diff mbox series

[2/2] Honor GIT_DEFAULT_HASH for empty clones without remote algo

Message ID 20230426205324.326501-3-sandals@crustytoothpaste.net (mailing list archive)
State Superseded
Headers show
Series Fix empty SHA-256 clones with v0 and v1 | expand

Commit Message

brian m. carlson April 26, 2023, 8:53 p.m. UTC
From: "brian m. carlson" <bk2204@github.com>

The previous commit introduced a change that allows HTTP v0 and v1
operations to determine the hash of an empty remote repository by
sending capabilities.  However, there are still some cases, such as when
cloning locally, where the capabilities are not sent.  This is because
for local operations, we don't strip out the fake "capabilities^{}" ref,
and thus "git ls-remote" would produce incorrect values if we did.

However, up until 8b214c2e9d ("clone: propagate object-format when
cloning from void", 2023-04-05), we honored GIT_DEFAULT_HASH in this
case, so let's continue to do that.  Check whether the hash algorithm
was explicitly set, and if so, continue to use that value.  If not, use
the default value for GIT_DEFAULT_HASH to ensure that we can at least
properly configure an empty clone whose hash algorithm we know.

Note that without this patch, git clone cannot create a SHA-256
repository from an empty remote without protocol v2 (except over HTTP,
as in the previous patch).
---
 Documentation/git.txt  | 10 +++++++---
 builtin/clone.c        |  8 +++++---
 connect.c              |  5 ++++-
 pkt-line.h             |  2 ++
 t/t5700-protocol-v1.sh | 11 +++++++++++
 transport-helper.c     |  1 +
 transport.c            | 14 ++++++++++++++
 transport.h            | 14 ++++++++++++++
 8 files changed, 58 insertions(+), 7 deletions(-)

Comments

Junio C Hamano April 26, 2023, 9:18 p.m. UTC | #1
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> However, up until 8b214c2e9d ("clone: propagate object-format when
> cloning from void", 2023-04-05), we honored GIT_DEFAULT_HASH in this
> case, so let's continue to do that.

Let's not.  Once we identified the bug of mistakenly honoring a
wrong variable, let's fix it and keep it fixed.

The local optimization, if necessary, can be taught to peek the
source repository and propagating the object format selection to the
destination repository.
Junio C Hamano April 26, 2023, 9:33 p.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>  `GIT_DEFAULT_HASH`::
>  	If this variable is set, the default hash algorithm for new
>  	repositories will be set to this value. This value is currently
> +	ignored when cloning if the remote value can be definitively
> +	determined; the setting of the remote repository is used
> +	instead. The value is honored if the remote repository's
> +	algorithm cannot be determined, such as some cases when
> +	the remote repository is empty. The default is "sha1".
> +	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> +	in linkgit:git-init[1].

We'd need to evantually cover all the transports (and non-transport
like the "--local" optimization) so that the object-format and other
choices are communicated from the origin to a new clone anyway, so
this extra complexity "until X is fixed, it behaves this way, but
otherwise the variable is read in the meantime" may be a disservice
to the end users, even though it may make it easier in the shorter
term for maintainers of programs that rely on the buggy "git clone"
that partially honored this environment variable.

In short, I am still not convinced that the above is a good design
choice in the longer term.

Thanks.
Jeff King April 27, 2023, 5:43 a.m. UTC | #3
On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:

> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >  `GIT_DEFAULT_HASH`::
> >  	If this variable is set, the default hash algorithm for new
> >  	repositories will be set to this value. This value is currently
> > +	ignored when cloning if the remote value can be definitively
> > +	determined; the setting of the remote repository is used
> > +	instead. The value is honored if the remote repository's
> > +	algorithm cannot be determined, such as some cases when
> > +	the remote repository is empty. The default is "sha1".
> > +	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > +	in linkgit:git-init[1].
> 
> We'd need to evantually cover all the transports (and non-transport
> like the "--local" optimization) so that the object-format and other
> choices are communicated from the origin to a new clone anyway, so
> this extra complexity "until X is fixed, it behaves this way, but
> otherwise the variable is read in the meantime" may be a disservice
> to the end users, even though it may make it easier in the shorter
> term for maintainers of programs that rely on the buggy "git clone"
> that partially honored this environment variable.
> 
> In short, I am still not convinced that the above is a good design
> choice in the longer term.

I also think it is working against the backwards-compatible design of
the hash function transition. If we do not see an object-format line
from the remote, then either:

  1. They sent us capabilities, but it did not include object-format. So
     if we are in GIT_DEFAULT_HASH=sha256 mode locally, but the other
     side is an older version of Git (or even a current version of other
     implementations, like Dulwich) that do not send object-format at
     all, then we will not correctly fall back to assuming they are
     sha1. In a non-empty repo, this means we'll fail to parse their ref
     advertisement (we'll expect sha256 hashes but get sha1), and
     cloning will be broken.

  2. They did not send us capabilities, because the repo is empty (and
     the server does not have brian's patch 1). The hash transition doc
     says we're supposed to assume they're sha1. It's _sort of_
     academic, in that they also are not telling us about any refs on
     their side. But we may end up mis-matched with the server (again,
     this is the 50/50 thing; we don't know what their format is).
     Presumably that bites us later when we try to push up new objects
     (but would eventually work when we support interop).

I think handling (2) is iffy as a goal, but the collateral damage of (1)
is a complete show-stopper for this patch. If we wanted to do (2) by
itself, we'd have to distinguish "did they even send us a capabilities
line" as a separate case (but I tend to agree with you that it is not
worth doing for now).

-Peff
Felipe Contreras May 2, 2023, 11:46 p.m. UTC | #4
Hi,

Changing the subject as this message seems like a different topic.

Jeff King wrote:
> On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > 
> > >  `GIT_DEFAULT_HASH`::
> > >  	If this variable is set, the default hash algorithm for new
> > >  	repositories will be set to this value. This value is currently
> > > +	ignored when cloning if the remote value can be definitively
> > > +	determined; the setting of the remote repository is used
> > > +	instead. The value is honored if the remote repository's
> > > +	algorithm cannot be determined, such as some cases when
> > > +	the remote repository is empty. The default is "sha1".
> > > +	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > +	in linkgit:git-init[1].
> > 
> > We'd need to evantually cover all the transports (and non-transport
> > like the "--local" optimization) so that the object-format and other
> > choices are communicated from the origin to a new clone anyway, so
> > this extra complexity "until X is fixed, it behaves this way, but
> > otherwise the variable is read in the meantime" may be a disservice
> > to the end users, even though it may make it easier in the shorter
> > term for maintainers of programs that rely on the buggy "git clone"
> > that partially honored this environment variable.
> > 
> > In short, I am still not convinced that the above is a good design
> > choice in the longer term.
> 
> I also think it is working against the backwards-compatible design of
> the hash function transition.

To be honest this whole approach seems to be completely flawed to me and
against the whole design of git in the first place.

In a recent email Linus Torvalds explained why object ids were
calculated based {type, size, data} [1], and he explained very clearly
that two objects with exactly the same data are not supposed to have the
same id if the type is different.

If even the tiniest change such as adding a period to a commit messange
changes the object id (and thus semantically makes it a different
object), then it makes sense that changing the type of an object also
changes the object id (and thus it's also a different object).

And because the id of the parent is included in the content of every
commit, the top-level id ensures the integrity of the whole graph.

But then comes this notion that the hash algorithm is a property of the
repository, and not part of the object storage, which means changing the
whole hash algorithm of a repository is considered less of a change than
adding a period to the commit message, worse: not a change at all.

I am reminded of the warning Sam Smith gave to the Git project [2] which
seemed to be unheard, but the notion of cryptographic algorithm agility
makes complete sense to me.

In my view one repository should be able to have part SHA-1 history,
part SHA3-256 history, and part BLAKE2b history.

Changing the hash algorithm of one commit should change the object id of
that commit, and thus make it semantically a different commit.

In other words: an object of type "blob" should never be confused with
an object of type "blob:sha-256", even if the content is exactly the
same.

The fact that apparently it's so easy to clone a repository with
the wrong hash algorithm should give developers pause, as it means the
whole point of using cryptographic hash algorithms to ensure the
integrity of the commit history is completely gone.

I have not been following the SHA-1 -> OID discussions, but I
distinctively recall Linus Torvalds mentioning that the choice of using
SHA-1 wasn't even for security purposes, it was to ensure integrity.
When I do a `git fetch` as long as the new commits have the same SHA-1
as parent as the SHA-1s I have in my repository I can be relatively
certain the repository has not been tampered with. Which means that if I
do a `git fetch` that suddenly brings SHA-256 commits, some of them must
have SHA-1 parents that match the ones I currently have. Otherwise how
do I know it's the same history?

Maybe that's one of the reasons people don't seem particularly eager to
move away from SHA-1:

Better the SHA-1 you know, than the SHA-256 you don't.

Cheers.

[1] https://lore.kernel.org/git/CAHk-=wjr-CMLX2Jo2++rwcv0VNr+HmZqXEVXNsJGiPRUwNxzBQ@mail.gmail.com/
[2] https://lore.kernel.org/git/D433038A-2643-4F63-8677-CA8AB6904AE1@samuelsmith.org/
Adam Majer May 3, 2023, 9:03 a.m. UTC | #5
On 5/3/23 01:46, Felipe Contreras wrote:
> To be honest this whole approach seems to be completely flawed to me and
> against the whole design of git in the first place.

The discussion above is mostly moot now since this has been fixed in 
later patches in this thread, AFAIK. It's also moot for other reasons, 
like the hash function transition plan is not really implemented, yet.

Also, this was about corner-case, like it often is.


> In a recent email Linus Torvalds explained why object ids were
> calculated based {type, size, data} [1], and he explained very clearly
> that two objects with exactly the same data are not supposed to have the
> same id if the type is different.

This is different. But aside, type + size + data are not really much 
different from just having data in a hash function. There are plenty of 
hash collisions where

     HASH(type + size + data) == HASH(type + size + data')

by definition of how these functions work. The problem is always in 
finding these collisions. But anyway...

> In my view one repository should be able to have part SHA-1 history,
> part SHA3-256 history, and part BLAKE2b history.

Yes, that would be great. Please provide patch series for this :-)

> I have not been following the SHA-1 -> OID discussions, but I
> distinctively recall Linus Torvalds mentioning that the choice of using
> SHA-1 wasn't even for security purposes, it was to ensure integrity.

These are different sides of the same coin. Hashes are used to provide 
integrity. Hashes like MD4, MD5, SHA1, SHA256 are there for integrity. 
Some of these are no longer recommended and some are completely broken.

> Better the SHA-1 you know, than the SHA-256 you don't.

Wrong conclusion ;) Also, we know SHA-256

The problem in git-core and virtually all clients and other 
implementations is/was that SHA1 was hardcoded and assumed to be THE ONE 
and ONLY hash. It will take quite a bit of work outside of git-core to 
remove this one assumption (remember two digit year and 2000? - yes I'm 
old). Once this hash assumption is removed, you can start talking about 
adding other hashes and interop.

Keep in mind -- hashes are there for object reference. They are the glue 
in git. But there is really nothing stopping us from recalculating them 
"on the fly". If you have SHA1 repo, you can calculate a SHA256 or 
whatever hash for any type object. That's not the problem, conceptually 
speaking.

Finally, let not have a "bike shed" discussion about this. The 
GIT_DEFAULT_HASH is meant to be used by `git init` in-lieu of 
--object-format parameter, so it's not flawed. When used in other 
applications, it probably indicates a bug. But we can't fix all the bugs 
at once :-)

Cheers,
- Adam
demerphq May 3, 2023, 9:09 a.m. UTC | #6
On Wed, 3 May 2023 at 02:17, Felipe Contreras
<felipe.contreras@gmail.com> wrote:
>
> Hi,
>
> Changing the subject as this message seems like a different topic.
>
> Jeff King wrote:
> > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > >
> > > >  `GIT_DEFAULT_HASH`::
> > > >   If this variable is set, the default hash algorithm for new
> > > >   repositories will be set to this value. This value is currently
> > > > + ignored when cloning if the remote value can be definitively
> > > > + determined; the setting of the remote repository is used
> > > > + instead. The value is honored if the remote repository's
> > > > + algorithm cannot be determined, such as some cases when
> > > > + the remote repository is empty. The default is "sha1".
> > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > > + in linkgit:git-init[1].
> > >
> > > We'd need to evantually cover all the transports (and non-transport
> > > like the "--local" optimization) so that the object-format and other
> > > choices are communicated from the origin to a new clone anyway, so
> > > this extra complexity "until X is fixed, it behaves this way, but
> > > otherwise the variable is read in the meantime" may be a disservice
> > > to the end users, even though it may make it easier in the shorter
> > > term for maintainers of programs that rely on the buggy "git clone"
> > > that partially honored this environment variable.
> > >
> > > In short, I am still not convinced that the above is a good design
> > > choice in the longer term.
> >
> > I also think it is working against the backwards-compatible design of
> > the hash function transition.
>
> To be honest this whole approach seems to be completely flawed to me and
> against the whole design of git in the first place.
>
> In a recent email Linus Torvalds explained why object ids were
> calculated based {type, size, data} [1], and he explained very clearly
> that two objects with exactly the same data are not supposed to have the
> same id if the type is different.

He said:

--- quote-begin ---
The "no aliasing" means that no two distinct pointers can point to the
same data. So a tagged pointer of type "commit" can not point to the
same object as a tagged pointer of type "blob". They are distinct
pointers, even if (maybe) the commit object encoding ends up then
being identical to a blob object.
--- quote-end ---

As far as I could tell he didn't really explain *why* he wanted this,
and IMO it is non-obvious why he would care if a blob and a commit had
the same text, and thus the same ID. He just said he didnt want it to
happen, not why. I can imagine some aesthetic reasons why you might
want to ensure that no blob has the same ID as a commit, and I can
imagine it might make debugging easier at certain points, but it seems
unnecessary given the data is write once.

> If even the tiniest change such as adding a period to a commit messange
> changes the object id (and thus semantically makes it a different
> object), then it makes sense that changing the type of an object also
> changes the object id (and thus it's also a different object).
>
> And because the id of the parent is included in the content of every
> commit, the top-level id ensures the integrity of the whole graph.
>
> But then comes this notion that the hash algorithm is a property of the
> repository, and not part of the object storage, which means changing the
> whole hash algorithm of a repository is considered less of a change than
> adding a period to the commit message, worse: not a change at all.

I really dont understand why you think having two hash functions
producing different results for the same data is comparable to a
single hash producing different results for different data. In one
case you have two different continuum of identifiers, with one ID per
continuum, and in the other you have two different identifiers in the
same continuum, and  if you a continuum you would have 4 different
identifiers right? Eg, the two cases are really quite different at a
fundamental level.

> I am reminded of the warning Sam Smith gave to the Git project [2] which
> seemed to be unheard, but the notion of cryptographic algorithm agility
> makes complete sense to me.
>
> In my view one repository should be able to have part SHA-1 history,
> part SHA3-256 history, and part BLAKE2b history.

Isn't this orthagonal to your other points?

> Changing the hash algorithm of one commit should change the object id of
> that commit, and thus make it semantically a different commit.
>
> In other words: an object of type "blob" should never be confused with
> an object of type "blob:sha-256", even if the content is exactly the
> same.

This doesn't make sense to me.  As long as we can distinguish the
hashes produced by the different hash functions in use we can create a
mapping of the data that is hashed such that we have a 1:1 mapping of
identifiers of each type at which point it really doesn't matter which
hash function is used.

> The fact that apparently it's so easy to clone a repository with
> the wrong hash algorithm should give developers pause, as it means the
> whole point of using cryptographic hash algorithms to ensure the
> integrity of the commit history is completely gone.

This is a leap too far. The fact that it is "so easy to clone a repo
with the wrong hash algorithm" is completely orthogonal to the
fundamental principles of hash identifiers from strong hash functions.
You seem to be deriving grand conclusions from what sounds to me like
a simple bug/design-oversight.

> I have not been following the SHA-1 -> OID discussions, but I
> distinctively recall Linus Torvalds mentioning that the choice of using
> SHA-1 wasn't even for security purposes, it was to ensure integrity.
> When I do a `git fetch` as long as the new commits have the same SHA-1
> as parent as the SHA-1s I have in my repository I can be relatively
> certain the repository has not been tampered with. Which means that if I
> do a `git fetch` that suddenly brings SHA-256 commits, some of them must
> have SHA-1 parents that match the ones I currently have. Otherwise how
> do I know it's the same history?

So consider what /could/ happen here. You fetch a commit which uses
SHA-256 into a repo where all of your local commits use SHA-1. The
commit you fetched says its parent is some SHA-256 ID you don't know
about as all your ID's are SHA-1. So git then could go and construct
an index, hashing each item using SHA-256 instead of SHA-1, and using
the result to build a bi-directional mapping from SHA-1 to SHA-256 and
back.  All it has to do then is look into the mapping to find if the
SHA-256 parent id is present in your repo. If it is then you know it's
the same history.

The key point here is that if you ignore SHAttered artifacts (which
seems reasonable as you can detect the attack during hashing)  you can
build a 1:1 map of SHA-1 and SHA-256 ids.  Once you have that mapping
it doesn't matter which ID is used.

> Maybe that's one of the reasons people don't seem particularly eager to
> move away from SHA-1:

Maybe, but it doesn't make sense to me.  You seem to be putting undue
weight on an unnecessary aspect of the git design: there doesn't seem
to be a reason for Linuses "no aliasing" policy, and it seems like one
could build a git-a-like without it without suffering any significant
penalties. Regardless, provided that the hash functions allow a 1:1
mapping of ID's (which is assumed by using "collision free hash
functions"), it seems like it really doesn't matter which hash is used
at any given time.

cheers,
Yves
Felipe Contreras May 3, 2023, 3:44 p.m. UTC | #7
Adam Majer wrote:
> On 5/3/23 01:46, Felipe Contreras wrote:
> > To be honest this whole approach seems to be completely flawed to me and
> > against the whole design of git in the first place.
> 
> The discussion above is mostly moot now since this has been fixed in 
> later patches in this thread, AFAIK.

That particular isssue might be fixed, but that issue should never have
happened in the first place if the design was correct.

A bad design makes certain errors prone to happen, a good design makes the same
errors happen rarely, a great design makes those errors impossible.

Git was designed to make it *impossible* to confuse two commits with similar
data.

The symptom might have been fixed, that doesn't mean there's no underlying
problem.

> It's also moot for other reasons, like the hash function transition plan is
> not really implemented, yet.

The implemention of the plan isn't the problem, it's the plan itself.

> Also, this was about corner-case, like it often is.

A corner-case that should be impossible.

> > In a recent email Linus Torvalds explained why object ids were
> > calculated based {type, size, data} [1], and he explained very clearly
> > that two objects with exactly the same data are not supposed to have the
> > same id if the type is different.
> 
> This is different. But aside, type + size + data are not really much 
> different from just having data in a hash function.

It's completely different.

> There are plenty of hash collisions where
> 
>      HASH(type + size + data) == HASH(type + size + data')
> 
> by definition of how these functions work. The problem is always in 
> finding these collisions. But anyway...

I don't think you understand why Linus Torvalds chose to hash objects.

> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> Yes, that would be great. Please provide patch series for this :-)

I have hundreds of patches being ignored, why would I write yet another patch
series that will be ignored?

> > I have not been following the SHA-1 -> OID discussions, but I
> > distinctively recall Linus Torvalds mentioning that the choice of using
> > SHA-1 wasn't even for security purposes, it was to ensure integrity.
> 
> These are different sides of the same coin. Hashes are used to provide 
> integrity. Hashes like MD4, MD5, SHA1, SHA256 are there for integrity. 
> Some of these are no longer recommended and some are completely broken.

There are different philosophical views of what "security" means, and it seems
pretty clear to me that your view does not align with the view of Linus
Torvalds.

> > Better the SHA-1 you know, than the SHA-256 you don't.
> 
> Wrong conclusion ;) Also, we know SHA-256

You don't understand what is being said.

Which hash is more trustworthy?

 a. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
 b. d891b12414e1d9331f8cbb15acfe690671974f27ba76e2b423294cfb7a055f2f

If you answer b just beacuse it's SHA-256 you don't understand security.

b is a random commit I generated, a is the current git.git master.

A SHA-1 hash from a source you trust is inifinitely more trustworthy than a
random SHA-256 hash. Even a known MD5 hash is better in this respect.

> Keep in mind -- hashes are there for object reference.

No. I don't think you understand why Linus Torvalds used hashes.

> If you have SHA1 repo, you can calculate a SHA256 or whatever hash for any
> type object.

I know it *can* be done, I understand how hash algorithms work, but just
because something *can* be done doesn't mean it *should*.

You *can* generate a SHA-1 of a blob's data, instead of a SHA-1 of a blob's
`type + size + data`, does that mean we should? No.

> Finally, let not have a "bike shed" discussion about this.

Discussing the original design of git's object storage which has withstood the
test of time for 18 years is not "bike sheding".

I don't even think you understand what I'm trying to say.

---

Why do you think these commands generate different hashes?

  git hash-object -t blob /dev/null
  git hash-object -t tree /dev/null
Adam Majer May 3, 2023, 5:21 p.m. UTC | #8
On May 3, 2023 5:44:24 p.m. GMT+02:00, Felipe Contreras <felipe.contreras@gmail.com> wrote:
>Git was designed to make it *impossible* to confuse two commits with similar
>data.

That was never ever the problem here.



>> This is different. But aside, type + size + data are not really much 
>> different from just having data in a hash function.
>
>It's completely different.

How so? Type and size are just about 2 and a dozen bits of data, respectfully.


>There are different philosophical views of what "security" means, and it seems
>pretty clear to me that your view does not align with the view of Linus
>Torvalds.


I'm not sure why you are name dropping Linus everywhere or assuming you know more than anyone here about hash functions.

Your explanation is quite clear to me (and probably everyone else here). But I'll just leave it at that.

Cheers,
Adam
Felipe Contreras May 3, 2023, 6:20 p.m. UTC | #9
demerphq wrote:
> On Wed, 3 May 2023 at 02:17, Felipe Contreras
> <felipe.contreras@gmail.com> wrote:
> > Changing the subject as this message seems like a different topic.
> > Jeff King wrote:
> > > On Wed, Apr 26, 2023 at 02:33:30PM -0700, Junio C Hamano wrote:
> > > > "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> > > >
> > > > >  `GIT_DEFAULT_HASH`::
> > > > >   If this variable is set, the default hash algorithm for new
> > > > >   repositories will be set to this value. This value is currently
> > > > > + ignored when cloning if the remote value can be definitively
> > > > > + determined; the setting of the remote repository is used
> > > > > + instead. The value is honored if the remote repository's
> > > > > + algorithm cannot be determined, such as some cases when
> > > > > + the remote repository is empty. The default is "sha1".
> > > > > + THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
> > > > > + in linkgit:git-init[1].
> > > >
> > > > We'd need to evantually cover all the transports (and non-transport
> > > > like the "--local" optimization) so that the object-format and other
> > > > choices are communicated from the origin to a new clone anyway, so
> > > > this extra complexity "until X is fixed, it behaves this way, but
> > > > otherwise the variable is read in the meantime" may be a disservice
> > > > to the end users, even though it may make it easier in the shorter
> > > > term for maintainers of programs that rely on the buggy "git clone"
> > > > that partially honored this environment variable.
> > > >
> > > > In short, I am still not convinced that the above is a good design
> > > > choice in the longer term.
> > >
> > > I also think it is working against the backwards-compatible design of
> > > the hash function transition.
> >
> > To be honest this whole approach seems to be completely flawed to me and
> > against the whole design of git in the first place.
> >
> > In a recent email Linus Torvalds explained why object ids were
> > calculated based {type, size, data} [1], and he explained very clearly
> > that two objects with exactly the same data are not supposed to have the
> > same id if the type is different.
> 
> He said:
> 
> --- quote-begin ---
> The "no aliasing" means that no two distinct pointers can point to the
> same data. So a tagged pointer of type "commit" can not point to the
> same object as a tagged pointer of type "blob". They are distinct
> pointers, even if (maybe) the commit object encoding ends up then
> being identical to a blob object.
> --- quote-end ---
> 
> As far as I could tell he didn't really explain *why* he wanted this,
> and IMO it is non-obvious why he would care if a blob and a commit had
> the same text, and thus the same ID. He just said he didnt want it to
> happen, not why.

But we don't need to understand why to know it's part of the core design.

If something is part of the core design as a rule it's better to not
mess with it.

> I can imagine some aesthetic reasons why you might want to ensure that
> no blob has the same ID as a commit, and I can imagine it might make
> debugging easier at certain points, but it seems unnecessary given the
> data is write once.

I don't know, but to me separating objects makes sense not just conceptually,
but in practice there's a whole class of potential errors that could be
avoided.

For example, I can think of an implementation of `git prune` that would check
commits first, then trees, then blobs, and the blobs that not reachable from
any trees are removed. But if a commit can have the same id as a blob, you have
to think of a different implementation.

If that's not possible, then you just forget about those potential issues.

> > If even the tiniest change such as adding a period to a commit messange
> > changes the object id (and thus semantically makes it a different
> > object), then it makes sense that changing the type of an object also
> > changes the object id (and thus it's also a different object).
> >
> > And because the id of the parent is included in the content of every
> > commit, the top-level id ensures the integrity of the whole graph.
> >
> > But then comes this notion that the hash algorithm is a property of the
> > repository, and not part of the object storage, which means changing the
> > whole hash algorithm of a repository is considered less of a change than
> > adding a period to the commit message, worse: not a change at all.
> 
> I really dont understand why you think having two hash functions
> producing different results for the same data is comparable to a
> single hash producing different results for different data.

That depends on what you consider the "data" to be.

If you consider the content of a blob to be the data, then you wouldn't
have different results if a commit has the same data: it would be the
same id.

If instead you consider the data to be `type+content`, then you would
have different results.

> In one case you have two different continuum of identifiers, with one
> ID per continuum, and in the other you have two different identifiers
> in the same continuum, and  if you a continuum you would have 4
> different identifiers right? Eg, the two cases are really quite
> different at a fundamental level.

That entirely depends on what data you hash.

If you hash `algo+type+size+data` there's only one id per object.
Period.

> > I am reminded of the warning Sam Smith gave to the Git project [2] which
> > seemed to be unheard, but the notion of cryptographic algorithm agility
> > makes complete sense to me.
> >
> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> Isn't this orthagonal to your other points?

Not if you consider changing the hash algorithm of a repository to be an
important part of its history (more important than adding a period to a
commit).

> > Changing the hash algorithm of one commit should change the object id of
> > that commit, and thus make it semantically a different commit.
> >
> > In other words: an object of type "blob" should never be confused with
> > an object of type "blob:sha-256", even if the content is exactly the
> > same.
> 
> This doesn't make sense to me.  As long as we can distinguish the
> hashes produced by the different hash functions in use we can create a
> mapping of the data that is hashed such that we have a 1:1 mapping of
> identifiers of each type at which point it really doesn't matter which
> hash function is used.

Yes, we *can*, that doesn't mean we *should*.

If you do `git commit --ammend --signoff` to add your `Signed-off-by` to
a commit, there's a 1:1 mapping from the original commit, to the new
one, but conceptually in git they are different objects.

I recall Linus Torvalds mentioned he used Monotone as a guideline of
what *not* to do. In Monotone you could add the equivalent of
`Signed-off-by` without changing the hash of the commit, in fact, you
could add any metadata if I recall correctly. But this opens a whole can
of worms because now how do you know you have all the metadata relevant
to the commit?

Making all the metadata of a commit part of the commit solves the
integrity problem Monotone had at the cost of making git commits
essentially immutable: any change means it's a different commit.

If making *any* change in the object, makes it conceptually a different
object, including the type of the object, how on Earth is changing the
hash algorithm not considered a change?

This object:

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Is considered different from this object:

  ❯ git hash-object -t tree /dev/null
  4b825dc642cb6eb9a060e54bf8d69288fbee4904

That's why they have a different hash.

Why would these objects be considered the same?

  ❯ git hash-object -t blob /dev/null
  e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

  ❯ git hash-object -t blob /dev/null
  473a0f4c3be8a93681a267e3b1e9a7dcda1185436fe141f7749120a303721813

It makes *zero* sense that adding a period changes the object, adding a
s-o-b changes the object, changing the type changes the object, but
changing the hash algorithm does not.

> > The fact that apparently it's so easy to clone a repository with
> > the wrong hash algorithm should give developers pause, as it means the
> > whole point of using cryptographic hash algorithms to ensure the
> > integrity of the commit history is completely gone.
> 
> This is a leap too far. The fact that it is "so easy to clone a repo
> with the wrong hash algorithm" is completely orthogonal to the
> fundamental principles of hash identifiers from strong hash functions.

Only if you think changing the hash algorithm is a less important part
of an object than adding a period.

Do you honestly think these two should be considered the same object?

a)

  tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

b)

  tree 6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321
  author Felipe Contreras <felipe.contreras@gmail.com> 0 -0600
  committer Felipe Contreras <felipe.contreras@gmail.com> 0 -0600

  Initial commit

> You seem to be deriving grand conclusions from what sounds to me like
> a simple bug/design-oversight.

I think you are dismissing the brilliant idea that made git's object
storage model so successful.

> > I have not been following the SHA-1 -> OID discussions, but I
> > distinctively recall Linus Torvalds mentioning that the choice of using
> > SHA-1 wasn't even for security purposes, it was to ensure integrity.
> > When I do a `git fetch` as long as the new commits have the same SHA-1
> > as parent as the SHA-1s I have in my repository I can be relatively
> > certain the repository has not been tampered with. Which means that if I
> > do a `git fetch` that suddenly brings SHA-256 commits, some of them must
> > have SHA-1 parents that match the ones I currently have. Otherwise how
> > do I know it's the same history?
> 
> So consider what /could/ happen here. You fetch a commit which uses
> SHA-256 into a repo where all of your local commits use SHA-1. The
> commit you fetched says its parent is some SHA-256 ID you don't know
> about as all your ID's are SHA-1. So git then could go and construct
> an index, hashing each item using SHA-256 instead of SHA-1, and using
> the result to build a bi-directional mapping from SHA-1 to SHA-256 and
> back.  All it has to do then is look into the mapping to find if the
> SHA-256 parent id is present in your repo. If it is then you know it's
> the same history.

Yeah, that *could* happen. Doesn't mean it *should*.

> The key point here is that if you ignore SHAttered artifacts (which
> seems reasonable as you can detect the attack during hashing)  you can
> build a 1:1 map of SHA-1 and SHA-256 ids.  Once you have that mapping
> it doesn't matter which ID is used.

It may not matter to you. It matters to me.

> > Maybe that's one of the reasons people don't seem particularly eager to
> > move away from SHA-1:
> 
> Maybe, but it doesn't make sense to me.  You seem to be putting undue
> weight on an unnecessary aspect of the git design: there doesn't seem
> to be a reason for Linuses "no aliasing" policy, and it seems like one
> could build a git-a-like without it without suffering any significant
> penalties.

The fact you don't see a reason doesn't mean there isn't one. This is an
argument from ignorance fallacy.

The appendix was considered a vestigial organ because nobody could see a
reason for it. Did that mean it served no puprpose? No.

> Regardless, provided that the hash functions allow a 1:1 mapping of
> ID's (which is assumed by using "collision free hash functions"), it
> seems like it really doesn't matter which hash is used at any given
> time.

People who designed CVS, Subversion, Monotone, and Mercurial didn't see
a reason for many of Git's design choices either.

I'd argue they were wrong.

I think changing the hash algorithm of a commit matters.

Cheers.
brian m. carlson May 3, 2023, 10:54 p.m. UTC | #10
On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
> In my view one repository should be able to have part SHA-1 history,
> part SHA3-256 history, and part BLAKE2b history.

That is practically very difficult and it means that it's hard to have
confidence in the later history because SHA-1 is weak and you have to
rely on it to verify the SHA-256 history later.  Since attacks always
get better, SHA-1 will eventually be so weak that collisions can be
computed in the amount of time we now take for MD4 or MD5 collisions
(i.e., seconds), and with your plan, we'd have to retain that history
forever with the resulting lack of confidence in part of the history.

This also doesn't work with various structures like trees, the index,
and pack and index formats, which have no indication of the algorithm
used and simply rely on fixed-size, often 4-byte aligned object IDs
without any metadata.  In addition, the internals of the code often
don't pass around enough data to make these values variable and thus
this approach would substantially complicate the code in many ways.

Also, we've already decided on the current design a long time ago with
the transition plan after extensive, thoughtful discussion by many
people.  Very few people other than me have worked on sending patches to
work on the hash function transition, and that work up to now has all
been done on my personal time, without compensation of any sort, out of
a desire to improve the project.  Lots of people have opined on how it
should have been different without sending any patches.

If you would like to propose patches for the extensive amount of work to
implement your solution, then we could consider them, although I will
warn you that your approach will likely require at least several hundred
patches.  However, I refer you to the list archives to determine why
your approach is not the one we chose and is not, in my view, the best
path forward.  I should also be clear that I have no intention of
submitting patches to change our approach now or in the future, or
redoing the patches I've already sent.

> The fact that apparently it's so easy to clone a repository with
> the wrong hash algorithm should give developers pause, as it means the
> whole point of using cryptographic hash algorithms to ensure the
> integrity of the commit history is completely gone.

No, it doesn't.  It means that our empty repositories until recently
lacked any indication of the algorithm or other capabilities, which was
a mistake in our original protocol design that has now been corrected.

If you interact with the repository later on when it has data, then if
you're using the wrong hash algorithm, you'll find that you get a
helpful error message that that's not yet supported.  If you patched Git
to ignore that check, you'd find that your repository would just be very
broken in many ways with lots of random crashing and seemingly unrelated
error messages instead of subtly using the wrong algorithm.
Felipe Contreras May 8, 2023, 12:34 a.m. UTC | #11
Adam Majer wrote:
> On May 3, 2023 5:44:24 p.m. GMT+02:00, Felipe Contreras <felipe.contreras@gmail.com> wrote:
> >Git was designed to make it *impossible* to confuse two commits with similar
> >data.
> 
> That was never ever the problem here.

But it will be.

> >> This is different. But aside, type + size + data are not really much 
> >> different from just having data in a hash function.
> >
> >It's completely different.
> 
> How so? Type and size are just about 2 and a dozen bits of data, respectfully.

Do you understand how checksums work?

Compare these two objects:

 1. 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
 2. 6ba62a7c5e3e9a260c5a30adf2756882c02f12a6

Are they a) "not much different", or b) "completely different"?

Answer: doesn't matter, they are *different*. Period.

> >There are different philosophical views of what "security" means, and it seems
> >pretty clear to me that your view does not align with the view of Linus
> >Torvalds.
> 
> 
> I'm not sure why you are name dropping Linus everywhere

I don't know if you are aware, but Linus Torvalds is the author of git.

He also happens to be the author of the most successful software project
in history: Linux.

So generally his design choices are considered to be good.

> or assuming you know more than anyone here about hash functions.

I don't assume such a thing.

But I'm pretty certain not many people are aware of the integrity issues
VCSs presented circa 2004, that git hashes solved in 2005, because if
they did, they could have created an object model storage similar to
git's, and no one did (except Linus Torvalds).

> Your explanation is quite clear to me (and probably everyone else
> here). But I'll just leave it at that.

Is it? Then you would have no trouble steel manning my argument, which
you haven't done.
Felipe Contreras May 8, 2023, 2 a.m. UTC | #12
brian m. carlson wrote:
> On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
> > In my view one repository should be able to have part SHA-1 history,
> > part SHA3-256 history, and part BLAKE2b history.
> 
> That is practically very difficult and it means that it's hard to have
> confidence in the later history because SHA-1 is weak and you have to
> rely on it to verify the SHA-256 history later.

Why would I have to rely on SHA-1 to verify the SHA-256 history later
on?

> Since attacks always get better, SHA-1 will eventually be so weak that
> collisions can be computed in the amount of time we now take for MD4
> or MD5 collisions (i.e., seconds), and with your plan, we'd have to
> retain that history forever with the resulting lack of confidence in
> part of the history.

We have to do the same with your plan as well.

Your plan relies on SHA-256 being interchangeable with SHA-1, so if the
Git project decided to switch *today* to SHA-256, we would have two
object ids:

 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116

This object would refer to a tree and a parent object with SHA-1 ids,
which would be OK, because they would be interchangeable with some
corresponding SHA-256 ids.

Isn't that your plan?

Therefore the SHA-1 of the parent of the commit, and the tree of the
commit would be trusted and retained forever.

> This also doesn't work with various structures like trees, the index,
> and pack and index formats, which have no indication of the algorithm
> used and simply rely on fixed-size, often 4-byte aligned object IDs
> without any metadata.

So? The index and pack objects can be regenerated, so at any point in
time they could be regenerated for SHA-1 or SHA-256.

The tree object is a no-brainer. For an object of type "commit:256" you
require a tree of type "tree:256". Easy.

> In addition, the internals of the code often don't pass around enough
> data to make these values variable and thus this approach would
> substantially complicate the code in many ways.

Really? `enum object_type` is not passed around?

> Also, we've already decided on the current design a long time ago with
> the transition plan after extensive, thoughtful discussion by many
> people.

Who is "we"?

I've participated in many discussions in the git mailing list where the
consensus is that 99% of people decide to do something, and that
something never happens.

The fact that "we" have decided something doesn't carry as much weight
as you seem to think it does.

Moreover, haven't "we" decided that this transitioning plan is
*tentantive*, and the SHA-256 feature is *experimental*?

> Very few people other than me have worked on sending patches to
> work on the hash function transition, and that work up to now has all
> been done on my personal time, without compensation of any sort, out of
> a desire to improve the project.

Which seems to suggest if there is a need, it's not very pressing.

Doesn't it?

> Lots of people have opined on how it should have been different
> without sending any patches.

As is typical.

> If you would like to propose patches for the extensive amount of work
> to implement your solution, then we could consider them, although I
> will warn you that your approach will likely require at least several
> hundred patches.

That's not an issue. I've started projects with several hundred patches
just to prove that something is possible.

> However, I refer you to the list archives to determine why
> your approach is not the one we chose and is not, in my view, the best
> path forward.

Yeah? Provide me with *one* mail proposing my approach.

> I should also be clear that I have no intention of submitting patches
> to change our approach now or in the future, or redoing the patches
> I've already sent.

You don't have to. (and it's not really necessary as it's typically the
case that people don't provide patches for designs that compete against
their own).

> > The fact that apparently it's so easy to clone a repository with
> > the wrong hash algorithm should give developers pause, as it means the
> > whole point of using cryptographic hash algorithms to ensure the
> > integrity of the commit history is completely gone.
> 
> No, it doesn't.  It means that our empty repositories until recently
> lacked any indication of the algorithm or other capabilities, which was
> a mistake in our original protocol design that has now been corrected.

Yes it does.

Can I clone a repository that already transitioned to SHA-256, and then
push a SHA-1 commit?

Well, of course I can't because that's not currently implemented, but if
we followed the current plan that apparently "we" have decided on, it
should be.

> If you interact with the repository later on when it has data, then if
> you're using the wrong hash algorithm, you'll find that you get a
> helpful error message that that's not yet supported.

That isn't true according to your plan.

SHA-1 would be interchangeable with SHA-256, would it not? So according
to the current plan, I would be able to push a SHA-1 commit on a SHA-256
repository.

---

Is it not the case that the current plan aims to have support for SHA-1
and SHA-256 object ids at the same time?

In other words: in your ideal world, the following object ids would
*both* refer to the same git object:

 1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
 2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116

Would they not?
brian m. carlson May 8, 2023, 9:38 p.m. UTC | #13
On 2023-05-08 at 02:00:56, Felipe Contreras wrote:
> brian m. carlson wrote:
> > On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
> > > In my view one repository should be able to have part SHA-1 history,
> > > part SHA3-256 history, and part BLAKE2b history.
> > 
> > That is practically very difficult and it means that it's hard to have
> > confidence in the later history because SHA-1 is weak and you have to
> > rely on it to verify the SHA-256 history later.
> 
> Why would I have to rely on SHA-1 to verify the SHA-256 history later
> on?

If your history contains mixed and matched hash algorithms, you'll need
to be able to verify those commits to the root to have any confidence in
a signed commit or tag, which means trusting SHA-1 if you have any SHA-1
commits in the repository.

> > Since attacks always get better, SHA-1 will eventually be so weak that
> > collisions can be computed in the amount of time we now take for MD4
> > or MD5 collisions (i.e., seconds), and with your plan, we'd have to
> > retain that history forever with the resulting lack of confidence in
> > part of the history.
> 
> We have to do the same with your plan as well.
> 
> Your plan relies on SHA-256 being interchangeable with SHA-1, so if the
> Git project decided to switch *today* to SHA-256, we would have two
> object ids:
> 
>  1. 69c786637d7a7fe3b2b8f7d989af095f5f49c3a8
>  2. 2b4ebdace10518280172449c012af17b51e9d46e023a91a5d3dd3a8ad9e4a116
> 
> This object would refer to a tree and a parent object with SHA-1 ids,
> which would be OK, because they would be interchangeable with some
> corresponding SHA-256 ids.
> 
> Isn't that your plan?

For a period of time, yes.  At some point, people will abandon SHA-1 and
won't use it anymore for a particular repository, and then its security
doesn't matter.

> Therefore the SHA-1 of the parent of the commit, and the tree of the
> commit would be trusted and retained forever.

Nope.  At some point, we just turn off SHA-1.

> > This also doesn't work with various structures like trees, the index,
> > and pack and index formats, which have no indication of the algorithm
> > used and simply rely on fixed-size, often 4-byte aligned object IDs
> > without any metadata.
> 
> So? The index and pack objects can be regenerated, so at any point in
> time they could be regenerated for SHA-1 or SHA-256.

Right, and that point you've basically converted the repository over.

> The tree object is a no-brainer. For an object of type "commit:256" you
> require a tree of type "tree:256". Easy.

That doesn't work with the pack format because there are only seven
valid types of objects, and five of them are used.

> > Also, we've already decided on the current design a long time ago with
> > the transition plan after extensive, thoughtful discussion by many
> > people.
> 
> Who is "we"?

The list members.

> I've participated in many discussions in the git mailing list where the
> consensus is that 99% of people decide to do something, and that
> something never happens.
> 
> The fact that "we" have decided something doesn't carry as much weight
> as you seem to think it does.

Jonathan Nieder decided to propose an approach for how we'd go about
this, and it was discussed extensively on the list and the parts that
I've implemented have almost completely conformed to that documentation.

I think his approach was very thoughtful and addressed many questions
about how the project was to proceed, and I appreciate that he sent it
and that others contributed in a helpful way.

The project wouldn't have been possible unless we had a clear decision
on how to implement things.  There was an opportunity at the time to
comment on the approach and propose alternatives, and we didn't choose
to adopt any.

> Moreover, haven't "we" decided that this transitioning plan is
> *tentantive*, and the SHA-256 feature is *experimental*?

We've documented that it's experimental because it's not seeing wide
use.  I expect that will change, at which point it will no longer be
experimental.  The design is not tentative and there are no plans to
change it.

> > Very few people other than me have worked on sending patches to
> > work on the hash function transition, and that work up to now has all
> > been done on my personal time, without compensation of any sort, out of
> > a desire to improve the project.
> 
> Which seems to suggest if there is a need, it's not very pressing.
> 
> Doesn't it?

No, I don't agree.  An approach to continue to use SHA-1 indefinitely
means that Git will not be viable at many major organizations and
in many major governments which have restrictions on using insecure
cryptography.

I think this ends the point at which I'd like to respond to your
proposal further.  I continue to feel that it's argumentative,
unproductive, and not in the best interests of the project, and I don't
think continuing here is going to shed any more light on the topic.

Ultimately, I feel that my previous statement that we're not going to
see eye to eye on most things continues to be correct.  I also don't
think arguing with you is a productive use of my time or a positive
contribution to the community, and in general, I'm going to avoid
commenting on your patches or proposals because I can't see a way that
we can interact in a useful and helpful way on the list or elsewhere. As
a result, I'll kindly ask you to refrain from CCing me on further emails
to the list or otherwise emailing me for the indefinite future, and I'll
do the same for you after this email.
Oswald Buddenhagen May 9, 2023, 10:32 a.m. UTC | #14
On Mon, May 08, 2023 at 09:38:56PM +0000, brian m. carlson wrote:
>On 2023-05-08 at 02:00:56, Felipe Contreras wrote:
>> brian m. carlson wrote:
>> > On 2023-05-02 at 23:46:02, Felipe Contreras wrote:
>> > > In my view one repository should be able to have part SHA-1 history,
>> > > part SHA3-256 history, and part BLAKE2b history.
>> > 
>> > That is practically very difficult and it means that it's hard to have
>> > confidence in the later history because SHA-1 is weak and you have to
>> > rely on it to verify the SHA-256 history later.
>> 
>> Why would I have to rely on SHA-1 to verify the SHA-256 history later
>> on?
>
>If your history contains mixed and matched hash algorithms, you'll need
>to be able to verify those commits to the root to have any confidence in
>a signed commit or tag, which means trusting SHA-1 if you have any SHA-1
>commits in the repository.
>
the history is traversed from the end anyway, so having sha-1 in the 
history is entirely irrelevant for verifying sha-256 commits, assuming 
one may only upgrade the algorithm.

the transition plan implies the intent to ultimately get rid of old 
algos, but this is a non-starter, because old histories need to remain 
accessible indefinitely (you can't rewrite all external references, and 
even for in-history references this would be unreliable and would 
falsify historical builds).

i won't try making an argument for mixed histories, as i'm assuming i 
wouldn't add anything that hasn't already been written.

-- ossi
Junio C Hamano May 9, 2023, 4:47 p.m. UTC | #15
Oswald Buddenhagen <oswald.buddenhagen@gmx.de> writes:

>>If your history contains mixed and matched hash algorithms, you'll need
>>to be able to verify those commits to the root to have any confidence in
>>a signed commit or tag, which means trusting SHA-1 if you have any SHA-1
>>commits in the repository.
>>
> the history is traversed from the end anyway, so having sha-1 in the
> history is entirely irrelevant for verifying sha-256 commits, assuming
> one may only upgrade the algorithm.

That depends on what is meant by "verify a commit".  If we are only
interested in the tree contents, then the newer commits whose trees
are hashed with a more secure hash algorithm recursively down to the
blobs would lack any weak link hashed by a less secure algorithm.
Most people do not care as deeply how their project tree came to the
shape it has today as they care about what is in the recent trees,
so this is an acceptable stance to take.

If the less secure algorithm becomes so weak that the history, up to
the last commit that was signed by it, can be rewritten arbitrarily,
however, an attacker can lie about why the code that survives to
this day looks the way it does by forging old parts of the history,
and mislead today's developers to do wrong things.  The only way to
prevent that kind of attack is to verify a commit by recursively
making sure not just the commit and its tree, but its parents are
all authentic, and less secure algorithm may make it impossible.

The old history hashed with the old algorithm needs to be kept to
help external references, but I think as a mitigation, those who
care about that part of history can create a copy of the history
hashed with the new algorithm and publish the correspondence between
the two parallel histories to assure the integrity of that part of
the old history.
diff mbox series

Patch

diff --git a/Documentation/git.txt b/Documentation/git.txt
index 74973d3cc4..48eda9f883 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -547,9 +547,13 @@  double-quotes and respecting backslash escapes. E.g., the value
 `GIT_DEFAULT_HASH`::
 	If this variable is set, the default hash algorithm for new
 	repositories will be set to this value. This value is currently
-	ignored when cloning; the setting of the remote repository
-	is used instead. The default is "sha1". THIS VARIABLE IS
-	EXPERIMENTAL! See `--object-format` in linkgit:git-init[1].
+	ignored when cloning if the remote value can be definitively
+	determined; the setting of the remote repository is used
+	instead. The value is honored if the remote repository's
+	algorithm cannot be determined, such as some cases when
+	the remote repository is empty. The default is "sha1".
+	THIS VARIABLE IS EXPERIMENTAL! See `--object-format`
+	in linkgit:git-init[1].
 
 Git Commits
 ~~~~~~~~~~~
diff --git a/builtin/clone.c b/builtin/clone.c
index 186845ef0b..c207798de9 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1316,13 +1316,15 @@  int cmd_clone(int argc, const char **argv, const char *prefix)
 		}
 	}
 
+	if (transport_get_hash_algo_explicit(transport)) {
 		/*
 		 * Now that we know what algorithm the remote side is using,
 		 * let's set ours to the same thing.
 		 */
-	hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
-	initialize_repository_version(hash_algo, 1);
-	repo_set_hash_algo(the_repository, hash_algo);
+		hash_algo = hash_algo_by_ptr(transport_get_hash_algo(transport));
+		initialize_repository_version(hash_algo, 1);
+		repo_set_hash_algo(the_repository, hash_algo);
+	}
 
 	if (mapped_refs) {
 		/*
diff --git a/connect.c b/connect.c
index 3a0186280c..40cb9bf261 100644
--- a/connect.c
+++ b/connect.c
@@ -243,8 +243,10 @@  static void process_capabilities(struct packet_reader *reader, int *linelen)
 	if (feat_val) {
 		char *hash_name = xstrndup(feat_val, feat_len);
 		int hash_algo = hash_algo_by_name(hash_name);
-		if (hash_algo != GIT_HASH_UNKNOWN)
+		if (hash_algo != GIT_HASH_UNKNOWN) {
 			reader->hash_algo = &hash_algos[hash_algo];
+			reader->hash_algo_explicit = 1;
+		}
 		free(hash_name);
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
@@ -493,6 +495,7 @@  static void send_capabilities(int fd_out, struct packet_reader *reader)
 		if (hash_algo == GIT_HASH_UNKNOWN)
 			die(_("unknown object format '%s' specified by server"), hash_name);
 		reader->hash_algo = &hash_algos[hash_algo];
+		reader->hash_algo_explicit = 1;
 		packet_write_fmt(fd_out, "object-format=%s", reader->hash_algo->name);
 	} else {
 		reader->hash_algo = &hash_algos[GIT_HASH_SHA1];
diff --git a/pkt-line.h b/pkt-line.h
index 8e9846f315..10700a9d8c 100644
--- a/pkt-line.h
+++ b/pkt-line.h
@@ -190,6 +190,8 @@  struct packet_reader {
 	int line_peeked;
 
 	unsigned use_sideband : 1;
+	/* indicates if we saw an explicit capability */
+	unsigned hash_algo_explicit : 1;
 	const char *me;
 
 	/* hash algorithm in use */
diff --git a/t/t5700-protocol-v1.sh b/t/t5700-protocol-v1.sh
index 3cd9db9012..ad24c7fe64 100755
--- a/t/t5700-protocol-v1.sh
+++ b/t/t5700-protocol-v1.sh
@@ -244,6 +244,17 @@  test_expect_success 'push with ssh:// using protocol v1' '
 	grep "push< version 1" log
 '
 
+test_expect_success 'clone propagates object-format from empty repo' '
+	test_when_finished "rm -fr src256 dst256" &&
+
+	echo sha256 >expect &&
+	git init --object-format=sha256 src256 &&
+	GIT_DEFAULT_HASH=sha256 git -c protocol.version=1 clone --no-local src256 dst256 &&
+	git -C dst256 rev-parse --show-object-format >actual &&
+
+	test_cmp expect actual
+'
+
 # Test protocol v1 with 'http://' transport
 #
 . "$TEST_DIRECTORY"/lib-httpd.sh
diff --git a/transport-helper.c b/transport-helper.c
index 6b816940dc..c65cf7c620 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -1236,6 +1236,7 @@  static struct ref *get_refs_list_using_list(struct transport *transport,
 					die(_("unsupported object format '%s'"),
 					    value);
 				transport->hash_algo = &hash_algos[algo];
+				transport->hash_algo_explicit = 1;
 			}
 			continue;
 		}
diff --git a/transport.c b/transport.c
index 67afdae57c..7774487e8d 100644
--- a/transport.c
+++ b/transport.c
@@ -147,6 +147,12 @@  static void get_refs_from_bundle_inner(struct transport *transport)
 		die(_("could not read bundle '%s'"), transport->url);
 
 	transport->hash_algo = data->header.hash_algo;
+	/*
+	 * This is always set, even if we didn't get an explicit object-format
+	 * capability, since we know that a missing capability or a v2 bundle
+	 * definitively indicates SHA-1.
+	 */
+	transport->hash_algo_explicit = 1;
 }
 
 static struct ref *get_refs_from_bundle(struct transport *transport,
@@ -190,6 +196,7 @@  static int fetch_refs_from_bundle(struct transport *transport,
 	ret = unbundle(the_repository, &data->header, data->fd,
 		       &extra_index_pack_args, 0);
 	transport->hash_algo = data->header.hash_algo;
+	transport->hash_algo_explicit = 1;
 	return ret;
 }
 
@@ -360,6 +367,7 @@  static struct ref *handshake(struct transport *transport, int for_push,
 	}
 	data->finished_handshake = 1;
 	transport->hash_algo = reader.hash_algo;
+	transport->hash_algo_explicit = reader.hash_algo_explicit;
 
 	if (reader.line_peeked)
 		BUG("buffer must be empty at the end of handshake()");
@@ -1190,6 +1198,7 @@  struct transport *transport_get(struct remote *remote, const char *url)
 	}
 
 	ret->hash_algo = &hash_algos[GIT_HASH_SHA1];
+	ret->hash_algo_explicit = 0;
 
 	return ret;
 }
@@ -1199,6 +1208,11 @@  const struct git_hash_algo *transport_get_hash_algo(struct transport *transport)
 	return transport->hash_algo;
 }
 
+int transport_get_hash_algo_explicit(struct transport *transport)
+{
+	return transport->hash_algo_explicit;
+}
+
 int transport_set_option(struct transport *transport,
 			 const char *name, const char *value)
 {
diff --git a/transport.h b/transport.h
index 6393cd9823..ce67eefc58 100644
--- a/transport.h
+++ b/transport.h
@@ -128,6 +128,11 @@  struct transport {
 	 * in transport_set_verbosity().
 	 **/
 	unsigned progress : 1;
+	/*
+	 * Indicates whether the hash algorithm was initialized explicitly as
+	 * opposed to using a fallback.
+	 */
+	unsigned hash_algo_explicit : 1;
 	/*
 	 * If transport is at least potentially smart, this points to
 	 * git_transport_options structure to use in case transport
@@ -305,6 +310,15 @@  int transport_get_remote_bundle_uri(struct transport *transport);
  * This can only be called after fetching the remote refs.
  */
 const struct git_hash_algo *transport_get_hash_algo(struct transport *transport);
+/*
+ * Fetch whether the hash algorithm provided was explicitly set.
+ *
+ * If this value is false, "transport_get_hash_algo" will always return a value
+ * of SHA-1, which is the default algorithm if none is specified.
+ *
+ * This can only be called after fetching the remote refs.
+ */
+int transport_get_hash_algo_explicit(struct transport *transport);
 int transport_fetch_refs(struct transport *transport, struct ref *refs);
 
 /*