mbox series

[00/11] Stop relying on SHA1 fallback for `the_hash_algo`

Message ID cover.1713519789.git.ps@pks.im (mailing list archive)
Headers show
Series Stop relying on SHA1 fallback for `the_hash_algo` | expand

Message

Patrick Steinhardt April 19, 2024, 9:51 a.m. UTC
Hi,

when starting up, Git will initialize `the_repository` by calling
`initialize_the_repository()`. Part of this is also that we set the hash
algo of `the_repository` to SHA1, which implicitly sets `the_hash_algo`
because it is a macro expanding to `the_repository->hash_algo`.

Usually, we eventually set up the correct hash algorithm here once we
have properly set up `the_repository` via `setup_git_directory()`. But
in some commands we actually don't require a Git repository, and thus we
will leave the SHA1 hash algorithm in place.

This has led to some subtle bugs when the context really asks for a
SHA256 repository, which this patch series corrects for most of the
part. Some commands need further work, like for example git-diff(1),
where the user might want to have the ability to pick a hash function
when run outside of a repository.

Ultimately, the last patch then drops the setup of the fallback hash
algorithm completely. This will cause `the_hash_algo` to be a `NULL`
pointer unless explicitly configured, and thus we now start to crash
when it gets accessed without doing so beforehand. This is a rather
risky thing to do, but it does catch bugs where we might otherwise
accidentally do the wrong thing. And even though I think it is the right
thing to do even conceptually, I'd be okay to drop it if folks think
that the risk is not worth it.

Patrick

Patrick Steinhardt (11):
  path: harden validation of HEAD with non-standard hashes
  parse-options-cb: only abbreviate hashes when hash algo is known
  attr: don't recompute default attribute source
  attr: fix BUG() when parsing attrs outside of repo
  remote-curl: fix parsing of detached SHA256 heads
  builtin/rev-parse: allow shortening to more than 40 hex characters
  builtin/blame: don't access potentially unitialized `the_hash_algo`
  builtin/bundle: abort "verify" early when there is no repository
  builtin/diff: explicitly set hash algo when there is no repo
  builtin/shortlog: don't set up revisions without repo
  repository: stop setting SHA1 as the default object hash

 attr.c                     | 31 ++++++++++++++++++++++---------
 builtin/blame.c            |  5 ++---
 builtin/bundle.c           |  5 +++++
 builtin/diff.c             |  9 +++++++++
 builtin/rev-parse.c        |  5 ++---
 builtin/shortlog.c         |  2 +-
 parse-options-cb.c         |  3 ++-
 path.c                     |  2 +-
 remote-curl.c              | 19 ++++++++++++++++++-
 repository.c               |  2 --
 t/t0003-attributes.sh      | 15 +++++++++++++++
 t/t0040-parse-options.sh   | 17 +++++++++++++++++
 t/t1500-rev-parse.sh       |  6 ++++++
 t/t5550-http-fetch-dumb.sh | 15 +++++++++++++++
 14 files changed, 115 insertions(+), 21 deletions(-)

Comments

brian m. carlson April 19, 2024, 7:12 p.m. UTC | #1
On 2024-04-19 at 09:51:03, Patrick Steinhardt wrote:
> Hi,
> 
> when starting up, Git will initialize `the_repository` by calling
> `initialize_the_repository()`. Part of this is also that we set the hash
> algo of `the_repository` to SHA1, which implicitly sets `the_hash_algo`
> because it is a macro expanding to `the_repository->hash_algo`.
> 
> Usually, we eventually set up the correct hash algorithm here once we
> have properly set up `the_repository` via `setup_git_directory()`. But
> in some commands we actually don't require a Git repository, and thus we
> will leave the SHA1 hash algorithm in place.
> 
> This has led to some subtle bugs when the context really asks for a
> SHA256 repository, which this patch series corrects for most of the
> part. Some commands need further work, like for example git-diff(1),
> where the user might want to have the ability to pick a hash function
> when run outside of a repository.
> 
> Ultimately, the last patch then drops the setup of the fallback hash
> algorithm completely. This will cause `the_hash_algo` to be a `NULL`
> pointer unless explicitly configured, and thus we now start to crash
> when it gets accessed without doing so beforehand. This is a rather
> risky thing to do, but it does catch bugs where we might otherwise
> accidentally do the wrong thing. And even though I think it is the right
> thing to do even conceptually, I'd be okay to drop it if folks think
> that the risk is not worth it.

I've taken a look, and other than the minor typo I noted, this seems
fine.  I'm in favour of getting rid of the SHA-1 default, even though my
gut tells me we might find a bug or two along the way where things
aren't initialized properly.  I still think that'll be okay and it's
worth doing, since it'll help us prepare for the case in the future
where we want to switch the default and also for libification, where we
won't want to make those kinds of assumptions.
Junio C Hamano April 19, 2024, 7:16 p.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> I've taken a look, and other than the minor typo I noted, this seems
> fine.  I'm in favour of getting rid of the SHA-1 default, even though my
> gut tells me we might find a bug or two along the way where things
> aren't initialized properly.  I still think that'll be okay and it's
> worth doing, since it'll help us prepare for the case in the future
> where we want to switch the default and also for libification, where we
> won't want to make those kinds of assumptions.

Yup, thanks for reviewing (and thanks Patrick for writing the
series).
Patrick Steinhardt April 22, 2024, 4:56 a.m. UTC | #3
On Fri, Apr 19, 2024 at 07:12:59PM +0000, brian m. carlson wrote:
> On 2024-04-19 at 09:51:03, Patrick Steinhardt wrote:
> > Hi,
> > 
> > when starting up, Git will initialize `the_repository` by calling
> > `initialize_the_repository()`. Part of this is also that we set the hash
> > algo of `the_repository` to SHA1, which implicitly sets `the_hash_algo`
> > because it is a macro expanding to `the_repository->hash_algo`.
> > 
> > Usually, we eventually set up the correct hash algorithm here once we
> > have properly set up `the_repository` via `setup_git_directory()`. But
> > in some commands we actually don't require a Git repository, and thus we
> > will leave the SHA1 hash algorithm in place.
> > 
> > This has led to some subtle bugs when the context really asks for a
> > SHA256 repository, which this patch series corrects for most of the
> > part. Some commands need further work, like for example git-diff(1),
> > where the user might want to have the ability to pick a hash function
> > when run outside of a repository.
> > 
> > Ultimately, the last patch then drops the setup of the fallback hash
> > algorithm completely. This will cause `the_hash_algo` to be a `NULL`
> > pointer unless explicitly configured, and thus we now start to crash
> > when it gets accessed without doing so beforehand. This is a rather
> > risky thing to do, but it does catch bugs where we might otherwise
> > accidentally do the wrong thing. And even though I think it is the right
> > thing to do even conceptually, I'd be okay to drop it if folks think
> > that the risk is not worth it.
> 
> I've taken a look, and other than the minor typo I noted, this seems
> fine.  I'm in favour of getting rid of the SHA-1 default, even though my
> gut tells me we might find a bug or two along the way where things
> aren't initialized properly.  I still think that'll be okay and it's
> worth doing, since it'll help us prepare for the case in the future
> where we want to switch the default and also for libification, where we
> won't want to make those kinds of assumptions.

Yeah, I would expect there to be a few more bugs that I just didn't
spot. But I think if we do this early in the next release cycle it would
give us plenty of time to detect any such issues. And in the worst case
we would still be able to revert the last patch of this series. So all
in all it hopefully shouldn't be all that bad and prepares us for the
future.

Thanks for your review!

Patrick