mbox series

[0/4] hash.h: support choosing a separate SHA-1 for non-cryptographic uses

Message ID cover.1725206584.git.me@ttaylorr.com (mailing list archive)
Headers show
Series hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand

Message

Taylor Blau Sept. 1, 2024, 4:03 p.m. UTC
This series adds a build-time knob to allow selecting an alternative
SHA-1 implementation for non-cryptographic hashing within Git, starting
with the `hashwrite()` family of functions.

This series is the result of starting to roll out verbatim multi-pack
reuse within GitHub's infrastructure. I noticed that on larger
repositories, it is harder thus far to measure a CPU speed-up on clones
where multi-pack reuse is enabled.

After some profiling, I noticed that we spend a significant amount of
time in hashwrite(), which is not all that surprising. But much of that
time is wasted in GitHub's infrastructure, since we are using the same
collision-detecting SHA-1 implementation to produce a trailing checksum
for the pack which does not need to be cryptographically secure.

This series teaches a new set of build-time knobs: OPENSSL_SHA1_FAST,
BLK_SHA1_FAST, and APPLE_COMMON_CRYPTO_SHA1_FAST, which can be used to
select an alterantive SHA-1 implementation for non-cryptographic uses
within Git.

The series is laid out as follows:

  - The first two patches are preparatory, allowing us to include
    multiple SHA-1 wrapper headers and adding scaffolding functions for
    the _fast() variants, respectively.

  - The third patch introduces the build-time knobs for selecting which
    SHA-1 implementation is used for non-cryptographic purposes.

  - The fourth and final patch updates the hashwrite() implementation to
    use the _fast() variants.

From the final commit, a clone of the kernel which took 19.219 seconds
on my machine with no build-time modifications now only takes 11.597
seconds when compiling with OPENSSL_SHA1_FAST=1, for a ~39.7% speed-up.
On a freshly-repacked copy of git.git, the time goes from 758ms to 333ms
for a ~56% speed-up.

Thanks in advance for your review!

Taylor Blau (4):
  sha1: do not redefine `platform_SHA_CTX` and friends
  hash.h: scaffolding for _fast hashing variants
  Makefile: allow specifying a SHA-1 for non-cryptographic uses
  csum-file.c: use fast SHA-1 implementation when available

 Makefile          | 25 ++++++++++++++++++
 block-sha1/sha1.h |  2 ++
 csum-file.c       | 18 ++++++-------
 hash.h            | 67 +++++++++++++++++++++++++++++++++++++++++++++++
 object-file.c     | 42 +++++++++++++++++++++++++++++
 sha1/openssl.h    |  2 ++
 sha1dc_git.h      |  3 +++
 7 files changed, 150 insertions(+), 9 deletions(-)


base-commit: 159f2d50e75c17382c9f4eb7cbda671a6fa612d1

Comments

Junio C Hamano Sept. 2, 2024, 3:41 a.m. UTC | #1
Taylor Blau <me@ttaylorr.com> writes:

> After some profiling, I noticed that we spend a significant amount of
> time in hashwrite(), which is not all that surprising. But much of that
> time is wasted in GitHub's infrastructure, since we are using the same
> collision-detecting SHA-1 implementation to produce a trailing checksum
> for the pack which does not need to be cryptographically secure.

Cute.

I wish we can upgrade the file formats so that a writer can choose
the hash algorithm independently from whatever the payload uses.
Most of our use of the tail sum are for files that are consumed
locally in the same repository so nobody shouldn't need to know that
you are using xxHash for the tail sum instead.

Except that the story is not so simple for packfiles, which is named
after the file's tail sum, so you cannot use a hash algorithm of
your choice independently without affecting other folks.  All other
csum-file protected file types are lot smaller than the pack files
to matter, sadly.
brian m. carlson Sept. 2, 2024, 2:08 p.m. UTC | #2
On 2024-09-01 at 16:03:15, Taylor Blau wrote:
> This series adds a build-time knob to allow selecting an alternative
> SHA-1 implementation for non-cryptographic hashing within Git, starting
> with the `hashwrite()` family of functions.
> 
> This series is the result of starting to roll out verbatim multi-pack
> reuse within GitHub's infrastructure. I noticed that on larger
> repositories, it is harder thus far to measure a CPU speed-up on clones
> where multi-pack reuse is enabled.
> 
> After some profiling, I noticed that we spend a significant amount of
> time in hashwrite(), which is not all that surprising. But much of that
> time is wasted in GitHub's infrastructure, since we are using the same
> collision-detecting SHA-1 implementation to produce a trailing checksum
> for the pack which does not need to be cryptographically secure.

Hmm, I'm not sure this is the case.  Let's consider the case where SHA-1
becomes as easy to collide as MD4, which requires less than 2 hash
operations for a collision, in which case we can assume that it's
trivial, because eventually we expect that will happen with advances in
technology.

So in that case, we believe that an attacker who knows what's in a pack
file and can collide one or more of the objects can create another
packfile with a different, colliding object and cause the pack contents
to be the same.  Because we use the pack file hash as the name of the
pack and we use rename(2), which ignores whether the destination exists,
that means we have to assume that eventually an attacker will be able to
overwrite one pack file with another with different contents without
being detected simply by pushing a new pack into the repository.

Even if we assume that SHA-1 attacks only become as easy as MD5 attacks,
the RapidSSL exploits[0][1] demonstrate that an attacker can create
collisions based on predictable outputs even with imprecise feedback. We
know SHA-1 is quite weak, so it could actually be quite soon that
someone finds an improvement in attacking the algorithm.  Note that
computing 2^56 DES keys by brute force costs about $20 from cloud
providers[2], and SHA-1 provides only 2^61 collision security, so a
small improvement would probably make this pretty viable to attack on
major providers with dedicated hardware.

This is actually worse on some providers where the operations tend to be
single threaded.  In those situations, there is no nondeterminism from
threads to make packs slightly different, and thus it would be extremely
easy to create such collisions predictably based on what an appropriate
upstream version of Git does.

So I don't think we can accurately say that cryptographic security isn't
needed here.  If we need a unique name that an attacker cannot control
and there are negative consequences (such as data loss) from the
attacker being able to control the name, then we need cryptographic
security here, which would imply a collision-detecting SHA-1 algorithm.

We could avoid this problem if we used link(2) and unlink(2) to avoid
renaming over existing pack files, though, similar to loose objects.
We'd need to not use the new fast algorithm unless core.createObject is
set to "link", though.

[0] https://en.wikipedia.org/wiki/MD5
[1] https://www.win.tue.nl/hashclash/rogue-ca/
[2] https://crack.sh/
Taylor Blau Sept. 3, 2024, 7:47 p.m. UTC | #3
On Mon, Sep 02, 2024 at 02:08:25PM +0000, brian m. carlson wrote:
> On 2024-09-01 at 16:03:15, Taylor Blau wrote:
> > This series adds a build-time knob to allow selecting an alternative
> > SHA-1 implementation for non-cryptographic hashing within Git, starting
> > with the `hashwrite()` family of functions.
> >
> > This series is the result of starting to roll out verbatim multi-pack
> > reuse within GitHub's infrastructure. I noticed that on larger
> > repositories, it is harder thus far to measure a CPU speed-up on clones
> > where multi-pack reuse is enabled.
> >
> > After some profiling, I noticed that we spend a significant amount of
> > time in hashwrite(), which is not all that surprising. But much of that
> > time is wasted in GitHub's infrastructure, since we are using the same
> > collision-detecting SHA-1 implementation to produce a trailing checksum
> > for the pack which does not need to be cryptographically secure.
>
> Hmm, I'm not sure this is the case.  Let's consider the case where SHA-1
> becomes as easy to collide as MD4, which requires less than 2 hash
> operations for a collision, in which case we can assume that it's
> trivial, because eventually we expect that will happen with advances in
> technology.

I'm not sure this attack is possible as you described.

We still run any packs through index-pack before landing them in
$GIT_DIR/objects/pack, and index-pack still uses the collision-detecting
SHA-1 implementation (if the repository uses SHA-1 and Git was compiled
with it).

So if I were a malicious attacker trying to compromise data on a forge,
I would have to first (a) know the name of some pack that I was trying
to collide, then (b) create a pack which collides with that one before
actually pushing it. (b) seems difficult to impossible to execute
(certainly today, maybe ever) because the attacker only controls the
object contents within the pack, but can't adjust the pack header,
object headers, compression, etc.

But even if the attacker could do all of that, the remote still needs to
index that pack, and while checksumming the pack, it would notice the
collision (or SHA-1 mismatch) and reject the pack by die()-ing either
way. (AFAICT, this all happens in
builtin/index-pack.c::parse_pack_objects()).

> So in that case, we believe that an attacker who knows what's in a pack
> file and can collide one or more of the objects can create another
> packfile with a different, colliding object and cause the pack contents
> to be the same.  Because we use the pack file hash as the name of the
> pack and we use rename(2), which ignores whether the destination exists,
> that means we have to assume that eventually an attacker will be able to
> overwrite one pack file with another with different contents without
> being detected simply by pushing a new pack into the repository.

Right... but I think we would die() before we attempt to rename() the
pack into place as above.

Thanks,
Taylor
Taylor Blau Sept. 3, 2024, 7:48 p.m. UTC | #4
On Sun, Sep 01, 2024 at 08:41:36PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > After some profiling, I noticed that we spend a significant amount of
> > time in hashwrite(), which is not all that surprising. But much of that
> > time is wasted in GitHub's infrastructure, since we are using the same
> > collision-detecting SHA-1 implementation to produce a trailing checksum
> > for the pack which does not need to be cryptographically secure.
>
> Cute.
>
> I wish we can upgrade the file formats so that a writer can choose
> the hash algorithm independently from whatever the payload uses.
> Most of our use of the tail sum are for files that are consumed
> locally in the same repository so nobody shouldn't need to know that
> you are using xxHash for the tail sum instead.

Yeah. I would actually like to get here in the long-term, but of course
that change is much larger than this one (the protocol would have to be
adjusted to learn a new "tailsum" capability for callers to negotiate
which checksumming hash function they want to use, etc.).

> Except that the story is not so simple for packfiles, which is named
> after the file's tail sum, so you cannot use a hash algorithm of
> your choice independently without affecting other folks.  All other
> csum-file protected file types are lot smaller than the pack files
> to matter, sadly.

Right.

Thanks,
Taylor
Junio C Hamano Sept. 3, 2024, 8:44 p.m. UTC | #5
Taylor Blau <me@ttaylorr.com> writes:

> Yeah. I would actually like to get here in the long-term, but of course
> that change is much larger than this one (the protocol would have to be
> adjusted to learn a new "tailsum" capability for callers to negotiate
> which checksumming hash function they want to use, etc.).

I dunno.  I was talking about strictly local tail sum, like the ones
in the index file.  You do not have anybody to negotiate about
anything.

We already agreed that the name of a packfile is a different matter.
Junio C Hamano Sept. 3, 2024, 10:41 p.m. UTC | #6
Taylor Blau <me@ttaylorr.com> writes:

> But even if the attacker could do all of that, the remote still needs to
> index that pack, and while checksumming the pack, it would notice the
> collision (or SHA-1 mismatch) and reject the pack by die()-ing either
> way. (AFAICT, this all happens in
> builtin/index-pack.c::parse_pack_objects()).

The hosting side writes a packfile and computes the tail sum once.
You force the clients that clone or fetch validate the tail sum.
Usually clients outnumber the hoster by large orders of magnitude.

That sounds like you are optimizing for a wrong side, but it does
point at another aspect of this problem.

Even without limiting ourselves to the tail sum, our uses of the
hash function fall into two categories, ones that do not have to be
overly cautious (i.e., when we are generating data and computing the
hash over that data), and the others that we do want to be paranoid
(i.e., when we receive check-summed data from outside world and
suspect that the data was generated by an adversary).
brian m. carlson Sept. 4, 2024, 2:01 p.m. UTC | #7
On 2024-09-03 at 19:47:39, Taylor Blau wrote:
> We still run any packs through index-pack before landing them in
> $GIT_DIR/objects/pack, and index-pack still uses the collision-detecting
> SHA-1 implementation (if the repository uses SHA-1 and Git was compiled
> with it).
> 
> So if I were a malicious attacker trying to compromise data on a forge,
> I would have to first (a) know the name of some pack that I was trying
> to collide, then (b) create a pack which collides with that one before
> actually pushing it. (b) seems difficult to impossible to execute
> (certainly today, maybe ever) because the attacker only controls the
> object contents within the pack, but can't adjust the pack header,
> object headers, compression, etc.

Packing single-threaded is deterministic in my tests, so it would seem
that this is possible, even if inconvenient or difficult to execute.
It's not very hard to get access to the configuration a forge is using
either because it's open source or open core, or just from getting the
on-premises version's configuration, so we have to assume that the
attacker knows the configuration, and we also can determine what packs
are on the server side if we've pushed all of the objects ourselves.

> But even if the attacker could do all of that, the remote still needs to
> index that pack, and while checksumming the pack, it would notice the
> collision (or SHA-1 mismatch) and reject the pack by die()-ing either
> way. (AFAICT, this all happens in
> builtin/index-pack.c::parse_pack_objects()).

If we're certain that we'll always index the pack, then I agree we would
detect this at that point, and so it would probably be safe.  As you and
I discussed elsewhere, I'm not the expert on the pack code, so I'll
defer to your analysis here.
Jeff King Sept. 5, 2024, 10:37 a.m. UTC | #8
On Tue, Sep 03, 2024 at 03:47:39PM -0400, Taylor Blau wrote:

> > Hmm, I'm not sure this is the case.  Let's consider the case where SHA-1
> > becomes as easy to collide as MD4, which requires less than 2 hash
> > operations for a collision, in which case we can assume that it's
> > trivial, because eventually we expect that will happen with advances in
> > technology.
> 
> I'm not sure this attack is possible as you described.
> 
> We still run any packs through index-pack before landing them in
> $GIT_DIR/objects/pack, and index-pack still uses the collision-detecting
> SHA-1 implementation (if the repository uses SHA-1 and Git was compiled
> with it).

I agree this is not a problem with your series as it is, but I think in
the long run we'd want to switch over index-pack, too, for two reasons:

  1. It could benefit from the same speedups on the receiving side that
     your patches give on the sending side (though the difference is
     less noticeable there, because we're also hashing the expanded
     contents).

  2. We'll have to do so if switch to a non-cryptographic hash (as is
     discussed elsewhere in this thread), since the two obviously have
     to match.

So let's imagine for a moment that we make that change.

I don't think you can smuggle any duplicate objects this way. We'll
still index each individual object using sha1dc, so any attempts to
collide there will be caught. You'd need totally different objects that
are in a packfile that happens to hash to the same checksum. And then
since the receiver is the one computing the object id of those objects,
they won't match (modulo some racing, which I'll get to).

But I do think you could do a denial-of-service attack by corrupting the
receiving repo. Imagine that:

  1. I (somehow) know you have a pack with hash XYZ, and thus that
     you'll be storing objects/pack/pack-XYZ.pack.

  2. I generate a new, valid pack that contains 100 random objects
     (or enough to defeat transfer.unpackLimit). I mutate that objects'
     contents until my new pack also has hash XYZ.

  3. I send you that pack (e.g., via push if you're a server, or by
     manipulating you into fetch if you're a client). You index the
     pack, see that it should be called pack-XYZ.pack, and then rename()
     it on top of your existing file.

Now you've just lost access to all of those objects, and your repo is
broken.

We should be able to simulate this in practice. First, let's weaken the
"fast" sha1 such that it retains only the final two bits:

diff --git a/sha1/openssl.h b/sha1/openssl.h
index 1038af47da..f0d5c59c43 100644
--- a/sha1/openssl.h
+++ b/sha1/openssl.h
@@ -32,6 +32,8 @@ static inline void openssl_SHA1_Final(unsigned char *digest,
 {
 	EVP_DigestFinal_ex(ctx->ectx, digest, NULL);
 	EVP_MD_CTX_free(ctx->ectx);
+	memset(digest, 0, 19);
+	digest[19] &= 0x3;
 }
 
 static inline void openssl_SHA1_Clone(struct openssl_SHA1_CTX *dst,

Now we can build with OPENSSL_SHA1_FAST=1, and use the result (which I
call git.compile below) to repack some victim repo:

  # use regular Git to clone first
  git clone --no-local --bare /some/repo victim.git

  # and now use our weak hash to repack; you should end up with
  # pack-000000...003.pack or similar
  git.compile -C victim.git repack -ad

We can even fsck it, if we teach the fsck code to use our weak hash (as
an aside, I think it is weird that fsck has its own hash verification;
it should probably be relying on verify-pack for this, but I haven't dug
into this in a while and IIRC there are some complications):

diff --git a/pack-check.c b/pack-check.c
index e883dae3f2..1b80616c70 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -67,7 +67,7 @@ static int verify_packfile(struct repository *r,
 	if (!is_pack_valid(p))
 		return error("packfile %s cannot be accessed", p->pack_name);
 
-	r->hash_algo->init_fn(&ctx);
+	r->hash_algo->fast_init_fn(&ctx);
 	do {
 		unsigned long remaining;
 		unsigned char *in = use_pack(p, w_curs, offset, &remaining);
@@ -76,9 +76,9 @@ static int verify_packfile(struct repository *r,
 			pack_sig_ofs = p->pack_size - r->hash_algo->rawsz;
 		if (offset > pack_sig_ofs)
 			remaining -= (unsigned int)(offset - pack_sig_ofs);
-		r->hash_algo->update_fn(&ctx, in, remaining);
+		r->hash_algo->fast_update_fn(&ctx, in, remaining);
 	} while (offset < pack_sig_ofs);
-	r->hash_algo->final_fn(hash, &ctx);
+	r->hash_algo->fast_final_fn(hash, &ctx);
 	pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL);
 	if (!hasheq(hash, pack_sig, the_repository->hash_algo))
 		err = error("%s pack checksum mismatch",

And now let's teach index-pack to use the weak hash, too:

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index fd968d673d..de99405880 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -284,7 +284,7 @@ static void flush(void)
 	if (input_offset) {
 		if (output_fd >= 0)
 			write_or_die(output_fd, input_buffer, input_offset);
-		the_hash_algo->update_fn(&input_ctx, input_buffer, input_offset);
+		the_hash_algo->fast_update_fn(&input_ctx, input_buffer, input_offset);
 		memmove(input_buffer, input_buffer + input_offset, input_len);
 		input_offset = 0;
 	}
@@ -357,7 +357,7 @@ static const char *open_pack_file(const char *pack_name)
 		output_fd = -1;
 		nothread_data.pack_fd = input_fd;
 	}
-	the_hash_algo->init_fn(&input_ctx);
+	the_hash_algo->fast_init_fn(&input_ctx);
 	return pack_name;
 }
 
@@ -1202,9 +1202,9 @@ static void parse_pack_objects(unsigned char *hash)
 
 	/* Check pack integrity */
 	flush();
-	the_hash_algo->init_fn(&tmp_ctx);
-	the_hash_algo->clone_fn(&tmp_ctx, &input_ctx);
-	the_hash_algo->final_fn(hash, &tmp_ctx);
+	the_hash_algo->fast_init_fn(&tmp_ctx);
+	the_hash_algo->fast_clone_fn(&tmp_ctx, &input_ctx);
+	the_hash_algo->fast_final_fn(hash, &tmp_ctx);
 	if (!hasheq(fill(the_hash_algo->rawsz), hash, the_repository->hash_algo))
 		die(_("pack is corrupted (SHA1 mismatch)"));
 	use(the_hash_algo->rawsz);

Now the code is all set. We need to adjust two things in the victim
repo:

  # we could just send 100+ fake objects from the client, but setting this to
  # "1" makes our attack loop simpler
  git -C victim.git config transfer.unpackLimit 1

  # Pushing happens into a quarantine area. And then when we move the
  # files out of quarantine, we do it with finalize_object_file(), which
  # does the usual link collision detection. But:
  #
  #   1. That doesn't happen for fetches, which are done directly in
  #      the pack directory, so the rename() done by index-pack applies
  #      there.
  #
  #   2. If the link() doesn't work for any reason other than EEXIST,
  #      we'll retry it as a rename(). So depending on your filesystem,
  #      this might be triggerable even for push.
  #
  # To simulate the worst case, we'll manually force the push into
  # rename mode.
  git -C victim.git config core.createObject rename

And now we can attack, like so:

  git init evil
  cd evil
  while git.compile -C ../victim.git fsck; do
          ls -l ../victim.git/objects/pack/
          git.compile commit --allow-empty -m foo
          git.compile push ../victim.git HEAD:foo
  done
  ls -l ../victim.git/objects/pack/

It's random whether we'll collide, but since there are only 4
possibilities (from 2 bits), it happens within a couple attempts.
Obviously sha1 is stronger than that, but I think the point is to
prepare for a world where collisions are cheap and easy to produce
(whether because sha1 gets more broken, or because we start using a
non-cryptographic hash).

So I do think there are problems in this general path, even though your
patch is not (yet) exposing them (although it would be easy to do so
accidentally; I was actually moderately surprised that index-pack is not
relying on the hashfile API already).

Probably the solution is:

  - renaming packfiles into place should use finalize_object_file() to
    avoid collisions.  That happens for tmp-objdir migration already,
    but we should do it more directly in index-pack (and maybe even
    pack-objects). And possibly we should implement an actual
    byte-for-byte comparison if we think we saw a collision, rather than
    just assuming that the write was effectively a noopi (see the FIXME
    in that function). That becomes more important if the checksum gets
    more likely to collide accidentally (we essentially ignore the
    possibility that sha1 would ever do so).

  - possibly object_creation_mode should have a more strict setting that
    refuses to fall back to renames. Or alternatively, we should do our
    own check for existence when falling back to a rename() in
    finalize_object_file().


And finally, I mentioned races earlier. I didn't try to reproduce it,
but I suspect there could also be a race like:

  - you have pack-XYZ.pack and pack-XYZ.idx

  - attacker generates evil pack-XYZ.pack as above (and assuming we have
    no fixed anything as I suggested above, and they really can replace
    the receiver's copy).

  - at some moment we will have moved pack-XYZ.pack into place, but not
    yet the matching idx. So we'll have the old idx and the new pack. An
    object lookup at that moment could cause us to find the object using
    the old idx, but then get the data out of the new pack file,
    replacing real data with the attacker's data. It's a pretty small
    window, but probably possible with enough simultaneous reading
    processes. Not something you'd probably want to spend $40k
    generating a collision for, but if we used a weak enough checksum,
    then attempts become cheap.

So I think we really do need to address this to prefer local data. At
which point it should be safe to use a much weaker checksum. But IMHO it
is still worth having a "fast" SHA1. Even if the future is SHA256 repos
or xxHash pack trailers, older clients will still use SHA1.

-Peff
Junio C Hamano Sept. 5, 2024, 3:41 p.m. UTC | #9
Jeff King <peff@peff.net> writes:

> Probably the solution is:
>
>   - renaming packfiles into place should use finalize_object_file() to
>     avoid collisions.  That happens for tmp-objdir migration already,
>     but we should do it more directly in index-pack (and maybe even
>     pack-objects). And possibly we should implement an actual
>     byte-for-byte comparison if we think we saw a collision, rather than
>     just assuming that the write was effectively a noopi (see the FIXME
>     in that function). That becomes more important if the checksum gets
>     more likely to collide accidentally (we essentially ignore the
>     possibility that sha1 would ever do so).

Yes.  I somehow mistakenly thought that Taylor analized the code
path when brian raised the "we rename over, overwriting existing
files" and I included fixing it as one of the steps necessary to
safely switch the tail sum to weaker and faster hash, but after
reading the thread again, it seems that I was hallucinating X-<.
This needs to be corrected.

>   - possibly object_creation_mode should have a more strict setting that
>     refuses to fall back to renames. Or alternatively, we should do our
>     own check for existence when falling back to a rename() in
>     finalize_object_file().

True, too.

>   - at some moment we will have moved pack-XYZ.pack into place, but not
>     yet the matching idx. So we'll have the old idx and the new pack. An
>     object lookup at that moment could cause us to find the object using
>     the old idx, but then get the data out of the new pack file,
>     replacing real data with the attacker's data. It's a pretty small
>     window, but probably possible with enough simultaneous reading
>     processes. Not something you'd probably want to spend $40k
>     generating a collision for, but if we used a weak enough checksum,
>     then attempts become cheap.

This reminds me why we changed the hash we use to name packfiles; we
used to use "hash of sorted object names contained in the pack", but
that would mean a (forced) repack of a sole pack of a fully packed
repository can create a packfile with contents and object layout
different from the original but with the same name, creating this
exact race to yourself without involving any evil attacker.  We of
course use the hash of the actual pack data stream these days, and
that would avoid this problem. 

It is funny to compare this with the reason why we switched how we
name individual objects in a very early part in the history.  We
used to name an object after the hash value of _compressed_ object
header plus payload, but that obviously means different compression
level (or improvement of the compressor implementation) would give
different names to the same contents, and that is why we swapped the
order to use the hash value of the object header plus payload before
compression.  Of course, that _requires_ us to avoid overwriting an
existing file to foil collision attack.  That brings us back to the
topic here ;-).

> So I think we really do need to address this to prefer local data. At
> which point it should be safe to use a much weaker checksum. But IMHO it
> is still worth having a "fast" SHA1. Even if the future is SHA256 repos
> or xxHash pack trailers, older clients will still use SHA1.

Yup. 100% agreed.

Thanks.
Taylor Blau Sept. 5, 2024, 4:23 p.m. UTC | #10
On Thu, Sep 05, 2024 at 08:41:16AM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
>
> > Probably the solution is:
> >
> >   - renaming packfiles into place should use finalize_object_file() to
> >     avoid collisions.  That happens for tmp-objdir migration already,
> >     but we should do it more directly in index-pack (and maybe even
> >     pack-objects). And possibly we should implement an actual
> >     byte-for-byte comparison if we think we saw a collision, rather than
> >     just assuming that the write was effectively a noopi (see the FIXME
> >     in that function). That becomes more important if the checksum gets
> >     more likely to collide accidentally (we essentially ignore the
> >     possibility that sha1 would ever do so).
>
> Yes.  I somehow mistakenly thought that Taylor analized the code
> path when brian raised the "we rename over, overwriting existing
> files" and I included fixing it as one of the steps necessary to
> safely switch the tail sum to weaker and faster hash, but after
> reading the thread again, it seems that I was hallucinating X-<.
> This needs to be corrected.

Just to make sure I understand you here... this is talking about a
prerequisite step for switching index-pack to use a faster hash, right?

If so, I agree, but would note that this series does not yet switch
index-pack to use the non-collision detecting SHA-1 implementation when
available, so that would not be a prerequisite for merging this series.

Thanks,
Taylor
Junio C Hamano Sept. 5, 2024, 4:51 p.m. UTC | #11
Taylor Blau <me@ttaylorr.com> writes:

> If so, I agree, but would note that this series does not yet switch
> index-pack to use the non-collision detecting SHA-1 implementation when
> available, so that would not be a prerequisite for merging this series.

Hmph, I am confused.  It needs to be corrected in order to address
collisions of the tail sum Peff raised, as no longer checked the
tail sum with SHA1DC but with "fast" SHA-1.

Doesn't it make it a prerequisite?
Taylor Blau Sept. 5, 2024, 5:04 p.m. UTC | #12
On Thu, Sep 05, 2024 at 09:51:00AM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > If so, I agree, but would note that this series does not yet switch
> > index-pack to use the non-collision detecting SHA-1 implementation when
> > available, so that would not be a prerequisite for merging this series.
>
> Hmph, I am confused.  It needs to be corrected in order to address
> collisions of the tail sum Peff raised, as no longer checked the
> tail sum with SHA1DC but with "fast" SHA-1.

Peff's mail supposes that we have already modified index-pack to use the
non-collision detecting SHA-1 implementation. But this series does not
do that, so I don't think we have an issue to address here.

In a hypothetical future series where we do modify index-pack to use the
_FAST SHA-1 implementation, then we would need to address the issue that
Peff raised first as a prerequisite.

Thanks,
Taylor
Taylor Blau Sept. 5, 2024, 5:51 p.m. UTC | #13
On Thu, Sep 05, 2024 at 01:04:34PM -0400, Taylor Blau wrote:
> On Thu, Sep 05, 2024 at 09:51:00AM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > If so, I agree, but would note that this series does not yet switch
> > > index-pack to use the non-collision detecting SHA-1 implementation when
> > > available, so that would not be a prerequisite for merging this series.
> >
> > Hmph, I am confused.  It needs to be corrected in order to address
> > collisions of the tail sum Peff raised, as no longer checked the
> > tail sum with SHA1DC but with "fast" SHA-1.
>
> Peff's mail supposes that we have already modified index-pack to use the
> non-collision detecting SHA-1 implementation. But this series does not
> do that, so I don't think we have an issue to address here.
>
> In a hypothetical future series where we do modify index-pack to use the
> _FAST SHA-1 implementation, then we would need to address the issue that
> Peff raised first as a prerequisite.

I verified that this was the case by applying only the following to my
series:

--- 8< ---
diff --git a/sha1/openssl.h b/sha1/openssl.h
index 1038af47da..f0d5c59c43 100644
--- a/sha1/openssl.h
+++ b/sha1/openssl.h
@@ -32,6 +32,8 @@ static inline void openssl_SHA1_Final(unsigned char *digest,
 {
 	EVP_DigestFinal_ex(ctx->ectx, digest, NULL);
 	EVP_MD_CTX_free(ctx->ectx);
+	memset(digest, 0, 19);
+	digest[19] &= 0x3;
 }

 static inline void openssl_SHA1_Clone(struct openssl_SHA1_CTX *dst,
--- >8 ---

and then creating a victim.git repository (which in my case was born
from git.git) and then repacking to produce the following state:

    $ ls -la victim.git/objects/pack
    total 262704
    drwxr-xr-x 2 ttaylorr ttaylorr      4096 Sep  5 13:45 .
    drwxr-xr-x 4 ttaylorr ttaylorr      4096 Sep  5 13:46 ..
    -r--r--r-- 1 ttaylorr ttaylorr   3306804 Sep  5 13:45 pack-0000000000000000000000000000000000000003.bitmap
    -r--r--r-- 1 ttaylorr ttaylorr  15588224 Sep  5 13:44 pack-0000000000000000000000000000000000000003.idx
    -r--r--r-- 1 ttaylorr ttaylorr 247865480 Sep  5 13:44 pack-0000000000000000000000000000000000000003.pack
    -r--r--r-- 1 ttaylorr ttaylorr   2226788 Sep  5 13:44 pack-0000000000000000000000000000000000000003.rev

Then I set up an "evil" repository like in Peff's recipe and started
repeatedly pushing. fsck is slow here, so the loop is just "while true",
but it doesn't matter that we're not fscking the victim repository since
I'll show in a second that it's not corrupted to begin with.

Running this loop:

  $ while true
    do
      ls -l ../victim.git/objects/pack/
      git.compile commit --allow-empty -m foo
      git.compile push ../victim.git HEAD:foo
    done
  $ ls -l ../victim.git/objects/pack/

, fails very quickly and produces the following:

    [main 727346d] foo
    Enumerating objects: 12, done.
    Counting objects: 100% (12/12), done.
    Delta compression using up to 20 threads
    Compressing objects: 100% (11/11), done.
    Writing objects: 100% (12/12), 779 bytes | 779.00 KiB/s, done.
    Total 12 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0)
    remote: fatal: final sha1 did not match
    error: remote unpack failed: unpack-objects abnormal exit
    To ../victim.git
     ! [remote rejected] HEAD -> foo (unpacker error)
    error: failed to push some refs to '../victim.git'

The victim repository rightly rejects the push, since even though the
evil repository generated a pack with a colliding checksum value, the
victim repository validated it using the collision-detecting /
non-broken SHA-1 implementation and rejected the pack appropriately.

Of course, if index-pack were updated to use the non-collision detecting
implementation of SHA-1 when compiled using one of the _FAST knobs,
*and* we did blindly updated index-pack to use the _fast variants
without doing anything else in Peff's mail, then we would have
corruption.

But I think the point of Peff's mail is to illustrate that this is only
a problem in a world where index-pack uses the _fast SHA-1
implementation, but does not have any additional protections in place.

Thanks,
Taylor
Taylor Blau Sept. 5, 2024, 8:21 p.m. UTC | #14
On Thu, Sep 05, 2024 at 01:51:10PM -0400, Taylor Blau wrote:
> , fails very quickly and produces the following:
>
>     [main 727346d] foo
>     Enumerating objects: 12, done.
>     Counting objects: 100% (12/12), done.
>     Delta compression using up to 20 threads
>     Compressing objects: 100% (11/11), done.
>     Writing objects: 100% (12/12), 779 bytes | 779.00 KiB/s, done.
>     Total 12 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0)
>     remote: fatal: final sha1 did not match
>     error: remote unpack failed: unpack-objects abnormal exit
>     To ../victim.git
>      ! [remote rejected] HEAD -> foo (unpacker error)
>     error: failed to push some refs to '../victim.git'

I didn't set transfer.unpackLimit to zero here, which is why this says
"(unpacker error)". But if I did remember, it would instead say:

    remote: fatal: pack is corrupted (SHA1 mismatch)
    error: remote unpack failed: index-pack abnormal exit
    To ../victim.git
     ! [remote rejected] HEAD -> foo2 (unpacker error)
    error: failed to push some refs to '../victim.git'

So we're safe here whether or not you use the unpackLimit setting.

Thanks,
Taylor
Jeff King Sept. 5, 2024, 8:27 p.m. UTC | #15
On Thu, Sep 05, 2024 at 01:04:34PM -0400, Taylor Blau wrote:

> On Thu, Sep 05, 2024 at 09:51:00AM -0700, Junio C Hamano wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> >
> > > If so, I agree, but would note that this series does not yet switch
> > > index-pack to use the non-collision detecting SHA-1 implementation when
> > > available, so that would not be a prerequisite for merging this series.
> >
> > Hmph, I am confused.  It needs to be corrected in order to address
> > collisions of the tail sum Peff raised, as no longer checked the
> > tail sum with SHA1DC but with "fast" SHA-1.
> 
> Peff's mail supposes that we have already modified index-pack to use the
> non-collision detecting SHA-1 implementation. But this series does not
> do that, so I don't think we have an issue to address here.

Yes, this is correct. You are modifying only the writing side (which of
course would be under attacker control in this scenario anyway). So we
are only getting the benefit there, but without any additional threat on
the reading side.

But I'm not sure how comfortable I am leaving us in that state, even if
it is by itself OK. It feels fragile, and we're a small step away from
somebody accidentally using the "fast" variant to do reading.

If it's possible to fix the overwrite issue without too much code (and I
think it probably is), then that leaves us in a much safer spot, even
with what's in your series. And the fact that it lets us speed up the
reading side _and_ opens the door to possible alternate-hash
improvements is the cherry on top.

  Side note: I do really like the xxHash direction in general, but I
  think we may be underestimating the difficulty. Obviously it needs a
  protocol extension for sending packfiles, but what about other
  cross-repo access? E.g., dumb http blindly downloads the packfiles.
  How does it learn which hash is in use?

-Peff
Junio C Hamano Sept. 5, 2024, 9:27 p.m. UTC | #16
Jeff King <peff@peff.net> writes:

> But I'm not sure how comfortable I am leaving us in that state, even if
> it is by itself OK. It feels fragile, and we're a small step away from
> somebody accidentally using the "fast" variant to do reading.

Yup, avoiding to rename over an existing files makes the gives us a
lot more defensive posture.

>   Side note: I do really like the xxHash direction in general, but I
>   think we may be underestimating the difficulty. Obviously it needs a
>   protocol extension for sending packfiles, but what about other
>   cross-repo access? E.g., dumb http blindly downloads the packfiles.
>   How does it learn which hash is in use?

We need to make the file contents a bit more self describing
regardless.  A dumb http clone should look more or less like

 - download "packfiles"
 - download "info/refs"

 - run verify-pack on the packfiles
 - create an empty repository
 - move the pack we downloaded to .git/objects/pack/
 - populate refs from downloaded info/refs

 - make sure the tips of refs are all connected.

but as we recently discussed how verify-pack and index-pack assumes
the hash algorithm in the receiving repository, we need to give a
bit more clue in packfiles themselves what hash algorithms they use,
etc.