Message ID | cover.1725206584.git.me@ttaylorr.com (mailing list archive) |
---|---|
Headers | show |
Series | hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand |
Taylor Blau <me@ttaylorr.com> writes: > After some profiling, I noticed that we spend a significant amount of > time in hashwrite(), which is not all that surprising. But much of that > time is wasted in GitHub's infrastructure, since we are using the same > collision-detecting SHA-1 implementation to produce a trailing checksum > for the pack which does not need to be cryptographically secure. Cute. I wish we can upgrade the file formats so that a writer can choose the hash algorithm independently from whatever the payload uses. Most of our use of the tail sum are for files that are consumed locally in the same repository so nobody shouldn't need to know that you are using xxHash for the tail sum instead. Except that the story is not so simple for packfiles, which is named after the file's tail sum, so you cannot use a hash algorithm of your choice independently without affecting other folks. All other csum-file protected file types are lot smaller than the pack files to matter, sadly.
On 2024-09-01 at 16:03:15, Taylor Blau wrote: > This series adds a build-time knob to allow selecting an alternative > SHA-1 implementation for non-cryptographic hashing within Git, starting > with the `hashwrite()` family of functions. > > This series is the result of starting to roll out verbatim multi-pack > reuse within GitHub's infrastructure. I noticed that on larger > repositories, it is harder thus far to measure a CPU speed-up on clones > where multi-pack reuse is enabled. > > After some profiling, I noticed that we spend a significant amount of > time in hashwrite(), which is not all that surprising. But much of that > time is wasted in GitHub's infrastructure, since we are using the same > collision-detecting SHA-1 implementation to produce a trailing checksum > for the pack which does not need to be cryptographically secure. Hmm, I'm not sure this is the case. Let's consider the case where SHA-1 becomes as easy to collide as MD4, which requires less than 2 hash operations for a collision, in which case we can assume that it's trivial, because eventually we expect that will happen with advances in technology. So in that case, we believe that an attacker who knows what's in a pack file and can collide one or more of the objects can create another packfile with a different, colliding object and cause the pack contents to be the same. Because we use the pack file hash as the name of the pack and we use rename(2), which ignores whether the destination exists, that means we have to assume that eventually an attacker will be able to overwrite one pack file with another with different contents without being detected simply by pushing a new pack into the repository. Even if we assume that SHA-1 attacks only become as easy as MD5 attacks, the RapidSSL exploits[0][1] demonstrate that an attacker can create collisions based on predictable outputs even with imprecise feedback. We know SHA-1 is quite weak, so it could actually be quite soon that someone finds an improvement in attacking the algorithm. Note that computing 2^56 DES keys by brute force costs about $20 from cloud providers[2], and SHA-1 provides only 2^61 collision security, so a small improvement would probably make this pretty viable to attack on major providers with dedicated hardware. This is actually worse on some providers where the operations tend to be single threaded. In those situations, there is no nondeterminism from threads to make packs slightly different, and thus it would be extremely easy to create such collisions predictably based on what an appropriate upstream version of Git does. So I don't think we can accurately say that cryptographic security isn't needed here. If we need a unique name that an attacker cannot control and there are negative consequences (such as data loss) from the attacker being able to control the name, then we need cryptographic security here, which would imply a collision-detecting SHA-1 algorithm. We could avoid this problem if we used link(2) and unlink(2) to avoid renaming over existing pack files, though, similar to loose objects. We'd need to not use the new fast algorithm unless core.createObject is set to "link", though. [0] https://en.wikipedia.org/wiki/MD5 [1] https://www.win.tue.nl/hashclash/rogue-ca/ [2] https://crack.sh/
On Mon, Sep 02, 2024 at 02:08:25PM +0000, brian m. carlson wrote: > On 2024-09-01 at 16:03:15, Taylor Blau wrote: > > This series adds a build-time knob to allow selecting an alternative > > SHA-1 implementation for non-cryptographic hashing within Git, starting > > with the `hashwrite()` family of functions. > > > > This series is the result of starting to roll out verbatim multi-pack > > reuse within GitHub's infrastructure. I noticed that on larger > > repositories, it is harder thus far to measure a CPU speed-up on clones > > where multi-pack reuse is enabled. > > > > After some profiling, I noticed that we spend a significant amount of > > time in hashwrite(), which is not all that surprising. But much of that > > time is wasted in GitHub's infrastructure, since we are using the same > > collision-detecting SHA-1 implementation to produce a trailing checksum > > for the pack which does not need to be cryptographically secure. > > Hmm, I'm not sure this is the case. Let's consider the case where SHA-1 > becomes as easy to collide as MD4, which requires less than 2 hash > operations for a collision, in which case we can assume that it's > trivial, because eventually we expect that will happen with advances in > technology. I'm not sure this attack is possible as you described. We still run any packs through index-pack before landing them in $GIT_DIR/objects/pack, and index-pack still uses the collision-detecting SHA-1 implementation (if the repository uses SHA-1 and Git was compiled with it). So if I were a malicious attacker trying to compromise data on a forge, I would have to first (a) know the name of some pack that I was trying to collide, then (b) create a pack which collides with that one before actually pushing it. (b) seems difficult to impossible to execute (certainly today, maybe ever) because the attacker only controls the object contents within the pack, but can't adjust the pack header, object headers, compression, etc. But even if the attacker could do all of that, the remote still needs to index that pack, and while checksumming the pack, it would notice the collision (or SHA-1 mismatch) and reject the pack by die()-ing either way. (AFAICT, this all happens in builtin/index-pack.c::parse_pack_objects()). > So in that case, we believe that an attacker who knows what's in a pack > file and can collide one or more of the objects can create another > packfile with a different, colliding object and cause the pack contents > to be the same. Because we use the pack file hash as the name of the > pack and we use rename(2), which ignores whether the destination exists, > that means we have to assume that eventually an attacker will be able to > overwrite one pack file with another with different contents without > being detected simply by pushing a new pack into the repository. Right... but I think we would die() before we attempt to rename() the pack into place as above. Thanks, Taylor
On Sun, Sep 01, 2024 at 08:41:36PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > After some profiling, I noticed that we spend a significant amount of > > time in hashwrite(), which is not all that surprising. But much of that > > time is wasted in GitHub's infrastructure, since we are using the same > > collision-detecting SHA-1 implementation to produce a trailing checksum > > for the pack which does not need to be cryptographically secure. > > Cute. > > I wish we can upgrade the file formats so that a writer can choose > the hash algorithm independently from whatever the payload uses. > Most of our use of the tail sum are for files that are consumed > locally in the same repository so nobody shouldn't need to know that > you are using xxHash for the tail sum instead. Yeah. I would actually like to get here in the long-term, but of course that change is much larger than this one (the protocol would have to be adjusted to learn a new "tailsum" capability for callers to negotiate which checksumming hash function they want to use, etc.). > Except that the story is not so simple for packfiles, which is named > after the file's tail sum, so you cannot use a hash algorithm of > your choice independently without affecting other folks. All other > csum-file protected file types are lot smaller than the pack files > to matter, sadly. Right. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > Yeah. I would actually like to get here in the long-term, but of course > that change is much larger than this one (the protocol would have to be > adjusted to learn a new "tailsum" capability for callers to negotiate > which checksumming hash function they want to use, etc.). I dunno. I was talking about strictly local tail sum, like the ones in the index file. You do not have anybody to negotiate about anything. We already agreed that the name of a packfile is a different matter.
Taylor Blau <me@ttaylorr.com> writes: > But even if the attacker could do all of that, the remote still needs to > index that pack, and while checksumming the pack, it would notice the > collision (or SHA-1 mismatch) and reject the pack by die()-ing either > way. (AFAICT, this all happens in > builtin/index-pack.c::parse_pack_objects()). The hosting side writes a packfile and computes the tail sum once. You force the clients that clone or fetch validate the tail sum. Usually clients outnumber the hoster by large orders of magnitude. That sounds like you are optimizing for a wrong side, but it does point at another aspect of this problem. Even without limiting ourselves to the tail sum, our uses of the hash function fall into two categories, ones that do not have to be overly cautious (i.e., when we are generating data and computing the hash over that data), and the others that we do want to be paranoid (i.e., when we receive check-summed data from outside world and suspect that the data was generated by an adversary).
On 2024-09-03 at 19:47:39, Taylor Blau wrote: > We still run any packs through index-pack before landing them in > $GIT_DIR/objects/pack, and index-pack still uses the collision-detecting > SHA-1 implementation (if the repository uses SHA-1 and Git was compiled > with it). > > So if I were a malicious attacker trying to compromise data on a forge, > I would have to first (a) know the name of some pack that I was trying > to collide, then (b) create a pack which collides with that one before > actually pushing it. (b) seems difficult to impossible to execute > (certainly today, maybe ever) because the attacker only controls the > object contents within the pack, but can't adjust the pack header, > object headers, compression, etc. Packing single-threaded is deterministic in my tests, so it would seem that this is possible, even if inconvenient or difficult to execute. It's not very hard to get access to the configuration a forge is using either because it's open source or open core, or just from getting the on-premises version's configuration, so we have to assume that the attacker knows the configuration, and we also can determine what packs are on the server side if we've pushed all of the objects ourselves. > But even if the attacker could do all of that, the remote still needs to > index that pack, and while checksumming the pack, it would notice the > collision (or SHA-1 mismatch) and reject the pack by die()-ing either > way. (AFAICT, this all happens in > builtin/index-pack.c::parse_pack_objects()). If we're certain that we'll always index the pack, then I agree we would detect this at that point, and so it would probably be safe. As you and I discussed elsewhere, I'm not the expert on the pack code, so I'll defer to your analysis here.
On Tue, Sep 03, 2024 at 03:47:39PM -0400, Taylor Blau wrote: > > Hmm, I'm not sure this is the case. Let's consider the case where SHA-1 > > becomes as easy to collide as MD4, which requires less than 2 hash > > operations for a collision, in which case we can assume that it's > > trivial, because eventually we expect that will happen with advances in > > technology. > > I'm not sure this attack is possible as you described. > > We still run any packs through index-pack before landing them in > $GIT_DIR/objects/pack, and index-pack still uses the collision-detecting > SHA-1 implementation (if the repository uses SHA-1 and Git was compiled > with it). I agree this is not a problem with your series as it is, but I think in the long run we'd want to switch over index-pack, too, for two reasons: 1. It could benefit from the same speedups on the receiving side that your patches give on the sending side (though the difference is less noticeable there, because we're also hashing the expanded contents). 2. We'll have to do so if switch to a non-cryptographic hash (as is discussed elsewhere in this thread), since the two obviously have to match. So let's imagine for a moment that we make that change. I don't think you can smuggle any duplicate objects this way. We'll still index each individual object using sha1dc, so any attempts to collide there will be caught. You'd need totally different objects that are in a packfile that happens to hash to the same checksum. And then since the receiver is the one computing the object id of those objects, they won't match (modulo some racing, which I'll get to). But I do think you could do a denial-of-service attack by corrupting the receiving repo. Imagine that: 1. I (somehow) know you have a pack with hash XYZ, and thus that you'll be storing objects/pack/pack-XYZ.pack. 2. I generate a new, valid pack that contains 100 random objects (or enough to defeat transfer.unpackLimit). I mutate that objects' contents until my new pack also has hash XYZ. 3. I send you that pack (e.g., via push if you're a server, or by manipulating you into fetch if you're a client). You index the pack, see that it should be called pack-XYZ.pack, and then rename() it on top of your existing file. Now you've just lost access to all of those objects, and your repo is broken. We should be able to simulate this in practice. First, let's weaken the "fast" sha1 such that it retains only the final two bits: diff --git a/sha1/openssl.h b/sha1/openssl.h index 1038af47da..f0d5c59c43 100644 --- a/sha1/openssl.h +++ b/sha1/openssl.h @@ -32,6 +32,8 @@ static inline void openssl_SHA1_Final(unsigned char *digest, { EVP_DigestFinal_ex(ctx->ectx, digest, NULL); EVP_MD_CTX_free(ctx->ectx); + memset(digest, 0, 19); + digest[19] &= 0x3; } static inline void openssl_SHA1_Clone(struct openssl_SHA1_CTX *dst, Now we can build with OPENSSL_SHA1_FAST=1, and use the result (which I call git.compile below) to repack some victim repo: # use regular Git to clone first git clone --no-local --bare /some/repo victim.git # and now use our weak hash to repack; you should end up with # pack-000000...003.pack or similar git.compile -C victim.git repack -ad We can even fsck it, if we teach the fsck code to use our weak hash (as an aside, I think it is weird that fsck has its own hash verification; it should probably be relying on verify-pack for this, but I haven't dug into this in a while and IIRC there are some complications): diff --git a/pack-check.c b/pack-check.c index e883dae3f2..1b80616c70 100644 --- a/pack-check.c +++ b/pack-check.c @@ -67,7 +67,7 @@ static int verify_packfile(struct repository *r, if (!is_pack_valid(p)) return error("packfile %s cannot be accessed", p->pack_name); - r->hash_algo->init_fn(&ctx); + r->hash_algo->fast_init_fn(&ctx); do { unsigned long remaining; unsigned char *in = use_pack(p, w_curs, offset, &remaining); @@ -76,9 +76,9 @@ static int verify_packfile(struct repository *r, pack_sig_ofs = p->pack_size - r->hash_algo->rawsz; if (offset > pack_sig_ofs) remaining -= (unsigned int)(offset - pack_sig_ofs); - r->hash_algo->update_fn(&ctx, in, remaining); + r->hash_algo->fast_update_fn(&ctx, in, remaining); } while (offset < pack_sig_ofs); - r->hash_algo->final_fn(hash, &ctx); + r->hash_algo->fast_final_fn(hash, &ctx); pack_sig = use_pack(p, w_curs, pack_sig_ofs, NULL); if (!hasheq(hash, pack_sig, the_repository->hash_algo)) err = error("%s pack checksum mismatch", And now let's teach index-pack to use the weak hash, too: diff --git a/builtin/index-pack.c b/builtin/index-pack.c index fd968d673d..de99405880 100644 --- a/builtin/index-pack.c +++ b/builtin/index-pack.c @@ -284,7 +284,7 @@ static void flush(void) if (input_offset) { if (output_fd >= 0) write_or_die(output_fd, input_buffer, input_offset); - the_hash_algo->update_fn(&input_ctx, input_buffer, input_offset); + the_hash_algo->fast_update_fn(&input_ctx, input_buffer, input_offset); memmove(input_buffer, input_buffer + input_offset, input_len); input_offset = 0; } @@ -357,7 +357,7 @@ static const char *open_pack_file(const char *pack_name) output_fd = -1; nothread_data.pack_fd = input_fd; } - the_hash_algo->init_fn(&input_ctx); + the_hash_algo->fast_init_fn(&input_ctx); return pack_name; } @@ -1202,9 +1202,9 @@ static void parse_pack_objects(unsigned char *hash) /* Check pack integrity */ flush(); - the_hash_algo->init_fn(&tmp_ctx); - the_hash_algo->clone_fn(&tmp_ctx, &input_ctx); - the_hash_algo->final_fn(hash, &tmp_ctx); + the_hash_algo->fast_init_fn(&tmp_ctx); + the_hash_algo->fast_clone_fn(&tmp_ctx, &input_ctx); + the_hash_algo->fast_final_fn(hash, &tmp_ctx); if (!hasheq(fill(the_hash_algo->rawsz), hash, the_repository->hash_algo)) die(_("pack is corrupted (SHA1 mismatch)")); use(the_hash_algo->rawsz); Now the code is all set. We need to adjust two things in the victim repo: # we could just send 100+ fake objects from the client, but setting this to # "1" makes our attack loop simpler git -C victim.git config transfer.unpackLimit 1 # Pushing happens into a quarantine area. And then when we move the # files out of quarantine, we do it with finalize_object_file(), which # does the usual link collision detection. But: # # 1. That doesn't happen for fetches, which are done directly in # the pack directory, so the rename() done by index-pack applies # there. # # 2. If the link() doesn't work for any reason other than EEXIST, # we'll retry it as a rename(). So depending on your filesystem, # this might be triggerable even for push. # # To simulate the worst case, we'll manually force the push into # rename mode. git -C victim.git config core.createObject rename And now we can attack, like so: git init evil cd evil while git.compile -C ../victim.git fsck; do ls -l ../victim.git/objects/pack/ git.compile commit --allow-empty -m foo git.compile push ../victim.git HEAD:foo done ls -l ../victim.git/objects/pack/ It's random whether we'll collide, but since there are only 4 possibilities (from 2 bits), it happens within a couple attempts. Obviously sha1 is stronger than that, but I think the point is to prepare for a world where collisions are cheap and easy to produce (whether because sha1 gets more broken, or because we start using a non-cryptographic hash). So I do think there are problems in this general path, even though your patch is not (yet) exposing them (although it would be easy to do so accidentally; I was actually moderately surprised that index-pack is not relying on the hashfile API already). Probably the solution is: - renaming packfiles into place should use finalize_object_file() to avoid collisions. That happens for tmp-objdir migration already, but we should do it more directly in index-pack (and maybe even pack-objects). And possibly we should implement an actual byte-for-byte comparison if we think we saw a collision, rather than just assuming that the write was effectively a noopi (see the FIXME in that function). That becomes more important if the checksum gets more likely to collide accidentally (we essentially ignore the possibility that sha1 would ever do so). - possibly object_creation_mode should have a more strict setting that refuses to fall back to renames. Or alternatively, we should do our own check for existence when falling back to a rename() in finalize_object_file(). And finally, I mentioned races earlier. I didn't try to reproduce it, but I suspect there could also be a race like: - you have pack-XYZ.pack and pack-XYZ.idx - attacker generates evil pack-XYZ.pack as above (and assuming we have no fixed anything as I suggested above, and they really can replace the receiver's copy). - at some moment we will have moved pack-XYZ.pack into place, but not yet the matching idx. So we'll have the old idx and the new pack. An object lookup at that moment could cause us to find the object using the old idx, but then get the data out of the new pack file, replacing real data with the attacker's data. It's a pretty small window, but probably possible with enough simultaneous reading processes. Not something you'd probably want to spend $40k generating a collision for, but if we used a weak enough checksum, then attempts become cheap. So I think we really do need to address this to prefer local data. At which point it should be safe to use a much weaker checksum. But IMHO it is still worth having a "fast" SHA1. Even if the future is SHA256 repos or xxHash pack trailers, older clients will still use SHA1. -Peff
Jeff King <peff@peff.net> writes: > Probably the solution is: > > - renaming packfiles into place should use finalize_object_file() to > avoid collisions. That happens for tmp-objdir migration already, > but we should do it more directly in index-pack (and maybe even > pack-objects). And possibly we should implement an actual > byte-for-byte comparison if we think we saw a collision, rather than > just assuming that the write was effectively a noopi (see the FIXME > in that function). That becomes more important if the checksum gets > more likely to collide accidentally (we essentially ignore the > possibility that sha1 would ever do so). Yes. I somehow mistakenly thought that Taylor analized the code path when brian raised the "we rename over, overwriting existing files" and I included fixing it as one of the steps necessary to safely switch the tail sum to weaker and faster hash, but after reading the thread again, it seems that I was hallucinating X-<. This needs to be corrected. > - possibly object_creation_mode should have a more strict setting that > refuses to fall back to renames. Or alternatively, we should do our > own check for existence when falling back to a rename() in > finalize_object_file(). True, too. > - at some moment we will have moved pack-XYZ.pack into place, but not > yet the matching idx. So we'll have the old idx and the new pack. An > object lookup at that moment could cause us to find the object using > the old idx, but then get the data out of the new pack file, > replacing real data with the attacker's data. It's a pretty small > window, but probably possible with enough simultaneous reading > processes. Not something you'd probably want to spend $40k > generating a collision for, but if we used a weak enough checksum, > then attempts become cheap. This reminds me why we changed the hash we use to name packfiles; we used to use "hash of sorted object names contained in the pack", but that would mean a (forced) repack of a sole pack of a fully packed repository can create a packfile with contents and object layout different from the original but with the same name, creating this exact race to yourself without involving any evil attacker. We of course use the hash of the actual pack data stream these days, and that would avoid this problem. It is funny to compare this with the reason why we switched how we name individual objects in a very early part in the history. We used to name an object after the hash value of _compressed_ object header plus payload, but that obviously means different compression level (or improvement of the compressor implementation) would give different names to the same contents, and that is why we swapped the order to use the hash value of the object header plus payload before compression. Of course, that _requires_ us to avoid overwriting an existing file to foil collision attack. That brings us back to the topic here ;-). > So I think we really do need to address this to prefer local data. At > which point it should be safe to use a much weaker checksum. But IMHO it > is still worth having a "fast" SHA1. Even if the future is SHA256 repos > or xxHash pack trailers, older clients will still use SHA1. Yup. 100% agreed. Thanks.
On Thu, Sep 05, 2024 at 08:41:16AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Probably the solution is: > > > > - renaming packfiles into place should use finalize_object_file() to > > avoid collisions. That happens for tmp-objdir migration already, > > but we should do it more directly in index-pack (and maybe even > > pack-objects). And possibly we should implement an actual > > byte-for-byte comparison if we think we saw a collision, rather than > > just assuming that the write was effectively a noopi (see the FIXME > > in that function). That becomes more important if the checksum gets > > more likely to collide accidentally (we essentially ignore the > > possibility that sha1 would ever do so). > > Yes. I somehow mistakenly thought that Taylor analized the code > path when brian raised the "we rename over, overwriting existing > files" and I included fixing it as one of the steps necessary to > safely switch the tail sum to weaker and faster hash, but after > reading the thread again, it seems that I was hallucinating X-<. > This needs to be corrected. Just to make sure I understand you here... this is talking about a prerequisite step for switching index-pack to use a faster hash, right? If so, I agree, but would note that this series does not yet switch index-pack to use the non-collision detecting SHA-1 implementation when available, so that would not be a prerequisite for merging this series. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > If so, I agree, but would note that this series does not yet switch > index-pack to use the non-collision detecting SHA-1 implementation when > available, so that would not be a prerequisite for merging this series. Hmph, I am confused. It needs to be corrected in order to address collisions of the tail sum Peff raised, as no longer checked the tail sum with SHA1DC but with "fast" SHA-1. Doesn't it make it a prerequisite?
On Thu, Sep 05, 2024 at 09:51:00AM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > If so, I agree, but would note that this series does not yet switch > > index-pack to use the non-collision detecting SHA-1 implementation when > > available, so that would not be a prerequisite for merging this series. > > Hmph, I am confused. It needs to be corrected in order to address > collisions of the tail sum Peff raised, as no longer checked the > tail sum with SHA1DC but with "fast" SHA-1. Peff's mail supposes that we have already modified index-pack to use the non-collision detecting SHA-1 implementation. But this series does not do that, so I don't think we have an issue to address here. In a hypothetical future series where we do modify index-pack to use the _FAST SHA-1 implementation, then we would need to address the issue that Peff raised first as a prerequisite. Thanks, Taylor
On Thu, Sep 05, 2024 at 01:04:34PM -0400, Taylor Blau wrote: > On Thu, Sep 05, 2024 at 09:51:00AM -0700, Junio C Hamano wrote: > > Taylor Blau <me@ttaylorr.com> writes: > > > > > If so, I agree, but would note that this series does not yet switch > > > index-pack to use the non-collision detecting SHA-1 implementation when > > > available, so that would not be a prerequisite for merging this series. > > > > Hmph, I am confused. It needs to be corrected in order to address > > collisions of the tail sum Peff raised, as no longer checked the > > tail sum with SHA1DC but with "fast" SHA-1. > > Peff's mail supposes that we have already modified index-pack to use the > non-collision detecting SHA-1 implementation. But this series does not > do that, so I don't think we have an issue to address here. > > In a hypothetical future series where we do modify index-pack to use the > _FAST SHA-1 implementation, then we would need to address the issue that > Peff raised first as a prerequisite. I verified that this was the case by applying only the following to my series: --- 8< --- diff --git a/sha1/openssl.h b/sha1/openssl.h index 1038af47da..f0d5c59c43 100644 --- a/sha1/openssl.h +++ b/sha1/openssl.h @@ -32,6 +32,8 @@ static inline void openssl_SHA1_Final(unsigned char *digest, { EVP_DigestFinal_ex(ctx->ectx, digest, NULL); EVP_MD_CTX_free(ctx->ectx); + memset(digest, 0, 19); + digest[19] &= 0x3; } static inline void openssl_SHA1_Clone(struct openssl_SHA1_CTX *dst, --- >8 --- and then creating a victim.git repository (which in my case was born from git.git) and then repacking to produce the following state: $ ls -la victim.git/objects/pack total 262704 drwxr-xr-x 2 ttaylorr ttaylorr 4096 Sep 5 13:45 . drwxr-xr-x 4 ttaylorr ttaylorr 4096 Sep 5 13:46 .. -r--r--r-- 1 ttaylorr ttaylorr 3306804 Sep 5 13:45 pack-0000000000000000000000000000000000000003.bitmap -r--r--r-- 1 ttaylorr ttaylorr 15588224 Sep 5 13:44 pack-0000000000000000000000000000000000000003.idx -r--r--r-- 1 ttaylorr ttaylorr 247865480 Sep 5 13:44 pack-0000000000000000000000000000000000000003.pack -r--r--r-- 1 ttaylorr ttaylorr 2226788 Sep 5 13:44 pack-0000000000000000000000000000000000000003.rev Then I set up an "evil" repository like in Peff's recipe and started repeatedly pushing. fsck is slow here, so the loop is just "while true", but it doesn't matter that we're not fscking the victim repository since I'll show in a second that it's not corrupted to begin with. Running this loop: $ while true do ls -l ../victim.git/objects/pack/ git.compile commit --allow-empty -m foo git.compile push ../victim.git HEAD:foo done $ ls -l ../victim.git/objects/pack/ , fails very quickly and produces the following: [main 727346d] foo Enumerating objects: 12, done. Counting objects: 100% (12/12), done. Delta compression using up to 20 threads Compressing objects: 100% (11/11), done. Writing objects: 100% (12/12), 779 bytes | 779.00 KiB/s, done. Total 12 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0) remote: fatal: final sha1 did not match error: remote unpack failed: unpack-objects abnormal exit To ../victim.git ! [remote rejected] HEAD -> foo (unpacker error) error: failed to push some refs to '../victim.git' The victim repository rightly rejects the push, since even though the evil repository generated a pack with a colliding checksum value, the victim repository validated it using the collision-detecting / non-broken SHA-1 implementation and rejected the pack appropriately. Of course, if index-pack were updated to use the non-collision detecting implementation of SHA-1 when compiled using one of the _FAST knobs, *and* we did blindly updated index-pack to use the _fast variants without doing anything else in Peff's mail, then we would have corruption. But I think the point of Peff's mail is to illustrate that this is only a problem in a world where index-pack uses the _fast SHA-1 implementation, but does not have any additional protections in place. Thanks, Taylor
On Thu, Sep 05, 2024 at 01:51:10PM -0400, Taylor Blau wrote: > , fails very quickly and produces the following: > > [main 727346d] foo > Enumerating objects: 12, done. > Counting objects: 100% (12/12), done. > Delta compression using up to 20 threads > Compressing objects: 100% (11/11), done. > Writing objects: 100% (12/12), 779 bytes | 779.00 KiB/s, done. > Total 12 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0) > remote: fatal: final sha1 did not match > error: remote unpack failed: unpack-objects abnormal exit > To ../victim.git > ! [remote rejected] HEAD -> foo (unpacker error) > error: failed to push some refs to '../victim.git' I didn't set transfer.unpackLimit to zero here, which is why this says "(unpacker error)". But if I did remember, it would instead say: remote: fatal: pack is corrupted (SHA1 mismatch) error: remote unpack failed: index-pack abnormal exit To ../victim.git ! [remote rejected] HEAD -> foo2 (unpacker error) error: failed to push some refs to '../victim.git' So we're safe here whether or not you use the unpackLimit setting. Thanks, Taylor
On Thu, Sep 05, 2024 at 01:04:34PM -0400, Taylor Blau wrote: > On Thu, Sep 05, 2024 at 09:51:00AM -0700, Junio C Hamano wrote: > > Taylor Blau <me@ttaylorr.com> writes: > > > > > If so, I agree, but would note that this series does not yet switch > > > index-pack to use the non-collision detecting SHA-1 implementation when > > > available, so that would not be a prerequisite for merging this series. > > > > Hmph, I am confused. It needs to be corrected in order to address > > collisions of the tail sum Peff raised, as no longer checked the > > tail sum with SHA1DC but with "fast" SHA-1. > > Peff's mail supposes that we have already modified index-pack to use the > non-collision detecting SHA-1 implementation. But this series does not > do that, so I don't think we have an issue to address here. Yes, this is correct. You are modifying only the writing side (which of course would be under attacker control in this scenario anyway). So we are only getting the benefit there, but without any additional threat on the reading side. But I'm not sure how comfortable I am leaving us in that state, even if it is by itself OK. It feels fragile, and we're a small step away from somebody accidentally using the "fast" variant to do reading. If it's possible to fix the overwrite issue without too much code (and I think it probably is), then that leaves us in a much safer spot, even with what's in your series. And the fact that it lets us speed up the reading side _and_ opens the door to possible alternate-hash improvements is the cherry on top. Side note: I do really like the xxHash direction in general, but I think we may be underestimating the difficulty. Obviously it needs a protocol extension for sending packfiles, but what about other cross-repo access? E.g., dumb http blindly downloads the packfiles. How does it learn which hash is in use? -Peff
Jeff King <peff@peff.net> writes: > But I'm not sure how comfortable I am leaving us in that state, even if > it is by itself OK. It feels fragile, and we're a small step away from > somebody accidentally using the "fast" variant to do reading. Yup, avoiding to rename over an existing files makes the gives us a lot more defensive posture. > Side note: I do really like the xxHash direction in general, but I > think we may be underestimating the difficulty. Obviously it needs a > protocol extension for sending packfiles, but what about other > cross-repo access? E.g., dumb http blindly downloads the packfiles. > How does it learn which hash is in use? We need to make the file contents a bit more self describing regardless. A dumb http clone should look more or less like - download "packfiles" - download "info/refs" - run verify-pack on the packfiles - create an empty repository - move the pack we downloaded to .git/objects/pack/ - populate refs from downloaded info/refs - make sure the tips of refs are all connected. but as we recently discussed how verify-pack and index-pack assumes the hash algorithm in the receiving repository, we need to give a bit more clue in packfiles themselves what hash algorithms they use, etc.