diff mbox series

[v4,3/8] finalize_object_file(): implement collision check

Message ID ed9eeef8513e08935c59defafde99956eb62d49a.1727199118.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand

Commit Message

Taylor Blau Sept. 24, 2024, 5:32 p.m. UTC
We've had "FIXME!!! Collision check here ?" in finalize_object_file()
since aac1794132 (Improve sha1 object file writing., 2005-05-03). That
is, when we try to write a file with the same name, we assume the
on-disk contents are the same and blindly throw away the new copy.

One of the reasons we never implemented this is because the files it
moves are all named after the cryptographic hash of their contents
(either loose objects, or packs which have their hash in the name these
days). So we are unlikely to see such a collision by accident. And even
though there are weaknesses in sha1, we assume they are mitigated by our
use of sha1dc.

So while it's a theoretical concern now, it hasn't been a priority.
However, if we start using weaker hashes for pack checksums and names,
this will become a practical concern. So in preparation, let's actually
implement a byte-for-byte collision check.

The new check will cause the write of new differing content to be a
failure, rather than a silent noop, and we'll retain the temporary file
on disk. If there's no collision present, we'll clean up the temporary
file as usual after either rename()-ing or link()-ing it into place.

Note that this may cause some extra computation when the files are in
fact identical, but this should happen rarely.

Loose objects are exempt from this check, and the collision check may be
skipped by calling the _flags variant of this function with the
FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons:

  - We don't treat the hash of the loose object file's contents as a
    checksum, since the same loose object can be stored using different
    bytes on disk (e.g., when adjusting core.compression, using a
    different version of zlib, etc.).

    This is fundamentally different from cases where
    finalize_object_file() is operating over a file which uses the hash
    value as a checksum of the contents. In other words, a pair of
    identical loose objects can be stored using different bytes on disk,
    and that should not be treated as a collision.

  - We already use the path of the loose object as its hash value /
    object name, so checking for collisions at the content level doesn't
    add anything.

    This is why we do not bother to check the inflated object contents
    for collisions either, since either (a) the object contents have the
    fingerprint of a SHA-1 collision, in which case the collision
    detecting SHA-1 implementation used to hash the contents to give us
    a path would have already rejected it, or (b) the contents are part
    of a colliding pair which does not bear the same fingerprints of
    known collision attacks, in which case we would not have caught it
    anyway.

    So skipping the collision check here does not change for better or
    worse the hardness of loose object writes.

As a small note related to the latter bullet point above, we must teach
the tmp-objdir routines to similarly skip the content-level collision
checks when calling migrate_one() on a loose object file, which we do by
setting the FOF_SKIP_COLLISION_CHECK bit when we are inside of a loose
object shard.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++---
 object-file.h |  6 +++++
 tmp-objdir.c  | 26 ++++++++++++++------
 3 files changed, 89 insertions(+), 10 deletions(-)

Comments

Jeff King Sept. 24, 2024, 8:37 p.m. UTC | #1
On Tue, Sep 24, 2024 at 01:32:17PM -0400, Taylor Blau wrote:

> Loose objects are exempt from this check, and the collision check may be
> skipped by calling the _flags variant of this function with the
> FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons:
> 
>   - We don't treat the hash of the loose object file's contents as a
>     checksum, since the same loose object can be stored using different
>     bytes on disk (e.g., when adjusting core.compression, using a
>     different version of zlib, etc.).
> 
>     This is fundamentally different from cases where
>     finalize_object_file() is operating over a file which uses the hash
>     value as a checksum of the contents. In other words, a pair of
>     identical loose objects can be stored using different bytes on disk,
>     and that should not be treated as a collision.

OK, this explains why a byte-for-byte hash-check would be the wrong
thing (it would cause false positives)...

>   - We already use the path of the loose object as its hash value /
>     object name, so checking for collisions at the content level doesn't
>     add anything.
> 
>     This is why we do not bother to check the inflated object contents
>     for collisions either, since either (a) the object contents have the
>     fingerprint of a SHA-1 collision, in which case the collision
>     detecting SHA-1 implementation used to hash the contents to give us
>     a path would have already rejected it, or (b) the contents are part
>     of a colliding pair which does not bear the same fingerprints of
>     known collision attacks, in which case we would not have caught it
>     anyway.
> 
>     So skipping the collision check here does not change for better or
>     worse the hardness of loose object writes.

...and this is the argument for why it is not necessary to do a
content-level check. As you say, we're not making anything worse here,
as we've never implemented this collision check. But I think the "FIXME"
in the code was really there under the notion that it was a backstop to
SHA-1 being broken (belt-and-suspenders, if you will). And your argument
above assumes that SHA-1 (using the DC variant) is safe.

But I think we can expand the argument a bit. I don't think this is a
very good place for such a collision check, because race conditions
aside, we'll already have bailed long before! When we do a write via
write_object_file(), for example, we'll hit the freshen paths that
assume the contents are identical, and skip the write (and thus this
finalize) entirely.

So if you want a collision check, we have to do it at a much higher
level. And indeed we do: index-pack already implements such a check
(through the confusingly named "check_collison" function). I don't think
unpack-objects does so (it looks like it just feeds the result to
write_object_file(), which does the freshening thing).

So the argument I'd make here is more like: this is the wrong place to
do it.

  Side thoughts on collision checks:

    I think index-pack is safe in the sense that it will always prefer
    on-disk copies and will complain if it sees a collision.
    unpack-objects is not, nor are regular in-repo writes (which
    normally cannot be triggered by remote, but on forges that do
    merges, etc, that's not always true). Both of the latter are also
    subject to races, where a simultaneous collision might let the
    attacker win.

    That race is kind of moot in a world where anybody can push to a
    fork of a repo that ends up in the same shared location (so they can
    actually win and become the "on disk" copy), and we're relying on
    sha1dc for protection there. That's specific to certain forges, but
    they do represent a lot of Git use.

    In general, the use of unpack-objects versus index-pack is up to the
    attacker (based on pack size). So I think it would be nice if
    unpack-objects did the collision check. Even if the attacker beats
    you to writing the object, it would be nice to see "holy crap, there
    was a collision" instead of just silently throwing your pushed
    object away.

    I know that GitHub only ever runs index-pack, which may give some
    amount of protection here. In general, I think we should consider
    deprecating unpack-objects in favor of teaching index-pack to
    unpack. It has many enhancements (this one, but also threading) that
    unpack-objects does not. I have an old patch series for this (sorry,
    only from 2017, I'm slipping). IIRC the sticking point was that
    unpack-objects is better about memory use in some cases (streaming
    large blobs?) and I hadn't figured out a way around that.

Phew. Sorry, that was a long digression for "I think you're right, but I
might have argued it a little differently". I think the direction of the
patch (skipping checks entirely for loose objects) is the right thing.

> As a small note related to the latter bullet point above, we must teach
> the tmp-objdir routines to similarly skip the content-level collision
> checks when calling migrate_one() on a loose object file, which we do by
> setting the FOF_SKIP_COLLISION_CHECK bit when we are inside of a loose
> object shard.

OK, this is the part I was wondering how you were going to deal with. :)
Let's look at the implementation...

> @@ -239,11 +247,15 @@ static int migrate_paths(struct strbuf *src, struct strbuf *dst)
>  
>  	for (i = 0; i < paths.nr; i++) {
>  		const char *name = paths.items[i].string;
> +		enum finalize_object_file_flags flags_copy = flags;
>  
>  		strbuf_addf(src, "/%s", name);
>  		strbuf_addf(dst, "/%s", name);
>  
> -		ret |= migrate_one(src, dst);
> +		if (is_loose_object_shard(name))
> +			flags_copy |= FOF_SKIP_COLLISION_CHECK;
> +
> +		ret |= migrate_one(src, dst, flags_copy);
>  
>  		strbuf_setlen(src, src_len);
>  		strbuf_setlen(dst, dst_len);

This looks pretty reasonable overall, though I'd note that
migrate_paths() is called recursively. So if I had:

  tmp-objdir-XXXXXX/pack/ab/foo.pack

we'd skip the collision check. I'm not sure how much we want to worry
about that. I don't think we'd ever create such a path, and the general
form of the paths is all under local control, so an attacker can't
convince us to do so. And I find it pretty unlikely that we'd change the
on-disk layout to accidentally trigger this.

So even though there are security implications if we have a false
positive from is_loose_object_shard(), I don't know that it's worth
caring about.

If we did want to, I think a more robust check is to make sure the shard
comes at the start of the path. Which is tricky, of course, because
we're building a path on top of an arbitrary root (tmp-objdir for
the src, and the repo object dir for the dst).

So you'd have to save that base_len and use it as a root, like the patch
below. I'm OK if you want to write all of this off as over-engineering,
though.

diff --git a/tmp-objdir.c b/tmp-objdir.c
index 9da0071cba..27ba9f4f57 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -207,10 +207,18 @@ static int read_dir_paths(struct string_list *out, const char *path)
 }
 
 static int migrate_paths(struct strbuf *src, struct strbuf *dst,
-			 enum finalize_object_file_flags flags);
+			 size_t base_len);
+
+static int is_loose_path(const char *name)
+{
+	return *name++ == '/' &&
+	       isxdigit(*name++) &&
+	       isxdigit(*name++) &&
+	       *name++ == '/';
+}
 
 static int migrate_one(struct strbuf *src, struct strbuf *dst,
-		       enum finalize_object_file_flags flags)
+		       size_t base_len)
 {
 	struct stat st;
 
@@ -222,18 +230,16 @@ static int migrate_one(struct strbuf *src, struct strbuf *dst,
 				return -1;
 		} else if (errno != EEXIST)
 			return -1;
-		return migrate_paths(src, dst, flags);
+		return migrate_paths(src, dst, base_len);
 	}
-	return finalize_object_file_flags(src->buf, dst->buf, flags);
+	return finalize_object_file_flags(src->buf, dst->buf,
+					  is_loose_path(src->buf + base_len) ?
+					  FOF_SKIP_COLLISION_CHECK : 0);
 }
 
-static int is_loose_object_shard(const char *name)
-{
-	return strlen(name) == 2 && isxdigit(name[0]) && isxdigit(name[1]);
-}
 
 static int migrate_paths(struct strbuf *src, struct strbuf *dst,
-			 enum finalize_object_file_flags flags)
+			 size_t base_len)
 {
 	size_t src_len = src->len, dst_len = dst->len;
 	struct string_list paths = STRING_LIST_INIT_DUP;
@@ -247,15 +253,11 @@ static int migrate_paths(struct strbuf *src, struct strbuf *dst,
 
 	for (i = 0; i < paths.nr; i++) {
 		const char *name = paths.items[i].string;
-		enum finalize_object_file_flags flags_copy = flags;
 
 		strbuf_addf(src, "/%s", name);
 		strbuf_addf(dst, "/%s", name);
 
-		if (is_loose_object_shard(name))
-			flags_copy |= FOF_SKIP_COLLISION_CHECK;
-
-		ret |= migrate_one(src, dst, flags_copy);
+		ret |= migrate_one(src, dst, base_len);
 
 		strbuf_setlen(src, src_len);
 		strbuf_setlen(dst, dst_len);
@@ -283,7 +285,7 @@ int tmp_objdir_migrate(struct tmp_objdir *t)
 	strbuf_addbuf(&src, &t->path);
 	strbuf_addstr(&dst, repo_get_object_directory(the_repository));
 
-	ret = migrate_paths(&src, &dst, 0);
+	ret = migrate_paths(&src, &dst, src.len);
 
 	strbuf_release(&src);
 	strbuf_release(&dst);
Junio C Hamano Sept. 24, 2024, 9:32 p.m. UTC | #2
Taylor Blau <me@ttaylorr.com> writes:

> One of the reasons we never implemented this is because the files it
> moves are all named after the cryptographic hash of their contents
> (either loose objects, or packs which have their hash in the name these
> days).

So we are saying that both loose objects and packfiles would benefit
if we did collision checks here.

> ... So in preparation, let's actually
> implement a byte-for-byte collision check.

But the byte-for-byte collision check is only good for packfiles.

In other words, the definition of "content" that gets hashed to
derive the name can differ among csum-file users, so the way to
check for collision can be written for these different types.

> Note that this may cause some extra computation when the files are in
> fact identical, but this should happen rarely.
>
> Loose objects are exempt from this check, and the collision check may be
> skipped by calling the _flags variant of this function with the
> FOF_SKIP_COLLISION_CHECK bit set. This is done for a couple of reasons:
>
>   - We don't treat the hash of the loose object file's contents as a
>     checksum, since the same loose object can be stored using different
>     bytes on disk (e.g., when adjusting core.compression, using a
>     different version of zlib, etc.).

That's a good explanation why byte-for-byte check is inappropriate.

>   - We already use the path of the loose object as its hash value /
>     object name, so checking for collisions at the content level doesn't
>     add anything.

"We already ... doesn't add anything" -> "The point of collision
check is to find an attempt to store different contents that hashes
down to the same object name, and we continue to use sha1dc to hash
the content name so such a collision would have already been
detected".

>     This is why we do not bother to check the inflated object contents
>     for collisions either, since either (a) the object contents have the
>     fingerprint of a SHA-1 collision, in which case the collision
>     detecting SHA-1 implementation used to hash the contents to give us
>     a path would have already rejected it, or (b) the contents are part
>     of a colliding pair which does not bear the same fingerprints of
>     known collision attacks, in which case we would not have caught it
>     anyway.

I do not understand (b).  If you compare inflated contents, you
would find such a pair that sha1dc missed but the "implement
collision check" patch would have caught as a pair of byte sequences
that hash to the same value.  Isn't it an opportunity to make it safer
to implement such a loose object collision check?

Thanks.
Taylor Blau Sept. 24, 2024, 9:59 p.m. UTC | #3
On Tue, Sep 24, 2024 at 04:37:18PM -0400, Jeff King wrote:
> But I think we can expand the argument a bit. I don't think this is a
> very good place for such a collision check, because race conditions
> aside, we'll already have bailed long before! When we do a write via
> write_object_file(), for example, we'll hit the freshen paths that
> assume the contents are identical, and skip the write (and thus this
> finalize) entirely.
>
> So if you want a collision check, we have to do it at a much higher
> level. And indeed we do: index-pack already implements such a check
> (through the confusingly named "check_collison" function). I don't think
> unpack-objects does so (it looks like it just feeds the result to
> write_object_file(), which does the freshening thing).
>
> So the argument I'd make here is more like: this is the wrong place to
> do it.

I think that is reasonable, and I agree with your reasoning here. I'm
happy to reword the commit message if you think doing so would be
useful, or drop the paragraph entirely if you think it's just confusing.

Let me know what you think, I'm happy to do whatever here, reroll or not
:-).

>   Side thoughts on collision checks:
>
>     I think index-pack is safe in the sense that it will always prefer
>     on-disk copies and will complain if it sees a collision.
>     unpack-objects is not, nor are regular in-repo writes (which
>     normally cannot be triggered by remote, but on forges that do
>     merges, etc, that's not always true). Both of the latter are also
>     subject to races, where a simultaneous collision might let the
>     attacker win.

Yup.

>     That race is kind of moot in a world where anybody can push to a
>     fork of a repo that ends up in the same shared location (so they can
>     actually win and become the "on disk" copy), and we're relying on
>     sha1dc for protection there. That's specific to certain forges, but
>     they do represent a lot of Git use.
>
>     In general, the use of unpack-objects versus index-pack is up to the
>     attacker (based on pack size). So I think it would be nice if
>     unpack-objects did the collision check. Even if the attacker beats
>     you to writing the object, it would be nice to see "holy crap, there
>     was a collision" instead of just silently throwing your pushed
>     object away.

Right, I agree that unpack-objects definitely should do the collision
check here. And indeed, that is the case, since that code (which all is
directly implemented in the builtin) uses the regular
collision-detecting SHA-1 implementation.

>     I know that GitHub only ever runs index-pack, which may give some
>     amount of protection here. In general, I think we should consider
>     deprecating unpack-objects in favor of teaching index-pack to
>     unpack. It has many enhancements (this one, but also threading) that
>     unpack-objects does not. I have an old patch series for this (sorry,
>     only from 2017, I'm slipping). IIRC the sticking point was that
>     unpack-objects is better about memory use in some cases (streaming
>     large blobs?) and I hadn't figured out a way around that.

Only seven years old? Sheesh, you really are slipping ;-).

> Phew. Sorry, that was a long digression for "I think you're right, but I
> might have argued it a little differently". I think the direction of the
> patch (skipping checks entirely for loose objects) is the right thing.
>
> > As a small note related to the latter bullet point above, we must teach
> > the tmp-objdir routines to similarly skip the content-level collision
> > checks when calling migrate_one() on a loose object file, which we do by
> > setting the FOF_SKIP_COLLISION_CHECK bit when we are inside of a loose
> > object shard.
>
> OK, this is the part I was wondering how you were going to deal with. :)
> Let's look at the implementation...
>
> > @@ -239,11 +247,15 @@ static int migrate_paths(struct strbuf *src, struct strbuf *dst)
> >
> >  	for (i = 0; i < paths.nr; i++) {
> >  		const char *name = paths.items[i].string;
> > +		enum finalize_object_file_flags flags_copy = flags;
> >
> >  		strbuf_addf(src, "/%s", name);
> >  		strbuf_addf(dst, "/%s", name);
> >
> > -		ret |= migrate_one(src, dst);
> > +		if (is_loose_object_shard(name))
> > +			flags_copy |= FOF_SKIP_COLLISION_CHECK;
> > +
> > +		ret |= migrate_one(src, dst, flags_copy);
> >
> >  		strbuf_setlen(src, src_len);
> >  		strbuf_setlen(dst, dst_len);
>
> This looks pretty reasonable overall, though I'd note that
> migrate_paths() is called recursively. So if I had:
>
>   tmp-objdir-XXXXXX/pack/ab/foo.pack
>
> we'd skip the collision check. I'm not sure how much we want to worry
> about that. I don't think we'd ever create such a path, and the general
> form of the paths is all under local control, so an attacker can't
> convince us to do so. And I find it pretty unlikely that we'd change the
> on-disk layout to accidentally trigger this.

I thought about this when writing it, and came to the conclusion that
the checks we have for "are we in something that looks like a loose
object shard?" are sane. That's only because we won't bother reading a
pack in $GIT_DIR/objects/pack/xx/yy.pack, since we do not read packs
recursively from the pack sub-directory.

So I think that the diff you posted below isn't hurting anything, and
the implementation looks correct to me, but I also don't think it's
helping anything either as a consequence of the above.

Thanks,
Taylor
Taylor Blau Sept. 24, 2024, 10:02 p.m. UTC | #4
On Tue, Sep 24, 2024 at 02:32:21PM -0700, Junio C Hamano wrote:
> >   - We already use the path of the loose object as its hash value /
> >     object name, so checking for collisions at the content level doesn't
> >     add anything.
>
> "We already ... doesn't add anything" -> "The point of collision
> check is to find an attempt to store different contents that hashes
> down to the same object name, and we continue to use sha1dc to hash
> the content name so such a collision would have already been
> detected".

Exactly.

> >     This is why we do not bother to check the inflated object contents
> >     for collisions either, since either (a) the object contents have the
> >     fingerprint of a SHA-1 collision, in which case the collision
> >     detecting SHA-1 implementation used to hash the contents to give us
> >     a path would have already rejected it, or (b) the contents are part
> >     of a colliding pair which does not bear the same fingerprints of
> >     known collision attacks, in which case we would not have caught it
> >     anyway.
>
> I do not understand (b).  If you compare inflated contents, you
> would find such a pair that sha1dc missed but the "implement
> collision check" patch would have caught as a pair of byte sequences
> that hash to the same value.  Isn't it an opportunity to make it safer
> to implement such a loose object collision check?

All I was saying with (b) was that if you had some new collision attack
that didn't trigger the existing detection mechanisms in the existing
collision-detecting SHA-1 implementation, then you wouldn't notice the
collision anyway.

You could inflate the contents and compare them, but I don't think it
would much matter at that point since you've already let one of them
enter the repository, and it's not clear which is the "correct" one to
keep since they both hash to the same value.

Thanks,
Taylor
Jeff King Sept. 24, 2024, 10:20 p.m. UTC | #5
On Tue, Sep 24, 2024 at 05:59:10PM -0400, Taylor Blau wrote:

> > So the argument I'd make here is more like: this is the wrong place to
> > do it.
> 
> I think that is reasonable, and I agree with your reasoning here. I'm
> happy to reword the commit message if you think doing so would be
> useful, or drop the paragraph entirely if you think it's just confusing.
> 
> Let me know what you think, I'm happy to do whatever here, reroll or not
> :-).

I'm content to let this live in the list archive, but it sounds like
Junio had the same reaction, so it may be worth trying to rework the
commit message a bit.

> >     In general, the use of unpack-objects versus index-pack is up to the
> >     attacker (based on pack size). So I think it would be nice if
> >     unpack-objects did the collision check. Even if the attacker beats
> >     you to writing the object, it would be nice to see "holy crap, there
> >     was a collision" instead of just silently throwing your pushed
> >     object away.
> 
> Right, I agree that unpack-objects definitely should do the collision
> check here. And indeed, that is the case, since that code (which all is
> directly implemented in the builtin) uses the regular
> collision-detecting SHA-1 implementation.

It uses sha1dc, but what I mean by "collision check" is an additional
belt-and-suspenders check. That even once we see an object which
hashes to a particular sha1 which we already have, we'll do the
byte-for-byte comparison. See index-pack's "check_collison".

And that's what I think should be added to unpack-objects (but again,
this is all orthogonal to your series).

> I thought about this when writing it, and came to the conclusion that
> the checks we have for "are we in something that looks like a loose
> object shard?" are sane. That's only because we won't bother reading a
> pack in $GIT_DIR/objects/pack/xx/yy.pack, since we do not read packs
> recursively from the pack sub-directory.
> 
> So I think that the diff you posted below isn't hurting anything, and
> the implementation looks correct to me, but I also don't think it's
> helping anything either as a consequence of the above.

Right, it's purely about future-proofing against a hypothetical change
where we'd be more inclusive (both in what we read, but also in what
we'd ourselves write!).

-Peff
Taylor Blau Sept. 25, 2024, 6:06 p.m. UTC | #6
On Tue, Sep 24, 2024 at 06:20:39PM -0400, Jeff King wrote:
> On Tue, Sep 24, 2024 at 05:59:10PM -0400, Taylor Blau wrote:
>
> > > So the argument I'd make here is more like: this is the wrong place to
> > > do it.
> >
> > I think that is reasonable, and I agree with your reasoning here. I'm
> > happy to reword the commit message if you think doing so would be
> > useful, or drop the paragraph entirely if you think it's just confusing.
> >
> > Let me know what you think, I'm happy to do whatever here, reroll or not
> > :-).
>
> I'm content to let this live in the list archive, but it sounds like
> Junio had the same reaction, so it may be worth trying to rework the
> commit message a bit.

Here's the relevant portion of my range-diff that has the new wording
(which is more or less equivalent to your "this isn't the right place to
do it, and we're not fundamentally changing anything from a security
perspective here" argument). Let me know what you think:

--- 8< ---
3:  ed9eeef851 ! 3:  41d38352a4 finalize_object_file(): implement collision check
    @@ Commit message
             object name, so checking for collisions at the content level doesn't
             add anything.

    -        This is why we do not bother to check the inflated object contents
    -        for collisions either, since either (a) the object contents have the
    -        fingerprint of a SHA-1 collision, in which case the collision
    -        detecting SHA-1 implementation used to hash the contents to give us
    -        a path would have already rejected it, or (b) the contents are part
    -        of a colliding pair which does not bear the same fingerprints of
    -        known collision attacks, in which case we would not have caught it
    -        anyway.
    +        Adding a content-level collision check would have to happen at a
    +        higher level than in finalize_object_file(), since (avoiding race
    +        conditions) writing an object loose which already exists in the
    +        repository will prevent us from even reaching finalize_object_file()
    +        via the object freshening code.
    +
    +        There is a collision check in index-pack via its `check_collision()`
    +        function, but there isn't an analogous function in unpack-objects,
    +        which just feeds the result to write_object_file().

             So skipping the collision check here does not change for better or
             worse the hardness of loose object writes.
    @@ Commit message

         Co-authored-by: Jeff King <peff@peff.net>
         Signed-off-by: Jeff King <peff@peff.net>
    +    Helped-by: Elijah Newren <newren@gmail.com>
         Signed-off-by: Taylor Blau <me@ttaylorr.com>

      ## object-file.c ##
--- >8 ---

Thanks,
Taylor
diff mbox series

Patch

diff --git a/object-file.c b/object-file.c
index 5a1da577115..b9a3a1f62db 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1932,10 +1932,67 @@  static void write_object_file_prepare_literally(const struct git_hash_algo *algo
 	hash_object_body(algo, &c, buf, len, oid, hdr, hdrlen);
 }
 
+static int check_collision(const char *filename_a, const char *filename_b)
+{
+	char buf_a[4096], buf_b[4096];
+	int fd_a = -1, fd_b = -1;
+	int ret = 0;
+
+	fd_a = open(filename_a, O_RDONLY);
+	if (fd_a < 0) {
+		ret = error_errno(_("unable to open %s"), filename_a);
+		goto out;
+	}
+
+	fd_b = open(filename_b, O_RDONLY);
+	if (fd_b < 0) {
+		ret = error_errno(_("unable to open %s"), filename_b);
+		goto out;
+	}
+
+	while (1) {
+		ssize_t sz_a, sz_b;
+
+		sz_a = read_in_full(fd_a, buf_a, sizeof(buf_a));
+		if (sz_a < 0) {
+			ret = error_errno(_("unable to read %s"), filename_a);
+			goto out;
+		}
+
+		sz_b = read_in_full(fd_b, buf_b, sizeof(buf_b));
+		if (sz_b < 0) {
+			ret = error_errno(_("unable to read %s"), filename_b);
+			goto out;
+		}
+
+		if (sz_a != sz_b || memcmp(buf_a, buf_b, sz_a)) {
+			ret = error(_("files '%s' and '%s' differ in contents"),
+				    filename_a, filename_b);
+			goto out;
+		}
+
+		if (sz_a < sizeof(buf_a))
+			break;
+	}
+
+out:
+	if (fd_a > -1)
+		close(fd_a);
+	if (fd_b > -1)
+		close(fd_b);
+	return ret;
+}
+
 /*
  * Move the just written object into its final resting place.
  */
 int finalize_object_file(const char *tmpfile, const char *filename)
+{
+	return finalize_object_file_flags(tmpfile, filename, 0);
+}
+
+int finalize_object_file_flags(const char *tmpfile, const char *filename,
+			       enum finalize_object_file_flags flags)
 {
 	struct stat st;
 	int ret = 0;
@@ -1974,7 +2031,9 @@  int finalize_object_file(const char *tmpfile, const char *filename)
 			errno = saved_errno;
 			return error_errno(_("unable to write file %s"), filename);
 		}
-		/* FIXME!!! Collision check here ? */
+		if (!(flags & FOF_SKIP_COLLISION_CHECK) &&
+		    check_collision(tmpfile, filename))
+				return -1;
 		unlink_or_warn(tmpfile);
 	}
 
@@ -2228,7 +2287,8 @@  static int write_loose_object(const struct object_id *oid, char *hdr,
 			warning_errno(_("failed utime() on %s"), tmp_file.buf);
 	}
 
-	return finalize_object_file(tmp_file.buf, filename.buf);
+	return finalize_object_file_flags(tmp_file.buf, filename.buf,
+					  FOF_SKIP_COLLISION_CHECK);
 }
 
 static int freshen_loose_object(const struct object_id *oid)
@@ -2350,7 +2410,8 @@  int stream_loose_object(struct input_stream *in_stream, size_t len,
 		strbuf_release(&dir);
 	}
 
-	err = finalize_object_file(tmp_file.buf, filename.buf);
+	err = finalize_object_file_flags(tmp_file.buf, filename.buf,
+					 FOF_SKIP_COLLISION_CHECK);
 	if (!err && compat)
 		err = repo_add_loose_object_map(the_repository, oid, &compat_oid);
 cleanup:
diff --git a/object-file.h b/object-file.h
index d6414610f80..81b30d269c8 100644
--- a/object-file.h
+++ b/object-file.h
@@ -117,7 +117,13 @@  int check_object_signature(struct repository *r, const struct object_id *oid,
  */
 int stream_object_signature(struct repository *r, const struct object_id *oid);
 
+enum finalize_object_file_flags {
+	FOF_SKIP_COLLISION_CHECK = 1,
+};
+
 int finalize_object_file(const char *tmpfile, const char *filename);
+int finalize_object_file_flags(const char *tmpfile, const char *filename,
+			       enum finalize_object_file_flags flags);
 
 /* Helper to check and "touch" a file */
 int check_and_freshen_file(const char *fn, int freshen);
diff --git a/tmp-objdir.c b/tmp-objdir.c
index c2fb9f91930..9da0071cba8 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -206,9 +206,11 @@  static int read_dir_paths(struct string_list *out, const char *path)
 	return 0;
 }
 
-static int migrate_paths(struct strbuf *src, struct strbuf *dst);
+static int migrate_paths(struct strbuf *src, struct strbuf *dst,
+			 enum finalize_object_file_flags flags);
 
-static int migrate_one(struct strbuf *src, struct strbuf *dst)
+static int migrate_one(struct strbuf *src, struct strbuf *dst,
+		       enum finalize_object_file_flags flags)
 {
 	struct stat st;
 
@@ -220,12 +222,18 @@  static int migrate_one(struct strbuf *src, struct strbuf *dst)
 				return -1;
 		} else if (errno != EEXIST)
 			return -1;
-		return migrate_paths(src, dst);
+		return migrate_paths(src, dst, flags);
 	}
-	return finalize_object_file(src->buf, dst->buf);
+	return finalize_object_file_flags(src->buf, dst->buf, flags);
 }
 
-static int migrate_paths(struct strbuf *src, struct strbuf *dst)
+static int is_loose_object_shard(const char *name)
+{
+	return strlen(name) == 2 && isxdigit(name[0]) && isxdigit(name[1]);
+}
+
+static int migrate_paths(struct strbuf *src, struct strbuf *dst,
+			 enum finalize_object_file_flags flags)
 {
 	size_t src_len = src->len, dst_len = dst->len;
 	struct string_list paths = STRING_LIST_INIT_DUP;
@@ -239,11 +247,15 @@  static int migrate_paths(struct strbuf *src, struct strbuf *dst)
 
 	for (i = 0; i < paths.nr; i++) {
 		const char *name = paths.items[i].string;
+		enum finalize_object_file_flags flags_copy = flags;
 
 		strbuf_addf(src, "/%s", name);
 		strbuf_addf(dst, "/%s", name);
 
-		ret |= migrate_one(src, dst);
+		if (is_loose_object_shard(name))
+			flags_copy |= FOF_SKIP_COLLISION_CHECK;
+
+		ret |= migrate_one(src, dst, flags_copy);
 
 		strbuf_setlen(src, src_len);
 		strbuf_setlen(dst, dst_len);
@@ -271,7 +283,7 @@  int tmp_objdir_migrate(struct tmp_objdir *t)
 	strbuf_addbuf(&src, &t->path);
 	strbuf_addstr(&dst, repo_get_object_directory(the_repository));
 
-	ret = migrate_paths(&src, &dst);
+	ret = migrate_paths(&src, &dst, 0);
 
 	strbuf_release(&src);
 	strbuf_release(&dst);