[v3,3/9] finalize_object_file(): implement collision check

Message ID	0feee5d1d4fc1e0aa1ca5d98f2f404ecdbb2f6b6.1725651952.git.me@ttaylorr.com (mailing list archive)
State	Superseded
Headers	show Received: from mail-yb1-f176.google.com (mail-yb1-f176.google.com [209.85.219.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 540B71D7E2D for <git@vger.kernel.org>; Fri, 6 Sep 2024 19:46:15 +0000 (UTC) Date: Fri, 6 Sep 2024 15:46:12 -0400 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: Jeff King <peff@peff.net>, "brian m. carlson" <sandals@crustytoothpaste.net>, Elijah Newren <newren@gmail.com>, Patrick Steinhardt <ps@pks.im>, Junio C Hamano <gitster@pobox.com> Subject: [PATCH v3 3/9] finalize_object_file(): implement collision check Message-ID: <0feee5d1d4fc1e0aa1ca5d98f2f404ecdbb2f6b6.1725651952.git.me@ttaylorr.com> References: <cover.1725206584.git.me@ttaylorr.com> <cover.1725651952.git.me@ttaylorr.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1725651952.git.me@ttaylorr.com>
Series	hash.h: support choosing a separate SHA-1 for non-cryptographic uses \| expand [v3,0/9] hash.h: support choosing a separate SHA-1 for non-cryptographic uses [v3,1/9] finalize_object_file(): check for name collision before renaming [v3,2/9] finalize_object_file(): refactor unlink_or_warn() placement [v3,3/9] finalize_object_file(): implement collision check [v3,4/9] pack-objects: use finalize_object_file() to rename pack/idx/etc [v3,5/9] i5500-git-daemon.sh: use compile-able version of Git without OpenSSL [v3,6/9] sha1: do not redefine `platform_SHA_CTX` and friends [v3,7/9] hash.h: scaffolding for _fast hashing variants [v3,8/9] Makefile: allow specifying a SHA-1 for non-cryptographic uses [v3,9/9] csum-file.c: use fast SHA-1 implementation when available

Taylor Blau Sept. 6, 2024, 7:46 p.m. UTC

We've had "FIXME!!! Collision check here ?" in finalize_object_file()
since aac1794132 (Improve sha1 object file writing., 2005-05-03). That
is, when we try to write a file with the same name, we assume the
on-disk contents are the same and blindly throw away the new copy.

One of the reasons we never implemented this is because the files it
moves are all named after the cryptographic hash of their contents
(either loose objects, or packs which have their hash in the name these
days). So we are unlikely to see such a collision by accident. And even
though there are weaknesses in sha1, we assume they are mitigated by our
use of sha1dc.

So while it's a theoretical concern now, it hasn't been a priority.
However, if we start using weaker hashes for pack checksums and names,
this will become a practical concern. So in preparation, let's actually
implement a byte-for-byte collision check.

The new check will cause the write of new differing content to be a
failure, rather than a silent noop, and we'll retain the temporary file
on disk.

Note that this may cause some extra computation when the files are in
fact identical, but this should happen rarely. For example, when writing
a loose object, we compute the object id first, then check manually for
an existing object (so that we can freshen its timestamp) before
actually trying to write and link the new data.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 object-file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 53 insertions(+), 2 deletions(-)

Junio C Hamano Sept. 6, 2024, 9:44 p.m. UTC | #1

Taylor Blau <me@ttaylorr.com> writes:

> Note that this may cause some extra computation when the files are in
> fact identical, but this should happen rarely. For example, when writing
> a loose object, we compute the object id first, then check manually for
> an existing object (so that we can freshen its timestamp) before
> actually trying to write and link the new data.

True.

> +static int check_collision(const char *filename_a, const char *filename_b)
> +{
> +	char buf_a[4096], buf_b[4096];
> +	int fd_a = -1, fd_b = -1;
> +	int ret = 0;
> +
> +	fd_a = open(filename_a, O_RDONLY);
> +	if (fd_a < 0) {
> +		ret = error_errno(_("unable to open %s"), filename_a);
> +		goto out;
> +	}
> +
> +	fd_b = open(filename_b, O_RDONLY);
> +	if (fd_b < 0) {
> +		ret = error_errno(_("unable to open %s"), filename_b);
> +		goto out;
> +	}

Two and two half comments on this function.

 * We compare 4k at a time here, while copy.c copies 8k at a time,
   and bulk-checkin.c uses 16k at a time.  Outside the scope of this
   topic, we probably should pick one number and stick to it, unless
   we have measured to pick perfect number for each case (and I know
   I picked 8k for copy.c and 16k for bulk-checkin.c both out of
   thin air).

 * I would have expected at least we would fstat() them to declare
   difference immediately after we find their sizes differ, for
   example.  As we assume that calling into this function should be
   rare, we prefer not to pay in complexity for performance here?

 * We use read_in_full() and assume that a short-read return from
   the function happens only at the end of file due to EOF, which is
   another reason why we can do away without fstat() on these files.

 * An error causes the caller to assume collision (because we assign
   the return value of error() to ret), which should do the same
   action as an actual collision to abort and keep the problematic
   file for forensics.

>  /*
>   * Move the just written object into its final resting place.
>   */
> @@ -1941,8 +1992,8 @@ int finalize_object_file(const char *tmpfile, const char *filename)
>  			errno = saved_errno;
>  			return error_errno(_("unable to write file %s"), filename);
>  		}
> -		/* FIXME!!! Collision check here ? */
> -		unlink_or_warn(tmpfile);
> +		if (check_collision(tmpfile, filename))
> +			return -1;
>  	}

OK.

Chris Torek Sept. 6, 2024, 9:51 p.m. UTC | #2

On Fri, Sep 6, 2024 at 2:44 PM Junio C Hamano <gitster@pobox.com> wrote:
> Two and two half comments on [check_collision].
>
>  * We compare 4k at a time here, while copy.c copies 8k at a time,
>    and bulk-checkin.c uses 16k at a time.  Outside the scope of this
>    topic, we probably should pick one number and stick to it, unless
>    we have measured to pick perfect number for each case (and I know
>    I picked 8k for copy.c and 16k for bulk-checkin.c both out of
>    thin air).

In Ye Olden Days, 4k would be fine, and going back 40+ years
even 512 would be fine. For *writing* modern systems often
prefer at least 128K or even 1M; anything under 8K is Right Out.
For *writing* it tends to be less important due to caches.  Still:

>  * I would have expected at least we would fstat() them to declare
>    difference immediately after we find their sizes differ, for
>    example.  As we assume that calling into this function should be
>    rare, we prefer not to pay in complexity for performance here?

Another benefit of calling `stat` is that you get `st_blksize`, which
is the system's recommended I/O block size.

I almost commented about this earlier but the "should be rare"
thing held me back. :-)

(I have no comments on the rest of the comments.)

Chris

Jeff King Sept. 10, 2024, 6:53 a.m. UTC | #3

On Fri, Sep 06, 2024 at 02:44:12PM -0700, Junio C Hamano wrote:

> > +static int check_collision(const char *filename_a, const char *filename_b)
> [...]
> 
> Two and two half comments on this function.
> 
>  * We compare 4k at a time here, while copy.c copies 8k at a time,
>    and bulk-checkin.c uses 16k at a time.  Outside the scope of this
>    topic, we probably should pick one number and stick to it, unless
>    we have measured to pick perfect number for each case (and I know
>    I picked 8k for copy.c and 16k for bulk-checkin.c both out of
>    thin air).

I can confirm that 4k was picked out of thin air by me while writing
this function. ;) It's my usual size for I/O just because it often
matches the page size. I wouldn't be surprised if other values are
faster, but I hope that this code would run pretty infrequently.

So I wouldn't worry too much about it, but if there's an obviously
better I/O size and we have a #define for it, it would make sense to use
it. I do wonder if we'd run into stack limitations at some point.

As an alternative, we could also mmap the files and then compare the
full arrays directly. That might be faster?

>  * I would have expected at least we would fstat() them to declare
>    difference immediately after we find their sizes differ, for
>    example.  As we assume that calling into this function should be
>    rare, we prefer not to pay in complexity for performance here?

I actually had the opposite thought about fstat() performance while
writing it.  In a world where you expect most name collisions to
actually have different content, then checking their sizes lets you skip
the more expensive byte-wise check. But in a world where you mostly
expect them _not_ to differ, then the size check usually does not help,
and you waste time doing it. I.e., it is a cache-miss problem with
weighted costs.

Now thinking on it more, that view is probably dumb. fstat() is really
cheap, and byte-wise comparisons are really expensive, so if it has even
a tiny chance of helping, it might be worth doing.

Though again, I'd hope this will trigger pretty rarely in practice,
because it's probably a sign that we could have skipped work earlier
(i.e., by realizing we were just going to generate an identical file,
and not generated it in the first place).

So I'd be content with just about any implementation, and waiting to see
if it ever becomes a performance pain point.

>  * We use read_in_full() and assume that a short-read return from
>    the function happens only at the end of file due to EOF, which is
>    another reason why we can do away without fstat() on these files.
> 
>  * An error causes the caller to assume collision (because we assign
>    the return value of error() to ret), which should do the same
>    action as an actual collision to abort and keep the problematic
>    file for forensics.

Yup, both of those were very intentional. :)

-Peff

Junio C Hamano Sept. 10, 2024, 3:14 p.m. UTC | #4

Jeff King <peff@peff.net> writes:

> Now thinking on it more, that view is probably dumb. fstat() is really
> cheap, and byte-wise comparisons are really expensive, so if it has even
> a tiny chance of helping, it might be worth doing.

If we see too many "yikes we need to check for collision" cases,
then yes, but I agree that it should be rare to matter.

> Though again, I'd hope this will trigger pretty rarely in practice,
> because it's probably a sign that we could have skipped work earlier
> (i.e., by realizing we were just going to generate an identical file,
> and not generated it in the first place).

Yes, exactly.

Patrick Steinhardt Sept. 16, 2024, 10:45 a.m. UTC | #5

On Fri, Sep 06, 2024 at 03:46:12PM -0400, Taylor Blau wrote:
> diff --git a/object-file.c b/object-file.c
> index 54a82a5f7a0..85f91516429 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1899,6 +1899,57 @@ static void write_object_file_prepare_literally(const struct git_hash_algo *algo
>  	hash_object_body(algo, &c, buf, len, oid, hdr, hdrlen);
>  }
>  
> +static int check_collision(const char *filename_a, const char *filename_b)
> +{
> +	char buf_a[4096], buf_b[4096];
> +	int fd_a = -1, fd_b = -1;
> +	int ret = 0;
> +
> +	fd_a = open(filename_a, O_RDONLY);
> +	if (fd_a < 0) {
> +		ret = error_errno(_("unable to open %s"), filename_a);
> +		goto out;
> +	}
> +
> +	fd_b = open(filename_b, O_RDONLY);
> +	if (fd_b < 0) {
> +		ret = error_errno(_("unable to open %s"), filename_b);
> +		goto out;
> +	}
> +
> +	while (1) {
> +		ssize_t sz_a, sz_b;
> +
> +		sz_a = read_in_full(fd_a, buf_a, sizeof(buf_a));
> +		if (sz_a < 0) {
> +			ret = error_errno(_("unable to read %s"), filename_a);
> +			goto out;
> +		}
> +
> +		sz_b = read_in_full(fd_b, buf_b, sizeof(buf_b));
> +		if (sz_b < 0) {
> +			ret = error_errno(_("unable to read %s"), filename_b);
> +			goto out;
> +		}
> +
> +		if (sz_a != sz_b || memcmp(buf_a, buf_b, sz_a)) {
> +			ret = error(_("files '%s' and '%s' differ in contents"),
> +				    filename_a, filename_b);
> +			goto out;
> +		}
> +
> +		if (sz_a < sizeof(buf_a))
> +			break;
> +	}
> +
> +out:
> +	if (fd_a > -1)
> +		close(fd_a);
> +	if (fd_b > -1)
> +		close(fd_b);
> +	return ret;
> +}
> +
>  /*
>   * Move the just written object into its final resting place.
>   */

This function compares the exact contents, but isn't that wrong? The
contents may differ even though the object is the same because the
object hash is derived from the uncompressed data, whereas we store
compressed data on disk.

So this might trigger when you have different zlib versions, but also if
you configure core.compression differently.

Patrick

Taylor Blau Sept. 16, 2024, 3:54 p.m. UTC | #6

On Mon, Sep 16, 2024 at 12:45:38PM +0200, Patrick Steinhardt wrote:
> This function compares the exact contents, but isn't that wrong? The
> contents may differ even though the object is the same because the
> object hash is derived from the uncompressed data, whereas we store
> compressed data on disk.
>
> So this might trigger when you have different zlib versions, but also if
> you configure core.compression differently.

Oh, shoot -- you're right for when we call finalize_object_file() on
loose objects.

For packfiles and other spots where we use the hashfile API, the
name of the file depends on the checksum'd contents, so we're safe
there. But for loose objects the, the name of the file is based on the
object hash, which is the hash of the uncompressed contents.

So I think we would need something like this on top:

--- 8< ---
diff --git a/object-file.c b/object-file.c
index 84dd7d0fab..2f1616651e 100644
--- a/object-file.c
+++ b/object-file.c
@@ -1992,10 +1992,8 @@ static int check_collision(const char *filename_a, const char *filename_b)
 	return ret;
 }

-/*
- * Move the just written object into its final resting place.
- */
-int finalize_object_file(const char *tmpfile, const char *filename)
+static int finalize_object_file_1(const char *tmpfile, const char *filename,
+				  unsigned check_collisions)
 {
 	struct stat st;
 	int ret = 0;
@@ -2044,6 +2042,14 @@ int finalize_object_file(const char *tmpfile, const char *filename)
 	return 0;
 }

+/*
+ * Move the just written object into its final resting place.
+ */
+int finalize_object_file(const char *tmpfile, const char *filename)
+{
+	return finalize_object_file_1(tmpfile, filename, 1);
+}
+
 static void hash_object_file_literally(const struct git_hash_algo *algo,
 				       const void *buf, unsigned long len,
 				       const char *type, struct object_id *oid)
@@ -2288,7 +2294,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr,
 			warning_errno(_("failed utime() on %s"), tmp_file.buf);
 	}

-	return finalize_object_file(tmp_file.buf, filename.buf);
+	return finalize_object_file_1(tmp_file.buf, filename.buf, 0);
 }

 static int freshen_loose_object(const struct object_id *oid)
@@ -2410,7 +2416,7 @@ int stream_loose_object(struct input_stream *in_stream, size_t len,
 		strbuf_release(&dir);
 	}

-	err = finalize_object_file(tmp_file.buf, filename.buf);
+	err = finalize_object_file_1(tmp_file.buf, filename.buf, 0);
 	if (!err && compat)
 		err = repo_add_loose_object_map(the_repository, oid, &compat_oid);
 cleanup:
--- >8 ---

But I'd like for some others to take a look at this before I send a new
round that includes this change.

Thanks for catching that!

Thanks,
Taylor

Taylor Blau Sept. 16, 2024, 4:03 p.m. UTC | #7

On Mon, Sep 16, 2024 at 11:54:14AM -0400, Taylor Blau wrote:
> On Mon, Sep 16, 2024 at 12:45:38PM +0200, Patrick Steinhardt wrote:
> > This function compares the exact contents, but isn't that wrong? The
> > contents may differ even though the object is the same because the
> > object hash is derived from the uncompressed data, whereas we store
> > compressed data on disk.
> >
> > So this might trigger when you have different zlib versions, but also if
> > you configure core.compression differently.
>
> Oh, shoot -- you're right for when we call finalize_object_file() on
> loose objects.
>
> For packfiles and other spots where we use the hashfile API, the
> name of the file depends on the checksum'd contents, so we're safe
> there. But for loose objects the, the name of the file is based on the
> object hash, which is the hash of the uncompressed contents.
>
> So I think we would need something like this on top:

Thinking about this a little more, I think that most cases should
actually be OK. Of course, this only affects repositories that are
changing their zlib version, and/or changing their core.compression
setting regularly, so this is all pretty niche, but still...

We often guard calls to write_loose_object() with either calling
freshen_loose_object(), or freshen_packed_object(), which will update
the mtime of the containing pack or loose object path for a given hash.

So if we already have a loose object with hash X in the object store,
and we try to write that same object again with a different zlib
version and/or core.compression setting, we'll simply call
freshen_loose_object() and optimize out the actual object write.

Of course, that's not always the case. We might e.g., force an object
loose via loosen_packed_objects() which could encounter a TOCTOU race
with an incoming object write using a different compression setting. But
that seems exceedingly rare to me.

I think that we should still take that change in the mail I'm replying
to, but I would definitely appreciate others' thoughts here.

Thanks,
Taylor

Junio C Hamano Sept. 17, 2024, 8:40 p.m. UTC | #8

Patrick Steinhardt <ps@pks.im> writes:

>> +static int check_collision(const char *filename_a, const char *filename_b)
>> +{
>> ...
>> +	return ret;
>> +}
>> +
>>  /*
>>   * Move the just written object into its final resting place.
>>   */
>
> This function compares the exact contents, but isn't that wrong? The
> contents may differ even though the object is the same because the
> object hash is derived from the uncompressed data, whereas we store
> compressed data on disk.

Very true.  So check_collision is *not* a good name for this helper
function.  

For .pack files, a collision means two resulting files have the
identical contents, so the above function, perhaps renamed to
"compare_files()" or something, is a very appropriate helper to
detect a collision once you get two files that claim to be the same
.pack file.

For loose object files, a collision means the contents before
compression are the same, so if we wanted to check collisions within
the same framework, we'd need "compare_files_after_inflating()" or
something, that opens both files and compare their contents while
"inflating" via zlib.

Thanks.

[v3,3/9] finalize_object_file(): implement collision check

Commit Message

Comments

Patch