Message ID | 0feee5d1d4fc1e0aa1ca5d98f2f404ecdbb2f6b6.1725651952.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | hash.h: support choosing a separate SHA-1 for non-cryptographic uses | expand |
Taylor Blau <me@ttaylorr.com> writes: > Note that this may cause some extra computation when the files are in > fact identical, but this should happen rarely. For example, when writing > a loose object, we compute the object id first, then check manually for > an existing object (so that we can freshen its timestamp) before > actually trying to write and link the new data. True. > +static int check_collision(const char *filename_a, const char *filename_b) > +{ > + char buf_a[4096], buf_b[4096]; > + int fd_a = -1, fd_b = -1; > + int ret = 0; > + > + fd_a = open(filename_a, O_RDONLY); > + if (fd_a < 0) { > + ret = error_errno(_("unable to open %s"), filename_a); > + goto out; > + } > + > + fd_b = open(filename_b, O_RDONLY); > + if (fd_b < 0) { > + ret = error_errno(_("unable to open %s"), filename_b); > + goto out; > + } Two and two half comments on this function. * We compare 4k at a time here, while copy.c copies 8k at a time, and bulk-checkin.c uses 16k at a time. Outside the scope of this topic, we probably should pick one number and stick to it, unless we have measured to pick perfect number for each case (and I know I picked 8k for copy.c and 16k for bulk-checkin.c both out of thin air). * I would have expected at least we would fstat() them to declare difference immediately after we find their sizes differ, for example. As we assume that calling into this function should be rare, we prefer not to pay in complexity for performance here? * We use read_in_full() and assume that a short-read return from the function happens only at the end of file due to EOF, which is another reason why we can do away without fstat() on these files. * An error causes the caller to assume collision (because we assign the return value of error() to ret), which should do the same action as an actual collision to abort and keep the problematic file for forensics. > /* > * Move the just written object into its final resting place. > */ > @@ -1941,8 +1992,8 @@ int finalize_object_file(const char *tmpfile, const char *filename) > errno = saved_errno; > return error_errno(_("unable to write file %s"), filename); > } > - /* FIXME!!! Collision check here ? */ > - unlink_or_warn(tmpfile); > + if (check_collision(tmpfile, filename)) > + return -1; > } OK.
On Fri, Sep 6, 2024 at 2:44 PM Junio C Hamano <gitster@pobox.com> wrote: > Two and two half comments on [check_collision]. > > * We compare 4k at a time here, while copy.c copies 8k at a time, > and bulk-checkin.c uses 16k at a time. Outside the scope of this > topic, we probably should pick one number and stick to it, unless > we have measured to pick perfect number for each case (and I know > I picked 8k for copy.c and 16k for bulk-checkin.c both out of > thin air). In Ye Olden Days, 4k would be fine, and going back 40+ years even 512 would be fine. For *writing* modern systems often prefer at least 128K or even 1M; anything under 8K is Right Out. For *writing* it tends to be less important due to caches. Still: > * I would have expected at least we would fstat() them to declare > difference immediately after we find their sizes differ, for > example. As we assume that calling into this function should be > rare, we prefer not to pay in complexity for performance here? Another benefit of calling `stat` is that you get `st_blksize`, which is the system's recommended I/O block size. I almost commented about this earlier but the "should be rare" thing held me back. :-) (I have no comments on the rest of the comments.) Chris
On Fri, Sep 06, 2024 at 02:44:12PM -0700, Junio C Hamano wrote: > > +static int check_collision(const char *filename_a, const char *filename_b) > [...] > > Two and two half comments on this function. > > * We compare 4k at a time here, while copy.c copies 8k at a time, > and bulk-checkin.c uses 16k at a time. Outside the scope of this > topic, we probably should pick one number and stick to it, unless > we have measured to pick perfect number for each case (and I know > I picked 8k for copy.c and 16k for bulk-checkin.c both out of > thin air). I can confirm that 4k was picked out of thin air by me while writing this function. ;) It's my usual size for I/O just because it often matches the page size. I wouldn't be surprised if other values are faster, but I hope that this code would run pretty infrequently. So I wouldn't worry too much about it, but if there's an obviously better I/O size and we have a #define for it, it would make sense to use it. I do wonder if we'd run into stack limitations at some point. As an alternative, we could also mmap the files and then compare the full arrays directly. That might be faster? > * I would have expected at least we would fstat() them to declare > difference immediately after we find their sizes differ, for > example. As we assume that calling into this function should be > rare, we prefer not to pay in complexity for performance here? I actually had the opposite thought about fstat() performance while writing it. In a world where you expect most name collisions to actually have different content, then checking their sizes lets you skip the more expensive byte-wise check. But in a world where you mostly expect them _not_ to differ, then the size check usually does not help, and you waste time doing it. I.e., it is a cache-miss problem with weighted costs. Now thinking on it more, that view is probably dumb. fstat() is really cheap, and byte-wise comparisons are really expensive, so if it has even a tiny chance of helping, it might be worth doing. Though again, I'd hope this will trigger pretty rarely in practice, because it's probably a sign that we could have skipped work earlier (i.e., by realizing we were just going to generate an identical file, and not generated it in the first place). So I'd be content with just about any implementation, and waiting to see if it ever becomes a performance pain point. > * We use read_in_full() and assume that a short-read return from > the function happens only at the end of file due to EOF, which is > another reason why we can do away without fstat() on these files. > > * An error causes the caller to assume collision (because we assign > the return value of error() to ret), which should do the same > action as an actual collision to abort and keep the problematic > file for forensics. Yup, both of those were very intentional. :) -Peff
Jeff King <peff@peff.net> writes: > Now thinking on it more, that view is probably dumb. fstat() is really > cheap, and byte-wise comparisons are really expensive, so if it has even > a tiny chance of helping, it might be worth doing. If we see too many "yikes we need to check for collision" cases, then yes, but I agree that it should be rare to matter. > Though again, I'd hope this will trigger pretty rarely in practice, > because it's probably a sign that we could have skipped work earlier > (i.e., by realizing we were just going to generate an identical file, > and not generated it in the first place). Yes, exactly.
On Fri, Sep 06, 2024 at 03:46:12PM -0400, Taylor Blau wrote: > diff --git a/object-file.c b/object-file.c > index 54a82a5f7a0..85f91516429 100644 > --- a/object-file.c > +++ b/object-file.c > @@ -1899,6 +1899,57 @@ static void write_object_file_prepare_literally(const struct git_hash_algo *algo > hash_object_body(algo, &c, buf, len, oid, hdr, hdrlen); > } > > +static int check_collision(const char *filename_a, const char *filename_b) > +{ > + char buf_a[4096], buf_b[4096]; > + int fd_a = -1, fd_b = -1; > + int ret = 0; > + > + fd_a = open(filename_a, O_RDONLY); > + if (fd_a < 0) { > + ret = error_errno(_("unable to open %s"), filename_a); > + goto out; > + } > + > + fd_b = open(filename_b, O_RDONLY); > + if (fd_b < 0) { > + ret = error_errno(_("unable to open %s"), filename_b); > + goto out; > + } > + > + while (1) { > + ssize_t sz_a, sz_b; > + > + sz_a = read_in_full(fd_a, buf_a, sizeof(buf_a)); > + if (sz_a < 0) { > + ret = error_errno(_("unable to read %s"), filename_a); > + goto out; > + } > + > + sz_b = read_in_full(fd_b, buf_b, sizeof(buf_b)); > + if (sz_b < 0) { > + ret = error_errno(_("unable to read %s"), filename_b); > + goto out; > + } > + > + if (sz_a != sz_b || memcmp(buf_a, buf_b, sz_a)) { > + ret = error(_("files '%s' and '%s' differ in contents"), > + filename_a, filename_b); > + goto out; > + } > + > + if (sz_a < sizeof(buf_a)) > + break; > + } > + > +out: > + if (fd_a > -1) > + close(fd_a); > + if (fd_b > -1) > + close(fd_b); > + return ret; > +} > + > /* > * Move the just written object into its final resting place. > */ This function compares the exact contents, but isn't that wrong? The contents may differ even though the object is the same because the object hash is derived from the uncompressed data, whereas we store compressed data on disk. So this might trigger when you have different zlib versions, but also if you configure core.compression differently. Patrick
On Mon, Sep 16, 2024 at 12:45:38PM +0200, Patrick Steinhardt wrote: > This function compares the exact contents, but isn't that wrong? The > contents may differ even though the object is the same because the > object hash is derived from the uncompressed data, whereas we store > compressed data on disk. > > So this might trigger when you have different zlib versions, but also if > you configure core.compression differently. Oh, shoot -- you're right for when we call finalize_object_file() on loose objects. For packfiles and other spots where we use the hashfile API, the name of the file depends on the checksum'd contents, so we're safe there. But for loose objects the, the name of the file is based on the object hash, which is the hash of the uncompressed contents. So I think we would need something like this on top: --- 8< --- diff --git a/object-file.c b/object-file.c index 84dd7d0fab..2f1616651e 100644 --- a/object-file.c +++ b/object-file.c @@ -1992,10 +1992,8 @@ static int check_collision(const char *filename_a, const char *filename_b) return ret; } -/* - * Move the just written object into its final resting place. - */ -int finalize_object_file(const char *tmpfile, const char *filename) +static int finalize_object_file_1(const char *tmpfile, const char *filename, + unsigned check_collisions) { struct stat st; int ret = 0; @@ -2044,6 +2042,14 @@ int finalize_object_file(const char *tmpfile, const char *filename) return 0; } +/* + * Move the just written object into its final resting place. + */ +int finalize_object_file(const char *tmpfile, const char *filename) +{ + return finalize_object_file_1(tmpfile, filename, 1); +} + static void hash_object_file_literally(const struct git_hash_algo *algo, const void *buf, unsigned long len, const char *type, struct object_id *oid) @@ -2288,7 +2294,7 @@ static int write_loose_object(const struct object_id *oid, char *hdr, warning_errno(_("failed utime() on %s"), tmp_file.buf); } - return finalize_object_file(tmp_file.buf, filename.buf); + return finalize_object_file_1(tmp_file.buf, filename.buf, 0); } static int freshen_loose_object(const struct object_id *oid) @@ -2410,7 +2416,7 @@ int stream_loose_object(struct input_stream *in_stream, size_t len, strbuf_release(&dir); } - err = finalize_object_file(tmp_file.buf, filename.buf); + err = finalize_object_file_1(tmp_file.buf, filename.buf, 0); if (!err && compat) err = repo_add_loose_object_map(the_repository, oid, &compat_oid); cleanup: --- >8 --- But I'd like for some others to take a look at this before I send a new round that includes this change. Thanks for catching that! Thanks, Taylor
On Mon, Sep 16, 2024 at 11:54:14AM -0400, Taylor Blau wrote: > On Mon, Sep 16, 2024 at 12:45:38PM +0200, Patrick Steinhardt wrote: > > This function compares the exact contents, but isn't that wrong? The > > contents may differ even though the object is the same because the > > object hash is derived from the uncompressed data, whereas we store > > compressed data on disk. > > > > So this might trigger when you have different zlib versions, but also if > > you configure core.compression differently. > > Oh, shoot -- you're right for when we call finalize_object_file() on > loose objects. > > For packfiles and other spots where we use the hashfile API, the > name of the file depends on the checksum'd contents, so we're safe > there. But for loose objects the, the name of the file is based on the > object hash, which is the hash of the uncompressed contents. > > So I think we would need something like this on top: Thinking about this a little more, I think that most cases should actually be OK. Of course, this only affects repositories that are changing their zlib version, and/or changing their core.compression setting regularly, so this is all pretty niche, but still... We often guard calls to write_loose_object() with either calling freshen_loose_object(), or freshen_packed_object(), which will update the mtime of the containing pack or loose object path for a given hash. So if we already have a loose object with hash X in the object store, and we try to write that same object again with a different zlib version and/or core.compression setting, we'll simply call freshen_loose_object() and optimize out the actual object write. Of course, that's not always the case. We might e.g., force an object loose via loosen_packed_objects() which could encounter a TOCTOU race with an incoming object write using a different compression setting. But that seems exceedingly rare to me. I think that we should still take that change in the mail I'm replying to, but I would definitely appreciate others' thoughts here. Thanks, Taylor
Patrick Steinhardt <ps@pks.im> writes: >> +static int check_collision(const char *filename_a, const char *filename_b) >> +{ >> ... >> + return ret; >> +} >> + >> /* >> * Move the just written object into its final resting place. >> */ > > This function compares the exact contents, but isn't that wrong? The > contents may differ even though the object is the same because the > object hash is derived from the uncompressed data, whereas we store > compressed data on disk. Very true. So check_collision is *not* a good name for this helper function. For .pack files, a collision means two resulting files have the identical contents, so the above function, perhaps renamed to "compare_files()" or something, is a very appropriate helper to detect a collision once you get two files that claim to be the same .pack file. For loose object files, a collision means the contents before compression are the same, so if we wanted to check collisions within the same framework, we'd need "compare_files_after_inflating()" or something, that opens both files and compare their contents while "inflating" via zlib. Thanks.
diff --git a/object-file.c b/object-file.c index 54a82a5f7a0..85f91516429 100644 --- a/object-file.c +++ b/object-file.c @@ -1899,6 +1899,57 @@ static void write_object_file_prepare_literally(const struct git_hash_algo *algo hash_object_body(algo, &c, buf, len, oid, hdr, hdrlen); } +static int check_collision(const char *filename_a, const char *filename_b) +{ + char buf_a[4096], buf_b[4096]; + int fd_a = -1, fd_b = -1; + int ret = 0; + + fd_a = open(filename_a, O_RDONLY); + if (fd_a < 0) { + ret = error_errno(_("unable to open %s"), filename_a); + goto out; + } + + fd_b = open(filename_b, O_RDONLY); + if (fd_b < 0) { + ret = error_errno(_("unable to open %s"), filename_b); + goto out; + } + + while (1) { + ssize_t sz_a, sz_b; + + sz_a = read_in_full(fd_a, buf_a, sizeof(buf_a)); + if (sz_a < 0) { + ret = error_errno(_("unable to read %s"), filename_a); + goto out; + } + + sz_b = read_in_full(fd_b, buf_b, sizeof(buf_b)); + if (sz_b < 0) { + ret = error_errno(_("unable to read %s"), filename_b); + goto out; + } + + if (sz_a != sz_b || memcmp(buf_a, buf_b, sz_a)) { + ret = error(_("files '%s' and '%s' differ in contents"), + filename_a, filename_b); + goto out; + } + + if (sz_a < sizeof(buf_a)) + break; + } + +out: + if (fd_a > -1) + close(fd_a); + if (fd_b > -1) + close(fd_b); + return ret; +} + /* * Move the just written object into its final resting place. */ @@ -1941,8 +1992,8 @@ int finalize_object_file(const char *tmpfile, const char *filename) errno = saved_errno; return error_errno(_("unable to write file %s"), filename); } - /* FIXME!!! Collision check here ? */ - unlink_or_warn(tmpfile); + if (check_collision(tmpfile, filename)) + return -1; } out: