Message ID | 94c07fd8a557c569fdc83015d5f3902094f21994.1736363652.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | hash: introduce unsafe_hash_algo(), drop unsafe_ variants | expand |
On Wed, Jan 08, 2025 at 02:14:51PM -0500, Taylor Blau wrote: > Introduce and use a new function which ensures that both parts of a > hashfile and hashfile_checkpoint pair use the same hash function > implementation to avoid such crashes. That makes sense. This should have been encapsulated all along, just like the actual hash initialization happens inside hashfile_init(). A hashfile_checkpoint is sort of inherently tied to a hashfile, right? I mean, it is recording an offset that only makes sense in the context of the parent hashfile. And that is only more true after the unsafe-hash patches, because now it needs to use the "algop" pointer from the parent hashfile (though for now we expect all hashfiles to use the same unsafe-algop, in theory we could use different checksums for each file). So in the new constructor: > +void hashfile_checkpoint_init(struct hashfile *f, > + struct hashfile_checkpoint *checkpoint) > +{ > + memset(checkpoint, 0, sizeof(*checkpoint)); > + f->algop->init_fn(&checkpoint->ctx); > +} ...should we actually record "f" itself? And then in the existing functions: > void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) ...they'd no longer need to take the extra parameter. It creates a lifetime dependency of the checkpoint struct on the "f" it is checkpointing, but I think that is naturally modeling the domain. A semi-related thing I wondered about: do we need a destructor/release function of some kind? Long ago when this checkpoint code was added, a memcpy() of the sha_ctx struct was sufficient. But these days we use clone_fn(), which may call openssl_SHA1_Clone(), which does EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this doesn't allocate any resources that might need a call to _Final() to release (or I guess the more efficient way is directly EVP_MD_CTX_free() under the hood). My reading of the openssl manpages suggests that we should be doing that, or we may see leaks. But it may also be the case that it doesn't happen to trigger for their implementation. At any rate, we do not seem to have such a cleanup function. So it is certainly an orthogonal issue to your series. I wondered about it here because if we did have one, it would be necessary to clean up checkpoint before the hashfile due to the lifetime dependency I mentioned above. -Peff
On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote: > So in the new constructor: > > > +void hashfile_checkpoint_init(struct hashfile *f, > > + struct hashfile_checkpoint *checkpoint) > > +{ > > + memset(checkpoint, 0, sizeof(*checkpoint)); > > + f->algop->init_fn(&checkpoint->ctx); > > +} > > ...should we actually record "f" itself? And then in the existing > functions: > > > void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) > > ...they'd no longer need to take the extra parameter. > > It creates a lifetime dependency of the checkpoint struct on the "f" it > is checkpointing, but I think that is naturally modeling the domain. Thanks, I really like these suggestions. I adjusted the series accordingly to do this cleanup in two patches (one for hashfile_checkpoint(), another for hashfile_truncate()) after the patch introducing hashfile_checkpoint_init(). > A semi-related thing I wondered about: do we need a destructor/release > function of some kind? Long ago when this checkpoint code was added, a > memcpy() of the sha_ctx struct was sufficient. But these days we use > clone_fn(), which may call openssl_SHA1_Clone(), which does > EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this > doesn't allocate any resources that might need a call to _Final() to > release (or I guess the more efficient way is directly EVP_MD_CTX_free() > under the hood). > > My reading of the openssl manpages suggests that we should be doing > that, or we may see leaks. But it may also be the case that it doesn't > happen to trigger for their implementation. > > At any rate, we do not seem to have such a cleanup function. So it is > certainly an orthogonal issue to your series. I wondered about it here > because if we did have one, it would be necessary to clean up checkpoint > before the hashfile due to the lifetime dependency I mentioned above. I like the idea of a cleanup function, but let's do so in a separate series. Thanks, Taylor
On Fri, Jan 10, 2025 at 04:50:25PM -0500, Taylor Blau wrote: > On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote: > > So in the new constructor: > > > > > +void hashfile_checkpoint_init(struct hashfile *f, > > > + struct hashfile_checkpoint *checkpoint) > > > +{ > > > + memset(checkpoint, 0, sizeof(*checkpoint)); > > > + f->algop->init_fn(&checkpoint->ctx); > > > +} > > > > ...should we actually record "f" itself? And then in the existing > > functions: > > > > > void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) > > > > ...they'd no longer need to take the extra parameter. > > > > It creates a lifetime dependency of the checkpoint struct on the "f" it > > is checkpointing, but I think that is naturally modeling the domain. > > Thanks, I really like these suggestions. I adjusted the series > accordingly to do this cleanup in two patches (one for > hashfile_checkpoint(), another for hashfile_truncate()) after the patch > introducing hashfile_checkpoint_init(). Hmm. I'm not sure that I like this as much as I thought I did. I agree with you that ultimately the hashfile_checkpoint is (or should be) tied to the lifetime of the hashfile that it is checkpointing underneath. But in practice things are a little funky. Let's suppose I did something like the following: --- 8< --- diff --git a/csum-file.c b/csum-file.c index ebffc80ef7..47b8317a1f 100644 --- a/csum-file.c +++ b/csum-file.c @@ -206,6 +206,15 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp return hashfd_internal(fd, name, tp, 8 * 1024); } +void hashfile_checkpoint_init(struct hashfile *f, + struct hashfile_checkpoint *checkpoint) +{ + memset(checkpoint, 0, sizeof(*checkpoint)); + + checkpoint->f = f; + checkpoint->f->algop->init_fn(&checkpoint->ctx); +} + void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) { hashflush(f); diff --git a/csum-file.h b/csum-file.h index 2b45f4673a..8016509c71 100644 --- a/csum-file.h +++ b/csum-file.h @@ -34,8 +34,10 @@ struct hashfile { struct hashfile_checkpoint { off_t offset; git_hash_ctx ctx; + struct hashfile *f; }; +void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *); void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *); int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *); --- >8 --- , where I'm eliding the trivial changes necessary to make this work at the two callers. Let's look a little closer at the bulk-checkin caller. If I do this on top: --- 8< --- diff --git a/bulk-checkin.c b/bulk-checkin.c index 433070a3bd..892176d23d 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; + struct hashfile_checkpoint checkpoint; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR); @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, OBJ_BLOB, size); the_hash_algo->init_fn(&ctx); the_hash_algo->update_fn(&ctx, obuf, header_len); - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); /* Note: idx is non-NULL when we are writing */ - if ((flags & HASH_WRITE_OBJECT) != 0) + if ((flags & HASH_WRITE_OBJECT) != 0) { CALLOC_ARRAY(idx, 1); + prepare_to_stream(state, flags); + hashfile_checkpoint_init(state->f, &checkpoint); + } + already_hashed_to = 0; while (1) { --- >8 --- then test 14 in t1050-large.sh fails because of a segfault in 'git add'. If we compile with SANITIZE=address, we can see that there's a use-after-free in hashflush(), which is called by hashfile_checkpoint(). That is a result of the max pack-size code. So we could try something like: --- 8< --- diff --git a/bulk-checkin.c b/bulk-checkin.c index 7b8a6eb2df..9dc114d132 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint; + struct hashfile_checkpoint checkpoint = { 0 }; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR); @@ -274,17 +274,14 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, the_hash_algo->update_fn(&ctx, obuf, header_len); /* Note: idx is non-NULL when we are writing */ - if ((flags & HASH_WRITE_OBJECT) != 0) { + if ((flags & HASH_WRITE_OBJECT) != 0) CALLOC_ARRAY(idx, 1); - - prepare_to_stream(state, flags); - hashfile_checkpoint_init(state->f, &checkpoint); - } - already_hashed_to = 0; while (1) { prepare_to_stream(state, flags); + if (checkpoint.f != state->f) + hashfile_checkpoint_init(state->f, &checkpoint); if (idx) { hashfile_checkpoint(&checkpoint); idx->offset = state->offset; --- >8 --- which would do the trick, but it feels awfully hacky to have the "if (checkpoint.f != state->f)" check in there, since that feels too intimately tied to the implementation of the hashfile_checkpoint API for my comfort. It would be nice if we could make the checkpoint only declared within the loop body itself, but we can't because we need to call hashfile_truncate() outside of the loop. Anyway, that's all to say that I think that while this is probably doable in theory, in practice it's kind of a mess, at least currently. I would rather see if there are other ways to clean up the deflate_blob_to_pack() function first in a way that made this change less awkward. I think the most reasonable course here would be to pursue a minimal change like the one presented here and then think about further clean up as a separate step. Thanks, Taylor
On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote: > diff --git a/bulk-checkin.c b/bulk-checkin.c > index 433070a3bd..892176d23d 100644 > --- a/bulk-checkin.c > +++ b/bulk-checkin.c > @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, > git_hash_ctx ctx; > unsigned char obuf[16384]; > unsigned header_len; > - struct hashfile_checkpoint checkpoint = {0}; > + struct hashfile_checkpoint checkpoint; > struct pack_idx_entry *idx = NULL; > > seekback = lseek(fd, 0, SEEK_CUR); > @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, > OBJ_BLOB, size); > the_hash_algo->init_fn(&ctx); > the_hash_algo->update_fn(&ctx, obuf, header_len); > - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); > > /* Note: idx is non-NULL when we are writing */ > - if ((flags & HASH_WRITE_OBJECT) != 0) > + if ((flags & HASH_WRITE_OBJECT) != 0) { > CALLOC_ARRAY(idx, 1); > > + prepare_to_stream(state, flags); > + hashfile_checkpoint_init(state->f, &checkpoint); > + } > + > already_hashed_to = 0; > > while (1) { Yeah, that's ugly. We are potentially throwing away the hashfile that the checkpoint was created for. That makes my instinct to push the checkpoint down into the loop where we we might restart a new pack, like this (and like you suggested below): diff --git a/bulk-checkin.c b/bulk-checkin.c index 433070a3bd..efa59074fb 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR); @@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, already_hashed_to = 0; while (1) { + struct hashfile_checkpoint checkpoint = {0}; prepare_to_stream(state, flags); if (idx) { + hashfile_checkpoint_init(state->f, &checkpoint); hashfile_checkpoint(state->f, &checkpoint); idx->offset = state->offset; crc32_begin(state->f); but that doesn't work, because the checkpoint is also used later for the already_written() check: if (already_written(state, result_oid)) { hashfile_truncate(state->f, &checkpoint); state->offset = checkpoint.offset; free(idx); } else That made me wonder if there is a bug lurking there. What if we found the pack was too big, truncated to our checkpoint, and then opened a new pack? Then the original checkpoint would now be bogus! It would mention an offset in the original packfile which doesn't make any sense with what we have open. But I think this is OK, because we can only leave the loop when stream_blob_to_pack() returns, and we always establish a new checkpoint before then. So I do think that moving the initialization of the checkpoint into the loop, but _not_ moving the variable would work the same way it does now (i.e., what you suggested below). But I admit that the way this loop works kind of makes my head spin. It can really only ever run twice, but it is hard to see: we break out if stream_blob_to_pack() returns success. And it will only return error if we would bust the packsize limit (all other errors cause us to die()). And only if we would bust the limit _and_ we are not the only object in the pack. And since we start a new pack if we loop, that will never be true on the second iteration; we'll always either die() or return success. I do think it would be much easier to read with a single explicit retry: if (checkpoint_and_try_to_stream() < 0) { /* we busted the limit; make a new pack and try again */ hashfile_truncate(); etc... if (checkpoint_and_try_to_stream() < 0) BUG("yikes, we should not fail a second time!"); } where checkpoint_and_try_to_stream() is the first half of the loop, up to the stream_blob_to_pack() call. Anyway, that is all outside of your patch, and relevant only because _if_ we untangled it a bit more, it might make the checkpoint lifetime a bit more obvious and less scary to refactor. But it does imply to me that the data dependency introduced by my suggestion is not always so straight-forward as I thought it would be, and we should probably punt on it for your series. > which would do the trick, but it feels awfully hacky to have the "if > (checkpoint.f != state->f)" check in there, since that feels too > intimately tied to the implementation of the hashfile_checkpoint API for > my comfort. I think you could unconditionally checkpoint at that part; we're about to do a write, so we want to store the state before the write in case we need to roll back. > Anyway, that's all to say that I think that while this is probably > doable in theory, in practice it's kind of a mess, at least currently. > I would rather see if there are other ways to clean up the > deflate_blob_to_pack() function first in a way that made this change > less awkward. Yeah, I actually wrote what I wrote above before reading this far down in your email, but we arrived at the exact same conclusion. ;) Hopefully what I wrote might give some pointers if somebody wants to refactor later. > I think the most reasonable course here would be to pursue a minimal > change like the one presented here and then think about further clean up > as a separate step. Yep. Thanks for looking into it. -Peff
diff --git a/builtin/fast-import.c b/builtin/fast-import.c index 0f86392761a..4a6c7ab52ac 100644 --- a/builtin/fast-import.c +++ b/builtin/fast-import.c @@ -1106,7 +1106,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark) || (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size) cycle_packfile(); - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); + hashfile_checkpoint_init(pack_file, &checkpoint); hashfile_checkpoint(pack_file, &checkpoint); offset = checkpoint.offset; diff --git a/bulk-checkin.c b/bulk-checkin.c index 433070a3bda..892176d23d2 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; + struct hashfile_checkpoint checkpoint; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR); @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, OBJ_BLOB, size); the_hash_algo->init_fn(&ctx); the_hash_algo->update_fn(&ctx, obuf, header_len); - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); /* Note: idx is non-NULL when we are writing */ - if ((flags & HASH_WRITE_OBJECT) != 0) + if ((flags & HASH_WRITE_OBJECT) != 0) { CALLOC_ARRAY(idx, 1); + prepare_to_stream(state, flags); + hashfile_checkpoint_init(state->f, &checkpoint); + } + already_hashed_to = 0; while (1) { diff --git a/csum-file.c b/csum-file.c index ebffc80ef71..232121f415f 100644 --- a/csum-file.c +++ b/csum-file.c @@ -206,6 +206,13 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp return hashfd_internal(fd, name, tp, 8 * 1024); } +void hashfile_checkpoint_init(struct hashfile *f, + struct hashfile_checkpoint *checkpoint) +{ + memset(checkpoint, 0, sizeof(*checkpoint)); + f->algop->init_fn(&checkpoint->ctx); +} + void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) { hashflush(f); diff --git a/csum-file.h b/csum-file.h index 2b45f4673a2..b7475f16c20 100644 --- a/csum-file.h +++ b/csum-file.h @@ -36,6 +36,7 @@ struct hashfile_checkpoint { git_hash_ctx ctx; }; +void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *); void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *); int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1 backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to initialize a hashfile_checkpoint with the same hash function implementation as is used by the hashfile it is used to checkpoint. While both 106140a99f and 9218c0bfe1 work around the immediate crash, changing the hash function implementation within the hashfile API to, for example, the non-unsafe variant would re-introduce the crash. This is a result of the tight coupling between initializing hashfiles and hashfile_checkpoints. Introduce and use a new function which ensures that both parts of a hashfile and hashfile_checkpoint pair use the same hash function implementation to avoid such crashes. A few things worth noting: - In the change to builtin/fast-import.c::stream_blob(), we can see that by removing the explicit reference to 'the_hash_algo->unsafe_init_fn()', we are hardened against the hashfile API changing away from the_hash_algo (or its unsafe variant) in the future. - The bulk-checkin code no longer needs to explicitly zero-initialize the hashfile_checkpoint, since it is now done as a result of calling 'hashfile_checkpoint_init()'. - Also in the bulk-checkin code, we add an additional call to prepare_to_stream() outside of the main loop in order to initialize 'state->f' so we know which hash function implementation to use when calling 'hashfile_checkpoint_init()'. This is OK, since subsequent 'prepare_to_stream()' calls are noops. However, we only need to call 'prepare_to_stream()' when we have the HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling 'prepare_to_stream()' does not assign 'state->f', so we have nothing to initialize. - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are appropriately guarded. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/fast-import.c | 2 +- bulk-checkin.c | 9 ++++++--- csum-file.c | 7 +++++++ csum-file.h | 1 + 4 files changed, 15 insertions(+), 4 deletions(-)