Message ID | 94c07fd8a557c569fdc83015d5f3902094f21994.1736363652.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | hash: introduce unsafe_hash_algo(), drop unsafe_ variants | expand |
On Wed, Jan 08, 2025 at 02:14:51PM -0500, Taylor Blau wrote: > Introduce and use a new function which ensures that both parts of a > hashfile and hashfile_checkpoint pair use the same hash function > implementation to avoid such crashes. That makes sense. This should have been encapsulated all along, just like the actual hash initialization happens inside hashfile_init(). A hashfile_checkpoint is sort of inherently tied to a hashfile, right? I mean, it is recording an offset that only makes sense in the context of the parent hashfile. And that is only more true after the unsafe-hash patches, because now it needs to use the "algop" pointer from the parent hashfile (though for now we expect all hashfiles to use the same unsafe-algop, in theory we could use different checksums for each file). So in the new constructor: > +void hashfile_checkpoint_init(struct hashfile *f, > + struct hashfile_checkpoint *checkpoint) > +{ > + memset(checkpoint, 0, sizeof(*checkpoint)); > + f->algop->init_fn(&checkpoint->ctx); > +} ...should we actually record "f" itself? And then in the existing functions: > void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) ...they'd no longer need to take the extra parameter. It creates a lifetime dependency of the checkpoint struct on the "f" it is checkpointing, but I think that is naturally modeling the domain. A semi-related thing I wondered about: do we need a destructor/release function of some kind? Long ago when this checkpoint code was added, a memcpy() of the sha_ctx struct was sufficient. But these days we use clone_fn(), which may call openssl_SHA1_Clone(), which does EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this doesn't allocate any resources that might need a call to _Final() to release (or I guess the more efficient way is directly EVP_MD_CTX_free() under the hood). My reading of the openssl manpages suggests that we should be doing that, or we may see leaks. But it may also be the case that it doesn't happen to trigger for their implementation. At any rate, we do not seem to have such a cleanup function. So it is certainly an orthogonal issue to your series. I wondered about it here because if we did have one, it would be necessary to clean up checkpoint before the hashfile due to the lifetime dependency I mentioned above. -Peff
On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote: > So in the new constructor: > > > +void hashfile_checkpoint_init(struct hashfile *f, > > + struct hashfile_checkpoint *checkpoint) > > +{ > > + memset(checkpoint, 0, sizeof(*checkpoint)); > > + f->algop->init_fn(&checkpoint->ctx); > > +} > > ...should we actually record "f" itself? And then in the existing > functions: > > > void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) > > ...they'd no longer need to take the extra parameter. > > It creates a lifetime dependency of the checkpoint struct on the "f" it > is checkpointing, but I think that is naturally modeling the domain. Thanks, I really like these suggestions. I adjusted the series accordingly to do this cleanup in two patches (one for hashfile_checkpoint(), another for hashfile_truncate()) after the patch introducing hashfile_checkpoint_init(). > A semi-related thing I wondered about: do we need a destructor/release > function of some kind? Long ago when this checkpoint code was added, a > memcpy() of the sha_ctx struct was sufficient. But these days we use > clone_fn(), which may call openssl_SHA1_Clone(), which does > EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this > doesn't allocate any resources that might need a call to _Final() to > release (or I guess the more efficient way is directly EVP_MD_CTX_free() > under the hood). > > My reading of the openssl manpages suggests that we should be doing > that, or we may see leaks. But it may also be the case that it doesn't > happen to trigger for their implementation. > > At any rate, we do not seem to have such a cleanup function. So it is > certainly an orthogonal issue to your series. I wondered about it here > because if we did have one, it would be necessary to clean up checkpoint > before the hashfile due to the lifetime dependency I mentioned above. I like the idea of a cleanup function, but let's do so in a separate series. Thanks, Taylor
diff --git a/builtin/fast-import.c b/builtin/fast-import.c index 0f86392761a..4a6c7ab52ac 100644 --- a/builtin/fast-import.c +++ b/builtin/fast-import.c @@ -1106,7 +1106,7 @@ static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark) || (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size) cycle_packfile(); - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); + hashfile_checkpoint_init(pack_file, &checkpoint); hashfile_checkpoint(pack_file, &checkpoint); offset = checkpoint.offset; diff --git a/bulk-checkin.c b/bulk-checkin.c index 433070a3bda..892176d23d2 100644 --- a/bulk-checkin.c +++ b/bulk-checkin.c @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, git_hash_ctx ctx; unsigned char obuf[16384]; unsigned header_len; - struct hashfile_checkpoint checkpoint = {0}; + struct hashfile_checkpoint checkpoint; struct pack_idx_entry *idx = NULL; seekback = lseek(fd, 0, SEEK_CUR); @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state, OBJ_BLOB, size); the_hash_algo->init_fn(&ctx); the_hash_algo->update_fn(&ctx, obuf, header_len); - the_hash_algo->unsafe_init_fn(&checkpoint.ctx); /* Note: idx is non-NULL when we are writing */ - if ((flags & HASH_WRITE_OBJECT) != 0) + if ((flags & HASH_WRITE_OBJECT) != 0) { CALLOC_ARRAY(idx, 1); + prepare_to_stream(state, flags); + hashfile_checkpoint_init(state->f, &checkpoint); + } + already_hashed_to = 0; while (1) { diff --git a/csum-file.c b/csum-file.c index ebffc80ef71..232121f415f 100644 --- a/csum-file.c +++ b/csum-file.c @@ -206,6 +206,13 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp return hashfd_internal(fd, name, tp, 8 * 1024); } +void hashfile_checkpoint_init(struct hashfile *f, + struct hashfile_checkpoint *checkpoint) +{ + memset(checkpoint, 0, sizeof(*checkpoint)); + f->algop->init_fn(&checkpoint->ctx); +} + void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint) { hashflush(f); diff --git a/csum-file.h b/csum-file.h index 2b45f4673a2..b7475f16c20 100644 --- a/csum-file.h +++ b/csum-file.h @@ -36,6 +36,7 @@ struct hashfile_checkpoint { git_hash_ctx ctx; }; +void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *); void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *); int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1 backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to initialize a hashfile_checkpoint with the same hash function implementation as is used by the hashfile it is used to checkpoint. While both 106140a99f and 9218c0bfe1 work around the immediate crash, changing the hash function implementation within the hashfile API to, for example, the non-unsafe variant would re-introduce the crash. This is a result of the tight coupling between initializing hashfiles and hashfile_checkpoints. Introduce and use a new function which ensures that both parts of a hashfile and hashfile_checkpoint pair use the same hash function implementation to avoid such crashes. A few things worth noting: - In the change to builtin/fast-import.c::stream_blob(), we can see that by removing the explicit reference to 'the_hash_algo->unsafe_init_fn()', we are hardened against the hashfile API changing away from the_hash_algo (or its unsafe variant) in the future. - The bulk-checkin code no longer needs to explicitly zero-initialize the hashfile_checkpoint, since it is now done as a result of calling 'hashfile_checkpoint_init()'. - Also in the bulk-checkin code, we add an additional call to prepare_to_stream() outside of the main loop in order to initialize 'state->f' so we know which hash function implementation to use when calling 'hashfile_checkpoint_init()'. This is OK, since subsequent 'prepare_to_stream()' calls are noops. However, we only need to call 'prepare_to_stream()' when we have the HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling 'prepare_to_stream()' does not assign 'state->f', so we have nothing to initialize. - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are appropriately guarded. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- builtin/fast-import.c | 2 +- bulk-checkin.c | 9 ++++++--- csum-file.c | 7 +++++++ csum-file.h | 1 + 4 files changed, 15 insertions(+), 4 deletions(-)