diff mbox series

[v2,7/8] csum-file: introduce hashfile_checkpoint_init()

Message ID 94c07fd8a557c569fdc83015d5f3902094f21994.1736363652.git.me@ttaylorr.com (mailing list archive)
State New
Headers show
Series hash: introduce unsafe_hash_algo(), drop unsafe_ variants | expand

Commit Message

Taylor Blau Jan. 8, 2025, 7:14 p.m. UTC
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.

While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.

Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.

A few things worth noting:

  - In the change to builtin/fast-import.c::stream_blob(), we can see
    that by removing the explicit reference to
    'the_hash_algo->unsafe_init_fn()', we are hardened against the
    hashfile API changing away from the_hash_algo (or its unsafe
    variant) in the future.

  - The bulk-checkin code no longer needs to explicitly zero-initialize
    the hashfile_checkpoint, since it is now done as a result of calling
    'hashfile_checkpoint_init()'.

  - Also in the bulk-checkin code, we add an additional call to
    prepare_to_stream() outside of the main loop in order to initialize
    'state->f' so we know which hash function implementation to use when
    calling 'hashfile_checkpoint_init()'.

    This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    However, we only need to call 'prepare_to_stream()' when we have the
    HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    'prepare_to_stream()' does not assign 'state->f', so we have nothing
    to initialize.

  - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    appropriately guarded.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/fast-import.c | 2 +-
 bulk-checkin.c        | 9 ++++++---
 csum-file.c           | 7 +++++++
 csum-file.h           | 1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

Comments

Jeff King Jan. 10, 2025, 10:37 a.m. UTC | #1
On Wed, Jan 08, 2025 at 02:14:51PM -0500, Taylor Blau wrote:

> Introduce and use a new function which ensures that both parts of a
> hashfile and hashfile_checkpoint pair use the same hash function
> implementation to avoid such crashes.

That makes sense. This should have been encapsulated all along, just
like the actual hash initialization happens inside hashfile_init().

A hashfile_checkpoint is sort of inherently tied to a hashfile, right? I
mean, it is recording an offset that only makes sense in the context of
the parent hashfile.

And that is only more true after the unsafe-hash patches, because now it
needs to use the "algop" pointer from the parent hashfile (though for
now we expect all hashfiles to use the same unsafe-algop, in theory we
could use different checksums for each file).

So in the new constructor:

> +void hashfile_checkpoint_init(struct hashfile *f,
> +			      struct hashfile_checkpoint *checkpoint)
> +{
> +	memset(checkpoint, 0, sizeof(*checkpoint));
> +	f->algop->init_fn(&checkpoint->ctx);
> +}

...should we actually record "f" itself? And then in the existing
functions:

>  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)

...they'd no longer need to take the extra parameter.

It creates a lifetime dependency of the checkpoint struct on the "f" it
is checkpointing, but I think that is naturally modeling the domain.

A semi-related thing I wondered about: do we need a destructor/release
function of some kind? Long ago when this checkpoint code was added, a
memcpy() of the sha_ctx struct was sufficient. But these days we use
clone_fn(), which may call openssl_SHA1_Clone(), which does
EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this
doesn't allocate any resources that might need a call to _Final() to
release (or I guess the more efficient way is directly EVP_MD_CTX_free()
under the hood).

My reading of the openssl manpages suggests that we should be doing
that, or we may see leaks. But it may also be the case that it doesn't
happen to trigger for their implementation.

At any rate, we do not seem to have such a cleanup function. So it is
certainly an orthogonal issue to your series. I wondered about it here
because if we did have one, it would be necessary to clean up checkpoint
before the hashfile due to the lifetime dependency I mentioned above.

-Peff
Taylor Blau Jan. 10, 2025, 9:50 p.m. UTC | #2
On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> So in the new constructor:
>
> > +void hashfile_checkpoint_init(struct hashfile *f,
> > +			      struct hashfile_checkpoint *checkpoint)
> > +{
> > +	memset(checkpoint, 0, sizeof(*checkpoint));
> > +	f->algop->init_fn(&checkpoint->ctx);
> > +}
>
> ...should we actually record "f" itself? And then in the existing
> functions:
>
> >  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
>
> ...they'd no longer need to take the extra parameter.
>
> It creates a lifetime dependency of the checkpoint struct on the "f" it
> is checkpointing, but I think that is naturally modeling the domain.

Thanks, I really like these suggestions. I adjusted the series
accordingly to do this cleanup in two patches (one for
hashfile_checkpoint(), another for hashfile_truncate()) after the patch
introducing hashfile_checkpoint_init().

> A semi-related thing I wondered about: do we need a destructor/release
> function of some kind? Long ago when this checkpoint code was added, a
> memcpy() of the sha_ctx struct was sufficient. But these days we use
> clone_fn(), which may call openssl_SHA1_Clone(), which does
> EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this
> doesn't allocate any resources that might need a call to _Final() to
> release (or I guess the more efficient way is directly EVP_MD_CTX_free()
> under the hood).
>
> My reading of the openssl manpages suggests that we should be doing
> that, or we may see leaks. But it may also be the case that it doesn't
> happen to trigger for their implementation.
>
> At any rate, we do not seem to have such a cleanup function. So it is
> certainly an orthogonal issue to your series. I wondered about it here
> because if we did have one, it would be necessary to clean up checkpoint
> before the hashfile due to the lifetime dependency I mentioned above.

I like the idea of a cleanup function, but let's do so in a separate
series.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0f86392761a..4a6c7ab52ac 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -1106,7 +1106,7 @@  static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
 		cycle_packfile();
 
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
+	hashfile_checkpoint_init(pack_file, &checkpoint);
 	hashfile_checkpoint(pack_file, &checkpoint);
 	offset = checkpoint.offset;
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bda..892176d23d2 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@  static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@  static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
 
 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);
 
+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;
 
 	while (1) {
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef71..232121f415f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,13 @@  struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }
 
+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+	f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a2..b7475f16c20 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -36,6 +36,7 @@  struct hashfile_checkpoint {
 	git_hash_ctx ctx;
 };
 
+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);