diff mbox series

[v2,7/8] csum-file: introduce hashfile_checkpoint_init()

Message ID 94c07fd8a557c569fdc83015d5f3902094f21994.1736363652.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series hash: introduce unsafe_hash_algo(), drop unsafe_ variants | expand

Commit Message

Taylor Blau Jan. 8, 2025, 7:14 p.m. UTC
In 106140a99f (builtin/fast-import: fix segfault with unsafe SHA1
backend, 2024-12-30) and 9218c0bfe1 (bulk-checkin: fix segfault with
unsafe SHA1 backend, 2024-12-30), we observed the effects of failing to
initialize a hashfile_checkpoint with the same hash function
implementation as is used by the hashfile it is used to checkpoint.

While both 106140a99f and 9218c0bfe1 work around the immediate crash,
changing the hash function implementation within the hashfile API to,
for example, the non-unsafe variant would re-introduce the crash. This
is a result of the tight coupling between initializing hashfiles and
hashfile_checkpoints.

Introduce and use a new function which ensures that both parts of a
hashfile and hashfile_checkpoint pair use the same hash function
implementation to avoid such crashes.

A few things worth noting:

  - In the change to builtin/fast-import.c::stream_blob(), we can see
    that by removing the explicit reference to
    'the_hash_algo->unsafe_init_fn()', we are hardened against the
    hashfile API changing away from the_hash_algo (or its unsafe
    variant) in the future.

  - The bulk-checkin code no longer needs to explicitly zero-initialize
    the hashfile_checkpoint, since it is now done as a result of calling
    'hashfile_checkpoint_init()'.

  - Also in the bulk-checkin code, we add an additional call to
    prepare_to_stream() outside of the main loop in order to initialize
    'state->f' so we know which hash function implementation to use when
    calling 'hashfile_checkpoint_init()'.

    This is OK, since subsequent 'prepare_to_stream()' calls are noops.
    However, we only need to call 'prepare_to_stream()' when we have the
    HASH_WRITE_OBJECT bit set in our flags. Without that bit, calling
    'prepare_to_stream()' does not assign 'state->f', so we have nothing
    to initialize.

  - Other uses of the 'checkpoint' in 'deflate_blob_to_pack()' are
    appropriately guarded.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 builtin/fast-import.c | 2 +-
 bulk-checkin.c        | 9 ++++++---
 csum-file.c           | 7 +++++++
 csum-file.h           | 1 +
 4 files changed, 15 insertions(+), 4 deletions(-)

Comments

Jeff King Jan. 10, 2025, 10:37 a.m. UTC | #1
On Wed, Jan 08, 2025 at 02:14:51PM -0500, Taylor Blau wrote:

> Introduce and use a new function which ensures that both parts of a
> hashfile and hashfile_checkpoint pair use the same hash function
> implementation to avoid such crashes.

That makes sense. This should have been encapsulated all along, just
like the actual hash initialization happens inside hashfile_init().

A hashfile_checkpoint is sort of inherently tied to a hashfile, right? I
mean, it is recording an offset that only makes sense in the context of
the parent hashfile.

And that is only more true after the unsafe-hash patches, because now it
needs to use the "algop" pointer from the parent hashfile (though for
now we expect all hashfiles to use the same unsafe-algop, in theory we
could use different checksums for each file).

So in the new constructor:

> +void hashfile_checkpoint_init(struct hashfile *f,
> +			      struct hashfile_checkpoint *checkpoint)
> +{
> +	memset(checkpoint, 0, sizeof(*checkpoint));
> +	f->algop->init_fn(&checkpoint->ctx);
> +}

...should we actually record "f" itself? And then in the existing
functions:

>  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)

...they'd no longer need to take the extra parameter.

It creates a lifetime dependency of the checkpoint struct on the "f" it
is checkpointing, but I think that is naturally modeling the domain.

A semi-related thing I wondered about: do we need a destructor/release
function of some kind? Long ago when this checkpoint code was added, a
memcpy() of the sha_ctx struct was sufficient. But these days we use
clone_fn(), which may call openssl_SHA1_Clone(), which does
EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this
doesn't allocate any resources that might need a call to _Final() to
release (or I guess the more efficient way is directly EVP_MD_CTX_free()
under the hood).

My reading of the openssl manpages suggests that we should be doing
that, or we may see leaks. But it may also be the case that it doesn't
happen to trigger for their implementation.

At any rate, we do not seem to have such a cleanup function. So it is
certainly an orthogonal issue to your series. I wondered about it here
because if we did have one, it would be necessary to clean up checkpoint
before the hashfile due to the lifetime dependency I mentioned above.

-Peff
Taylor Blau Jan. 10, 2025, 9:50 p.m. UTC | #2
On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> So in the new constructor:
>
> > +void hashfile_checkpoint_init(struct hashfile *f,
> > +			      struct hashfile_checkpoint *checkpoint)
> > +{
> > +	memset(checkpoint, 0, sizeof(*checkpoint));
> > +	f->algop->init_fn(&checkpoint->ctx);
> > +}
>
> ...should we actually record "f" itself? And then in the existing
> functions:
>
> >  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
>
> ...they'd no longer need to take the extra parameter.
>
> It creates a lifetime dependency of the checkpoint struct on the "f" it
> is checkpointing, but I think that is naturally modeling the domain.

Thanks, I really like these suggestions. I adjusted the series
accordingly to do this cleanup in two patches (one for
hashfile_checkpoint(), another for hashfile_truncate()) after the patch
introducing hashfile_checkpoint_init().

> A semi-related thing I wondered about: do we need a destructor/release
> function of some kind? Long ago when this checkpoint code was added, a
> memcpy() of the sha_ctx struct was sufficient. But these days we use
> clone_fn(), which may call openssl_SHA1_Clone(), which does
> EVP_MD_CTX_copy_ex() under the hood. Do we have any promise that this
> doesn't allocate any resources that might need a call to _Final() to
> release (or I guess the more efficient way is directly EVP_MD_CTX_free()
> under the hood).
>
> My reading of the openssl manpages suggests that we should be doing
> that, or we may see leaks. But it may also be the case that it doesn't
> happen to trigger for their implementation.
>
> At any rate, we do not seem to have such a cleanup function. So it is
> certainly an orthogonal issue to your series. I wondered about it here
> because if we did have one, it would be necessary to clean up checkpoint
> before the hashfile due to the lifetime dependency I mentioned above.

I like the idea of a cleanup function, but let's do so in a separate
series.

Thanks,
Taylor
Taylor Blau Jan. 17, 2025, 9:30 p.m. UTC | #3
On Fri, Jan 10, 2025 at 04:50:25PM -0500, Taylor Blau wrote:
> On Fri, Jan 10, 2025 at 05:37:56AM -0500, Jeff King wrote:
> > So in the new constructor:
> >
> > > +void hashfile_checkpoint_init(struct hashfile *f,
> > > +			      struct hashfile_checkpoint *checkpoint)
> > > +{
> > > +	memset(checkpoint, 0, sizeof(*checkpoint));
> > > +	f->algop->init_fn(&checkpoint->ctx);
> > > +}
> >
> > ...should we actually record "f" itself? And then in the existing
> > functions:
> >
> > >  void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
> >
> > ...they'd no longer need to take the extra parameter.
> >
> > It creates a lifetime dependency of the checkpoint struct on the "f" it
> > is checkpointing, but I think that is naturally modeling the domain.
>
> Thanks, I really like these suggestions. I adjusted the series
> accordingly to do this cleanup in two patches (one for
> hashfile_checkpoint(), another for hashfile_truncate()) after the patch
> introducing hashfile_checkpoint_init().

Hmm. I'm not sure that I like this as much as I thought I did.

I agree with you that ultimately the hashfile_checkpoint is (or should
be) tied to the lifetime of the hashfile that it is checkpointing
underneath. But in practice things are a little funky.

Let's suppose I did something like the following:

--- 8< ---
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef7..47b8317a1f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,15 @@ struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }

+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+
+	checkpoint->f = f;
+	checkpoint->f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a..8016509c71 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -34,8 +34,10 @@ struct hashfile {
 struct hashfile_checkpoint {
 	off_t offset;
 	git_hash_ctx ctx;
+	struct hashfile *f;
 };

+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);
--- >8 ---

, where I'm eliding the trivial changes necessary to make this work at
the two callers. Let's look a little closer at the bulk-checkin caller.
If I do this on top:

--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..892176d23d 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);

+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;

 	while (1) {
--- >8 ---

then test 14 in t1050-large.sh fails because of a segfault in 'git add'.
If we compile with SANITIZE=address, we can see that there's a
use-after-free in hashflush(), which is called by hashfile_checkpoint().

That is a result of the max pack-size code. So we could try something
like:

--- 8< ---
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 7b8a6eb2df..9dc114d132 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint;
+	struct hashfile_checkpoint checkpoint = { 0 };
 	struct pack_idx_entry *idx = NULL;

 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -274,17 +274,14 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	the_hash_algo->update_fn(&ctx, obuf, header_len);

 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0) {
+	if ((flags & HASH_WRITE_OBJECT) != 0)
 		CALLOC_ARRAY(idx, 1);
-
-		prepare_to_stream(state, flags);
-		hashfile_checkpoint_init(state->f, &checkpoint);
-	}
-
 	already_hashed_to = 0;

 	while (1) {
 		prepare_to_stream(state, flags);
+		if (checkpoint.f != state->f)
+			hashfile_checkpoint_init(state->f, &checkpoint);
 		if (idx) {
 			hashfile_checkpoint(&checkpoint);
 			idx->offset = state->offset;
--- >8 ---

which would do the trick, but it feels awfully hacky to have the "if
(checkpoint.f != state->f)" check in there, since that feels too
intimately tied to the implementation of the hashfile_checkpoint API for
my comfort.

It would be nice if we could make the checkpoint only declared within
the loop body itself, but we can't because we need to call
hashfile_truncate() outside of the loop.

Anyway, that's all to say that I think that while this is probably
doable in theory, in practice it's kind of a mess, at least currently.
I would rather see if there are other ways to clean up the
deflate_blob_to_pack() function first in a way that made this change
less awkward.

I think the most reasonable course here would be to pursue a minimal
change like the one presented here and then think about further clean up
as a separate step.

Thanks,
Taylor
Jeff King Jan. 18, 2025, 12:15 p.m. UTC | #4
On Fri, Jan 17, 2025 at 04:30:05PM -0500, Taylor Blau wrote:

> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 433070a3bd..892176d23d 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -261,7 +261,7 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  	git_hash_ctx ctx;
>  	unsigned char obuf[16384];
>  	unsigned header_len;
> -	struct hashfile_checkpoint checkpoint = {0};
> +	struct hashfile_checkpoint checkpoint;
>  	struct pack_idx_entry *idx = NULL;
> 
>  	seekback = lseek(fd, 0, SEEK_CUR);
> @@ -272,12 +272,15 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
>  					  OBJ_BLOB, size);
>  	the_hash_algo->init_fn(&ctx);
>  	the_hash_algo->update_fn(&ctx, obuf, header_len);
> -	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
> 
>  	/* Note: idx is non-NULL when we are writing */
> -	if ((flags & HASH_WRITE_OBJECT) != 0)
> +	if ((flags & HASH_WRITE_OBJECT) != 0) {
>  		CALLOC_ARRAY(idx, 1);
> 
> +		prepare_to_stream(state, flags);
> +		hashfile_checkpoint_init(state->f, &checkpoint);
> +	}
> +
>  	already_hashed_to = 0;
> 
>  	while (1) {

Yeah, that's ugly. We are potentially throwing away the hashfile that
the checkpoint was created for. That makes my instinct to push the
checkpoint down into the loop where we we might restart a new pack, like
this (and like you suggested below):

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bd..efa59074fb 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,6 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -281,8 +280,10 @@ static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	already_hashed_to = 0;
 
 	while (1) {
+		struct hashfile_checkpoint checkpoint = {0};
 		prepare_to_stream(state, flags);
 		if (idx) {
+			hashfile_checkpoint_init(state->f, &checkpoint);
 			hashfile_checkpoint(state->f, &checkpoint);
 			idx->offset = state->offset;
 			crc32_begin(state->f);

but that doesn't work, because the checkpoint is also used later for the
already_written() check:

        if (already_written(state, result_oid)) {
                hashfile_truncate(state->f, &checkpoint);
                state->offset = checkpoint.offset;
                free(idx);
	} else

That made me wonder if there is a bug lurking there. What if we found
the pack was too big, truncated to our checkpoint, and then opened a new
pack? Then the original checkpoint would now be bogus! It would mention
an offset in the original packfile which doesn't make any sense with
what we have open. But I think this is OK, because we can only leave the
loop when stream_blob_to_pack() returns, and we always establish a new
checkpoint before then.

So I do think that moving the initialization of the checkpoint into the
loop, but _not_ moving the variable would work the same way it does now
(i.e., what you suggested below).

But I admit that the way this loop works kind of makes my head spin. It
can really only ever run twice, but it is hard to see: we break out if
stream_blob_to_pack() returns success. And it will only return error if
we would bust the packsize limit (all other errors cause us to die()).
And only if we would bust the limit _and_ we are not the only object in
the pack. And since we start a new pack if we loop, that will never be
true on the second iteration; we'll always either die() or return
success.

I do think it would be much easier to read with a single explicit retry:

  if (checkpoint_and_try_to_stream() < 0) {
	/* we busted the limit; make a new pack and try again */
	hashfile_truncate();
	etc...
        if (checkpoint_and_try_to_stream() < 0)
		BUG("yikes, we should not fail a second time!");
  }

where checkpoint_and_try_to_stream() is the first half of the loop, up
to the stream_blob_to_pack() call.

Anyway, that is all outside of your patch, and relevant only because
_if_ we untangled it a bit more, it might make the checkpoint lifetime a
bit more obvious and less scary to refactor.

But it does imply to me that the data dependency introduced by my
suggestion is not always so straight-forward as I thought it would be,
and we should probably punt on it for your series.

> which would do the trick, but it feels awfully hacky to have the "if
> (checkpoint.f != state->f)" check in there, since that feels too
> intimately tied to the implementation of the hashfile_checkpoint API for
> my comfort.

I think you could unconditionally checkpoint at that part; we're about
to do a write, so we want to store the state before the write in case we
need to roll back.

> Anyway, that's all to say that I think that while this is probably
> doable in theory, in practice it's kind of a mess, at least currently.
> I would rather see if there are other ways to clean up the
> deflate_blob_to_pack() function first in a way that made this change
> less awkward.

Yeah, I actually wrote what I wrote above before reading this far down
in your email, but we arrived at the exact same conclusion. ;) Hopefully
what I wrote might give some pointers if somebody wants to refactor
later.

> I think the most reasonable course here would be to pursue a minimal
> change like the one presented here and then think about further clean up
> as a separate step.

Yep. Thanks for looking into it.

-Peff
diff mbox series

Patch

diff --git a/builtin/fast-import.c b/builtin/fast-import.c
index 0f86392761a..4a6c7ab52ac 100644
--- a/builtin/fast-import.c
+++ b/builtin/fast-import.c
@@ -1106,7 +1106,7 @@  static void stream_blob(uintmax_t len, struct object_id *oidout, uintmax_t mark)
 		|| (pack_size + PACK_SIZE_THRESHOLD + len) < pack_size)
 		cycle_packfile();
 
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
+	hashfile_checkpoint_init(pack_file, &checkpoint);
 	hashfile_checkpoint(pack_file, &checkpoint);
 	offset = checkpoint.offset;
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 433070a3bda..892176d23d2 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -261,7 +261,7 @@  static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 	git_hash_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
-	struct hashfile_checkpoint checkpoint = {0};
+	struct hashfile_checkpoint checkpoint;
 	struct pack_idx_entry *idx = NULL;
 
 	seekback = lseek(fd, 0, SEEK_CUR);
@@ -272,12 +272,15 @@  static int deflate_blob_to_pack(struct bulk_checkin_packfile *state,
 					  OBJ_BLOB, size);
 	the_hash_algo->init_fn(&ctx);
 	the_hash_algo->update_fn(&ctx, obuf, header_len);
-	the_hash_algo->unsafe_init_fn(&checkpoint.ctx);
 
 	/* Note: idx is non-NULL when we are writing */
-	if ((flags & HASH_WRITE_OBJECT) != 0)
+	if ((flags & HASH_WRITE_OBJECT) != 0) {
 		CALLOC_ARRAY(idx, 1);
 
+		prepare_to_stream(state, flags);
+		hashfile_checkpoint_init(state->f, &checkpoint);
+	}
+
 	already_hashed_to = 0;
 
 	while (1) {
diff --git a/csum-file.c b/csum-file.c
index ebffc80ef71..232121f415f 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -206,6 +206,13 @@  struct hashfile *hashfd_throughput(int fd, const char *name, struct progress *tp
 	return hashfd_internal(fd, name, tp, 8 * 1024);
 }
 
+void hashfile_checkpoint_init(struct hashfile *f,
+			      struct hashfile_checkpoint *checkpoint)
+{
+	memset(checkpoint, 0, sizeof(*checkpoint));
+	f->algop->init_fn(&checkpoint->ctx);
+}
+
 void hashfile_checkpoint(struct hashfile *f, struct hashfile_checkpoint *checkpoint)
 {
 	hashflush(f);
diff --git a/csum-file.h b/csum-file.h
index 2b45f4673a2..b7475f16c20 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -36,6 +36,7 @@  struct hashfile_checkpoint {
 	git_hash_ctx ctx;
 };
 
+void hashfile_checkpoint_init(struct hashfile *, struct hashfile_checkpoint *);
 void hashfile_checkpoint(struct hashfile *, struct hashfile_checkpoint *);
 int hashfile_truncate(struct hashfile *, struct hashfile_checkpoint *);