diff mbox series

[10/20] midx: bounds-check large offset chunk

Message ID 20231009210530.GJ3282181@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 2abd56e9b2195c8111ff5d16efafabc5bccba92b
Headers show
Series bounds-checks for chunk-based files | expand

Commit Message

Jeff King Oct. 9, 2023, 9:05 p.m. UTC
When we see a large offset bit in the regular midx offset table, we
use the entry as an index into a separate large offset table (just like
a pack idx does). But we don't bounds-check the access to that large
offset table (nor even record its size when we parse the chunk!).

The equivalent code for a regular pack idx is in check_pack_index_ptr().
But things are a bit simpler here because of the chunked format: we can
just check our array index directly.

As a bonus, we can get rid of the st_mult() here. If our array
bounds-check is successful, then we know that the result will fit in a
size_t (and the bounds check uses a division to avoid overflow
entirely).

Signed-off-by: Jeff King <peff@peff.net>
---
 midx.c                      |  8 +++++---
 midx.h                      |  1 +
 t/t5319-multi-pack-index.sh | 20 ++++++++++++++++++++
 3 files changed, 26 insertions(+), 3 deletions(-)

Comments

Taylor Blau Oct. 11, 2023, 6:38 p.m. UTC | #1
On Mon, Oct 09, 2023 at 05:05:30PM -0400, Jeff King wrote:
> When we see a large offset bit in the regular midx offset table, we
> use the entry as an index into a separate large offset table (just like
> a pack idx does). But we don't bounds-check the access to that large
> offset table (nor even record its size when we parse the chunk!).
>
> The equivalent code for a regular pack idx is in check_pack_index_ptr().
> But things are a bit simpler here because of the chunked format: we can
> just check our array index directly.
>
> As a bonus, we can get rid of the st_mult() here. If our array
> bounds-check is successful, then we know that the result will fit in a
> size_t (and the bounds check uses a division to avoid overflow
> entirely).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  midx.c                      |  8 +++++---
>  midx.h                      |  1 +
>  t/t5319-multi-pack-index.sh | 20 ++++++++++++++++++++
>  3 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 7b1b45f381..3e768d0df0 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -180,7 +180,8 @@ struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
>  	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
>  		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
>
> -	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
> +	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
> +		   &m->chunk_large_offsets_len);
>
>  	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
>  		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
> @@ -303,8 +304,9 @@ off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
>  			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
>
>  		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
> -		return get_be64(m->chunk_large_offsets +
> -				st_mult(sizeof(uint64_t), offset32));
> +		if (offset32 >= m->chunk_large_offsets_len / sizeof(uint64_t))
> +			die(_("multi-pack-index large offset out of bounds"));
> +		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);

Makes sense, and this seems very reasonable. I had a couple of thoughts
on this hunk, one nitpick, and one non-nitpick ;-).

The nitpick is the easy one, which is that I typically think of scaling
some index to produce an offset into some chunk, instead of dividing and
going the other way.

So I probably would have written something like:

    if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
        die(_("multi-pack-index large offset out of bounds"));

But that is definitely a subjective/minor point, and I would not at all
be sad if you felt differently about it. That said, I do wish for a
little more information in the die() message, perhaps:

    if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
        die(_("multi-pack-index large offset for %s out of bounds "
              "(%"PRIuMAX" is beyond %"PRIuMAX")"),
            (uintmax_t)(offset32 * sizeof(uint64_t)), /* checked earlier */
            (uintmax_t)m->chunk_large_offsets_len);

I can imagine that for debugging corrupt MIDXs in the future, having
some extra information like the above would give us a better starting
point than popping into a debugger to get the same information.

Thanks,
Taylor
Jeff King Oct. 11, 2023, 11:18 p.m. UTC | #2
On Wed, Oct 11, 2023 at 02:38:07PM -0400, Taylor Blau wrote:

> >  		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
> > -		return get_be64(m->chunk_large_offsets +
> > -				st_mult(sizeof(uint64_t), offset32));
> > +		if (offset32 >= m->chunk_large_offsets_len / sizeof(uint64_t))
> > +			die(_("multi-pack-index large offset out of bounds"));
> > +		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
> 
> Makes sense, and this seems very reasonable. I had a couple of thoughts
> on this hunk, one nitpick, and one non-nitpick ;-).
> 
> The nitpick is the easy one, which is that I typically think of scaling
> some index to produce an offset into some chunk, instead of dividing and
> going the other way.
> 
> So I probably would have written something like:
> 
>     if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
>         die(_("multi-pack-index large offset out of bounds"));

Yeah, I admit that my inclination is to think of it that way, too, and
the division is a bit of an inversion. But I guess I am hard-wired to do
bounds checks with division, because I know it avoids overflow issues.
And the behavior is actually different, since st_mult() dies (with a
somewhat vague message) rather than returning with an error.

I was actually tempted to write a small inline helper to do
bounds-checks (since this pattern appears a few times in this series).
But of course it suffers from the same issues (it dies with a vague
message, or it returns a result code and then is awkward to call/use
because you have to stick the output in an out-parameter).

> But that is definitely a subjective/minor point, and I would not at all
> be sad if you felt differently about it. That said, I do wish for a
> little more information in the die() message, perhaps:
> 
>     if (st_mult(offset32, sizeof(uint64_t)) >= m->chunk_large_offsets_len)
>         die(_("multi-pack-index large offset for %s out of bounds "
>               "(%"PRIuMAX" is beyond %"PRIuMAX")"),
>             (uintmax_t)(offset32 * sizeof(uint64_t)), /* checked earlier */
>             (uintmax_t)m->chunk_large_offsets_len);
> 
> I can imagine that for debugging corrupt MIDXs in the future, having
> some extra information like the above would give us a better starting
> point than popping into a debugger to get the same information.

As you'll see when you get to the BDAT/BIDX messages, I put in a load of
detail. That's because I did those ones first, and then after seeing the
terse "eh, it's the wrong size" messages in the rest of the commit-graph
and midx code, I just followed suit.

We can circle back and improve the detail in the messages. One slightly
annoying thing is dealing with it in the tests. We'd have to do one of:

  - make the tests more brittle by hard-coding positions and offsets we
    don't necessarily care about

  - loosen the tests by just matching with "grep", which can sometimes
    miss other unexpected output

  - do some post-processing on the output to massage out the details we
    don't care about; this in theory is the best of both worlds, but
    reliably post-processing is non-trivial.

So of the three I'd probably just loosen to use "grep" in most spots.

-Peff
diff mbox series

Patch

diff --git a/midx.c b/midx.c
index 7b1b45f381..3e768d0df0 100644
--- a/midx.c
+++ b/midx.c
@@ -180,7 +180,8 @@  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local
 	if (read_chunk(cf, MIDX_CHUNKID_OBJECTOFFSETS, midx_read_object_offsets, m))
 		die(_("multi-pack-index required object offsets chunk missing or corrupted"));
 
-	pair_chunk_unsafe(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets);
+	pair_chunk(cf, MIDX_CHUNKID_LARGEOFFSETS, &m->chunk_large_offsets,
+		   &m->chunk_large_offsets_len);
 
 	if (git_env_bool("GIT_TEST_MIDX_READ_RIDX", 1))
 		pair_chunk_unsafe(cf, MIDX_CHUNKID_REVINDEX, &m->chunk_revindex);
@@ -303,8 +304,9 @@  off_t nth_midxed_offset(struct multi_pack_index *m, uint32_t pos)
 			die(_("multi-pack-index stores a 64-bit offset, but off_t is too small"));
 
 		offset32 ^= MIDX_LARGE_OFFSET_NEEDED;
-		return get_be64(m->chunk_large_offsets +
-				st_mult(sizeof(uint64_t), offset32));
+		if (offset32 >= m->chunk_large_offsets_len / sizeof(uint64_t))
+			die(_("multi-pack-index large offset out of bounds"));
+		return get_be64(m->chunk_large_offsets + sizeof(uint64_t) * offset32);
 	}
 
 	return offset32;
diff --git a/midx.h b/midx.h
index 5b2a7da043..e8e8884d16 100644
--- a/midx.h
+++ b/midx.h
@@ -37,6 +37,7 @@  struct multi_pack_index {
 	const unsigned char *chunk_oid_lookup;
 	const unsigned char *chunk_object_offsets;
 	const unsigned char *chunk_large_offsets;
+	size_t chunk_large_offsets_len;
 	const unsigned char *chunk_revindex;
 
 	const char **pack_names;
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 30687d5452..16050f39d9 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -1118,4 +1118,24 @@  test_expect_success 'reader notices too-small object offset chunk' '
 	test_cmp expect err
 '
 
+test_expect_success 'reader bounds-checks large offset table' '
+	# re-use the objects64 dir here to cheaply get access to a midx
+	# with large offsets.
+	git init repo &&
+	test_when_finished "rm -rf repo" &&
+	(
+		cd repo &&
+		(cd ../objects64 && pwd) >.git/objects/info/alternates &&
+		git multi-pack-index --object-dir=../objects64 write &&
+		midx=../objects64/pack/multi-pack-index &&
+		corrupt_chunk_file $midx LOFF clear &&
+		test_must_fail git cat-file \
+			--batch-check --batch-all-objects 2>err &&
+		cat >expect <<-\EOF &&
+		fatal: multi-pack-index large offset out of bounds
+		EOF
+		test_cmp expect err
+	)
+'
+
 test_done