diff mbox series

[2/3] t7703: demonstrate object corruption with pack.packSizeLimit

Message ID 08da02fa74c211ae1019cb0a9f4e30cc239e1ab9.1653073280.git.me@ttaylorr.com (mailing list archive)
State Accepted
Commit aab7bea14fee0f185b26413bf27a1cfefbe0114d
Headers show
Series repack: handle --keep-pack, --max-pack-size for geometric repacks | expand

Commit Message

Taylor Blau May 20, 2022, 7:01 p.m. UTC
When doing a `--geometric=<d>` repack, `git repack` determines a
splitting point among packs ordered by their object count such that:

  - each pack above the split has at least `<d>` times as many objects
    as the next-largest pack by object count, and
  - the first pack above the split has at least `<d>` times as many
    object as the sum of all packs below the split line combined

`git repack` then creates a pack containing all of the objects contained
in packs below the split line by running `git pack-objects
--stdin-packs` underneath. Once packs are moved into place, then any
packs below the split line are removed, since their objects were just
combined into a new pack.

But `git repack` tries to be careful to avoid removing a pack that it
just wrote, by checking:

    struct packed_git *p = geometry->pack[i];
    if (string_list_has_string(&names, hash_to_hex(p->hash)))
      continue;

in the `delete_redundant` and `geometric` conditional towards the end of
`cmd_repack`.

But it's possible to trick `git repack` into not recognizing a pack that
it just wrote when `names` is out-of-order (which violates
`string_list_has_string()`'s assumption that the list is sorted and thus
binary search-able).

When this happens in just the right circumstances, it is possible to
remove a pack that we just wrote, leading to object corruption.

Luckily, this is quite difficult to provoke in practice (for a couple of
reasons):

  - we ordinarily write just one pack, so `names` usually contains just
    one entry, and is thus sorted
  - when we do write more than one pack (e.g., due to `--max-pack-size`)
    we have to: (a) write a pack identical to one that already
    exists, (b) have that pack be below the split line, and (c) have
    the set of packs written by `pack-objects` occur in an order which
    tricks `string_list_has_string()`.

Demonstrate the above scenario in a failing test, which causes `git
repack --geometric` to write a pack which occurs below the split line,
_and_ fail to recognize that it wrote that pack.

The following patch will fix this bug.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

Comments

Victoria Dye May 20, 2022, 7:42 p.m. UTC | #1
Taylor Blau wrote:
> When doing a `--geometric=<d>` repack, `git repack` determines a
> splitting point among packs ordered by their object count such that:
> 
>   - each pack above the split has at least `<d>` times as many objects
>     as the next-largest pack by object count, and
>   - the first pack above the split has at least `<d>` times as many
>     object as the sum of all packs below the split line combined
> 
> `git repack` then creates a pack containing all of the objects contained
> in packs below the split line by running `git pack-objects
> --stdin-packs` underneath. Once packs are moved into place, then any
> packs below the split line are removed, since their objects were just
> combined into a new pack.
> 
> But `git repack` tries to be careful to avoid removing a pack that it
> just wrote, by checking:
> 
>     struct packed_git *p = geometry->pack[i];
>     if (string_list_has_string(&names, hash_to_hex(p->hash)))
>       continue;
> 
> in the `delete_redundant` and `geometric` conditional towards the end of
> `cmd_repack`.
> 
> But it's possible to trick `git repack` into not recognizing a pack that
> it just wrote when `names` is out-of-order (which violates
> `string_list_has_string()`'s assumption that the list is sorted and thus
> binary search-able).
> 
> When this happens in just the right circumstances, it is possible to
> remove a pack that we just wrote, leading to object corruption.
> 
> Luckily, this is quite difficult to provoke in practice (for a couple of
> reasons):
> 
>   - we ordinarily write just one pack, so `names` usually contains just
>     one entry, and is thus sorted
>   - when we do write more than one pack (e.g., due to `--max-pack-size`)
>     we have to: (a) write a pack identical to one that already
>     exists, (b) have that pack be below the split line, and (c) have
>     the set of packs written by `pack-objects` occur in an order which
>     tricks `string_list_has_string()`.
> 
> Demonstrate the above scenario in a failing test, which causes `git
> repack --geometric` to write a pack which occurs below the split line,
> _and_ fail to recognize that it wrote that pack.
> 
> The following patch will fix this bug.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index 91bb2b37a8..2cd1de7295 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -7,6 +7,7 @@ test_description='git repack --geometric works correctly'
>  GIT_TEST_MULTI_PACK_INDEX=0
>  
>  objdir=.git/objects
> +packdir=$objdir/pack
>  midx=$objdir/pack/multi-pack-index
>  
>  test_expect_success '--geometric with no packs' '
> @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' '
>  	)
>  '
>  
> +test_expect_failure '--geometric with pack.packSizeLimit' '
> +	git init pack-rewrite &&
> +	test_when_finished "rm -fr pack-rewrite" &&
> +	(
> +		cd pack-rewrite &&
> +
> +		test-tool genrandom foo 1048576 >foo &&
> +		test-tool genrandom bar 1048576 >bar &&
> +

I was a bit worried about this test being flaky in the future (relying on
particular pseudorandomly-generated file contents and the subsequent
ordering of hashes on the packs). But, since neither 'genrandom' nor the
pack hash generation seem likely to change (and I can't come up with an
alternative to this approach anyway), the test looks good as-is. 

> +		git add foo bar &&
> +		test_tick &&
> +		git commit -m base &&
> +
> +		git rev-parse HEAD:foo HEAD:bar >p1.objects &&
> +		git rev-parse HEAD HEAD^{tree} >p2.objects &&
> +
> +		# These two packs each contain two objects, so the following
> +		# `--geometric` repack will try to combine them.
> +		p1="$(git pack-objects $packdir/pack <p1.objects)" &&
> +		p2="$(git pack-objects $packdir/pack <p2.objects)" &&
> +
> +		# Remove any loose objects in packs, since we do not want extra
> +		# copies around (which would mask over potential object
> +		# corruption issues).
> +		git prune-packed &&
> +
> +		# Both p1 and p2 will be rolled up, but pack-objects will write
> +		# three packs:
> +		#
> +		#   - one containing object "foo",
> +		#   - another containing object "bar",
> +		#   - a final pack containing the commit and tree objects
> +		#     (identical to p2 above)
> +		git repack --geometric 2 -d --max-pack-size=1048576 &&
> +
> +		# Ensure `repack` can detect that the third pack it wrote
> +		# (containing just the tree and commit objects) was identical to
> +		# one that was below the geometric split, so that we can save it
> +		# from deletion.
> +		#
> +		# If `repack` fails to do that, we will incorrectly delete p2,
> +		# causing object corruption.
> +		git fsck
> +	)
> +'
> +
>  test_done
Junio C Hamano May 20, 2022, 8:54 p.m. UTC | #2
Taylor Blau <me@ttaylorr.com> writes:

> When this happens in just the right circumstances, it is possible to
> remove a pack that we just wrote, leading to object corruption.
>
> Luckily, this is quite difficult to provoke in practice (for a couple of
> reasons):

I'd call it unlucky that it is hard to trigger, actually ;-) With a
system like Git with more than a few handful of users, even an issue
that is hard-to-trigger is bound to hit somebody every day, but it
it is hard to trigger without the right condition, it is hard to
debug.

Thanks for finding and fixing (presumably we need a call to keep the
list sorted at the right places?)

>   - we ordinarily write just one pack, so `names` usually contains just
>     one entry, and is thus sorted
>   - when we do write more than one pack (e.g., due to `--max-pack-size`)
>     we have to: (a) write a pack identical to one that already
>     exists, (b) have that pack be below the split line, and (c) have
>     the set of packs written by `pack-objects` occur in an order which
>     tricks `string_list_has_string()`.
>
> Demonstrate the above scenario in a failing test, which causes `git
> repack --geometric` to write a pack which occurs below the split line,
> _and_ fail to recognize that it wrote that pack.
>
> The following patch will fix this bug.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
>
> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index 91bb2b37a8..2cd1de7295 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -7,6 +7,7 @@ test_description='git repack --geometric works correctly'
>  GIT_TEST_MULTI_PACK_INDEX=0
>  
>  objdir=.git/objects
> +packdir=$objdir/pack
>  midx=$objdir/pack/multi-pack-index
>  
>  test_expect_success '--geometric with no packs' '
> @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' '
>  	)
>  '
>  
> +test_expect_failure '--geometric with pack.packSizeLimit' '
> +	git init pack-rewrite &&
> +	test_when_finished "rm -fr pack-rewrite" &&
> +	(
> +		cd pack-rewrite &&
> +
> +		test-tool genrandom foo 1048576 >foo &&
> +		test-tool genrandom bar 1048576 >bar &&
> +
> +		git add foo bar &&
> +		test_tick &&
> +		git commit -m base &&
> +
> +		git rev-parse HEAD:foo HEAD:bar >p1.objects &&
> +		git rev-parse HEAD HEAD^{tree} >p2.objects &&
> +
> +		# These two packs each contain two objects, so the following
> +		# `--geometric` repack will try to combine them.
> +		p1="$(git pack-objects $packdir/pack <p1.objects)" &&
> +		p2="$(git pack-objects $packdir/pack <p2.objects)" &&
> +
> +		# Remove any loose objects in packs, since we do not want extra
> +		# copies around (which would mask over potential object
> +		# corruption issues).
> +		git prune-packed &&
> +
> +		# Both p1 and p2 will be rolled up, but pack-objects will write
> +		# three packs:
> +		#
> +		#   - one containing object "foo",
> +		#   - another containing object "bar",
> +		#   - a final pack containing the commit and tree objects
> +		#     (identical to p2 above)
> +		git repack --geometric 2 -d --max-pack-size=1048576 &&
> +
> +		# Ensure `repack` can detect that the third pack it wrote
> +		# (containing just the tree and commit objects) was identical to
> +		# one that was below the geometric split, so that we can save it
> +		# from deletion.
> +		#
> +		# If `repack` fails to do that, we will incorrectly delete p2,
> +		# causing object corruption.
> +		git fsck
> +	)
> +'
> +
>  test_done
Taylor Blau May 20, 2022, 11:22 p.m. UTC | #3
On Fri, May 20, 2022 at 12:42:58PM -0700, Victoria Dye wrote:
> > @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' '
> >  	)
> >  '
> >
> > +test_expect_failure '--geometric with pack.packSizeLimit' '
> > +	git init pack-rewrite &&
> > +	test_when_finished "rm -fr pack-rewrite" &&
> > +	(
> > +		cd pack-rewrite &&
> > +
> > +		test-tool genrandom foo 1048576 >foo &&
> > +		test-tool genrandom bar 1048576 >bar &&
> > +
>
> I was a bit worried about this test being flaky in the future (relying on
> particular pseudorandomly-generated file contents and the subsequent
> ordering of hashes on the packs). But, since neither 'genrandom' nor the
> pack hash generation seem likely to change (and I can't come up with an
> alternative to this approach anyway), the test looks good as-is.

Note that the "random" contents aren't so random (though I suspect
you're talking about _how_ genrandom interprets the seed changing), and
that we're really only depending on genrandom here to create a large
amount of data.

We are relying on the pack hashes appearing in a certain order, so in
that sense this test could "break" even if pack-objects reported the
packs it wrote in a different order.

But I agree in the sense that I also cannot come up with a less brittle
approach for writing this test ;).

Thanks,
Taylor
diff mbox series

Patch

diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 91bb2b37a8..2cd1de7295 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -7,6 +7,7 @@  test_description='git repack --geometric works correctly'
 GIT_TEST_MULTI_PACK_INDEX=0
 
 objdir=.git/objects
+packdir=$objdir/pack
 midx=$objdir/pack/multi-pack-index
 
 test_expect_success '--geometric with no packs' '
@@ -230,4 +231,50 @@  test_expect_success '--geometric chooses largest MIDX preferred pack' '
 	)
 '
 
+test_expect_failure '--geometric with pack.packSizeLimit' '
+	git init pack-rewrite &&
+	test_when_finished "rm -fr pack-rewrite" &&
+	(
+		cd pack-rewrite &&
+
+		test-tool genrandom foo 1048576 >foo &&
+		test-tool genrandom bar 1048576 >bar &&
+
+		git add foo bar &&
+		test_tick &&
+		git commit -m base &&
+
+		git rev-parse HEAD:foo HEAD:bar >p1.objects &&
+		git rev-parse HEAD HEAD^{tree} >p2.objects &&
+
+		# These two packs each contain two objects, so the following
+		# `--geometric` repack will try to combine them.
+		p1="$(git pack-objects $packdir/pack <p1.objects)" &&
+		p2="$(git pack-objects $packdir/pack <p2.objects)" &&
+
+		# Remove any loose objects in packs, since we do not want extra
+		# copies around (which would mask over potential object
+		# corruption issues).
+		git prune-packed &&
+
+		# Both p1 and p2 will be rolled up, but pack-objects will write
+		# three packs:
+		#
+		#   - one containing object "foo",
+		#   - another containing object "bar",
+		#   - a final pack containing the commit and tree objects
+		#     (identical to p2 above)
+		git repack --geometric 2 -d --max-pack-size=1048576 &&
+
+		# Ensure `repack` can detect that the third pack it wrote
+		# (containing just the tree and commit objects) was identical to
+		# one that was below the geometric split, so that we can save it
+		# from deletion.
+		#
+		# If `repack` fails to do that, we will incorrectly delete p2,
+		# causing object corruption.
+		git fsck
+	)
+'
+
 test_done