Message ID | 08da02fa74c211ae1019cb0a9f4e30cc239e1ab9.1653073280.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Accepted |
Commit | aab7bea14fee0f185b26413bf27a1cfefbe0114d |
Headers | show |
Series | repack: handle --keep-pack, --max-pack-size for geometric repacks | expand |
Taylor Blau wrote: > When doing a `--geometric=<d>` repack, `git repack` determines a > splitting point among packs ordered by their object count such that: > > - each pack above the split has at least `<d>` times as many objects > as the next-largest pack by object count, and > - the first pack above the split has at least `<d>` times as many > object as the sum of all packs below the split line combined > > `git repack` then creates a pack containing all of the objects contained > in packs below the split line by running `git pack-objects > --stdin-packs` underneath. Once packs are moved into place, then any > packs below the split line are removed, since their objects were just > combined into a new pack. > > But `git repack` tries to be careful to avoid removing a pack that it > just wrote, by checking: > > struct packed_git *p = geometry->pack[i]; > if (string_list_has_string(&names, hash_to_hex(p->hash))) > continue; > > in the `delete_redundant` and `geometric` conditional towards the end of > `cmd_repack`. > > But it's possible to trick `git repack` into not recognizing a pack that > it just wrote when `names` is out-of-order (which violates > `string_list_has_string()`'s assumption that the list is sorted and thus > binary search-able). > > When this happens in just the right circumstances, it is possible to > remove a pack that we just wrote, leading to object corruption. > > Luckily, this is quite difficult to provoke in practice (for a couple of > reasons): > > - we ordinarily write just one pack, so `names` usually contains just > one entry, and is thus sorted > - when we do write more than one pack (e.g., due to `--max-pack-size`) > we have to: (a) write a pack identical to one that already > exists, (b) have that pack be below the split line, and (c) have > the set of packs written by `pack-objects` occur in an order which > tricks `string_list_has_string()`. > > Demonstrate the above scenario in a failing test, which causes `git > repack --geometric` to write a pack which occurs below the split line, > _and_ fail to recognize that it wrote that pack. > > The following patch will fix this bug. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 47 insertions(+) > > diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh > index 91bb2b37a8..2cd1de7295 100755 > --- a/t/t7703-repack-geometric.sh > +++ b/t/t7703-repack-geometric.sh > @@ -7,6 +7,7 @@ test_description='git repack --geometric works correctly' > GIT_TEST_MULTI_PACK_INDEX=0 > > objdir=.git/objects > +packdir=$objdir/pack > midx=$objdir/pack/multi-pack-index > > test_expect_success '--geometric with no packs' ' > @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' ' > ) > ' > > +test_expect_failure '--geometric with pack.packSizeLimit' ' > + git init pack-rewrite && > + test_when_finished "rm -fr pack-rewrite" && > + ( > + cd pack-rewrite && > + > + test-tool genrandom foo 1048576 >foo && > + test-tool genrandom bar 1048576 >bar && > + I was a bit worried about this test being flaky in the future (relying on particular pseudorandomly-generated file contents and the subsequent ordering of hashes on the packs). But, since neither 'genrandom' nor the pack hash generation seem likely to change (and I can't come up with an alternative to this approach anyway), the test looks good as-is. > + git add foo bar && > + test_tick && > + git commit -m base && > + > + git rev-parse HEAD:foo HEAD:bar >p1.objects && > + git rev-parse HEAD HEAD^{tree} >p2.objects && > + > + # These two packs each contain two objects, so the following > + # `--geometric` repack will try to combine them. > + p1="$(git pack-objects $packdir/pack <p1.objects)" && > + p2="$(git pack-objects $packdir/pack <p2.objects)" && > + > + # Remove any loose objects in packs, since we do not want extra > + # copies around (which would mask over potential object > + # corruption issues). > + git prune-packed && > + > + # Both p1 and p2 will be rolled up, but pack-objects will write > + # three packs: > + # > + # - one containing object "foo", > + # - another containing object "bar", > + # - a final pack containing the commit and tree objects > + # (identical to p2 above) > + git repack --geometric 2 -d --max-pack-size=1048576 && > + > + # Ensure `repack` can detect that the third pack it wrote > + # (containing just the tree and commit objects) was identical to > + # one that was below the geometric split, so that we can save it > + # from deletion. > + # > + # If `repack` fails to do that, we will incorrectly delete p2, > + # causing object corruption. > + git fsck > + ) > +' > + > test_done
Taylor Blau <me@ttaylorr.com> writes: > When this happens in just the right circumstances, it is possible to > remove a pack that we just wrote, leading to object corruption. > > Luckily, this is quite difficult to provoke in practice (for a couple of > reasons): I'd call it unlucky that it is hard to trigger, actually ;-) With a system like Git with more than a few handful of users, even an issue that is hard-to-trigger is bound to hit somebody every day, but it it is hard to trigger without the right condition, it is hard to debug. Thanks for finding and fixing (presumably we need a call to keep the list sorted at the right places?) > - we ordinarily write just one pack, so `names` usually contains just > one entry, and is thus sorted > - when we do write more than one pack (e.g., due to `--max-pack-size`) > we have to: (a) write a pack identical to one that already > exists, (b) have that pack be below the split line, and (c) have > the set of packs written by `pack-objects` occur in an order which > tricks `string_list_has_string()`. > > Demonstrate the above scenario in a failing test, which causes `git > repack --geometric` to write a pack which occurs below the split line, > _and_ fail to recognize that it wrote that pack. > > The following patch will fix this bug. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 47 insertions(+) > > diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh > index 91bb2b37a8..2cd1de7295 100755 > --- a/t/t7703-repack-geometric.sh > +++ b/t/t7703-repack-geometric.sh > @@ -7,6 +7,7 @@ test_description='git repack --geometric works correctly' > GIT_TEST_MULTI_PACK_INDEX=0 > > objdir=.git/objects > +packdir=$objdir/pack > midx=$objdir/pack/multi-pack-index > > test_expect_success '--geometric with no packs' ' > @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' ' > ) > ' > > +test_expect_failure '--geometric with pack.packSizeLimit' ' > + git init pack-rewrite && > + test_when_finished "rm -fr pack-rewrite" && > + ( > + cd pack-rewrite && > + > + test-tool genrandom foo 1048576 >foo && > + test-tool genrandom bar 1048576 >bar && > + > + git add foo bar && > + test_tick && > + git commit -m base && > + > + git rev-parse HEAD:foo HEAD:bar >p1.objects && > + git rev-parse HEAD HEAD^{tree} >p2.objects && > + > + # These two packs each contain two objects, so the following > + # `--geometric` repack will try to combine them. > + p1="$(git pack-objects $packdir/pack <p1.objects)" && > + p2="$(git pack-objects $packdir/pack <p2.objects)" && > + > + # Remove any loose objects in packs, since we do not want extra > + # copies around (which would mask over potential object > + # corruption issues). > + git prune-packed && > + > + # Both p1 and p2 will be rolled up, but pack-objects will write > + # three packs: > + # > + # - one containing object "foo", > + # - another containing object "bar", > + # - a final pack containing the commit and tree objects > + # (identical to p2 above) > + git repack --geometric 2 -d --max-pack-size=1048576 && > + > + # Ensure `repack` can detect that the third pack it wrote > + # (containing just the tree and commit objects) was identical to > + # one that was below the geometric split, so that we can save it > + # from deletion. > + # > + # If `repack` fails to do that, we will incorrectly delete p2, > + # causing object corruption. > + git fsck > + ) > +' > + > test_done
On Fri, May 20, 2022 at 12:42:58PM -0700, Victoria Dye wrote: > > @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' ' > > ) > > ' > > > > +test_expect_failure '--geometric with pack.packSizeLimit' ' > > + git init pack-rewrite && > > + test_when_finished "rm -fr pack-rewrite" && > > + ( > > + cd pack-rewrite && > > + > > + test-tool genrandom foo 1048576 >foo && > > + test-tool genrandom bar 1048576 >bar && > > + > > I was a bit worried about this test being flaky in the future (relying on > particular pseudorandomly-generated file contents and the subsequent > ordering of hashes on the packs). But, since neither 'genrandom' nor the > pack hash generation seem likely to change (and I can't come up with an > alternative to this approach anyway), the test looks good as-is. Note that the "random" contents aren't so random (though I suspect you're talking about _how_ genrandom interprets the seed changing), and that we're really only depending on genrandom here to create a large amount of data. We are relying on the pack hashes appearing in a certain order, so in that sense this test could "break" even if pack-objects reported the packs it wrote in a different order. But I agree in the sense that I also cannot come up with a less brittle approach for writing this test ;). Thanks, Taylor
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh index 91bb2b37a8..2cd1de7295 100755 --- a/t/t7703-repack-geometric.sh +++ b/t/t7703-repack-geometric.sh @@ -7,6 +7,7 @@ test_description='git repack --geometric works correctly' GIT_TEST_MULTI_PACK_INDEX=0 objdir=.git/objects +packdir=$objdir/pack midx=$objdir/pack/multi-pack-index test_expect_success '--geometric with no packs' ' @@ -230,4 +231,50 @@ test_expect_success '--geometric chooses largest MIDX preferred pack' ' ) ' +test_expect_failure '--geometric with pack.packSizeLimit' ' + git init pack-rewrite && + test_when_finished "rm -fr pack-rewrite" && + ( + cd pack-rewrite && + + test-tool genrandom foo 1048576 >foo && + test-tool genrandom bar 1048576 >bar && + + git add foo bar && + test_tick && + git commit -m base && + + git rev-parse HEAD:foo HEAD:bar >p1.objects && + git rev-parse HEAD HEAD^{tree} >p2.objects && + + # These two packs each contain two objects, so the following + # `--geometric` repack will try to combine them. + p1="$(git pack-objects $packdir/pack <p1.objects)" && + p2="$(git pack-objects $packdir/pack <p2.objects)" && + + # Remove any loose objects in packs, since we do not want extra + # copies around (which would mask over potential object + # corruption issues). + git prune-packed && + + # Both p1 and p2 will be rolled up, but pack-objects will write + # three packs: + # + # - one containing object "foo", + # - another containing object "bar", + # - a final pack containing the commit and tree objects + # (identical to p2 above) + git repack --geometric 2 -d --max-pack-size=1048576 && + + # Ensure `repack` can detect that the third pack it wrote + # (containing just the tree and commit objects) was identical to + # one that was below the geometric split, so that we can save it + # from deletion. + # + # If `repack` fails to do that, we will incorrectly delete p2, + # causing object corruption. + git fsck + ) +' + test_done
When doing a `--geometric=<d>` repack, `git repack` determines a splitting point among packs ordered by their object count such that: - each pack above the split has at least `<d>` times as many objects as the next-largest pack by object count, and - the first pack above the split has at least `<d>` times as many object as the sum of all packs below the split line combined `git repack` then creates a pack containing all of the objects contained in packs below the split line by running `git pack-objects --stdin-packs` underneath. Once packs are moved into place, then any packs below the split line are removed, since their objects were just combined into a new pack. But `git repack` tries to be careful to avoid removing a pack that it just wrote, by checking: struct packed_git *p = geometry->pack[i]; if (string_list_has_string(&names, hash_to_hex(p->hash))) continue; in the `delete_redundant` and `geometric` conditional towards the end of `cmd_repack`. But it's possible to trick `git repack` into not recognizing a pack that it just wrote when `names` is out-of-order (which violates `string_list_has_string()`'s assumption that the list is sorted and thus binary search-able). When this happens in just the right circumstances, it is possible to remove a pack that we just wrote, leading to object corruption. Luckily, this is quite difficult to provoke in practice (for a couple of reasons): - we ordinarily write just one pack, so `names` usually contains just one entry, and is thus sorted - when we do write more than one pack (e.g., due to `--max-pack-size`) we have to: (a) write a pack identical to one that already exists, (b) have that pack be below the split line, and (c) have the set of packs written by `pack-objects` occur in an order which tricks `string_list_has_string()`. Demonstrate the above scenario in a failing test, which causes `git repack --geometric` to write a pack which occurs below the split line, _and_ fail to recognize that it wrote that pack. The following patch will fix this bug. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- t/t7703-repack-geometric.sh | 47 +++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+)