Message ID | a07ed50feeec4bfc3e9736bf493b9876896bcdd2.1680606445.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | repack: fix geometric repacking with gitalternates | expand |
On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote: > Both issues have the same underlying root cause, which is that geometric > repacks don't honor whether packfiles are local or not. As a result, > they will try to include packs part of the alternate object directory > and then at a later point fail to locate them as they do not exist in > the object directory of the repository we're about to repack. Interesting. This fix does make sense, but I wonder if it is doing the right thing. When we have an alternated repository and do 'git repack -ad' in the "member" repository, we'll end up creating a new pack containing all objects in that repository, including ones from the alternate. For example, if you do something like: rm -fr shared member git init shared git -C shared commit --allow-empty -m "base" && git -C shared repack -d git clone --shared shared member git -C member repack -ad for dir in shared member do echo "==> $dir" find $dir/.git/objects -type f done Then you'll end up with the output: ==> shared shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack shared/.git/objects/info/packs ==> member member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack member/.git/objects/info/alternates member/.git/objects/info/packs In other words, we end up creating the pack necessary in the member repository necessary to complete the repack. Since we're using `-a` here, that means creating an identical pack as we have in the shared repository. If we instead did something like: git -C member repack -adl # <- `-l` is new here , our output changes to instead contain the following (empty) pack directory in the member repository: ==> member member/.git/objects/info/alternates member/.git/objects/info/packs > Skip over packfiles that aren't local. This will cause geometric repacks > to never include packfiles of its alternates. ...So I wonder if this is the right thing to do in all cases. IOW, should we be creating packs in the "member" repository which may be based off of packs from the shared repository when `-l` is not specified? I think this gets tricky. For one, the geometric repack code creates at most one new pack. So if some of the packs that were above the split line (IOW, the ones that don't need to be squashed together) were in the shared repository, I'm not sure how you'd write a MIDX that covers both without using the MIDX-over-alternates feature. I have no idea how that works with MIDX bitmaps (IIUC, the MIDX/alternates feature is very niche). I think we reasonably could do something like ignoring non-local packs in `init_pack_geometry()` only when `-l` is given. That still runs into problems when trying to write a MIDX or MIDX bitmaps, so we should likely declare the combination "-l --write-midx --write-bitmap-index" as unsupported. For backwards compatibility, I think it would make sense to have "--no-local" be the default when `--geometric` is given (which is already the case, since po_args is zero-initialized). I suspect in practice that nobody cares about what happens when you run "git repack --geometric=<d> --local", but in case they do, I think the above is probably the most reasonable behavior that I can quickly come up with. > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > builtin/repack.c | 6 ++++ > t/t7703-repack-geometric.sh | 59 +++++++++++++++++++++++++++++++++++++ > 2 files changed, 65 insertions(+) > > diff --git a/builtin/repack.c b/builtin/repack.c > index 87f73c8923..c6d12fa4bd 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -333,6 +333,12 @@ static void init_pack_geometry(struct pack_geometry **geometry_p, > geometry = *geometry_p; > > for (p = get_all_packs(the_repository); p; p = p->next) { > + /* > + * We don't want to repack packfiles which are not part of the > + * primary object database. > + */ > + if (!p->pack_local) > + continue; So this would change to something like `if (po_args.local && !p->pack_local)`. > diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh > index 8821fbd2dd..9f8bc663e4 100755 > --- a/t/t7703-repack-geometric.sh > +++ b/t/t7703-repack-geometric.sh > @@ -281,4 +281,63 @@ test_expect_success '--geometric with pack.packSizeLimit' ' > ) > ' > > +packed_objects() { > + git verify-pack -v "$@" >tmp-object-list && > + sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\|tag\).*/\1/p" <tmp-object-list && > + rm -f tmp-object-list > +} Just a nitpick, but I've found the output from index-pack to be a little easier to work with here. I think that you could replace this function with the following and have it work: --- 8< --- diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh index 9f8bc663e4..e3ac5001bd 100755 --- a/t/t7703-repack-geometric.sh +++ b/t/t7703-repack-geometric.sh @@ -282,9 +282,8 @@ test_expect_success '--geometric with pack.packSizeLimit' ' ' packed_objects() { - git verify-pack -v "$@" >tmp-object-list && - sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\|tag\).*/\1/p" <tmp-object-list && - rm -f tmp-object-list + git show-index <"$1" >tmp-object-list && + cut -d' ' -f2 tmp-object-list } test_expect_success '--geometric with no local objects creates no pack' ' --- >8 --- I removed the "&& rm -f tmp-object-list", since we overwrite it anyway whenever we call packed_objects(). > +test_expect_success '--geometric with no local objects creates no pack' ' > + git init shared && > + test_when_finished "rm -fr shared" && > + test_commit -C shared "shared" && > + git -C shared repack -Ad && > + > + # Set up the member repository so that it does not have > + # any objects on its own. > + git clone --shared shared member && > + test_when_finished "rm -fr member" && > + > + git -C member repack --geometric 2 --write-midx && > + find member/.git/objects/pack -type f >actual && > + test_must_be_empty actual Another small nitpick, but I think these last two lines could be replaced with test_dir_is_empty member/.git/objects/pack or even test_dir_is_empty member/$packdir > +test_expect_success '--geometric does not include shared packfiles' ' > + git init shared && > + test_when_finished "rm -fr shared" && > + test_commit -C shared "shared" && > + git -C shared repack -Ad && > + > + git clone --shared shared member && > + test_when_finished "rm -fr member" && > + git -C member commit --allow-empty --message "not-shared" && Any reason to make this commit empty? Could we replace this with test_commit -C member not-shared ? > +test_expect_success '--geometric with same packfile in shared repository' ' > + git init shared && > + test_when_finished "rm -fr shared" && > + test_commit -C shared "shared" && > + git -C shared repack -Ad && > + > + # Prepare the member repository so that it got the exact same packfile > + # as the shared repository and set up gitalternates. > + cp -r shared member && > + test_when_finished "rm -fr member" && > + test-tool path-utils real_path shared/.git/objects >member/.git/objects/info/alternates && > + find shared/.git/objects -type f >expect && This works, though it may be easier to write something like: git clone -s shared member && test_when_finished "rm -fr member" && # ensure that "shared" and "member" have an identical set of packs git -C member repack -a && find shared/.git/objects -type f >expect && > + # After repacking, contents of the member repository should not have > + # changed. > + git -C member repack --geometric 2 --write-midx 2>error && I think that "2>error" is a stray change left over from debugging, and could probably be removed. > + find shared/.git/objects -type f >actual && This and above could be replaced with shared/$packdir. Whenever I'm asserting the contents of a directory are unchanged, I typically like to call the two files under comparison "before" and "after", but "expect" and "actual" are fine here, too. Thanks, Taylor
On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: > I think we reasonably could do something like ignoring non-local packs > in `init_pack_geometry()` only when `-l` is given. That still runs into > problems when trying to write a MIDX or MIDX bitmaps, so we should > likely declare the combination "-l --write-midx --write-bitmap-index" as > unsupported. For backwards compatibility, I think it would make sense to > have "--no-local" be the default when `--geometric` is given (which is > already the case, since po_args is zero-initialized). When summarizing this message to colleagues at GitHub, I came up with this TL;DR, which I figured may be useful to share: TL;DR: I think that this is a (much) more complicated problem than originally anticipated, but the easiest thing to do is to assume that git repack --geometric=<d> means git repack --geometric=<d> --no-local unless otherwise specified, and declare --geometric=<d> --local unsupported when used in conjunction with --write-midx or --write-bitmap-index. Thanks, Taylor
On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: > On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote: > > Both issues have the same underlying root cause, which is that geometric > > repacks don't honor whether packfiles are local or not. As a result, > > they will try to include packs part of the alternate object directory > > and then at a later point fail to locate them as they do not exist in > > the object directory of the repository we're about to repack. > > Interesting. This fix does make sense, but I wonder if it is doing the > right thing. > > When we have an alternated repository and do 'git repack -ad' in the > "member" repository, we'll end up creating a new pack containing all > objects in that repository, including ones from the alternate. > > For example, if you do something like: > > rm -fr shared member > > git init shared > git -C shared commit --allow-empty -m "base" && git -C shared repack -d > > git clone --shared shared member > git -C member repack -ad > > for dir in shared member > do > echo "==> $dir" > find $dir/.git/objects -type f > done > > Then you'll end up with the output: > > ==> shared > shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx > shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack > shared/.git/objects/info/packs > ==> member > member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx > member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack > member/.git/objects/info/alternates > member/.git/objects/info/packs > > In other words, we end up creating the pack necessary in the member > repository necessary to complete the repack. Since we're using `-a` > here, that means creating an identical pack as we have in the shared > repository. > > If we instead did something like: > > git -C member repack -adl # <- `-l` is new here > > , our output changes to instead contain the following (empty) pack > directory in the member repository: > > ==> member > member/.git/objects/info/alternates > member/.git/objects/info/packs Yup, that's fair. > > Skip over packfiles that aren't local. This will cause geometric repacks > > to never include packfiles of its alternates. > > ...So I wonder if this is the right thing to do in all cases. IOW, > should we be creating packs in the "member" repository which may be > based off of packs from the shared repository when `-l` is not > specified? > > I think this gets tricky. For one, the geometric repack code creates at > most one new pack. So if some of the packs that were above the split > line (IOW, the ones that don't need to be squashed together) were in the > shared repository, I'm not sure how you'd write a MIDX that covers both > without using the MIDX-over-alternates feature. I have no idea how that > works with MIDX bitmaps (IIUC, the MIDX/alternates feature is very > niche). I agree, things would become tricky in case geometric repacks are expected to work across alternates. > I think we reasonably could do something like ignoring non-local packs > in `init_pack_geometry()` only when `-l` is given. That still runs into > problems when trying to write a MIDX or MIDX bitmaps, so we should > likely declare the combination "-l --write-midx --write-bitmap-index" as > unsupported. For backwards compatibility, I think it would make sense to > have "--no-local" be the default when `--geometric` is given (which is > already the case, since po_args is zero-initialized). Okay, I agree that it's not all that sensible to allow writing bitmaps in a geometric repack that spans across multiple repositories. These bitmaps would immediately break once the shared repository performs a repack that removes a subset of packfiles that the bitmap depends on, which would make it non-obvious for how to even do repacks in such a shared repository at all. But I'm not yet sure whether I understand why `-l --write-midx` should be prohibited like you summarized in the follow-up mail: On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: > TL;DR: I think that this is a (much) more complicated problem than > originally anticipated, but the easiest thing to do is to assume that > git repack --geometric=<d> means git repack --geometric=<d> --no-local > unless otherwise specified, and declare --geometric=<d> --local > unsupported when used in conjunction with --write-midx or > --write-bitmap-index. The newly written MIDX would of course only span over the local packfiles, but is that even a problem? Ideally, Git would know to handle multiple MIDX files, and in that case it would make sense both for the shared and for the member repository to have one. > I suspect in practice that nobody cares about what happens when you run > "git repack --geometric=<d> --local", but in case they do, I think the > above is probably the most reasonable behavior that I can quickly come > up with. Well, I do as we use alternates to implement fork networks at GitLab and we're in the process of adopting geometric repacking. So what I want to have is that I can perform geometric repacks both in the shared and in the member repository that includes only the local packfiles. And yes, I agree that the above is the most reasonable behaviour, with the exception of disallowing MIDXs when doing a local geometric repack. But that raises the question: what do we do about the currently-broken behaviour when executing `git repack --geometric=<d> --no-local` in a alternated repository? I'd personally be fine to start honoring the `po_args.local` flag so that we skip over any non-local packfiles there while ignoring the larger problem of non-local geometric repacks breaking in certain scenarios. It would at least improve the status quo as users now have a way out in case they ever happen to hit that error. And it allows us to use geometric repacks in alternated repositories. But are we okay with punting on the larger issue for the time being? Thanks for your feedback and this interesting discussion! Patrick
On 4/5/2023 3:08 AM, Patrick Steinhardt wrote: > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: >> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote: >>> Both issues have the same underlying root cause, which is that geometric >>> repacks don't honor whether packfiles are local or not. As a result, >>> they will try to include packs part of the alternate object directory >>> and then at a later point fail to locate them as they do not exist in >>> the object directory of the repository we're about to repack. >> >> Interesting. This fix does make sense, but I wonder if it is doing the >> right thing. >> >> When we have an alternated repository and do 'git repack -ad' in the >> "member" repository, we'll end up creating a new pack containing all >> objects in that repository, including ones from the alternate. ... >> I think we reasonably could do something like ignoring non-local packs >> in `init_pack_geometry()` only when `-l` is given. That still runs into >> problems when trying to write a MIDX or MIDX bitmaps, so we should >> likely declare the combination "-l --write-midx --write-bitmap-index" as >> unsupported. For backwards compatibility, I think it would make sense to >> have "--no-local" be the default when `--geometric` is given (which is >> already the case, since po_args is zero-initialized). > > Okay, I agree that it's not all that sensible to allow writing bitmaps > in a geometric repack that spans across multiple repositories. These > bitmaps would immediately break once the shared repository performs a > repack that removes a subset of packfiles that the bitmap depends on, > which would make it non-obvious for how to even do repacks in such a > shared repository at all. We have mechanisms for avoiding writing bitmaps for packs that are not closed under reachability. We should have some protection against writing them for multi-pack-indexes that are not closed under reachability, if only as a check during bitmap_writer_build(). > But I'm not yet sure whether I understand why `-l --write-midx` should > be prohibited like you summarized in the follow-up mail: > > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: >> TL;DR: I think that this is a (much) more complicated problem than >> originally anticipated, but the easiest thing to do is to assume that >> git repack --geometric=<d> means git repack --geometric=<d> --no-local >> unless otherwise specified, and declare --geometric=<d> --local >> unsupported when used in conjunction with --write-midx or >> --write-bitmap-index. > > The newly written MIDX would of course only span over the local > packfiles, but is that even a problem? Ideally, Git would know to handle > multiple MIDX files, and in that case it would make sense both for the > shared and for the member repository to have one. Yes, each odb is allowed a multi-pack-index, and they chain the same way pack-files do. I agree that this restriction isn't necessary. >> I suspect in practice that nobody cares about what happens when you run >> "git repack --geometric=<d> --local", but in case they do, I think the >> above is probably the most reasonable behavior that I can quickly come >> up with. > > Well, I do as we use alternates to implement fork networks at GitLab and > we're in the process of adopting geometric repacking. So what I want to > have is that I can perform geometric repacks both in the shared and in > the member repository that includes only the local packfiles. > > And yes, I agree that the above is the most reasonable behaviour, with > the exception of disallowing MIDXs when doing a local geometric repack. I think the recommended if (po_args.local && !p->local) continue; approach is a nice _performance improvement_ for the --local case, since it avoids adding a list of objects to be packed (and then thrown away, because those objects exist in an alternate). Of course, we are currently blocked on that part working because of the non-local packs being a problem. > But that raises the question: what do we do about the currently-broken > behaviour when executing `git repack --geometric=<d> --no-local` in a > alternated repository? Much like "git repack -a --no-local" in an alternated repository, I don't think this is something good for users to do, but I agree that it not working is a problem we should fix. Your original message includes error messages like "could not find pack" and "unknown preferred pack" which makes me think the _real_ problem is that we are not respecting the full path name of the pack-file and are somehow localizing packs to the local object dir. The basic reason is that write_midx_included_packs() takes all of the pack-files from the geometry, but does not strip out the pack-files that are not in the same object directory. Perhaps this method could include a step to create a new, "local" geometry containing only the packs within the local object dir. We can then skip the --preferred-pack option and bitmap if this is different than the original geometry (perhaps with a warning message to help users who did this accidentally). > I'd personally be fine to start honoring the `po_args.local` flag so > that we skip over any non-local packfiles there while ignoring the > larger problem of non-local geometric repacks breaking in certain > scenarios. It would at least improve the status quo as users now have a > way out in case they ever happen to hit that error. And it allows us to > use geometric repacks in alternated repositories. But are we okay with > punting on the larger issue for the time being? I think the real bug is isolated in write_midx_included_packs() in how it may specify packs that the multi-pack-index cannot track. It should be worth the time exploring if there is an easy fix there, and then the po_args.local version can be used as a backup/performance tweak on top of that. Thanks, -Stolee
On Wed, Apr 05, 2023 at 09:08:19AM +0200, Patrick Steinhardt wrote: > But I'm not yet sure whether I understand why `-l --write-midx` should > be prohibited like you summarized in the follow-up mail: > > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: > > TL;DR: I think that this is a (much) more complicated problem than > > originally anticipated, but the easiest thing to do is to assume that > > git repack --geometric=<d> means git repack --geometric=<d> --no-local > > unless otherwise specified, and declare --geometric=<d> --local > > unsupported when used in conjunction with --write-midx or > > --write-bitmap-index. > > The newly written MIDX would of course only span over the local > packfiles, but is that even a problem? Ideally, Git would know to handle > multiple MIDX files, and in that case it would make sense both for the > shared and for the member repository to have one. Right, I don't think that it's a problem. But I don't know how well the MIDX-over-alternates thing works in practice. I know that the feature exists, but I don't think it is as battled-tested as non-alternated MIDXs. So I think you'd have to ban MIDX bitmaps when the MIDX spans over multiple repositories through an alternates file, but otherwise it should be OK. > But that raises the question: what do we do about the currently-broken > behaviour when executing `git repack --geometric=<d> --no-local` in a > alternated repository? > > I'd personally be fine to start honoring the `po_args.local` flag so > that we skip over any non-local packfiles there while ignoring the > larger problem of non-local geometric repacks breaking in certain > scenarios. It would at least improve the status quo as users now have a > way out in case they ever happen to hit that error. And it allows us to > use geometric repacks in alternated repositories. But are we okay with > punting on the larger issue for the time being? I wonder if the larger issue could be simplified into (a) honoring `po_args.local` when doing the geometric rollup, and (b) dropping the non-local packs when writing a MIDX. > Thanks for your feedback and this interesting discussion! Ditto ;-). Thanks, Taylor
On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote: > Your original message includes error messages like "could not find pack" > and "unknown preferred pack" which makes me think the _real_ problem is > that we are not respecting the full path name of the pack-file and are > somehow localizing packs to the local object dir. "could not find pack" comes from pack-objects's --stdin-packs mode when the caller specifies a pack that doesn't exist in the repository. I'm not sure what's going on here, since read_packs_list_from_stdin() searches over the result of calling get_all_packs(), which does include packs in alternate object stores. And indeed, pack-objects can recognize such packs: --- 8< --- #!/bin/sh rm -fr shared member git init shared git -C shared commit --allow-empty -m "base" && git -C shared repack -d git clone --shared shared member basename "$(find shared/.git/objects/pack -type f -name '*.pack')" >input git -C member pack-objects .git/objects/pack/pack --stdin-packs <input for repo in shared member do echo "==> $repo" find $repo/.git/objects/pack -type f done --- >8 --- ^^ the above results in generating the identical pack in the "member" repository as expected: ==> shared shared/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.idx shared/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.pack ==> member member/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.idx member/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.pack > The basic reason is that write_midx_included_packs() takes all of the > pack-files from the geometry, but does not strip out the pack-files > that are not in the same object directory. I agree here; the pack-geometry code should be stripping out non-local packs when writing a MIDX regardless of whether the caller passed --local or not. > Perhaps this method could include a step to create a new, "local" > geometry containing only the packs within the local object dir. We can > then skip the --preferred-pack option and bitmap if this is different > than the original geometry (perhaps with a warning message to help > users who did this accidentally). It would be nice to not have to juggle multiple pack geometry structs, since the logic of what gets repacked, what gets thrown away, and what gets kept is already fairly complicated (at least to me) and pretty fragile. I think if I were sketching this out, I'd start out by doing something like: --- 8< --- diff --git a/builtin/repack.c b/builtin/repack.c index df4d8e0f0b..eab5f58444 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -558,6 +558,10 @@ static void midx_included_packs(struct string_list *include, for (i = geometry->split; i < geometry->pack_nr; i++) { struct packed_git *p = geometry->pack[i]; + /* MIDXs cannot refer to non-local packs */ + if (!p->pack_local) + continue; + strbuf_addstr(&buf, pack_basename(p)); strbuf_strip_suffix(&buf, ".pack"); strbuf_addstr(&buf, ".idx"); --- >8 --- ...and I actually think that might do it, since: - existing_nonkept_packs is populated by calling readdir() on the local repository's $GIT_DIR/objects/pack directory, so it will never contain any non-local packs. - existing_kept_packs is also OK for the same reason - names (which tracks the packs that we just wrote) will never contain a non-local pack, since we never write packs outside of our local pack directory So that would cause you to write a MIDX containing only local packs (as desired) regardless of whether or not the caller passed --[no]-local or not. > > I'd personally be fine to start honoring the `po_args.local` flag so > > that we skip over any non-local packfiles there while ignoring the > > larger problem of non-local geometric repacks breaking in certain > > scenarios. It would at least improve the status quo as users now have a > > way out in case they ever happen to hit that error. And it allows us to > > use geometric repacks in alternated repositories. But are we okay with > > punting on the larger issue for the time being? > > I think the real bug is isolated in write_midx_included_packs() in how > it may specify packs that the multi-pack-index cannot track. It should > be worth the time exploring if there is an easy fix there, and then the > po_args.local version can be used as a backup/performance tweak on top > of that. Yeah, seeing "unknown preferred pack" makes me suspicious that we tried to write a MIDX that contains a pack outside of our repository. I tried to reproduce that case with the following script but couldn't do it (the pack that the MIDX does track is the local one, as expected): --- 8< --- #!/bin/sh rm -fr shared member git init shared git -C shared commit --allow-empty -m "base" && git -C shared repack -d git clone --shared shared member git -C member commit --allow-empty -m "base" && git -C member repack -d { basename "$(find shared/.git/objects/pack -type f -name '*.idx')" basename "$(find member/.git/objects/pack -type f -name '*.idx')" } | git.compile -C member multi-pack-index write --stdin-packs for repo in shared member do echo "==> $repo" find $repo/.git/objects/pack -type f done ( cd member && ~/src/git/t/helper/test-tool read-midx --show-objects .git/objects ) --- >8 --- But the MIDX code that is responsible for collecting the packs specified over --stdin-packs enumerates the possible packs by calling `for_each_file_in_pack_dir()` with our local `object_dir`, so I don't think it's possible to have a MIDX that contains a non-local pack. IOW, I agree with you that the bug is in write_midx_included_packs(), and I think the fix that I outlined would do the trick. Thanks, Taylor
On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote: > On 4/5/2023 3:08 AM, Patrick Steinhardt wrote: > > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: > >> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote: [snip] > > I'd personally be fine to start honoring the `po_args.local` flag so > > that we skip over any non-local packfiles there while ignoring the > > larger problem of non-local geometric repacks breaking in certain > > scenarios. It would at least improve the status quo as users now have a > > way out in case they ever happen to hit that error. And it allows us to > > use geometric repacks in alternated repositories. But are we okay with > > punting on the larger issue for the time being? > > I think the real bug is isolated in write_midx_included_packs() in how > it may specify packs that the multi-pack-index cannot track. It should > be worth the time exploring if there is an easy fix there, and then the > po_args.local version can be used as a backup/performance tweak on top > of that. The problems are quite multi-faceted, but taken on their own most of the problems actually seem quite easy to fix. I've got a local patch series that addresses almost all of the pain points I have found until now: - A segfault in git-multi-pack-index(1) when passed no packfiles and a preferred packfile that it cannot find. - The issue that git-repack(1) asks git-multi-pack-index(1) to write an MIDX with packs that it cannot actually track because they are not local. - An issue with git-pack-objects(1) that keeps it from packing objects that are non-local due to the way `--stdin-packs` is only referring to the packfile basenames. - The fact that we don't honor `-l` when doing geometric repacking. I'm still busy polishing it and finding testcases for each of these to demonstrate the actual bugs, but will send out the series later this week. Patrick
On Mon, Apr 10, 2023 at 07:49:01PM -0400, Taylor Blau wrote: > On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote: [snip] > > Perhaps this method could include a step to create a new, "local" > > geometry containing only the packs within the local object dir. We can > > then skip the --preferred-pack option and bitmap if this is different > > than the original geometry (perhaps with a warning message to help > > users who did this accidentally). > > It would be nice to not have to juggle multiple pack geometry structs, > since the logic of what gets repacked, what gets thrown away, and what > gets kept is already fairly complicated (at least to me) and pretty > fragile. > > I think if I were sketching this out, I'd start out by doing something > like: > > --- 8< --- > diff --git a/builtin/repack.c b/builtin/repack.c > index df4d8e0f0b..eab5f58444 100644 > --- a/builtin/repack.c > +++ b/builtin/repack.c > @@ -558,6 +558,10 @@ static void midx_included_packs(struct string_list *include, > for (i = geometry->split; i < geometry->pack_nr; i++) { > struct packed_git *p = geometry->pack[i]; > > + /* MIDXs cannot refer to non-local packs */ > + if (!p->pack_local) > + continue; > + > strbuf_addstr(&buf, pack_basename(p)); > strbuf_strip_suffix(&buf, ".pack"); > strbuf_addstr(&buf, ".idx"); > --- >8 --- > > ...and I actually think that might do it, since: > > - existing_nonkept_packs is populated by calling readdir() on the local > repository's $GIT_DIR/objects/pack directory, so it will never contain > any non-local packs. > > - existing_kept_packs is also OK for the same reason > > - names (which tracks the packs that we just wrote) will never contain a > non-local pack, since we never write packs outside of our local pack > directory > > So that would cause you to write a MIDX containing only local packs (as > desired) regardless of whether or not the caller passed --[no]-local or > not. I think this doesn't actually do anything. The reason is that the `--stdin-packs` option in git-multi-pack-index(1) does not actually treat the provided packs as packs that are to be included in the multi-pack-index. Instead, it uses it as a filter for `get_all_packs()`. And as `get_all_packs()` returns packfiles paths prefixed with the alternate directory's name, but we pass in basenames only, we would already be filtering out any non-local packfiles. So ultimately, we would only ever write an MIDX containing only local packs already. It rather feels like this is only by chance though, so I think it is good to include your patch regardless of whether it actually does something or not. Better be explicit here, also as documentation to the reader. I'll include it as part of my patch series that I'm to send out later this week. Patrick
On Tue, Apr 11, 2023 at 07:06:48PM +0200, Patrick Steinhardt wrote: > On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote: > > On 4/5/2023 3:08 AM, Patrick Steinhardt wrote: > > > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote: > > >> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote: > [snip] > > > I'd personally be fine to start honoring the `po_args.local` flag so > > > that we skip over any non-local packfiles there while ignoring the > > > larger problem of non-local geometric repacks breaking in certain > > > scenarios. It would at least improve the status quo as users now have a > > > way out in case they ever happen to hit that error. And it allows us to > > > use geometric repacks in alternated repositories. But are we okay with > > > punting on the larger issue for the time being? > > > > I think the real bug is isolated in write_midx_included_packs() in how > > it may specify packs that the multi-pack-index cannot track. It should > > be worth the time exploring if there is an easy fix there, and then the > > po_args.local version can be used as a backup/performance tweak on top > > of that. > > The problems are quite multi-faceted, but taken on their own most of the > problems actually seem quite easy to fix. I've got a local patch series > that addresses almost all of the pain points I have found until now: > > - A segfault in git-multi-pack-index(1) when passed no packfiles and a > preferred packfile that it cannot find. > > - The issue that git-repack(1) asks git-multi-pack-index(1) to write an > MIDX with packs that it cannot actually track because they are not > local. > > - An issue with git-pack-objects(1) that keeps it from packing objects > that are non-local due to the way `--stdin-packs` is only referring to > the packfile basenames. > > - The fact that we don't honor `-l` when doing geometric repacking. Ah, one more thing I forgot: add a safety mechanism that disables writing bitmaps when we see non-local packs. Patrick
On Tue, Apr 11, 2023 at 07:13:46PM +0200, Patrick Steinhardt wrote: > So ultimately, we would only ever write an MIDX containing only local > packs already. It rather feels like this is only by chance though, so I > think it is good to include your patch regardless of whether it actually > does something or not. Better be explicit here, also as documentation to > the reader. Doh, this is definitely right. I even wrote that code a while ago; shows you how good my memory is ;-). FWIW, I agree that even though it doesn't do anything in this instance, it is a good safeguard to have in place, so I think including it in your series is the right thing to do. Thanks, Taylor
On Tue, Apr 11, 2023 at 07:26:04PM +0200, Patrick Steinhardt wrote: > > The problems are quite multi-faceted, but taken on their own most of the > > problems actually seem quite easy to fix. I've got a local patch series > > that addresses almost all of the pain points I have found until now: > > > > - A segfault in git-multi-pack-index(1) when passed no packfiles and a > > preferred packfile that it cannot find. > > > > - The issue that git-repack(1) asks git-multi-pack-index(1) to write an > > MIDX with packs that it cannot actually track because they are not > > local. > > > > - An issue with git-pack-objects(1) that keeps it from packing objects > > that are non-local due to the way `--stdin-packs` is only referring to > > the packfile basenames. > > > > - The fact that we don't honor `-l` when doing geometric repacking. > > Ah, one more thing I forgot: add a safety mechanism that disables > writing bitmaps when we see non-local packs. Thank you very much for working on this. For what it's worth, I think that this whole thing is pretty cool. Having a couple of different forges use this feature in subtly different ways is proving to be an extremely effective way to shake out subtle bugs in the implementation. Thanks, Taylor
On Tue, Apr 11, 2023 at 05:13:12PM -0400, Taylor Blau wrote: > On Tue, Apr 11, 2023 at 07:13:46PM +0200, Patrick Steinhardt wrote: > > So ultimately, we would only ever write an MIDX containing only local > > packs already. It rather feels like this is only by chance though, so I > > think it is good to include your patch regardless of whether it actually > > does something or not. Better be explicit here, also as documentation to > > the reader. > > Doh, this is definitely right. I even wrote that code a while ago; shows > you how good my memory is ;-). > > FWIW, I agree that even though it doesn't do anything in this instance, > it is a good safeguard to have in place, so I think including it in your > series is the right thing to do. I was wrong, it actually does fix a problem, just not the one we thought it would fix. When all packfiles are non-local, git-multi-pack-index(1) would end up returning an error because it cannot find any of them. So by skipping over these non-local packfiles early on we avoid spawning it altogether and thus avoid the error. Partick
diff --git a/builtin/repack.c b/builtin/repack.c index 87f73c8923..c6d12fa4bd 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -333,6 +333,12 @@ static void init_pack_geometry(struct pack_geometry **geometry_p, geometry = *geometry_p; for (p = get_all_packs(the_repository); p; p = p->next) { + /* + * We don't want to repack packfiles which are not part of the + * primary object database. + */ + if (!p->pack_local) + continue; if (!pack_kept_objects) { /* * Any pack that has its pack_keep bit set will appear diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh index 8821fbd2dd..9f8bc663e4 100755 --- a/t/t7703-repack-geometric.sh +++ b/t/t7703-repack-geometric.sh @@ -281,4 +281,63 @@ test_expect_success '--geometric with pack.packSizeLimit' ' ) ' +packed_objects() { + git verify-pack -v "$@" >tmp-object-list && + sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\|tag\).*/\1/p" <tmp-object-list && + rm -f tmp-object-list +} + +test_expect_success '--geometric with no local objects creates no pack' ' + git init shared && + test_when_finished "rm -fr shared" && + test_commit -C shared "shared" && + git -C shared repack -Ad && + + # Set up the member repository so that it does not have + # any objects on its own. + git clone --shared shared member && + test_when_finished "rm -fr member" && + + git -C member repack --geometric 2 --write-midx && + find member/.git/objects/pack -type f >actual && + test_must_be_empty actual +' + +test_expect_success '--geometric does not include shared packfiles' ' + git init shared && + test_when_finished "rm -fr shared" && + test_commit -C shared "shared" && + git -C shared repack -Ad && + + git clone --shared shared member && + test_when_finished "rm -fr member" && + git -C member commit --allow-empty --message "not-shared" && + git -C member repack --geometric 2 --write-midx && + + # We expect the created packfile to only contain the new commit. + packed_objects member/.git/objects/pack/pack-*.idx >actual && + git -C member rev-parse HEAD >expect && + test_cmp expect actual +' + +test_expect_success '--geometric with same packfile in shared repository' ' + git init shared && + test_when_finished "rm -fr shared" && + test_commit -C shared "shared" && + git -C shared repack -Ad && + + # Prepare the member repository so that it got the exact same packfile + # as the shared repository and set up gitalternates. + cp -r shared member && + test_when_finished "rm -fr member" && + test-tool path-utils real_path shared/.git/objects >member/.git/objects/info/alternates && + find shared/.git/objects -type f >expect && + + # After repacking, contents of the member repository should not have + # changed. + git -C member repack --geometric 2 --write-midx 2>error && + find shared/.git/objects -type f >actual && + test_cmp expect actual +' + test_done
Performing geometric repacking with repositories that have alternate object directories set up is causing errors in different scenarios. - Repacking fails when the repository linked to the object directory does not have any objects on its own. ``` $ git init shared $ git -C shared commit --allow-empty --message something $ git clone --shared shared member $ git -C member repack --geometric=2 --write-midx Nothing new to pack. warning: unknown preferred pack: 'pack-3e1a94a8dc9bb4defb0d98ce2ebf325312d76362.pack' error: multi-pack-index died of signal 7 ``` - Repacking fails when the repository linked to the object directory has the exact same packfile as the linked-to object directory. ``` $ git init shared $ git -C shared commit --allow-empty --message something $ git -C shared repack -Ad $ cp -r shared member $ realpath shared/.git/objects >member/.git/objects/info/alternates $ git -C member repack --geometric 2 --write-midx fatal: could not find pack 'pack-d404037a861afe456e07a1aefb3655150f1299f0.pack' ``` Both issues have the same underlying root cause, which is that geometric repacks don't honor whether packfiles are local or not. As a result, they will try to include packs part of the alternate object directory and then at a later point fail to locate them as they do not exist in the object directory of the repository we're about to repack. Skip over packfiles that aren't local. This will cause geometric repacks to never include packfiles of its alternates. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- builtin/repack.c | 6 ++++ t/t7703-repack-geometric.sh | 59 +++++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+)