diff mbox series

repack: fix geometric repacking with gitalternates

Message ID a07ed50feeec4bfc3e9736bf493b9876896bcdd2.1680606445.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series repack: fix geometric repacking with gitalternates | expand

Commit Message

Patrick Steinhardt April 4, 2023, 11:08 a.m. UTC
Performing geometric repacking with repositories that have alternate
object directories set up is causing errors in different scenarios.

- Repacking fails when the repository linked to the object
  directory does not have any objects on its own.

  ```
  $ git init shared
  $ git -C shared commit --allow-empty --message something
  $ git clone --shared shared member
  $ git -C member repack --geometric=2 --write-midx
  Nothing new to pack.
  warning: unknown preferred pack: 'pack-3e1a94a8dc9bb4defb0d98ce2ebf325312d76362.pack'
  error: multi-pack-index died of signal 7
  ```

- Repacking fails when the repository linked to the object
  directory has the exact same packfile as the linked-to object
  directory.

  ```
  $ git init shared
  $ git -C shared commit --allow-empty --message something
  $ git -C shared repack -Ad
  $ cp -r shared member
  $ realpath shared/.git/objects >member/.git/objects/info/alternates
  $ git -C member repack --geometric 2 --write-midx
  fatal: could not find pack 'pack-d404037a861afe456e07a1aefb3655150f1299f0.pack'
  ```

Both issues have the same underlying root cause, which is that geometric
repacks don't honor whether packfiles are local or not. As a result,
they will try to include packs part of the alternate object directory
and then at a later point fail to locate them as they do not exist in
the object directory of the repository we're about to repack.

Skip over packfiles that aren't local. This will cause geometric repacks
to never include packfiles of its alternates.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/repack.c            |  6 ++++
 t/t7703-repack-geometric.sh | 59 +++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

Comments

Taylor Blau April 4, 2023, 6:55 p.m. UTC | #1
On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote:
> Both issues have the same underlying root cause, which is that geometric
> repacks don't honor whether packfiles are local or not. As a result,
> they will try to include packs part of the alternate object directory
> and then at a later point fail to locate them as they do not exist in
> the object directory of the repository we're about to repack.

Interesting. This fix does make sense, but I wonder if it is doing the
right thing.

When we have an alternated repository and do 'git repack -ad' in the
"member" repository, we'll end up creating a new pack containing all
objects in that repository, including ones from the alternate.

For example, if you do something like:

    rm -fr shared member

    git init shared
    git -C shared commit --allow-empty -m "base" && git -C shared repack -d

    git clone --shared shared member
    git -C member repack -ad

    for dir in shared member
    do
      echo "==> $dir"
      find $dir/.git/objects -type f
    done

Then you'll end up with the output:

    ==> shared
    shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx
    shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack
    shared/.git/objects/info/packs
    ==> member
    member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx
    member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack
    member/.git/objects/info/alternates
    member/.git/objects/info/packs

In other words, we end up creating the pack necessary in the member
repository necessary to complete the repack. Since we're using `-a`
here, that means creating an identical pack as we have in the shared
repository.

If we instead did something like:

    git -C member repack -adl # <- `-l` is new here

, our output changes to instead contain the following (empty) pack
directory in the member repository:

    ==> member
    member/.git/objects/info/alternates
    member/.git/objects/info/packs

> Skip over packfiles that aren't local. This will cause geometric repacks
> to never include packfiles of its alternates.

...So I wonder if this is the right thing to do in all cases. IOW,
should we be creating packs in the "member" repository which may be
based off of packs from the shared repository when `-l` is not
specified?

I think this gets tricky. For one, the geometric repack code creates at
most one new pack. So if some of the packs that were above the split
line (IOW, the ones that don't need to be squashed together) were in the
shared repository, I'm not sure how you'd write a MIDX that covers both
without using the MIDX-over-alternates feature. I have no idea how that
works with MIDX bitmaps (IIUC, the MIDX/alternates feature is very
niche).

I think we reasonably could do something like ignoring non-local packs
in `init_pack_geometry()` only when `-l` is given. That still runs into
problems when trying to write a MIDX or MIDX bitmaps, so we should
likely declare the combination "-l --write-midx --write-bitmap-index" as
unsupported. For backwards compatibility, I think it would make sense to
have "--no-local" be the default when `--geometric` is given (which is
already the case, since po_args is zero-initialized).

I suspect in practice that nobody cares about what happens when you run
"git repack --geometric=<d> --local", but in case they do, I think the
above is probably the most reasonable behavior that I can quickly come
up with.

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  builtin/repack.c            |  6 ++++
>  t/t7703-repack-geometric.sh | 59 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 65 insertions(+)
>
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 87f73c8923..c6d12fa4bd 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -333,6 +333,12 @@ static void init_pack_geometry(struct pack_geometry **geometry_p,
>  	geometry = *geometry_p;
>
>  	for (p = get_all_packs(the_repository); p; p = p->next) {
> +		/*
> +		 * We don't want to repack packfiles which are not part of the
> +		 * primary object database.
> +		 */
> +		if (!p->pack_local)
> +			continue;

So this would change to something like `if (po_args.local &&
!p->pack_local)`.

> diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
> index 8821fbd2dd..9f8bc663e4 100755
> --- a/t/t7703-repack-geometric.sh
> +++ b/t/t7703-repack-geometric.sh
> @@ -281,4 +281,63 @@ test_expect_success '--geometric with pack.packSizeLimit' '
>  	)
>  '
>
> +packed_objects() {
> +	git verify-pack -v "$@" >tmp-object-list &&
> +	sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\|tag\).*/\1/p" <tmp-object-list &&
> +	rm -f tmp-object-list
> +}

Just a nitpick, but I've found the output from index-pack to be a little
easier to work with here. I think that you could replace this function
with the following and have it work:

--- 8< ---
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 9f8bc663e4..e3ac5001bd 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -282,9 +282,8 @@ test_expect_success '--geometric with pack.packSizeLimit' '
 '

 packed_objects() {
-	git verify-pack -v "$@" >tmp-object-list &&
-	sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\|tag\).*/\1/p" <tmp-object-list &&
-	rm -f tmp-object-list
+	git show-index <"$1" >tmp-object-list &&
+	cut -d' ' -f2 tmp-object-list
 }

 test_expect_success '--geometric with no local objects creates no pack' '
--- >8 ---

I removed the "&& rm -f tmp-object-list", since we overwrite it anyway
whenever we call packed_objects().

> +test_expect_success '--geometric with no local objects creates no pack' '
> +	git init shared &&
> +	test_when_finished "rm -fr shared" &&
> +	test_commit -C shared "shared" &&
> +	git -C shared repack -Ad &&
> +
> +	# Set up the member repository so that it does not have
> +	# any objects on its own.
> +	git clone --shared shared member &&
> +	test_when_finished "rm -fr member" &&
> +
> +	git -C member repack --geometric 2 --write-midx &&
> +	find member/.git/objects/pack -type f >actual &&
> +	test_must_be_empty actual

Another small nitpick, but I think these last two lines could be
replaced with

    test_dir_is_empty member/.git/objects/pack

or even

    test_dir_is_empty member/$packdir

> +test_expect_success '--geometric does not include shared packfiles' '
> +	git init shared &&
> +	test_when_finished "rm -fr shared" &&
> +	test_commit -C shared "shared" &&
> +	git -C shared repack -Ad &&
> +
> +	git clone --shared shared member &&
> +	test_when_finished "rm -fr member" &&
> +	git -C member commit --allow-empty --message "not-shared" &&

Any reason to make this commit empty? Could we replace this with

    test_commit -C member not-shared

?

> +test_expect_success '--geometric with same packfile in shared repository' '
> +	git init shared &&
> +	test_when_finished "rm -fr shared" &&
> +	test_commit -C shared "shared" &&
> +	git -C shared repack -Ad &&
> +
> +	# Prepare the member repository so that it got the exact same packfile
> +	# as the shared repository and set up gitalternates.
> +	cp -r shared member &&
> +	test_when_finished "rm -fr member" &&
> +	test-tool path-utils real_path shared/.git/objects >member/.git/objects/info/alternates &&
> +	find shared/.git/objects -type f >expect &&

This works, though it may be easier to write something like:

    git clone -s shared member &&
    test_when_finished "rm -fr member" &&
    # ensure that "shared" and "member" have an identical set of packs
    git -C member repack -a &&
    find shared/.git/objects -type f >expect &&

> +	# After repacking, contents of the member repository should not have
> +	# changed.
> +	git -C member repack --geometric 2 --write-midx 2>error &&

I think that "2>error" is a stray change left over from debugging, and
could probably be removed.

> +	find shared/.git/objects -type f >actual &&

This and above could be replaced with shared/$packdir. Whenever I'm
asserting the contents of a directory are unchanged, I typically like to
call the two files under comparison "before" and "after", but "expect"
and "actual" are fine here, too.

Thanks,
Taylor
Taylor Blau April 4, 2023, 7 p.m. UTC | #2
On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
> I think we reasonably could do something like ignoring non-local packs
> in `init_pack_geometry()` only when `-l` is given. That still runs into
> problems when trying to write a MIDX or MIDX bitmaps, so we should
> likely declare the combination "-l --write-midx --write-bitmap-index" as
> unsupported. For backwards compatibility, I think it would make sense to
> have "--no-local" be the default when `--geometric` is given (which is
> already the case, since po_args is zero-initialized).

When summarizing this message to colleagues at GitHub, I came up with
this TL;DR, which I figured may be useful to share:

  TL;DR: I think that this is a (much) more complicated problem than
  originally anticipated, but the easiest thing to do is to assume that
  git repack --geometric=<d> means git repack --geometric=<d> --no-local
  unless otherwise specified, and declare --geometric=<d> --local
  unsupported when used in conjunction with --write-midx or
  --write-bitmap-index.

Thanks,
Taylor
Patrick Steinhardt April 5, 2023, 7:08 a.m. UTC | #3
On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote:
> > Both issues have the same underlying root cause, which is that geometric
> > repacks don't honor whether packfiles are local or not. As a result,
> > they will try to include packs part of the alternate object directory
> > and then at a later point fail to locate them as they do not exist in
> > the object directory of the repository we're about to repack.
> 
> Interesting. This fix does make sense, but I wonder if it is doing the
> right thing.
> 
> When we have an alternated repository and do 'git repack -ad' in the
> "member" repository, we'll end up creating a new pack containing all
> objects in that repository, including ones from the alternate.
> 
> For example, if you do something like:
> 
>     rm -fr shared member
> 
>     git init shared
>     git -C shared commit --allow-empty -m "base" && git -C shared repack -d
> 
>     git clone --shared shared member
>     git -C member repack -ad
> 
>     for dir in shared member
>     do
>       echo "==> $dir"
>       find $dir/.git/objects -type f
>     done
> 
> Then you'll end up with the output:
> 
>     ==> shared
>     shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx
>     shared/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack
>     shared/.git/objects/info/packs
>     ==> member
>     member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.idx
>     member/.git/objects/pack/pack-a2f0c663b287c3eeb0207397f8cafb9ae49f8277.pack
>     member/.git/objects/info/alternates
>     member/.git/objects/info/packs
> 
> In other words, we end up creating the pack necessary in the member
> repository necessary to complete the repack. Since we're using `-a`
> here, that means creating an identical pack as we have in the shared
> repository.
> 
> If we instead did something like:
> 
>     git -C member repack -adl # <- `-l` is new here
> 
> , our output changes to instead contain the following (empty) pack
> directory in the member repository:
> 
>     ==> member
>     member/.git/objects/info/alternates
>     member/.git/objects/info/packs

Yup, that's fair.

> > Skip over packfiles that aren't local. This will cause geometric repacks
> > to never include packfiles of its alternates.
> 
> ...So I wonder if this is the right thing to do in all cases. IOW,
> should we be creating packs in the "member" repository which may be
> based off of packs from the shared repository when `-l` is not
> specified?
> 
> I think this gets tricky. For one, the geometric repack code creates at
> most one new pack. So if some of the packs that were above the split
> line (IOW, the ones that don't need to be squashed together) were in the
> shared repository, I'm not sure how you'd write a MIDX that covers both
> without using the MIDX-over-alternates feature. I have no idea how that
> works with MIDX bitmaps (IIUC, the MIDX/alternates feature is very
> niche).

I agree, things would become tricky in case geometric repacks are
expected to work across alternates.

> I think we reasonably could do something like ignoring non-local packs
> in `init_pack_geometry()` only when `-l` is given. That still runs into
> problems when trying to write a MIDX or MIDX bitmaps, so we should
> likely declare the combination "-l --write-midx --write-bitmap-index" as
> unsupported. For backwards compatibility, I think it would make sense to
> have "--no-local" be the default when `--geometric` is given (which is
> already the case, since po_args is zero-initialized).

Okay, I agree that it's not all that sensible to allow writing bitmaps
in a geometric repack that spans across multiple repositories. These
bitmaps would immediately break once the shared repository performs a
repack that removes a subset of packfiles that the bitmap depends on,
which would make it non-obvious for how to even do repacks in such a
shared repository at all.

But I'm not yet sure whether I understand why `-l --write-midx` should
be prohibited like you summarized in the follow-up mail:

On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
> TL;DR: I think that this is a (much) more complicated problem than
> originally anticipated, but the easiest thing to do is to assume that
> git repack --geometric=<d> means git repack --geometric=<d> --no-local
> unless otherwise specified, and declare --geometric=<d> --local
> unsupported when used in conjunction with --write-midx or
> --write-bitmap-index.

The newly written MIDX would of course only span over the local
packfiles, but is that even a problem? Ideally, Git would know to handle
multiple MIDX files, and in that case it would make sense both for the
shared and for the member repository to have one.

> I suspect in practice that nobody cares about what happens when you run
> "git repack --geometric=<d> --local", but in case they do, I think the
> above is probably the most reasonable behavior that I can quickly come
> up with.

Well, I do as we use alternates to implement fork networks at GitLab and
we're in the process of adopting geometric repacking. So what I want to
have is that I can perform geometric repacks both in the shared and in
the member repository that includes only the local packfiles.

And yes, I agree that the above is the most reasonable behaviour, with
the exception of disallowing MIDXs when doing a local geometric repack.
But that raises the question: what do we do about the currently-broken
behaviour when executing `git repack --geometric=<d> --no-local` in a
alternated repository?

I'd personally be fine to start honoring the `po_args.local` flag so
that we skip over any non-local packfiles there while ignoring the
larger problem of non-local geometric repacks breaking in certain
scenarios. It would at least improve the status quo as users now have a
way out in case they ever happen to hit that error. And it allows us to
use geometric repacks in alternated repositories. But are we okay with
punting on the larger issue for the time being?

Thanks for your feedback and this interesting discussion!

Patrick
Derrick Stolee April 10, 2023, 3:06 p.m. UTC | #4
On 4/5/2023 3:08 AM, Patrick Steinhardt wrote:
> On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
>> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote:
>>> Both issues have the same underlying root cause, which is that geometric
>>> repacks don't honor whether packfiles are local or not. As a result,
>>> they will try to include packs part of the alternate object directory
>>> and then at a later point fail to locate them as they do not exist in
>>> the object directory of the repository we're about to repack.
>>
>> Interesting. This fix does make sense, but I wonder if it is doing the
>> right thing.
>>
>> When we have an alternated repository and do 'git repack -ad' in the
>> "member" repository, we'll end up creating a new pack containing all
>> objects in that repository, including ones from the alternate.
...
>> I think we reasonably could do something like ignoring non-local packs
>> in `init_pack_geometry()` only when `-l` is given. That still runs into
>> problems when trying to write a MIDX or MIDX bitmaps, so we should
>> likely declare the combination "-l --write-midx --write-bitmap-index" as
>> unsupported. For backwards compatibility, I think it would make sense to
>> have "--no-local" be the default when `--geometric` is given (which is
>> already the case, since po_args is zero-initialized).
> 
> Okay, I agree that it's not all that sensible to allow writing bitmaps
> in a geometric repack that spans across multiple repositories. These
> bitmaps would immediately break once the shared repository performs a
> repack that removes a subset of packfiles that the bitmap depends on,
> which would make it non-obvious for how to even do repacks in such a
> shared repository at all.

We have mechanisms for avoiding writing bitmaps for packs that are not
closed under reachability. We should have some protection against writing
them for multi-pack-indexes that are not closed under reachability, if
only as a check during bitmap_writer_build().

> But I'm not yet sure whether I understand why `-l --write-midx` should
> be prohibited like you summarized in the follow-up mail:
> 
> On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
>> TL;DR: I think that this is a (much) more complicated problem than
>> originally anticipated, but the easiest thing to do is to assume that
>> git repack --geometric=<d> means git repack --geometric=<d> --no-local
>> unless otherwise specified, and declare --geometric=<d> --local
>> unsupported when used in conjunction with --write-midx or
>> --write-bitmap-index.
> 
> The newly written MIDX would of course only span over the local
> packfiles, but is that even a problem? Ideally, Git would know to handle
> multiple MIDX files, and in that case it would make sense both for the
> shared and for the member repository to have one.

Yes, each odb is allowed a multi-pack-index, and they chain the same way
pack-files do. I agree that this restriction isn't necessary.

>> I suspect in practice that nobody cares about what happens when you run
>> "git repack --geometric=<d> --local", but in case they do, I think the
>> above is probably the most reasonable behavior that I can quickly come
>> up with.
> 
> Well, I do as we use alternates to implement fork networks at GitLab and
> we're in the process of adopting geometric repacking. So what I want to
> have is that I can perform geometric repacks both in the shared and in
> the member repository that includes only the local packfiles.
> 
> And yes, I agree that the above is the most reasonable behaviour, with
> the exception of disallowing MIDXs when doing a local geometric repack.

I think the recommended

	if (po_args.local && !p->local)
		continue;

approach is a nice _performance improvement_ for the --local case, since
it avoids adding a list of objects to be packed (and then thrown away,
because those objects exist in an alternate). Of course, we are currently
blocked on that part working because of the non-local packs being a
problem.

> But that raises the question: what do we do about the currently-broken
> behaviour when executing `git repack --geometric=<d> --no-local` in a
> alternated repository?

Much like "git repack -a --no-local" in an alternated repository, I
don't think this is something good for users to do, but I agree that it
not working is a problem we should fix.

Your original message includes error messages like "could not find pack"
and "unknown preferred pack" which makes me think the _real_ problem is
that we are not respecting the full path name of the pack-file and are
somehow localizing packs to the local object dir.

The basic reason is that write_midx_included_packs() takes all of the
pack-files from the geometry, but does not strip out the pack-files
that are not in the same object directory.

Perhaps this method could include a step to create a new, "local"
geometry containing only the packs within the local object dir. We can
then skip the --preferred-pack option and bitmap if this is different
than the original geometry (perhaps with a warning message to help
users who did this accidentally).

> I'd personally be fine to start honoring the `po_args.local` flag so
> that we skip over any non-local packfiles there while ignoring the
> larger problem of non-local geometric repacks breaking in certain
> scenarios. It would at least improve the status quo as users now have a
> way out in case they ever happen to hit that error. And it allows us to
> use geometric repacks in alternated repositories. But are we okay with
> punting on the larger issue for the time being?

I think the real bug is isolated in write_midx_included_packs() in how
it may specify packs that the multi-pack-index cannot track. It should
be worth the time exploring if there is an easy fix there, and then the
po_args.local version can be used as a backup/performance tweak on top
of that.

Thanks,
-Stolee
Taylor Blau April 10, 2023, 11:29 p.m. UTC | #5
On Wed, Apr 05, 2023 at 09:08:19AM +0200, Patrick Steinhardt wrote:
> But I'm not yet sure whether I understand why `-l --write-midx` should
> be prohibited like you summarized in the follow-up mail:
>
> On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
> > TL;DR: I think that this is a (much) more complicated problem than
> > originally anticipated, but the easiest thing to do is to assume that
> > git repack --geometric=<d> means git repack --geometric=<d> --no-local
> > unless otherwise specified, and declare --geometric=<d> --local
> > unsupported when used in conjunction with --write-midx or
> > --write-bitmap-index.
>
> The newly written MIDX would of course only span over the local
> packfiles, but is that even a problem? Ideally, Git would know to handle
> multiple MIDX files, and in that case it would make sense both for the
> shared and for the member repository to have one.

Right, I don't think that it's a problem. But I don't know how well the
MIDX-over-alternates thing works in practice. I know that the feature
exists, but I don't think it is as battled-tested as non-alternated
MIDXs.

So I think you'd have to ban MIDX bitmaps when the MIDX spans over
multiple repositories through an alternates file, but otherwise it
should be OK.

> But that raises the question: what do we do about the currently-broken
> behaviour when executing `git repack --geometric=<d> --no-local` in a
> alternated repository?
>
> I'd personally be fine to start honoring the `po_args.local` flag so
> that we skip over any non-local packfiles there while ignoring the
> larger problem of non-local geometric repacks breaking in certain
> scenarios. It would at least improve the status quo as users now have a
> way out in case they ever happen to hit that error. And it allows us to
> use geometric repacks in alternated repositories. But are we okay with
> punting on the larger issue for the time being?

I wonder if the larger issue could be simplified into (a) honoring
`po_args.local` when doing the geometric rollup, and (b) dropping the
non-local packs when writing a MIDX.

> Thanks for your feedback and this interesting discussion!

Ditto ;-).

Thanks,
Taylor
Taylor Blau April 10, 2023, 11:49 p.m. UTC | #6
On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote:
> Your original message includes error messages like "could not find pack"
> and "unknown preferred pack" which makes me think the _real_ problem is
> that we are not respecting the full path name of the pack-file and are
> somehow localizing packs to the local object dir.

"could not find pack" comes from pack-objects's --stdin-packs mode when
the caller specifies a pack that doesn't exist in the repository.

I'm not sure what's going on here, since read_packs_list_from_stdin()
searches over the result of calling get_all_packs(), which does include
packs in alternate object stores. And indeed, pack-objects can recognize
such packs:

--- 8< ---
#!/bin/sh

rm -fr shared member

git init shared
git -C shared commit --allow-empty -m "base" && git -C shared repack -d

git clone --shared shared member

basename "$(find shared/.git/objects/pack -type f -name '*.pack')" >input

git -C member pack-objects .git/objects/pack/pack --stdin-packs <input

for repo in shared member
do
	echo "==> $repo"
	find $repo/.git/objects/pack -type f
done
--- >8 ---

^^ the above results in generating the identical pack in the "member"
repository as expected:

==> shared
shared/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.idx
shared/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.pack
==> member
member/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.idx
member/.git/objects/pack/pack-d51c724345703411d57134d018e9643924900d39.pack

> The basic reason is that write_midx_included_packs() takes all of the
> pack-files from the geometry, but does not strip out the pack-files
> that are not in the same object directory.

I agree here; the pack-geometry code should be stripping out non-local
packs when writing a MIDX regardless of whether the caller passed
--local or not.

> Perhaps this method could include a step to create a new, "local"
> geometry containing only the packs within the local object dir. We can
> then skip the --preferred-pack option and bitmap if this is different
> than the original geometry (perhaps with a warning message to help
> users who did this accidentally).

It would be nice to not have to juggle multiple pack geometry structs,
since the logic of what gets repacked, what gets thrown away, and what
gets kept is already fairly complicated (at least to me) and pretty
fragile.

I think if I were sketching this out, I'd start out by doing something
like:

--- 8< ---
diff --git a/builtin/repack.c b/builtin/repack.c
index df4d8e0f0b..eab5f58444 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -558,6 +558,10 @@ static void midx_included_packs(struct string_list *include,
 		for (i = geometry->split; i < geometry->pack_nr; i++) {
 			struct packed_git *p = geometry->pack[i];

+			/* MIDXs cannot refer to non-local packs */
+			if (!p->pack_local)
+				continue;
+
 			strbuf_addstr(&buf, pack_basename(p));
 			strbuf_strip_suffix(&buf, ".pack");
 			strbuf_addstr(&buf, ".idx");
--- >8 ---

...and I actually think that might do it, since:

	- existing_nonkept_packs is populated by calling readdir() on the local
		repository's $GIT_DIR/objects/pack directory, so it will never contain
		any non-local packs.

	- existing_kept_packs is also OK for the same reason

	- names (which tracks the packs that we just wrote) will never contain a
		non-local pack, since we never write packs outside of our local pack
		directory

So that would cause you to write a MIDX containing only local packs (as
desired) regardless of whether or not the caller passed --[no]-local or
not.

> > I'd personally be fine to start honoring the `po_args.local` flag so
> > that we skip over any non-local packfiles there while ignoring the
> > larger problem of non-local geometric repacks breaking in certain
> > scenarios. It would at least improve the status quo as users now have a
> > way out in case they ever happen to hit that error. And it allows us to
> > use geometric repacks in alternated repositories. But are we okay with
> > punting on the larger issue for the time being?
>
> I think the real bug is isolated in write_midx_included_packs() in how
> it may specify packs that the multi-pack-index cannot track. It should
> be worth the time exploring if there is an easy fix there, and then the
> po_args.local version can be used as a backup/performance tweak on top
> of that.

Yeah, seeing "unknown preferred pack" makes me suspicious that we tried
to write a MIDX that contains a pack outside of our repository. I tried
to reproduce that case with the following script but couldn't do it (the
pack that the MIDX does track is the local one, as expected):

--- 8< ---
#!/bin/sh

rm -fr shared member

git init shared
git -C shared commit --allow-empty -m "base" && git -C shared repack -d

git clone --shared shared member

git -C member commit --allow-empty -m "base" && git -C member repack -d

{
	basename "$(find shared/.git/objects/pack -type f -name '*.idx')"
	basename "$(find member/.git/objects/pack -type f -name '*.idx')"
} |
git.compile -C member multi-pack-index write --stdin-packs

for repo in shared member
do
	echo "==> $repo"
	find $repo/.git/objects/pack -type f
done

( cd member && ~/src/git/t/helper/test-tool read-midx --show-objects .git/objects )
--- >8 ---

But the MIDX code that is responsible for collecting the packs specified
over --stdin-packs enumerates the possible packs by calling
`for_each_file_in_pack_dir()` with our local `object_dir`, so I don't
think it's possible to have a MIDX that contains a non-local pack.

IOW, I agree with you that the bug is in write_midx_included_packs(),
and I think the fix that I outlined would do the trick.

Thanks,
Taylor
Patrick Steinhardt April 11, 2023, 5:06 p.m. UTC | #7
On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote:
> On 4/5/2023 3:08 AM, Patrick Steinhardt wrote:
> > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
> >> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote:
[snip]
> > I'd personally be fine to start honoring the `po_args.local` flag so
> > that we skip over any non-local packfiles there while ignoring the
> > larger problem of non-local geometric repacks breaking in certain
> > scenarios. It would at least improve the status quo as users now have a
> > way out in case they ever happen to hit that error. And it allows us to
> > use geometric repacks in alternated repositories. But are we okay with
> > punting on the larger issue for the time being?
> 
> I think the real bug is isolated in write_midx_included_packs() in how
> it may specify packs that the multi-pack-index cannot track. It should
> be worth the time exploring if there is an easy fix there, and then the
> po_args.local version can be used as a backup/performance tweak on top
> of that.

The problems are quite multi-faceted, but taken on their own most of the
problems actually seem quite easy to fix. I've got a local patch series
that addresses almost all of the pain points I have found until now:

- A segfault in git-multi-pack-index(1) when passed no packfiles and a
  preferred packfile that it cannot find.

- The issue that git-repack(1) asks git-multi-pack-index(1) to write an
  MIDX with packs that it cannot actually track because they are not
  local.

- An issue with git-pack-objects(1) that keeps it from packing objects
  that are non-local due to the way `--stdin-packs` is only referring to
  the packfile basenames.

- The fact that we don't honor `-l` when doing geometric repacking.

I'm still busy polishing it and finding testcases for each of these to
demonstrate the actual bugs, but will send out the series later this
week.

Patrick
Patrick Steinhardt April 11, 2023, 5:13 p.m. UTC | #8
On Mon, Apr 10, 2023 at 07:49:01PM -0400, Taylor Blau wrote:
> On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote:
[snip]
> > Perhaps this method could include a step to create a new, "local"
> > geometry containing only the packs within the local object dir. We can
> > then skip the --preferred-pack option and bitmap if this is different
> > than the original geometry (perhaps with a warning message to help
> > users who did this accidentally).
> 
> It would be nice to not have to juggle multiple pack geometry structs,
> since the logic of what gets repacked, what gets thrown away, and what
> gets kept is already fairly complicated (at least to me) and pretty
> fragile.
> 
> I think if I were sketching this out, I'd start out by doing something
> like:
> 
> --- 8< ---
> diff --git a/builtin/repack.c b/builtin/repack.c
> index df4d8e0f0b..eab5f58444 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -558,6 +558,10 @@ static void midx_included_packs(struct string_list *include,
>  		for (i = geometry->split; i < geometry->pack_nr; i++) {
>  			struct packed_git *p = geometry->pack[i];
> 
> +			/* MIDXs cannot refer to non-local packs */
> +			if (!p->pack_local)
> +				continue;
> +
>  			strbuf_addstr(&buf, pack_basename(p));
>  			strbuf_strip_suffix(&buf, ".pack");
>  			strbuf_addstr(&buf, ".idx");
> --- >8 ---
> 
> ...and I actually think that might do it, since:
> 
> 	- existing_nonkept_packs is populated by calling readdir() on the local
> 		repository's $GIT_DIR/objects/pack directory, so it will never contain
> 		any non-local packs.
> 
> 	- existing_kept_packs is also OK for the same reason
> 
> 	- names (which tracks the packs that we just wrote) will never contain a
> 		non-local pack, since we never write packs outside of our local pack
> 		directory
> 
> So that would cause you to write a MIDX containing only local packs (as
> desired) regardless of whether or not the caller passed --[no]-local or
> not.

I think this doesn't actually do anything. The reason is that the
`--stdin-packs` option in git-multi-pack-index(1) does not actually
treat the provided packs as packs that are to be included in the
multi-pack-index. Instead, it uses it as a filter for `get_all_packs()`.
And as `get_all_packs()` returns packfiles paths prefixed with the
alternate directory's name, but we pass in basenames only, we would
already be filtering out any non-local packfiles.

So ultimately, we would only ever write an MIDX containing only local
packs already. It rather feels like this is only by chance though, so I
think it is good to include your patch regardless of whether it actually
does something or not. Better be explicit here, also as documentation to
the reader.

I'll include it as part of my patch series that I'm to send out later
this week.

Patrick
Patrick Steinhardt April 11, 2023, 5:26 p.m. UTC | #9
On Tue, Apr 11, 2023 at 07:06:48PM +0200, Patrick Steinhardt wrote:
> On Mon, Apr 10, 2023 at 11:06:49AM -0400, Derrick Stolee wrote:
> > On 4/5/2023 3:08 AM, Patrick Steinhardt wrote:
> > > On Tue, Apr 04, 2023 at 02:55:50PM -0400, Taylor Blau wrote:
> > >> On Tue, Apr 04, 2023 at 01:08:33PM +0200, Patrick Steinhardt wrote:
> [snip]
> > > I'd personally be fine to start honoring the `po_args.local` flag so
> > > that we skip over any non-local packfiles there while ignoring the
> > > larger problem of non-local geometric repacks breaking in certain
> > > scenarios. It would at least improve the status quo as users now have a
> > > way out in case they ever happen to hit that error. And it allows us to
> > > use geometric repacks in alternated repositories. But are we okay with
> > > punting on the larger issue for the time being?
> > 
> > I think the real bug is isolated in write_midx_included_packs() in how
> > it may specify packs that the multi-pack-index cannot track. It should
> > be worth the time exploring if there is an easy fix there, and then the
> > po_args.local version can be used as a backup/performance tweak on top
> > of that.
> 
> The problems are quite multi-faceted, but taken on their own most of the
> problems actually seem quite easy to fix. I've got a local patch series
> that addresses almost all of the pain points I have found until now:
> 
> - A segfault in git-multi-pack-index(1) when passed no packfiles and a
>   preferred packfile that it cannot find.
> 
> - The issue that git-repack(1) asks git-multi-pack-index(1) to write an
>   MIDX with packs that it cannot actually track because they are not
>   local.
> 
> - An issue with git-pack-objects(1) that keeps it from packing objects
>   that are non-local due to the way `--stdin-packs` is only referring to
>   the packfile basenames.
> 
> - The fact that we don't honor `-l` when doing geometric repacking.

Ah, one more thing I forgot: add a safety mechanism that disables
writing bitmaps when we see non-local packs.

Patrick
Taylor Blau April 11, 2023, 9:13 p.m. UTC | #10
On Tue, Apr 11, 2023 at 07:13:46PM +0200, Patrick Steinhardt wrote:
> So ultimately, we would only ever write an MIDX containing only local
> packs already. It rather feels like this is only by chance though, so I
> think it is good to include your patch regardless of whether it actually
> does something or not. Better be explicit here, also as documentation to
> the reader.

Doh, this is definitely right. I even wrote that code a while ago; shows
you how good my memory is ;-).

FWIW, I agree that even though it doesn't do anything in this instance,
it is a good safeguard to have in place, so I think including it in your
series is the right thing to do.

Thanks,
Taylor
Taylor Blau April 11, 2023, 9:14 p.m. UTC | #11
On Tue, Apr 11, 2023 at 07:26:04PM +0200, Patrick Steinhardt wrote:
> > The problems are quite multi-faceted, but taken on their own most of the
> > problems actually seem quite easy to fix. I've got a local patch series
> > that addresses almost all of the pain points I have found until now:
> >
> > - A segfault in git-multi-pack-index(1) when passed no packfiles and a
> >   preferred packfile that it cannot find.
> >
> > - The issue that git-repack(1) asks git-multi-pack-index(1) to write an
> >   MIDX with packs that it cannot actually track because they are not
> >   local.
> >
> > - An issue with git-pack-objects(1) that keeps it from packing objects
> >   that are non-local due to the way `--stdin-packs` is only referring to
> >   the packfile basenames.
> >
> > - The fact that we don't honor `-l` when doing geometric repacking.
>
> Ah, one more thing I forgot: add a safety mechanism that disables
> writing bitmaps when we see non-local packs.

Thank you very much for working on this.

For what it's worth, I think that this whole thing is pretty cool.
Having a couple of different forges use this feature in subtly different
ways is proving to be an extremely effective way to shake out subtle
bugs in the implementation.

Thanks,
Taylor
Patrick Steinhardt April 12, 2023, 9:37 a.m. UTC | #12
On Tue, Apr 11, 2023 at 05:13:12PM -0400, Taylor Blau wrote:
> On Tue, Apr 11, 2023 at 07:13:46PM +0200, Patrick Steinhardt wrote:
> > So ultimately, we would only ever write an MIDX containing only local
> > packs already. It rather feels like this is only by chance though, so I
> > think it is good to include your patch regardless of whether it actually
> > does something or not. Better be explicit here, also as documentation to
> > the reader.
> 
> Doh, this is definitely right. I even wrote that code a while ago; shows
> you how good my memory is ;-).
> 
> FWIW, I agree that even though it doesn't do anything in this instance,
> it is a good safeguard to have in place, so I think including it in your
> series is the right thing to do.

I was wrong, it actually does fix a problem, just not the one we thought
it would fix. When all packfiles are non-local, git-multi-pack-index(1)
would end up returning an error because it cannot find any of them. So
by skipping over these non-local packfiles early on we avoid spawning
it altogether and thus avoid the error.

Partick
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index 87f73c8923..c6d12fa4bd 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -333,6 +333,12 @@  static void init_pack_geometry(struct pack_geometry **geometry_p,
 	geometry = *geometry_p;
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
+		/*
+		 * We don't want to repack packfiles which are not part of the
+		 * primary object database.
+		 */
+		if (!p->pack_local)
+			continue;
 		if (!pack_kept_objects) {
 			/*
 			 * Any pack that has its pack_keep bit set will appear
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 8821fbd2dd..9f8bc663e4 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -281,4 +281,63 @@  test_expect_success '--geometric with pack.packSizeLimit' '
 	)
 '
 
+packed_objects() {
+	git verify-pack -v "$@" >tmp-object-list &&
+	sed -n -e "s/^\([0-9a-f][0-9a-f]*\).*\(commit\|tree\|blob\|tag\).*/\1/p" <tmp-object-list &&
+	rm -f tmp-object-list
+}
+
+test_expect_success '--geometric with no local objects creates no pack' '
+	git init shared &&
+	test_when_finished "rm -fr shared" &&
+	test_commit -C shared "shared" &&
+	git -C shared repack -Ad &&
+
+	# Set up the member repository so that it does not have
+	# any objects on its own.
+	git clone --shared shared member &&
+	test_when_finished "rm -fr member" &&
+
+	git -C member repack --geometric 2 --write-midx &&
+	find member/.git/objects/pack -type f >actual &&
+	test_must_be_empty actual
+'
+
+test_expect_success '--geometric does not include shared packfiles' '
+	git init shared &&
+	test_when_finished "rm -fr shared" &&
+	test_commit -C shared "shared" &&
+	git -C shared repack -Ad &&
+
+	git clone --shared shared member &&
+	test_when_finished "rm -fr member" &&
+	git -C member commit --allow-empty --message "not-shared" &&
+	git -C member repack --geometric 2 --write-midx &&
+
+	# We expect the created packfile to only contain the new commit.
+	packed_objects member/.git/objects/pack/pack-*.idx >actual &&
+	git -C member rev-parse HEAD >expect &&
+	test_cmp expect actual
+'
+
+test_expect_success '--geometric with same packfile in shared repository' '
+	git init shared &&
+	test_when_finished "rm -fr shared" &&
+	test_commit -C shared "shared" &&
+	git -C shared repack -Ad &&
+
+	# Prepare the member repository so that it got the exact same packfile
+	# as the shared repository and set up gitalternates.
+	cp -r shared member &&
+	test_when_finished "rm -fr member" &&
+	test-tool path-utils real_path shared/.git/objects >member/.git/objects/info/alternates &&
+	find shared/.git/objects -type f >expect &&
+
+	# After repacking, contents of the member repository should not have
+	# changed.
+	git -C member repack --geometric 2 --write-midx 2>error &&
+	find shared/.git/objects -type f >actual &&
+	test_cmp expect actual
+'
+
 test_done