diff mbox series

[v2,2/8] repack: fix trying to use preferred pack in alternates

Message ID 011b08f3b64f264e3abbe8b49ee5338c221badb9.1681294715.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series repack: fix geometric repacking with gitalternates | expand

Commit Message

Patrick Steinhardt April 12, 2023, 10:22 a.m. UTC
When doing a geometric repack with multi-pack-indices, then we ask
git-multi-pack-index(1) to use the largest packfile as the preferred
pack. It can happen though that the largest packfile is not part of the
main object database, but instead part of an alternate object database.
The result is that git-multi-pack-index(1) will not be able to find the
preferred pack and print a warning. It then falls back to use the first
packfile that the multi-pack-index shall reference.

Fix this bug by only considering packfiles as preferred pack that are
local. This is the right thing to do given that a multi-pack-index
should never reference packfiles borrowed from an alternate.

While at it, rename the function `get_largest_active_packfile()` to
`get_preferred_pack()` to better document its intent.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/repack.c            | 21 ++++++++++++++++-----
 t/t7703-repack-geometric.sh | 31 +++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+), 5 deletions(-)

Comments

Taylor Blau April 12, 2023, 6:37 p.m. UTC | #1
On Wed, Apr 12, 2023 at 12:22:35PM +0200, Patrick Steinhardt wrote:
> When doing a geometric repack with multi-pack-indices, then we ask
> git-multi-pack-index(1) to use the largest packfile as the preferred
> pack. It can happen though that the largest packfile is not part of the
> main object database, but instead part of an alternate object database.
> The result is that git-multi-pack-index(1) will not be able to find the
> preferred pack and print a warning. It then falls back to use the first
> packfile that the multi-pack-index shall reference.
>
> Fix this bug by only considering packfiles as preferred pack that are
> local. This is the right thing to do given that a multi-pack-index
> should never reference packfiles borrowed from an alternate.
>
> While at it, rename the function `get_largest_active_packfile()` to
> `get_preferred_pack()` to better document its intent.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>

> @@ -464,7 +466,16 @@ static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry
>  	}
>  	if (geometry->split == geometry->pack_nr)
>  		return NULL;
> -	return geometry->pack[geometry->pack_nr - 1];
> +
> +	for (i = geometry->pack_nr; i > 0; i--)
> +		/*
> +		 * A pack that is not local would never be included in a
> +		 * multi-pack index. We thus skip over any non-local packs.
> +		 */
> +		if (geometry->pack[i - 1]->pack_local)
> +			return geometry->pack[i - 1];
> +
> +	return NULL;
>  }

Looking good, we want to go through this list in reverse order, since
the packs are ordered largest to smallest. I think that you could
slightly rewrite the loop condition to avoid having to always access
`geometry->pack` at `i-1` instead of `i`, like so:

--- 8< ---
diff --git a/builtin/repack.c b/builtin/repack.c
index 9d36dc8b84..ba468ac44e 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -467,13 +467,15 @@ static struct packed_git *get_preferred_pack(struct pack_geometry *geometry)
 	if (geometry->split == geometry->pack_nr)
 		return NULL;

-	for (i = geometry->pack_nr; i > 0; i--)
+	for (i = geometry->pack_nr - 1; i >= 0; i--) {
 		/*
 		 * A pack that is not local would never be included in a
 		 * multi-pack index. We thus skip over any non-local packs.
 		 */
-		if (geometry->pack[i - 1]->pack_local)
-			return geometry->pack[i - 1];
+		struct packed_git *p = geometry->pack[i];
+		if (p->pack_local)
+			return p;
+	}

 	return NULL;
 }
--- >8 ---

but I'm not sure that the loop condition is quite right to begin with.
We don't want to iterate all the way down to the beginning of the pack
list, since some of those packs may be below the "split" line, IOW their
contents are going to be rolled up and the packs destroyed.

So I think the right loop condition would be:

    for (i = geometry->pack_nr - 1; i >= geometry->split; i--)

which I think makes the "if (geometry->split == geometry->pack_nr)"
condition above redundant with this loop, so you could probably drop
that.

I would definitely appreciate a second set of eye here. The pack *at*
the split line is considered frozen (IOW, we create a new pack
consisting of the packs in range [0, geometry->split), and leave the
packs in range [geometry->split, geometry->nr) alone).

So it should be fine to enter that loop when "i == geometry->split",
because it's OK to return that as a potential preferred pack.

> +test_expect_success '--geometric --write-midx with packfiles in main and alternate ODB' '
> +	test_when_finished "rm -fr shared member" &&

Looks familiar ;-).

> +	# Create a shared repository that will serve as the alternate object
> +	# database for the member linked to it. It has got some objects on its
> +	# own that are packed into a single packfile.
> +	git init shared &&
> +	test_commit -C shared common-object &&
> +	git -C shared repack -ad &&
> +
> +	# We create member so that its alternates file points to the shared
> +	# repository. We then create a commit in it so that git-repack(1) has
> +	# something to repack.
> +	# of the shared object database.
> +	git clone --shared shared member &&
> +	test_commit -C member unique-object &&
> +	git -C member repack --geometric=2 --write-midx 2>err &&
> +	test_must_be_empty err &&
> +
> +	# We should see that a new packfile was generated.
> +	find shared/.git/objects/pack -type f -name "*.pack" | sort >packs &&

Do we need a sort here? I don't think we rely on the order anywhere, and
it looks like we only care that there is a single entry.

If that's the case, I think we could replace the wc invocation here
with:

    test_line_count = 1 packs

> +	test $(wc -l <packs) = 1 &&
> +
> +	# We should also see a multi-pack-index. This multi-pack-index should
> +	# never refer to any packfiles in the alternate object database.
> +	# Consequentially, it should be valid even with the alternate ODB
> +	# deleted.
> +	rm -r shared &&
> +	git -C member multi-pack-index verify

To test this even more directly, I think that you could ensure that the
set of packs (which should just be the single entry found in "packs"
above) matches what we expect. That should look something like:

    test-tool read-midx member/.git/objects >packs.midx &&
    grep "^pack-.*\.idx$" packs.midx | sort >actual &&
    xargs -n 1 basename <packs | sort >expect
    test_cmp expect actual

Note that the read-midx test helper outputs packs with the "*.idx"
suffix, so you'd probably want to alter your find invocation a few lines
above accordingly, or rewrite it before writing into "actual".

Alternatively, since we expect there to be just a single pack here, I
think we could just as easily write something like:

    test-tool read-midx member/.git/objects >packs.midx &&
    grep "^pack-.*\.idx$" packs.midx >actual &&
    test_line_count = 1 actual &&
    grep $(cat actual) packs

though I slightly prefer the former since it is less brittle to changing
this test. Either way, though.

Thanks,
Taylor
Patrick Steinhardt April 13, 2023, 9:31 a.m. UTC | #2
On Wed, Apr 12, 2023 at 02:37:53PM -0400, Taylor Blau wrote:
> On Wed, Apr 12, 2023 at 12:22:35PM +0200, Patrick Steinhardt wrote:
> > When doing a geometric repack with multi-pack-indices, then we ask
> > git-multi-pack-index(1) to use the largest packfile as the preferred
> > pack. It can happen though that the largest packfile is not part of the
> > main object database, but instead part of an alternate object database.
> > The result is that git-multi-pack-index(1) will not be able to find the
> > preferred pack and print a warning. It then falls back to use the first
> > packfile that the multi-pack-index shall reference.
> >
> > Fix this bug by only considering packfiles as preferred pack that are
> > local. This is the right thing to do given that a multi-pack-index
> > should never reference packfiles borrowed from an alternate.
> >
> > While at it, rename the function `get_largest_active_packfile()` to
> > `get_preferred_pack()` to better document its intent.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> 
> > @@ -464,7 +466,16 @@ static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry
> >  	}
> >  	if (geometry->split == geometry->pack_nr)
> >  		return NULL;
> > -	return geometry->pack[geometry->pack_nr - 1];
> > +
> > +	for (i = geometry->pack_nr; i > 0; i--)
> > +		/*
> > +		 * A pack that is not local would never be included in a
> > +		 * multi-pack index. We thus skip over any non-local packs.
> > +		 */
> > +		if (geometry->pack[i - 1]->pack_local)
> > +			return geometry->pack[i - 1];
> > +
> > +	return NULL;
> >  }
> 
> Looking good, we want to go through this list in reverse order, since
> the packs are ordered largest to smallest. I think that you could
> slightly rewrite the loop condition to avoid having to always access
> `geometry->pack` at `i-1` instead of `i`, like so:
> 
> --- 8< ---
> diff --git a/builtin/repack.c b/builtin/repack.c
> index 9d36dc8b84..ba468ac44e 100644
> --- a/builtin/repack.c
> +++ b/builtin/repack.c
> @@ -467,13 +467,15 @@ static struct packed_git *get_preferred_pack(struct pack_geometry *geometry)
>  	if (geometry->split == geometry->pack_nr)
>  		return NULL;
> 
> -	for (i = geometry->pack_nr; i > 0; i--)
> +	for (i = geometry->pack_nr - 1; i >= 0; i--) {
>  		/*
>  		 * A pack that is not local would never be included in a
>  		 * multi-pack index. We thus skip over any non-local packs.
>  		 */
> -		if (geometry->pack[i - 1]->pack_local)
> -			return geometry->pack[i - 1];
> +		struct packed_git *p = geometry->pack[i];
> +		if (p->pack_local)
> +			return p;
> +	}
> 
>  	return NULL;
>  }
> --- >8 ---

There's a gotcha: `i` is an `unit32_t`, so `i >= 0` would always be true
and thus we wrap around and would try to access the pack array at an
out-of-bound index.

> but I'm not sure that the loop condition is quite right to begin with.
> We don't want to iterate all the way down to the beginning of the pack
> list, since some of those packs may be below the "split" line, IOW their
> contents are going to be rolled up and the packs destroyed.
> 
> So I think the right loop condition would be:
> 
>     for (i = geometry->pack_nr - 1; i >= geometry->split; i--)
> 
> which I think makes the "if (geometry->split == geometry->pack_nr)"
> condition above redundant with this loop, so you could probably drop
> that.
> 
> I would definitely appreciate a second set of eye here. The pack *at*
> the split line is considered frozen (IOW, we create a new pack
> consisting of the packs in range [0, geometry->split), and leave the
> packs in range [geometry->split, geometry->nr) alone).
> 
> So it should be fine to enter that loop when "i == geometry->split",
> because it's OK to return that as a potential preferred pack.

That makes sense indeed. Will amend.

[snip]
> > +	test $(wc -l <packs) = 1 &&
> > +
> > +	# We should also see a multi-pack-index. This multi-pack-index should
> > +	# never refer to any packfiles in the alternate object database.
> > +	# Consequentially, it should be valid even with the alternate ODB
> > +	# deleted.
> > +	rm -r shared &&
> > +	git -C member multi-pack-index verify
> 
> To test this even more directly, I think that you could ensure that the
> set of packs (which should just be the single entry found in "packs"
> above) matches what we expect. That should look something like:
> 
>     test-tool read-midx member/.git/objects >packs.midx &&
>     grep "^pack-.*\.idx$" packs.midx | sort >actual &&
>     xargs -n 1 basename <packs | sort >expect
>     test_cmp expect actual
> 
> Note that the read-midx test helper outputs packs with the "*.idx"
> suffix, so you'd probably want to alter your find invocation a few lines
> above accordingly, or rewrite it before writing into "actual".
> 
> Alternatively, since we expect there to be just a single pack here, I
> think we could just as easily write something like:
> 
>     test-tool read-midx member/.git/objects >packs.midx &&
>     grep "^pack-.*\.idx$" packs.midx >actual &&
>     test_line_count = 1 actual &&
>     grep $(cat actual) packs
> 
> though I slightly prefer the former since it is less brittle to changing
> this test. Either way, though.

Thanks, I've adopted that approach with a slight modification.

Patrick
diff mbox series

Patch

diff --git a/builtin/repack.c b/builtin/repack.c
index df4d8e0f0b..9d36dc8b84 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -448,8 +448,10 @@  static void split_pack_geometry(struct pack_geometry *geometry, int factor)
 	geometry->split = split;
 }
 
-static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry)
+static struct packed_git *get_preferred_pack(struct pack_geometry *geometry)
 {
+	uint32_t i;
+
 	if (!geometry) {
 		/*
 		 * No geometry means either an all-into-one repack (in which
@@ -464,7 +466,16 @@  static struct packed_git *get_largest_active_pack(struct pack_geometry *geometry
 	}
 	if (geometry->split == geometry->pack_nr)
 		return NULL;
-	return geometry->pack[geometry->pack_nr - 1];
+
+	for (i = geometry->pack_nr; i > 0; i--)
+		/*
+		 * A pack that is not local would never be included in a
+		 * multi-pack index. We thus skip over any non-local packs.
+		 */
+		if (geometry->pack[i - 1]->pack_local)
+			return geometry->pack[i - 1];
+
+	return NULL;
 }
 
 static void clear_pack_geometry(struct pack_geometry *geometry)
@@ -591,7 +602,7 @@  static int write_midx_included_packs(struct string_list *include,
 {
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	struct string_list_item *item;
-	struct packed_git *largest = get_largest_active_pack(geometry);
+	struct packed_git *preferred = get_preferred_pack(geometry);
 	FILE *in;
 	int ret;
 
@@ -612,9 +623,9 @@  static int write_midx_included_packs(struct string_list *include,
 	if (write_bitmaps)
 		strvec_push(&cmd.args, "--bitmap");
 
-	if (largest)
+	if (preferred)
 		strvec_pushf(&cmd.args, "--preferred-pack=%s",
-			     pack_basename(largest));
+			     pack_basename(preferred));
 
 	if (refs_snapshot)
 		strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 8821fbd2dd..92a1aaa754 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -281,4 +281,35 @@  test_expect_success '--geometric with pack.packSizeLimit' '
 	)
 '
 
+test_expect_success '--geometric --write-midx with packfiles in main and alternate ODB' '
+	test_when_finished "rm -fr shared member" &&
+
+	# Create a shared repository that will serve as the alternate object
+	# database for the member linked to it. It has got some objects on its
+	# own that are packed into a single packfile.
+	git init shared &&
+	test_commit -C shared common-object &&
+	git -C shared repack -ad &&
+
+	# We create member so that its alternates file points to the shared
+	# repository. We then create a commit in it so that git-repack(1) has
+	# something to repack.
+	# of the shared object database.
+	git clone --shared shared member &&
+	test_commit -C member unique-object &&
+	git -C member repack --geometric=2 --write-midx 2>err &&
+	test_must_be_empty err &&
+
+	# We should see that a new packfile was generated.
+	find shared/.git/objects/pack -type f -name "*.pack" | sort >packs &&
+	test $(wc -l <packs) = 1 &&
+
+	# We should also see a multi-pack-index. This multi-pack-index should
+	# never refer to any packfiles in the alternate object database.
+	# Consequentially, it should be valid even with the alternate ODB
+	# deleted.
+	rm -r shared &&
+	git -C member multi-pack-index verify
+'
+
 test_done