diff mbox series

[v2,4/8] pack-objects: fix error when packing same pack twice

Message ID e7d30fd22c65b33defb944bd043a56d0c525f875.1681294715.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series repack: fix geometric repacking with gitalternates | expand

Commit Message

Patrick Steinhardt April 12, 2023, 10:22 a.m. UTC
When passed the same packfile twice via `--stdin-packs` we return an
error that the packfile supposedly was not found. This is because when
reading packs into the list of included or excluded packfiles, we will
happily re-add packfiles even if they are part of the lists already. And
while the list can now contain duplicates, we will only set the `util`
pointer of the first list entry to the `packed_git` structure. We notice
that at a later point when checking that all list entries have their
`util` pointer set and die with an error.

While this is kind of a nonsensical request, this scenario can be hit
when doing geometric repacks. When a repository is connected to an
alternate object directory and both have the exact same packfile then
both would get added to the geometric sequence. And when we then decide
to perform the repack, we will invoke git-pack-objects(1) with the same
packfile twice.

Fix this bug by removing any duplicates from both the included and
excluded packs.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/pack-objects.c        |  2 ++
 t/t5331-pack-objects-stdin.sh | 29 +++++++++++++++++++++++++++++
 t/t7703-repack-geometric.sh   | 25 +++++++++++++++++++++++++
 3 files changed, 56 insertions(+)
 create mode 100755 t/t5331-pack-objects-stdin.sh

Comments

Taylor Blau April 12, 2023, 9:33 p.m. UTC | #1
On Wed, Apr 12, 2023 at 12:22:44PM +0200, Patrick Steinhardt wrote:
> When passed the same packfile twice via `--stdin-packs` we return an
> error that the packfile supposedly was not found. This is because when
> reading packs into the list of included or excluded packfiles, we will
> happily re-add packfiles even if they are part of the lists already. And
> while the list can now contain duplicates, we will only set the `util`
> pointer of the first list entry to the `packed_git` structure. We notice
> that at a later point when checking that all list entries have their
> `util` pointer set and die with an error.

Well explained.

> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  builtin/pack-objects.c        |  2 ++
>  t/t5331-pack-objects-stdin.sh | 29 +++++++++++++++++++++++++++++
>  t/t7703-repack-geometric.sh   | 25 +++++++++++++++++++++++++
>  3 files changed, 56 insertions(+)
>  create mode 100755 t/t5331-pack-objects-stdin.sh
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 77d88f85b0..fdf3f440be 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -3359,7 +3359,9 @@ static void read_packs_list_from_stdin(void)
>  	}
>
>  	string_list_sort(&include_packs);
> +	string_list_remove_duplicates(&include_packs, 0);
>  	string_list_sort(&exclude_packs);
> +	string_list_remove_duplicates(&exclude_packs, 0);

I for some reason thought that string_list_remove_duplicates() sorted
its first argument as a side-effect, but double checking on it, that
isn't the case. So I agree that calling string_list_remove_duplicates()
is the right thing to do here.

> diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
> new file mode 100755
> index 0000000000..ab34cfc729
> --- /dev/null
> +++ b/t/t5331-pack-objects-stdin.sh

Cool, I am glad that we are moving some of these tests out of t5300. We
could probably go a little bit further, since there are a handful of
purely `--stdin-packs`-related tests towards the bottom of that script
that could or should also be moved out.

We could also always do it later, but I wouldn't be sad to see an extra
commit before this one that introduces t5331 and moves the last six or
so `--stdin-packs` tests out of t5300. If nothing else, it would be
nice to be able to run the tests for just that feature without having to
skip a bunch of other unrelated tests.

Luckily, those tests don't depend on any of the earlier ones or setup
tests, so they should be able to move with a pure cut and paste.

> @@ -0,0 +1,29 @@
> +#!/bin/sh
> +
> +test_description='pack-objects --stdin'
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +TEST_PASSES_SANITIZE_LEAK=true
> +. ./test-lib.sh
> +
> +test_expect_success 'pack-objects --stdin with duplicate packfile' '
> +	test_when_finished "rm -fr repo" &&
> +
> +	git init repo &&
> +	(
> +		cd repo &&
> +		test_commit "commit" &&
> +		git repack -ad &&
> +
> +		(
> +			basename .git/objects/pack/pack-*.pack &&
> +			basename .git/objects/pack/pack-*.pack
> +		) >packfiles &&
> +
> +		git pack-objects --stdin-packs generated-pack <packfiles &&
> +		test_cmp generated-pack-*.pack .git/objects/pack/pack-*.pack

I think that in practice that test_cmp'ing these two together will be
just fine, but if you wanted to be extra careful, you could grab their
object lists and compare those two together. That would allow you to
continue passing this test without needing to worry about if you
generated the same logical content with slightly different on-disk
content.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 77d88f85b0..fdf3f440be 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -3359,7 +3359,9 @@  static void read_packs_list_from_stdin(void)
 	}
 
 	string_list_sort(&include_packs);
+	string_list_remove_duplicates(&include_packs, 0);
 	string_list_sort(&exclude_packs);
+	string_list_remove_duplicates(&exclude_packs, 0);
 
 	for (p = get_all_packs(the_repository); p; p = p->next) {
 		const char *pack_name = pack_basename(p);
diff --git a/t/t5331-pack-objects-stdin.sh b/t/t5331-pack-objects-stdin.sh
new file mode 100755
index 0000000000..ab34cfc729
--- /dev/null
+++ b/t/t5331-pack-objects-stdin.sh
@@ -0,0 +1,29 @@ 
+#!/bin/sh
+
+test_description='pack-objects --stdin'
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+TEST_PASSES_SANITIZE_LEAK=true
+. ./test-lib.sh
+
+test_expect_success 'pack-objects --stdin with duplicate packfile' '
+	test_when_finished "rm -fr repo" &&
+
+	git init repo &&
+	(
+		cd repo &&
+		test_commit "commit" &&
+		git repack -ad &&
+
+		(
+			basename .git/objects/pack/pack-*.pack &&
+			basename .git/objects/pack/pack-*.pack
+		) >packfiles &&
+
+		git pack-objects --stdin-packs generated-pack <packfiles &&
+		test_cmp generated-pack-*.pack .git/objects/pack/pack-*.pack
+	)
+'
+
+test_done
diff --git a/t/t7703-repack-geometric.sh b/t/t7703-repack-geometric.sh
index 9e95350cbf..0a2f2bd46c 100755
--- a/t/t7703-repack-geometric.sh
+++ b/t/t7703-repack-geometric.sh
@@ -336,4 +336,29 @@  test_expect_success '--geometric --with-midx with no local objects' '
 	test_must_be_empty actual
 '
 
+test_expect_success '--geometric with same pack in main and alternate ODB' '
+	test_when_finished "rm -fr shared member" &&
+
+	# Create a repository with a single packfile that acts as alternate
+	# object database.
+	git init shared &&
+	test_commit -C shared "shared-objects" &&
+	git -C shared repack -ad &&
+
+	# We create the member repository as an exact copy so that it has the
+	# same packfile.
+	cp -r shared member &&
+	test-tool path-utils real_path shared/.git/objects >member/.git/objects/info/alternates &&
+	find shared/.git/objects -type f >expected-files &&
+
+	# Verify that we can repack objects as expected without observing any
+	# error. Having the same packfile in both ODBs used to cause an error
+	# in git-pack-objects(1).
+	git -C member repack --geometric 2 2>err &&
+	test_must_be_empty err &&
+	# Nothing should have changed.
+	find shared/.git/objects -type f >actual-files &&
+	test_cmp expected-files actual-files
+'
+
 test_done