diff mbox series

[v2,7/8] repack: implement `--filter-to` for storing filtered out objects

Message ID 20230705060812.2865188-8-christian.couder@gmail.com (mailing list archive)
State Superseded
Headers show
Series Repack objects into separate packfiles based on a filter | expand

Commit Message

Christian Couder July 5, 2023, 6:08 a.m. UTC
A previous commit has implemented `git repack --filter=<filter-spec>` to
allow users to filter out some objects from the main pack and move them
into a new different pack.

It would be nice if this new different pack could be created in a
different directory than the regular pack. This would make it possible
to move large blobs into a pack on a different kind of storage, for
example cheaper storage. Even in a different directory this pack can be
accessible if, for example, the Git alternates mechanism is used to
point to it.

While at it, as an example to show that `--filter` and `--filter-to`
work well with other options, let's also add a test to check that these
options work well with `--max-pack-size`.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>

repack: add test with --max-pack-size
---
 Documentation/git-repack.txt |  6 ++++
 builtin/repack.c             | 11 +++++-
 t/t7700-repack.sh            | 66 ++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+), 1 deletion(-)

Comments

Junio C Hamano July 5, 2023, 6:26 p.m. UTC | #1
Christian Couder <christian.couder@gmail.com> writes:

> A previous commit has implemented `git repack --filter=<filter-spec>` to
> allow users to filter out some objects from the main pack and move them
> into a new different pack.

OK, this sidesteps the question I had on an earlier step rather
nicely.  Instead of having to find out which ones are to be moved
away, just generating them in a separate location would be more
straight forward.

The implementation does not seem to restrict where --filter-to
directory can be placed, but shouldn't it make sure that it is one
of the already specified alternates directories?  Otherwise the user
will end up corrupting the repository, no?
Christian Couder July 24, 2023, 9 a.m. UTC | #2
On Wed, Jul 5, 2023 at 8:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > A previous commit has implemented `git repack --filter=<filter-spec>` to
> > allow users to filter out some objects from the main pack and move them
> > into a new different pack.
>
> OK, this sidesteps the question I had on an earlier step rather
> nicely.  Instead of having to find out which ones are to be moved
> away, just generating them in a separate location would be more
> straight forward.
>
> The implementation does not seem to restrict where --filter-to
> directory can be placed, but shouldn't it make sure that it is one
> of the already specified alternates directories?  Otherwise the user
> will end up corrupting the repository, no?

I don't think it should make sure that the implementation should
restrict where the --filter-to directory can be placed.

In version 3, that I just sent, I have written the following in the
commit message to explain this:

"
   Even in a different directory, this pack can be accessible if, for
   example, the Git alternates mechanism is used to point to it. In fact
   not using the Git alternates mechanism can corrupt a repo as the
   generated pack containing the filtered objects might not be accessible
   from the repo any more. So setting up the Git alternates mechanism
   should be done before using this feature if the user wants the repo to
   be fully usable while this feature is used.

   In some cases, like when a repo has just been cloned or when there is no
   other activity in the repo, it's Ok to setup the Git alternates
   mechanism afterwards though. It's also Ok to just inspect the generated
   packfile containing the filtered objects and then just move it into the
   '.git/objects/pack/' directory manually. That's why it's not necessary
   for this command to check that the Git alternates mechanism has been
   already setup.
"

I haven't mentioned cases related to promisor remotes, but I think in
some of those cases the feature can be very useful too while there is
no need to check that the Git alternates mechanism has been set up.

In version 3, the doc for the --filter-to option and the corresponding
gc.repackFilterTo config flag look like this:

+--filter-to=<dir>::
+       Write the pack containing filtered out objects to the
+       directory `<dir>`. Only useful with `--filter`. This can be
+       used for putting the pack on a separate object directory that
+       is accessed through the Git alternates mechanism. **WARNING:**
+       If the packfile containing the filtered out objects is not
+       accessible, the repo could be considered corrupt by Git as it
+       migh not be able to access the objects in that packfile. See
+       the `objects` and `objects/info/alternates` sections of
+       linkgit:gitrepository-layout[5].

+gc.repackFilterTo::
+       When repacking and using a filter, see `gc.repackFilter`, the
+       specified location will be used to create the packfile
+       containing the filtered out objects. **WARNING:** The
+       specified location should be accessible, using for example the
+       Git alternates mechanism, otherwise the repo could be
+       considered corrupt by Git as it might not be able to access the
+       objects in that packfile. See the `--filter-to=<dir>` option
+       of linkgit:git-repack[1] and the `objects/info/alternates`
+       section of linkgit:gitrepository-layout[5].

So they warn about possible issues with the feature and link to some
relevant doc.

Now if we think that it's not enough, I would implement a check in the
code that would warn users loudly if the directory specified by those
options is not accessible using the Git alternates mechanism. It would
be annoying I think that it would be too restrictive to error out in
that case though.
Junio C Hamano July 24, 2023, 6:18 p.m. UTC | #3
Christian Couder <christian.couder@gmail.com> writes:

> In version 3, the doc for the --filter-to option and the corresponding
> gc.repackFilterTo config flag look like this:
>
> +--filter-to=<dir>::
> +       Write the pack containing filtered out objects to the
> +       directory `<dir>`. Only useful with `--filter`. This can be
> +       used for putting the pack on a separate object directory that
> +       is accessed through the Git alternates mechanism. **WARNING:**
> +       If the packfile containing the filtered out objects is not
> +       accessible, the repo could be considered corrupt by Git as it

"could be considered" -> "can become".

> +       migh not be able to access the objects in that packfile. See

"migh" -> "might".

> +       the `objects` and `objects/info/alternates` sections of
> +       linkgit:gitrepository-layout[5].
>
> +gc.repackFilterTo::
> +       When repacking and using a filter, see `gc.repackFilter`, the
> +       specified location will be used to create the packfile
> +       containing the filtered out objects. **WARNING:** The
> +       specified location should be accessible, using for example the
> +       Git alternates mechanism, otherwise the repo could be
> +       considered corrupt by Git as it might not be able to access the
> +       objects in that packfile. See the `--filter-to=<dir>` option
> +       of linkgit:git-repack[1] and the `objects/info/alternates`
> +       section of linkgit:gitrepository-layout[5].
>
> So they warn about possible issues with the feature and link to some
> relevant doc.

In all other parts of the system, we tend to avoid such an "unsafe
by default" desgin, especially when the risk is known before there
is an implementation, and instead allow an explicit end-user action
(ranging from command line option to interactive confirmation) to
opt-into more risky behaviour.  Should we consider --filter-to as
such an "always risky and prone to repository corruption" option
(just like "--hard" to "reset" is always loses changes in the
working tree without warning)?

I am OK with that myself, but others may disagree.

Come to think of it, we haven't seen much reviews from those other
than Taylor.  Are folks content with the direction this series is
going in general?

Thanks.
Robert Coup July 25, 2023, 1:41 p.m. UTC | #4
Hi Junio,

On Mon, 24 Jul 2023 at 19:18, Junio C Hamano <gitster@pobox.com> wrote:
> Come to think of it, we haven't seen much reviews from those other
> than Taylor.  Are folks content with the direction this series is
> going in general?

For what it's worth I have a medium-term plan similar to Gitlab's with
respect to moving chunks of repositories onto lower cost storage media
& to promisor remotes. Like others, I wasn't at all sure about the
original approach (and commented at the time). What Christian is
proposing here seems much cleaner, is usable without complex
gymnastics or safety equipment, and provides a better building block
for future work.

Rob :)
Christian Couder July 25, 2023, 3:45 p.m. UTC | #5
On Mon, Jul 24, 2023 at 8:18 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Christian Couder <christian.couder@gmail.com> writes:
>
> > In version 3, the doc for the --filter-to option and the corresponding
> > gc.repackFilterTo config flag look like this:
> >
> > +--filter-to=<dir>::
> > +       Write the pack containing filtered out objects to the
> > +       directory `<dir>`. Only useful with `--filter`. This can be
> > +       used for putting the pack on a separate object directory that
> > +       is accessed through the Git alternates mechanism. **WARNING:**
> > +       If the packfile containing the filtered out objects is not
> > +       accessible, the repo could be considered corrupt by Git as it
>
> "could be considered" -> "can become".
>
> > +       migh not be able to access the objects in that packfile. See
>
> "migh" -> "might".
>
> > +       the `objects` and `objects/info/alternates` sections of
> > +       linkgit:gitrepository-layout[5].

Thanks for catching these, they are fixed in my current version.
Junio C Hamano July 25, 2023, 4:50 p.m. UTC | #6
Robert Coup <robert.coup@koordinates.com> writes:

> ... Like others, I wasn't at all sure about the
> original approach (and commented at the time). What Christian is
> proposing here seems much cleaner, is usable without complex
> gymnastics or safety equipment, and provides a better building block
> for future work.

Nice to hear a positive feedback [*].

Thanks.

[Footnote]

 * Of course negative ones as long as they are consturctive are also
   welcome;-)
diff mbox series

Patch

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index d702553033..396a91b9ac 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -152,6 +152,12 @@  depth is 4095.
 	this option.  See linkgit:git-rev-list[1] for valid
 	`<filter-spec>` forms.
 
+--filter-to=<dir>::
+	Write the pack containing filtered out objects to the
+	directory `<dir>`. This can be used for putting the pack on a
+	separate object directory that is accessed through the Git
+	alternates mechanism. Only useful with `--filter`.
+
 -b::
 --write-bitmap-index::
 	Write a reachability bitmap index as part of the repack. This
diff --git a/builtin/repack.c b/builtin/repack.c
index e2661b956c..5695f9734d 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -879,6 +879,7 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	int write_midx = 0;
 	const char *cruft_expiration = NULL;
 	const char *expire_to = NULL;
+	const char *filter_to = NULL;
 
 	struct option builtin_repack_options[] = {
 		OPT_BIT('a', NULL, &pack_everything,
@@ -932,6 +933,8 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 			   N_("write a multi-pack index of the resulting packs")),
 		OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
 			   N_("pack prefix to store a pack containing pruned objects")),
+		OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
+			   N_("pack prefix to store a pack containing filtered out objects")),
 		OPT_END()
 	};
 
@@ -1075,6 +1078,9 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 		strvec_push(&cmd.args, "--incremental");
 	}
 
+	if (filter_to && !po_args.filter)
+		die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
+
 	if (geometry)
 		cmd.in = -1;
 	else
@@ -1162,8 +1168,11 @@  int cmd_repack(int argc, const char **argv, const char *prefix)
 	}
 
 	if (po_args.filter) {
+		if (!filter_to)
+			filter_to = packtmp;
+
 		ret = write_filtered_pack(&po_args,
-					  packtmp,
+					  filter_to,
 					  find_pack_prefix(),
 					  &names,
 					  &existing_nonkept_packs,
diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh
index 66589e4217..a96c1635b2 100755
--- a/t/t7700-repack.sh
+++ b/t/t7700-repack.sh
@@ -309,6 +309,72 @@  test_expect_success 'repacking with a filter works' '
 	test "$blob_pack2" = "$blob_pack"
 '
 
+test_expect_success '--filter-to stores filtered out objects' '
+	git -C bare.git repack -a -d &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack &&
+
+	git init --bare filtered.git &&
+	git -C bare.git -c repack.writebitmaps=false repack -a -d \
+		--filter=blob:none \
+		--filter-to=../filtered.git/objects/pack/pack &&
+	test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack &&
+	test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack &&
+
+	commit_pack=$(test-tool -C bare.git find-pack HEAD) &&
+	test -n "$commit_pack" &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -z "$blob_pack" &&
+	blob_hash=$(git -C bare.git rev-parse HEAD:file1) &&
+	test -n "$blob_hash" &&
+	blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) &&
+	test -n "$blob_pack" &&
+
+	echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates &&
+	blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) &&
+	test -n "$blob_pack" &&
+	blob_content=$(git -C bare.git show $blob_hash) &&
+	test "$blob_content" = "content1"
+'
+
+test_expect_success '--filter works with --max-pack-size' '
+	rm -rf filtered.git &&
+	git init --bare filtered.git &&
+	git init max-pack-size &&
+	(
+		cd max-pack-size &&
+		test_commit base &&
+		# two blobs which exceed the maximum pack size
+		test-tool genrandom foo 1048576 >foo &&
+		git hash-object -w foo &&
+		test-tool genrandom bar 1048576 >bar &&
+		git hash-object -w bar &&
+		git add foo bar &&
+		git commit -m "adding foo and bar"
+	) &&
+	git clone --no-local --bare max-pack-size max-pack-size.git &&
+	(
+		cd max-pack-size.git &&
+		git -c repack.writebitmaps=false repack -a -d --filter=blob:none \
+			--max-pack-size=1M \
+			--filter-to=../filtered.git/objects/pack/pack &&
+		echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates &&
+
+		# Check that the 3 blobs are in different packfiles in filtered.git
+		test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack &&
+		test_stdout_line_count = 1 ls objects/pack/pack-*.pack &&
+		foo_pack=$(test-tool find-pack HEAD:foo) &&
+		bar_pack=$(test-tool find-pack HEAD:bar) &&
+		base_pack=$(test-tool find-pack HEAD:base.t) &&
+		test "$foo_pack" != "$bar_pack" &&
+		test "$foo_pack" != "$base_pack" &&
+		test "$bar_pack" != "$base_pack" &&
+		for pack in "$foo_pack" "$bar_pack" "$base_pack"
+		do
+			case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac
+		done
+	)
+'
+
 objdir=.git/objects
 midx=$objdir/pack/multi-pack-index