Message ID | 20230705060812.2865188-8-christian.couder@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Repack objects into separate packfiles based on a filter | expand |
Christian Couder <christian.couder@gmail.com> writes: > A previous commit has implemented `git repack --filter=<filter-spec>` to > allow users to filter out some objects from the main pack and move them > into a new different pack. OK, this sidesteps the question I had on an earlier step rather nicely. Instead of having to find out which ones are to be moved away, just generating them in a separate location would be more straight forward. The implementation does not seem to restrict where --filter-to directory can be placed, but shouldn't it make sure that it is one of the already specified alternates directories? Otherwise the user will end up corrupting the repository, no?
On Wed, Jul 5, 2023 at 8:26 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > A previous commit has implemented `git repack --filter=<filter-spec>` to > > allow users to filter out some objects from the main pack and move them > > into a new different pack. > > OK, this sidesteps the question I had on an earlier step rather > nicely. Instead of having to find out which ones are to be moved > away, just generating them in a separate location would be more > straight forward. > > The implementation does not seem to restrict where --filter-to > directory can be placed, but shouldn't it make sure that it is one > of the already specified alternates directories? Otherwise the user > will end up corrupting the repository, no? I don't think it should make sure that the implementation should restrict where the --filter-to directory can be placed. In version 3, that I just sent, I have written the following in the commit message to explain this: " Even in a different directory, this pack can be accessible if, for example, the Git alternates mechanism is used to point to it. In fact not using the Git alternates mechanism can corrupt a repo as the generated pack containing the filtered objects might not be accessible from the repo any more. So setting up the Git alternates mechanism should be done before using this feature if the user wants the repo to be fully usable while this feature is used. In some cases, like when a repo has just been cloned or when there is no other activity in the repo, it's Ok to setup the Git alternates mechanism afterwards though. It's also Ok to just inspect the generated packfile containing the filtered objects and then just move it into the '.git/objects/pack/' directory manually. That's why it's not necessary for this command to check that the Git alternates mechanism has been already setup. " I haven't mentioned cases related to promisor remotes, but I think in some of those cases the feature can be very useful too while there is no need to check that the Git alternates mechanism has been set up. In version 3, the doc for the --filter-to option and the corresponding gc.repackFilterTo config flag look like this: +--filter-to=<dir>:: + Write the pack containing filtered out objects to the + directory `<dir>`. Only useful with `--filter`. This can be + used for putting the pack on a separate object directory that + is accessed through the Git alternates mechanism. **WARNING:** + If the packfile containing the filtered out objects is not + accessible, the repo could be considered corrupt by Git as it + migh not be able to access the objects in that packfile. See + the `objects` and `objects/info/alternates` sections of + linkgit:gitrepository-layout[5]. +gc.repackFilterTo:: + When repacking and using a filter, see `gc.repackFilter`, the + specified location will be used to create the packfile + containing the filtered out objects. **WARNING:** The + specified location should be accessible, using for example the + Git alternates mechanism, otherwise the repo could be + considered corrupt by Git as it might not be able to access the + objects in that packfile. See the `--filter-to=<dir>` option + of linkgit:git-repack[1] and the `objects/info/alternates` + section of linkgit:gitrepository-layout[5]. So they warn about possible issues with the feature and link to some relevant doc. Now if we think that it's not enough, I would implement a check in the code that would warn users loudly if the directory specified by those options is not accessible using the Git alternates mechanism. It would be annoying I think that it would be too restrictive to error out in that case though.
Christian Couder <christian.couder@gmail.com> writes: > In version 3, the doc for the --filter-to option and the corresponding > gc.repackFilterTo config flag look like this: > > +--filter-to=<dir>:: > + Write the pack containing filtered out objects to the > + directory `<dir>`. Only useful with `--filter`. This can be > + used for putting the pack on a separate object directory that > + is accessed through the Git alternates mechanism. **WARNING:** > + If the packfile containing the filtered out objects is not > + accessible, the repo could be considered corrupt by Git as it "could be considered" -> "can become". > + migh not be able to access the objects in that packfile. See "migh" -> "might". > + the `objects` and `objects/info/alternates` sections of > + linkgit:gitrepository-layout[5]. > > +gc.repackFilterTo:: > + When repacking and using a filter, see `gc.repackFilter`, the > + specified location will be used to create the packfile > + containing the filtered out objects. **WARNING:** The > + specified location should be accessible, using for example the > + Git alternates mechanism, otherwise the repo could be > + considered corrupt by Git as it might not be able to access the > + objects in that packfile. See the `--filter-to=<dir>` option > + of linkgit:git-repack[1] and the `objects/info/alternates` > + section of linkgit:gitrepository-layout[5]. > > So they warn about possible issues with the feature and link to some > relevant doc. In all other parts of the system, we tend to avoid such an "unsafe by default" desgin, especially when the risk is known before there is an implementation, and instead allow an explicit end-user action (ranging from command line option to interactive confirmation) to opt-into more risky behaviour. Should we consider --filter-to as such an "always risky and prone to repository corruption" option (just like "--hard" to "reset" is always loses changes in the working tree without warning)? I am OK with that myself, but others may disagree. Come to think of it, we haven't seen much reviews from those other than Taylor. Are folks content with the direction this series is going in general? Thanks.
Hi Junio, On Mon, 24 Jul 2023 at 19:18, Junio C Hamano <gitster@pobox.com> wrote: > Come to think of it, we haven't seen much reviews from those other > than Taylor. Are folks content with the direction this series is > going in general? For what it's worth I have a medium-term plan similar to Gitlab's with respect to moving chunks of repositories onto lower cost storage media & to promisor remotes. Like others, I wasn't at all sure about the original approach (and commented at the time). What Christian is proposing here seems much cleaner, is usable without complex gymnastics or safety equipment, and provides a better building block for future work. Rob :)
On Mon, Jul 24, 2023 at 8:18 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > In version 3, the doc for the --filter-to option and the corresponding > > gc.repackFilterTo config flag look like this: > > > > +--filter-to=<dir>:: > > + Write the pack containing filtered out objects to the > > + directory `<dir>`. Only useful with `--filter`. This can be > > + used for putting the pack on a separate object directory that > > + is accessed through the Git alternates mechanism. **WARNING:** > > + If the packfile containing the filtered out objects is not > > + accessible, the repo could be considered corrupt by Git as it > > "could be considered" -> "can become". > > > + migh not be able to access the objects in that packfile. See > > "migh" -> "might". > > > + the `objects` and `objects/info/alternates` sections of > > + linkgit:gitrepository-layout[5]. Thanks for catching these, they are fixed in my current version.
Robert Coup <robert.coup@koordinates.com> writes: > ... Like others, I wasn't at all sure about the > original approach (and commented at the time). What Christian is > proposing here seems much cleaner, is usable without complex > gymnastics or safety equipment, and provides a better building block > for future work. Nice to hear a positive feedback [*]. Thanks. [Footnote] * Of course negative ones as long as they are consturctive are also welcome;-)
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt index d702553033..396a91b9ac 100644 --- a/Documentation/git-repack.txt +++ b/Documentation/git-repack.txt @@ -152,6 +152,12 @@ depth is 4095. this option. See linkgit:git-rev-list[1] for valid `<filter-spec>` forms. +--filter-to=<dir>:: + Write the pack containing filtered out objects to the + directory `<dir>`. This can be used for putting the pack on a + separate object directory that is accessed through the Git + alternates mechanism. Only useful with `--filter`. + -b:: --write-bitmap-index:: Write a reachability bitmap index as part of the repack. This diff --git a/builtin/repack.c b/builtin/repack.c index e2661b956c..5695f9734d 100644 --- a/builtin/repack.c +++ b/builtin/repack.c @@ -879,6 +879,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix) int write_midx = 0; const char *cruft_expiration = NULL; const char *expire_to = NULL; + const char *filter_to = NULL; struct option builtin_repack_options[] = { OPT_BIT('a', NULL, &pack_everything, @@ -932,6 +933,8 @@ int cmd_repack(int argc, const char **argv, const char *prefix) N_("write a multi-pack index of the resulting packs")), OPT_STRING(0, "expire-to", &expire_to, N_("dir"), N_("pack prefix to store a pack containing pruned objects")), + OPT_STRING(0, "filter-to", &filter_to, N_("dir"), + N_("pack prefix to store a pack containing filtered out objects")), OPT_END() }; @@ -1075,6 +1078,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix) strvec_push(&cmd.args, "--incremental"); } + if (filter_to && !po_args.filter) + die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter"); + if (geometry) cmd.in = -1; else @@ -1162,8 +1168,11 @@ int cmd_repack(int argc, const char **argv, const char *prefix) } if (po_args.filter) { + if (!filter_to) + filter_to = packtmp; + ret = write_filtered_pack(&po_args, - packtmp, + filter_to, find_pack_prefix(), &names, &existing_nonkept_packs, diff --git a/t/t7700-repack.sh b/t/t7700-repack.sh index 66589e4217..a96c1635b2 100755 --- a/t/t7700-repack.sh +++ b/t/t7700-repack.sh @@ -309,6 +309,72 @@ test_expect_success 'repacking with a filter works' ' test "$blob_pack2" = "$blob_pack" ' +test_expect_success '--filter-to stores filtered out objects' ' + git -C bare.git repack -a -d && + test_stdout_line_count = 1 ls bare.git/objects/pack/*.pack && + + git init --bare filtered.git && + git -C bare.git -c repack.writebitmaps=false repack -a -d \ + --filter=blob:none \ + --filter-to=../filtered.git/objects/pack/pack && + test_stdout_line_count = 1 ls bare.git/objects/pack/pack-*.pack && + test_stdout_line_count = 1 ls filtered.git/objects/pack/pack-*.pack && + + commit_pack=$(test-tool -C bare.git find-pack HEAD) && + test -n "$commit_pack" && + blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) && + test -z "$blob_pack" && + blob_hash=$(git -C bare.git rev-parse HEAD:file1) && + test -n "$blob_hash" && + blob_pack=$(test-tool -C filtered.git find-pack $blob_hash) && + test -n "$blob_pack" && + + echo $(pwd)/filtered.git/objects >bare.git/objects/info/alternates && + blob_pack=$(test-tool -C bare.git find-pack HEAD:file1) && + test -n "$blob_pack" && + blob_content=$(git -C bare.git show $blob_hash) && + test "$blob_content" = "content1" +' + +test_expect_success '--filter works with --max-pack-size' ' + rm -rf filtered.git && + git init --bare filtered.git && + git init max-pack-size && + ( + cd max-pack-size && + test_commit base && + # two blobs which exceed the maximum pack size + test-tool genrandom foo 1048576 >foo && + git hash-object -w foo && + test-tool genrandom bar 1048576 >bar && + git hash-object -w bar && + git add foo bar && + git commit -m "adding foo and bar" + ) && + git clone --no-local --bare max-pack-size max-pack-size.git && + ( + cd max-pack-size.git && + git -c repack.writebitmaps=false repack -a -d --filter=blob:none \ + --max-pack-size=1M \ + --filter-to=../filtered.git/objects/pack/pack && + echo $(cd .. && pwd)/filtered.git/objects >objects/info/alternates && + + # Check that the 3 blobs are in different packfiles in filtered.git + test_stdout_line_count = 3 ls ../filtered.git/objects/pack/pack-*.pack && + test_stdout_line_count = 1 ls objects/pack/pack-*.pack && + foo_pack=$(test-tool find-pack HEAD:foo) && + bar_pack=$(test-tool find-pack HEAD:bar) && + base_pack=$(test-tool find-pack HEAD:base.t) && + test "$foo_pack" != "$bar_pack" && + test "$foo_pack" != "$base_pack" && + test "$bar_pack" != "$base_pack" && + for pack in "$foo_pack" "$bar_pack" "$base_pack" + do + case "$foo_pack" in */filtered.git/objects/pack/*) true ;; *) return 1 ;; esac + done + ) +' + objdir=.git/objects midx=$objdir/pack/multi-pack-index
A previous commit has implemented `git repack --filter=<filter-spec>` to allow users to filter out some objects from the main pack and move them into a new different pack. It would be nice if this new different pack could be created in a different directory than the regular pack. This would make it possible to move large blobs into a pack on a different kind of storage, for example cheaper storage. Even in a different directory this pack can be accessible if, for example, the Git alternates mechanism is used to point to it. While at it, as an example to show that `--filter` and `--filter-to` work well with other options, let's also add a test to check that these options work well with `--max-pack-size`. Signed-off-by: Christian Couder <chriscool@tuxfamily.org> repack: add test with --max-pack-size --- Documentation/git-repack.txt | 6 ++++ builtin/repack.c | 11 +++++- t/t7700-repack.sh | 66 ++++++++++++++++++++++++++++++++++++ 3 files changed, 82 insertions(+), 1 deletion(-)