mbox series

[0/2] repack: add --filter=

Message ID pull.1206.git.git.1643248180.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series repack: add --filter= | expand

Message

Philippe Blain via GitGitGadget Jan. 27, 2022, 1:49 a.m. UTC
This patch series aims to make partial clones more useful by allowing repack
to create packfiles with promisor objects. The longer vision is to be able
to use partial clones on a git server to offload large blobs to an http
server. We can then store large blobs on said http server, and use a remote
helper to grab these objects when necessary.

This is the first step in allowing a repack to honor a filter spec.

John Cai (2):
  pack-objects: allow --filter without --stdout
  repack: add --filter=<filter-spec> option

 Documentation/git-repack.txt   |   5 +
 builtin/pack-objects.c         |   2 -
 builtin/repack.c               |  10 ++
 t/lib-httpd.sh                 |   2 +
 t/lib-httpd/apache.conf        |   8 ++
 t/lib-httpd/list.sh            |  43 +++++++++
 t/lib-httpd/upload.sh          |  46 +++++++++
 t/t0410-partial-clone.sh       |  52 ++++++++++
 t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
 t/t7700-repack.sh              |  20 ++++
 10 files changed, 356 insertions(+), 2 deletions(-)
 create mode 100644 t/lib-httpd/list.sh
 create mode 100644 t/lib-httpd/upload.sh
 create mode 100755 t/t0410/git-remote-testhttpgit


base-commit: 89bece5c8c96f0b962cfc89e63f82d603fd60bed
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v1
Pull-Request: https://github.com/git/git/pull/1206

Comments

John Cai Feb. 9, 2022, 2:27 a.m. UTC | #1
Hi Johannes

I'm not sure where I went wrong on GGG. Somehow the cc list didn't get translated into
cc fields. Here's the PR: https://github.com/git/git/pull/1206. Thanks!

cc'ing folks I meant to cc for this patch series

On 8 Feb 2022, at 21:10, John Cai via GitGitGadget wrote:

> This patch series makes partial clone more useful by making it possible to
> run repack to remove objects from a repository (replacing it with promisor
> objects). This is useful when we want to offload large blobs from a git
> server onto another git server, or even use an http server through a remote
> helper.
>
> In [A], a --refilter option on fetch and fetch-pack is being discussed where
> either a less restrictive or more restrictive filter can be used. In the
> more restrictive case, the objects that already exist will not be deleted.
> But, one can imagine that users might want the ability to delete objects
> when they apply a more restrictive filter in order to save space, and this
> patch series would also allow that.
>
> There are a couple of things we need to adjust to make this possible. This
> patch has three parts.
>
>  1. Allow --filter in pack-objects without --stdout
>  2. Add a --filter flag for repack
>  3. Allow missing promisor objects in upload-pack
>  4. Tests that demonstrate the ability to offload objects onto an http
>     remote
>
> cc: Christian Couder christian.couder@gmail.com cc: Derrick Stolee
> stolee@gmail.com cc: Robert Coup robert@coup.net.nz
>
> A.
> https://lore.kernel.org/git/pull.1138.git.1643730593.gitgitgadget@gmail.com/
>
> John Cai (4):
>   pack-objects: allow --filter without --stdout
>   repack: add --filter=<filter-spec> option
>   upload-pack: allow missing promisor objects
>   tests for repack --filter mode
>
>  Documentation/git-repack.txt   |   5 +
>  builtin/pack-objects.c         |   2 -
>  builtin/repack.c               |  22 +++--
>  t/lib-httpd.sh                 |   2 +
>  t/lib-httpd/apache.conf        |   8 ++
>  t/lib-httpd/list.sh            |  43 +++++++++
>  t/lib-httpd/upload.sh          |  46 +++++++++
>  t/t0410-partial-clone.sh       |  81 ++++++++++++++++
>  t/t0410/git-remote-testhttpgit | 170 +++++++++++++++++++++++++++++++++
>  t/t7700-repack.sh              |  20 ++++
>  upload-pack.c                  |   5 +
>  11 files changed, 395 insertions(+), 9 deletions(-)
>  create mode 100644 t/lib-httpd/list.sh
>  create mode 100644 t/lib-httpd/upload.sh
>  create mode 100755 t/t0410/git-remote-testhttpgit
>
>
> base-commit: 38062e73e009f27ea192d50481fcb5e7b0e9d6eb
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1206%2Fjohn-cai%2Fjc-repack-filter-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1206/john-cai/jc-repack-filter-v2
> Pull-Request: https://github.com/git/git/pull/1206
>
> Range-diff vs v1:
>
>  1:  0eec9b117da = 1:  f43b76ca650 pack-objects: allow --filter without --stdout
>  -:  ----------- > 2:  6e7c8410b8d repack: add --filter=<filter-spec> option
>  -:  ----------- > 3:  40612b9663b upload-pack: allow missing promisor objects
>  2:  a3166381572 ! 4:  d76faa1f16e repack: add --filter=<filter-spec> option
>      @@ Metadata
>       Author: John Cai <johncai86@gmail.com>
>
>        ## Commit message ##
>      -    repack: add --filter=<filter-spec> option
>      +    tests for repack --filter mode
>
>      -    Currently, repack does not work with partial clones. When repack is run
>      -    on a partially cloned repository, it grabs all missing objects from
>      -    promisor remotes. This also means that when gc is run for repository
>      -    maintenance on a partially cloned repository, it will end up getting
>      -    missing objects, which is not what we want.
>      -
>      -    In order to make repack work with partial clone, teach repack a new
>      -    option --filter, which takes a <filter-spec> argument. repack will skip
>      -    any objects that are matched by <filter-spec> similar to how the clone
>      -    command will skip fetching certain objects.
>      -
>      -    The final goal of this feature, is to be able to store objects on a
>      -    server other than the regular git server itself.
>      +    This patch adds tests to test both repack --filter functionality in
>      +    isolation (in t7700-repack.sh) as well as how it can be used to offload
>      +    large blobs (in t0410-partial-clone.sh)
>
>           There are several scripts added so we can test the process of using a
>      -    remote helper to upload blobs to an http server:
>      +    remote helper to upload blobs to an http server.
>
>           - t/lib-httpd/list.sh lists blobs uploaded to the http server.
>           - t/lib-httpd/upload.sh uploads blobs to the http server.
>      @@ Commit message
>           Based-on-patch-by: Christian Couder <chriscool@tuxfamily.org>
>           Signed-off-by: John Cai <johncai86@gmail.com>
>
>      - ## Documentation/git-repack.txt ##
>      -@@ Documentation/git-repack.txt: depth is 4095.
>      - 	a larger and slower repository; see the discussion in
>      - 	`pack.packSizeLimit`.
>      -
>      -+--filter=<filter-spec>::
>      -+	Omits certain objects (usually blobs) from the resulting
>      -+	packfile. See linkgit:git-rev-list[1] for valid
>      -+	`<filter-spec>` forms.
>      -+
>      - -b::
>      - --write-bitmap-index::
>      - 	Write a reachability bitmap index as part of the repack. This
>      -
>      - ## builtin/repack.c ##
>      -@@ builtin/repack.c: struct pack_objects_args {
>      - 	const char *depth;
>      - 	const char *threads;
>      - 	const char *max_pack_size;
>      -+	const char *filter;
>      - 	int no_reuse_delta;
>      - 	int no_reuse_object;
>      - 	int quiet;
>      -@@ builtin/repack.c: static void prepare_pack_objects(struct child_process *cmd,
>      - 		strvec_pushf(&cmd->args, "--threads=%s", args->threads);
>      - 	if (args->max_pack_size)
>      - 		strvec_pushf(&cmd->args, "--max-pack-size=%s", args->max_pack_size);
>      -+	if (args->filter)
>      -+		strvec_pushf(&cmd->args, "--filter=%s", args->filter);
>      - 	if (args->no_reuse_delta)
>      - 		strvec_pushf(&cmd->args, "--no-reuse-delta");
>      - 	if (args->no_reuse_object)
>      -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
>      - 				N_("limits the maximum number of threads")),
>      - 		OPT_STRING(0, "max-pack-size", &po_args.max_pack_size, N_("bytes"),
>      - 				N_("maximum size of each packfile")),
>      -+		OPT_STRING(0, "filter", &po_args.filter, N_("args"),
>      -+				N_("object filtering")),
>      - 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
>      - 				N_("repack objects in packs marked with .keep")),
>      - 		OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
>      -@@ builtin/repack.c: int cmd_repack(int argc, const char **argv, const char *prefix)
>      - 		if (line.len != the_hash_algo->hexsz)
>      - 			die(_("repack: Expecting full hex object ID lines only from pack-objects."));
>      - 		string_list_append(&names, line.buf);
>      -+		if (po_args.filter) {
>      -+			char *promisor_name = mkpathdup("%s-%s.promisor", packtmp,
>      -+							line.buf);
>      -+			write_promisor_file(promisor_name, NULL, 0);
>      -+		}
>      - 	}
>      - 	fclose(out);
>      - 	ret = finish_command(&cmd);
>      -
>        ## t/lib-httpd.sh ##
>       @@ t/lib-httpd.sh: prepare_httpd() {
>        	install_script error-smart-http.sh
>      @@ t/t0410-partial-clone.sh: test_expect_success 'fetching of missing objects from
>       +	git -C server rev-list --objects --all --missing=print >objects &&
>       +	grep "$sha" objects
>       +'
>      ++
>      ++test_expect_success 'fetch does not cause server to fetch missing objects' '
>      ++	rm -rf origin server client &&
>      ++	test_create_repo origin &&
>      ++	dd if=/dev/zero of=origin/file1 bs=801k count=1 &&
>      ++	git -C origin add file1 &&
>      ++	git -C origin commit -m "large blob" &&
>      ++	sha="$(git -C origin rev-parse :file1)" &&
>      ++	expected="?$(git -C origin rev-parse :file1)" &&
>      ++	git clone --bare --no-local origin server &&
>      ++	git -C server remote add httpremote "testhttpgit::${PWD}/server" &&
>      ++	git -C server config remote.httpremote.promisor true &&
>      ++	git -C server config --remove-section remote.origin &&
>      ++	git -C server rev-list --all --objects --filter-print-omitted \
>      ++		--filter=blob:limit=800k | perl -ne "print if s/^[~]//" \
>      ++		>large_blobs.txt &&
>      ++	upload_blobs_from_stdin server <large_blobs.txt &&
>      ++	git -C server -c repack.writebitmaps=false repack -a -d \
>      ++		--filter=blob:limit=800k &&
>      ++	git -C server config uploadpack.allowmissingpromisor true &&
>      ++	git clone -c remote.httpremote.url="testhttpgit::${PWD}/server" \
>      ++	-c remote.httpremote.fetch='+refs/heads/*:refs/remotes/httpremote/*' \
>      ++	-c remote.httpremote.promisor=true --bare --no-local \
>      ++	--filter=blob:limit=800k server client &&
>      ++	git -C client rev-list --objects --all --missing=print >client_objects &&
>      ++	grep "$expected" client_objects &&
>      ++	git -C server rev-list --objects --all --missing=print >server_objects &&
>      ++	grep "$expected" server_objects
>      ++'
>       +
>        # DO NOT add non-httpd-specific tests here, because the last part of this
>        # test script is only executed when httpd is available and enabled.
>
> -- 
> gitgitgadget