diff mbox series

[2/5] backfill: basic functionality and tests

Message ID 5728dd2702195b7ba3a208859f114e40ba2b6bbd.1733515638.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series PATH WALK III: Add 'git backfill' command | expand

Commit Message

Derrick Stolee Dec. 6, 2024, 8:07 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The default behavior of 'git backfill' is to fetch all missing blobs that
are reachable from HEAD. Document and test this behavior.

The implementation is a very simple use of the path-walk API, initializing
the revision walk at HEAD to start the path-walk from all commits reachable
from HEAD. Ignore the object arrays that correspond to tree entries,
assuming that they are all present already.

The path-walk API provides lists of objects in batches according to a
common path, but that list could be very small. We want to balance the
number of requests to the server with the ability to have the process
interrupted with minimal repeated work to catch up in the next run.
Based on some experiments (detailed in the next change) a minimum batch
size of 50,000 is selected for the default.

This batch size is a _minimum_. As the path-walk API emits lists of blob
IDs, they are collected into a list of objects for a request to the
server. When that list is at least the minimum batch size, then the
request is sent to the server for the new objects. However, the list of
blob IDs from the path-walk API could be much longer than the batch
size. At this moment, it is unclear if there is a benefit to split the
list when there are too many objects at the same path.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 Documentation/git-backfill.txt            |  24 +++++
 Documentation/technical/api-path-walk.txt |   3 +-
 builtin/backfill.c                        | 104 +++++++++++++++++++++-
 t/t5620-backfill.sh                       |  94 +++++++++++++++++++
 4 files changed, 221 insertions(+), 4 deletions(-)
 create mode 100755 t/t5620-backfill.sh

Comments

Patrick Steinhardt Dec. 16, 2024, 8:01 a.m. UTC | #1
On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
> index 640144187d3..0e10f066fef 100644
> --- a/Documentation/git-backfill.txt
> +++ b/Documentation/git-backfill.txt
> @@ -14,6 +14,30 @@ SYNOPSIS
>  DESCRIPTION
>  -----------
>  
> +Blobless partial clones are created using `git clone --filter=blob:none`
> +and then configure the local repository such that the Git client avoids
> +downloading blob objects unless they are required for a local operation.
> +This initially means that the clone and later fetches download reachable
> +commits and trees but no blobs. Later operations that change the `HEAD`
> +pointer, such as `git checkout` or `git merge`, may need to download
> +missing blobs in order to complete their operation.

Okay.

> +In the worst cases, commands that compute blob diffs, such as `git blame`,
> +become very slow as they download the missing blobs in single-blob
> +requests to satisfy the missing object as the Git command needs it. This
> +leads to multiple download requests and no ability for the Git server to
> +provide delta compression across those objects.
> +
> +The `git backfill` command provides a way for the user to request that
> +Git downloads the missing blobs (with optional filters) such that the
> +missing blobs representing historical versions of files can be downloaded
> +in batches. The `backfill` command attempts to optimize the request by
> +grouping blobs that appear at the same path, hopefully leading to good
> +delta compression in the packfile sent by the server.

Hm. So we're asking the user to fix a usability issue of git-blame(1),
don't we? Ideally, git-blame(1) itself should know to transparently
batch the blobs it requires to compute its output, shouldn't it? That
usecase alone doesn't yet convince me that git-backfill(1) is a good
idea as I'd think we should rather fix the underlying issue.

So are there other usecases for git-backfill(1)? I can imagine that it
might be helpful in the context of scripts that know they'll operate on
a bunch of blobs.

> diff --git a/builtin/backfill.c b/builtin/backfill.c
> index 38e6aaeaa03..e5f2000d5e0 100644
> --- a/builtin/backfill.c
> +++ b/builtin/backfill.c
> @@ -1,16 +1,116 @@
>  #include "builtin.h"
> +#include "git-compat-util.h"
>  #include "config.h"
>  #include "parse-options.h"
>  #include "repository.h"
> +#include "commit.h"
> +#include "hex.h"
> +#include "tree.h"
> +#include "tree-walk.h"
>  #include "object.h"
> +#include "object-store-ll.h"
> +#include "oid-array.h"
> +#include "oidset.h"
> +#include "promisor-remote.h"
> +#include "strmap.h"
> +#include "string-list.h"
> +#include "revision.h"
> +#include "trace2.h"
> +#include "progress.h"
> +#include "packfile.h"
> +#include "path-walk.h"
>  
>  static const char * const builtin_backfill_usage[] = {
>  	N_("git backfill [<options>]"),
>  	NULL
>  };
>  
> +struct backfill_context {
> +	struct repository *repo;
> +	struct oid_array current_batch;
> +	size_t batch_size;
> +};
> +
> +static void clear_backfill_context(struct backfill_context *ctx)
> +{
> +	oid_array_clear(&ctx->current_batch);
> +}

Nit: our style guide says that this should rather be
`backfill_context_clear()`.

> +static void download_batch(struct backfill_context *ctx)
> +{
> +	promisor_remote_get_direct(ctx->repo,
> +				   ctx->current_batch.oid,
> +				   ctx->current_batch.nr);
> +	oid_array_clear(&ctx->current_batch);
> +
> +	/*
> +	 * We likely have a new packfile. Add it to the packed list to
> +	 * avoid possible duplicate downloads of the same objects.
> +	 */
> +	reprepare_packed_git(ctx->repo);
> +}
> +
> +static int fill_missing_blobs(const char *path UNUSED,
> +			      struct oid_array *list,
> +			      enum object_type type,
> +			      void *data)
> +{
> +	struct backfill_context *ctx = data;
> +
> +	if (type != OBJ_BLOB)
> +		return 0;
> +
> +	for (size_t i = 0; i < list->nr; i++) {
> +		off_t size = 0;
> +		struct object_info info = OBJECT_INFO_INIT;
> +		info.disk_sizep = &size;
> +		if (oid_object_info_extended(ctx->repo,
> +					     &list->oid[i],
> +					     &info,
> +					     OBJECT_INFO_FOR_PREFETCH) ||
> +		    !size)
> +			oid_array_append(&ctx->current_batch, &list->oid[i]);
> +	}
> +
> +	if (ctx->current_batch.nr >= ctx->batch_size)
> +		download_batch(ctx);

Okay, so the batch size is just "best effort". If we walk a tree that
makes us exceed the batch size then we wouldn't issue a fetch during the
tree walk. Is there any specific reason for this behaviour?

In any case, as long as this is properly documented I think this should
be fine in general.

> +	return 0;
> +}
> +
> +static int do_backfill(struct backfill_context *ctx)
> +{
> +	struct rev_info revs;
> +	struct path_walk_info info = PATH_WALK_INFO_INIT;
> +	int ret;
> +
> +	repo_init_revisions(ctx->repo, &revs, "");
> +	handle_revision_arg("HEAD", &revs, 0, 0);
> +
> +	info.blobs = 1;
> +	info.tags = info.commits = info.trees = 0;
> +
> +	info.revs = &revs;
> +	info.path_fn = fill_missing_blobs;
> +	info.path_fn_data = ctx;
> +
> +	ret = walk_objects_by_path(&info);
> +
> +	/* Download the objects that did not fill a batch. */
> +	if (!ret)
> +		download_batch(ctx);
> +
> +	clear_backfill_context(ctx);

Are we leaking `revs` and `info`?

> +	return ret;
> +}
> +
>  int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
>  {
> +	struct backfill_context ctx = {
> +		.repo = repo,
> +		.current_batch = OID_ARRAY_INIT,
> +		.batch_size = 50000,
> +	};
>  	struct option options[] = {
>  		OPT_END(),
>  	};
> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
>  
>  	repo_config(repo, git_default_config, NULL);
>  
> -	die(_("not implemented"));
> -
> -	return 0;
> +	return do_backfill(&ctx);
>  }

The current iteration only backfills blobs as far as I can see. Do we
maybe want to keep the door open for future changes in git-backfill(1)
by implementing this via a "blob" subcommand?

Patrick
Derrick Stolee Dec. 18, 2024, 3:03 p.m. UTC | #2
On 12/16/24 3:01 AM, Patrick Steinhardt wrote:
 > On Fri, Dec 06, 2024 at 08:07:15PM +0000, Derrick Stolee via GitGitGadget wrote:

 >> +The `git backfill` command provides a way for the user to request that
 >> +Git downloads the missing blobs (with optional filters) such that the
 >> +missing blobs representing historical versions of files can be downloaded
 >> +in batches. The `backfill` command attempts to optimize the request by
 >> +grouping blobs that appear at the same path, hopefully leading to good
 >> +delta compression in the packfile sent by the server.
 >
 > Hm. So we're asking the user to fix a usability issue of git-blame(1),
 > don't we? Ideally, git-blame(1) itself should know to transparently
 > batch the blobs it requires to compute its output, shouldn't it? That
 > usecase alone doesn't yet convince me that git-backfill(1) is a good
 > idea as I'd think we should rather fix the underlying issue.

I've looked into making this change for 'git blame' and it is a
nontrivial change. I'm not sure on the timeline that we could expect
'git blame' to be improved.

But you're right that this is not enough justification on its own. It's
an example of a kind of command that has these problems, including 'git
log [-p|-L]'.

 > So are there other usecases for git-backfill(1)? I can imagine that it
 > might be helpful in the context of scripts that know they'll operate on
 > a bunch of blobs.

One major motivation is that it can essentially provide a way to do
resumable clones: get the commits and trees in one go with a blobless
clone, then download the blobs in batches. If something interrupts the
'git backfill' command, then restarting it will only repeat the most
recent batch.

Finally, in a later patch we can see that the --sparse option allows a
user to operate as if they have a full clone but where they only include
blob data within their sparse-checkout, providing reduced disk space and
network time to get to that state.

 >> +	if (ctx->current_batch.nr >= ctx->batch_size)
 >> +		download_batch(ctx);
 >
 > Okay, so the batch size is just "best effort". If we walk a tree that
 > makes us exceed the batch size then we wouldn't issue a fetch during the
 > tree walk. Is there any specific reason for this behaviour?
 >
 > In any case, as long as this is properly documented I think this should
 > be fine in general.

The main reason is that we expect the server to return a packfile that
has many delta relationships within the objects at a given path. If we
split the batch in the middle of a path batch, then we are artificially
introducing breaks in the delta chains that could be wasteful.

However, this batching pattern could be problematic if there are a
million versions of a single file and the batch is too large to download
efficiently. This "absolute max batch size" is currently left as a
future extension.

 >> +	clear_backfill_context(ctx);
 >
 > Are we leaking `revs` and `info`?

At the moment. Will fix.

 >> +	return ret;
 >> +}
 >> +
 >>   int cmd_backfill(int argc, const char **argv, const char *prefix, struct 
repository *repo)
 >>   {
 >> +	struct backfill_context ctx = {
 >> +		.repo = repo,
 >> +		.current_batch = OID_ARRAY_INIT,
 >> +		.batch_size = 50000,
 >> +	};
 >>   	struct option options[] = {
 >>   		OPT_END(),
 >>   	};
 >> @@ -23,7 +123,5 @@ int cmd_backfill(int argc, const char **argv, const char 
*prefix, struct reposit
 >>
 >>   	repo_config(repo, git_default_config, NULL);
 >>
 >> -	die(_("not implemented"));
 >> -
 >> -	return 0;
 >> +	return do_backfill(&ctx);
 >>   }
 >
 > The current iteration only backfills blobs as far as I can see. Do we
 > maybe want to keep the door open for future changes in git-backfill(1)
 > by implementing this via a "blob" subcommand?

I think that one tricky part is to ask "what could be missing?". With
Git's partial clone, it seems that we could have treeless or depth-
based tree restrictions. Technically, there could also be clones that
restrict to a set of sparse-checkout patterns, but I'm not aware of any
server that will respect those kinds of clones.  At the moment, the tree
walk would fault in any missing trees as they are seen, but this is
extremely inefficient.

I think that the path-walk API could be adjusted to be more careful to
check for the existence of an object before automatically loading it.
That would allow for batched downloads of missing trees, though a
second scan would be required to get the next layer of objects.

I'm not sure a subcommand is the right way to solve for this potential
future, but instead later we could adjust the logic to be better for
these treeless or tree-restricted clones.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/git-backfill.txt b/Documentation/git-backfill.txt
index 640144187d3..0e10f066fef 100644
--- a/Documentation/git-backfill.txt
+++ b/Documentation/git-backfill.txt
@@ -14,6 +14,30 @@  SYNOPSIS
 DESCRIPTION
 -----------
 
+Blobless partial clones are created using `git clone --filter=blob:none`
+and then configure the local repository such that the Git client avoids
+downloading blob objects unless they are required for a local operation.
+This initially means that the clone and later fetches download reachable
+commits and trees but no blobs. Later operations that change the `HEAD`
+pointer, such as `git checkout` or `git merge`, may need to download
+missing blobs in order to complete their operation.
+
+In the worst cases, commands that compute blob diffs, such as `git blame`,
+become very slow as they download the missing blobs in single-blob
+requests to satisfy the missing object as the Git command needs it. This
+leads to multiple download requests and no ability for the Git server to
+provide delta compression across those objects.
+
+The `git backfill` command provides a way for the user to request that
+Git downloads the missing blobs (with optional filters) such that the
+missing blobs representing historical versions of files can be downloaded
+in batches. The `backfill` command attempts to optimize the request by
+grouping blobs that appear at the same path, hopefully leading to good
+delta compression in the packfile sent by the server.
+
+By default, `git backfill` downloads all blobs reachable from the `HEAD`
+commit. This set can be restricted or expanded using various options.
+
 SEE ALSO
 --------
 linkgit:git-clone[1].
diff --git a/Documentation/technical/api-path-walk.txt b/Documentation/technical/api-path-walk.txt
index 7075d0d5ab5..1fba0ce04cb 100644
--- a/Documentation/technical/api-path-walk.txt
+++ b/Documentation/technical/api-path-walk.txt
@@ -60,4 +60,5 @@  Examples
 --------
 
 See example usages in:
-	`t/helper/test-path-walk.c`
+	`t/helper/test-path-walk.c`,
+	`builtin/backfill.c`
diff --git a/builtin/backfill.c b/builtin/backfill.c
index 38e6aaeaa03..e5f2000d5e0 100644
--- a/builtin/backfill.c
+++ b/builtin/backfill.c
@@ -1,16 +1,116 @@ 
 #include "builtin.h"
+#include "git-compat-util.h"
 #include "config.h"
 #include "parse-options.h"
 #include "repository.h"
+#include "commit.h"
+#include "hex.h"
+#include "tree.h"
+#include "tree-walk.h"
 #include "object.h"
+#include "object-store-ll.h"
+#include "oid-array.h"
+#include "oidset.h"
+#include "promisor-remote.h"
+#include "strmap.h"
+#include "string-list.h"
+#include "revision.h"
+#include "trace2.h"
+#include "progress.h"
+#include "packfile.h"
+#include "path-walk.h"
 
 static const char * const builtin_backfill_usage[] = {
 	N_("git backfill [<options>]"),
 	NULL
 };
 
+struct backfill_context {
+	struct repository *repo;
+	struct oid_array current_batch;
+	size_t batch_size;
+};
+
+static void clear_backfill_context(struct backfill_context *ctx)
+{
+	oid_array_clear(&ctx->current_batch);
+}
+
+static void download_batch(struct backfill_context *ctx)
+{
+	promisor_remote_get_direct(ctx->repo,
+				   ctx->current_batch.oid,
+				   ctx->current_batch.nr);
+	oid_array_clear(&ctx->current_batch);
+
+	/*
+	 * We likely have a new packfile. Add it to the packed list to
+	 * avoid possible duplicate downloads of the same objects.
+	 */
+	reprepare_packed_git(ctx->repo);
+}
+
+static int fill_missing_blobs(const char *path UNUSED,
+			      struct oid_array *list,
+			      enum object_type type,
+			      void *data)
+{
+	struct backfill_context *ctx = data;
+
+	if (type != OBJ_BLOB)
+		return 0;
+
+	for (size_t i = 0; i < list->nr; i++) {
+		off_t size = 0;
+		struct object_info info = OBJECT_INFO_INIT;
+		info.disk_sizep = &size;
+		if (oid_object_info_extended(ctx->repo,
+					     &list->oid[i],
+					     &info,
+					     OBJECT_INFO_FOR_PREFETCH) ||
+		    !size)
+			oid_array_append(&ctx->current_batch, &list->oid[i]);
+	}
+
+	if (ctx->current_batch.nr >= ctx->batch_size)
+		download_batch(ctx);
+
+	return 0;
+}
+
+static int do_backfill(struct backfill_context *ctx)
+{
+	struct rev_info revs;
+	struct path_walk_info info = PATH_WALK_INFO_INIT;
+	int ret;
+
+	repo_init_revisions(ctx->repo, &revs, "");
+	handle_revision_arg("HEAD", &revs, 0, 0);
+
+	info.blobs = 1;
+	info.tags = info.commits = info.trees = 0;
+
+	info.revs = &revs;
+	info.path_fn = fill_missing_blobs;
+	info.path_fn_data = ctx;
+
+	ret = walk_objects_by_path(&info);
+
+	/* Download the objects that did not fill a batch. */
+	if (!ret)
+		download_batch(ctx);
+
+	clear_backfill_context(ctx);
+	return ret;
+}
+
 int cmd_backfill(int argc, const char **argv, const char *prefix, struct repository *repo)
 {
+	struct backfill_context ctx = {
+		.repo = repo,
+		.current_batch = OID_ARRAY_INIT,
+		.batch_size = 50000,
+	};
 	struct option options[] = {
 		OPT_END(),
 	};
@@ -23,7 +123,5 @@  int cmd_backfill(int argc, const char **argv, const char *prefix, struct reposit
 
 	repo_config(repo, git_default_config, NULL);
 
-	die(_("not implemented"));
-
-	return 0;
+	return do_backfill(&ctx);
 }
diff --git a/t/t5620-backfill.sh b/t/t5620-backfill.sh
new file mode 100755
index 00000000000..64326362d80
--- /dev/null
+++ b/t/t5620-backfill.sh
@@ -0,0 +1,94 @@ 
+#!/bin/sh
+
+test_description='git backfill on partial clones'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./test-lib.sh
+
+# We create objects in the 'src' repo.
+test_expect_success 'setup repo for object creation' '
+	echo "{print \$1}" >print_1.awk &&
+	echo "{print \$2}" >print_2.awk &&
+
+	git init src &&
+
+	mkdir -p src/a/b/c &&
+	mkdir -p src/d/e &&
+
+	for i in 1 2
+	do
+		for n in 1 2 3 4
+		do
+			echo "Version $i of file $n" > src/file.$n.txt &&
+			echo "Version $i of file a/$n" > src/a/file.$n.txt &&
+			echo "Version $i of file a/b/$n" > src/a/b/file.$n.txt &&
+			echo "Version $i of file a/b/c/$n" > src/a/b/c/file.$n.txt &&
+			echo "Version $i of file d/$n" > src/d/file.$n.txt &&
+			echo "Version $i of file d/e/$n" > src/d/e/file.$n.txt &&
+			git -C src add . &&
+			git -C src commit -m "Iteration $n" || return 1
+		done
+	done
+'
+
+# Clone 'src' into 'srv.bare' so we have a bare repo to be our origin
+# server for the partial clone.
+test_expect_success 'setup bare clone for server' '
+	git clone --bare "file://$(pwd)/src" srv.bare &&
+	git -C srv.bare config --local uploadpack.allowfilter 1 &&
+	git -C srv.bare config --local uploadpack.allowanysha1inwant 1
+'
+
+# do basic partial clone from "srv.bare"
+test_expect_success 'do partial clone 1, backfill gets all objects' '
+	git clone --no-checkout --filter=blob:none	\
+		--single-branch --branch=main 		\
+		"file://$(pwd)/srv.bare" backfill1 &&
+
+	# Backfill with no options gets everything reachable from HEAD.
+	GIT_TRACE2_EVENT="$(pwd)/backfill-file-trace" git \
+		-C backfill1 backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-file-trace &&
+
+	# No more missing objects!
+	git -C backfill1 rev-list --quiet --objects --missing=print HEAD >revs2 &&
+	test_line_count = 0 revs2
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'create a partial clone over HTTP' '
+	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+	rm -rf "$SERVER" repo &&
+	git clone --bare "file://$(pwd)/src" "$SERVER" &&
+	test_config -C "$SERVER" uploadpack.allowfilter 1 &&
+	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
+
+	git clone --no-checkout --filter=blob:none \
+		"$HTTPD_URL/smart/server" backfill-http
+'
+
+test_expect_success 'backfilling over HTTP succeeds' '
+	GIT_TRACE2_EVENT="$(pwd)/backfill-http-trace" git \
+		-C backfill-http backfill &&
+
+	# We should have engaged the partial clone machinery
+	test_trace2_data promisor fetch_count 48 <backfill-http-trace &&
+
+	# Confirm all objects are present, none missing.
+	git -C backfill-http rev-list --objects --all >rev-list-out &&
+	awk "{print \$1;}" <rev-list-out >oids &&
+	GIT_TRACE2_EVENT="$(pwd)/walk-trace" git -C backfill-http \
+		cat-file --batch-check <oids >batch-out &&
+	! grep missing batch-out
+'
+
+# DO NOT add non-httpd-specific tests here, because the last part of this
+# test script is only executed when httpd is available and enabled.
+
+test_done