diff mbox series

[10/11] bundle: create filtered bundles

Message ID 5393e74708dfd38e5596d9e877a491e6ed8dda24.1645638911.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Partial bundles | expand

Commit Message

Derrick Stolee Feb. 23, 2022, 5:55 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

A previous change allowed Git to parse bundles with the 'filter'
capability. Now, teach Git to create bundles with this option.

Some rearranging of code is required to get the option parsing in the
correct spot. There are now two reasons why we might need capabilities
(a new hash algorithm or an object filter) so that is pulled out into a
place where we can check both at the same time.

The --filter option is parsed as part of setup_revisions(), but it
expected the --objects flag, too. That flag is somewhat implied by 'git
bundle' because it creates a pack-file walking objects, but there is
also a walk that walks the revision range expecting only commits. Make
this parsing work by setting 'revs.tree_objects' and 'revs.blob_objects'
before the call to setup_revisions().

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle.c               | 56 +++++++++++++++++++++++++++++++++---------
 t/t6020-bundle-misc.sh | 30 ++++++++++++++++++++++
 2 files changed, 75 insertions(+), 11 deletions(-)

Comments

Junio C Hamano March 4, 2022, 11:35 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> A previous change allowed Git to parse bundles with the 'filter'
> capability. Now, teach Git to create bundles with this option.
>
> Some rearranging of code is required to get the option parsing in the
> correct spot. There are now two reasons why we might need capabilities
> (a new hash algorithm or an object filter) so that is pulled out into a
> place where we can check both at the same time.
>
> The --filter option is parsed as part of setup_revisions(), but it
> expected the --objects flag, too. That flag is somewhat implied by 'git
> bundle' because it creates a pack-file walking objects, but there is
> also a walk that walks the revision range expecting only commits. Make
> this parsing work by setting 'revs.tree_objects' and 'revs.blob_objects'
> before the call to setup_revisions().
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---

Now, the gem of the series ;-)

> @@ -334,6 +334,9 @@ static int write_pack_data(int bundle_fd, struct rev_info *revs, struct strvec *
>  		     "--stdout", "--thin", "--delta-base-offset",
>  		     NULL);
>  	strvec_pushv(&pack_objects.args, pack_options->v);
> +	if (revs->filter)
> +		strvec_pushf(&pack_objects.args, "--filter=%s",
> +			     list_objects_filter_spec(revs->filter));
>  	pack_objects.in = -1;
>  	pack_objects.out = bundle_fd;
>  	pack_objects.git_cmd = 1;

Quite expected.

> @@ -507,10 +510,38 @@ int create_bundle(struct repository *r, const char *path,
>  	int bundle_to_stdout;
>  	int ref_count = 0;
>  	struct rev_info revs, revs_copy;
> -	int min_version = the_hash_algo == &hash_algos[GIT_HASH_SHA1] ? 2 : 3;
> +	int min_version = 2;
>  	struct bundle_prerequisites_info bpi;
>  	int i;
>  
> +	/* init revs to list objects for pack-objects later */
> +	save_commit_buffer = 0;
> +	repo_init_revisions(r, &revs, NULL);
> +
> +	/*
> +	 * Pre-initialize the '--objects' flag so we can parse a
> +	 * --filter option successfully.
> +	 */
> +	revs.tree_objects = revs.blob_objects = 1;

Tricky, but true.

> +	argc = setup_revisions(argc, argv, &revs, NULL);
> +
> +	/*
> +	 * Reasons to require version 3:
> +	 *
> +	 * 1. @object-format is required because our hash algorithm is not
> +	 *    SHA1.
> +	 * 2. @filter is required because we parsed an object filter.
> +	 */

OK.

> +	if (the_hash_algo != &hash_algos[GIT_HASH_SHA1] ||
> +	    revs.filter)

Did we need to wrap?  With these on a single line, the line is way
shorter than the line with "because our hash algorithm is not" on
it.

> +		min_version = 3;
> +
> +	if (argc > 1) {
> +		error(_("unrecognized argument: %s"), argv[1]);
> +		goto err;
> +	}
> +

OK.  We are moving original logic around correctly and there is not
much to see here ;-)

> @@ -533,17 +564,14 @@ int create_bundle(struct repository *r, const char *path,
>  		write_or_die(bundle_fd, capability, strlen(capability));
>  		write_or_die(bundle_fd, the_hash_algo->name, strlen(the_hash_algo->name));
>  		write_or_die(bundle_fd, "\n", 1);
> ...
> +		if (revs.filter) {
> +			const char *value = expand_list_objects_filter_spec(revs.filter);
> +			capability = "@filter=";
> +			write_or_die(bundle_fd, capability, strlen(capability));
> +			write_or_die(bundle_fd, value, strlen(value));
> +			write_or_die(bundle_fd, "\n", 1);
> +		}

This block is added at the end of the code to write the v3 preamble
and it adds the @filter= capability.  Looking good.

> @@ -566,6 +594,12 @@ int create_bundle(struct repository *r, const char *path,
>  	bpi.fd = bundle_fd;
>  	bpi.pending = &revs_copy.pending;
>  
> +	/*
> +	 * Nullify the filter here, and any object walking. We only care
> +	 * about commits and tags here. The revs_copy has the right
> +	 * instances of these values.
> +	 */
> +	revs.filter = NULL;
>  	revs.blob_objects = revs.tree_objects = 0;
>  	traverse_commit_list(&revs, write_bundle_prerequisites, ignore_object, &bpi);
>  	object_array_remove_duplicates(&revs_copy.pending);

OK.  We prepare revs, and we save it to revs_copy, because we
perform two traversals, one to determine which bottom commits are
required to unbundle the bundle (which is done with the instance
"revs"), and then later to actually enumerate the objects to place
in the bundle (using "revs_copy").  Is there a reason why we need to
remove .filter in order to perform the first traversal?

This is a tangent, but I wish we could reliably determine when we
can optimize the first traversal away, by inspecting revs.  If there
are any pending objects with UNINTERESTING bit, or members like
max_count, max_age, min_age are set, we'd end up traversing down to
all roots and the prerequisites list would be empty.

> +	test_expect_success 'filtered bundle: $filter' '
> +		test_when_finished rm -rf .git/objects/pack &&
> +		git bundle create partial.bdl \
> +			--all \
> +			--filter=$filter &&
> +
> +		git bundle verify partial.bdl >unfiltered &&
> +		make_user_friendly_and_stable_output <unfiltered >actual &&
> +
> +		cat >expect <<-EOF &&
> +		The bundle contains these 10 refs:
> +		<COMMIT-P> refs/heads/main
> +		<COMMIT-N> refs/heads/release
> +		<COMMIT-D> refs/heads/topic/1
> +		<COMMIT-H> refs/heads/topic/2
> +		<COMMIT-D> refs/pull/1/head
> +		<COMMIT-G> refs/pull/2/head
> +		<TAG-1> refs/tags/v1
> +		<TAG-2> refs/tags/v2
> +		<TAG-3> refs/tags/v3
> +		<COMMIT-P> HEAD
> +		The bundle uses this filter: $filter
> +		The bundle records a complete history.
> +		EOF
> +		test_cmp expect actual
> +	'

OK.

It is somewhat curious why our bundle tests do not unbundle and
check the resulting contents of the repository we unbundle it in.

> +done
> +
>  test_done
Derrick Stolee March 7, 2022, 2:14 p.m. UTC | #2
On 3/4/2022 6:35 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> A previous change allowed Git to parse bundles with the 'filter'
>> capability. Now, teach Git to create bundles with this option.
>>
>> Some rearranging of code is required to get the option parsing in the
>> correct spot. There are now two reasons why we might need capabilities
>> (a new hash algorithm or an object filter) so that is pulled out into a
>> place where we can check both at the same time.
>>
>> The --filter option is parsed as part of setup_revisions(), but it
>> expected the --objects flag, too. That flag is somewhat implied by 'git
>> bundle' because it creates a pack-file walking objects, but there is
>> also a walk that walks the revision range expecting only commits. Make
>> this parsing work by setting 'revs.tree_objects' and 'revs.blob_objects'
>> before the call to setup_revisions().
>>
>> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>> ---
> 
> Now, the gem of the series ;-)

:D

>> +	if (the_hash_algo != &hash_algos[GIT_HASH_SHA1] ||
>> +	    revs.filter)
> 
> Did we need to wrap?  With these on a single line, the line is way
> shorter than the line with "because our hash algorithm is not" on
> it.

Perhaps I was thinking about having one line per "reason", which might
be extended in the future. But there's no reason to waste space right
now.

>> +	/*
>> +	 * Nullify the filter here, and any object walking. We only care
>> +	 * about commits and tags here. The revs_copy has the right
>> +	 * instances of these values.
>> +	 */
>> +	revs.filter = NULL;
>>  	revs.blob_objects = revs.tree_objects = 0;
>>  	traverse_commit_list(&revs, write_bundle_prerequisites, ignore_object, &bpi);
>>  	object_array_remove_duplicates(&revs_copy.pending);
> 
> OK.  We prepare revs, and we save it to revs_copy, because we
> perform two traversals, one to determine which bottom commits are
> required to unbundle the bundle (which is done with the instance
> "revs"), and then later to actually enumerate the objects to place
> in the bundle (using "revs_copy").  Is there a reason why we need to
> remove .filter in order to perform the first traversal?
> 
> This is a tangent, but I wish we could reliably determine when we
> can optimize the first traversal away, by inspecting revs.  If there
> are any pending objects with UNINTERESTING bit, or members like
> max_count, max_age, min_age are set, we'd end up traversing down to
> all roots and the prerequisites list would be empty.

Noted for potential follow-up.
 
>> +	test_expect_success 'filtered bundle: $filter' '
...
> 
> OK.
> 
> It is somewhat curious why our bundle tests do not unbundle and
> check the resulting contents of the repository we unbundle it in.

I haven't checked your response yet, but hopefully this is answered in
the next patch which teaches Git how to unbundle bundles with this new
capability.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 7, 2022, 3:44 p.m. UTC | #3
On Wed, Feb 23 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>

Just something missed by Junio....

> +for filter in "blob:none" "tree:0" "tree:1" "blob:limit=100"
> +do
> +	test_expect_success 'filtered bundle: $filter' '

I think you'll need "" here, the description part of test_expect_success
isn't eval'd.

> +		test_when_finished rm -rf .git/objects/pack &&
> +		git bundle create partial.bdl \
> +			--all \
> +			--filter=$filter &&
> +
> +		git bundle verify partial.bdl >unfiltered &&
> +		make_user_friendly_and_stable_output <unfiltered >actual &&
> +
> +		cat >expect <<-EOF &&
> +		The bundle contains these 10 refs:
> +		<COMMIT-P> refs/heads/main
> +		<COMMIT-N> refs/heads/release
> +		<COMMIT-D> refs/heads/topic/1
> +		<COMMIT-H> refs/heads/topic/2
> +		<COMMIT-D> refs/pull/1/head
> +		<COMMIT-G> refs/pull/2/head
> +		<TAG-1> refs/tags/v1
> +		<TAG-2> refs/tags/v2
> +		<TAG-3> refs/tags/v3
> +		<COMMIT-P> HEAD
> +		The bundle uses this filter: $filter
> +		The bundle records a complete history.
> +		EOF
> +		test_cmp expect actual
> +	'
> +done
> +
>  test_done

I think this needs a corresponding documentation update, now "verify"
just talks about output related to whether the history is complete or
not.
Junio C Hamano March 7, 2022, 4:49 p.m. UTC | #4
Derrick Stolee <derrickstolee@github.com> writes:

> Perhaps I was thinking about having one line per "reason", which might
> be extended in the future.

Makes sense.  It wasn't a serious objection but mostly a curiousity
question anyway.
diff mbox series

Patch

diff --git a/bundle.c b/bundle.c
index 2afced4d991..e284ef63062 100644
--- a/bundle.c
+++ b/bundle.c
@@ -334,6 +334,9 @@  static int write_pack_data(int bundle_fd, struct rev_info *revs, struct strvec *
 		     "--stdout", "--thin", "--delta-base-offset",
 		     NULL);
 	strvec_pushv(&pack_objects.args, pack_options->v);
+	if (revs->filter)
+		strvec_pushf(&pack_objects.args, "--filter=%s",
+			     list_objects_filter_spec(revs->filter));
 	pack_objects.in = -1;
 	pack_objects.out = bundle_fd;
 	pack_objects.git_cmd = 1;
@@ -507,10 +510,38 @@  int create_bundle(struct repository *r, const char *path,
 	int bundle_to_stdout;
 	int ref_count = 0;
 	struct rev_info revs, revs_copy;
-	int min_version = the_hash_algo == &hash_algos[GIT_HASH_SHA1] ? 2 : 3;
+	int min_version = 2;
 	struct bundle_prerequisites_info bpi;
 	int i;
 
+	/* init revs to list objects for pack-objects later */
+	save_commit_buffer = 0;
+	repo_init_revisions(r, &revs, NULL);
+
+	/*
+	 * Pre-initialize the '--objects' flag so we can parse a
+	 * --filter option successfully.
+	 */
+	revs.tree_objects = revs.blob_objects = 1;
+
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	/*
+	 * Reasons to require version 3:
+	 *
+	 * 1. @object-format is required because our hash algorithm is not
+	 *    SHA1.
+	 * 2. @filter is required because we parsed an object filter.
+	 */
+	if (the_hash_algo != &hash_algos[GIT_HASH_SHA1] ||
+	    revs.filter)
+		min_version = 3;
+
+	if (argc > 1) {
+		error(_("unrecognized argument: %s"), argv[1]);
+		goto err;
+	}
+
 	bundle_to_stdout = !strcmp(path, "-");
 	if (bundle_to_stdout)
 		bundle_fd = 1;
@@ -533,17 +564,14 @@  int create_bundle(struct repository *r, const char *path,
 		write_or_die(bundle_fd, capability, strlen(capability));
 		write_or_die(bundle_fd, the_hash_algo->name, strlen(the_hash_algo->name));
 		write_or_die(bundle_fd, "\n", 1);
-	}
-
-	/* init revs to list objects for pack-objects later */
-	save_commit_buffer = 0;
-	repo_init_revisions(r, &revs, NULL);
 
-	argc = setup_revisions(argc, argv, &revs, NULL);
-
-	if (argc > 1) {
-		error(_("unrecognized argument: %s"), argv[1]);
-		goto err;
+		if (revs.filter) {
+			const char *value = expand_list_objects_filter_spec(revs.filter);
+			capability = "@filter=";
+			write_or_die(bundle_fd, capability, strlen(capability));
+			write_or_die(bundle_fd, value, strlen(value));
+			write_or_die(bundle_fd, "\n", 1);
+		}
 	}
 
 	/* save revs.pending in revs_copy for later use */
@@ -566,6 +594,12 @@  int create_bundle(struct repository *r, const char *path,
 	bpi.fd = bundle_fd;
 	bpi.pending = &revs_copy.pending;
 
+	/*
+	 * Nullify the filter here, and any object walking. We only care
+	 * about commits and tags here. The revs_copy has the right
+	 * instances of these values.
+	 */
+	revs.filter = NULL;
 	revs.blob_objects = revs.tree_objects = 0;
 	traverse_commit_list(&revs, write_bundle_prerequisites, ignore_object, &bpi);
 	object_array_remove_duplicates(&revs_copy.pending);
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index 6522401617d..39cfefafb65 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -487,4 +487,34 @@  test_expect_success 'unfiltered bundle with --objects' '
 	test_cmp expect actual
 '
 
+for filter in "blob:none" "tree:0" "tree:1" "blob:limit=100"
+do
+	test_expect_success 'filtered bundle: $filter' '
+		test_when_finished rm -rf .git/objects/pack &&
+		git bundle create partial.bdl \
+			--all \
+			--filter=$filter &&
+
+		git bundle verify partial.bdl >unfiltered &&
+		make_user_friendly_and_stable_output <unfiltered >actual &&
+
+		cat >expect <<-EOF &&
+		The bundle contains these 10 refs:
+		<COMMIT-P> refs/heads/main
+		<COMMIT-N> refs/heads/release
+		<COMMIT-D> refs/heads/topic/1
+		<COMMIT-H> refs/heads/topic/2
+		<COMMIT-D> refs/pull/1/head
+		<COMMIT-G> refs/pull/2/head
+		<TAG-1> refs/tags/v1
+		<TAG-2> refs/tags/v2
+		<TAG-3> refs/tags/v3
+		<COMMIT-P> HEAD
+		The bundle uses this filter: $filter
+		The bundle records a complete history.
+		EOF
+		test_cmp expect actual
+	'
+done
+
 test_done