diff mbox series

[v4,08/11] bundle: add flags to verify_bundle(), skip walk

Message ID 83f2cd893a4c47c947a93fe99d202d67f540e963.1665417859.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Bundle URIs III: Parse and download from bundle lists | expand

Commit Message

Derrick Stolee Oct. 10, 2022, 4:04 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The verify_bundle() method checks if a bundle can be applied to a given
repository. This not only verifies that certain commits exist in the
repository, but Git also checks that these commits are reachable.

This behavior dates back to the original git-bundle builtin written in
2e0afafebd8 (Add git-bundle: move objects and references by archive,
2007-02-22), but the message does not go into detail why the
reachability check is important.

Since verify_bundle() is called from unbundle(), we need to add an
option to pipe the flags through that method.

When unbundling from a list of bundles, Git will create refs that point
to the tips of the latest bundle, which makes this reachability walk
succeed, in theory. However, the loose refs cache does not get
invalidated and hence the reachability walk fails. By disabling the
reachability walk in the bundle URI code, we can get around this
reachability check.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/bundle.c |  5 +++--
 bundle-uri.c     |  8 +++++++-
 bundle.c         | 12 +++++++-----
 bundle.h         | 15 +++++++++++++--
 transport.c      |  2 +-
 5 files changed, 31 insertions(+), 11 deletions(-)

Comments

Junio C Hamano Oct. 10, 2022, 5:27 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> The verify_bundle() method checks if a bundle can be applied to a given
> repository. This not only verifies that certain commits exist in the
> repository, but Git also checks that these commits are reachable.
>
> This behavior dates back to the original git-bundle builtin written in
> 2e0afafebd8 (Add git-bundle: move objects and references by archive,
> 2007-02-22), but the message does not go into detail why the
> reachability check is important.
>
> Since verify_bundle() is called from unbundle(), we need to add an
> option to pipe the flags through that method.

All makes sense.

> When unbundling from a list of bundles, Git will create refs that point
> to the tips of the latest bundle, which makes this reachability walk
> succeed, in theory. However, the loose refs cache does not get
> invalidated and hence the reachability walk fails. By disabling the
> reachability walk in the bundle URI code, we can get around this
> reachability check.

The above makes it sound like the real culprit is that cache goes
out of sync and the presented solution is a workaround; readers are
left in suspense if the "real" solution (as opposed to a workaround)
would come in a later step or in a future series.

> diff --git a/bundle-uri.c b/bundle-uri.c
> index 8a7c11c6393..ad5baabdd94 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -301,7 +301,13 @@ static int unbundle_from_file(struct repository *r, const char *file)
>  	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
>  		return 1;
>  
> -	if ((result = unbundle(r, &header, bundle_fd, NULL)))
> +	/*
> +	 * Skip the reachability walk here, since we will be adding
> +	 * a reachable ref pointing to the new tips, which will reach
> +	 * the prerequisite commits.
> +	 */
> +	if ((result = unbundle(r, &header, bundle_fd, NULL,
> +			       VERIFY_BUNDLE_SKIP_REACHABLE)))
>  		return 1;

This is not a new problem introduced in this new round, but if we
are updating this, can we fix it to omit assignment inside if
condition?

 * result is initialized to 0.

 * when unbundle returns non-zero, it is assigned to result and the
   function returns immediately, discarding whatever was assigned to
   the variable.

 * if unbundle returns zero, it is assigned to result and the
   control continues from here.  We know result is set to 0, but
   then that is what it was initialized earlier.
Derrick Stolee Oct. 10, 2022, 6:13 p.m. UTC | #2
On 10/10/2022 1:27 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>> When unbundling from a list of bundles, Git will create refs that point
>> to the tips of the latest bundle, which makes this reachability walk
>> succeed, in theory. However, the loose refs cache does not get
>> invalidated and hence the reachability walk fails. By disabling the
>> reachability walk in the bundle URI code, we can get around this
>> reachability check.
> 
> The above makes it sound like the real culprit is that cache goes
> out of sync and the presented solution is a workaround; readers are
> left in suspense if the "real" solution (as opposed to a workaround)
> would come in a later step or in a future series.

I've been going over the refs code multiple times today trying to
fix this "real" culprit, with no luck. I can share this interesting
point:

 * The initial loop over the bundles tries to apply each, but the
   prerequisite objects are not present so we never reach the revision
   walk. A refs/bundle/* ref is added via update_ref().

 * The second loop over the bundles tries to apply each, but the only
   bundle with its prerequisites present also finds the commits as
   reachable (this must be where the loose ref cache is populated).
   Then, a refs/bundle/* ref is added via update_ref().

 * The third loop over the bundles finds a bundle whose prerequisites
   are present, but verify_bundle() rejected it because those commits
   were not seen from any ref.

Other than identifying that issue, I was unable to track down exactly
what is happening here or offer a fix. I had considered inserting
more cache frees deep in the refs code, but I wasn't sure what effect
that would have across the wider system.

>> diff --git a/bundle-uri.c b/bundle-uri.c
>> index 8a7c11c6393..ad5baabdd94 100644
>> --- a/bundle-uri.c
>> +++ b/bundle-uri.c
>> @@ -301,7 +301,13 @@ static int unbundle_from_file(struct repository *r, const char *file)
>>  	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
>>  		return 1;
>>  
>> -	if ((result = unbundle(r, &header, bundle_fd, NULL)))
>> +	/*
>> +	 * Skip the reachability walk here, since we will be adding
>> +	 * a reachable ref pointing to the new tips, which will reach
>> +	 * the prerequisite commits.
>> +	 */
>> +	if ((result = unbundle(r, &header, bundle_fd, NULL,
>> +			       VERIFY_BUNDLE_SKIP_REACHABLE)))
>>  		return 1;
> 
> This is not a new problem introduced in this new round, but if we
> are updating this, can we fix it to omit assignment inside if
> condition?
> 
>  * result is initialized to 0.
> 
>  * when unbundle returns non-zero, it is assigned to result and the
>    function returns immediately, discarding whatever was assigned to
>    the variable.
> 
>  * if unbundle returns zero, it is assigned to result and the
>    control continues from here.  We know result is set to 0, but
>    then that is what it was initialized earlier.
 
Since we are not "trusting" the integer result of unbundle, we
can definitely stop this assignment in the if.

Thanks,
-Stolee
Junio C Hamano Oct. 10, 2022, 6:40 p.m. UTC | #3
Derrick Stolee <derrickstolee@github.com> writes:

> I've been going over the refs code multiple times today trying to
> fix this "real" culprit, with no luck. I can share this interesting
> point:
>
>  * The initial loop over the bundles tries to apply each, but the
>    prerequisite objects are not present so we never reach the revision
>    walk. A refs/bundle/* ref is added via update_ref().
>
>  * The second loop over the bundles tries to apply each, but the only
>    bundle with its prerequisites present also finds the commits as
>    reachable (this must be where the loose ref cache is populated).
>    Then, a refs/bundle/* ref is added via update_ref().
>
>  * The third loop over the bundles finds a bundle whose prerequisites
>    are present, but verify_bundle() rejected it because those commits
>    were not seen from any ref.
>
> Other than identifying that issue, I was unable to track down exactly
> what is happening here or offer a fix. I had considered inserting
> more cache frees deep in the refs code, but I wasn't sure what effect
> that would have across the wider system.

OK.  That certainly is understandable.

As a comment in the proposed log message that BUNDLE_SKIP_REACHABLE
bit is a band aid papering over a problem we punted in this series,
to guide future developers, I think what you wrote is sufficient.
We do not want them to think that skipping the check is our
preferred longer term solution and add their own hack to keep
skipping the check when they resolve "the real culprit".

Thanks.
Derrick Stolee Oct. 11, 2022, 7:04 p.m. UTC | #4
On 10/10/2022 2:40 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> I've been going over the refs code multiple times today trying to
>> fix this "real" culprit, with no luck. I can share this interesting
>> point:
>>
>>  * The initial loop over the bundles tries to apply each, but the
>>    prerequisite objects are not present so we never reach the revision
>>    walk. A refs/bundle/* ref is added via update_ref().
>>
>>  * The second loop over the bundles tries to apply each, but the only
>>    bundle with its prerequisites present also finds the commits as
>>    reachable (this must be where the loose ref cache is populated).
>>    Then, a refs/bundle/* ref is added via update_ref().
>>
>>  * The third loop over the bundles finds a bundle whose prerequisites
>>    are present, but verify_bundle() rejected it because those commits
>>    were not seen from any ref.
>>
>> Other than identifying that issue, I was unable to track down exactly
>> what is happening here or offer a fix. I had considered inserting
>> more cache frees deep in the refs code, but I wasn't sure what effect
>> that would have across the wider system.
> 
> OK.  That certainly is understandable.
> 
> As a comment in the proposed log message that BUNDLE_SKIP_REACHABLE
> bit is a band aid papering over a problem we punted in this series,
> to guide future developers, I think what you wrote is sufficient.
> We do not want them to think that skipping the check is our
> preferred longer term solution and add their own hack to keep
> skipping the check when they resolve "the real culprit".

I have discovered the real culprit, and my expectation was incorrect
about the loose ref cache. The key issue was that I was looking at
this loop:

	i = req_nr;
	while (i && (commit = get_revision(&revs)))
		if (commit->object.flags & PREREQ_MARK)
			i--;

and noticing that only one commit was being visited. I was not 
seeing the actually-important commit. But it wasn't the revision
walk's fault. The loop was terminating because "i" was reaching
zero!

It turns out that verify_bundles() is not clearing the PREREQ_MARK
flag, so multiple runs would incorrectly hit this short-circuit
and terminate the walk early.

I'll replace this patch with the correct fix soon.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/builtin/bundle.c b/builtin/bundle.c
index 2adad545a2e..7d983a238f0 100644
--- a/builtin/bundle.c
+++ b/builtin/bundle.c
@@ -119,7 +119,8 @@  static int cmd_bundle_verify(int argc, const char **argv, const char *prefix) {
 		goto cleanup;
 	}
 	close(bundle_fd);
-	if (verify_bundle(the_repository, &header, !quiet)) {
+	if (verify_bundle(the_repository, &header,
+			  quiet ? 0 : VERIFY_BUNDLE_VERBOSE)) {
 		ret = 1;
 		goto cleanup;
 	}
@@ -185,7 +186,7 @@  static int cmd_bundle_unbundle(int argc, const char **argv, const char *prefix)
 		strvec_pushl(&extra_index_pack_args, "-v", "--progress-title",
 			     _("Unbundling objects"), NULL);
 	ret = !!unbundle(the_repository, &header, bundle_fd,
-			 &extra_index_pack_args) ||
+			 &extra_index_pack_args, 0) ||
 		list_bundle_refs(&header, argc, argv);
 	bundle_header_release(&header);
 cleanup:
diff --git a/bundle-uri.c b/bundle-uri.c
index 8a7c11c6393..ad5baabdd94 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -301,7 +301,13 @@  static int unbundle_from_file(struct repository *r, const char *file)
 	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
 		return 1;
 
-	if ((result = unbundle(r, &header, bundle_fd, NULL)))
+	/*
+	 * Skip the reachability walk here, since we will be adding
+	 * a reachable ref pointing to the new tips, which will reach
+	 * the prerequisite commits.
+	 */
+	if ((result = unbundle(r, &header, bundle_fd, NULL,
+			       VERIFY_BUNDLE_SKIP_REACHABLE)))
 		return 1;
 
 	/*
diff --git a/bundle.c b/bundle.c
index 0208e6d90d3..36ffeb1e0eb 100644
--- a/bundle.c
+++ b/bundle.c
@@ -189,7 +189,7 @@  static int list_refs(struct string_list *r, int argc, const char **argv)
 
 int verify_bundle(struct repository *r,
 		  struct bundle_header *header,
-		  int verbose)
+		  enum verify_bundle_flags flags)
 {
 	/*
 	 * Do fast check, then if any prereqs are missing then go line by line
@@ -222,7 +222,8 @@  int verify_bundle(struct repository *r,
 			error("%s", message);
 		error("%s %s", oid_to_hex(oid), name);
 	}
-	if (revs.pending.nr != p->nr)
+	if (revs.pending.nr != p->nr ||
+	    (flags & VERIFY_BUNDLE_SKIP_REACHABLE))
 		goto cleanup;
 	req_nr = revs.pending.nr;
 	setup_revisions(2, argv, &revs, NULL);
@@ -259,7 +260,7 @@  int verify_bundle(struct repository *r,
 			clear_commit_marks(commit, ALL_REV_FLAGS);
 	}
 
-	if (verbose) {
+	if (flags & VERIFY_BUNDLE_VERBOSE) {
 		struct string_list *r;
 
 		r = &header->references;
@@ -620,7 +621,8 @@  err:
 }
 
 int unbundle(struct repository *r, struct bundle_header *header,
-	     int bundle_fd, struct strvec *extra_index_pack_args)
+	     int bundle_fd, struct strvec *extra_index_pack_args,
+	     enum verify_bundle_flags flags)
 {
 	struct child_process ip = CHILD_PROCESS_INIT;
 	strvec_pushl(&ip.args, "index-pack", "--fix-thin", "--stdin", NULL);
@@ -634,7 +636,7 @@  int unbundle(struct repository *r, struct bundle_header *header,
 		strvec_clear(extra_index_pack_args);
 	}
 
-	if (verify_bundle(r, header, 0))
+	if (verify_bundle(r, header, flags))
 		return -1;
 	ip.in = bundle_fd;
 	ip.no_stdout = 1;
diff --git a/bundle.h b/bundle.h
index 0c052f54964..9f798c00d93 100644
--- a/bundle.h
+++ b/bundle.h
@@ -29,7 +29,14 @@  int read_bundle_header_fd(int fd, struct bundle_header *header,
 int create_bundle(struct repository *r, const char *path,
 		  int argc, const char **argv, struct strvec *pack_options,
 		  int version);
-int verify_bundle(struct repository *r, struct bundle_header *header, int verbose);
+
+enum verify_bundle_flags {
+	VERIFY_BUNDLE_VERBOSE = (1 << 0),
+	VERIFY_BUNDLE_SKIP_REACHABLE = (1 << 1)
+};
+
+int verify_bundle(struct repository *r, struct bundle_header *header,
+		  enum verify_bundle_flags flags);
 
 /**
  * Unbundle after reading the header with read_bundle_header().
@@ -40,9 +47,13 @@  int verify_bundle(struct repository *r, struct bundle_header *header, int verbos
  * Provide "extra_index_pack_args" to pass any extra arguments
  * (e.g. "-v" for verbose/progress), NULL otherwise. The provided
  * "extra_index_pack_args" (if any) will be strvec_clear()'d for you.
+ *
+ * Before unbundling, this method will call verify_bundle() with the
+ * given 'flags'.
  */
 int unbundle(struct repository *r, struct bundle_header *header,
-	     int bundle_fd, struct strvec *extra_index_pack_args);
+	     int bundle_fd, struct strvec *extra_index_pack_args,
+	     enum verify_bundle_flags flags);
 int list_bundle_refs(struct bundle_header *header,
 		int argc, const char **argv);
 
diff --git a/transport.c b/transport.c
index 52db7a3cb09..c5d3042731a 100644
--- a/transport.c
+++ b/transport.c
@@ -178,7 +178,7 @@  static int fetch_refs_from_bundle(struct transport *transport,
 	if (!data->get_refs_from_bundle_called)
 		get_refs_from_bundle_inner(transport);
 	ret = unbundle(the_repository, &data->header, data->fd,
-		       &extra_index_pack_args);
+		       &extra_index_pack_args, 0);
 	transport->hash_algo = data->header.hash_algo;
 	return ret;
 }