diff mbox series

[v3,02/11] bundle: verify using check_connected()

Message ID 20c29d37f9c1ba1367145331d25dd27f966312cd.1675171759.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit d9fd674c8b26c376b37e02d974b92033acb99732
Headers show
Series Bundle URIs V: creationToken heuristic for incremental fetches | expand

Commit Message

Derrick Stolee Jan. 31, 2023, 1:29 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

When Git verifies a bundle to see if it is safe for unbundling, it first
looks to see if the prerequisite commits are in the object store. This
is an easy way to "fail fast" but it is not a sufficient check for
updating refs that guarantee closure under reachability. There could
still be issues if those commits are not reachable from the repository's
references. The repository only has guarantees that its object store is
closed under reachability for the objects that are reachable from
references.

Thus, the code in verify_bundle() has previously had the additional
check that all prerequisite commits are reachable from repository
references. This is done via a revision walk from all references,
stopping only if all prerequisite commits are discovered or all commits
are walked. This uses a custom walk to verify_bundle().

This check is more strict than what Git applies to fetched pack-files.
In the fetch case, Git guarantees that the new references are closed
under reachability by walking from the new references until walking
commits that are reachable from repository refs. This is done through
the well-used check_connected() method.

To better align with the restrictions required by 'git fetch',
reimplement this check in verify_bundle() to use check_connected(). This
also simplifies the code significantly.

The previous change added a test that verified the behavior of 'git
bundle verify' and 'git bundle unbundle' in this case, and the error
messages looked like this:

  error: Could not read <missing-commit>
  fatal: Failed to traverse parents of commit <extant-commit>

However, by changing the revision walk slightly within check_connected()
and using its quiet mode, we can omit those messages. Instead, we get
only this message, tailored to describing the current state of the
repository:

  error: some prerequisite commits exist in the object store,
         but are not connected to the repository's history

(Line break added here for the commit message formatting, only.)

While this message does not include any object IDs, there is no
guarantee that those object IDs would help the user diagnose what is
going on, as they could be separated from the prerequisite commits by
some distance. At minimum, this situation describes the situation in a
more informative way than the previous error messages.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle.c               | 75 ++++++++++++++++--------------------------
 t/t6020-bundle-misc.sh |  8 ++---
 2 files changed, 33 insertions(+), 50 deletions(-)

Comments

Junio C Hamano Jan. 31, 2023, 5:35 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Thus, the code in verify_bundle() has previously had the additional
> check that all prerequisite commits are reachable from repository
> references. This is done via a revision walk from all references,
> stopping only if all prerequisite commits are discovered or all commits
> are walked. This uses a custom walk to verify_bundle().
>
> This check is more strict than what Git applies to fetched pack-files.
> In the fetch case, Git guarantees that the new references are closed
> under reachability by walking from the new references until walking
> commits that are reachable from repository refs. This is done through
> the well-used check_connected() method.
>
> To better align with the restrictions required by 'git fetch',
> reimplement this check in verify_bundle() to use check_connected(). This
> also simplifies the code significantly.

As I often say, breaking repository faster is not the kind of
performance gain we want to have in Git, and I am in favor of this
iteration compared to the earlier version that mostly punted on
ensuring the correctness (rather, it relied on assumption that "most
of the time this should be OK").

>  bundle.c               | 75 ++++++++++++++++--------------------------
>  t/t6020-bundle-misc.sh |  8 ++---
>  2 files changed, 33 insertions(+), 50 deletions(-)

The diffstat is very pleasing to see.

Let me read the postimage along aloud (preimage omitted).

> diff --git a/bundle.c b/bundle.c
> index 4ef7256aa11..76c3a904898 100644
> --- a/bundle.c
> +++ b/bundle.c
> @@ -187,6 +188,21 @@ static int list_refs(struct string_list *r, int argc, const char **argv)
>  /* Remember to update object flag allocation in object.h */
>  #define PREREQ_MARK (1u<<16)
>  
> +struct string_list_iterator {
> +	struct string_list *list;
> +	size_t cur;
> +};
> +
> +static const struct object_id *iterate_ref_map(void *cb_data)
> +{
> +	struct string_list_iterator *iter = cb_data;
> +
> +	if (iter->cur >= iter->list->nr)
> +		return NULL;
> +
> +	return iter->list->items[iter->cur++].util;
> +}

This is to let check_connected() collect all the prerequisite object
names.  OK.

>  int verify_bundle(struct repository *r,
>  		  struct bundle_header *header,
>  		  enum verify_bundle_flags flags)
>  {
>  	/*
>  	 * Do fast check, then if any prereqs are missing then go line by line
>  	 * to be verbose about the errors
>  	 */
>  	struct string_list *p = &header->prerequisites;
> +	int i, ret = 0;
>  	const char *message = _("Repository lacks these prerequisite commits:");
> +	struct string_list_iterator iter = {
> +		.list = p,
> +	};
> +	struct check_connected_options opts = {
> +		.quiet = 1,
> +	};
>  
>  	if (!r || !r->objects || !r->objects->odb)
>  		return error(_("need a repository to verify a bundle"));
>  
>  	for (i = 0; i < p->nr; i++) {
>  		struct string_list_item *e = p->items + i;
>  		const char *name = e->string;
>  		struct object_id *oid = e->util;
>  		struct object *o = parse_object(r, oid);
> +		if (o)
>  			continue;
>  		ret++;
>  		if (flags & VERIFY_BUNDLE_QUIET)
>  			continue;
>  		if (ret == 1)
>  			error("%s", message);
>  		error("%s %s", oid_to_hex(oid), name);
>  	}
> +	if (ret)
>  		goto cleanup;

The "quick fail" logic as before.  Looking sensible.

>  
> +	if ((ret = check_connected(iterate_ref_map, &iter, &opts)))
> +		error(_("some prerequisite commits exist in the object store, "
> +			"but are not connected to the repository's history"));

And then we let check_connected() to ensure that traversing from
these prerequisite objects down to the DAG formed by existing refs
will not die from missing objects.  Makes sense.

> +	/* TODO: preserve this verbose language. */

I am lost -- aren't we preserving the BUNDLE_VERBOSE code below?

>  	if (flags & VERIFY_BUNDLE_VERBOSE) {



> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
> index 38dbbf89155..7d40994991e 100755
> --- a/t/t6020-bundle-misc.sh
> +++ b/t/t6020-bundle-misc.sh
> @@ -595,14 +595,14 @@ test_expect_success 'verify catches unreachable, broken prerequisites' '
>  		# Verify should fail
>  		test_must_fail git bundle verify \
>  			../clone-from/tip.bundle 2>err &&
> -		grep "Could not read $BAD_OID" err &&
> -		grep "Failed to traverse parents of commit $TIP_OID" err &&
> +		grep "some prerequisite commits .* are not connected" err &&
> +		test_line_count = 1 err &&
>  
>  		# Unbundling should fail
>  		test_must_fail git bundle unbundle \
>  			../clone-from/tip.bundle 2>err &&
> -		grep "Could not read $BAD_OID" err &&
> -		grep "Failed to traverse parents of commit $TIP_OID" err
> +		grep "some prerequisite commits .* are not connected" err &&
> +		test_line_count = 1 err
>  	)
>  '

Especially with the new test added in the previous step, we know we
are not trading correctness off.  Excellent.

I wonder how much the performance hit does this version incur over
the "not safe at all" version and over the "use custom and
stricter-than-needed" version, by the way?

Thanks.
Derrick Stolee Jan. 31, 2023, 7:31 p.m. UTC | #2
On 1/31/2023 12:35 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>> +	/* TODO: preserve this verbose language. */
> 
> I am lost -- aren't we preserving the BUNDLE_VERBOSE code below?

Sorry, I put this in so I wouldn't forget to test that the
verbose message sticks, but I did forget to remove it.

>> diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
>> index 38dbbf89155..7d40994991e 100755
>> --- a/t/t6020-bundle-misc.sh
>> +++ b/t/t6020-bundle-misc.sh
>> @@ -595,14 +595,14 @@ test_expect_success 'verify catches unreachable, broken prerequisites' '
>>  		# Verify should fail
>>  		test_must_fail git bundle verify \
>>  			../clone-from/tip.bundle 2>err &&
>> -		grep "Could not read $BAD_OID" err &&
>> -		grep "Failed to traverse parents of commit $TIP_OID" err &&
>> +		grep "some prerequisite commits .* are not connected" err &&
>> +		test_line_count = 1 err &&
>>  
>>  		# Unbundling should fail
>>  		test_must_fail git bundle unbundle \
>>  			../clone-from/tip.bundle 2>err &&
>> -		grep "Could not read $BAD_OID" err &&
>> -		grep "Failed to traverse parents of commit $TIP_OID" err
>> +		grep "some prerequisite commits .* are not connected" err &&
>> +		test_line_count = 1 err
>>  	)
>>  '
> 
> Especially with the new test added in the previous step, we know we
> are not trading correctness off.  Excellent.
> 
> I wonder how much the performance hit does this version incur over
> the "not safe at all" version and over the "use custom and
> stricter-than-needed" version, by the way?

I was able to simulate this in an extreme situation by taking a clone
of the normal Git fork, creating a ref at every first parent, then
creating a bundle of the difference between git/git and gitster/git.
Finally, I compared the performance of 'git bundle verify' on Git
compiled before this change and after this change:

Benchmark 1: old
  Time (mean ± σ):     324.7 ms ±   7.5 ms    [User: 228.0 ms, System: 95.7 ms]
  Range (min … max):   315.3 ms … 338.0 ms    10 runs
 
Benchmark 2: new
  Time (mean ± σ):     331.2 ms ±  20.2 ms    [User: 230.6 ms, System: 99.7 ms]
  Range (min … max):   302.8 ms … 360.2 ms    10 runs
 
Summary
  'old' ran
    1.02 ± 0.07 times faster than 'new'

So, there is a performance difference in the two situations, but it
is very slight, in what I can detect.

Of course, there is the case where the behavior differs because of
the more permissible behavior, but I expect the old algorithm to be
slower than the new case, because check_connected() can terminate
with success (due to seeing all the prerequisite commits) faster
than the old walk can terminate with failure (due to walking all
reachable commits).

Thanks,
-Stolee
Junio C Hamano Jan. 31, 2023, 7:36 p.m. UTC | #3
Derrick Stolee <derrickstolee@github.com> writes:

> ... but I expect the old algorithm to be
> slower than the new case, because check_connected() can terminate
> with success (due to seeing all the prerequisite commits) faster
> than the old walk can terminate with failure (due to walking all
> reachable commits).

That is good to hear.  As long as we are not making it worse than
before, I am happy as a clam ;-)

Thanks.
diff mbox series

Patch

diff --git a/bundle.c b/bundle.c
index 4ef7256aa11..76c3a904898 100644
--- a/bundle.c
+++ b/bundle.c
@@ -12,6 +12,7 @@ 
 #include "refs.h"
 #include "strvec.h"
 #include "list-objects-filter-options.h"
+#include "connected.h"
 
 static const char v2_bundle_signature[] = "# v2 git bundle\n";
 static const char v3_bundle_signature[] = "# v3 git bundle\n";
@@ -187,6 +188,21 @@  static int list_refs(struct string_list *r, int argc, const char **argv)
 /* Remember to update object flag allocation in object.h */
 #define PREREQ_MARK (1u<<16)
 
+struct string_list_iterator {
+	struct string_list *list;
+	size_t cur;
+};
+
+static const struct object_id *iterate_ref_map(void *cb_data)
+{
+	struct string_list_iterator *iter = cb_data;
+
+	if (iter->cur >= iter->list->nr)
+		return NULL;
+
+	return iter->list->items[iter->cur++].util;
+}
+
 int verify_bundle(struct repository *r,
 		  struct bundle_header *header,
 		  enum verify_bundle_flags flags)
@@ -196,26 +212,25 @@  int verify_bundle(struct repository *r,
 	 * to be verbose about the errors
 	 */
 	struct string_list *p = &header->prerequisites;
-	struct rev_info revs = REV_INFO_INIT;
-	const char *argv[] = {NULL, "--all", NULL};
-	struct commit *commit;
-	int i, ret = 0, req_nr;
+	int i, ret = 0;
 	const char *message = _("Repository lacks these prerequisite commits:");
+	struct string_list_iterator iter = {
+		.list = p,
+	};
+	struct check_connected_options opts = {
+		.quiet = 1,
+	};
 
 	if (!r || !r->objects || !r->objects->odb)
 		return error(_("need a repository to verify a bundle"));
 
-	repo_init_revisions(r, &revs, NULL);
 	for (i = 0; i < p->nr; i++) {
 		struct string_list_item *e = p->items + i;
 		const char *name = e->string;
 		struct object_id *oid = e->util;
 		struct object *o = parse_object(r, oid);
-		if (o) {
-			o->flags |= PREREQ_MARK;
-			add_pending_object(&revs, o, name);
+		if (o)
 			continue;
-		}
 		ret++;
 		if (flags & VERIFY_BUNDLE_QUIET)
 			continue;
@@ -223,37 +238,14 @@  int verify_bundle(struct repository *r,
 			error("%s", message);
 		error("%s %s", oid_to_hex(oid), name);
 	}
-	if (revs.pending.nr != p->nr)
+	if (ret)
 		goto cleanup;
-	req_nr = revs.pending.nr;
-	setup_revisions(2, argv, &revs, NULL);
-
-	list_objects_filter_copy(&revs.filter, &header->filter);
-
-	if (prepare_revision_walk(&revs))
-		die(_("revision walk setup failed"));
 
-	i = req_nr;
-	while (i && (commit = get_revision(&revs)))
-		if (commit->object.flags & PREREQ_MARK)
-			i--;
-
-	for (i = 0; i < p->nr; i++) {
-		struct string_list_item *e = p->items + i;
-		const char *name = e->string;
-		const struct object_id *oid = e->util;
-		struct object *o = parse_object(r, oid);
-		assert(o); /* otherwise we'd have returned early */
-		if (o->flags & SHOWN)
-			continue;
-		ret++;
-		if (flags & VERIFY_BUNDLE_QUIET)
-			continue;
-		if (ret == 1)
-			error("%s", message);
-		error("%s %s", oid_to_hex(oid), name);
-	}
+	if ((ret = check_connected(iterate_ref_map, &iter, &opts)))
+		error(_("some prerequisite commits exist in the object store, "
+			"but are not connected to the repository's history"));
 
+	/* TODO: preserve this verbose language. */
 	if (flags & VERIFY_BUNDLE_VERBOSE) {
 		struct string_list *r;
 
@@ -282,15 +274,6 @@  int verify_bundle(struct repository *r,
 				  list_objects_filter_spec(&header->filter));
 	}
 cleanup:
-	/* Clean up objects used, as they will be reused. */
-	for (i = 0; i < p->nr; i++) {
-		struct string_list_item *e = p->items + i;
-		struct object_id *oid = e->util;
-		commit = lookup_commit_reference_gently(r, oid, 1);
-		if (commit)
-			clear_commit_marks(commit, ALL_REV_FLAGS | PREREQ_MARK);
-	}
-	release_revisions(&revs);
 	return ret;
 }
 
diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index 38dbbf89155..7d40994991e 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -595,14 +595,14 @@  test_expect_success 'verify catches unreachable, broken prerequisites' '
 		# Verify should fail
 		test_must_fail git bundle verify \
 			../clone-from/tip.bundle 2>err &&
-		grep "Could not read $BAD_OID" err &&
-		grep "Failed to traverse parents of commit $TIP_OID" err &&
+		grep "some prerequisite commits .* are not connected" err &&
+		test_line_count = 1 err &&
 
 		# Unbundling should fail
 		test_must_fail git bundle unbundle \
 			../clone-from/tip.bundle 2>err &&
-		grep "Could not read $BAD_OID" err &&
-		grep "Failed to traverse parents of commit $TIP_OID" err
+		grep "some prerequisite commits .* are not connected" err &&
+		test_line_count = 1 err
 	)
 '