diff mbox series

[v2.5,02/11] bundle: verify using connected()

Message ID ecc6b167-f5c4-48ce-3973-461d1659ed40@github.com (mailing list archive)
State New, archived
Headers show
Series [v2.5,01/11] bundle: test unbundling with incomplete history | expand

Commit Message

Derrick Stolee Jan. 24, 2023, 2:16 p.m. UTC
On 1/24/2023 7:27 AM, Derrick Stolee wrote:

> I'll focus on this area today and see what I can learn and how I
> can approach this problem in a different way.

>  2. Find out how to modify verify_bundle() so it can do the more
>     relaxed connectivity check.

And here is the modification to verify_bundle() to depend on
check_connected() instead. This also improves (in my opinion) the
error reporting from this situation, as seen in the test edits.

Again, this is a placeholder before I re-roll this series into
an inevitable v3, so don't bother applying this patch until then.

Thanks,
-Stolee

--- >8 ---

From 6a3d64761e9691994f9310b9ce2338f49aa72d48 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <derrickstolee@github.com>
Date: Tue, 24 Jan 2023 08:47:00 -0500
Subject: [PATCH v2.5 02/11] bundle: verify using connected()

When Git verifies a bundle to see if it is safe for unbundling, it first
looks to see if the prerequisite commits are in the object store. This
is usually a sufficient filter, and those missing commits are indicated
clearly in the error messages. However, if the commits are present in
the object store, then there could still be issues if those commits are
not reachable from the repository's references. The repository only has
guarantees that its object store is closed under reachability for the
objects that are reachable from references.

Thus, the code in verify_bundle() has previously had the additional
check that all prerequisite commits are reachable from repository
references. This is done via a revision walk from all references,
stopping only if all prerequisite commits are discovered or all commits
are walked. This uses a custom walk to verify_bundle().

This check is more strict than what Git applies even to fetched
pack-files. In the fetch case, Git guarantees that the new references
are closed under reachability by walking from the new references until
walking commits that are reachable from repository refs. This is done
through the well-used check_connected() method.

To better align with the restrictions required by 'git fetch',
reimplement this check in verify_bundle() to use check_connected(). This
also simplifies the code significantly.

The previous change added a test that verified the behavior of 'git
bundle verify' and 'git bundle unbundle' in this case, and the error
messages looked like this:

  error: Could not read <missing-commit>
  fatal: Failed to traverse parents of commit <extant-commit>

However, by changing the revision walk slightly within check_connected()
and using its quiet mode, we can omit those messages. Instead, we get
only this message, tailored to describing the current state of the
repository:

  error: some prerequisite commits exist in the object store,
         but are not connected to the repository's history

(Line break added here for the commit message formatting, only.)

While this message does not include any object IDs, there is no
guarantee that those object IDs would help the user diagnose what is
going on, as they could be separated from the prerequisite commits by
some distance. At minimum, this situation describes the situation in a
more informative way than the previous error messages.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle.c               | 75 ++++++++++++++++--------------------------
 t/t6020-bundle-misc.sh |  8 ++---
 2 files changed, 33 insertions(+), 50 deletions(-)

--
2.39.1.vfs.0.0

Comments

Junio C Hamano Jan. 24, 2023, 5:33 p.m. UTC | #1
Derrick Stolee <derrickstolee@github.com> writes:

> When Git verifies a bundle to see if it is safe for unbundling, it first
> looks to see if the prerequisite commits are in the object store. This
> is usually a sufficient filter, and those missing commits are indicated
> clearly in the error messages.

I am not sure if our early check is because "does the prerequisite
even exist?" is sufficient.  It is a short-cut that is cheap and can
be done without preparing the commit traversal.

> However, if the commits are present in
> the object store, then there could still be issues if those commits are
> not reachable from the repository's references. The repository only has
> guarantees that its object store is closed under reachability for the
> objects that are reachable from references.

Correct.

> Thus, the code in verify_bundle() has previously had the additional
> check that all prerequisite commits are reachable from repository
> references. This is done via a revision walk from all references,
> stopping only if all prerequisite commits are discovered or all commits
> are walked. This uses a custom walk to verify_bundle().

Correct.

> This check is more strict than what Git applies even to fetched
> pack-files.

I do not see the need to say "even" here.  In what other situation
do we make connectivity checks, and is there a need to be more
strict than others when checking fetched packfiles?

> In the fetch case, Git guarantees that the new references
> are closed under reachability by walking from the new references until
> walking commits that are reachable from repository refs. This is done
> through the well-used check_connected() method.

Correct and is a good point to make.

> To better align with the restrictions required by 'git fetch',
> reimplement this check in verify_bundle() to use check_connected(). This
> also simplifies the code significantly.

Wonderful.  I never liked the custom check done in unbundle code,
which I am reasonably sure came from scripted hack to unbundle I
wrote eons ago.
Derrick Stolee Jan. 24, 2023, 6:46 p.m. UTC | #2
On 1/24/2023 12:33 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> When Git verifies a bundle to see if it is safe for unbundling, it first
>> looks to see if the prerequisite commits are in the object store. This
>> is usually a sufficient filter, and those missing commits are indicated
>> clearly in the error messages.
> 
> I am not sure if our early check is because "does the prerequisite
> even exist?" is sufficient.  It is a short-cut that is cheap and can
> be done without preparing the commit traversal.

I suppose I should say "Usually, existence in the object store is
correlated with having all reachable objects, but this is not
guaranteed." I'll also mention that it is a short-cut that can fail
faster than the reachability check.

>> This check is more strict than what Git applies even to fetched
>> pack-files.
> 
> I do not see the need to say "even" here.  In what other situation
> do we make connectivity checks, and is there a need to be more
> strict than others when checking fetched packfiles?

I suppose that I was implying that fetches are the more common
operation, and the scrutiny applied to an arbitrary pack-file from
a remote is probably higher there. However, who knows where a
bundle came from, so the scrutiny should be the same.

>> To better align with the restrictions required by 'git fetch',
>> reimplement this check in verify_bundle() to use check_connected(). This
>> also simplifies the code significantly.
> 
> Wonderful.  I never liked the custom check done in unbundle code,
> which I am reasonably sure came from scripted hack to unbundle I
> wrote eons ago.
 
Excellent. Thanks for your feedback.

-Stolee
Junio C Hamano Jan. 24, 2023, 8:41 p.m. UTC | #3
Derrick Stolee <derrickstolee@github.com> writes:

>> I do not see the need to say "even" here.  In what other situation
>> do we make connectivity checks, and is there a need to be more
>> strict than others when checking fetched packfiles?
>
> I suppose that I was implying that fetches are the more common
> operation, and the scrutiny applied to an arbitrary pack-file from
> a remote is probably higher there. However, who knows where a
> bundle came from, so the scrutiny should be the same.

Ah, I see that is where that "even" came from.  And yes, I agree
that unbundling and fetch should be suspicious of their input to the
same degree.
diff mbox series

Patch

diff --git a/bundle.c b/bundle.c
index 4ef7256aa11..76c3a904898 100644
--- a/bundle.c
+++ b/bundle.c
@@ -12,6 +12,7 @@ 
 #include "refs.h"
 #include "strvec.h"
 #include "list-objects-filter-options.h"
+#include "connected.h"

 static const char v2_bundle_signature[] = "# v2 git bundle\n";
 static const char v3_bundle_signature[] = "# v3 git bundle\n";
@@ -187,6 +188,21 @@  static int list_refs(struct string_list *r, int argc, const char **argv)
 /* Remember to update object flag allocation in object.h */
 #define PREREQ_MARK (1u<<16)

+struct string_list_iterator {
+	struct string_list *list;
+	size_t cur;
+};
+
+static const struct object_id *iterate_ref_map(void *cb_data)
+{
+	struct string_list_iterator *iter = cb_data;
+
+	if (iter->cur >= iter->list->nr)
+		return NULL;
+
+	return iter->list->items[iter->cur++].util;
+}
+
 int verify_bundle(struct repository *r,
 		  struct bundle_header *header,
 		  enum verify_bundle_flags flags)
@@ -196,26 +212,25 @@  int verify_bundle(struct repository *r,
 	 * to be verbose about the errors
 	 */
 	struct string_list *p = &header->prerequisites;
-	struct rev_info revs = REV_INFO_INIT;
-	const char *argv[] = {NULL, "--all", NULL};
-	struct commit *commit;
-	int i, ret = 0, req_nr;
+	int i, ret = 0;
 	const char *message = _("Repository lacks these prerequisite commits:");
+	struct string_list_iterator iter = {
+		.list = p,
+	};
+	struct check_connected_options opts = {
+		.quiet = 1,
+	};

 	if (!r || !r->objects || !r->objects->odb)
 		return error(_("need a repository to verify a bundle"));

-	repo_init_revisions(r, &revs, NULL);
 	for (i = 0; i < p->nr; i++) {
 		struct string_list_item *e = p->items + i;
 		const char *name = e->string;
 		struct object_id *oid = e->util;
 		struct object *o = parse_object(r, oid);
-		if (o) {
-			o->flags |= PREREQ_MARK;
-			add_pending_object(&revs, o, name);
+		if (o)
 			continue;
-		}
 		ret++;
 		if (flags & VERIFY_BUNDLE_QUIET)
 			continue;
@@ -223,37 +238,14 @@  int verify_bundle(struct repository *r,
 			error("%s", message);
 		error("%s %s", oid_to_hex(oid), name);
 	}
-	if (revs.pending.nr != p->nr)
+	if (ret)
 		goto cleanup;
-	req_nr = revs.pending.nr;
-	setup_revisions(2, argv, &revs, NULL);
-
-	list_objects_filter_copy(&revs.filter, &header->filter);
-
-	if (prepare_revision_walk(&revs))
-		die(_("revision walk setup failed"));

-	i = req_nr;
-	while (i && (commit = get_revision(&revs)))
-		if (commit->object.flags & PREREQ_MARK)
-			i--;
-
-	for (i = 0; i < p->nr; i++) {
-		struct string_list_item *e = p->items + i;
-		const char *name = e->string;
-		const struct object_id *oid = e->util;
-		struct object *o = parse_object(r, oid);
-		assert(o); /* otherwise we'd have returned early */
-		if (o->flags & SHOWN)
-			continue;
-		ret++;
-		if (flags & VERIFY_BUNDLE_QUIET)
-			continue;
-		if (ret == 1)
-			error("%s", message);
-		error("%s %s", oid_to_hex(oid), name);
-	}
+	if ((ret = check_connected(iterate_ref_map, &iter, &opts)))
+		error(_("some prerequisite commits exist in the object store, "
+			"but are not connected to the repository's history"));

+	/* TODO: preserve this verbose language. */
 	if (flags & VERIFY_BUNDLE_VERBOSE) {
 		struct string_list *r;

@@ -282,15 +274,6 @@  int verify_bundle(struct repository *r,
 				  list_objects_filter_spec(&header->filter));
 	}
 cleanup:
-	/* Clean up objects used, as they will be reused. */
-	for (i = 0; i < p->nr; i++) {
-		struct string_list_item *e = p->items + i;
-		struct object_id *oid = e->util;
-		commit = lookup_commit_reference_gently(r, oid, 1);
-		if (commit)
-			clear_commit_marks(commit, ALL_REV_FLAGS | PREREQ_MARK);
-	}
-	release_revisions(&revs);
 	return ret;
 }

diff --git a/t/t6020-bundle-misc.sh b/t/t6020-bundle-misc.sh
index 38dbbf89155..7d40994991e 100755
--- a/t/t6020-bundle-misc.sh
+++ b/t/t6020-bundle-misc.sh
@@ -595,14 +595,14 @@  test_expect_success 'verify catches unreachable, broken prerequisites' '
 		# Verify should fail
 		test_must_fail git bundle verify \
 			../clone-from/tip.bundle 2>err &&
-		grep "Could not read $BAD_OID" err &&
-		grep "Failed to traverse parents of commit $TIP_OID" err &&
+		grep "some prerequisite commits .* are not connected" err &&
+		test_line_count = 1 err &&

 		# Unbundling should fail
 		test_must_fail git bundle unbundle \
 			../clone-from/tip.bundle 2>err &&
-		grep "Could not read $BAD_OID" err &&
-		grep "Failed to traverse parents of commit $TIP_OID" err
+		grep "some prerequisite commits .* are not connected" err &&
+		test_line_count = 1 err
 	)
 '