diff mbox series

clone: do faster object check for partial clones

Message ID 6de682d5e48186970644569586fc6613763d5caa.1554312374.git.steadmon@google.com (mailing list archive)
State New, archived
Headers show
Series clone: do faster object check for partial clones | expand

Commit Message

Josh Steadmon April 3, 2019, 5:27 p.m. UTC
For partial clones, doing a full connectivity check is wasteful; we skip
promisor objects (which, for a partial clone, is all known objects), and
excluding them all from the connectivity check can take a significant
amount of time on large repos.

At most, we want to make sure that we get the objects referred to by any
wanted refs. For partial clones, just check that these objects were
transferred.

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
 builtin/clone.c |  6 ++++--
 connected.c     | 15 +++++++++++++++
 connected.h     |  8 ++++++++
 3 files changed, 27 insertions(+), 2 deletions(-)

Comments

Jonathan Tan April 3, 2019, 6:58 p.m. UTC | #1
> For partial clones, doing a full connectivity check is wasteful; we skip
> promisor objects (which, for a partial clone, is all known objects), and
> excluding them all from the connectivity check can take a significant
> amount of time on large repos.

Instead of "excluding them all", I would word this as "enumerating them
all so that we can exclude them" - the enumerating is the slow part, not
the excluding (which actually makes things faster).

> +	if (opt->check_refs_only) {
> +		/*
> +		 * For partial clones, we don't want to walk the full commit
> +		 * graph because we're skipping promisor objects anyway. We
> +		 * should just check that objects referenced by wanted refs were
> +		 * transferred.

The enumeration of promisor objects to be excluded is done through
for_each_packed_object() (see is_promisor_object()), not through walking
the commit graph. Maybe reword this comment to be similar to what I've
suggested for the commit message.

> @@ -46,6 +46,14 @@ struct check_connected_options {
>  	 * during a fetch.
>  	 */
>  	unsigned is_deepening_fetch : 1;
> +
> +	/*
> +	 * If non-zero, only check the top-level objects referenced by the
> +	 * wanted refs (passed in as cb_data). This is useful for partial
> +	 * clones, where this can be much faster than excluding all promisor
> +	 * objects prior to walking the commit graph.
> +	 */
> +	unsigned check_refs_only : 1;
>  };

Same enumerating vs excluding comment as before.

Aside from that: thinking from scratch, we want something that tells
check_connected() to avoid anything that enumerates the list of promised
objects, since the objects that we're checking are all promisor objects,
and thus any outgoing links are automatically promised. I would include
some of this explanation in the comment, but in the interest of trying
to avoid a bikeshedding discussion, I don't consider this necessary.
Jeff King April 3, 2019, 7:41 p.m. UTC | #2
On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:

> For partial clones, doing a full connectivity check is wasteful; we skip
> promisor objects (which, for a partial clone, is all known objects), and
> excluding them all from the connectivity check can take a significant
> amount of time on large repos.
> 
> At most, we want to make sure that we get the objects referred to by any
> wanted refs. For partial clones, just check that these objects were
> transferred.

This isn't strictly true, since we could get objects from elsewhere via
--shared or --reference. Those might not be promisor objects.

Shouldn't we be able to stop a traversal as soon as we see that an
object is in a promisor pack?

I.e., here:

> +	if (opt->check_refs_only) {
> +		/*
> +		 * For partial clones, we don't want to walk the full commit
> +		 * graph because we're skipping promisor objects anyway. We
> +		 * should just check that objects referenced by wanted refs were
> +		 * transferred.
> +		 */
> +		do {
> +			if (!repo_has_object_file(the_repository, &oid))
> +				return 1;
> +		} while (!fn(cb_data, &oid));
> +		return 0;
> +	}

for each object where you ask "do we have this?" we could, for the same
cost, ask "do we have this in a promisor pack?". And the answer would be
yes for each in a partial clone.

But that would also cleanly handle --shared, etc, that I mentioned. And
more importantly, it would _also_ work on fetches. If I fetch from you
and get a promisor pack, then there is no point in me enumerating every
tree you sent. As long as you gave me a tip tree, then you have promised
that you'd give me all the others if I ask.

So it seems like this should be a feature of the child rev-list, to stop
walking the graph at any object that is in a promisor pack.

-Peff
Jonathan Tan April 3, 2019, 8:57 p.m. UTC | #3
> On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:
> 
> > For partial clones, doing a full connectivity check is wasteful; we skip
> > promisor objects (which, for a partial clone, is all known objects), and
> > excluding them all from the connectivity check can take a significant
> > amount of time on large repos.
> > 
> > At most, we want to make sure that we get the objects referred to by any
> > wanted refs. For partial clones, just check that these objects were
> > transferred.
> 
> This isn't strictly true, since we could get objects from elsewhere via
> --shared or --reference. Those might not be promisor objects.

I don't think local clones (which --shared or --reference implies) can
be partial, but the bigger point is below.

> Shouldn't we be able to stop a traversal as soon as we see that an
> object is in a promisor pack?
> 
> I.e., here:
> 
> > +	if (opt->check_refs_only) {
> > +		/*
> > +		 * For partial clones, we don't want to walk the full commit
> > +		 * graph because we're skipping promisor objects anyway. We
> > +		 * should just check that objects referenced by wanted refs were
> > +		 * transferred.
> > +		 */
> > +		do {
> > +			if (!repo_has_object_file(the_repository, &oid))
> > +				return 1;
> > +		} while (!fn(cb_data, &oid));
> > +		return 0;
> > +	}
> 
> for each object where you ask "do we have this?" we could, for the same
> cost, ask "do we have this in a promisor pack?". And the answer would be
> yes for each in a partial clone.
> 
> But that would also cleanly handle --shared, etc, that I mentioned. And
> more importantly, it would _also_ work on fetches. If I fetch from you
> and get a promisor pack, then there is no point in me enumerating every
> tree you sent. As long as you gave me a tip tree, then you have promised
> that you'd give me all the others if I ask.
> 
> So it seems like this should be a feature of the child rev-list, to stop
> walking the graph at any object that is in a promisor pack.

We currently already do a less optimal version of this - we pass
--exclude-promisor-objects to rev-list, which indeed stops traversal at
any promisor objects (whether in a promisor pack or referenced by one).
As far as I know, the problem is that to do so, we currently enumerate
all the objects in all promisor packs, and all objects that those
objects reference (which means we inflate them too) - so that we have an
oidset that we can check objects against.

A partial solution is for is_promisor_object() to first check if the
given object is in a promisor pack, avoiding generating the set of
promisor objects until necessary. This would work in a blob:none clone
with the refs pointing all to commits or all to blobs, but would not
work in a tree:none clone (or maybe, in this case, the clone would be
small enough that performance is not a concern, hmm).

Maybe the ideal solution is for rev-list to check if an object is in a
promisor pack and if --exclude-promisor-objects is active, we do not
follow any outgoing links.
Josh Steadmon April 4, 2019, 12:21 a.m. UTC | #4
On 2019.04.03 13:57, Jonathan Tan wrote:
> > On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:
> > 
> > > For partial clones, doing a full connectivity check is wasteful; we skip
> > > promisor objects (which, for a partial clone, is all known objects), and
> > > excluding them all from the connectivity check can take a significant
> > > amount of time on large repos.
> > > 
> > > At most, we want to make sure that we get the objects referred to by any
> > > wanted refs. For partial clones, just check that these objects were
> > > transferred.
> > 
> > This isn't strictly true, since we could get objects from elsewhere via
> > --shared or --reference. Those might not be promisor objects.
> 
> I don't think local clones (which --shared or --reference implies) can
> be partial, but the bigger point is below.
> 
> > Shouldn't we be able to stop a traversal as soon as we see that an
> > object is in a promisor pack?
> > 
> > I.e., here:
> > 
> > > +	if (opt->check_refs_only) {
> > > +		/*
> > > +		 * For partial clones, we don't want to walk the full commit
> > > +		 * graph because we're skipping promisor objects anyway. We
> > > +		 * should just check that objects referenced by wanted refs were
> > > +		 * transferred.
> > > +		 */
> > > +		do {
> > > +			if (!repo_has_object_file(the_repository, &oid))
> > > +				return 1;
> > > +		} while (!fn(cb_data, &oid));
> > > +		return 0;
> > > +	}
> > 
> > for each object where you ask "do we have this?" we could, for the same
> > cost, ask "do we have this in a promisor pack?". And the answer would be
> > yes for each in a partial clone.
> > 
> > But that would also cleanly handle --shared, etc, that I mentioned. And
> > more importantly, it would _also_ work on fetches. If I fetch from you
> > and get a promisor pack, then there is no point in me enumerating every
> > tree you sent. As long as you gave me a tip tree, then you have promised
> > that you'd give me all the others if I ask.
> > 
> > So it seems like this should be a feature of the child rev-list, to stop
> > walking the graph at any object that is in a promisor pack.
> 
> We currently already do a less optimal version of this - we pass
> --exclude-promisor-objects to rev-list, which indeed stops traversal at
> any promisor objects (whether in a promisor pack or referenced by one).
> As far as I know, the problem is that to do so, we currently enumerate
> all the objects in all promisor packs, and all objects that those
> objects reference (which means we inflate them too) - so that we have an
> oidset that we can check objects against.
> 
> A partial solution is for is_promisor_object() to first check if the
> given object is in a promisor pack, avoiding generating the set of
> promisor objects until necessary. This would work in a blob:none clone
> with the refs pointing all to commits or all to blobs, but would not
> work in a tree:none clone (or maybe, in this case, the clone would be
> small enough that performance is not a concern, hmm).
> 
> Maybe the ideal solution is for rev-list to check if an object is in a
> promisor pack and if --exclude-promisor-objects is active, we do not
> follow any outgoing links.

I am not sure that this actually saves us any work vs. the original
method of marking promisor objects as uninteresting. Marking the objects
uninteresting involves calling for_each_packed_object(some_callback_fn, ...,
FOR_EACH_OBJECT_PROMISOR_ONLY). But this is also exactly the setup that
is_promisor_object() runs the first time it's called. In fact, it looks
to me like the callback function used by the latter is more expensive
than the former. Is there some alternative to is_promisor_object() that
doesn't involve this object enumeration?

I've tried implementing this approach anyway (diff against master is
below), but I can't get a version that passes t0410-partial-clone.sh
(cases 17 and 19). In case 17, the rev-list emits one extra line
corresponding to the blob for baz.t. I'm not sure why the original
implementation thinks that baz.t is a promisor object but my WIP version
doesn't.

I'll keep poking at this tomorrow, I just thought I should comment on
what I've found so far in case anyone can spot where I'm going wrong.



diff --git a/list-objects.c b/list-objects.c
index dc77361e11..1cb85f1662 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -326,6 +326,12 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 		struct object *obj = pending->item;
 		const char *name = pending->name;
 		const char *path = pending->path;
+		trace_printf("DEBUG: Examining object: %s\n", oid_to_hex(&obj->oid));
+		if (ctx->revs->exclude_promisor_objects &&
+		    is_promisor_object(&obj->oid)) {
+			trace_printf("DEBUG: Skipping object: %s\n", oid_to_hex(&obj->oid));
+			continue;
+		}
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -356,6 +362,13 @@ static void do_traverse(struct traversal_context *ctx)
 	strbuf_init(&csp, PATH_MAX);
 
 	while ((commit = get_revision(ctx->revs)) != NULL) {
+		trace_printf("DEBUG: Examining commit: %s\n", oid_to_hex(&commit->object.oid));
+		if (ctx->revs->exclude_promisor_objects &&
+		    is_promisor_object(&commit->object.oid)) {
+			trace_printf("DEBUG: Skipping commit: %s\n", oid_to_hex(&commit->object.oid));
+			continue;
+		}
+
 		/*
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
diff --git a/revision.c b/revision.c
index eb8e51bc63..85974e941d 100644
--- a/revision.c
+++ b/revision.c
@@ -3067,17 +3067,6 @@ void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
-static int mark_uninteresting(const struct object_id *oid,
-			      struct packed_git *pack,
-			      uint32_t pos,
-			      void *cb)
-{
-	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
-	o->flags |= UNINTERESTING | SEEN;
-	return 0;
-}
-
 define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
@@ -3316,11 +3305,6 @@ int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
-	if (revs->exclude_promisor_objects) {
-		for_each_packed_object(mark_uninteresting, revs,
-				       FOR_EACH_OBJECT_PROMISOR_ONLY);
-	}
-
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index bce02788e6..55ac29f650 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -342,7 +342,7 @@ test_expect_success 'rev-list stops traversal at promisor commit, tree, and blob
 
 	git -C repo config core.repositoryformatversion 1 &&
 	git -C repo config extensions.partialclone "arbitrary string" &&
-	git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
+	GIT_TRACE="$(pwd)/rev_trace" git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
 	! grep $COMMIT out &&
 	! grep $TREE out &&
 	! grep $BLOB out &&
Jeff King April 4, 2019, 1:33 a.m. UTC | #5
On Wed, Apr 03, 2019 at 01:57:48PM -0700, Jonathan Tan wrote:

> > This isn't strictly true, since we could get objects from elsewhere via
> > --shared or --reference. Those might not be promisor objects.
> 
> I don't think local clones (which --shared or --reference implies) can
> be partial, but the bigger point is below.

Yeah, you're right about --shared. But I don't see any reason a
--reference clone could not be partial.

> > So it seems like this should be a feature of the child rev-list, to stop
> > walking the graph at any object that is in a promisor pack.
> 
> We currently already do a less optimal version of this - we pass
> --exclude-promisor-objects to rev-list, which indeed stops traversal at
> any promisor objects (whether in a promisor pack or referenced by one).
> As far as I know, the problem is that to do so, we currently enumerate
> all the objects in all promisor packs, and all objects that those
> objects reference (which means we inflate them too) - so that we have an
> oidset that we can check objects against.
> 
> A partial solution is for is_promisor_object() to first check if the
> given object is in a promisor pack, avoiding generating the set of
> promisor objects until necessary. This would work in a blob:none clone
> with the refs pointing all to commits or all to blobs, but would not
> work in a tree:none clone (or maybe, in this case, the clone would be
> small enough that performance is not a concern, hmm).
> 
> Maybe the ideal solution is for rev-list to check if an object is in a
> promisor pack and if --exclude-promisor-objects is active, we do not
> follow any outgoing links.

I was thinking you could actually check it before even loading the
object. I.e., something like:

  struct object_info oi = OBJECT_INFO_INIT;

  if (!oid_object_info_extended(oid, &oi, 0) &&
      oi->whence = OI_PACKED &&
      oi->u.packed.pack->pack_promisor)) {
          /*
	   * no point in even looking at its links,
	   * since the promisor pack claims that we
	   * can get anything we need later from the
	   * remote
	   */
     return 0; /* or whatever, depending where this goes ;) */
  } else {
    /* not a promisor object, load it and traverse as normal */
  }

That doesn't quite work as an implementation of is_promisor_object(),
because it wouldn't know about items that we _don't_ have that are
promised. But I think it could work as part of the traversal in
list-objects.c, since we'd just be walking down a traversal from which
we presumably have all the objects.

I guess maybe it would be complicated if you had non-promisor objects
that refer indirectly to promisor ones. E.g., imagine ref A points to
object X, which is in a promisor pack pointing to Y (which we don't
have). But we also have ref B pointing to object Z which also refers to
Y, but _isn't_ in a promisor pack. I'm not sure that can actually happen
with the promisor mechanism, though (how did we get a Z without all of
its objects into a non-promisor pack?).

It's also a shame that it would incur an extra object lookup, since if
it _isn't_ in the promisor pack we'd then have to actually look it up
again via parse_object() or whatever. It may not be measurable though.
In an ideal world, we'd have an object access API that lets us open a
handle, ask things about it (like "which pack is this coming from") and
then load it if we want. But I don't think that needs to hold up this
particular topic.

-Peff
diff mbox series

Patch

diff --git a/builtin/clone.c b/builtin/clone.c
index 50bde99618..fdbbd8942a 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -657,7 +657,8 @@  static void update_remote_refs(const struct ref *refs,
 			       const char *branch_top,
 			       const char *msg,
 			       struct transport *transport,
-			       int check_connectivity)
+			       int check_connectivity,
+			       int check_refs_only)
 {
 	const struct ref *rm = mapped_refs;
 
@@ -666,6 +667,7 @@  static void update_remote_refs(const struct ref *refs,
 
 		opt.transport = transport;
 		opt.progress = transport->progress;
+		opt.check_refs_only = !!check_refs_only;
 
 		if (check_connected(iterate_ref_map, &rm, &opt))
 			die(_("remote did not send all necessary objects"));
@@ -1224,7 +1226,7 @@  int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	update_remote_refs(refs, mapped_refs, remote_head_points_at,
 			   branch_top.buf, reflog_msg.buf, transport,
-			   !is_local);
+			   !is_local, filter_options.choice);
 
 	update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
diff --git a/connected.c b/connected.c
index 1bba888eff..c297cdc5ab 100644
--- a/connected.c
+++ b/connected.c
@@ -1,4 +1,5 @@ 
 #include "cache.h"
+#include "object-store.h"
 #include "run-command.h"
 #include "sigchain.h"
 #include "connected.h"
@@ -49,6 +50,20 @@  int check_connected(oid_iterate_fn fn, void *cb_data,
 		strbuf_release(&idx_file);
 	}
 
+	if (opt->check_refs_only) {
+		/*
+		 * For partial clones, we don't want to walk the full commit
+		 * graph because we're skipping promisor objects anyway. We
+		 * should just check that objects referenced by wanted refs were
+		 * transferred.
+		 */
+		do {
+			if (!repo_has_object_file(the_repository, &oid))
+				return 1;
+		} while (!fn(cb_data, &oid));
+		return 0;
+	}
+
 	if (opt->shallow_file) {
 		argv_array_push(&rev_list.args, "--shallow-file");
 		argv_array_push(&rev_list.args, opt->shallow_file);
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..bb4afcb301 100644
--- a/connected.h
+++ b/connected.h
@@ -46,6 +46,14 @@  struct check_connected_options {
 	 * during a fetch.
 	 */
 	unsigned is_deepening_fetch : 1;
+
+	/*
+	 * If non-zero, only check the top-level objects referenced by the
+	 * wanted refs (passed in as cb_data). This is useful for partial
+	 * clones, where this can be much faster than excluding all promisor
+	 * objects prior to walking the commit graph.
+	 */
+	unsigned check_refs_only : 1;
 };
 
 #define CHECK_CONNECTED_INIT { 0 }