diff mbox series

[v4] clone: do faster object check for partial clones

Message ID b4a285e2a199ea059c165aa344d037374797fa40.1555707373.git.steadmon@google.com (mailing list archive)
State New, archived
Headers show
Series [v4] clone: do faster object check for partial clones | expand

Commit Message

Josh Steadmon April 19, 2019, 9 p.m. UTC
For partial clones, doing a full connectivity check is wasteful; we skip
promisor objects (which, for a partial clone, is all known objects), and
enumerating them all to exclude them from the connectivity check can
take a significant amount of time on large repos.

At most, we want to make sure that we get the objects referred to by any
wanted refs. For partial clones, just check that these objects were
transferred.

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
This is an update of the original V1 approach (skipping the fully
connectivity check when doing a partial clone) with updated commit
message and comments to address the review concerns.


Range-diff against v1:
1:  9c29e1ce8d ! 1:  b4a285e2a1 clone: do faster object check for partial clones
    @@ -4,8 +4,8 @@
     
         For partial clones, doing a full connectivity check is wasteful; we skip
         promisor objects (which, for a partial clone, is all known objects), and
    -    excluding them all from the connectivity check can take a significant
    -    amount of time on large repos.
    +    enumerating them all to exclude them from the connectivity check can
    +    take a significant amount of time on large repos.
     
         At most, we want to make sure that we get the objects referred to by any
         wanted refs. For partial clones, just check that these objects were
    @@ -59,10 +59,12 @@
      
     +	if (opt->check_refs_only) {
     +		/*
    -+		 * For partial clones, we don't want to walk the full commit
    -+		 * graph because we're skipping promisor objects anyway. We
    -+		 * should just check that objects referenced by wanted refs were
    -+		 * transferred.
    ++		 * For partial clones, we don't want to have to do a regular
    ++		 * connectivity check because we have to enumerate and exclude
    ++		 * all promisor objects (slow), and then the connectivity check
    ++		 * itself becomes a no-op because in a partial clone every
    ++		 * object is a promisor object. Instead, just make sure we
    ++		 * received the objects pointed to by each wanted ref.
     +		 */
     +		do {
     +			if (!repo_has_object_file(the_repository, &oid))
    @@ -86,8 +88,8 @@
     +	/*
     +	 * If non-zero, only check the top-level objects referenced by the
     +	 * wanted refs (passed in as cb_data). This is useful for partial
    -+	 * clones, where this can be much faster than excluding all promisor
    -+	 * objects prior to walking the commit graph.
    ++	 * clones, where enumerating and excluding all promisor objects is very
    ++	 * slow and the commit-walk itself becomes a no-op.
     +	 */
     +	unsigned check_refs_only : 1;
      };

 builtin/clone.c |  6 ++++--
 connected.c     | 17 +++++++++++++++++
 connected.h     |  8 ++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

Comments

Jeff King April 22, 2019, 9:31 p.m. UTC | #1
On Fri, Apr 19, 2019 at 02:00:13PM -0700, Josh Steadmon wrote:

> For partial clones, doing a full connectivity check is wasteful; we skip
> promisor objects (which, for a partial clone, is all known objects), and
> enumerating them all to exclude them from the connectivity check can
> take a significant amount of time on large repos.
> 
> At most, we want to make sure that we get the objects referred to by any
> wanted refs. For partial clones, just check that these objects were
> transferred.
> 
> Signed-off-by: Josh Steadmon <steadmon@google.com>
> ---
> This is an update of the original V1 approach (skipping the fully
> connectivity check when doing a partial clone) with updated commit
> message and comments to address the review concerns.

This looks OK to me. Just trying to think of ways it could bite us, the
obvious line of thinking is where "--reference" is used. If we tell the
other side we already have object X, it will not be sent to us, and we
are relying on our local non-promisor alternate to have all of the
required objects. But I think this is OK:

  - for it to be mentioned in a ref, then the server must have been
    advertising it. Which means that it should similarly be on the hook
    for providing it to us via the promisor mechanism. That's a little
    hand-wavy, but then so is the entire promisor concept. We're
    inherently at the server's mercy, so if they're lying to us about
    what they're willing or able to provide, there's not much we can do
    anyway.

  - if we sent it as a "have" to the server, that means that our
    alternate was advertising it from a ref tip. Which means that unless
    it's corrupted, we're fine (which is no different than the
    connectivity promise we'd make for our own refs). I actually think
    that the connectivity check should "--not" any advertised alternate
    tips. I even have a patch to do that, but I need to polish it a
    little.

So I think this optimization will yield correct results. If we later
figure out a better way for rev-list to optimize this case itself, then
we can rip this out.

I think you had dug up some numbers to put in the commit message after
the last discussion? Likewise, is there a t/perf test which shows this
off (and will help us catch regressions)? If not, it might be worth
adding one (AFAICT a simple no-blobs partial-clone would be enough).

-Peff
diff mbox series

Patch

diff --git a/builtin/clone.c b/builtin/clone.c
index 50bde99618..fdbbd8942a 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -657,7 +657,8 @@  static void update_remote_refs(const struct ref *refs,
 			       const char *branch_top,
 			       const char *msg,
 			       struct transport *transport,
-			       int check_connectivity)
+			       int check_connectivity,
+			       int check_refs_only)
 {
 	const struct ref *rm = mapped_refs;
 
@@ -666,6 +667,7 @@  static void update_remote_refs(const struct ref *refs,
 
 		opt.transport = transport;
 		opt.progress = transport->progress;
+		opt.check_refs_only = !!check_refs_only;
 
 		if (check_connected(iterate_ref_map, &rm, &opt))
 			die(_("remote did not send all necessary objects"));
@@ -1224,7 +1226,7 @@  int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	update_remote_refs(refs, mapped_refs, remote_head_points_at,
 			   branch_top.buf, reflog_msg.buf, transport,
-			   !is_local);
+			   !is_local, filter_options.choice);
 
 	update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
diff --git a/connected.c b/connected.c
index 1bba888eff..1ab481fed6 100644
--- a/connected.c
+++ b/connected.c
@@ -1,4 +1,5 @@ 
 #include "cache.h"
+#include "object-store.h"
 #include "run-command.h"
 #include "sigchain.h"
 #include "connected.h"
@@ -49,6 +50,22 @@  int check_connected(oid_iterate_fn fn, void *cb_data,
 		strbuf_release(&idx_file);
 	}
 
+	if (opt->check_refs_only) {
+		/*
+		 * For partial clones, we don't want to have to do a regular
+		 * connectivity check because we have to enumerate and exclude
+		 * all promisor objects (slow), and then the connectivity check
+		 * itself becomes a no-op because in a partial clone every
+		 * object is a promisor object. Instead, just make sure we
+		 * received the objects pointed to by each wanted ref.
+		 */
+		do {
+			if (!repo_has_object_file(the_repository, &oid))
+				return 1;
+		} while (!fn(cb_data, &oid));
+		return 0;
+	}
+
 	if (opt->shallow_file) {
 		argv_array_push(&rev_list.args, "--shallow-file");
 		argv_array_push(&rev_list.args, opt->shallow_file);
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..ce2e7d8f2e 100644
--- a/connected.h
+++ b/connected.h
@@ -46,6 +46,14 @@  struct check_connected_options {
 	 * during a fetch.
 	 */
 	unsigned is_deepening_fetch : 1;
+
+	/*
+	 * If non-zero, only check the top-level objects referenced by the
+	 * wanted refs (passed in as cb_data). This is useful for partial
+	 * clones, where enumerating and excluding all promisor objects is very
+	 * slow and the commit-walk itself becomes a no-op.
+	 */
+	unsigned check_refs_only : 1;
 };
 
 #define CHECK_CONNECTED_INIT { 0 }