diff mbox series

[v2] rev-list: exclude promisor objects at walk time

Message ID 9f327d6d8dc5e71eb0039aef3ac76ea16c2adab3.1554417917.git.steadmon@google.com (mailing list archive)
State New, archived
Headers show
Series [v2] rev-list: exclude promisor objects at walk time | expand

Commit Message

Josh Steadmon April 4, 2019, 10:53 p.m. UTC
For large repositories, enumerating the list of all promisor objects (in
order to exclude them from a rev-list walk) can take a significant
amount of time).

When --exclude-promisor-objects is passed to rev-list, don't enumerate
the promisor objects. Instead, filter them (and any children objects)
during the actual graph walk.

Remove the mark_uninteresting() function as it's not used anywhere else.

Helped-By: Jonathan Tan <jonathantanmy@google.com>
Helped-By: Jeff King <peff@peff.net>
Helped-By: Jonathan Nieder <jrnieder@gmail.com>

Signed-off-by: Josh Steadmon <steadmon@google.com>
---

Re-implemented following Jonathan & Jeff's advice (and also previously
Jonathan Nieder's, although I didn't understand it at the time). Thanks
for the feedback all.


 list-objects.c | 26 ++++++++++++++++++++++++++
 revision.c     | 16 ----------------
 2 files changed, 26 insertions(+), 16 deletions(-)

Comments

Jeff King April 4, 2019, 11:08 p.m. UTC | #1
On Thu, Apr 04, 2019 at 03:53:56PM -0700, Josh Steadmon wrote:

> For large repositories, enumerating the list of all promisor objects (in
> order to exclude them from a rev-list walk) can take a significant
> amount of time).
> 
> When --exclude-promisor-objects is passed to rev-list, don't enumerate
> the promisor objects. Instead, filter them (and any children objects)
> during the actual graph walk.

Yeah, this is definitely the approach I was thinking of.

Did you (or anybody else) have any thoughts on the case where a given
object is referred to both by a promisor and a non-promisor (and we
don't have it)? That's the "shortcut" I think we're taking here: we
would no longer realize that it's available via the promisor when we
traverse to it from the non-promisor. I'm just not clear on whether that
can ever happen.

> Helped-By: Jonathan Tan <jonathantanmy@google.com>
> Helped-By: Jeff King <peff@peff.net>
> Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> 
> Signed-off-by: Josh Steadmon <steadmon@google.com>

Minor nit, but these should all be part of the same block.

> diff --git a/list-objects.c b/list-objects.c
> index dc77361e11..d1eaa0999e 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -30,6 +30,7 @@ static void process_blob(struct traversal_context *ctx,
>  	struct object *obj = &blob->object;
>  	size_t pathlen;
>  	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
> +	struct object_info oi = OBJECT_INFO_INIT;
>  
>  	if (!ctx->revs->blob_objects)
>  		return;
> @@ -37,6 +38,11 @@ static void process_blob(struct traversal_context *ctx,
>  		die("bad blob object");
>  	if (obj->flags & (UNINTERESTING | SEEN))
>  		return;
> +	if (ctx->revs->exclude_promisor_objects &&
> +	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
> +	    oi.whence == OI_PACKED &&
> +	    oi.u.packed.pack->pack_promisor)
> +		return;

This conditional gets repeated a lot in your patch. Perhaps it's worth a
helper so we can say:

  if (skip_promisor_object(&ctx->revs, &obj->oid))
	return;

in each place?

One other possible small optimization: we don't look up the object
unless the caller asked to exclude promisors, which is good. But we
could also keep a single flag for "is there a promisor pack at all?".
When there isn't, we know there's no point in looking for the object.

It might not matter much in practice. The main caller here is going to
be check_connected(), and it only passes --exclude-promisor-objects if
it's in a partial clone.

> [...]

I didn't see any tweaks to the callers, which makes sense; we're already
passing --exclude-promisor-objects as necessary. Which means by itself,
this patch should be making things faster, right? Do you have timings to
show that off?

-Peff
Josh Steadmon April 4, 2019, 11:47 p.m. UTC | #2
On 2019.04.04 19:08, Jeff King wrote:
> On Thu, Apr 04, 2019 at 03:53:56PM -0700, Josh Steadmon wrote:
> 
> > For large repositories, enumerating the list of all promisor objects (in
> > order to exclude them from a rev-list walk) can take a significant
> > amount of time).
> > 
> > When --exclude-promisor-objects is passed to rev-list, don't enumerate
> > the promisor objects. Instead, filter them (and any children objects)
> > during the actual graph walk.
> 
> Yeah, this is definitely the approach I was thinking of.
> 
> Did you (or anybody else) have any thoughts on the case where a given
> object is referred to both by a promisor and a non-promisor (and we
> don't have it)? That's the "shortcut" I think we're taking here: we
> would no longer realize that it's available via the promisor when we
> traverse to it from the non-promisor. I'm just not clear on whether that
> can ever happen.

I am not sure either. In process_blob() and process_tree() there are
additional checks for whether missing blobs/trees are promisor objects
using is_promisor_object()...  but if we call that we undo the
performance gains from this change.


> > Helped-By: Jonathan Tan <jonathantanmy@google.com>
> > Helped-By: Jeff King <peff@peff.net>
> > Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> > 
> > Signed-off-by: Josh Steadmon <steadmon@google.com>
> 
> Minor nit, but these should all be part of the same block.

Will fix in v3.


> > diff --git a/list-objects.c b/list-objects.c
> > index dc77361e11..d1eaa0999e 100644
> > --- a/list-objects.c
> > +++ b/list-objects.c
> > @@ -30,6 +30,7 @@ static void process_blob(struct traversal_context *ctx,
> >  	struct object *obj = &blob->object;
> >  	size_t pathlen;
> >  	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
> > +	struct object_info oi = OBJECT_INFO_INIT;
> >  
> >  	if (!ctx->revs->blob_objects)
> >  		return;
> > @@ -37,6 +38,11 @@ static void process_blob(struct traversal_context *ctx,
> >  		die("bad blob object");
> >  	if (obj->flags & (UNINTERESTING | SEEN))
> >  		return;
> > +	if (ctx->revs->exclude_promisor_objects &&
> > +	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
> > +	    oi.whence == OI_PACKED &&
> > +	    oi.u.packed.pack->pack_promisor)
> > +		return;
> 
> This conditional gets repeated a lot in your patch. Perhaps it's worth a
> helper so we can say:
> 
>   if (skip_promisor_object(&ctx->revs, &obj->oid))
> 	return;
> 
> in each place?

Will fix in v3.


> One other possible small optimization: we don't look up the object
> unless the caller asked to exclude promisors, which is good. But we
> could also keep a single flag for "is there a promisor pack at all?".
> When there isn't, we know there's no point in looking for the object.
> 
> It might not matter much in practice. The main caller here is going to
> be check_connected(), and it only passes --exclude-promisor-objects if
> it's in a partial clone.

I'm not necessarily opposed, but I'm leaning towards the "won't matter
much" side.

Where would such a flag live, in this case, and who would be responsible
for initializing it? I guess it would only matter for rev-list, so we
could initialize it in cmd_rev_list() if --exclude-promisor-objects is
passed?

> > [...]
> 
> I didn't see any tweaks to the callers, which makes sense; we're already
> passing --exclude-promisor-objects as necessary. Which means by itself,
> this patch should be making things faster, right? Do you have timings to
> show that off?

Yeah, for a partial clone of a large-ish Android repo [1], we see the
connectivity check go from >180s to ~7s.

[1]: https://android.googlesource.com/platform/frameworks/base/
Jeff King April 5, 2019, midnight UTC | #3
On Thu, Apr 04, 2019 at 04:47:26PM -0700, Josh Steadmon wrote:

> > Did you (or anybody else) have any thoughts on the case where a given
> > object is referred to both by a promisor and a non-promisor (and we
> > don't have it)? That's the "shortcut" I think we're taking here: we
> > would no longer realize that it's available via the promisor when we
> > traverse to it from the non-promisor. I'm just not clear on whether that
> > can ever happen.
> 
> I am not sure either. In process_blob() and process_tree() there are
> additional checks for whether missing blobs/trees are promisor objects
> using is_promisor_object()...  but if we call that we undo the
> performance gains from this change.

Hmm. That might be a good outcome, though. If it never happens, we're
fast. If it does happen, then our worst case is that we fall back to the
current slower-but-more-thorough check. (And I think that happens with
your patch, without us having to do anything further).

> > One other possible small optimization: we don't look up the object
> > unless the caller asked to exclude promisors, which is good. But we
> > could also keep a single flag for "is there a promisor pack at all?".
> > When there isn't, we know there's no point in looking for the object.
> [...]
> I'm not necessarily opposed, but I'm leaning towards the "won't matter
> much" side.
> 
> Where would such a flag live, in this case, and who would be responsible
> for initializing it? I guess it would only matter for rev-list, so we
> could initialize it in cmd_rev_list() if --exclude-promisor-objects is
> passed?

The check is really something like:

  int have_promisor_pack() {
	for (p = packed_git; p; p = p->next) {
		if (p->pack_promisor)
			return 1;
	}
	return 0;
  }

That could be lazily cached as a single bit, but it would need to be
reset whenever we call reprepare_packed_git().

Let's just punt on it for now. I'm not convinced it would actually yield
any benefit, unless we have a partial-clone repo that doesn't have any
promisor packs (but then, I suspect whatever un-partial'd it should
probably be resetting the partial flag in the config).

> > I didn't see any tweaks to the callers, which makes sense; we're already
> > passing --exclude-promisor-objects as necessary. Which means by itself,
> > this patch should be making things faster, right? Do you have timings to
> > show that off?
> 
> Yeah, for a partial clone of a large-ish Android repo [1], we see the
> connectivity check go from >180s to ~7s.

Those are nice numbers. :) Worth mentioning in the commit message, I
think. How does it compare to your earlier patch? I'd hope they're about
the same.

-Peff
Josh Steadmon April 5, 2019, 12:09 a.m. UTC | #4
On 2019.04.04 20:00, Jeff King wrote:
> On Thu, Apr 04, 2019 at 04:47:26PM -0700, Josh Steadmon wrote:
> 
> > > Did you (or anybody else) have any thoughts on the case where a given
> > > object is referred to both by a promisor and a non-promisor (and we
> > > don't have it)? That's the "shortcut" I think we're taking here: we
> > > would no longer realize that it's available via the promisor when we
> > > traverse to it from the non-promisor. I'm just not clear on whether that
> > > can ever happen.
> > 
> > I am not sure either. In process_blob() and process_tree() there are
> > additional checks for whether missing blobs/trees are promisor objects
> > using is_promisor_object()...  but if we call that we undo the
> > performance gains from this change.
> 
> Hmm. That might be a good outcome, though. If it never happens, we're
> fast. If it does happen, then our worst case is that we fall back to the
> current slower-but-more-thorough check. (And I think that happens with
> your patch, without us having to do anything further).
> 
> > > One other possible small optimization: we don't look up the object
> > > unless the caller asked to exclude promisors, which is good. But we
> > > could also keep a single flag for "is there a promisor pack at all?".
> > > When there isn't, we know there's no point in looking for the object.
> > [...]
> > I'm not necessarily opposed, but I'm leaning towards the "won't matter
> > much" side.
> > 
> > Where would such a flag live, in this case, and who would be responsible
> > for initializing it? I guess it would only matter for rev-list, so we
> > could initialize it in cmd_rev_list() if --exclude-promisor-objects is
> > passed?
> 
> The check is really something like:
> 
>   int have_promisor_pack() {
> 	for (p = packed_git; p; p = p->next) {
> 		if (p->pack_promisor)
> 			return 1;
> 	}
> 	return 0;
>   }
> 
> That could be lazily cached as a single bit, but it would need to be
> reset whenever we call reprepare_packed_git().
> 
> Let's just punt on it for now. I'm not convinced it would actually yield
> any benefit, unless we have a partial-clone repo that doesn't have any
> promisor packs (but then, I suspect whatever un-partial'd it should
> probably be resetting the partial flag in the config).
> 
> > > I didn't see any tweaks to the callers, which makes sense; we're already
> > > passing --exclude-promisor-objects as necessary. Which means by itself,
> > > this patch should be making things faster, right? Do you have timings to
> > > show that off?
> > 
> > Yeah, for a partial clone of a large-ish Android repo [1], we see the
> > connectivity check go from >180s to ~7s.
> 
> Those are nice numbers. :) Worth mentioning in the commit message, I
> think. How does it compare to your earlier patch? I'd hope they're about
> the same.

Thanks, will include them in the v3 commit message. Unfortunately it's
hard to compare against v1, because v1 doesn't call rev-list at all, and
thus we don't have a good measurement in the trace / trace2 output. The
rev-list timing has been pretty consistent at just a bit over 3 minutes,
but the overall clone takes anywhere from 12-20 minutes, so any
difference between v1 and v2 performance just gets lost in the noise. If
I get a chance on Monday I may go back to v1 and add some timing.
Josh Steadmon April 8, 2019, 8:59 p.m. UTC | #5
On 2019.04.04 17:09, Josh Steadmon wrote:
> On 2019.04.04 20:00, Jeff King wrote:
> > On Thu, Apr 04, 2019 at 04:47:26PM -0700, Josh Steadmon wrote:
> > 
> > > > Did you (or anybody else) have any thoughts on the case where a given
> > > > object is referred to both by a promisor and a non-promisor (and we
> > > > don't have it)? That's the "shortcut" I think we're taking here: we
> > > > would no longer realize that it's available via the promisor when we
> > > > traverse to it from the non-promisor. I'm just not clear on whether that
> > > > can ever happen.
> > > 
> > > I am not sure either. In process_blob() and process_tree() there are
> > > additional checks for whether missing blobs/trees are promisor objects
> > > using is_promisor_object()...  but if we call that we undo the
> > > performance gains from this change.
> > 
> > Hmm. That might be a good outcome, though. If it never happens, we're
> > fast. If it does happen, then our worst case is that we fall back to the
> > current slower-but-more-thorough check. (And I think that happens with
> > your patch, without us having to do anything further).
> > 
> > > > One other possible small optimization: we don't look up the object
> > > > unless the caller asked to exclude promisors, which is good. But we
> > > > could also keep a single flag for "is there a promisor pack at all?".
> > > > When there isn't, we know there's no point in looking for the object.
> > > [...]
> > > I'm not necessarily opposed, but I'm leaning towards the "won't matter
> > > much" side.
> > > 
> > > Where would such a flag live, in this case, and who would be responsible
> > > for initializing it? I guess it would only matter for rev-list, so we
> > > could initialize it in cmd_rev_list() if --exclude-promisor-objects is
> > > passed?
> > 
> > The check is really something like:
> > 
> >   int have_promisor_pack() {
> > 	for (p = packed_git; p; p = p->next) {
> > 		if (p->pack_promisor)
> > 			return 1;
> > 	}
> > 	return 0;
> >   }
> > 
> > That could be lazily cached as a single bit, but it would need to be
> > reset whenever we call reprepare_packed_git().
> > 
> > Let's just punt on it for now. I'm not convinced it would actually yield
> > any benefit, unless we have a partial-clone repo that doesn't have any
> > promisor packs (but then, I suspect whatever un-partial'd it should
> > probably be resetting the partial flag in the config).
> > 
> > > > I didn't see any tweaks to the callers, which makes sense; we're already
> > > > passing --exclude-promisor-objects as necessary. Which means by itself,
> > > > this patch should be making things faster, right? Do you have timings to
> > > > show that off?
> > > 
> > > Yeah, for a partial clone of a large-ish Android repo [1], we see the
> > > connectivity check go from >180s to ~7s.
> > 
> > Those are nice numbers. :) Worth mentioning in the commit message, I
> > think. How does it compare to your earlier patch? I'd hope they're about
> > the same.
> 
> Thanks, will include them in the v3 commit message. Unfortunately it's
> hard to compare against v1, because v1 doesn't call rev-list at all, and
> thus we don't have a good measurement in the trace / trace2 output. The
> rev-list timing has been pretty consistent at just a bit over 3 minutes,
> but the overall clone takes anywhere from 12-20 minutes, so any
> difference between v1 and v2 performance just gets lost in the noise. If
> I get a chance on Monday I may go back to v1 and add some timing.

For the record, the V1 abbreviated check takes about 5ms
diff mbox series

Patch

diff --git a/list-objects.c b/list-objects.c
index dc77361e11..d1eaa0999e 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -30,6 +30,7 @@  static void process_blob(struct traversal_context *ctx,
 	struct object *obj = &blob->object;
 	size_t pathlen;
 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
+	struct object_info oi = OBJECT_INFO_INIT;
 
 	if (!ctx->revs->blob_objects)
 		return;
@@ -37,6 +38,11 @@  static void process_blob(struct traversal_context *ctx,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (ctx->revs->exclude_promisor_objects &&
+	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
+	    oi.whence == OI_PACKED &&
+	    oi.u.packed.pack->pack_promisor)
+		return;
 
 	/*
 	 * Pre-filter known-missing objects when explicitly requested.
@@ -149,6 +155,7 @@  static void process_tree(struct traversal_context *ctx,
 	int baselen = base->len;
 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
 	int failed_parse;
+	struct object_info oi = OBJECT_INFO_INIT;
 
 	if (!revs->tree_objects)
 		return;
@@ -156,6 +163,11 @@  static void process_tree(struct traversal_context *ctx,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (ctx->revs->exclude_promisor_objects &&
+	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
+	    oi.whence == OI_PACKED &&
+	    oi.u.packed.pack->pack_promisor)
+		return;
 
 	failed_parse = parse_tree_gently(tree, 1);
 	if (failed_parse) {
@@ -318,6 +330,7 @@  static void traverse_trees_and_blobs(struct traversal_context *ctx,
 				     struct strbuf *base)
 {
 	int i;
+	struct object_info oi = OBJECT_INFO_INIT;
 
 	assert(base->len == 0);
 
@@ -326,6 +339,12 @@  static void traverse_trees_and_blobs(struct traversal_context *ctx,
 		struct object *obj = pending->item;
 		const char *name = pending->name;
 		const char *path = pending->path;
+		if (ctx->revs->exclude_promisor_objects &&
+		    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
+		    oi.whence == OI_PACKED &&
+		    oi.u.packed.pack->pack_promisor)
+			continue;
+
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -353,9 +372,16 @@  static void do_traverse(struct traversal_context *ctx)
 {
 	struct commit *commit;
 	struct strbuf csp; /* callee's scratch pad */
+	struct object_info oi = OBJECT_INFO_INIT;
 	strbuf_init(&csp, PATH_MAX);
 
 	while ((commit = get_revision(ctx->revs)) != NULL) {
+		if (ctx->revs->exclude_promisor_objects &&
+		    !oid_object_info_extended(the_repository, &commit->object.oid, &oi, 0) &&
+		    oi.whence == OI_PACKED &&
+		    oi.u.packed.pack->pack_promisor)
+			continue;
+
 		/*
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
diff --git a/revision.c b/revision.c
index eb8e51bc63..85974e941d 100644
--- a/revision.c
+++ b/revision.c
@@ -3067,17 +3067,6 @@  void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
-static int mark_uninteresting(const struct object_id *oid,
-			      struct packed_git *pack,
-			      uint32_t pos,
-			      void *cb)
-{
-	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
-	o->flags |= UNINTERESTING | SEEN;
-	return 0;
-}
-
 define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
@@ -3316,11 +3305,6 @@  int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
-	if (revs->exclude_promisor_objects) {
-		for_each_packed_object(mark_uninteresting, revs,
-				       FOR_EACH_OBJECT_PROMISOR_ONLY);
-	}
-
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)