[v2,2/2] diff: restrict when prefetching occurs

Message ID	a3322cdedf019126305fcead5918d523a1b2dfbc.1585854639.git.jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=dTtN=5S=vger.kernel.org=git-owner@kernel.org> Date: Thu, 2 Apr 2020 12:19:17 -0700 In-Reply-To: <cover.1585854639.git.jonathantanmy@google.com> Message-Id: <a3322cdedf019126305fcead5918d523a1b2dfbc.1585854639.git.jonathantanmy@google.com> Mime-Version: 1.0 References: <20200331020418.55640-1-jonathantanmy@google.com> <cover.1585854639.git.jonathantanmy@google.com> Subject: [PATCH v2 2/2] diff: restrict when prefetching occurs From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, gitster@pobox.com, stolee@gmail.com, Jeff King <peff@peff.net> Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Restrict when prefetcing occurs \| expand [v2,0/2] Restrict when prefetcing occurs [v2,1/2] promisor-remote: accept 0 as oid_nr in function [v2,2/2] diff: restrict when prefetching occurs

Jonathan Tan April 2, 2020, 7:19 p.m. UTC

Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In particular, if
(1) no blob data is included in the output of the diff, (2)
break-rewrite detection is not requested, and (3) no inexact rename
detection is needed, then no blobs are read at all.

Therefore, in such a case, do not prefetch. Change diffcore_std() to
only prefetch if (1) and/or (2) is not true (currently, it always
prefetches); change diffcore_rename() to prefetch if (3) is not true and
no prefetch has yet occurred.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 diff.c                        | 38 +++++++++++++++++++--------
 diffcore-rename.c             | 37 ++++++++++++++++++++++++++-
 diffcore.h                    | 10 +++++++-
 t/t4067-diff-partial-clone.sh | 48 +++++++++++++++++++++++++++++++++++
 4 files changed, 121 insertions(+), 12 deletions(-)

Junio C Hamano April 2, 2020, 8:08 p.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> +	int prefetched = 0;
> +	int output_formats_to_prefetch = DIFF_FORMAT_DIFFSTAT |
> +		DIFF_FORMAT_NUMSTAT |
> +		DIFF_FORMAT_PATCH |
> +		DIFF_FORMAT_SHORTSTAT |
> +		DIFF_FORMAT_DIRSTAT;

Would this want to be a "const int" (or even #define), I wonder.  I
do not care too much between the two, but leaving it as a variable
makes me a bit nervous.

> +	/*
> +	 * Check if the user requested a blob-data-requiring diff output and/or
> +	 * break-rewrite detection (which requires blob data). If yes, prefetch
> +	 * the diff pairs.
> +	 *
> +	 * If no prefetching occurs, diffcore_rename() will prefetch if it
> +	 * decides that it needs inexact rename detection.
> +	 */

Name-only etc. that Derrick mentioned in the other thread would be
relevant only when rename detection is active, and you'd do that in
diffcore_rename().  Good.

> +	if (options->repo == the_repository && has_promisor_remote() &&
> +	    (options->output_format & output_formats_to_prefetch ||
> +	     (!options->found_follow && options->break_opt != -1))) {
>  		int i;
>  		struct diff_queue_struct *q = &diff_queued_diff;
>  		struct oid_array to_fetch = OID_ARRAY_INIT;
>  
>  		for (i = 0; i < q->nr; i++) {
>  			struct diff_filepair *p = q->queue[i];
> -			add_if_missing(options->repo, &to_fetch, p->one);
> -			add_if_missing(options->repo, &to_fetch, p->two);
> +			diff_add_if_missing(options->repo, &to_fetch, p->one);
> +			diff_add_if_missing(options->repo, &to_fetch, p->two);
>  		}
> +
> +		prefetched = 1;
> +

Wouldn't it logically make more sense to do this after calling
promisor_remote_get_direct() and if to_fetch.nr is not 0, ...

>  		/*
>  		 * NEEDSWORK: Consider deduplicating the OIDs sent.
>  		 */
>  		promisor_remote_get_direct(options->repo,
>  					   to_fetch.oid, to_fetch.nr);
> +

... namely, here?

When (q->nr != 0), to_fetch.nr may not be zero, I suspect, but the
original code before [1/2] protected against to_fetch.nr==0 case, so
...?

>  		oid_array_clear(&to_fetch);
>  	}
>  
> @@ -6537,7 +6555,7 @@ void diffcore_std(struct diff_options *options)
>  			diffcore_break(options->repo,
>  				       options->break_opt);
>  		if (options->detect_rename)
> -			diffcore_rename(options);
> +			diffcore_rename(options, prefetched);
>  		if (options->break_opt != -1)
>  			diffcore_merge_broken();
>  	}
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index e189f407af..79ac1b4bee 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -7,6 +7,7 @@
>  #include "object-store.h"
>  #include "hashmap.h"
>  #include "progress.h"
> +#include "promisor-remote.h"
>  
>  /* Table of rename/copy destinations */
>  
> @@ -448,7 +449,7 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
>  	return count;
>  }
>  
> -void diffcore_rename(struct diff_options *options)
> +void diffcore_rename(struct diff_options *options, int prefetched)
>  {
>  	int detect_rename = options->detect_rename;
>  	int minimum_score = options->rename_score;
> @@ -538,6 +539,40 @@ void diffcore_rename(struct diff_options *options)
>  		break;
>  	}
>  
> +	if (!prefetched) {
> +		/*
> +		 * At this point we know there's actual work to do: we have rename
> +		 * destinations that didn't find an exact match, and we have potential
> +		 * sources. So we'll have to do inexact rename detection, which
> +		 * requires looking at the blobs.
> +		 *
> +		 * If we haven't already prefetched, it's worth pre-fetching
> +		 * them as a group now.
> +		 */

This comment makes me wonder if it would be even better to

 - prepare an empty to_fetch OID array in the caller,

 - if the output format is one of the ones that wants prefetch, add
   object names to to_fetch in the caller, BUT not fetch there.

 - pass &to_fetch by the caller to this function, and this code here
   may add even more objects,

 - then do the prefetch here (so a single promisor interaction will
   grab objects the caller would have fetched before calling us and
   the ones we want here), and then clear the to_fetch array.

 - the caller, after seeing this function returns, checks to_fetch
   and if it is not empty, fetches (i.e. the caller prepared list of
   objects based on the output type, we ended up not calling this
   helper, and then finally the caller does the prefetch).

That way, the "unless we have already prefetched" logic can go, and
we can lose one indentation level, no?


> +		int i;
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +
> +		for (i = 0; i < rename_dst_nr; i++) {
> +			if (rename_dst[i].pair)
> +				continue; /* already found exact match */
> +			diff_add_if_missing(options->repo, &to_fetch, rename_dst[i].two);
> +		}
> +		for (i = 0; i < rename_src_nr; i++) {
> +			if (skip_unmodified &&
> +			    diff_unmodified_pair(rename_src[i].p))
> +				/*
> +				 * The "for" loop below will not need these
> +				 * blobs, so skip prefetching.
> +				 */
> +				continue;
> +			diff_add_if_missing(options->repo, &to_fetch, rename_src[i].p->one);
> +		}
> +		if (to_fetch.nr)
> +			promisor_remote_get_direct(options->repo,
> +						   to_fetch.oid, to_fetch.nr);

You no longer need the if(), no?

> +		oid_array_clear(&to_fetch);
> +	}

Jonathan Tan April 2, 2020, 11:09 p.m. UTC | #2

> > +	int output_formats_to_prefetch = DIFF_FORMAT_DIFFSTAT |
> > +		DIFF_FORMAT_NUMSTAT |
> > +		DIFF_FORMAT_PATCH |
> > +		DIFF_FORMAT_SHORTSTAT |
> > +		DIFF_FORMAT_DIRSTAT;
> 
> Would this want to be a "const int" (or even #define), I wonder.  I
> do not care too much between the two, but leaving it as a variable
> makes me a bit nervous.

OK, will switch to "const int".

> > +	if (options->repo == the_repository && has_promisor_remote() &&
> > +	    (options->output_format & output_formats_to_prefetch ||
> > +	     (!options->found_follow && options->break_opt != -1))) {
> >  		int i;
> >  		struct diff_queue_struct *q = &diff_queued_diff;
> >  		struct oid_array to_fetch = OID_ARRAY_INIT;
> >  
> >  		for (i = 0; i < q->nr; i++) {
> >  			struct diff_filepair *p = q->queue[i];
> > -			add_if_missing(options->repo, &to_fetch, p->one);
> > -			add_if_missing(options->repo, &to_fetch, p->two);
> > +			diff_add_if_missing(options->repo, &to_fetch, p->one);
> > +			diff_add_if_missing(options->repo, &to_fetch, p->two);
> >  		}
> > +
> > +		prefetched = 1;
> > +
> 
> Wouldn't it logically make more sense to do this after calling
> promisor_remote_get_direct() and if to_fetch.nr is not 0, ...
> 
> >  		/*
> >  		 * NEEDSWORK: Consider deduplicating the OIDs sent.
> >  		 */
> >  		promisor_remote_get_direct(options->repo,
> >  					   to_fetch.oid, to_fetch.nr);
> > +
> 
> ... namely, here?
> 
> When (q->nr != 0), to_fetch.nr may not be zero, I suspect, but the
> original code before [1/2] protected against to_fetch.nr==0 case, so
> ...?

My idea is that this prefetch is a superset of what diffcore_rebase()
wants to prefetch, so if we have already done the necessary logic here
(even if nothing gets prefetched - which might be the case if we have
all objects), we do not need to do it in diffcore_rebase().

> > +	if (!prefetched) {
> > +		/*
> > +		 * At this point we know there's actual work to do: we have rename
> > +		 * destinations that didn't find an exact match, and we have potential
> > +		 * sources. So we'll have to do inexact rename detection, which
> > +		 * requires looking at the blobs.
> > +		 *
> > +		 * If we haven't already prefetched, it's worth pre-fetching
> > +		 * them as a group now.
> > +		 */
> 
> This comment makes me wonder if it would be even better to
> 
>  - prepare an empty to_fetch OID array in the caller,
> 
>  - if the output format is one of the ones that wants prefetch, add
>    object names to to_fetch in the caller, BUT not fetch there.
> 
>  - pass &to_fetch by the caller to this function, and this code here
>    may add even more objects,
> 
>  - then do the prefetch here (so a single promisor interaction will
>    grab objects the caller would have fetched before calling us and
>    the ones we want here), and then clear the to_fetch array.
> 
>  - the caller, after seeing this function returns, checks to_fetch
>    and if it is not empty, fetches (i.e. the caller prepared list of
>    objects based on the output type, we ended up not calling this
>    helper, and then finally the caller does the prefetch).
> 
> That way, the "unless we have already prefetched" logic can go, and
> we can lose one indentation level, no?

This means that the only prefetch occurs in diffcore_rename()? I don't
think this will work for 2 reasons:

 - diffcore_std() calls diffcore_break() (which also reads blobs) before
   diffcore_rename()
 - (more importantly) there's a code path in diffcore_std() that does
   not call diffcore_rename(), so we would still need some prefetching
   logic in diffcore_std() in case diffcore_rename() is not called

> > +		if (to_fetch.nr)
> > +			promisor_remote_get_direct(options->repo,
> > +						   to_fetch.oid, to_fetch.nr);
> 
> You no longer need the if(), no?

Ah...I'll remove the if().

Junio C Hamano April 2, 2020, 11:25 p.m. UTC | #3

Jonathan Tan <jonathantanmy@google.com> writes:

>> This comment makes me wonder if it would be even better to
>> 
>>  - prepare an empty to_fetch OID array in the caller,
>> 
>>  - if the output format is one of the ones that wants prefetch, add
>>    object names to to_fetch in the caller, BUT not fetch there.
>> 
>>  - pass &to_fetch by the caller to this function, and this code here
>>    may add even more objects,
>> 
>>  - then do the prefetch here (so a single promisor interaction will
>>    grab objects the caller would have fetched before calling us and
>>    the ones we want here), and then clear the to_fetch array.
>> 
>>  - the caller, after seeing this function returns, checks to_fetch
>>    and if it is not empty, fetches (i.e. the caller prepared list of
>>    objects based on the output type, we ended up not calling this
>>    helper, and then finally the caller does the prefetch).
>> 
>> That way, the "unless we have already prefetched" logic can go, and
>> we can lose one indentation level, no?
>
> This means that the only prefetch occurs in diffcore_rename()?

No, but I phrased the last bullet item incorrectly.  "after seeing
this function returns" is wrong, but what is in parentheses (i.e. if
we didn't call diffcore_rename) is correct.

> I don't
> think this will work for 2 reasons:
>
>  - diffcore_std() calls diffcore_break() (which also reads blobs) before
>    diffcore_rename()

Ahh, I missed that part.

>  - (more importantly) there's a code path in diffcore_std() that does
>    not call diffcore_rename(), so we would still need some prefetching
>    logic in diffcore_std() in case diffcore_rename() is not called

That one I think is already covered.

Junio C Hamano April 2, 2020, 11:54 p.m. UTC | #4

Jonathan Tan <jonathantanmy@google.com> writes:

> My idea is that this prefetch is a superset of what diffcore_rebase()
> wants to prefetch, so if we have already done the necessary logic here
> (even if nothing gets prefetched - which might be the case if we have
> all objects), we do not need to do it in diffcore_rebase().

s/rebase/rename/ I presume, but the above reasoning, while it may
happen to hold true right now, feels brittle.  In other words

 - how do we know it would stay to be "a superset"?

 - would it change the picture if we later added a prefetch in
   diffcore_break(), just like you are doing so to diffcore_rename()
   in this patch?

So the revised version of my earlier "wondering" is if it would be
more future-proof (easier to teach different steps to prefetch for
their own needs, without having to make an assumption like "what
this step needs is sufficient for the other step") to arrange the
codepath from diffcore_std() to its helpers like so:

    - prepare an empty to_fetch OID array in the caller,

    - if the output format is one of the ones that wants prefetch,
      add object names to to_fetch in the caller, but not fetch as
      long as the caller does not yet need the contents of the
      blobs.

    - pass &to_fetch from diffcore_std() to the helper functions in
      the diffcore family like diffcore_{break,rename}() have them
      also batch what they (may) want to prefetch in there.  Delay
      fetching until they actually need to look at the blobs, and
      when they fetch, clear &to_fetch for the next helper.

    - diffcore_std() also would need to look at the blob eventually,
      perhaps after all the helpers it may call returns.  Do the
      final prefetch if to_fetch is still not empty before it has to
      look at the blobs.

Thanks.

Jonathan Tan April 3, 2020, 9:35 p.m. UTC | #5

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > My idea is that this prefetch is a superset of what diffcore_rebase()
> > wants to prefetch, so if we have already done the necessary logic here
> > (even if nothing gets prefetched - which might be the case if we have
> > all objects), we do not need to do it in diffcore_rebase().
> 
> s/rebase/rename/ I presume,

Ah, yes.

> but the above reasoning, while it may
> happen to hold true right now, feels brittle.  In other words
> 
>  - how do we know it would stay to be "a superset"?
> 
>  - would it change the picture if we later added a prefetch in
>    diffcore_break(), just like you are doing so to diffcore_rename()
>    in this patch?
> 
> So the revised version of my earlier "wondering" is if it would be
> more future-proof (easier to teach different steps to prefetch for
> their own needs, without having to make an assumption like "what
> this step needs is sufficient for the other step") to arrange the
> codepath from diffcore_std() to its helpers like so:
> 
>     - prepare an empty to_fetch OID array in the caller,
> 
>     - if the output format is one of the ones that wants prefetch,
>       add object names to to_fetch in the caller, but not fetch as
>       long as the caller does not yet need the contents of the
>       blobs.
> 
>     - pass &to_fetch from diffcore_std() to the helper functions in
>       the diffcore family like diffcore_{break,rename}() have them
>       also batch what they (may) want to prefetch in there.  Delay
>       fetching until they actually need to look at the blobs, and
>       when they fetch, clear &to_fetch for the next helper.
> 
>     - diffcore_std() also would need to look at the blob eventually,
>       perhaps after all the helpers it may call returns.  Do the
>       final prefetch if to_fetch is still not empty before it has to
>       look at the blobs.

Ah...that makes sense. Besides the accumulating of prefetch targets
(which makes deduplication more necessary - I might make a
"sort_and_uniq" function on oid_array that updates the oid_array
in-place), looking at the code, it's not only diffcore_break() which
might need prefetching, but diffcore_skip_stat_unmatch() too (not to
speak of the functions that come after diffcore_rename()). The least
brittle way is probably to have diff_populate_filespec() do the
prefetching. I'll take a further look.

Junio C Hamano April 3, 2020, 10:12 p.m. UTC | #6

Jonathan Tan <jonathantanmy@google.com> writes:

> Ah...that makes sense. Besides the accumulating of prefetch targets
> (which makes deduplication more necessary - I might make a
> "sort_and_uniq" function on oid_array that updates the oid_array
> in-place), looking at the code, it's not only diffcore_break() which
> might need prefetching, but diffcore_skip_stat_unmatch() too (not to
> speak of the functions that come after diffcore_rename()). The least
> brittle way is probably to have diff_populate_filespec() do the
> prefetching. I'll take a further look.

It might be a losing battle, unless we can somehow cleanly have a
two pass approach where we ask various codepaths "enumerate blobs
that you think you would need prefetching in this oid list" before
letting any of them actually look at blobs and perform their main
tasks, do a single prefetch and then let the existing age-old code
call the diffcore transformations as if there is no need for it to
worry about prefetching ;-)

Thanks.

[v2,2/2] diff: restrict when prefetching occurs

Commit Message

Comments

Patch