Message ID | a3322cdedf019126305fcead5918d523a1b2dfbc.1585854639.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Restrict when prefetcing occurs | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > + int prefetched = 0; > + int output_formats_to_prefetch = DIFF_FORMAT_DIFFSTAT | > + DIFF_FORMAT_NUMSTAT | > + DIFF_FORMAT_PATCH | > + DIFF_FORMAT_SHORTSTAT | > + DIFF_FORMAT_DIRSTAT; Would this want to be a "const int" (or even #define), I wonder. I do not care too much between the two, but leaving it as a variable makes me a bit nervous. > + /* > + * Check if the user requested a blob-data-requiring diff output and/or > + * break-rewrite detection (which requires blob data). If yes, prefetch > + * the diff pairs. > + * > + * If no prefetching occurs, diffcore_rename() will prefetch if it > + * decides that it needs inexact rename detection. > + */ Name-only etc. that Derrick mentioned in the other thread would be relevant only when rename detection is active, and you'd do that in diffcore_rename(). Good. > + if (options->repo == the_repository && has_promisor_remote() && > + (options->output_format & output_formats_to_prefetch || > + (!options->found_follow && options->break_opt != -1))) { > int i; > struct diff_queue_struct *q = &diff_queued_diff; > struct oid_array to_fetch = OID_ARRAY_INIT; > > for (i = 0; i < q->nr; i++) { > struct diff_filepair *p = q->queue[i]; > - add_if_missing(options->repo, &to_fetch, p->one); > - add_if_missing(options->repo, &to_fetch, p->two); > + diff_add_if_missing(options->repo, &to_fetch, p->one); > + diff_add_if_missing(options->repo, &to_fetch, p->two); > } > + > + prefetched = 1; > + Wouldn't it logically make more sense to do this after calling promisor_remote_get_direct() and if to_fetch.nr is not 0, ... > /* > * NEEDSWORK: Consider deduplicating the OIDs sent. > */ > promisor_remote_get_direct(options->repo, > to_fetch.oid, to_fetch.nr); > + ... namely, here? When (q->nr != 0), to_fetch.nr may not be zero, I suspect, but the original code before [1/2] protected against to_fetch.nr==0 case, so ...? > oid_array_clear(&to_fetch); > } > > @@ -6537,7 +6555,7 @@ void diffcore_std(struct diff_options *options) > diffcore_break(options->repo, > options->break_opt); > if (options->detect_rename) > - diffcore_rename(options); > + diffcore_rename(options, prefetched); > if (options->break_opt != -1) > diffcore_merge_broken(); > } > diff --git a/diffcore-rename.c b/diffcore-rename.c > index e189f407af..79ac1b4bee 100644 > --- a/diffcore-rename.c > +++ b/diffcore-rename.c > @@ -7,6 +7,7 @@ > #include "object-store.h" > #include "hashmap.h" > #include "progress.h" > +#include "promisor-remote.h" > > /* Table of rename/copy destinations */ > > @@ -448,7 +449,7 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i > return count; > } > > -void diffcore_rename(struct diff_options *options) > +void diffcore_rename(struct diff_options *options, int prefetched) > { > int detect_rename = options->detect_rename; > int minimum_score = options->rename_score; > @@ -538,6 +539,40 @@ void diffcore_rename(struct diff_options *options) > break; > } > > + if (!prefetched) { > + /* > + * At this point we know there's actual work to do: we have rename > + * destinations that didn't find an exact match, and we have potential > + * sources. So we'll have to do inexact rename detection, which > + * requires looking at the blobs. > + * > + * If we haven't already prefetched, it's worth pre-fetching > + * them as a group now. > + */ This comment makes me wonder if it would be even better to - prepare an empty to_fetch OID array in the caller, - if the output format is one of the ones that wants prefetch, add object names to to_fetch in the caller, BUT not fetch there. - pass &to_fetch by the caller to this function, and this code here may add even more objects, - then do the prefetch here (so a single promisor interaction will grab objects the caller would have fetched before calling us and the ones we want here), and then clear the to_fetch array. - the caller, after seeing this function returns, checks to_fetch and if it is not empty, fetches (i.e. the caller prepared list of objects based on the output type, we ended up not calling this helper, and then finally the caller does the prefetch). That way, the "unless we have already prefetched" logic can go, and we can lose one indentation level, no? > + int i; > + struct oid_array to_fetch = OID_ARRAY_INIT; > + > + for (i = 0; i < rename_dst_nr; i++) { > + if (rename_dst[i].pair) > + continue; /* already found exact match */ > + diff_add_if_missing(options->repo, &to_fetch, rename_dst[i].two); > + } > + for (i = 0; i < rename_src_nr; i++) { > + if (skip_unmodified && > + diff_unmodified_pair(rename_src[i].p)) > + /* > + * The "for" loop below will not need these > + * blobs, so skip prefetching. > + */ > + continue; > + diff_add_if_missing(options->repo, &to_fetch, rename_src[i].p->one); > + } > + if (to_fetch.nr) > + promisor_remote_get_direct(options->repo, > + to_fetch.oid, to_fetch.nr); You no longer need the if(), no? > + oid_array_clear(&to_fetch); > + }
> > + int output_formats_to_prefetch = DIFF_FORMAT_DIFFSTAT | > > + DIFF_FORMAT_NUMSTAT | > > + DIFF_FORMAT_PATCH | > > + DIFF_FORMAT_SHORTSTAT | > > + DIFF_FORMAT_DIRSTAT; > > Would this want to be a "const int" (or even #define), I wonder. I > do not care too much between the two, but leaving it as a variable > makes me a bit nervous. OK, will switch to "const int". > > + if (options->repo == the_repository && has_promisor_remote() && > > + (options->output_format & output_formats_to_prefetch || > > + (!options->found_follow && options->break_opt != -1))) { > > int i; > > struct diff_queue_struct *q = &diff_queued_diff; > > struct oid_array to_fetch = OID_ARRAY_INIT; > > > > for (i = 0; i < q->nr; i++) { > > struct diff_filepair *p = q->queue[i]; > > - add_if_missing(options->repo, &to_fetch, p->one); > > - add_if_missing(options->repo, &to_fetch, p->two); > > + diff_add_if_missing(options->repo, &to_fetch, p->one); > > + diff_add_if_missing(options->repo, &to_fetch, p->two); > > } > > + > > + prefetched = 1; > > + > > Wouldn't it logically make more sense to do this after calling > promisor_remote_get_direct() and if to_fetch.nr is not 0, ... > > > /* > > * NEEDSWORK: Consider deduplicating the OIDs sent. > > */ > > promisor_remote_get_direct(options->repo, > > to_fetch.oid, to_fetch.nr); > > + > > ... namely, here? > > When (q->nr != 0), to_fetch.nr may not be zero, I suspect, but the > original code before [1/2] protected against to_fetch.nr==0 case, so > ...? My idea is that this prefetch is a superset of what diffcore_rebase() wants to prefetch, so if we have already done the necessary logic here (even if nothing gets prefetched - which might be the case if we have all objects), we do not need to do it in diffcore_rebase(). > > + if (!prefetched) { > > + /* > > + * At this point we know there's actual work to do: we have rename > > + * destinations that didn't find an exact match, and we have potential > > + * sources. So we'll have to do inexact rename detection, which > > + * requires looking at the blobs. > > + * > > + * If we haven't already prefetched, it's worth pre-fetching > > + * them as a group now. > > + */ > > This comment makes me wonder if it would be even better to > > - prepare an empty to_fetch OID array in the caller, > > - if the output format is one of the ones that wants prefetch, add > object names to to_fetch in the caller, BUT not fetch there. > > - pass &to_fetch by the caller to this function, and this code here > may add even more objects, > > - then do the prefetch here (so a single promisor interaction will > grab objects the caller would have fetched before calling us and > the ones we want here), and then clear the to_fetch array. > > - the caller, after seeing this function returns, checks to_fetch > and if it is not empty, fetches (i.e. the caller prepared list of > objects based on the output type, we ended up not calling this > helper, and then finally the caller does the prefetch). > > That way, the "unless we have already prefetched" logic can go, and > we can lose one indentation level, no? This means that the only prefetch occurs in diffcore_rename()? I don't think this will work for 2 reasons: - diffcore_std() calls diffcore_break() (which also reads blobs) before diffcore_rename() - (more importantly) there's a code path in diffcore_std() that does not call diffcore_rename(), so we would still need some prefetching logic in diffcore_std() in case diffcore_rename() is not called > > + if (to_fetch.nr) > > + promisor_remote_get_direct(options->repo, > > + to_fetch.oid, to_fetch.nr); > > You no longer need the if(), no? Ah...I'll remove the if().
Jonathan Tan <jonathantanmy@google.com> writes: >> This comment makes me wonder if it would be even better to >> >> - prepare an empty to_fetch OID array in the caller, >> >> - if the output format is one of the ones that wants prefetch, add >> object names to to_fetch in the caller, BUT not fetch there. >> >> - pass &to_fetch by the caller to this function, and this code here >> may add even more objects, >> >> - then do the prefetch here (so a single promisor interaction will >> grab objects the caller would have fetched before calling us and >> the ones we want here), and then clear the to_fetch array. >> >> - the caller, after seeing this function returns, checks to_fetch >> and if it is not empty, fetches (i.e. the caller prepared list of >> objects based on the output type, we ended up not calling this >> helper, and then finally the caller does the prefetch). >> >> That way, the "unless we have already prefetched" logic can go, and >> we can lose one indentation level, no? > > This means that the only prefetch occurs in diffcore_rename()? No, but I phrased the last bullet item incorrectly. "after seeing this function returns" is wrong, but what is in parentheses (i.e. if we didn't call diffcore_rename) is correct. > I don't > think this will work for 2 reasons: > > - diffcore_std() calls diffcore_break() (which also reads blobs) before > diffcore_rename() Ahh, I missed that part. > - (more importantly) there's a code path in diffcore_std() that does > not call diffcore_rename(), so we would still need some prefetching > logic in diffcore_std() in case diffcore_rename() is not called That one I think is already covered.
Jonathan Tan <jonathantanmy@google.com> writes: > My idea is that this prefetch is a superset of what diffcore_rebase() > wants to prefetch, so if we have already done the necessary logic here > (even if nothing gets prefetched - which might be the case if we have > all objects), we do not need to do it in diffcore_rebase(). s/rebase/rename/ I presume, but the above reasoning, while it may happen to hold true right now, feels brittle. In other words - how do we know it would stay to be "a superset"? - would it change the picture if we later added a prefetch in diffcore_break(), just like you are doing so to diffcore_rename() in this patch? So the revised version of my earlier "wondering" is if it would be more future-proof (easier to teach different steps to prefetch for their own needs, without having to make an assumption like "what this step needs is sufficient for the other step") to arrange the codepath from diffcore_std() to its helpers like so: - prepare an empty to_fetch OID array in the caller, - if the output format is one of the ones that wants prefetch, add object names to to_fetch in the caller, but not fetch as long as the caller does not yet need the contents of the blobs. - pass &to_fetch from diffcore_std() to the helper functions in the diffcore family like diffcore_{break,rename}() have them also batch what they (may) want to prefetch in there. Delay fetching until they actually need to look at the blobs, and when they fetch, clear &to_fetch for the next helper. - diffcore_std() also would need to look at the blob eventually, perhaps after all the helpers it may call returns. Do the final prefetch if to_fetch is still not empty before it has to look at the blobs. Thanks.
> Jonathan Tan <jonathantanmy@google.com> writes: > > > My idea is that this prefetch is a superset of what diffcore_rebase() > > wants to prefetch, so if we have already done the necessary logic here > > (even if nothing gets prefetched - which might be the case if we have > > all objects), we do not need to do it in diffcore_rebase(). > > s/rebase/rename/ I presume, Ah, yes. > but the above reasoning, while it may > happen to hold true right now, feels brittle. In other words > > - how do we know it would stay to be "a superset"? > > - would it change the picture if we later added a prefetch in > diffcore_break(), just like you are doing so to diffcore_rename() > in this patch? > > So the revised version of my earlier "wondering" is if it would be > more future-proof (easier to teach different steps to prefetch for > their own needs, without having to make an assumption like "what > this step needs is sufficient for the other step") to arrange the > codepath from diffcore_std() to its helpers like so: > > - prepare an empty to_fetch OID array in the caller, > > - if the output format is one of the ones that wants prefetch, > add object names to to_fetch in the caller, but not fetch as > long as the caller does not yet need the contents of the > blobs. > > - pass &to_fetch from diffcore_std() to the helper functions in > the diffcore family like diffcore_{break,rename}() have them > also batch what they (may) want to prefetch in there. Delay > fetching until they actually need to look at the blobs, and > when they fetch, clear &to_fetch for the next helper. > > - diffcore_std() also would need to look at the blob eventually, > perhaps after all the helpers it may call returns. Do the > final prefetch if to_fetch is still not empty before it has to > look at the blobs. Ah...that makes sense. Besides the accumulating of prefetch targets (which makes deduplication more necessary - I might make a "sort_and_uniq" function on oid_array that updates the oid_array in-place), looking at the code, it's not only diffcore_break() which might need prefetching, but diffcore_skip_stat_unmatch() too (not to speak of the functions that come after diffcore_rename()). The least brittle way is probably to have diff_populate_filespec() do the prefetching. I'll take a further look.
Jonathan Tan <jonathantanmy@google.com> writes: > Ah...that makes sense. Besides the accumulating of prefetch targets > (which makes deduplication more necessary - I might make a > "sort_and_uniq" function on oid_array that updates the oid_array > in-place), looking at the code, it's not only diffcore_break() which > might need prefetching, but diffcore_skip_stat_unmatch() too (not to > speak of the functions that come after diffcore_rename()). The least > brittle way is probably to have diff_populate_filespec() do the > prefetching. I'll take a further look. It might be a losing battle, unless we can somehow cleanly have a two pass approach where we ask various codepaths "enumerate blobs that you think you would need prefetching in this oid list" before letting any of them actually look at blobs and perform their main tasks, do a single prefetch and then let the existing age-old code call the diffcore transformations as if there is no need for it to worry about prefetching ;-) Thanks.
diff --git a/diff.c b/diff.c index f01b4d91b8..857f02f481 100644 --- a/diff.c +++ b/diff.c @@ -6494,9 +6494,9 @@ void diffcore_fix_diff_index(void) QSORT(q->queue, q->nr, diffnamecmp); } -static void add_if_missing(struct repository *r, - struct oid_array *to_fetch, - const struct diff_filespec *filespec) +void diff_add_if_missing(struct repository *r, + struct oid_array *to_fetch, + const struct diff_filespec *filespec) { if (filespec && filespec->oid_valid && !S_ISGITLINK(filespec->mode) && @@ -6507,24 +6507,42 @@ static void add_if_missing(struct repository *r, void diffcore_std(struct diff_options *options) { - if (options->repo == the_repository && has_promisor_remote()) { - /* - * Prefetch the diff pairs that are about to be flushed. - */ + int prefetched = 0; + int output_formats_to_prefetch = DIFF_FORMAT_DIFFSTAT | + DIFF_FORMAT_NUMSTAT | + DIFF_FORMAT_PATCH | + DIFF_FORMAT_SHORTSTAT | + DIFF_FORMAT_DIRSTAT; + + /* + * Check if the user requested a blob-data-requiring diff output and/or + * break-rewrite detection (which requires blob data). If yes, prefetch + * the diff pairs. + * + * If no prefetching occurs, diffcore_rename() will prefetch if it + * decides that it needs inexact rename detection. + */ + if (options->repo == the_repository && has_promisor_remote() && + (options->output_format & output_formats_to_prefetch || + (!options->found_follow && options->break_opt != -1))) { int i; struct diff_queue_struct *q = &diff_queued_diff; struct oid_array to_fetch = OID_ARRAY_INIT; for (i = 0; i < q->nr; i++) { struct diff_filepair *p = q->queue[i]; - add_if_missing(options->repo, &to_fetch, p->one); - add_if_missing(options->repo, &to_fetch, p->two); + diff_add_if_missing(options->repo, &to_fetch, p->one); + diff_add_if_missing(options->repo, &to_fetch, p->two); } + + prefetched = 1; + /* * NEEDSWORK: Consider deduplicating the OIDs sent. */ promisor_remote_get_direct(options->repo, to_fetch.oid, to_fetch.nr); + oid_array_clear(&to_fetch); } @@ -6537,7 +6555,7 @@ void diffcore_std(struct diff_options *options) diffcore_break(options->repo, options->break_opt); if (options->detect_rename) - diffcore_rename(options); + diffcore_rename(options, prefetched); if (options->break_opt != -1) diffcore_merge_broken(); } diff --git a/diffcore-rename.c b/diffcore-rename.c index e189f407af..79ac1b4bee 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -7,6 +7,7 @@ #include "object-store.h" #include "hashmap.h" #include "progress.h" +#include "promisor-remote.h" /* Table of rename/copy destinations */ @@ -448,7 +449,7 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i return count; } -void diffcore_rename(struct diff_options *options) +void diffcore_rename(struct diff_options *options, int prefetched) { int detect_rename = options->detect_rename; int minimum_score = options->rename_score; @@ -538,6 +539,40 @@ void diffcore_rename(struct diff_options *options) break; } + if (!prefetched) { + /* + * At this point we know there's actual work to do: we have rename + * destinations that didn't find an exact match, and we have potential + * sources. So we'll have to do inexact rename detection, which + * requires looking at the blobs. + * + * If we haven't already prefetched, it's worth pre-fetching + * them as a group now. + */ + int i; + struct oid_array to_fetch = OID_ARRAY_INIT; + + for (i = 0; i < rename_dst_nr; i++) { + if (rename_dst[i].pair) + continue; /* already found exact match */ + diff_add_if_missing(options->repo, &to_fetch, rename_dst[i].two); + } + for (i = 0; i < rename_src_nr; i++) { + if (skip_unmodified && + diff_unmodified_pair(rename_src[i].p)) + /* + * The "for" loop below will not need these + * blobs, so skip prefetching. + */ + continue; + diff_add_if_missing(options->repo, &to_fetch, rename_src[i].p->one); + } + if (to_fetch.nr) + promisor_remote_get_direct(options->repo, + to_fetch.oid, to_fetch.nr); + oid_array_clear(&to_fetch); + } + if (options->show_rename_progress) { progress = start_delayed_progress( _("Performing inexact rename detection"), diff --git a/diffcore.h b/diffcore.h index 7c07347e42..d7af6ab018 100644 --- a/diffcore.h +++ b/diffcore.h @@ -144,7 +144,7 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *, void diff_q(struct diff_queue_struct *, struct diff_filepair *); void diffcore_break(struct repository *, int); -void diffcore_rename(struct diff_options *); +void diffcore_rename(struct diff_options *, int prefetched); void diffcore_merge_broken(void); void diffcore_pickaxe(struct diff_options *); void diffcore_order(const char *orderfile); @@ -182,4 +182,12 @@ int diffcore_count_changes(struct repository *r, unsigned long *src_copied, unsigned long *literal_added); +/* + * If filespec contains an OID and if that object is missing from the given + * repository, add that OID to to_fetch. + */ +void diff_add_if_missing(struct repository *r, + struct oid_array *to_fetch, + const struct diff_filespec *filespec); + #endif diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh index 4831ad35e6..c1ed1c2fc4 100755 --- a/t/t4067-diff-partial-clone.sh +++ b/t/t4067-diff-partial-clone.sh @@ -131,4 +131,52 @@ test_expect_success 'diff with rename detection batches blobs' ' test_line_count = 1 done_lines ' +test_expect_success 'diff does not fetch anything if inexact rename detection is not needed' ' + test_when_finished "rm -rf server client trace" && + + test_create_repo server && + echo a >server/a && + printf "b\nb\nb\nb\nb\n" >server/b && + git -C server add a b && + git -C server commit -m x && + mv server/b server/c && + git -C server add c && + git -C server commit -a -m x && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && + + # Ensure no fetches. + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff --raw -M HEAD^ HEAD && + ! test_path_exists trace +' + +test_expect_success 'diff --break-rewrites fetches only if necessary, and batches blobs if it does' ' + test_when_finished "rm -rf server client trace" && + + test_create_repo server && + echo a >server/a && + printf "b\nb\nb\nb\nb\n" >server/b && + git -C server add a b && + git -C server commit -m x && + printf "c\nc\nc\nc\nc\n" >server/b && + git -C server commit -a -m x && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && + + # Ensure no fetches. + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff --raw -M HEAD^ HEAD && + ! test_path_exists trace && + + # But with --break-rewrites, ensure that there is exactly 1 negotiation + # by checking that there is only 1 "done" line sent. ("done" marks the + # end of negotiation.) + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff --break-rewrites --raw -M HEAD^ HEAD && + grep "git> done" trace >done_lines && + test_line_count = 1 done_lines +' + test_done
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08) optimized "diff" by prefetching blobs in a partial clone, but there are some cases wherein blobs do not need to be prefetched. In particular, if (1) no blob data is included in the output of the diff, (2) break-rewrite detection is not requested, and (3) no inexact rename detection is needed, then no blobs are read at all. Therefore, in such a case, do not prefetch. Change diffcore_std() to only prefetch if (1) and/or (2) is not true (currently, it always prefetches); change diffcore_rename() to prefetch if (3) is not true and no prefetch has yet occurred. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- diff.c | 38 +++++++++++++++++++-------- diffcore-rename.c | 37 ++++++++++++++++++++++++++- diffcore.h | 10 +++++++- t/t4067-diff-partial-clone.sh | 48 +++++++++++++++++++++++++++++++++++ 4 files changed, 121 insertions(+), 12 deletions(-)