Message ID | 20190405170934.20441-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fixup! diff: batch fetching of missing blobs | expand |
Hi Jonathan, On Fri, 5 Apr 2019, Jonathan Tan wrote: > This is a fixup on the tip of jt/batch-fetch-blobs-in-diff (571debe1d9). > > I don't know if Junio has already merged this branch to next (he marked > this as "Will merge to 'next'" in the "What's Cooking" email, but when I > fetched, it hasn't been merged yet). If he has, we can use this commit > message: > > diff: propagate options->repo to add_if_missing > > Avoid a usage of the_repository by propagating the configured repository > to add_if_missing(). Also, prefetch only if the repository being diffed > is the_repository (because we do not support lazy fetching for any other > repository anyway). True, and the introduction of `has_promisor_remotes()` should probably help that, by introducing a parameter of type `struct repository *r`. > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> The patch and the commit message look good! Thanks, Dscho > diff --git a/diff.c b/diff.c > index 1eccefb4ef..811afbdfb1 100644 > --- a/diff.c > +++ b/diff.c > @@ -6367,18 +6367,19 @@ void diffcore_fix_diff_index(void) > QSORT(q->queue, q->nr, diffnamecmp); > } > > -static void add_if_missing(struct oid_array *to_fetch, > +static void add_if_missing(struct oid_array *to_fetch, struct repository *r, > const struct diff_filespec *filespec) > { > if (filespec && filespec->oid_valid && > - oid_object_info_extended(the_repository, &filespec->oid, NULL, > + oid_object_info_extended(r, &filespec->oid, NULL, > OBJECT_INFO_FOR_PREFETCH)) > oid_array_append(to_fetch, &filespec->oid); > } > > void diffcore_std(struct diff_options *options) > { > - if (repository_format_partial_clone) { > + if (options->repo == the_repository && > + repository_format_partial_clone) { > /* > * Prefetch the diff pairs that are about to be flushed. > */ > @@ -6388,8 +6389,8 @@ void diffcore_std(struct diff_options *options) > > for (i = 0; i < q->nr; i++) { > struct diff_filepair *p = q->queue[i]; > - add_if_missing(&to_fetch, p->one); > - add_if_missing(&to_fetch, p->two); > + add_if_missing(&to_fetch, options->repo, p->one); > + add_if_missing(&to_fetch, options->repo, p->two); > } > if (to_fetch.nr) > /* > -- > 2.21.0.392.gf8f6787159e-goog > >
On Sat, Apr 6, 2019 at 12:09 AM Jonathan Tan <jonathantanmy@google.com> wrote: > > This is a fixup on the tip of jt/batch-fetch-blobs-in-diff (571debe1d9). > > I don't know if Junio has already merged this branch to next (he marked > this as "Will merge to 'next'" in the "What's Cooking" email, but when I > fetched, it hasn't been merged yet). If he has, we can use this commit > message: > > diff: propagate options->repo to add_if_missing > > Avoid a usage of the_repository by propagating the configured repository > to add_if_missing(). Also, prefetch only if the repository being diffed > is the_repository (because we do not support lazy fetching for any other > repository anyway). > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > --- > Thanks, Duy, for noticing this. > --- > diff.c | 11 ++++++----- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/diff.c b/diff.c > index 1eccefb4ef..811afbdfb1 100644 > --- a/diff.c > +++ b/diff.c > @@ -6367,18 +6367,19 @@ void diffcore_fix_diff_index(void) > QSORT(q->queue, q->nr, diffnamecmp); > } > > -static void add_if_missing(struct oid_array *to_fetch, > +static void add_if_missing(struct oid_array *to_fetch, struct repository *r, One last nit. "struct repository *r" is often the first argument. See https://public-inbox.org/git/xmqqsh2p6l43.fsf@gitster-ct.c.googlers.com/ > const struct diff_filespec *filespec) > { > if (filespec && filespec->oid_valid && > - oid_object_info_extended(the_repository, &filespec->oid, NULL, > + oid_object_info_extended(r, &filespec->oid, NULL, > OBJECT_INFO_FOR_PREFETCH)) > oid_array_append(to_fetch, &filespec->oid); > } > > void diffcore_std(struct diff_options *options) > { > - if (repository_format_partial_clone) { > + if (options->repo == the_repository && > + repository_format_partial_clone) { > /* > * Prefetch the diff pairs that are about to be flushed. > */ > @@ -6388,8 +6389,8 @@ void diffcore_std(struct diff_options *options) > > for (i = 0; i < q->nr; i++) { > struct diff_filepair *p = q->queue[i]; > - add_if_missing(&to_fetch, p->one); > - add_if_missing(&to_fetch, p->two); > + add_if_missing(&to_fetch, options->repo, p->one); > + add_if_missing(&to_fetch, options->repo, p->two); > } > if (to_fetch.nr) > /* > -- > 2.21.0.392.gf8f6787159e-goog >
Duy Nguyen <pclouds@gmail.com> writes: >> -static void add_if_missing(struct oid_array *to_fetch, >> +static void add_if_missing(struct oid_array *to_fetch, struct repository *r, > > One last nit. "struct repository *r" is often the first argument. See Yes, and that's easy enough to tweak locally ;-) Thanks.
Duy Nguyen <pclouds@gmail.com> writes: >> Avoid a usage of the_repository by propagating the configured repository >> to add_if_missing(). Also, prefetch only if the repository being diffed >> is the_repository (because we do not support lazy fetching for any other >> repository anyway). If we are willing to stay limited to the default repository anyway, allowing add_if_missing() to take an arbitrary repository does not really matter, but before the caller of add_if_missing() befcomes ready to work on an arbitrary repository, this change has to happen. To update the caller, it seems to me that fetch_objects() must learn to take an arbitrary repository, but is that the only thing needed? After that, the function that the caller resides in and callchain upwards can learn to take a repository instance if we want to be able to diff inside an arbitrary repository. But. Such a change still would not allow us to compare a tree in one repository against a tree in another repository. It is likely that a caller with such a need would simply make sure that objects in both repositories are available by using the in-core alternate object store mechanism, making it a more-or-less moot point to be able to pass a repository instance through the callchain X-<. We probably should make it, and spell it out somewhere in a long term vision shared among the developers, an explicit goal to get rid of the internal (ab)use of the alternate object store mechanism. With squashing the fix-up commit in, the 2/2 patch has become like so. Thanks, both. -- >8 -- From: Jonathan Tan <jonathantanmy@google.com> Date: Fri, 5 Apr 2019 10:09:34 -0700 Subject: [PATCH] diff: batch fetching of missing blobs When running a command like "git show" or "git diff" in a partial clone, batch all missing blobs to be fetched as one request. This is similar to c0c578b33c ("unpack-trees: batch fetching of missing blobs", 2017-12-08), but for another command. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> --- diff.c | 33 +++++++++++ t/t4067-diff-partial-clone.sh | 103 ++++++++++++++++++++++++++++++++++ 2 files changed, 136 insertions(+) create mode 100755 t/t4067-diff-partial-clone.sh diff --git a/diff.c b/diff.c index ec5c095199..811afbdfb1 100644 --- a/diff.c +++ b/diff.c @@ -25,6 +25,7 @@ #include "packfile.h" #include "parse-options.h" #include "help.h" +#include "fetch-object.h" #ifdef NO_FAST_WORKING_DIRECTORY #define FAST_WORKING_DIRECTORY 0 @@ -6366,8 +6367,40 @@ void diffcore_fix_diff_index(void) QSORT(q->queue, q->nr, diffnamecmp); } +static void add_if_missing(struct oid_array *to_fetch, struct repository *r, + const struct diff_filespec *filespec) +{ + if (filespec && filespec->oid_valid && + oid_object_info_extended(r, &filespec->oid, NULL, + OBJECT_INFO_FOR_PREFETCH)) + oid_array_append(to_fetch, &filespec->oid); +} + void diffcore_std(struct diff_options *options) { + if (options->repo == the_repository && + repository_format_partial_clone) { + /* + * Prefetch the diff pairs that are about to be flushed. + */ + int i; + struct diff_queue_struct *q = &diff_queued_diff; + struct oid_array to_fetch = OID_ARRAY_INIT; + + for (i = 0; i < q->nr; i++) { + struct diff_filepair *p = q->queue[i]; + add_if_missing(&to_fetch, options->repo, p->one); + add_if_missing(&to_fetch, options->repo, p->two); + } + if (to_fetch.nr) + /* + * NEEDSWORK: Consider deduplicating the OIDs sent. + */ + fetch_objects(repository_format_partial_clone, + to_fetch.oid, to_fetch.nr); + oid_array_clear(&to_fetch); + } + /* NOTE please keep the following in sync with diff_tree_combined() */ if (options->skip_stat_unmatch) diffcore_skip_stat_unmatch(options); diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh new file mode 100755 index 0000000000..90c8fb2901 --- /dev/null +++ b/t/t4067-diff-partial-clone.sh @@ -0,0 +1,103 @@ +#!/bin/sh + +test_description='behavior of diff when reading objects in a partial clone' + +. ./test-lib.sh + +test_expect_success 'git show batches blobs' ' + test_when_finished "rm -rf server client trace" && + + test_create_repo server && + echo a >server/a && + echo b >server/b && + git -C server add a b && + git -C server commit -m x && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && + + # Ensure that there is exactly 1 negotiation by checking that there is + # only 1 "done" line sent. ("done" marks the end of negotiation.) + GIT_TRACE_PACKET="$(pwd)/trace" git -C client show HEAD && + grep "git> done" trace >done_lines && + test_line_count = 1 done_lines +' + +test_expect_success 'diff batches blobs' ' + test_when_finished "rm -rf server client trace" && + + test_create_repo server && + echo a >server/a && + echo b >server/b && + git -C server add a b && + git -C server commit -m x && + echo c >server/c && + echo d >server/d && + git -C server add c d && + git -C server commit -m x && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && + + # Ensure that there is exactly 1 negotiation by checking that there is + # only 1 "done" line sent. ("done" marks the end of negotiation.) + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD && + grep "git> done" trace >done_lines && + test_line_count = 1 done_lines +' + +test_expect_success 'diff skips same-OID blobs' ' + test_when_finished "rm -rf server client trace" && + + test_create_repo server && + echo a >server/a && + echo b >server/b && + git -C server add a b && + git -C server commit -m x && + echo another-a >server/a && + git -C server add a && + git -C server commit -m x && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && + + echo a | git hash-object --stdin >hash-old-a && + echo another-a | git hash-object --stdin >hash-new-a && + echo b | git hash-object --stdin >hash-b && + + # Ensure that only a and another-a are fetched. + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD && + grep "want $(cat hash-old-a)" trace && + grep "want $(cat hash-new-a)" trace && + ! grep "want $(cat hash-b)" trace +' + +test_expect_success 'diff with rename detection batches blobs' ' + test_when_finished "rm -rf server client trace" && + + test_create_repo server && + echo a >server/a && + printf "b\nb\nb\nb\nb\n" >server/b && + git -C server add a b && + git -C server commit -m x && + rm server/b && + printf "b\nb\nb\nb\nbX\n" >server/c && + git -C server add c && + git -C server commit -a -m x && + + test_config -C server uploadpack.allowfilter 1 && + test_config -C server uploadpack.allowanysha1inwant 1 && + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && + + # Ensure that there is exactly 1 negotiation by checking that there is + # only 1 "done" line sent. ("done" marks the end of negotiation.) + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff -M HEAD^ HEAD >out && + grep "similarity index" out && + grep "git> done" trace >done_lines && + test_line_count = 1 done_lines +' + +test_done
On Mon, Apr 8, 2019 at 11:06 AM Junio C Hamano <gitster@pobox.com> wrote: > > Duy Nguyen <pclouds@gmail.com> writes: > > >> Avoid a usage of the_repository by propagating the configured repository > >> to add_if_missing(). Also, prefetch only if the repository being diffed > >> is the_repository (because we do not support lazy fetching for any other > >> repository anyway). > > If we are willing to stay limited to the default repository anyway, > allowing add_if_missing() to take an arbitrary repository does not > really matter, but before the caller of add_if_missing() befcomes > ready to work on an arbitrary repository, this change has to happen. > > To update the caller, it seems to me that fetch_objects() must learn > to take an arbitrary repository, but is that the only thing needed? > After that, the function that the caller resides in and callchain > upwards can learn to take a repository instance if we want to be > able to diff inside an arbitrary repository. > > But. Such a change still would not allow us to compare a tree in > one repository against a tree in another repository. I feel lost (and the answer "go read partial clone code!" is perfectly acceptable) but why would we need to diff trees of two different repositories? > It is likely > that a caller with such a need would simply make sure that objects > in both repositories are available by using the in-core alternate > object store mechanism, making it a more-or-less moot point to be > able to pass a repository instance through the callchain X-<. We > probably should make it, and spell it out somewhere in a long term > vision shared among the developers, an explicit goal to get rid of > the internal (ab)use of the alternate object store mechanism. I think submodule code so far is doing this way. Though I don't see any reason we need it for submodule code. Objects are not supposed to be shared between the super- and the sub-repo. > > With squashing the fix-up commit in, the 2/2 patch has become like > so. > > Thanks, both. > > -- >8 -- > From: Jonathan Tan <jonathantanmy@google.com> > Date: Fri, 5 Apr 2019 10:09:34 -0700 > Subject: [PATCH] diff: batch fetching of missing blobs > > When running a command like "git show" or "git diff" in a partial clone, > batch all missing blobs to be fetched as one request. > > This is similar to c0c578b33c ("unpack-trees: batch fetching of missing > blobs", 2017-12-08), but for another command. > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > Signed-off-by: Junio C Hamano <gitster@pobox.com> > --- > diff.c | 33 +++++++++++ > t/t4067-diff-partial-clone.sh | 103 ++++++++++++++++++++++++++++++++++ > 2 files changed, 136 insertions(+) > create mode 100755 t/t4067-diff-partial-clone.sh > > diff --git a/diff.c b/diff.c > index ec5c095199..811afbdfb1 100644 > --- a/diff.c > +++ b/diff.c > @@ -25,6 +25,7 @@ > #include "packfile.h" > #include "parse-options.h" > #include "help.h" > +#include "fetch-object.h" > > #ifdef NO_FAST_WORKING_DIRECTORY > #define FAST_WORKING_DIRECTORY 0 > @@ -6366,8 +6367,40 @@ void diffcore_fix_diff_index(void) > QSORT(q->queue, q->nr, diffnamecmp); > } > > +static void add_if_missing(struct oid_array *to_fetch, struct repository *r, > + const struct diff_filespec *filespec) > +{ > + if (filespec && filespec->oid_valid && > + oid_object_info_extended(r, &filespec->oid, NULL, > + OBJECT_INFO_FOR_PREFETCH)) > + oid_array_append(to_fetch, &filespec->oid); > +} > + > void diffcore_std(struct diff_options *options) > { > + if (options->repo == the_repository && > + repository_format_partial_clone) { > + /* > + * Prefetch the diff pairs that are about to be flushed. > + */ > + int i; > + struct diff_queue_struct *q = &diff_queued_diff; > + struct oid_array to_fetch = OID_ARRAY_INIT; > + > + for (i = 0; i < q->nr; i++) { > + struct diff_filepair *p = q->queue[i]; > + add_if_missing(&to_fetch, options->repo, p->one); > + add_if_missing(&to_fetch, options->repo, p->two); > + } > + if (to_fetch.nr) > + /* > + * NEEDSWORK: Consider deduplicating the OIDs sent. > + */ > + fetch_objects(repository_format_partial_clone, > + to_fetch.oid, to_fetch.nr); > + oid_array_clear(&to_fetch); > + } > + > /* NOTE please keep the following in sync with diff_tree_combined() */ > if (options->skip_stat_unmatch) > diffcore_skip_stat_unmatch(options); > diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh > new file mode 100755 > index 0000000000..90c8fb2901 > --- /dev/null > +++ b/t/t4067-diff-partial-clone.sh > @@ -0,0 +1,103 @@ > +#!/bin/sh > + > +test_description='behavior of diff when reading objects in a partial clone' > + > +. ./test-lib.sh > + > +test_expect_success 'git show batches blobs' ' > + test_when_finished "rm -rf server client trace" && > + > + test_create_repo server && > + echo a >server/a && > + echo b >server/b && > + git -C server add a b && > + git -C server commit -m x && > + > + test_config -C server uploadpack.allowfilter 1 && > + test_config -C server uploadpack.allowanysha1inwant 1 && > + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && > + > + # Ensure that there is exactly 1 negotiation by checking that there is > + # only 1 "done" line sent. ("done" marks the end of negotiation.) > + GIT_TRACE_PACKET="$(pwd)/trace" git -C client show HEAD && > + grep "git> done" trace >done_lines && > + test_line_count = 1 done_lines > +' > + > +test_expect_success 'diff batches blobs' ' > + test_when_finished "rm -rf server client trace" && > + > + test_create_repo server && > + echo a >server/a && > + echo b >server/b && > + git -C server add a b && > + git -C server commit -m x && > + echo c >server/c && > + echo d >server/d && > + git -C server add c d && > + git -C server commit -m x && > + > + test_config -C server uploadpack.allowfilter 1 && > + test_config -C server uploadpack.allowanysha1inwant 1 && > + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && > + > + # Ensure that there is exactly 1 negotiation by checking that there is > + # only 1 "done" line sent. ("done" marks the end of negotiation.) > + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD && > + grep "git> done" trace >done_lines && > + test_line_count = 1 done_lines > +' > + > +test_expect_success 'diff skips same-OID blobs' ' > + test_when_finished "rm -rf server client trace" && > + > + test_create_repo server && > + echo a >server/a && > + echo b >server/b && > + git -C server add a b && > + git -C server commit -m x && > + echo another-a >server/a && > + git -C server add a && > + git -C server commit -m x && > + > + test_config -C server uploadpack.allowfilter 1 && > + test_config -C server uploadpack.allowanysha1inwant 1 && > + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && > + > + echo a | git hash-object --stdin >hash-old-a && > + echo another-a | git hash-object --stdin >hash-new-a && > + echo b | git hash-object --stdin >hash-b && > + > + # Ensure that only a and another-a are fetched. > + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff HEAD^ HEAD && > + grep "want $(cat hash-old-a)" trace && > + grep "want $(cat hash-new-a)" trace && > + ! grep "want $(cat hash-b)" trace > +' > + > +test_expect_success 'diff with rename detection batches blobs' ' > + test_when_finished "rm -rf server client trace" && > + > + test_create_repo server && > + echo a >server/a && > + printf "b\nb\nb\nb\nb\n" >server/b && > + git -C server add a b && > + git -C server commit -m x && > + rm server/b && > + printf "b\nb\nb\nb\nbX\n" >server/c && > + git -C server add c && > + git -C server commit -a -m x && > + > + test_config -C server uploadpack.allowfilter 1 && > + test_config -C server uploadpack.allowanysha1inwant 1 && > + git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client && > + > + # Ensure that there is exactly 1 negotiation by checking that there is > + # only 1 "done" line sent. ("done" marks the end of negotiation.) > + GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff -M HEAD^ HEAD >out && > + grep "similarity index" out && > + grep "git> done" trace >done_lines && > + test_line_count = 1 done_lines > +' > + > +test_done > -- > 2.21.0-196-g041f5ea1cf >
Duy Nguyen <pclouds@gmail.com> writes: > I feel lost (and the answer "go read partial clone code!" is perfectly > acceptable) but why would we need to diff trees of two different > repositories? It's just a speculation into far future to open possibilities of what we cannot do right now. There is no *NEED*, just like there is no need for a single process of git to be working in multiple repositories at the same time (you can just fork another one into the other repository). >> It is likely >> that a caller with such a need would simply make sure that objects >> in both repositories are available by using the in-core alternate >> object store mechanism, making it a more-or-less moot point to be >> able to pass a repository instance through the callchain X-<. We >> probably should make it, and spell it out somewhere in a long term >> vision shared among the developers, an explicit goal to get rid of >> the internal (ab)use of the alternate object store mechanism. > > I think submodule code so far is doing this way. Though I don't see > any reason we need it for submodule code. Objects are not supposed to > be shared between the super- and the sub-repo. Yup, that is why I made that comment---if we were to pass repository pointer throughout all the codepaths, one of the goal (eh, rather, a useful yardstick to measure how close we are to the goal) should be that we use less and less of that pattern to access objects from random repositories using the in-core alternate mechanism.
diff --git a/diff.c b/diff.c index 1eccefb4ef..811afbdfb1 100644 --- a/diff.c +++ b/diff.c @@ -6367,18 +6367,19 @@ void diffcore_fix_diff_index(void) QSORT(q->queue, q->nr, diffnamecmp); } -static void add_if_missing(struct oid_array *to_fetch, +static void add_if_missing(struct oid_array *to_fetch, struct repository *r, const struct diff_filespec *filespec) { if (filespec && filespec->oid_valid && - oid_object_info_extended(the_repository, &filespec->oid, NULL, + oid_object_info_extended(r, &filespec->oid, NULL, OBJECT_INFO_FOR_PREFETCH)) oid_array_append(to_fetch, &filespec->oid); } void diffcore_std(struct diff_options *options) { - if (repository_format_partial_clone) { + if (options->repo == the_repository && + repository_format_partial_clone) { /* * Prefetch the diff pairs that are about to be flushed. */ @@ -6388,8 +6389,8 @@ void diffcore_std(struct diff_options *options) for (i = 0; i < q->nr; i++) { struct diff_filepair *p = q->queue[i]; - add_if_missing(&to_fetch, p->one); - add_if_missing(&to_fetch, p->two); + add_if_missing(&to_fetch, options->repo, p->one); + add_if_missing(&to_fetch, options->repo, p->two); } if (to_fetch.nr) /*
This is a fixup on the tip of jt/batch-fetch-blobs-in-diff (571debe1d9). I don't know if Junio has already merged this branch to next (he marked this as "Will merge to 'next'" in the "What's Cooking" email, but when I fetched, it hasn't been merged yet). If he has, we can use this commit message: diff: propagate options->repo to add_if_missing Avoid a usage of the_repository by propagating the configured repository to add_if_missing(). Also, prefetch only if the repository being diffed is the_repository (because we do not support lazy fetching for any other repository anyway). Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- Thanks, Duy, for noticing this. --- diff.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-)