diff mbox series

[v3,2/3] builtin/fetch: skip unnecessary tasks when using --negotiate-only

Message ID 20211222001134.28933-3-chooglen@google.com (mailing list archive)
State New, archived
Headers show
Series builtin/fetch: skip unnecessary tasks when using --negotiate-only | expand

Commit Message

Glen Choo Dec. 22, 2021, 12:11 a.m. UTC
cmd_fetch() performs certain tasks with the assumption that objects are
fetched, but `git fetch --negotiate-only` does not fetch objects, as its
name implies. This is results in behavior that is unnecessary at best,
and incorrect at worst:

* Submodules are updated if enabled by recurse.submodules=true, but
  negotiation fetch doesn't actually update the repo, so this doesn't
  make sense (introduced in [1]).
* Commit graphs will be written if enabled by
  fetch.writeCommitGraph=true. But according to
  Documentation/config/fetch.txt [2], this should only be done if a
  pack-file is downloaded.
* gc is run, but according to [3], we only do this because we expect
  `git fetch` to introduce objects.

Instead of disabling these tasks piecemeal, make cmd_fetch() return
early if we know for certain that objects will not be fetched. We can
return early whenever objects are not fetched, but for now this only
considers --negotiate-only.

A similar optimization would be to skip irrelevant tasks in `git fetch
--dry-run`. This optimization was not done in this commit because a dry
run will actually fetch objects; we'd presumably still want to recurse
into submodules and run gc.

[1] 7dce19d374 (fetch/pull: Add the --recurse-submodules option, 2010-11-12)
[2] 50f26bd035 (fetch: add fetch.writeCommitGraph config setting, 2019-09-02)
[3] 131b8fcbfb (fetch: run gc --auto after fetching, 2013-01-26)

Signed-off-by: Glen Choo <chooglen@google.com>
---
 builtin/fetch.c       | 20 ++++++++++++++++++++
 t/t5516-fetch-push.sh | 12 ++++++++++++
 2 files changed, 32 insertions(+)

Comments

Junio C Hamano Dec. 22, 2021, 6:42 a.m. UTC | #1
Glen Choo <chooglen@google.com> writes:

> @@ -2113,6 +2122,17 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  		result = fetch_multiple(&list, max_children);
>  	}
>  
> +	/*
> +	 * Skip irrelevant tasks because we know objects were not
> +	 * fetched.
> +	 *
> +	 * NEEDSWORK: as a future optimization, we can return early
> +	 * whenever objects were not fetched e.g. if we already have all
> +	 * of them.
> +	 */
> +	if (negotiate_only)
> +		goto cleanup;

Sorry if I did not mention this in the review of the earlier round,
but I think the location this patch places the jump is wrong,
especially with the NEEDSWORK comment.

When we are not under negotiate_only, if our earlier call to
transport_fetch_refs() learns to tell us that that we did not add
any new objects, we would be able to jump to the clean-up label,
making the above code to:

	if (negotiate_only || !num_fetched_objects)
		goto cleanup;

But such a future enhancement is made harder by having this jump
here---the optimization the NEEDSWORK comment alludes to has no
reason to be incompatible with "--recurse-submodules".

If the above block is placed _after_ the "if the main fetch was
successful, and we are not told not to recurse into submodules, then
do this" block we see below, then

 (1) this patch still achieves its goal, as we have manually
     and unconditionally turned recursion off;

 (2) such a future enhancement will not be forbidden from working
     with recurse-submodules feature.

>  	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
>  		struct strvec options = STRVEC_INIT;
>  		int max_children = max_jobs;
> diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
> index 8212ca56dc..732031085e 100755
> --- a/t/t5516-fetch-push.sh
> +++ b/t/t5516-fetch-push.sh
> @@ -229,6 +229,18 @@ test_expect_success 'push with negotiation proceeds anyway even if negotiation f
>  	test_i18ngrep "push negotiation failed" err
>  '
>  
> +test_expect_success 'push with negotiation does not attempt to fetch submodules' '
> +	mk_empty submodule_upstream &&
> +	test_commit -C submodule_upstream submodule_commit &&
> +	git submodule add ./submodule_upstream submodule &&
> +	mk_empty testrepo &&
> +	git push testrepo $the_first_commit:refs/remotes/origin/first_commit &&
> +	test_commit -C testrepo unrelated_commit &&
> +	git -C testrepo config receive.hideRefs refs/remotes/origin/first_commit &&
> +	git -c submodule.recurse=true -c protocol.version=2 -c push.negotiate=1 push testrepo refs/heads/main:refs/remotes/origin/main 2>err &&
> +	! grep "Fetching submodule" err
> +'
> +
>  test_expect_success 'push without wildcard' '
>  	mk_empty testrepo &&
Glen Choo Dec. 22, 2021, 5:28 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> Glen Choo <chooglen@google.com> writes:
>
>> @@ -2113,6 +2122,17 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>>  		result = fetch_multiple(&list, max_children);
>>  	}
>>  
>> +	/*
>> +	 * Skip irrelevant tasks because we know objects were not
>> +	 * fetched.
>> +	 *
>> +	 * NEEDSWORK: as a future optimization, we can return early
>> +	 * whenever objects were not fetched e.g. if we already have all
>> +	 * of them.
>> +	 */
>> +	if (negotiate_only)
>> +		goto cleanup;
>
> Sorry if I did not mention this in the review of the earlier round,
> but I think the location this patch places the jump is wrong,
> especially with the NEEDSWORK comment.
>
> When we are not under negotiate_only, if our earlier call to
> transport_fetch_refs() learns to tell us that that we did not add
> any new objects, we would be able to jump to the clean-up label,
> making the above code to:
>
> 	if (negotiate_only || !num_fetched_objects)
> 		goto cleanup;

Thanks for the clarification. Yes, we agree that the location of the
jump in this patch should be the same as the location of the jump after
the future optimization.

> But such a future enhancement is made harder by having this jump
> here---the optimization the NEEDSWORK comment alludes to has no
> reason to be incompatible with "--recurse-submodules".
>
> If the above block is placed _after_ the "if the main fetch was
> successful, and we are not told not to recurse into submodules, then
> do this" block we see below, then
>
>  (1) this patch still achieves its goal, as we have manually
>      and unconditionally turned recursion off;
>
>  (2) such a future enhancement will not be forbidden from working
>      with recurse-submodules feature.
>

I would have come to same conclusion if I agreed that we should recurse
into submodules even if no objects are fetched. When I first wrote this
patch, I was convinced that "no new objects" implies "no need to update
submodules" (see my response at [1]), but I'm not sure any more and I'd
like to check my understanding.

The way "fetch --recurse-submodules" works is that the changed
submodules are calcuated from the newly updated tips fetched from the
remote. If no objects were fetched, we already have all of the
superproject commits.

In ~99% of the time, no objects were fetched because the remote doesn't
have any info we do not know about - there are no new commits and no
refs were updated. In this scenario, 'git fetch' can avoid recursing
into submodules because there's no need to. But if we choose to recurse,
the worst thing that happens is that we do some file I/O and realize
that there are no changed submodules - essentially a no-op given that
fetch is slow.

(This is where my understanding of objects vs refs needs to be checked)
In the other ~1%, we might already have all commits, but a remote ref
might still have moved, albeit to a known commit. In this case, the
submodule would need to be updated because it might have changed.

If my understanding is correct, then my patch produces the wrong
behavior in that ~1%. But even if my understanding is wrong, and we
don't need to worry about that edge case, I see that there's unnecessary
risk in trying to be too be clever in my reasoning and skipping what is
essentially a no-op.

Is my understanding accurate? At any rate, I'm somewhat convinced to
move the jump to just after the "if main fetch was successful, and we
are not told not to recurse into submodules" block, i.e. before the "if
we should write the commit graph" block.

[1] https://lore.kernel.org/git/kl6ltuf3ysnw.fsf@chooglen-macbookpro.roam.corp.google.com

>>  	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
>>  		struct strvec options = STRVEC_INIT;
>>  		int max_children = max_jobs;
>> diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
>> index 8212ca56dc..732031085e 100755
>> --- a/t/t5516-fetch-push.sh
>> +++ b/t/t5516-fetch-push.sh
>> @@ -229,6 +229,18 @@ test_expect_success 'push with negotiation proceeds anyway even if negotiation f
>>  	test_i18ngrep "push negotiation failed" err
>>  '
>>  
>> +test_expect_success 'push with negotiation does not attempt to fetch submodules' '
>> +	mk_empty submodule_upstream &&
>> +	test_commit -C submodule_upstream submodule_commit &&
>> +	git submodule add ./submodule_upstream submodule &&
>> +	mk_empty testrepo &&
>> +	git push testrepo $the_first_commit:refs/remotes/origin/first_commit &&
>> +	test_commit -C testrepo unrelated_commit &&
>> +	git -C testrepo config receive.hideRefs refs/remotes/origin/first_commit &&
>> +	git -c submodule.recurse=true -c protocol.version=2 -c push.negotiate=1 push testrepo refs/heads/main:refs/remotes/origin/main 2>err &&
>> +	! grep "Fetching submodule" err
>> +'
>> +
>>  test_expect_success 'push without wildcard' '
>>  	mk_empty testrepo &&
Junio C Hamano Dec. 22, 2021, 7:29 p.m. UTC | #3
Glen Choo <chooglen@google.com> writes:

> I would have come to same conclusion if I agreed that we should recurse
> into submodules even if no objects are fetched. When I first wrote this
> patch, I was convinced that "no new objects" implies "no need to update
> submodules" (see my response at [1]), but I'm not sure any more and I'd
> like to check my understanding.

For example, there is a "everything_local()" optimization in the
fetch-pack codepath where the following steps happen:

 (1) we look at the current value of the remote refs we are fetching
     from their ls_refs output (let's call it "new tips");

 (2) we notice that all of these objects happen to exist in our
     object store.

 (3) we make sure that we do not see any "missing links" if we run a
     reachability traversal that starts from these objects and our
     existing refs, and stops when the traversal intersect.

When the last step finds that all objects necessary to point at
these "new tips" with our refs safely, then we have no reason to
perform physical transfer of objects.  Yet, we'd update our refs
to the "new tips".

This can happen in a number of ways.  

Imagine that you have a clone of https://github.com/git/git/ for
only its 'main' branch (i.e. a single-branch clone).  If you then
say "git fetch origin maint:maint", we'll learn that the tip of
their 'maint' branch points at a commit, we look into our object
store, find that there is no missing object to reach from it to the
part of the object graph that is reachable from our refs (i.e. my
'maint' is always an ancestor of my 'main'), and we find that there
is no reason to transfer any object.  Yet we will carete a new ref
and point at the commit.

Or if you did "git branch -d" locally, making objects unreachable in
your object store, and then fetch from your upstream, which had fast
forwarded to the contents of the branch you just deleted.

Or they rewound and rebuilt their branches since you fetched the
last time, and then they realized their mistake and now their refs
point at a commit that you have already seen but are different from
what your remote-tracking branches point at now.

Or you are using Derrick's "prefetch" (in "git maintenance run") and
a background process already downloaded the objects needed for the
branch you are fetching in the past.

Depending on what happened when these objects were pre-fetched, such
a real fetch that did not have to perform an object transfer may
likely to need to adjust things in the submodule repository.
"prefetch" is designed not to disrupt and to be invisible to the
normal operation as much as possible, so I would expect that it
won't do any priming of the submodules based on what it prefetched
for the superproject, for example.

So in short, physical object transfer can be optimized out, even
when the external world view, i.e. where in the history graph the
refs point at, changes and makes it necessary to check in the
submodule repositories.
Glen Choo Dec. 22, 2021, 8:27 p.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> Glen Choo <chooglen@google.com> writes:
>
>> I would have come to same conclusion if I agreed that we should recurse
>> into submodules even if no objects are fetched. When I first wrote this
>> patch, I was convinced that "no new objects" implies "no need to update
>> submodules" (see my response at [1]), but I'm not sure any more and I'd
>> like to check my understanding.
>
> For example, there is a "everything_local()" optimization in the
> fetch-pack codepath where the following steps happen:
>
>  (1) we look at the current value of the remote refs we are fetching
>      from their ls_refs output (let's call it "new tips");
>
>  (2) we notice that all of these objects happen to exist in our
>      object store.
>
>  (3) we make sure that we do not see any "missing links" if we run a
>      reachability traversal that starts from these objects and our
>      existing refs, and stops when the traversal intersect.
>
> When the last step finds that all objects necessary to point at
> these "new tips" with our refs safely, then we have no reason to
> perform physical transfer of objects.  Yet, we'd update our refs
> to the "new tips".
>
> This can happen in a number of ways.  
>
> Imagine that you have a clone of https://github.com/git/git/ for
> only its 'main' branch (i.e. a single-branch clone).  If you then
> say "git fetch origin maint:maint", we'll learn that the tip of
> their 'maint' branch points at a commit, we look into our object
> store, find that there is no missing object to reach from it to the
> part of the object graph that is reachable from our refs (i.e. my
> 'maint' is always an ancestor of my 'main'), and we find that there
> is no reason to transfer any object.  Yet we will carete a new ref
> and point at the commit.
>
> Or if you did "git branch -d" locally, making objects unreachable in
> your object store, and then fetch from your upstream, which had fast
> forwarded to the contents of the branch you just deleted.
>
> Or they rewound and rebuilt their branches since you fetched the
> last time, and then they realized their mistake and now their refs
> point at a commit that you have already seen but are different from
> what your remote-tracking branches point at now.
>
> Or you are using Derrick's "prefetch" (in "git maintenance run") and
> a background process already downloaded the objects needed for the
> branch you are fetching in the past.
>
> Depending on what happened when these objects were pre-fetched, such
> a real fetch that did not have to perform an object transfer may
> likely to need to adjust things in the submodule repository.
> "prefetch" is designed not to disrupt and to be invisible to the
> normal operation as much as possible, so I would expect that it
> won't do any priming of the submodules based on what it prefetched
> for the superproject, for example.

Thanks for providing concrete examples; that's a lot more convincing
than my hypothetical.

> So in short, physical object transfer can be optimized out, even
> when the external world view, i.e. where in the history graph the
> refs point at, changes and makes it necessary to check in the
> submodule repositories.

Yes, you're right. I'll move the jump accordingly :)
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index eab73d7417..883bb1b10c 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1996,6 +1996,15 @@  int cmd_fetch(int argc, const char **argv, const char *prefix)
 
 	argc = parse_options(argc, argv, prefix,
 			     builtin_fetch_options, builtin_fetch_usage, 0);
+
+	if (negotiate_only) {
+		/*
+		 * --negotiate-only should never recurse into
+		 * submodules, so there is no need to read .gitmodules.
+		 */
+		recurse_submodules = RECURSE_SUBMODULES_OFF;
+	}
+
 	if (recurse_submodules != RECURSE_SUBMODULES_OFF) {
 		int *sfjc = submodule_fetch_jobs_config == -1
 			    ? &submodule_fetch_jobs_config : NULL;
@@ -2113,6 +2122,17 @@  int cmd_fetch(int argc, const char **argv, const char *prefix)
 		result = fetch_multiple(&list, max_children);
 	}
 
+	/*
+	 * Skip irrelevant tasks because we know objects were not
+	 * fetched.
+	 *
+	 * NEEDSWORK: as a future optimization, we can return early
+	 * whenever objects were not fetched e.g. if we already have all
+	 * of them.
+	 */
+	if (negotiate_only)
+		goto cleanup;
+
 	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
 		struct strvec options = STRVEC_INIT;
 		int max_children = max_jobs;
diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index 8212ca56dc..732031085e 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -229,6 +229,18 @@  test_expect_success 'push with negotiation proceeds anyway even if negotiation f
 	test_i18ngrep "push negotiation failed" err
 '
 
+test_expect_success 'push with negotiation does not attempt to fetch submodules' '
+	mk_empty submodule_upstream &&
+	test_commit -C submodule_upstream submodule_commit &&
+	git submodule add ./submodule_upstream submodule &&
+	mk_empty testrepo &&
+	git push testrepo $the_first_commit:refs/remotes/origin/first_commit &&
+	test_commit -C testrepo unrelated_commit &&
+	git -C testrepo config receive.hideRefs refs/remotes/origin/first_commit &&
+	git -c submodule.recurse=true -c protocol.version=2 -c push.negotiate=1 push testrepo refs/heads/main:refs/remotes/origin/main 2>err &&
+	! grep "Fetching submodule" err
+'
+
 test_expect_success 'push without wildcard' '
 	mk_empty testrepo &&