diff mbox series

[1/6] fetch: speed up lookup of want refs via commit-graph

Message ID 6872979c4557204821d788dc3f5e1c8bef0a773c.1629452412.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series Speed up mirror-fetches with many refs | expand

Commit Message

Patrick Steinhardt Aug. 20, 2021, 10:08 a.m. UTC
When updating our local refs based on the refs fetched from the remote,
we need to iterate through all requested refs and load their respective
commits such that we can determine whether they need to be appended to
FETCH_HEAD or not. In cases where we're fetching from a remote with
exceedingly many refs, resolving these refs can be quite expensive given
that we repeatedly need to unpack object headers for each of the
referenced objects.

Speed this up by opportunistcally trying to resolve object IDs via the
commit graph: more likely than not, they're going to be a commit anyway,
and this lets us avoid having to unpack object headers completely in
case the object is a commit that is part of the commit-graph. This
significantly speeds up mirror-fetches in a real-world repository with
2.3M refs:

    Benchmark #1: HEAD~: git-fetch
      Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
      Range (min … max):   56.372 s … 57.533 s    5 runs

    Benchmark #2: HEAD: git-fetch
      Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
      Range (min … max):   33.454 s … 33.844 s    5 runs

    Summary
      'HEAD: git-fetch' ran
        1.69 ± 0.02 times faster than 'HEAD~: git-fetch'

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 builtin/fetch.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Derrick Stolee Aug. 20, 2021, 2:27 p.m. UTC | #1
On 8/20/2021 6:08 AM, Patrick Steinhardt wrote:
> When updating our local refs based on the refs fetched from the remote,
> we need to iterate through all requested refs and load their respective
> commits such that we can determine whether they need to be appended to
> FETCH_HEAD or not. In cases where we're fetching from a remote with
> exceedingly many refs, resolving these refs can be quite expensive given
> that we repeatedly need to unpack object headers for each of the
> referenced objects.
> 
> Speed this up by opportunistcally trying to resolve object IDs via the
> commit graph: more likely than not, they're going to be a commit anyway,
> and this lets us avoid having to unpack object headers completely in
> case the object is a commit that is part of the commit-graph. This
> significantly speeds up mirror-fetches in a real-world repository with
> 2.3M refs:
> 
>     Benchmark #1: HEAD~: git-fetch
>       Time (mean ± σ):     56.942 s ±  0.449 s    [User: 53.360 s, System: 5.356 s]
>       Range (min … max):   56.372 s … 57.533 s    5 runs
> 
>     Benchmark #2: HEAD: git-fetch
>       Time (mean ± σ):     33.657 s ±  0.167 s    [User: 30.302 s, System: 5.181 s]
>       Range (min … max):   33.454 s … 33.844 s    5 runs
> 
>     Summary
>       'HEAD: git-fetch' ran
>         1.69 ± 0.02 times faster than 'HEAD~: git-fetch'

These numbers are impressive, and it makes sense that performing a
binary search on the OID lookup chunk of the commit-graph is faster
than doing a binary search on the OIDs across the pack-index(es).

I do worry about the case where annotated tags greatly outnumber
branches, so this binary search is extra overhead and the performance
may degrade. Would it be worth checking the ref to see if it lies
within "refs/heads/" (or even _not_ in "refs/tags/") before doing
this commit-graph check?

> -			commit = lookup_commit_reference_gently(the_repository,
> -								&rm->old_oid,
> -								1);
> -			if (!commit)
> -				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
> +			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
> +			if (!commit) {
> +				commit = lookup_commit_reference_gently(the_repository,
> +									&rm->old_oid,
> +									1);
> +				if (!commit)
> +					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;

nit: I wouldn't nest this last "if (!commit)".

Code will work as advertised.

Thanks,
-Stolee
Junio C Hamano Aug. 20, 2021, 5:18 p.m. UTC | #2
Derrick Stolee <stolee@gmail.com> writes:

> I do worry about the case where annotated tags greatly outnumber
> branches, so this binary search is extra overhead and the performance
> may degrade. Would it be worth checking the ref to see if it lies
> within "refs/heads/" (or even _not_ in "refs/tags/") before doing
> this commit-graph check?

Ah, clever.
Patrick Steinhardt Aug. 23, 2021, 6:46 a.m. UTC | #3
On Fri, Aug 20, 2021 at 10:18:22AM -0700, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
> > I do worry about the case where annotated tags greatly outnumber
> > branches, so this binary search is extra overhead and the performance
> > may degrade. Would it be worth checking the ref to see if it lies
> > within "refs/heads/" (or even _not_ in "refs/tags/") before doing
> > this commit-graph check?
> 
> Ah, clever.

Good idea. Benchmarks for my test repository (which definitely isn't
representative, but it's at least some numbers) show that restricting to
"refs/heads/" diminishes almost all the gains, while restricting to
everything but "refs/tags/" performs almost the same (it's a tiny bit
slower, probably because of the added string comparisons):

    Benchmark #1: all refs: git-fetch
      Time (mean ± σ):     32.959 s ±  0.282 s    [User: 29.801 s, System: 5.137 s]
      Range (min … max):   32.760 s … 33.158 s    2 runs

    Benchmark #2: refs/heads: git-fetch
      Time (mean ± σ):     56.955 s ±  0.002 s    [User: 53.447 s, System: 5.362 s]
      Range (min … max):   56.953 s … 56.957 s    2 runs

    Benchmark #3: !refs/tags: git-fetch
      Time (mean ± σ):     33.447 s ±  0.003 s    [User: 30.160 s, System: 5.027 s]
      Range (min … max):   33.444 s … 33.449 s    2 runs

    Summary
      'all refs: git-fetch' ran
        1.01 ± 0.01 times faster than '!refs/tags: git-fetch'
        1.73 ± 0.01 times faster than 'refs/heads: git-fetch'

This is easily explained by the fact that the test repo has most of its
refs neither in "refs/tags/" nor in "refs/heads/", but rather in special
namespaces like "refs/merge-requests/", "refs/environments/" or
"refs/keep-around/".

I like the idea of excluding "refs/tags/" though: as you point out,
chances are high that these don't point to commits but to annotated tags
instead. So I'll go with that, thanks!

Patrick
Derrick Stolee Aug. 25, 2021, 2:12 p.m. UTC | #4
On 8/23/2021 2:46 AM, Patrick Steinhardt wrote:
> On Fri, Aug 20, 2021 at 10:18:22AM -0700, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> I do worry about the case where annotated tags greatly outnumber
>>> branches, so this binary search is extra overhead and the performance
>>> may degrade. Would it be worth checking the ref to see if it lies
>>> within "refs/heads/" (or even _not_ in "refs/tags/") before doing
>>> this commit-graph check?
>>
>> Ah, clever.
> 
> Good idea. Benchmarks for my test repository (which definitely isn't
> representative, but it's at least some numbers) show that restricting to
> "refs/heads/" diminishes almost all the gains, while restricting to
> everything but "refs/tags/" performs almost the same (it's a tiny bit
> slower, probably because of the added string comparisons):
> 
>     Benchmark #1: all refs: git-fetch
>       Time (mean ± σ):     32.959 s ±  0.282 s    [User: 29.801 s, System: 5.137 s]
>       Range (min … max):   32.760 s … 33.158 s    2 runs
> 
>     Benchmark #2: refs/heads: git-fetch
>       Time (mean ± σ):     56.955 s ±  0.002 s    [User: 53.447 s, System: 5.362 s]
>       Range (min … max):   56.953 s … 56.957 s    2 runs
> 
>     Benchmark #3: !refs/tags: git-fetch
>       Time (mean ± σ):     33.447 s ±  0.003 s    [User: 30.160 s, System: 5.027 s]
>       Range (min … max):   33.444 s … 33.449 s    2 runs
> 
>     Summary
>       'all refs: git-fetch' ran
>         1.01 ± 0.01 times faster than '!refs/tags: git-fetch'
>         1.73 ± 0.01 times faster than 'refs/heads: git-fetch'

Thanks for testing both options.

> This is easily explained by the fact that the test repo has most of its
> refs neither in "refs/tags/" nor in "refs/heads/", but rather in special
> namespaces like "refs/merge-requests/", "refs/environments/" or
> "refs/keep-around/".

That makes sense to me. GitHub also stores refs like refs/pull/ so I can
understand not wanting to restrict to refs/heads/.

> I like the idea of excluding "refs/tags/" though: as you point out,
> chances are high that these don't point to commits but to annotated tags
> instead. So I'll go with that, thanks!

Yeah, that makes sense as a good way forward.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 405afe9bdf..73f5b286d5 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1131,11 +1131,14 @@  static int store_updated_refs(const char *raw_url, const char *remote_name,
 				continue;
 			}
 
-			commit = lookup_commit_reference_gently(the_repository,
-								&rm->old_oid,
-								1);
-			if (!commit)
-				rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			commit = lookup_commit_in_graph(the_repository, &rm->old_oid);
+			if (!commit) {
+				commit = lookup_commit_reference_gently(the_repository,
+									&rm->old_oid,
+									1);
+				if (!commit)
+					rm->fetch_head_status = FETCH_HEAD_NOT_FOR_MERGE;
+			}
 
 			if (rm->fetch_head_status != want_status)
 				continue;