diff mbox series

[v3,1/4] connected: do not sort input revisions

Message ID 1fd83f726a04dfb5be27c74cb116618cb76be923.1627896460.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series Speed up connectivity checks | expand

Commit Message

Patrick Steinhardt Aug. 2, 2021, 9:38 a.m. UTC
In order to compute whether objects reachable from a set of tips are all
connected, we do a revision walk with these tips as positive references
and `--not --all`. `--not --all` will cause the revision walk to load
all preexisting references as uninteresting, which can be very expensive
in repositories with many references.

Benchmarking the git-rev-list(1) command highlights that by far the most
expensive single phase is initial sorting of the input revisions: after
all references have been loaded, we first sort commits by author date.
In a real-world repository with about 2.2 million references, it makes
up about 40% of the total runtime of git-rev-list(1).

Ultimately, the connectivity check shouldn't really bother about the
order of input revisions at all. We only care whether we can actually
walk all objects until we hit the cut-off point. So sorting the input is
a complete waste of time.

Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
cause it to not sort the commits and adjust the connectivity check to
always pass the flag. This results in the following speedups, executed
in a clone of gitlab-org/gitlab [1]:

    Benchmark #1: git rev-list  --objects --quiet --not --all --not $(cat newrev)
      Time (mean ± σ):      7.639 s ±  0.065 s    [User: 7.304 s, System: 0.335 s]
      Range (min … max):    7.543 s …  7.742 s    10 runs

    Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
      Time (mean ± σ):      4.995 s ±  0.044 s    [User: 4.657 s, System: 0.337 s]
      Range (min … max):    4.909 s …  5.048 s    10 runs

    Summary
      'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
        1.53 ± 0.02 times faster than 'git rev-list  --objects --quiet --not --all --not $newrev'

[1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs
     are visible to clients.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 connected.c | 1 +
 revision.c  | 4 +++-
 revision.h  | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason Aug. 2, 2021, 12:49 p.m. UTC | #1
On Mon, Aug 02 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> In order to compute whether objects reachable from a set of tips are all
> connected, we do a revision walk with these tips as positive references
> and `--not --all`. `--not --all` will cause the revision walk to load
> all preexisting references as uninteresting, which can be very expensive
> in repositories with many references.
>
> Benchmarking the git-rev-list(1) command highlights that by far the most
> expensive single phase is initial sorting of the input revisions: after
> all references have been loaded, we first sort commits by author date.
> In a real-world repository with about 2.2 million references, it makes
> up about 40% of the total runtime of git-rev-list(1).
>
> Ultimately, the connectivity check shouldn't really bother about the
> order of input revisions at all. We only care whether we can actually
> walk all objects until we hit the cut-off point. So sorting the input is
> a complete waste of time.

Really good results:

> Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
> cause it to not sort the commits and adjust the connectivity check to
> always pass the flag. This results in the following speedups, executed
> in a clone of gitlab-org/gitlab [1]:
>
>     Benchmark #1: git rev-list  --objects --quiet --not --all --not $(cat newrev)
>       Time (mean ± σ):      7.639 s ±  0.065 s    [User: 7.304 s, System: 0.335 s]
>       Range (min … max):    7.543 s …  7.742 s    10 runs
>
>     Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
>       Time (mean ± σ):      4.995 s ±  0.044 s    [User: 4.657 s, System: 0.337 s]
>       Range (min … max):    4.909 s …  5.048 s    10 runs
>
>     Summary
>       'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
>         1.53 ± 0.02 times faster than 'git rev-list  --objects --quiet --not --all --not $newrev'

Just bikeshedding for a potential re-roll, perhaps --unordered-input, so
that it matches/rhymes with the existing "git cat-file --unordered",
which serves the same conceptual purpose (except this one's input, that
one's output).
Junio C Hamano Aug. 2, 2021, 7 p.m. UTC | #2
Patrick Steinhardt <ps@pks.im> writes:

> In order to compute whether objects reachable from a set of tips are all
> connected, we do a revision walk with these tips as positive references
> and `--not --all`. `--not --all` will cause the revision walk to load
> all preexisting references as uninteresting, which can be very expensive
> in repositories with many references.
>
> Benchmarking the git-rev-list(1) command highlights that by far the most
> expensive single phase is initial sorting of the input revisions: after
> all references have been loaded, we first sort commits by author date.
> In a real-world repository with about 2.2 million references, it makes
> up about 40% of the total runtime of git-rev-list(1).

Nice finding.

> Ultimately, the connectivity check shouldn't really bother about the
> order of input revisions at all. We only care whether we can actually
> walk all objects until we hit the cut-off point. So sorting the input is
> a complete waste of time.

Sorting of positive side is done to help both performance and
correctness in regular use of the traversal machinery, especially
when reachability bitmap is not in effect, but on the negative side
I do not think there is any downside to omit sorting offhand.  The
only case that may get affected is when the revision.c::SLOP kicks
in to deal with oddball commits with incorrect committer timestamps,
but then the result of the sorting isn't to be trusted anyway, so...

> Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
> cause it to not sort the commits and adjust the connectivity check to
> always pass the flag. This results in the following speedups, executed
> in a clone of gitlab-org/gitlab [1]:
> ...
> [1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs
>      are visible to clients.

So is this the 2.2 million refs thing?

> @@ -3584,7 +3586,7 @@ int prepare_revision_walk(struct rev_info *revs)
>  
>  	if (!revs->reflog_info)
>  		prepare_to_use_bloom_filter(revs);
> -	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
> +	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input)
>  		commit_list_sort_by_date(&revs->commits);

Looks quite straight-forward.

I however suspect that in the longer term it may be cleaner to get
rid of REVSISION_WALK_NO_WALK_UNSORTED if we do this.  The knob that
controls if we sort the initial traversal tips and the knob that
controls if we walk from these tips used to be tied to --no-walk
only because ca92e59e30b wanted to affect only no-walk case, but
with your new finding, it clearly is not limited to the no-walk case
to want to avoid sorting.
Patrick Steinhardt Aug. 3, 2021, 8:50 a.m. UTC | #3
On Mon, Aug 02, 2021 at 02:49:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Mon, Aug 02 2021, Patrick Steinhardt wrote:
[snip]
> > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
> > cause it to not sort the commits and adjust the connectivity check to
> > always pass the flag. This results in the following speedups, executed
> > in a clone of gitlab-org/gitlab [1]:
> >
> >     Benchmark #1: git rev-list  --objects --quiet --not --all --not $(cat newrev)
> >       Time (mean ± σ):      7.639 s ±  0.065 s    [User: 7.304 s, System: 0.335 s]
> >       Range (min … max):    7.543 s …  7.742 s    10 runs
> >
> >     Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
> >       Time (mean ± σ):      4.995 s ±  0.044 s    [User: 4.657 s, System: 0.337 s]
> >       Range (min … max):    4.909 s …  5.048 s    10 runs
> >
> >     Summary
> >       'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
> >         1.53 ± 0.02 times faster than 'git rev-list  --objects --quiet --not --all --not $newrev'
> 
> Just bikeshedding for a potential re-roll, perhaps --unordered-input, so
> that it matches/rhymes with the existing "git cat-file --unordered",
> which serves the same conceptual purpose (except this one's input, that
> one's output).

Yeah, I wasn't quite sure how to name it myself either. Internally, we
typically use "unsorted" instead of "unordered", and there's also the
preexisting "--no-walk=(sorted|unsorted)" flag for git-rev-list(1). With
the latter in mind, I think that "unsorted" fits a bit better given that
we already use it in the same command, but I don't particularly mind.

For now, I'll keep "--unsorted-input", but please feel free to give me
another shout if you disagree with my reasoning.

Patrick
Patrick Steinhardt Aug. 3, 2021, 8:55 a.m. UTC | #4
On Mon, Aug 02, 2021 at 12:00:02PM -0700, Junio C Hamano wrote:
> Patrick Steinhardt <ps@pks.im> writes:
[snip]
> > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
> > cause it to not sort the commits and adjust the connectivity check to
> > always pass the flag. This results in the following speedups, executed
> > in a clone of gitlab-org/gitlab [1]:
> > ...
> > [1]: https://gitlab.com/gitlab-org/gitlab.git. Note that not all refs
> >      are visible to clients.
> 
> So is this the 2.2 million refs thing?

Yeah, it is. The repo itself got 2.2 million refs, even though only 800k
are publicly visible. It got even more when one considers its alternate,
where it grows to 3.4 million in total.

> > @@ -3584,7 +3586,7 @@ int prepare_revision_walk(struct rev_info *revs)
> >  
> >  	if (!revs->reflog_info)
> >  		prepare_to_use_bloom_filter(revs);
> > -	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
> > +	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input)
> >  		commit_list_sort_by_date(&revs->commits);
> 
> Looks quite straight-forward.
> 
> I however suspect that in the longer term it may be cleaner to get
> rid of REVSISION_WALK_NO_WALK_UNSORTED if we do this.  The knob that
> controls if we sort the initial traversal tips and the knob that
> controls if we walk from these tips used to be tied to --no-walk
> only because ca92e59e30b wanted to affect only no-walk case, but
> with your new finding, it clearly is not limited to the no-walk case
> to want to avoid sorting.

Right. The question also is what to do when the user calls `git rev-list
--no-walk=sorted --unsorted-input`. Do we sort? Don't we? Should we mark
these options as incompatible with each other and bail out? I guess just
bailing out would be the easiest solution for now.

Patrick
Junio C Hamano Aug. 3, 2021, 9:47 p.m. UTC | #5
Patrick Steinhardt <ps@pks.im> writes:

>> I however suspect that in the longer term it may be cleaner to get
>> rid of REVSISION_WALK_NO_WALK_UNSORTED if we do this.  The knob that
>> controls if we sort the initial traversal tips and the knob that
>> controls if we walk from these tips used to be tied to --no-walk
>> only because ca92e59e30b wanted to affect only no-walk case, but
>> with your new finding, it clearly is not limited to the no-walk case
>> to want to avoid sorting.
>
> Right. The question also is what to do when the user calls `git rev-list
> --no-walk=sorted --unsorted-input`. Do we sort? Don't we? Should we mark
> these options as incompatible with each other and bail out? I guess just
> bailing out would be the easiest solution for now.

I'd say so.  Even without the future clean-up to get rid of the
NO_WALK_UNSORTED, the issue already exists with this series, and
when in doubt, it is easiest to start tight and take our time to
figure out what the right behaviour should be while we initially
do not allow both to be used at the same time.
Ævar Arnfjörð Bjarmason Aug. 4, 2021, 11:01 a.m. UTC | #6
On Tue, Aug 03 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> On Mon, Aug 02, 2021 at 02:49:29PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Mon, Aug 02 2021, Patrick Steinhardt wrote:
> [snip]
>> > Introduce a new "--unsorted-input" flag to git-rev-list(1) which will
>> > cause it to not sort the commits and adjust the connectivity check to
>> > always pass the flag. This results in the following speedups, executed
>> > in a clone of gitlab-org/gitlab [1]:
>> >
>> >     Benchmark #1: git rev-list  --objects --quiet --not --all --not $(cat newrev)
>> >       Time (mean ± σ):      7.639 s ±  0.065 s    [User: 7.304 s, System: 0.335 s]
>> >       Range (min … max):    7.543 s …  7.742 s    10 runs
>> >
>> >     Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
>> >       Time (mean ± σ):      4.995 s ±  0.044 s    [User: 4.657 s, System: 0.337 s]
>> >       Range (min … max):    4.909 s …  5.048 s    10 runs
>> >
>> >     Summary
>> >       'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
>> >         1.53 ± 0.02 times faster than 'git rev-list  --objects --quiet --not --all --not $newrev'
>> 
>> Just bikeshedding for a potential re-roll, perhaps --unordered-input, so
>> that it matches/rhymes with the existing "git cat-file --unordered",
>> which serves the same conceptual purpose (except this one's input, that
>> one's output).
>
> Yeah, I wasn't quite sure how to name it myself either. Internally, we
> typically use "unsorted" instead of "unordered", and there's also the
> preexisting "--no-walk=(sorted|unsorted)" flag for git-rev-list(1). With
> the latter in mind, I think that "unsorted" fits a bit better given that
> we already use it in the same command, but I don't particularly mind.
>
> For now, I'll keep "--unsorted-input", but please feel free to give me
> another shout if you disagree with my reasoning.

Sounds good. I didn't mean it as a strong suggestion, unsorted does
indeed sounds better in this context, just a gentle poke to check if you
knew about that similar option...

Thanks.
diff mbox series

Patch

diff --git a/connected.c b/connected.c
index b18299fdf0..b5f9523a5f 100644
--- a/connected.c
+++ b/connected.c
@@ -106,6 +106,7 @@  int check_connected(oid_iterate_fn fn, void *cb_data,
 	if (opt->progress)
 		strvec_pushf(&rev_list.args, "--progress=%s",
 			     _("Checking connectivity"));
+	strvec_push(&rev_list.args, "--unsorted-input");
 
 	rev_list.git_cmd = 1;
 	rev_list.env = opt->env;
diff --git a/revision.c b/revision.c
index cddd0542a6..7eb09009ba 100644
--- a/revision.c
+++ b/revision.c
@@ -2256,6 +2256,8 @@  static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 	} else if (!strcmp(arg, "--author-date-order")) {
 		revs->sort_order = REV_SORT_BY_AUTHOR_DATE;
 		revs->topo_order = 1;
+	} else if (!strcmp(arg, "--unsorted-input")) {
+		revs->unsorted_input = 1;
 	} else if (!strcmp(arg, "--early-output")) {
 		revs->early_output = 100;
 		revs->topo_order = 1;
@@ -3584,7 +3586,7 @@  int prepare_revision_walk(struct rev_info *revs)
 
 	if (!revs->reflog_info)
 		prepare_to_use_bloom_filter(revs);
-	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
+	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED && !revs->unsorted_input)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
diff --git a/revision.h b/revision.h
index fbb068da9f..ef6998509a 100644
--- a/revision.h
+++ b/revision.h
@@ -134,6 +134,7 @@  struct rev_info {
 			simplify_history:1,
 			show_pulls:1,
 			topo_order:1,
+			unsorted_input:1,
 			simplify_merges:1,
 			simplify_by_decoration:1,
 			single_worktree:1,