[0/3] Make add_missing_tags() linear

Message ID	pull.60.git.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Tue, 30 Oct 2018 07:16:02 -0700 (PDT) Message-Id: <pull.60.git.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Subject: [PATCH 0/3] Make add_missing_tags() linear Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: peff@peff.net, newren@gmail.com, Junio C Hamano <gitster@pobox.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Make add_missing_tags() linear \| expand [0/3] Make add_missing_tags() linear [1/3] commit-reach: implement get_reachable_subset [2/3] test-reach: test get_reachable_subset [3/3] remote: make add_missing_tags() linear

Linus Arver via GitGitGadget Oct. 30, 2018, 2:16 p.m. UTC

As reported earlier [1], the add_missing_tags() method in remote.c has
quadratic performance. Some of that performance is curbed due to the
generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
help users without a commit-graph, and it can still be painful if that
cutoff is sufficiently low compared to the tags we are using for
reachability testing.

Add a new method in commit-reach.c called get_reachable_subset() which does
a many-to-many reachability test. Starting at the 'from' commits, walk until
the generation is below the smallest generation in the 'to' commits, or all
'to' commits have been discovered. This performs only one commit walk for
the entire add_missing_tags() method, giving linear performance in the worst
case.

Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
works independently of its application in add_missing_tags().

Thanks, -Stolee

[1] 
https://public-inbox.org/git/CABPp-BECpSOxudovjbDG_3W9wus102RW+E+qPmd4g3Qyd-QDKQ@mail.gmail.com/

Derrick Stolee (3):
  commit-reach: implement get_reachable_subset
  test-reach: test get_reachable_subset
  remote: make add_missing_tags() linear

 commit-reach.c        | 70 +++++++++++++++++++++++++++++++++++++++++++
 commit-reach.h        | 10 +++++++
 remote.c              | 34 ++++++++++++++++++++-
 t/helper/test-reach.c | 34 ++++++++++++++++++---
 t/t6600-test-reach.sh | 52 ++++++++++++++++++++++++++++++++
 5 files changed, 195 insertions(+), 5 deletions(-)


base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/60

Junio C Hamano Oct. 31, 2018, 3:05 a.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Add a new method in commit-reach.c called get_reachable_subset() which does
> a many-to-many reachability test. Starting at the 'from' commits, walk until
> the generation is below the smallest generation in the 'to' commits, or all
> 'to' commits have been discovered. This performs only one commit walk for
> the entire add_missing_tags() method, giving linear performance in the worst
> case.

;-)

I think in_merge_bases_many() was an attempt to do one half of this
(i.e. it checks only one 'to' against main 'from' to see if any of
them reach).  I wonder why we didn't extend it to multiple 'to' back
when we invented it.

In any case, good to see this code optimized.

Thanks.

Elijah Newren Oct. 31, 2018, 6:04 a.m. UTC | #2

On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> As reported earlier [1], the add_missing_tags() method in remote.c has
> quadratic performance. Some of that performance is curbed due to the
> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
> help users without a commit-graph, and it can still be painful if that
> cutoff is sufficiently low compared to the tags we are using for
> reachability testing.
>
> Add a new method in commit-reach.c called get_reachable_subset() which does
> a many-to-many reachability test. Starting at the 'from' commits, walk until
> the generation is below the smallest generation in the 'to' commits, or all
> 'to' commits have been discovered. This performs only one commit walk for
> the entire add_missing_tags() method, giving linear performance in the worst
> case.
>
> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
> works independently of its application in add_missing_tags().

On the original repo where the topic was brought up, with commit-graph
NOT turned on and using origin/master, I see:

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real 1m20.081s
user 1m19.688s
sys 0m0.292s

Merging this series in, I now get:

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real 0m2.857s
user 0m2.580s
sys 0m0.328s

which provides a very nice speedup.

Oddly enough, if I _also_ do the following:
$ git config core.commitgraph true
$ git config gc.writecommitgraph true
$ git gc

then my timing actually slows down just slightly:
$ time git push --follow-tags --dry-run /home/newren/repo-mirror
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real 0m3.027s
user 0m2.696s
sys 0m0.400s

(run-to-run variation seems pretty consistent, < .1s variation, so
this difference is just enough to notice.)  I wouldn't be that
surprised if that means there's some really old tags with very small
generation numbers, meaning it's not gaining anything in this special
case from the commit-graph, but it does pay the cost of loading the
commit-graph.


Anyway, looks good in my testing.  Thanks much for working on this!

Derrick Stolee Oct. 31, 2018, 12:05 p.m. UTC | #3

On 10/31/2018 2:04 AM, Elijah Newren wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> As reported earlier [1], the add_missing_tags() method in remote.c has
>> quadratic performance. Some of that performance is curbed due to the
>> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
>> help users without a commit-graph, and it can still be painful if that
>> cutoff is sufficiently low compared to the tags we are using for
>> reachability testing.
>>
>> Add a new method in commit-reach.c called get_reachable_subset() which does
>> a many-to-many reachability test. Starting at the 'from' commits, walk until
>> the generation is below the smallest generation in the 'to' commits, or all
>> 'to' commits have been discovered. This performs only one commit walk for
>> the entire add_missing_tags() method, giving linear performance in the worst
>> case.
>>
>> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
>> works independently of its application in add_missing_tags().
>
> On the original repo where the topic was brought up, with commit-graph
> NOT turned on and using origin/master, I see:
>
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]       test5 -> test5
>
> real 1m20.081s
> user 1m19.688s
> sys 0m0.292s
>
> Merging this series in, I now get:
>
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]       test5 -> test5
>
> real 0m2.857s
> user 0m2.580s
> sys 0m0.328s
>
> which provides a very nice speedup.
>
> Oddly enough, if I _also_ do the following:
> $ git config core.commitgraph true
> $ git config gc.writecommitgraph true
> $ git gc
>
> then my timing actually slows down just slightly:
> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]          test5 -> test5
>
> real 0m3.027s
> user 0m2.696s
> sys 0m0.400s

So you say that the commit-graph is off in the 2.8s case, but not here
in the 3.1s case? I would expect _at minimum_ that the cost of parsing
commits would have a speedup in the commit-graph case.  There may be
something else going on here, since you are timing a `push` event that
is doing more than the current walk.

> (run-to-run variation seems pretty consistent, < .1s variation, so
> this difference is just enough to notice.)  I wouldn't be that
> surprised if that means there's some really old tags with very small
> generation numbers, meaning it's not gaining anything in this special
> case from the commit-graph, but it does pay the cost of loading the
> commit-graph.

While you have this test environment, do you mind applying the diff
below and re-running the tests? It will output a count for how many
commits are walked by the algorithm. This should help us determine if
this is another case where generation numbers are worse than commit-date,
or if there is something else going on. Thanks!

-->8--

From 2115e7dcafb2770455b7b4793f90edc2254bad97 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Wed, 31 Oct 2018 11:40:50 +0000
Subject: [PATCH] DO-NOT-MERGE: count commits in get_reachable_subset

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index a98532ecc8..b512461cf7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -700,6 +700,7 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit **from_last = from + nr_from;
 	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
+	int count = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
@@ -729,6 +730,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
 		struct commit_list *parents;
 
+		count++;
+
 		if (current->object.flags & PARENT1) {
 			current->object.flags &= ~PARENT1;
 			current->object.flags |= reachable_flag;
@@ -755,6 +758,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	clear_commit_marks_many(nr_to, to, PARENT1);
 	clear_commit_marks_many(nr_from, from, PARENT2);
 
+	fprintf(stderr, "count: %d\n", count);
+
 	return found_commits;
 }

Elijah Newren Nov. 1, 2018, 6:52 a.m. UTC | #4

On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 10/31/2018 2:04 AM, Elijah Newren wrote:
> > On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> As reported earlier [1], the add_missing_tags() method in remote.c has
> >> quadratic performance. Some of that performance is curbed due to the
> >> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
> >> help users without a commit-graph, and it can still be painful if that
> >> cutoff is sufficiently low compared to the tags we are using for
> >> reachability testing.
> >>
> >> Add a new method in commit-reach.c called get_reachable_subset() which does
> >> a many-to-many reachability test. Starting at the 'from' commits, walk until
> >> the generation is below the smallest generation in the 'to' commits, or all
> >> 'to' commits have been discovered. This performs only one commit walk for
> >> the entire add_missing_tags() method, giving linear performance in the worst
> >> case.
> >>
> >> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
> >> works independently of its application in add_missing_tags().
> >
> > On the original repo where the topic was brought up, with commit-graph
> > NOT turned on and using origin/master, I see:
> >
> > $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> > To /home/newren/repo-mirror
> >  * [new branch]       test5 -> test5
> >
> > real 1m20.081s
> > user 1m19.688s
> > sys 0m0.292s
> >
> > Merging this series in, I now get:
> >
> > $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> > To /home/newren/repo-mirror
> >  * [new branch]       test5 -> test5
> >
> > real 0m2.857s
> > user 0m2.580s
> > sys 0m0.328s
> >
> > which provides a very nice speedup.
> >
> > Oddly enough, if I _also_ do the following:
> > $ git config core.commitgraph true
> > $ git config gc.writecommitgraph true
> > $ git gc
> >
> > then my timing actually slows down just slightly:
> > $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> > To /home/newren/repo-mirror
> >  * [new branch]          test5 -> test5
> >
> > real 0m3.027s
> > user 0m2.696s
> > sys 0m0.400s
>
> So you say that the commit-graph is off in the 2.8s case, but not here
> in the 3.1s case? I would expect _at minimum_ that the cost of parsing
> commits would have a speedup in the commit-graph case.  There may be
> something else going on here, since you are timing a `push` event that
> is doing more than the current walk.
>
> > (run-to-run variation seems pretty consistent, < .1s variation, so
> > this difference is just enough to notice.)  I wouldn't be that
> > surprised if that means there's some really old tags with very small
> > generation numbers, meaning it's not gaining anything in this special
> > case from the commit-graph, but it does pay the cost of loading the
> > commit-graph.
>
> While you have this test environment, do you mind applying the diff
> below and re-running the tests? It will output a count for how many
> commits are walked by the algorithm. This should help us determine if
> this is another case where generation numbers are worse than commit-date,
> or if there is something else going on. Thanks!

I can do that, but wouldn't you want a similar patch for the old
get_merge_bases_many() in order to compare?  Does an absolute number
help by itself?
It's going to have to be tomorrow, though; not enough time tonight.

Derrick Stolee Nov. 1, 2018, 12:32 p.m. UTC | #5

On 11/1/2018 2:52 AM, Elijah Newren wrote:
> On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee <stolee@gmail.com> wrote:
>> On 10/31/2018 2:04 AM, Elijah Newren wrote:
>>>
>>> On the original repo where the topic was brought up, with commit-graph
>>> NOT turned on and using origin/master, I see:
>>>
>>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
>>> To /home/newren/repo-mirror
>>>   * [new branch]       test5 -> test5
>>>
>>> real 1m20.081s
>>> user 1m19.688s
>>> sys 0m0.292s
>>>
>>> Merging this series in, I now get:
>>>
>>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
>>> To /home/newren/repo-mirror
>>>   * [new branch]       test5 -> test5
>>>
>>> real 0m2.857s
>>> user 0m2.580s
>>> sys 0m0.328s
>>>
>>> which provides a very nice speedup.
>>>
>>> Oddly enough, if I _also_ do the following:
>>> $ git config core.commitgraph true
>>> $ git config gc.writecommitgraph true
>>> $ git gc
>>>
>>> then my timing actually slows down just slightly:
>>> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
>>> To /home/newren/repo-mirror
>>>   * [new branch]          test5 -> test5
>>>
>>> real 0m3.027s
>>> user 0m2.696s
>>> sys 0m0.400s
>> So you say that the commit-graph is off in the 2.8s case, but not here
>> in the 3.1s case? I would expect _at minimum_ that the cost of parsing
>> commits would have a speedup in the commit-graph case.  There may be
>> something else going on here, since you are timing a `push` event that
>> is doing more than the current walk.
>>
>>> (run-to-run variation seems pretty consistent, < .1s variation, so
>>> this difference is just enough to notice.)  I wouldn't be that
>>> surprised if that means there's some really old tags with very small
>>> generation numbers, meaning it's not gaining anything in this special
>>> case from the commit-graph, but it does pay the cost of loading the
>>> commit-graph.
>> While you have this test environment, do you mind applying the diff
>> below and re-running the tests? It will output a count for how many
>> commits are walked by the algorithm. This should help us determine if
>> this is another case where generation numbers are worse than commit-date,
>> or if there is something else going on. Thanks!
> I can do that, but wouldn't you want a similar patch for the old
> get_merge_bases_many() in order to compare?  Does an absolute number
> help by itself?
> It's going to have to be tomorrow, though; not enough time tonight.

No rush. I'd just like to understand how removing the commit-graph file
can make the new algorithm faster. Putting a similar count in the old
algorithm would involve giving a count for every call to
in_merge_bases_many(), which would be very noisy.

Thanks!
-Stolee

Elijah Newren Nov. 1, 2018, 6:57 p.m. UTC | #6

On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
> On 11/1/2018 2:52 AM, Elijah Newren wrote:
> > On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee <stolee@gmail.com> wrote:
> >> On 10/31/2018 2:04 AM, Elijah Newren wrote:
> >>>
> >>> On the original repo where the topic was brought up, with commit-graph
> >>> NOT turned on and using origin/master, I see:
> >>>
> >>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> >>> To /home/newren/repo-mirror
> >>>   * [new branch]       test5 -> test5
> >>>
> >>> real 1m20.081s
> >>> user 1m19.688s
> >>> sys 0m0.292s
> >>>
> >>> Merging this series in, I now get:
> >>>
> >>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> >>> To /home/newren/repo-mirror
> >>>   * [new branch]       test5 -> test5
> >>>
> >>> real 0m2.857s
> >>> user 0m2.580s
> >>> sys 0m0.328s
> >>>
> >>> which provides a very nice speedup.
> >>>
> >>> Oddly enough, if I _also_ do the following:
> >>> $ git config core.commitgraph true
> >>> $ git config gc.writecommitgraph true
> >>> $ git gc
> >>>
> >>> then my timing actually slows down just slightly:
> >>> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> >>> To /home/newren/repo-mirror
> >>>   * [new branch]          test5 -> test5
> >>>
> >>> real 0m3.027s
> >>> user 0m2.696s
> >>> sys 0m0.400s
> >> So you say that the commit-graph is off in the 2.8s case, but not here
> >> in the 3.1s case? I would expect _at minimum_ that the cost of parsing
> >> commits would have a speedup in the commit-graph case.  There may be
> >> something else going on here, since you are timing a `push` event that
> >> is doing more than the current walk.
> >>
> >>> (run-to-run variation seems pretty consistent, < .1s variation, so
> >>> this difference is just enough to notice.)  I wouldn't be that
> >>> surprised if that means there's some really old tags with very small
> >>> generation numbers, meaning it's not gaining anything in this special
> >>> case from the commit-graph, but it does pay the cost of loading the
> >>> commit-graph.
> >> While you have this test environment, do you mind applying the diff
> >> below and re-running the tests? It will output a count for how many
> >> commits are walked by the algorithm. This should help us determine if
> >> this is another case where generation numbers are worse than commit-date,
> >> or if there is something else going on. Thanks!
> > I can do that, but wouldn't you want a similar patch for the old
> > get_merge_bases_many() in order to compare?  Does an absolute number
> > help by itself?
> > It's going to have to be tomorrow, though; not enough time tonight.
>
> No rush. I'd just like to understand how removing the commit-graph file
> can make the new algorithm faster. Putting a similar count in the old
> algorithm would involve giving a count for every call to
> in_merge_bases_many(), which would be very noisy.

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
count: 92912
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real    0m3.024s
user    0m2.752s
sys    0m0.320s


Also:
$ git rev-list --count HEAD
55764
$ git rev-list --count --all
91820

Seems a little odd to me that count is greater than `git rev-list
--count --all`.  However, the fact that they are close in magnitude
isn't surprising since I went digging for the commit with smallest
generation number not found in the upstream repo, and found:
$ git ls-remote /home/newren/repo-mirror/ | grep refs/tags/v0.2.0; echo $?
1
$ git rev-list --count refs/tags/v0.2.0
4
$ git rev-list --count refs/tags/v0.2.0 ^HEAD
4


So, the commit-graph can only help us avoid parsing 3 or so commits,
but we have to parse the 5M .git/objects/info/commit-graph file, and
then for every parse_commit() call we need to bsearch_graph() for the
commit.    My theory is that parsing the commit-graph file and binary
searching it for commits is relatively fast, but that the time is just
big enough to measure and notice, while avoiding parsing 3 more
commits is a negligible time savings.

To me, I'm thinking this is one of those bizarre corner cases where
the commit-graph is almost imperceptibly slower than without the
commit-graph.  (And it is a very weird repo; someone repeatedly
filter-branched lots of small independent repos into a monorepo, but
didn't always push everything and didn't clean out all old stuff.)
But if you still see weird stuff you want to dig into further (maybe
the 92912 > 91820 bit?), I'm happy to try out other stuff.

Derrick Stolee Nov. 1, 2018, 7:02 p.m. UTC | #7

On 11/1/2018 2:57 PM, Elijah Newren wrote:
> On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
>> No rush. I'd just like to understand how removing the commit-graph file
>> can make the new algorithm faster. Putting a similar count in the old
>> algorithm would involve giving a count for every call to
>> in_merge_bases_many(), which would be very noisy.
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> count: 92912
> To /home/newren/repo-mirror
>   * [new branch]              test5 -> test5
>
> real    0m3.024s
> user    0m2.752s
> sys    0m0.320s

Is the above test with or without the commit-graph file? Can you run it 
in the other mode, too? I'd like to see if the "count" value changes 
when the only difference is the presence of a commit-graph file.

Thanks,
-Stolee

Elijah Newren Nov. 2, 2018, 2:58 p.m. UTC | #8

On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/1/2018 2:57 PM, Elijah Newren wrote:
> > On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
> >> No rush. I'd just like to understand how removing the commit-graph file
> >> can make the new algorithm faster. Putting a similar count in the old
> >> algorithm would involve giving a count for every call to
> >> in_merge_bases_many(), which would be very noisy.
> > $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> > count: 92912
> > To /home/newren/repo-mirror
> >   * [new branch]              test5 -> test5
> >
> > real    0m3.024s
> > user    0m2.752s
> > sys    0m0.320s
>
> Is the above test with or without the commit-graph file? Can you run it
> in the other mode, too? I'd like to see if the "count" value changes
> when the only difference is the presence of a commit-graph file.

I apologize for providing misleading information earlier; this was an
apples to oranges comparison.  Here's what I did:

<build a version of git with your fixes>
git clone coworker.bundle coworker-repo
cd coworker-repo
time git push --dry-run --follow-tags /home/newren/repo-mirror
git config core.commitgraph true
git config gc.writecommitgraph true
git gc
time git push --dry-run --follow-tags /home/newren/nucleus-mirror

I figured I had just done a fresh clone, so surely the gc wouldn't do
anything other than write the .git/objects/info/commit-graph file.
However, the original bundle contained many references outside of
refs/heads/ and refs/tags/:

$ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e
refs/tags/ -e HEAD | wc -l
2396

These other refs apparently referred to objects not otherwise
referenced in refs/heads/ and refs/tags/, and caused the gc to explode
lots of loose objects:
$ git count-objects -v
count: 147604
size: 856416
in-pack: 1180692
packs: 1
size-pack: 796143
prune-packable: 0
garbage: 0
size-garbage: 0

The slowdown with commit-graph was entirely due to there being lots of
loose objects (147K of them).  If I add a git-prune before doing the
timing with commit-graph, then the timing with commit-graph is faster
than the run without a commit-graph.

Sorry for the wild goose chase.

And thanks for the fixes; get_reachable_subset() makes things much
faster even without a commit-graph, and the commit-graph just improves
it more.  :-)

Derrick Stolee Nov. 2, 2018, 3:38 p.m. UTC | #9

On 11/2/2018 10:58 AM, Elijah Newren wrote:
> On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>> On 11/1/2018 2:57 PM, Elijah Newren wrote:
>>> On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>> No rush. I'd just like to understand how removing the commit-graph file
>>>> can make the new algorithm faster. Putting a similar count in the old
>>>> algorithm would involve giving a count for every call to
>>>> in_merge_bases_many(), which would be very noisy.
>>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
>>> count: 92912
>>> To /home/newren/repo-mirror
>>>    * [new branch]              test5 -> test5
>>>
>>> real    0m3.024s
>>> user    0m2.752s
>>> sys    0m0.320s
>> Is the above test with or without the commit-graph file? Can you run it
>> in the other mode, too? I'd like to see if the "count" value changes
>> when the only difference is the presence of a commit-graph file.
> I apologize for providing misleading information earlier; this was an
> apples to oranges comparison.  Here's what I did:
>
> <build a version of git with your fixes>
> git clone coworker.bundle coworker-repo
> cd coworker-repo
> time git push --dry-run --follow-tags /home/newren/repo-mirror
> git config core.commitgraph true
> git config gc.writecommitgraph true
> git gc
> time git push --dry-run --follow-tags /home/newren/nucleus-mirror
>
>
> I figured I had just done a fresh clone, so surely the gc wouldn't do
> anything other than write the .git/objects/info/commit-graph file.
> However, the original bundle contained many references outside of
> refs/heads/ and refs/tags/:
>
> $ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e
> refs/tags/ -e HEAD | wc -l
> 2396
>
> These other refs apparently referred to objects not otherwise
> referenced in refs/heads/ and refs/tags/, and caused the gc to explode
> lots of loose objects:
> $ git count-objects -v
> count: 147604
> size: 856416
> in-pack: 1180692
> packs: 1
> size-pack: 796143
> prune-packable: 0
> garbage: 0
> size-garbage: 0
>
>
> The slowdown with commit-graph was entirely due to there being lots of
> loose objects (147K of them).  If I add a git-prune before doing the
> timing with commit-graph, then the timing with commit-graph is faster
> than the run without a commit-graph.
>
> Sorry for the wild goose chase.
>
> And thanks for the fixes; get_reachable_subset() makes things much
> faster even without a commit-graph, and the commit-graph just improves
> it more.  :-)


Thanks for double-checking! It's good to have confidence that this is a good

algorithm to use.


Thanks,

-Stolee

[0/3] Make add_missing_tags() linear

Message

Comments