diff mbox series

[v3,1/2] commit-graph write: add progress output

Message ID 20180917153336.2280-2-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series commit-graph: add progress output | expand

Commit Message

Ævar Arnfjörð Bjarmason Sept. 17, 2018, 3:33 p.m. UTC
Before this change the "commit-graph write" command didn't report any
progress. On my machine this command takes more than 10 seconds to
write the graph for linux.git, and around 1m30s on the
2015-04-03-1M-git.git[1] test repository (a test case for a large
monorepository).

Furthermore, since the gc.writeCommitGraph setting was added in
d5d5d7b641 ("gc: automatically write commit-graph files", 2018-06-27),
there was no indication at all from a "git gc" run that anything was
different. This why one of the progress bars being added here uses
start_progress() instead of start_delayed_progress(), so that it's
guaranteed to be seen. E.g. on my tiny 867 commit dotfiles.git
repository:

    $ git -c gc.writeCommitGraph=true gc
    Enumerating objects: 2821, done.
    [...]
    Computing commit graph generation numbers: 100% (867/867), done.

On larger repositories, such as linux.git the delayed progress bar(s)
will kick in, and we'll show what's going on instead of, as was
previously happening, printing nothing while we write the graph:

    $ git -c gc.writeCommitGraph=true gc
    [...]
    Annotating commits in commit graph: 1565573, done.
    Computing commit graph generation numbers: 100% (782484/782484), done.

Note that here we don't show "Finding commits for commit graph", this
is because under "git gc" we seed the search with the commit
references in the repository, and that set is too small to show any
progress, but would e.g. on a smaller repo such as git.git with
--stdin-commits:

    $ git rev-list --all | git -c gc.writeCommitGraph=true write --stdin-commits
    Finding commits for commit graph: 100% (162576/162576), done.
    Computing commit graph generation numbers: 100% (162576/162576), done.

With --stdin-packs we don't show any estimation of how much is left to
do. This is because we might be processing more than one pack. We
could be less lazy here and show progress, either by detecting that
we're only processing one pack, or by first looping over the packs to
discover how many commits they have. I don't see the point in doing
that work. So instead we get (on 2015-04-03-1M-git.git):

    $ echo pack-<HASH>.idx | git -c gc.writeCommitGraph=true --exec-path=$PWD commit-graph write --stdin-packs
    Finding commits for commit graph: 13064614, done.
    Annotating commits in commit graph: 3001341, done.
    Computing commit graph generation numbers: 100% (1000447/1000447), done.

No GC mode uses --stdin-packs. It's what they use at Microsoft to
manually compute the generation numbers for their collection of large
packs which are never coalesced.

The reason we need a "report_progress" variable passed down from "git
gc" is so that we don't report this output when we're running in the
process "git gc --auto" detaches from the terminal.

Since we write the commit graph from the "git gc" process itself (as
opposed to what we do with say the "git repack" phase), we'd end up
writing the output to .git/gc.log and reporting it to the user next
time as part of the "The last gc run reported the following[...]"
error, see 329e6e8794 ("gc: save log from daemonized gc --auto and
print it next time", 2015-09-19).

So we must keep track of whether or not we're running in that
demonized mode, and if so print no progress.

See [2] and subsequent replies for a discussion of an approach not
taken in compute_generation_numbers(). I.e. we're saying "Computing
commit graph generation numbers", even though on an established
history we're mostly skipping over all the work we did in the
past. This is similar to the white lie we tell in the "Writing
objects" phase (not all are objects being written).

Always showing progress is considered more important than
accuracy. I.e. on a repository like 2015-04-03-1M-git.git we'd hang
for 6 seconds with no output on the second "git gc" if no changes were
made to any objects in the interim if we'd take the approach in [2].

1. https://github.com/avar/2015-04-03-1M-git

2. <c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com>
   (https://public-inbox.org/git/c6960252-c095-fb2b-e0bc-b1e6bb261614@gmail.com/)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/commit-graph.c |  5 ++--
 builtin/gc.c           |  3 ++-
 commit-graph.c         | 60 ++++++++++++++++++++++++++++++++++++------
 commit-graph.h         |  5 ++--
 4 files changed, 60 insertions(+), 13 deletions(-)

Comments

SZEDER Gábor Oct. 10, 2018, 8:37 p.m. UTC | #1
On Mon, Sep 17, 2018 at 03:33:35PM +0000, Ævar Arnfjörð Bjarmason wrote:
>     $ git -c gc.writeCommitGraph=true gc
>     [...]
>     Annotating commits in commit graph: 1565573, done.
>     Computing commit graph generation numbers: 100% (782484/782484), done.

While poking around 'commit-graph.c' in my Bloom filter experiment, I
saw similar numbers like above, and was confused by the much higher
than expected number of annotated commits.  It's about twice as much
as the number of commits in the repository, or the number shown on the
very next line.

> diff --git a/commit-graph.c b/commit-graph.c
> index 8a1bec7b8a..2c5d996194 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> -static void close_reachable(struct packed_oid_list *oids)
> +static void close_reachable(struct packed_oid_list *oids, int report_progress)
>  {
>  	int i;
>  	struct commit *commit;
> +	struct progress *progress = NULL;
> +	int j = 0;
>  
> +	if (report_progress)
> +		progress = start_delayed_progress(
> +			_("Annotating commits in commit graph"), 0);
>  	for (i = 0; i < oids->nr; i++) {
> +		display_progress(progress, ++j);
>  		commit = lookup_commit(the_repository, &oids->list[i]);
>  		if (commit)
>  			commit->object.flags |= UNINTERESTING;
> @@ -604,6 +616,7 @@ static void close_reachable(struct packed_oid_list *oids)
>  	 * closure.
>  	 */
>  	for (i = 0; i < oids->nr; i++) {
> +		display_progress(progress, ++j);
>  		commit = lookup_commit(the_repository, &oids->list[i]);
>  
>  		if (commit && !parse_commit(commit))
> @@ -611,19 +624,28 @@ static void close_reachable(struct packed_oid_list *oids)
>  	}

The above loops have already counted all the commits, and, more
importantly, did all the hard work that takes time and makes the
progress indicator useful.

>  	for (i = 0; i < oids->nr; i++) {
> +		display_progress(progress, ++j);

This display_progress() call, however, doesn't seem to be necessary.
First, it counts all commits for a second time, resulting in the ~2x
difference compared to the actual number of commits, and then causing
my confusion.  Second, all what this loop is doing is setting a flag
in commits that were already looked up and parsed in the above loops.
IOW this loop is very fast, and the progress indicator jumps from
~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
progress indicator at all.

>  		commit = lookup_commit(the_repository, &oids->list[i]);
>  
>  		if (commit)
>  			commit->object.flags &= ~UNINTERESTING;
>  	}
> +	stop_progress(&progress);
>  }
Ævar Arnfjörð Bjarmason Oct. 10, 2018, 9:56 p.m. UTC | #2
On Wed, Oct 10 2018, SZEDER Gábor wrote:

> On Mon, Sep 17, 2018 at 03:33:35PM +0000, Ævar Arnfjörð Bjarmason wrote:
>>     $ git -c gc.writeCommitGraph=true gc
>>     [...]
>>     Annotating commits in commit graph: 1565573, done.
>>     Computing commit graph generation numbers: 100% (782484/782484), done.
>
> While poking around 'commit-graph.c' in my Bloom filter experiment, I
> saw similar numbers like above, and was confused by the much higher
> than expected number of annotated commits.  It's about twice as much
> as the number of commits in the repository, or the number shown on the
> very next line.
>
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 8a1bec7b8a..2c5d996194 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> -static void close_reachable(struct packed_oid_list *oids)
>> +static void close_reachable(struct packed_oid_list *oids, int report_progress)
>>  {
>>  	int i;
>>  	struct commit *commit;
>> +	struct progress *progress = NULL;
>> +	int j = 0;
>>
>> +	if (report_progress)
>> +		progress = start_delayed_progress(
>> +			_("Annotating commits in commit graph"), 0);
>>  	for (i = 0; i < oids->nr; i++) {
>> +		display_progress(progress, ++j);
>>  		commit = lookup_commit(the_repository, &oids->list[i]);
>>  		if (commit)
>>  			commit->object.flags |= UNINTERESTING;
>> @@ -604,6 +616,7 @@ static void close_reachable(struct packed_oid_list *oids)
>>  	 * closure.
>>  	 */
>>  	for (i = 0; i < oids->nr; i++) {
>> +		display_progress(progress, ++j);
>>  		commit = lookup_commit(the_repository, &oids->list[i]);
>>
>>  		if (commit && !parse_commit(commit))
>> @@ -611,19 +624,28 @@ static void close_reachable(struct packed_oid_list *oids)
>>  	}
>
> The above loops have already counted all the commits, and, more
> importantly, did all the hard work that takes time and makes the
> progress indicator useful.
>
>>  	for (i = 0; i < oids->nr; i++) {
>> +		display_progress(progress, ++j);

[...]

> This display_progress() call, however, doesn't seem to be necessary.
> First, it counts all commits for a second time, resulting in the ~2x
> difference compared to the actual number of commits, and then causing
> my confusion.  Second, all what this loop is doing is setting a flag
> in commits that were already looked up and parsed in the above loops.
> IOW this loop is very fast, and the progress indicator jumps from
> ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
> progress indicator at all.

You're right, I tried this patch on top:

    diff --git a/commit-graph.c b/commit-graph.c
    index a686758603..cccd83de72 100644
    --- a/commit-graph.c
    +++ b/commit-graph.c
    @@ -655,12 +655,16 @@ static void close_reachable(struct packed_oid_list *oids, int report_progress)
     		if (commit)
     			commit->object.flags |= UNINTERESTING;
     	}
    +	stop_progress(&progress); j = 0;

     	/*
     	 * As this loop runs, oids->nr may grow, but not more
     	 * than the number of missing commits in the reachable
     	 * closure.
     	 */
    +	if (report_progress)
    +		progress = start_delayed_progress(
    +			_("Annotating commits in commit graph 2"), 0);
     	for (i = 0; i < oids->nr; i++) {
     		display_progress(progress, ++j);
     		commit = lookup_commit(the_repository, &oids->list[i]);
    @@ -668,7 +672,11 @@ static void close_reachable(struct packed_oid_list *oids, int report_progress)
     		if (commit && !parse_commit(commit))
     			add_missing_parents(oids, commit);
     	}
    +	stop_progress(&progress); j = 0;

    +	if (report_progress)
    +		progress = start_delayed_progress(
    +			_("Annotating commits in commit graph 3"), 0);
     	for (i = 0; i < oids->nr; i++) {
     		display_progress(progress, ++j);
     		commit = lookup_commit(the_repository, &oids->list[i]);

And on a large repo with around 3 million commits the 3rd progress bar
doesn't kick in.

But if I apply this on top:

    diff --git a/progress.c b/progress.c
    index 5a99c9fbf0..89cc705bf7 100644
    --- a/progress.c
    +++ b/progress.c
    @@ -58,8 +58,8 @@ static void set_progress_signal(void)
     	sa.sa_flags = SA_RESTART;
     	sigaction(SIGALRM, &sa, NULL);

    -	v.it_interval.tv_sec = 1;
    -	v.it_interval.tv_usec = 0;
    +	v.it_interval.tv_sec = 0;
    +	v.it_interval.tv_usec = 250000;
     	v.it_value = v.it_interval;
     	setitimer(ITIMER_REAL, &v, NULL);
     }

I.e. start the timer after 1/4 of a second instead of 1 second, I get
that progress bar.

So I'm inclined to keep it. It just needs to be 4x the size before it's
noticeably hanging for 1 second.

That repo isn't all that big compared to what we've heard about out
there, and inner loops like this have a tendency to accumulate some more
code over time without a re-visit of why we weren't monitoring progress
there.

But maybe we can fix the message. We say "Annotating commits in commit
grap", not "Counting" or whatever. I was trying to find something that
didn't imply that we were doing this once. One can annotate a thing more
than once, but maybe ther's a better way to explain this...

We had some more accurate progress reporting in close_reachable(),
discussed in
https://public-inbox.org/git/87efe5qqks.fsf@evledraar.gmail.com/ I still
think the *main* use-case for these things is to just report that we're
not hanging, so maybe the proper solution is to pick up Duy's patch to
display a spinner insted of a numeric progress.

>>  		commit = lookup_commit(the_repository, &oids->list[i]);
>>
>>  		if (commit)
>>  			commit->object.flags &= ~UNINTERESTING;
>>  	}
>> +	stop_progress(&progress);
>>  }
SZEDER Gábor Oct. 10, 2018, 10:19 p.m. UTC | #3
On Wed, Oct 10, 2018 at 11:56:45PM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Oct 10 2018, SZEDER Gábor wrote:

> >>  	for (i = 0; i < oids->nr; i++) {
> >> +		display_progress(progress, ++j);
> 
> [...]
> 
> > This display_progress() call, however, doesn't seem to be necessary.
> > First, it counts all commits for a second time, resulting in the ~2x
> > difference compared to the actual number of commits, and then causing
> > my confusion.  Second, all what this loop is doing is setting a flag
> > in commits that were already looked up and parsed in the above loops.
> > IOW this loop is very fast, and the progress indicator jumps from
> > ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
> > progress indicator at all.
> 
> You're right, I tried this patch on top:

[...] 

> And on a large repo with around 3 million commits the 3rd progress bar
> doesn't kick in.
> 
> But if I apply this on top:
> 
[...]
> 
> I.e. start the timer after 1/4 of a second instead of 1 second, I get
> that progress bar.
> 
> So I'm inclined to keep it. It just needs to be 4x the size before it's
> noticeably hanging for 1 second.

Just to clarify: are you worried about a 1 second hang in an approx. 12
million commit repository?  If so, then I'm unconvinced, that's not
even a blip on the radar, and the misleading numbers are far worse.

> That repo isn't all that big compared to what we've heard about out
> there, and inner loops like this have a tendency to accumulate some more
> code over time without a re-visit of why we weren't monitoring progress
> there.
> 
> But maybe we can fix the message. We say "Annotating commits in commit
> grap", not "Counting" or whatever. I was trying to find something that
> didn't imply that we were doing this once. One can annotate a thing more
> than once, but maybe ther's a better way to explain this...

IMO just remove it.
Ævar Arnfjörð Bjarmason Oct. 10, 2018, 10:37 p.m. UTC | #4
On Wed, Oct 10 2018, SZEDER Gábor wrote:

> On Wed, Oct 10, 2018 at 11:56:45PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Wed, Oct 10 2018, SZEDER Gábor wrote:
>
>> >>  	for (i = 0; i < oids->nr; i++) {
>> >> +		display_progress(progress, ++j);
>>
>> [...]
>>
>> > This display_progress() call, however, doesn't seem to be necessary.
>> > First, it counts all commits for a second time, resulting in the ~2x
>> > difference compared to the actual number of commits, and then causing
>> > my confusion.  Second, all what this loop is doing is setting a flag
>> > in commits that were already looked up and parsed in the above loops.
>> > IOW this loop is very fast, and the progress indicator jumps from
>> > ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
>> > progress indicator at all.
>>
>> You're right, I tried this patch on top:
>
> [...]
>
>> And on a large repo with around 3 million commits the 3rd progress bar
>> doesn't kick in.
>>
>> But if I apply this on top:
>>
> [...]
>>
>> I.e. start the timer after 1/4 of a second instead of 1 second, I get
>> that progress bar.
>>
>> So I'm inclined to keep it. It just needs to be 4x the size before it's
>> noticeably hanging for 1 second.
>
> Just to clarify: are you worried about a 1 second hang in an approx. 12
> million commit repository?  If so, then I'm unconvinced, that's not
> even a blip on the radar, and the misleading numbers are far worse.

It's not a blip on the runtime, but the point of these progress bars in
general is so we don't have a UI where there's no UI differnce between
git hanging and just doing work in some tight loop in the background,
and even 1 second when you're watching something is noticeable if it
stalls.

Also it's 1 second on a server where I had 128G of RAM. I think even a
"trivial" flag change like this would very much change if e.g. the
system was under memory pressure or was swapping.

And as noted code like this tends to change over time, that loop might
get more expensive, so let's future proof by having all the loops over N
call the progress code.

When I wrote this the intent was just "report progress". So that it's
counting anything is just an implementation detail of how progress.c
works now.

This was the reference to Duy's patch, i.e. instead of spewing numbers
at the user here let's just render a spinner. Then we no longer need to
make judgement calls about which loop over N is expensive right now, and
which one isn't, and if any of them will result in reporting a 2N number
while the user might be more familiar with or expecting N.

>> That repo isn't all that big compared to what we've heard about out
>> there, and inner loops like this have a tendency to accumulate some more
>> code over time without a re-visit of why we weren't monitoring progress
>> there.
>>
>> But maybe we can fix the message. We say "Annotating commits in commit
>> grap", not "Counting" or whatever. I was trying to find something that
>> didn't imply that we were doing this once. One can annotate a thing more
>> than once, but maybe ther's a better way to explain this...
>
> IMO just remove it.
Ævar Arnfjörð Bjarmason Oct. 11, 2018, 5:52 p.m. UTC | #5
On Wed, Oct 10 2018, Ævar Arnfjörð Bjarmason wrote:

> On Wed, Oct 10 2018, SZEDER Gábor wrote:
>
>> On Wed, Oct 10, 2018 at 11:56:45PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>> On Wed, Oct 10 2018, SZEDER Gábor wrote:
>>
>>> >>  	for (i = 0; i < oids->nr; i++) {
>>> >> +		display_progress(progress, ++j);
>>>
>>> [...]
>>>
>>> > This display_progress() call, however, doesn't seem to be necessary.
>>> > First, it counts all commits for a second time, resulting in the ~2x
>>> > difference compared to the actual number of commits, and then causing
>>> > my confusion.  Second, all what this loop is doing is setting a flag
>>> > in commits that were already looked up and parsed in the above loops.
>>> > IOW this loop is very fast, and the progress indicator jumps from
>>> > ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
>>> > progress indicator at all.
>>>
>>> You're right, I tried this patch on top:
>>
>> [...]
>>
>>> And on a large repo with around 3 million commits the 3rd progress bar
>>> doesn't kick in.
>>>
>>> But if I apply this on top:
>>>
>> [...]
>>>
>>> I.e. start the timer after 1/4 of a second instead of 1 second, I get
>>> that progress bar.
>>>
>>> So I'm inclined to keep it. It just needs to be 4x the size before it's
>>> noticeably hanging for 1 second.
>>
>> Just to clarify: are you worried about a 1 second hang in an approx. 12
>> million commit repository?  If so, then I'm unconvinced, that's not
>> even a blip on the radar, and the misleading numbers are far worse.
>
> It's not a blip on the runtime, but the point of these progress bars in
> general is so we don't have a UI where there's no UI differnce between
> git hanging and just doing work in some tight loop in the background,
> and even 1 second when you're watching something is noticeable if it
> stalls.
>
> Also it's 1 second on a server where I had 128G of RAM. I think even a
> "trivial" flag change like this would very much change if e.g. the
> system was under memory pressure or was swapping.
>
> And as noted code like this tends to change over time, that loop might
> get more expensive, so let's future proof by having all the loops over N
> call the progress code.
>
> When I wrote this the intent was just "report progress". So that it's
> counting anything is just an implementation detail of how progress.c
> works now.
>
> This was the reference to Duy's patch, i.e. instead of spewing numbers
> at the user here let's just render a spinner. Then we no longer need to
> make judgement calls about which loop over N is expensive right now, and
> which one isn't, and if any of them will result in reporting a 2N number
> while the user might be more familiar with or expecting N.
>
>>> That repo isn't all that big compared to what we've heard about out
>>> there, and inner loops like this have a tendency to accumulate some more
>>> code over time without a re-visit of why we weren't monitoring progress
>>> there.
>>>
>>> But maybe we can fix the message. We say "Annotating commits in commit
>>> grap", not "Counting" or whatever. I was trying to find something that
>>> didn't imply that we were doing this once. One can annotate a thing more
>>> than once, but maybe ther's a better way to explain this...
>>
>> IMO just remove it.

Hrm, actually reading this again your initial post says we end up with a
2x difference v.s. the number of commits, but it's actually 3x. The loop
that has a rather trivial runtime comparatively is the 3x, but the 2x
loop takes a notable amount of time. So e.g. on git.git:

    $ git rev-list --all | wc -l; ~/g/git/git commit-graph write
    166678
    Annotating commits in commit graph: 518463, done.
    Computing commit graph generation numbers: 100% (172685/172685), done.
Junio C Hamano Oct. 12, 2018, 6:09 a.m. UTC | #6
SZEDER Gábor <szeder.dev@gmail.com> writes:

>>  	for (i = 0; i < oids->nr; i++) {
>> +		display_progress(progress, ++j);
>>  		commit = lookup_commit(the_repository, &oids->list[i]);
>>  
>>  		if (commit && !parse_commit(commit))
>> @@ -611,19 +624,28 @@ static void close_reachable(struct packed_oid_list *oids)
>>  	}
>
> The above loops have already counted all the commits, and, more
> importantly, did all the hard work that takes time and makes the
> progress indicator useful.
>
>>  	for (i = 0; i < oids->nr; i++) {
>> +		display_progress(progress, ++j);
>
> This display_progress() call, however, doesn't seem to be necessary.
> First, it counts all commits for a second time, resulting in the ~2x
> difference compared to the actual number of commits, and then causing
> my confusion.  Second, all what this loop is doing is setting a flag
> in commits that were already looked up and parsed in the above loops.
> IOW this loop is very fast, and the progress indicator jumps from
> ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
> progress indicator at all.

Makes sense.  If this second iteration were also time consuming,
then it probably is a good idea to split these into two separate
phases?  "Counting 1...N" followed by "Inspecting 1...N" or
something like that.  Of course, if the latter does not take much
time, then doing without any progress indicator is also fine.
Ævar Arnfjörð Bjarmason Oct. 12, 2018, 3:07 p.m. UTC | #7
On Fri, Oct 12 2018, Junio C Hamano wrote:

> SZEDER Gábor <szeder.dev@gmail.com> writes:
>
>>>  	for (i = 0; i < oids->nr; i++) {
>>> +		display_progress(progress, ++j);
>>>  		commit = lookup_commit(the_repository, &oids->list[i]);
>>>
>>>  		if (commit && !parse_commit(commit))
>>> @@ -611,19 +624,28 @@ static void close_reachable(struct packed_oid_list *oids)
>>>  	}
>>
>> The above loops have already counted all the commits, and, more
>> importantly, did all the hard work that takes time and makes the
>> progress indicator useful.
>>
>>>  	for (i = 0; i < oids->nr; i++) {
>>> +		display_progress(progress, ++j);
>>
>> This display_progress() call, however, doesn't seem to be necessary.
>> First, it counts all commits for a second time, resulting in the ~2x
>> difference compared to the actual number of commits, and then causing
>> my confusion.  Second, all what this loop is doing is setting a flag
>> in commits that were already looked up and parsed in the above loops.
>> IOW this loop is very fast, and the progress indicator jumps from
>> ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
>> progress indicator at all.
>
> Makes sense.  If this second iteration were also time consuming,
> then it probably is a good idea to split these into two separate
> phases?  "Counting 1...N" followed by "Inspecting 1...N" or
> something like that.  Of course, if the latter does not take much
> time, then doing without any progress indicator is also fine.

That's a good point. Derrick: If the three loops in close_reachable()
had to be split up into different progress stages and given different
names what do you think they should be? Now it's "Annotating commits in
commit graph" for all of them.
Derrick Stolee Oct. 12, 2018, 3:12 p.m. UTC | #8
On 10/12/2018 11:07 AM, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Oct 12 2018, Junio C Hamano wrote:
>
>> Makes sense. If this second iteration were also time consuming,
>> then it probably is a good idea to split these into two separate
>> phases?  "Counting 1...N" followed by "Inspecting 1...N" or
>> something like that.  Of course, if the latter does not take much
>> time, then doing without any progress indicator is also fine.
> That's a good point. Derrick: If the three loops in close_reachable()
> had to be split up into different progress stages and given different
> names what do you think they should be? Now it's "Annotating commits in
> commit graph" for all of them.

The following is the best I can think of right now:

1. Loading known commits.
2. Expanding reachable commits.
3. Clearing commit marks.

-Stolee
SZEDER Gábor Oct. 15, 2018, 4:05 p.m. UTC | #9
On Thu, Oct 11, 2018 at 07:52:21PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 10 2018, Ævar Arnfjörð Bjarmason wrote:
> 
> > On Wed, Oct 10 2018, SZEDER Gábor wrote:
> >
> >> On Wed, Oct 10, 2018 at 11:56:45PM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>> On Wed, Oct 10 2018, SZEDER Gábor wrote:
> >>
> >>> >>  	for (i = 0; i < oids->nr; i++) {
> >>> >> +		display_progress(progress, ++j);
> >>>
> >>> [...]
> >>>
> >>> > This display_progress() call, however, doesn't seem to be necessary.
> >>> > First, it counts all commits for a second time, resulting in the ~2x
> >>> > difference compared to the actual number of commits, and then causing
> >>> > my confusion.  Second, all what this loop is doing is setting a flag
> >>> > in commits that were already looked up and parsed in the above loops.
> >>> > IOW this loop is very fast, and the progress indicator jumps from
> >>> > ~780k right to 1.5M, even on my tiny laptop, so it doesn't need a
> >>> > progress indicator at all.

> Hrm, actually reading this again your initial post says we end up with a
> 2x difference v.s. the number of commits, but it's actually 3x.

Well, it depends on how you create the commit-graph and on the repo as
well, I guess.  I run 'git commit-graph write --reachable' in a repo
created by 'git clone --single-branch ...', and in that case the
difference is only ~2x (the first loop in close_reachable() has as
many iterations as the number of refs).  If the repo were to contain
twice as many refs as commits, then the difference could be as high as
4x.

However, I think I might have noticed an other progress counting issue
as well, will get back to it later but first I have to get my numbers
straight.

> The loop
> that has a rather trivial runtime comparatively is the 3x, but the 2x
> loop takes a notable amount of time. So e.g. on git.git:
> 
>     $ git rev-list --all | wc -l; ~/g/git/git commit-graph write
>     166678
>     Annotating commits in commit graph: 518463, done.
>     Computing commit graph generation numbers: 100% (172685/172685), done.
SZEDER Gábor Oct. 15, 2018, 4:54 p.m. UTC | #10
On Mon, Sep 17, 2018 at 03:33:35PM +0000, Ævar Arnfjörð Bjarmason wrote:

> @@ -560,6 +563,9 @@ static int add_packed_commits(const struct object_id *oid,
>  	off_t offset = nth_packed_object_offset(pack, pos);
>  	struct object_info oi = OBJECT_INFO_INIT;
>  
> +	if (list->progress)
> +		display_progress(list->progress, ++list->progress_done);

Note that add_packed_commits() is used as a callback function for
for_each_object_in_pack() (with '--stdin-packs') or
for_each_packed_object() (no options), i.e. this will count the number
of objects, not commits:

  $ git rev-list --all |wc -l
  768524
  $ git rev-list --objects --all |wc -l
  6130295
  # '--count --objects' together didn't work as expected.
  $ time ~/src/git/git commit-graph write
  Finding commits for commit graph: 6130295, done.
  Annotating commits in commit graph: 2305572, done.
  Computing commit graph generation numbers: 100% (768524/768524), done.

(Now I also see the 3x difference in the "Annotating commits" counter
that you mentioned.)

I see two options:

  - Provide a different title for this progress counter, e.g.
    "Scanning objects for c-g", or "Processing objects...", or
    something else that says "objects" instead of "commits".

  - Move this condition and display_progress() call to the end of the
    function, so it will only count commits, not any other objects.
    (As far as I understand both for_each_object_in_pack() and
    for_each_packed_object() iterate in pack .idx order, i.e. it's
    essentially random.  This means that commit objects should be
    distributed evenly among other kinds of objects, so we don't have
    to worry about the counter stalling for a long stretch of
    consecutive non-commit objects.  At least in theory.)
SZEDER Gábor Nov. 19, 2018, 4:02 p.m. UTC | #11
Ping?

We are at -rc0, this progress output is a new feature since v2.19.0,
and the numbers shown are still way off.


On Mon, Oct 15, 2018 at 06:54:47PM +0200, SZEDER Gábor wrote:
> On Mon, Sep 17, 2018 at 03:33:35PM +0000, Ævar Arnfjörð Bjarmason wrote:
> 
> > @@ -560,6 +563,9 @@ static int add_packed_commits(const struct object_id *oid,
> >  	off_t offset = nth_packed_object_offset(pack, pos);
> >  	struct object_info oi = OBJECT_INFO_INIT;
> >  
> > +	if (list->progress)
> > +		display_progress(list->progress, ++list->progress_done);
> 
> Note that add_packed_commits() is used as a callback function for
> for_each_object_in_pack() (with '--stdin-packs') or
> for_each_packed_object() (no options), i.e. this will count the number
> of objects, not commits:
> 
>   $ git rev-list --all |wc -l
>   768524
>   $ git rev-list --objects --all |wc -l
>   6130295
>   # '--count --objects' together didn't work as expected.
>   $ time ~/src/git/git commit-graph write
>   Finding commits for commit graph: 6130295, done.
>   Annotating commits in commit graph: 2305572, done.
>   Computing commit graph generation numbers: 100% (768524/768524), done.
> 
> (Now I also see the 3x difference in the "Annotating commits" counter
> that you mentioned.)
> 
> I see two options:
> 
>   - Provide a different title for this progress counter, e.g.
>     "Scanning objects for c-g", or "Processing objects...", or
>     something else that says "objects" instead of "commits".
> 
>   - Move this condition and display_progress() call to the end of the
>     function, so it will only count commits, not any other objects.
>     (As far as I understand both for_each_object_in_pack() and
>     for_each_packed_object() iterate in pack .idx order, i.e. it's
>     essentially random.  This means that commit objects should be
>     distributed evenly among other kinds of objects, so we don't have
>     to worry about the counter stalling for a long stretch of
>     consecutive non-commit objects.  At least in theory.)
> 
> 
>
diff mbox series

Patch

diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 0bf0c48657..bc0fa9ba52 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -151,7 +151,7 @@  static int graph_write(int argc, const char **argv)
 		opts.obj_dir = get_object_directory();
 
 	if (opts.reachable) {
-		write_commit_graph_reachable(opts.obj_dir, opts.append);
+		write_commit_graph_reachable(opts.obj_dir, opts.append, 1);
 		return 0;
 	}
 
@@ -171,7 +171,8 @@  static int graph_write(int argc, const char **argv)
 	write_commit_graph(opts.obj_dir,
 			   pack_indexes,
 			   commit_hex,
-			   opts.append);
+			   opts.append,
+			   1);
 
 	string_list_clear(&lines, 0);
 	return 0;
diff --git a/builtin/gc.c b/builtin/gc.c
index 57069442b0..06ddd3aea5 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -646,7 +646,8 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 		clean_pack_garbage();
 
 	if (gc_write_commit_graph)
-		write_commit_graph_reachable(get_object_directory(), 0);
+		write_commit_graph_reachable(get_object_directory(), 0,
+					     !daemonized);
 
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
diff --git a/commit-graph.c b/commit-graph.c
index 8a1bec7b8a..2c5d996194 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -13,6 +13,7 @@ 
 #include "commit-graph.h"
 #include "object-store.h"
 #include "alloc.h"
+#include "progress.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
 #define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
@@ -548,6 +549,8 @@  struct packed_oid_list {
 	struct object_id *list;
 	int nr;
 	int alloc;
+	struct progress *progress;
+	int progress_done;
 };
 
 static int add_packed_commits(const struct object_id *oid,
@@ -560,6 +563,9 @@  static int add_packed_commits(const struct object_id *oid,
 	off_t offset = nth_packed_object_offset(pack, pos);
 	struct object_info oi = OBJECT_INFO_INIT;
 
+	if (list->progress)
+		display_progress(list->progress, ++list->progress_done);
+
 	oi.typep = &type;
 	if (packed_object_info(the_repository, pack, offset, &oi) < 0)
 		die(_("unable to get type of object %s"), oid_to_hex(oid));
@@ -587,12 +593,18 @@  static void add_missing_parents(struct packed_oid_list *oids, struct commit *com
 	}
 }
 
-static void close_reachable(struct packed_oid_list *oids)
+static void close_reachable(struct packed_oid_list *oids, int report_progress)
 {
 	int i;
 	struct commit *commit;
+	struct progress *progress = NULL;
+	int j = 0;
 
+	if (report_progress)
+		progress = start_delayed_progress(
+			_("Annotating commits in commit graph"), 0);
 	for (i = 0; i < oids->nr; i++) {
+		display_progress(progress, ++j);
 		commit = lookup_commit(the_repository, &oids->list[i]);
 		if (commit)
 			commit->object.flags |= UNINTERESTING;
@@ -604,6 +616,7 @@  static void close_reachable(struct packed_oid_list *oids)
 	 * closure.
 	 */
 	for (i = 0; i < oids->nr; i++) {
+		display_progress(progress, ++j);
 		commit = lookup_commit(the_repository, &oids->list[i]);
 
 		if (commit && !parse_commit(commit))
@@ -611,19 +624,28 @@  static void close_reachable(struct packed_oid_list *oids)
 	}
 
 	for (i = 0; i < oids->nr; i++) {
+		display_progress(progress, ++j);
 		commit = lookup_commit(the_repository, &oids->list[i]);
 
 		if (commit)
 			commit->object.flags &= ~UNINTERESTING;
 	}
+	stop_progress(&progress);
 }
 
-static void compute_generation_numbers(struct packed_commit_list* commits)
+static void compute_generation_numbers(struct packed_commit_list* commits, 
+				       int report_progress)
 {
 	int i;
 	struct commit_list *list = NULL;
+	struct progress *progress = NULL;
 
+	if (report_progress)
+		progress = start_progress(
+			_("Computing commit graph generation numbers"),
+			commits->nr);
 	for (i = 0; i < commits->nr; i++) {
+		display_progress(progress, i + 1);
 		if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY &&
 		    commits->list[i]->generation != GENERATION_NUMBER_ZERO)
 			continue;
@@ -655,6 +677,7 @@  static void compute_generation_numbers(struct packed_commit_list* commits)
 			}
 		}
 	}
+	stop_progress(&progress);
 }
 
 static int add_ref_to_list(const char *refname,
@@ -667,19 +690,20 @@  static int add_ref_to_list(const char *refname,
 	return 0;
 }
 
-void write_commit_graph_reachable(const char *obj_dir, int append)
+void write_commit_graph_reachable(const char *obj_dir, int append,
+				  int report_progress)
 {
 	struct string_list list;
 
 	string_list_init(&list, 1);
 	for_each_ref(add_ref_to_list, &list);
-	write_commit_graph(obj_dir, NULL, &list, append);
+	write_commit_graph(obj_dir, NULL, &list, append, report_progress);
 }
 
 void write_commit_graph(const char *obj_dir,
 			struct string_list *pack_indexes,
 			struct string_list *commit_hex,
-			int append)
+			int append, int report_progress)
 {
 	struct packed_oid_list oids;
 	struct packed_commit_list commits;
@@ -692,9 +716,12 @@  void write_commit_graph(const char *obj_dir,
 	int num_chunks;
 	int num_extra_edges;
 	struct commit_list *parent;
+	struct progress *progress = NULL;
 
 	oids.nr = 0;
 	oids.alloc = approximate_object_count() / 4;
+	oids.progress = NULL;
+	oids.progress_done = 0;
 
 	if (append) {
 		prepare_commit_graph_one(the_repository, obj_dir);
@@ -721,6 +748,11 @@  void write_commit_graph(const char *obj_dir,
 		int dirlen;
 		strbuf_addf(&packname, "%s/pack/", obj_dir);
 		dirlen = packname.len;
+		if (report_progress) {
+			oids.progress = start_delayed_progress(
+				_("Finding commits for commit graph"), 0);
+			oids.progress_done = 0;
+		}
 		for (i = 0; i < pack_indexes->nr; i++) {
 			struct packed_git *p;
 			strbuf_setlen(&packname, dirlen);
@@ -733,15 +765,21 @@  void write_commit_graph(const char *obj_dir,
 			for_each_object_in_pack(p, add_packed_commits, &oids, 0);
 			close_pack(p);
 		}
+		stop_progress(&oids.progress);
 		strbuf_release(&packname);
 	}
 
 	if (commit_hex) {
+		if (report_progress)
+			progress = start_delayed_progress(
+				_("Finding commits for commit graph"),
+				commit_hex->nr);
 		for (i = 0; i < commit_hex->nr; i++) {
 			const char *end;
 			struct object_id oid;
 			struct commit *result;
 
+			display_progress(progress, i + 1);
 			if (commit_hex->items[i].string &&
 			    parse_oid_hex(commit_hex->items[i].string, &oid, &end))
 				continue;
@@ -754,12 +792,18 @@  void write_commit_graph(const char *obj_dir,
 				oids.nr++;
 			}
 		}
+		stop_progress(&progress);
 	}
 
-	if (!pack_indexes && !commit_hex)
+	if (!pack_indexes && !commit_hex) {
+		if (report_progress)
+			oids.progress = start_delayed_progress(
+				_("Finding commits for commit graph"), 0);
 		for_each_packed_object(add_packed_commits, &oids, 0);
+		stop_progress(&oids.progress);
+	}
 
-	close_reachable(&oids);
+	close_reachable(&oids, report_progress);
 
 	QSORT(oids.list, oids.nr, commit_compare);
 
@@ -799,7 +843,7 @@  void write_commit_graph(const char *obj_dir,
 	if (commits.nr >= GRAPH_PARENT_MISSING)
 		die(_("too many commits to write graph"));
 
-	compute_generation_numbers(&commits);
+	compute_generation_numbers(&commits, report_progress);
 
 	graph_name = get_commit_graph_filename(obj_dir);
 	if (safe_create_leading_directories(graph_name))
diff --git a/commit-graph.h b/commit-graph.h
index eea62f8c0e..f50712a973 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -52,11 +52,12 @@  struct commit_graph {
 
 struct commit_graph *load_commit_graph_one(const char *graph_file);
 
-void write_commit_graph_reachable(const char *obj_dir, int append);
+void write_commit_graph_reachable(const char *obj_dir, int append,
+				  int report_progress);
 void write_commit_graph(const char *obj_dir,
 			struct string_list *pack_indexes,
 			struct string_list *commit_hex,
-			int append);
+			int append, int report_progress);
 
 int verify_commit_graph(struct repository *r, struct commit_graph *g);