[RFC] We should add a "git gc --auto" after "git clone" due to commit graph
diff mbox series

Message ID 87in2hgzin.fsf@evledraar.gmail.com
State New
Headers show
Series
  • [RFC] We should add a "git gc --auto" after "git clone" due to commit graph
Related show

Commit Message

Ævar Arnfjörð Bjarmason Oct. 4, 2018, 9:42 p.m. UTC
On Wed, Oct 03 2018, Ævar Arnfjörð Bjarmason wrote:

> Don't have time to patch this now, but thought I'd send a note / RFC
> about this.
>
> Now that we have the commit graph it's nice to be able to set
> e.g. core.commitGraph=true & gc.writeCommitGraph=true in ~/.gitconfig or
> /etc/gitconfig to apply them to all repos.
>
> But when I clone e.g. linux.git stuff like 'tag --contains' will be slow
> until whenever my first "gc" kicks in, which may be quite some time if
> I'm just using it passively.
>
> So we should make "git gc --auto" be run on clone, and change the
> need_to_gc() / cmd_gc() behavior so that we detect that the
> gc.writeCommitGraph=true setting is on, but we have no commit graph, and
> then just generate that without doing a full repack.
>
> As an aside such more granular "gc" would be nice for e.g. pack-refs
> too. It's possible for us to just have one pack, but to have 100k loose
> refs.
>
> It might also be good to have some gc.autoDetachOnClone option and have
> it false by default, so we don't have a race condition where "clone
> linux && git -C linux tag --contains" is slow because the graph hasn't
> been generated yet, and generating the graph initially doesn't take that
> long compared to the time to clone a large repo (and on a small one it
> won't matter either way).
>
> I was going to say "also for midx", but of course after clone we have
> just one pack, so I can't imagine us needing this. But I can see us
> having other such optional side-indexes in the future generated by gc,
> and they'd also benefit from this.

I don't have time to polish this up for submission now, but here's a WIP
patch that implements this, highlights:

 * There's a gc.clone.autoDetach=false default setting which overrides
   gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
   --cloning option to indicate this).

 * A clone of say git.git with gc.writeCommitGraph=true looks like:

   [...]
   Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
   Resolving deltas: 100% (188947/188947), done.
   Computing commit graph generation numbers: 100% (55210/55210), done.

 * The 'git gc --auto' command also knows to (only) run the commit-graph
   (and space is left for future optimization steps) if general GC isn't
   needed, but we need "optimization":

   $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
   Annotating commits in commit graph: 341229, done.
   Computing commit graph generation numbers: 100% (165969/165969), done.
   $

 * The patch to gc.c looks less scary with -w, most of it is indenting
   the existing pack-refs etc. with a "!auto_gc || should_gc" condition.

 * I added a commit_graph_exists() exists function and only care if I
   get ENOENT for the purposes of this gc mode. This would need to be
   tweaked for the incremental mode Derrick talks about, but if we just
   set "should_optimize" that'll also work as far as gc --auto is
   concerned (e.g. on fetch, am etc.)

Comments

Derrick Stolee Oct. 5, 2018, 12:05 p.m. UTC | #1
On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
> I don't have time to polish this up for submission now, but here's a WIP
> patch that implements this, highlights:
>
>   * There's a gc.clone.autoDetach=false default setting which overrides
>     gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>     --cloning option to indicate this).

I'll repeat that it could make sense to do the same thing on clone _and_ 
fetch. Perhaps a "--post-fetch" flag would be good here to communicate 
that we just downloaded a pack from a remote.

>   * A clone of say git.git with gc.writeCommitGraph=true looks like:
>
>     [...]
>     Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
>     Resolving deltas: 100% (188947/188947), done.
>     Computing commit graph generation numbers: 100% (55210/55210), done.

This looks like good UX. Thanks for the progress here!

>   * The 'git gc --auto' command also knows to (only) run the commit-graph
>     (and space is left for future optimization steps) if general GC isn't
>     needed, but we need "optimization":
>
>     $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
>     Annotating commits in commit graph: 341229, done.
>     Computing commit graph generation numbers: 100% (165969/165969), done.
>     $

Will this also trigger a full commit-graph rewrite on every 'git commit' 
command? Or is there some way we can compute the staleness of the 
commit-graph in order to only update if we get too far ahead? 
Previously, this was solved by relying on the auto-GC threshold.

>   * The patch to gc.c looks less scary with -w, most of it is indenting
>     the existing pack-refs etc. with a "!auto_gc || should_gc" condition.
>
>   * I added a commit_graph_exists() exists function and only care if I
>     get ENOENT for the purposes of this gc mode. This would need to be
>     tweaked for the incremental mode Derrick talks about, but if we just
>     set "should_optimize" that'll also work as far as gc --auto is
>     concerned (e.g. on fetch, am etc.)

The incremental mode would operate the same as split-index, which means 
we will still look for .git/objects/info/commit-graph. That file may 
point us to more files.

> +int commit_graph_exists(const char *graph_file)
> +{
> +	struct stat st;
> +	if (stat(graph_file, &st)) {
> +		if (errno == ENOENT)
> +			return 0;
> +		else
> +			return -1;
> +	}
> +	return 1;
> +}
> +

This method serves a very similar purpose to 
generation_numbers_enabled(), except your method only cares about the 
file existing. It ignores information like `core.commitGraph`, which 
should keep us from doing anything with the commit-graph file if false.

Nothing about your method is specific to the commit-graph file, since 
you provide a filename as a parameter. It could easily be "int 
file_exists(const char *filename)".

Thanks,

-Stolee
Ævar Arnfjörð Bjarmason Oct. 5, 2018, 1:05 p.m. UTC | #2
On Fri, Oct 05 2018, Derrick Stolee wrote:

> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>> I don't have time to polish this up for submission now, but here's a WIP
>> patch that implements this, highlights:
>>
>>   * There's a gc.clone.autoDetach=false default setting which overrides
>>     gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>>     --cloning option to indicate this).
>
> I'll repeat that it could make sense to do the same thing on clone
> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
> communicate that we just downloaded a pack from a remote.

I don't think that makes sense, but let's talk about why, because maybe
I've missed something, you're certainly more familiar with the
commit-graph than I am.

The reason to do it on clone as a special-case or when the file is
missing, is because we know the file is desired (via the GC config), and
presumably is expected to help performance, and we have 0% of it. So by
going from 0% to 100% on clone we'll get fast --contains and other
goodies the graph helps with.

But when we're doing a fetch, or really anything else that runs "git gc
--auto" we can safely assume that we have a recent enough graph, because
it will have been run whenever auto-gc kicked in.

I.e.:

    # Slow, if we assume background forked commit-graph generation
    # (which I'm avoiding)
    git clone x && cd x && git tag --contains
    # Fast enough, since we have an existing commit-graph
    cd x && git fetch && git tag --contains

I *do* think it might make sense to in general split off parts of "gc
--auto" that we'd like to be more aggressive about, simply because the
ratio of how long it takes to do, and how much it helps with performance
makes more sense than a full repack, which is what the current heuristic
is based on.

And maybe when we run in that mode we should run in the foreground, but
I don't see why git-fetch should be a special case there, and in this
regard, the gc.clone.autoDetach=false setting I've made doesn't make
much sence. I.e. maybe we should also skip forking to the background in
such a mode when we trigger such a "mini gc" via git-commit or whatever.

>>   * A clone of say git.git with gc.writeCommitGraph=true looks like:
>>
>>     [...]
>>     Receiving objects: 100% (255262/255262), 100.49 MiB | 17.78 MiB/s, done.
>>     Resolving deltas: 100% (188947/188947), done.
>>     Computing commit graph generation numbers: 100% (55210/55210), done.
>
> This looks like good UX. Thanks for the progress here!
>
>>   * The 'git gc --auto' command also knows to (only) run the commit-graph
>>     (and space is left for future optimization steps) if general GC isn't
>>     needed, but we need "optimization":
>>
>>     $ rm .git/objects/info/commit-graph; ~/g/git/git --exec-path=$PWD -c gc.writeCommitGraph=true -c gc.autoDetach=false gc --auto;
>>     Annotating commits in commit graph: 341229, done.
>>     Computing commit graph generation numbers: 100% (165969/165969), done.
>>     $
>
> Will this also trigger a full commit-graph rewrite on every 'git
> commit' command?

Nope, because "git commit" can safely be assumed to have some
commit-graph anyway, and I'm just special casing the case where it
doesn't exist.

But if it doesn't exist and you do a "git commit" then "gc --auto" will
be run, and we'll fork to the background and generate it...

>  Or is there some way we can compute the staleness of
> the commit-graph in order to only update if we get too far ahead?
> Previously, this was solved by relying on the auto-GC threshold.

So re the "I don't think that makes sense..." at the start of my E-Mail,
isn't it fine to rely on the default thresholds here, or should we be
more aggressive?

>>   * The patch to gc.c looks less scary with -w, most of it is indenting
>>     the existing pack-refs etc. with a "!auto_gc || should_gc" condition.
>>
>>   * I added a commit_graph_exists() exists function and only care if I
>>     get ENOENT for the purposes of this gc mode. This would need to be
>>     tweaked for the incremental mode Derrick talks about, but if we just
>>     set "should_optimize" that'll also work as far as gc --auto is
>>     concerned (e.g. on fetch, am etc.)
>
> The incremental mode would operate the same as split-index, which
> means we will still look for .git/objects/info/commit-graph. That file
> may point us to more files.

Ah!

>> +int commit_graph_exists(const char *graph_file)
>> +{
>> +	struct stat st;
>> +	if (stat(graph_file, &st)) {
>> +		if (errno == ENOENT)
>> +			return 0;
>> +		else
>> +			return -1;
>> +	}
>> +	return 1;
>> +}
>> +
>
> This method serves a very similar purpose to
> generation_numbers_enabled(), except your method only cares about the
> file existing. It ignores information like `core.commitGraph`, which
> should keep us from doing anything with the commit-graph file if
> false.
>
> Nothing about your method is specific to the commit-graph file, since
> you provide a filename as a parameter. It could easily be "int
> file_exists(const char *filename)".

I was being paranoid about not doing this if it didn't exist but it was
something else than ENOENT (e.g. permission error?), but in retrospect
that's silly. I'll drop this helper and just use file_exists().
Derrick Stolee Oct. 5, 2018, 1:45 p.m. UTC | #3
On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Oct 05 2018, Derrick Stolee wrote:
>
>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>> I don't have time to polish this up for submission now, but here's a WIP
>>> patch that implements this, highlights:
>>>
>>>    * There's a gc.clone.autoDetach=false default setting which overrides
>>>      gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>>>      --cloning option to indicate this).
>> I'll repeat that it could make sense to do the same thing on clone
>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
>> communicate that we just downloaded a pack from a remote.
> I don't think that makes sense, but let's talk about why, because maybe
> I've missed something, you're certainly more familiar with the
> commit-graph than I am.
>
> The reason to do it on clone as a special-case or when the file is
> missing, is because we know the file is desired (via the GC config), and
> presumably is expected to help performance, and we have 0% of it. So by
> going from 0% to 100% on clone we'll get fast --contains and other
> goodies the graph helps with.
>
> But when we're doing a fetch, or really anything else that runs "git gc
> --auto" we can safely assume that we have a recent enough graph, because
> it will have been run whenever auto-gc kicked in.
>
> I.e.:
>
>      # Slow, if we assume background forked commit-graph generation
>      # (which I'm avoiding)
>      git clone x && cd x && git tag --contains
>      # Fast enough, since we have an existing commit-graph
>      cd x && git fetch && git tag --contains
>
> I *do* think it might make sense to in general split off parts of "gc
> --auto" that we'd like to be more aggressive about, simply because the
> ratio of how long it takes to do, and how much it helps with performance
> makes more sense than a full repack, which is what the current heuristic
> is based on.
>
> And maybe when we run in that mode we should run in the foreground, but
> I don't see why git-fetch should be a special case there, and in this
> regard, the gc.clone.autoDetach=false setting I've made doesn't make
> much sence. I.e. maybe we should also skip forking to the background in
> such a mode when we trigger such a "mini gc" via git-commit or whatever.

My misunderstanding was that your proposed change to gc computes the 
commit-graph in either of these two cases:

(1) The auto-GC threshold is met.

(2) There is no commit-graph file.

And what I hope to have instead of (2) is (3):

(3) The commit-graph file is "sufficiently behind" the tip refs.

This condition is intentionally vague at the moment. It could be that we 
hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a 
pack, and it probably contains a lot of new commits") or we could create 
some more complicated condition based on counting reachable commits with 
infinite generation number (the number of commits not in the 
commit-graph file).

I like that you are moving forward to make the commit-graph be written 
more frequently, but I'm trying to push us in a direction of writing it 
even more often than your proposed strategy. We should avoid creating 
too many orthogonal conditions that trigger the commit-graph write, 
which is why I'm pushing on your design here.

Anyone else have thoughts on this direction?

Thanks,

-Stolee
Ævar Arnfjörð Bjarmason Oct. 5, 2018, 2:04 p.m. UTC | #4
On Fri, Oct 05 2018, Derrick Stolee wrote:

> On 10/5/2018 9:05 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Oct 05 2018, Derrick Stolee wrote:
>>
>>> On 10/4/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> I don't have time to polish this up for submission now, but here's a WIP
>>>> patch that implements this, highlights:
>>>>
>>>>    * There's a gc.clone.autoDetach=false default setting which overrides
>>>>      gc.autoDetach if 'git gc --auto' is run via git-clone (we just pass a
>>>>      --cloning option to indicate this).
>>> I'll repeat that it could make sense to do the same thing on clone
>>> _and_ fetch. Perhaps a "--post-fetch" flag would be good here to
>>> communicate that we just downloaded a pack from a remote.
>> I don't think that makes sense, but let's talk about why, because maybe
>> I've missed something, you're certainly more familiar with the
>> commit-graph than I am.
>>
>> The reason to do it on clone as a special-case or when the file is
>> missing, is because we know the file is desired (via the GC config), and
>> presumably is expected to help performance, and we have 0% of it. So by
>> going from 0% to 100% on clone we'll get fast --contains and other
>> goodies the graph helps with.
>>
>> But when we're doing a fetch, or really anything else that runs "git gc
>> --auto" we can safely assume that we have a recent enough graph, because
>> it will have been run whenever auto-gc kicked in.
>>
>> I.e.:
>>
>>      # Slow, if we assume background forked commit-graph generation
>>      # (which I'm avoiding)
>>      git clone x && cd x && git tag --contains
>>      # Fast enough, since we have an existing commit-graph
>>      cd x && git fetch && git tag --contains
>>
>> I *do* think it might make sense to in general split off parts of "gc
>> --auto" that we'd like to be more aggressive about, simply because the
>> ratio of how long it takes to do, and how much it helps with performance
>> makes more sense than a full repack, which is what the current heuristic
>> is based on.
>>
>> And maybe when we run in that mode we should run in the foreground, but
>> I don't see why git-fetch should be a special case there, and in this
>> regard, the gc.clone.autoDetach=false setting I've made doesn't make
>> much sence. I.e. maybe we should also skip forking to the background in
>> such a mode when we trigger such a "mini gc" via git-commit or whatever.
>
> My misunderstanding was that your proposed change to gc computes the
> commit-graph in either of these two cases:
>
> (1) The auto-GC threshold is met.
>
> (2) There is no commit-graph file.
>
> And what I hope to have instead of (2) is (3):
>
> (3) The commit-graph file is "sufficiently behind" the tip refs.
>
> This condition is intentionally vague at the moment. It could be that
> we hint that (3) holds by saying "--post-fetch" (i.e. "We just
> downloaded a pack, and it probably contains a lot of new commits") or
> we could create some more complicated condition based on counting
> reachable commits with infinite generation number (the number of
> commits not in the commit-graph file).
>
> I like that you are moving forward to make the commit-graph be written
> more frequently, but I'm trying to push us in a direction of writing
> it even more often than your proposed strategy. We should avoid
> creating too many orthogonal conditions that trigger the commit-graph
> write, which is why I'm pushing on your design here.
>
> Anyone else have thoughts on this direction?

Ah. I see. I think #3 makes perfect sense, but probably makes sense to
do as a follow-up, or maybe you'd like to stick a patch on top of the
series I have when I send it. I don't know how to write the "I'm not
quite happy about the commit graph" code :)

What I will do is refactor gc.c a bit and leave it in a state where it's
going to be really easy to change the existing "we have no commit graph,
and thus should do the optimization step" to have some more complex
condition instead of "we have no commit graph", i.e. your "we just
grabbed a lot of data".

Also, I'll drop the gc.clone.autoDetach=false setting and name it
something more general. maybe gc.AutoDetachOnBigOptimization=false?
Anyway something more generic so that "clone" will always pass in some
option saying "expect a large % commit graph update" (100% in its case),
and then in "fetch" we could have some detection of how big what we just
got from the server is, and do the same.

This seems to be to be the most general thing that would make sense, and
could also be extended e.g. to "git commit" and other users of gc
--auto. If I started with a README file in an empty repo, and then made
a commit where I added 1 million files all in one commit, in which case
we'd (depending on that setting) also block in the foreground and
generate the commit-graph.
Jeff King Oct. 5, 2018, 7:21 p.m. UTC | #5
On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:

> My misunderstanding was that your proposed change to gc computes the
> commit-graph in either of these two cases:
> 
> (1) The auto-GC threshold is met.
> 
> (2) There is no commit-graph file.
> 
> And what I hope to have instead of (2) is (3):
> 
> (3) The commit-graph file is "sufficiently behind" the tip refs.
> 
> This condition is intentionally vague at the moment. It could be that we
> hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
> pack, and it probably contains a lot of new commits") or we could create
> some more complicated condition based on counting reachable commits with
> infinite generation number (the number of commits not in the commit-graph
> file).
> 
> I like that you are moving forward to make the commit-graph be written more
> frequently, but I'm trying to push us in a direction of writing it even more
> often than your proposed strategy. We should avoid creating too many
> orthogonal conditions that trigger the commit-graph write, which is why I'm
> pushing on your design here.
> 
> Anyone else have thoughts on this direction?

Yes, I think measuring "sufficiently behind" is the right thing.
Everything else is a proxy or heuristic, and will run into corner cases.
E.g., I have some small number of objects and then do a huge fetch, and
now my commit-graph only covers 5% of what's available.

We know how many objects are in the graph already. And it's not too
expensive to get the number of objects in the repository. We can do the
same sampling for loose objects that "gc --auto" does, and counting
packed objects just involves opening up the .idx files (that can be slow
if you have a ton of packs, but you'd want to either repack or use a
.midx in that case anyway, either of which would help here).

So can we really just take (total_objects - commit_graph_objects) and
compare it to some threshold?

-Peff
Derrick Stolee Oct. 5, 2018, 7:41 p.m. UTC | #6
On 10/5/2018 3:21 PM, Jeff King wrote:
> On Fri, Oct 05, 2018 at 09:45:47AM -0400, Derrick Stolee wrote:
>
>> My misunderstanding was that your proposed change to gc computes the
>> commit-graph in either of these two cases:
>>
>> (1) The auto-GC threshold is met.
>>
>> (2) There is no commit-graph file.
>>
>> And what I hope to have instead of (2) is (3):
>>
>> (3) The commit-graph file is "sufficiently behind" the tip refs.
>>
>> This condition is intentionally vague at the moment. It could be that we
>> hint that (3) holds by saying "--post-fetch" (i.e. "We just downloaded a
>> pack, and it probably contains a lot of new commits") or we could create
>> some more complicated condition based on counting reachable commits with
>> infinite generation number (the number of commits not in the commit-graph
>> file).
>>
>> I like that you are moving forward to make the commit-graph be written more
>> frequently, but I'm trying to push us in a direction of writing it even more
>> often than your proposed strategy. We should avoid creating too many
>> orthogonal conditions that trigger the commit-graph write, which is why I'm
>> pushing on your design here.
>>
>> Anyone else have thoughts on this direction?
> Yes, I think measuring "sufficiently behind" is the right thing.
> Everything else is a proxy or heuristic, and will run into corner cases.
> E.g., I have some small number of objects and then do a huge fetch, and
> now my commit-graph only covers 5% of what's available.
>
> We know how many objects are in the graph already. And it's not too
> expensive to get the number of objects in the repository. We can do the
> same sampling for loose objects that "gc --auto" does, and counting
> packed objects just involves opening up the .idx files (that can be slow
> if you have a ton of packs, but you'd want to either repack or use a
> .midx in that case anyway, either of which would help here).
>
> So can we really just take (total_objects - commit_graph_objects) and
> compare it to some threshold?

The commit-graph only stores the number of _commits_, not total objects.

Azure Repos' commit-graph does store the total number of objects, and 
that is how we trigger updating the graph, so it is not unreasonable to 
use that as a heuristic.

Thanks,

-Stolee
Jeff King Oct. 5, 2018, 7:47 p.m. UTC | #7
On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:

> > So can we really just take (total_objects - commit_graph_objects) and
> > compare it to some threshold?
> 
> The commit-graph only stores the number of _commits_, not total objects.

Oh, right, of course. That does throw a monkey wrench in that line of
thought. ;)

There's unfortunately not a fast way of doing that. One option would be
to keep a counter of "ungraphed commit objects", and have callers update
it. Anybody admitting a pack via index-pack or unpack-objects can easily
get this information. Commands like fast-import can do likewise, and
"git commit" obviously increments it by one.

I'm not excited about adding a new global on-disk data structure (and
the accompanying lock).

-Peff
Derrick Stolee Oct. 5, 2018, 8 p.m. UTC | #8
On 10/5/2018 3:47 PM, Jeff King wrote:
> On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
>
>>> So can we really just take (total_objects - commit_graph_objects) and
>>> compare it to some threshold?
>> The commit-graph only stores the number of _commits_, not total objects.
> Oh, right, of course. That does throw a monkey wrench in that line of
> thought. ;)
>
> There's unfortunately not a fast way of doing that. One option would be
> to keep a counter of "ungraphed commit objects", and have callers update
> it. Anybody admitting a pack via index-pack or unpack-objects can easily
> get this information. Commands like fast-import can do likewise, and
> "git commit" obviously increments it by one.
>
> I'm not excited about adding a new global on-disk data structure (and
> the accompanying lock).

If we want, then we can add an optional chunk to the commit-graph file 
that stores the object count.
Ævar Arnfjörð Bjarmason Oct. 5, 2018, 8:01 p.m. UTC | #9
On Fri, Oct 05 2018, Jeff King wrote:

> On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
>
>> > So can we really just take (total_objects - commit_graph_objects) and
>> > compare it to some threshold?
>>
>> The commit-graph only stores the number of _commits_, not total objects.
>
> Oh, right, of course. That does throw a monkey wrench in that line of
> thought. ;)
>
> There's unfortunately not a fast way of doing that. One option would be
> to keep a counter of "ungraphed commit objects", and have callers update
> it. Anybody admitting a pack via index-pack or unpack-objects can easily
> get this information. Commands like fast-import can do likewise, and
> "git commit" obviously increments it by one.
>
> I'm not excited about adding a new global on-disk data structure (and
> the accompanying lock).

You don't really need a new global datastructure to solve this
problem. It would be sufficient to have git-gc itself write out a 4-line
text file after it runs saying how many tags, commits, trees and blobs
it found on its last run.

You can then fuzzily compare object counts v.s. commit counts for the
purposes of deciding whether something like the commit-graph needs to be
updated, while assuming that whatever new data you have has similar
enough ratios of those as your existing data.

That's an assumption that'll hold well enough for big repos where this
matters the most, and who tend to grow in fairly uniform ways as far as
their object type ratios go.

Databases like MySQL, PostgreSQL etc. pull similar tricks with their
fuzzy table statistics.
Jeff King Oct. 5, 2018, 8:02 p.m. UTC | #10
On Fri, Oct 05, 2018 at 04:00:12PM -0400, Derrick Stolee wrote:

> On 10/5/2018 3:47 PM, Jeff King wrote:
> > On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
> > 
> > > > So can we really just take (total_objects - commit_graph_objects) and
> > > > compare it to some threshold?
> > > The commit-graph only stores the number of _commits_, not total objects.
> > Oh, right, of course. That does throw a monkey wrench in that line of
> > thought. ;)
> > 
> > There's unfortunately not a fast way of doing that. One option would be
> > to keep a counter of "ungraphed commit objects", and have callers update
> > it. Anybody admitting a pack via index-pack or unpack-objects can easily
> > get this information. Commands like fast-import can do likewise, and
> > "git commit" obviously increments it by one.
> > 
> > I'm not excited about adding a new global on-disk data structure (and
> > the accompanying lock).
> 
> If we want, then we can add an optional chunk to the commit-graph file that
> stores the object count.

Yeah, that's probably a saner route, since we have to do the write then
anyway.

-Peff
Jeff King Oct. 5, 2018, 8:09 p.m. UTC | #11
On Fri, Oct 05, 2018 at 10:01:31PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > There's unfortunately not a fast way of doing that. One option would be
> > to keep a counter of "ungraphed commit objects", and have callers update
> > it. Anybody admitting a pack via index-pack or unpack-objects can easily
> > get this information. Commands like fast-import can do likewise, and
> > "git commit" obviously increments it by one.
> >
> > I'm not excited about adding a new global on-disk data structure (and
> > the accompanying lock).
> 
> You don't really need a new global datastructure to solve this
> problem. It would be sufficient to have git-gc itself write out a 4-line
> text file after it runs saying how many tags, commits, trees and blobs
> it found on its last run.
>
> You can then fuzzily compare object counts v.s. commit counts for the
> purposes of deciding whether something like the commit-graph needs to be
> updated, while assuming that whatever new data you have has similar
> enough ratios of those as your existing data.

I think this is basically the same thing as Stolee's suggestion to keep
the total object count in the commit-graph file. The only difference is
here is that we know the actual ratio of commit to blobs for this
particular repository. But I don't think we need to know that. As you
said, this is fuzzy anyway, so a single number for "update the graph
when there are N new objects" is likely enough.

If you had a repository with an unusually large tree, you'd end up
rebuilding the graph more often. But I think it would probably be OK, as
we're primarily trying not to waste time doing a graph rebuild when
we've only done a small amount of other work. But if we just shoved a
ton of objects through index-pack then we did a lot of work, whether
those were commit objects or not.

-Peff

Patch
diff mbox series

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1546833213..5759fbb067 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1621,7 +1621,19 @@  gc.autoPackLimit::

 gc.autoDetach::
 	Make `git gc --auto` return immediately and run in background
-	if the system supports it. Default is true.
+	if the system supports it. Default is true. Overridden by
+	`gc.clone.autoDetach` when running linkgit:git-clone[1].
+
+gc.clone.autoDetach::
+	Make `git gc --auto` return immediately and run in background
+	if the system supports it when run via
+	linkgit:git-clone[1]. Default is false.
++
+The reason this defaults to false is because the only time we'll have
+work to do after a 'git clone' is if something like
+`gc.writeCommitGraph` is true, in that case we'd like to compute the
+optimized file before returning, so that say commands that benefit
+from commit graph aren't slow until it's generated in the background.

 gc.bigPackThreshold::
 	If non-zero, all packs larger than this limit are kept when
diff --git a/builtin/clone.c b/builtin/clone.c
index 15b142d646..824c130ba5 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -897,6 +897,8 @@  int cmd_clone(int argc, const char **argv, const char *prefix)
 	struct remote *remote;
 	int err = 0, complete_refs_before_fetch = 1;
 	int submodule_progress;
+	const char *argv_gc_auto[]       = {"gc", "--auto", "--cloning", NULL};
+	const char *argv_gc_auto_quiet[] = {"gc", "--auto", "--cloning", "--quiet", NULL};

 	struct refspec rs = REFSPEC_INIT_FETCH;
 	struct argv_array ref_prefixes = ARGV_ARRAY_INIT;
@@ -1245,5 +1247,11 @@  int cmd_clone(int argc, const char **argv, const char *prefix)

 	refspec_clear(&rs);
 	argv_array_clear(&ref_prefixes);
+
+	if (0 <= option_verbosity)
+		run_command_v_opt_cd_env(argv_gc_auto, RUN_GIT_CMD, git_dir, NULL);
+	else
+		run_command_v_opt_cd_env(argv_gc_auto_quiet, RUN_GIT_CMD, git_dir, NULL);
+
 	return err;
 }
diff --git a/builtin/gc.c b/builtin/gc.c
index 6591ddbe83..27be03890a 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -43,6 +43,7 @@  static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int gc_write_commit_graph;
 static int detach_auto = 1;
+static int detach_clone_auto = 0;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
 static const char *prune_expire = "2.weeks.ago";
@@ -133,6 +134,7 @@  static void gc_config(void)
 	git_config_get_int("gc.autopacklimit", &gc_auto_pack_limit);
 	git_config_get_bool("gc.writecommitgraph", &gc_write_commit_graph);
 	git_config_get_bool("gc.autodetach", &detach_auto);
+	git_config_get_bool("gc.clone.autodetach", &detach_clone_auto);
 	git_config_get_expiry("gc.pruneexpire", &prune_expire);
 	git_config_get_expiry("gc.worktreepruneexpire", &prune_worktrees_expire);
 	git_config_get_expiry("gc.logexpiry", &gc_log_expire);
@@ -157,9 +159,6 @@  static int too_many_loose_objects(void)
 	int num_loose = 0;
 	int needed = 0;

-	if (gc_auto_threshold <= 0)
-		return 0;
-
 	dir = opendir(git_path("objects/17"));
 	if (!dir)
 		return 0;
@@ -369,10 +368,21 @@  static int need_to_gc(void)
 		return 0;

 	if (run_hook_le(NULL, "pre-auto-gc", NULL))
-		return 0;
+		return -1;
 	return 1;
 }

+static int need_to_optimize(void) {
+	if (gc_write_commit_graph) {
+		char *obj_dir = get_object_directory();
+		char *graph_name = get_commit_graph_filename(obj_dir);
+
+		if (commit_graph_exists(graph_name) == 0) /* ENOENT */
+			return 1;
+	}
+	return 0;
+}
+
 /* return NULL on success, else hostname running the gc */
 static const char *lock_repo_for_gc(int force, pid_t* ret_pid)
 {
@@ -491,6 +501,7 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int aggressive = 0;
 	int auto_gc = 0;
+	int cloning = 0;
 	int quiet = 0;
 	int force = 0;
 	const char *name;
@@ -498,6 +509,8 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 	int daemonized = 0;
 	int keep_base_pack = -1;
 	timestamp_t dummy;
+	int should_gc;
+	int should_optimize;

 	struct option builtin_gc_options[] = {
 		OPT__QUIET(&quiet, N_("suppress progress reporting")),
@@ -507,6 +520,8 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 		OPT_BOOL(0, "aggressive", &aggressive, N_("be more thorough (increased runtime)")),
 		OPT_BOOL_F(0, "auto", &auto_gc, N_("enable auto-gc mode"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_BOOL_F(0, "cloning", &cloning, N_("enable cloning mode"),
+			   PARSE_OPT_NOCOMPLETE),
 		OPT_BOOL_F(0, "force", &force,
 			   N_("force running gc even if there may be another gc running"),
 			   PARSE_OPT_NOCOMPLETE),
@@ -555,22 +570,27 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 		/*
 		 * Auto-gc should be least intrusive as possible.
 		 */
-		if (!need_to_gc())
+		should_gc = need_to_gc();
+		if (should_gc == -1)
+			return 0;
+		should_optimize = need_to_optimize();
+		if (!should_gc && !should_optimize)
 			return 0;
-		if (!quiet) {
+		if (!quiet && should_gc) {
 			if (detach_auto)
 				fprintf(stderr, _("Auto packing the repository in background for optimum performance.\n"));
 			else
 				fprintf(stderr, _("Auto packing the repository for optimum performance.\n"));
 			fprintf(stderr, _("See \"git help gc\" for manual housekeeping.\n"));
 		}
-		if (detach_auto) {
+		if (detach_auto &&
+		    (!cloning || (cloning && detach_clone_auto))) {
 			if (report_last_gc_error())
 				return -1;

 			if (lock_repo_for_gc(force, &pid))
 				return 0;
-			if (gc_before_repack())
+			if (should_gc && gc_before_repack())
 				return -1;
 			delete_tempfile(&pidfile);

@@ -611,45 +631,48 @@  int cmd_gc(int argc, const char **argv, const char *prefix)
 		atexit(process_log_file_at_exit);
 	}

-	if (gc_before_repack())
-		return -1;
-
-	if (!repository_format_precious_objects) {
-		close_all_packs(the_repository->objects);
-		if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, repack.argv[0]);
-
-		if (prune_expire) {
-			argv_array_push(&prune, prune_expire);
-			if (quiet)
-				argv_array_push(&prune, "--no-progress");
-			if (repository_format_partial_clone)
-				argv_array_push(&prune,
-						"--exclude-promisor-objects");
-			if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
-				return error(FAILED_RUN, prune.argv[0]);
+	if (!auto_gc || should_gc) {
+		if (gc_before_repack())
+			return -1;
+
+		if (!repository_format_precious_objects) {
+			close_all_packs(the_repository->objects);
+			if (run_command_v_opt(repack.argv, RUN_GIT_CMD))
+				return error(FAILED_RUN, repack.argv[0]);
+
+			if (prune_expire) {
+				argv_array_push(&prune, prune_expire);
+				if (quiet)
+					argv_array_push(&prune, "--no-progress");
+				if (repository_format_partial_clone)
+					argv_array_push(&prune,
+							"--exclude-promisor-objects");
+				if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
+					return error(FAILED_RUN, prune.argv[0]);
+			}
 		}
-	}

-	if (prune_worktrees_expire) {
-		argv_array_push(&prune_worktrees, prune_worktrees_expire);
-		if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
-			return error(FAILED_RUN, prune_worktrees.argv[0]);
-	}

-	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
-		return error(FAILED_RUN, rerere.argv[0]);
+		if (prune_worktrees_expire) {
+			argv_array_push(&prune_worktrees, prune_worktrees_expire);
+			if (run_command_v_opt(prune_worktrees.argv, RUN_GIT_CMD))
+				return error(FAILED_RUN, prune_worktrees.argv[0]);
+		}

-	report_garbage = report_pack_garbage;
-	reprepare_packed_git(the_repository);
-	if (pack_garbage.nr > 0)
-		clean_pack_garbage();
+		if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
+			return error(FAILED_RUN, rerere.argv[0]);
+
+		report_garbage = report_pack_garbage;
+		reprepare_packed_git(the_repository);
+		if (pack_garbage.nr > 0)
+			clean_pack_garbage();
+	}

 	if (gc_write_commit_graph)
 		write_commit_graph_reachable(get_object_directory(), 0,
 					     !quiet && !daemonized);

-	if (auto_gc && too_many_loose_objects())
+	if (auto_gc && should_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));

diff --git a/commit-graph.c b/commit-graph.c
index 5908bd4e34..a4a7c94cec 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -57,6 +57,18 @@  static struct commit_graph *alloc_commit_graph(void)
 	return g;
 }

+int commit_graph_exists(const char *graph_file)
+{
+	struct stat st;
+	if (stat(graph_file, &st)) {
+		if (errno == ENOENT)
+			return 0;
+		else
+			return -1;
+	}
+	return 1;
+}
+
 struct commit_graph *load_commit_graph_one(const char *graph_file)
 {
 	void *graph_map;
diff --git a/commit-graph.h b/commit-graph.h
index 5678a8f4ca..a251f1bc32 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -11,6 +11,7 @@ 
 struct commit;

 char *get_commit_graph_filename(const char *obj_dir);
+int commit_graph_exists(const char *graph_file);

 /*
  * Given a commit struct, try to fill the commit struct info, including: