diff mbox series

[12/17] Documentation: describe split commit-graphs

Message ID 7bbe8d9150a623ea684c94d129eda1607dd32a79.1557330827.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Commit-graph: Write incremental files | expand

Commit Message

Linus Arver via GitGitGadget May 8, 2019, 3:53 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

The design for the split commit-graphs uses file names to force
a "stack" of commit-graph files. This allows incremental writes
without updating the file format.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 142 +++++++++++++++++++++++
 1 file changed, 142 insertions(+)

Comments

SZEDER Gábor May 8, 2019, 5:20 p.m. UTC | #1
On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The design for the split commit-graphs uses file names to force
> a "stack" of commit-graph files. This allows incremental writes
> without updating the file format.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/commit-graph.txt | 142 +++++++++++++++++++++++
>  1 file changed, 142 insertions(+)
> 
> diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
> index fb53341d5e..ca1661d2d8 100644
> --- a/Documentation/technical/commit-graph.txt
> +++ b/Documentation/technical/commit-graph.txt
> @@ -127,6 +127,148 @@ Design Details
>    helpful for these clones, anyway. The commit-graph will not be read or
>    written when shallow commits are present.
>  
> +Split Commit Graphs
> +-------------------
> +
> +Typically, repos grow with near-constant velocity (commits per day). Over
> +time, the number of commits added by a fetch operation is much smaller than
> +the number of commits in the full history. The split commit-graph feature
> +allows for fast writes of new commit data without rewriting the entire
> +commit history -- at least, most of the time.
> +
> +## File Layout
> +
> +A split commit-graph uses multiple files, and we use a fixed naming
> +convention to organize these files. The base commit-graph file is the
> +same: `$OBJDIR/info/commit-graph`. The rest of the commit-graph files have
> +the format `$OBJDIR/info/commit-graphs/commit-graph-<N>` where N is a
> +positive integer. The integers must start at 1 and grow sequentially
> +to form a stack of files.
> +
> +Each `commit-graph-<N>` file has the same format as the `commit-graph`
> +file, including a lexicographic list of commit ids. The only difference
> +is that this list is considered to be concatenated to the list from
> +the lower commit-graphs. As an example, consider this diagram of three
> +files:
> +
> + +-----------------------+
> + |  commit-graph-2       |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |  commit-graph-1       |
> + |                       |
> + +-----------------------+
> +	  |
> + +-----------------------+
> + |                       |
> + |                       |
> + |                       |
> + |  commit-graph         |
> + |                       |
> + |                       |
> + |                       |
> + +-----------------------+
> +
> +Let X0 be the number of commits in `commit-graph`, X1 be the number
> +of commits in commit-graph-1, and X2 be the number of commits in
> +commit-graph-2. If a commit appears in position i in `commit-graph-2`,
> +then we interpret this as being the commit in position (X0 + X1 + i),
> +and that will be used as its "graph position". The commits in
> +commit-graph-2 use these positions to refer to their parents, which
> +may be in commit-graph-1 or commit-graph. We can navigate to an
> +arbitrary commit in position j by checking its containment in the
> +intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + X2).
> +
> +When Git reads from these files, it starts by acquiring a read handle
> +on the `commit-graph` file. On success, it continues acquiring read
> +handles on the `commit-graph-<N>` files in increasing order. This
> +order is important for how we replace the files.
> +
> +## Merging commit-graph files
> +
> +If we only added a `commit-graph-<N>` file on every write, we would
> +run into a linear search problem through many commit-graph files.
> +Instead, we use a merge strategy to decide when the stack should
> +collapse some number of levels.
> +
> +The diagram below shows such a collapse. As a set of new commits
> +are added, it is determined by the merge strategy that the files
> +should collapse to `commit-graph-1`. Thus, the new commits, the
> +commits in `commit-graph-2` and the commits in `commit-graph-1`
> +should be combined into a new `commit-graph-1` file.
> +
> +			    +---------------------+
> +			    |                     |
> +			    |    (new commits)    |
> +			    |                     |
> +			    +---------------------+
> +			    |                     |
> + +-----------------------+  +---------------------+
> + |  commit-graph-2       |->|                     |
> + +-----------------------+  +---------------------+
> +	  |                 |                     |
> + +-----------------------+  +---------------------+
> + |                       |  |                     |
> + |  commit-graph-1       |->|                     |
> + |                       |  |                     |
> + +-----------------------+  +---------------------+
> +	  |                   commit-graph-1.lock
> + +-----------------------+
> + |                       |
> + |                       |
> + |                       |
> + |  commit-graph         |
> + |                       |
> + |                       |
> + |                       |
> + +-----------------------+
> +
> +During this process, the commits to write are combined, sorted
> +and we write the contents to the `commit-graph-1.lock` file.
> +When the file is flushed and ready to swap to `commit-graph-1`,
> +we first unlink the files above our target file. This unlinking
> +is done from the top of the stack, the reverse direction that
> +another process would use to read the stack.
> +
> +During this time window, another process trying to read the
> +commit-graph stack could read `commit-graph-1` before the swap
> +but try to read `commit-graph-2` after it is unlinked. That
> +process would then believe that this stack is complete, but
> +will miss out on the performance benefits of the commits in
> +`commit-graph-2`. For this reason, the stack above the
> +`commit-graph` file should be small.

Consider the following sequence of events:

  1. There are three commit-graph files in the repository.

  2. A git process opens the base commit-graph and commit-graph-1 for
     reading.  It doesn't yet open commit-graph-2, because the (for
     arguments sake not very fair) scheduler takes the CPU away.

  3. Meanwhile, a 'git fetch', well, fetches from a remote, and
     upon noticing that it got a lot of commits it decides to collapse
     commit-graph-1 and -2 and the new commits, writing a brand new
     commit-graph-1.

  4. A second fetch fetches from a second remote, and writes
     commit-graph-2 (no collapsing this time).

  5. Now the crappy scheduler finally decides that it's time to wake
     up the waiting git process from step 2, which then finds the new
     commit-graph-2 file and opens it for reading.

  6. At this point this poor git process has file handles for:
  
     - the base commit-graph file, which is unchanged.

     - the old commit-graph-1 which has since been replaced, and does
       not yet contain info about the old commit-graph-2 or the
       commits received in the first fetch.

     - the new commit-graph-2, containing info only about commits
       received in the second fetch, and whose parents' graph
       positions point either to the base commitg-graph (good, since
       unchanged) or to the new commit-graph-1 (uh-oh).

What happens next?  If this process tries to access the parent of a
commit from commit-graph-2, and the metadata about this parent is in
the new commit-graph-1, then I expect all kinds of weird bugs.

But will a git process ever try to access a commit that didn't yet
existed in the repository when it started opening the commit-graph
files?



> +## Merge Strategy
> +
> +When writing a set of commits that do not exist in the
> +commit-graph stack of height N, we default to creating
> +a new file at level N + 1. We then decide to merge
> +with the Nth level if one of two conditions hold:
> +
> +  1. The expected file size for level N + 1 is at
> +     least half the file size for level N.
> +
> +  2. Level N + 1 contains more than MAX_SPLIT_COMMITS
> +     commits (64,0000 commits).
> +
> +This decision cascades down the levels: when we
> +merge a level we create a new set of commits that
> +then compares to the next level.
> +
> +The first condition bounds the number of levels
> +to be logarithmic in the total number of commits.
> +The second condition bounds the total number of
> +commits in a `commit-graph-N` file and not in
> +the `commit-graph` file, preventing significant
> +performance issues when the stack merges and another
> +process only partially reads the previous stack.
> +
> +The merge strategy values (2 for the size multiple,
> +64,000 for the maximum number of commits) could be
> +extracted into config settings for full flexibility.
> +
>  Related Links
>  -------------
>  [0] https://bugs.chromium.org/p/git/issues/detail?id=8
> -- 
> gitgitgadget
>
Derrick Stolee May 8, 2019, 7 p.m. UTC | #2
On 5/8/2019 1:20 PM, SZEDER Gábor wrote:
> On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
> Consider the following sequence of events:
> 
>   1. There are three commit-graph files in the repository.
> 
>   2. A git process opens the base commit-graph and commit-graph-1 for
>      reading.  It doesn't yet open commit-graph-2, because the (for
>      arguments sake not very fair) scheduler takes the CPU away.
> 
>   3. Meanwhile, a 'git fetch', well, fetches from a remote, and
>      upon noticing that it got a lot of commits it decides to collapse
>      commit-graph-1 and -2 and the new commits, writing a brand new
>      commit-graph-1.
> 
>   4. A second fetch fetches from a second remote, and writes
>      commit-graph-2 (no collapsing this time).
> 
>   5. Now the crappy scheduler finally decides that it's time to wake
>      up the waiting git process from step 2, which then finds the new
>      commit-graph-2 file and opens it for reading.
> 
>   6. At this point this poor git process has file handles for:
>   
>      - the base commit-graph file, which is unchanged.
> 
>      - the old commit-graph-1 which has since been replaced, and does
>        not yet contain info about the old commit-graph-2 or the
>        commits received in the first fetch.
> 
>      - the new commit-graph-2, containing info only about commits
>        received in the second fetch, and whose parents' graph
>        positions point either to the base commitg-graph (good, since
>        unchanged) or to the new commit-graph-1 (uh-oh).
> 
> What happens next?  If this process tries to access the parent of a
> commit from commit-graph-2, and the metadata about this parent is in
> the new commit-graph-1, then I expect all kinds of weird bugs.
> 
> But will a git process ever try to access a commit that didn't yet
> existed in the repository when it started opening the commit-graph
> files?

I'll ignore the improbability of this turn of events (two writes happening
during the span of trying to read two files) and focus on the fact that
we can prevent issues here using the 4th TODO item in my cover letter:

 4. It would be helpful to add a new optional chunk that contains the
    trailing hash for the lower level of the commit-graph stack. This chunk
    would only be for the commit-graph-N files, and would provide a simple
    way to check that the stack is valid on read, in case we are still
    worried about other processes reading/writing in the wrong order.

If we have this chunk -- you have convinced me that we need it -- then we
could ignore the "new" commit-graph-2 because its base graph hash does not
match. We can continue without dying because we can always parse the "missing"
commits from the packs.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason May 8, 2019, 8:11 p.m. UTC | #3
On Wed, May 08 2019, Derrick Stolee wrote:

> On 5/8/2019 1:20 PM, SZEDER Gábor wrote:
>> On Wed, May 08, 2019 at 08:53:57AM -0700, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>> Consider the following sequence of events:
>>
>>   1. There are three commit-graph files in the repository.
>>
>>   2. A git process opens the base commit-graph and commit-graph-1 for
>>      reading.  It doesn't yet open commit-graph-2, because the (for
>>      arguments sake not very fair) scheduler takes the CPU away.
>>
>>   3. Meanwhile, a 'git fetch', well, fetches from a remote, and
>>      upon noticing that it got a lot of commits it decides to collapse
>>      commit-graph-1 and -2 and the new commits, writing a brand new
>>      commit-graph-1.
>>
>>   4. A second fetch fetches from a second remote, and writes
>>      commit-graph-2 (no collapsing this time).
>>
>>   5. Now the crappy scheduler finally decides that it's time to wake
>>      up the waiting git process from step 2, which then finds the new
>>      commit-graph-2 file and opens it for reading.
>>
>>   6. At this point this poor git process has file handles for:
>>
>>      - the base commit-graph file, which is unchanged.
>>
>>      - the old commit-graph-1 which has since been replaced, and does
>>        not yet contain info about the old commit-graph-2 or the
>>        commits received in the first fetch.
>>
>>      - the new commit-graph-2, containing info only about commits
>>        received in the second fetch, and whose parents' graph
>>        positions point either to the base commitg-graph (good, since
>>        unchanged) or to the new commit-graph-1 (uh-oh).
>>
>> What happens next?  If this process tries to access the parent of a
>> commit from commit-graph-2, and the metadata about this parent is in
>> the new commit-graph-1, then I expect all kinds of weird bugs.
>>
>> But will a git process ever try to access a commit that didn't yet
>> existed in the repository when it started opening the commit-graph
>> files?
>
> I'll ignore the improbability of this turn of events (two writes happening
> during the span of trying to read two files) and focus on the fact that
> we can prevent issues here using the 4th TODO item in my cover letter:

FWIW the sort of scenario SZEDER is describing is something I deal with
in production a lot. It doesn't require an unfair scheduler, just that
you have differently nice(1)'d processes accessing the same repo.

So if you have batch "cron" processes their IO scheduling follows their
nice(1) scheduling. It's not atypical to e.g. have some background
thingy sit for seconds or even minutes on an I/O syscall while the
kernel decides everyone else has right of way, since you nice'd that not
caring if it finishes in 10 seconds or 10 hours.

>  4. It would be helpful to add a new optional chunk that contains the
>     trailing hash for the lower level of the commit-graph stack. This chunk
>     would only be for the commit-graph-N files, and would provide a simple
>     way to check that the stack is valid on read, in case we are still
>     worried about other processes reading/writing in the wrong order.
>
> If we have this chunk -- you have convinced me that we need it -- then we
> could ignore the "new" commit-graph-2 because its base graph hash does not
> match. We can continue without dying because we can always parse the "missing"
> commits from the packs.

Having read the actual code for this & what the format will look like
(last time I didn't have that context[1]) I really don't get why we
don't save ourselves a lot of trouble and do away with this "N" in the
filename idea from the start.

It's just a slight change to the iteration in
prepare_commit_graph_one().

Instead of looping through files N at a time we'd have a discovery step
where we'd need to open() all the files, see which ones say "my parent
hash hash X", and then create a list of those hashes in order to read a
bunch of commit-graph-<HASH> files.

Is that a bit painful? Sure, but way less painful than dealing with the
caveats I'd mentioned in [1] and SZEDER details here.

And the obvious thing would be to save ourselves most of that work every
time we read by writing a .git/objects/commit-graphs/info on
"commit-graph write", which would be the <HASH> of the end of the
"latest" chain. We could also have some side-index listing the whole
"current" chain in order (but I'm more paranoid about locking/updates to
such a thing, maybe we could put it in the last file in a new chunk
....).

If we didn't have such side-indexes then the way the current loading
works it would need to traverse the files back to front, and *then* load
them front to back (as it does now), so that's a slight pain,
obviously. But not a big deal.

With commit-graph-<HASH> all these unlink() race conditions go away,
partial reads due to concurrent graph writing becomes a non-issue (we'd
just leave the old files, "gc" deals with them later..), no need to
carefully fsync() files/dirs etc as we need to carefully juggle N and
N+1 files.

It also becomes easy to "chain" graphs across repos e.g. via
alternates. Say in the scenario github/gitlab have where they have a
"main" repo and other objects on another delta island.

In that case the repo would have a local "tip" file with the last link
in its chain, some of which would then refer back to <HASHes> in other
"parent" alternates.

As long as such a setup has a "gc" process that's not overly eager about
pruning old stuff and considers that constellation of repos as a whole
that should just work. You can freely optimize and rewrite graphs across
repos, just be careful about unlinking old stuff.

I don't see how it would work with commit-graph-N without a *lot* of
painful orchestration (where e.g. you *must* guarantee that the parent
repo ends in N, all child repos start at N+1).

> This allows incremental writes without updating the file format.

FWIW this is some of what I was talking about in [2]. In
ds/commit-graph-format-v2 I had feedback to the effect[3] that the
particular way in which you proposed to update the format (changing the
header) wouldn't be worth it, since old clients dealt with it so badly.

But as noted in [3] I see zero reason for why we can't update the
existing format, we just add new chunks. That allows us to add any new
data in backwards-compatible ways.

I see nothing wrong with solution that has split files in principle,
just with the currently proposed commit-graph-N way of doing that.

I just wonder if we're looking at a "Y" solution to an "X-Y" problem
where "X" was unduly dismissed. If updating the format was a non-issue
(which seems to me to be the case), what then?

I imagine we'd still have just a "commit-graph" file, to write a new one
we'd "cp" that one, then munge that existing file to write something new
to the and and "mv" it in-place. It seems to me a (sane) split-file plan
is better, but I'm not in your head, maybe it was a much better for
reasons I'm not imagining before I apparently talked you out of changing
the format itself :)

1. https://public-inbox.org/git/87bm0jirjj.fsf@evledraar.gmail.com/
2. https://public-inbox.org/git/87r298hhlc.fsf@evledraar.gmail.com/
3. https://public-inbox.org/git/87lfzprkfc.fsf@evledraar.gmail.com/
Junio C Hamano May 9, 2019, 4:49 a.m. UTC | #4
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> With commit-graph-<HASH> all these unlink() race conditions go away,
> partial reads due to concurrent graph writing becomes a non-issue (we'd
> just leave the old files, "gc" deals with them later..), no need to
> carefully fsync() files/dirs etc as we need to carefully juggle N and
> N+1 files.

The above would give a nice course correction to be in line with the
rest of the system, like how split index knows about and chains to
its base.  Thanks for a dose of sanity.
Derrick Stolee May 9, 2019, 12:25 p.m. UTC | #5
On 5/9/2019 12:49 AM, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
>> With commit-graph-<HASH> all these unlink() race conditions go away,
>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>> just leave the old files, "gc" deals with them later..), no need to
>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>> N+1 files.
> 
> The above would give a nice course correction to be in line with the
> rest of the system, like how split index knows about and chains to
> its base.  Thanks for a dose of sanity.

I'm working on a detailed response to Ævar's ideas, to be sure we are
talking about the same thing, because the original motivation for the
commit-graph format v2 was to allow the 'commit-graph' file to point
to a chain of base files by a list of hashes (like the split index
does). The current proposal was created in response to an unwillingness
to break the file format for the 'commit-graph' file.

-Stolee
Derrick Stolee May 9, 2019, 1:45 p.m. UTC | #6
On 5/8/2019 4:11 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, May 08 2019, Derrick Stolee wrote:
>
>> I'll ignore the improbability of this turn of events (two writes happening
>> during the span of trying to read two files) and focus on the fact that
>> we can prevent issues here using the 4th TODO item in my cover letter:
> 
> FWIW the sort of scenario SZEDER is describing is something I deal with
> in production a lot. It doesn't require an unfair scheduler, just that
> you have differently nice(1)'d processes accessing the same repo.
> 
> So if you have batch "cron" processes their IO scheduling follows their
> nice(1) scheduling. It's not atypical to e.g. have some background
> thingy sit for seconds or even minutes on an I/O syscall while the
> kernel decides everyone else has right of way, since you nice'd that not
> caring if it finishes in 10 seconds or 10 hours.

Thanks. I learned something today.

I still don't think this is a problem, with the idea below.

>>  4. It would be helpful to add a new optional chunk that contains the
>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>     would only be for the commit-graph-N files, and would provide a simple
>>     way to check that the stack is valid on read, in case we are still
>>     worried about other processes reading/writing in the wrong order.
>>
>> If we have this chunk -- you have convinced me that we need it -- then we
>> could ignore the "new" commit-graph-2 because its base graph hash does not
>> match. We can continue without dying because we can always parse the "missing"
>> commits from the packs.

So, let's set that idea aside. You have other concerns.

> Instead of looping through files N at a time we'd have a discovery step
> where we'd need to open() all the files, see which ones say "my parent
> hash hash X", and then create a list of those hashes in order to read a
> bunch of commit-graph-<HASH> files.
> 
> Is that a bit painful? Sure, but way less painful than dealing with the
> caveats I'd mentioned in [1] and SZEDER details here.

I don't see how this step is less painful than the one I am describing.
You'll need to be a bit more specific to convince me.

I'll try to be specific with a few ideas that have been thrown around,
so we can compare and contrast (and point out where I am misunderstanding
what you are trying to say).

Option 0 (this series): commit-graph-N
--------------------------------------

On read, we look for the 'info/commit-graph' file and acquire a read handle.
We set that commit_graph struct as our tip commit-graph. Then, for each
increasing N (until we fail) we acquire a read handle on
'info/commit-graphs/commit-graph-N' and check that its base hash matches
our current tip commit-graph. If the file doesn't exist, or the base
hash doesn't match, then we stop and continue with our current tip graph.

On write, use a 'mv' to swap our .lock file with whatever level we are
merging, THEN can unlink() the higher layers in decreasing order. (This
"mv-then-unlink" order is different than what is implemented by this
series, but is enabled by the chunk containing the base graph hash.)

Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
--------------------------------------------------------------

On read, we load the 'info/commit-graph' file and inspect the byte saying
how many base files we have. We load their hashes from the base file chunk
and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
to ignore anything "above" the failure in the chain (and specifically this
includes the 'commit-graph' file), or consider reloading the commit-graph
file altogether and hope it works this time. [Note: if an older version
of Git doesn't understand the incremental file format, it will fail to
properly understand the graph positions and either fail with an
"invalid parent position" error, or worse give garbage results.]

On write, if we are creating a new layer to our chain, we need to _copy_
the existing commit-graph file to a graph-{hash}.graph file before renaming
the .lock file. If we are merging layers, then we either (a) clean up the
dangling chains after moving our commit-graph file, or (b) have something
like 'gc' clean up the files later. I think using 'gc' for this is not
a good idea, since I would expect these files to be written and merged
much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
up the dangling chains leads to our concurrency issues. Further, this
'copy the base' no longer keeps our large base file at rest.

Option 2: grovel commit-graphs directory for graph-{hash}.graph
---------------------------------------------------------------

On read, we load the 'info/commit-graph' file and assume it is never
an incremental file. Then, scan the 'info/commit-graphs' directory
for 'graph-{hash}.graph' files and open them _all_ up to construct
a "graph forest" (each graph has a single parent, given by a chunk
specifying its base graph hash). If we don't have an instance of a
graph with a given hash, then ignore any graphs pointing to that hash.
We now have a decision to make: which leaf of this forest should we
use as our tip commit-graph? That could be given by the
commit-graphs/info file. But what happens when we have timing issues
around scanning the directory and the commit-graphs/info file? Do
we fall back to modified time?

On write, if we are not merging, then we just create a new
graph-{hash}.graph file. If we are merging, but still have a base graph,
then create a new graph-{hash}.graph file. Finally, if we are merging
all layers, then we rename our .lock file to 'info/commit-graph'.
To clean up, we need to grovel the directory to look for graph-{hash}.graph
files whose base chains no longer match the new, "best" chain and unlink()
them. This clean-up step can happen at any time.

--[end description of options]--

Did I accurately describe the options we are considering?

Option 1 was the design I was planning, and I think it matches how the
split-index feature works. Please correct me if I am missing something.
It _requires_ updating the file format version. But it also has a flaw
that the other options do not have: the copy of the base file. One
thing I want to enable is for whatever machinery is handling these
file writes to run a 'verify' immediately after, and have that be fast
most of the time. With a model that changes only the "tip" file, we
can verify only the new files and have confidence that the base file
did not change. I think options 0 and 2 both improve in this direction.
 
> With commit-graph-<HASH> all these unlink() race conditions go away,
> partial reads due to concurrent graph writing becomes a non-issue (we'd
> just leave the old files, "gc" deals with them later..), no need to
> carefully fsync() files/dirs etc as we need to carefully juggle N and
> N+1 files.

Calling this a non-issue is an exaggeration, especially if you are
claiming we need to be robust to multi-hour gaps between reading files.

> It also becomes easy to "chain" graphs across repos e.g. via
> alternates. Say in the scenario github/gitlab have where they have a
> "main" repo and other objects on another delta island.
> 
> In that case the repo would have a local "tip" file with the last link
> in its chain, some of which would then refer back to <HASHes> in other
> "parent" alternates.
> 
> As long as such a setup has a "gc" process that's not overly eager about
> pruning old stuff and considers that constellation of repos as a whole
> that should just work. You can freely optimize and rewrite graphs across
> repos, just be careful about unlinking old stuff.
> 
> I don't see how it would work with commit-graph-N without a *lot* of
> painful orchestration (where e.g. you *must* guarantee that the parent
> repo ends in N, all child repos start at N+1).

You're right that Option 0 does not work in this model where some graph
information is stored in an alternate _and_ more information is stored
outside the alternate. My perspective is biased, because I consider the
alternate to be "almost everything" and the local object store to be
small. But in a fork network, this is not always the case. I appreciate
your feedback for this environment, and I've always hoped that someone
with server experience would come and say "this feature is great, but
we need X, Y, and Z to make best use of it in our environment. Here's
a patch that moves us in that direction!" At least you are doing the
next-best thing: stopping me from making mistakes that would block
adoption.

So let's consider how Option 2 would work in this "multi-tip" case.
Each object directory would have some number of graph files, and one
'commit-graphs/info' file pointing to some hash. When we read, we
try to pick the info file that is "closest" to us.

This does create some complications that I don't think you gave enough
attention to. These may be solvable, but they are non-trivial:

* When we 'gc' the "core" repo, we need to enumerate all of the
  "leaf" repos to check their tip commit-graph files and make a
  decision if we should keep their bases around or delete those tips.
  Perhaps I'm over-stating the difficulty here, since we need to do
  something similar to find still-reachable objects, anyway. But if
  we are doing that reachability calculation, then why are we not
  just putting all of the commit-graph data in the core repo? Note
  that we don't get the same value as delta islands because this data
  isn't being shared across the protocol. The issue with storing all
  graph data in the core repo is that the core repo doesn't actually
  have all of the commits, which makes 'verify' on the graph a bit
  annoying.

* If we choose a local tip instead of the "core" tip, then that chain
  of commit-graphs can be far behind the core repo. In the world where
  a fork moves only at the speed of a single developer, but the core
  project moves quickly, then computing a merge base with the core's
  master branch becomes slow as our local chain doesn't contain most
  of the commits.

* We can't take all of the "core" chain _and_ the local chain, because
  the concept of "graph position" no longer makes sense. The only way
  I see out of this is to make the graph position two-dimensional:
  commit -> (index of tip chain, position in that chain). Perhaps this
  is a valuable thing to do in the future? Or perhaps, we shouldn't
  have incremental chains spanning object directories and instead
  introduce "parents-by-ref" where we mark some parents as included
  by object id instead of by graph position. This would allow the
  core repo to gc without caring about the external repos. It also
  wouldn't care about how the graph files are stored (Option 0 would
  work, as graph chains would not cross object store boundaries) and
  more closely resembles the independence of the pack-files in each
  object store. The "parents-by-ref" would require updating the
  file format version.

--[end discussion of incremental files]--

I'll move forward applying your existing feedback on patches 1-11 and
submit as a full series to replace ds/commit-graph-format-v2. We can
work on reviewing that code while we continue to think critically on
the topic of incremental files.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason May 9, 2019, 3:48 p.m. UTC | #7
On Thu, May 09 2019, Derrick Stolee wrote:

Not a very detailed reply since I've gotta run soon, but figured I'd
send something (also given our timezone difference).

> On 5/8/2019 4:11 PM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Wed, May 08 2019, Derrick Stolee wrote:
>>
>>> I'll ignore the improbability of this turn of events (two writes happening
>>> during the span of trying to read two files) and focus on the fact that
>>> we can prevent issues here using the 4th TODO item in my cover letter:
>>
>> FWIW the sort of scenario SZEDER is describing is something I deal with
>> in production a lot. It doesn't require an unfair scheduler, just that
>> you have differently nice(1)'d processes accessing the same repo.
>>
>> So if you have batch "cron" processes their IO scheduling follows their
>> nice(1) scheduling. It's not atypical to e.g. have some background
>> thingy sit for seconds or even minutes on an I/O syscall while the
>> kernel decides everyone else has right of way, since you nice'd that not
>> caring if it finishes in 10 seconds or 10 hours.
>
> Thanks. I learned something today.
>
> I still don't think this is a problem, with the idea below.
>
>>>  4. It would be helpful to add a new optional chunk that contains the
>>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>>     would only be for the commit-graph-N files, and would provide a simple
>>>     way to check that the stack is valid on read, in case we are still
>>>     worried about other processes reading/writing in the wrong order.
>>>
>>> If we have this chunk -- you have convinced me that we need it -- then we
>>> could ignore the "new" commit-graph-2 because its base graph hash does not
>>> match. We can continue without dying because we can always parse the "missing"
>>> commits from the packs.
>
> So, let's set that idea aside. You have other concerns.
>
>> Instead of looping through files N at a time we'd have a discovery step
>> where we'd need to open() all the files, see which ones say "my parent
>> hash hash X", and then create a list of those hashes in order to read a
>> bunch of commit-graph-<HASH> files.
>>
>> Is that a bit painful? Sure, but way less painful than dealing with the
>> caveats I'd mentioned in [1] and SZEDER details here.
>
> I don't see how this step is less painful than the one I am describing.
> You'll need to be a bit more specific to convince me.
>
> I'll try to be specific with a few ideas that have been thrown around,
> so we can compare and contrast (and point out where I am misunderstanding
> what you are trying to say).
>
> Option 0 (this series): commit-graph-N
> --------------------------------------
>
> On read, we look for the 'info/commit-graph' file and acquire a read handle.
> We set that commit_graph struct as our tip commit-graph. Then, for each
> increasing N (until we fail) we acquire a read handle on
> 'info/commit-graphs/commit-graph-N' and check that its base hash matches
> our current tip commit-graph. If the file doesn't exist, or the base
> hash doesn't match, then we stop and continue with our current tip graph.

This "base hash" is something I may have missed. I *thought* we'd
happy-go-lucky load e.g. commit-graph-1, commit-graph-2 in that order,
and if (e.g. due to fsync/dir entry issues noted earlier) if we got the
"wrong" commit-graph-2 we'd be oblivious to that. Is that not the case?

I was trying (and failing) to make the test suite vomit by hacking up
the code to load "commit-graph-1" first, then "commit-graph", so maybe I
ran into that safeguard, whatever it is. I've just seen how we carry the
"num_commits_in_base" forward, not some expected checksum...

> On write, use a 'mv' to swap our .lock file with whatever level we are
> merging, THEN can unlink() the higher layers in decreasing order. (This
> "mv-then-unlink" order is different than what is implemented by this
> series, but is enabled by the chunk containing the base graph hash.)

So one of the things that make me paranoid about this (as noted
upthread/earlier) is that at least on POSIX filesystems just because you
observe that you create and fsync "foo" and "bar" in that order doesn't
mean that a concurrent reader of that directory will get the updates in
that order. I.e. file update != dir entry update.

It can be made to work, and core.fsyncObjectFiles is a partial solution,
but better to avoid it entirely. Also the failure with
core.fsyncObjectFiles is much more graceful, since in that case we don't
have a "deadbeef[...]" loose object *yet* but will (presumably) see the
*right* thing Real Soon Now since it's content-addressable, whereas with
this design we might see the *wrong* thing. So this would be the first
area of git (that I know of) where we'd be sensitive to a combination of
syncing dir entries and file content in order.

> Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
> --------------------------------------------------------------
>
> On read, we load the 'info/commit-graph' file and inspect the byte saying
> how many base files we have. We load their hashes from the base file chunk
> and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
> to ignore anything "above" the failure in the chain (and specifically this
> includes the 'commit-graph' file), or consider reloading the commit-graph
> file altogether and hope it works this time. [Note: if an older version
> of Git doesn't understand the incremental file format, it will fail to
> properly understand the graph positions and either fail with an
> "invalid parent position" error, or worse give garbage results.]

Right, except as noted before I'd add (at least per my understanding so
far) "s/needs format v2//". I.e. we could stick this data in new v1
backwards-compatible chunks, which also addresses the "older version..."
caveat.

I.e. we'd always guarantee that the "commit-graph" file was
understandable/valid for older versions (even though eventually their
view on it would be "oh, this has no data" (but really it's all in a new
chunk they don't grok).

> On write, if we are creating a new layer to our chain, we need to _copy_
> the existing commit-graph file to a graph-{hash}.graph file before renaming
> the .lock file. If we are merging layers, then we either (a) clean up the
> dangling chains after moving our commit-graph file, or (b) have something
> like 'gc' clean up the files later. I think using 'gc' for this is not
> a good idea, since I would expect these files to be written and merged
> much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
> up the dangling chains leads to our concurrency issues. Further, this
> 'copy the base' no longer keeps our large base file at rest.

So aside from the specifics of implementation both this and the
commit-graph-N way of doing it involves juggling a sequence of chunks
that "point upwards", which IMO has worse caveats than "point
downwards". I.e. we'd need to rewrite "base" files to point "upwards" to
new stuff, and there's no (sane) way in option 0 to split commit-graphs
across repos, and no way at all in this scenario (if I understand it
correctly...).

> Option 2: grovel commit-graphs directory for graph-{hash}.graph
> ---------------------------------------------------------------
>
> On read, we load the 'info/commit-graph' file and assume it is never
> an incremental file. Then, scan the 'info/commit-graphs' directory
> for 'graph-{hash}.graph' files and open them _all_ up to construct
> a "graph forest" (each graph has a single parent, given by a chunk
> specifying its base graph hash). If we don't have an instance of a
> graph with a given hash, then ignore any graphs pointing to that hash.
> We now have a decision to make: which leaf of this forest should we
> use as our tip commit-graph? That could be given by the
> commit-graphs/info file. But what happens when we have timing issues
> around scanning the directory and the commit-graphs/info file? Do
> we fall back to modified time?

I think in a worst-case scenario we shrug and just pick whatever chain
looks most recent/largest or whatever, but in practice I think it's a
non-issue.

Critical for that "it's a non-issue" is the suggestion I had on 17/17 of
not doing the incremental write in "fetch", but instead just rely on "gc
--auto" *and* to address your "I would expect these files to be written
and merged much more frequently" we'd just (and I'm keen to hack this
up) teach "gc --auto" such an incremental mode.

This means that within a single repo all updates to the commit-graph
will go through the gc.lock, whereas with your current 17/17 we'd
potentially have a "commit-graph write" racing a concurrent "gc", with
both aiming to update the commit-graph.

> On write, if we are not merging, then we just create a new
> graph-{hash}.graph file. If we are merging, but still have a base graph,
> then create a new graph-{hash}.graph file. Finally, if we are merging
> all layers, then we rename our .lock file to 'info/commit-graph'.
> To clean up, we need to grovel the directory to look for graph-{hash}.graph
> files whose base chains no longer match the new, "best" chain and unlink()
> them. This clean-up step can happen at any time.
>
> --[end description of options]--
>
> Did I accurately describe the options we are considering?
>
> Option 1 was the design I was planning, and I think it matches how the
> split-index feature works. Please correct me if I am missing something.
> It _requires_ updating the file format version. But it also has a flaw
> that the other options do not have: the copy of the base file. One
> thing I want to enable is for whatever machinery is handling these
> file writes to run a 'verify' immediately after, and have that be fast
> most of the time. With a model that changes only the "tip" file, we
> can verify only the new files and have confidence that the base file
> did not change. I think options 0 and 2 both improve in this direction.
>
>> With commit-graph-<HASH> all these unlink() race conditions go away,
>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>> just leave the old files, "gc" deals with them later..), no need to
>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>> N+1 files.
>
> Calling this a non-issue is an exaggeration, especially if you are
> claiming we need to be robust to multi-hour gaps between reading files.

We always have a race, but they're very different races.

With "N" files (option #0) we'd have races of the type where within the
same few milliseconds of a commit-graph being merged/updated a
e.g. concurrent "tag --contains" would either be much slower (couldn't
get one of the incremental graphs it expected), or produce the wrong
answer (this may be wrong on my part, see my "base hash" comment above).

Whereas if I run something with "ionice -c 3" I could possibly hang for
however many hours/days/weeks we wait until another "gc" comes along and
unlinks those old files, but if I'm running it like that I'm not
expecting it to be fast, so it's OK if the files went away, and it won't
ever get the wrong file (since the filenames are hash-addressible).

>> It also becomes easy to "chain" graphs across repos e.g. via
>> alternates. Say in the scenario github/gitlab have where they have a
>> "main" repo and other objects on another delta island.
>>
>> In that case the repo would have a local "tip" file with the last link
>> in its chain, some of which would then refer back to <HASHes> in other
>> "parent" alternates.
>>
>> As long as such a setup has a "gc" process that's not overly eager about
>> pruning old stuff and considers that constellation of repos as a whole
>> that should just work. You can freely optimize and rewrite graphs across
>> repos, just be careful about unlinking old stuff.
>>
>> I don't see how it would work with commit-graph-N without a *lot* of
>> painful orchestration (where e.g. you *must* guarantee that the parent
>> repo ends in N, all child repos start at N+1).
>
> You're right that Option 0 does not work in this model where some graph
> information is stored in an alternate _and_ more information is stored
> outside the alternate. My perspective is biased, because I consider the
> alternate to be "almost everything" and the local object store to be
> small. But in a fork network, this is not always the case. I appreciate
> your feedback for this environment, and I've always hoped that someone
> with server experience would come and say "this feature is great, but
> we need X, Y, and Z to make best use of it in our environment. Here's
> a patch that moves us in that direction!" At least you are doing the
> next-best thing: stopping me from making mistakes that would block
> adoption.

I'm happy to write some patches, just want to talk about it first (and
if I'm lucky convince you to write them for me :) ).

One example is the git-for-windows commit-graph is 5.6MB, the git.git
one is 3.1MB. Should GitHub just stick that all in the one "parent"
graph, maybe. But nice to have the flexibility of stacking them.

There's also more disconnected cases, e.g. I have some "staging" boxes
where where I have a cronjob running around re-pointing clones of a big
monorepo to a shared "alternates" store where I guarantee objects are
only ever added, never removed.

It would be nice to have a way to provide a commit-graph there that's
"stable" that clients could point to, and they'd just generate the
difference.

I.e. now I have a shared .git/objects which contains gigabytes, a
crapload of stuff in /home where .git/objects is 10-50MB, and each one
has a commit-graph that's around the same 50-100MB size (since it needs
to contain the metadata for the full set).

> So let's consider how Option 2 would work in this "multi-tip" case.
> Each object directory would have some number of graph files, and one
> 'commit-graphs/info' file pointing to some hash. When we read, we
> try to pick the info file that is "closest" to us.
>
> This does create some complications that I don't think you gave enough
> attention to. These may be solvable, but they are non-trivial:
>
> * When we 'gc' the "core" repo, we need to enumerate all of the
>   "leaf" repos to check their tip commit-graph files and make a
>   decision if we should keep their bases around or delete those tips.
>   Perhaps I'm over-stating the difficulty here, since we need to do
>   something similar to find still-reachable objects, anyway. But if
>   we are doing that reachability calculation, then why are we not
>   just putting all of the commit-graph data in the core repo? Note
>   that we don't get the same value as delta islands because this data
>   isn't being shared across the protocol. The issue with storing all
>   graph data in the core repo is that the core repo doesn't actually
>   have all of the commits, which makes 'verify' on the graph a bit
>   annoying.

Yeah I think similar to "alternates" it would be annoying to have a case
where a given repo has metadata on objects it doesn't have, and there's
cases (see the "staging" case I mentioned above) where that "parent"
repo won't have access to those things.

> * If we choose a local tip instead of the "core" tip, then that chain
>   of commit-graphs can be far behind the core repo. In the world where
>   a fork moves only at the speed of a single developer, but the core
>   project moves quickly, then computing a merge base with the core's
>   master branch becomes slow as our local chain doesn't contain most
>   of the commits.

That's a good point and a case where pointing "upwards" or just having
commit-graphs in the "base" repo is better, i.e. the "fork" has almost
no objects.

But solvable by triggering "gc" on these child projects so their
commit-graph keeps being re-pointed to a later version.

And we'd have the reverse problem with a git-for-windows wouldn't we?
I.e. the fork is "far ahead".

> * We can't take all of the "core" chain _and_ the local chain, because
>   the concept of "graph position" no longer makes sense. The only way
>   I see out of this is to make the graph position two-dimensional:
>   commit -> (index of tip chain, position in that chain). Perhaps this
>   is a valuable thing to do in the future? Or perhaps, we shouldn't
>   have incremental chains spanning object directories and instead
>   introduce "parents-by-ref" where we mark some parents as included
>   by object id instead of by graph position. This would allow the
>   core repo to gc without caring about the external repos. It also
>   wouldn't care about how the graph files are stored (Option 0 would
>   work, as graph chains would not cross object store boundaries) and
>   more closely resembles the independence of the pack-files in each
>   object store. The "parents-by-ref" would require updating the
>   file format version.

The parent repo can "gc" without caring/inspecting "child" repos with
the "point downwards" of option #2, as long as it promises to retain its
old commit-graph files for some retention period of X, and the "child"
repos promise to "gc" (and re-point to new graphs if necessary) at rate
that's faster than that.

This makes it easy to e.g. say "we retain old commit-graph files for 2
weeks", and "we re-gc everything in cron weekly".

It would work best if we can also pull this trick on the "base"
commit-graph file, which I believe we could do in a backwards-compatible
way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
is the current "base".

> --[end discussion of incremental files]--
>
> I'll move forward applying your existing feedback on patches 1-11 and
> submit as a full series to replace ds/commit-graph-format-v2. We can
> work on reviewing that code while we continue to think critically on
> the topic of incremental files.
>
> Thanks,
> -Stolee
Derrick Stolee May 9, 2019, 5:08 p.m. UTC | #8
On 5/9/2019 11:48 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, May 09 2019, Derrick Stolee wrote:
> [snip]
>>
>> I still don't think this is a problem, with the idea below.
>>
>>>>  4. It would be helpful to add a new optional chunk that contains the
>>>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>>>     would only be for the commit-graph-N files, and would provide a simple
>>>>     way to check that the stack is valid on read, in case we are still
>>>>     worried about other processes reading/writing in the wrong order.
>>>>
>>>> If we have this chunk -- you have convinced me that we need it -- then we
>>>> could ignore the "new" commit-graph-2 because its base graph hash does not
>>>> match. We can continue without dying because we can always parse the "missing"
>>>> commits from the packs.
>>
>> So, let's set that idea aside. You have other concerns.
>>
>>> Instead of looping through files N at a time we'd have a discovery step
>>> where we'd need to open() all the files, see which ones say "my parent
>>> hash hash X", and then create a list of those hashes in order to read a
>>> bunch of commit-graph-<HASH> files.
>>>
>>> Is that a bit painful? Sure, but way less painful than dealing with the
>>> caveats I'd mentioned in [1] and SZEDER details here.
>>
>> I don't see how this step is less painful than the one I am describing.
>> You'll need to be a bit more specific to convince me.
>>
>> I'll try to be specific with a few ideas that have been thrown around,
>> so we can compare and contrast (and point out where I am misunderstanding
>> what you are trying to say).
>>
>> Option 0 (this series): commit-graph-N
>> --------------------------------------
>>
>> On read, we look for the 'info/commit-graph' file and acquire a read handle.
>> We set that commit_graph struct as our tip commit-graph. Then, for each
>> increasing N (until we fail) we acquire a read handle on
>> 'info/commit-graphs/commit-graph-N' and check that its base hash matches
>> our current tip commit-graph. If the file doesn't exist, or the base
>> hash doesn't match, then we stop and continue with our current tip graph.
> 
> This "base hash" is something I may have missed. I *thought* we'd
> happy-go-lucky load e.g. commit-graph-1, commit-graph-2 in that order,
> and if (e.g. due to fsync/dir entry issues noted earlier) if we got the
> "wrong" commit-graph-2 we'd be oblivious to that. Is that not the case?> 
> I was trying (and failing) to make the test suite vomit by hacking up
> the code to load "commit-graph-1" first, then "commit-graph", so maybe I
> ran into that safeguard, whatever it is. I've just seen how we carry the
> "num_commits_in_base" forward, not some expected checksum...

It's not implemented in the patch series code, but included as an idea
in the cover letter, and the discussion above convinced me it should be
required. The discussion here assumes that is part of the design.

> >> On write, use a 'mv' to swap our .lock file with whatever level we are
>> merging, THEN can unlink() the higher layers in decreasing order. (This
>> "mv-then-unlink" order is different than what is implemented by this
>> series, but is enabled by the chunk containing the base graph hash.)
> 
> So one of the things that make me paranoid about this (as noted
> upthread/earlier) is that at least on POSIX filesystems just because you
> observe that you create and fsync "foo" and "bar" in that order doesn't
> mean that a concurrent reader of that directory will get the updates in
> that order. I.e. file update != dir entry update.

How far apart can these concurrency issues happen in the file system?
One benefit to Option 0 is that there is only one file _write_ that matters.
The other options require at least two writes.

> It can be made to work, and core.fsyncObjectFiles is a partial solution,
> but better to avoid it entirely. Also the failure with
> core.fsyncObjectFiles is much more graceful, since in that case we don't
> have a "deadbeef[...]" loose object *yet* but will (presumably) see the
> *right* thing Real Soon Now since it's content-addressable, whereas with
> this design we might see the *wrong* thing. So this would be the first
> area of git (that I know of) where we'd be sensitive to a combination of
> syncing dir entries and file content in order.
> 
>> Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
>> --------------------------------------------------------------
>>
>> On read, we load the 'info/commit-graph' file and inspect the byte saying
>> how many base files we have. We load their hashes from the base file chunk
>> and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
>> to ignore anything "above" the failure in the chain (and specifically this
>> includes the 'commit-graph' file), or consider reloading the commit-graph
>> file altogether and hope it works this time. [Note: if an older version
>> of Git doesn't understand the incremental file format, it will fail to
>> properly understand the graph positions and either fail with an
>> "invalid parent position" error, or worse give garbage results.]
> 
> Right, except as noted before I'd add (at least per my understanding so
> far) "s/needs format v2//". I.e. we could stick this data in new v1
> backwards-compatible chunks, which also addresses the "older version..."
> caveat.
> 
> I.e. we'd always guarantee that the "commit-graph" file was
> understandable/valid for older versions (even though eventually their
> view on it would be "oh, this has no data" (but really it's all in a new
> chunk they don't grok).

I see. Something that I gather from your paragraph above but cannot find
written explicitly anywhere is "write a commit-graph with zero commits,
and it contains extra information that we can use to find the rest of the
commits." In essence, the entire purpose of the file would be to contain
the information of the 'commit-graphs/info' file from Option 2. It would
mean we don't need to "copy then move" because the tip commit-graph file
has no content to persist (except for the first incremental write).

That would prevent breaking old clients, but they would also not have any
commit-graph data to use. Option 0 and Option 2 can leave a valid v1
commit-graph file with the majority of the commit data. This is only a
performance issue for a narrow case, so maybe that's worth ignoring.

(See Ævar's idea about symlinks at the bottom of this message for more.)

>> On write, if we are creating a new layer to our chain, we need to _copy_
>> the existing commit-graph file to a graph-{hash}.graph file before renaming
>> the .lock file. If we are merging layers, then we either (a) clean up the
>> dangling chains after moving our commit-graph file, or (b) have something
>> like 'gc' clean up the files later. I think using 'gc' for this is not
>> a good idea, since I would expect these files to be written and merged
>> much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
>> up the dangling chains leads to our concurrency issues. Further, this
>> 'copy the base' no longer keeps our large base file at rest.
> 
> So aside from the specifics of implementation both this and the
> commit-graph-N way of doing it involves juggling a sequence of chunks
> that "point upwards", which IMO has worse caveats than "point
> downwards". I.e. we'd need to rewrite "base" files to point "upwards" to
> new stuff, and there's no (sane) way in option 0 to split commit-graphs
> across repos, and no way at all in this scenario (if I understand it
> correctly...).

I don't understand the value of rewriting base files to point upwards.
The whole point is to not change the base files.

>> Option 2: grovel commit-graphs directory for graph-{hash}.graph
>> ---------------------------------------------------------------
>>
>> On read, we load the 'info/commit-graph' file and assume it is never
>> an incremental file. Then, scan the 'info/commit-graphs' directory
>> for 'graph-{hash}.graph' files and open them _all_ up to construct
>> a "graph forest" (each graph has a single parent, given by a chunk
>> specifying its base graph hash). If we don't have an instance of a
>> graph with a given hash, then ignore any graphs pointing to that hash.
>> We now have a decision to make: which leaf of this forest should we
>> use as our tip commit-graph? That could be given by the
>> commit-graphs/info file. But what happens when we have timing issues
>> around scanning the directory and the commit-graphs/info file? Do
>> we fall back to modified time?
> 
> I think in a worst-case scenario we shrug and just pick whatever chain
> looks most recent/largest or whatever, but in practice I think it's a
> non-issue.
> 
> Critical for that "it's a non-issue" is the suggestion I had on 17/17 of
> not doing the incremental write in "fetch", but instead just rely on "gc
> --auto" *and* to address your "I would expect these files to be written
> and merged much more frequently" we'd just (and I'm keen to hack this
> up) teach "gc --auto" such an incremental mode.

I look forward to this incremental mode, and its "partial maintenance".

> This means that within a single repo all updates to the commit-graph
> will go through the gc.lock, whereas with your current 17/17 we'd
> potentially have a "commit-graph write" racing a concurrent "gc", with
> both aiming to update the commit-graph.

Except for writes that happen in `git commit-graph write`, of course.

VFS for Git never runs 'gc' and instead maintains the commit-graph directly.

Should the commit-graph builtin take a gc lock on write?

>> On write, if we are not merging, then we just create a new
>> graph-{hash}.graph file. If we are merging, but still have a base graph,
>> then create a new graph-{hash}.graph file. Finally, if we are merging
>> all layers, then we rename our .lock file to 'info/commit-graph'.
>> To clean up, we need to grovel the directory to look for graph-{hash}.graph
>> files whose base chains no longer match the new, "best" chain and unlink()
>> them. This clean-up step can happen at any time.
>>
>> --[end description of options]--
>>
>> Did I accurately describe the options we are considering?
>>
>> Option 1 was the design I was planning, and I think it matches how the
>> split-index feature works. Please correct me if I am missing something.
>> It _requires_ updating the file format version. But it also has a flaw
>> that the other options do not have: the copy of the base file. One
>> thing I want to enable is for whatever machinery is handling these
>> file writes to run a 'verify' immediately after, and have that be fast
>> most of the time. With a model that changes only the "tip" file, we
>> can verify only the new files and have confidence that the base file
>> did not change. I think options 0 and 2 both improve in this direction.
>>
>>> With commit-graph-<HASH> all these unlink() race conditions go away,
>>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>>> just leave the old files, "gc" deals with them later..), no need to
>>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>>> N+1 files.
>>
>> Calling this a non-issue is an exaggeration, especially if you are
>> claiming we need to be robust to multi-hour gaps between reading files.
> 
> We always have a race, but they're very different races.
> 
> With "N" files (option #0) we'd have races of the type where within the
> same few milliseconds of a commit-graph being merged/updated a
> e.g. concurrent "tag --contains" would either be much slower (couldn't
> get one of the incremental graphs it expected), or produce the wrong
> answer (this may be wrong on my part, see my "base hash" comment above).

With the "base hash" feature (planned, not implemented) we won't be wrong.

With the "max number of commits in the incremental files" feature, we
would limit the performance issues from missing an update. BUT, this also
implies we are rewriting the base commit-graph more often. With a limit
of 64,000 commits, that's still only once every few weeks in the Windows
OS repo.

> Whereas if I run something with "ionice -c 3" I could possibly hang for
> however many hours/days/weeks we wait until another "gc" comes along and
> unlinks those old files, but if I'm running it like that I'm not
> expecting it to be fast, so it's OK if the files went away, and it won't
> ever get the wrong file (since the filenames are hash-addressible).

How do we recover from this situation? Have no commit-graph at all? Or
do we try again?

>>> It also becomes easy to "chain" graphs across repos e.g. via
>>> alternates. Say in the scenario github/gitlab have where they have a
>>> "main" repo and other objects on another delta island.
>>>
>>> In that case the repo would have a local "tip" file with the last link
>>> in its chain, some of which would then refer back to <HASHes> in other
>>> "parent" alternates.
>>>
>>> As long as such a setup has a "gc" process that's not overly eager about
>>> pruning old stuff and considers that constellation of repos as a whole
>>> that should just work. You can freely optimize and rewrite graphs across
>>> repos, just be careful about unlinking old stuff.
>>>
>>> I don't see how it would work with commit-graph-N without a *lot* of
>>> painful orchestration (where e.g. you *must* guarantee that the parent
>>> repo ends in N, all child repos start at N+1).
>>
>> You're right that Option 0 does not work in this model where some graph
>> information is stored in an alternate _and_ more information is stored
>> outside the alternate. My perspective is biased, because I consider the
>> alternate to be "almost everything" and the local object store to be
>> small. But in a fork network, this is not always the case. I appreciate
>> your feedback for this environment, and I've always hoped that someone
>> with server experience would come and say "this feature is great, but
>> we need X, Y, and Z to make best use of it in our environment. Here's
>> a patch that moves us in that direction!" At least you are doing the
>> next-best thing: stopping me from making mistakes that would block
>> adoption.
> 
> I'm happy to write some patches, just want to talk about it first (and
> if I'm lucky convince you to write them for me :) ).
> 
> One example is the git-for-windows commit-graph is 5.6MB, the git.git
> one is 3.1MB. Should GitHub just stick that all in the one "parent"
> graph, maybe. But nice to have the flexibility of stacking them.
> 
> There's also more disconnected cases, e.g. I have some "staging" boxes
> where where I have a cronjob running around re-pointing clones of a big
> monorepo to a shared "alternates" store where I guarantee objects are
> only ever added, never removed.
> 
> It would be nice to have a way to provide a commit-graph there that's
> "stable" that clients could point to, and they'd just generate the
> difference.
> 
> I.e. now I have a shared .git/objects which contains gigabytes, a
> crapload of stuff in /home where .git/objects is 10-50MB, and each one
> has a commit-graph that's around the same 50-100MB size (since it needs
> to contain the metadata for the full set).
>
>> So let's consider how Option 2 would work in this "multi-tip" case.
>> Each object directory would have some number of graph files, and one
>> 'commit-graphs/info' file pointing to some hash. When we read, we
>> try to pick the info file that is "closest" to us.
>>
>> This does create some complications that I don't think you gave enough
>> attention to. These may be solvable, but they are non-trivial:
>>
>> * When we 'gc' the "core" repo, we need to enumerate all of the
>>   "leaf" repos to check their tip commit-graph files and make a
>>   decision if we should keep their bases around or delete those tips.
>>   Perhaps I'm over-stating the difficulty here, since we need to do
>>   something similar to find still-reachable objects, anyway. But if
>>   we are doing that reachability calculation, then why are we not
>>   just putting all of the commit-graph data in the core repo? Note
>>   that we don't get the same value as delta islands because this data
>>   isn't being shared across the protocol. The issue with storing all
>>   graph data in the core repo is that the core repo doesn't actually
>>   have all of the commits, which makes 'verify' on the graph a bit
>>   annoying.
> 
> Yeah I think similar to "alternates" it would be annoying to have a case
> where a given repo has metadata on objects it doesn't have, and there's
> cases (see the "staging" case I mentioned above) where that "parent"
> repo won't have access to those things.
> 
>> * If we choose a local tip instead of the "core" tip, then that chain
>>   of commit-graphs can be far behind the core repo. In the world where
>>   a fork moves only at the speed of a single developer, but the core
>>   project moves quickly, then computing a merge base with the core's
>>   master branch becomes slow as our local chain doesn't contain most
>>   of the commits.
> 
> That's a good point and a case where pointing "upwards" or just having
> commit-graphs in the "base" repo is better, i.e. the "fork" has almost
> no objects.

Is your idea of "upwards" different than mine? I think of pointing to
a base file as pointing "down", and the opposite would be "up". In the
case of a fork network or multiple repos using an alternate, pointing
upwards would not even be well-defined.

> But solvable by triggering "gc" on these child projects so their
> commit-graph keeps being re-pointed to a later version.
> 
> And we'd have the reverse problem with a git-for-windows wouldn't we?
> I.e. the fork is "far ahead".

This is the quintessential example for why we can't have a single chain
of commit-graphs long-term. It deviates from most fork networks enough
that we can't say "just take the base repo's commit-graph" but typical
fork networks can't say "just take my local commit-graph chain". The
two-dimensional graph position would be valuable to help both shapes.

>> * We can't take all of the "core" chain _and_ the local chain, because
>>   the concept of "graph position" no longer makes sense. The only way
>>   I see out of this is to make the graph position two-dimensional:
>>   commit -> (index of tip chain, position in that chain). Perhaps this
>>   is a valuable thing to do in the future? Or perhaps, we shouldn't
>>   have incremental chains spanning object directories and instead
>>   introduce "parents-by-ref" where we mark some parents as included
>>   by object id instead of by graph position. This would allow the
>>   core repo to gc without caring about the external repos. It also
>>   wouldn't care about how the graph files are stored (Option 0 would
>>   work, as graph chains would not cross object store boundaries) and
>>   more closely resembles the independence of the pack-files in each
>>   object store. The "parents-by-ref" would require updating the
>>   file format version.
> 
> The parent repo can "gc" without caring/inspecting "child" repos with
> the "point downwards" of option #2, as long as it promises to retain its
> old commit-graph files for some retention period of X, and the "child"
> repos promise to "gc" (and re-point to new graphs if necessary) at rate
> that's faster than that.
> 
> This makes it easy to e.g. say "we retain old commit-graph files for 2
> weeks", and "we re-gc everything in cron weekly".

Here, I think, is the most crucial point of why Option 2 may be worth the
added complexity over Option 0. Option 0 _requires_ that the files be
replaced immediately on a new write, while Option 2 provides a way to
leave old files around and be cleaned up later.

But how should we actually perform this cleanup? I would imagine a
'git commit-graph gc' subcommand that cleans up old files. A 'git gc'
run would perform the same logic, but we need a way to do this outside
of 'gc'. It needs to use the modified time as an indicator, since we
could run 'git commit-graph write' twice an hour before our two-week
cleanup job and need to keep our hour-old stale file. Perhaps the
'git commit-graph gc' subcommand could take a '--window' parameter that
can be 0, while 'git gc' uses a config setting.

The decision can then be: "is this file not in our graph chain and
older than <window> from now?"

But also, I expect the stale commit-graph files will pile up quickly.
We rebuild the commit-graph file roughly every hour. I would write
our maintenance to call these subcommands in order with no delay:

    "write" -> "verify --shallow" -> "gc --window=0"

(Here, "verify --shallow" would only verify the tip of the
commit-graph chain.)

> It would work best if we can also pull this trick on the "base"
> commit-graph file, which I believe we could do in a backwards-compatible
> way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
> is the current "base".

Could we do this, anyway? Use 'commit-graphs/info' to point to the tip
and let the symlink 'commit-graph' point to the base. Then, old clients
would load a full commit-graph and new clients would get the full chain.
Ævar Arnfjörð Bjarmason May 9, 2019, 9:45 p.m. UTC | #9
On Thu, May 09 2019, Derrick Stolee wrote:

> On 5/9/2019 11:48 AM, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Thu, May 09 2019, Derrick Stolee wrote:
>> [snip]
>>>
>>> I still don't think this is a problem, with the idea below.
>>>
>>>>>  4. It would be helpful to add a new optional chunk that contains the
>>>>>     trailing hash for the lower level of the commit-graph stack. This chunk
>>>>>     would only be for the commit-graph-N files, and would provide a simple
>>>>>     way to check that the stack is valid on read, in case we are still
>>>>>     worried about other processes reading/writing in the wrong order.
>>>>>
>>>>> If we have this chunk -- you have convinced me that we need it -- then we
>>>>> could ignore the "new" commit-graph-2 because its base graph hash does not
>>>>> match. We can continue without dying because we can always parse the "missing"
>>>>> commits from the packs.
>>>
>>> So, let's set that idea aside. You have other concerns.
>>>
>>>> Instead of looping through files N at a time we'd have a discovery step
>>>> where we'd need to open() all the files, see which ones say "my parent
>>>> hash hash X", and then create a list of those hashes in order to read a
>>>> bunch of commit-graph-<HASH> files.
>>>>
>>>> Is that a bit painful? Sure, but way less painful than dealing with the
>>>> caveats I'd mentioned in [1] and SZEDER details here.
>>>
>>> I don't see how this step is less painful than the one I am describing.
>>> You'll need to be a bit more specific to convince me.
>>>
>>> I'll try to be specific with a few ideas that have been thrown around,
>>> so we can compare and contrast (and point out where I am misunderstanding
>>> what you are trying to say).
>>>
>>> Option 0 (this series): commit-graph-N
>>> --------------------------------------
>>>
>>> On read, we look for the 'info/commit-graph' file and acquire a read handle.
>>> We set that commit_graph struct as our tip commit-graph. Then, for each
>>> increasing N (until we fail) we acquire a read handle on
>>> 'info/commit-graphs/commit-graph-N' and check that its base hash matches
>>> our current tip commit-graph. If the file doesn't exist, or the base
>>> hash doesn't match, then we stop and continue with our current tip graph.
>>
>> This "base hash" is something I may have missed. I *thought* we'd
>> happy-go-lucky load e.g. commit-graph-1, commit-graph-2 in that order,
>> and if (e.g. due to fsync/dir entry issues noted earlier) if we got the
>> "wrong" commit-graph-2 we'd be oblivious to that. Is that not the case?>
>> I was trying (and failing) to make the test suite vomit by hacking up
>> the code to load "commit-graph-1" first, then "commit-graph", so maybe I
>> ran into that safeguard, whatever it is. I've just seen how we carry the
>> "num_commits_in_base" forward, not some expected checksum...
>
> It's not implemented in the patch series code, but included as an idea
> in the cover letter, and the discussion above convinced me it should be
> required. The discussion here assumes that is part of the design.

Got it. Hashing like that would definitely mitigate the "wrong file"
failure scenario, although as discussed leaving some others...

>> >> On write, use a 'mv' to swap our .lock file with whatever level we are
>>> merging, THEN can unlink() the higher layers in decreasing order. (This
>>> "mv-then-unlink" order is different than what is implemented by this
>>> series, but is enabled by the chunk containing the base graph hash.)
>>
>> So one of the things that make me paranoid about this (as noted
>> upthread/earlier) is that at least on POSIX filesystems just because you
>> observe that you create and fsync "foo" and "bar" in that order doesn't
>> mean that a concurrent reader of that directory will get the updates in
>> that order. I.e. file update != dir entry update.
>
> How far apart can these concurrency issues happen in the file system?
> One benefit to Option 0 is that there is only one file _write_ that matters.
> The other options require at least two writes.

You need to write() and then fsync()/close() to be guaranteed that the
data was written to the file.

As an aside I see that the current commit-graph uses CSUM_FSYNC (but
without CSUM_CLOSE, probably doesn't matter), I thought it
didn't. Maybe we should remove that unless core.fsyncObjectFiles=true
(or have another "looser" fsync config).

So writing is cheap, but it's asking the OS to sync to disk that
generally hurts. As noted in the discussions when core.fsyncObjectFiles
was introduced some FSs are really stupid about it, so it's better to
avoid these caveats if you can.

As noted in the fsync(2) manpage on Linux (ditto POSIX) an fsync to a
*file* doesn't guarantee that the directory entry is updated. So we'd
also need to opendir() the containing directory and fsync that FD as we
juggle these renames/replaces.

But stepping away from strict POSIX for a second, many users of git use
the object store over something like NFS. It's very handy to be able to
mount such a volume with e.g. lookupcache=positive
(https://linux.die.net/man/5/nfs).

I.e. if you request a commit-graph-<HASH> you don't need a round-trip to
the server to server to see if it's still there.

So it's not that we can't do it, but rather that there's some usage
patterns that are much friendlier to performance & caching than
others. Just throwing things at the FS and having it flush stuff in its
own time is ideal, and e.g. commercial FS appliances have a lot of
performance knobs you can use if you have access patterns like what we
have in objects/pack/.

I.e. "any file here you can cache forever, they never change (they
disappear, but no hurry!), but new ones might show up".

That sort of thing becomes more true with other dumber FS-alike things,
e.g. if we go for some "grab a graph from the server" implementation
such a thing is way simpler as a list of hashes that can be cached
forever (http proxies et al) than N files sensitive to their update
sequence.

>> It can be made to work, and core.fsyncObjectFiles is a partial solution,
>> but better to avoid it entirely. Also the failure with
>> core.fsyncObjectFiles is much more graceful, since in that case we don't
>> have a "deadbeef[...]" loose object *yet* but will (presumably) see the
>> *right* thing Real Soon Now since it's content-addressable, whereas with
>> this design we might see the *wrong* thing. So this would be the first
>> area of git (that I know of) where we'd be sensitive to a combination of
>> syncing dir entries and file content in order.
>>
>>> Option 1 (needs format v2): commit-graph -> graph-{hash}.graph
>>> --------------------------------------------------------------
>>>
>>> On read, we load the 'info/commit-graph' file and inspect the byte saying
>>> how many base files we have. We load their hashes from the base file chunk
>>> and read 'info/graph-{hash}.graph' for each. If _any_ fail, then we need
>>> to ignore anything "above" the failure in the chain (and specifically this
>>> includes the 'commit-graph' file), or consider reloading the commit-graph
>>> file altogether and hope it works this time. [Note: if an older version
>>> of Git doesn't understand the incremental file format, it will fail to
>>> properly understand the graph positions and either fail with an
>>> "invalid parent position" error, or worse give garbage results.]
>>
>> Right, except as noted before I'd add (at least per my understanding so
>> far) "s/needs format v2//". I.e. we could stick this data in new v1
>> backwards-compatible chunks, which also addresses the "older version..."
>> caveat.
>>
>> I.e. we'd always guarantee that the "commit-graph" file was
>> understandable/valid for older versions (even though eventually their
>> view on it would be "oh, this has no data" (but really it's all in a new
>> chunk they don't grok).
>
> I see. Something that I gather from your paragraph above but cannot find
> written explicitly anywhere is "write a commit-graph with zero commits,
> and it contains extra information that we can use to find the rest of the
> commits." In essence, the entire purpose of the file would be to contain
> the information of the 'commit-graphs/info' file from Option 2. It would
> mean we don't need to "copy then move" because the tip commit-graph file
> has no content to persist (except for the first incremental write).
>
> That would prevent breaking old clients, but they would also not have any
> commit-graph data to use. Option 0 and Option 2 can leave a valid v1
> commit-graph file with the majority of the commit data. This is only a
> performance issue for a narrow case, so maybe that's worth ignoring.

Indeed. As noted in the thread about the v1->v2 format if we just append
new chunks we have leeway like that, i.e. it's completely OK if we
choose to from the POV of old clients write no data to them so they're
not helped by the optimization, that's a more graceful way than them
dying on a format change.

And yes, we could stick the proposed commit-graphs/info "index" in a
chunk there. I've got a preference for a \n-delimited list similar to
objects/info/packs, but that's mostly for aesthetic reasons.

> (See Ævar's idea about symlinks at the bottom of this message for more.)
>
>>> On write, if we are creating a new layer to our chain, we need to _copy_
>>> the existing commit-graph file to a graph-{hash}.graph file before renaming
>>> the .lock file. If we are merging layers, then we either (a) clean up the
>>> dangling chains after moving our commit-graph file, or (b) have something
>>> like 'gc' clean up the files later. I think using 'gc' for this is not
>>> a good idea, since I would expect these files to be written and merged
>>> much more frequently (say, after a 'fetch') than a 'gc' is run. Cleaning
>>> up the dangling chains leads to our concurrency issues. Further, this
>>> 'copy the base' no longer keeps our large base file at rest.
>>
>> So aside from the specifics of implementation both this and the
>> commit-graph-N way of doing it involves juggling a sequence of chunks
>> that "point upwards", which IMO has worse caveats than "point
>> downwards". I.e. we'd need to rewrite "base" files to point "upwards" to
>> new stuff, and there's no (sane) way in option 0 to split commit-graphs
>> across repos, and no way at all in this scenario (if I understand it
>> correctly...).
>
> I don't understand the value of rewriting base files to point upwards.
> The whole point is to not change the base files.

Yeah I don't think it makes sense, but maybe I misread the
description...

>>> Option 2: grovel commit-graphs directory for graph-{hash}.graph
>>> ---------------------------------------------------------------
>>>
>>> On read, we load the 'info/commit-graph' file and assume it is never
>>> an incremental file. Then, scan the 'info/commit-graphs' directory
>>> for 'graph-{hash}.graph' files and open them _all_ up to construct
>>> a "graph forest" (each graph has a single parent, given by a chunk
>>> specifying its base graph hash). If we don't have an instance of a
>>> graph with a given hash, then ignore any graphs pointing to that hash.
>>> We now have a decision to make: which leaf of this forest should we
>>> use as our tip commit-graph? That could be given by the
>>> commit-graphs/info file. But what happens when we have timing issues
>>> around scanning the directory and the commit-graphs/info file? Do
>>> we fall back to modified time?
>>
>> I think in a worst-case scenario we shrug and just pick whatever chain
>> looks most recent/largest or whatever, but in practice I think it's a
>> non-issue.
>>
>> Critical for that "it's a non-issue" is the suggestion I had on 17/17 of
>> not doing the incremental write in "fetch", but instead just rely on "gc
>> --auto" *and* to address your "I would expect these files to be written
>> and merged much more frequently" we'd just (and I'm keen to hack this
>> up) teach "gc --auto" such an incremental mode.
>
> I look forward to this incremental mode, and its "partial maintenance".
>
>> This means that within a single repo all updates to the commit-graph
>> will go through the gc.lock, whereas with your current 17/17 we'd
>> potentially have a "commit-graph write" racing a concurrent "gc", with
>> both aiming to update the commit-graph.
>
> Except for writes that happen in `git commit-graph write`, of course.
>
> VFS for Git never runs 'gc' and instead maintains the commit-graph directly.
>
> Should the commit-graph builtin take a gc lock on write?

If we make the file management dumb enough it shouldn't matter, we'd
always have a lock for updating the "meta" file, but if we didn't we'd
just have some harmless duplicate work.

I think a lock in "gc" is enough, it's our default, so it should be
sensible. People doing custom GC generally do/want their own management
of that, no need for us to hold their hand beyond the point where
e.g. concurrent "pack-refs" might corrupt refs if the ref update itself
isn't locked.

>>> On write, if we are not merging, then we just create a new
>>> graph-{hash}.graph file. If we are merging, but still have a base graph,
>>> then create a new graph-{hash}.graph file. Finally, if we are merging
>>> all layers, then we rename our .lock file to 'info/commit-graph'.
>>> To clean up, we need to grovel the directory to look for graph-{hash}.graph
>>> files whose base chains no longer match the new, "best" chain and unlink()
>>> them. This clean-up step can happen at any time.
>>>
>>> --[end description of options]--
>>>
>>> Did I accurately describe the options we are considering?
>>>
>>> Option 1 was the design I was planning, and I think it matches how the
>>> split-index feature works. Please correct me if I am missing something.
>>> It _requires_ updating the file format version. But it also has a flaw
>>> that the other options do not have: the copy of the base file. One
>>> thing I want to enable is for whatever machinery is handling these
>>> file writes to run a 'verify' immediately after, and have that be fast
>>> most of the time. With a model that changes only the "tip" file, we
>>> can verify only the new files and have confidence that the base file
>>> did not change. I think options 0 and 2 both improve in this direction.
>>>
>>>> With commit-graph-<HASH> all these unlink() race conditions go away,
>>>> partial reads due to concurrent graph writing becomes a non-issue (we'd
>>>> just leave the old files, "gc" deals with them later..), no need to
>>>> carefully fsync() files/dirs etc as we need to carefully juggle N and
>>>> N+1 files.
>>>
>>> Calling this a non-issue is an exaggeration, especially if you are
>>> claiming we need to be robust to multi-hour gaps between reading files.
>>
>> We always have a race, but they're very different races.
>>
>> With "N" files (option #0) we'd have races of the type where within the
>> same few milliseconds of a commit-graph being merged/updated a
>> e.g. concurrent "tag --contains" would either be much slower (couldn't
>> get one of the incremental graphs it expected), or produce the wrong
>> answer (this may be wrong on my part, see my "base hash" comment above).
>
> With the "base hash" feature (planned, not implemented) we won't be wrong.

*nod*

> With the "max number of commits in the incremental files" feature, we
> would limit the performance issues from missing an update. BUT, this also
> implies we are rewriting the base commit-graph more often. With a limit
> of 64,000 commits, that's still only once every few weeks in the Windows
> OS repo.
>
>> Whereas if I run something with "ionice -c 3" I could possibly hang for
>> however many hours/days/weeks we wait until another "gc" comes along and
>> unlinks those old files, but if I'm running it like that I'm not
>> expecting it to be fast, so it's OK if the files went away, and it won't
>> ever get the wrong file (since the filenames are hash-addressible).
>
> How do we recover from this situation? Have no commit-graph at all? Or
> do we try again?

I think just no commit-graph at all is fine. If it lazily took you N
hours from reading the "commit-graphs/info" to getting around to looking
at now-disappeared files it sucks to be you.

>>>> It also becomes easy to "chain" graphs across repos e.g. via
>>>> alternates. Say in the scenario github/gitlab have where they have a
>>>> "main" repo and other objects on another delta island.
>>>>
>>>> In that case the repo would have a local "tip" file with the last link
>>>> in its chain, some of which would then refer back to <HASHes> in other
>>>> "parent" alternates.
>>>>
>>>> As long as such a setup has a "gc" process that's not overly eager about
>>>> pruning old stuff and considers that constellation of repos as a whole
>>>> that should just work. You can freely optimize and rewrite graphs across
>>>> repos, just be careful about unlinking old stuff.
>>>>
>>>> I don't see how it would work with commit-graph-N without a *lot* of
>>>> painful orchestration (where e.g. you *must* guarantee that the parent
>>>> repo ends in N, all child repos start at N+1).
>>>
>>> You're right that Option 0 does not work in this model where some graph
>>> information is stored in an alternate _and_ more information is stored
>>> outside the alternate. My perspective is biased, because I consider the
>>> alternate to be "almost everything" and the local object store to be
>>> small. But in a fork network, this is not always the case. I appreciate
>>> your feedback for this environment, and I've always hoped that someone
>>> with server experience would come and say "this feature is great, but
>>> we need X, Y, and Z to make best use of it in our environment. Here's
>>> a patch that moves us in that direction!" At least you are doing the
>>> next-best thing: stopping me from making mistakes that would block
>>> adoption.
>>
>> I'm happy to write some patches, just want to talk about it first (and
>> if I'm lucky convince you to write them for me :) ).
>>
>> One example is the git-for-windows commit-graph is 5.6MB, the git.git
>> one is 3.1MB. Should GitHub just stick that all in the one "parent"
>> graph, maybe. But nice to have the flexibility of stacking them.
>>
>> There's also more disconnected cases, e.g. I have some "staging" boxes
>> where where I have a cronjob running around re-pointing clones of a big
>> monorepo to a shared "alternates" store where I guarantee objects are
>> only ever added, never removed.
>>
>> It would be nice to have a way to provide a commit-graph there that's
>> "stable" that clients could point to, and they'd just generate the
>> difference.
>>
>> I.e. now I have a shared .git/objects which contains gigabytes, a
>> crapload of stuff in /home where .git/objects is 10-50MB, and each one
>> has a commit-graph that's around the same 50-100MB size (since it needs
>> to contain the metadata for the full set).
>>
>>> So let's consider how Option 2 would work in this "multi-tip" case.
>>> Each object directory would have some number of graph files, and one
>>> 'commit-graphs/info' file pointing to some hash. When we read, we
>>> try to pick the info file that is "closest" to us.
>>>
>>> This does create some complications that I don't think you gave enough
>>> attention to. These may be solvable, but they are non-trivial:
>>>
>>> * When we 'gc' the "core" repo, we need to enumerate all of the
>>>   "leaf" repos to check their tip commit-graph files and make a
>>>   decision if we should keep their bases around or delete those tips.
>>>   Perhaps I'm over-stating the difficulty here, since we need to do
>>>   something similar to find still-reachable objects, anyway. But if
>>>   we are doing that reachability calculation, then why are we not
>>>   just putting all of the commit-graph data in the core repo? Note
>>>   that we don't get the same value as delta islands because this data
>>>   isn't being shared across the protocol. The issue with storing all
>>>   graph data in the core repo is that the core repo doesn't actually
>>>   have all of the commits, which makes 'verify' on the graph a bit
>>>   annoying.
>>
>> Yeah I think similar to "alternates" it would be annoying to have a case
>> where a given repo has metadata on objects it doesn't have, and there's
>> cases (see the "staging" case I mentioned above) where that "parent"
>> repo won't have access to those things.
>>
>>> * If we choose a local tip instead of the "core" tip, then that chain
>>>   of commit-graphs can be far behind the core repo. In the world where
>>>   a fork moves only at the speed of a single developer, but the core
>>>   project moves quickly, then computing a merge base with the core's
>>>   master branch becomes slow as our local chain doesn't contain most
>>>   of the commits.
>>
>> That's a good point and a case where pointing "upwards" or just having
>> commit-graphs in the "base" repo is better, i.e. the "fork" has almost
>> no objects.
>
> Is your idea of "upwards" different than mine? I think of pointing to
> a base file as pointing "down", and the opposite would be "up". In the
> case of a fork network or multiple repos using an alternate, pointing
> upwards would not even be well-defined.

We've got the same idea, but as noted above maybe I misunderstood the
"option 1" description.

>> But solvable by triggering "gc" on these child projects so their
>> commit-graph keeps being re-pointed to a later version.
>>
>> And we'd have the reverse problem with a git-for-windows wouldn't we?
>> I.e. the fork is "far ahead".
>
> This is the quintessential example for why we can't have a single chain
> of commit-graphs long-term. It deviates from most fork networks enough
> that we can't say "just take the base repo's commit-graph" but typical
> fork networks can't say "just take my local commit-graph chain". The
> two-dimensional graph position would be valuable to help both shapes.

Indeed. I just thought about e.g. a [branch|tag] --contains for the
current repo, but for things like "how ahead of gitster/pu is my GFW
topic?" one graph for the network is needed.

If we can help it it would be useful to not unduly box the user in and
offer them flexibility to choose. I.e. some (e.g. my staging server
use-case) might want base+local repo and never need 2x local repos
v.s. each other, whereas github/gitlab might need one giant graph etc.

>>> * We can't take all of the "core" chain _and_ the local chain, because
>>>   the concept of "graph position" no longer makes sense. The only way
>>>   I see out of this is to make the graph position two-dimensional:
>>>   commit -> (index of tip chain, position in that chain). Perhaps this
>>>   is a valuable thing to do in the future? Or perhaps, we shouldn't
>>>   have incremental chains spanning object directories and instead
>>>   introduce "parents-by-ref" where we mark some parents as included
>>>   by object id instead of by graph position. This would allow the
>>>   core repo to gc without caring about the external repos. It also
>>>   wouldn't care about how the graph files are stored (Option 0 would
>>>   work, as graph chains would not cross object store boundaries) and
>>>   more closely resembles the independence of the pack-files in each
>>>   object store. The "parents-by-ref" would require updating the
>>>   file format version.
>>
>> The parent repo can "gc" without caring/inspecting "child" repos with
>> the "point downwards" of option #2, as long as it promises to retain its
>> old commit-graph files for some retention period of X, and the "child"
>> repos promise to "gc" (and re-point to new graphs if necessary) at rate
>> that's faster than that.
>>
>> This makes it easy to e.g. say "we retain old commit-graph files for 2
>> weeks", and "we re-gc everything in cron weekly".
>
> Here, I think, is the most crucial point of why Option 2 may be worth the
> added complexity over Option 0. Option 0 _requires_ that the files be
> replaced immediately on a new write, while Option 2 provides a way to
> leave old files around and be cleaned up later.
>
> But how should we actually perform this cleanup? I would imagine a
> 'git commit-graph gc' subcommand that cleans up old files. A 'git gc'
> run would perform the same logic, but we need a way to do this outside
> of 'gc'. It needs to use the modified time as an indicator, since we
> could run 'git commit-graph write' twice an hour before our two-week
> cleanup job and need to keep our hour-old stale file. Perhaps the
> 'git commit-graph gc' subcommand could take a '--window' parameter that
> can be 0, while 'git gc' uses a config setting.
>
> The decision can then be: "is this file not in our graph chain and
> older than <window> from now?"
>
> But also, I expect the stale commit-graph files will pile up quickly.
> We rebuild the commit-graph file roughly every hour. I would write
> our maintenance to call these subcommands in order with no delay:
>
>     "write" -> "verify --shallow" -> "gc --window=0"
>
> (Here, "verify --shallow" would only verify the tip of the
> commit-graph chain.)

I think a sane default would be to just unlink() the old ones as soon as
we're done writing the new ones & writing the commit-graphs/info file
saying "here's your current ones".

I.e. as noted in what I said about fsync above it's not that we need to
keep the old ones around, but avoiding the tight dance with N & N+1
updates, and being friendlier to stuff like lookupcache=positive.

Users with more advanced use-cases (e.g. cross-repo graphs) could then
always increase such an expiry.

>> It would work best if we can also pull this trick on the "base"
>> commit-graph file, which I believe we could do in a backwards-compatible
>> way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
>> is the current "base".
>
> Could we do this, anyway? Use 'commit-graphs/info' to point to the tip
> and let the symlink 'commit-graph' point to the base. Then, old clients
> would load a full commit-graph and new clients would get the full chain.

How's the Windows support for symlinks? We don't symlink anything in
.git/objects/ ourselves now (but see[1]).

On *nix just manually symlinking it works fine (you need to go out of
your way not to support it, which we didn't).

So something like this would be desirable:

    $ tree -a .git/objects/info/
    .git/objects/info/
    ├── commit-graph -> commit-graphs/commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
    ├── commit-graphs
    │   ├── commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
    │   ├── commit-graph-988881adc9fc3655077dc2d4d757d480b5ea0e11
    │   └── info
    └── packs
    $ cat .git/objects/info/commit-graphs/info
    2492e0ef38643d4cb6369f76443e6a814d616258
    988881adc9fc3655077dc2d4d757d480b5ea0e11

I.e. create new ones as needed, and when done say what sequence they
should be read in in the "info" file, and symlink "commit-graph" to
whatever the latest "base" is as a courtesy to old clients (or not, or
eventually don't bother).

1. https://public-inbox.org/git/20190502144829.4394-1-matheus.bernardino@usp.br/
Derrick Stolee May 10, 2019, 12:44 p.m. UTC | #10
On 5/9/2019 5:45 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, May 09 2019, Derrick Stolee wrote:
> 
>> How far apart can these concurrency issues happen in the file system?
>> One benefit to Option 0 is that there is only one file _write_ that matters.
>> The other options require at least two writes.
> 
> You need to write() and then fsync()/close() to be guaranteed that the
> data was written to the file.
> 
> As an aside I see that the current commit-graph uses CSUM_FSYNC (but
> without CSUM_CLOSE, probably doesn't matter), I thought it
> didn't. Maybe we should remove that unless core.fsyncObjectFiles=true
> (or have another "looser" fsync config).
> 
> So writing is cheap, but it's asking the OS to sync to disk that
> generally hurts. As noted in the discussions when core.fsyncObjectFiles
> was introduced some FSs are really stupid about it, so it's better to
> avoid these caveats if you can.
> 
> As noted in the fsync(2) manpage on Linux (ditto POSIX) an fsync to a
> *file* doesn't guarantee that the directory entry is updated. So we'd
> also need to opendir() the containing directory and fsync that FD as we
> juggle these renames/replaces.

I think any plan should not rely on directory scanning for this reason.
Starting from a known file (info/commit-graphs/head?) and then navigating
directly to files would avoid these issues, right?

>> That would prevent breaking old clients, but they would also not have any
>> commit-graph data to use. Option 0 and Option 2 can leave a valid v1
>> commit-graph file with the majority of the commit data. This is only a
>> performance issue for a narrow case, so maybe that's worth ignoring.
> 
> Indeed. As noted in the thread about the v1->v2 format if we just append
> new chunks we have leeway like that, i.e. it's completely OK if we
> choose to from the POV of old clients write no data to them so they're
> not helped by the optimization, that's a more graceful way than them
> dying on a format change.
> 
> And yes, we could stick the proposed commit-graphs/info "index" in a
> chunk there. I've got a preference for a \n-delimited list similar to
> objects/info/packs, but that's mostly for aesthetic reasons.

I think a \n-delimited list would be better because using the commit-graph
format with zero commits creates at least a kilobyte of data for a
zero-valued fanout table. If we relax the format to say "the head file
has zero commits, so the fanout, commit oids, and commit data chunks
should not exist" then we would be fine, and get versioning "for free".
But the file serves a different purpose.

>>> Whereas if I run something with "ionice -c 3" I could possibly hang for
>>> however many hours/days/weeks we wait until another "gc" comes along and
>>> unlinks those old files, but if I'm running it like that I'm not
>>> expecting it to be fast, so it's OK if the files went away, and it won't
>>> ever get the wrong file (since the filenames are hash-addressible).
>>
>> How do we recover from this situation? Have no commit-graph at all? Or
>> do we try again?
> 
> I think just no commit-graph at all is fine. If it lazily took you N
> hours from reading the "commit-graphs/info" to getting around to looking
> at now-disappeared files it sucks to be you.

Sounds good. We can revisit if this actually starts hurting any real
scenarios.
>>> And we'd have the reverse problem with a git-for-windows wouldn't we?
>>> I.e. the fork is "far ahead".
>>
>> This is the quintessential example for why we can't have a single chain
>> of commit-graphs long-term. It deviates from most fork networks enough
>> that we can't say "just take the base repo's commit-graph" but typical
>> fork networks can't say "just take my local commit-graph chain". The
>> two-dimensional graph position would be valuable to help both shapes.
> 
> Indeed. I just thought about e.g. a [branch|tag] --contains for the
> current repo, but for things like "how ahead of gitster/pu is my GFW
> topic?" one graph for the network is needed.
> 
> If we can help it it would be useful to not unduly box the user in and
> offer them flexibility to choose. I.e. some (e.g. my staging server
> use-case) might want base+local repo and never need 2x local repos
> v.s. each other, whereas github/gitlab might need one giant graph etc.

I made a note to follow up with the two-dimensional graph position [1].

[1] https://github.com/microsoft/git/issues/138

>>> This makes it easy to e.g. say "we retain old commit-graph files for 2
>>> weeks", and "we re-gc everything in cron weekly".
>>
>> Here, I think, is the most crucial point of why Option 2 may be worth the
>> added complexity over Option 0. Option 0 _requires_ that the files be
>> replaced immediately on a new write, while Option 2 provides a way to
>> leave old files around and be cleaned up later.
>>
>> But how should we actually perform this cleanup? I would imagine a
>> 'git commit-graph gc' subcommand that cleans up old files. A 'git gc'
>> run would perform the same logic, but we need a way to do this outside
>> of 'gc'. It needs to use the modified time as an indicator, since we
>> could run 'git commit-graph write' twice an hour before our two-week
>> cleanup job and need to keep our hour-old stale file. Perhaps the
>> 'git commit-graph gc' subcommand could take a '--window' parameter that
>> can be 0, while 'git gc' uses a config setting.
>>
>> The decision can then be: "is this file not in our graph chain and
>> older than <window> from now?"
>>
>> But also, I expect the stale commit-graph files will pile up quickly.
>> We rebuild the commit-graph file roughly every hour. I would write
>> our maintenance to call these subcommands in order with no delay:
>>
>>     "write" -> "verify --shallow" -> "gc --window=0"
>>
>> (Here, "verify --shallow" would only verify the tip of the
>> commit-graph chain.)
> 
> I think a sane default would be to just unlink() the old ones as soon as
> we're done writing the new ones & writing the commit-graphs/info file
> saying "here's your current ones".
> 
> I.e. as noted in what I said about fsync above it's not that we need to
> keep the old ones around, but avoiding the tight dance with N & N+1
> updates, and being friendlier to stuff like lookupcache=positive.
> 
> Users with more advanced use-cases (e.g. cross-repo graphs) could then
> always increase such an expiry.

Noted. Thanks.

>>> It would work best if we can also pull this trick on the "base"
>>> commit-graph file, which I believe we could do in a backwards-compatible
>>> way by making "commit-graph" a symlink to whatever "commit-graph-<HASH>"
>>> is the current "base".
>>
>> Could we do this, anyway? Use 'commit-graphs/info' to point to the tip
>> and let the symlink 'commit-graph' point to the base. Then, old clients
>> would load a full commit-graph and new clients would get the full chain.
> 
> How's the Windows support for symlinks? We don't symlink anything in
> .git/objects/ ourselves now (but see[1]).

Maybe this won't work in all scenarios, but it could be left as a future
enhancement if anyone cares. The "support two versions" scenario is rare
enough to avoid building something specifically for it, but often enough
to not _break_ it.

> On *nix just manually symlinking it works fine (you need to go out of
> your way not to support it, which we didn't).
> 
> So something like this would be desirable:
> 
>     $ tree -a .git/objects/info/
>     .git/objects/info/
>     ├── commit-graph -> commit-graphs/commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
>     ├── commit-graphs
>     │   ├── commit-graph-2492e0ef38643d4cb6369f76443e6a814d616258
>     │   ├── commit-graph-988881adc9fc3655077dc2d4d757d480b5ea0e11
>     │   └── info
>     └── packs
>     $ cat .git/objects/info/commit-graphs/info
>     2492e0ef38643d4cb6369f76443e6a814d616258
>     988881adc9fc3655077dc2d4d757d480b5ea0e11
> 
> I.e. create new ones as needed, and when done say what sequence they
> should be read in in the "info" file, and symlink "commit-graph" to
> whatever the latest "base" is as a courtesy to old clients (or not, or
> eventually don't bother).

OK. Glad that the idea would work.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/Documentation/technical/commit-graph.txt b/Documentation/technical/commit-graph.txt
index fb53341d5e..ca1661d2d8 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -127,6 +127,148 @@  Design Details
   helpful for these clones, anyway. The commit-graph will not be read or
   written when shallow commits are present.
 
+Split Commit Graphs
+-------------------
+
+Typically, repos grow with near-constant velocity (commits per day). Over
+time, the number of commits added by a fetch operation is much smaller than
+the number of commits in the full history. The split commit-graph feature
+allows for fast writes of new commit data without rewriting the entire
+commit history -- at least, most of the time.
+
+## File Layout
+
+A split commit-graph uses multiple files, and we use a fixed naming
+convention to organize these files. The base commit-graph file is the
+same: `$OBJDIR/info/commit-graph`. The rest of the commit-graph files have
+the format `$OBJDIR/info/commit-graphs/commit-graph-<N>` where N is a
+positive integer. The integers must start at 1 and grow sequentially
+to form a stack of files.
+
+Each `commit-graph-<N>` file has the same format as the `commit-graph`
+file, including a lexicographic list of commit ids. The only difference
+is that this list is considered to be concatenated to the list from
+the lower commit-graphs. As an example, consider this diagram of three
+files:
+
+ +-----------------------+
+ |  commit-graph-2       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |  commit-graph-1       |
+ |                       |
+ +-----------------------+
+	  |
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  commit-graph         |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+Let X0 be the number of commits in `commit-graph`, X1 be the number
+of commits in commit-graph-1, and X2 be the number of commits in
+commit-graph-2. If a commit appears in position i in `commit-graph-2`,
+then we interpret this as being the commit in position (X0 + X1 + i),
+and that will be used as its "graph position". The commits in
+commit-graph-2 use these positions to refer to their parents, which
+may be in commit-graph-1 or commit-graph. We can navigate to an
+arbitrary commit in position j by checking its containment in the
+intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + X2).
+
+When Git reads from these files, it starts by acquiring a read handle
+on the `commit-graph` file. On success, it continues acquiring read
+handles on the `commit-graph-<N>` files in increasing order. This
+order is important for how we replace the files.
+
+## Merging commit-graph files
+
+If we only added a `commit-graph-<N>` file on every write, we would
+run into a linear search problem through many commit-graph files.
+Instead, we use a merge strategy to decide when the stack should
+collapse some number of levels.
+
+The diagram below shows such a collapse. As a set of new commits
+are added, it is determined by the merge strategy that the files
+should collapse to `commit-graph-1`. Thus, the new commits, the
+commits in `commit-graph-2` and the commits in `commit-graph-1`
+should be combined into a new `commit-graph-1` file.
+
+			    +---------------------+
+			    |                     |
+			    |    (new commits)    |
+			    |                     |
+			    +---------------------+
+			    |                     |
+ +-----------------------+  +---------------------+
+ |  commit-graph-2       |->|                     |
+ +-----------------------+  +---------------------+
+	  |                 |                     |
+ +-----------------------+  +---------------------+
+ |                       |  |                     |
+ |  commit-graph-1       |->|                     |
+ |                       |  |                     |
+ +-----------------------+  +---------------------+
+	  |                   commit-graph-1.lock
+ +-----------------------+
+ |                       |
+ |                       |
+ |                       |
+ |  commit-graph         |
+ |                       |
+ |                       |
+ |                       |
+ +-----------------------+
+
+During this process, the commits to write are combined, sorted
+and we write the contents to the `commit-graph-1.lock` file.
+When the file is flushed and ready to swap to `commit-graph-1`,
+we first unlink the files above our target file. This unlinking
+is done from the top of the stack, the reverse direction that
+another process would use to read the stack.
+
+During this time window, another process trying to read the
+commit-graph stack could read `commit-graph-1` before the swap
+but try to read `commit-graph-2` after it is unlinked. That
+process would then believe that this stack is complete, but
+will miss out on the performance benefits of the commits in
+`commit-graph-2`. For this reason, the stack above the
+`commit-graph` file should be small.
+
+## Merge Strategy
+
+When writing a set of commits that do not exist in the
+commit-graph stack of height N, we default to creating
+a new file at level N + 1. We then decide to merge
+with the Nth level if one of two conditions hold:
+
+  1. The expected file size for level N + 1 is at
+     least half the file size for level N.
+
+  2. Level N + 1 contains more than MAX_SPLIT_COMMITS
+     commits (64,0000 commits).
+
+This decision cascades down the levels: when we
+merge a level we create a new set of commits that
+then compares to the next level.
+
+The first condition bounds the number of levels
+to be logarithmic in the total number of commits.
+The second condition bounds the total number of
+commits in a `commit-graph-N` file and not in
+the `commit-graph` file, preventing significant
+performance issues when the stack merges and another
+process only partially reads the previous stack.
+
+The merge strategy values (2 for the size multiple,
+64,000 for the maximum number of commits) could be
+extracted into config settings for full flexibility.
+
 Related Links
 -------------
 [0] https://bugs.chromium.org/p/git/issues/detail?id=8