mbox series

[00/17,RFC] Commit-graph: Write incremental files

Message ID pull.184.git.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Commit-graph: Write incremental files | expand

Message

Linus Arver via GitGitGadget May 8, 2019, 3:53 p.m. UTC
This patch series is marked as RFC quality because it is missing some key
features and tests, but hopefully starts a concrete discussion of how the
incremental commit-graph writes can work. If this is a good direction, then
it would replace ds/commit-graph-format-v2.

The commit-graph is a valuable performance feature for repos with large
commit histories, but suffers from the same problem as git repack: it
rewrites the entire file every time. This can be slow when there are
millions of commits, especially after we stopped reading from the
commit-graph file during a write in 43d3561 (commit-graph write: don't die
if the existing graph is corrupt).

Instead, create a "stack" of commit-graphs where the existing commit-graph 
file is "level 0" and other levels are in files 
$OBJDIR/info/commit-graphs/commit-graph-N for positive N. Each level is
closed under reachability with its lower levels, and the idea of "graph
position" now considers the concatenation of the commit orders from each
level. See PATCH 12 for more details.

When writing, we don't always want to add a new level to the stack. This
would eventually result in performance degradation, especially when
searching for a commit (before we know its graph position). We decide to
merge levels of the stack when the new commits we will write satisfy two
conditions:

 1. The expected size of the new file is more than half the size of the tip
    of the stack.
 2. The new file contains more than 64,000 commits.

The first condition alone would prevent more than a logarithmic number of
levels. The second condition is a stop-gap to prevent performance issues
when another process starts reading the commit-graph stack as we are merging
a large stack of commit-graph files. The reading process could be in a state
where the new file is not ready, but the levels above the new file were
already deleted. Thus, the commits that were merged down must be parsed from
pack-files.

The performance is necessarily amortized across multiple writes, so I tested
by writing commit-graphs from the (non-rc) tags in the Linux repo. My test
included 72 tags, and wrote everything reachable from the tag using 
--stdin-commits. Here are the overall perf numbers:

git commit-graph write --stdin-commits:         8m 12s
git commit-graph write --stdin-commits --split:    45s

The test using --split included at least six full collapses to the full
commit-graph. I believe the commit-graph stack had at most three levels
during this test.

This series is long because I felt the need to refactor write_commit_graph()
before making such a sweeping change to the format.

 * Patches 1-4: these are small changes which either fix issues or just
   provide clean-up. These are mostly borrowed from
   ds/commit-graph-format-v2.
   
   
 * Patches 5-11: these provide a non-functional refactor of
   write_commit_graph() into several methods using a "struct
   write_commit_graph_context" to share across the methods.
   
   
 * Patches 12-16: Implement the split commit-graph feature.
   
   
 * Patch 17: Demonstrate the value by writing a split commit-graph during 
   git fetch when the new config setting fetch.writeCommitGraph is true.
   
   

TODO: There are several things missing that need to be added before this
series is ready for full review and merging:

 1. The documentation for git commit-graph needs updating for the --split 
    option.
    
    
 2. We likely want config settings for the merge strategy. This is mentioned
    in the design doc, and could be saved for later.
    
    
 3. We want to update the git commit-graph verify subcommand to understand
    the commit-graph stack and optionally only verify the tip of the stack.
    This allows faster (amortized) verification if we are verifying
    immediately after writes and trusting the files at rest.
    
    
 4. It would be helpful to add a new optional chunk that contains the
    trailing hash for the lower level of the commit-graph stack. This chunk
    would only be for the commit-graph-N files, and would provide a simple
    way to check that the stack is valid on read, in case we are still
    worried about other processes reading/writing in the wrong order.
    
    
 5. Currently, --split essentially implies --append since we either (a)
    don't change the existing stack and only add commits, or (b) add all
    existing commits while merging files. However, if you would use --append 
    with --split, the append logic will trigger a merge with the current tip
    (at minimum). Some care should be taken to make this more clear.
    
    

Thanks, -Stolee

[1] 
https://github.com/git/git/commit/43d356180556180b4ef6ac232a14498a5bb2b446
commit-graph write: don't die if the existing graph is corrupt

Derrick Stolee (17):
  commit-graph: fix the_repository reference
  commit-graph: return with errors during write
  commit-graph: collapse parameters into flags
  commit-graph: remove Future Work section
  commit-graph: create write_commit_graph_context
  commit-graph: extract fill_oids_from_packs()
  commit-graph: extract fill_oids_from_commit_hex()
  commit-graph: extract fill_oids_from_all_packs()
  commit-graph: extract count_distinct_commits()
  commit-graph: extract copy_oids_to_commits()
  commit-graph: extract write_commit_graph_file()
  Documentation: describe split commit-graphs
  commit-graph: lay groundwork for incremental files
  commit-graph: load split commit-graph files
  commit-graph: write split commit-graph files
  commit-graph: add --split option
  fetch: add fetch.writeCommitGraph config setting

 Documentation/technical/commit-graph.txt | 157 +++-
 builtin/commit-graph.c                   |  31 +-
 builtin/commit.c                         |   5 +-
 builtin/fetch.c                          |  17 +
 builtin/gc.c                             |   7 +-
 commit-graph.c                           | 946 ++++++++++++++++-------
 commit-graph.h                           |  19 +-
 commit.c                                 |   2 +-
 t/t5318-commit-graph.sh                  |  28 +-
 9 files changed, 887 insertions(+), 325 deletions(-)


base-commit: 93b4405ffe4ad9308740e7c1c71383bfc369baaa
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-184%2Fderrickstolee%2Fgraph%2Fincremental-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-184/derrickstolee/graph/incremental-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/184

Comments

Ævar Arnfjörð Bjarmason May 8, 2019, 7:27 p.m. UTC | #1
On Wed, May 08 2019, Derrick Stolee via GitGitGadget wrote:

> This patch series is marked as RFC quality because it is missing some key
> features and tests, but hopefully starts a concrete discussion of how the
> incremental commit-graph writes can work. If this is a good direction, then
> it would replace ds/commit-graph-format-v2.

I have some comments on 12/17 that I'll put there.

I think it would be best to start by submitting 1-11 for inclusion so we
can get minor cleanups/refactoring out of the way. I've only skimmed
those patches, but they seem to be obviously correct, although the diff
move detection (and with -w) doesn't help much with them.

This next bit sounds petty, but I honestly don't mean it that way :)

One minor thing I want to note is 04/17. The change itself I 100% agree
on (in-tree docs are bad places for TODO lists), but the commit message
*still* says that a "verify" is just as slow as "write", even though I
noted a ~40% difference in [1].

Do I care about that tiny isolated thing? Nope. But I *do* think it's
indicative of a general thing that could be improved in these RFC
iterations that I found a bit frustrating in reading through
it. I.e. you're getting some of the "C[comments]", but then there's
re-rolled patches that don't address those things.

What we say in the commit message for 4/17 obviously doesn't matter much
at all. But there's other outstanding feedback on the last iteration
that from reading this one still mostly/entirely applies.

So I'll just leave this reply at "I have a lot of comments", but that
they're still sitting there.

1. https://public-inbox.org/git/87o94mql0a.fsf@evledraar.gmail.com/