mbox series

[v3,0/4] Speed up connectivity checks

Message ID cover.1627896460.git.ps@pks.im (mailing list archive)
Headers show
Series Speed up connectivity checks | expand

Message

Patrick Steinhardt Aug. 2, 2021, 9:37 a.m. UTC
Hi,

I finally found some time again to have another look at my old problem
of slow connectivity checks. After my previous two approaches of using
the quarantine directory and using bitmaps proved to not really be
viable, I've taken a step back yet again. The result is this series,
which speeds up the connectivity checks by optimizing "revison.c". More
specifically, I'm mostly tweaking how we're queueing up references,
which is the most pressing issue we've observed at GitLab when doing
connectivity checks in repos with many references.

The following optimizations are part of this series. All benchmarks were
done on [1], which is a repository with about 2.2 million references
(even though most of them are hidden to public users) with `git rev-list
--objects --quiet --unsorted-input --not --all --not $newrev`.

    1. We used to sort the input references in git-rev-list(1). This is
       moot in the context of connectivity checks, so a new flag
       suppresses this sorting. This improves the command by ~30% from
       7.6s to 4.9s.

    2. We did some busy-work, loading each reference twice via
       `get_reference()`. We now don't anymore, resulting in a ~8%
       speedup from 5.0s to 4.6s.

    3. An optimization was done to how we load objects. Previously, we
       always called `oid_object_info()`, even if we had already loaded
       the object. This was tweaked to use `lookup_unknown_object()`,
       which is a performance-memory tradeoff. This saves us another 7%,
       going from 4.7s to 4.4s, but it's a prereq for (4).

    4. We now make better use of the commit-graph in that we first try
       loading from there before we load it from the ODB. This is a 40%
       speedup, going from 4.4s to 2.8s.

The result is a speedup of about 65%. The nice thing compared to
previous versions is that this should also be visible when directly
executing git-rev-list(1) or doing a revwalk.

Patch #1 still needs some polishing if we agree that this patch series
makes sense, given that it's still missing documentation.

Patrick

[1]: https://gitlab.com/gitlab-org/gitlab.git

Patrick Steinhardt (4):
  connected: do not sort input revisions
  revision: stop retrieving reference twice
  revision: avoid loading object headers multiple times
  revision: avoid hitting packfiles when commits are in commit-graph

 commit-graph.c | 55 +++++++++++++++++++++++++++++++++++++++-----------
 commit-graph.h |  2 ++
 connected.c    |  1 +
 revision.c     | 23 ++++++++++++++++-----
 revision.h     |  1 +
 5 files changed, 65 insertions(+), 17 deletions(-)