mbox series

[00/10] more miscellaneous Bloom filter improvements

Message ID cover.1596480582.git.me@ttaylorr.com (mailing list archive)
Headers show
Series more miscellaneous Bloom filter improvements | expand

Message

Taylor Blau Aug. 3, 2020, 6:57 p.m. UTC
Hi,

Here are some patches that I have been sitting on while rolling out
changed-path Bloom filters at GitHub. I was initially going to send this
as four separate topics, but they really make the most sense when sent
all together.

Here's an overview of what's contained:

  - The first patch fixes a bug where Bloom filters are only read if the
    most-recent layer in a commit-graph-chain has Bloom filters.

  - Patches 2-5 introduce 'commitgraph.readChangedPaths', a new
    configuration option for debugging changed-path Bloom filters. When
    disabled, the commit-graph machinery pretends as if there are no
    BIDX/BDAT chunks.

    This is useful when testing behavior with/without Bloom filters
    without having to regenerate the commit graph.

  - Patches 6-7 introduce a new 'BFXL' chunk which is a bitmap
    indicating which Bloom filters were too large to compute (ie., had
    more than 512 changed paths). This is a prerequisite for the final
    feature, --max-changed-paths.

  - Patches 8-10 introduces '--max-new-filters <n>', which allows
    callers to limit the number of Bloom filters that they are willing
    to compute from scratch when generating commit-graphs with
    --changed-paths.

The BFXL chunk is a prerequisite for '--max-new-filters' because of an
unfortunate overloading where filters with (1) no changed paths, (2) too
many changed paths, and (3) ones that weren't computed at all are all
represented as a length-0 filter by the BIDX chunk.

The problem is that in repositories with many commits that have too many
changed-paths to store Bloom filters, specifying '--max-new-filters'
recomputes those same large filters each commit-graph write, only to
throw them away. Knowing which filters are too-large allows us to skip
over computing filters we know are a waste.

Thanks in advance for your review!

Taylor Blau (10):
  commit-graph: introduce 'get_bloom_filter_settings()'
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&&'-chain
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitgraph.readChangedPaths'
  commit-graph.c: sort index into commits list
  commit-graph: add large-filters bitmap chunk
  bloom: split 'get_bloom_filter()' in two
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   4 +
 .../technical/commit-graph-format.txt         |   9 +
 blame.c                                       |   8 +-
 bloom.c                                       |  34 ++-
 bloom.h                                       |   9 +-
 builtin/commit-graph.c                        |  61 +++--
 commit-graph.c                                | 214 +++++++++++++-----
 commit-graph.h                                |  17 +-
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   4 +-
 t/helper/test-bloom.c                         |   2 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t4216-log-bloom.sh                          |  54 ++++-
 t/t5324-split-commit-graph.sh                 |  13 ++
 19 files changed, 357 insertions(+), 96 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

--
2.28.0.rc1.13.ge78abce653