[RFC,0/6] bloom: reuse existing Bloom filters when possible during upgrade

Message ID	cover.1691426160.git.me@ttaylorr.com (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> Date: Mon, 7 Aug 2023 12:37:42 -0400 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: Derrick Stolee <derrickstolee@github.com>, Jonathan Tan <jonathantanmy@google.com>, Junio C Hamano <gitster@pobox.com> Subject: [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible during upgrade Message-ID: <cover.1691426160.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Precedence: bulk
Series	bloom: reuse existing Bloom filters when possible during upgrade \| expand [RFC,0/6] bloom: reuse existing Bloom filters when possible during upgrade [RFC,1/6] bloom: annotate filters with hash version [RFC,2/6] bloom: prepare to discard incompatible Bloom filters [RFC,3/6] t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()` [RFC,4/6] commit-graph.c: unconditionally load Bloom filters [RFC,5/6] object.h: fix mis-aligned flag bits table [RFC,6/6] commit-graph: reuse existing Bloom filters where possible

Message ID

cover.1691426160.git.me@ttaylorr.com (mailing list archive)

Headers

Date: Mon, 7 Aug 2023 12:37:42 -0400
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <derrickstolee@github.com>,
        Jonathan Tan <jonathantanmy@google.com>,
        Junio C Hamano <gitster@pobox.com>
Subject: [RFC PATCH 0/6] bloom: reuse existing Bloom filters when possible
 during upgrade
Message-ID: <cover.1691426160.git.me@ttaylorr.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

bloom: reuse existing Bloom filters when possible during upgrade | expand

Message

Taylor Blau Aug. 7, 2023, 4:37 p.m. UTC

This series is based off of 'jt/path-filter-fix'.

These few patches implement an idea that we discussed in [1], where we
attempt to reuse existing Bloom filters during an upgrade from v1 to v2
Bloom filters while rewriting the commit-graph.

The core idea is that Bloom filters are reusable when there aren't any
non-ASCII paths in a commit's tree-diff against its first parent (or the
empty tree, if none exists). If we assume that a commit's tree-diff
meets those conditions, we can't conclude anything about whether either
tree contains non-ASCII characters, since they could be unmodified on
either side and thus excluded from the tree-diff.

But assuming the RHS (that there aren't any non-ASCII characters present
in the tree's path set) *does* give us that there aren't any such paths
present in the first-parent tree diff, either.

This series checks whether or not commits meet that criteria, and reuses
the existing Bloom filter (if one exists) when possible. In practice, we
end up visiting relatively few trees, since we mark trees we've already
visited.

On both linux.git and git.git, this series gives a significant speed-up
when upgrading Bloom filters from v1 to v2. On linux.git:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git (where we do have some non-ASCII paths), the change goes from
4.163 seconds to 3.348 seconds, for a 1.24x speed-up.

I'm sending this as an RFC, since we are in the middle of the -rc phase,
and 'jt/path-filter-fix' isn't expected[2] to merge into 'master' until
we're on the other side of 2.42.

The structure of this series is as follows:

  - The first three patches prepare to load the `BDAT` chunk, even when
    the graph's Bloom filter settings are incompatible with the value in
    `commitGraph.changedPathsVersion`.
  - The fourth patch begins loading `BDAT` chunks unconditionally.
  - The fifth patch is a clean-up.
  - The sixth and final patch implements the approach discussed above.

Thanks in advance for your thoughts and review :-).

[1]: https://lore.kernel.org/git/ZMKvsObx+uaKA8zF@nand.local/
[2]: https://lore.kernel.org/git/xmqqy1it6ykm.fsf@gitster.g/

Taylor Blau (6):
  bloom: annotate filters with hash version
  bloom: prepare to discard incompatible Bloom filters
  t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
  commit-graph.c: unconditionally load Bloom filters
  object.h: fix mis-aligned flag bits table
  commit-graph: reuse existing Bloom filters where possible

 bloom.c              | 117 +++++++++++++++++++++++++++++++++++++++++--
 bloom.h              |  22 +++++++-
 commit-graph.c       |  24 +++++----
 object.h             |   3 +-
 t/t4216-log-bloom.sh |  49 ++++++++++++++++--
 5 files changed, 195 insertions(+), 20 deletions(-)

Comments

Jonathan Tan Aug. 11, 2023, 10:13 p.m. UTC | #1

Taylor Blau <me@ttaylorr.com> writes:
> On both linux.git and git.git, this series gives a significant speed-up
> when upgrading Bloom filters from v1 to v2. On linux.git:
> 
>     Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
>       Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
>       Range (min … max):   124.621 s … 125.227 s    3 runs
> 
>     Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
>       Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
>       Range (min … max):   79.112 s … 79.437 s    3 runs
> 
>     Summary
>       'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
>         1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'
> 
> On git.git (where we do have some non-ASCII paths), the change goes from
> 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.

My main concern is that this modifies the code somewhat pervasively
(tracking the version of Bloom filters and removing assumptions about
what Bloom filter versions are in memory) in return for what I think
is a small speedup, when considering that we will perform this
operation only once for a given repo. I'll defer to server operators
on this (or other people handling large numbers of repos), though.

Putting that concern aside, I've reviewed the code and assuming that
we're OK with modifying the code in this way, this patch set looks good
to me, and hopefully my review will be of some help.

Taylor Blau Aug. 21, 2023, 8:46 p.m. UTC | #2

On Fri, Aug 11, 2023 at 03:13:37PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > On both linux.git and git.git, this series gives a significant speed-up
> > when upgrading Bloom filters from v1 to v2. On linux.git:
> >
> >     Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit
> >       Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
> >       Range (min … max):   124.621 s … 125.227 s    3 runs
> >
> >     Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
> >       Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
> >       Range (min … max):   79.112 s … 79.437 s    3 runs
> >
> >     Summary
> >       'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
> >         1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'
> >
> > On git.git (where we do have some non-ASCII paths), the change goes from
> > 4.163 seconds to 3.348 seconds, for a 1.24x speed-up.
>
> My main concern is that this modifies the code somewhat pervasively
> (tracking the version of Bloom filters and removing assumptions about
> what Bloom filter versions are in memory) in return for what I think
> is a small speedup, when considering that we will perform this
> operation only once for a given repo. I'll defer to server operators
> on this (or other people handling large numbers of repos), though.

Yeah, I think that this is certainly on the riskier side of things. But
I have confidence in our test coverage here, and feel better with your
thorough review.

FWIW, I think that the speed-up is worthwhile (though I'm obviously
biased here). When we were rolling out changed-path Bloom filters to
repositories on GitHub.com, it was apparent that computing them from
scratch in a single shot was infeasible. This led to 809e0327f5
(builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18).

That would of course still save us here if we didn't have an upgrade
path, since each recomputed filter would count against the maximum
number we can generate in a single run.

We could introduce a configuration knob, but I'd rather avoid it if
possible. That can also be done on top in a later patch, which should be
easy enough to do if we hear of a compelling use-case for wanting to
disable the upgrade path here.

> Putting that concern aside, I've reviewed the code and assuming that
> we're OK with modifying the code in this way, this patch set looks good
> to me, and hopefully my review will be of some help.

Thanks very much -- it was more than helpful :-). I'll collect up this
and your previous patches (as we have discussed off-list) and tie
everything together in a single series for the maintainer.

Thanks,
Taylor