[PoC,00/34] An alternative modified path Bloom filters implementation

Message ID	20200529085038.26008-1-szeder.dev@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=mJAO=7L=vger.kernel.org=git-owner@kernel.org> From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>, Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>, Taylor Blau <me@ttaylorr.com>, =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com> Subject: [PoC PATCH 00/34] An alternative modified path Bloom filters implementation Date: Fri, 29 May 2020 10:50:04 +0200 Message-Id: <20200529085038.26008-1-szeder.dev@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	An alternative modified path Bloom filters implementation \| expand [PoC,00/34] An alternative modified path Bloom filters implementation [01/34] tree-walk.c: don't match submodule entries for 'submod/anything' [02/34] commit-graph: fix parsing the Chunk Lookup table [03/34] commit-graph-format.txt: all multi-byte numbers are in network byte order [04/34] commit-slab: add a function to deep free entries on the slab [05/34] diff.h: drop diff_tree_oid() & friends' return value [06/34] commit-graph: clean up #includes [07/34] commit-graph: simplify parse_commit_graph() #1 [08/34] commit-graph: simplify parse_commit_graph() #2 [09/34] commit-graph: simplify write_commit_graph_file() #1 [10/34] commit-graph: simplify write_commit_graph_file() #2 [11/34] commit-graph: allocate the 'struct chunk_info' array dinamically [12/34] commit-graph: unify the signatures of all write_graph_chunk_*() functions [13/34] commit-graph: simplify write_commit_graph_file() #3 [14/34] commit-graph: check chunk sizes after writing [15/34] commit-graph-format.txt: document the modified path Bloom filter chunks [16/34] Add a generic and minimal Bloom filter implementation [17/34] Import a streaming-capable Murmur3 hash function implementation [18/34] commit-graph: write "empty" Modified Path Bloom Filter Index chunk [19/34] commit-graph: add commit slab for modified path Bloom filters [20/34] commit-graph: fill the Modified Path Bloom Filter Index chunk [21/34] commit-graph: load and use the Modified Path Bloom Filter Index chunk [22/34] commit-graph: write the Modified Path Bloom Filters chunk [23/34] commit-graph: load and use the Modified Path Bloom Filters chunk [24/34] commit-graph: check all leading directories in modified path Bloom filters [25/34] commit-graph: check embedded modified path Bloom filters with a mask [26/34] commit-graph: deduplicate modified path Bloom filters [27/34] commit-graph: load modified path Bloom filters for merge commits [28/34] commit-graph: write Modified Path Bloom Filter Merge Index chunk [29/34] commit-graph: extract init and free write_commit_graph_context [30/34] commit-graph: move write_commit_graph_reachable below write_commit_graph [31/34] t7007-show: make the first test compatible with the next patch [32/34] PoC commit-graph: use revision walk machinery for '--reachable' [33/34] commit-graph: write modified path Bloom filters in "history order" [34/34] commit-graph: use modified path Bloom filters with wildcards, if possible

Message ID

20200529085038.26008-1-szeder.dev@gmail.com (mailing list archive)

Headers

From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
 Garima Singh <garima.singh@microsoft.com>, Derrick Stolee <stolee@gmail.com>,
 Jakub Narebski <jnareb@gmail.com>, Jeff King <peff@peff.net>,
 Taylor Blau <me@ttaylorr.com>,
 =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com>
Subject: [PoC PATCH 00/34] An alternative modified path Bloom filters
 implementation
Date: Fri, 29 May 2020 10:50:04 +0200
Message-Id: <20200529085038.26008-1-szeder.dev@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

An alternative modified path Bloom filters implementation | expand

Message

SZEDER Gábor May 29, 2020, 8:50 a.m. UTC

Sigh...  but better late than never, right?

I experimented quite a bit with modified path Bloom filters a year and
more ago, and got quite far...  but my disappointment in the
inadequacy of all double hashing schemes, the arrival of split
commit-graphs, and, well, life in general has put the whole thing on
the back burner, and I haven't touched it for a couple of releases.

Now I finally managed to take a closer look at the current changed
paths Bloom filters implementation, and saw that it has some of the
same issues that I had stumbled upon and that it missed some
optimization opportunities.  Unfortunately, fixing those issues and
performing those optimizations do require a thorough format change.

So here is my proof of concept version, in all its incompleteness,
with the following benefits:

  - Better understanding of the problem it tries to optimize.
  - Better understanding of the issues with many small Bloom filters.
  - Better hashing scheme (though it should be better still).
  - Orders of magnitude lower average false positive rate.
  - Consequently, faster pathspec-limited revision walks.
  - Faster processing of the tree-diff output and lower memory usage
    while computing Bloom filters (from scratch...).
  - Optional support for storing Bloom filters for all parents of
    merge commits.
  - Deduplicates Bloom filters.
  - Supports multiple pathspecs right from the start.
  - Supports some wildcards in pathspecs.
  - Handles as many commits as the commit-graph format can.
  - It has the right name :)  The diff machinery and all its frontends
    report "modified" paths with the letter 'M', not "changed".
  - More cleanups, more bugfixes.
  - Consistent output with and without modified path Bloom filters for
    over 80k random paths in 16 repositories, even with submodules in
    them.  Well, at least on my machine, if nowhere else...

Alas, the drawbacks are significant:

  - No tests whatsoever.
  - Computes all modified path Bloom filters from scratch when
    writing, no matter what.
  - Doesn't work with split commit-graphs.
  - Basically if anything works besides 'git commit-graph write
    --reachable' it's a miracle.
  - Not a single test.
  - Many BUG()s, which should rather be graceful errors...  though I
    have to admit that at this point they are indeed bugs.
  - Many TODOs, both in commit messages and code, some incomplete
    commit messages, crappy subject lines, even missing signoffs.
  - Some ridiculously long variable, function, macro and config
    variable names.
  - It's based on v2.25.0 (no technical reason, but that's the version
    I used to run the baseline benchmarks the last time, which takes
    days...)
  - I'm pretty sure that there are more...
  - Oh, did I mention that there are no tests?


The first 14 patches are preparatory fixes and cleanups:

  01/34 tree-walk.c: don't match submodule entries for 'submod/anything'

This fix or something similar is necessary to have consistent output
with and without modified path Bloom filters for paths crossing
submodule boundary.

  02/34 commit-graph: fix parsing the Chunk Lookup table

The minimal (though not the best) fix for a bug which, I think, is as
old as the commit-graph.  I don't know how to test this.

  03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order
  04/34 commit-slab: add a function to deep free entries on the slab
  05/34 diff.h: drop diff_tree_oid() & friends' return value
  06/34 commit-graph: clean up #includes

A couple of minor cleanups.

  07/34 commit-graph: simplify parse_commit_graph() #1
  08/34 commit-graph: simplify parse_commit_graph() #2

These two would be the right, though not minimal fix for the parsing
bug above.

  09/34 commit-graph: simplify write_commit_graph_file() #1
  10/34 commit-graph: simplify write_commit_graph_file() #2
  11/34 commit-graph: allocate the 'struct chunk_info' array dinamically

I think these three cleanup patches are a better alternative of
3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30),
because...

  12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions
  13/34 commit-graph: simplify write_commit_graph_file() #3
  14/34 commit-graph: check chunk sizes after writing

... they laid the ground work for this patch.

  15/34 commit-graph-format.txt: document the modified path Bloom filter chunks

This is the most important one, specifying and _justifying_ the new
chunk formats.

Do grab a cup or pint of your favourite beverage and get comfy before
reading this one.  You have been warned.

  16/34 Add a generic and minimal Bloom filter implementation
  17/34 Import a streaming-capable Murmur3 hash function implementation
  18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk
  19/34 commit-graph: add commit slab for modified path Bloom filters
  20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk

This shows a more efficient approach to process the tree-diff output
into Bloom filters.

  21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk
  22/34 commit-graph: write the Modified Path Bloom Filters chunk
  23/34 commit-graph: load and use the Modified Path Bloom Filters chunk
  24/34 commit-graph: check all leading directories in modified path Bloom filters

This was a good lightbulb moment.  It is essential to try to maintain
reasonable performance in repositories where the vast majority of
changes are concentrated to a single directory.

  25/34 commit-graph: check embedded modified path Bloom filters with a mask
  26/34 commit-graph: deduplicate modified path Bloom filters
  27/34 commit-graph: load modified path Bloom filters for merge commits
  28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk
  29/34 commit-graph: extract init and free write_commit_graph_context
  30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph
  31/34 t7007-show: make the first test compatible with the next patch
  32/34 PoC commit-graph: use revision walk machinery for '--reachable'

Once upon a time I thought this was the greatest idea ever, but as
time goes by I get more and more concerned that this is a really dumb
idea, though don't yet know why.

  33/34 commit-graph: write modified path Bloom filters in "history order"
  34/34 commit-graph: use modified path Bloom filters with wildcards, if possible

Finally a cherry on top.



SZEDER Gábor (34):
  tree-walk.c: don't match submodule entries for 'submod/anything'
  commit-graph: fix parsing the Chunk Lookup table
  commit-graph-format.txt: all multi-byte numbers are in network byte
    order
  commit-slab: add a function to deep free entries on the slab
  diff.h: drop diff_tree_oid() & friends' return value
  commit-graph: clean up #includes
  commit-graph: simplify parse_commit_graph() #1
  commit-graph: simplify parse_commit_graph() #2
  commit-graph: simplify write_commit_graph_file() #1
  commit-graph: simplify write_commit_graph_file() #2
  commit-graph: allocate the 'struct chunk_info' array dinamically
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify write_commit_graph_file() #3
  commit-graph: check chunk sizes after writing
  commit-graph-format.txt: document the modified path Bloom filter
    chunks
  Add a generic and minimal Bloom filter implementation
  Import a streaming-capable Murmur3 hash function implementation
  commit-graph: write "empty" Modified Path Bloom Filter Index chunk
  commit-graph: add commit slab for modified path Bloom filters
  commit-graph: fill the Modified Path Bloom Filter Index chunk
  commit-graph: load and use the Modified Path Bloom Filter Index chunk
  commit-graph: write the Modified Path Bloom Filters chunk
  commit-graph: load and use the Modified Path Bloom Filters chunk
  commit-graph: check all leading directories in modified path Bloom
    filters
  commit-graph: check embedded modified path Bloom filters with a mask
  commit-graph: deduplicate modified path Bloom filters
  commit-graph: load modified path Bloom filters for merge commits
  commit-graph: write Modified Path Bloom Filter Merge Index chunk
  commit-graph: extract init and free write_commit_graph_context
  commit-graph: move write_commit_graph_reachable below
    write_commit_graph
  t7007-show: make the first test compatible with the next patch
  PoC commit-graph: use revision walk machinery for '--reachable'
  commit-graph: write modified path Bloom filters in "history order"
  commit-graph: use modified path Bloom filters with wildcards, if
    possible

 Documentation/config/core.txt                 |   19 +
 .../technical/commit-graph-format.txt         |  127 +-
 Makefile                                      |    2 +
 bloom-filter.c                                |   91 ++
 bloom-filter.h                                |   47 +
 commit-graph.c                                | 1239 +++++++++++++++--
 commit-graph.h                                |   24 +-
 commit-slab-decl.h                            |    1 +
 commit-slab-impl.h                            |   13 +
 commit-slab.h                                 |   10 +
 compat/PMurHash.c                             |  291 ++++
 compat/PMurHash.h                             |   62 +
 diff.h                                        |   10 +-
 pathspec.c                                    |   10 +
 pathspec.h                                    |   13 +
 revision.c                                    |   27 +-
 shallow.c                                     |   14 +-
 t/t4010-diff-pathspec.sh                      |    4 +-
 t/t5318-commit-graph.sh                       |    3 +-
 t/t7007-show.sh                               |    7 +-
 tree-diff.c                                   |   21 +-
 tree-walk.c                                   |    9 +-
 22 files changed, 1872 insertions(+), 172 deletions(-)
 create mode 100644 bloom-filter.c
 create mode 100644 bloom-filter.h
 create mode 100644 compat/PMurHash.c
 create mode 100644 compat/PMurHash.h

Comments

Johannes Schindelin May 28, 2020, 10:09 p.m. UTC | #1

Hi Gábor,

On Fri, 29 May 2020, SZEDER Gábor wrote:

> Sigh...  but better late than never, right?

Well, incremental patches on top of what is _already_ in Git are _even_
better.

> I experimented quite a bit with modified path Bloom filters a year and
> more ago, and got quite far...  but my disappointment in the
> inadequacy of all double hashing schemes, the arrival of split
> commit-graphs, and, well, life in general has put the whole thing on
> the back burner, and I haven't touched it for a couple of releases.
>
> Now I finally managed to take a closer look at the current changed
> paths Bloom filters implementation, and saw that it has some of the
> same issues that I had stumbled upon and that it missed some
> optimization opportunities.  Unfortunately, fixing those issues and
> performing those optimizations do require a thorough format change.

Thank you for this background information.

Your claim that there are opportunities to optimize, as well as your claim
that it requires a thorough format change, strike me as best backed up
with evidence by porting your optimizations on top of what is already in
`master` and which has been reviewed _already_ (as opposed to your new
patch series, which essentially threatens to throw away all the time
people spent on reviewing Garima's patches).

> So here is my proof of concept version, in all its incompleteness,
> with the following benefits:
>
>   - Better understanding of the problem it tries to optimize.
>   - Better understanding of the issues with many small Bloom filters.
>   - Better hashing scheme (though it should be better still).
>   - Orders of magnitude lower average false positive rate.
>   - Consequently, faster pathspec-limited revision walks.
>   - Faster processing of the tree-diff output and lower memory usage
>     while computing Bloom filters (from scratch...).
>   - Optional support for storing Bloom filters for all parents of
>     merge commits.
>   - Deduplicates Bloom filters.
>   - Supports multiple pathspecs right from the start.
>   - Supports some wildcards in pathspecs.
>   - Handles as many commits as the commit-graph format can.
>   - It has the right name :)  The diff machinery and all its frontends
>     report "modified" paths with the letter 'M', not "changed".
>   - More cleanups, more bugfixes.
>   - Consistent output with and without modified path Bloom filters for
>     over 80k random paths in 16 repositories, even with submodules in
>     them.  Well, at least on my machine, if nowhere else...

It strikes me as an obvious idea to make all those improvements in an
incremental manner.

> Alas, the drawbacks are significant:
>
>   - No tests whatsoever.
>   - Computes all modified path Bloom filters from scratch when
>     writing, no matter what.
>   - Doesn't work with split commit-graphs.
>   - Basically if anything works besides 'git commit-graph write
>     --reachable' it's a miracle.
>   - Not a single test.

You said that already in the first bullet point. But yes, I get it, there
are no tests.

>   - Many BUG()s, which should rather be graceful errors...  though I
>     have to admit that at this point they are indeed bugs.
>   - Many TODOs, both in commit messages and code, some incomplete
>     commit messages, crappy subject lines, even missing signoffs.
>   - Some ridiculously long variable, function, macro and config
>     variable names.
>   - It's based on v2.25.0 (no technical reason, but that's the version
>     I used to run the baseline benchmarks the last time, which takes
>     days...)
>   - I'm pretty sure that there are more...
>   - Oh, did I mention that there are no tests?

Ah, your patch series has no tests.

Please do understand that it can be perceived as quite frustrating that
an alternative is presented _this_ late, when already such a lot of effort
has gone into not only iterating several times on the patch series but
also into reviewing all of them.

I don't really think that I want to spend the time to review this
alternative (not that I am an expert in the space) because it would imply
that I help invalidating the effort that went into the current
implementation.

All this is to say: I appreciate your efforts to improve the path-based
Bloom filters, at the same time I wish that those efforts would happen in
a collaborative manner instead of simply dismissing other people's work.

Ciao,
Johannes

Derrick Stolee May 29, 2020, 1:59 p.m. UTC | #2

On 5/29/2020 4:50 AM, SZEDER Gábor wrote:
> Sigh...  but better late than never, right?

True. The best time to present these ideas would be while the series
was under review, which was a reasonable amount of time. But there's
no helping that, so what can we learn from your extensive work here?

> I experimented quite a bit with modified path Bloom filters a year and
> more ago, and got quite far...  but my disappointment in the
> inadequacy of all double hashing schemes, the arrival of split
> commit-graphs, and, well, life in general has put the whole thing on
> the back burner, and I haven't touched it for a couple of releases.
> 
> Now I finally managed to take a closer look at the current changed
> paths Bloom filters implementation, and saw that it has some of the
> same issues that I had stumbled upon and that it missed some
> optimization opportunities.  Unfortunately, fixing those issues and
> performing those optimizations do require a thorough format change.

My initial reaction is "it's too late for a format change now" but
that's not quite true. Even if we ship v2.27.0 with the existing
format, the data is in optional chunks. It would not be too difficult
to replace the storage with a different set of optional chunks and
stop reading the old ones. The question remains: exactly _how much
better_ is your implementation, and is it enough to justify the
upheaval?

The other thing that impacts this consideration is how complicated
the format becomes. Bloom filters are already complicated, so how
can we keep things simple when possible?

> So here is my proof of concept version, in all its incompleteness,
> with the following benefits:

These are all very interesting items. Here is my initial categorization:

These are bold claims that I hope you have end-to-end performance
numbers to back them up. Depending on how much the benefit is, then
we can consider a replacing.

>   - Better hashing scheme (though it should be better still).
>   - Orders of magnitude lower average false positive rate.
>   - Consequently, faster pathspec-limited revision walks.

These bullets seem like they may apply to the existing implementation,
since they seem unrelated to the format itself. Is it possible to
contribute these ideas to the existing implementation first?

>   - Faster processing of the tree-diff output and lower memory usage
>     while computing Bloom filters (from scratch...).
>   - Supports multiple pathspecs right from the start.
>   - Supports some wildcards in pathspecs.
>   - More cleanups, more bugfixes.

These items are confusing to me. I either don't know what they mean,
or doubt the value of having them present.

>   - Better understanding of the problem it tries to optimize.
>   - Better understanding of the issues with many small Bloom filters.
>   - Deduplicates Bloom filters.
>   - It has the right name :)  The diff machinery and all its frontends
>     report "modified" paths with the letter 'M', not "changed".
>   - Handles as many commits as the commit-graph format can>   - Optional support for storing Bloom filters for all parents of
>     merge commits.

(This last item in particular is one where I have already communicated
why there is little value in storing filters for diffs against a
second parent. Perhaps you have some concrete numbers on what happens
in terms of storage-vs-performance when adding these diffs.)

> Alas, the drawbacks are significant:
> 
>   - No tests whatsoever.
>   - Computes all modified path Bloom filters from scratch when
>     writing, no matter what.
>   - Doesn't work with split commit-graphs.
>   - Basically if anything works besides 'git commit-graph write
>     --reachable' it's a miracle.
>   - Not a single test.
>   - Many BUG()s, which should rather be graceful errors...  though I
>     have to admit that at this point they are indeed bugs.
>   - Many TODOs, both in commit messages and code, some incomplete
>     commit messages, crappy subject lines, even missing signoffs.
>   - Some ridiculously long variable, function, macro and config
>     variable names.
>   - It's based on v2.25.0 (no technical reason, but that's the version
>     I used to run the baseline benchmarks the last time, which takes
>     days...)
>   - I'm pretty sure that there are more...
>   - Oh, did I mention that there are no tests?

And now you get to the _real_ hard part of this feature. It is easy to
create a prototype that implements the idea, but it's another step to
make it a full contribution. You clearly know this, which I expect is
the reason this was not submitted earlier.

> 
> The first 14 patches are preparatory fixes and cleanups:
> 
>   01/34 tree-walk.c: don't match submodule entries for 'submod/anything'
> 
> This fix or something similar is necessary to have consistent output
> with and without modified path Bloom filters for paths crossing
> submodule boundary.
> 
>   02/34 commit-graph: fix parsing the Chunk Lookup table
> 
> The minimal (though not the best) fix for a bug which, I think, is as
> old as the commit-graph.  I don't know how to test this.
> 
>   03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order
>   04/34 commit-slab: add a function to deep free entries on the slab
>   05/34 diff.h: drop diff_tree_oid() & friends' return value
>   06/34 commit-graph: clean up #includes
> 
> A couple of minor cleanups.
> 
>   07/34 commit-graph: simplify parse_commit_graph() #1
>   08/34 commit-graph: simplify parse_commit_graph() #2
> 
> These two would be the right, though not minimal fix for the parsing
> bug above.
> 
>   09/34 commit-graph: simplify write_commit_graph_file() #1
>   10/34 commit-graph: simplify write_commit_graph_file() #2
>   11/34 commit-graph: allocate the 'struct chunk_info' array dinamically
> 
> I think these three cleanup patches are a better alternative of
> 3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30),
> because...
> 
>   12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions
>   13/34 commit-graph: simplify write_commit_graph_file() #3
>   14/34 commit-graph: check chunk sizes after writing

These 14 patches are probably worth submitting on their own. I hope they
still apply cleanly to a recent master.

> ... they laid the ground work for this patch.
> 
>   15/34 commit-graph-format.txt: document the modified path Bloom filter chunks
> 
> This is the most important one, specifying and _justifying_ the new
> chunk formats.
> 
> Do grab a cup or pint of your favourite beverage and get comfy before
> reading this one.  You have been warned.
> 
>   16/34 Add a generic and minimal Bloom filter implementation
>   17/34 Import a streaming-capable Murmur3 hash function implementation
>   18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk
>   19/34 commit-graph: add commit slab for modified path Bloom filters
>   20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk
> 
> This shows a more efficient approach to process the tree-diff output
> into Bloom filters.
> 
>   21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk
>   22/34 commit-graph: write the Modified Path Bloom Filters chunk
>   23/34 commit-graph: load and use the Modified Path Bloom Filters chunk

Ok, these implement the new filter format.

>   24/34 commit-graph: check all leading directories in modified path Bloom filters
> 
> This was a good lightbulb moment.  It is essential to try to maintain
> reasonable performance in repositories where the vast majority of
> changes are concentrated to a single directory.

This is really interesting! This could apply to the existing format, so
when you compare formats this patch should either be in both or in neither.

>   25/34 commit-graph: check embedded modified path Bloom filters with a mask
>   26/34 commit-graph: deduplicate modified path Bloom filters
>   27/34 commit-graph: load modified path Bloom filters for merge commits
>   28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk
>   29/34 commit-graph: extract init and free write_commit_graph_context
>   30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph
>   31/34 t7007-show: make the first test compatible with the next patch
>   32/34 PoC commit-graph: use revision walk machinery for '--reachable'
> 
> Once upon a time I thought this was the greatest idea ever, but as
> time goes by I get more and more concerned that this is a really dumb
> idea, though don't yet know why.

Ok, here are some extra patches that implement your extra features. The
multi-parent filters are a fair thing to compare against the existing
implementation, but we need to compare file size changes, extra computation
time changes, and how much the end-to-end rev-list performance changes.
Are the costs justified?

>   33/34 commit-graph: write modified path Bloom filters in "history order"

I think we've covered this in these two patches:

3d112755056 commit-graph: examine commits by generation number
d21ee7d1110 commit-graph: examine changed-path objects in pack order

>   34/34 commit-graph: use modified path Bloom filters with wildcards, if possible
> 
> Finally a cherry on top.

This is also something that could be interesting for the current
implementation. Perhaps group it with patches 1-14 and 24, if the idea
applies to the existing implementation.

I think it would be great to apply whatever code cleanups and fixes
that you've presented here to the existing implementation _first_,
then we have the optimal way to compare how your proposed format
improves over the existing implementation.

I can't commit more time to this series today, but hopefully the discussion
is just getting started. I'll try to poke around some of the patches
next week.

Thanks,
-Stolee

Taylor Blau June 1, 2020, 11:25 p.m. UTC | #3

On Fri, May 29, 2020 at 10:50:04AM +0200, SZEDER Gábor wrote:
> Sigh...  but better late than never, right?

Yes, indeed. I think that there is a balance here: I'm thrilled that you
are choosing to spend your time working on and improving the
changed-path Bloom filter implementation.

Of course, it couldn't have hurt to have these ideas earlier when the
list was more focused on reviewing Garima's original patches. But,
nothing is set in stone, and it seems like there are some re-usable
ideas and clean-ups below.

I'll respond to the patch descriptions below with a "cut-line" where I
think we could extract and apply what's already there before putting the
rest aside to be worked on more.

> I experimented quite a bit with modified path Bloom filters a year and
> more ago, and got quite far...  but my disappointment in the
> inadequacy of all double hashing schemes, the arrival of split
> commit-graphs, and, well, life in general has put the whole thing on
> the back burner, and I haven't touched it for a couple of releases.
>
> Now I finally managed to take a closer look at the current changed
> paths Bloom filters implementation, and saw that it has some of the
> same issues that I had stumbled upon and that it missed some
> optimization opportunities.  Unfortunately, fixing those issues and
> performing those optimizations do require a thorough format change.
>
> So here is my proof of concept version, in all its incompleteness,
> with the following benefits:
>
>   - Better understanding of the problem it tries to optimize.
>   - Better understanding of the issues with many small Bloom filters.
>   - Better hashing scheme (though it should be better still).
>   - Orders of magnitude lower average false positive rate.
>   - Consequently, faster pathspec-limited revision walks.
>   - Faster processing of the tree-diff output and lower memory usage
>     while computing Bloom filters (from scratch...).
>   - Optional support for storing Bloom filters for all parents of
>     merge commits.
>   - Deduplicates Bloom filters.
>   - Supports multiple pathspecs right from the start.
>   - Supports some wildcards in pathspecs.
>   - Handles as many commits as the commit-graph format can.
>   - It has the right name :)  The diff machinery and all its frontends
>     report "modified" paths with the letter 'M', not "changed".
>   - More cleanups, more bugfixes.
>   - Consistent output with and without modified path Bloom filters for
>     over 80k random paths in 16 repositories, even with submodules in
>     them.  Well, at least on my machine, if nowhere else...
>
> Alas, the drawbacks are significant:
>
>   - No tests whatsoever.

Yes, obviously this needs to be addressed, but I certainly don't fault
you for posting what you have (thanks for adding the 'PoC' prefix to
indicate it as such).

>   - Computes all modified path Bloom filters from scratch when
>     writing, no matter what.
>   - Doesn't work with split commit-graphs.

I originally thought that this would be a non-starter, but it may not
be. GitHub doesn't write Bloom filters at all (yet), but one
consideration of ours is to only generate Bloom filters when we roll-up
all of the split graphs.

Our invocation during this bigger repository maintenance is something
like:

  $ git commit-graph write --reachable --split=replace

But if it turns out to be expensive to run 'git commit-graph write
--split=no-merge --stdin-commits --write --changed-paths' on every push,
we might just run the above with '--changed-paths'.

I can imagine other schemes where it would be desirable to have your
Bloom filter implementation over split graphs, in which case this should
be addressed. But, I don't have a problem with a version that only works
for non-split graphs first, so long as there isn't a compatibility
boundary between that and a version that does work for split graphs.

>   - Basically if anything works besides 'git commit-graph write
>     --reachable' it's a miracle.
>   - Not a single test.
>   - Many BUG()s, which should rather be graceful errors...  though I
>     have to admit that at this point they are indeed bugs.
>   - Many TODOs, both in commit messages and code, some incomplete
>     commit messages, crappy subject lines, even missing signoffs.
>   - Some ridiculously long variable, function, macro and config
>     variable names.
>   - It's based on v2.25.0 (no technical reason, but that's the version
>     I used to run the baseline benchmarks the last time, which takes
>     days...)
>   - I'm pretty sure that there are more...
>   - Oh, did I mention that there are no tests?

...I think so ;-).

>
> The first 14 patches are preparatory fixes and cleanups:
>
>   01/34 tree-walk.c: don't match submodule entries for 'submod/anything'
>
> This fix or something similar is necessary to have consistent output
> with and without modified path Bloom filters for paths crossing
> submodule boundary.
>
>   02/34 commit-graph: fix parsing the Chunk Lookup table
>
> The minimal (though not the best) fix for a bug which, I think, is as
> old as the commit-graph.  I don't know how to test this.
>
>   03/34 commit-graph-format.txt: all multi-byte numbers are in network byte order
>   04/34 commit-slab: add a function to deep free entries on the slab
>   05/34 diff.h: drop diff_tree_oid() & friends' return value
>   06/34 commit-graph: clean up #includes
>
> A couple of minor cleanups.
>
>   07/34 commit-graph: simplify parse_commit_graph() #1
>   08/34 commit-graph: simplify parse_commit_graph() #2
>
> These two would be the right, though not minimal fix for the parsing
> bug above.
>
>   09/34 commit-graph: simplify write_commit_graph_file() #1
>   10/34 commit-graph: simplify write_commit_graph_file() #2
>   11/34 commit-graph: allocate the 'struct chunk_info' array dinamically
>
> I think these three cleanup patches are a better alternative of
> 3be7efcafc (commit-graph: define and use MAX_NUM_CHUNKS, 2020-03-30),
> because...
>
>   12/34 commit-graph: unify the signatures of all write_graph_chunk_*() functions
>   13/34 commit-graph: simplify write_commit_graph_file() #3
>   14/34 commit-graph: check chunk sizes after writing

I think you're right to draw the "laying the ground work" line here.
Could these first fourteen patches be applied cleanly to master? They
all look mostly like improvements to me, especially the second patch.

> ... they laid the ground work for this patch.
>
>   15/34 commit-graph-format.txt: document the modified path Bloom filter chunks
>
> This is the most important one, specifying and _justifying_ the new
> chunk formats.
>
> Do grab a cup or pint of your favourite beverage and get comfy before
> reading this one.  You have been warned.
>
>   16/34 Add a generic and minimal Bloom filter implementation
>   17/34 Import a streaming-capable Murmur3 hash function implementation
>   18/34 commit-graph: write "empty" Modified Path Bloom Filter Index chunk
>   19/34 commit-graph: add commit slab for modified path Bloom filters
>   20/34 commit-graph: fill the Modified Path Bloom Filter Index chunk
>
> This shows a more efficient approach to process the tree-diff output
> into Bloom filters.
>
>   21/34 commit-graph: load and use the Modified Path Bloom Filter Index chunk
>   22/34 commit-graph: write the Modified Path Bloom Filters chunk
>   23/34 commit-graph: load and use the Modified Path Bloom Filters chunk
>   24/34 commit-graph: check all leading directories in modified path Bloom filters
>
> This was a good lightbulb moment.  It is essential to try to maintain
> reasonable performance in repositories where the vast majority of
> changes are concentrated to a single directory.
>
>   25/34 commit-graph: check embedded modified path Bloom filters with a mask
>   26/34 commit-graph: deduplicate modified path Bloom filters
>   27/34 commit-graph: load modified path Bloom filters for merge commits
>   28/34 commit-graph: write Modified Path Bloom Filter Merge Index chunk
>   29/34 commit-graph: extract init and free write_commit_graph_context
>   30/34 commit-graph: move write_commit_graph_reachable below write_commit_graph
>   31/34 t7007-show: make the first test compatible with the next patch
>   32/34 PoC commit-graph: use revision walk machinery for '--reachable'
>
> Once upon a time I thought this was the greatest idea ever, but as
> time goes by I get more and more concerned that this is a really dumb
> idea, though don't yet know why.
>
>   33/34 commit-graph: write modified path Bloom filters in "history order"
>   34/34 commit-graph: use modified path Bloom filters with wildcards, if possible
>
> Finally a cherry on top.

The next twenty patches look like they need some more work. At a
high-level, I think we need to do the following things, in roughly the
following order:

  - Prove that this is a strict performance improvement on the existing
    Bloom filter implementation.

  - Sketch out a way that it would work for cases where '--split' is
    provided, and/or '--reachable' isn't provided.

  - Clean up the existing patches, adding tests which exercise the new
    functionality, and prove that there aren't regressions against the
    existing Bloom filter implementation.

  - Review and repeat.

I guess I'm left wondering whether or not this is something that you
want to continue, given that I outlined a pretty lengthy path for
getting this in a suitable shape to keep going.

Are you planning on keeping working on this?

Thanks,
Taylor

Junio C Hamano June 2, 2020, 5:08 p.m. UTC | #4

Taylor Blau <me@ttaylorr.com> writes:

> On Fri, May 29, 2020 at 10:50:04AM +0200, SZEDER Gábor wrote:
>> Sigh...  but better late than never, right?
>
> Yes, indeed. I think that there is a balance here: I'm thrilled that you
> are choosing to spend your time working on and improving the
> changed-path Bloom filter implementation.
>
> Of course, it couldn't have hurt to have these ideas earlier when the
> list was more focused on reviewing Garima's original patches. But,
> nothing is set in stone, and it seems like there are some re-usable
> ideas and clean-ups below.

Yes, I had the same impression as you did, unlike some folks who
sounded as if they felt offended seeing a comment that sabotages
existing work.  It could have been presented in a more useful ways
(i.e. instead of risking to appear suggesting total replacement,
which I do not think was the intention, massaged to build on top of
what is already there as improvements), though.

> I think you're right to draw the "laying the ground work" line here.
> Could these first fourteen patches be applied cleanly to master? They
> all look mostly like improvements to me, especially the second patch.

Again, I concur.

Thanks.