[0/9,RFC] Changed Paths Bloom Filters

Message ID	pull.497.git.1576879520.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=PG55=2K=vger.kernel.org=git-owner@kernel.org> Message-Id: <pull.497.git.1576879520.gitgitgadget@gmail.com> From: "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> Date: Fri, 20 Dec 2019 22:05:11 +0000 Subject: [PATCH 0/9] [RFC] Changed Paths Bloom Filters MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com, jeffhost@microsoft.com, me@ttaylorr.com, peff@peff.net, Junio C Hamano <gitster@pobox.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Changed Paths Bloom Filters \| expand [0/9,RFC] Changed Paths Bloom Filters [1/9] commit-graph: add --changed-paths option to write [2/9] commit-graph: write changed paths bloom filters [3/9] commit-graph: use MAX_NUM_CHUNKS [4/9] commit-graph: document bloom filter format [5/9] commit-graph: write changed path bloom filters to commit-graph file. [6/9] commit-graph: test commit-graph write --changed-paths [7/9] commit-graph: reuse existing bloom filters during write. [8/9] revision.c: use bloom filters to speed up path based revision walks [9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag

Elijah Newren via GitGitGadget Dec. 20, 2019, 10:05 p.m. UTC

Hey! 

The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories. 

Adopting changed path bloom filters has been discussed on the list before,
and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
presents an updated and more polished RFC version of the feature. 

Performance Gains: We tested the performance of git log -- path on the git
repo, the linux repo and some internal large repos, with a variety of paths
of varying depths.

On the git and linux repos: We observed a 2x to 5x speed up.

On a large internal repo with files seated 6-10 levels deep in the tree: We
observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Future Work (not included in the scope of this series):

 1. Supporting multiple path based revision walk
 2. Adopting it in git blame logic. 
 3. Interactions with line log git log -L

This series is intended to start the conversation and many of the commit
messages include specific call outs for suggestions and thoughts. 

Cheers! Garima Singh

[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

Garima Singh (9):
  commit-graph: add --changed-paths option to write
  commit-graph: write changed paths bloom filters
  commit-graph: use MAX_NUM_CHUNKS
  commit-graph: document bloom filter format
  commit-graph: write changed path bloom filters to commit-graph file.
  commit-graph: test commit-graph write --changed-paths
  commit-graph: reuse existing bloom filters during write.
  revision.c: use bloom filters to speed up path based revision walks
  commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag

 Documentation/git-commit-graph.txt            |   5 +
 .../technical/commit-graph-format.txt         |  17 ++
 Makefile                                      |   1 +
 bloom.c                                       | 257 +++++++++++++++++
 bloom.h                                       |  51 ++++
 builtin/commit-graph.c                        |   9 +-
 ci/run-build-and-tests.sh                     |   1 +
 commit-graph.c                                | 116 +++++++-
 commit-graph.h                                |   9 +-
 revision.c                                    |  67 ++++-
 revision.h                                    |   5 +
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   4 +
 t/t4216-log-bloom.sh                          |  77 ++++++
 t/t5318-commit-graph.sh                       |   2 +
 t/t5324-split-commit-graph.sh                 |   1 +
 t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
 17 files changed, 875 insertions(+), 8 deletions(-)
 create mode 100644 bloom.c
 create mode 100644 bloom.h
 create mode 100755 t/t4216-log-bloom.sh
 create mode 100755 t/t5325-commit-graph-bloom.sh


base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/497

Junio C Hamano Dec. 20, 2019, 10:14 p.m. UTC | #1

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature. 

;-)

Christian Couder Dec. 22, 2019, 9:26 a.m. UTC | #2

Hi,

On Fri, Dec 20, 2019 at 11:07 PM Garima Singh via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories.
>
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature.

Thanks for working on this!

> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Very nice!

I have a question though. Are the performance gains only available
with `git log -- path` or are they already available for example when
doing a partial clone and/or a sparse checkout?

> Future Work (not included in the scope of this series):
>
>  1. Supporting multiple path based revision walk
>  2. Adopting it in git blame logic.
>  3. Interactions with line log git log -L

Great!

> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts.

I think Peff said during the Virtual Contributor Summit that he was
interested in using bitmaps to speed up partial clone on the server
side. Would it make sense to use both bitmaps and bloom filters?

Thanks,
Christian.

Jeff King Dec. 22, 2019, 9:30 a.m. UTC | #3

On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:

> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature.

Great to see progress here. I probably won't have time to review this
carefully before the new year, but I did notice some low-hanging fruit
on the generation side.

So here are a few patches to reduce the CPU and memory usage. They could
be squashed in at the appropriate spots, or perhaps taken as inspiration
if there are better solutions (especially for the first one).

I think we could go further still, by actually doing a non-recursive
diff_tree_oid(), and then recursing into sub-trees ourselves. That would
save us having to split apart each path to add the leading paths to the
hashmap (most of which will be duplicates if the commit touched "a/b/c"
and "a/b/d", etc). I doubt it would be that huge a speedup though. We
have to keep a list of the touched paths anyway (since the bloom key
parameters depend on the number of entries), and most of the time is
almost certainly spent inflating the trees in the first place. However
it might be easier to follow the code, and it would make it simpler to
stop traversing at the 512-entry limit, rather than generating a huge
diff only to throw it away.

  [1/3]: commit-graph: examine changed-path objects in pack order
  [2/3]: commit-graph: free large diffs, too
  [3/3]: commit-graph: stop using full rev_info for diffs

 bloom.c        | 18 +++++++++---------
 commit-graph.c | 34 +++++++++++++++++++++++++++++++++-
 2 files changed, 42 insertions(+), 10 deletions(-)

-Peff

Jeff King Dec. 22, 2019, 9:38 a.m. UTC | #4

On Sun, Dec 22, 2019 at 10:26:20AM +0100, Christian Couder wrote:

> I have a question though. Are the performance gains only available
> with `git log -- path` or are they already available for example when
> doing a partial clone and/or a sparse checkout?

From my quick look at the code, anything that feeds a pathspec to a
revision traversal would be helped. I'm not sure if it would help for
partial/sparse traversals, though. There we actually need to know which
blobs correspond to the paths in question, not just whether any
particular commit touched them.

I also took a brief look at adding support to the custom blame-tree
implementation we use at GitHub, and got about a 6x speedup.

> > This series is intended to start the conversation and many of the commit
> > messages include specific call outs for suggestions and thoughts.
> 
> I think Peff said during the Virtual Contributor Summit that he was
> interested in using bitmaps to speed up partial clone on the server
> side. Would it make sense to use both bitmaps and bloom filters?

I think they're orthogonal. For size-based filters on blobs, you'd just
use bitmaps as normal, because you can post-process the result to check
the type and size of each object in the list (and I have patches to do
this, but they need some polishing and we're not yet running them).

For path-based filters like a sparse specification, you can't use
bitmaps at all; you have to do a real traversal. But there you still
generally get all of the commits. I guess if a commit doesn't touch any
path you're interested in, you could avoid walking into its tree at all,
which might help. I haven't given it much thought yet.

-Peff

Derrick Stolee Dec. 26, 2019, 2:21 p.m. UTC | #5

On 12/22/2019 4:30 AM, Jeff King wrote:
> On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> 
>> Adopting changed path bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
>> presents an updated and more polished RFC version of the feature.
> 
> Great to see progress here. I probably won't have time to review this
> carefully before the new year, but I did notice some low-hanging fruit
> on the generation side.
> 
> So here are a few patches to reduce the CPU and memory usage. They could
> be squashed in at the appropriate spots, or perhaps taken as inspiration
> if there are better solutions (especially for the first one).
> 
> I think we could go further still, by actually doing a non-recursive
> diff_tree_oid(), and then recursing into sub-trees ourselves. That would
> save us having to split apart each path to add the leading paths to the
> hashmap (most of which will be duplicates if the commit touched "a/b/c"
> and "a/b/d", etc). I doubt it would be that huge a speedup though. We
> have to keep a list of the touched paths anyway (since the bloom key
> parameters depend on the number of entries), and most of the time is
> almost certainly spent inflating the trees in the first place. However
> it might be easier to follow the code, and it would make it simpler to
> stop traversing at the 512-entry limit, rather than generating a huge
> diff only to throw it away.

Thanks for these improvements. This diff machinery is new to us (Garima
and myself).

Here are some recommendations (to Garima) for how to proceed with these
patches. Please let me know if anyone disagrees.

>   [1/3]: commit-graph: examine changed-path objects in pack order

This one is best kept as its own patch, as it shows a clear reason why
we want to do the sort-by-position. It would also be a complicated
patch to include this logic along with the first use of
compute_bloom_filters().

>   [2/3]: commit-graph: free large diffs, too
This one seems best to squash into "commit-graph: write changed paths
bloom filters" with a Helped-by for Peff.

>   [3/3]: commit-graph: stop using full rev_info for diffs

While I appreciate the clear benefit in the commit-message here, it
may be best to also squash this one similarly.

Of course, if we create our own diff logic with the short-circuit
capability, then perhaps these patches become obsolete. I'll spend
a little time playing with options here.

Thanks!
-Stolee

Derrick Stolee Dec. 27, 2019, 4:11 p.m. UTC | #6

On 12/22/2019 4:30 AM, Jeff King wrote:
> On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> 
>> Adopting changed path bloom filters has been discussed on the list before,
>> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
>> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
>> presents an updated and more polished RFC version of the feature.
> 
> Great to see progress here. I probably won't have time to review this
> carefully before the new year, but I did notice some low-hanging fruit
> on the generation side.
> 
> So here are a few patches to reduce the CPU and memory usage. They could
> be squashed in at the appropriate spots, or perhaps taken as inspiration
> if there are better solutions (especially for the first one).

I tested these patches with the Linux kernel repo and reported my results
on each patch. However, I wanted to also test on a larger internal repo
(the AzureDevOps repo), which has ~500 commits with more than 512 changes,
and generally has larger diffs than the Linux kernel repo.

| Version  | Time   | Memory |
|----------|--------|--------|
| Garima   | 16m36s | 27.0gb |
| Peff 1   | 6m32s  | 28.0gb |
| Peff 2   | 6m48s  |  5.6gb |
| Peff 3   | 6m14s  |  4.5gb |
| Shortcut | 3m47s  |  4.5gb |

For reference, I found the time and memory information using
"/usr/bin/time --verbose" in a bash script.
 
> I think we could go further still, by actually doing a non-recursive
> diff_tree_oid(), and then recursing into sub-trees ourselves. That would
> save us having to split apart each path to add the leading paths to the
> hashmap (most of which will be duplicates if the commit touched "a/b/c"
> and "a/b/d", etc). I doubt it would be that huge a speedup though. We
> have to keep a list of the touched paths anyway (since the bloom key
> parameters depend on the number of entries), and most of the time is
> almost certainly spent inflating the trees in the first place. However
> it might be easier to follow the code, and it would make it simpler to
> stop traversing at the 512-entry limit, rather than generating a huge
> diff only to throw it away.

By "Shortcut" in the table above, I mean the following patch on top of
Garima's and Peff's changes. It inserts a max_changes option into struct
diff_options to halt the diff early. This seemed like an easier change
than creating a new tree diff algorithm wholesale.

Thanks,
-Stolee

-->8--

From: Derrick Stolee <dstolee@microsoft.com>
Date: Fri, 27 Dec 2019 10:13:48 -0500
Subject: [PATCH] diff: halt tree-diff early after max_changes

When computing the changed-paths bloom filters for the commit-graph,
we limit the size of the filter by restricting the number of paths
in the diff. Instead of computing a large diff and then ignoring the
result, it is better to halt the diff computation early.

Create a new "max_changes" option in struct diff_options. If non-zero,
then halt the diff computation after discovering strictly more changed
paths. This includes paths corresponding to trees that change.

Use this max_changes option in the bloom filter calculations. This
reduces the time taken to compute the filters for the Linux kernel
repo from 2m50s to 2m35s. For a larger repo with more commits changing
many paths, the time reduces from 6 minutes to under 4 minutes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 bloom.c     | 4 +++-
 diff.h      | 5 +++++
 tree-diff.c | 5 +++++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/bloom.c b/bloom.c
index ea77631cc2..83dde2378b 100644
--- a/bloom.c
+++ b/bloom.c
@@ -155,6 +155,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 	int i;
 	struct diff_options diffopt;
+	int max_changes = 512;
 
 	filter = bloom_filter_slab_at(&bloom_filters, c);
 
@@ -171,6 +172,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
+	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);
 
 	if (c->parents)
@@ -179,7 +181,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
 
-	if (diff_queued_diff.nr <= 512) {
+	if (diff_queued_diff.nr <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry* e;
 		struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index 6febe7e365..9443dc1b00 100644
--- a/diff.h
+++ b/diff.h
@@ -285,6 +285,11 @@ struct diff_options {
 	/* Number of hexdigits to abbreviate raw format output to. */
 	int abbrev;
 
+	/* If non-zero, then stop computing after this many changes. */
+	int max_changes;
+	/* For internal use only. */
+	int num_changes;
+
 	int ita_invisible_in_index;
 /* white-space error highlighting */
 #define WSEH_NEW (1<<12)
diff --git a/tree-diff.c b/tree-diff.c
index 33ded7f8b3..16a21d9f34 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
+		if (opt->max_changes && opt->num_changes > opt->max_changes)
+			break;
+
 		if (opt->pathspec.nr) {
 			skip_uninteresting(&t, base, opt);
 			for (i = 0; i < nparent; i++)
@@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
+			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
+			opt->num_changes++;
 		}
 	}

Jeff King Dec. 29, 2019, 6:03 a.m. UTC | #7

On Thu, Dec 26, 2019 at 09:21:36AM -0500, Derrick Stolee wrote:

> Here are some recommendations (to Garima) for how to proceed with these
> patches. Please let me know if anyone disagrees.
> 
> >   [1/3]: commit-graph: examine changed-path objects in pack order
> 
> This one is best kept as its own patch, as it shows a clear reason why
> we want to do the sort-by-position. It would also be a complicated
> patch to include this logic along with the first use of
> compute_bloom_filters().

Yeah, I'd agree this one could be a separate patch. It does need more
work, though (as you found out, it does not cover --reachable at all).

The position counter also probably ought to be an unsigned (or even a
uint32_t, which we usually consider a maximum bound for number of
objects).

-Peff

Jeff King Dec. 29, 2019, 6:24 a.m. UTC | #8

On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote:

> > So here are a few patches to reduce the CPU and memory usage. They could
> > be squashed in at the appropriate spots, or perhaps taken as inspiration
> > if there are better solutions (especially for the first one).
> 
> I tested these patches with the Linux kernel repo and reported my results
> on each patch. However, I wanted to also test on a larger internal repo
> (the AzureDevOps repo), which has ~500 commits with more than 512 changes,
> and generally has larger diffs than the Linux kernel repo.
> 
> | Version  | Time   | Memory |
> |----------|--------|--------|
> | Garima   | 16m36s | 27.0gb |
> | Peff 1   | 6m32s  | 28.0gb |
> | Peff 2   | 6m48s  |  5.6gb |
> | Peff 3   | 6m14s  |  4.5gb |
> | Shortcut | 3m47s  |  4.5gb |
> 
> For reference, I found the time and memory information using
> "/usr/bin/time --verbose" in a bash script.

Thanks for giving it more exercise. My heap profiling was done with
massif, which measures the heap directly. Measuring RSS would cover
that, but will also include the mmap'd packfiles. That's probably why
your linux.git numbers were slightly higher than mine.

(massif is a really great tool if you haven't used it, as it also shows
which allocations were using the memory. But it's part of valgrind, so
it definitely doesn't run on native Windows. It might work under WSL,
though. I'm sure there are also other heap profilers on Windows).

> By "Shortcut" in the table above, I mean the following patch on top of
> Garima's and Peff's changes. It inserts a max_changes option into struct
> diff_options to halt the diff early. This seemed like an easier change
> than creating a new tree diff algorithm wholesale.

Yeah, I'm not opposed to a diff feature like this.

But be careful, because...

> diff --git a/diff.h b/diff.h
> index 6febe7e365..9443dc1b00 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
>  	/* Number of hexdigits to abbreviate raw format output to. */
>  	int abbrev;
>  
> +	/* If non-zero, then stop computing after this many changes. */
> +	int max_changes;
> +	/* For internal use only. */
> +	int num_changes;

This is holding internal state in diff_options, but the same
diff_options is often used for multiple diffs (e.g., "git log --raw"
would use the same rev_info.diffopt over and over again).

So it would need to be cleared between diffs. There's a similar feature
in the "has_changes" flag, though it looks like it is cleared manually
by callers. Yuck.

This isn't a problem for commit-graph right now, but:

  - it actually could be using a single diff_options, which would be
    slightly simpler (it doesn't seem to save much CPU, though, because
    the initialization is relatively cheap)

  - it's a bit of a subtle bug to leave hanging around for the next
    person who tries to use the feature

I actually wonder if this could be rolled into the has_changes and
diff_can_quit_early() feature. This really just a generalization of that
feature (which is like setting max_changes to "1").

-Peff

Derrick Stolee Dec. 30, 2019, 4:04 p.m. UTC | #9

On 12/29/2019 1:24 AM, Jeff King wrote:
> On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote:
> 
>>> So here are a few patches to reduce the CPU and memory usage. They could
>>> be squashed in at the appropriate spots, or perhaps taken as inspiration
>>> if there are better solutions (especially for the first one).
>>
>> I tested these patches with the Linux kernel repo and reported my results
>> on each patch. However, I wanted to also test on a larger internal repo
>> (the AzureDevOps repo), which has ~500 commits with more than 512 changes,
>> and generally has larger diffs than the Linux kernel repo.
>>
>> | Version  | Time   | Memory |
>> |----------|--------|--------|
>> | Garima   | 16m36s | 27.0gb |
>> | Peff 1   | 6m32s  | 28.0gb |
>> | Peff 2   | 6m48s  |  5.6gb |
>> | Peff 3   | 6m14s  |  4.5gb |
>> | Shortcut | 3m47s  |  4.5gb |
>>
>> For reference, I found the time and memory information using
>> "/usr/bin/time --verbose" in a bash script.
> 
> Thanks for giving it more exercise. My heap profiling was done with
> massif, which measures the heap directly. Measuring RSS would cover
> that, but will also include the mmap'd packfiles. That's probably why
> your linux.git numbers were slightly higher than mine.

That's interesting. I initially avoided massif because it is so much
slower than /usr/bin/time. However, the inflated numbers could be
explained by that. Also, the distinction between mem_heap and
mem_heap_extra may have interesting implications. Looking online, it
seems that large mem_heap_extra implies the heap is fragmented from
many small allocations.

Here are my findings on the Linux repo:

| Version  | mem_heap | mem_heap_extra |
|----------|----------|----------------|
| Peff 1   |  6,500mb |          913mb |
| Peff 2   |  3,100mb |          286mb |
| Peff 3   |    781mb |          235mb |

These numbers more closely match your numbers (in sum of the two
columns).

> (massif is a really great tool if you haven't used it, as it also shows
> which allocations were using the memory. But it's part of valgrind, so
> it definitely doesn't run on native Windows. It might work under WSL,
> though. I'm sure there are also other heap profilers on Windows).

I am using my Linux machine for my tests. Garima is using her Windows
machine.

>> By "Shortcut" in the table above, I mean the following patch on top of
>> Garima's and Peff's changes. It inserts a max_changes option into struct
>> diff_options to halt the diff early. This seemed like an easier change
>> than creating a new tree diff algorithm wholesale.
> 
> Yeah, I'm not opposed to a diff feature like this.
> 
> But be careful, because...
> 
>> diff --git a/diff.h b/diff.h
>> index 6febe7e365..9443dc1b00 100644
>> --- a/diff.h
>> +++ b/diff.h
>> @@ -285,6 +285,11 @@ struct diff_options {
>>  	/* Number of hexdigits to abbreviate raw format output to. */
>>  	int abbrev;
>>  
>> +	/* If non-zero, then stop computing after this many changes. */
>> +	int max_changes;
>> +	/* For internal use only. */
>> +	int num_changes;
> 
> This is holding internal state in diff_options, but the same
> diff_options is often used for multiple diffs (e.g., "git log --raw"
> would use the same rev_info.diffopt over and over again).
> 
> So it would need to be cleared between diffs. There's a similar feature
> in the "has_changes" flag, though it looks like it is cleared manually
> by callers. Yuck.

You're right about this. What if we initialize it in diff_tree_paths()
before it calls the recursive ll_difF_tree_paths()?

> This isn't a problem for commit-graph right now, but:
> 
>   - it actually could be using a single diff_options, which would be
>     slightly simpler (it doesn't seem to save much CPU, though, because
>     the initialization is relatively cheap)
> 
>   - it's a bit of a subtle bug to leave hanging around for the next
>     person who tries to use the feature
> 
> I actually wonder if this could be rolled into the has_changes and
> diff_can_quit_early() feature. This really just a generalization of that
> feature (which is like setting max_changes to "1").

I thought about this at first, but it only takes a struct diff_options
right now. It does have an internally-mutated member (flags.has_changes)
but it also seems a bit wrong to add a uint32_t of the count in this.
Changing the prototype could be messy, too.

There are also multiple callers, and limiting everything to tree-diff.c
limits the impact.

Thanks,
-Stolee

Junio C Hamano Dec. 30, 2019, 5:02 p.m. UTC | #10

Jeff King <peff@peff.net> writes:

> This is holding internal state in diff_options, but the same
> diff_options is often used for multiple diffs (e.g., "git log --raw"
> would use the same rev_info.diffopt over and over again).
>
> So it would need to be cleared between diffs. There's a similar feature
> in the "has_changes" flag, though it looks like it is cleared manually
> by callers. Yuck.

Do you mean we want reset_per_invocation_part_of_diff_options()
helper or something?

> I actually wonder if this could be rolled into the has_changes and
> diff_can_quit_early() feature. This really just a generalization of that
> feature (which is like setting max_changes to "1").

Yeah, I wondered about the same thing, after seeing the impressive
numbers ;-)

Jakub Narębski Dec. 31, 2019, 4:45 p.m. UTC | #11

"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Hey! 
>
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories. 
>
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature. 

It is nice to have this picked up for upstream, finally.  The proof of
concept works[1][2] were started more than a year ago.

On the other hand slow and steady adoption of commit-graph serialization
and then extending it (generation numbers, topological sort, incremental
update) feels like a good approach.

> Performance Gains: We tested the performance of 'git log -- <path>' on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
>
> On the git and linux repos: We observed a 2x to 5x speed up.
>
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.

Could you provide some more statistics about this internal repository,
such as number of files, number of commits, perhaps also number of all
objects?  Thanks in advance.

I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
about the depth of the file hierarchy?  How would the numbers look for
files seated closer to the root in the same large repository, like 3-5
levels deep in the tree?

> Future Work (not included in the scope of this series):
>
>  1. Supporting multiple path based revision walk

I wonder if it would ever be possible to support globbing, e.g. '*.c'

>  2. Adopting it in git blame logic.

What about 'git log --follow <path>'?

>  3. Interactions with line log git log -L
>
> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts. 
>
> Cheers! Garima Singh
>
> [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
> [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
>
> Garima Singh (9):
>   commit-graph: add --changed-paths option to write

This summary is not easy to understand on first glance.  Maybe:

    commit-graph: add --changed-paths option to the write subcommand

or

    commit-graph: add --changed-paths option to 'git commit-graph write'

would be better?

>   commit-graph: write changed paths bloom filters
>   commit-graph: use MAX_NUM_CHUNKS
>   commit-graph: document bloom filter format
>   commit-graph: write changed path bloom filters to commit-graph file.
>   commit-graph: test commit-graph write --changed-paths
>   commit-graph: reuse existing bloom filters during write.
>   revision.c: use bloom filters to speed up path based revision walks
>   commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
>
>  Documentation/git-commit-graph.txt            |   5 +
>  .../technical/commit-graph-format.txt         |  17 ++
>  Makefile                                      |   1 +
>  bloom.c                                       | 257 +++++++++++++++++
>  bloom.h                                       |  51 ++++
>  builtin/commit-graph.c                        |   9 +-
>  ci/run-build-and-tests.sh                     |   1 +
>  commit-graph.c                                | 116 +++++++-
>  commit-graph.h                                |   9 +-
>  revision.c                                    |  67 ++++-
>  revision.h                                    |   5 +
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |  77 ++++++
>  t/t5318-commit-graph.sh                       |   2 +
>  t/t5324-split-commit-graph.sh                 |   1 +
>  t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
>  17 files changed, 875 insertions(+), 8 deletions(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100755 t/t4216-log-bloom.sh
>  create mode 100755 t/t5325-commit-graph-bloom.sh
>
>
> base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/497

Jakub Narębski Jan. 1, 2020, 12:04 p.m. UTC | #12

Jeff King <peff@peff.net> writes:

> On Sun, Dec 22, 2019 at 10:26:20AM +0100, Christian Couder wrote:
>
>> I have a question though. Are the performance gains only available
>> with `git log -- path` or are they already available for example when
>> doing a partial clone and/or a sparse checkout?
>
> From my quick look at the code, anything that feeds a pathspec to a
> revision traversal would be helped. I'm not sure if it would help for
> partial/sparse traversals, though. There we actually need to know which
> blobs correspond to the paths in question, not just whether any
> particular commit touched them.
>
> I also took a brief look at adding support to the custom blame-tree
> implementation we use at GitHub, and got about a 6x speedup.

Is there any chance of upstreaming the blame-tree algorithm, perhaps as
a separate mode for git-blame (invoked with `git blame <directory>`?
Or is the algorithm too GitHub-specific?

Best,

Garima Singh Jan. 13, 2020, 4:54 p.m. UTC | #13

On 12/31/2019 11:45 AM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> Performance Gains: We tested the performance of 'git log -- <path>' on the git
>> repo, the linux repo and some internal large repos, with a variety of paths
>> of varying depths.
>>
>> On the git and linux repos: We observed a 2x to 5x speed up.
>>
>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
> 
> Could you provide some more statistics about this internal repository,
> such as number of files, number of commits, perhaps also number of all
> objects?  Thanks in advance.
> 
> I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
> about the depth of the file hierarchy?  How would the numbers look for
> files seated closer to the root in the same large repository, like 3-5
> levels deep in the tree?

The internal repository we saw these massive gains on has:
- 413579 commits. 
- 183303 files distributed across 34482 folders
The size on disk is about 17 GiB. 

And yes, the difference is performance gains is mostly because of how 
deep the files were in the hierarchy. How often a file has been touched
also makes a difference. The performance gains are less dramatic if the 
file has a very sparse history even if it is a deep file. 

The numbers from the git and linux repos for instance, are for files 
closer to the root, hence 2x to 5x. 

Thanks! 
Garima Singh

Jakub Narębski Jan. 20, 2020, 1:48 p.m. UTC | #14

Garima Singh <garimasigit@gmail.com> writes:
> On 12/31/2019 11:45 AM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>> Performance Gains: We tested the performance of 'git log -- <path>' on the git
>>> repo, the linux repo and some internal large repos, with a variety of paths
>>> of varying depths.
>>>
>>> On the git and linux repos: We observed a 2x to 5x speed up.
>>>
>>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
>> 
>> Could you provide some more statistics about this internal repository,
>> such as number of files, number of commits, perhaps also number of all
>> objects?  Thanks in advance.
>> 
>> I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
>> about the depth of the file hierarchy?  How would the numbers look for
>> files seated closer to the root in the same large repository, like 3-5
>> levels deep in the tree?
>
> The internal repository we saw these massive gains on has:
> - 413579 commits. 
> - 183303 files distributed across 34482 folders
> The size on disk is about 17 GiB. 

Thank you for the data.  Such information would be important
consideration to help to find out whether enabling Bloom filters in
given repository would be worth it.

> And yes, the difference is performance gains is mostly because of how 
> deep the files were in the hierarchy.

Right, this is understandable.  If files are diep in hierarchy, then we
have to unpack more tree objects to find out if the file was changed in
a given commit (provided that finding differences do not terminate early
thanks to hierarchical structure of tree objects).

>                                      How often a file has been touched
> also makes a difference. The performance gains are less dramatic if the 
> file has a very sparse history even if it is a deep file.

This looks a bit strange (or maybe I don't understand something).

Bloom filter can answer "no" and "maybe" to subset inclusion query.
This means that if file was *not* changed, with great probability the
answer from Bloom filter would be "no", and we would skip diff-ing
trees (which may terminate early, though).

On the other hand if file was changed by the commit, and the answer from
a Bloom filter is "maybe", then we have to perform diffing to make sure.

>
> The numbers from the git and linux repos for instance, are for files 
> closer to the root, hence 2x to 5x. 

That is quite nice speedup, anyway (git repository cannot be even
considered large; medium -- maybe).

P.S. I wonder if it would be worth to create some synthetical repository
to test performance gains of Bloom filters, perhaps in t/perf...

Best,

Garima Singh Jan. 21, 2020, 4:14 p.m. UTC | #15

On 1/20/2020 8:48 AM, Jakub Narebski wrote: >>                                      How often a file has been touched
>> also makes a difference. The performance gains are less dramatic if the 
>> file has a very sparse history even if it is a deep file.
> 
> This looks a bit strange (or maybe I don't understand something).
> 
> Bloom filter can answer "no" and "maybe" to subset inclusion query.
> This means that if file was *not* changed, with great probability the
> answer from Bloom filter would be "no", and we would skip diff-ing
> trees (which may terminate early, though).
> 
> On the other hand if file was changed by the commit, and the answer from
> a Bloom filter is "maybe", then we have to perform diffing to make sure.
>

Yes. What I meant by statement however is that the performance gain i.e. 
difference in performance between using and not using bloom filters, is not 
always as dramatic if the history is sparse and the trees aren't touched 
as often. So it is largely dependent on the shape of the repo and the shape
of the commit graph. 

>>
>> The numbers from the git and linux repos for instance, are for files 
>> closer to the root, hence 2x to 5x. 
> 
> That is quite nice speedup, anyway (git repository cannot be even
> considered large; medium -- maybe).
> 

Yeah. Git and Linux served as nice initial test beds. If you have any 
suggestions for interesting repos it would be worth running performanc 
investigations on, do let me know! 

> 
> P.S. I wonder if it would be worth to create some synthetical repository
> to test performance gains of Bloom filters, perhaps in t/perf...
> 

I will look into this after I get v1 out on the mailing list. 
Thanks! 

Cheers
Garima Singh

Emily Shaffer Jan. 21, 2020, 11:40 p.m. UTC | #16

Hi,

On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote:
> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts. 

Since it's mostly in RFC stage, I'm holding off on line by line comments
for now. I read the series (thanks for your patience) and I'll try to
leave some review thoughts in the diffspec here.

> Garima Singh (9):
>   [1/9] commit-graph: add --changed-paths option to write
I wonder if this can be combined with 2; without 2 actually the
documentation is wrong for this one, right? Although I suppose you also
mentioned 2 perhaps being too long :)

>   [2/9] commit-graph: write changed paths bloom filters
As I understand it, this one always regenerates the bloom filter pieces,
and doesn't write it down in the commit-graph file. How much longer does
that take now than before? I don't have a great feel for how often 'git
commit-graph' is run, or whether I need to be invoking it manually.

>   [3/9] commit-graph: use MAX_NUM_CHUNKS
>   [4/9] commit-graph: document bloom filter format
I suppose I might like to see this commit squashed with 5, but it's a
nit. I'm thinking it'd be handy to say "git blame commit-graph" and see
some nice doc about the format expected in the commit-graph file.

>   [5/9] commit-graph: write changed path bloom filters to commit-graph file.
Ah, so here we finally write down the result from 2 to disk in the
commit-graph file. And without 7, this gets recalculated every time we
call 'git commit-graph' still.

As for a technical doc around here, I'd really appreciate one. But I'm
speaking selfishly - I'd also be happy if I could watch a talk about
this design to make sure I understand it right :)

>   [6/9] commit-graph: test commit-graph write --changed-paths
>   [7/9] commit-graph: reuse existing bloom filters during write.
I saw an option to give up if there wasn't an existing bloom filter, but
I didn't see an option here to force recalculating. Is there a scenario
when that would be useful? What's the mitigation path if:
 - I have a commit-graph with v0 of the bloom index piece, but update to
   Git which uses v1?
 - My commit-graph file is corrupted in a way that the bloom filter
   results are incorrect and I am missing a blob change (and therefore
   not finding it during the walk)?
I think I understand that without this commit, 8 is not much speedup
because we will be recalculating the filter for each commit, rather than
using the written-down commit-graph file.

>   [8/9] revision.c: use bloom filters to speed up path based revision walks
>   [9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag

Speaking of making sure I understand the change right, I will also
summarize my own understanding in the hopes that I can be corrected and
others can learn too ;)

 - The general idea is that we can write down a hint at the tree level
   saying "something I own did change this commit"; when we look for an
   object later, we can skip the commit where it looks like that path
   didn't change.
 - The change is in two pieces: first, to generate the hints per tree
   (which happens during commit-graph); second, to use those hints to
   optimize a rev walk (which happens in revision.c patch 8)
 - When we calculate the hints during commit-graph, we check the diff of
   each tree compared to its recent ancestor to see if there was a
   change; if so we calculate a hash for each path and use that as a key
   for a map from hash to path. After we look through everything changed
   in the diff, we can add it to a cumulative bloom filter list (one
   filter per commit) so we have a handy in-memory idea of which paths
   changed in each commit.
 - When it's time to do the rev walk, we ask for the bloom filter for
   each commit and check if that commit's map contains the path to the
   object we're worried about; if so, then it's OK to unpack the tree
   and check if the path we are interested in actually did get changed
   during that commit.

Thanks.
 - Emily

> 
>  Documentation/git-commit-graph.txt            |   5 +
>  .../technical/commit-graph-format.txt         |  17 ++
>  Makefile                                      |   1 +
>  bloom.c                                       | 257 +++++++++++++++++
>  bloom.h                                       |  51 ++++
>  builtin/commit-graph.c                        |   9 +-
>  ci/run-build-and-tests.sh                     |   1 +
>  commit-graph.c                                | 116 +++++++-
>  commit-graph.h                                |   9 +-
>  revision.c                                    |  67 ++++-
>  revision.h                                    |   5 +
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |  77 ++++++
>  t/t5318-commit-graph.sh                       |   2 +
>  t/t5324-split-commit-graph.sh                 |   1 +
>  t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
>  17 files changed, 875 insertions(+), 8 deletions(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100755 t/t4216-log-bloom.sh
>  create mode 100755 t/t5325-commit-graph-bloom.sh
> 
> 
> base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/497
> -- 
> gitgitgadget

Garima Singh Jan. 27, 2020, 6:24 p.m. UTC | #17

On 1/21/2020 6:40 PM, Emily Shaffer wrote:>> Garima Singh (9):
>>   [1/9] commit-graph: add --changed-paths option to write
> I wonder if this can be combined with 2; without 2 actually the
> documentation is wrong for this one, right? Although I suppose you also
> mentioned 2 perhaps being too long :)
> 

True. The ordering of these commits has been a very subjective discussion. 
Leaving this commit isolated the way it is, does help separate the option
from the bloom filter computation and commit graph write step. 

Also for clarity, I have changed the message for 2/9 to:
`commit-graph: compute Bloom filters for changed paths`

>>   [2/9] commit-graph: write changed paths bloom filters
> As I understand it, this one always regenerates the bloom filter pieces,
> and doesn't write it down in the commit-graph file. How much longer does
> that take now than before? I don't have a great feel for how often 'git
> commit-graph' is run, or whether I need to be invoking it manually.
> 

Yes. Computation and writing the bloom filters to the commit-graph file
(2/9 and 5/9) are ideally one time operations per commit. The time taken
depends on the shape and size of the repo since computation involves 
running a full diff. If you look at the discussions and patches exchanged
by Dr. Stolee and Peff: it has improved greatly since this RFC patch. 
My next submission will carry these patches and more concrete numbers for
time and memory. 

Also, `git commit-graph write` is not run very frequently and is ideally
incremental. A full rewrite is usually only done in case there is a 
corruption caught by `git commit-graph verify` in which case, you delete
and rewrite. These are both manual operations. 
See the docs here: 
https://git-scm.com/docs/git-commit-graph

Note: that fetch.writeCommitGraph is now on by default in which case 
new computations happen automatically for newly fetched commits. But 
I am not adding the --changed-paths option to that yet. So computing, 
writing and using bloom filters will still be an opt-in feature 
that users need to manually run. 

>>   [7/9] commit-graph: reuse existing bloom filters during write.
> I saw an option to give up if there wasn't an existing bloom filter, but
> I didn't see an option here to force recalculating. 

The way to force recalculation at the moment would be to delete the commit
graph file and write it again. 

Is there a scenario when that would be useful? 

Yes, if there are algorithm or hash version changes in the changed paths logic
we would need to rewrite. Each of the cases I can think of would involve
triggering a recalculation by deleting and rewriting.

What's the mitigation path if:
>  - I have a commit-graph with v0 of the bloom index piece, but update to
>    Git which uses v1?     

     Take a look at 5/9, when we are parsing the commit graph: if the code 
     is expected to work with a particular version (hash_version = 1 in the 
     current version), and the commit graph has a different version, we just 
     ignore it. In the future, this is where we could extend this to support 
     multiple versions.

>  - My commit-graph file is corrupted in a way that the bloom filter
>    results are incorrect and I am missing a blob change (and therefore
>    not finding it during the walk)?
         The mitigation for any commit graph corruption is to delete and 
     rewrite. 

     If however, we are confident that the bloom filter computation itself
     is wrong, the immediate mitigations would be to deleting the commit
     graph file and rewriting without the --changed-paths option; and ofc
     report the bug so it can be investigated and fixed. :) 

> Speaking of making sure I understand the change right, I will also
> summarize my own understanding in the hopes that I can be corrected and
> others can learn too ;)
> 
>  - The general idea is that we can write down a hint at the tree level
>    saying "something I own did change this commit"; when we look for an
>    object later, we can skip the commit where it looks like that path
>    didn't change.

The hint we are storing is a bloom filter which answers "No" or "Maybe"
to the question "Did file A change in commit c"
If the answer is No, we can ignore walking that commit. Else, we fall
back to the diff algorithm like before to confirm if the file changed or
not. 

>  - The change is in two pieces: first, to generate the hints per tree
>    (which happens during commit-graph); second, to use those hints to
>    optimize a rev walk (which happens in revision.c patch 8)

Yes. 

>  - When we calculate the hints during commit-graph, we check the diff of
>    each tree compared to its recent ancestor to see if there was a
>    change; if so we calculate a hash for each path and use that as a key
>    for a map from hash to path. After we look through everything changed
>    in the diff, we can add it to a cumulative bloom filter list (one
>    filter per commit) so we have a handy in-memory idea of which paths
>    changed in each commit.
>  - When it's time to do the rev walk, we ask for the bloom filter for
>    each commit and check if that commit's map contains the path to the
>    object we're worried about; if so, then it's OK to unpack the tree
>    and check if the path we are interested in actually did get changed
>    during that commit.

Essentially yes. There are a few implementation specifics this description
is glossing over, but I understand that is the intention. 

> Thanks.
>  - Emily

Cheers! 
Garima Singh

Jakub Narębski Feb. 1, 2020, 11:32 p.m. UTC | #18

Emily Shaffer <emilyshaffer@google.com> writes:

[...]
> Speaking of making sure I understand the change right, I will also
> summarize my own understanding in the hopes that I can be corrected and
> others can learn too ;)
>
>  - The general idea is that we can write down a hint at the tree level
>    saying "something I own did change this commit"; when we look for an
>    object later, we can skip the commit where it looks like that path
>    didn't change.

Or, to be more exact, we write hint about all the files and directories
changed in the commit at the commit level.

Say, for example, that changed files are 'README' and 'subdir/file'.
We store hint that 'README', 'subdir/file' and 'subdir' paths have
changes in them.

>  - The change is in two pieces: first, to generate the hints per tree
>    (which happens during commit-graph); second, to use those hints to
>    optimize a rev walk (which happens in revision.c patch 8)

Right.

>  - When we calculate the hints during commit-graph, we check the diff of
>    each tree compared to its recent ancestor to see if there was a
>    change;

s/recent ancestor/first parent/

Right, though commits without any changes with respect to first-parent
(or null tree in case of root i.e. parentless commit) should be rare.

>           if so we calculate a hash for each path and use that as a key
>    for a map from hash to path.

Yes, and no.  Here we enter the details how Bloom filter is constructed.
We don't store paths in Bloom filter -- it would take too much space.

The hashmap is used as an implementation of mathematical set, a
temporary structure used during Bloom filter construction.  We could
have used string_list, but then we could waste time trying to add
intermediate directories multiple times (for example if 'foo/bar' and
'foo/baz' files changed, we need to add 'foo' path only once to Bloom
filter).

You can think of Bloom filter as a compact (and probabilistic, see
below) representation of set of changed paths.

>                                After we look through everything changed
>    in the diff, we can add it to a cumulative bloom filter list (one
>    filter per commit) so we have a handy in-memory idea of which paths
>    changed in each commit.

Yes.

>  - When it's time to do the rev walk, we ask for the bloom filter for
>    each commit and check if that commit's map contains the path to the
>    object we're worried about; if so, then it's OK to unpack the tree
>    and check if the path we are interested in actually did get changed
>    during that commit.

From the point of view of rev walk, we ask for the Bloom filter for each
commit walked, and check if the (sub)set of changed paths includes given
path.

Bloom filter can answer "no" -- then we can skip the commit simplifying
history, or it can answer "maybe" -- then we need to check if file was
actually changed unpacking the trees (there is around 1% probability
that Bloom filter will say "maybe" if the path is not actually changed).

From the point of view of Bloom filter, if the path was actually changed
the filter will always answer "maybe".  If the path was not changed,
then in most cases the filter will answer "no" but there is 1% of chance
that it will answer "maybe".

I hope that helps,
--
Jakub Narębski

Jakub Narębski Feb. 2, 2020, 6:43 p.m. UTC | #19

Garima Singh <garimasigit@gmail.com> writes:
> On 1/20/2020 8:48 AM, Jakub Narebski wrote:

>>>                                      How often a file has been touched
>>> also makes a difference. The performance gains are less dramatic if the 
>>> file has a very sparse history even if it is a deep file.
>> 
>> This looks a bit strange (or maybe I don't understand something).
>> 
>> Bloom filter can answer "no" and "maybe" to subset inclusion query.
>> This means that if file was *not* changed, with great probability the
>> answer from Bloom filter would be "no", and we would skip diff-ing
>> trees (which may terminate early, though).
>> 
>> On the other hand if file was changed by the commit, and the answer from
>> a Bloom filter is "maybe", then we have to perform diffing to make sure.
>
> Yes. What I meant by statement however is that the performance gain i.e. 
> difference in performance between using and not using bloom filters, is not 
> always as dramatic if the history is sparse and the trees aren't touched 
> as often. So it is largely dependent on the shape of the repo and the shape
> of the commit graph. 

It probably depends on the depth of changes in a typical skipped commit,
I think.

If we are getting history for the
core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl
and the change is contained in libs/input/ directory, we have to unpack
only two trees, independent on the depth of the file we are asking
about.

It could be also possible, if Git is smart enough about it, to halt
early if we are checking only if given path changed rather than
calculating a full difftree.  Say, for example, that the change was in
core/java/org/apache/http/conn/ssl/SSLSocketFactory.java file, while
we were getting the history for the following file
core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl
After unpacking two or three threes we know that the second file was not
changed.  But if we compute full diff, we have to unpack 8 trees.
Quite a difference.

All example paths above came from AOSP repository that was used for
testing different proposed generation numbers v2, see
https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh

>>> The numbers from the git and linux repos for instance, are for files 
>>> closer to the root, hence 2x to 5x. 
>> 
>> That is quite nice speedup, anyway (git repository cannot be even
>> considered large; medium -- maybe).
>
> Yeah. Git and Linux served as nice initial test beds. If you have any 
> suggestions for interesting repos it would be worth running performanc 
> investigations on, do let me know! 

If we want repositories with deep path hierarchy, Java projects with
mandated directory structures might be a good choice, for example
Android (AOSP):

  git clone https://android.googlesource.com/platform/frameworks/base/ android-base

It is also quite large repository; in 2019 it had around 874000 commits,
around the same as the Linux kernel repository.

Another large repository is Chromium -- though I don't know if it has
deep filesystem hierarchy.

You can use the list of different large and large-ish repositories from
https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh
Other repositories with large number of commmits not on that list are
LLVM Compiler, GCC (GNU Compiler Collection) -- just converted to Git,
Homebrew, and Ruby on Rails.

>> P.S. I wonder if it would be worth to create some synthetical repository
>> to test performance gains of Bloom filters, perhaps in t/perf...
>> 
>
> I will look into this after I get v1 out on the mailing list. 
> Thanks! 

It would be nice to have, but it can wait.

Keep up the good work!

Garima Singh March 5, 2020, 7:49 p.m. UTC | #20

My apologies that things have been quite on this series for the past
week or so. An unexpected high priority task at work demanded all of 
my attention and will continue to do so through the end of this week. 

Hopefully I will be able to pick this up again early next week and 
have v3 out soon! 

Cheers!
Garima Singh

On 12/20/2019 5:05 PM, Garima Singh via GitGitGadget wrote:
> Hey! 
> 
> The commit graph feature brought in a lot of performance improvements across
> multiple commands. However, file based history continues to be a performance
> pain point, especially in large repositories. 
> 
> Adopting changed path bloom filters has been discussed on the list before,
> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and
> presents an updated and more polished RFC version of the feature. 
> 
> Performance Gains: We tested the performance of git log -- path on the git
> repo, the linux repo and some internal large repos, with a variety of paths
> of varying depths.
> 
> On the git and linux repos: We observed a 2x to 5x speed up.
> 
> On a large internal repo with files seated 6-10 levels deep in the tree: We
> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
> 
> Future Work (not included in the scope of this series):
> 
>  1. Supporting multiple path based revision walk
>  2. Adopting it in git blame logic. 
>  3. Interactions with line log git log -L
> 
> This series is intended to start the conversation and many of the commit
> messages include specific call outs for suggestions and thoughts. 
> 
> Cheers! Garima Singh
> 
> [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/
> [2] 
> https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/
> 
> Garima Singh (9):
>   commit-graph: add --changed-paths option to write
>   commit-graph: write changed paths bloom filters
>   commit-graph: use MAX_NUM_CHUNKS
>   commit-graph: document bloom filter format
>   commit-graph: write changed path bloom filters to commit-graph file.
>   commit-graph: test commit-graph write --changed-paths
>   commit-graph: reuse existing bloom filters during write.
>   revision.c: use bloom filters to speed up path based revision walks
>   commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag
> 
>  Documentation/git-commit-graph.txt            |   5 +
>  .../technical/commit-graph-format.txt         |  17 ++
>  Makefile                                      |   1 +
>  bloom.c                                       | 257 +++++++++++++++++
>  bloom.h                                       |  51 ++++
>  builtin/commit-graph.c                        |   9 +-
>  ci/run-build-and-tests.sh                     |   1 +
>  commit-graph.c                                | 116 +++++++-
>  commit-graph.h                                |   9 +-
>  revision.c                                    |  67 ++++-
>  revision.h                                    |   5 +
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   4 +
>  t/t4216-log-bloom.sh                          |  77 ++++++
>  t/t5318-commit-graph.sh                       |   2 +
>  t/t5324-split-commit-graph.sh                 |   1 +
>  t/t5325-commit-graph-bloom.sh                 | 258 ++++++++++++++++++
>  17 files changed, 875 insertions(+), 8 deletions(-)
>  create mode 100644 bloom.c
>  create mode 100644 bloom.h
>  create mode 100755 t/t4216-log-bloom.sh
>  create mode 100755 t/t5325-commit-graph-bloom.sh
> 
> 
> base-commit: b02fd2accad4d48078671adf38fe5b5976d77304
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/497
>

[0/9,RFC] Changed Paths Bloom Filters

Message

Comments