diff mbox series

[v3,12/14] commit-graph: add large-filters bitmap chunk

Message ID 619e0c619dd12e2bea2b80c7d249ba66fe2a315c.1597178915.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series more miscellaneous Bloom filter improvements | expand

Commit Message

Taylor Blau Aug. 11, 2020, 8:52 p.m. UTC
When a commit has more than a certain number of changed paths (commonly
512), the commit-graph machinery represents it as a zero-length filter.
This is done since having many entries in the Bloom filter has
undesirable effects on the false positivity rate.

In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit with too many changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters=<n>' option,
which specifies an upper-bound on the number of new filters we are
willing to compute from scratch. Suppose that there are 'N' too-large
filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
unlikely that any filters will be generated, since we'll spend most of
our effort on filters that we ultimately throw away. If 'N < M', filters
will trickle in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
Likewise store the maximum number of changed paths we are willing to
store in order to prepare for eventually making this value more easily
customizable. When computing Bloom filters, first consult the relevant
bitmap (in the case that we are rolling up existing layers) to see if
computing the Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. (Eg., setting the most-significant bit in the BIDX chunk would
confuse old clients and require a version bump).

To allow using the existing bitmap code with 64-bit words, we write the
data in network byte order from the 64-bit words. This means we also
need to read the array from the commit-graph file by translating each
word from network byte order using get_be64() when loading the commit
graph. (Note that this *could* be delayed until first-use, but a later
patch will rely on this being initialized early, so we assume the
up-front cost when parsing instead of delaying initialization).

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we couldl
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones). Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         | 12 +++
 bloom.h                                       |  2 +-
 commit-graph.c                                | 96 ++++++++++++++++---
 commit-graph.h                                |  4 +
 t/t4216-log-bloom.sh                          | 25 ++++-
 5 files changed, 124 insertions(+), 15 deletions(-)

Comments

Derrick Stolee Aug. 11, 2020, 9:11 p.m. UTC | #1
On 8/11/20 4:52 PM, Taylor Blau wrote:
> To allow using the existing bitmap code with 64-bit words, we write the
> data in network byte order from the 64-bit words. This means we also
> need to read the array from the commit-graph file by translating each
> word from network byte order using get_be64() when loading the commit
> graph. (Note that this *could* be delayed until first-use, but a later
> patch will rely on this being initialized early, so we assume the
> up-front cost when parsing instead of delaying initialization).

I don't think this is correct to do. This means that every commit-graph
load will load the entire large bloom filter chunk into memory before
allowing a single commit to be read from the graph.

In the case of a very large commit-graph (> 1 million commits), this
would cause a significant initial cost that is not necessarily needed
for anything. For example, "git log -1" will be delayed by this action.

If I understand correctly, this bloom_large bitmap is only used when
writing Bloom filters. At that point, reading the entire bitmap from
disk into memory is inexpensive compared to the time saved by the
feature.

> @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
>  			}
>  			break;
> +
> +		case GRAPH_CHUNKID_BLOOMLARGE:
> +			if (graph->chunk_bloom_large_filters)
> +				chunk_repeated = 1;
> +			else if (r->settings.commit_graph_read_changed_paths) {

This is guarded against commitGraph.readChangedPaths, which is good,
but that's not enough to claim that we need the bloom_large bitmap
in this process.

> +				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);

If we store this inside the commit_graph struct, we can save this
size for later so...

> +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);

...this portion can be done only when we are about to read from the
bitmap.

> +				if (alloc) {
> +					size_t j;
> +					graph->bloom_large = bitmap_word_alloc(alloc);
> +
> +					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> +						graph->bloom_large->words[j] = get_be64(
> +							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> +				}



> +			}
> +			break;
>  		}
>  
>  		if (chunk_repeated) {
> @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  		/* We need both the bloom chunks to exist together. Else ignore the data */
>  		graph->chunk_bloom_indexes = NULL;
>  		graph->chunk_bloom_data = NULL;
> +		graph->chunk_bloom_large_filters = NULL;
>  		FREE_AND_NULL(graph->bloom_filter_settings);
> +		bitmap_free(graph->bloom_large);

Perhaps this bitmap_free() needs a check to see if we
initialized it (with my recommended lazy-loading), but
maybe not?

>  	}
>  
>  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
> @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
>  }
>  
> +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> +					   const struct commit *c)
> +{
> +	uint32_t graph_pos = commit_graph_position(c);
> +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> +		return 0;
> +
> +	while (g && graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	if (!(g && g->bloom_large))
> +		return 0;

Here is where we can check if the size of the chunk is
non-zero and if the g->bloom_large bitmap is uninitialized.
Since we are caring about this information, it is now time
to do the network-byte transition of the full bitmap.

> +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> +}
>  
Thanks,
-Stolee
Taylor Blau Aug. 11, 2020, 9:18 p.m. UTC | #2
On Tue, Aug 11, 2020 at 05:11:49PM -0400, Derrick Stolee wrote:
> On 8/11/20 4:52 PM, Taylor Blau wrote:
> > To allow using the existing bitmap code with 64-bit words, we write the
> > data in network byte order from the 64-bit words. This means we also
> > need to read the array from the commit-graph file by translating each
> > word from network byte order using get_be64() when loading the commit
> > graph. (Note that this *could* be delayed until first-use, but a later
> > patch will rely on this being initialized early, so we assume the
> > up-front cost when parsing instead of delaying initialization).
>
> I don't think this is correct to do. This means that every commit-graph
> load will load the entire large bloom filter chunk into memory before
> allowing a single commit to be read from the graph.
>
> In the case of a very large commit-graph (> 1 million commits), this
> would cause a significant initial cost that is not necessarily needed
> for anything. For example, "git log -1" will be delayed by this action.
>
> If I understand correctly, this bloom_large bitmap is only used when
> writing Bloom filters. At that point, reading the entire bitmap from
> disk into memory is inexpensive compared to the time saved by the
> feature.

That's not quite correct. By this point in the patch series, we only use
this bitmap for writing, but in the final patch, we will use it in
'load_bloom_filter_from_graph()' (see that patch for the details of why,
there is an explanatory comment to that effect).

So, the way that this lazy initialization was written was subtly
incorrect, since calling 'load_bloom_filter_from_graph' would return the
wrong value depending on what was or wasn't called before it (namely
whether or not we had initialized the bitmap).

> > @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
> >  			}
> >  			break;
> > +
> > +		case GRAPH_CHUNKID_BLOOMLARGE:
> > +			if (graph->chunk_bloom_large_filters)
> > +				chunk_repeated = 1;
> > +			else if (r->settings.commit_graph_read_changed_paths) {
>
> This is guarded against commitGraph.readChangedPaths, which is good,
> but that's not enough to claim that we need the bloom_large bitmap
> in this process.
>
> > +				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
>
> If we store this inside the commit_graph struct, we can save this
> size for later so...
>
> > +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> > +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
>
> ...this portion can be done only when we are about to read from the
> bitmap.

Right, that all is clear.

>
> > +				if (alloc) {
> > +					size_t j;
> > +					graph->bloom_large = bitmap_word_alloc(alloc);
> > +
> > +					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> > +						graph->bloom_large->words[j] = get_be64(
> > +							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> > +				}
>
>
>
> > +			}
> > +			break;
> >  		}
> >
> >  		if (chunk_repeated) {
> > @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
> >  		/* We need both the bloom chunks to exist together. Else ignore the data */
> >  		graph->chunk_bloom_indexes = NULL;
> >  		graph->chunk_bloom_data = NULL;
> > +		graph->chunk_bloom_large_filters = NULL;
> >  		FREE_AND_NULL(graph->bloom_filter_settings);
> > +		bitmap_free(graph->bloom_large);
>
> Perhaps this bitmap_free() needs a check to see if we
> initialized it (with my recommended lazy-loading), but
> maybe not?

It does not ('bitmap_free()' understands that a NULL argument is a noop,
like 'free()').
>
> >  	}
> >
> >  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
> > @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
> >  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
> >  }
> >
> > +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> > +					   const struct commit *c)
> > +{
> > +	uint32_t graph_pos = commit_graph_position(c);
> > +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> > +		return 0;
> > +
> > +	while (g && graph_pos < g->num_commits_in_base)
> > +		g = g->base_graph;
> > +
> > +	if (!(g && g->bloom_large))
> > +		return 0;
>
> Here is where we can check if the size of the chunk is
> non-zero and if the g->bloom_large bitmap is uninitialized.
> Since we are caring about this information, it is now time
> to do the network-byte transition of the full bitmap.

Yes, that's right. But again, we'll need to move this out to a
non-static helper function so that we can call it from bloom.c in the
final patch.

I'm admittedly a little unsure of what to do next, since the changes are
scoped to this and the remaining patches (really just this and the final
patch).

I guess I'll try the other approach of replacing these two patches
by sending emails in response so that Junio can take those (I'll avoid
sending a fixup patch).

> > +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> > +}
> >
> Thanks,
> -Stolee

Thanks,
Taylor
Taylor Blau Aug. 11, 2020, 10:05 p.m. UTC | #3
On Tue, Aug 11, 2020 at 05:18:15PM -0400, Taylor Blau wrote:
> On Tue, Aug 11, 2020 at 05:11:49PM -0400, Derrick Stolee wrote:
> > On 8/11/20 4:52 PM, Taylor Blau wrote:
> > > To allow using the existing bitmap code with 64-bit words, we write the
> > > data in network byte order from the 64-bit words. This means we also
> > > need to read the array from the commit-graph file by translating each
> > > word from network byte order using get_be64() when loading the commit
> > > graph. (Note that this *could* be delayed until first-use, but a later
> > > patch will rely on this being initialized early, so we assume the
> > > up-front cost when parsing instead of delaying initialization).
> >
> > I don't think this is correct to do. This means that every commit-graph
> > load will load the entire large bloom filter chunk into memory before
> > allowing a single commit to be read from the graph.
> >
> > In the case of a very large commit-graph (> 1 million commits), this
> > would cause a significant initial cost that is not necessarily needed
> > for anything. For example, "git log -1" will be delayed by this action.
> >
> > If I understand correctly, this bloom_large bitmap is only used when
> > writing Bloom filters. At that point, reading the entire bitmap from
> > disk into memory is inexpensive compared to the time saved by the
> > feature.
>
> That's not quite correct. By this point in the patch series, we only use
> this bitmap for writing, but in the final patch, we will use it in
> 'load_bloom_filter_from_graph()' (see that patch for the details of why,
> there is an explanatory comment to that effect).
>
> So, the way that this lazy initialization was written was subtly
> incorrect, since calling 'load_bloom_filter_from_graph' would return the
> wrong value depending on what was or wasn't called before it (namely
> whether or not we had initialized the bitmap).

Ack, this is totally wrong. I had the wrong impression of where you were
actually initializing the bitmaps, which is what motivated the change.
Whoops, totally my mistake. Here's a version of the patch that does what
it says.

There is a small adjustment that needs to be made to the final patch, so
I can send a version of that, too, if we decide that this is the
direction we want to go (and that there are no other changes to the
series). The change to the final patch basically boils down to using
'get_bloom_filter_large_in_graph()' instead of manually checking the
bitmap (so that the aforementioned function can do its initialization
ahead of time).

Thanks for catching this subtle issue, I'm glad that we got it resolved
before going further.

--- >8 ---

Subject: [PATCH] commit-graph: add large-filters bitmap chunk

When a commit has more than a certain number of changed paths (commonly
512), the commit-graph machinery represents it as a zero-length filter.
This is done since having many entries in the Bloom filter has
undesirable effects on the false positivity rate.

In addition to these too-large filters, the commit-graph machinery also
represents commits with no filter and commits with no changed paths in
the same way.

When writing a commit-graph that aggregates several incremental
commit-graph layers (eg., with '--split=replace'), the commit-graph
machinery first computes all of the Bloom filters that it wants to write
but does not already know about from existing graph layers. Because we
overload the zero-length filter in the above fashion, this leads to
recomputing large filters over and over again.

This is already undesirable, since it means that we are wasting
considerable effort to discover that a commit with too many changed
paths, only to throw that effort away (and then repeat the process the
next time a roll-up is performed).

In a subsequent patch, we will add a '--max-new-filters=<n>' option,
which specifies an upper-bound on the number of new filters we are
willing to compute from scratch. Suppose that there are 'N' too-large
filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
unlikely that any filters will be generated, since we'll spend most of
our effort on filters that we ultimately throw away. If 'N < M', filters
will trickle in over time, but only at most 'M - N' per-write.

To address this, add a new chunk which encodes a bitmap where the ith
bit is on iff the ith commit has zero or at least 512 changed paths.
Likewise store the maximum number of changed paths we are willing to
store in order to prepare for eventually making this value more easily
customizable. When computing Bloom filters, first consult the relevant
bitmap (in the case that we are rolling up existing layers) to see if
computing the Bloom filter from scratch would be a waste of time.

This patch implements a new chunk instead of extending the existing BIDX
and BDAT chunks because modifying these chunks would confuse old
clients. (Eg., setting the most-significant bit in the BIDX chunk would
confuse old clients and require a version bump).

To allow using the existing bitmap code with 64-bit words, we write the
data in network byte order from the 64-bit words. This means we also
need to read the array from the commit-graph file by translating each
word from network byte order using get_be64() when loading the commit
graph. Initialize this bitmap lazily to avoid paying a linear-time cost
upon each commit-graph load even if we do not need the bitmaps
themselves.

By avoiding the need to move to new versions of the BDAT and BIDX chunk,
we can give ourselves more time to consider whether or not other
modifications to these chunks are worthwhile without holding up this
change.

Another approach would be to introduce a new BIDX chunk (say, one
identified by 'BID2') which is identical to the existing BIDX chunk,
except the most-significant bit of each offset is interpreted as "this
filter is too big" iff looking at a BID2 chunk. This avoids having to
write a bitmap, but forces older clients to rewrite their commit-graphs
(as well as reduces the theoretical largest Bloom filters we could
write, and forces us to maintain the code necessary to translate BIDX
chunks to BID2 ones). Separately from this patch, I implemented this
alternate approach and did not find it to be advantageous.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 .../technical/commit-graph-format.txt         |  12 +++
 bloom.h                                       |   2 +-
 commit-graph.c                                | 100 +++++++++++++++---
 commit-graph.h                                |  10 ++
 t/t4216-log-bloom.sh                          |  25 ++++-
 5 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..5f2d9ab4d7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -123,6 +123,18 @@ CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.

+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It starts with a 32-bit unsigned integer specifying the maximum number of
+      changed-paths that can be stored in a single Bloom filter.
+    * It then contains a list of 64-bit words (the length of this list is
+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
+      filter with zero entries (either because the commit is empty, or because
+      it contains more than 512 changed paths).
+    * The BFXL chunk is present only when the BIDX and BDAT chunks are
+      also present.
+
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/bloom.h b/bloom.h
index 3f19e3fca4..464d9b57de 100644
--- a/bloom.h
+++ b/bloom.h
@@ -33,7 +33,7 @@ struct bloom_filter_settings {
 	 * The maximum number of changed paths per commit
 	 * before declaring a Bloom filter to be too-large.
 	 *
-	 * Not written to the commit-graph file.
+	 * Written to the 'BFXL' chunk (instead of 'BDAT').
 	 */
 	uint32_t max_changed_paths;
 };
diff --git a/commit-graph.c b/commit-graph.c
index 8964453433..4ccc7a3e56 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@ void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8

 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)

@@ -429,6 +430,16 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->chunk_bloom_large_filters)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
+				graph->bloom_large_alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
+				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
+			}
+			break;
 		}

 		if (chunk_repeated) {
@@ -443,6 +454,8 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->chunk_bloom_large_filters = NULL;
+		graph->bloom_large_alloc = 0;
 		FREE_AND_NULL(graph->bloom_filter_settings);
 	}

@@ -932,6 +945,32 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }

+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c)
+{
+	uint32_t graph_pos = commit_graph_position(c);
+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!g)
+		return 0;
+
+	if (!g->bloom_large && g->bloom_large_alloc) {
+		size_t i;
+		g->bloom_large = bitmap_word_alloc(g->bloom_large_alloc);
+
+		for (i = 0; i < g->bloom_large->word_alloc; i++)
+			g->bloom_large->words[i] = get_be64(
+				g->chunk_bloom_large_filters + i * sizeof(eword_t));
+	}
+
+	if (!g->bloom_large)
+		return 0;
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
+}

 struct packed_oid_list {
 	struct object_id *list;
@@ -970,8 +1009,10 @@ struct write_commit_graph_context {
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;

+	int count_bloom_filter_known_large;
 	int count_bloom_filter_found_large;
 	int count_bloom_filter_computed;
+	struct bitmap *bloom_large;
 };

 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1235,6 +1276,23 @@ static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }

+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
+{
+	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
+	if (ctx->commits.nr % BITS_IN_EWORD)
+		alloc++;
+	if (alloc > ctx->bloom_large->word_alloc)
+		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
+
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
+	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
+		hashwrite_be64(f, ctx->bloom_large->words[i]);
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1398,6 +1456,8 @@ static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 	struct json_writer jw = JSON_WRITER_INIT;

 	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
 	jw_object_intmax(&jw, "filter_found_large",
 			 ctx->count_bloom_filter_found_large);
 	jw_object_intmax(&jw, "filter_computed",
@@ -1416,6 +1476,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;

 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);

 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1430,21 +1491,28 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		&ctx->commits);

 	for (i = 0; i < ctx->commits.nr; i++) {
-		int computed = 0;
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_or_compute_bloom_filter(
-			ctx->r,
-			c,
-			1,
-			ctx->bloom_settings,
-			&computed);
-		if (computed) {
-			ctx->count_bloom_filter_computed++;
-			if (filter && !filter->len)
-				ctx->count_bloom_filter_found_large++;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r,
+				c,
+				1,
+				ctx->bloom_settings,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}

@@ -1764,6 +1832,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
+					+ sizeof(uint32_t);
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
@@ -2503,6 +2576,7 @@ void free_commit_graph(struct commit_graph *g)
 	}
 	free(g->filename);
 	free(g->bloom_filter_settings);
+	bitmap_free(g->bloom_large);
 	free(g);
 }

diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..9afb1477d5 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"

 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -50,6 +51,9 @@ void load_commit_graph_info(struct repository *r, struct commit *item);
 struct tree *get_commit_tree_in_graph(struct repository *r,
 				      const struct commit *c);

+int get_bloom_filter_large_in_graph(struct commit_graph *g,
+				    const struct commit *c);
+
 struct commit_graph {
 	const unsigned char *data;
 	size_t data_len;
@@ -71,6 +75,10 @@ struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	const unsigned char *chunk_bloom_large_filters;
+
+	struct bitmap *bloom_large;
+	size_t bloom_large_alloc;

 	struct bloom_filter_settings *bloom_filter_settings;
 };
@@ -83,6 +91,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size);

+void prepare_commit_graph_bloom_large(struct commit_graph *g);
+
 /*
  * Return 1 if and only if the repository has a commit-graph
  * file and generation numbers are computed in that file.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 21b67677ef..6859d85369 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
@@ -262,5 +262,28 @@ test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd limits &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			2 0 1
+	)
+'

 test_done
--
2.28.0.rc1.13.ge78abce653
SZEDER Gábor Aug. 19, 2020, 1:35 p.m. UTC | #4
On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> When a commit has more than a certain number of changed paths (commonly
> 512), the commit-graph machinery represents it as a zero-length filter.
> This is done since having many entries in the Bloom filter has
> undesirable effects on the false positivity rate.

This is not the case, the false positive probability depends on the
ratio of the Bloom filter's size and the number of elements it
contains, and we size the filters proportional to the number of
elements they contain, so the number of elements shouldn't affect the 
false positive rate.

On the contrary, it's the small filters, up to around 30-35 bytes, 
that tend to have larger than expected false positive rate when using
double hashing.
SZEDER Gábor Sept. 1, 2020, 2:35 p.m. UTC | #5
On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> When a commit has more than a certain number of changed paths (commonly
> 512), the commit-graph machinery represents it as a zero-length filter.
> This is done since having many entries in the Bloom filter has
> undesirable effects on the false positivity rate.
> 
> In addition to these too-large filters, the commit-graph machinery also
> represents commits with no filter and commits with no changed paths in
> the same way.
> 
> When writing a commit-graph that aggregates several incremental
> commit-graph layers (eg., with '--split=replace'), the commit-graph
> machinery first computes all of the Bloom filters that it wants to write
> but does not already know about from existing graph layers. Because we
> overload the zero-length filter in the above fashion, this leads to
> recomputing large filters over and over again.
> 
> This is already undesirable, since it means that we are wasting
> considerable effort to discover that a commit with too many changed
> paths, only to throw that effort away (and then repeat the process the
> next time a roll-up is performed).

Is this really the case?

If a commit has a corresponding entry in the Bloom filters chunks,
then the commit-graph machinery does know the Bloom filter associated
with that commit.  The size of that filter should not matter, i.e.
even if it is a zero-length filter, the commit-graph machinery should
know about it all the same.  And as far as I can tell it does indeed,
because load_bloom_filter_from_graph() sets a non-NULL 'filter->data'
pointer even if 'filter->len' is zero, which get_bloom_filter() treats
as "we know about this", and returns early without (re)computing the
filter.  Even the test 'Bloom generation does not recompute too-large
filters' added in this patch is expected to succeed, and my
superficial and makeshift testing seems to corroborate this; at least
I couldn't find a combination of commands and options that would
recompute any existing zero-length Bloom filters.

Am I missing something?

> In a subsequent patch, we will add a '--max-new-filters=<n>' option,
> which specifies an upper-bound on the number of new filters we are
> willing to compute from scratch. Suppose that there are 'N' too-large
> filters, and we specify '--max-new-filters=M'. If 'N >= M', it is
> unlikely that any filters will be generated, since we'll spend most of
> our effort on filters that we ultimately throw away. If 'N < M', filters
> will trickle in over time, but only at most 'M - N' per-write.
> 
> To address this, add a new chunk which encodes a bitmap where the ith
> bit is on iff the ith commit has zero or at least 512 changed paths.
> Likewise store the maximum number of changed paths we are willing to
> store in order to prepare for eventually making this value more easily
> customizable.

I don't understand how storing this value would make it any easier to
customize.

> When computing Bloom filters, first consult the relevant
> bitmap (in the case that we are rolling up existing layers) to see if
> computing the Bloom filter from scratch would be a waste of time.
> 
> This patch implements a new chunk instead of extending the existing BIDX
> and BDAT chunks because modifying these chunks would confuse old
> clients. (Eg., setting the most-significant bit in the BIDX chunk would
> confuse old clients and require a version bump).
> 
> To allow using the existing bitmap code with 64-bit words, we write the
> data in network byte order from the 64-bit words. This means we also
> need to read the array from the commit-graph file by translating each
> word from network byte order using get_be64() when loading the commit
> graph. (Note that this *could* be delayed until first-use, but a later
> patch will rely on this being initialized early, so we assume the
> up-front cost when parsing instead of delaying initialization).
> 
> By avoiding the need to move to new versions of the BDAT and BIDX chunk,
> we can give ourselves more time to consider whether or not other
> modifications to these chunks are worthwhile without holding up this
> change.
> 
> Another approach would be to introduce a new BIDX chunk (say, one
> identified by 'BID2') which is identical to the existing BIDX chunk,
> except the most-significant bit of each offset is interpreted as "this
> filter is too big" iff looking at a BID2 chunk. This avoids having to
> write a bitmap, but forces older clients to rewrite their commit-graphs
> (as well as reduces the theoretical largest Bloom filters we couldl

And it reduces the max possible size of the BDAT chunk, and thus the
max number of commits with Bloom filters as well.

s/couldl/could/

> write, and forces us to maintain the code necessary to translate BIDX
> chunks to BID2 ones). Separately from this patch, I implemented this
> alternate approach and did not find it to be advantageous.

Let's take a step back to reconsider what should be stored in this
bitmap for a moment.  Sure, setting a bit for each commit that doesn't
modify any paths or modifies too many makes it possible to repliably
identify commits that don't have Bloom filters yet.  But isn't it a
bit roundabout way?  I think it would be better to directly track
which commits don't have Bloom filters yet.  IOW what you really want
is a, say, BNCY "Bloom filter Not Computed Yet" chunk, where we set
the corresponding bit for each commit which has an entry in the BIDX
chunk but for which a Bloom filter hasn't been computed yet.

  - It's simpler and easier to explain (IMO).

  - This bitmap chunk can easily be made optional: if all Bloom
    filters have been computed, then the bitmap will contain all
    zeros.  So why bother writing it, when we can save a bit of space
    instead?

  - It avoids the unpleasentness of setting a bit in the _Large_ Bloom
    Filters chunks for commits _not_ modifying any paths.

  - Less incentive to spill implementation details to the format
    specification (e.g. 512 modified paths).

Now, let's take another step back: is such a bitmap really necessary?
We could write a single-byte Bloom filter with no bits set for commits
not modifying any paths, and a single-byte Bloom filter with all bits
set for commits modifying too many paths.  This is compatible with the
specs and any existing implementation should do the right thing when
reading such filters, this would allow us to interpret zero-length
filters as "not computed yet", and if that bitmap chunk won't be
optional, then this would save space as long as less than 1/8 of
commits modify no or too many paths.  Unfortunately, however, all
existing zero-length Bloom filters have to be recomputed.


> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  .../technical/commit-graph-format.txt         | 12 +++
>  bloom.h                                       |  2 +-
>  commit-graph.c                                | 96 ++++++++++++++++---
>  commit-graph.h                                |  4 +
>  t/t4216-log-bloom.sh                          | 25 ++++-
>  5 files changed, 124 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 440541045d..5f2d9ab4d7 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -123,6 +123,18 @@ CHUNK DATA:
>        of length zero.
>      * The BDAT chunk is present if and only if BIDX is present.
>  
> +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
> +    * It starts with a 32-bit unsigned integer specifying the maximum number of
> +      changed-paths that can be stored in a single Bloom filter.

Should this value be the same in all elements of a commit-graph chain?
Note that in this case having different values won't affect revision
walks using modified path Bloom filters.

> +    * It then contains a list of 64-bit words (the length of this list is
> +      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
> +      set exactly when the 'i'th commit in the graph has a changed-path Bloom
> +      filter with zero entries (either because the commit is empty, or because
> +      it contains more than 512 changed paths).

Please make clear the byte order of these 64 bit words in the specs as
well.

Furthermore, that 512 path limit is an implementation detail, so it
would be better if it didn't leak into the specification of this new
chunk.

> +    * The BFXL chunk is present only when the BIDX and BDAT chunks are
> +      also present.
> +
> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this



> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index 21b67677ef..6859d85369 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -33,7 +33,7 @@ test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
>  	git commit-graph write --reachable --changed-paths
>  '
>  graph_read_expect () {
> -	NUM_CHUNKS=5
> +	NUM_CHUNKS=6
>  	cat >expect <<- EOF
>  	header: 43475048 1 1 $NUM_CHUNKS 0
>  	num_commits: $1
> @@ -262,5 +262,28 @@ test_expect_success 'correctly report changes over limit' '
>  		done
>  	)
>  '
> +test_bloom_filters_computed () {
> +	commit_graph_args=$1
> +	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
> +	rm -f "$TRASH_DIRECTORY/trace.event" &&
> +	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
> +	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
> +}
> +
> +test_expect_success 'Bloom generation does not recompute too-large filters' '
> +	(
> +		cd limits &&
> +
> +		# start from scratch and rebuild
> +		rm -f .git/objects/info/commit-graph &&
> +		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
> +			git commit-graph write --reachable --changed-paths \
> +			--split=replace &&
> +		test_commit c1 filter &&
> +
> +		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
> +			2 0 1
> +	)
> +'
>  
>  test_done
> -- 
> 2.28.0.rc1.13.ge78abce653
>
Taylor Blau Sept. 2, 2020, 8:23 p.m. UTC | #6
(Finally getting back to this review after working on some other topics
for a while..., sorry for the late response).

On Wed, Aug 19, 2020 at 03:35:26PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> > When a commit has more than a certain number of changed paths (commonly
> > 512), the commit-graph machinery represents it as a zero-length filter.
> > This is done since having many entries in the Bloom filter has
> > undesirable effects on the false positivity rate.
>
> This is not the case, the false positive probability depends on the
> ratio of the Bloom filter's size and the number of elements it
> contains, and we size the filters proportional to the number of
> elements they contain, so the number of elements shouldn't affect the
> false positive rate.

I'm not sure that I understand. I agree that the FPR depends on the
ratio between the number of elements in the filter and the filter's
"size". But, consider a Bloom filter that is too small to faithfully
represent all its elements. Such a filter would likely have all its bits
set high, in which case every query would return "maybe", and the FPR
would go up.

> On the contrary, it's the small filters, up to around 30-35 bytes,
> that tend to have larger than expected false positive rate when using
> double hashing.

I agree that small filters suffer from the same, but I think this is an
"in addition" not an "on the contrary".

In either case, I don't think that this is an important detail for the
commit message. What matters is the representation (that we truncate >=
512 elements to a length-zero filter), not why (that can be found in
another commit). I'd have expected to find the rationale in ed591febb4
(bloom.c: core Bloom filter implementation for changed paths.,
2020-03-30), but I couldn't find anything there.

So, I'll drop this sentence entirely to avoid an unimportant detail.

Thanks,
Taylor
Taylor Blau Sept. 2, 2020, 8:40 p.m. UTC | #7
On Tue, Sep 01, 2020 at 04:35:46PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote:
> > This is already undesirable, since it means that we are wasting
> > considerable effort to discover that a commit with too many changed
> > paths, only to throw that effort away (and then repeat the process the
> > next time a roll-up is performed).
>
> Is this really the case?
>
> If a commit has a corresponding entry in the Bloom filters chunks,
> then the commit-graph machinery does know the Bloom filter associated
> with that commit.  The size of that filter should not matter, i.e.
> even if it is a zero-length filter, the commit-graph machinery should
> know about it all the same.  And as far as I can tell it does indeed,
> because load_bloom_filter_from_graph() sets a non-NULL 'filter->data'
> pointer even if 'filter->len' is zero, which get_bloom_filter() treats
> as "we know about this", and returns early without (re)computing the
> filter.  Even the test 'Bloom generation does not recompute too-large
> filters' added in this patch is expected to succeed, and my
> superficial and makeshift testing seems to corroborate this; at least
> I couldn't find a combination of commands and options that would
> recompute any existing zero-length Bloom filters.
>
> Am I missing something?

I'm not sure that this would work, or at least that it creates another
pair of cases that we have to disambiguate. Here's what I'm thinking
after reading what you wrote:

  - By this point in the series, we expect every commit in a
    commit-graph layer to have a corresponding entry in the 'BIDX' chunk
    (if one exists, obviously all of this is meaningless without
    specifying '--changed-paths').

  - In a future commit (specifically, "builtin/commit-graph.c: introduce
    '--max-new-filters=<n>'"), this will no longer be the case.

  - So, by that point in the series, we would have two possible reasons
    why a commit did not show up in the BIDX. Either:

      * The commit has too many changed paths, in which case we don't
        want to write it down, or

      * We had already computed too many new filters, and so didn't have
        the budget to compute the filter for some commit(s).

In the first of those two cases, we want to avoid recomputing the
too-large filter. But, in the latter case, we *do* want to compute the
filter from scratch, since we want to "backfill" commits with missing
filters by letting them trickle in over time.

I might be missing something, too, but I think that those two cases are
just as indistinguishable as what's in the commit-graph today (without
the 'BFXL' chunk).

> > To address this, add a new chunk which encodes a bitmap where the ith
> > bit is on iff the ith commit has zero or at least 512 changed paths.
> > Likewise store the maximum number of changed paths we are willing to
> > store in order to prepare for eventually making this value more easily
> > customizable.
>
> I don't understand how storing this value would make it any easier to
> customize.

It doesn't, but it's also not meant to. The idea is that if the value
ever were easy to customize, then we could keep track of which layers
were written with which threshold. This would allow you to invent fancy
rules like dropping filters when the threshold lowers, or recomputing
ones that you wouldn't have otherwise when it goes up.

> > Another approach would be to introduce a new BIDX chunk (say, one
> > identified by 'BID2') which is identical to the existing BIDX chunk,
> > except the most-significant bit of each offset is interpreted as "this
> > filter is too big" iff looking at a BID2 chunk. This avoids having to
> > write a bitmap, but forces older clients to rewrite their commit-graphs
> > (as well as reduces the theoretical largest Bloom filters we couldl
>
> And it reduces the max possible size of the BDAT chunk, and thus the
> max number of commits with Bloom filters as well.

Fair point.

> s/couldl/could/

Oops, thanks.

> > write, and forces us to maintain the code necessary to translate BIDX
> > chunks to BID2 ones). Separately from this patch, I implemented this
> > alternate approach and did not find it to be advantageous.
>
> Let's take a step back to reconsider what should be stored in this
> bitmap for a moment.  Sure, setting a bit for each commit that doesn't
> modify any paths or modifies too many makes it possible to repliably
> identify commits that don't have Bloom filters yet.  But isn't it a
> bit roundabout way?  I think it would be better to directly track
> which commits don't have Bloom filters yet.  IOW what you really want
> is a, say, BNCY "Bloom filter Not Computed Yet" chunk, where we set
> the corresponding bit for each commit which has an entry in the BIDX
> chunk but for which a Bloom filter hasn't been computed yet.
>
>   - It's simpler and easier to explain (IMO).
>
>   - This bitmap chunk can easily be made optional: if all Bloom
>     filters have been computed, then the bitmap will contain all
>     zeros.  So why bother writing it, when we can save a bit of space
>     instead?

I don't think that this is true. Omitting the chunk (because you have
computed--or tried to compute--filters for every commit in the graph)
isn't distinguishable from what exists today, so we are back at square
one there.

>   - It avoids the unpleasentness of setting a bit in the _Large_ Bloom
>     Filters chunks for commits _not_ modifying any paths.

I agree it's unpleasant, but I also don't think it's a show-stopper.

>   - Less incentive to spill implementation details to the format
>     specification (e.g. 512 modified paths).
>
> Now, let's take another step back: is such a bitmap really necessary?
> We could write a single-byte Bloom filter with no bits set for commits
> not modifying any paths, and a single-byte Bloom filter with all bits
> set for commits modifying too many paths.  This is compatible with the
> specs and any existing implementation should do the right thing when
> reading such filters, this would allow us to interpret zero-length
> filters as "not computed yet", and if that bitmap chunk won't be
> optional, then this would save space as long as less than 1/8 of
> commits modify no or too many paths.  Unfortunately, however, all
> existing zero-length Bloom filters have to be recomputed.

I think this is a semantic difference: either you store a bitmap, or a
Bloom filter containing the same data. To me, I don't think there's a
huge difference, since we're talking about 1 bit per commit. If we were
really worried, we could store them as EWAH-compressed bitmaps, but I
don't get a sense that such a concern exists.

I do feel strongly about using a non-probabilistic data structure,
though, since the point of this feature is to be able to make tighter
guarentees about the runtime of 'git commit-graph write'.

>
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > ---
> >  .../technical/commit-graph-format.txt         | 12 +++
> >  bloom.h                                       |  2 +-
> >  commit-graph.c                                | 96 ++++++++++++++++---
> >  commit-graph.h                                |  4 +
> >  t/t4216-log-bloom.sh                          | 25 ++++-
> >  5 files changed, 124 insertions(+), 15 deletions(-)
> >
> > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> > index 440541045d..5f2d9ab4d7 100644
> > --- a/Documentation/technical/commit-graph-format.txt
> > +++ b/Documentation/technical/commit-graph-format.txt
> > @@ -123,6 +123,18 @@ CHUNK DATA:
> >        of length zero.
> >      * The BDAT chunk is present if and only if BIDX is present.
> >
> > +  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
> > +    * It starts with a 32-bit unsigned integer specifying the maximum number of
> > +      changed-paths that can be stored in a single Bloom filter.
>
> Should this value be the same in all elements of a commit-graph chain?
> Note that in this case having different values won't affect revision
> walks using modified path Bloom filters.

I don't think it needs to be the same necessarily.

> > +    * It then contains a list of 64-bit words (the length of this list is
> > +      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
> > +      set exactly when the 'i'th commit in the graph has a changed-path Bloom
> > +      filter with zero entries (either because the commit is empty, or because
> > +      it contains more than 512 changed paths).
>
> Please make clear the byte order of these 64 bit words in the specs as
> well.
>
> Furthermore, that 512 path limit is an implementation detail, so it
> would be better if it didn't leak into the specification of this new
> chunk.

Addressed both, thanks.

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 440541045d..5f2d9ab4d7 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -123,6 +123,18 @@  CHUNK DATA:
       of length zero.
     * The BDAT chunk is present if and only if BIDX is present.
 
+  Large Bloom Filters (ID: {'B', 'F', 'X', 'L'}) [Optional]
+    * It starts with a 32-bit unsigned integer specifying the maximum number of
+      changed-paths that can be stored in a single Bloom filter.
+    * It then contains a list of 64-bit words (the length of this list is
+      determined by the width of the chunk) which is a bitmap. The 'i'th bit is
+      set exactly when the 'i'th commit in the graph has a changed-path Bloom
+      filter with zero entries (either because the commit is empty, or because
+      it contains more than 512 changed paths).
+    * The BFXL chunk is present only when the BIDX and BDAT chunks are
+      also present.
+
+
   Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
       This list of H-byte hashes describe a set of B commit-graph files that
       form a commit-graph chain. The graph position for the ith commit in this
diff --git a/bloom.h b/bloom.h
index 3f19e3fca4..464d9b57de 100644
--- a/bloom.h
+++ b/bloom.h
@@ -33,7 +33,7 @@  struct bloom_filter_settings {
 	 * The maximum number of changed paths per commit
 	 * before declaring a Bloom filter to be too-large.
 	 *
-	 * Not written to the commit-graph file.
+	 * Written to the 'BFXL' chunk (instead of 'BDAT').
 	 */
 	uint32_t max_changed_paths;
 };
diff --git a/commit-graph.c b/commit-graph.c
index 8964453433..ea0583298c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -41,8 +41,9 @@  void git_test_write_commit_graph_or_die(void)
 #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
 #define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
 #define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
+#define GRAPH_CHUNKID_BLOOMLARGE 0x4246584c /* "BFXL" */
 #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
-#define MAX_NUM_CHUNKS 7
+#define MAX_NUM_CHUNKS 8
 
 #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
 
@@ -429,6 +430,24 @@  struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
 			}
 			break;
+
+		case GRAPH_CHUNKID_BLOOMLARGE:
+			if (graph->chunk_bloom_large_filters)
+				chunk_repeated = 1;
+			else if (r->settings.commit_graph_read_changed_paths) {
+				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);
+				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
+				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);
+				if (alloc) {
+					size_t j;
+					graph->bloom_large = bitmap_word_alloc(alloc);
+
+					for (j = 0; j < graph->bloom_large->word_alloc; j++)
+						graph->bloom_large->words[j] = get_be64(
+							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
+				}
+			}
+			break;
 		}
 
 		if (chunk_repeated) {
@@ -443,7 +462,9 @@  struct commit_graph *parse_commit_graph(struct repository *r,
 		/* We need both the bloom chunks to exist together. Else ignore the data */
 		graph->chunk_bloom_indexes = NULL;
 		graph->chunk_bloom_data = NULL;
+		graph->chunk_bloom_large_filters = NULL;
 		FREE_AND_NULL(graph->bloom_filter_settings);
+		bitmap_free(graph->bloom_large);
 	}
 
 	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
@@ -932,6 +953,20 @@  struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
 	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
 }
 
+static int get_bloom_filter_large_in_graph(struct commit_graph *g,
+					   const struct commit *c)
+{
+	uint32_t graph_pos = commit_graph_position(c);
+	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
+		return 0;
+
+	while (g && graph_pos < g->num_commits_in_base)
+		g = g->base_graph;
+
+	if (!(g && g->bloom_large))
+		return 0;
+	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
+}
 
 struct packed_oid_list {
 	struct object_id *list;
@@ -970,8 +1005,10 @@  struct write_commit_graph_context {
 	size_t total_bloom_filter_data_size;
 	const struct bloom_filter_settings *bloom_settings;
 
+	int count_bloom_filter_known_large;
 	int count_bloom_filter_found_large;
 	int count_bloom_filter_computed;
+	struct bitmap *bloom_large;
 };
 
 static int write_graph_chunk_fanout(struct hashfile *f,
@@ -1235,6 +1272,23 @@  static int write_graph_chunk_bloom_data(struct hashfile *f,
 	return 0;
 }
 
+static int write_graph_chunk_bloom_large(struct hashfile *f,
+					 struct write_commit_graph_context *ctx)
+{
+	size_t i, alloc = ctx->commits.nr / BITS_IN_EWORD;
+	if (ctx->commits.nr % BITS_IN_EWORD)
+		alloc++;
+	if (alloc > ctx->bloom_large->word_alloc)
+		BUG("write_graph_chunk_bloom_large: bitmap not large enough");
+
+	trace2_region_enter("commit-graph", "bloom_large", ctx->r);
+	hashwrite_be32(f, ctx->bloom_settings->max_changed_paths);
+	for (i = 0; i < ctx->bloom_large->word_alloc; i++)
+		hashwrite_be64(f, ctx->bloom_large->words[i]);
+	trace2_region_leave("commit-graph", "bloom_large", ctx->r);
+	return 0;
+}
+
 static int oid_compare(const void *_a, const void *_b)
 {
 	const struct object_id *a = (const struct object_id *)_a;
@@ -1398,6 +1452,8 @@  static void trace2_bloom_filter_write_statistics(struct write_commit_graph_conte
 	struct json_writer jw = JSON_WRITER_INIT;
 
 	jw_object_begin(&jw, 0);
+	jw_object_intmax(&jw, "filter_known_large",
+			 ctx->count_bloom_filter_known_large);
 	jw_object_intmax(&jw, "filter_found_large",
 			 ctx->count_bloom_filter_found_large);
 	jw_object_intmax(&jw, "filter_computed",
@@ -1416,6 +1472,7 @@  static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 	int *sorted_commits;
 
 	init_bloom_filters();
+	ctx->bloom_large = bitmap_word_alloc(ctx->commits.nr / BITS_IN_EWORD + 1);
 
 	if (ctx->report_progress)
 		progress = start_delayed_progress(
@@ -1430,21 +1487,28 @@  static void compute_bloom_filters(struct write_commit_graph_context *ctx)
 		&ctx->commits);
 
 	for (i = 0; i < ctx->commits.nr; i++) {
-		int computed = 0;
 		int pos = sorted_commits[i];
 		struct commit *c = ctx->commits.list[pos];
-		struct bloom_filter *filter = get_or_compute_bloom_filter(
-			ctx->r,
-			c,
-			1,
-			ctx->bloom_settings,
-			&computed);
-		if (computed) {
-			ctx->count_bloom_filter_computed++;
-			if (filter && !filter->len)
-				ctx->count_bloom_filter_found_large++;
+		if (get_bloom_filter_large_in_graph(ctx->r->objects->commit_graph, c)) {
+			bitmap_set(ctx->bloom_large, pos);
+			ctx->count_bloom_filter_known_large++;
+		} else {
+			int computed = 0;
+			struct bloom_filter *filter = get_or_compute_bloom_filter(
+				ctx->r,
+				c,
+				1,
+				ctx->bloom_settings,
+				&computed);
+			if (computed) {
+				ctx->count_bloom_filter_computed++;
+				if (filter && !filter->len) {
+					bitmap_set(ctx->bloom_large, pos);
+					ctx->count_bloom_filter_found_large++;
+				}
+			}
+			ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		}
-		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
 		display_progress(progress, i + 1);
 	}
 
@@ -1764,6 +1828,11 @@  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 					  + ctx->total_bloom_filter_data_size;
 		chunks[num_chunks].write_fn = write_graph_chunk_bloom_data;
 		num_chunks++;
+		chunks[num_chunks].id = GRAPH_CHUNKID_BLOOMLARGE;
+		chunks[num_chunks].size = sizeof(eword_t) * ctx->bloom_large->word_alloc
+					+ sizeof(uint32_t);
+		chunks[num_chunks].write_fn = write_graph_chunk_bloom_large;
+		num_chunks++;
 	}
 	if (ctx->num_commit_graphs_after > 1) {
 		chunks[num_chunks].id = GRAPH_CHUNKID_BASE;
@@ -2503,6 +2572,7 @@  void free_commit_graph(struct commit_graph *g)
 	}
 	free(g->filename);
 	free(g->bloom_filter_settings);
+	bitmap_free(g->bloom_large);
 	free(g);
 }
 
diff --git a/commit-graph.h b/commit-graph.h
index d9acb22bac..ddbca1b59d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -4,6 +4,7 @@ 
 #include "git-compat-util.h"
 #include "object-store.h"
 #include "oidset.h"
+#include "ewah/ewok.h"
 
 #define GIT_TEST_COMMIT_GRAPH "GIT_TEST_COMMIT_GRAPH"
 #define GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE "GIT_TEST_COMMIT_GRAPH_DIE_ON_PARSE"
@@ -71,6 +72,9 @@  struct commit_graph {
 	const unsigned char *chunk_base_graphs;
 	const unsigned char *chunk_bloom_indexes;
 	const unsigned char *chunk_bloom_data;
+	const unsigned char *chunk_bloom_large_filters;
+
+	struct bitmap *bloom_large;
 
 	struct bloom_filter_settings *bloom_filter_settings;
 };
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 21b67677ef..6859d85369 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -33,7 +33,7 @@  test_expect_success 'setup test - repo, commits, commit graph, log outputs' '
 	git commit-graph write --reachable --changed-paths
 '
 graph_read_expect () {
-	NUM_CHUNKS=5
+	NUM_CHUNKS=6
 	cat >expect <<- EOF
 	header: 43475048 1 1 $NUM_CHUNKS 0
 	num_commits: $1
@@ -262,5 +262,28 @@  test_expect_success 'correctly report changes over limit' '
 		done
 	)
 '
+test_bloom_filters_computed () {
+	commit_graph_args=$1
+	bloom_trace_prefix="{\"filter_known_large\":$2,\"filter_found_large\":$3,\"filter_computed\":$4"
+	rm -f "$TRASH_DIRECTORY/trace.event" &&
+	GIT_TRACE2_EVENT="$TRASH_DIRECTORY/trace.event" git commit-graph write $commit_graph_args &&
+	grep "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.event"
+}
+
+test_expect_success 'Bloom generation does not recompute too-large filters' '
+	(
+		cd limits &&
+
+		# start from scratch and rebuild
+		rm -f .git/objects/info/commit-graph &&
+		GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=10 \
+			git commit-graph write --reachable --changed-paths \
+			--split=replace &&
+		test_commit c1 filter &&
+
+		test_bloom_filters_computed "--reachable --changed-paths --split=replace" \
+			2 0 1
+	)
+'
 
 test_done