mbox series

[v2,00/11] More commit-graph/Bloom filter improvements

Message ID pull.659.v2.git.1592934430.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series More commit-graph/Bloom filter improvements | expand

Message

Philippe Blain via GitGitGadget June 23, 2020, 5:46 p.m. UTC
This builds on sg/commit-graph-cleanups, which took several patches from
Szeder's series [1] and applied them almost directly to a more-recent
version of Git [2].

[1] https://lore.kernel.org/git/20200529085038.26008-1-szeder.dev@gmail.com/
[2] 
https://lore.kernel.org/git/pull.650.git.1591362032.gitgitgadget@gmail.com/

This series adds a few extra improvements, several of which are rooted in
Szeder's original series. I maintained his authorship and sign-off, even
though the patches did not apply or cherry-pick at all.

(In v2, I have removed the range-diff comparison to Szeder's series, so look
at the v1 cover letter for that.)

The patches have been significantly reordered. René pointed out (and Szeder
discovered in the old thread) that we are not re-using the
bloom_filter_settings from the existing commit-graph when writing a new one.

 1. commit-graph: place bloom_settings in context
 2. commit-graph: change test to die on parse, not load

These are mostly the same, except we now use a pointer to the settings in
the commit-graph write context.

 3. bloom: get_bloom_filter() cleanups

This new patch is a subtle change in behavior that will become relevant in
the very next patch. In fact, if we swap patch 3 and 4, then
t4216-log-bloom.sh fails with a segfault due to a NULL filter.

 4. commit-graph: persist existence of changed-paths

This patch is now updated to use the existing changed-path filter settings.

 5. commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
 6. commit-graph: simplify chunk writes into loop
 7. commit-graph: check chunk sizes after writing

These are all the same as before.

 8. revision.c: fix whitespace

This patch is the cleanup part of Taylor's patch.

 9. revision: empty pathspecs should not use Bloom filters

Here is Taylor's fix for empty pathspecs.

 10. commit-graph: check all leading directories in changed path Bloom
     filters
 11. bloom: enforce a minimum size of 8 bytes

Finally, we get these performance patches. Patch 10 is updated to have the
better logic around directory separators and empty paths. Also, the list of
Bloom keys is ordered with the deepest path first. That has some tiny
performance benefits for deep paths since we can short-circuit the multi-key
checks more often. That code path is much faster than the tree parsing, so
it is hard to measure any change.

Thanks, -Stolee

Derrick Stolee (6):
  commit-graph: place bloom_settings in context
  commit-graph: change test to die on parse, not load
  bloom: get_bloom_filter() cleanups
  commit-graph: persist existence of changed-paths
  revision.c: fix whitespace
  bloom: enforce a minimum size of 8 bytes

SZEDER Gábor (4):
  commit-graph: unify the signatures of all write_graph_chunk_*()
    functions
  commit-graph: simplify chunk writes into loop
  commit-graph: check chunk sizes after writing
  commit-graph: check all leading directories in changed path Bloom
    filters

Taylor Blau (1):
  revision: empty pathspecs should not use Bloom filters

 Documentation/git-commit-graph.txt |   5 +-
 bloom.c                            |  19 ++--
 builtin/commit-graph.c             |   5 +-
 commit-graph.c                     | 136 +++++++++++++++++++++--------
 commit-graph.h                     |   3 +-
 revision.c                         |  53 ++++++++---
 revision.h                         |   6 +-
 t/t4216-log-bloom.sh               |  35 +++++++-
 t/t5318-commit-graph.sh            |   2 +-
 9 files changed, 200 insertions(+), 64 deletions(-)


base-commit: 7fbfe07ab4d4e58c0971dac73001b89f180a0af3
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-659%2Fderrickstolee%2Fbloom-2-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-659/derrickstolee/bloom-2-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/659

Range-diff vs v1:

  1:  c966969071 !  1:  57002040bc commit-graph: place bloom_settings in context
     @@ Commit message
          to combine the function prototypes and use function pointers to
          simplify write_commit_graph_file().
      
     +    By using a pointer, we can later replace the settings to match those
     +    that exist in the current commit-graph, in case a future Git version
     +    allows customization of these parameters.
     +
          Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ commit-graph.c: struct write_commit_graph_context {
       
       	const struct split_commit_graph_opts *split_opts;
       	size_t total_bloom_filter_data_size;
     -+	struct bloom_filter_settings bloom_settings;
     ++	const struct bloom_filter_settings *bloom_settings;
       };
       
       static void write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
      -	hashwrite_be32(f, settings->hash_version);
      -	hashwrite_be32(f, settings->num_hashes);
      -	hashwrite_be32(f, settings->bits_per_entry);
     -+	hashwrite_be32(f, ctx->bloom_settings.hash_version);
     -+	hashwrite_be32(f, ctx->bloom_settings.num_hashes);
     -+	hashwrite_be32(f, ctx->bloom_settings.bits_per_entry);
     ++	hashwrite_be32(f, ctx->bloom_settings->hash_version);
     ++	hashwrite_be32(f, ctx->bloom_settings->num_hashes);
     ++	hashwrite_be32(f, ctx->bloom_settings->bits_per_entry);
       
       	while (list < last) {
       		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
       	struct object_id file_hash;
       	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
       
     -+	ctx->bloom_settings = bloom_settings;
     ++	ctx->bloom_settings = &bloom_settings;
      +
       	if (ctx->split) {
       		struct strbuf tmp_file = STRBUF_INIT;
  7:  60bbc15d24 =  2:  6b63f9bd8a commit-graph: change test to die on parse, not load
  -:  ---------- >  3:  492deaf916 bloom: get_bloom_filter() cleanups
  8:  db5b8fe843 !  4:  8727b25468 commit-graph: persist existence of changed-paths
     @@ Commit message
          property of "my commit-graph has changed-path filters" automatically. A
          user can drop filters using the --no-changed-paths option.
      
     +    In the process, we need to be extremely careful to match the Bloom
     +    filter settings as specified by the commit-graph. This will allow future
     +    versions of Git to customize these settings, and the version with this
     +    change will persist those settings as commit-graphs are rewritten on
     +    top.
     +
     +    Use the trace2 API to signal the settings used during the write, and
     +    check that output in a test after manually adjusting the correct bytes
     +    in the commit-graph file.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## Documentation/git-commit-graph.txt ##
     @@ builtin/commit-graph.c: static int graph_write(int argc, const char **argv)
       
      
       ## commit-graph.c ##
     +@@
     + #include "progress.h"
     + #include "bloom.h"
     + #include "commit-slab.h"
     ++#include "json-writer.h"
     ++#include "trace2.h"
     + 
     + void git_test_write_commit_graph_or_die(void)
     + {
     +@@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
     + 	stop_progress(&progress);
     + }
     + 
     ++static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     ++{
     ++	struct json_writer jw = JSON_WRITER_INIT;
     ++
     ++	jw_object_begin(&jw, 0);
     ++	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
     ++	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
     ++	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
     ++	jw_end(&jw);
     ++
     ++	trace2_data_json("bloom", ctx->r, "settings", &jw);
     ++
     ++	jw_release(&jw);
     ++}
     ++
     + static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 					 struct write_commit_graph_context *ctx)
     + {
     +@@ commit-graph.c: static void write_graph_chunk_bloom_data(struct hashfile *f,
     + 	struct progress *progress = NULL;
     + 	int i = 0;
     + 
     ++	trace2_bloom_filter_settings(ctx);
     ++
     + 	if (ctx->report_progress)
     + 		progress = start_delayed_progress(
     + 			_("Writing changed paths Bloom filters data"),
     +@@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_context *ctx)
     + 	struct object_id file_hash;
     + 	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
     + 
     +-	ctx->bloom_settings = &bloom_settings;
     ++	if (!ctx->bloom_settings)
     ++		ctx->bloom_settings = &bloom_settings;
     + 
     + 	if (ctx->split) {
     + 		struct strbuf tmp_file = STRBUF_INIT;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
       	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
      +	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
      +		ctx->changed_paths = 1;
     -+	else if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     ++	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     ++		struct commit_graph *g;
      +		prepare_commit_graph_one(ctx->r, ctx->odb);
      +
     ++		g = ctx->r->objects->commit_graph;
     ++
      +		/* We have changed-paths already. Keep them in the next graph */
     -+		if (ctx->r->objects->commit_graph &&
     -+		    ctx->r->objects->commit_graph->chunk_bloom_data)
     ++		if (g && g->chunk_bloom_data) {
      +			ctx->changed_paths = 1;
     ++			ctx->bloom_settings = g->bloom_filter_settings;
     ++		}
      +	}
      +
       	if (ctx->split) {
     @@ t/t4216-log-bloom.sh: test_expect_success 'setup - add commit-graph to the chain
       	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
       '
       
     +@@ t/t4216-log-bloom.sh: test_expect_success 'Use Bloom filters if they exist in the latest but not all c
     + 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
     + '
     + 
     ++BASE_BDAT_OFFSET=2240
     ++BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
     ++BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
     ++
     ++corrupt_graph() {
     ++	pos=$1
     ++	data="${2:-\0}"
     ++	grepstr=$3
     ++	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
     ++	zero_pos=${4:-${orig_size}} &&
     ++	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
     ++	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
     ++}
     ++
     ++test_expect_success 'persist filter settings' '
     ++	test_when_finished rm -rf .git/objects/info/commit-graph* &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
     ++	cp .git/objects/info/commit-graph commit-graph-before &&
     ++	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
     ++	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
     ++	cp .git/objects/info/commit-graph commit-graph-after &&
     ++	test_commit c18 A/corrupt &&
     ++	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
     ++	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
     ++'
     ++
     + test_done
     + \ No newline at end of file
  2:  65eb15221c !  5:  244668fec4 commit-graph: unify the signatures of all write_graph_chunk_*() functions
     @@ Commit message
      
       ## commit-graph.c ##
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct bloom_filter_settings bloom_settings;
     + 	const struct bloom_filter_settings *bloom_settings;
       };
       
      -static void write_graph_chunk_fanout(struct hashfile *f,
     @@ commit-graph.c: static void write_graph_chunk_bloom_indexes(struct hashfile *f,
      +	return 0;
       }
       
     + static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     +@@ commit-graph.c: static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
     + 	jw_release(&jw);
     + }
     + 
      -static void write_graph_chunk_bloom_data(struct hashfile *f,
      -					 struct write_commit_graph_context *ctx)
      +static int write_graph_chunk_bloom_data(struct hashfile *f,
  3:  3d24b9802d =  6:  8b959f2f37 commit-graph: simplify chunk writes into loop
  4:  bdca834e6d =  7:  3eb10933dc commit-graph: check chunk sizes after writing
  -:  ---------- >  8:  0bcfc1f051 revision.c: fix whitespace
  -:  ---------- >  9:  719c7091a7 revision: empty pathspecs should not use Bloom filters
  5:  9975fc96f1 ! 10:  9c2076b4ce commit-graph: check all leading directories in changed path Bloom filters
     @@ Commit message
          This adjusts the tracing values in t4216-log-bloom.sh, which provides a
          concrete way to notice the improvement.
      
     +    Helped-by: Taylor Blau <me@ttaylorr.com>
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
       	int last_index;
      -	int len;
      +	size_t len;
     -+	int path_component_nr = 0, j;
     ++	int path_component_nr = 1;
       
       	if (!revs->commits)
       		return;
      @@ revision.c: static void prepare_to_use_bloom_filter(struct rev_info *revs)
     - 
     - 	len = strlen(path);
     + 		return;
     + 	}
       
      -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
      -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
      +	p = path;
     -+	do {
     -+		p = strchrnul(p + 1, '/');
     -+		path_component_nr++;
     -+	} while (p - path < len);
     ++	while (*p) {
     ++		if (is_dir_sep(*p))
     ++			path_component_nr++;
     ++		p++;
     ++	}
      +
      +	revs->bloom_keys_nr = path_component_nr;
      +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
      +
     -+	p = path;
     -+	for (j = 0; j < revs->bloom_keys_nr; j++) {
     -+		p = strchrnul(p + 1, '/');
     ++	fill_bloom_key(path, len, &revs->bloom_keys[0],
     ++		       revs->bloom_filter_settings);
     ++	path_component_nr = 1;
      +
     -+		fill_bloom_key(path, p - path, &revs->bloom_keys[j],
     -+			       revs->bloom_filter_settings);
     ++	p = path + len - 1;
     ++	while (p > path) {
     ++		if (is_dir_sep(*p))
     ++			fill_bloom_key(path, p - path,
     ++				       &revs->bloom_keys[path_component_nr++],
     ++				       revs->bloom_filter_settings);
     ++		p--;
      +	}
       
       	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
  6:  2a5f1e1752 = 11:  1022c0ad21 bloom: enforce a minimum size of 8 bytes

Comments

Junio C Hamano June 24, 2020, 11:11 p.m. UTC | #1
This does not seem to play well with what is in flight.  Tests seem
to pass with topics up to es/config-hooks merged but not with this
topic merged on top.

    1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
    32169c595c Merge branch 'es/config-hooks' into seen
    ...

$ sh t4216-log-bloom.sh -i -v

ends like so:

ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.

expecting success of 4216.134 'persist filter settings':
        test_when_finished rm -rf .git/objects/info/commit-graph* &&
        GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
        grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
        cp .git/objects/info/commit-graph commit-graph-before &&
        corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
        corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
        cp .git/objects/info/commit-graph commit-graph-after &&
        test_commit c18 A/corrupt &&
        GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
        grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt

not ok 134 - persist filter settings
# ...

Thanks.
Derrick Stolee June 24, 2020, 11:32 p.m. UTC | #2
On 6/24/2020 7:11 PM, Junio C Hamano wrote:
> This does not seem to play well with what is in flight.  Tests seem
> to pass with topics up to es/config-hooks merged but not with this
> topic merged on top.
> 
>     1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
>     32169c595c Merge branch 'es/config-hooks' into seen
>     ...
> 
> $ sh t4216-log-bloom.sh -i -v
> 
> ends like so:
> 
> ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.
> 
> expecting success of 4216.134 'persist filter settings':
>         test_when_finished rm -rf .git/objects/info/commit-graph* &&
>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>         grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
>         cp .git/objects/info/commit-graph commit-graph-before &&
>         corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
>         corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
>         cp .git/objects/info/commit-graph commit-graph-after &&
>         test_commit c18 A/corrupt &&
>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>         grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
> 
> not ok 134 - persist filter settings
> # ...
> 
> Thanks.

Thanks for letting me know. I'll investigate carefully with the
rest of the 'seen' branch. This test is a bit fragile due to
computed values for which bytes to replace, so anything that
could have changed the length or order of chunks would lead to
a failure here.

Sorry for the disruption.

-Stolee
Junio C Hamano June 25, 2020, 12:38 a.m. UTC | #3
Derrick Stolee <stolee@gmail.com> writes:

> On 6/24/2020 7:11 PM, Junio C Hamano wrote:
>> This does not seem to play well with what is in flight.  Tests seem
>> to pass with topics up to es/config-hooks merged but not with this
>> topic merged on top.
>> 
>>     1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
>>     32169c595c Merge branch 'es/config-hooks' into seen
>>     ...
>> 
>> $ sh t4216-log-bloom.sh -i -v
>> 
>> ends like so:
>> 
>> ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.
>> 
>> expecting success of 4216.134 'persist filter settings':
>>         test_when_finished rm -rf .git/objects/info/commit-graph* &&
>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>         grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
>>         cp .git/objects/info/commit-graph commit-graph-before &&
>>         corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
>>         corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
>>         cp .git/objects/info/commit-graph commit-graph-after &&
>>         test_commit c18 A/corrupt &&
>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>         grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
>> 
>> not ok 134 - persist filter settings
>> # ...
>> 
>> Thanks.
>
> Thanks for letting me know. I'll investigate carefully with the
> rest of the 'seen' branch. This test is a bit fragile due to
> computed values for which bytes to replace, so anything that
> could have changed the length or order of chunks would lead to
> a failure here.
>
> Sorry for the disruption.

Oh, not at all.  Thanks for helping.
Derrick Stolee June 25, 2020, 1:38 p.m. UTC | #4
On 6/24/2020 8:38 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 6/24/2020 7:11 PM, Junio C Hamano wrote:
>>> This does not seem to play well with what is in flight.  Tests seem
>>> to pass with topics up to es/config-hooks merged but not with this
>>> topic merged on top.
>>>
>>>     1b5d3d8260 Merge branch 'ds/commit-graph-bloom-updates' into seen
>>>     32169c595c Merge branch 'es/config-hooks' into seen
>>>     ...
>>>
>>> $ sh t4216-log-bloom.sh -i -v
>>>
>>> ends like so:
>>>
>>> ok 133 - Use Bloom filters if they exist in the latest but not all commit graphs in the chain.
>>>
>>> expecting success of 4216.134 'persist filter settings':
>>>         test_when_finished rm -rf .git/objects/info/commit-graph* &&
>>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>>         grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
>>>         cp .git/objects/info/commit-graph commit-graph-before &&
>>>         corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
>>>         corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
>>>         cp .git/objects/info/commit-graph commit-graph-after &&
>>>         test_commit c18 A/corrupt &&
>>>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
>>>         grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
>>>
>>> not ok 134 - persist filter settings
>>> # ...
>>>
>>> Thanks.
>>
>> Thanks for letting me know. I'll investigate carefully with the
>> rest of the 'seen' branch. This test is a bit fragile due to
>> computed values for which bytes to replace, so anything that
>> could have changed the length or order of chunks would lead to
>> a failure here.
>>
>> Sorry for the disruption.
> 
> Oh, not at all.  Thanks for helping
I'll squash the patch into my v3, but here it is now to make 'seen'
pass tests again.

The _real_ reason for the failure was that some changes in trace2
pushed the events out of the nesting limits. I also think it is a
good idea to make the test less brittle. Adding GIT_TEST_* variables
will also help anyone who wants to adjust the Bloom filter settings
for testing.

Question: Should these be GIT_BLOOM_SETTINGS_* instead of GIT_TEST_...?
I ask because this _could_ be a way to allow user customization,
without making it as public as a config option. Or, should I just do
the work and add config settings in this series?

Thanks,
-Stolee

-- >8 --
From 9245d31f0431eceec60f0b7a90900d2825787530 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Thu, 25 Jun 2020 09:19:13 -0400
Subject: [PATCH] fixup! commit-graph: persist existence of changed-paths

The previous version of this test was too fragile to subtle changes
in the commit-graph file size. This version now uses two environment
variables to customize the Bloom filter settings before rewriting
without those environment variables. This demonstrates that we
persist the settings correctly.

The issue with the 'seen' branch is due to es/trace-log-progress
adding trace2 regions in the progress indicators. This pushed the
trace2 data that the test was expecting outside the nesting limit.
Set GIT_TRACE2_EVENT_NESTING to ensure we still record those items.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c       |  9 +++++++--
 t/t4216-log-bloom.sh | 34 +++++++++++-----------------------
 2 files changed, 18 insertions(+), 25 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 1de6800d74..026ec63d38 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1576,10 +1576,15 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	int num_chunks = 3;
 	uint64_t chunk_offset;
 	struct object_id file_hash;
-	const struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
+	struct bloom_filter_settings bloom_settings = DEFAULT_BLOOM_FILTER_SETTINGS;
 
-	if (!ctx->bloom_settings)
+	if (!ctx->bloom_settings) {
+		bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY",
+							      bloom_settings.bits_per_entry);
+		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
+							  bloom_settings.num_hashes);
 		ctx->bloom_settings = &bloom_settings;
+	}
 
 	if (ctx->split) {
 		struct strbuf tmp_file = STRBUF_INIT;
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 84f95972ca..d7dd717347 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -156,31 +156,19 @@ test_expect_success 'Use Bloom filters if they exist in the latest but not all c
 	test_bloom_filters_used_when_some_filters_are_missing "-- A/B"
 '
 
-BASE_BDAT_OFFSET=2240
-BASE_K_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 10))
-BASE_LEN_BYTE_OFFSET=$((BASE_BDAT_OFFSET + 14))
-
-corrupt_graph() {
-	pos=$1
-	data="${2:-\0}"
-	grepstr=$3
-	orig_size=$(wc -c < .git/objects/info/commit-graph) &&
-	zero_pos=${4:-${orig_size}} &&
-	printf "$data" | dd of=".git/objects/info/commit-graph" bs=1 seek="$pos" conv=notrunc &&
-	dd of=".git/objects/info/commit-graph" bs=1 seek="$zero_pos" if=/dev/null
-}
-
 test_expect_success 'persist filter settings' '
 	test_when_finished rm -rf .git/objects/info/commit-graph* &&
-	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":7,\"bits_per_entry\":10}" trace2.txt &&
-	cp .git/objects/info/commit-graph commit-graph-before &&
-	corrupt_graph $BASE_K_BYTE_OFFSET "\09" &&
-	corrupt_graph $BASE_LEN_BYTE_OFFSET "\0F" &&
-	cp .git/objects/info/commit-graph commit-graph-after &&
-	test_commit c18 A/corrupt &&
-	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":57,\"bits_per_entry\":70}" trace2.txt
+	rm -rf .git/objects/info/commit-graph* &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
+		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
+		GIT_TRACE2_EVENT_NESTING=5 \
+		git commit-graph write --reachable --changed-paths &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
 '
 
 test_done
\ No newline at end of file
Junio C Hamano June 25, 2020, 4:34 p.m. UTC | #5
Derrick Stolee <stolee@gmail.com> writes:

> Question: Should these be GIT_BLOOM_SETTINGS_* instead of GIT_TEST_...?
> I ask because this _could_ be a way to allow user customization,
> without making it as public as a config option. Or, should I just do
> the work and add config settings in this series?

Other than when testing and/or debugging, what are the expected
reasons and situations an individual would want to use customized
settings?  Once a decision is made to use one customized setting for
a repository, does it make sense for a setting other than that one
setting for the same repository, or is it something very handy if we
can use different settings on a whim?  

My gut feeling is that it should be added as per-repo configuration
but only after a use case is found, and GIT_TEST_* would be the way
to go.

Thanks.