mbox series

[v2,00/13] more miscellaneous Bloom filter improvements, redux

Message ID cover.1600279373.git.me@ttaylorr.com
Headers show
Series more miscellaneous Bloom filter improvements, redux | expand

Message

Taylor Blau Sept. 16, 2020, 6:06 p.m. UTC
Here's another re-roll of mine and Stolee's series to add a series of
improvements to the Bloom filter machinery, culminating in the
command-line flag '--max-new-filters'.

Much is left the same from [1], with two notable exceptions:

  - I took much of Stolee's feedback from [2], since the suggestions
    were good and we are rerolling, anyway.

  - I dropped the trace2 calls that write JSON from within the
    commit-graph machinery in favor of pure "data" writes, which are
    easier to grep for. This caused some test refactoring and allowed us
    to ultimately drop the 'test_bloom_filters_computed' function.

Hopefully this is it for this series ;). I think that it's in good shape
now, and I couldn't find anything in my own inspection that I wanted to
change. So, if others feel good, too, I think we should focus on
incremental fixes on top of this.

[1]: https://lore.kernel.org/git/cover.1599664389.git.me@ttaylorr.com/
[2]: https://lore.kernel.org/git/134d64a0-abb6-bdc9-2c05-7aded01a906a@gmail.com/

Derrick Stolee (1):
  bloom/diff: properly short-circuit on max_changes

Taylor Blau (12):
  commit-graph: introduce 'get_bloom_filter_settings()'
  t4216: use an '&&'-chain
  commit-graph: pass a 'struct repository *' in more places
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: respect 'commitGraph.readChangedPaths'
  commit-graph.c: store maximum changed paths
  bloom: split 'get_bloom_filter()' in two
  bloom: use provided 'struct bloom_filter_settings'
  bloom: encode out-of-bounds filters as non-empty
  commit-graph: rename 'split_commit_graph_opts'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  commit-graph: introduce 'commitGraph.maxNewFilters'

 Documentation/config.txt                      |   2 +
 Documentation/config/commitgraph.txt          |   8 +
 Documentation/git-commit-graph.txt            |   6 +
 .../technical/commit-graph-format.txt         |   2 +-
 blame.c                                       |   8 +-
 bloom.c                                       |  59 +++--
 bloom.h                                       |  29 ++-
 builtin/commit-graph.c                        |  63 ++++-
 commit-graph.c                                | 140 +++++++---
 commit-graph.h                                |  17 +-
 diff.h                                        |   2 -
 fuzz-commit-graph.c                           |   5 +-
 line-log.c                                    |   2 +-
 repo-settings.c                               |   3 +
 repository.h                                  |   1 +
 revision.c                                    |   7 +-
 t/helper/test-bloom.c                         |   4 +-
 t/helper/test-read-graph.c                    |   3 +-
 t/t0095-bloom.sh                              |   8 +-
 t/t4216-log-bloom.sh                          | 241 ++++++++++++++++--
 t/t5324-split-commit-graph.sh                 |  13 +
 tree-diff.c                                   |   5 +-
 22 files changed, 505 insertions(+), 123 deletions(-)
 create mode 100644 Documentation/config/commitgraph.txt

--
2.28.0.510.g86fdc5f89a

Comments

Derrick Stolee Sept. 16, 2020, 10:51 p.m. UTC | #1
On 9/16/20 2:06 PM, Taylor Blau wrote:
> Hopefully this is it for this series ;). I think that it's in good shape
> now, and I couldn't find anything in my own inspection that I wanted to
> change. So, if others feel good, too, I think we should focus on
> incremental fixes on top of this.

I am also happy with this version. The test updates are
rather nice.

It is worth highlighging the tracing changes, in case
anyone was planning to use that style of JSON data in
production. I think it is better to include this simpler
data model of one variable per event instead of rolling
our own JSON format.

Thanks,
-Stolee
Junio C Hamano Sept. 16, 2020, 11:07 p.m. UTC | #2
Derrick Stolee <stolee@gmail.com> writes:

> On 9/16/20 2:06 PM, Taylor Blau wrote:
>> Hopefully this is it for this series ;). I think that it's in good shape
>> now, and I couldn't find anything in my own inspection that I wanted to
>> change. So, if others feel good, too, I think we should focus on
>> incremental fixes on top of this.
>
> I am also happy with this version. The test updates are
> rather nice.
>
> It is worth highlighging the tracing changes, in case
> anyone was planning to use that style of JSON data in
> production. I think it is better to include this simpler
> data model of one variable per event instead of rolling
> our own JSON format.

Yup, lets merge it down soonish.

Thanks.
Taylor Blau Sept. 17, 2020, 12:45 a.m. UTC | #3
On Wed, Sep 16, 2020 at 04:07:51PM -0700, Junio C Hamano wrote:
> Yup, lets merge it down soonish.

It's me, the bearer of bad news. I noticed an uninitialized read when
running the tests with SANITIZE=address,undefined. I *think* the fix is
as simple as a single replacement, but let me double check before you
merge this.

Sorry this topic has been such a disaster. Assuming the fix is isolated
to a single patch, do you want a new version of that patch, or the whole
series?

> Thanks.

Thanks,
Taylor
Junio C Hamano Sept. 17, 2020, 12:59 a.m. UTC | #4
Taylor Blau <me@ttaylorr.com> writes:

> On Wed, Sep 16, 2020 at 04:07:51PM -0700, Junio C Hamano wrote:
>> Yup, lets merge it down soonish.
>
> It's me, the bearer of bad news. I noticed an uninitialized read when
> running the tests with SANITIZE=address,undefined. I *think* the fix is
> as simple as a single replacement, but let me double check before you
> merge this.
>
> Sorry this topic has been such a disaster. Assuming the fix is isolated
> to a single patch, do you want a new version of that patch, or the whole
> series?

Depends on the timing---if it takes less than 2 days, please expect
that the topic would still be in my cache and "squash this into
patch 07/13" would be sufficient.  More than 10 days, wholesale
replacement would be the easiest.  A single patch replacement in
between.

Thanks.
Taylor Blau Sept. 17, 2020, 1:10 a.m. UTC | #5
On Wed, Sep 16, 2020 at 05:59:26PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > On Wed, Sep 16, 2020 at 04:07:51PM -0700, Junio C Hamano wrote:
> >> Yup, lets merge it down soonish.
> >
> > It's me, the bearer of bad news. I noticed an uninitialized read when
> > running the tests with SANITIZE=address,undefined. I *think* the fix is
> > as simple as a single replacement, but let me double check before you
> > merge this.
> >
> > Sorry this topic has been such a disaster. Assuming the fix is isolated
> > to a single patch, do you want a new version of that patch, or the whole
> > series?
>
> Depends on the timing---if it takes less than 2 days, please expect
> that the topic would still be in my cache and "squash this into
> patch 07/13" would be sufficient.  More than 10 days, wholesale
> replacement would be the easiest.  A single patch replacement in
> between.

I should have the patch in your inbox by the end of tonight, depending
on how fast my workstation can run the ASan-enabled test suite 13 times
;).

> Thanks.

Thanks,
Taylor
Taylor Blau Sept. 17, 2020, 1:34 p.m. UTC | #6
Junio,

On Wed, Sep 16, 2020 at 09:10:49PM -0400, Taylor Blau wrote:
> I should have the patch in your inbox by the end of tonight, depending
> on how fast my workstation can run the ASan-enabled test suite 13 times
> ;).

All finished. This is sufficient to fix the ASan-enabled test suite,
along with fixing a bug where we wouldn't respect the limit on changed
paths when loading an existing commit-graph. This has nothing to do with
the user-specified '--max-new-filters', nor does it mean that we're
storing the limit in the commit-graph file. Instead it's because we're
loading the bloom_filter_settings struct from the graph and
initializing it ourselves, instead of using the default values (which is
the case when we don't load a graph at all).

Anyway, let's use this instead of 6/13. Here's an inter-diff that shows
the fix and test change:

  diff --git a/commit-graph.c b/commit-graph.c
  index 33af6c2430..fc6c6fdc3e 100644
  --- a/commit-graph.c
  +++ b/commit-graph.c
  @@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
          graph->bloom_filter_settings->hash_version = hash_version;
          graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
          graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
  +				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
        }
        break;
      }
  diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
  index dc7d62c778..5bc1627568 100755
  --- a/t/t4216-log-bloom.sh
  +++ b/t/t4216-log-bloom.sh
  @@ -175,11 +175,11 @@ test_expect_success 'persist filter settings' '
      GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
      GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
      git commit-graph write --reachable --changed-paths &&
  -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2.txt &&
  +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
    GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
      GIT_TRACE2_EVENT_NESTING=5 \
      git commit-graph write --reachable --changed-paths &&
  -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15" trace2-auto.txt
  +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
   '

   test_max_changed_paths () {

...and here is the updated patch. Sorry again for all of the trouble.

--- >8 ---

Subject: [PATCH] commit-graph.c: store maximum changed paths

For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 bloom.h              | 11 ++++++++++-
 commit-graph.c       |  4 ++++
 t/t4216-log-bloom.sh |  4 ++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/bloom.h b/bloom.h
index d8fbb0fbf1..0b9b59a6fe 100644
--- a/bloom.h
+++ b/bloom.h
@@ -28,9 +28,18 @@ struct bloom_filter_settings {
 	 * that contain n*b bits.
 	 */
 	uint32_t bits_per_entry;
+
+	/*
+	 * The maximum number of changed paths per commit
+	 * before declaring a Bloom filter to be too-large.
+	 *
+	 * Not written to the commit-graph file.
+	 */
+	uint32_t max_changed_paths;
 };

-#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
+#define DEFAULT_BLOOM_MAX_CHANGES 512
+#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES }
 #define BITS_PER_WORD 8
 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)

diff --git a/commit-graph.c b/commit-graph.c
index ea54d108b9..ba6d4a4c6c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
 				graph->bloom_filter_settings->hash_version = hash_version;
 				graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
 				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
+				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;
 			}
 			break;
 		}
@@ -1201,6 +1202,7 @@ static void trace2_bloom_filter_settings(struct write_commit_graph_context *ctx)
 	jw_object_intmax(&jw, "hash_version", ctx->bloom_settings->hash_version);
 	jw_object_intmax(&jw, "num_hashes", ctx->bloom_settings->num_hashes);
 	jw_object_intmax(&jw, "bits_per_entry", ctx->bloom_settings->bits_per_entry);
+	jw_object_intmax(&jw, "max_changed_paths", ctx->bloom_settings->max_changed_paths);
 	jw_end(&jw);

 	trace2_data_json("bloom", ctx->r, "settings", &jw);
@@ -1669,6 +1671,8 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 							      bloom_settings.bits_per_entry);
 		bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES",
 							  bloom_settings.num_hashes);
+		bloom_settings.max_changed_paths = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS",
+							  bloom_settings.max_changed_paths);
 		ctx->bloom_settings = &bloom_settings;
 	}

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fc7693806c..593571358d 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -174,11 +174,11 @@ test_expect_success 'persist filter settings' '
 		GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \
 		GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
 		GIT_TRACE2_EVENT_NESTING=5 \
 		git commit-graph write --reachable --changed-paths &&
-	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
+	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt
 '

 test_expect_success 'correctly report changes over limit' '
--
2.28.0.510.g86fdc5f89a
Derrick Stolee Sept. 17, 2020, 1:38 p.m. UTC | #7
On 9/17/2020 9:34 AM, Taylor Blau wrote:
> Junio,
> 
> On Wed, Sep 16, 2020 at 09:10:49PM -0400, Taylor Blau wrote:
>> I should have the patch in your inbox by the end of tonight, depending
>> on how fast my workstation can run the ASan-enabled test suite 13 times
>> ;).
> 
> All finished. This is sufficient to fix the ASan-enabled test suite,
> along with fixing a bug where we wouldn't respect the limit on changed
> paths when loading an existing commit-graph. This has nothing to do with
> the user-specified '--max-new-filters', nor does it mean that we're
> storing the limit in the commit-graph file. Instead it's because we're
> loading the bloom_filter_settings struct from the graph and
> initializing it ourselves, instead of using the default values (which is
> the case when we don't load a graph at all).
> 
> Anyway, let's use this instead of 6/13. Here's an inter-diff that shows
> the fix and test change:
> 
>   diff --git a/commit-graph.c b/commit-graph.c
>   index 33af6c2430..fc6c6fdc3e 100644
>   --- a/commit-graph.c
>   +++ b/commit-graph.c
>   @@ -424,6 +424,7 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>           graph->bloom_filter_settings->hash_version = hash_version;
>           graph->bloom_filter_settings->num_hashes = get_be32(data + chunk_offset + 4);
>           graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
>   +				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;

This whitespace looks strange in the inter-diff...

>  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
> +				graph->bloom_filter_settings->max_changed_paths = DEFAULT_BLOOM_MAX_CHANGES;

...but it is correct in the patch itself.

> -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2.txt &&
> +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt &&
>  	GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \
>  		GIT_TRACE2_EVENT_NESTING=5 \
>  		git commit-graph write --reachable --changed-paths &&
> -	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15}" trace2-auto.txt
> +	grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt

I appreciate the additional tests to guarantee this is set
correctly.

Thanks,
-Stolee