[v6,0/7] Changed path filter hash fix and version bump

Message ID	cover.1689889382.git.jonathantanmy@google.com (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> Date: Thu, 20 Jul 2023 14:46:33 -0700 In-Reply-To: <cover.1684790529.git.jonathantanmy@google.com> Mime-Version: 1.0 References: <cover.1684790529.git.jonathantanmy@google.com> Message-ID: <cover.1689889382.git.jonathantanmy@google.com> Subject: [PATCH v6 0/7] Changed path filter hash fix and version bump From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, Junio C Hamano <gitster@pobox.com>, Taylor Blau <me@ttaylorr.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Changed path filter hash fix and version bump \| expand [v6,0/7] Changed path filter hash fix and version bump [v6,1/7] gitformat-commit-graph: describe version 2 of BDAT [v6,2/7] t/helper/test-read-graph.c: extract `dump_graph_info()` [v6,3/7] bloom.h: make `load_bloom_filter_from_graph()` public [v6,4/7] t/helper/test-read-graph: implement `bloom-filters` mode [v6,5/7] t4216: test changed path filters with high bit paths [v6,6/7] repo-settings: introduce commitgraph.changedPathsVersion [v6,7/7] commit-graph: new filter ver. that fixes murmur3

Jonathan Tan July 20, 2023, 9:46 p.m. UTC

Thanks, Junio and Taylor, for your reviews. I have included Taylor's
patches in this patch set.

There seemed to be some merge conflicts when I tried to apply the
patches Taylor provided on the base that I built my patches on (that is,
the base of jt/path-filter-fix, namely, maint-2.40), so I have rebased
all my patches onto latest master.

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

Taylor Blau (3):
  t/helper/test-read-graph.c: extract `dump_graph_info()`
  bloom.h: make `load_bloom_filter_from_graph()` public
  t/helper/test-read-graph: implement `bloom-filters` mode

 Documentation/config/commitgraph.txt     |  26 +++++-
 Documentation/gitformat-commit-graph.txt |   9 +-
 bloom.c                                  |  75 ++++++++++++++--
 bloom.h                                  |  13 ++-
 commit-graph.c                           |  33 +++++--
 oss-fuzz/fuzz-commit-graph.c             |   2 +-
 repo-settings.c                          |   6 +-
 repository.h                             |   2 +-
 t/helper/test-bloom.c                    |   9 +-
 t/helper/test-read-graph.c               |  67 ++++++++++++---
 t/t0095-bloom.sh                         |   8 ++
 t/t4216-log-bloom.sh                     | 104 +++++++++++++++++++++++
 12 files changed, 315 insertions(+), 39 deletions(-)

Range-diff against v5:
1:  efe7f40fed = 1:  3ce6090a4d gitformat-commit-graph: describe version 2 of BDAT
-:  ---------- > 2:  1955734d1f t/helper/test-read-graph.c: extract `dump_graph_info()`
-:  ---------- > 3:  4cf7c2f634 bloom.h: make `load_bloom_filter_from_graph()` public
-:  ---------- > 4:  47b55758e6 t/helper/test-read-graph: implement `bloom-filters` mode
2:  f684d07971 ! 5:  5276e6a90e t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
      	)
      '
      
    -+get_bdat_offset () {
    -+	perl -0777 -ne \
    -+		'print unpack("N", "$1") if /BDAT\0\0\0\0(....)/ or exit 1' \
    -+		.git/objects/info/commit-graph
    -+}
    -+
     +get_first_changed_path_filter () {
    -+	BDAT_OFFSET=$(get_bdat_offset) &&
    -+	perl -0777 -ne \
    -+		'print unpack("H*", substr($_, '$BDAT_OFFSET' + 12, 2))' \
    -+		.git/objects/info/commit-graph
    ++	test-tool read-graph bloom-filters >filters.dat &&
    ++	head -n 1 filters.dat
     +}
     +
     +# chosen to be the same under all Unicode normalization forms
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +
     +test_expect_success 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
    -+		printf "52a9" >expect &&
    ++		echo "52a9" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
3:  2fadd87063 ! 6:  dc3f6d2d4f repo-settings: introduce commitgraph.changedPathsVersion
    @@ Documentation/config/commitgraph.txt: commitGraph.maxNewFilters::
     -	If true, then git will use the changed-path Bloom filters in the
     -	commit-graph file (if it exists, and they are present). Defaults to
     -	true. See linkgit:git-commit-graph[1] for more information.
    -+	Deprecated. Equivalent to changedPathsVersion=-1 if true, and
    -+	changedPathsVersion=0 if false.
    ++	Deprecated. Equivalent to commitGraph.changedPathsVersion=-1 if true, and
    ++	commitGraph.changedPathsVersion=0 if false. (If commitGraph.changedPathVersion
    ++	is also set, commitGraph.changedPathsVersion takes precedence.)
     +
     +commitGraph.changedPathsVersion::
     +	Specifies the version of the changed-path Bloom filters that Git will read and
    -+	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
    -+	match the version set in this config variable will be ignored.
    ++	write. May be -1, 0 or 1.
     ++
     +Defaults to -1.
     ++
     +If -1, Git will use the version of the changed-path Bloom filters in the
     +repository, defaulting to 1 if there are none.
     ++
    -+If 0, git will write version 1 Bloom filters when instructed to write.
    ++If 0, Git will not read any Bloom filters, and will write version 1 Bloom
    ++filters when instructed to write.
    +++
    ++If 1, Git will only read version 1 Bloom filters, and will write version 1
    ++Bloom filters.
     ++
     +See linkgit:git-commit-graph[1] for more information.
     
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
     -	if (s->commit_graph_read_changed_paths) {
    -+	if (s->commit_graph_changed_paths_version != 0) {
    ++	if (s->commit_graph_changed_paths_version) {
      		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
      			   &graph->chunk_bloom_indexes);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
    @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
      	int value;
      	const char *strval;
      	int manyfiles;
    -+	int readChangedPaths;
    ++	int read_changed_paths;
      
      	if (!r->gitdir)
      		BUG("Cannot add settings for uninitialized repository");
    @@ repo-settings.c: void prepare_repo_settings(struct repository *r)
      	repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1);
      	repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2);
     -	repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1);
    -+	repo_cfg_bool(r, "commitgraph.readchangedpaths", &readChangedPaths, 1);
    ++	repo_cfg_bool(r, "commitgraph.readchangedpaths", &read_changed_paths, 1);
     +	repo_cfg_int(r, "commitgraph.changedpathsversion",
     +		     &r->settings.commit_graph_changed_paths_version,
    -+		     readChangedPaths ? -1 : 0);
    ++		     read_changed_paths ? -1 : 0);
      	repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1);
      	repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0);
      
    @@ repository.h: struct repo_settings {
     -	int commit_graph_read_changed_paths;
     +	int commit_graph_changed_paths_version;
      	int gc_write_commit_graph;
    - 	int gc_cruft_packs;
      	int fetch_write_commit_graph;
    + 	int command_requires_full_index;
4:  e31711ae85 ! 7:  6e2d797406 commit-graph: new filter ver. that fixes murmur3
    @@ Commit message
         controllable by a compiler option, and the default signedness of char is
         platform-specific). When a string contains characters with the high bit
         set, this bug causes results that, although internally consistent within
    -    Git, does not accord with other implementations of murmur3 and even with
    -    Git binaries that were compiled with different signedness of char. This
    -    bug affects both how Git writes changed path filters to disk and how Git
    -    interprets changed path filters on disk.
    +    Git, does not accord with other implementations of murmur3 (thus,
    +    the changed path filters wouldn't be readable by other off-the-shelf
    +    implementatios of murmur3) and even with Git binaries that were compiled
    +    with different signedness of char. This bug affects both how Git writes
    +    changed path filters to disk and how Git interprets changed path filters
    +    on disk.
     
         Therefore, introduce a new version (2) of changed path filters that
         corrects this problem. The existing version (1) is still supported and
    @@ Documentation/config/commitgraph.txt: commitGraph.readChangedPaths::
      
      commitGraph.changedPathsVersion::
      	Specifies the version of the changed-path Bloom filters that Git will read and
    --	write. May be -1, 0 or 1. Any changed-path Bloom filters on disk that do not
    -+	write. May be -1, 0, 1, or 2. Any changed-path Bloom filters on disk that do not
    - 	match the version set in this config variable will be ignored.
    +-	write. May be -1, 0 or 1.
    ++	write. May be -1, 0, 1, or 2.
      +
      Defaults to -1.
    + +
    +@@ Documentation/config/commitgraph.txt: filters when instructed to write.
    + If 1, Git will only read version 1 Bloom filters, and will write version 1
    + Bloom filters.
    + +
    ++If 2, Git will only read version 2 Bloom filters, and will write version 2
    ++Bloom filters.
    +++
    + See linkgit:git-commit-graph[1] for more information.
     
      ## bloom.c ##
    -@@ bloom.c: static int load_bloom_filter_from_graph(struct commit_graph *g,
    +@@ bloom.c: int load_bloom_filter_from_graph(struct commit_graph *g,
       * Not considered to be cryptographically secure.
       * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
       */
    @@ bloom.c: void fill_bloom_key(const char *data,
      	const uint32_t seed1 = 0x7e646e2c;
     -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
     -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
    -+	const uint32_t hash0 = (settings->hash_version == 2
    -+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
    -+	const uint32_t hash1 = (settings->hash_version == 2
    -+		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);
    ++	uint32_t hash0, hash1;
    ++	if (settings->hash_version == 2) {
    ++		hash0 = murmur3_seeded_v2(seed0, data, len);
    ++		hash1 = murmur3_seeded_v2(seed1, data, len);
    ++	} else {
    ++		hash0 = murmur3_seeded_v1(seed0, data, len);
    ++		hash1 = murmur3_seeded_v1(seed1, data, len);
    ++	}
      
      	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
      	for (i = 0; i < settings->num_hashes; i++)
     
      ## bloom.h ##
    -@@ bloom.h: struct repository;
    +@@ bloom.h: struct commit_graph;
      struct bloom_filter_settings {
      	/*
      	 * The version of the hashing technique being used.
    @@ bloom.h: struct repository;
      	 */
      	uint32_t hash_version;
      
    -@@ bloom.h: struct bloom_key {
    +@@ bloom.h: int load_bloom_filter_from_graph(struct commit_graph *g,
       * Not considered to be cryptographically secure.
       * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
       */
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      	return 0;
      }
      
    -+struct graph_read_bloom_data_data {
    ++struct graph_read_bloom_data_context {
     +	struct commit_graph *g;
     +	int *commit_graph_changed_paths_version;
     +};
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
      				  size_t chunk_size, void *data)
      {
     -	struct commit_graph *g = data;
    -+	struct graph_read_bloom_data_data *d = data;
    -+	struct commit_graph *g = d->g;
    ++	struct graph_read_bloom_data_context *c = data;
    ++	struct commit_graph *g = c->g;
      	uint32_t hash_version;
      	g->chunk_bloom_data = chunk_start;
      	hash_version = get_be32(chunk_start);
      
     -	if (hash_version != 1)
    -+	if (*d->commit_graph_changed_paths_version == -1) {
    -+		*d->commit_graph_changed_paths_version = hash_version;
    -+	} else if (hash_version != *d->commit_graph_changed_paths_version) {
    - 		return 0;
    +-		return 0;
    ++	if (*c->commit_graph_changed_paths_version == -1) {
    ++		*c->commit_graph_changed_paths_version = hash_version;
    ++	} else if (hash_version != *c->commit_graph_changed_paths_version) {
    ++ 		return 0;
     +	}
      
      	g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
    @@ commit-graph.c: static int graph_read_oid_lookup(const unsigned char *chunk_star
     @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      	}
      
    - 	if (s->commit_graph_changed_paths_version != 0) {
    -+		struct graph_read_bloom_data_data data = {
    + 	if (s->commit_graph_changed_paths_version) {
    ++		struct graph_read_bloom_data_context context = {
     +			.g = graph,
     +			.commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version
     +		};
    @@ commit-graph.c: struct commit_graph *parse_commit_graph(struct repo_settings *s,
      			   &graph->chunk_bloom_indexes);
      		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
     -			   graph_read_bloom_data, graph);
    -+			   graph_read_bloom_data, &data);
    ++			   graph_read_bloom_data, &context);
      	}
      
      	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: get_bdat_offset () {
    - 		.git/objects/info/commit-graph
    - }
    - 
    -+get_changed_path_filter_version () {
    -+	BDAT_OFFSET=$(get_bdat_offset) &&
    -+	perl -0777 -ne \
    -+		'print unpack("H*", substr($_, '$BDAT_OFFSET', 4))' \
    -+		.git/objects/info/commit-graph
    -+}
    -+
    - get_first_changed_path_filter () {
    - 	BDAT_OFFSET=$(get_bdat_offset) &&
    - 	perl -0777 -ne \
     @@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, version 1 changed-path' '
      	git -C highbit1 commit-graph write --reachable --changed-paths
      '
    @@ t/t4216-log-bloom.sh: test_expect_success 'set up repo with high bit path, versi
     -test_expect_success 'setup check value of version 1 changed-path' '
     +test_expect_success 'check value of version 1 changed-path' '
      	(cd highbit1 &&
    - 		printf "52a9" >expect &&
    + 		echo "52a9" >expect &&
      		get_first_changed_path_filter >actual &&
     @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      		test_bloom_filters_used "-- $CENT")
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +	git -C highbit1 commit-graph write --reachable --changed-paths &&
     +	(cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
    -+		printf "00000001" >expect &&
    -+		get_changed_path_filter_version >actual &&
    ++		echo "options: bloom(1,10,7) read_generation_data" >expect &&
    ++		test-tool read-graph >full &&
    ++		grep options full >actual &&
     +		test_cmp expect actual)
     +'
     +
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +
     +test_expect_success 'check value of version 2 changed-path' '
     +	(cd highbit2 &&
    -+		printf "c01f" >expect &&
    ++		echo "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
    @@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when vers
     +	git -C highbit2 commit-graph write --reachable --changed-paths &&
     +	(cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion -1 &&
    -+		printf "00000002" >expect &&
    -+		get_changed_path_filter_version >actual &&
    ++		echo "options: bloom(2,10,7) read_generation_data" >expect &&
    ++		test-tool read-graph >full &&
    ++		grep options full >actual &&
     +		test_cmp expect actual)
     +'
     +

Junio C Hamano July 25, 2023, 8:52 p.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks, Junio and Taylor, for your reviews. I have included Taylor's
> patches in this patch set.
>
> There seemed to be some merge conflicts when I tried to apply the
> patches Taylor provided on the base that I built my patches on (that is,
> the base of jt/path-filter-fix, namely, maint-2.40), so I have rebased
> all my patches onto latest master.
>
> Jonathan Tan (4):
>   gitformat-commit-graph: describe version 2 of BDAT
>   t4216: test changed path filters with high bit paths
>   repo-settings: introduce commitgraph.changedPathsVersion
>   commit-graph: new filter ver. that fixes murmur3
>
> Taylor Blau (3):
>   t/helper/test-read-graph.c: extract `dump_graph_info()`
>   bloom.h: make `load_bloom_filter_from_graph()` public
>   t/helper/test-read-graph: implement `bloom-filters` mode

Thanks, I seem to have missed this one.  Let's queue this version
and merge it down to 'next', unless there is no other blocking
comments in a few days.

Thanks.

Junio C Hamano July 26, 2023, 8:39 p.m. UTC | #2

Jonathan Tan <jonathantanmy@google.com> writes:

> Jonathan Tan (4):
>   gitformat-commit-graph: describe version 2 of BDAT
>   t4216: test changed path filters with high bit paths
>   repo-settings: introduce commitgraph.changedPathsVersion
>   commit-graph: new filter ver. that fixes murmur3
>
> Taylor Blau (3):
>   t/helper/test-read-graph.c: extract `dump_graph_info()`
>   bloom.h: make `load_bloom_filter_from_graph()` public
>   t/helper/test-read-graph: implement `bloom-filters` mode

After a week, nobody seems to have found anything worth saying about
these patches.  Does the design, especially the migration story, now
look sensible to everybody?  I was contemplating to mark the topic
for 'next' after reading them myself once more.

Thanks.

Taylor Blau July 27, 2023, 12:17 a.m. UTC | #3

On Wed, Jul 26, 2023 at 01:39:14PM -0700, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
> > Jonathan Tan (4):
> >   gitformat-commit-graph: describe version 2 of BDAT
> >   t4216: test changed path filters with high bit paths
> >   repo-settings: introduce commitgraph.changedPathsVersion
> >   commit-graph: new filter ver. that fixes murmur3
> >
> > Taylor Blau (3):
> >   t/helper/test-read-graph.c: extract `dump_graph_info()`
> >   bloom.h: make `load_bloom_filter_from_graph()` public
> >   t/helper/test-read-graph: implement `bloom-filters` mode
>
> After a week, nobody seems to have found anything worth saying about
> these patches.  Does the design, especially the migration story, now
> look sensible to everybody?  I was contemplating to mark the topic
> for 'next' after reading them myself once more.

Sorry for not getting to this sooner. I didn't notice anything during my
review, but I think there may be a bug here.

Suppose that we have an existing commit-graph with v1 Bloom filters. If
we then try to rewrite that commit-graph using v2 Bloom filters, we
*should* attempt to recompute the filter from scratch. But AFAICT, that
isn't what happens.

Here's my test setup:

    test_expect_success 'test' '
      test_commit base &&
      git repack -d &&

      git -c commitGraph.changedPathVersion=1 commit-graph write --changed-paths &&
      debug git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths
    '

if you attach a debugger to the second process, and break inside of
get_or_compute_bloom_filter() when compute_if_not_present is set, you'll
see that Git will pass along the existing *v1* Bloom filter, and then
write its contents to the new commit-graph:

    (gdb) b get_or_compute_bloom_filter if compute_if_not_present
    Breakpoint 1 at 0x14340f: file bloom.c, line 260.
    (gdb) r
    Starting program: /home/ttaylorr/src/git/git -c commitGraph.changedPathVersion=2 commit-graph write --changed-paths
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Breakpoint 1, get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0,
        compute_if_not_present=1, settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:260
    260		if (computed)
    (gdb) until 271
    get_or_compute_bloom_filter (r=0x5555559bdc80 <the_repo>, c=0x5555559c8ef0, compute_if_not_present=1,
        settings=0x5555559c6950, computed=0x7fffffffd854) at bloom.c:271
    271				load_bloom_filter_from_graph(r->objects->commit_graph,
    (gdb) p *filter
    $2 = {data = 0x0, len = 0}
    (gdb) n
    275		if (filter->data && filter->len)
    (gdb) p *filter
    $3 = {data = 0x7ffff7fc24a8 "\210\210\322a\267\234\214s}\004J\265\313\201\241\032e\312\034", len = 2}

If I'm parsing this all correctly, Git used the v1 filter corresponding
to that commit, and did not recompute a new one.

I think that this could lead to incorrect results if you use this to
masquerade a v1 Bloom filter as a v2 one. Since they use different
implementations (one correct, one not) of murmur3, that opens us up to
false negatives, at which point all bets are off.

So I think we want to be more careful about when we load the existing
Bloom data or not. I think we probably want to load it in all cases, but
only read it when we have compatible Bloom settings. That should stop us
from exhibiting this kind of bug, but it also gives us a handle on
existing Bloom data if we wanted to copy forward existing results where
we can.

If all of this tracks, I think that there is a gap in our test coverage
that didn't catch this earlier.

Thanks,
Taylor

Junio C Hamano July 27, 2023, 12:49 a.m. UTC | #4

Taylor Blau <me@ttaylorr.com> writes:

> Sorry for not getting to this sooner. I didn't notice anything during my
> review, but I think there may be a bug here.
>
> Suppose that we have an existing commit-graph with v1 Bloom filters. If
> we then try to rewrite that commit-graph using v2 Bloom filters, we
> *should* attempt to recompute the filter from scratch. But AFAICT, that
> isn't what happens.
> ...
> If I'm parsing this all correctly, Git used the v1 filter corresponding
> to that commit, and did not recompute a new one.
>
> I think that this could lead to incorrect results if you use this to
> masquerade a v1 Bloom filter as a v2 one.

That indeed is very bad.  How hard it would be to construct a test
case that fails if filter computed as v1 is marketed as v2?  A test
may be far easier to construct if it does not have to be end-to-end
(e.g. instrument the codepath you followed through with the debugger
with trace2 annotations and see the control takes the right branches
by reading the log).

> So I think we want to be more careful about when we load the existing
> Bloom data or not. I think we probably want to load it in all cases, but
> only read it when we have compatible Bloom settings. That should stop us
> from exhibiting this kind of bug, but it also gives us a handle on
> existing Bloom data if we wanted to copy forward existing results where
> we can.
>
> If all of this tracks, I think that there is a gap in our test coverage
> that didn't catch this earlier.

Yeah, thanks for raising a concern.

Jonathan?

Jonathan Tan July 27, 2023, 5:39 p.m. UTC | #5

Junio C Hamano <gitster@pobox.com> writes:
> Taylor Blau <me@ttaylorr.com> writes:
> > Suppose that we have an existing commit-graph with v1 Bloom filters. If
> > we then try to rewrite that commit-graph using v2 Bloom filters, we
> > *should* attempt to recompute the filter from scratch. But AFAICT, that
> > isn't what happens.
> > ...
> > If I'm parsing this all correctly, Git used the v1 filter corresponding
> > to that commit, and did not recompute a new one.
> >
> > I think that this could lead to incorrect results if you use this to
> > masquerade a v1 Bloom filter as a v2 one.
> 
> That indeed is very bad.  How hard it would be to construct a test
> case that fails if filter computed as v1 is marketed as v2?  A test
> may be far easier to construct if it does not have to be end-to-end
> (e.g. instrument the codepath you followed through with the debugger
> with trace2 annotations and see the control takes the right branches
> by reading the log).

Ah, thanks, Taylor, for so meticulously investigating this. I'll take a look.

A test should be doable - we already have tests (the ones that use
"get_first_changed_path_filter") that check the bytes of the filter
generated, so we should be able to write a test that writes one version,
writes the other, then checks the bytes.

> > So I think we want to be more careful about when we load the existing
> > Bloom data or not. I think we probably want to load it in all cases, but
> > only read it when we have compatible Bloom settings. That should stop us
> > from exhibiting this kind of bug, but it also gives us a handle on
> > existing Bloom data if we wanted to copy forward existing results where
> > we can.

The intention in the current patch set was to not load it at all when we
have incompatible Bloom settings because it appeared quite troublesome
to notate which Bloom filter in memory is of which version. If we want
to copy forward existing results, we can change that, but I don't know
whether it's worth doing that (and if we think it's worth doing, this
should probably go in another patch set).

> > If all of this tracks, I think that there is a gap in our test coverage
> > that didn't catch this earlier.
> 
> Yeah, thanks for raising a concern.
> 
> Jonathan?

I'll take a look. Yes this does seem like a gap in test coverage -
I thought the existing test that checks that Bloom filters are not
used when a different version is requested would be sufficient, but
apparently that's not the case.

Taylor Blau July 27, 2023, 5:56 p.m. UTC | #6

On Thu, Jul 27, 2023 at 10:39:05AM -0700, Jonathan Tan wrote:
> A test should be doable - we already have tests (the ones that use
> "get_first_changed_path_filter") that check the bytes of the filter
> generated, so we should be able to write a test that writes one version,
> writes the other, then checks the bytes.

Thanks for looking into it!

> > > So I think we want to be more careful about when we load the existing
> > > Bloom data or not. I think we probably want to load it in all cases, but
> > > only read it when we have compatible Bloom settings. That should stop us
> > > from exhibiting this kind of bug, but it also gives us a handle on
> > > existing Bloom data if we wanted to copy forward existing results where
> > > we can.
>
> The intention in the current patch set was to not load it at all when we
> have incompatible Bloom settings because it appeared quite troublesome
> to notate which Bloom filter in memory is of which version. If we want
> to copy forward existing results, we can change that, but I don't know
> whether it's worth doing that (and if we think it's worth doing, this
> should probably go in another patch set).

Yeah, I think having Bloom filters accessible from a commit-graph
regardless of whether or not it matches our Bloom filter version is
prerequisite to being able to implement something like this.

I feel like this is important enough to do in the same patch set, or the
same release to avoid surprising operators when their commit-graph write
suddenly recomputes all of its Bloom filters.

Since we already store the Bloom version that we're using in the
commit-graph file itself, shouldn't it be something along the lines of
sticking that value onto the bloom_filter when we read its contents?

Although I don't think that you'd even need to annotate each individual
filter, since you know that every pre-existing Bloom filter you are able
to find has the version given by:

    the_repository->objects->commit_graph->bloom_filter_settings->hash_version

right?

Thanks,
Taylor

Junio C Hamano July 27, 2023, 6:44 p.m. UTC | #7

Junio C Hamano <gitster@pobox.com> writes:

> ...  How hard it would be to construct a test
> case that fails if filter computed as v1 is marketed as v2?

There may be a more effective way for future-proofing, besides
ensuring the test coverage.

Although this series added a way to see which version the on-disk
data is using using the version field, I do not think it touched the
"struct bloom_filter" and "struct bloom_key" that represent the
in-core data.  If we had a member in these structures that
get_or_compute_bloom_filter() can fill in from the on-disk structure
or the version it used when it computed the filter anew, would it
become easier to catch the case where we try to add a version 2
computed key to a filter that was read from version 1 on-disk
structure, presumably at add_key_to_filter()?

Thanks.

Jonathan Tan July 27, 2023, 8:53 p.m. UTC | #8

Taylor Blau <me@ttaylorr.com> writes:
> > The intention in the current patch set was to not load it at all when we
> > have incompatible Bloom settings because it appeared quite troublesome
> > to notate which Bloom filter in memory is of which version. If we want
> > to copy forward existing results, we can change that, but I don't know
> > whether it's worth doing that (and if we think it's worth doing, this
> > should probably go in another patch set).
> 
> Yeah, I think having Bloom filters accessible from a commit-graph
> regardless of whether or not it matches our Bloom filter version is
> prerequisite to being able to implement something like this.
> 
> I feel like this is important enough to do in the same patch set, or the
> same release to avoid surprising operators when their commit-graph write
> suddenly recomputes all of its Bloom filters.

Suddenly reading many (or most) of the repo's trees would be a similar
surprise, right?

Also this would happen only if the server operator explicitly sets a
changed path filter version. If they leave it as-is, commit graphs will
still be written with the same version as the one on disk.

> Since we already store the Bloom version that we're using in the
> commit-graph file itself, shouldn't it be something along the lines of
> sticking that value onto the bloom_filter when we read its contents?
> 
> Although I don't think that you'd even need to annotate each individual
> filter, since you know that every pre-existing Bloom filter you are able
> to find has the version given by:
> 
>     the_repository->objects->commit_graph->bloom_filter_settings->hash_version
> 
> right?
> 
> Thanks,
> Taylor

Regarding consulting commit_graph->bloom_filter_settings->hash_version,
the issue I ran into was that firstly, I didn't know what to do about
commit_graph->base_graph (which also has its own bloom_filter_settings)
and what to do if it had a contradictory hash_version. And even if
we found a way to unify those, it is not true that every Bloom filter
in memory is of that version, since we may have generated some that
correspond to the version we're writing (not the version on disk).
In particular, the Bloom filters we write come from a commit slab
(bloom_filters in bloom.c) and in that slab, both Bloom filters from
disk and Bloom filters that were generated in-process coexist.

I also thought of your other proposal of augmenting struct bloom_filter
to also include the version. The issue I ran into there is if, for a
given commit, there already exists a Bloom filter read from disk with
the wrong version, what should we do? Overwrite it, or store both
versions in memory? (We can't just immediately output the Bloom filter
to disk and forget about the new version, only storing its size so that
we can generate the BIDX, because in the current code, generation and
writing to disk are separate. We could try to refactor it, but I didn't
want to make such a large change to reduce the possibility of bugs.)
Both storing the version number and storing an additional pointer for a
second version would increase memory consumption too, even when
supporting two versions isn't needed, but maybe this isn't a big deal.

Taylor Blau Aug. 1, 2023, 6:08 p.m. UTC | #9

On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > > The intention in the current patch set was to not load it at all when we
> > > have incompatible Bloom settings because it appeared quite troublesome
> > > to notate which Bloom filter in memory is of which version. If we want
> > > to copy forward existing results, we can change that, but I don't know
> > > whether it's worth doing that (and if we think it's worth doing, this
> > > should probably go in another patch set).
> >
> > Yeah, I think having Bloom filters accessible from a commit-graph
> > regardless of whether or not it matches our Bloom filter version is
> > prerequisite to being able to implement something like this.
> >
> > I feel like this is important enough to do in the same patch set, or the
> > same release to avoid surprising operators when their commit-graph write
> > suddenly recomputes all of its Bloom filters.
>
> Suddenly reading many (or most) of the repo's trees would be a similar
> surprise, right?

That's a good point. I think in general I'd expect Git to avoid
recomputing Bloom filters where that work can be avoided, if the work in
order to detect whether or not we need to recompute a filter is cheap
enough to carry out.

> Also this would happen only if the server operator explicitly sets a
> changed path filter version. If they leave it as-is, commit graphs will
> still be written with the same version as the one on disk.

I think that I could live with that if the default is to leave things
as-is.

I still think that it's worth it to have this functionality to propagate
Bloom filters forward should ship in Git, but we can work on that
outside of this series.

> Regarding consulting commit_graph->bloom_filter_settings->hash_version,
> the issue I ran into was that firstly, I didn't know what to do about
> commit_graph->base_graph (which also has its own bloom_filter_settings)
> and what to do if it had a contradictory hash_version. And even if
> we found a way to unify those, it is not true that every Bloom filter
> in memory is of that version, since we may have generated some that
> correspond to the version we're writing (not the version on disk).
> In particular, the Bloom filters we write come from a commit slab
> (bloom_filters in bloom.c) and in that slab, both Bloom filters from
> disk and Bloom filters that were generated in-process coexist.

Would we ever want to have a commit-graph chain with mixed Bloom filter
versions?

We avoid mixing generation number schemes across multiple layers of a
commit-graph chain. But I don't see any reason that we should or need to
have a similar restriction in place for the Bloom filter version. Both
are readable, as long as the user-provided configuration allows them to
be.

We just have to interpret them differently depending on what layer of
the graph (and therefore, what Bloom filter version they are) they come
from.

Sorry for thinking aloud a little there. I think that this means that we
at minimum have to keep in context the commit-graph layer we found the
Bloom filter in so that we can tie that back to its Bloom filter
version. That might just mean that we have to tag each Bloom filter with
the version it was computed under, or maybe we already have the
commit-graph layer in context, in which case we shouldn't need an
additional field.

My gut is telling me that we probably *do* need such a field, since we
don't often have a reference to the particular layer that we found a
Bloom filter in, just the tip of the commit-graph chain that it came
from.

> I also thought of your other proposal of augmenting struct bloom_filter
> to also include the version. The issue I ran into there is if, for a
> given commit, there already exists a Bloom filter read from disk with
> the wrong version, what should we do? Overwrite it, or store both
> versions in memory? (We can't just immediately output the Bloom filter
> to disk and forget about the new version, only storing its size so that
> we can generate the BIDX, because in the current code, generation and
> writing to disk are separate. We could try to refactor it, but I didn't
> want to make such a large change to reduce the possibility of bugs.)
> Both storing the version number and storing an additional pointer for a
> second version would increase memory consumption too, even when
> supporting two versions isn't needed, but maybe this isn't a big deal.

It's likely that I'm missing something here, but what is stopping us
from discarding the old Bloom filter as soon as we generate the new
one? We shouldn't need to load the old filter again out of the commit
slab, right?

Thanks,
Taylor

Jonathan Tan Aug. 1, 2023, 6:52 p.m. UTC | #10

Taylor Blau <me@ttaylorr.com> writes:
> On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> > Suddenly reading many (or most) of the repo's trees would be a similar
> > surprise, right?
> 
> That's a good point. I think in general I'd expect Git to avoid
> recomputing Bloom filters where that work can be avoided, if the work in
> order to detect whether or not we need to recompute a filter is cheap
> enough to carry out.

Makes sense. I just don't think that there is a cheap way to detect if
a filter does not need to be recomputed (the closest way I think we have
is something that will require reading a lot of trees in a repo).

> > Also this would happen only if the server operator explicitly sets a
> > changed path filter version. If they leave it as-is, commit graphs will
> > still be written with the same version as the one on disk.
> 
> I think that I could live with that if the default is to leave things
> as-is.

Ah, thanks.
 
> I still think that it's worth it to have this functionality to propagate
> Bloom filters forward should ship in Git, but we can work on that
> outside of this series.

Makes sense.

> > Regarding consulting commit_graph->bloom_filter_settings->hash_version,
> > the issue I ran into was that firstly, I didn't know what to do about
> > commit_graph->base_graph (which also has its own bloom_filter_settings)
> > and what to do if it had a contradictory hash_version. And even if
> > we found a way to unify those, it is not true that every Bloom filter
> > in memory is of that version, since we may have generated some that
> > correspond to the version we're writing (not the version on disk).
> > In particular, the Bloom filters we write come from a commit slab
> > (bloom_filters in bloom.c) and in that slab, both Bloom filters from
> > disk and Bloom filters that were generated in-process coexist.
> 
> Would we ever want to have a commit-graph chain with mixed Bloom filter
> versions?

Probably not, but I wanted to be robust in case a third-party tool wrote a chain with
mixed versions.

> We avoid mixing generation number schemes across multiple layers of a
> commit-graph chain. But I don't see any reason that we should or need to
> have a similar restriction in place for the Bloom filter version. Both
> are readable, as long as the user-provided configuration allows them to
> be.

Yes, that's true - there is no inherent reason why we can't mix them,
unlike with generation numbers.

> We just have to interpret them differently depending on what layer of
> the graph (and therefore, what Bloom filter version they are) they come
> from.
> 
> Sorry for thinking aloud a little there. I think that this means that we
> at minimum have to keep in context the commit-graph layer we found the
> Bloom filter in so that we can tie that back to its Bloom filter
> version. That might just mean that we have to tag each Bloom filter with
> the version it was computed under, or maybe we already have the
> commit-graph layer in context, in which case we shouldn't need an
> additional field.
> 
> My gut is telling me that we probably *do* need such a field, since we
> don't often have a reference to the particular layer that we found a
> Bloom filter in, just the tip of the commit-graph chain that it came
> from.

We'll need the additional field because we don't know which commit graph
layer it comes from. In fact, we don't even know which *repo* the commit
comes from, since the commit slab is global. (Moving the slab to being
under a repo or under a commit graph layer would fix this.)

But I think there still remains the question of whether we really need
to support multiple versions in one Git invocation.

> > I also thought of your other proposal of augmenting struct bloom_filter
> > to also include the version. The issue I ran into there is if, for a
> > given commit, there already exists a Bloom filter read from disk with
> > the wrong version, what should we do? Overwrite it, or store both
> > versions in memory? (We can't just immediately output the Bloom filter
> > to disk and forget about the new version, only storing its size so that
> > we can generate the BIDX, because in the current code, generation and
> > writing to disk are separate. We could try to refactor it, but I didn't
> > want to make such a large change to reduce the possibility of bugs.)
> > Both storing the version number and storing an additional pointer for a
> > second version would increase memory consumption too, even when
> > supporting two versions isn't needed, but maybe this isn't a big deal.
> 
> It's likely that I'm missing something here, but what is stopping us
> from discarding the old Bloom filter as soon as we generate the new
> one? We shouldn't need to load the old filter again out of the commit
> slab, right?
> 
> Thanks,
> Taylor

I did not look at the code closely enough except to see that there was
a gap between generating the new-version Bloom filters and writing them
to disk, and I was concerned that now, or in the future, there would
be some code in that gap that reads the Bloom filter for a commit and
expects the old-version Bloom filter there.

Taylor Blau Aug. 3, 2023, 12:01 a.m. UTC | #11

On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote:
> On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> > Taylor Blau <me@ttaylorr.com> writes:
> > > > The intention in the current patch set was to not load it at all when we
> > > > have incompatible Bloom settings because it appeared quite troublesome
> > > > to notate which Bloom filter in memory is of which version. If we want
> > > > to copy forward existing results, we can change that, but I don't know
> > > > whether it's worth doing that (and if we think it's worth doing, this
> > > > should probably go in another patch set).
> > >
> > > Yeah, I think having Bloom filters accessible from a commit-graph
> > > regardless of whether or not it matches our Bloom filter version is
> > > prerequisite to being able to implement something like this.
> > >
> > > I feel like this is important enough to do in the same patch set, or the
> > > same release to avoid surprising operators when their commit-graph write
> > > suddenly recomputes all of its Bloom filters.
> >
> > Suddenly reading many (or most) of the repo's trees would be a similar
> > surprise, right?
>
> That's a good point. I think in general I'd expect Git to avoid
> recomputing Bloom filters where that work can be avoided, if the work in
> order to detect whether or not we need to recompute a filter is cheap
> enough to carry out.

I spent some time implementing this (patches are available in the branch
'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom
filter versions is kind of tricky, but do-able without having to
implement too much on top of what's already there.

But I don't think that it's enough to say that we can reuse a commit's
Bloom filter iff that commit's tree has no paths with characters >=
0x80. Suppose we have such a tree, whose Bloom filter we believe to be
reusable. If its first parent *does* have such a path, then that path
would appear as a deletion relative to its first parent. So that path
*would* be in the filter, meaning that it isn't reusable.

So I think the revised condition is something like: a commit's Bloom
filter is reusable when there are no paths with characters >= 0x80 in
a tree-diff against its first parent. I think that ensuring that there
are no such paths in both a commit's root tree, as well as its first
parent's root tree is equivalent, since that removes the possibility of
such a path showing up in its tree-diff.

As long as we aren't generating a commit-graph with --stdin-packs, then
we process commits in generation order, so we will see parents before
their children. I think we could reuse existing filters in that case,
but the condition is slightly more complex than I originally thought.

Thanks,
Taylor

Derrick Stolee Aug. 3, 2023, 1:18 p.m. UTC | #12

On 8/2/2023 8:01 PM, Taylor Blau wrote:
> On Tue, Aug 01, 2023 at 02:08:50PM -0400, Taylor Blau wrote:

>> That's a good point. I think in general I'd expect Git to avoid
>> recomputing Bloom filters where that work can be avoided, if the work in
>> order to detect whether or not we need to recompute a filter is cheap
>> enough to carry out.
> 
> I spent some time implementing this (patches are available in the branch
> 'tb/path-filter-fix-upgrade' from my fork). Handling incompatible Bloom
> filter versions is kind of tricky, but do-able without having to
> implement too much on top of what's already there.
> 
> But I don't think that it's enough to say that we can reuse a commit's
> Bloom filter iff that commit's tree has no paths with characters >=
> 0x80. Suppose we have such a tree, whose Bloom filter we believe to be
> reusable. If its first parent *does* have such a path, then that path
> would appear as a deletion relative to its first parent. So that path
> *would* be in the filter, meaning that it isn't reusable.
> 
> So I think the revised condition is something like: a commit's Bloom
> filter is reusable when there are no paths with characters >= 0x80 in
> a tree-diff against its first parent. I think that ensuring that there
> are no such paths in both a commit's root tree, as well as its first
> parent's root tree is equivalent, since that removes the possibility of
> such a path showing up in its tree-diff.

This condition is exactly "we computed the diff to know which paths were
input to the filter" which is as difficult as recomputing the Bloom filter
from scratch. I don't think there is much room to gain a performance
improvement here.
 
Thanks,
-Stolee

Taylor Blau Aug. 3, 2023, 6:45 p.m. UTC | #13

On Thu, Aug 03, 2023 at 09:18:11AM -0400, Derrick Stolee wrote:
> > So I think the revised condition is something like: a commit's Bloom
> > filter is reusable when there are no paths with characters >= 0x80 in
> > a tree-diff against its first parent. I think that ensuring that there
> > are no such paths in both a commit's root tree, as well as its first
> > parent's root tree is equivalent, since that removes the possibility of
> > such a path showing up in its tree-diff.
>
> This condition is exactly "we computed the diff to know which paths were
> input to the filter" which is as difficult as recomputing the Bloom filter
> from scratch. I don't think there is much room to gain a performance
> improvement here.

I think that's true in the worst case, and certainly for repositories
with many tree entries which have characters >= 0x80.

But note that it's a heuristic in one direction only. If we know that a
commit's root tree (and that of it's first parent, if it has one) is
free of any such paths, then it's impossible for the first parent
tree-diff to contain such an entry, and therefore we can reuse any
existing filter.

Of course, a commit's root tree (and its parent) may both have a path
whose characters >= 0x80 while still not seeing a corresponding entry
show up in the tree-diff if that path is unchanged between a commit and
its first parent.

I think if we were looking at every tree every time only to realize that
we have to go back and compute its changed-path Bloom filter, this would
be a non-starter. But since we "cache" the results of our walk via the
tree object's flags bits, we can skip looking at many trees.

In my testing, this showed a significant improvement on linux.git and
git.git. My setup for testing is something like:

    $ git clone git@github.com:torvalds/linux.git
    $ cd linux
    $ git commit-graph write --reachable --changed-paths
    $ graph=".git/objects/info/commit-graph"
    $ mv $graph{,.bak}

so that .git/objects/info/commit-graph.bak is a commit-graph with v1
changed-path Bloom filters for all commits in generation order.

With that in place, I can do:

    $ git config commitGraph.changedPathsVersion 2
    $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \
        'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths'

, producing the following results on linux.git:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git, writing a new commit-graph after upgrading from v1 to v2
filters went from taking 4.163 seconds to 3.348 seconds, for a more
modest 1.24x speed-up.

Of course, all of this depends on how much of the tree meets the above
conditions, so we'd expect worse results on repositories with paths that
contain characters >= 0x80. I think we'd want some kind of mechanism
(probably via config, not a GIT_TEST environment variable) to control
whether or not to upgrade existing filters.

Thanks,
Taylor

[v6,0/7] Changed path filter hash fix and version bump

Message

Comments