mbox series

[v4,0/4] Changed path filter hash fix and version bump

Message ID cover.1686677910.git.jonathantanmy@google.com (mailing list archive)
Headers show
Series Changed path filter hash fix and version bump | expand

Message

Jonathan Tan June 13, 2023, 5:39 p.m. UTC
Thanks Ramsay for spotting the errors and mentioning that I can use
octal escapes. Here's an update taking into account their comments.

Jonathan Tan (4):
  gitformat-commit-graph: describe version 2 of BDAT
  t4216: test changed path filters with high bit paths
  repo-settings: introduce commitgraph.changedPathsVersion
  commit-graph: new filter ver. that fixes murmur3

 Documentation/config/commitgraph.txt     | 16 ++++-
 Documentation/gitformat-commit-graph.txt |  9 ++-
 bloom.c                                  | 65 +++++++++++++++++++-
 bloom.h                                  |  8 ++-
 commit-graph.c                           | 29 +++++++--
 oss-fuzz/fuzz-commit-graph.c             |  2 +-
 repo-settings.c                          |  6 +-
 repository.h                             |  2 +-
 t/helper/test-bloom.c                    |  9 ++-
 t/t0095-bloom.sh                         |  8 +++
 t/t4216-log-bloom.sh                     | 77 ++++++++++++++++++++++++
 11 files changed, 210 insertions(+), 21 deletions(-)

Range-diff against v3:
1:  d4b63945f6 ! 1:  a3b52af4c9 gitformat-commit-graph: describe version 2 of BDAT
    @@ Documentation/gitformat-commit-graph.txt: All multi-byte numbers are in network
      	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
     -	in Probabilistic Verification"
     +	in Probabilistic Verification". Version 1 bloom filters have a bug that appears
    -+	when int is signed and the repository has path names that have characters >=
    ++	when char is signed and the repository has path names that have characters >=
     +	0x80; Git supports reading and writing them, but this ability will be removed
     +	in a future version of Git.
            - The number of times a path is hashed and hence the number of bit positions
2:  aa4535776e ! 2:  f095e2b486 t4216: test changed path filters with high bit paths
    @@ t/t4216-log-bloom.sh: test_expect_success 'Bloom generation backfills empty comm
     +}
     +
     +# chosen to be the same under all Unicode normalization forms
    -+CENT=$(printf "\xc2\xa2")
    ++CENT=$(printf "\302\242")
     +
    -+# Some systems (in particular, Linux on the CI running on GitHub at the time of
    -+# writing) store into CENT a literal backslash, then "x", and so on (instead of
    -+# the high-bit characters needed). In these systems, do not run the following
    -+# tests.
    -+if test "$(printf $CENT | perl -0777 -ne 'no utf8; print ord($_)')" = "194"
    -+then
    -+	test_set_prereq HIGH_BIT
    -+fi
    -+
    -+test_expect_success HIGH_BIT 'set up repo with high bit path, version 1 changed-path' '
    ++test_expect_success 'set up repo with high bit path, version 1 changed-path' '
     +	git init highbit1 &&
     +	test_commit -C highbit1 c1 "$CENT" &&
     +	git -C highbit1 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success HIGH_BIT 'setup check value of version 1 changed-path' '
    ++test_expect_success 'setup check value of version 1 changed-path' '
     +	(cd highbit1 &&
     +		printf "52a9" >expect &&
     +		get_first_changed_path_filter >actual)
     +'
     +
    -+# expect will not match actual if int is unsigned by default. Write the test
    ++# expect will not match actual if char is unsigned by default. Write the test
     +# in this way, so that a user running this test script can still see if the two
     +# files match. (It will appear as an ordinary success if they match, and a skip
     +# if not.)
     +if test_cmp highbit1/expect highbit1/actual
     +then
    -+	test_set_prereq SIGNED_INT_BY_DEFAULT
    ++	test_set_prereq SIGNED_CHAR_BY_DEFAULT
     +fi
    -+test_expect_success SIGNED_INT_BY_DEFAULT 'check value of version 1 changed-path' '
    ++test_expect_success SIGNED_CHAR_BY_DEFAULT 'check value of version 1 changed-path' '
     +	# Only the prereq matters for this test.
     +	true
     +'
     +
    -+test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 requested' '
    ++test_expect_success 'version 1 changed-path used when version 1 requested' '
     +	(cd highbit1 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
3:  d6982268a4 = 3:  6adfa53daf repo-settings: introduce commitgraph.changedPathsVersion
4:  e879483c42 ! 4:  5c65bf8a22 commit-graph: new filter ver. that fixes murmur3
    @@ t/t0095-bloom.sh: test_expect_success 'compute unseeded murmur3 hash for test st
      	Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004|
     
      ## t/t4216-log-bloom.sh ##
    -@@ t/t4216-log-bloom.sh: test_expect_success HIGH_BIT 'version 1 changed-path used when version 1 request
    +@@ t/t4216-log-bloom.sh: test_expect_success 'version 1 changed-path used when version 1 requested' '
      		test_bloom_filters_used "-- $CENT")
      '
      
    -+test_expect_success HIGH_BIT 'version 1 changed-path not used when version 2 requested' '
    ++test_expect_success 'version 1 changed-path not used when version 2 requested' '
     +	(cd highbit1 &&
     +		git config --add commitgraph.changedPathsVersion 2 &&
     +		test_bloom_filters_not_used "-- $CENT")
     +'
     +
    -+test_expect_success HIGH_BIT 'set up repo with high bit path, version 2 changed-path' '
    ++test_expect_success 'set up repo with high bit path, version 2 changed-path' '
     +	git init highbit2 &&
     +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
     +	test_commit -C highbit2 c2 "$CENT" &&
     +	git -C highbit2 commit-graph write --reachable --changed-paths
     +'
     +
    -+test_expect_success HIGH_BIT 'check value of version 2 changed-path' '
    ++test_expect_success 'check value of version 2 changed-path' '
     +	(cd highbit2 &&
     +		printf "c01f" >expect &&
     +		get_first_changed_path_filter >actual &&
     +		test_cmp expect actual)
     +'
     +
    -+test_expect_success HIGH_BIT 'version 2 changed-path used when version 2 requested' '
    ++test_expect_success 'version 2 changed-path used when version 2 requested' '
     +	(cd highbit2 &&
     +		test_bloom_filters_used "-- $CENT")
     +'
     +
    -+test_expect_success HIGH_BIT 'version 2 changed-path not used when version 1 requested' '
    ++test_expect_success 'version 2 changed-path not used when version 1 requested' '
     +	(cd highbit2 &&
     +		git config --add commitgraph.changedPathsVersion 1 &&
     +		test_bloom_filters_not_used "-- $CENT")

Comments

Junio C Hamano June 13, 2023, 7:21 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> Thanks Ramsay for spotting the errors and mentioning that I can use
> octal escapes. Here's an update taking into account their comments.

The changes look good.  Will queue.

Stolee, you had comments on an earlier round---how does this one
look?

Thanks.
Derrick Stolee June 20, 2023, 1:43 p.m. UTC | #2
On 6/13/2023 3:21 PM, Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
> 
>> Thanks Ramsay for spotting the errors and mentioning that I can use
>> octal escapes. Here's an update taking into account their comments.
> 
> The changes look good.  Will queue.
> 
> Stolee, you had comments on an earlier round---how does this one
> look?

I'm sorry I'm so late to this. I've been meaning to get to it, but
it's been a crazy couple of weeks.

This version is not ready. The backwards compatibility story is
incomplete.

When commitGraph.changedPathsVersion is set, it does not allow
reading a previous filter version, leaving us in a poor performance
state until the commit-graph file can be rewritten.

While I was reviewing, it seemed reasonable to deprecate
commitGraph.readChangedPaths, but this use of "also restrict writes
to this version" didn't make sense to me at the time. Instead, it
would be good to have this clarity between the config options:

 commitGraph.readChangedPaths: should we read and use the filters
 that exist on disk? Defaults to 'true'.

 commitGraph.changedPathsVersion: Which version should we _write_
 when writing a new commit-graph? Defaults to '1' but will default
 to '2' in the next major verion, then '1' will no longer be an
 accepted value in the version after that.

The tricky part is that during the commit-graph write, you will
need to check the existing filter value to see if it matches. If
not, the filters will need to be recomputed from scratch. This
will change patch 4 a bit, but it's the right thing to do.

Thanks,
-Stolee
Jonathan Tan June 20, 2023, 9:56 p.m. UTC | #3
Derrick Stolee <derrickstolee@github.com> writes:
> When commitGraph.changedPathsVersion is set, it does not allow
> reading a previous filter version, leaving us in a poor performance
> state until the commit-graph file can be rewritten.
> 
> While I was reviewing, it seemed reasonable to deprecate
> commitGraph.readChangedPaths, but this use of "also restrict writes
> to this version" didn't make sense to me at the time. Instead, it
> would be good to have this clarity between the config options:
> 
>  commitGraph.readChangedPaths: should we read and use the filters
>  that exist on disk? Defaults to 'true'.
> 
>  commitGraph.changedPathsVersion: Which version should we _write_
>  when writing a new commit-graph? Defaults to '1' but will default
>  to '2' in the next major verion, then '1' will no longer be an
>  accepted value in the version after that.
> 
> The tricky part is that during the commit-graph write, you will
> need to check the existing filter value to see if it matches. If
> not, the filters will need to be recomputed from scratch. This
> will change patch 4 a bit, but it's the right thing to do.
> 
> Thanks,
> -Stolee

OK - this sounds reasonable. I'll take a look.
Taylor Blau June 21, 2023, 12:19 p.m. UTC | #4
On Tue, Jun 20, 2023 at 09:43:46AM -0400, Derrick Stolee wrote:
> This version is not ready. The backwards compatibility story is
> incomplete.

I'm also late to the party, but I agree with Stolee here, having come to
the same conclusion about needing to support reading older (corrupt) Bloom
filters when possible (i.e. when paths contain no bytes which have their
high-order bits set), and assuming the filter contains all paths
otherwise.

>  commitGraph.changedPathsVersion: Which version should we _write_
>  when writing a new commit-graph? Defaults to '1' but will default
>  to '2' in the next major verion, then '1' will no longer be an
>  accepted value in the version after that.

I am not sure if there's a situation where we'd ever want to not write
the newer versions when starting a new commit-graph (or chain) from
scratch.

I think that follows from what you and I are both suggesting w.r.t
backwards compatibility. If that's the case, I think that we could in
theory drop this configuration setting altogether.

Or, at the very least, we should be able to change it change only what
version we *write*, not read. I think this is what you are suggesting
above, but I am not 100% sure, so apologies if I'm just repeating what
you've already suggested.

> The tricky part is that during the commit-graph write, you will
> need to check the existing filter value to see if it matches. If
> not, the filters will need to be recomputed from scratch. This
> will change patch 4 a bit, but it's the right thing to do.

Yup, good suggestion.

Thanks,
Taylor
Derrick Stolee June 21, 2023, 5:53 p.m. UTC | #5
On 6/21/2023 8:19 AM, Taylor Blau wrote:
> On Tue, Jun 20, 2023 at 09:43:46AM -0400, Derrick Stolee wrote:

>>  commitGraph.changedPathsVersion: Which version should we _write_
>>  when writing a new commit-graph? Defaults to '1' but will default
>>  to '2' in the next major verion, then '1' will no longer be an
>>  accepted value in the version after that.
> 
> I am not sure if there's a situation where we'd ever want to not write
> the newer versions when starting a new commit-graph (or chain) from
> scratch.

I'd rather have the choice to start writing the new filter mode be
made by config rather than a change to the Git binary. Makes for a
more gradual rollout to be sure there aren't issues with the new
version.

So please keep the configuration value, but have it indicate the
mode used when writing filters.

Thanks,
-Stolee
Jonathan Tan June 22, 2023, 10:27 p.m. UTC | #6
Derrick Stolee <derrickstolee@github.com> writes:
> On 6/21/2023 8:19 AM, Taylor Blau wrote:
> > On Tue, Jun 20, 2023 at 09:43:46AM -0400, Derrick Stolee wrote:
> 
> >>  commitGraph.changedPathsVersion: Which version should we _write_
> >>  when writing a new commit-graph? Defaults to '1' but will default
> >>  to '2' in the next major verion, then '1' will no longer be an
> >>  accepted value in the version after that.
> > 
> > I am not sure if there's a situation where we'd ever want to not write
> > the newer versions when starting a new commit-graph (or chain) from
> > scratch.
> 
> I'd rather have the choice to start writing the new filter mode be
> made by config rather than a change to the Git binary. Makes for a
> more gradual rollout to be sure there aren't issues with the new
> version.
> 
> So please keep the configuration value, but have it indicate the
> mode used when writing filters.
> 
> Thanks,
> -Stolee

It looks like we can't avoid writing both versions (we need to write
version 1 so that we can reuse existing Bloom filters when writing, if
the repo has version 1 Bloom filters) so a config that tells us which to
write sounds doable. I'll take a look.
Derrick Stolee June 23, 2023, 1:18 p.m. UTC | #7
On 6/22/2023 6:27 PM, Jonathan Tan wrote:

> It looks like we can't avoid writing both versions (we need to write
> version 1 so that we can reuse existing Bloom filters when writing, if
> the repo has version 1 Bloom filters) so a config that tells us which to
> write sounds doable. I'll take a look.

I don't fully understand what you're saying here.

We need to be able to write both versions (not simultaneously, but
toggled via config) so we can roll out this change carefully instead
of suddenly due to the Git executable changing.

But we don't need to be able to write version 1 just so we can reuse
version 1 filters. In fact, we should be able to upgrade to version 2
if the config points at that, but we should _not_ re-use the filters
in that case.

This does present an interesting challenge for the upgrade: we have
the commitGraph.maxNewFilters option to limit the amount of new filters
we write at a given time. When shifting to version 2, we will start
from scratch, so that could have some effect. I will consider how to
handle this, perhaps by raising the number temporarily.

Thanks,
-Stolee