diff mbox series

[v2,10/11] commit-graph: check all leading directories in changed path Bloom filters

Message ID 9c2076b4ce46918fce8f05e609b057611ec56e13.1592934430.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series More commit-graph/Bloom filter improvements | expand

Commit Message

Johannes Schindelin via GitGitGadget June 23, 2020, 5:47 p.m. UTC
From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>

The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git rev-list HEAD -- "$path"

before and after this change for 5000+ randomly* selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
  linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
  tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%

*Path selection was done with the following pipeline:

	git ls-tree -r --name-only HEAD | sort -R | head -n 5000

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

This adjusts the tracing values in t4216-log-bloom.sh, which provides a
concrete way to notice the improvement.

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c           | 41 ++++++++++++++++++++++++++++++++---------
 revision.h           |  6 ++++--
 t/t4216-log-bloom.sh |  2 +-
 3 files changed, 37 insertions(+), 12 deletions(-)

Comments

René Scharfe June 25, 2020, 7:25 a.m. UTC | #1
Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>
> The file 'dir/subdir/file' can only be modified if its leading
> directories 'dir' and 'dir/subdir' are modified as well.
>
> So when checking modified path Bloom filters looking for commits
> modifying a path with multiple path components, then check not only
> the full path in the Bloom filters, but all its leading directories as
> well.  Take care to check these paths in "deepest first" order,
> because it's the full path that is least likely to be modified, and
> the Bloom filter queries can short circuit sooner.
>
> This can significantly reduce the average false positive rate, by
> about an order of magnitude or three(!), and can further speed up
> pathspec-limited revision walks.  The table below compares the average
> false positive rate and runtime of
>
>   git rev-list HEAD -- "$path"
>
> before and after this change for 5000+ randomly* selected paths from
> each repository:
>
>                     Average false           Average        Average
>                     positive rate           runtime        runtime
>                   before     after     before     after   difference
>   ------------------------------------------------------------------
>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>
> *Path selection was done with the following pipeline:
>
> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>
> The improvements in runtime are much smaller than the improvements in
> average false positive rate, as we are clearly reaching diminishing
> returns here.  However, all these timings depend on that accessing
> tree objects is reasonably fast (warm caches).  If we had a partial
> clone and the tree objects had to be fetched from a promisor remote,
> e.g.:
>
>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>         commit-graph write --reachable
>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>         rev-list HEAD -- "$path"
>
> then checking all leading path component can reduce the runtime from
> over an hour to a few seconds (and this is with the clone and the
> promisor on the same machine).
>
> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
> concrete way to notice the improvement.
>
> Helped-by: Taylor Blau <me@ttaylorr.com>
> Helped-by: René Scharfe <l.s.r@web.de>
> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c           | 41 ++++++++++++++++++++++++++++++++---------
>  revision.h           |  6 ++++--
>  t/t4216-log-bloom.sh |  2 +-
>  3 files changed, 37 insertions(+), 12 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index b53377cd52..077888ee51 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  {
>  	struct pathspec_item *pi;
>  	char *path_alloc = NULL;
> -	const char *path;
> +	const char *path, *p;
>  	int last_index;
> -	int len;
> +	size_t len;
> +	int path_component_nr = 1;
>
>  	if (!revs->commits)
>  		return;
> @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>  		return;
>  	}
>
> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +	p = path;
> +	while (*p) {
> +		if (is_dir_sep(*p))
> +			path_component_nr++;
> +		p++;
> +	}
> +
> +	revs->bloom_keys_nr = path_component_nr;
> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
> +
> +	fill_bloom_key(path, len, &revs->bloom_keys[0],
> +		       revs->bloom_filter_settings);
> +	path_component_nr = 1;
> +
> +	p = path + len - 1;

len cannot be 0 at this point, as patch 9 made sure, so this is safe.
Good.

> +	while (p > path) {
> +		if (is_dir_sep(*p))
> +			fill_bloom_key(path, p - path,
> +				       &revs->bloom_keys[path_component_nr++],
> +				       revs->bloom_filter_settings);
> +		p--;
> +	}

This walks the directory hierarchy upwards and adds bloom filters for
shorter and shorter paths, ("deepest first").  Good.

And it supports all directory separators.  On Windows that would be
slash (/) and backslash (\).  I assume paths are normalized to use
only slashes when bloom filters are written, correct?  Then the lookup
side needs to normalize a given path to only use slashes as well,
otherwise paths with backslashes cannot be found.  This part seems to
be missing.

>
>  	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
>  		atexit(trace2_bloom_filter_statistics_atexit);
> @@ -724,7 +745,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  						 struct commit *commit)
>  {
>  	struct bloom_filter *filter;
> -	int result;
> +	int result = 1, j;
>
>  	if (!revs->repo->objects->commit_graph)
>  		return -1;
> @@ -744,9 +765,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
>  		return -1;
>  	}
>
> -	result = bloom_filter_contains(filter,
> -				       revs->bloom_key,
> -				       revs->bloom_filter_settings);
> +	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
> +		result = bloom_filter_contains(filter,
> +					       &revs->bloom_keys[j],
> +					       revs->bloom_filter_settings);
> +	}
>
>  	if (result)
>  		count_bloom_filter_maybe++;
> @@ -786,7 +809,7 @@ static int rev_compare_tree(struct rev_info *revs,
>  			return REV_TREE_SAME;
>  	}
>
> -	if (revs->bloom_key && !nth_parent) {
> +	if (revs->bloom_keys_nr && !nth_parent) {
>  		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
>
>  		if (bloom_ret == 0)
> diff --git a/revision.h b/revision.h
> index 7c026fe41f..abbfb4ab59 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -295,8 +295,10 @@ struct rev_info {
>  	struct topo_walk_info *topo_walk_info;
>
>  	/* Commit graph bloom filter fields */
> -	/* The bloom filter key for the pathspec */
> -	struct bloom_key *bloom_key;
> +	/* The bloom filter key(s) for the pathspec */
> +	struct bloom_key *bloom_keys;
> +	int bloom_keys_nr;
> +
>  	/*
>  	 * The bloom filter settings used to generate the key.
>  	 * This is loaded from the commit-graph being used.
> diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
> index f890cc4737..84f95972ca 100755
> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -146,7 +146,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
>
>  test_bloom_filters_used_when_some_filters_are_missing () {
>  	log_args=$1
> -	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
> +	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
>  	setup "$log_args" &&
>  	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
>  	test_cmp log_wo_bloom log_w_bloom
>
Derrick Stolee June 25, 2020, 3:05 p.m. UTC | #2
On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@gmail.com>
>>
>> The file 'dir/subdir/file' can only be modified if its leading
>> directories 'dir' and 'dir/subdir' are modified as well.
>>
>> So when checking modified path Bloom filters looking for commits
>> modifying a path with multiple path components, then check not only
>> the full path in the Bloom filters, but all its leading directories as
>> well.  Take care to check these paths in "deepest first" order,
>> because it's the full path that is least likely to be modified, and
>> the Bloom filter queries can short circuit sooner.
>>
>> This can significantly reduce the average false positive rate, by
>> about an order of magnitude or three(!), and can further speed up
>> pathspec-limited revision walks.  The table below compares the average
>> false positive rate and runtime of
>>
>>   git rev-list HEAD -- "$path"
>>
>> before and after this change for 5000+ randomly* selected paths from
>> each repository:
>>
>>                     Average false           Average        Average
>>                     positive rate           runtime        runtime
>>                   before     after     before     after   difference
>>   ------------------------------------------------------------------
>>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>>
>> *Path selection was done with the following pipeline:
>>
>> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>>
>> The improvements in runtime are much smaller than the improvements in
>> average false positive rate, as we are clearly reaching diminishing
>> returns here.  However, all these timings depend on that accessing
>> tree objects is reasonably fast (warm caches).  If we had a partial
>> clone and the tree objects had to be fetched from a promisor remote,
>> e.g.:
>>
>>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>>         commit-graph write --reachable
>>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>>         rev-list HEAD -- "$path"
>>
>> then checking all leading path component can reduce the runtime from
>> over an hour to a few seconds (and this is with the clone and the
>> promisor on the same machine).
>>
>> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
>> concrete way to notice the improvement.
>>
>> Helped-by: Taylor Blau <me@ttaylorr.com>
>> Helped-by: René Scharfe <l.s.r@web.de>
>> Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  revision.c           | 41 ++++++++++++++++++++++++++++++++---------
>>  revision.h           |  6 ++++--
>>  t/t4216-log-bloom.sh |  2 +-
>>  3 files changed, 37 insertions(+), 12 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index b53377cd52..077888ee51 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 1;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  		return;
>>  	}
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	while (*p) {
>> +		if (is_dir_sep(*p))
>> +			path_component_nr++;
>> +		p++;
>> +	}
>> +
>> +	revs->bloom_keys_nr = path_component_nr;
>> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>> +
>> +	fill_bloom_key(path, len, &revs->bloom_keys[0],
>> +		       revs->bloom_filter_settings);
>> +	path_component_nr = 1;
>> +
>> +	p = path + len - 1;
> 
> len cannot be 0 at this point, as patch 9 made sure, so this is safe.
> Good.
> 
>> +	while (p > path) {
>> +		if (is_dir_sep(*p))
>> +			fill_bloom_key(path, p - path,
>> +				       &revs->bloom_keys[path_component_nr++],
>> +				       revs->bloom_filter_settings);
>> +		p--;
>> +	}
> 
> This walks the directory hierarchy upwards and adds bloom filters for
> shorter and shorter paths, ("deepest first").  Good.
> 
> And it supports all directory separators.  On Windows that would be
> slash (/) and backslash (\).  I assume paths are normalized to use
> only slashes when bloom filters are written, correct?  Then the lookup
> side needs to normalize a given path to only use slashes as well,
> otherwise paths with backslashes cannot be found.  This part seems to
> be missing.

Yes, that's a good point. We _require_ the paths to be normalized
here to be Unix-style paths or else the Bloom filter keys are
incorrect. Thankfully, they are. Let's make that clear in-code by
using '/' instead of is_dir_sep().

Thanks,
-Stolee
SZEDER Gábor June 26, 2020, 6:34 a.m. UTC | #3
On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote:

> >> +	while (p > path) {
> >> +		if (is_dir_sep(*p))
> >> +			fill_bloom_key(path, p - path,
> >> +				       &revs->bloom_keys[path_component_nr++],
> >> +				       revs->bloom_filter_settings);
> >> +		p--;
> >> +	}
> > 
> > This walks the directory hierarchy upwards and adds bloom filters for
> > shorter and shorter paths, ("deepest first").  Good.
> > 
> > And it supports all directory separators.  On Windows that would be
> > slash (/) and backslash (\).  I assume paths are normalized to use
> > only slashes when bloom filters are written, correct?  Then the lookup
> > side needs to normalize a given path to only use slashes as well,
> > otherwise paths with backslashes cannot be found.  This part seems to
> > be missing.
> 
> Yes, that's a good point. We _require_ the paths to be normalized
> here to be Unix-style paths or else the Bloom filter keys are
> incorrect. Thankfully, they are.

Unfortunately, they aren't always...

Path normalization is done in normalize_path_copy_len(), whose
description says, among other things:

   * Performs the following normalizations on src, storing the result in dst:
   * - Ensures that components are separated by '/' (Windows only)

and the code indeed does:

        if (is_dir_sep(c)) {
                *dst++ = '/';

Now, while parsing pathspecs this function is called via:

  parse_pathspec()
    init_pathspec_item()
      prefix_path_gently()
        normalize_path_copy_len()

Unfortunately, init_pathspec_item() has this chain of conditions:

        /* Create match string which will be used for pathspec matching */
        if (pathspec_prefix >= 0) {
                match = xstrdup(copyfrom);
                prefixlen = pathspec_prefix;
        } else if (magic & PATHSPEC_FROMTOP) {
                match = xstrdup(copyfrom);
                prefixlen = 0;
        } else {
                match = prefix_path_gently(prefix, prefixlen,
                                           &prefixlen, copyfrom);
                if (!match) {
                        const char *hint_path = get_git_work_tree();
                        if (!hint_path)
                                hint_path = get_git_dir();
                        die(_("%s: '%s' is outside repository at '%s'"), elt,
                            copyfrom, absolute_path(hint_path));
                }
        }

which means that it doesn't always calls prefix_path_gently(), which,
in turn, means that 'pathspec_item->match' might remain un-normalized
in case of some unusual pathspecs.

The first condition is supposed to handle the case when one Git
process passes pathspecs to another, and is supposed to be "internal
use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via
PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what
that might entail.

The second condition handles pathspecs explicitly relative to the root
of the work tree, i.e. ':/path'.  Adding a printf() to show the
original path and the resulting 'pathspec_item->match' does confirm
that no normalization is performed:

  expecting success of 9999.1 'test': 
          mkdir -p dir &&
          >dir/file &&
          git add ":/dir/file" &&
          git add ":(top)dir/file" &&
          test_might_fail git add ":/dir//file" &&
          git add ":(top)dir//file"
  
  orig:  ':/dir/file'
  match: 'dir/file'
  orig:  ':(top)dir/file'
  match: 'dir/file'
  orig:  ':/dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  orig:  ':(top)dir//file'
  match: 'dir//file'
  fatal: oops in prep_exclude
  not ok 1 - test

This is, of course, bad for Bloom filters, because the repeated
slashes are hashed as well and commits will be omitted from the output
of pathspec-limited revision walks, but apparently it also affects
other parts of Git.

And the else branch handles the rest, which, I believe, is by far the
most common case.

> Let's make that clear in-code by
> using '/' instead of is_dir_sep().
> 
> Thanks,
> -Stolee
Derrick Stolee June 26, 2020, 2:42 p.m. UTC | #4
On 6/26/2020 2:34 AM, SZEDER Gábor wrote:
> On Thu, Jun 25, 2020 at 11:05:04AM -0400, Derrick Stolee wrote:
> 
>>>> +	while (p > path) {
>>>> +		if (is_dir_sep(*p))
>>>> +			fill_bloom_key(path, p - path,
>>>> +				       &revs->bloom_keys[path_component_nr++],
>>>> +				       revs->bloom_filter_settings);
>>>> +		p--;
>>>> +	}
>>>
>>> This walks the directory hierarchy upwards and adds bloom filters for
>>> shorter and shorter paths, ("deepest first").  Good.
>>>
>>> And it supports all directory separators.  On Windows that would be
>>> slash (/) and backslash (\).  I assume paths are normalized to use
>>> only slashes when bloom filters are written, correct?  Then the lookup
>>> side needs to normalize a given path to only use slashes as well,
>>> otherwise paths with backslashes cannot be found.  This part seems to
>>> be missing.
>>
>> Yes, that's a good point. We _require_ the paths to be normalized
>> here to be Unix-style paths or else the Bloom filter keys are
>> incorrect. Thankfully, they are.
> 
> Unfortunately, they aren't always...
> 
> Path normalization is done in normalize_path_copy_len(), whose
> description says, among other things:
> 
>    * Performs the following normalizations on src, storing the result in dst:
>    * - Ensures that components are separated by '/' (Windows only)
> 
> and the code indeed does:
> 
>         if (is_dir_sep(c)) {
>                 *dst++ = '/';
> 
> Now, while parsing pathspecs this function is called via:
> 
>   parse_pathspec()
>     init_pathspec_item()
>       prefix_path_gently()
>         normalize_path_copy_len()
> 
> Unfortunately, init_pathspec_item() has this chain of conditions:
> 
>         /* Create match string which will be used for pathspec matching */
>         if (pathspec_prefix >= 0) {
>                 match = xstrdup(copyfrom);
>                 prefixlen = pathspec_prefix;
>         } else if (magic & PATHSPEC_FROMTOP) {
>                 match = xstrdup(copyfrom);
>                 prefixlen = 0;
>         } else {
>                 match = prefix_path_gently(prefix, prefixlen,
>                                            &prefixlen, copyfrom);
>                 if (!match) {
>                         const char *hint_path = get_git_work_tree();
>                         if (!hint_path)
>                                 hint_path = get_git_dir();
>                         die(_("%s: '%s' is outside repository at '%s'"), elt,
>                             copyfrom, absolute_path(hint_path));
>                 }
>         }
> 
> which means that it doesn't always calls prefix_path_gently(), which,
> in turn, means that 'pathspec_item->match' might remain un-normalized
> in case of some unusual pathspecs.
> 
> The first condition is supposed to handle the case when one Git
> process passes pathspecs to another, and is supposed to be "internal
> use only"; see 233c3e6c59 (parse_pathspec: preserve prefix length via
> PATHSPEC_PREFIX_ORIGIN, 2013-07-14), I haven't even tried to grok what
> that might entail.
> 
> The second condition handles pathspecs explicitly relative to the root
> of the work tree, i.e. ':/path'.  Adding a printf() to show the
> original path and the resulting 'pathspec_item->match' does confirm
> that no normalization is performed:
> 
>   expecting success of 9999.1 'test': 
>           mkdir -p dir &&
>           >dir/file &&
>           git add ":/dir/file" &&
>           git add ":(top)dir/file" &&
>           test_might_fail git add ":/dir//file" &&
>           git add ":(top)dir//file"
>   
>   orig:  ':/dir/file'
>   match: 'dir/file'
>   orig:  ':(top)dir/file'
>   match: 'dir/file'
>   orig:  ':/dir//file'
>   match: 'dir//file'
>   fatal: oops in prep_exclude
>   orig:  ':(top)dir//file'
>   match: 'dir//file'
>   fatal: oops in prep_exclude
>   not ok 1 - test
> 
> This is, of course, bad for Bloom filters, because the repeated
> slashes are hashed as well and commits will be omitted from the output
> of pathspec-limited revision walks, but apparently it also affects
> other parts of Git.
> 
> And the else branch handles the rest, which, I believe, is by far the
> most common case.

Thanks for this analysis. Clearly, there is already a bug here
when the input data is not pristine. I didn't see this message
when I submitted my v3, but normalizing the path data before
computing filters can (hopefully) be done as a small patch
before or after my v3 PATCH 10 without much conflict.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/revision.c b/revision.c
index b53377cd52..077888ee51 100644
--- a/revision.c
+++ b/revision.c
@@ -670,9 +670,10 @@  static void prepare_to_use_bloom_filter(struct rev_info *revs)
 {
 	struct pathspec_item *pi;
 	char *path_alloc = NULL;
-	const char *path;
+	const char *path, *p;
 	int last_index;
-	int len;
+	size_t len;
+	int path_component_nr = 1;
 
 	if (!revs->commits)
 		return;
@@ -709,8 +710,28 @@  static void prepare_to_use_bloom_filter(struct rev_info *revs)
 		return;
 	}
 
-	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
-	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
+	p = path;
+	while (*p) {
+		if (is_dir_sep(*p))
+			path_component_nr++;
+		p++;
+	}
+
+	revs->bloom_keys_nr = path_component_nr;
+	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
+
+	fill_bloom_key(path, len, &revs->bloom_keys[0],
+		       revs->bloom_filter_settings);
+	path_component_nr = 1;
+
+	p = path + len - 1;
+	while (p > path) {
+		if (is_dir_sep(*p))
+			fill_bloom_key(path, p - path,
+				       &revs->bloom_keys[path_component_nr++],
+				       revs->bloom_filter_settings);
+		p--;
+	}
 
 	if (trace2_is_enabled() && !bloom_filter_atexit_registered) {
 		atexit(trace2_bloom_filter_statistics_atexit);
@@ -724,7 +745,7 @@  static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 						 struct commit *commit)
 {
 	struct bloom_filter *filter;
-	int result;
+	int result = 1, j;
 
 	if (!revs->repo->objects->commit_graph)
 		return -1;
@@ -744,9 +765,11 @@  static int check_maybe_different_in_bloom_filter(struct rev_info *revs,
 		return -1;
 	}
 
-	result = bloom_filter_contains(filter,
-				       revs->bloom_key,
-				       revs->bloom_filter_settings);
+	for (j = 0; result && j < revs->bloom_keys_nr; j++) {
+		result = bloom_filter_contains(filter,
+					       &revs->bloom_keys[j],
+					       revs->bloom_filter_settings);
+	}
 
 	if (result)
 		count_bloom_filter_maybe++;
@@ -786,7 +809,7 @@  static int rev_compare_tree(struct rev_info *revs,
 			return REV_TREE_SAME;
 	}
 
-	if (revs->bloom_key && !nth_parent) {
+	if (revs->bloom_keys_nr && !nth_parent) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 
 		if (bloom_ret == 0)
diff --git a/revision.h b/revision.h
index 7c026fe41f..abbfb4ab59 100644
--- a/revision.h
+++ b/revision.h
@@ -295,8 +295,10 @@  struct rev_info {
 	struct topo_walk_info *topo_walk_info;
 
 	/* Commit graph bloom filter fields */
-	/* The bloom filter key for the pathspec */
-	struct bloom_key *bloom_key;
+	/* The bloom filter key(s) for the pathspec */
+	struct bloom_key *bloom_keys;
+	int bloom_keys_nr;
+
 	/*
 	 * The bloom filter settings used to generate the key.
 	 * This is loaded from the commit-graph being used.
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index f890cc4737..84f95972ca 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -146,7 +146,7 @@  test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
 
 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom