diff mbox series

[24/34] commit-graph: check all leading directories in modified path Bloom filters

Message ID 20200529085038.26008-25-szeder.dev@gmail.com (mailing list archive)
State New, archived
Headers show
Series An alternative modified path Bloom filters implementation | expand

Commit Message

SZEDER Gábor May 29, 2020, 8:50 a.m. UTC
The file 'dir/subdir/file' can only be modified if its leading
directories 'dir' and 'dir/subdir' are modified as well.

So when checking modified path Bloom filters looking for commits
modifying a path with multiple path components, then check not only
the full path in the Bloom filters, but all its leading directories as
well.  Take care to check these paths in "deepest first" order,
because it's the full path that is least likely to be modified, and
the Bloom filter queries can short circuit sooner.

This can significantly reduce the average false positive rate, by
about an order of magnitude or three(!), and can further speed up
pathspec-limited revision walks.  The table below compares the average
false positive rate and runtime of

  git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- "$path"

before and after this change for 5000+ randomly selected paths from
each repository:

                    Average false           Average        Average
                    positive rate           runtime        runtime
                  before     after     before     after   difference
  ------------------------------------------------------------------
  android-base    0.526%   0.00416%    0.1727s   0.1503s   -13.0%
  cmssw           0.494%   0.00395%    0.0426s   0.0332s   -22.0%
  cpython         0.213%   0.01661%    0.0940s   0.0858s    -8.8%
  elasticsearch   0.679%   0.00325%    0.0191s   0.0147s   -23.2%
  gcc             0.398%   0.00892%    0.3423s   0.2192s   -36.0%
  gecko-dev       0.472%   0.00073%    0.6243s   0.5134s   -17.8%
  git             0.191%   0.06992%    0.0384s   0.0319s   -17.1%
  glibc           0.392%   0.01639%    0.0425s   0.0296s   -30.5%
  go              0.453%   0.01262%    0.0515s   0.0425s   -17.5%
  jdk             0.662%   0.00643%    0.0083s   0.0070s   -15.0%
  linux           0.434%   0.00749%    0.1102s   0.0911s   -17.3%
  llvm-project    0.511%   0.00391%    0.4865s   0.4290s   -11.8%
  rails           0.476%   0.01313%    0.0464s   0.0410s   -11.6%
  rust            0.457%   0.02551%    0.0590s   0.0462s   -21.7%
  tensorflow      0.511%   0.00824%    0.0642s   0.0517s   -19.5%
  webkit          0.661%   0.00101%    0.3315s   0.2576s   -22.3%

The improvements in runtime are much smaller than the improvements in
average false positive rate, as we are clearly reaching diminishing
returns here.  However, all these timings depend on that accessing
tree objects is reasonably fast (warm caches).  If we had a partial
clone and the tree objects had to be fetched from a promisor remote,
e.g.:

  $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
  $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
        commit-graph write --reachable
  $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
  $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
        rev-list HEAD -- "$path"

then checking all leading path component can reduce the runtime from
over an hour to a few seconds (and this is with the clone and the
promisor on the same machine).

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index f9a21ecdfb..1fd1b4f8dd 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -977,8 +977,10 @@  void init_pathspec_bloom_fields(struct repository *r,
 
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item *pi = &pathspec->items[i];
-		const char *path = pi->match;
+		const char *path = pi->match, *p;
 		size_t len = pi->len;
+		int path_component_nr = 0, j;
+		uint32_t *hashes;
 
 		/*
 		 * Pathspec parsing has normalized away any consecutive
@@ -988,13 +990,28 @@  void init_pathspec_bloom_fields(struct repository *r,
 		if (path[len - 1] == '/')
 			len--;
 
-		pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes;
+		p = path;
+		do {
+			p = strchrnul(p + 1, '/');
+			path_component_nr++;
+		} while (p - path < len);
+
+		pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes;
 		ALLOC_ARRAY(pi->modified_path_bloom_hashes,
 			    pi->modified_path_bloom_hashes_nr);
 
-		compute_modified_path_bloom_hashes_for_path(path, len,
-				graph->num_modified_path_bloom_hashes,
-				pi->modified_path_bloom_hashes);
+		p = path;
+		hashes = pi->modified_path_bloom_hashes +
+			 pi->modified_path_bloom_hashes_nr;
+		for (j = 0; j < path_component_nr; j++) {
+			p = strchrnul(p + 1, '/');
+
+			hashes -= graph->num_modified_path_bloom_hashes;
+			compute_modified_path_bloom_hashes_for_path(path,
+					p - path,
+					graph->num_modified_path_bloom_hashes,
+					hashes);
+		}
 	}
 
 	pathspec->can_use_modified_path_bloom_filters = 1;