Message ID | 77f1c561e8205c0598b57bf572640d21d64757f8.1580943390.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Changed Paths Bloom Filters | expand |
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Garima Singh <garima.singh@microsoft.com> > > Revision walk will now use Bloom filters for commits to speed up revision > walks for a particular path (for computing history for that path), if they > are present in the commit-graph file. Why do we need to turn this feature off for --walk-reflog? Anyway, in my opinion this restriction should be stated explicitly in the commit message, if kept. > > We load the Bloom filters during the prepare_revision_walk step, but only > when dealing with a single pathspec. I would add the qualifier "currently" here, i.e. s/only/currently only/ to make it clear that it is the limitation of current implementation, and not the inherent implementation of the technique. > While comparing trees in > rev_compare_trees(), if the Bloom filter says that the file is not different > between the two trees, we don't need to compute the expensive diff. This is > where we get our performance gains. The other response of the Bloom filter > is `maybe`, in which case we fall back to the full diff calculation to > determine if the path was changed in the commit. All right, looks good. Very minor nitpick: s/`maybe`/'maybe'/ (in my opinion). > > Performance Gains: > We tested the performance of `git log -- <path>` on the git repo, the linux > and some internal large repos, with a variety of paths of varying depths. Another repository that we could test Bloom filters feature would be, as I have written before, Android AOSP frameworks core repository https://android.googlesource.com/platform/frameworks/base/ because being written in Java it has deep path hierarchy, and it also has large number of commits. > > On the git and linux repos: > - we observed a 2x to 5x speed up. It would be nice to have at least one specific and repeatable example: in given repository, starting from given commit or tag, following the history of given path, what are timing results for doing some specific command with and without Bloom filters computed and enabled. One might also want to know the cost of this speedup: how much disk space does it take (i.e. how large is the commit-graph file with and without Bloom filters chunks), and how long does it take to compute (i.e. how much time writing commit-graph takes with and without using --changed-paths options). > > On a large internal repo with files seated 6-10 levels deep in the tree: > - we observed 10x to 20x speed ups, with some paths going up to 28 times > faster. This is good to know. In the future we might want to have procedurally generated synthetic repository, where we would be able to control number of files, depth of filesystem hierarchy, average number of changes per commit, etc. to be used for performance testing. (Just wishful thinking) > > Helped-by: Derrick Stolee <dstolee@microsoft.com > Helped-by: SZEDER Gábor <szeder.dev@gmail.com> > Helped-by: Jonathan Tan <jonathantanmy@google.com> > Signed-off-by: Garima Singh <garima.singh@microsoft.com> > --- > revision.c | 124 +++++++++++++++++++++++++++++++- > revision.h | 11 +++ > t/helper/test-read-graph.c | 4 ++ > t/t4216-log-bloom.sh | 140 +++++++++++++++++++++++++++++++++++++ > 4 files changed, 277 insertions(+), 2 deletions(-) > create mode 100755 t/t4216-log-bloom.sh > > diff --git a/revision.c b/revision.c > index 8136929e23..d1622afa17 100644 > --- a/revision.c > +++ b/revision.c > @@ -29,6 +29,8 @@ > #include "prio-queue.h" > #include "hashmap.h" > #include "utf8.h" > +#include "bloom.h" > +#include "json-writer.h" > > volatile show_early_output_fn_t show_early_output; > > @@ -624,11 +626,114 @@ static void file_change(struct diff_options *options, > options->flags.has_changes = 1; > } > > +static int bloom_filter_atexit_registered; > +static unsigned int count_bloom_filter_maybe; > +static unsigned int count_bloom_filter_definitely_not; > +static unsigned int count_bloom_filter_false_positive; > +static unsigned int count_bloom_filter_not_present; > +static unsigned int count_bloom_filter_length_zero; > + > +static void trace2_bloom_filter_statistics_atexit(void) > +{ > + struct json_writer jw = JSON_WRITER_INIT; > + > + jw_object_begin(&jw, 0); > + jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present); > + jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero); > + jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe); > + jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not); > + jw_end(&jw); > + > + trace2_data_json("bloom", the_repository, "statistics", &jw); > + > + jw_release(&jw); > +} I thought that it would be better to put this part together with tests that absolutely require this functionality in a separate subsequent patch, but now I am not so sure. It is nice to have all or almost all tests created in a single patch. Looks good to me, but I don't know much about trace2 API, so take it with a pinch of salt. > + > +static void prepare_to_use_bloom_filter(struct rev_info *revs) > +{ > + struct pathspec_item *pi; > + char *path_alloc = NULL; > + const char *path; > + int last_index; > + int len; > + > + if (!revs->commits) > + return; I see that we need this because in next command we dereference revs->commits to get revs->commits->item. If I understand it correctly empty pending list may happen with "--all" or "--glob" options, but somebody with more experience in this area of code is needed to state for sure. Should we test `git log --all -- <path>`? > + > + repo_parse_commit(revs->repo, revs->commits->item); Are we calling this function for its side-effects? Wouldn't using prepare_commit_graph(revs->repo) here be a better solution? > + > + if (!revs->repo->objects->commit_graph) > + return; Looks good to me. If there is no commit graph, then there are no Bloom filters to consult. > + > + revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings; Hmmm... is that why bloom_filter_settings is a pointer to struct, and not struct itself? > + if (!revs->bloom_filter_settings) > + return; Looks good to me. If there is no Bloomm filter in the commit-graph file, then there are no Bloom filters to consult. > + > + pi = &revs->pruning.pathspec.items[0]; > + last_index = pi->len - 1; > + It might be a good idea to add a comment explaining what is happening here, for example: + /* remove single trailing slash from path, if needed */ > + if (pi->match[last_index] == '/') { > + path_alloc = xstrdup(pi->match); > + path_alloc[last_index] = '\0'; > + path = path_alloc; > + } else > + path = pi->match; > + > + len = strlen(path); We can avoid computing strlen(path) here, because in first branch of this conditional we have len = last_index, in the second branch we have len = pi->len. > + > + revs->bloom_key = xmalloc(sizeof(struct bloom_key)); > + fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); All right, this is the meat of this function: creating bloom_key for a path. Looks good to me. > + > + if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > + atexit(trace2_bloom_filter_statistics_atexit); > + bloom_filter_atexit_registered = 1; > + } OK, here we register trace2 Bloom filter statistics handler, but only once, and only when needed. > + > + free(path_alloc); OK, path_alloc is either xstrdup-ed string, or NULL, and is no longer needed (after possibly being used to create bloom_key). > +} > + > +static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > + struct commit *commit) > +{ > + struct bloom_filter *filter; > + int result; > + > + if (!revs->repo->objects->commit_graph) > + return -1; > + > + if (commit->generation == GENERATION_NUMBER_INFINITY) > + return -1; Idle thought: would it be useful to gather for trace2 statistics also number of commits encountered that were outside commit-graph? > + > + filter = get_bloom_filter(revs->repo, commit, 0); > + > + if (!filter) { > + count_bloom_filter_not_present++; > + return -1; > + } > + > + if (!filter->len) { > + count_bloom_filter_length_zero++; > + return -1; > + } > + > + result = bloom_filter_contains(filter, > + revs->bloom_key, > + revs->bloom_filter_settings); > + > + if (result) > + count_bloom_filter_maybe++; > + else > + count_bloom_filter_definitely_not++; > + > + return result; > +} The whole check_maybe_different_in_bloom_filter() looks good to me, thanks to designing and building a good API. > + > static int rev_compare_tree(struct rev_info *revs, > - struct commit *parent, struct commit *commit) > + struct commit *parent, struct commit *commit, int nth_parent) > { > struct tree *t1 = get_commit_tree(parent); > struct tree *t2 = get_commit_tree(commit); > + int bloom_ret = 1; I don't understand why it is initialized to 1, and not to 0. > > if (!t1) > return REV_TREE_NEW; > @@ -653,11 +758,23 @@ static int rev_compare_tree(struct rev_info *revs, > return REV_TREE_SAME; > } > > + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) { Shouldn't we check upfront here that revs->bloom_key is not NULL? I don't think we check this down the callchain... Or even better replace the first two checks with it, as revs->bloom_key is set only if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info), see addition to prepare_revision_walk() below. Of course the !nth_parent check needs to be kept, as this changes during the revision walk (it is a limitation of current version of Bloom filter in that only changes with respect to first parent are stored in filter). > + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > + > + if (bloom_ret == 0) > + return REV_TREE_SAME; > + } All right, if we have single pathspec, and we don't walk reflog (?), and we are interested in first parent, then we query the Bloom filter. The Bloom filter can return 'no' or 'maybe'; if it returns 'no' then we can short-circuit and avoid computing the tree diff. > + > tree_difference = REV_TREE_SAME; > revs->pruning.flags.has_changes = 0; > if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "", > &revs->pruning) < 0) > return REV_TREE_DIFFERENT; > + > + if (!nth_parent) Shouldn't this condition be exactly the same as for running check_maybe_different_in_bloom_filter()? Otherwise due to initializing bloom_ret to 1 we would get wrong statistics, isn't it? > + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) > + count_bloom_filter_false_positive++; > + All right, looks good. > return tree_difference; > } > > @@ -855,7 +972,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) > die("cannot simplify commit %s (because of %s)", > oid_to_hex(&commit->object.oid), > oid_to_hex(&p->object.oid)); > - switch (rev_compare_tree(revs, p, commit)) { > + switch (rev_compare_tree(revs, p, commit, nth_parent)) { > case REV_TREE_SAME: > if (!revs->simplify_history || !relevant_commit(p)) { > /* Even if a merge with an uninteresting OK, we are just dding new parameter, with the information needed to decide whether Bloom filters can be used or not. > @@ -3362,6 +3479,8 @@ int prepare_revision_walk(struct rev_info *revs) > FOR_EACH_OBJECT_PROMISOR_ONLY); > } > > + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info) > + prepare_to_use_bloom_filter(revs); Well, the limitation that the technique _currently_ works only with a single pathspec is stated explicitly, but the fact that it is turned off for some reason for --walk-reflog is not. Otherwise, looks good to me. > if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) > commit_list_sort_by_date(&revs->commits); > if (revs->no_walk) > @@ -3379,6 +3498,7 @@ int prepare_revision_walk(struct rev_info *revs) > simplify_merges(revs); > if (revs->children.name) > set_children(revs); > + > return 0; > } Unrelated coding style fixup, but we are doing changes in the neighborhood. All right, I can agree to that. > > diff --git a/revision.h b/revision.h > index 475f048fb6..7c026fe41f 100644 > --- a/revision.h > +++ b/revision.h > @@ -56,6 +56,8 @@ struct repository; > struct rev_info; > struct string_list; > struct saved_parents; > +struct bloom_key; > +struct bloom_filter_settings; > define_shared_commit_slab(revision_sources, char *); > > struct rev_cmdline_info { > @@ -291,6 +293,15 @@ struct rev_info { > struct revision_sources *sources; > > struct topo_walk_info *topo_walk_info; > + > + /* Commit graph bloom filter fields */ > + /* The bloom filter key for the pathspec */ > + struct bloom_key *bloom_key; > + /* > + * The bloom filter settings used to generate the key. > + * This is loaded from the commit-graph being used. > + */ > + struct bloom_filter_settings *bloom_filter_settings; It is nice having those explanatory comments. Sidenote: if I understand it correctly, revs->bloom_key is allocated but never free()d. On the other hand revs->bloom_filter_settings is a weak reference / is set to the value of other pointer, which is allocated and free()d together with commit_graph struct. > }; > > int ref_excluded(struct string_list *, const char *path); > diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c > index d2884efe0a..aff597c7a3 100644 > --- a/t/helper/test-read-graph.c > +++ b/t/helper/test-read-graph.c > @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv) > printf(" commit_metadata"); > if (graph->chunk_extra_edges) > printf(" extra_edges"); > + if (graph->chunk_bloom_indexes) > + printf(" bloom_indexes"); > + if (graph->chunk_bloom_data) > + printf(" bloom_data"); > printf("\n"); This chunk could be moved to the commit adding --changed-paths option... on the other hand if all tests are to be added by this patch, it can be left as is. > > UNLEAK(graph); > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > new file mode 100755 > index 0000000000..19eca1864b > --- /dev/null > +++ b/t/t4216-log-bloom.sh [...] I'll leave reviewing tests of this feature for the next email. Best regards,
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: This is a second part of my response, focusing solely on tests of the Bloom filters feature. [...] > diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c > index d2884efe0a..aff597c7a3 100644 > --- a/t/helper/test-read-graph.c > +++ b/t/helper/test-read-graph.c > @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv) > printf(" commit_metadata"); > if (graph->chunk_extra_edges) > printf(" extra_edges"); > + if (graph->chunk_bloom_indexes) > + printf(" bloom_indexes"); > + if (graph->chunk_bloom_data) > + printf(" bloom_data"); > printf("\n"); > All right, that is simple extension of 'test-helper read-graph'. > UNLEAK(graph); > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > new file mode 100755 > index 0000000000..19eca1864b > --- /dev/null > +++ b/t/t4216-log-bloom.sh > @@ -0,0 +1,140 @@ > +#!/bin/sh > + > +test_description='git log for a path with bloom filters' > +. ./test-lib.sh > + > +test_expect_success 'setup test - repo, commits, commit graph, log outputs' ' > + git init && > + mkdir A A/B A/B/C && > + test_commit c1 A/file1 && > + test_commit c2 A/B/file2 && > + test_commit c3 A/B/C/file3 && > + test_commit c4 A/file1 && > + test_commit c5 A/B/file2 && > + test_commit c6 A/B/C/file3 && > + test_commit c7 A/file1 && > + test_commit c8 A/B/file2 && > + test_commit c9 A/B/C/file3 && > + git checkout -b side HEAD~4 && > + test_commit side-1 file4 && > + git checkout master && > + git merge side && > + test_commit c10 file5 && Unfortunately this might be not enough for Git's heuristic similarity based rename detection, as it creates 'file5' file with content 'c10'. [Checking something]. Well, actually it looks like it works, even with not much contents. I thought you would need to use something like + test_write_lines 1 2 3 4 5 6 7 8 9 >file5 && + git add file5 && + git commit -m c10 && But it turns out that it is, s far as I have checked, not necessary. > + mv file5 file5_renamed && > + git add file5_renamed && > + git commit -m "rename" && > + git commit-graph write --reachable --changed-paths > +' Hmmm... there is no test for file that was present in history but got deleted. Might be important (because of pre-image vs post-image name issues). Very minor issue: following the style used in t/test-lib-functions.sh and the style guide in CodingGuidelines, it should be +graph_read_expect () { and the same for the following functions. https://github.com/git/git/blob/master/Documentation/CodingGuidelines#L144 - We prefer a space between the function name and the parentheses, and no space inside the parentheses. The opening "{" should also be on the same line. (incorrect) my_function(){ ... (correct) my_function () { ... > +graph_read_expect() { > + OPTIONAL="" > + NUM_CHUNKS=5 > + cat >expect <<- EOF > + header: 43475048 1 1 $NUM_CHUNKS 0 > + num_commits: $1 > + chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data Either OPTIONAL remains unused, and should be removed, or we leave it for possible future extension, and we write + chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data$OPTIONAL like in t/t5318-commit-graph.sh. > + EOF > + test-tool read-graph >output && > + test_cmp expect output Why 'output', and not 'actual'? > +} > + > +test_expect_success 'commit-graph write wrote out the bloom chunks' ' > + graph_read_expect 13 > +' All right, that is sanity-checking 'git commit-graph write --changed-paths'. > + > +setup() { I wonder if we can come up with a better name... setup_log(), setup_log_bloom(), log_compare()? > + rm output This shouldn't be here, in this function. Or perhaps it shouldn't even be used at all; having 'output' doesn't hinder anything. > + rm "$TRASH_DIRECTORY/trace.perf" All right, this cleanup is needed. > + git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom > + GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom All right, we prepare for comparing version without Bloom filters (reference) and with Bloom filters, and for checking if Bloom filters were used. > +} This setup() function above is missing the && chain. It should then in my opinion read: +setup () { + rm "$TRASH_DIRECTORY/trace.perf" && + git -c core.commitGraph=false log --format="%s" $1 >log_wo_bloom && + GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" \ + git -c core.commitGraph=true log --format="%s" $1 >log_w_bloom +} Also, perhaps we should add at the beginning of this test file, outside anu test_expect_success block, the following (see t/*trace2*.sh files): # Turn off any inherited trace2 settings for this test. sane_unset GIT_TRACE2 GIT_TRACE2_PERF GIT_TRACE2_EVENT sane_unset GIT_TRACE2_PERF_BRIEF sane_unset GIT_TRACE2_CONFIG_PARAMS > + > +test_bloom_filters_used() { > + log_args=$1 > + bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\"" > + setup "$log_args" Missing && chain. > + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom Why no line break after &&? > +} Ugh, examining JSON output with regexp is in my opinion quite fragile. Though I am not sure if requiring Perl and JSON module installed like t/t0212-trace2-event.sh is any better. > + > +test_bloom_filters_not_used() { > + log_args=$1 > + setup "$log_args" > + !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom We should also check that "$TRASH_DIRECTORY/trace.perf" file exist with test_path_is_file. Also, testing that something was not found is a bit fragile, but I don't have any better idea on how to do this test without negating grep exit value. > +} > + > +for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed NOTE: file5 is missing from this list! I suspect that adding it might cause the test to fail. > +do > + for option in "" \ > + "--full-history" \ > + "--full-history --simplify-merges" \ > + "--simplify-merges" \ > + "--simplify-by-decoration" \ > + "--follow" \ > + "--first-parent" \ > + "--topo-order" \ > + "--date-order" \ > + "--author-date-order" \ > + "--ancestry-path side..master" > + do > + test_expect_success "git log option: $option for path: $path" ' > + test_bloom_filters_used "$option -- $path" All right, this tests that Bloom filters were used *and* that the command run with Bloom filters and without Bloom filters (without commit-graph) produces the same output. > + ' > + done > +done > + > +test_expect_success 'git log -- folder works with and without the trailing slash' ' > + test_bloom_filters_used "-- A" && > + test_bloom_filters_used "-- A/" > +' All right. I wonder if we should test for insane test case, namely pathname to an ordinary file that ends with slash: + test_bloom_filters_used "-- file4" && + test_bloom_filters_used "-- file4/" The latter should produce no output, being treated as not existing file. > + > +test_expect_success 'git log for path that does not exist. ' ' > + test_bloom_filters_used "-- path_does_not_exist" > +' All right. > + > +test_expect_success 'git log with --walk-reflogs does not use bloom filters' ' > + test_bloom_filters_not_used "--walk-reflogs -- A" > +' All right, but why is it so? > + > +test_expect_success 'git log -- multiple path specs does not use bloom filters' ' > + test_bloom_filters_not_used "-- file4 A/file1" > +' All right, though this is limitation of current code, not limitation of technique, so _maybe_ it would be better to test_expect_failure that for multiple pathspecs bloom_filters_used... > + > +test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' ' > + test_bloom_filters_used "-- *4" && > + test_bloom_filters_used "-- *renamed" > +' > + > +test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' ' > + test_bloom_filters_not_used "-- *" && > + test_bloom_filters_not_used "-- file*" > +' Same here. > + > +test_expect_success 'setup - add commit-graph to the chain without bloom filters' ' > + test_commit c14 A/anotherFile2 && > + test_commit c15 A/B/anotherFile2 && > + test_commit c16 A/B/C/anotherFile2 && > + GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split && > + test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain > +' > + > +test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' ' > + test_bloom_filters_not_used "-- A/B" > +' All right... though I would try to come up with a shorter test name :-) > + > +test_expect_success 'setup - add commit-graph to the chain with bloom filters' ' > + test_commit c17 A/anotherFile3 && > + git commit-graph write --reachable --changed-paths --split && > + test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain > +' > + > +test_bloom_filters_used_when_some_filters_are_missing() { > + log_args=$1 > + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6" Perhaps a better solution would be to use (enhanced) 'test-tool bloom' to check which commits have Bloom filters and which do not. > + setup "$log_args" > + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom > +} Why broken && chain between setup() and the resr, and why && is not followed by line break (as before)? > + > +test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' ' > + test_bloom_filters_used_when_some_filters_are_missing "-- A/B" > +' > + > +test_done All right... though the description of this test is a bit long. Thank you for your work on this series. Best,
diff --git a/revision.c b/revision.c index 8136929e23..d1622afa17 100644 --- a/revision.c +++ b/revision.c @@ -29,6 +29,8 @@ #include "prio-queue.h" #include "hashmap.h" #include "utf8.h" +#include "bloom.h" +#include "json-writer.h" volatile show_early_output_fn_t show_early_output; @@ -624,11 +626,114 @@ static void file_change(struct diff_options *options, options->flags.has_changes = 1; } +static int bloom_filter_atexit_registered; +static unsigned int count_bloom_filter_maybe; +static unsigned int count_bloom_filter_definitely_not; +static unsigned int count_bloom_filter_false_positive; +static unsigned int count_bloom_filter_not_present; +static unsigned int count_bloom_filter_length_zero; + +static void trace2_bloom_filter_statistics_atexit(void) +{ + struct json_writer jw = JSON_WRITER_INIT; + + jw_object_begin(&jw, 0); + jw_object_intmax(&jw, "filter_not_present", count_bloom_filter_not_present); + jw_object_intmax(&jw, "zero_length_filter", count_bloom_filter_length_zero); + jw_object_intmax(&jw, "maybe", count_bloom_filter_maybe); + jw_object_intmax(&jw, "definitely_not", count_bloom_filter_definitely_not); + jw_end(&jw); + + trace2_data_json("bloom", the_repository, "statistics", &jw); + + jw_release(&jw); +} + +static void prepare_to_use_bloom_filter(struct rev_info *revs) +{ + struct pathspec_item *pi; + char *path_alloc = NULL; + const char *path; + int last_index; + int len; + + if (!revs->commits) + return; + + repo_parse_commit(revs->repo, revs->commits->item); + + if (!revs->repo->objects->commit_graph) + return; + + revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings; + if (!revs->bloom_filter_settings) + return; + + pi = &revs->pruning.pathspec.items[0]; + last_index = pi->len - 1; + + if (pi->match[last_index] == '/') { + path_alloc = xstrdup(pi->match); + path_alloc[last_index] = '\0'; + path = path_alloc; + } else + path = pi->match; + + len = strlen(path); + + revs->bloom_key = xmalloc(sizeof(struct bloom_key)); + fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); + + if (trace2_is_enabled() && !bloom_filter_atexit_registered) { + atexit(trace2_bloom_filter_statistics_atexit); + bloom_filter_atexit_registered = 1; + } + + free(path_alloc); +} + +static int check_maybe_different_in_bloom_filter(struct rev_info *revs, + struct commit *commit) +{ + struct bloom_filter *filter; + int result; + + if (!revs->repo->objects->commit_graph) + return -1; + + if (commit->generation == GENERATION_NUMBER_INFINITY) + return -1; + + filter = get_bloom_filter(revs->repo, commit, 0); + + if (!filter) { + count_bloom_filter_not_present++; + return -1; + } + + if (!filter->len) { + count_bloom_filter_length_zero++; + return -1; + } + + result = bloom_filter_contains(filter, + revs->bloom_key, + revs->bloom_filter_settings); + + if (result) + count_bloom_filter_maybe++; + else + count_bloom_filter_definitely_not++; + + return result; +} + static int rev_compare_tree(struct rev_info *revs, - struct commit *parent, struct commit *commit) + struct commit *parent, struct commit *commit, int nth_parent) { struct tree *t1 = get_commit_tree(parent); struct tree *t2 = get_commit_tree(commit); + int bloom_ret = 1; if (!t1) return REV_TREE_NEW; @@ -653,11 +758,23 @@ static int rev_compare_tree(struct rev_info *revs, return REV_TREE_SAME; } + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info && !nth_parent) { + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); + + if (bloom_ret == 0) + return REV_TREE_SAME; + } + tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; if (diff_tree_oid(&t1->object.oid, &t2->object.oid, "", &revs->pruning) < 0) return REV_TREE_DIFFERENT; + + if (!nth_parent) + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) + count_bloom_filter_false_positive++; + return tree_difference; } @@ -855,7 +972,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) die("cannot simplify commit %s (because of %s)", oid_to_hex(&commit->object.oid), oid_to_hex(&p->object.oid)); - switch (rev_compare_tree(revs, p, commit)) { + switch (rev_compare_tree(revs, p, commit, nth_parent)) { case REV_TREE_SAME: if (!revs->simplify_history || !relevant_commit(p)) { /* Even if a merge with an uninteresting @@ -3362,6 +3479,8 @@ int prepare_revision_walk(struct rev_info *revs) FOR_EACH_OBJECT_PROMISOR_ONLY); } + if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info) + prepare_to_use_bloom_filter(revs); if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED) commit_list_sort_by_date(&revs->commits); if (revs->no_walk) @@ -3379,6 +3498,7 @@ int prepare_revision_walk(struct rev_info *revs) simplify_merges(revs); if (revs->children.name) set_children(revs); + return 0; } diff --git a/revision.h b/revision.h index 475f048fb6..7c026fe41f 100644 --- a/revision.h +++ b/revision.h @@ -56,6 +56,8 @@ struct repository; struct rev_info; struct string_list; struct saved_parents; +struct bloom_key; +struct bloom_filter_settings; define_shared_commit_slab(revision_sources, char *); struct rev_cmdline_info { @@ -291,6 +293,15 @@ struct rev_info { struct revision_sources *sources; struct topo_walk_info *topo_walk_info; + + /* Commit graph bloom filter fields */ + /* The bloom filter key for the pathspec */ + struct bloom_key *bloom_key; + /* + * The bloom filter settings used to generate the key. + * This is loaded from the commit-graph being used. + */ + struct bloom_filter_settings *bloom_filter_settings; }; int ref_excluded(struct string_list *, const char *path); diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c index d2884efe0a..aff597c7a3 100644 --- a/t/helper/test-read-graph.c +++ b/t/helper/test-read-graph.c @@ -45,6 +45,10 @@ int cmd__read_graph(int argc, const char **argv) printf(" commit_metadata"); if (graph->chunk_extra_edges) printf(" extra_edges"); + if (graph->chunk_bloom_indexes) + printf(" bloom_indexes"); + if (graph->chunk_bloom_data) + printf(" bloom_data"); printf("\n"); UNLEAK(graph); diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh new file mode 100755 index 0000000000..19eca1864b --- /dev/null +++ b/t/t4216-log-bloom.sh @@ -0,0 +1,140 @@ +#!/bin/sh + +test_description='git log for a path with bloom filters' +. ./test-lib.sh + +test_expect_success 'setup test - repo, commits, commit graph, log outputs' ' + git init && + mkdir A A/B A/B/C && + test_commit c1 A/file1 && + test_commit c2 A/B/file2 && + test_commit c3 A/B/C/file3 && + test_commit c4 A/file1 && + test_commit c5 A/B/file2 && + test_commit c6 A/B/C/file3 && + test_commit c7 A/file1 && + test_commit c8 A/B/file2 && + test_commit c9 A/B/C/file3 && + git checkout -b side HEAD~4 && + test_commit side-1 file4 && + git checkout master && + git merge side && + test_commit c10 file5 && + mv file5 file5_renamed && + git add file5_renamed && + git commit -m "rename" && + git commit-graph write --reachable --changed-paths +' +graph_read_expect() { + OPTIONAL="" + NUM_CHUNKS=5 + cat >expect <<- EOF + header: 43475048 1 1 $NUM_CHUNKS 0 + num_commits: $1 + chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data + EOF + test-tool read-graph >output && + test_cmp expect output +} + +test_expect_success 'commit-graph write wrote out the bloom chunks' ' + graph_read_expect 13 +' + +setup() { + rm output + rm "$TRASH_DIRECTORY/trace.perf" + git -c core.commitGraph=false log --pretty="format:%s" $1 >log_wo_bloom + GIT_TRACE2_PERF="$TRASH_DIRECTORY/trace.perf" git -c core.commitGraph=true log --pretty="format:%s" $1 >log_w_bloom +} + +test_bloom_filters_used() { + log_args=$1 + bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"zero_length_filter\":0,\"maybe\"" + setup "$log_args" + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom +} + +test_bloom_filters_not_used() { + log_args=$1 + setup "$log_args" + !(grep -q "statistics:{\"filter_not_present\":" "$TRASH_DIRECTORY/trace.perf") && test_cmp log_wo_bloom log_w_bloom +} + +for path in A A/B A/B/C A/file1 A/B/file2 A/B/C/file3 file4 file5_renamed +do + for option in "" \ + "--full-history" \ + "--full-history --simplify-merges" \ + "--simplify-merges" \ + "--simplify-by-decoration" \ + "--follow" \ + "--first-parent" \ + "--topo-order" \ + "--date-order" \ + "--author-date-order" \ + "--ancestry-path side..master" + do + test_expect_success "git log option: $option for path: $path" ' + test_bloom_filters_used "$option -- $path" + ' + done +done + +test_expect_success 'git log -- folder works with and without the trailing slash' ' + test_bloom_filters_used "-- A" && + test_bloom_filters_used "-- A/" +' + +test_expect_success 'git log for path that does not exist. ' ' + test_bloom_filters_used "-- path_does_not_exist" +' + +test_expect_success 'git log with --walk-reflogs does not use bloom filters' ' + test_bloom_filters_not_used "--walk-reflogs -- A" +' + +test_expect_success 'git log -- multiple path specs does not use bloom filters' ' + test_bloom_filters_not_used "-- file4 A/file1" +' + +test_expect_success 'git log with wildcard that resolves to a single path uses bloom filters' ' + test_bloom_filters_used "-- *4" && + test_bloom_filters_used "-- *renamed" +' + +test_expect_success 'git log with wildcard that resolves to a multiple paths does not uses bloom filters' ' + test_bloom_filters_not_used "-- *" && + test_bloom_filters_not_used "-- file*" +' + +test_expect_success 'setup - add commit-graph to the chain without bloom filters' ' + test_commit c14 A/anotherFile2 && + test_commit c15 A/B/anotherFile2 && + test_commit c16 A/B/C/anotherFile2 && + GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0 git commit-graph write --reachable --split && + test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain +' + +test_expect_success 'git log does not use bloom filters if the latest graph does not have bloom filters.' ' + test_bloom_filters_not_used "-- A/B" +' + +test_expect_success 'setup - add commit-graph to the chain with bloom filters' ' + test_commit c17 A/anotherFile3 && + git commit-graph write --reachable --changed-paths --split && + test_line_count = 3 .git/objects/info/commit-graphs/commit-graph-chain +' + +test_bloom_filters_used_when_some_filters_are_missing() { + log_args=$1 + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":6" + setup "$log_args" + grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom +} + +test_expect_success 'git log uses bloom filters if they exist in the latest but not all commit graphs in the chain.' ' + test_bloom_filters_used_when_some_filters_are_missing "-- A/B" +' + +test_done