Message ID | 20230911223157.446269-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] Not computing changed path filter for root commits | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > This is following the discussion about adding a new changed path filter > version due to the current implementation of murmur3 not matching the > algorithm. [1] > > SZEDER Gábor suggested [2] that we change the revision walk to read > changed path filters also for root commits, but I don't think that's > possible - we have to tie reading changed path filters to when we read > trees, and right now, we don't seem to read trees when evaluating root > commits (rev_compare_tree() in revision.c is in the only code path that > uses changed path filters, and it itself is only called per-parent and > thus not called for root commits). The alternative is to not generate > changed path filters for root commits (or what I did in this patch, > which is to generate an all-1 filter), which seems reasonable to me. I know this is a very silly question, but if the filter is not read for root commits at runtime, does it matter if a filter is created for them beforehand (or not)? They will not be read whether if they exist or not, no? One observation in the thread [2] appears in was: In several of the above test cases test_bloom_filters_used is invoked in a repository with only a root commit, so they don't check that the output is the same with and without Bloom filters. i.e. the check would be ineffective with the current system that we know does not use the filter for a root commit even if it existed. But would it be an improvement to add a filter to a root commit and test with the filter enabled and disabled to compare the results, if we know the filter is not used anyway?
On Mon, Sep 11, 2023 at 03:31:56PM -0700, Jonathan Tan wrote: > SZEDER Gábor suggested [2] that we change the revision walk to read > changed path filters also for root commits, but I don't think that's > possible - we have to tie reading changed path filters to when we read > trees, and right now, we don't seem to read trees when evaluating root > commits (rev_compare_tree() in revision.c is in the only code path that > uses changed path filters, and it itself is only called per-parent and > thus not called for root commits). When encountering a root commit during a pathspec-limited revision walk we call rev_same_tree_as_empty() instead of rev_compare_tree(). All that's missing there is checking the Bloom filter and accounting for false positives.
On Fri, Sep 15, 2023 at 10:29:12PM +0200, SZEDER Gábor wrote: > On Mon, Sep 11, 2023 at 03:31:56PM -0700, Jonathan Tan wrote: > > SZEDER Gábor suggested [2] that we change the revision walk to read > > changed path filters also for root commits, but I don't think that's > > possible - we have to tie reading changed path filters to when we read > > trees, and right now, we don't seem to read trees when evaluating root > > commits (rev_compare_tree() in revision.c is in the only code path that > > uses changed path filters, and it itself is only called per-parent and > > thus not called for root commits). > > When encountering a root commit during a pathspec-limited revision > walk we call rev_same_tree_as_empty() instead of rev_compare_tree(). > All that's missing there is checking the Bloom filter and accounting > for false positives. I think that we'd want something like this, though I would definitely appreciate a second set of eyes since I am not 100% confident in my set of changes here: --- 8< --- diff --git a/revision.c b/revision.c index 2f4c53ea20..1d36df49e2 100644 --- a/revision.c +++ b/revision.c @@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs, static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) { struct tree *t1 = repo_get_commit_tree(the_repository, commit); + int bloom_ret = 1; if (!t1) return 0; + if (revs->bloom_keys_nr) { + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); + if (!bloom_ret) + return 1; + } + tree_difference = REV_TREE_SAME; revs->pruning.flags.has_changes = 0; diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) + count_bloom_filter_false_positive++; + return tree_difference == REV_TREE_SAME; } diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index fa9d32facf..3a45cb997b 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -162,7 +162,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' test_bloom_filters_used_when_some_filters_are_missing () { log_args=$1 - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":9" + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":10" setup "$log_args" && grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && test_cmp log_wo_bloom log_w_bloom --- >8 --- Thanks, Taylor
On Mon, Sep 11, 2023 at 03:31:56PM -0700, Jonathan Tan wrote: > SZEDER Gábor suggested [2] that we change the revision walk to read > changed path filters also for root commits, but I don't think that's > possible - we have to tie reading changed path filters to when we read > trees, and right now, we don't seem to read trees when evaluating root > commits (rev_compare_tree() in revision.c is in the only code path that > uses changed path filters, and it itself is only called per-parent and > thus not called for root commits). The alternative is to not generate > changed path filters for root commits (or what I did in this patch, > which is to generate an all-1 filter), which seems reasonable to me. I think between the two, the all-1's filter is the more sensible choice, since not computing a filter is typically reserved for blowing past the `commitGraph.maxNewFilters` setting. But, I agree with Gábor down-thread that we could instead teach `rev_same_tree_as_empty()` to be aware of Bloom filters, which I think would accomplish our goal of reading Bloom filters at the root commit while not having to tweak their generation. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > diff --git a/revision.c b/revision.c > index 2f4c53ea20..1d36df49e2 100644 > --- a/revision.c > +++ b/revision.c > @@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs, > static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) > { > struct tree *t1 = repo_get_commit_tree(the_repository, commit); > + int bloom_ret = 1; > > if (!t1) > return 0; > > + if (revs->bloom_keys_nr) { > + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > + if (!bloom_ret) > + return 1; > + } > + > tree_difference = REV_TREE_SAME; > revs->pruning.flags.has_changes = 0; > diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); > > + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) > + count_bloom_filter_false_positive++; > + > return tree_difference == REV_TREE_SAME; > } I'll concentrate on getting this patch in, and will look at (and discuss) the other Bloom filter-related emails later. This looks good, possibly except a code path in try_to_simplify_commit() that calls this rev_same_tree_as_empty() function when rev_compare_tree() between a commit and its parent returns REV_TREE_NEW. So there are 2 issues: How can rev_compare_tree() ever return REV_TREE_NEW? And it doesn't seem right to check Bloom filters in this code path, since rev_same_tree_as_empty() was invoked here while we are enumerating through a commit's parents, which necessarily implies that the commit has parents, but here we're using the Bloom filter as if the commit is known to have no parents. As for the first issue, rev_compare_tree() returns REV_TREE_NEW when the parent's tree is NULL. I'm not sure how this can happen - the tree can be NULL if the parent commit is not parsed, but at this point I think that it has been parsed. And I think every commit has a tree. This goes back all the way to 3a5e860815 (revision: make tree comparison functions take commits rather than trees, 2008-11-03) and even beyond that (I didn't dig further). As for the second issue, we can probably solve this by being defensive in rev_same_tree_as_empty() by only using the Bloom filter when the commit has no parents. Not sure if this is being overly defensive, though. There is also the issue that count_bloom_filter_false_positive is incremented even when no Bloom filters are present, but I think this is fine (it matches the behavior of rev_compare_tree()).
On Mon, Oct 02, 2023 at 03:55:46PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > diff --git a/revision.c b/revision.c > > index 2f4c53ea20..1d36df49e2 100644 > > --- a/revision.c > > +++ b/revision.c > > @@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs, > > static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) > > { > > struct tree *t1 = repo_get_commit_tree(the_repository, commit); > > + int bloom_ret = 1; > > > > if (!t1) > > return 0; > > > > + if (revs->bloom_keys_nr) { > > + bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > > + if (!bloom_ret) > > + return 1; > > + } > > + > > tree_difference = REV_TREE_SAME; > > revs->pruning.flags.has_changes = 0; > > diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning); > > > > + if (bloom_ret == 1 && tree_difference == REV_TREE_SAME) > > + count_bloom_filter_false_positive++; > > + > > return tree_difference == REV_TREE_SAME; > > } > > I'll concentrate on getting this patch in, and will look at (and > discuss) the other Bloom filter-related emails later. Sounds good. I know that I have some pending mail from SZEDER as well, so I'll try and apply any feedback from both of you before sending a reroll so that we can get this all done in one shot. > This looks good, possibly except a code path in try_to_simplify_commit() > that calls this rev_same_tree_as_empty() function when > rev_compare_tree() between a commit and its parent returns REV_TREE_NEW. > So there are 2 issues: How can rev_compare_tree() ever return > REV_TREE_NEW? And it doesn't seem right to check Bloom filters in this > code path, since rev_same_tree_as_empty() was invoked here while we are > enumerating through a commit's parents, which necessarily implies that > the commit has parents, but here we're using the Bloom filter as if the > commit is known to have no parents. > As for the first issue, rev_compare_tree() returns REV_TREE_NEW when the > parent's tree is NULL. I'm not sure how this can happen - the tree can > be NULL if the parent commit is not parsed, but at this point I think > that it has been parsed. And I think every commit has a tree. This goes > back all the way to 3a5e860815 (revision: make tree comparison functions > take commits rather than trees, 2008-11-03) and even beyond that (I > didn't dig further). The more I think about this, the more confused I become ;-). I was able to track this all the way back to Linus's 461cf59f89 (rev-list: stop when the file disappears, 2006-01-18), but am convinced that both of these cases (we have an analogous one for when t2 is NULL and we return REV_TREE_OLD) are dead code. I replaced the "if (!t1) return REV_TREE_NEW;" with "if (!t1) BUG("oops")", and was able to get the whole test suite to pass. So... I am pretty sure that this is dead code, but not sure enough to remove it myself ;-). To your point about seeing the tree as NULL before parsing the parent, I don't think that is the case here, since we parse the parent immediately before calling rev_compare_tree() (and indeed Git will refuse to read commit objects which do not list a tree). > As for the second issue, we can probably solve this by being defensive > in rev_same_tree_as_empty() by only using the Bloom filter when the > commit has no parents. Not sure if this is being overly defensive, > though. I am also unsure whether we are being overly defensive here or not. But I agree that it does feel safer to apply something like: --- 8< --- diff --git a/revision.c b/revision.c index 3d78ea6a9a..21b3085465 100644 --- a/revision.c +++ b/revision.c @@ -834,7 +834,8 @@ static int rev_compare_tree(struct rev_info *revs, return tree_difference; } -static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) +static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit, + int use_bloom_filter) { struct tree *t1 = repo_get_commit_tree(the_repository, commit); int bloom_ret = 1; @@ -842,7 +843,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit) if (!t1) return 0; - if (revs->bloom_keys_nr) { + if (use_bloom_filter && revs->bloom_keys_nr) { bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); if (!bloom_ret) return 1; @@ -892,7 +893,7 @@ static int compact_treesame(struct rev_info *revs, struct commit *commit, unsign if (nth_parent != 0) die("compact_treesame %u", nth_parent); old_same = !!(commit->object.flags & TREESAME); - if (rev_same_tree_as_empty(revs, commit)) + if (rev_same_tree_as_empty(revs, commit, 1)) commit->object.flags |= TREESAME; else commit->object.flags &= ~TREESAME; @@ -988,7 +989,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) return; if (!commit->parents) { - if (rev_same_tree_as_empty(revs, commit)) + if (rev_same_tree_as_empty(revs, commit, 1)) commit->object.flags |= TREESAME; return; } @@ -1069,7 +1070,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit) case REV_TREE_NEW: if (revs->remove_empty_trees && - rev_same_tree_as_empty(revs, p)) { + rev_same_tree_as_empty(revs, p, 0)) { /* We are adding all the specified * paths from this parent, so the * history beyond this parent is not --- >8 --- on top. > There is also the issue that count_bloom_filter_false_positive is > incremented even when no Bloom filters are present, but I think this is > fine (it matches the behavior of rev_compare_tree()). Yep, agreed. Thanks, Taylor
On Mon, Oct 09, 2023 at 01:20:31PM -0400, Taylor Blau wrote: > On Mon, Oct 02, 2023 at 03:55:46PM -0700, Jonathan Tan wrote: > > As for the second issue, we can probably solve this by being defensive > > in rev_same_tree_as_empty() by only using the Bloom filter when the > > commit has no parents. Not sure if this is being overly defensive, > > though. > > I am also unsure whether we are being overly defensive here or not. But > I agree that it does feel safer to apply something like: Never mind, we are being overly defensive there. The goal here is to avoid using a commit's Bloom filter in cases where we are acting as if a commit is at the root of history, but in fact has parents. This only happens when we return REV_TREE_NEW from a call to `rev_compare_tree(revs, p, commit, nth_parent)`. But we'll only get REV_TREE_NEW back if repo_get_commit_tree(the_repository, p); returns NULL. But when we call rev_same_tree_as_empty(revs, p) in the REV_TREE_NEW case, we return early as follows: struct tree *t1 = repo_get_commit_tree(revs, p); if (!t1) return 0; So we won't even consult the Bloom filter in that case, since t1 is NULL for the same reason as what caused rev_compare_tree() to return REV_TREE_NEW in the first place. I am still dumbfounded by how we would ever get REV_TREE_NEW in the first place, but if we did, I think we would be OK here. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > This only happens when we return REV_TREE_NEW from a call to > `rev_compare_tree(revs, p, commit, nth_parent)`. But we'll only get > REV_TREE_NEW back if > > repo_get_commit_tree(the_repository, p); > > returns NULL. But when we call rev_same_tree_as_empty(revs, p) in the > REV_TREE_NEW case, we return early as follows: > > struct tree *t1 = repo_get_commit_tree(revs, p); > if (!t1) > return 0; > > So we won't even consult the Bloom filter in that case, since t1 is NULL > for the same reason as what caused rev_compare_tree() to return > REV_TREE_NEW in the first place. > > I am still dumbfounded by how we would ever get REV_TREE_NEW in the > first place, but if we did, I think we would be OK here. > > Thanks, > Taylor Ah, good point. Your patch in https://lore.kernel.org/git/ZQnmTXUO94%2FQy8mq@nand.local/ looks good to me, then.
On Mon, Oct 09, 2023 at 01:59:25PM -0700, Jonathan Tan wrote: > Taylor Blau <me@ttaylorr.com> writes: > > This only happens when we return REV_TREE_NEW from a call to > > `rev_compare_tree(revs, p, commit, nth_parent)`. But we'll only get > > REV_TREE_NEW back if > > > > repo_get_commit_tree(the_repository, p); > > > > returns NULL. But when we call rev_same_tree_as_empty(revs, p) in the > > REV_TREE_NEW case, we return early as follows: > > > > struct tree *t1 = repo_get_commit_tree(revs, p); > > if (!t1) > > return 0; > > > > So we won't even consult the Bloom filter in that case, since t1 is NULL > > for the same reason as what caused rev_compare_tree() to return > > REV_TREE_NEW in the first place. > > > > I am still dumbfounded by how we would ever get REV_TREE_NEW in the > > first place, but if we did, I think we would be OK here. > > > > Thanks, > > Taylor > > Ah, good point. Your patch in > https://lore.kernel.org/git/ZQnmTXUO94%2FQy8mq@nand.local/ looks good to > me, then. Oops, I made a mistake in the quoted portion, which is that we could get REV_TREE_NEW if the tree-diff itself only adds files. This is the non-trivial case that we get when t1 is non-NULL, and we end up calling `diff_tree_oid()` which sets the static `tree_difference` variable. So I think adding an nth_parent field (like you originally suggested[^1]) makes sense. Thanks, Taylor [^1]: Thanks for being patient with me ;-).
diff --git a/bloom.c b/bloom.c index aef6b5fea2..b21130b236 100644 --- a/bloom.c +++ b/bloom.c @@ -226,7 +226,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, if (c->parents) diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt); else - diff_tree_oid(NULL, &c->object.oid, "", &diffopt); + goto large; diffcore_std(&diffopt); if (diff_queued_diff.nr <= settings->max_changed_paths) { @@ -292,6 +292,7 @@ struct bloom_filter *get_or_compute_bloom_filter(struct repository *r, } else { for (i = 0; i < diff_queued_diff.nr; i++) diff_free_filepair(diff_queued_diff.queue[i]); +large: init_truncated_large_filter(filter); if (computed) diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index b567383eb8..02a0b41026 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -74,7 +74,7 @@ test_expect_success !SANITIZE_LEAK 'get bloom filters for commit with no changes git commit --allow-empty -m "c0" && cat >expect <<-\EOF && Filter_Length:1 - Filter_Data:00| + Filter_Data:ff| EOF test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual && test_cmp expect actual diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index fa9d32facf..d14fe93fb1 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -283,7 +283,6 @@ test_expect_success 'correctly report changes over limit' ' git commit-graph write --reachable --changed-paths && test_max_changed_paths 11 trace-update && test_filter_computed 2 trace-update && - test_filter_trunc_large 0 trace-update && for path in $(git ls-tree -r --name-only HEAD) do @@ -306,9 +305,7 @@ test_expect_success 'correctly report commits with no changed paths' ' GIT_TRACE2_EVENT="$(pwd)/trace.event" \ git commit-graph write --reachable --changed-paths && test_filter_computed 1 trace.event && - test_filter_not_computed 0 trace.event && - test_filter_trunc_empty 1 trace.event && - test_filter_trunc_large 0 trace.event + test_filter_not_computed 0 trace.event ) ' @@ -363,8 +360,7 @@ test_expect_success '--max-new-filters overrides configuration' ' --max-new-filters=1 && test_filter_computed 1 trace.event && test_filter_not_computed 1 trace.event && - test_filter_trunc_empty 0 trace.event && - test_filter_trunc_large 0 trace.event + test_filter_trunc_empty 0 trace.event ) ' @@ -386,9 +382,7 @@ test_expect_success 'Bloom generation backfills empty commits' ' git commit-graph write --reachable \ --changed-paths --max-new-filters=2 && test_filter_computed 2 trace.event && - test_filter_not_computed 4 trace.event && - test_filter_trunc_empty 2 trace.event && - test_filter_trunc_large 0 trace.event || return 1 + test_filter_not_computed 4 trace.event || return 1 done && # Finally, make sure that once all commits have filters, that