diff mbox series

[RFC] Not computing changed path filter for root commits

Message ID 20230911223157.446269-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series [RFC] Not computing changed path filter for root commits | expand

Commit Message

Jonathan Tan Sept. 11, 2023, 10:31 p.m. UTC
This is following the discussion about adding a new changed path filter
version due to the current implementation of murmur3 not matching the
algorithm. [1]

SZEDER Gábor suggested [2] that we change the revision walk to read
changed path filters also for root commits, but I don't think that's
possible - we have to tie reading changed path filters to when we read
trees, and right now, we don't seem to read trees when evaluating root
commits (rev_compare_tree() in revision.c is in the only code path that
uses changed path filters, and it itself is only called per-parent and
thus not called for root commits). The alternative is to not generate
changed path filters for root commits (or what I did in this patch,
which is to generate an all-1 filter), which seems reasonable to me.

Is this a good idea? If yes, there is the follow-up question of how to
report it in traces. I don't know off-hand if it's better to reuse the
"large" statistic, since what we output is the same (all 1s), or if we
should make a new one. Presumably what we need to weigh is the clarity
versus the migration costs, but I don't know how to weigh it.

If people are generally in agreement, I can send an updated patch that
does not goto into an "else" block :) and also with updated tests (t0095
needs to check a non-root commit, and t4216 needs to be updated with
whatever statistics we decide to use).

[1] https://lore.kernel.org/git/20230826150610.GA1928@szeder.dev/
[2] https://lore.kernel.org/git/20230830200218.GA5147@szeder.dev/
---
 bloom.c              |  3 ++-
 t/t0095-bloom.sh     |  2 +-
 t/t4216-log-bloom.sh | 12 +++---------
 3 files changed, 6 insertions(+), 11 deletions(-)

Comments

Junio C Hamano Sept. 15, 2023, 6:25 p.m. UTC | #1
Jonathan Tan <jonathantanmy@google.com> writes:

> This is following the discussion about adding a new changed path filter
> version due to the current implementation of murmur3 not matching the
> algorithm. [1]
>
> SZEDER Gábor suggested [2] that we change the revision walk to read
> changed path filters also for root commits, but I don't think that's
> possible - we have to tie reading changed path filters to when we read
> trees, and right now, we don't seem to read trees when evaluating root
> commits (rev_compare_tree() in revision.c is in the only code path that
> uses changed path filters, and it itself is only called per-parent and
> thus not called for root commits). The alternative is to not generate
> changed path filters for root commits (or what I did in this patch,
> which is to generate an all-1 filter), which seems reasonable to me.

I know this is a very silly question, but if the filter is not read
for root commits at runtime, does it matter if a filter is created
for them beforehand (or not)?  They will not be read whether if they
exist or not, no?  One observation in the thread [2] appears in was:

    In several of the above test cases test_bloom_filters_used is invoked
    in a repository with only a root commit, so they don't check that
    the output is the same with and without Bloom filters.

i.e. the check would be ineffective with the current system that we
know does not use the filter for a root commit even if it existed.
But would it be an improvement to add a filter to a root commit and
test with the filter enabled and disabled to compare the results, if
we know the filter is not used anyway?
SZEDER Gábor Sept. 15, 2023, 8:29 p.m. UTC | #2
On Mon, Sep 11, 2023 at 03:31:56PM -0700, Jonathan Tan wrote:
> SZEDER Gábor suggested [2] that we change the revision walk to read
> changed path filters also for root commits, but I don't think that's
> possible - we have to tie reading changed path filters to when we read
> trees, and right now, we don't seem to read trees when evaluating root
> commits (rev_compare_tree() in revision.c is in the only code path that
> uses changed path filters, and it itself is only called per-parent and
> thus not called for root commits).

When encountering a root commit during a pathspec-limited revision
walk we call rev_same_tree_as_empty() instead of rev_compare_tree().
All that's missing there is checking the Bloom filter and accounting
for false positives.
Taylor Blau Sept. 19, 2023, 6:19 p.m. UTC | #3
On Fri, Sep 15, 2023 at 10:29:12PM +0200, SZEDER Gábor wrote:
> On Mon, Sep 11, 2023 at 03:31:56PM -0700, Jonathan Tan wrote:
> > SZEDER Gábor suggested [2] that we change the revision walk to read
> > changed path filters also for root commits, but I don't think that's
> > possible - we have to tie reading changed path filters to when we read
> > trees, and right now, we don't seem to read trees when evaluating root
> > commits (rev_compare_tree() in revision.c is in the only code path that
> > uses changed path filters, and it itself is only called per-parent and
> > thus not called for root commits).
>
> When encountering a root commit during a pathspec-limited revision
> walk we call rev_same_tree_as_empty() instead of rev_compare_tree().
> All that's missing there is checking the Bloom filter and accounting
> for false positives.

I think that we'd want something like this, though I would definitely
appreciate a second set of eyes since I am not 100% confident in my
set of changes here:

--- 8< ---
diff --git a/revision.c b/revision.c
index 2f4c53ea20..1d36df49e2 100644
--- a/revision.c
+++ b/revision.c
@@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs,
 static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
 {
 	struct tree *t1 = repo_get_commit_tree(the_repository, commit);
+	int bloom_ret = 1;

 	if (!t1)
 		return 0;

+	if (revs->bloom_keys_nr) {
+		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
+		if (!bloom_ret)
+			return 1;
+	}
+
 	tree_difference = REV_TREE_SAME;
 	revs->pruning.flags.has_changes = 0;
 	diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);

+	if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
+		count_bloom_filter_false_positive++;
+
 	return tree_difference == REV_TREE_SAME;
 }

diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..3a45cb997b 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -162,7 +162,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '

 test_bloom_filters_used_when_some_filters_are_missing () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":9"
+	bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"maybe\":6,\"definitely_not\":10"
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom
--- >8 ---

Thanks,
Taylor
Taylor Blau Sept. 19, 2023, 6:21 p.m. UTC | #4
On Mon, Sep 11, 2023 at 03:31:56PM -0700, Jonathan Tan wrote:
> SZEDER Gábor suggested [2] that we change the revision walk to read
> changed path filters also for root commits, but I don't think that's
> possible - we have to tie reading changed path filters to when we read
> trees, and right now, we don't seem to read trees when evaluating root
> commits (rev_compare_tree() in revision.c is in the only code path that
> uses changed path filters, and it itself is only called per-parent and
> thus not called for root commits). The alternative is to not generate
> changed path filters for root commits (or what I did in this patch,
> which is to generate an all-1 filter), which seems reasonable to me.

I think between the two, the all-1's filter is the more sensible choice,
since not computing a filter is typically reserved for blowing past the
`commitGraph.maxNewFilters` setting.

But, I agree with Gábor down-thread that we could instead teach
`rev_same_tree_as_empty()` to be aware of Bloom filters, which I think
would accomplish our goal of reading Bloom filters at the root commit
while not having to tweak their generation.

Thanks,
Taylor
Jonathan Tan Oct. 2, 2023, 10:55 p.m. UTC | #5
Taylor Blau <me@ttaylorr.com> writes:
> diff --git a/revision.c b/revision.c
> index 2f4c53ea20..1d36df49e2 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs,
>  static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
>  {
>  	struct tree *t1 = repo_get_commit_tree(the_repository, commit);
> +	int bloom_ret = 1;
> 
>  	if (!t1)
>  		return 0;
> 
> +	if (revs->bloom_keys_nr) {
> +		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> +		if (!bloom_ret)
> +			return 1;
> +	}
> +
>  	tree_difference = REV_TREE_SAME;
>  	revs->pruning.flags.has_changes = 0;
>  	diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
> 
> +	if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
> +		count_bloom_filter_false_positive++;
> +
>  	return tree_difference == REV_TREE_SAME;
>  }

I'll concentrate on getting this patch in, and will look at (and
discuss) the other Bloom filter-related emails later.

This looks good, possibly except a code path in try_to_simplify_commit()
that calls this rev_same_tree_as_empty() function when
rev_compare_tree() between a commit and its parent returns REV_TREE_NEW.
So there are 2 issues: How can rev_compare_tree() ever return
REV_TREE_NEW? And it doesn't seem right to check Bloom filters in this
code path, since rev_same_tree_as_empty() was invoked here while we are
enumerating through a commit's parents, which necessarily implies that
the commit has parents, but here we're using the Bloom filter as if the
commit is known to have no parents.

As for the first issue, rev_compare_tree() returns REV_TREE_NEW when the
parent's tree is NULL. I'm not sure how this can happen - the tree can
be NULL if the parent commit is not parsed, but at this point I think
that it has been parsed. And I think every commit has a tree. This goes
back all the way to 3a5e860815 (revision: make tree comparison functions
take commits rather than trees, 2008-11-03) and even beyond that (I
didn't dig further).

As for the second issue, we can probably solve this by being defensive
in rev_same_tree_as_empty() by only using the Bloom filter when the
commit has no parents. Not sure if this is being overly defensive,
though.

There is also the issue that count_bloom_filter_false_positive is
incremented even when no Bloom filters are present, but I think this is
fine (it matches the behavior of rev_compare_tree()).
Taylor Blau Oct. 9, 2023, 5:20 p.m. UTC | #6
On Mon, Oct 02, 2023 at 03:55:46PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > diff --git a/revision.c b/revision.c
> > index 2f4c53ea20..1d36df49e2 100644
> > --- a/revision.c
> > +++ b/revision.c
> > @@ -837,14 +837,24 @@ static int rev_compare_tree(struct rev_info *revs,
> >  static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
> >  {
> >  	struct tree *t1 = repo_get_commit_tree(the_repository, commit);
> > +	int bloom_ret = 1;
> >
> >  	if (!t1)
> >  		return 0;
> >
> > +	if (revs->bloom_keys_nr) {
> > +		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
> > +		if (!bloom_ret)
> > +			return 1;
> > +	}
> > +
> >  	tree_difference = REV_TREE_SAME;
> >  	revs->pruning.flags.has_changes = 0;
> >  	diff_tree_oid(NULL, &t1->object.oid, "", &revs->pruning);
> >
> > +	if (bloom_ret == 1 && tree_difference == REV_TREE_SAME)
> > +		count_bloom_filter_false_positive++;
> > +
> >  	return tree_difference == REV_TREE_SAME;
> >  }
>
> I'll concentrate on getting this patch in, and will look at (and
> discuss) the other Bloom filter-related emails later.

Sounds good. I know that I have some pending mail from SZEDER as well,
so I'll try and apply any feedback from both of you before sending a
reroll so that we can get this all done in one shot.

> This looks good, possibly except a code path in try_to_simplify_commit()
> that calls this rev_same_tree_as_empty() function when
> rev_compare_tree() between a commit and its parent returns REV_TREE_NEW.
> So there are 2 issues: How can rev_compare_tree() ever return
> REV_TREE_NEW? And it doesn't seem right to check Bloom filters in this
> code path, since rev_same_tree_as_empty() was invoked here while we are
> enumerating through a commit's parents, which necessarily implies that
> the commit has parents, but here we're using the Bloom filter as if the
> commit is known to have no parents.

> As for the first issue, rev_compare_tree() returns REV_TREE_NEW when the
> parent's tree is NULL. I'm not sure how this can happen - the tree can
> be NULL if the parent commit is not parsed, but at this point I think
> that it has been parsed. And I think every commit has a tree. This goes
> back all the way to 3a5e860815 (revision: make tree comparison functions
> take commits rather than trees, 2008-11-03) and even beyond that (I
> didn't dig further).

The more I think about this, the more confused I become ;-). I was able
to track this all the way back to Linus's 461cf59f89 (rev-list: stop
when the file disappears, 2006-01-18), but am convinced that both of
these cases (we have an analogous one for when t2 is NULL and we return
REV_TREE_OLD) are dead code.

I replaced the "if (!t1) return REV_TREE_NEW;" with "if (!t1)
BUG("oops")", and was able to get the whole test suite to pass. So... I
am pretty sure that this is dead code, but not sure enough to remove it
myself ;-).

To your point about seeing the tree as NULL before parsing the parent, I
don't think that is the case here, since we parse the parent immediately
before calling rev_compare_tree() (and indeed Git will refuse to read
commit objects which do not list a tree).

> As for the second issue, we can probably solve this by being defensive
> in rev_same_tree_as_empty() by only using the Bloom filter when the
> commit has no parents. Not sure if this is being overly defensive,
> though.

I am also unsure whether we are being overly defensive here or not. But
I agree that it does feel safer to apply something like:

--- 8< ---
diff --git a/revision.c b/revision.c
index 3d78ea6a9a..21b3085465 100644
--- a/revision.c
+++ b/revision.c
@@ -834,7 +834,8 @@ static int rev_compare_tree(struct rev_info *revs,
 	return tree_difference;
 }

-static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
+static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit,
+				  int use_bloom_filter)
 {
 	struct tree *t1 = repo_get_commit_tree(the_repository, commit);
 	int bloom_ret = 1;
@@ -842,7 +843,7 @@ static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
 	if (!t1)
 		return 0;

-	if (revs->bloom_keys_nr) {
+	if (use_bloom_filter && revs->bloom_keys_nr) {
 		bloom_ret = check_maybe_different_in_bloom_filter(revs, commit);
 		if (!bloom_ret)
 			return 1;
@@ -892,7 +893,7 @@ static int compact_treesame(struct rev_info *revs, struct commit *commit, unsign
 		if (nth_parent != 0)
 			die("compact_treesame %u", nth_parent);
 		old_same = !!(commit->object.flags & TREESAME);
-		if (rev_same_tree_as_empty(revs, commit))
+		if (rev_same_tree_as_empty(revs, commit, 1))
 			commit->object.flags |= TREESAME;
 		else
 			commit->object.flags &= ~TREESAME;
@@ -988,7 +989,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)
 		return;

 	if (!commit->parents) {
-		if (rev_same_tree_as_empty(revs, commit))
+		if (rev_same_tree_as_empty(revs, commit, 1))
 			commit->object.flags |= TREESAME;
 		return;
 	}
@@ -1069,7 +1070,7 @@ static void try_to_simplify_commit(struct rev_info *revs, struct commit *commit)

 		case REV_TREE_NEW:
 			if (revs->remove_empty_trees &&
-			    rev_same_tree_as_empty(revs, p)) {
+			    rev_same_tree_as_empty(revs, p, 0)) {
 				/* We are adding all the specified
 				 * paths from this parent, so the
 				 * history beyond this parent is not
--- >8 ---

on top.

> There is also the issue that count_bloom_filter_false_positive is
> incremented even when no Bloom filters are present, but I think this is
> fine (it matches the behavior of rev_compare_tree()).

Yep, agreed.

Thanks,
Taylor
Taylor Blau Oct. 9, 2023, 5:26 p.m. UTC | #7
On Mon, Oct 09, 2023 at 01:20:31PM -0400, Taylor Blau wrote:
> On Mon, Oct 02, 2023 at 03:55:46PM -0700, Jonathan Tan wrote:
> > As for the second issue, we can probably solve this by being defensive
> > in rev_same_tree_as_empty() by only using the Bloom filter when the
> > commit has no parents. Not sure if this is being overly defensive,
> > though.
>
> I am also unsure whether we are being overly defensive here or not. But
> I agree that it does feel safer to apply something like:

Never mind, we are being overly defensive there. The goal here is to
avoid using a commit's Bloom filter in cases where we are acting as if a
commit is at the root of history, but in fact has parents.

This only happens when we return REV_TREE_NEW from a call to
`rev_compare_tree(revs, p, commit, nth_parent)`. But we'll only get
REV_TREE_NEW back if

    repo_get_commit_tree(the_repository, p);

returns NULL. But when we call rev_same_tree_as_empty(revs, p) in the
REV_TREE_NEW case, we return early as follows:

    struct tree *t1 = repo_get_commit_tree(revs, p);
    if (!t1)
      return 0;

So we won't even consult the Bloom filter in that case, since t1 is NULL
for the same reason as what caused rev_compare_tree() to return
REV_TREE_NEW in the first place.

I am still dumbfounded by how we would ever get REV_TREE_NEW in the
first place, but if we did, I think we would be OK here.

Thanks,
Taylor
Jonathan Tan Oct. 9, 2023, 8:59 p.m. UTC | #8
Taylor Blau <me@ttaylorr.com> writes:
> This only happens when we return REV_TREE_NEW from a call to
> `rev_compare_tree(revs, p, commit, nth_parent)`. But we'll only get
> REV_TREE_NEW back if
> 
>     repo_get_commit_tree(the_repository, p);
> 
> returns NULL. But when we call rev_same_tree_as_empty(revs, p) in the
> REV_TREE_NEW case, we return early as follows:
> 
>     struct tree *t1 = repo_get_commit_tree(revs, p);
>     if (!t1)
>       return 0;
> 
> So we won't even consult the Bloom filter in that case, since t1 is NULL
> for the same reason as what caused rev_compare_tree() to return
> REV_TREE_NEW in the first place.
> 
> I am still dumbfounded by how we would ever get REV_TREE_NEW in the
> first place, but if we did, I think we would be OK here.
> 
> Thanks,
> Taylor

Ah, good point. Your patch in
https://lore.kernel.org/git/ZQnmTXUO94%2FQy8mq@nand.local/ looks good to
me, then.
Taylor Blau Oct. 10, 2023, 7:47 p.m. UTC | #9
On Mon, Oct 09, 2023 at 01:59:25PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> > This only happens when we return REV_TREE_NEW from a call to
> > `rev_compare_tree(revs, p, commit, nth_parent)`. But we'll only get
> > REV_TREE_NEW back if
> >
> >     repo_get_commit_tree(the_repository, p);
> >
> > returns NULL. But when we call rev_same_tree_as_empty(revs, p) in the
> > REV_TREE_NEW case, we return early as follows:
> >
> >     struct tree *t1 = repo_get_commit_tree(revs, p);
> >     if (!t1)
> >       return 0;
> >
> > So we won't even consult the Bloom filter in that case, since t1 is NULL
> > for the same reason as what caused rev_compare_tree() to return
> > REV_TREE_NEW in the first place.
> >
> > I am still dumbfounded by how we would ever get REV_TREE_NEW in the
> > first place, but if we did, I think we would be OK here.
> >
> > Thanks,
> > Taylor
>
> Ah, good point. Your patch in
> https://lore.kernel.org/git/ZQnmTXUO94%2FQy8mq@nand.local/ looks good to
> me, then.

Oops, I made a mistake in the quoted portion, which is that we could get
REV_TREE_NEW if the tree-diff itself only adds files. This is the
non-trivial case that we get when t1 is non-NULL, and we end up calling
`diff_tree_oid()` which sets the static `tree_difference` variable.

So I think adding an nth_parent field (like you originally
suggested[^1]) makes sense.

Thanks,
Taylor

[^1]: Thanks for being patient with me ;-).
diff mbox series

Patch

diff --git a/bloom.c b/bloom.c
index aef6b5fea2..b21130b236 100644
--- a/bloom.c
+++ b/bloom.c
@@ -226,7 +226,7 @@  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	if (c->parents)
 		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
 	else
-		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
+		goto large;
 	diffcore_std(&diffopt);
 
 	if (diff_queued_diff.nr <= settings->max_changed_paths) {
@@ -292,6 +292,7 @@  struct bloom_filter *get_or_compute_bloom_filter(struct repository *r,
 	} else {
 		for (i = 0; i < diff_queued_diff.nr; i++)
 			diff_free_filepair(diff_queued_diff.queue[i]);
+large:
 		init_truncated_large_filter(filter);
 
 		if (computed)
diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
index b567383eb8..02a0b41026 100755
--- a/t/t0095-bloom.sh
+++ b/t/t0095-bloom.sh
@@ -74,7 +74,7 @@  test_expect_success !SANITIZE_LEAK 'get bloom filters for commit with no changes
 	git commit --allow-empty -m "c0" &&
 	cat >expect <<-\EOF &&
 	Filter_Length:1
-	Filter_Data:00|
+	Filter_Data:ff|
 	EOF
 	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
 	test_cmp expect actual
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index fa9d32facf..d14fe93fb1 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -283,7 +283,6 @@  test_expect_success 'correctly report changes over limit' '
 			git commit-graph write --reachable --changed-paths &&
 		test_max_changed_paths 11 trace-update &&
 		test_filter_computed 2 trace-update &&
-		test_filter_trunc_large 0 trace-update &&
 
 		for path in $(git ls-tree -r --name-only HEAD)
 		do
@@ -306,9 +305,7 @@  test_expect_success 'correctly report commits with no changed paths' '
 		GIT_TRACE2_EVENT="$(pwd)/trace.event" \
 			git commit-graph write --reachable --changed-paths &&
 		test_filter_computed 1 trace.event &&
-		test_filter_not_computed 0 trace.event &&
-		test_filter_trunc_empty 1 trace.event &&
-		test_filter_trunc_large 0 trace.event
+		test_filter_not_computed 0 trace.event
 	)
 '
 
@@ -363,8 +360,7 @@  test_expect_success '--max-new-filters overrides configuration' '
 				--max-new-filters=1 &&
 		test_filter_computed 1 trace.event &&
 		test_filter_not_computed 1 trace.event &&
-		test_filter_trunc_empty 0 trace.event &&
-		test_filter_trunc_large 0 trace.event
+		test_filter_trunc_empty 0 trace.event
 	)
 '
 
@@ -386,9 +382,7 @@  test_expect_success 'Bloom generation backfills empty commits' '
 				git commit-graph write --reachable \
 					--changed-paths --max-new-filters=2 &&
 			test_filter_computed 2 trace.event &&
-			test_filter_not_computed 4 trace.event &&
-			test_filter_trunc_empty 2 trace.event &&
-			test_filter_trunc_large 0 trace.event || return 1
+			test_filter_not_computed 4 trace.event || return 1
 		done &&
 
 		# Finally, make sure that once all commits have filters, that