diff mbox series

[v3,01/14] commit-graph: introduce 'get_bloom_filter_settings()'

Message ID e714e54240bb339d1ecebcea157b734abf1c14ef.1597178915.git.me@ttaylorr.com (mailing list archive)
State Superseded
Headers show
Series more miscellaneous Bloom filter improvements | expand

Commit Message

Taylor Blau Aug. 11, 2020, 8:51 p.m. UTC
Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed to be the same for any layer in a
chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 blame.c                       |  6 ++++--
 bloom.c                       |  6 +++---
 commit-graph.c                | 11 +++++++++++
 commit-graph.h                |  2 ++
 revision.c                    |  5 +----
 t/t4216-log-bloom.sh          |  9 ++++++---
 t/t5324-split-commit-graph.sh | 13 +++++++++++++
 7 files changed, 40 insertions(+), 12 deletions(-)

Comments

SZEDER Gábor Aug. 11, 2020, 9:18 p.m. UTC | #1
On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> Many places in the code often need a pointer to the commit-graph's
> 'struct bloom_filter_settings', in which case they often take the value
> from the top-most commit-graph.
> 
> In the non-split case, this works as expected. In the split case,
> however, things get a little tricky. Not all layers in a chain of
> incremental commit-graphs are required to themselves have Bloom data,
> and so whether or not some part of the code uses Bloom filters depends
> entirely on whether or not the top-most level of the commit-graph chain
> has Bloom filters.
> 
> This has been the behavior since Bloom filters were introduced, and has
> been codified into the tests since a759bfa9ee (t4216: add end to end
> tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> requires that Bloom filters are not used in exactly the case described
> earlier.
> 
> There is no reason that this needs to be the case, since it is perfectly
> valid for commits in an earlier layer to have Bloom filters when commits
> in a newer layer do not.
> 
> Since Bloom settings are guaranteed to be the same for any layer in a
> chain that has Bloom data,

Is it?  Where is that guaranteed?

> it is sufficient to traverse the
> '->base_graph' pointer until either (1) a non-null 'struct
> bloom_filter_settings *' is found, or (2) until we are at the root of
> the commit-graph chain.
> 
> Introduce a 'get_bloom_filter_settings()' function that does just this,
> and use it instead of purely dereferencing the top-most graph's
> '->bloom_filter_settings' pointer.
> 
> While we're at it, add an additional test in t5324 to guard against code
> in the commit-graph writing machinery that doesn't correctly handle a
> NULL 'struct bloom_filter *'.
> 
> Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
Taylor Blau Aug. 11, 2020, 9:21 p.m. UTC | #2
On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > Many places in the code often need a pointer to the commit-graph's
> > 'struct bloom_filter_settings', in which case they often take the value
> > from the top-most commit-graph.
> >
> > In the non-split case, this works as expected. In the split case,
> > however, things get a little tricky. Not all layers in a chain of
> > incremental commit-graphs are required to themselves have Bloom data,
> > and so whether or not some part of the code uses Bloom filters depends
> > entirely on whether or not the top-most level of the commit-graph chain
> > has Bloom filters.
> >
> > This has been the behavior since Bloom filters were introduced, and has
> > been codified into the tests since a759bfa9ee (t4216: add end to end
> > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > requires that Bloom filters are not used in exactly the case described
> > earlier.
> >
> > There is no reason that this needs to be the case, since it is perfectly
> > valid for commits in an earlier layer to have Bloom filters when commits
> > in a newer layer do not.
> >
> > Since Bloom settings are guaranteed to be the same for any layer in a
> > chain that has Bloom data,
>
> Is it?  Where is that guaranteed?

There is no mechanism whatsoever to customize these settings that is
exposed to the user (except for the undocumented 'GIT_TEST' environment
variables).

> > it is sufficient to traverse the
> > '->base_graph' pointer until either (1) a non-null 'struct
> > bloom_filter_settings *' is found, or (2) until we are at the root of
> > the commit-graph chain.
> >
> > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > and use it instead of purely dereferencing the top-most graph's
> > '->bloom_filter_settings' pointer.
> >
> > While we're at it, add an additional test in t5324 to guard against code
> > in the commit-graph writing machinery that doesn't correctly handle a
> > NULL 'struct bloom_filter *'.
> >
> > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > Signed-off-by: Taylor Blau <me@ttaylorr.com>
Thanks,
Taylor
SZEDER Gábor Aug. 11, 2020, 9:27 p.m. UTC | #3
On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > > Many places in the code often need a pointer to the commit-graph's
> > > 'struct bloom_filter_settings', in which case they often take the value
> > > from the top-most commit-graph.
> > >
> > > In the non-split case, this works as expected. In the split case,
> > > however, things get a little tricky. Not all layers in a chain of
> > > incremental commit-graphs are required to themselves have Bloom data,
> > > and so whether or not some part of the code uses Bloom filters depends
> > > entirely on whether or not the top-most level of the commit-graph chain
> > > has Bloom filters.
> > >
> > > This has been the behavior since Bloom filters were introduced, and has
> > > been codified into the tests since a759bfa9ee (t4216: add end to end
> > > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > > requires that Bloom filters are not used in exactly the case described
> > > earlier.
> > >
> > > There is no reason that this needs to be the case, since it is perfectly
> > > valid for commits in an earlier layer to have Bloom filters when commits
> > > in a newer layer do not.
> > >
> > > Since Bloom settings are guaranteed to be the same for any layer in a
> > > chain that has Bloom data,
> >
> > Is it?  Where is that guaranteed?
> 
> There is no mechanism whatsoever to customize these settings that is
> exposed to the user (except for the undocumented 'GIT_TEST' environment
> variables).

Let me rephrase it, then: where is it written in the commit-graph
format specification that these must be the same in all layers?

Nowhere.

> > > it is sufficient to traverse the
> > > '->base_graph' pointer until either (1) a non-null 'struct
> > > bloom_filter_settings *' is found, or (2) until we are at the root of
> > > the commit-graph chain.
> > >
> > > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > > and use it instead of purely dereferencing the top-most graph's
> > > '->bloom_filter_settings' pointer.
> > >
> > > While we're at it, add an additional test in t5324 to guard against code
> > > in the commit-graph writing machinery that doesn't correctly handle a
> > > NULL 'struct bloom_filter *'.
> > >
> > > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> Thanks,
> Taylor
Taylor Blau Aug. 11, 2020, 9:34 p.m. UTC | #4
On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> > On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> > > On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > > > Many places in the code often need a pointer to the commit-graph's
> > > > 'struct bloom_filter_settings', in which case they often take the value
> > > > from the top-most commit-graph.
> > > >
> > > > In the non-split case, this works as expected. In the split case,
> > > > however, things get a little tricky. Not all layers in a chain of
> > > > incremental commit-graphs are required to themselves have Bloom data,
> > > > and so whether or not some part of the code uses Bloom filters depends
> > > > entirely on whether or not the top-most level of the commit-graph chain
> > > > has Bloom filters.
> > > >
> > > > This has been the behavior since Bloom filters were introduced, and has
> > > > been codified into the tests since a759bfa9ee (t4216: add end to end
> > > > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > > > requires that Bloom filters are not used in exactly the case described
> > > > earlier.
> > > >
> > > > There is no reason that this needs to be the case, since it is perfectly
> > > > valid for commits in an earlier layer to have Bloom filters when commits
> > > > in a newer layer do not.
> > > >
> > > > Since Bloom settings are guaranteed to be the same for any layer in a
> > > > chain that has Bloom data,
> > >
> > > Is it?  Where is that guaranteed?
> >
> > There is no mechanism whatsoever to customize these settings that is
> > exposed to the user (except for the undocumented 'GIT_TEST' environment
> > variables).
>
> Let me rephrase it, then: where is it written in the commit-graph
> format specification that these must be the same in all layers?
>
> Nowhere.

OK. We can certainly document that this is the case. For this purpose,
all we really care about is that the graph _has_ Bloom filters anywhere.
If you wanted to return the exact matching settings, you could also
provide a commit and return the settings belonging to the graph that
contains that commit.

In the case where we don't have a commit, we could use the default
settings instead.

I think that we are a little bit dealing with a problem that doesn't
exist, since we do not document the sole method by which you would
change these settings. So, maybe we can think more about this, but my
preference would be to leave this patch alone.

Maybe Stolee can chime in, too.

> > > > it is sufficient to traverse the
> > > > '->base_graph' pointer until either (1) a non-null 'struct
> > > > bloom_filter_settings *' is found, or (2) until we are at the root of
> > > > the commit-graph chain.
> > > >
> > > > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > > > and use it instead of purely dereferencing the top-most graph's
> > > > '->bloom_filter_settings' pointer.
> > > >
> > > > While we're at it, add an additional test in t5324 to guard against code
> > > > in the commit-graph writing machinery that doesn't correctly handle a
> > > > NULL 'struct bloom_filter *'.
> > > >
> > > > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > Thanks,
> > Taylor
Thanks,
Taylor
SZEDER Gábor Aug. 11, 2020, 11:55 p.m. UTC | #5
On Tue, Aug 11, 2020 at 05:34:11PM -0400, Taylor Blau wrote:
> On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> > > On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> > > > On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> > > > > Many places in the code often need a pointer to the commit-graph's
> > > > > 'struct bloom_filter_settings', in which case they often take the value
> > > > > from the top-most commit-graph.
> > > > >
> > > > > In the non-split case, this works as expected. In the split case,
> > > > > however, things get a little tricky. Not all layers in a chain of
> > > > > incremental commit-graphs are required to themselves have Bloom data,
> > > > > and so whether or not some part of the code uses Bloom filters depends
> > > > > entirely on whether or not the top-most level of the commit-graph chain
> > > > > has Bloom filters.
> > > > >
> > > > > This has been the behavior since Bloom filters were introduced, and has
> > > > > been codified into the tests since a759bfa9ee (t4216: add end to end
> > > > > tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> > > > > requires that Bloom filters are not used in exactly the case described
> > > > > earlier.
> > > > >
> > > > > There is no reason that this needs to be the case, since it is perfectly
> > > > > valid for commits in an earlier layer to have Bloom filters when commits
> > > > > in a newer layer do not.
> > > > >
> > > > > Since Bloom settings are guaranteed to be the same for any layer in a
> > > > > chain that has Bloom data,
> > > >
> > > > Is it?  Where is that guaranteed?
> > >
> > > There is no mechanism whatsoever to customize these settings that is
> > > exposed to the user (except for the undocumented 'GIT_TEST' environment
> > > variables).
> >
> > Let me rephrase it, then: where is it written in the commit-graph
> > format specification that these must be the same in all layers?
> >
> > Nowhere.
> 
> OK. We can certainly document that this is the case.

IMO we absolutely must document this first; ideally it should have
been carefully considered and documented right from the start.

Some thougths about this:

  https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/

> For this purpose,
> all we really care about is that the graph _has_ Bloom filters anywhere.
> If you wanted to return the exact matching settings, you could also
> provide a commit and return the settings belonging to the graph that
> contains that commit.
> 
> In the case where we don't have a commit, we could use the default
> settings instead.
> 
> I think that we are a little bit dealing with a problem that doesn't
> exist, since we do not document the sole method by which you would
> change these settings. So, maybe we can think more about this, but my
> preference would be to leave this patch alone.

Other implementations can write split commit-graphs with modified path
Bloom filters as well, and at the moment there is nothing in the specs
that tells them not to use different Bloom filter settings in
different layers.

> Maybe Stolee can chime in, too.
> 
> > > > > it is sufficient to traverse the
> > > > > '->base_graph' pointer until either (1) a non-null 'struct
> > > > > bloom_filter_settings *' is found, or (2) until we are at the root of
> > > > > the commit-graph chain.
> > > > >
> > > > > Introduce a 'get_bloom_filter_settings()' function that does just this,
> > > > > and use it instead of purely dereferencing the top-most graph's
> > > > > '->bloom_filter_settings' pointer.
> > > > >
> > > > > While we're at it, add an additional test in t5324 to guard against code
> > > > > in the commit-graph writing machinery that doesn't correctly handle a
> > > > > NULL 'struct bloom_filter *'.
> > > > >
> > > > > Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
> > > > > Signed-off-by: Taylor Blau <me@ttaylorr.com>
> > > Thanks,
> > > Taylor
> Thanks,
> Taylor
Derrick Stolee Aug. 12, 2020, 11:48 a.m. UTC | #6
On 8/11/2020 7:55 PM, SZEDER Gábor wrote:
> On Tue, Aug 11, 2020 at 05:34:11PM -0400, Taylor Blau wrote:
>> On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
>>> On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
>>>> On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
>>>>> On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
>>>>>> Many places in the code often need a pointer to the commit-graph's
>>>>>> 'struct bloom_filter_settings', in which case they often take the value
>>>>>> from the top-most commit-graph.
>>>>>>
>>>>>> In the non-split case, this works as expected. In the split case,
>>>>>> however, things get a little tricky. Not all layers in a chain of
>>>>>> incremental commit-graphs are required to themselves have Bloom data,
>>>>>> and so whether or not some part of the code uses Bloom filters depends
>>>>>> entirely on whether or not the top-most level of the commit-graph chain
>>>>>> has Bloom filters.
>>>>>>
>>>>>> This has been the behavior since Bloom filters were introduced, and has
>>>>>> been codified into the tests since a759bfa9ee (t4216: add end to end
>>>>>> tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
>>>>>> requires that Bloom filters are not used in exactly the case described
>>>>>> earlier.
>>>>>>
>>>>>> There is no reason that this needs to be the case, since it is perfectly
>>>>>> valid for commits in an earlier layer to have Bloom filters when commits
>>>>>> in a newer layer do not.
>>>>>>
>>>>>> Since Bloom settings are guaranteed to be the same for any layer in a
>>>>>> chain that has Bloom data,
>>>>>
>>>>> Is it?  Where is that guaranteed?

Perhaps instead of "guaranteed" we could say "Git never writes
a commit-graph chain with different settings between layers."

>>>> There is no mechanism whatsoever to customize these settings that is
>>>> exposed to the user (except for the undocumented 'GIT_TEST' environment
>>>> variables).
>>>
>>> Let me rephrase it, then: where is it written in the commit-graph
>>> format specification that these must be the same in all layers?
>>>
>>> Nowhere.
>>
>> OK. We can certainly document that this is the case.
> 
> IMO we absolutely must document this first; ideally it should have
> been carefully considered and documented right from the start.

You're right. One of the major difficulties in bringing a Bloom
filter implementation to the commit-graph feature when we did was
that the split commit-graph was introduced between our initial
prototypes and when it was finally prepared for a full submission.

There certainly are gaps in the implementation and documentation.
I think Taylor is doing a great job by addressing one of those gaps
in a focused, thoughtful way.

> Some thougths about this:
> 
>   https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/

I appreciate your attention to detail. Your comments on the existing
implementation do point out some of its shortcomings, and that is a
valuable contribution.

Actually converting those thoughts into patches is a lot of work.

>> For this purpose,
>> all we really care about is that the graph _has_ Bloom filters anywhere.
>> If you wanted to return the exact matching settings, you could also
>> provide a commit and return the settings belonging to the graph that
>> contains that commit.
>>
>> In the case where we don't have a commit, we could use the default
>> settings instead.
>>
>> I think that we are a little bit dealing with a problem that doesn't
>> exist, since we do not document the sole method by which you would
>> change these settings. So, maybe we can think more about this, but my
>> preference would be to leave this patch alone.
> 
> Other implementations can write split commit-graphs with modified path
> Bloom filters as well, and at the moment there is nothing in the specs
> that tells them not to use different Bloom filter settings in
> different layers.

You are 100% correct that there is a gap in documentation. That should
be corrected at some point. (I don't consider it a blocker for this
series.)

But also: Git itself is the true test of a "correct" third-party
implementation. libgit2 and JGit try to match Git, "warts and all".
If another implementation wrote data that results in incorrect
behavior by Git, then that implementation is wrong.

Improving documentation can make those errors less likely.

We also must design with "future Git" in mind, presenting it with
enough flexibility to improve formats. The custom Bloom filter
settings do allow that flexibility, but the requirement that all
layers have identical settings exists for a reason (despite not
being documented). It is important that any commit walk that intends
to use the changed-path Bloom filters can compute the bloom keys
for the test paths only once.

Thanks,
-Stolee
Taylor Blau Aug. 14, 2020, 8:17 p.m. UTC | #7
On Wed, Aug 12, 2020 at 07:48:55AM -0400, Derrick Stolee wrote:
> On 8/11/2020 7:55 PM, SZEDER Gábor wrote:
> > On Tue, Aug 11, 2020 at 05:34:11PM -0400, Taylor Blau wrote:
> >> On Tue, Aug 11, 2020 at 11:27:16PM +0200, SZEDER Gábor wrote:
> >>> On Tue, Aug 11, 2020 at 05:21:18PM -0400, Taylor Blau wrote:
> >>>> On Tue, Aug 11, 2020 at 11:18:30PM +0200, SZEDER Gábor wrote:
> >>>>> On Tue, Aug 11, 2020 at 04:51:19PM -0400, Taylor Blau wrote:
> >>>>>> Many places in the code often need a pointer to the commit-graph's
> >>>>>> 'struct bloom_filter_settings', in which case they often take the value
> >>>>>> from the top-most commit-graph.
> >>>>>>
> >>>>>> In the non-split case, this works as expected. In the split case,
> >>>>>> however, things get a little tricky. Not all layers in a chain of
> >>>>>> incremental commit-graphs are required to themselves have Bloom data,
> >>>>>> and so whether or not some part of the code uses Bloom filters depends
> >>>>>> entirely on whether or not the top-most level of the commit-graph chain
> >>>>>> has Bloom filters.
> >>>>>>
> >>>>>> This has been the behavior since Bloom filters were introduced, and has
> >>>>>> been codified into the tests since a759bfa9ee (t4216: add end to end
> >>>>>> tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
> >>>>>> requires that Bloom filters are not used in exactly the case described
> >>>>>> earlier.
> >>>>>>
> >>>>>> There is no reason that this needs to be the case, since it is perfectly
> >>>>>> valid for commits in an earlier layer to have Bloom filters when commits
> >>>>>> in a newer layer do not.
> >>>>>>
> >>>>>> Since Bloom settings are guaranteed to be the same for any layer in a
> >>>>>> chain that has Bloom data,
> >>>>>
> >>>>> Is it?  Where is that guaranteed?
>
> Perhaps instead of "guaranteed" we could say "Git never writes
> a commit-graph chain with different settings between layers."
>
> >>>> There is no mechanism whatsoever to customize these settings that is
> >>>> exposed to the user (except for the undocumented 'GIT_TEST' environment
> >>>> variables).
> >>>
> >>> Let me rephrase it, then: where is it written in the commit-graph
> >>> format specification that these must be the same in all layers?
> >>>
> >>> Nowhere.
> >>
> >> OK. We can certainly document that this is the case.
> >
> > IMO we absolutely must document this first; ideally it should have
> > been carefully considered and documented right from the start.
>
> You're right. One of the major difficulties in bringing a Bloom
> filter implementation to the commit-graph feature when we did was
> that the split commit-graph was introduced between our initial
> prototypes and when it was finally prepared for a full submission.
>
> There certainly are gaps in the implementation and documentation.
> I think Taylor is doing a great job by addressing one of those gaps
> in a focused, thoughtful way.
>
> > Some thougths about this:
> >
> >   https://public-inbox.org/git/20200619140230.GB22200@szeder.dev/
>
> I appreciate your attention to detail. Your comments on the existing
> implementation do point out some of its shortcomings, and that is a
> valuable contribution.
>
> Actually converting those thoughts into patches is a lot of work.
>
> >> For this purpose,
> >> all we really care about is that the graph _has_ Bloom filters anywhere.
> >> If you wanted to return the exact matching settings, you could also
> >> provide a commit and return the settings belonging to the graph that
> >> contains that commit.
> >>
> >> In the case where we don't have a commit, we could use the default
> >> settings instead.
> >>
> >> I think that we are a little bit dealing with a problem that doesn't
> >> exist, since we do not document the sole method by which you would
> >> change these settings. So, maybe we can think more about this, but my
> >> preference would be to leave this patch alone.
> >
> > Other implementations can write split commit-graphs with modified path
> > Bloom filters as well, and at the moment there is nothing in the specs
> > that tells them not to use different Bloom filter settings in
> > different layers.
>
> You are 100% correct that there is a gap in documentation. That should
> be corrected at some point. (I don't consider it a blocker for this
> series.)
>
> But also: Git itself is the true test of a "correct" third-party
> implementation. libgit2 and JGit try to match Git, "warts and all".
> If another implementation wrote data that results in incorrect
> behavior by Git, then that implementation is wrong.
>
> Improving documentation can make those errors less likely.

I agree with this reasoning. Would anybody object to moving forward with
this series without a change in documentation today (but rather down the
road)?

> We also must design with "future Git" in mind, presenting it with
> enough flexibility to improve formats. The custom Bloom filter
> settings do allow that flexibility, but the requirement that all
> layers have identical settings exists for a reason (despite not
> being documented). It is important that any commit walk that intends
> to use the changed-path Bloom filters can compute the bloom keys
> for the test paths only once.
>
> Thanks,
> -Stolee

Thanks,
Taylor
diff mbox series

Patch

diff --git a/blame.c b/blame.c
index 82fa16d658..3e5f8787bc 100644
--- a/blame.c
+++ b/blame.c
@@ -2891,16 +2891,18 @@  void setup_blame_bloom_data(struct blame_scoreboard *sb,
 			    const char *path)
 {
 	struct blame_bloom_data *bd;
+	struct bloom_filter_settings *bs;
 
 	if (!sb->repo->objects->commit_graph)
 		return;
 
-	if (!sb->repo->objects->commit_graph->bloom_filter_settings)
+	bs = get_bloom_filter_settings(sb->repo);
+	if (!bs)
 		return;
 
 	bd = xmalloc(sizeof(struct blame_bloom_data));
 
-	bd->settings = sb->repo->objects->commit_graph->bloom_filter_settings;
+	bd->settings = bs;
 
 	bd->alloc = 4;
 	bd->nr = 0;
diff --git a/bloom.c b/bloom.c
index 1a573226e7..cd9380ac62 100644
--- a/bloom.c
+++ b/bloom.c
@@ -38,7 +38,7 @@  static int load_bloom_filter_from_graph(struct commit_graph *g,
 	while (graph_pos < g->num_commits_in_base)
 		g = g->base_graph;
 
-	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
+	/* The commit graph commit 'c' lives in doesn't carry Bloom filters. */
 	if (!g->chunk_bloom_indexes)
 		return 0;
 
@@ -195,8 +195,8 @@  struct bloom_filter *get_bloom_filter(struct repository *r,
 	if (!filter->data) {
 		load_commit_graph_info(r, c);
 		if (commit_graph_position(c) != COMMIT_NOT_FROM_GRAPH &&
-			r->objects->commit_graph->chunk_bloom_indexes)
-			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c);
+			load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
+				return filter;
 	}
 
 	if (filter->data)
diff --git a/commit-graph.c b/commit-graph.c
index e51c91dd5b..d4b06811be 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -660,6 +660,17 @@  int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	while (g) {
+		if (g->bloom_filter_settings)
+			return g->bloom_filter_settings;
+		g = g->base_graph;
+	}
+	return NULL;
+}
+
 static void close_commit_graph_one(struct commit_graph *g)
 {
 	if (!g)
diff --git a/commit-graph.h b/commit-graph.h
index 09a97030dc..0677dd1031 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -87,6 +87,8 @@  struct commit_graph *parse_commit_graph(void *graph_map, size_t graph_size);
  */
 int generation_numbers_enabled(struct repository *r);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
 	COMMIT_GRAPH_WRITE_PROGRESS   = (1 << 1),
diff --git a/revision.c b/revision.c
index 3dcf689341..be600186ee 100644
--- a/revision.c
+++ b/revision.c
@@ -680,10 +680,7 @@  static void prepare_to_use_bloom_filter(struct rev_info *revs)
 
 	repo_parse_commit(revs->repo, revs->commits->item);
 
-	if (!revs->repo->objects->commit_graph)
-		return;
-
-	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
+	revs->bloom_filter_settings = get_bloom_filter_settings(revs->repo);
 	if (!revs->bloom_filter_settings)
 		return;
 
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index c21cc160f3..c9f9bdf1ba 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -60,7 +60,7 @@  setup () {
 
 test_bloom_filters_used () {
 	log_args=$1
-	bloom_trace_prefix="statistics:{\"filter_not_present\":0,\"maybe\""
+	bloom_trace_prefix="statistics:{\"filter_not_present\":${2:-0},\"maybe\""
 	setup "$log_args" &&
 	grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" &&
 	test_cmp log_wo_bloom log_w_bloom &&
@@ -134,8 +134,11 @@  test_expect_success 'setup - add commit-graph to the chain without Bloom filters
 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain
 '
 
-test_expect_success 'Do not use Bloom filters if the latest graph does not have Bloom filters.' '
-	test_bloom_filters_not_used "-- A/B"
+test_expect_success 'use Bloom filters even if the latest graph does not have Bloom filters' '
+	# Ensure that the number of empty filters is equal to the number of
+	# filters in the latest graph layer to prove that they are loaded (and
+	# ignored).
+	test_bloom_filters_used "-- A/B" 3
 '
 
 test_expect_success 'setup - add commit-graph to the chain with Bloom filters' '
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 9b850ea907..5bdfd53ef9 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,17 @@  done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success '--split=replace with partial Bloom data' '
+	rm -rf $graphdir $infodir/commit-graph &&
+	git reset --hard commits/3 &&
+	git rev-list -1 HEAD~2 >a &&
+	git rev-list -1 HEAD~1 >b &&
+	git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
+	git commit-graph write --split=no-merge --stdin-commits <b &&
+	git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 1 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done