diff mbox series

[3/3] commit-graph: respect 'core.useBloomFilters'

Message ID 4cfa086e503e19763a9d581fcb6a2ef818776dfc.1593536481.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series commit-graph: introduce 'core.useBloomFilters' | expand

Commit Message

Taylor Blau June 30, 2020, 5:17 p.m. UTC
Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'core.useBloomFilters' to control whether or not Bloom filters
are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/config/core.txt | 5 +++++
 commit-graph.c                | 4 ++--
 repo-settings.c               | 3 +++
 repository.h                  | 1 +
 t/helper/test-read-graph.c    | 3 ++-
 t/t4216-log-bloom.sh          | 4 +++-
 6 files changed, 16 insertions(+), 4 deletions(-)

Comments

Jeff King June 30, 2020, 7:18 p.m. UTC | #1
On Tue, Jun 30, 2020 at 01:17:48PM -0400, Taylor Blau wrote:

> Git uses the 'core.commitGraph' configuration value to control whether
> or not the commit graph is used when parsing commits or performing a
> traversal.

I think this is a good thing to have, and the patch itself makes sense
to me (this is actually my first time reviewing it, despite its intended
use within GitHub :) ).

If I may bikeshed for a moment:

> Introduce 'core.useBloomFilters' to control whether or not Bloom filters
> are read. Note that this configuration is independent from both:
> 
>   - 'core.commitGraph', to allow flexibility in using all parts of a
>     commit-graph _except_ for its Bloom filters.
> 
>   - The '--changed-paths' option for 'git commit-graph write', to allow
>     reading and writing Bloom filters to be controlled independently.

Should we avoid exposing the user to the words "Bloom filter"?

The command-line option for writing them was genericized to
"changed-paths", which I think is good. The use of Bloom filters is an
implementation detail. What the user cares about is whether we can
optimize queries of which paths changed in a commit.

When we introduced reachability bitmaps long ago, we made the mistake of
just calling them "bitmaps". That jargon is well understood by people
who work with that code, but it's confusing outside of that (even within
other parts of Git) because bitmaps are just a generic data structure.
You can have a bitmap of just about anything (and indeed we do use other
bitmaps these days). Consistently calling them "reachability bitmaps",
especially in the user facing bits, would have reduced confusion over
the years.

Similarly, Bloom filters are a generic structure we might use elsewhere.
I don't really care if we use the word "Bloom" internally to refer to
this feature, but we'll be stuck with this config option for all time. I
think it's worth picking something more clear.

It might even be worth considering whether "changed paths" needs more
context (or would if we add new features in the future). On a "git
commit-graph write" command-line it is perfectly clear, but would
core.commitGraphChangedPaths be worth it? It's definitely more specific,
but it's also way more ugly. ;)

> diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> index 6d0c962438..5f585a1725 100644
> --- a/t/helper/test-read-graph.c
> +++ b/t/helper/test-read-graph.c
> @@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
>  	setup_git_directory();
>  	odb = the_repository->objects->odb;
>  
> +	prepare_repo_settings(the_repository);
> +
>  	graph = read_commit_graph_one(the_repository, odb);

I wondered why we would need this prepare_repo_settings() now, when it
should have been needed already to cover core.commitGraph already. I
strongly suspect the answer is: "test-tool read-graph" never properly
respected core.commitGraph in the first place.

And now presumably it would. If true, I don't think any tests need
adjusted because the only places we set it are:

  - on a "git -c" command line, which wouldn't run a test-tool helper

  - when we do set it, it is always to "true", which is the default
    anyway

>  	if (!graph)
>  		return 1;
>  
> -
>  	printf("header: %08x %d %d %d %d\n",
>  		ntohl(*(uint32_t*)graph->data),
>  		*(unsigned char*)(graph->data + 4),

Oh good, I happened to be looking at this code earlier today for an
unrelated reason and was bothered by this extra newline. :)

-Peff
Taylor Blau June 30, 2020, 7:27 p.m. UTC | #2
On Tue, Jun 30, 2020 at 03:18:34PM -0400, Jeff King wrote:
> On Tue, Jun 30, 2020 at 01:17:48PM -0400, Taylor Blau wrote:
>
> > Git uses the 'core.commitGraph' configuration value to control whether
> > or not the commit graph is used when parsing commits or performing a
> > traversal.
>
> I think this is a good thing to have, and the patch itself makes sense
> to me (this is actually my first time reviewing it, despite its intended
> use within GitHub :) ).
>
> If I may bikeshed for a moment:
>
> > Introduce 'core.useBloomFilters' to control whether or not Bloom filters
> > are read. Note that this configuration is independent from both:
> >
> >   - 'core.commitGraph', to allow flexibility in using all parts of a
> >     commit-graph _except_ for its Bloom filters.
> >
> >   - The '--changed-paths' option for 'git commit-graph write', to allow
> >     reading and writing Bloom filters to be controlled independently.
>
> Should we avoid exposing the user to the words "Bloom filter"?
>
> The command-line option for writing them was genericized to
> "changed-paths", which I think is good. The use of Bloom filters is an
> implementation detail. What the user cares about is whether we can
> optimize queries of which paths changed in a commit.
>
> When we introduced reachability bitmaps long ago, we made the mistake of
> just calling them "bitmaps". That jargon is well understood by people
> who work with that code, but it's confusing outside of that (even within
> other parts of Git) because bitmaps are just a generic data structure.
> You can have a bitmap of just about anything (and indeed we do use other
> bitmaps these days). Consistently calling them "reachability bitmaps",
> especially in the user facing bits, would have reduced confusion over
> the years.
>
> Similarly, Bloom filters are a generic structure we might use elsewhere.
> I don't really care if we use the word "Bloom" internally to refer to
> this feature, but we'll be stuck with this config option for all time. I
> think it's worth picking something more clear.

All good thoughts. I wondered about this, too, when writing the patch,
but ultimately decided to expose the name since this is the only usage
of Bloom filters within Git to date. Whether that will continue to be
true, I'm not sure, so it probably isn't a great idea to lock ourselves
into that decision within the 'core' namespace.

So, I'm certainly open to changing it, although I'm not sure that I'm as
worried about exposing the implementation detail as I am about squatting
on Bloom filters within Git in general. I don't think that this
configuration will end up getting used by folks other than server
administrators and for debugging purposes, so those populations are
already likely to be aware of changed-path Bloom filters beforehand.

But, hiding the implementation detail seems like sane advice either way.

> It might even be worth considering whether "changed paths" needs more
> context (or would if we add new features in the future). On a "git
> commit-graph write" command-line it is perfectly clear, but would
> core.commitGraphChangedPaths be worth it? It's definitely more specific,
> but it's also way more ugly. ;)

Here's a third option what about 'graph.readChangedPaths'. I think that
Stolee and I discussed a new top-level 'graph' section, since we now
have a few commit-graph-related configuration variables in 'core'.

That's a little shorter, and it adds the verb 'read', which is more
descriptive than 'use' (I touch on this in the third patch, where I say
that this configuration variable _doesn't_ affect the '--changed-path'
option when writing).

Either way, I'd love to hear your thoughts and others', too, to figure
out what we think the most agreeable configuration name is.

> > diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
> > index 6d0c962438..5f585a1725 100644
> > --- a/t/helper/test-read-graph.c
> > +++ b/t/helper/test-read-graph.c
> > @@ -12,11 +12,12 @@ int cmd__read_graph(int argc, const char **argv)
> >  	setup_git_directory();
> >  	odb = the_repository->objects->odb;
> >
> > +	prepare_repo_settings(the_repository);
> > +
> >  	graph = read_commit_graph_one(the_repository, odb);
>
> I wondered why we would need this prepare_repo_settings() now, when it
> should have been needed already to cover core.commitGraph already. I
> strongly suspect the answer is: "test-tool read-graph" never properly
> respected core.commitGraph in the first place.

Yep. Could probably be broken out into a separate patch (or mentioned as
an aside in this one), but you're right: this helper did not respect
any configuration that 'prepare_repo_settings' picks up.

> And now presumably it would. If true, I don't think any tests need
> adjusted because the only places we set it are:
>
>   - on a "git -c" command line, which wouldn't run a test-tool helper
>
>   - when we do set it, it is always to "true", which is the default
>     anyway
>
> >  	if (!graph)
> >  		return 1;
> >
> > -
> >  	printf("header: %08x %d %d %d %d\n",
> >  		ntohl(*(uint32_t*)graph->data),
> >  		*(unsigned char*)(graph->data + 4),
>
> Oh good, I happened to be looking at this code earlier today for an
> unrelated reason and was bothered by this extra newline. :)

I hoped that nobody would mine me sneaking this in ;-).

>
> -Peff
Thanks,
Taylor
Jeff King June 30, 2020, 7:33 p.m. UTC | #3
On Tue, Jun 30, 2020 at 03:27:18PM -0400, Taylor Blau wrote:

> So, I'm certainly open to changing it, although I'm not sure that I'm as
> worried about exposing the implementation detail as I am about squatting
> on Bloom filters within Git in general. I don't think that this
> configuration will end up getting used by folks other than server
> administrators and for debugging purposes, so those populations are
> already likely to be aware of changed-path Bloom filters beforehand.

Yeah, the squatting thing is definitely my bigger concern (having been
through the "bitmaps" version of the same thing).

> > It might even be worth considering whether "changed paths" needs more
> > context (or would if we add new features in the future). On a "git
> > commit-graph write" command-line it is perfectly clear, but would
> > core.commitGraphChangedPaths be worth it? It's definitely more specific,
> > but it's also way more ugly. ;)
> 
> Here's a third option what about 'graph.readChangedPaths'. I think that
> Stolee and I discussed a new top-level 'graph' section, since we now
> have a few commit-graph-related configuration variables in 'core'.

Yes, I like that even better. Probably "graph" is sufficiently specific
within Git's context, though I guess it _could_ bring to mind "git log
--graph". So many overloaded terms. :)

> That's a little shorter, and it adds the verb 'read', which is more
> descriptive than 'use' (I touch on this in the third patch, where I say
> that this configuration variable _doesn't_ affect the '--changed-path'
> option when writing).

Yeah, saying "read" specifically is much nicer.

> > > +	prepare_repo_settings(the_repository);
> > > +
> > >  	graph = read_commit_graph_one(the_repository, odb);
> >
> > I wondered why we would need this prepare_repo_settings() now, when it
> > should have been needed already to cover core.commitGraph already. I
> > strongly suspect the answer is: "test-tool read-graph" never properly
> > respected core.commitGraph in the first place.
> 
> Yep. Could probably be broken out into a separate patch (or mentioned as
> an aside in this one), but you're right: this helper did not respect
> any configuration that 'prepare_repo_settings' picks up.

I'd probably just note it in the commit message, but I'd be fine with
that or with a separate patch.

-Peff
diff mbox series

Patch

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 74619a9c03..b146bf8d34 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -599,6 +599,11 @@  core.commitGraph::
 	to parse the graph structure of commits. Defaults to true. See
 	linkgit:git-commit-graph[1] for more information.
 
+core.useBloomFilters::
+	If true, then git will use the changed-path Bloom filters in the
+	commit-graph file (if it exists, and they are present). Defaults to
+	true. See linkgit:git-commit-graph[1] for more information.
+
 core.useReplaceRefs::
 	If set to `false`, behave as if the `--no-replace-objects`
 	option was given on the command line. See linkgit:git[1] and
diff --git a/commit-graph.c b/commit-graph.c
index fdfb0888f0..03c00415c4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -337,14 +337,14 @@  struct commit_graph *parse_commit_graph(struct repository *r,
 		case GRAPH_CHUNKID_BLOOMINDEXES:
 			if (graph->chunk_bloom_indexes)
 				chunk_repeated = 1;
-			else
+			else if (r->settings.core_use_bloom_filters)
 				graph->chunk_bloom_indexes = data + chunk_offset;
 			break;
 
 		case GRAPH_CHUNKID_BLOOMDATA:
 			if (graph->chunk_bloom_data)
 				chunk_repeated = 1;
-			else {
+			else if (r->settings.core_use_bloom_filters) {
 				uint32_t hash_version;
 				graph->chunk_bloom_data = data + chunk_offset;
 				hash_version = get_be32(data + chunk_offset);
diff --git a/repo-settings.c b/repo-settings.c
index dc6817daa9..d8e3b1c61e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -17,9 +17,12 @@  void prepare_repo_settings(struct repository *r)
 
 	if (!repo_config_get_bool(r, "core.commitgraph", &value))
 		r->settings.core_commit_graph = value;
+	if (!repo_config_get_bool(r, "core.usebloomfilters", &value))
+		r->settings.core_use_bloom_filters = value;
 	if (!repo_config_get_bool(r, "gc.writecommitgraph", &value))
 		r->settings.gc_write_commit_graph = value;
 	UPDATE_DEFAULT_BOOL(r->settings.core_commit_graph, 1);
+	UPDATE_DEFAULT_BOOL(r->settings.core_use_bloom_filters, 1);
 	UPDATE_DEFAULT_BOOL(r->settings.gc_write_commit_graph, 1);
 
 	if (!repo_config_get_int(r, "index.version", &value))
diff --git a/repository.h b/repository.h
index 3c1f7d54bd..cc61533122 100644
--- a/repository.h
+++ b/repository.h
@@ -29,6 +29,7 @@  struct repo_settings {
 	int initialized;
 
 	int core_commit_graph;
+	int core_use_bloom_filters;
 	int gc_write_commit_graph;
 	int fetch_write_commit_graph;
 
diff --git a/t/helper/test-read-graph.c b/t/helper/test-read-graph.c
index 6d0c962438..5f585a1725 100644
--- a/t/helper/test-read-graph.c
+++ b/t/helper/test-read-graph.c
@@ -12,11 +12,12 @@  int cmd__read_graph(int argc, const char **argv)
 	setup_git_directory();
 	odb = the_repository->objects->odb;
 
+	prepare_repo_settings(the_repository);
+
 	graph = read_commit_graph_one(the_repository, odb);
 	if (!graph)
 		return 1;
 
-
 	printf("header: %08x %d %d %d %d\n",
 		ntohl(*(uint32_t*)graph->data),
 		*(unsigned char*)(graph->data + 4),
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 0b4cc4f8d1..b1a247477e 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -90,7 +90,9 @@  do
 		      "--ancestry-path side..master"
 	do
 		test_expect_success "git log option: $option for path: $path" '
-			test_bloom_filters_used "$option -- $path"
+			test_bloom_filters_used "$option -- $path" &&
+			test_config core.useBloomFilters false &&
+			test_bloom_filters_not_used "$option -- $path"
 		'
 	done
 done