[v2,2/4] commit: write commit-graph with Bloom filters

Message ID	7e8f1aed1138ab2a52a8957ac95895ac9effd933.1586789126.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=K77S=55=vger.kernel.org=git-owner@kernel.org> Message-Id: <7e8f1aed1138ab2a52a8957ac95895ac9effd933.1586789126.git.gitgitgadget@gmail.com> In-Reply-To: <pull.609.v2.git.1586789126.gitgitgadget@gmail.com> References: <pull.609.git.1586566981.gitgitgadget@gmail.com> <pull.609.v2.git.1586789126.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Mon, 13 Apr 2020 14:45:24 +0000 Subject: [PATCH v2 2/4] commit: write commit-graph with Bloom filters Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, jnareb@gmail.com, garimasigit@gmail.com, Derrick Stolee <dstolee@microsoft.com>, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Integrate changed-path Bloom filters with 'git blame' \| expand [v2,0/4] Integrate changed-path Bloom filters with 'git blame' [v2,1/4] revision: complicated pathspecs disable filters [v2,2/4] commit: write commit-graph with Bloom filters [v2,3/4] commit-graph: write commit-graph in more tests [v2,4/4] blame: use changed-path Bloom filters

John Passaro via GitGitGadget April 13, 2020, 2:45 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.

Update the "git commit" builtin to write changed-path Bloom filters
when both GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
are enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 2 ++
 1 file changed, 2 insertions(+)

Taylor Blau April 13, 2020, 4:12 p.m. UTC | #1

On Mon, Apr 13, 2020 at 02:45:24PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
> graph file whenever "git commit" is run, ensuring that we always
> have an updated commit-graph throughout the test suite. The
> GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
> introduced to write the changed-path Bloom filters whenever "git
> commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
> trick doesn't launch a separate process and instead writes it
> directly.
>
> Update the "git commit" builtin to write changed-path Bloom filters
> when both GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
> are enabled.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 77668629e27..3e8f36ce5c8 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1968,6 +1968,8 @@ int write_commit_graph(struct object_directory *odb,
>  	ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
>  	ctx->split_opts = split_opts;
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> +	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
> +		ctx->changed_paths = 1;

Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
environment variables and potentially ignoring its arguments, but given
the discussion we had in v1, I don't feel strongly enough to recommend
that you change this.

For what it's worth, I think that 'write_commit_graph' could behave more
purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
'flags' appropriately from the outside, but I can understand also why
it's preferred to do it in this style, too.

>  	ctx->total_bloom_filter_data_size = 0;
>
>  	if (ctx->split) {
> --
> gitgitgadget
>

Thanks,
Taylor

Junio C Hamano April 13, 2020, 10:21 p.m. UTC | #2

Taylor Blau <me@ttaylorr.com> writes:

> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
> environment variables and potentially ignoring its arguments, but given
> the discussion we had in v1, I don't feel strongly enough to recommend
> that you change this.
>
> For what it's worth, I think that 'write_commit_graph' could behave more
> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
> 'flags' appropriately from the outside,...

Yeah, I agree that it would be a lot cleaner if that is easy to
arrange.  I have a suspicion that Derrick already tried and the
resulting code looked messier and was discarded?

Derrick Stolee April 14, 2020, 3:04 p.m. UTC | #3

On 4/13/2020 6:21 PM, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
> 
>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
>> environment variables and potentially ignoring its arguments, but given
>> the discussion we had in v1, I don't feel strongly enough to recommend
>> that you change this.
>>
>> For what it's worth, I think that 'write_commit_graph' could behave more
>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
>> 'flags' appropriately from the outside,...
> 
> Yeah, I agree that it would be a lot cleaner if that is easy to
> arrange.  I have a suspicion that Derrick already tried and the
> resulting code looked messier and was discarded?

Perhaps I could fix both concerns by

1. Use a macro instead of a library call.

2. Check the _CHANGED_PATHS variable in the macro.

The macro would look like this:


#define git_test_write_commit_graph_or_die() \
	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) { \
		int flags = 0; \
		if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0)) \
			flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS; \
		if (write_commit_graph_reachable( \
			the_repository->objects->odb, flags, NULL)) \
			die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH"); \
	}

Thanks,
-Stolee

Junio C Hamano April 14, 2020, 5:26 p.m. UTC | #4

Derrick Stolee <stolee@gmail.com> writes:

> On 4/13/2020 6:21 PM, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>> 
>>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
>>> environment variables and potentially ignoring its arguments, but given
>>> the discussion we had in v1, I don't feel strongly enough to recommend
>>> that you change this.
>>>
>>> For what it's worth, I think that 'write_commit_graph' could behave more
>>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
>>> 'flags' appropriately from the outside,...
>> 
>> Yeah, I agree that it would be a lot cleaner if that is easy to
>> arrange.  I have a suspicion that Derrick already tried and the
>> resulting code looked messier and was discarded?
>
> Perhaps I could fix both concerns by
>
> 1. Use a macro instead of a library call.
>
> 2. Check the _CHANGED_PATHS variable in the macro.

I am not sure how use of a macro "fixes" purity, though.  And what
is the other concern?

How widely would this "if we are testing, write out the graph file"
call be sprinkled over the codebase?  I am hoping that it won't be
"everywhere", but only at strategic places (like "just once before
we leave a subcommand that creates one or bunch of commits").  And
how often would they be called?  I am also hoping that it won't be
"inside a tight loop".  In short, I am wondering if we can promise
our codebase that 

 - git_test_write_commit_graph_or_die() calls won't be an eyesore
   (and/or distraction) for developers too much.

 - git_env_bool() call won't have performance impact.


> The macro would look like this:
>
>
> #define git_test_write_commit_graph_or_die() \
> 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) { \
> 		int flags = 0; \
> 		if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0)) \
> 			flags = COMMIT_GRAPH_WRITE_BLOOM_FILTERS; \
> 		if (write_commit_graph_reachable( \
> 			the_repository->objects->odb, flags, NULL)) \
> 			die("failed to write commit-graph under GIT_TEST_COMMIT_GRAPH"); \
> 	}
>
> Thanks,
> -Stolee

Derrick Stolee April 14, 2020, 5:40 p.m. UTC | #5

On 4/14/2020 1:26 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>> On 4/13/2020 6:21 PM, Junio C Hamano wrote:
>>> Taylor Blau <me@ttaylorr.com> writes:
>>>
>>>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
>>>> environment variables and potentially ignoring its arguments, but given
>>>> the discussion we had in v1, I don't feel strongly enough to recommend
>>>> that you change this.
>>>>
>>>> For what it's worth, I think that 'write_commit_graph' could behave more
>>>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
>>>> 'flags' appropriately from the outside,...
>>>
>>> Yeah, I agree that it would be a lot cleaner if that is easy to
>>> arrange.  I have a suspicion that Derrick already tried and the
>>> resulting code looked messier and was discarded?
>>
>> Perhaps I could fix both concerns by
>>
>> 1. Use a macro instead of a library call.
>>
>> 2. Check the _CHANGED_PATHS variable in the macro.
> 
> I am not sure how use of a macro "fixes" purity, though.  And what
> is the other concern?

The concern was (1) checking the environment and (2) die()ing in the
library.

> How widely would this "if we are testing, write out the graph file"
> call be sprinkled over the codebase?  I am hoping that it won't be
> "everywhere", but only at strategic places (like "just once before
> we leave a subcommand that creates one or bunch of commits").  And
> how often would they be called?  I am also hoping that it won't be
> "inside a tight loop".  In short, I am wondering if we can promise
> our codebase that 
> 
>  - git_test_write_commit_graph_or_die() calls won't be an eyesore
>    (and/or distraction) for developers too much.
> 
>  - git_env_bool() call won't have performance impact.

I could add a comment to the header file to say "this is only for
improving coverage of optional features in the test suite. Do not
call this method unless you know what you are doing."

My intention right now is to only include this in 'git commit'
and 'git merge'. Earlier discussion included thoughts about
'git rebase' and similar, but those are more rarely used when
constructing an "example repo" in the test scripts.

-Stolee

Taylor Blau April 15, 2020, 12:17 a.m. UTC | #6

On Tue, Apr 14, 2020 at 01:40:53PM -0400, Derrick Stolee wrote:
> On 4/14/2020 1:26 PM, Junio C Hamano wrote:
> > Derrick Stolee <stolee@gmail.com> writes:
> >
> >> On 4/13/2020 6:21 PM, Junio C Hamano wrote:
> >>> Taylor Blau <me@ttaylorr.com> writes:
> >>>
> >>>> Hmm. I'm not crazy about a library function looking at 'GIT_TEST_*'
> >>>> environment variables and potentially ignoring its arguments, but given
> >>>> the discussion we had in v1, I don't feel strongly enough to recommend
> >>>> that you change this.
> >>>>
> >>>> For what it's worth, I think that 'write_commit_graph' could behave more
> >>>> purely if callers checked 'GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS' and set
> >>>> 'flags' appropriately from the outside,...
> >>>
> >>> Yeah, I agree that it would be a lot cleaner if that is easy to
> >>> arrange.  I have a suspicion that Derrick already tried and the
> >>> resulting code looked messier and was discarded?
> >>
> >> Perhaps I could fix both concerns by
> >>
> >> 1. Use a macro instead of a library call.
> >>
> >> 2. Check the _CHANGED_PATHS variable in the macro.
> >
> > I am not sure how use of a macro "fixes" purity, though.  And what
> > is the other concern?
>
> The concern was (1) checking the environment and (2) die()ing in the
> library.

I think that we might as well use a plain old function here instead of a
macro, especially since the macro is just expanding out to what that
function would be if we had written it that way.

For what it's worth, my concern was more that I don't mind dying, but
I'd rather do so in a calling function, and that by the time we're
calling 'write_commit_graph', that we should be using return values
instead of 'die()' or 'error()'.

> > How widely would this "if we are testing, write out the graph file"
> > call be sprinkled over the codebase?  I am hoping that it won't be
> > "everywhere", but only at strategic places (like "just once before
> > we leave a subcommand that creates one or bunch of commits").  And
> > how often would they be called?  I am also hoping that it won't be
> > "inside a tight loop".  In short, I am wondering if we can promise
> > our codebase that
> >
> >  - git_test_write_commit_graph_or_die() calls won't be an eyesore
> >    (and/or distraction) for developers too much.
> >
> >  - git_env_bool() call won't have performance impact.
>
> I could add a comment to the header file to say "this is only for
> improving coverage of optional features in the test suite. Do not
> call this method unless you know what you are doing."

Works for me.

> My intention right now is to only include this in 'git commit'
> and 'git merge'. Earlier discussion included thoughts about
> 'git rebase' and similar, but those are more rarely used when
> constructing an "example repo" in the test scripts.

Ditto. I think that there's benefit in having it in some places but not
all (as Stolee pointed out earlier in the thread), and that trying to
have it everywhere is a fool's errand.

> -Stolee

Thanks,
Taylor

[v2,2/4] commit: write commit-graph with Bloom filters

Commit Message

Comments

Patch