[2/3] commit: write commit-graph with bloom filters

Message ID	bb5ce39d0283f14095d87dd0645e641ae799f16c.1586566981.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=KYeX=53=vger.kernel.org=git-owner@kernel.org> Message-Id: <bb5ce39d0283f14095d87dd0645e641ae799f16c.1586566981.git.gitgitgadget@gmail.com> In-Reply-To: <pull.609.git.1586566981.gitgitgadget@gmail.com> References: <pull.609.git.1586566981.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Sat, 11 Apr 2020 01:03:00 +0000 Subject: [PATCH 2/3] commit: write commit-graph with bloom filters Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: me@ttaylorr.com, jnareb@gmail.com, garimasigit@gmail.com, Derrick Stolee <dstolee@microsoft.com>, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Integrate changed-path Bloom filters with 'git blame' \| expand [0/3] Integrate changed-path Bloom filters with 'git blame' [1/3] revision: complicated pathspecs disable filters [2/3] commit: write commit-graph with bloom filters [3/3] blame: use changed-path Bloom filters

Message ID

bb5ce39d0283f14095d87dd0645e641ae799f16c.1586566981.git.gitgitgadget@gmail.com (mailing list archive)

State

New, archived

Headers

Message-Id: 
 <bb5ce39d0283f14095d87dd0645e641ae799f16c.1586566981.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.609.git.1586566981.gitgitgadget@gmail.com>
References: <pull.609.git.1586566981.gitgitgadget@gmail.com>
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Date: Sat, 11 Apr 2020 01:03:00 +0000
Subject: [PATCH 2/3] commit: write commit-graph with bloom filters
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: me@ttaylorr.com, jnareb@gmail.com, garimasigit@gmail.com,
        Derrick Stolee <dstolee@microsoft.com>,
        Derrick Stolee <dstolee@microsoft.com>
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

Integrate changed-path Bloom filters with 'git blame' | expand

Commit Message

John Passaro via GitGitGadget April 11, 2020, 1:03 a.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.

Update the "git commit" builtin to write changed-path Bloom filters
when both GIT_TEST_COMMIT_GRAPH and GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
are enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Junio C Hamano April 11, 2020, 9:57 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> diff --git a/builtin/commit.c b/builtin/commit.c
> index d3e7781e658..d2fdfdc4363 100644
> --- a/builtin/commit.c
> +++ b/builtin/commit.c
> @@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
>  		      "not exceeded, and then \"git restore --staged :/\" to recover."));
>  
>  	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
> -	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
> +	    write_commit_graph_reachable(the_repository->objects->odb,
> +					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
> +					 NULL))

Yuck.  That is doubly mouthful.

I wonder how much value we are getting by having this call here in
the first place.  This function being cmd_commit(), by definition we
won't be updating the graph file unless the test script does "git
commit".  We won't update the graph file upon "git rebase", "git
merge", "git pull", "git fetch", etc., so it is not like with this
hack, the test coverage for various traversal using the graph file
would magically be perfect.  We'd still need an explicit call to
"commit-graph write" at strategic places in the test scripts, no?

How are we testing with and without bitmaps, and do we have a kludge
like this one for the feature, or do we use a different mechanism
to help writing tests easier to gain better coverage?

In any case, I can see why we want a separate knob specific to the
CHANGED_PATHS until the path filter part becomes as stable as the
rest of the graph file, but after some time, we should remove that
knob, and at that point, we always use the path filter whenever we
use commit-graph, so that we won't have to suffer from the ugliness.

Wait.  I wonder if we can tweak the arrangement a bit so that this
layer does not need to know any more than GIT_TEST_COMMIT_GRAPH?

For example, in commit-graph.c::write_commit_graph(), can we test
the CHANGED_PATHS environment variable and flip the .changed_paths
bit, no matter what the 'flags' argument to the function says?

Thanks.

>  		return 1;
>  
>  	repo_rerere(the_repository, 0);

Taylor Blau April 12, 2020, 8:51 p.m. UTC | #2

On Sat, Apr 11, 2020 at 02:57:18PM -0700, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > diff --git a/builtin/commit.c b/builtin/commit.c
> > index d3e7781e658..d2fdfdc4363 100644
> > --- a/builtin/commit.c
> > +++ b/builtin/commit.c
> > @@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
> >  		      "not exceeded, and then \"git restore --staged :/\" to recover."));
> >
> >  	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
> > -	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
> > +	    write_commit_graph_reachable(the_repository->objects->odb,
> > +					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
> > +					 NULL))
>
> Yuck.  That is doubly mouthful.

Yeah, this is quite a mouthful indeed. I think the most conservative fix
would be something like:

  if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) {
    enum commit_graph_write_flags flags = 0;
    if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
      flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;

    if (write_commit_graph_reachable(the_repository->objects->odb,
                                     flags, NULL)) {
      /* ... */
    }
  }

and then ripping this knob out (as Junio suggests below, which I think
is a wise suggestion) out would be straightforward.

> I wonder how much value we are getting by having this call here in
> the first place.  This function being cmd_commit(), by definition we
> won't be updating the graph file unless the test script does "git
> commit".  We won't update the graph file upon "git rebase", "git
> merge", "git pull", "git fetch", etc., so it is not like with this
> hack, the test coverage for various traversal using the graph file
> would magically be perfect.  We'd still need an explicit call to
> "commit-graph write" at strategic places in the test scripts, no?

Yeah. It's definitely not a silver bullet for the reasons you mention,
but I think that it is helping out in some common cases. For example,
if we moved this check out to 'test_commit', then we'd be relying on
tests to use that instead of 'git commit'. On the other hand, this
catches either, since they both wind up in this builtin.

I guess you could push this check down much further to when we're about
to write a new object, and--if that new object is a commit--update the
commit-graph. That sounds awfully slow (and feels to me like a hack, but
I can't justify whether it's more or less of a hack than we already
have), but I think it would be OK, since this isn't the normal way to
run tests.

> How are we testing with and without bitmaps, and do we have a kludge
> like this one for the feature, or do we use a different mechanism
> to help writing tests easier to gain better coverage?

As far as I know after a brief search, we have no similar such mechanism
for bitmaps, so commit-graphs are a unique instance.

> In any case, I can see why we want a separate knob specific to the
> CHANGED_PATHS until the path filter part becomes as stable as the
> rest of the graph file, but after some time, we should remove that
> knob, and at that point, we always use the path filter whenever we
> use commit-graph, so that we won't have to suffer from the ugliness.
>
> Wait.  I wonder if we can tweak the arrangement a bit so that this
> layer does not need to know any more than GIT_TEST_COMMIT_GRAPH?
>
> For example, in commit-graph.c::write_commit_graph(), can we test
> the CHANGED_PATHS environment variable and flip the .changed_paths
> bit, no matter what the 'flags' argument to the function says?

It may be cleaner to have 'GIT_TEST_COMMIT_GRAPH' specify the flags that
it wants as its value, but the additional coupling makes me somewhat
uncomfortable.

> Thanks.
>
> >  		return 1;
> >
> >  	repo_rerere(the_repository, 0);

Thanks,
Taylor

Derrick Stolee April 13, 2020, 12:08 p.m. UTC | #3

On 4/12/2020 4:51 PM, Taylor Blau wrote:
> On Sat, Apr 11, 2020 at 02:57:18PM -0700, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> diff --git a/builtin/commit.c b/builtin/commit.c
>>> index d3e7781e658..d2fdfdc4363 100644
>>> --- a/builtin/commit.c
>>> +++ b/builtin/commit.c
>>> @@ -1701,7 +1701,9 @@ int cmd_commit(int argc, const char **argv, const char *prefix)
>>>  		      "not exceeded, and then \"git restore --staged :/\" to recover."));
>>>
>>>  	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
>>> -	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
>>> +	    write_commit_graph_reachable(the_repository->objects->odb,
>>> +					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
>>> +					 NULL))
>>
>> Yuck.  That is doubly mouthful.
> 
> Yeah, this is quite a mouthful indeed. I think the most conservative fix
> would be something like:
> 
>   if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0)) {
>     enum commit_graph_write_flags flags = 0;
>     if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
>       flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
> 
>     if (write_commit_graph_reachable(the_repository->objects->odb,
>                                      flags, NULL)) {
>       /* ... */
>     }
>   }
> 
> and then ripping this knob out (as Junio suggests below, which I think
> is a wise suggestion) out would be straightforward.
> 
>> I wonder how much value we are getting by having this call here in
>> the first place.  This function being cmd_commit(), by definition we
>> won't be updating the graph file unless the test script does "git
>> commit".  We won't update the graph file upon "git rebase", "git
>> merge", "git pull", "git fetch", etc., so it is not like with this
>> hack, the test coverage for various traversal using the graph file
>> would magically be perfect.  We'd still need an explicit call to
>> "commit-graph write" at strategic places in the test scripts, no?
> 
> Yeah. It's definitely not a silver bullet for the reasons you mention,
> but I think that it is helping out in some common cases. For example,
> if we moved this check out to 'test_commit', then we'd be relying on
> tests to use that instead of 'git commit'. On the other hand, this
> catches either, since they both wind up in this builtin.
> 
> I guess you could push this check down much further to when we're about
> to write a new object, and--if that new object is a commit--update the
> commit-graph. That sounds awfully slow (and feels to me like a hack, but
> I can't justify whether it's more or less of a hack than we already
> have), but I think it would be OK, since this isn't the normal way to
> run tests.

If we keep this simple, or extract the process to a
"write_commit_graph_for_tests()" macro inside builtin.h, then we could
insert a commit-graph write in more places.

However, I think there is value in testing the "not every commit is
included in the commit-graph" case, which our current setup does quite
nicely. The one exception is that 'git merge' could benefit from this
snippet, so we cap off the commit-graph whenever a test script
"constructs" a repository to match a data shape.

>> How are we testing with and without bitmaps, and do we have a kludge
>> like this one for the feature, or do we use a different mechanism
>> to help writing tests easier to gain better coverage?
> 
> As far as I know after a brief search, we have no similar such mechanism
> for bitmaps, so commit-graphs are a unique instance.

I would not be against a similar mechanism for bitmaps, but this is
really important for commit-graphs. The difference lies in which users
have commit-graphs versus bitmaps. Now that commit-graphs are on by
default and written by 'git gc', client users get commit-graphs unless
they opt-out. We need to test as many scenarios as possible to avoid
causing them disruption.

Who uses bitmaps? Primarily Git servers. The scenarios run on those
repos are much more restricted, and those scenarios are tested with
bitmaps explicitly.

>> In any case, I can see why we want a separate knob specific to the
>> CHANGED_PATHS until the path filter part becomes as stable as the
>> rest of the graph file, but after some time, we should remove that
>> knob, and at that point, we always use the path filter whenever we
>> use commit-graph, so that we won't have to suffer from the ugliness.
>>
>> Wait.  I wonder if we can tweak the arrangement a bit so that this
>> layer does not need to know any more than GIT_TEST_COMMIT_GRAPH?
>>
>> For example, in commit-graph.c::write_commit_graph(), can we test
>> the CHANGED_PATHS environment variable and flip the .changed_paths
>> bit, no matter what the 'flags' argument to the function says?
> 
> It may be cleaner to have 'GIT_TEST_COMMIT_GRAPH' specify the flags that
> it wants as its value, but the additional coupling makes me somewhat
> uncomfortable.

It would be easy to insert it here: 

diff --git a/commit-graph.c b/commit-graph.c
index 77668629e2..3e8f36ce5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1968,6 +1968,8 @@ int write_commit_graph(struct object_directory *odb,
        ctx->check_oids = flags & COMMIT_GRAPH_WRITE_CHECK_OIDS ? 1 : 0;
        ctx->split_opts = split_opts;
        ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
+       if (git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0))
+               ctx->changed_paths = 1;
        ctx->total_bloom_filter_data_size = 0;
 
        if (ctx->split) {

The problem here is that if GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1 and
someone runs "git commit-graph write --no-changed-paths" then the
negation of that option is ignored. But this is a GIT_TEST_* variable,
and any test that requires that check could disable the enviroment
variable first.

Thanks,
-Stolee

Junio C Hamano April 13, 2020, 10:11 p.m. UTC | #4

Derrick Stolee <stolee@gmail.com> writes:

> If we keep this simple, or extract the process to a
> "write_commit_graph_for_tests()" macro inside builtin.h, then we could
> insert a commit-graph write in more places.

As long as the check necessary is cheap enough to realize that we
are in production mode, we should be able to keep the run-time
overhead to the minimum.  Sprinkling such a call all over the place,
however, might add to the overhead of reading code, though.

> However, I think there is value in testing the "not every commit is
> included in the commit-graph" case, which our current setup does quite
> nicely.

Yeah, that is also a good point.

> The one exception is that 'git merge' could benefit from this
> snippet, so we cap off the commit-graph whenever a test script
> "constructs" a repository to match a data shape.

Sounds good.

> The problem here is that if GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=1 and
> someone runs "git commit-graph write --no-changed-paths" then the
> negation of that option is ignored. But this is a GIT_TEST_* variable,
> and any test that requires that check could disable the enviroment
> variable first.

Yeah, that sounds good.

diff --git a/builtin/commit.c b/builtin/commit.c
index d3e7781e658..d2fdfdc4363 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1701,7 +1701,9 @@  int cmd_commit(int argc, const char **argv, const char *prefix)
 		      "not exceeded, and then \"git restore --staged :/\" to recover."));
 
 	if (git_env_bool(GIT_TEST_COMMIT_GRAPH, 0) &&
-	    write_commit_graph_reachable(the_repository->objects->odb, 0, NULL))
+	    write_commit_graph_reachable(the_repository->objects->odb,
+					 git_env_bool(GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS, 0) ? COMMIT_GRAPH_WRITE_BLOOM_FILTERS : 0,
+					 NULL))
 		return 1;
 
 	repo_rerere(the_repository, 0);

[2/3] commit: write commit-graph with bloom filters

Commit Message

Comments

Patch