diff mbox series

[v2,09/11] commit-graph: add --changed-paths option to write subcommand

Message ID 3d7ee0c96955dc15c87d04982d8cdec8b62750b2.1580943390.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Changed Paths Bloom Filters | expand

Commit Message

Linus Arver via GitGitGadget Feb. 5, 2020, 10:56 p.m. UTC
From: Garima Singh <garima.singh@microsoft.com>

Add --changed-paths option to git commit-graph write. This option will
allow users to compute information about the paths that have changed
between a commit and its first parent, and write it into the commit graph
file. If the option is passed to the write subcommand we set the
COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
commit-graph logic.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
---
 Documentation/git-commit-graph.txt | 5 +++++
 builtin/commit-graph.c             | 9 +++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

Comments

Jakub Narębski Feb. 20, 2020, 8:28 p.m. UTC | #1
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.

In the manpage you write that this operation (computing Bloom filters)
can take a while on large repositories.  Could you perhaps provide some
numbers: how much longer does it take to write commit-graph file with
and without '--changed-paths' for example for Linux kernel, or some
other large repository?  Thanks in advance.

>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)

What is missing is some sanity tests: that bloom index and bloom data
chunks are not present without '--changed-paths', and that they are
added with '--changed-paths'.

If possible, maybe also check in a separate test that the size of
bloom_index chunk agrees with the number of commits in the commit graph.


Also, we can now add those tests I have wrote about in my review of
previous patch, that is:

1. If you write commit-graph with --changed-paths, and either add some
   commits later or exclude some commits from the commit graph, then:

   a.) commit(s) in commit-graph have Bloom filter
   b.) commit(s) not in commit-graph do not have Bloom filter

2. If you write commit-graph without --changed-paths as base layer,
   and then write next layer with --changed-paths and --split, then:

   a.) commit(s) in top layer have Bloom filter(s)
   b.) commit(s) in bottom layer don't have Bloom filter(s)

>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++

Should we write about limitation that the topmost layer in the split
commit graph needs to be written with '--changed-paths' for Git to use
this information?  Or perhaps we should try (in the future) to remove
this limitation??

>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>

All right.

> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>  	int split;
>  	int shallow;
>  	int progress;
> +	int enable_changed_paths;

Bikeshed painting: should this field be called enable_changed_paths or
simply changed_paths?

>  } opts;
>  
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>  			N_("start walk at commits listed by stdin")),
>  		OPT_BOOL(0, "append", &opts.append,
>  			N_("include all commits already in the commit-graph file")),
> +		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +			N_("enable computation for changed paths")),
>  		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>  		OPT_BOOL(0, "split", &opts.split,
>  			N_("allow writing an incremental commit-graph file")),

All right.

> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +	if (opts.enable_changed_paths)
> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>  
>  	read_replace_refs = 0;

All right.  This actually turns on calculation Bloom filters for changed
paths, thanks to

 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;

that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
filters for changed paths" patch.

Though... should this enabling be split into two separate patches like
this?


Best,
Bryan Turner Feb. 20, 2020, 10:10 p.m. UTC | #2
On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Garima Singh <garima.singh@microsoft.com>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can

"its first parent"

(Pardon the grammar nit from the peanut gallery!)

> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++
>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>
>  static char const * const builtin_commit_graph_usage[] = {
>         N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>         NULL
>  };
>
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>
>  static const char * const builtin_commit_graph_write_usage[] = {
> -       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>         NULL
>  };
>
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>         int split;
>         int shallow;
>         int progress;
> +       int enable_changed_paths;
>  } opts;
>
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>                         N_("start walk at commits listed by stdin")),
>                 OPT_BOOL(0, "append", &opts.append,
>                         N_("include all commits already in the commit-graph file")),
> +               OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +                       N_("enable computation for changed paths")),
>                 OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>                 OPT_BOOL(0, "split", &opts.split,
>                         N_("allow writing an incremental commit-graph file")),
> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>                 flags |= COMMIT_GRAPH_WRITE_SPLIT;
>         if (opts.progress)
>                 flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +       if (opts.enable_changed_paths)
> +               flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
>         read_replace_refs = 0;
>
> --
> gitgitgadget
>
Garima Singh Feb. 22, 2020, 1:44 a.m. UTC | #3
On 2/20/2020 5:10 PM, Bryan Turner wrote:
> On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
> 
> "its first parent"
> 
> (Pardon the grammar nit from the peanut gallery!)
> 

:)
Thank you! Fixed in v3. 

Cheers! 
Garima Singh
Garima Singh Feb. 24, 2020, 9:51 p.m. UTC | #4
On 2/20/2020 3:28 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Add --changed-paths option to git commit-graph write. This option will
>> allow users to compute information about the paths that have changed
>> between a commit and its first parent, and write it into the commit graph
>> file. If the option is passed to the write subcommand we set the
>> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
>> commit-graph logic.
> 
> In the manpage you write that this operation (computing Bloom filters)
> can take a while on large repositories.  Could you perhaps provide some
> numbers: how much longer does it take to write commit-graph file with
> and without '--changed-paths' for example for Linux kernel, or some
> other large repository?  Thanks in advance.
> 

Yes. Will include numbers as appropriate in v3. 

>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  Documentation/git-commit-graph.txt | 5 +++++
>>  builtin/commit-graph.c             | 9 +++++++--
>>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> What is missing is some sanity tests: that bloom index and bloom data
> chunks are not present without '--changed-paths', and that they are
> added with '--changed-paths'.
> 
> If possible, maybe also check in a separate test that the size of
> bloom_index chunk agrees with the number of commits in the commit graph.
> 
> 
> Also, we can now add those tests I have wrote about in my review of
> previous patch, that is:
> 
> 1. If you write commit-graph with --changed-paths, and either add some
>    commits later or exclude some commits from the commit graph, then:
> 
>    a.) commit(s) in commit-graph have Bloom filter
>    b.) commit(s) not in commit-graph do not have Bloom filter
> 
> 2. If you write commit-graph without --changed-paths as base layer,
>    and then write next layer with --changed-paths and --split, then:
> 
>    a.) commit(s) in top layer have Bloom filter(s)
>    b.) commit(s) in bottom layer don't have Bloom filter(s)
> 

I will see what more can be done here. 

>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
>> +take a while on large repositories. It provides significant performance gains
>> +for getting history of a directory or a file with `git log -- <path>`.
>> ++
> 
> Should we write about limitation that the topmost layer in the split
> commit graph needs to be written with '--changed-paths' for Git to use
> this information?  Or perhaps we should try (in the future) to remove
> this limitation??
> 

Given that this information is going to be used best effort, it would be 
superfluous to describe every case and conditional that decides whether 
this information is being used.
>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>  	if (opts.progress)
>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>> +	if (opts.enable_changed_paths)
>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>  
>>  	read_replace_refs = 0;
> 
> All right.  This actually turns on calculation Bloom filters for changed
> paths, thanks to
> 
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> 
> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
> filters for changed paths" patch.
> 
> Though... should this enabling be split into two separate patches like
> this?
> 

The idea is that in 4/11 We compute only if the flag is set. 
And between that patch and this one: we prepare the foundational code 
that is now ready for that flag to be set via an opt-in by the user. 

> 
> Best,
>
Jakub Narębski Feb. 25, 2020, 12:10 p.m. UTC | #5
Garima Singh <garimasigit@gmail.com> writes:
> On 2/20/2020 3:28 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes:

[...]
>>> --- a/Documentation/git-commit-graph.txt
>>> +++ b/Documentation/git-commit-graph.txt
>>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>>  With the `--append` option, include all commits that are present in the
>>>  existing commit-graph file.
>>>  +
>>> +With the `--changed-paths` option, compute and write information about the
>>> +paths changed between a commit and it's first parent. This operation can
>>> +take a while on large repositories. It provides significant performance gains
>>> +for getting history of a directory or a file with `git log -- <path>`.
>>> ++
>> 
>> Should we write about limitation that the topmost layer in the split
>> commit graph needs to be written with '--changed-paths' for Git to use
>> this information?  Or perhaps we should try (in the future) to remove
>> this limitation?
>
> Given that this information is going to be used best effort, it would be 
> superfluous to describe every case and conditional that decides whether 
> this information is being used.

I can somewhat agree with this reasoning.

However what I would like to avoid is surprising users.  If one creates
base commit-graph with Bloom filters data, but then when creating
new layer of commit-graph (updating it incrementally), it may be
surprising that `git log -- <path>` is now much slower.

On the other hand if one would update commit-graph in a non-incremental
way (rewriting the commit-graph file), loosing the Bloom filter
information and performance of `git log -- <path>` because one forgot to
include `--changed-paths` is not that unexpected.

Anyway, in the future when this mechanism will be controlled by
appropriate config variable, this whole discussion would become somewhat
moot.


Thought for the future: perhaps `git commit-graph verify` could detect
that split graph has Bloom filters only for some layers, and inform the
user?  But that is almost certainly out of scope of this patch series.

>>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>>  	if (opts.progress)
>>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>>> +	if (opts.enable_changed_paths)
>>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>>  
>>>  	read_replace_refs = 0;
>> 
>> All right.  This actually turns on calculation Bloom filters for changed
>> paths, thanks to
>> 
>>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>> 
>> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
>> filters for changed paths" patch.
>> 
>> Though... should this enabling be split into two separate patches like
>> this?
>
> The idea is that in 4/11 We compute only if the flag is set. 
> And between that patch and this one: we prepare the foundational code 
> that is now ready for that flag to be set via an opt-in by the user. 

All right.

Choosing how to split large change into series is not easy.  One one
hand one would want for each change to be small and self contained.  On
the other hand it would be good if each change was testable (test-tool
can help here).

Best,
diff mbox series

Patch

diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
index bcd85c1976..907d703b30 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -54,6 +54,11 @@  or `--stdin-packs`.)
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
 +
+With the `--changed-paths` option, compute and write information about the
+paths changed between a commit and it's first parent. This operation can
+take a while on large repositories. It provides significant performance gains
+for getting history of a directory or a file with `git log -- <path>`.
++
 With the `--split` option, write the commit-graph as a chain of multiple
 commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
 not already in the commit-graph are added in a new "tip" file. This file
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index e0c6fc4bbf..261dcce091 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ 
 
 static char const * const builtin_commit_graph_usage[] = {
 	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
 	NULL
 };
 
@@ -19,7 +19,7 @@  static const char * const builtin_commit_graph_verify_usage[] = {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
+	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
 	NULL
 };
 
@@ -32,6 +32,7 @@  static struct opts_commit_graph {
 	int split;
 	int shallow;
 	int progress;
+	int enable_changed_paths;
 } opts;
 
 static int graph_verify(int argc, const char **argv)
@@ -110,6 +111,8 @@  static int graph_write(int argc, const char **argv)
 			N_("start walk at commits listed by stdin")),
 		OPT_BOOL(0, "append", &opts.append,
 			N_("include all commits already in the commit-graph file")),
+		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
+			N_("enable computation for changed paths")),
 		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
 		OPT_BOOL(0, "split", &opts.split,
 			N_("allow writing an incremental commit-graph file")),
@@ -143,6 +146,8 @@  static int graph_write(int argc, const char **argv)
 		flags |= COMMIT_GRAPH_WRITE_SPLIT;
 	if (opts.progress)
 		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
+	if (opts.enable_changed_paths)
+		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
 
 	read_replace_refs = 0;