diff mbox series

bloom: ignore renames when computing changed paths

Message ID pull.601.git.1586363907252.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series bloom: ignore renames when computing changed paths | expand

Commit Message

Linus Arver via GitGitGadget April 8, 2020, 4:38 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

The changed-path Bloom filters record an entry in the filter for
every path that was changed. This includes every add and delete,
regardless of whther a rename was detected. Detecting renames
causes significant performance issues, but also will trigger
downloading missing blobs in partial clone.

The simple fix is to disable rename detection when computing a
changed-path Bloom filter.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    bloom: ignore renames when computing changed paths
    
    I promised [1] I would adapt the commit that was dropped from
    gs/commit-graph-path-filter [2] on top of gs/commit-graph-path-filter
    and jt/avoid-prefetch-when-able-in-diff. However, I noticed that the
    change was extremely simple and has value without basing it on
    jt/avoid-prefetch-when-able-in-diff.
    
    This change applied to gs/commit-graph-path-filter has obvious CPU time
    improvements for computing changed-path Bloom filters (that I did not
    measure). The partial clone improvements require
    jt/avoid-prefetch-when-able-in-diff to be included, too, but the code
    does not depend on it at compile time.
    
    Thanks, -Stolee
    
    [1] 
    https://lore.kernel.org/git/7de2f54b-8704-a0e1-12aa-0ca9d3d70f6f@gmail.com/
    [2] 
    https://lore.kernel.org/git/55824cda89c1dca7756c8c2d831d6e115f4a9ddb.1585528298.git.gitgitgadget@gmail.com/

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-601%2Fderrickstolee%2Fdiff-and-bloom-filters-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-601/derrickstolee/diff-and-bloom-filters-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/601

 bloom.c | 1 +
 1 file changed, 1 insertion(+)


base-commit: d5b873c832d832e44523d1d2a9d29afe2b84c84f

Comments

Junio C Hamano April 8, 2020, 7:11 p.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The changed-path Bloom filters record an entry in the filter for
> every path that was changed. This includes every add and delete,
> regardless of whther a rename was detected. Detecting renames
> causes significant performance issues, but also will trigger
> downloading missing blobs in partial clone.
>
> The simple fix is to disable rename detection when computing a
> changed-path Bloom filter.

Makes perfect sense to me.

> diff --git a/bloom.c b/bloom.c
> index c5b461d1cfe..dd9bab9bbd6 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -189,6 +189,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> +	diffopt.detect_rename = 0;
>  	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>  
>
> base-commit: d5b873c832d832e44523d1d2a9d29afe2b84c84f
Philip Oakley April 8, 2020, 7:13 p.m. UTC | #2
spelling nit.

On 08/04/2020 17:38, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The changed-path Bloom filters record an entry in the filter for
> every path that was changed. This includes every add and delete,
> regardless of whther a rename was detected. Detecting renames
whether
> causes significant performance issues, but also will trigger
> downloading missing blobs in partial clone.
>
> The simple fix is to disable rename detection when computing a
> changed-path Bloom filter.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>     bloom: ignore renames when computing changed paths
>     
>     I promised [1] I would adapt the commit that was dropped from
>     gs/commit-graph-path-filter [2] on top of gs/commit-graph-path-filter
>     and jt/avoid-prefetch-when-able-in-diff. However, I noticed that the
>     change was extremely simple and has value without basing it on
>     jt/avoid-prefetch-when-able-in-diff.
>     
>     This change applied to gs/commit-graph-path-filter has obvious CPU time
>     improvements for computing changed-path Bloom filters (that I did not
>     measure). The partial clone improvements require
>     jt/avoid-prefetch-when-able-in-diff to be included, too, but the code
>     does not depend on it at compile time.
>     
>     Thanks, -Stolee
>     
>     [1] 
>     https://lore.kernel.org/git/7de2f54b-8704-a0e1-12aa-0ca9d3d70f6f@gmail.com/
>     [2] 
>     https://lore.kernel.org/git/55824cda89c1dca7756c8c2d831d6e115f4a9ddb.1585528298.git.gitgitgadget@gmail.com/
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-601%2Fderrickstolee%2Fdiff-and-bloom-filters-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-601/derrickstolee/diff-and-bloom-filters-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/601
>
>  bloom.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/bloom.c b/bloom.c
> index c5b461d1cfe..dd9bab9bbd6 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -189,6 +189,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> +	diffopt.detect_rename = 0;
>  	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>  
>
> base-commit: d5b873c832d832e44523d1d2a9d29afe2b84c84f
Philip
Jeff King April 8, 2020, 10:31 p.m. UTC | #3
On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The changed-path Bloom filters record an entry in the filter for
> every path that was changed. This includes every add and delete,
> regardless of whther a rename was detected. Detecting renames
> causes significant performance issues, but also will trigger
> downloading missing blobs in partial clone.
> 
> The simple fix is to disable rename detection when computing a
> changed-path Bloom filter.

Yes, we should be doing as simple a tree-diff as possible.

I wonder if this might actually be fixing a potential bug, too. The loop
below this code only looks at the "two" half of each queued diff pair.
With renames enabled, we might see the source path only in the "one"
half of the rename pair.

However, this seems to work already. If I do:

  echo content >file
  git add file
  git commit -m added
  echo change >file
  git commit -am changed
  git mv file other
  git commit -am 'rename away'
  git mv other file
  git commit -am 'rename back'
  git rm file
  git commit -am removed

  git commit-graph write --reachable --changed-paths
  git log --oneline -- file | cat

then I see all of the commits. Instrumenting Git like:

diff --git a/bloom.c b/bloom.c
index c5b461d1cf..fb2a758e1d 100644
--- a/bloom.c
+++ b/bloom.c
@@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 		for (i = 0; i < diff_queued_diff.nr; i++) {
 			const char *path = diff_queued_diff.queue[i]->two->path;
 
+			warning("queuing touched %c path %s for %s",
+				diff_queued_diff.queue[i]->status,
+				path, oid_to_hex(&c->object.oid));
+
 			/*
 			* Add each leading directory of the changed file, i.e. for
 			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so

results in:

  warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a
  warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201
  warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
  warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
  warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
  warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
  warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f

So we really aren't detecting renames in the first place! And indeed,
checking diffopt.detect_rename shows that it is unset. So I'm curious if
there is a case where that would not be true. I _think_ it would only be
true in a program which ran init_diff_ui_defaults(), but never in
git-commit-graph.

Even if it does nothing in practice, I'm not at all opposed to having it
in there as an explicit documentation of our expectations/requirements
for the loop below. But it's probably worth saying so in the commit
message.

-Peff
Derrick Stolee April 9, 2020, 11:56 a.m. UTC | #4
On 4/8/2020 6:31 PM, Jeff King wrote:
> On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The changed-path Bloom filters record an entry in the filter for
>> every path that was changed. This includes every add and delete,
>> regardless of whther a rename was detected. Detecting renames
>> causes significant performance issues, but also will trigger
>> downloading missing blobs in partial clone.
>>
>> The simple fix is to disable rename detection when computing a
>> changed-path Bloom filter.
> 
> Yes, we should be doing as simple a tree-diff as possible.
> 
> I wonder if this might actually be fixing a potential bug, too. The loop
> below this code only looks at the "two" half of each queued diff pair.
> With renames enabled, we might see the source path only in the "one"
> half of the rename pair.
> 
> However, this seems to work already. If I do:
> 
>   echo content >file
>   git add file
>   git commit -m added
>   echo change >file
>   git commit -am changed
>   git mv file other
>   git commit -am 'rename away'
>   git mv other file
>   git commit -am 'rename back'
>   git rm file
>   git commit -am removed
> 
>   git commit-graph write --reachable --changed-paths
>   git log --oneline -- file | cat
> 
> then I see all of the commits. Instrumenting Git like:
> 
> diff --git a/bloom.c b/bloom.c
> index c5b461d1cf..fb2a758e1d 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		for (i = 0; i < diff_queued_diff.nr; i++) {
>  			const char *path = diff_queued_diff.queue[i]->two->path;
>  
> +			warning("queuing touched %c path %s for %s",
> +				diff_queued_diff.queue[i]->status,
> +				path, oid_to_hex(&c->object.oid));
> +
>  			/*
>  			* Add each leading directory of the changed file, i.e. for
>  			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> 
> results in:
> 
>   warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a
>   warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201
>   warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
>   warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f
>   warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
>   warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834
>   warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f
> 
> So we really aren't detecting renames in the first place! And indeed,
> checking diffopt.detect_rename shows that it is unset. So I'm curious if
> there is a case where that would not be true. I _think_ it would only be
> true in a program which ran init_diff_ui_defaults(), but never in
> git-commit-graph.

So our issue was really that the partial clone prefetch logic was just
overly aggressive.

> Even if it does nothing in practice, I'm not at all opposed to having it
> in there as an explicit documentation of our expectations/requirements
> for the loop below. But it's probably worth saying so in the commit
> message.

I will update the message and send a v2. I'll fix the typo that Philip pointed
out, too.

Thanks,
-Stolee
Jeff King April 9, 2020, 1:47 p.m. UTC | #5
On Thu, Apr 09, 2020 at 07:56:43AM -0400, Derrick Stolee wrote:

> > So we really aren't detecting renames in the first place! And indeed,
> > checking diffopt.detect_rename shows that it is unset. So I'm curious if
> > there is a case where that would not be true. I _think_ it would only be
> > true in a program which ran init_diff_ui_defaults(), but never in
> > git-commit-graph.
> 
> So our issue was really that the partial clone prefetch logic was just
> overly aggressive.

Right, but I'm not sure how this patch could ever have helped, since
it's just setting a variable to the value it _should_ have already had.

Or do you just mean that the issue would have gone away with Jonathan's
patch to make the prefetching less aggressive?

-Peff
Derrick Stolee April 9, 2020, 2 p.m. UTC | #6
On 4/9/2020 9:47 AM, Jeff King wrote:
> On Thu, Apr 09, 2020 at 07:56:43AM -0400, Derrick Stolee wrote:
> 
>>> So we really aren't detecting renames in the first place! And indeed,
>>> checking diffopt.detect_rename shows that it is unset. So I'm curious if
>>> there is a case where that would not be true. I _think_ it would only be
>>> true in a program which ran init_diff_ui_defaults(), but never in
>>> git-commit-graph.
>>
>> So our issue was really that the partial clone prefetch logic was just
>> overly aggressive.
> 
> Right, but I'm not sure how this patch could ever have helped, since
> it's just setting a variable to the value it _should_ have already had.
> 
> Or do you just mean that the issue would have gone away with Jonathan's
> patch to make the prefetching less aggressive?

Yes, with Jonathan's patch we stop downloading blobs during Bloom filter
computations. The patch this is "replacing" disabled that download in a
different way, depending only on if detect_renames is false and the diff
output doesn't need file contents. (Jonathan's is better.)

-Stolee
Jeff King April 9, 2020, 2:15 p.m. UTC | #7
On Thu, Apr 09, 2020 at 10:00:46AM -0400, Derrick Stolee wrote:

> >> So our issue was really that the partial clone prefetch logic was just
> >> overly aggressive.
> > 
> > Right, but I'm not sure how this patch could ever have helped, since
> > it's just setting a variable to the value it _should_ have already had.
> > 
> > Or do you just mean that the issue would have gone away with Jonathan's
> > patch to make the prefetching less aggressive?
> 
> Yes, with Jonathan's patch we stop downloading blobs during Bloom filter
> computations. The patch this is "replacing" disabled that download in a
> different way, depending only on if detect_renames is false and the diff
> output doesn't need file contents. (Jonathan's is better.)

OK. I'm willing to leave it there, but I'm still puzzled as to why the
original patch would have done anything based on detect_renames, which
would always be false.

-Peff
diff mbox series

Patch

diff --git a/bloom.c b/bloom.c
index c5b461d1cfe..dd9bab9bbd6 100644
--- a/bloom.c
+++ b/bloom.c
@@ -189,6 +189,7 @@  struct bloom_filter *get_bloom_filter(struct repository *r,
 
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
+	diffopt.detect_rename = 0;
 	diffopt.max_changes = max_changes;
 	diff_setup_done(&diffopt);