Message ID | pull.601.git.1586363907252.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | bloom: ignore renames when computing changed paths | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > The changed-path Bloom filters record an entry in the filter for > every path that was changed. This includes every add and delete, > regardless of whther a rename was detected. Detecting renames > causes significant performance issues, but also will trigger > downloading missing blobs in partial clone. > > The simple fix is to disable rename detection when computing a > changed-path Bloom filter. Makes perfect sense to me. > diff --git a/bloom.c b/bloom.c > index c5b461d1cfe..dd9bab9bbd6 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -189,6 +189,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, > > repo_diff_setup(r, &diffopt); > diffopt.flags.recursive = 1; > + diffopt.detect_rename = 0; > diffopt.max_changes = max_changes; > diff_setup_done(&diffopt); > > > base-commit: d5b873c832d832e44523d1d2a9d29afe2b84c84f
spelling nit. On 08/04/2020 17:38, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The changed-path Bloom filters record an entry in the filter for > every path that was changed. This includes every add and delete, > regardless of whther a rename was detected. Detecting renames whether > causes significant performance issues, but also will trigger > downloading missing blobs in partial clone. > > The simple fix is to disable rename detection when computing a > changed-path Bloom filter. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > bloom: ignore renames when computing changed paths > > I promised [1] I would adapt the commit that was dropped from > gs/commit-graph-path-filter [2] on top of gs/commit-graph-path-filter > and jt/avoid-prefetch-when-able-in-diff. However, I noticed that the > change was extremely simple and has value without basing it on > jt/avoid-prefetch-when-able-in-diff. > > This change applied to gs/commit-graph-path-filter has obvious CPU time > improvements for computing changed-path Bloom filters (that I did not > measure). The partial clone improvements require > jt/avoid-prefetch-when-able-in-diff to be included, too, but the code > does not depend on it at compile time. > > Thanks, -Stolee > > [1] > https://lore.kernel.org/git/7de2f54b-8704-a0e1-12aa-0ca9d3d70f6f@gmail.com/ > [2] > https://lore.kernel.org/git/55824cda89c1dca7756c8c2d831d6e115f4a9ddb.1585528298.git.gitgitgadget@gmail.com/ > > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-601%2Fderrickstolee%2Fdiff-and-bloom-filters-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-601/derrickstolee/diff-and-bloom-filters-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/601 > > bloom.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/bloom.c b/bloom.c > index c5b461d1cfe..dd9bab9bbd6 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -189,6 +189,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, > > repo_diff_setup(r, &diffopt); > diffopt.flags.recursive = 1; > + diffopt.detect_rename = 0; > diffopt.max_changes = max_changes; > diff_setup_done(&diffopt); > > > base-commit: d5b873c832d832e44523d1d2a9d29afe2b84c84f Philip
On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The changed-path Bloom filters record an entry in the filter for > every path that was changed. This includes every add and delete, > regardless of whther a rename was detected. Detecting renames > causes significant performance issues, but also will trigger > downloading missing blobs in partial clone. > > The simple fix is to disable rename detection when computing a > changed-path Bloom filter. Yes, we should be doing as simple a tree-diff as possible. I wonder if this might actually be fixing a potential bug, too. The loop below this code only looks at the "two" half of each queued diff pair. With renames enabled, we might see the source path only in the "one" half of the rename pair. However, this seems to work already. If I do: echo content >file git add file git commit -m added echo change >file git commit -am changed git mv file other git commit -am 'rename away' git mv other file git commit -am 'rename back' git rm file git commit -am removed git commit-graph write --reachable --changed-paths git log --oneline -- file | cat then I see all of the commits. Instrumenting Git like: diff --git a/bloom.c b/bloom.c index c5b461d1cf..fb2a758e1d 100644 --- a/bloom.c +++ b/bloom.c @@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r, for (i = 0; i < diff_queued_diff.nr; i++) { const char *path = diff_queued_diff.queue[i]->two->path; + warning("queuing touched %c path %s for %s", + diff_queued_diff.queue[i]->status, + path, oid_to_hex(&c->object.oid)); + /* * Add each leading directory of the changed file, i.e. for * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so results in: warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201 warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834 warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834 warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f So we really aren't detecting renames in the first place! And indeed, checking diffopt.detect_rename shows that it is unset. So I'm curious if there is a case where that would not be true. I _think_ it would only be true in a program which ran init_diff_ui_defaults(), but never in git-commit-graph. Even if it does nothing in practice, I'm not at all opposed to having it in there as an explicit documentation of our expectations/requirements for the loop below. But it's probably worth saying so in the commit message. -Peff
On 4/8/2020 6:31 PM, Jeff King wrote: > On Wed, Apr 08, 2020 at 04:38:27PM +0000, Derrick Stolee via GitGitGadget wrote: > >> From: Derrick Stolee <dstolee@microsoft.com> >> >> The changed-path Bloom filters record an entry in the filter for >> every path that was changed. This includes every add and delete, >> regardless of whther a rename was detected. Detecting renames >> causes significant performance issues, but also will trigger >> downloading missing blobs in partial clone. >> >> The simple fix is to disable rename detection when computing a >> changed-path Bloom filter. > > Yes, we should be doing as simple a tree-diff as possible. > > I wonder if this might actually be fixing a potential bug, too. The loop > below this code only looks at the "two" half of each queued diff pair. > With renames enabled, we might see the source path only in the "one" > half of the rename pair. > > However, this seems to work already. If I do: > > echo content >file > git add file > git commit -m added > echo change >file > git commit -am changed > git mv file other > git commit -am 'rename away' > git mv other file > git commit -am 'rename back' > git rm file > git commit -am removed > > git commit-graph write --reachable --changed-paths > git log --oneline -- file | cat > > then I see all of the commits. Instrumenting Git like: > > diff --git a/bloom.c b/bloom.c > index c5b461d1cf..fb2a758e1d 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -207,6 +207,10 @@ struct bloom_filter *get_bloom_filter(struct repository *r, > for (i = 0; i < diff_queued_diff.nr; i++) { > const char *path = diff_queued_diff.queue[i]->two->path; > > + warning("queuing touched %c path %s for %s", > + diff_queued_diff.queue[i]->status, > + path, oid_to_hex(&c->object.oid)); > + > /* > * Add each leading directory of the changed file, i.e. for > * 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so > > results in: > > warning: queuing touched A path file for 2346d88b0cb4bca11c38ee545d007a7a14ca472a > warning: queuing touched M path file for 991cd7f0696ae29fea738ca1b8340c90dae4b201 > warning: queuing touched D path file for d3642c9fb27459ea09f6c967a1e6ad119e265d6f > warning: queuing touched A path other for d3642c9fb27459ea09f6c967a1e6ad119e265d6f > warning: queuing touched A path file for bc908eb29e562d97ebb8cf718e41b69d3aa1d834 > warning: queuing touched D path other for bc908eb29e562d97ebb8cf718e41b69d3aa1d834 > warning: queuing touched D path file for 7433b46bd6aa170ab17a651c10658a5b0c10ba4f > > So we really aren't detecting renames in the first place! And indeed, > checking diffopt.detect_rename shows that it is unset. So I'm curious if > there is a case where that would not be true. I _think_ it would only be > true in a program which ran init_diff_ui_defaults(), but never in > git-commit-graph. So our issue was really that the partial clone prefetch logic was just overly aggressive. > Even if it does nothing in practice, I'm not at all opposed to having it > in there as an explicit documentation of our expectations/requirements > for the loop below. But it's probably worth saying so in the commit > message. I will update the message and send a v2. I'll fix the typo that Philip pointed out, too. Thanks, -Stolee
On Thu, Apr 09, 2020 at 07:56:43AM -0400, Derrick Stolee wrote: > > So we really aren't detecting renames in the first place! And indeed, > > checking diffopt.detect_rename shows that it is unset. So I'm curious if > > there is a case where that would not be true. I _think_ it would only be > > true in a program which ran init_diff_ui_defaults(), but never in > > git-commit-graph. > > So our issue was really that the partial clone prefetch logic was just > overly aggressive. Right, but I'm not sure how this patch could ever have helped, since it's just setting a variable to the value it _should_ have already had. Or do you just mean that the issue would have gone away with Jonathan's patch to make the prefetching less aggressive? -Peff
On 4/9/2020 9:47 AM, Jeff King wrote: > On Thu, Apr 09, 2020 at 07:56:43AM -0400, Derrick Stolee wrote: > >>> So we really aren't detecting renames in the first place! And indeed, >>> checking diffopt.detect_rename shows that it is unset. So I'm curious if >>> there is a case where that would not be true. I _think_ it would only be >>> true in a program which ran init_diff_ui_defaults(), but never in >>> git-commit-graph. >> >> So our issue was really that the partial clone prefetch logic was just >> overly aggressive. > > Right, but I'm not sure how this patch could ever have helped, since > it's just setting a variable to the value it _should_ have already had. > > Or do you just mean that the issue would have gone away with Jonathan's > patch to make the prefetching less aggressive? Yes, with Jonathan's patch we stop downloading blobs during Bloom filter computations. The patch this is "replacing" disabled that download in a different way, depending only on if detect_renames is false and the diff output doesn't need file contents. (Jonathan's is better.) -Stolee
On Thu, Apr 09, 2020 at 10:00:46AM -0400, Derrick Stolee wrote: > >> So our issue was really that the partial clone prefetch logic was just > >> overly aggressive. > > > > Right, but I'm not sure how this patch could ever have helped, since > > it's just setting a variable to the value it _should_ have already had. > > > > Or do you just mean that the issue would have gone away with Jonathan's > > patch to make the prefetching less aggressive? > > Yes, with Jonathan's patch we stop downloading blobs during Bloom filter > computations. The patch this is "replacing" disabled that download in a > different way, depending only on if detect_renames is false and the diff > output doesn't need file contents. (Jonathan's is better.) OK. I'm willing to leave it there, but I'm still puzzled as to why the original patch would have done anything based on detect_renames, which would always be false. -Peff
diff --git a/bloom.c b/bloom.c index c5b461d1cfe..dd9bab9bbd6 100644 --- a/bloom.c +++ b/bloom.c @@ -189,6 +189,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, repo_diff_setup(r, &diffopt); diffopt.flags.recursive = 1; + diffopt.detect_rename = 0; diffopt.max_changes = max_changes; diff_setup_done(&diffopt);