Message ID | pull.497.git.1576879520.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Changed Paths Bloom Filters | expand |
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and > presents an updated and more polished RFC version of the feature. ;-)
Hi, On Fri, Dec 20, 2019 at 11:07 PM Garima Singh via GitGitGadget <gitgitgadget@gmail.com> wrote: > > The commit graph feature brought in a lot of performance improvements across > multiple commands. However, file based history continues to be a performance > pain point, especially in large repositories. > > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and > presents an updated and more polished RFC version of the feature. Thanks for working on this! > Performance Gains: We tested the performance of git log -- path on the git > repo, the linux repo and some internal large repos, with a variety of paths > of varying depths. > > On the git and linux repos: We observed a 2x to 5x speed up. > > On a large internal repo with files seated 6-10 levels deep in the tree: We > observed 10x to 20x speed ups, with some paths going up to 28 times faster. Very nice! I have a question though. Are the performance gains only available with `git log -- path` or are they already available for example when doing a partial clone and/or a sparse checkout? > Future Work (not included in the scope of this series): > > 1. Supporting multiple path based revision walk > 2. Adopting it in git blame logic. > 3. Interactions with line log git log -L Great! > This series is intended to start the conversation and many of the commit > messages include specific call outs for suggestions and thoughts. I think Peff said during the Virtual Contributor Summit that he was interested in using bitmaps to speed up partial clone on the server side. Would it make sense to use both bitmaps and bloom filters? Thanks, Christian.
On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote: > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and > presents an updated and more polished RFC version of the feature. Great to see progress here. I probably won't have time to review this carefully before the new year, but I did notice some low-hanging fruit on the generation side. So here are a few patches to reduce the CPU and memory usage. They could be squashed in at the appropriate spots, or perhaps taken as inspiration if there are better solutions (especially for the first one). I think we could go further still, by actually doing a non-recursive diff_tree_oid(), and then recursing into sub-trees ourselves. That would save us having to split apart each path to add the leading paths to the hashmap (most of which will be duplicates if the commit touched "a/b/c" and "a/b/d", etc). I doubt it would be that huge a speedup though. We have to keep a list of the touched paths anyway (since the bloom key parameters depend on the number of entries), and most of the time is almost certainly spent inflating the trees in the first place. However it might be easier to follow the code, and it would make it simpler to stop traversing at the 512-entry limit, rather than generating a huge diff only to throw it away. [1/3]: commit-graph: examine changed-path objects in pack order [2/3]: commit-graph: free large diffs, too [3/3]: commit-graph: stop using full rev_info for diffs bloom.c | 18 +++++++++--------- commit-graph.c | 34 +++++++++++++++++++++++++++++++++- 2 files changed, 42 insertions(+), 10 deletions(-) -Peff
On Sun, Dec 22, 2019 at 10:26:20AM +0100, Christian Couder wrote: > I have a question though. Are the performance gains only available > with `git log -- path` or are they already available for example when > doing a partial clone and/or a sparse checkout? From my quick look at the code, anything that feeds a pathspec to a revision traversal would be helped. I'm not sure if it would help for partial/sparse traversals, though. There we actually need to know which blobs correspond to the paths in question, not just whether any particular commit touched them. I also took a brief look at adding support to the custom blame-tree implementation we use at GitHub, and got about a 6x speedup. > > This series is intended to start the conversation and many of the commit > > messages include specific call outs for suggestions and thoughts. > > I think Peff said during the Virtual Contributor Summit that he was > interested in using bitmaps to speed up partial clone on the server > side. Would it make sense to use both bitmaps and bloom filters? I think they're orthogonal. For size-based filters on blobs, you'd just use bitmaps as normal, because you can post-process the result to check the type and size of each object in the list (and I have patches to do this, but they need some polishing and we're not yet running them). For path-based filters like a sparse specification, you can't use bitmaps at all; you have to do a real traversal. But there you still generally get all of the commits. I guess if a commit doesn't touch any path you're interested in, you could avoid walking into its tree at all, which might help. I haven't given it much thought yet. -Peff
On 12/22/2019 4:30 AM, Jeff King wrote: > On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote: > >> Adopting changed path bloom filters has been discussed on the list before, >> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. >> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and >> presents an updated and more polished RFC version of the feature. > > Great to see progress here. I probably won't have time to review this > carefully before the new year, but I did notice some low-hanging fruit > on the generation side. > > So here are a few patches to reduce the CPU and memory usage. They could > be squashed in at the appropriate spots, or perhaps taken as inspiration > if there are better solutions (especially for the first one). > > I think we could go further still, by actually doing a non-recursive > diff_tree_oid(), and then recursing into sub-trees ourselves. That would > save us having to split apart each path to add the leading paths to the > hashmap (most of which will be duplicates if the commit touched "a/b/c" > and "a/b/d", etc). I doubt it would be that huge a speedup though. We > have to keep a list of the touched paths anyway (since the bloom key > parameters depend on the number of entries), and most of the time is > almost certainly spent inflating the trees in the first place. However > it might be easier to follow the code, and it would make it simpler to > stop traversing at the 512-entry limit, rather than generating a huge > diff only to throw it away. Thanks for these improvements. This diff machinery is new to us (Garima and myself). Here are some recommendations (to Garima) for how to proceed with these patches. Please let me know if anyone disagrees. > [1/3]: commit-graph: examine changed-path objects in pack order This one is best kept as its own patch, as it shows a clear reason why we want to do the sort-by-position. It would also be a complicated patch to include this logic along with the first use of compute_bloom_filters(). > [2/3]: commit-graph: free large diffs, too This one seems best to squash into "commit-graph: write changed paths bloom filters" with a Helped-by for Peff. > [3/3]: commit-graph: stop using full rev_info for diffs While I appreciate the clear benefit in the commit-message here, it may be best to also squash this one similarly. Of course, if we create our own diff logic with the short-circuit capability, then perhaps these patches become obsolete. I'll spend a little time playing with options here. Thanks! -Stolee
On 12/22/2019 4:30 AM, Jeff King wrote: > On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote: > >> Adopting changed path bloom filters has been discussed on the list before, >> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. >> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and >> presents an updated and more polished RFC version of the feature. > > Great to see progress here. I probably won't have time to review this > carefully before the new year, but I did notice some low-hanging fruit > on the generation side. > > So here are a few patches to reduce the CPU and memory usage. They could > be squashed in at the appropriate spots, or perhaps taken as inspiration > if there are better solutions (especially for the first one). I tested these patches with the Linux kernel repo and reported my results on each patch. However, I wanted to also test on a larger internal repo (the AzureDevOps repo), which has ~500 commits with more than 512 changes, and generally has larger diffs than the Linux kernel repo. | Version | Time | Memory | |----------|--------|--------| | Garima | 16m36s | 27.0gb | | Peff 1 | 6m32s | 28.0gb | | Peff 2 | 6m48s | 5.6gb | | Peff 3 | 6m14s | 4.5gb | | Shortcut | 3m47s | 4.5gb | For reference, I found the time and memory information using "/usr/bin/time --verbose" in a bash script. > I think we could go further still, by actually doing a non-recursive > diff_tree_oid(), and then recursing into sub-trees ourselves. That would > save us having to split apart each path to add the leading paths to the > hashmap (most of which will be duplicates if the commit touched "a/b/c" > and "a/b/d", etc). I doubt it would be that huge a speedup though. We > have to keep a list of the touched paths anyway (since the bloom key > parameters depend on the number of entries), and most of the time is > almost certainly spent inflating the trees in the first place. However > it might be easier to follow the code, and it would make it simpler to > stop traversing at the 512-entry limit, rather than generating a huge > diff only to throw it away. By "Shortcut" in the table above, I mean the following patch on top of Garima's and Peff's changes. It inserts a max_changes option into struct diff_options to halt the diff early. This seemed like an easier change than creating a new tree diff algorithm wholesale. Thanks, -Stolee -->8-- From: Derrick Stolee <dstolee@microsoft.com> Date: Fri, 27 Dec 2019 10:13:48 -0500 Subject: [PATCH] diff: halt tree-diff early after max_changes When computing the changed-paths bloom filters for the commit-graph, we limit the size of the filter by restricting the number of paths in the diff. Instead of computing a large diff and then ignoring the result, it is better to halt the diff computation early. Create a new "max_changes" option in struct diff_options. If non-zero, then halt the diff computation after discovering strictly more changed paths. This includes paths corresponding to trees that change. Use this max_changes option in the bloom filter calculations. This reduces the time taken to compute the filters for the Linux kernel repo from 2m50s to 2m35s. For a larger repo with more commits changing many paths, the time reduces from 6 minutes to under 4 minutes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- bloom.c | 4 +++- diff.h | 5 +++++ tree-diff.c | 5 +++++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/bloom.c b/bloom.c index ea77631cc2..83dde2378b 100644 --- a/bloom.c +++ b/bloom.c @@ -155,6 +155,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; int i; struct diff_options diffopt; + int max_changes = 512; filter = bloom_filter_slab_at(&bloom_filters, c); @@ -171,6 +172,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, repo_diff_setup(r, &diffopt); diffopt.flags.recursive = 1; + diffopt.max_changes = max_changes; diff_setup_done(&diffopt); if (c->parents) @@ -179,7 +181,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, diff_tree_oid(NULL, &c->object.oid, "", &diffopt); diffcore_std(&diffopt); - if (diff_queued_diff.nr <= 512) { + if (diff_queued_diff.nr <= max_changes) { struct hashmap pathmap; struct pathmap_hash_entry* e; struct hashmap_iter iter; diff --git a/diff.h b/diff.h index 6febe7e365..9443dc1b00 100644 --- a/diff.h +++ b/diff.h @@ -285,6 +285,11 @@ struct diff_options { /* Number of hexdigits to abbreviate raw format output to. */ int abbrev; + /* If non-zero, then stop computing after this many changes. */ + int max_changes; + /* For internal use only. */ + int num_changes; + int ita_invisible_in_index; /* white-space error highlighting */ #define WSEH_NEW (1<<12) diff --git a/tree-diff.c b/tree-diff.c index 33ded7f8b3..16a21d9f34 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths( if (diff_can_quit_early(opt)) break; + if (opt->max_changes && opt->num_changes > opt->max_changes) + break; + if (opt->pathspec.nr) { skip_uninteresting(&t, base, opt); for (i = 0; i < nparent; i++) @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths( /* t↓ */ update_tree_entry(&t); + opt->num_changes++; } /* t > p[imin] */ @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths( skip_emit_tp: /* ∀ pi=p[imin] pi↓ */ update_tp_entries(tp, nparent); + opt->num_changes++; } }
On Thu, Dec 26, 2019 at 09:21:36AM -0500, Derrick Stolee wrote: > Here are some recommendations (to Garima) for how to proceed with these > patches. Please let me know if anyone disagrees. > > > [1/3]: commit-graph: examine changed-path objects in pack order > > This one is best kept as its own patch, as it shows a clear reason why > we want to do the sort-by-position. It would also be a complicated > patch to include this logic along with the first use of > compute_bloom_filters(). Yeah, I'd agree this one could be a separate patch. It does need more work, though (as you found out, it does not cover --reachable at all). The position counter also probably ought to be an unsigned (or even a uint32_t, which we usually consider a maximum bound for number of objects). -Peff
On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote: > > So here are a few patches to reduce the CPU and memory usage. They could > > be squashed in at the appropriate spots, or perhaps taken as inspiration > > if there are better solutions (especially for the first one). > > I tested these patches with the Linux kernel repo and reported my results > on each patch. However, I wanted to also test on a larger internal repo > (the AzureDevOps repo), which has ~500 commits with more than 512 changes, > and generally has larger diffs than the Linux kernel repo. > > | Version | Time | Memory | > |----------|--------|--------| > | Garima | 16m36s | 27.0gb | > | Peff 1 | 6m32s | 28.0gb | > | Peff 2 | 6m48s | 5.6gb | > | Peff 3 | 6m14s | 4.5gb | > | Shortcut | 3m47s | 4.5gb | > > For reference, I found the time and memory information using > "/usr/bin/time --verbose" in a bash script. Thanks for giving it more exercise. My heap profiling was done with massif, which measures the heap directly. Measuring RSS would cover that, but will also include the mmap'd packfiles. That's probably why your linux.git numbers were slightly higher than mine. (massif is a really great tool if you haven't used it, as it also shows which allocations were using the memory. But it's part of valgrind, so it definitely doesn't run on native Windows. It might work under WSL, though. I'm sure there are also other heap profilers on Windows). > By "Shortcut" in the table above, I mean the following patch on top of > Garima's and Peff's changes. It inserts a max_changes option into struct > diff_options to halt the diff early. This seemed like an easier change > than creating a new tree diff algorithm wholesale. Yeah, I'm not opposed to a diff feature like this. But be careful, because... > diff --git a/diff.h b/diff.h > index 6febe7e365..9443dc1b00 100644 > --- a/diff.h > +++ b/diff.h > @@ -285,6 +285,11 @@ struct diff_options { > /* Number of hexdigits to abbreviate raw format output to. */ > int abbrev; > > + /* If non-zero, then stop computing after this many changes. */ > + int max_changes; > + /* For internal use only. */ > + int num_changes; This is holding internal state in diff_options, but the same diff_options is often used for multiple diffs (e.g., "git log --raw" would use the same rev_info.diffopt over and over again). So it would need to be cleared between diffs. There's a similar feature in the "has_changes" flag, though it looks like it is cleared manually by callers. Yuck. This isn't a problem for commit-graph right now, but: - it actually could be using a single diff_options, which would be slightly simpler (it doesn't seem to save much CPU, though, because the initialization is relatively cheap) - it's a bit of a subtle bug to leave hanging around for the next person who tries to use the feature I actually wonder if this could be rolled into the has_changes and diff_can_quit_early() feature. This really just a generalization of that feature (which is like setting max_changes to "1"). -Peff
On 12/29/2019 1:24 AM, Jeff King wrote: > On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote: > >>> So here are a few patches to reduce the CPU and memory usage. They could >>> be squashed in at the appropriate spots, or perhaps taken as inspiration >>> if there are better solutions (especially for the first one). >> >> I tested these patches with the Linux kernel repo and reported my results >> on each patch. However, I wanted to also test on a larger internal repo >> (the AzureDevOps repo), which has ~500 commits with more than 512 changes, >> and generally has larger diffs than the Linux kernel repo. >> >> | Version | Time | Memory | >> |----------|--------|--------| >> | Garima | 16m36s | 27.0gb | >> | Peff 1 | 6m32s | 28.0gb | >> | Peff 2 | 6m48s | 5.6gb | >> | Peff 3 | 6m14s | 4.5gb | >> | Shortcut | 3m47s | 4.5gb | >> >> For reference, I found the time and memory information using >> "/usr/bin/time --verbose" in a bash script. > > Thanks for giving it more exercise. My heap profiling was done with > massif, which measures the heap directly. Measuring RSS would cover > that, but will also include the mmap'd packfiles. That's probably why > your linux.git numbers were slightly higher than mine. That's interesting. I initially avoided massif because it is so much slower than /usr/bin/time. However, the inflated numbers could be explained by that. Also, the distinction between mem_heap and mem_heap_extra may have interesting implications. Looking online, it seems that large mem_heap_extra implies the heap is fragmented from many small allocations. Here are my findings on the Linux repo: | Version | mem_heap | mem_heap_extra | |----------|----------|----------------| | Peff 1 | 6,500mb | 913mb | | Peff 2 | 3,100mb | 286mb | | Peff 3 | 781mb | 235mb | These numbers more closely match your numbers (in sum of the two columns). > (massif is a really great tool if you haven't used it, as it also shows > which allocations were using the memory. But it's part of valgrind, so > it definitely doesn't run on native Windows. It might work under WSL, > though. I'm sure there are also other heap profilers on Windows). I am using my Linux machine for my tests. Garima is using her Windows machine. >> By "Shortcut" in the table above, I mean the following patch on top of >> Garima's and Peff's changes. It inserts a max_changes option into struct >> diff_options to halt the diff early. This seemed like an easier change >> than creating a new tree diff algorithm wholesale. > > Yeah, I'm not opposed to a diff feature like this. > > But be careful, because... > >> diff --git a/diff.h b/diff.h >> index 6febe7e365..9443dc1b00 100644 >> --- a/diff.h >> +++ b/diff.h >> @@ -285,6 +285,11 @@ struct diff_options { >> /* Number of hexdigits to abbreviate raw format output to. */ >> int abbrev; >> >> + /* If non-zero, then stop computing after this many changes. */ >> + int max_changes; >> + /* For internal use only. */ >> + int num_changes; > > This is holding internal state in diff_options, but the same > diff_options is often used for multiple diffs (e.g., "git log --raw" > would use the same rev_info.diffopt over and over again). > > So it would need to be cleared between diffs. There's a similar feature > in the "has_changes" flag, though it looks like it is cleared manually > by callers. Yuck. You're right about this. What if we initialize it in diff_tree_paths() before it calls the recursive ll_difF_tree_paths()? > This isn't a problem for commit-graph right now, but: > > - it actually could be using a single diff_options, which would be > slightly simpler (it doesn't seem to save much CPU, though, because > the initialization is relatively cheap) > > - it's a bit of a subtle bug to leave hanging around for the next > person who tries to use the feature > > I actually wonder if this could be rolled into the has_changes and > diff_can_quit_early() feature. This really just a generalization of that > feature (which is like setting max_changes to "1"). I thought about this at first, but it only takes a struct diff_options right now. It does have an internally-mutated member (flags.has_changes) but it also seems a bit wrong to add a uint32_t of the count in this. Changing the prototype could be messy, too. There are also multiple callers, and limiting everything to tree-diff.c limits the impact. Thanks, -Stolee
Jeff King <peff@peff.net> writes: > This is holding internal state in diff_options, but the same > diff_options is often used for multiple diffs (e.g., "git log --raw" > would use the same rev_info.diffopt over and over again). > > So it would need to be cleared between diffs. There's a similar feature > in the "has_changes" flag, though it looks like it is cleared manually > by callers. Yuck. Do you mean we want reset_per_invocation_part_of_diff_options() helper or something? > I actually wonder if this could be rolled into the has_changes and > diff_can_quit_early() feature. This really just a generalization of that > feature (which is like setting max_changes to "1"). Yeah, I wondered about the same thing, after seeing the impressive numbers ;-)
"Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: > Hey! > > The commit graph feature brought in a lot of performance improvements across > multiple commands. However, file based history continues to be a performance > pain point, especially in large repositories. > > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and > presents an updated and more polished RFC version of the feature. It is nice to have this picked up for upstream, finally. The proof of concept works[1][2] were started more than a year ago. On the other hand slow and steady adoption of commit-graph serialization and then extending it (generation numbers, topological sort, incremental update) feels like a good approach. > Performance Gains: We tested the performance of 'git log -- <path>' on the git > repo, the linux repo and some internal large repos, with a variety of paths > of varying depths. > > On the git and linux repos: We observed a 2x to 5x speed up. > > On a large internal repo with files seated 6-10 levels deep in the tree: We > observed 10x to 20x speed ups, with some paths going up to 28 times faster. Could you provide some more statistics about this internal repository, such as number of files, number of commits, perhaps also number of all objects? Thanks in advance. I wonder why such large difference in performance 2-5x vs 10-20x. Is it about the depth of the file hierarchy? How would the numbers look for files seated closer to the root in the same large repository, like 3-5 levels deep in the tree? > Future Work (not included in the scope of this series): > > 1. Supporting multiple path based revision walk I wonder if it would ever be possible to support globbing, e.g. '*.c' > 2. Adopting it in git blame logic. What about 'git log --follow <path>'? > 3. Interactions with line log git log -L > > This series is intended to start the conversation and many of the commit > messages include specific call outs for suggestions and thoughts. > > Cheers! Garima Singh > > [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/ > [2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/ > > Garima Singh (9): > commit-graph: add --changed-paths option to write This summary is not easy to understand on first glance. Maybe: commit-graph: add --changed-paths option to the write subcommand or commit-graph: add --changed-paths option to 'git commit-graph write' would be better? > commit-graph: write changed paths bloom filters > commit-graph: use MAX_NUM_CHUNKS > commit-graph: document bloom filter format > commit-graph: write changed path bloom filters to commit-graph file. > commit-graph: test commit-graph write --changed-paths > commit-graph: reuse existing bloom filters during write. > revision.c: use bloom filters to speed up path based revision walks > commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag > > Documentation/git-commit-graph.txt | 5 + > .../technical/commit-graph-format.txt | 17 ++ > Makefile | 1 + > bloom.c | 257 +++++++++++++++++ > bloom.h | 51 ++++ > builtin/commit-graph.c | 9 +- > ci/run-build-and-tests.sh | 1 + > commit-graph.c | 116 +++++++- > commit-graph.h | 9 +- > revision.c | 67 ++++- > revision.h | 5 + > t/README | 3 + > t/helper/test-read-graph.c | 4 + > t/t4216-log-bloom.sh | 77 ++++++ > t/t5318-commit-graph.sh | 2 + > t/t5324-split-commit-graph.sh | 1 + > t/t5325-commit-graph-bloom.sh | 258 ++++++++++++++++++ > 17 files changed, 875 insertions(+), 8 deletions(-) > create mode 100644 bloom.c > create mode 100644 bloom.h > create mode 100755 t/t4216-log-bloom.sh > create mode 100755 t/t5325-commit-graph-bloom.sh > > > base-commit: b02fd2accad4d48078671adf38fe5b5976d77304 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/497
Jeff King <peff@peff.net> writes: > On Sun, Dec 22, 2019 at 10:26:20AM +0100, Christian Couder wrote: > >> I have a question though. Are the performance gains only available >> with `git log -- path` or are they already available for example when >> doing a partial clone and/or a sparse checkout? > > From my quick look at the code, anything that feeds a pathspec to a > revision traversal would be helped. I'm not sure if it would help for > partial/sparse traversals, though. There we actually need to know which > blobs correspond to the paths in question, not just whether any > particular commit touched them. > > I also took a brief look at adding support to the custom blame-tree > implementation we use at GitHub, and got about a 6x speedup. Is there any chance of upstreaming the blame-tree algorithm, perhaps as a separate mode for git-blame (invoked with `git blame <directory>`? Or is the algorithm too GitHub-specific? Best,
On 12/31/2019 11:45 AM, Jakub Narebski wrote: > "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: >> Performance Gains: We tested the performance of 'git log -- <path>' on the git >> repo, the linux repo and some internal large repos, with a variety of paths >> of varying depths. >> >> On the git and linux repos: We observed a 2x to 5x speed up. >> >> On a large internal repo with files seated 6-10 levels deep in the tree: We >> observed 10x to 20x speed ups, with some paths going up to 28 times faster. > > Could you provide some more statistics about this internal repository, > such as number of files, number of commits, perhaps also number of all > objects? Thanks in advance. > > I wonder why such large difference in performance 2-5x vs 10-20x. Is it > about the depth of the file hierarchy? How would the numbers look for > files seated closer to the root in the same large repository, like 3-5 > levels deep in the tree? The internal repository we saw these massive gains on has: - 413579 commits. - 183303 files distributed across 34482 folders The size on disk is about 17 GiB. And yes, the difference is performance gains is mostly because of how deep the files were in the hierarchy. How often a file has been touched also makes a difference. The performance gains are less dramatic if the file has a very sparse history even if it is a deep file. The numbers from the git and linux repos for instance, are for files closer to the root, hence 2x to 5x. Thanks! Garima Singh
Garima Singh <garimasigit@gmail.com> writes: > On 12/31/2019 11:45 AM, Jakub Narebski wrote: >> "Garima Singh via GitGitGadget" <gitgitgadget@gmail.com> writes: >>> >>> Performance Gains: We tested the performance of 'git log -- <path>' on the git >>> repo, the linux repo and some internal large repos, with a variety of paths >>> of varying depths. >>> >>> On the git and linux repos: We observed a 2x to 5x speed up. >>> >>> On a large internal repo with files seated 6-10 levels deep in the tree: We >>> observed 10x to 20x speed ups, with some paths going up to 28 times faster. >> >> Could you provide some more statistics about this internal repository, >> such as number of files, number of commits, perhaps also number of all >> objects? Thanks in advance. >> >> I wonder why such large difference in performance 2-5x vs 10-20x. Is it >> about the depth of the file hierarchy? How would the numbers look for >> files seated closer to the root in the same large repository, like 3-5 >> levels deep in the tree? > > The internal repository we saw these massive gains on has: > - 413579 commits. > - 183303 files distributed across 34482 folders > The size on disk is about 17 GiB. Thank you for the data. Such information would be important consideration to help to find out whether enabling Bloom filters in given repository would be worth it. > And yes, the difference is performance gains is mostly because of how > deep the files were in the hierarchy. Right, this is understandable. If files are diep in hierarchy, then we have to unpack more tree objects to find out if the file was changed in a given commit (provided that finding differences do not terminate early thanks to hierarchical structure of tree objects). > How often a file has been touched > also makes a difference. The performance gains are less dramatic if the > file has a very sparse history even if it is a deep file. This looks a bit strange (or maybe I don't understand something). Bloom filter can answer "no" and "maybe" to subset inclusion query. This means that if file was *not* changed, with great probability the answer from Bloom filter would be "no", and we would skip diff-ing trees (which may terminate early, though). On the other hand if file was changed by the commit, and the answer from a Bloom filter is "maybe", then we have to perform diffing to make sure. > > The numbers from the git and linux repos for instance, are for files > closer to the root, hence 2x to 5x. That is quite nice speedup, anyway (git repository cannot be even considered large; medium -- maybe). P.S. I wonder if it would be worth to create some synthetical repository to test performance gains of Bloom filters, perhaps in t/perf... Best,
On 1/20/2020 8:48 AM, Jakub Narebski wrote: >> How often a file has been touched >> also makes a difference. The performance gains are less dramatic if the >> file has a very sparse history even if it is a deep file. > > This looks a bit strange (or maybe I don't understand something). > > Bloom filter can answer "no" and "maybe" to subset inclusion query. > This means that if file was *not* changed, with great probability the > answer from Bloom filter would be "no", and we would skip diff-ing > trees (which may terminate early, though). > > On the other hand if file was changed by the commit, and the answer from > a Bloom filter is "maybe", then we have to perform diffing to make sure. > Yes. What I meant by statement however is that the performance gain i.e. difference in performance between using and not using bloom filters, is not always as dramatic if the history is sparse and the trees aren't touched as often. So it is largely dependent on the shape of the repo and the shape of the commit graph. >> >> The numbers from the git and linux repos for instance, are for files >> closer to the root, hence 2x to 5x. > > That is quite nice speedup, anyway (git repository cannot be even > considered large; medium -- maybe). > Yeah. Git and Linux served as nice initial test beds. If you have any suggestions for interesting repos it would be worth running performanc investigations on, do let me know! > > P.S. I wonder if it would be worth to create some synthetical repository > to test performance gains of Bloom filters, perhaps in t/perf... > I will look into this after I get v1 out on the mailing list. Thanks! Cheers Garima Singh
Hi, On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote: > This series is intended to start the conversation and many of the commit > messages include specific call outs for suggestions and thoughts. Since it's mostly in RFC stage, I'm holding off on line by line comments for now. I read the series (thanks for your patience) and I'll try to leave some review thoughts in the diffspec here. > Garima Singh (9): > [1/9] commit-graph: add --changed-paths option to write I wonder if this can be combined with 2; without 2 actually the documentation is wrong for this one, right? Although I suppose you also mentioned 2 perhaps being too long :) > [2/9] commit-graph: write changed paths bloom filters As I understand it, this one always regenerates the bloom filter pieces, and doesn't write it down in the commit-graph file. How much longer does that take now than before? I don't have a great feel for how often 'git commit-graph' is run, or whether I need to be invoking it manually. > [3/9] commit-graph: use MAX_NUM_CHUNKS > [4/9] commit-graph: document bloom filter format I suppose I might like to see this commit squashed with 5, but it's a nit. I'm thinking it'd be handy to say "git blame commit-graph" and see some nice doc about the format expected in the commit-graph file. > [5/9] commit-graph: write changed path bloom filters to commit-graph file. Ah, so here we finally write down the result from 2 to disk in the commit-graph file. And without 7, this gets recalculated every time we call 'git commit-graph' still. As for a technical doc around here, I'd really appreciate one. But I'm speaking selfishly - I'd also be happy if I could watch a talk about this design to make sure I understand it right :) > [6/9] commit-graph: test commit-graph write --changed-paths > [7/9] commit-graph: reuse existing bloom filters during write. I saw an option to give up if there wasn't an existing bloom filter, but I didn't see an option here to force recalculating. Is there a scenario when that would be useful? What's the mitigation path if: - I have a commit-graph with v0 of the bloom index piece, but update to Git which uses v1? - My commit-graph file is corrupted in a way that the bloom filter results are incorrect and I am missing a blob change (and therefore not finding it during the walk)? I think I understand that without this commit, 8 is not much speedup because we will be recalculating the filter for each commit, rather than using the written-down commit-graph file. > [8/9] revision.c: use bloom filters to speed up path based revision walks > [9/9] commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag Speaking of making sure I understand the change right, I will also summarize my own understanding in the hopes that I can be corrected and others can learn too ;) - The general idea is that we can write down a hint at the tree level saying "something I own did change this commit"; when we look for an object later, we can skip the commit where it looks like that path didn't change. - The change is in two pieces: first, to generate the hints per tree (which happens during commit-graph); second, to use those hints to optimize a rev walk (which happens in revision.c patch 8) - When we calculate the hints during commit-graph, we check the diff of each tree compared to its recent ancestor to see if there was a change; if so we calculate a hash for each path and use that as a key for a map from hash to path. After we look through everything changed in the diff, we can add it to a cumulative bloom filter list (one filter per commit) so we have a handy in-memory idea of which paths changed in each commit. - When it's time to do the rev walk, we ask for the bloom filter for each commit and check if that commit's map contains the path to the object we're worried about; if so, then it's OK to unpack the tree and check if the path we are interested in actually did get changed during that commit. Thanks. - Emily > > Documentation/git-commit-graph.txt | 5 + > .../technical/commit-graph-format.txt | 17 ++ > Makefile | 1 + > bloom.c | 257 +++++++++++++++++ > bloom.h | 51 ++++ > builtin/commit-graph.c | 9 +- > ci/run-build-and-tests.sh | 1 + > commit-graph.c | 116 +++++++- > commit-graph.h | 9 +- > revision.c | 67 ++++- > revision.h | 5 + > t/README | 3 + > t/helper/test-read-graph.c | 4 + > t/t4216-log-bloom.sh | 77 ++++++ > t/t5318-commit-graph.sh | 2 + > t/t5324-split-commit-graph.sh | 1 + > t/t5325-commit-graph-bloom.sh | 258 ++++++++++++++++++ > 17 files changed, 875 insertions(+), 8 deletions(-) > create mode 100644 bloom.c > create mode 100644 bloom.h > create mode 100755 t/t4216-log-bloom.sh > create mode 100755 t/t5325-commit-graph-bloom.sh > > > base-commit: b02fd2accad4d48078671adf38fe5b5976d77304 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/497 > -- > gitgitgadget
On 1/21/2020 6:40 PM, Emily Shaffer wrote:>> Garima Singh (9): >> [1/9] commit-graph: add --changed-paths option to write > I wonder if this can be combined with 2; without 2 actually the > documentation is wrong for this one, right? Although I suppose you also > mentioned 2 perhaps being too long :) > True. The ordering of these commits has been a very subjective discussion. Leaving this commit isolated the way it is, does help separate the option from the bloom filter computation and commit graph write step. Also for clarity, I have changed the message for 2/9 to: `commit-graph: compute Bloom filters for changed paths` >> [2/9] commit-graph: write changed paths bloom filters > As I understand it, this one always regenerates the bloom filter pieces, > and doesn't write it down in the commit-graph file. How much longer does > that take now than before? I don't have a great feel for how often 'git > commit-graph' is run, or whether I need to be invoking it manually. > Yes. Computation and writing the bloom filters to the commit-graph file (2/9 and 5/9) are ideally one time operations per commit. The time taken depends on the shape and size of the repo since computation involves running a full diff. If you look at the discussions and patches exchanged by Dr. Stolee and Peff: it has improved greatly since this RFC patch. My next submission will carry these patches and more concrete numbers for time and memory. Also, `git commit-graph write` is not run very frequently and is ideally incremental. A full rewrite is usually only done in case there is a corruption caught by `git commit-graph verify` in which case, you delete and rewrite. These are both manual operations. See the docs here: https://git-scm.com/docs/git-commit-graph Note: that fetch.writeCommitGraph is now on by default in which case new computations happen automatically for newly fetched commits. But I am not adding the --changed-paths option to that yet. So computing, writing and using bloom filters will still be an opt-in feature that users need to manually run. >> [7/9] commit-graph: reuse existing bloom filters during write. > I saw an option to give up if there wasn't an existing bloom filter, but > I didn't see an option here to force recalculating. The way to force recalculation at the moment would be to delete the commit graph file and write it again. Is there a scenario when that would be useful? Yes, if there are algorithm or hash version changes in the changed paths logic we would need to rewrite. Each of the cases I can think of would involve triggering a recalculation by deleting and rewriting. What's the mitigation path if: > - I have a commit-graph with v0 of the bloom index piece, but update to > Git which uses v1? Take a look at 5/9, when we are parsing the commit graph: if the code is expected to work with a particular version (hash_version = 1 in the current version), and the commit graph has a different version, we just ignore it. In the future, this is where we could extend this to support multiple versions. > - My commit-graph file is corrupted in a way that the bloom filter > results are incorrect and I am missing a blob change (and therefore > not finding it during the walk)? The mitigation for any commit graph corruption is to delete and rewrite. If however, we are confident that the bloom filter computation itself is wrong, the immediate mitigations would be to deleting the commit graph file and rewriting without the --changed-paths option; and ofc report the bug so it can be investigated and fixed. :) > Speaking of making sure I understand the change right, I will also > summarize my own understanding in the hopes that I can be corrected and > others can learn too ;) > > - The general idea is that we can write down a hint at the tree level > saying "something I own did change this commit"; when we look for an > object later, we can skip the commit where it looks like that path > didn't change. The hint we are storing is a bloom filter which answers "No" or "Maybe" to the question "Did file A change in commit c" If the answer is No, we can ignore walking that commit. Else, we fall back to the diff algorithm like before to confirm if the file changed or not. > - The change is in two pieces: first, to generate the hints per tree > (which happens during commit-graph); second, to use those hints to > optimize a rev walk (which happens in revision.c patch 8) Yes. > - When we calculate the hints during commit-graph, we check the diff of > each tree compared to its recent ancestor to see if there was a > change; if so we calculate a hash for each path and use that as a key > for a map from hash to path. After we look through everything changed > in the diff, we can add it to a cumulative bloom filter list (one > filter per commit) so we have a handy in-memory idea of which paths > changed in each commit. > - When it's time to do the rev walk, we ask for the bloom filter for > each commit and check if that commit's map contains the path to the > object we're worried about; if so, then it's OK to unpack the tree > and check if the path we are interested in actually did get changed > during that commit. Essentially yes. There are a few implementation specifics this description is glossing over, but I understand that is the intention. > Thanks. > - Emily Cheers! Garima Singh
Emily Shaffer <emilyshaffer@google.com> writes: [...] > Speaking of making sure I understand the change right, I will also > summarize my own understanding in the hopes that I can be corrected and > others can learn too ;) > > - The general idea is that we can write down a hint at the tree level > saying "something I own did change this commit"; when we look for an > object later, we can skip the commit where it looks like that path > didn't change. Or, to be more exact, we write hint about all the files and directories changed in the commit at the commit level. Say, for example, that changed files are 'README' and 'subdir/file'. We store hint that 'README', 'subdir/file' and 'subdir' paths have changes in them. > - The change is in two pieces: first, to generate the hints per tree > (which happens during commit-graph); second, to use those hints to > optimize a rev walk (which happens in revision.c patch 8) Right. > - When we calculate the hints during commit-graph, we check the diff of > each tree compared to its recent ancestor to see if there was a > change; s/recent ancestor/first parent/ Right, though commits without any changes with respect to first-parent (or null tree in case of root i.e. parentless commit) should be rare. > if so we calculate a hash for each path and use that as a key > for a map from hash to path. Yes, and no. Here we enter the details how Bloom filter is constructed. We don't store paths in Bloom filter -- it would take too much space. The hashmap is used as an implementation of mathematical set, a temporary structure used during Bloom filter construction. We could have used string_list, but then we could waste time trying to add intermediate directories multiple times (for example if 'foo/bar' and 'foo/baz' files changed, we need to add 'foo' path only once to Bloom filter). You can think of Bloom filter as a compact (and probabilistic, see below) representation of set of changed paths. > After we look through everything changed > in the diff, we can add it to a cumulative bloom filter list (one > filter per commit) so we have a handy in-memory idea of which paths > changed in each commit. Yes. > - When it's time to do the rev walk, we ask for the bloom filter for > each commit and check if that commit's map contains the path to the > object we're worried about; if so, then it's OK to unpack the tree > and check if the path we are interested in actually did get changed > during that commit. From the point of view of rev walk, we ask for the Bloom filter for each commit walked, and check if the (sub)set of changed paths includes given path. Bloom filter can answer "no" -- then we can skip the commit simplifying history, or it can answer "maybe" -- then we need to check if file was actually changed unpacking the trees (there is around 1% probability that Bloom filter will say "maybe" if the path is not actually changed). From the point of view of Bloom filter, if the path was actually changed the filter will always answer "maybe". If the path was not changed, then in most cases the filter will answer "no" but there is 1% of chance that it will answer "maybe". I hope that helps, -- Jakub Narębski
Garima Singh <garimasigit@gmail.com> writes: > On 1/20/2020 8:48 AM, Jakub Narebski wrote: >>> How often a file has been touched >>> also makes a difference. The performance gains are less dramatic if the >>> file has a very sparse history even if it is a deep file. >> >> This looks a bit strange (or maybe I don't understand something). >> >> Bloom filter can answer "no" and "maybe" to subset inclusion query. >> This means that if file was *not* changed, with great probability the >> answer from Bloom filter would be "no", and we would skip diff-ing >> trees (which may terminate early, though). >> >> On the other hand if file was changed by the commit, and the answer from >> a Bloom filter is "maybe", then we have to perform diffing to make sure. > > Yes. What I meant by statement however is that the performance gain i.e. > difference in performance between using and not using bloom filters, is not > always as dramatic if the history is sparse and the trees aren't touched > as often. So it is largely dependent on the shape of the repo and the shape > of the commit graph. It probably depends on the depth of changes in a typical skipped commit, I think. If we are getting history for the core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl and the change is contained in libs/input/ directory, we have to unpack only two trees, independent on the depth of the file we are asking about. It could be also possible, if Git is smart enough about it, to halt early if we are checking only if given path changed rather than calculating a full difftree. Say, for example, that the change was in core/java/org/apache/http/conn/ssl/SSLSocketFactory.java file, while we were getting the history for the following file core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl After unpacking two or three threes we know that the second file was not changed. But if we compute full diff, we have to unpack 8 trees. Quite a difference. All example paths above came from AOSP repository that was used for testing different proposed generation numbers v2, see https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh >>> The numbers from the git and linux repos for instance, are for files >>> closer to the root, hence 2x to 5x. >> >> That is quite nice speedup, anyway (git repository cannot be even >> considered large; medium -- maybe). > > Yeah. Git and Linux served as nice initial test beds. If you have any > suggestions for interesting repos it would be worth running performanc > investigations on, do let me know! If we want repositories with deep path hierarchy, Java projects with mandated directory structures might be a good choice, for example Android (AOSP): git clone https://android.googlesource.com/platform/frameworks/base/ android-base It is also quite large repository; in 2019 it had around 874000 commits, around the same as the Linux kernel repository. Another large repository is Chromium -- though I don't know if it has deep filesystem hierarchy. You can use the list of different large and large-ish repositories from https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh Other repositories with large number of commmits not on that list are LLVM Compiler, GCC (GNU Compiler Collection) -- just converted to Git, Homebrew, and Ruby on Rails. >> P.S. I wonder if it would be worth to create some synthetical repository >> to test performance gains of Bloom filters, perhaps in t/perf... >> > > I will look into this after I get v1 out on the mailing list. > Thanks! It would be nice to have, but it can wait. Keep up the good work!
My apologies that things have been quite on this series for the past week or so. An unexpected high priority task at work demanded all of my attention and will continue to do so through the end of this week. Hopefully I will be able to pick this up again early next week and have v3 out soon! Cheers! Garima Singh On 12/20/2019 5:05 PM, Garima Singh via GitGitGadget wrote: > Hey! > > The commit graph feature brought in a lot of performance improvements across > multiple commands. However, file based history continues to be a performance > pain point, especially in large repositories. > > Adopting changed path bloom filters has been discussed on the list before, > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. > Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and > presents an updated and more polished RFC version of the feature. > > Performance Gains: We tested the performance of git log -- path on the git > repo, the linux repo and some internal large repos, with a variety of paths > of varying depths. > > On the git and linux repos: We observed a 2x to 5x speed up. > > On a large internal repo with files seated 6-10 levels deep in the tree: We > observed 10x to 20x speed ups, with some paths going up to 28 times faster. > > Future Work (not included in the scope of this series): > > 1. Supporting multiple path based revision walk > 2. Adopting it in git blame logic. > 3. Interactions with line log git log -L > > This series is intended to start the conversation and many of the commit > messages include specific call outs for suggestions and thoughts. > > Cheers! Garima Singh > > [1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/ > [2] > https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/ > > Garima Singh (9): > commit-graph: add --changed-paths option to write > commit-graph: write changed paths bloom filters > commit-graph: use MAX_NUM_CHUNKS > commit-graph: document bloom filter format > commit-graph: write changed path bloom filters to commit-graph file. > commit-graph: test commit-graph write --changed-paths > commit-graph: reuse existing bloom filters during write. > revision.c: use bloom filters to speed up path based revision walks > commit-graph: add GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS test flag > > Documentation/git-commit-graph.txt | 5 + > .../technical/commit-graph-format.txt | 17 ++ > Makefile | 1 + > bloom.c | 257 +++++++++++++++++ > bloom.h | 51 ++++ > builtin/commit-graph.c | 9 +- > ci/run-build-and-tests.sh | 1 + > commit-graph.c | 116 +++++++- > commit-graph.h | 9 +- > revision.c | 67 ++++- > revision.h | 5 + > t/README | 3 + > t/helper/test-read-graph.c | 4 + > t/t4216-log-bloom.sh | 77 ++++++ > t/t5318-commit-graph.sh | 2 + > t/t5324-split-commit-graph.sh | 1 + > t/t5325-commit-graph-bloom.sh | 258 ++++++++++++++++++ > 17 files changed, 875 insertions(+), 8 deletions(-) > create mode 100644 bloom.c > create mode 100644 bloom.h > create mode 100755 t/t4216-log-bloom.sh > create mode 100755 t/t5325-commit-graph-bloom.sh > > > base-commit: b02fd2accad4d48078671adf38fe5b5976d77304 > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-497%2Fgarimasi514%2FcoreGit-bloomFilters-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-497/garimasi514/coreGit-bloomFilters-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/497 >