Message ID | pull.676.v5.git.1609154168.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Implement Corrected Commit Date | expand |
On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote: > This patch series implements the corrected commit date offsets as generation > number v2, along with other pre-requisites. Abhishek, Thank you for this version. I appreciate your hard work on this topic, especially after GSoC ended and you returned to being a full-time student. My hope was that I could completely approve this series and only provide forward-fixes from here on out, as necessary. I think there are a few minor typos that you might want to address, but I was also able to understand your intention. I did make a particular case about a SEGFAULT I hit that I have been unable to replicate. I saw it both in my copy of torvalds/linux and of chromium/chromium. I have the file for chromium/chromium that is in a bad state where a GDAT value includes the bit saying it should be in the long offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't exist in a from-scratch write. I'm now taking backups of my existing commit-graph files before any later test, but it doesn't repro for my Git repository or any other repo I try on purpose. However, I did some performance testing to double-check your numbers. I sent a patch [1] that helps with some of the hard numbers. [1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/ The big question is whether the overhead from using a slab to store the generation values is worth it. I still think it is, for these reasons: 1. Generation number v2 is measurably better than v1 in most user cases. 2. Generation number v2 is slower than using committer date due to the overhead, but _guarantees correctness_. I like to use "git log --graph -<N>" to compare against topological levels (v1), for various levels of <N>. When <N> is small, we hope to minimize the amount we need to walk using the extra commit-date information as an assistance. Repos like git/git and torvalds/linux use the philosophy of "base your changes on oldest applicable commit" enough that v1 struggles sometimes. git/git: N=1000 Benchmark #1: baseline Time (mean ± σ): 100.3 ms ± 4.2 ms [User: 89.0 ms, System: 11.3 ms] Range (min … max): 94.5 ms … 105.1 ms 28 runs Benchmark #2: test Time (mean ± σ): 35.8 ms ± 3.1 ms [User: 29.6 ms, System: 6.2 ms] Range (min … max): 29.8 ms … 40.6 ms 81 runs Summary 'test' ran 2.80 ± 0.27 times faster than 'baseline' This is a dramatic improvement! Using my topo-walk stats commit, I see that v1 walks 58,805 commits as part of the in-degree walk while v2 only walks 4,335 commits! torvalds/linux: N=1000 (starting at v5.10) Benchmark #1: baseline Time (mean ± σ): 90.8 ms ± 3.7 ms [User: 75.2 ms, System: 15.6 ms] Range (min … max): 85.2 ms … 96.2 ms 31 runs Benchmark #2: test Time (mean ± σ): 49.2 ms ± 3.5 ms [User: 36.9 ms, System: 12.3 ms] Range (min … max): 42.9 ms … 54.0 ms 61 runs Summary 'test' ran 1.85 ± 0.15 times faster than 'baseline' Similarly, v1 walked 38,161 commits compared to 4,340 by v2. If I increase N to something like 10,000, then usually these values get washed out due to the width of the parallel topics. The place we were still using commit-date as a heuristic was paint_down_to_common which caused a regression the first time we used v1, at least for certain cases. Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9 hit a strangeness about a pair of recent commits both based on a very old commit, but the generation numbers forced walking farther than necessary. This doesn't happen with v2, but we see the overhead cost of the slabs: Benchmark #1: baseline Time (mean ± σ): 112.9 ms ± 2.8 ms [User: 96.5 ms, System: 16.3 ms] Range (min … max): 107.7 ms … 118.0 ms 26 runs Benchmark #2: test Time (mean ± σ): 147.1 ms ± 5.2 ms [User: 132.7 ms, System: 14.3 ms] Range (min … max): 141.4 ms … 162.2 ms 18 runs Summary 'baseline' ran 1.30 ± 0.06 times faster than 'test' The overhead still exists for a more recent pair of versions (v5.0 and v5.1): Benchmark #1: baseline Time (mean ± σ): 25.1 ms ± 3.2 ms [User: 18.6 ms, System: 6.5 ms] Range (min … max): 19.0 ms … 32.8 ms 99 runs Benchmark #2: test Time (mean ± σ): 33.3 ms ± 3.3 ms [User: 26.5 ms, System: 6.9 ms] Range (min … max): 27.0 ms … 38.4 ms 105 runs Summary 'baseline' ran 1.33 ± 0.22 times faster than 'test' I still think this overhead is worth it. In case not everyone agrees, it _might_ be worth a command-line option to skip the GDAT chunk. That also prevents an ability to eventually wean entirely of generation number v1 and allow the commit date to take the full 64-bit column (instead of only 34 bits, saving 30 for topo-levels). Again, such a modification should not be considered required for this series. > ---------------------------------------------------------------------------- > > Improvements left for a future series: > > * Save commits with generation data overflow and extra edge commits instead > of looping over all commits. cf. 858sbel67n.fsf@gmail.com > * Verify both topological levels and corrected commit dates when present. > cf. 85pn4tnk8u.fsf@gmail.com These seem like reasonable things to delay for a later series or for #leftoverbits Thanks, -Stolee
On Tue, Dec 29, 2020 at 11:35:56PM -0500, Derrick Stolee wrote: > On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote: > > This patch series implements the corrected commit date offsets as generation > > number v2, along with other pre-requisites. > > Abhishek, > > Thank you for this version. I appreciate your hard work on this topic, > especially after GSoC ended and you returned to being a full-time student. > > My hope was that I could completely approve this series and only provide > forward-fixes from here on out, as necessary. I think there are a few minor > typos that you might want to address, but I was also able to understand your > intention. > > I did make a particular case about a SEGFAULT I hit that I have been unable > to replicate. I saw it both in my copy of torvalds/linux and of > chromium/chromium. I have the file for chromium/chromium that is in a bad > state where a GDAT value includes the bit saying it should be in the long > offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't > exist in a from-scratch write. I hope validating mixed generation chain while writing as well was enough to fix the SEGFAULT. > > I'm now taking backups of my existing commit-graph files before any later > test, but it doesn't repro for my Git repository or any other repo I try on > purpose. > > However, I did some performance testing to double-check your numbers. I sent > a patch [1] that helps with some of the hard numbers. > > [1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@gmail.com/ > > The big question is whether the overhead from using a slab to store the > generation values is worth it. I still think it is, for these reasons: > > 1. Generation number v2 is measurably better than v1 in most user cases. > > 2. Generation number v2 is slower than using committer date due to the > overhead, but _guarantees correctness_. > > I like to use "git log --graph -<N>" to compare against topological levels > (v1), for various levels of <N>. When <N> is small, we hope to minimize > the amount we need to walk using the extra commit-date information as an > assistance. Repos like git/git and torvalds/linux use the philosophy of > "base your changes on oldest applicable commit" enough that v1 struggles > sometimes. > > git/git: N=1000 > > Benchmark #1: baseline > Time (mean ± σ): 100.3 ms ± 4.2 ms [User: 89.0 ms, System: 11.3 ms] > Range (min … max): 94.5 ms … 105.1 ms 28 runs > > Benchmark #2: test > Time (mean ± σ): 35.8 ms ± 3.1 ms [User: 29.6 ms, System: 6.2 ms] > Range (min … max): 29.8 ms … 40.6 ms 81 runs > > Summary > 'test' ran > 2.80 ± 0.27 times faster than 'baseline' > > This is a dramatic improvement! Using my topo-walk stats commit, I see that > v1 walks 58,805 commits as part of the in-degree walk while v2 only walks > 4,335 commits! > > torvalds/linux: N=1000 (starting at v5.10) > > Benchmark #1: baseline > Time (mean ± σ): 90.8 ms ± 3.7 ms [User: 75.2 ms, System: 15.6 ms] > Range (min … max): 85.2 ms … 96.2 ms 31 runs > > Benchmark #2: test > Time (mean ± σ): 49.2 ms ± 3.5 ms [User: 36.9 ms, System: 12.3 ms] > Range (min … max): 42.9 ms … 54.0 ms 61 runs > > Summary > 'test' ran > 1.85 ± 0.15 times faster than 'baseline' > > Similarly, v1 walked 38,161 commits compared to 4,340 by v2. > > If I increase N to something like 10,000, then usually these values get > washed out due to the width of the parallel topics. That's not too bad, as large N would be needed rather infrequently. > > The place we were still using commit-date as a heuristic was paint_down_to_common > which caused a regression the first time we used v1, at least for certain cases. > > Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9 > hit a strangeness about a pair of recent commits both based on a very old commit, > but the generation numbers forced walking farther than necessary. This doesn't > happen with v2, but we see the overhead cost of the slabs: > > Benchmark #1: baseline > Time (mean ± σ): 112.9 ms ± 2.8 ms [User: 96.5 ms, System: 16.3 ms] > Range (min … max): 107.7 ms … 118.0 ms 26 runs > > Benchmark #2: test > Time (mean ± σ): 147.1 ms ± 5.2 ms [User: 132.7 ms, System: 14.3 ms] > Range (min … max): 141.4 ms … 162.2 ms 18 runs > > Summary > 'baseline' ran > 1.30 ± 0.06 times faster than 'test' > > The overhead still exists for a more recent pair of versions (v5.0 and v5.1): > > Benchmark #1: baseline > Time (mean ± σ): 25.1 ms ± 3.2 ms [User: 18.6 ms, System: 6.5 ms] > Range (min … max): 19.0 ms … 32.8 ms 99 runs > > Benchmark #2: test > Time (mean ± σ): 33.3 ms ± 3.3 ms [User: 26.5 ms, System: 6.9 ms] > Range (min … max): 27.0 ms … 38.4 ms 105 runs > > Summary > 'baseline' ran > 1.33 ± 0.22 times faster than 'test' > > I still think this overhead is worth it. In case not everyone agrees, it _might_ > be worth a command-line option to skip the GDAT chunk. That also prevents an > ability to eventually wean entirely of generation number v1 and allow the commit > date to take the full 64-bit column (instead of only 34 bits, saving 30 for > topo-levels). Thank you for the detailed benchmarking and discussion. I don't think there is any disagreement on utility of corrected commit dates so far. We will run out of 34-bits for the commit date by the year 2514, so I am not exactly worried about weaning of generation number v1 anytime soon. > > Again, such a modification should not be considered required for this series. > > > ---------------------------------------------------------------------------- > > > > Improvements left for a future series: > > > > * Save commits with generation data overflow and extra edge commits instead > > of looping over all commits. cf. 858sbel67n.fsf@gmail.com > > * Verify both topological levels and corrected commit dates when present. > > cf. 85pn4tnk8u.fsf@gmail.com > > These seem like reasonable things to delay for a later series > or for #leftoverbits > > Thanks, > -Stolee > Thanks - Abhishek