Message ID | pull.845.v2.git.1615248599.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Optimization batch 9: avoid detecting irrelevant renames | expand |
On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> The basic idea here is: > > We only need expensive rename detection on the subset of files changed on > both sides of history (for the most part). > > This is because: > > 1. The primary reason for rename detection in merges is enabling three-way > content merges > 2. The purpose of three-way content merges is reconciling changes when > > both sides of history modified some file 3. If a file was only modified by > the side that renamed the file, then detecting the rename is irrelevant; > we'll get the same answer without knowing about the rename. 4. (Well...there > are rare cases where we need the rename for reasons other than three-way > content merges. Patch 5 explains those.) Makes sense. Don't compute information you won't need. I look forward to trying to figure out the special cases here. > Before Series After Series > no-renames: 12.596 s ± 0.061 s 5.680 s ± 0.096 s > mega-renames: 130.465 s ± 0.259 s 13.812 s ± 0.162 s > just-one-mega: 3.958 s ± 0.010 s 506.0 ms ± 3.9 ms These are _very_ impressive numbers for such a "simple" idea. > However, interestingly, if we had ignored the basename-guided rename > detection optimizations[2][3], then this optimization series would have > improved the performance as follows: > > Before Basename Series After Just This Series > no-renames: 13.815 s ± 0.062 s 5.728 s ± 0.104 s > mega-renames: 1799.937 s ± 0.493 s 18.213 s ± 0.139 s > just-one-mega 51.289 s ± 0.019 s 891.9 ms ± 7.0 ms And here it is even more impressive. I see that your optimizations are having combined effects but also are doing valuable things on their own. > We get best results by prioritizing them as follows: > > 1. exact rename detection > 2. skip-because-unnecessary > 3. basename-guided rename detection This makes sense to me, since even the basename-guided rename is doing some non-trivial work. It would be good to reduce that effort. > it means that our remaining optimization potential is > somewhat limited, and subsequent optimization series will have to fight for > much smaller gains. This is a good place to end up. Let the code rest for a bit after we are done here, and maybe we'll find new cases to care about later. We could chase the long tail forever, but these steps are a huge accomplishment! Getting to reading now. -Stolee
On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote: > This series depends textually on ort-perf-batch-8, but semantically it's > almost completely unrelated and can be reviewed without any familiarity with > any of the previous patch series. > > There are no changes since v1; it's just a resend just over a week later to > bump it so it isn't lost. > > === Basic Optimization idea === > > This series determines paths which meet special cases where detection of > renames is irrelevant, where the irrelevance is due to the fact that the > merge machinery will arrive at the same result regardless of whether a > rename is detected for any of those paths. This series represents > "Optimization #2" from my Git Merge 2020 talk[1], though this series has > some improvements on the optimization relative to what I had at that time. > > The basic idea here is: > > We only need expensive rename detection on the subset of files changed on > both sides of history (for the most part). I've taken time this morning to re-read some of the patches. I have a grasp on the idea of the optimization and the code looks well presented and correct. The only issue I have is that there are no additional tests to ensure that these scenarios are being tested and are checked to return the correct results. Naturally, it seems we are not testing the ORT strategy by default, and if I do enable it, that causes test failures still. I wonder how much we should be merging these optimizations before the full test suite can pass with ORT as the default. Then, we can check to see if small mutations can be caught by the test suite. In particular, everything in this optimization seems to revolve around this condition in add_pair(): if (content_relevant || location_relevant) strset_add(&renames->relevant_sources[side], pathname); I'd prefer to have test cases that cover all four options for the two boolean values on this operator. Mostly, I'd like to know that if I delete either side of the || operator (or change it to &&) then we would have a test failure. Thanks, -Stolee
On Mon, Mar 8, 2021 at 4:10 PM Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> wrote: > > This series depends textually on ort-perf-batch-8, but semantically it's > almost completely unrelated and can be reviewed without any familiarity with > any of the previous patch series. > > There are no changes since v1; it's just a resend just over a week later to > bump it so it isn't lost. > > === Basic Optimization idea === > > This series determines paths which meet special cases where detection of > renames is irrelevant, where the irrelevance is due to the fact that the > merge machinery will arrive at the same result regardless of whether a > rename is detected for any of those paths. This series represents > "Optimization #2" from my Git Merge 2020 talk[1], though this series has > some improvements on the optimization relative to what I had at that time. > > The basic idea here is: > > We only need expensive rename detection on the subset of files changed on > both sides of history (for the most part). I know this series was already reviewed and even a subsequent series was reviewed, but I thought I'd insert a bit of history trivia: I first submitted this optimization in late 2017 as an RFC at https://lore.kernel.org/git/20171110222156.23221-9-newren@gmail.com/. Unfortunately I had only handled the "for the most part" piece, not the other special cases. I knew not handling those other cases caused bugs, but didn't readily find a solution for them at the time. I kept mulling it over periodically despite being switched onto other projects; eventually I weaseled my way into being able to work on it again and with some effort was able to work out the other necessary conditions and audit the code to verify I had all the cases covered. Those "other cases" were much more easily expressed in the context of merge-ort's data structures than merge-recursive's, in part because merge-ort's data structures were picked to help me solve this optimization problem and the various known failing testcases that I wanted to fix.