[v2,0/8] Optimization batch 9: avoid detecting irrelevant renames

Message ID	pull.845.v2.git.1615248599.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.845.v2.git.1615248599.gitgitgadget@gmail.com> In-Reply-To: <pull.845.git.1614484707.gitgitgadget@gmail.com> References: <pull.845.git.1614484707.gitgitgadget@gmail.com> From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> Date: Tue, 09 Mar 2021 00:09:51 +0000 Subject: [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee <dstolee@microsoft.com>, Jonathan Tan <jonathantanmy@google.com>, Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com>, Elijah Newren <newren@gmail.com>, Elijah Newren <newren@gmail.com> Precedence: bulk
Series	Optimization batch 9: avoid detecting irrelevant renames \| expand [v2,0/8] Optimization batch 9: avoid detecting irrelevant renames [v2,1/8] diffcore-rename: enable filtering possible rename sources [v2,2/8] merge-ort: precompute subset of sources for which we need rename detection [v2,3/8] merge-ort: add data structures for an alternate tree traversal [v2,4/8] merge-ort: introduce wrappers for alternate tree traversal [v2,5/8] merge-ort: precompute whether directory rename detection is needed [v2,6/8] merge-ort: use relevant_sources to filter possible rename sources [v2,7/8] merge-ort: skip rename detection entirely if possible [v2,8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources

Message ID

pull.845.v2.git.1615248599.gitgitgadget@gmail.com (mailing list archive)

Headers

Message-Id: <pull.845.v2.git.1615248599.gitgitgadget@gmail.com>
In-Reply-To: <pull.845.git.1614484707.gitgitgadget@gmail.com>
References: <pull.845.git.1614484707.gitgitgadget@gmail.com>
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
Date: Tue, 09 Mar 2021 00:09:51 +0000
Subject: [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant
 renames
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Fcc: Sent
To: git@vger.kernel.org
Cc: Derrick Stolee <dstolee@microsoft.com>,
 Jonathan Tan <jonathantanmy@google.com>, Taylor Blau <me@ttaylorr.com>,
 Junio C Hamano <gitster@pobox.com>,
 =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason  <avarab@gmail.com>,
 Elijah Newren <newren@gmail.com>, Elijah Newren <newren@gmail.com>
Precedence: bulk

Series

Optimization batch 9: avoid detecting irrelevant renames | expand

Message

Derrick Stolee via GitGitGadget March 9, 2021, 12:09 a.m. UTC

This series depends textually on ort-perf-batch-8, but semantically it's
almost completely unrelated and can be reviewed without any familiarity with
any of the previous patch series.

There are no changes since v1; it's just a resend just over a week later to
bump it so it isn't lost.

=== Basic Optimization idea ===

This series determines paths which meet special cases where detection of
renames is irrelevant, where the irrelevance is due to the fact that the
merge machinery will arrive at the same result regardless of whether a
rename is detected for any of those paths. This series represents
"Optimization #2" from my Git Merge 2020 talk[1], though this series has
some improvements on the optimization relative to what I had at that time.

The basic idea here is:

We only need expensive rename detection on the subset of files changed on
both sides of history (for the most part).

This is because:

 1. The primary reason for rename detection in merges is enabling three-way
    content merges
 2. The purpose of three-way content merges is reconciling changes when

both sides of history modified some file 3. If a file was only modified by
the side that renamed the file, then detecting the rename is irrelevant;
we'll get the same answer without knowing about the rename. 4. (Well...there
are rare cases where we need the rename for reasons other than three-way
content merges. Patch 5 explains those.)

Since commits being rebased or cherry-picked tend to only modify a few
files, this optimization tends to be particularly effective for rebases and
cherry-picks.

Basing rename detection on what the other side of history did to a file
means that extra information needs to be fed from merge-ort to
diffcore-rename in order to take advantage of such an optimization.

=== Comparison to previous series ===

This series differs from my two previous optimizations[2][3] (focusing on
basename-guided rename detection) in two important aspects:

 * there are no behavioral changes (there is no heuristic involved)
 * this optimization is merge specific (it does not help the diff/status/log
   family of commands, just merge/rebase/cherry-pick and such)

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       12.596 s ±  0.061 s     5.680 s ±  0.096 s
mega-renames:    130.465 s ±  0.259 s    13.812 s ±  0.162 s
just-one-mega:     3.958 s ±  0.010 s   506.0  ms ±  3.9  ms


However, interestingly, if we had ignored the basename-guided rename
detection optimizations[2][3], then this optimization series would have
improved the performance as follows:

               Before Basename Series   After Just This Series
no-renames:      13.815 s ±  0.062 s      5.728 s ±  0.104 s
mega-renames:  1799.937 s ±  0.493 s     18.213 s ±  0.139 s
just-one-mega    51.289 s ±  0.019 s    891.9  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Competition between optimizations ===

We now have three major rename-related optimizations:

 * exact rename detection
 * basename-guided rename detection[2][3]
 * skip-because-unnecessary (this series)

It is possible for all three to potentially apply for specific paths (they
do for the majority of renamed paths in our testcases), but we cannot use
more than one for any given path. It turns out that the priority we give
each optimization is very important and can drastically affect performance.
We get best results by prioritizing them as follows:

 1. exact rename detection
 2. skip-because-unnecessary
 3. basename-guided rename detection

Some of the commit messages discuss this ordering and another minor variant
of the skip-because-unnecessary optimization that was tried (and resulted in
less effective performance gains than reported here), as well as some of the
preparatory work over the past few years that this series relies on in order
to enable this optimization.

=== Near optimal? ===

You may remember that there was a row labelled "everything else" from the
commit message of 557ac0350d that represented the maximum possible speed-up
from accelerating rename detection alone; as stated in that commit, those
rows represented how fast the code could be if we had somehow infinitely
parallelized the inexact rename detection. However, if you compare those
"maximum speedup" numbers to what we have above, you'll note that the
infinitely parallelized inexact rename detection would have been slightly
slower than the results we have now achieved. (The reason this is possible,
despite the fact that we still spend time in rename detection after our
optimizations, is because we implemented two optimizations outside of
diffcore_rename() along the way.) However, this good news does also come
with a downside -- it means that our remaining optimization potential is
somewhat limited, and subsequent optimization series will have to fight for
much smaller gains.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf
[2]
https://lore.kernel.org/git/pull.843.git.1612651937.gitgitgadget@gmail.com/
[3]
https://lore.kernel.org/git/pull.844.git.1613289544.gitgitgadget@gmail.com/

Elijah Newren (8):
  diffcore-rename: enable filtering possible rename sources
  merge-ort: precompute subset of sources for which we need rename
    detection
  merge-ort: add data structures for an alternate tree traversal
  merge-ort: introduce wrappers for alternate tree traversal
  merge-ort: precompute whether directory rename detection is needed
  merge-ort: use relevant_sources to filter possible rename sources
  merge-ort: skip rename detection entirely if possible
  diffcore-rename: avoid doing basename comparisons for irrelevant
    sources

 diffcore-rename.c |  63 ++++++++++---
 diffcore.h        |   1 +
 merge-ort.c       | 236 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 285 insertions(+), 15 deletions(-)


base-commit: 4be565c472088d4144063b736308bf2a57331f45
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-845%2Fnewren%2Fort-perf-batch-9-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-845/newren/ort-perf-batch-9-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/845

Range-diff vs v1:

 1:  064fa5de1e20 = 1:  dab8e3c6aee5 diffcore-rename: enable filtering possible rename sources
 2:  69b42a41e83a = 2:  33c231331744 merge-ort: precompute subset of sources for which we need rename detection
 3:  042ce66011ef = 3:  89b43c9f75a0 merge-ort: add data structures for an alternate tree traversal
 4:  7673e4c23bbb = 4:  6497050c0012 merge-ort: introduce wrappers for alternate tree traversal
 5:  8dbf0a452545 = 5:  608d5a4c6ad7 merge-ort: precompute whether directory rename detection is needed
 6:  6b20977a5a81 = 6:  d62fdee45ad3 merge-ort: use relevant_sources to filter possible rename sources
 7:  d5486ab28462 = 7:  cd931286f24d merge-ort: skip rename detection entirely if possible
 8:  8fed92b62f37 = 8:  c443ba8abb89 diffcore-rename: avoid doing basename comparisons for irrelevant sources

Comments

Derrick Stolee March 9, 2021, 10:08 p.m. UTC | #1

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> The basic idea here is:
> 
> We only need expensive rename detection on the subset of files changed on
> both sides of history (for the most part).
> 
> This is because:
> 
>  1. The primary reason for rename detection in merges is enabling three-way
>     content merges
>  2. The purpose of three-way content merges is reconciling changes when
> 
> both sides of history modified some file 3. If a file was only modified by
> the side that renamed the file, then detecting the rename is irrelevant;
> we'll get the same answer without knowing about the rename. 4. (Well...there
> are rare cases where we need the rename for reasons other than three-way
> content merges. Patch 5 explains those.)

Makes sense. Don't compute information you won't need. I look forward to
trying to figure out the special cases here.
 
>                      Before Series           After Series
> no-renames:       12.596 s ±  0.061 s     5.680 s ±  0.096 s
> mega-renames:    130.465 s ±  0.259 s    13.812 s ±  0.162 s
> just-one-mega:     3.958 s ±  0.010 s   506.0  ms ±  3.9  ms

These are _very_ impressive numbers for such a "simple" idea.
 
> However, interestingly, if we had ignored the basename-guided rename
> detection optimizations[2][3], then this optimization series would have
> improved the performance as follows:
> 
>                Before Basename Series   After Just This Series
> no-renames:      13.815 s ±  0.062 s      5.728 s ±  0.104 s
> mega-renames:  1799.937 s ±  0.493 s     18.213 s ±  0.139 s
> just-one-mega    51.289 s ±  0.019 s    891.9  ms ±  7.0  ms

And here it is even more impressive. I see that your optimizations are
having combined effects but also are doing valuable things on their
own.

> We get best results by prioritizing them as follows:
> 
>  1. exact rename detection
>  2. skip-because-unnecessary
>  3. basename-guided rename detection

This makes sense to me, since even the basename-guided rename is
doing some non-trivial work. It would be good to reduce that
effort.

> it means that our remaining optimization potential is
> somewhat limited, and subsequent optimization series will have to fight for
> much smaller gains.

This is a good place to end up. Let the code rest for a bit after
we are done here, and maybe we'll find new cases to care about
later. We could chase the long tail forever, but these steps are
a huge accomplishment!

Getting to reading now.

-Stolee

Derrick Stolee March 10, 2021, 3:08 p.m. UTC | #2

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:
> This series depends textually on ort-perf-batch-8, but semantically it's
> almost completely unrelated and can be reviewed without any familiarity with
> any of the previous patch series.
> 
> There are no changes since v1; it's just a resend just over a week later to
> bump it so it isn't lost.
> 
> === Basic Optimization idea ===
> 
> This series determines paths which meet special cases where detection of
> renames is irrelevant, where the irrelevance is due to the fact that the
> merge machinery will arrive at the same result regardless of whether a
> rename is detected for any of those paths. This series represents
> "Optimization #2" from my Git Merge 2020 talk[1], though this series has
> some improvements on the optimization relative to what I had at that time.
> 
> The basic idea here is:
> 
> We only need expensive rename detection on the subset of files changed on
> both sides of history (for the most part).

I've taken time this morning to re-read some of the patches. I have a
grasp on the idea of the optimization and the code looks well presented
and correct.

The only issue I have is that there are no additional tests to ensure that
these scenarios are being tested and are checked to return the correct
results. Naturally, it seems we are not testing the ORT strategy by default,
and if I do enable it, that causes test failures still.

I wonder how much we should be merging these optimizations before the full
test suite can pass with ORT as the default. Then, we can check to see if
small mutations can be caught by the test suite. In particular, everything
in this optimization seems to revolve around this condition in add_pair():

		if (content_relevant || location_relevant)
			strset_add(&renames->relevant_sources[side], pathname);

I'd prefer to have test cases that cover all four options for the two boolean
values on this operator. Mostly, I'd like to know that if I delete either side
of the || operator (or change it to &&) then we would have a test failure.

Thanks,
-Stolee

Elijah Newren March 15, 2021, 5:10 p.m. UTC | #3

On Mon, Mar 8, 2021 at 4:10 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This series depends textually on ort-perf-batch-8, but semantically it's
> almost completely unrelated and can be reviewed without any familiarity with
> any of the previous patch series.
>
> There are no changes since v1; it's just a resend just over a week later to
> bump it so it isn't lost.
>
> === Basic Optimization idea ===
>
> This series determines paths which meet special cases where detection of
> renames is irrelevant, where the irrelevance is due to the fact that the
> merge machinery will arrive at the same result regardless of whether a
> rename is detected for any of those paths. This series represents
> "Optimization #2" from my Git Merge 2020 talk[1], though this series has
> some improvements on the optimization relative to what I had at that time.
>
> The basic idea here is:
>
> We only need expensive rename detection on the subset of files changed on
> both sides of history (for the most part).

I know this series was already reviewed and even a subsequent series
was reviewed, but I thought I'd insert a bit of history trivia:

I first submitted this optimization in late 2017 as an RFC at
https://lore.kernel.org/git/20171110222156.23221-9-newren@gmail.com/.
Unfortunately I had only handled the "for the most part" piece, not
the other special cases.  I knew not handling those other cases caused
bugs, but didn't readily find a solution for them at the time.  I kept
mulling it over periodically despite being switched onto other
projects; eventually I weaseled my way into being able to work on it
again and with some effort was able to work out the other necessary
conditions and audit the code to verify I had all the cases covered.
Those "other cases" were much more easily expressed in the context of
merge-ort's data structures than merge-recursive's, in part because
merge-ort's data structures were picked to help me solve this
optimization problem and the various known failing testcases that I
wanted to fix.