[0/7] Optimization batch 11: avoid repeatedly detecting same renames

Message ID	pull.859.git.1616621553.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <pull.859.git.1616621553.gitgitgadget@gmail.com> From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> Date: Wed, 24 Mar 2021 21:32:26 +0000 Subject: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee <dstolee@microsoft.com>, Jonathan Tan <jonathantanmy@google.com>, Taylor Blau <me@ttaylorr.com>, Elijah Newren <newren@gmail.com> Precedence: bulk
Series	Optimization batch 11: avoid repeatedly detecting same renames \| expand [0/7] Optimization batch 11: avoid repeatedly detecting same renames [1/7] merge-ort: add data structures for in-memory caching of rename detection [2/7] merge-ort: populate caches of rename detection results [3/7] merge-ort: add code to check for whether cached renames can be reused [4/7] merge-ort: avoid accidental API mis-use [5/7] merge-ort: preserve cached renames for the appropriate side [6/7] merge-ort: add helper functions for using cached renames [7/7] merge-ort, diffcore-rename: employ cached renames when possible

John Cai via GitGitGadget March 24, 2021, 9:32 p.m. UTC

This series builds on ort-readiness, but it's semantically orthogonal to the
previous series and represents a new independent optimization.

=== Basic Optimization idea ===

This series avoids repeatedly detecting the same renames in a sequence of
merges such as a rebase or cherry-pick of several commits. When there are
many renames between the old base and the new base, traditionally all those
renames are re-detected for every commit that is transplanted. This
optimization avoids redoing that work.

This represents "Optimization #4" from my Git Merge 2020 talk[1]; the
details are a bit more involved than I realized at the time, but the high
level idea is the same.

=== Comparison to previous series ===

I previously noted that we had three major rename-related optimizations:

 * exact rename detection (applies when unmodified on renamed side)
 * skip-because-irrelevant (applies when unmodified on unrenamed side)
 * basename-guided rename detection (applies when basename unchanged)

This one adds a fourth (remember-renames), with some interesting properties:

 * unlike basename-guided rename detection, there are no behavioral changes
   (there is no heuristic involved)[2].

 * like skip-because-irrelevant, this optimization does not apply to all git
   commands using the rename machinery. In fact, this one is even more
   restrictive since it is ONLY useful for rebases and cherry-picks (not
   even merges), and only for second and later commits in a linear series.

 * unlike the three previous optimizations, there are no requirements about
   the types of changes done to the file; it just caches renames on the
   "upstream" side of history for subsequent commit picking.

It's also worth noting despite wording about "remembering" or "caching"
renames, that this optimization does NOT write this cache to disk; it's an
in-memory only cache. When the rebase or cherry-pick completes (or hits a
conflict and stops), the cache is discarded.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.665 s ±  0.129 s     5.624 s ±  0.077 s 
mega-renames:     11.435 s ±  0.158 s    10.213 s ±  0.032 s
just-one-mega:   494.2  ms ±  6.1  ms   497.6  ms ±  5.3  ms


By design, this optimization could not help the just-one-mega testcase. The
gains for the other two testcases may look somewhat smaller than one would
expect given the description (only ~10% for the mega-renames testcase), but
the point was to spend less time detecting renames...and there just wasn't
that much time spent in renames for these testcases before this series for
us to remove. However, if we undid the basename-guided rename detection and
skip-because-unnecessary optimizations, then this series alone would have
improved performance as follows:

                   Before Basename Series   After Just This Series
    no-renames:      13.815 s ±  0.062 s      5.814 s ±  0.094 s
    mega-renames:  1799.937 s ±  0.493 s    303.225 s ±  1.330 s


Showing that this optimization has the ability to improve things when the
other optimizations do not apply. In fact, when I originally implemented
this optimization, it improved the mega-renames testcase by a factor of 2
(at the time, I did not have all the optimizations from ort-perf-batch-7
thru ort-perf-batch-10 in their current shape).

As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Further discussion of results ===

If we change our focus from absolute time taken, to the percentage of
overall time spent on rename detection, then we find the following picture
comparing our starting point at the beginning of the performance work to
what we achieve at the end of this series:

       Percentage of time spent on rename detection
   
                  commit 557ac0350d      After this Series
no-renames:             39.4%                   0.2%
mega-renames:           96.6%                   2.7%
just-one-mega:          95.0%                   9.0%


Since this optimization is only applicable for the first two testcases
(because the third only involves rebasing a single commit), even if this
optimization had removed all the remaining time in rename detection it would
have only sped it up the mega-renames case by ~12% rather than the 10% it
achieved. This table makes it clear that our attempts to accelerate rename
detection have succeeded, and any further work to accelerate merges needs to
start concentrating on other areas.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

[2] Well, almost no changes. There's technically a very narrow way that this
could change the behavior; see the really long "Technically," bullet point
in patch 3 for discussion of this.

Elijah Newren (7):
  merge-ort: add data structures for in-memory caching of rename
    detection
  merge-ort: populate caches of rename detection results
  merge-ort: add code to check for whether cached renames can be reused
  merge-ort: avoid accidental API mis-use
  merge-ort: preserve cached renames for the appropriate side
  merge-ort: add helper functions for using cached renames
  merge-ort, diffcore-rename: employ cached renames when possible

 diffcore-rename.c |  22 +++-
 diffcore.h        |   3 +-
 merge-ort.c       | 273 +++++++++++++++++++++++++++++++++++++++++++---
 merge-ort.h       |   2 +
 4 files changed, 282 insertions(+), 18 deletions(-)


base-commit: c2d2a1ccaea70b7dc8c0539ba9d3a132f9687040
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-859%2Fnewren%2Fort-perf-batch-11-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-859/newren/ort-perf-batch-11-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/859

Junio C Hamano March 24, 2021, 10:04 p.m. UTC | #1

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> === Basic Optimization idea ===
>
> This series avoids repeatedly detecting the same renames in a sequence of
> merges such as a rebase or cherry-pick of several commits. When there are
> many renames between the old base and the new base, traditionally all those
> renames are re-detected for every commit that is transplanted. This
> optimization avoids redoing that work.

Unless this section is easily understandable, the readers have no
incentive to read on, but the above is a bit too hand wavy.

> This one adds a fourth (remember-renames), with some interesting properties:
>
>  * unlike basename-guided rename detection, there are no behavioral changes
>    (there is no heuristic involved)[2].
>
>  * like skip-because-irrelevant, this optimization does not apply to all git
>    commands using the rename machinery. In fact, this one is even more
>    restrictive since it is ONLY useful for rebases and cherry-picks (not
>    even merges), and only for second and later commits in a linear series.

So, is it correct to understand that one case this would help is
this scenario?

 ---o---o---o---X---o---o---o---O ours
     \
      A---B---C topic

where there is a side branch A--B--C that touched some files, while
on our side, there is a commit X that is unknown to the side branch
that renamed these files.  Now we want to transplant the side topic
to the tip of our history, replaying the changes A--B--C made to
these files under their original name to the corresponding files
that have been renamed.

And each step in this "rebase" is a 3-way merge of commits A, B and
C onto HEAD, using the parent of the commit being cherrk-picked as a
virtual common ancestor.  Which means

 - To transplant A (i.e. the first step), we'd compare the diff of
   A^..O (i.e. what our side did, including the renames done at X)
   and diff of A^..A (i.e. what the first commit did in the range),
   and the former does quite a lot of rename detection.

 - After transplanting B (i.e. the second step), then we'd compare
   the diff of A^..A' (where A' is A cherry-picked on O, i.e. the
   result of the previous step).  If we are lucky, O..A' did not
   rename anything so the renames done in A^..O (i.e. what we
   detected during the first step) and A^..A' (i.e. what we should
   be computing for this second step) should be quite similar.

   If we assume that the "quite similar" is good enough, then we can
   blindly reuse the record of "<path in A^> correspnds to <path in
   O>" as if it were "<path in A^> corresponds to <path in A'>".

 - Do the same for C, pretending that renames discovered between A^
   and O is identical to the renames between A^ and B' (i.e. the
   result of cherry-picking A--B on top of O).

Elijah Newren March 24, 2021, 11:25 p.m. UTC | #2

On Wed, Mar 24, 2021 at 3:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > === Basic Optimization idea ===
> >
> > This series avoids repeatedly detecting the same renames in a sequence of
> > merges such as a rebase or cherry-pick of several commits. When there are
> > many renames between the old base and the new base, traditionally all those
> > renames are re-detected for every commit that is transplanted. This
> > optimization avoids redoing that work.
>
> Unless this section is easily understandable, the readers have no
> incentive to read on, but the above is a bit too hand wavy.

Oh, yeah, it's very hand wavy.  I figured the commit messages were the
right place to include the details, and just wanted to give a flavor
of the idea in the cover letter.

> > This one adds a fourth (remember-renames), with some interesting properties:
> >
> >  * unlike basename-guided rename detection, there are no behavioral changes
> >    (there is no heuristic involved)[2].
> >
> >  * like skip-because-irrelevant, this optimization does not apply to all git
> >    commands using the rename machinery. In fact, this one is even more
> >    restrictive since it is ONLY useful for rebases and cherry-picks (not
> >    even merges), and only for second and later commits in a linear series.
>
> So, is it correct to understand that one case this would help is
> this scenario?
>
>  ---o---o---o---X---o---o---o---O ours
>      \
>       A---B---C topic
>
> where there is a side branch A--B--C that touched some files, while
> on our side, there is a commit X that is unknown to the side branch
> that renamed these files.  Now we want to transplant the side topic
> to the tip of our history, replaying the changes A--B--C made to
> these files under their original name to the corresponding files
> that have been renamed.
>
> And each step in this "rebase" is a 3-way merge of commits A, B and
> C onto HEAD, using the parent of the commit being cherrk-picked as a
> virtual common ancestor.  Which means

You generated nearly the same description and diagram I used in the
commit message (the one in 3/7) describing this.  :-)

>  - To transplant A (i.e. the first step), we'd compare the diff of
>    A^..O (i.e. what our side did, including the renames done at X)
>    and diff of A^..A (i.e. what the first commit did in the range),
>    and the former does quite a lot of rename detection.
>
>  - After transplanting B (i.e. the second step), then we'd compare
>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the

Close, but for transplanting B we do the diff of B^..A', not A^...A'.
(And in this diagram, B^ is A.)  That's critical below...

>    result of the previous step).  If we are lucky, O..A' did not
>    rename anything so the renames done in A^..O (i.e. what we
>    detected during the first step) and A^..A' (i.e. what we should
>    be computing for this second step) should be quite similar.

Again, B^..A' rather than A^..A'.

Luck is not involved here.  If O..A' did rename anything, it's one of
two reasons:

- There were conflicts when trying to transplant A, and when we stop
for conflict resolution, the user added some renames at that point.
- There were renames in A^..A.

In the first case, the presence of conflicts means we drop the cache
and this optimization doesn't try to kick in.  In the second case,
those renames in A' came from A.  Even without this optimization,
since those renames in A' came from A, doing rename detection on A..A'
wouldn't re-detect them and transplanting wouldn't try to reapply
them, so they just aren't relevant anymore -- with or without this
optimization.

>    If we assume that the "quite similar" is good enough, then we can
>    blindly reuse the record of "<path in A^> correspnds to <path in
>    O>" as if it were "<path in A^> corresponds to <path in A'>".

Again, B^ rather than A^ on the last line.

I disagree with the use of the term "blindly" here.  As spelled out in
the third commit message, the transplant of A involved a three-way
content merge of the form:
    A^:oldfile
    O:newfile
    A:oldfile
and produce a new result:
    A':newfile

The point of rename detection is to determine what files are similar
enough to use in a three-way content merge.  In particular, we'd use
rename detection when transplanting B to notice the oldfile -> newfile
rename so that we can do a three-way content merge of the form:
    A:oldfile
    A':newfile
    B:oldfile
and produce a new result:
    B':newfile

But, instead of asking rename detection whether A:oldfile and
A':newfile are similar enough to use together in a three-way content
merge, we could ask ourselves -- do we have any _other_ reason to
believe these files are similar enough to be used in a three-way
content merge?  And the answer that comes back is: these files were
*already* involved in the same three-way content merge -- the one that
A':newfile came from.  It was a three-way content merge with no
conflicts.  (Because when conflicts are triggered we turn this
optimization off.)

>  - Do the same for C, pretending that renames discovered between A^
>    and O is identical to the renames between A^ and B' (i.e. the
>    result of cherry-picking A--B on top of O).

Now you've changed your off-by-one mistake to an off-by-two mistake;
the rename detection is between C^ and B', not A^ and B'.  I think
this error might be critical to why you used terms like "pretend" and
"blindly" and "lucky".  I agree that it would require
luck/blindness/pretending to assume that the renames between A^ and O
are identical to those between A^ and B', but that's not what the
original algorithm would have been using for computing renames; it
would be using C^ and B'.

It's actually quite difficult to generate a case where this
optimization gets a possibly different result.  It requires there were
changes to the content on both sides of history that merge cleanly,
and in particular that need a significant size reduction of the file
by the unrenamed side of history.  If you take the changes on the
*renamed* side of history, which represent <50% changes since it was
detected as a rename, those same changes need to represent a >50%
change when applied to the smaller file.  This is discussed in the
third commit message, as noted in the cover letter:

>> [2] Well, almost no changes. There's technically a very narrow way that this
>> could change the behavior; see the really long "Technically," bullet point
>> in patch 3 for discussion of this.

Junio C Hamano March 25, 2021, 6:59 p.m. UTC | #3

Elijah Newren <newren@gmail.com> writes:

>> And each step in this "rebase" is a 3-way merge of commits A, B and
>> C onto HEAD, using the parent of the commit being cherrk-picked as a
>> virtual common ancestor.  Which means
>
> You generated nearly the same description and diagram I used in the
> commit message (the one in 3/7) describing this.  :-)
>
>>  - To transplant A (i.e. the first step), we'd compare the diff of
>>    A^..O (i.e. what our side did, including the renames done at X)
>>    and diff of A^..A (i.e. what the first commit did in the range),
>>    and the former does quite a lot of rename detection.
>>
>>  - After transplanting B (i.e. the second step), then we'd compare
>>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the
>
> Close, but for transplanting B we do the diff of B^..A', not A^...A'.
> (And in this diagram, B^ is A.)  That's critical below...

Yes, I upfront said "pretend that the parent of the commit being
picked is the common ancestor and run 3-way merge", but then got
confused by the ancestry graph myself, forgetting that the reason
why A^ is used in the first "pick" is *not* because the it is the
fork point of our history and the side branch, but it is because it
is A's parent.

And if the renames in B^..A' and A^..A' are different that must have
come only from the difference between A..B (which is B^..B), but
that comparison is what we do when cherry-picking B on top of A',
so it is easy to take into account to reuse the renames precisely
without "assuming they are the same".

Thanks.

Elijah Newren March 29, 2021, 10:34 p.m. UTC | #4

On Thu, Mar 25, 2021 at 12:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> And each step in this "rebase" is a 3-way merge of commits A, B and
> >> C onto HEAD, using the parent of the commit being cherrk-picked as a
> >> virtual common ancestor.  Which means
> >
> > You generated nearly the same description and diagram I used in the
> > commit message (the one in 3/7) describing this.  :-)
> >
> >>  - To transplant A (i.e. the first step), we'd compare the diff of
> >>    A^..O (i.e. what our side did, including the renames done at X)
> >>    and diff of A^..A (i.e. what the first commit did in the range),
> >>    and the former does quite a lot of rename detection.
> >>
> >>  - After transplanting B (i.e. the second step), then we'd compare
> >>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the
> >
> > Close, but for transplanting B we do the diff of B^..A', not A^...A'.
> > (And in this diagram, B^ is A.)  That's critical below...
>
> Yes, I upfront said "pretend that the parent of the commit being
> picked is the common ancestor and run 3-way merge", but then got
> confused by the ancestry graph myself, forgetting that the reason
> why A^ is used in the first "pick" is *not* because the it is the
> fork point of our history and the side branch, but it is because it
> is A's parent.
>
> And if the renames in B^..A' and A^..A' are different that must have
> come only from the difference between A..B (which is B^..B), but
> that comparison is what we do when cherry-picking B on top of A',
> so it is easy to take into account to reuse the renames precisely
> without "assuming they are the same".
>
> Thanks.

After sending the initial series, I decided to type up a more thorough
document that
  * spelled out in more detail how the sequence of cherry-picks work
  * proved why the renames in one pick are always a superset of the
renames in the next
  * proved why the renames in one pick are _almost_ always also a
rename in the next
  * discussed the counterexample cases in more detail, and why the
optimization is still reasonable
I figured the more extended document would be useful in case people
decide to change how things work in the future (e.g. what if someone
wants to turn on break detection?), and wants to be able to check
whether all the conditions and cases still hold.

I then also added details about how things work with directory
renames, in the case that merge.directoryRenames is not the default of
"conflict" (which is trivially handled by stopping and dropping the
cache) but is set to true...and found a case that needed more care due
to interactions with some of the earlier optimizations.  (The earlier
optimizations could result in bypassing directory rename detection in
one merge because there was no file added to the old directory, but
the no-directory-rename would be cached for subsequent rebases.)

So I need to get that fixed up and resubmit this series.

Derrick Stolee March 30, 2021, 12:07 p.m. UTC | #5

On 3/29/2021 6:34 PM, Elijah Newren wrote:...
> After sending the initial series, I decided to type up a more thorough
> document that
>   * spelled out in more detail how the sequence of cherry-picks work
>   * proved why the renames in one pick are always a superset of the
> renames in the next
>   * proved why the renames in one pick are _almost_ always also a
> rename in the next
>   * discussed the counterexample cases in more detail, and why the
> optimization is still reasonable
> I figured the more extended document would be useful in case people
> decide to change how things work in the future (e.g. what if someone
> wants to turn on break detection?), and wants to be able to check
> whether all the conditions and cases still hold.
> 
> I then also added details about how things work with directory
> renames, in the case that merge.directoryRenames is not the default of
> "conflict" (which is trivially handled by stopping and dropping the
> cache) but is set to true...and found a case that needed more care due
> to interactions with some of the earlier optimizations.  (The earlier
> optimizations could result in bypassing directory rename detection in
> one merge because there was no file added to the old directory, but
> the no-directory-rename would be cached for subsequent rebases.)
> 
> So I need to get that fixed up and resubmit this series.

I look forward to that document. I attempted reading this series
yesterday, but did not have the mental energy to convince myself
of the correctness (because of things like not knowing this logic
that you plan to document). Instead, I'll promise to give round 2
a quicker response.

Thanks,
-Stolee

[0/7] Optimization batch 11: avoid repeatedly detecting same renames

Message

Comments