diff mbox series

[v2,2/2] diffcore-rename: filter rename_src list when possible

Message ID 7ae9460d3dba84122c2674b46e4339b9d42bdedd.1612382628.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 6: make full use of exact renames | expand

Commit Message

Elijah Newren Feb. 3, 2021, 8:03 p.m. UTC
From: Elijah Newren <newren@gmail.com>

We have to look at each entry in rename_src a total of rename_dst_nr
times.  When we're not detecting copies, any exact renames or ignorable
rename paths will just be skipped over.  While checking that these can
be skipped over is a relatively cheap check, it's still a waste of time
to do that check more than once, let alone rename_dst_nr times.  When
rename_src_nr is a few thousand times bigger than the number of relevant
sources (such as when cherry-picking a commit that only touched a
handful of files, but from a side of history that has different names
for some high level directories), this time can add up.

First make an initial pass over the rename_src array and move all the
relevant entries to the front, so that we can iterate over just those
relevant entries.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
    mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
    just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 57 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 6 deletions(-)

Comments

Junio C Hamano Feb. 13, 2021, 1:04 a.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		return; /* culling incompatbile with break detection */

incompatible.

>  	/*
> -	 * Calculate how many renames are left (but all the source
> -	 * files still remain as options for rename/copies!)
> +	 * Calculate how many renames are left
>  	 */

This no longer needs to be a multi-line comment, does it?
Junio C Hamano Feb. 13, 2021, 1:06 a.m. UTC | #2
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options)
>  			struct diff_filespec *one = rename_src[j].p->one;
>  			struct diff_score this_src;
>  
> -			if (one->rename_used && !want_copies)
> -				continue;
> +			assert(!one->rename_used || want_copies || break_idx);

Doesn't assert becomes a no-op in production builds?  Shouldn't this
be a BUG() instead?
Elijah Newren Feb. 13, 2021, 4:24 a.m. UTC | #3
On Fri, Feb 12, 2021 at 5:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +             return; /* culling incompatbile with break detection */
>
> incompatible.

Thanks.

> >       /*
> > -      * Calculate how many renames are left (but all the source
> > -      * files still remain as options for rename/copies!)
> > +      * Calculate how many renames are left
> >        */
>
> This no longer needs to be a multi-line comment, does it?

Indeed.

Will fix both.
Elijah Newren Feb. 13, 2021, 4:43 a.m. UTC | #4
On Fri, Feb 12, 2021 at 5:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options)
> >                       struct diff_filespec *one = rename_src[j].p->one;
> >                       struct diff_score this_src;
> >
> > -                     if (one->rename_used && !want_copies)
> > -                             continue;
> > +                     assert(!one->rename_used || want_copies || break_idx);
>
> Doesn't assert becomes a no-op in production builds?  Shouldn't this
> be a BUG() instead?

I guess it depends on the reason for the line.  If we're just trying
to help others understand the code by documenting conditions that are
true (and perhaps help them when they are refactoring), then comments
like

    /* The following is true at this point: !one->rename_used ||
want_copies || break_idx */

could also be used, but it's kind of verbose.  The form

    assert(!one->rename_used || want_copies || break_idx);

is shorter, clearer, and is likely to be up-to-date because even if
most builds and users turn off assertions, some folks will build and
run with assertions on.  If it's a refactoring-aid, then the latter
form is also more likely to help the future developer (who may be
future me).


If, however, the purpose is to check for bad input or worries about
programming logic possibly have edge cases, and we're afraid that such
conditions might cause severe and hard to debug problems later in the
code, then we absolutely would rather use a BUG().

Most of my uses of assert() fall in the former category.  I use BUG()
when it's a line meant for the latter category.  There are a few that
perhaps fall in between, where I just have to make a judgement call.
I'd like to say that I'm more likely to use BUG() for those, but the
lack of a BUG_ON() does tend to make it pretty annoying to use and
certainly discourages it.

Granted, that's the way I look at it.  I'm curious if others have a
different view.  It certainly seems like the project likes to use both
forms nearly equally:
$ git grep assert\( origin/master | wc -l
468
$ git grep BUG\( origin/master | wc -l
506

so I'm curious if there are other factors that make folks pick one or the other.
diff mbox series

Patch

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8b118628b4e..74930716e70 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -454,6 +454,54 @@  static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 	return count;
 }
 
+static void remove_unneeded_paths_from_src(int detecting_copies)
+{
+	int i, new_num_src;
+
+	if (detecting_copies)
+		return; /* nothing to remove */
+	if (break_idx)
+		return; /* culling incompatbile with break detection */
+
+	/*
+	 * Note on reasons why we cull unneeded sources but not destinations:
+	 *   1) Pairings are stored in rename_dst (not rename_src), which we
+	 *      need to keep around.  So, we just can't cull rename_dst even
+	 *      if we wanted to.  But doing so wouldn't help because...
+	 *
+	 *   2) There is a matrix pairwise comparison that follows the
+	 *      "Performing inexact rename detection" progress message.
+	 *      Iterating over the destinations is done in the outer loop,
+	 *      hence we only iterate over each of those once and we can
+	 *      easily skip the outer loop early if the destination isn't
+	 *      relevant.  That's only one check per destination path to
+	 *      skip.
+	 *
+	 *      By contrast, the sources are iterated in the inner loop; if
+	 *      we check whether a source can be skipped, then we'll be
+	 *      checking it N separate times, once for each destination.
+	 *      We don't want to have to iterate over known-not-needed
+	 *      sources N times each, so avoid that by removing the sources
+	 *      from rename_src here.
+	 */
+	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
+		/*
+		 * renames are stored in rename_dst, so if a rename has
+		 * already been detected using this source, we can just
+		 * remove the source knowing rename_dst has its info.
+		 */
+		if (rename_src[i].p->one->rename_used)
+			continue;
+
+		if (new_num_src < i)
+			memcpy(&rename_src[new_num_src], &rename_src[i],
+			       sizeof(struct diff_rename_src));
+		new_num_src++;
+	}
+
+	rename_src_nr = new_num_src;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -530,13 +578,11 @@  void diffcore_rename(struct diff_options *options)
 		goto cleanup;
 
 	/*
-	 * Calculate how many renames are left (but all the source
-	 * files still remain as options for rename/copies!)
+	 * Calculate how many renames are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
+	remove_unneeded_paths_from_src(want_copies);
 	num_sources = rename_src_nr;
-	if (!want_copies)
-		num_sources -= rename_count;
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -578,8 +624,7 @@  void diffcore_rename(struct diff_options *options)
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
-			if (one->rename_used && !want_copies)
-				continue;
+			assert(!one->rename_used || want_copies || break_idx);
 
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))