diff mbox series

[v2,6/7] merge-ort: avoid recursing into directories when we don't need to

Message ID 3409a6cd631deb361d3ecb94491c0c297c52fb53.1626204784.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Optimization batch 14: trivial directory resolution | expand

Commit Message

Elijah Newren July 13, 2021, 7:33 p.m. UTC
From: Elijah Newren <newren@gmail.com>

This combines the work of the last several patches, and implements the
conditions when we don't need to recurse into directories.  It's perhaps
easiest to see the logic by separating the fact that a directory might
have both rename sources and rename destinations:

  * rename sources: only files present in the merge base can serve as
    rename sources, and only when one side deletes that file.  When the
    tree on one side matches the merge base, that means every file
    within the subtree matches the merge base.  This means that the
    skip-irrelevant-rename-detection optimization from before kicks in
    and we don't need any of these files as rename sources.

  * rename destinations: the tree that does not match the merge base
    might have newly added and hence unmatched destination files.
    This is what usually prevents us from doing trivial directory
    resolutions in the merge machinery.  However, the fact that we have
    deferred recursing into this directory until the end means we know
    whether there are any unmatched relevant potential rename sources
    elsewhere in this merge.  If there are no unmatched such relevant
    sources anywhere, then there is no need to look for unmatched
    potential rename destinations to match them with.

This informs our algorithm:
  * Search through relevant_sources; if we have entries, they better all
    be reflected in cached_pairs or cached_irrelevant, otherwise they
    represent an unmatched potential rename source (causing the
    optimization to be disallowed).
  * For any relevant_source represented in cached_pairs, we do need to
    to make sure to get the destination for each source, meaning we need
    to recurse into any ancestor directories of those destinations.
  * Once we've recursed into all the rename destinations for any
    relevant_sources in cached_pairs, we can then do the trivial
    directory resolution for the remaining directories.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
    mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
    just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 3 deletions(-)

Comments

Derrick Stolee July 15, 2021, 2:55 p.m. UTC | #1
On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
...
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28),
> this change improves the performance as follows:
> 
>                             Before                  After
>     no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms

Wow! This is quite the savings, and reinforces that when no renames
exist we are likely to hit this optimization.

>     mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s

I'm surprised that this one works so well, too. Clearly, there must
be a lot of directories that can be skipped despite many renames
existing.

>     just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

And no overhead added to this case. Good.

> +		/* Loop over the set of paths we need to know rename info for */
> +		strset_for_each_entry(&renames->relevant_sources[side],
> +				      &iter, entry) {
> +			char *rename_target, *dir, *dir_marker;
> +			struct strmap_entry *e;
> +
> +			/*
> +			 * if we don't know delete/rename info for this path,
> +			 * then we need to recurse into all trees to get all
> +			 * adds to make sure we have it.
> +			 */

super-nit: s/if/If/

Otherwise, I can't speak to the code being 100% correct because
this area is so dense. I find the code to be well organized,
which will help finding and fixing any potential bugs that might
show up in strange corner cases later.

Thanks,
-Stolee
Elijah Newren July 15, 2021, 4:28 p.m. UTC | #2
On Thu, Jul 15, 2021 at 7:55 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> ...
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
>
> Wow! This is quite the savings, and reinforces that when no renames
> exist we are likely to hit this optimization.

Just to clarify, "no-renames" was an unfortunate misnomer.  I should
have called it "few-renames".

But yeah, I was pretty amazed a year ago (when I initially implemented
this optimization) seeing the speedups on this testcase -- a testcase
that I hadn't even been paying much attention to beyond trying to
avoid regressing it.

> >     mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
>
> I'm surprised that this one works so well, too. Clearly, there must
> be a lot of directories that can be skipped despite many renames
> existing.

One small reason it works is that some of the commits don't touch any
paths involved in the big directory rename.  That means that for those
commits, all the renames under that big directory are irrelevant.
This fact makes the optimization work for 4 of the 35 patches.  (Of
course, I did pick that particular series of commits for rebasing
because it touched the directory being renamed a lot, including adding
files to it, which made it a better stress test case for my
optimizations.  But that does make it less likely for the
irrelevant-renames criteria to help us out on this particular
optimization.)

The bigger reason this optimization works so well on this testcase, is
the mere fact that this rebase involved a sequence of 35 commits.  The
optimization doesn't work for the first commit (as noted by the
just-one-mega testcase below).  However, once the first commit is
picked, renames on the upstream side that were relevant in the first
commit are cached, so the optimization does trigger for many of the
subsequent commits.  And whenever the optimization doesn't work on a
subsequent commit -- which happens because the new commit modifies
different files that make other renames become relevant -- then after
detecting the new renames those also get cached.  More cached renames
means a higher likelihood that we can skip recursing into directories
when rebasing subsequent commits.

> >     just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms
>
> And no overhead added to this case. Good.
>
> > +             /* Loop over the set of paths we need to know rename info for */
> > +             strset_for_each_entry(&renames->relevant_sources[side],
> > +                                   &iter, entry) {
> > +                     char *rename_target, *dir, *dir_marker;
> > +                     struct strmap_entry *e;
> > +
> > +                     /*
> > +                      * if we don't know delete/rename info for this path,
> > +                      * then we need to recurse into all trees to get all
> > +                      * adds to make sure we have it.
> > +                      */
>
> super-nit: s/if/If/

Will fix.

> Otherwise, I can't speak to the code being 100% correct because
> this area is so dense. I find the code to be well organized,
> which will help finding and fixing any potential bugs that might
> show up in strange corner cases later.
>
> Thanks,
> -Stolee
diff mbox series

Patch

diff --git a/merge-ort.c b/merge-ort.c
index bf0712d18a0..c9cf7a158c8 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1223,6 +1223,18 @@  static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+static void resolve_trivial_directory_merge(struct conflict_info *ci, int side)
+{
+	VERIFY_CI(ci);
+	assert((side == 1 && ci->match_mask == 5) ||
+	       (side == 2 && ci->match_mask == 3));
+	oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+	ci->merged.result.mode = ci->stages[side].mode;
+	ci->merged.is_null = is_null_oid(&ci->stages[side].oid);
+	ci->match_mask = 0;
+	ci->merged.clean = 1; /* (ci->filemask == 0); */
+}
+
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1232,9 +1244,71 @@  static int handle_deferred_entries(struct merge_options *opt,
 	int side, ret = 0;
 
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
-		renames->trivial_merges_okay[side] = 0;
-		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
-					 &iter, entry) {
+		unsigned optimization_okay = 1;
+		struct strintmap copy;
+
+		/* Loop over the set of paths we need to know rename info for */
+		strset_for_each_entry(&renames->relevant_sources[side],
+				      &iter, entry) {
+			char *rename_target, *dir, *dir_marker;
+			struct strmap_entry *e;
+
+			/*
+			 * if we don't know delete/rename info for this path,
+			 * then we need to recurse into all trees to get all
+			 * adds to make sure we have it.
+			 */
+			if (strset_contains(&renames->cached_irrelevant[side],
+					    entry->key))
+				continue;
+			e = strmap_get_entry(&renames->cached_pairs[side],
+					     entry->key);
+			if (!e) {
+				optimization_okay = 0;
+				break;
+			}
+
+			/* If this is a delete, we have enough info already */
+			rename_target = e->value;
+			if (!rename_target)
+				continue;
+
+			/* If we already walked the rename target, we're good */
+			if (strmap_contains(&opt->priv->paths, rename_target))
+				continue;
+
+			/*
+			 * Otherwise, we need to get a list of directories that
+			 * will need to be recursed into to get this
+			 * rename_target.
+			 */
+			dir = xstrdup(rename_target);
+			while ((dir_marker = strrchr(dir, '/'))) {
+				*dir_marker = '\0';
+				if (strset_contains(&renames->target_dirs[side],
+						    dir))
+					break;
+				strset_add(&renames->target_dirs[side], dir);
+			}
+			free(dir);
+		}
+		renames->trivial_merges_okay[side] = optimization_okay;
+		/*
+		 * We need to recurse into any directories in
+		 * possible_trivial_merges[side] found in target_dirs[side].
+		 * But when we recurse, we may need to queue up some of the
+		 * subdirectories for possible_trivial_merges[side].  Since
+		 * we can't safely iterate through a hashmap while also adding
+		 * entries, move the entries into 'copy', iterate over 'copy',
+		 * and then we'll also iterate anything added into
+		 * possible_trivial_merges[side] once this loop is done.
+		 */
+		copy = renames->possible_trivial_merges[side];
+		strintmap_init_with_options(&renames->possible_trivial_merges[side],
+					    0,
+					    NULL,
+					    0);
+		strintmap_for_each_entry(&copy, &iter, entry) {
 			const char *path = entry->key;
 			unsigned dir_rename_mask = (intptr_t)entry->value;
 			struct conflict_info *ci;
@@ -1247,6 +1321,13 @@  static int handle_deferred_entries(struct merge_options *opt,
 			VERIFY_CI(ci);
 			dirmask = ci->dirmask;
 
+			if (optimization_okay &&
+			    !strset_contains(&renames->target_dirs[side],
+					     path)) {
+				resolve_trivial_directory_merge(ci, side);
+				continue;
+			}
+
 			info->name = path;
 			info->namelen = strlen(path);
 			info->pathlen = info->namelen + 1;
@@ -1282,6 +1363,20 @@  static int handle_deferred_entries(struct merge_options *opt,
 			if (ret < 0)
 				return ret;
 		}
+		strintmap_clear(&copy);
+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
+					 &iter, entry) {
+			const char *path = entry->key;
+			struct conflict_info *ci;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+
+			assert(renames->trivial_merges_okay[side] &&
+			       !strset_contains(&renames->target_dirs[side],
+						path));
+			resolve_trivial_directory_merge(ci, side);
+		}
 	}
 	return ret;
 }