diff mbox series

[v2,06/13] merge-ort: add data structures for in-memory caching of rename detection

Message ID 2a9e73de2beef5f51ad76fe1d9aaaed926a5fce8.1620094339.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 11: avoid repeatedly detecting same renames | expand

Commit Message

Elijah Newren May 4, 2021, 2:12 a.m. UTC
From: Elijah Newren <newren@gmail.com>

When there are many renames between the old base of a series of commits
and the new base for a series of commits, the sequence of merges
employed to transplant those commits (from a cherry-pick or rebase
operation) will repeatedly detect the exact same renames.  This is
wasted effort.

Add data structures which will be used to cache rename detection
results, along with the initialization and deallocation of these data
structures.  Future commits will populate these caches, detect the
appropriate circumstances when they can be used, and employ them to
avoid re-detecting the same renames repeatedly.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

Comments

Derrick Stolee May 17, 2021, 1:41 p.m. UTC | #1
On 5/3/21 10:12 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> When there are many renames between the old base of a series of commits
> and the new base for a series of commits, the sequence of merges
> employed to transplant those commits (from a cherry-pick or rebase
> operation) will repeatedly detect the exact same renames.  This is
> wasted effort.
> 
> Add data structures which will be used to cache rename detection
> results, along with the initialization and deallocation of these data
> structures.  Future commits will populate these caches, detect the
> appropriate circumstances when they can be used, and employ them to
> avoid re-detecting the same renames repeatedly.

I appreciate the definitions and boilerplate for these data
structures being isolated to their own patch.

> @@ -140,6 +140,37 @@ struct rename_info {
>  	int callback_data_nr, callback_data_alloc;
>  	char *callback_data_traverse_path;
>  
> +	/*
> +	 * cached_pairs: Caching of renames and deletions.
> +	 *
> +	 * These are mappings recording renames and deletions of individual
> +	 * files (not directories).  They are thus a map from an old
> +	 * filename to either NULL (for deletions) or a new filename (for
> +	 * renames).
> +	 */
> +	struct strmap cached_pairs[3];
> +
> +	/*
> +	 * cached_target_names: just the destinations from cached_pairs
> +	 *
> +	 * We sometimes want a fast lookup to determine if a given filename
> +	 * is one of the destinations in cached_pairs.  cached_target_names
> +	 * is thus duplicative information, but it provides a fast lookup.
> +	 */
> +	struct strset cached_target_names[3];

These two work well together. Very clear.

> +	/*
> +	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
> +	 *
> +	 * cached_pairs records both renames and deletes.  Sometimes we
> +	 * do not know if a path is a rename or a delete because we pass
> +	 * RELEVANT_LOCATION to diffcore_rename_extended() and based on
> +	 * various optimizations it returns without detecting whether that
> +	 * path is actually a rename or a delete.  We need to cache such
> +	 * paths too, but separately from cached_pairs.
> +	 */
> +	struct strset cached_irrelevant[3];

I'm having a hard time parsing what these "irrelevant" paths will be.
It seems like diffcore_rename_extended() will report something other
than "rename" or "delete" for some paths. Could we explicitly mark
that state as "irrelevant"?

	/*
	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
	 *
	 * cached_pairs records both renames and deletes.  Sometimes we
	 * do not know if a path is a rename or a delete because we pass
	 * RELEVANT_LOCATION to diffcore_rename_extended() which might
	 * describe a path as "irrelevant" instead of as a "rename" or "delete".
	 *  We need to cache such paths too, but separately from cached_pairs.
	 */

Does this make sense? diffcore_rename_extended() might need an update
to match this extra, explicit state.

The rest of the code looks good.

Thanks,
-Stolee
Elijah Newren May 18, 2021, 3:55 a.m. UTC | #2
On Mon, May 17, 2021 at 6:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 5/3/21 10:12 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > When there are many renames between the old base of a series of commits
> > and the new base for a series of commits, the sequence of merges
> > employed to transplant those commits (from a cherry-pick or rebase
> > operation) will repeatedly detect the exact same renames.  This is
> > wasted effort.
> >
> > Add data structures which will be used to cache rename detection
> > results, along with the initialization and deallocation of these data
> > structures.  Future commits will populate these caches, detect the
> > appropriate circumstances when they can be used, and employ them to
> > avoid re-detecting the same renames repeatedly.
>
> I appreciate the definitions and boilerplate for these data
> structures being isolated to their own patch.
>
> > @@ -140,6 +140,37 @@ struct rename_info {
> >       int callback_data_nr, callback_data_alloc;
> >       char *callback_data_traverse_path;
> >
> > +     /*
> > +      * cached_pairs: Caching of renames and deletions.
> > +      *
> > +      * These are mappings recording renames and deletions of individual
> > +      * files (not directories).  They are thus a map from an old
> > +      * filename to either NULL (for deletions) or a new filename (for
> > +      * renames).
> > +      */
> > +     struct strmap cached_pairs[3];
> > +
> > +     /*
> > +      * cached_target_names: just the destinations from cached_pairs
> > +      *
> > +      * We sometimes want a fast lookup to determine if a given filename
> > +      * is one of the destinations in cached_pairs.  cached_target_names
> > +      * is thus duplicative information, but it provides a fast lookup.
> > +      */
> > +     struct strset cached_target_names[3];
>
> These two work well together. Very clear.
>
> > +     /*
> > +      * cached_irrelevant: Caching of rename_sources that aren't relevant.
> > +      *
> > +      * cached_pairs records both renames and deletes.  Sometimes we
> > +      * do not know if a path is a rename or a delete because we pass
> > +      * RELEVANT_LOCATION to diffcore_rename_extended() and based on
> > +      * various optimizations it returns without detecting whether that
> > +      * path is actually a rename or a delete.  We need to cache such
> > +      * paths too, but separately from cached_pairs.
> > +      */
> > +     struct strset cached_irrelevant[3];
>
> I'm having a hard time parsing what these "irrelevant" paths will be.
> It seems like diffcore_rename_extended() will report something other
> than "rename" or "delete" for some paths. Could we explicitly mark
> that state as "irrelevant"?

The state is better known as RELEVANT_NO_MORE, yes.

>         /*
>          * cached_irrelevant: Caching of rename_sources that aren't relevant.
>          *
>          * cached_pairs records both renames and deletes.  Sometimes we
>          * do not know if a path is a rename or a delete because we pass
>          * RELEVANT_LOCATION to diffcore_rename_extended() which might
>          * describe a path as "irrelevant" instead of as a "rename" or "delete".
>          *  We need to cache such paths too, but separately from cached_pairs.
>          */
>
> Does this make sense? diffcore_rename_extended() might need an update
> to match this extra, explicit state.

Hmm, let's flesh out the description a bit and try to be more
explicit.  How about:

    /*
     * cached_irrelevant: Caching of rename_sources that aren't relevant.
     *
     * If we try to detect a rename for a source path and succeed, it's
     * part of a rename.  If we try to detect a rename for a source path
     * and fail, then it's a delete.  If we do not try to detect a rename
     * for a path, then we don't know if it's a rename or a delete.  If
     * merge-ort doesn't think the path is relevant, then we just won't
     * cache anything for that path.  But there's a slight problem in
     * that merge-ort can think a path is RELEVANT_LOCATION, but due to
     * commit 9bd342137e ("diffcore-rename: determine which
     * relevant_sources are no longer relevant", 2021-03-13),
     * diffcore-rename can downgrade the path to RELEVANT_NO_MORE.  To
     * avoid excessive calls to diffcore_rename_extended() we still need
     * to cache such paths, though we cannot record them as either
     * renames or deletes.  So we cache them here as a "turned out to be
     * irrelevant *for this commit*" as they are often also irrelevant
     * for subsequent commits, though we will have to do some extra
     * checking to see whether such paths become relevant for rename
     * detection when cherry-picking/rebasing subsequent commits.
     */
Derrick Stolee May 18, 2021, 1:57 p.m. UTC | #3
On 5/17/2021 11:55 PM, Elijah Newren wrote:
>>         /*
>>          * cached_irrelevant: Caching of rename_sources that aren't relevant.
>>          *
>>          * cached_pairs records both renames and deletes.  Sometimes we
>>          * do not know if a path is a rename or a delete because we pass
>>          * RELEVANT_LOCATION to diffcore_rename_extended() which might
>>          * describe a path as "irrelevant" instead of as a "rename" or "delete".
>>          *  We need to cache such paths too, but separately from cached_pairs.
>>          */
>>
>> Does this make sense? diffcore_rename_extended() might need an update
>> to match this extra, explicit state.
> Hmm, let's flesh out the description a bit and try to be more
> explicit.  How about:
> 
>     /*
>      * cached_irrelevant: Caching of rename_sources that aren't relevant.
>      *
>      * If we try to detect a rename for a source path and succeed, it's
>      * part of a rename.  If we try to detect a rename for a source path
>      * and fail, then it's a delete.  If we do not try to detect a rename
>      * for a path, then we don't know if it's a rename or a delete.  If
>      * merge-ort doesn't think the path is relevant, then we just won't
>      * cache anything for that path.  But there's a slight problem in
>      * that merge-ort can think a path is RELEVANT_LOCATION, but due to
>      * commit 9bd342137e ("diffcore-rename: determine which
>      * relevant_sources are no longer relevant", 2021-03-13),
>      * diffcore-rename can downgrade the path to RELEVANT_NO_MORE.  To
>      * avoid excessive calls to diffcore_rename_extended() we still need
>      * to cache such paths, though we cannot record them as either
>      * renames or deletes.  So we cache them here as a "turned out to be
>      * irrelevant *for this commit*" as they are often also irrelevant
>      * for subsequent commits, though we will have to do some extra
>      * checking to see whether such paths become relevant for rename
>      * detection when cherry-picking/rebasing subsequent commits.
>      */

This is more informative, thanks.
-Stolee
diff mbox series

Patch

diff --git a/merge-ort.c b/merge-ort.c
index b1795d838eaf..8602f88a960c 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -140,6 +140,37 @@  struct rename_info {
 	int callback_data_nr, callback_data_alloc;
 	char *callback_data_traverse_path;
 
+	/*
+	 * cached_pairs: Caching of renames and deletions.
+	 *
+	 * These are mappings recording renames and deletions of individual
+	 * files (not directories).  They are thus a map from an old
+	 * filename to either NULL (for deletions) or a new filename (for
+	 * renames).
+	 */
+	struct strmap cached_pairs[3];
+
+	/*
+	 * cached_target_names: just the destinations from cached_pairs
+	 *
+	 * We sometimes want a fast lookup to determine if a given filename
+	 * is one of the destinations in cached_pairs.  cached_target_names
+	 * is thus duplicative information, but it provides a fast lookup.
+	 */
+	struct strset cached_target_names[3];
+
+	/*
+	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
+	 *
+	 * cached_pairs records both renames and deletes.  Sometimes we
+	 * do not know if a path is a rename or a delete because we pass
+	 * RELEVANT_LOCATION to diffcore_rename_extended() and based on
+	 * various optimizations it returns without detecting whether that
+	 * path is actually a rename or a delete.  We need to cache such
+	 * paths too, but separately from cached_pairs.
+	 */
+	struct strset cached_irrelevant[3];
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -382,6 +413,8 @@  static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		reinitialize ? strmap_partial_clear : strmap_clear;
 	void (*strintmap_func)(struct strintmap *) =
 		reinitialize ? strintmap_partial_clear : strintmap_clear;
+	void (*strset_func)(struct strset *) =
+		reinitialize ? strset_partial_clear : strset_clear;
 
 	/*
 	 * We marked opti->paths with strdup_strings = 0, so that we
@@ -425,6 +458,9 @@  static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_func(&renames->dir_renames[i], 0);
 
 		strintmap_func(&renames->relevant_sources[i]);
+		strset_func(&renames->cached_target_names[i]);
+		strmap_func(&renames->cached_pairs[i], 1);
+		strset_func(&renames->cached_irrelevant[i]);
 	}
 
 	if (!reinitialize) {
@@ -3667,6 +3703,12 @@  static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 0);
 		strintmap_init_with_options(&renames->relevant_sources[i],
 					    0, NULL, 0);
+		strmap_init_with_options(&renames->cached_pairs[i],
+					 NULL, 1);
+		strset_init_with_options(&renames->cached_irrelevant[i],
+					 NULL, 1);
+		strset_init_with_options(&renames->cached_target_names[i],
+					 NULL, 0);
 	}
 
 	/*