diff mbox series

[v2,2/4] diffcore-rename: complete find_basename_matches()

Message ID dcd0175229aa6fba576425e78875b95385acb58d.1612870326.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 7: use file basenames to guide rename detection | expand

Commit Message

Elijah Newren Feb. 9, 2021, 11:32 a.m. UTC
From: Elijah Newren <newren@gmail.com>

It is not uncommon in real world repositories for the majority of file
renames to not change the basename of the file; i.e. most "renames" are
just a move of files into different directories.  We can make use of
this to avoid comparing all rename source candidates with all rename
destination candidates, by first comparing sources to destinations with
the same basenames.  If two files with the same basename are
sufficiently similar, we record the rename; if not, we include those
files in the more exhaustive matrix comparison.

This means we are adding a set of preliminary additional comparisons,
but for each file we only compare it with at most one other file.  For
example, if there was a include/media/device.h that was deleted and a
src/module/media/device.h that was added, and there were no other
device.h files added or deleted between the commits being compared,
then these two files would be compared in the preliminary step.

This commit does not yet actually employ this new optimization, it
merely adds a function which can be used for this purpose.  The next
commit will do the necessary plumbing to make use of it.

Note that this optimization might give us different results than without
the optimization, because it's possible that despite files with the same
basename being sufficiently similar to be considered a rename, there's
an even better match between files without the same basename.  I think
that is okay for four reasons: (1) it's easy to explain to the users
what happened if it does ever occur (or even for them to intuitively
figure out), (2) as the next patch will show it provides such a large
performance boost that it's worth the tradeoff, and (3) it's somewhat
unlikely that despite having unique matching basenames that other files
serve as better matches.  Reason (4) takes a full paragraph to
explain...

If the previous three reasons aren't enough, consider what rename
detection already does.  Break detection is not the default, meaning
that if files have the same _fullname_, then they are considered related
even if they are 0% similar.  In fact, in such a case, we don't even
bother comparing the files to see if they are similar let alone
comparing them to all other files to see what they are most similar to.
Basically, we override content similarity based on sufficient filename
similarity.  Without the filename similarity (currently implemented as
an exact match of filename), we swing the pendulum the opposite
direction and say that filename similarity is irrelevant and compare a
full N x M matrix of sources and destinations to find out which have the
most similar contents.  This optimization just adds another form of
filename similarity comparison, but augments it with a file content
similarity check as well.  Basically, if two files have the same
basename and are sufficiently similar to be considered a rename, mark
them as such without comparing the two to all other rename candidates.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 94 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 91 insertions(+), 3 deletions(-)

Comments

Derrick Stolee Feb. 9, 2021, 1:25 p.m. UTC | #1
On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> +	/*
> +	 * When I checked, over 76% of file renames in linux just moved

Perhaps "In late 2020," instead of "When I checked".

> +	 * files to a different directory but kept the same basename.  gcc
> +	 * did that with over 64% of renames, gecko did it with over 79%,
> +	 * and WebKit did it with over 89%.
> +	 *
> +	 * Therefore we can bypass the normal exhaustive NxM matrix
> +	 * comparison of similarities between all potential rename sources
> +	 * and destinations by instead using file basename as a hint, checking
> +	 * for similarity between files with the same basename, and if we
> +	 * find a pair that are sufficiently similar, record the rename
> +	 * pair and exclude those two from the NxM matrix.
> +	 *
> +	 * This *might* cause us to find a less than optimal pairing (if
> +	 * there is another file that we are even more similar to but has a
> +	 * different basename).  Given the huge performance advantage
> +	 * basename matching provides, and given the frequency with which
> +	 * people use the same basename in real world projects, that's a
> +	 * trade-off we are willing to accept when doing just rename
> +	 * detection.  However, if someone wants copy detection that
> +	 * implies they are willing to spend more cycles to find
> +	 * similarities between files, so it may be less likely that this
> +	 * heuristic is wanted.
> +	 */
> +
> +	int i, renames = 0;
>  	struct strintmap sources;
>  	struct strintmap dests; 

...

> +	 * copy detection.  find_basename_matches() is only used when detecting
> +	 * renames, not when detecting copies, so it'll only be used when a file
> +	 * only existed in the source.  Since we already know that the file

There are two "only"s in this sentence. Just awkward, not wrong.

> +	 * won't be unmodified, there's no point checking for it; that's just a
> +	 * waste of resources.  So set skip_unmodified to 0 so that
> +	 * estimate_similarity() and prefetch() won't waste resources checking
> +	 * for something we already know is false.
> +	 */
> +	int skip_unmodified = 0;
> +



> -	/* TODO: Make use of basenames source and destination basenames */
> +	/* Now look for basename matchups and do similarity estimation */
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		char *base = NULL;
> +		intptr_t src_index;
> +		intptr_t dst_index;
> +
> +		/* Get the basename */
> +		base = strrchr(filename, '/');
> +		base = (base ? base+1 : filename);

Here is the third instance of this in the same function. At minimum we should
extract a helper for you to consume.

> +		/* Find out if this basename is unique among sources */
> +		src_index = strintmap_get(&sources, base);
> +		if (src_index == -1)
> +			continue; /* not a unique basename; skip it */
> +		assert(src_index == i);
> +
> +		if (strintmap_contains(&dests, base)) {
> +			struct diff_filespec *one, *two;
> +			int score;
> +
> +			/* Find out if this basename is unique among dests */
> +			dst_index = strintmap_get(&dests, base);
> +			if (dst_index == -1)
> +				continue; /* not a unique basename; skip it */
> +
> +			/* Ignore this dest if already used in a rename */
> +			if (rename_dst[dst_index].is_rename)
> +				continue; /* already used previously */
> +
> +			/* Estimate the similarity */
> +			one = rename_src[src_index].p->one;
> +			two = rename_dst[dst_index].p->two;
> +			score = estimate_similarity(options->repo, one, two,
> +						    minimum_score, skip_unmodified);
> +
> +			/* If sufficiently similar, record as rename pair */
> +			if (score < minimum_score)
> +				continue;
> +			record_rename_pair(dst_index, src_index, score);
> +			renames++;
> +
> +			/*
> +			 * Found a rename so don't need text anymore; if we
> +			 * didn't find a rename, the filespec_blob would get
> +			 * re-used when doing the matrix of comparisons.
> +			 */
> +			diff_free_filespec_blob(one);
> +			diff_free_filespec_blob(two);
> +		}
> +	}

Makes sense to me.

Thanks,
-Stolee
Elijah Newren Feb. 9, 2021, 5:17 p.m. UTC | #2
On Tue, Feb 9, 2021 at 5:25 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> > +     /*
> > +      * When I checked, over 76% of file renames in linux just moved
>
> Perhaps "In late 2020," instead of "When I checked".

In early 2020 (in fact, it might have been 2019, but I have no records
to verify the actual year), but sure I can change that.

> > +      * files to a different directory but kept the same basename.  gcc
> > +      * did that with over 64% of renames, gecko did it with over 79%,
> > +      * and WebKit did it with over 89%.
> > +      *
> > +      * Therefore we can bypass the normal exhaustive NxM matrix
> > +      * comparison of similarities between all potential rename sources
> > +      * and destinations by instead using file basename as a hint, checking
> > +      * for similarity between files with the same basename, and if we
> > +      * find a pair that are sufficiently similar, record the rename
> > +      * pair and exclude those two from the NxM matrix.
> > +      *
> > +      * This *might* cause us to find a less than optimal pairing (if
> > +      * there is another file that we are even more similar to but has a
> > +      * different basename).  Given the huge performance advantage
> > +      * basename matching provides, and given the frequency with which
> > +      * people use the same basename in real world projects, that's a
> > +      * trade-off we are willing to accept when doing just rename
> > +      * detection.  However, if someone wants copy detection that
> > +      * implies they are willing to spend more cycles to find
> > +      * similarities between files, so it may be less likely that this
> > +      * heuristic is wanted.
> > +      */
> > +
> > +     int i, renames = 0;
> >       struct strintmap sources;
> >       struct strintmap dests;
>
> ...
>
> > +      * copy detection.  find_basename_matches() is only used when detecting
> > +      * renames, not when detecting copies, so it'll only be used when a file
> > +      * only existed in the source.  Since we already know that the file
>
> There are two "only"s in this sentence. Just awkward, not wrong.
>
> > +      * won't be unmodified, there's no point checking for it; that's just a
> > +      * waste of resources.  So set skip_unmodified to 0 so that
> > +      * estimate_similarity() and prefetch() won't waste resources checking
> > +      * for something we already know is false.
> > +      */
> > +     int skip_unmodified = 0;
> > +
>
>
>
> > -     /* TODO: Make use of basenames source and destination basenames */
> > +     /* Now look for basename matchups and do similarity estimation */
> > +     for (i = 0; i < num_src; ++i) {
> > +             char *filename = rename_src[i].p->one->path;
> > +             char *base = NULL;
> > +             intptr_t src_index;
> > +             intptr_t dst_index;
> > +
> > +             /* Get the basename */
> > +             base = strrchr(filename, '/');
> > +             base = (base ? base+1 : filename);
>
> Here is the third instance of this in the same function. At minimum we should
> extract a helper for you to consume.

Where by "this" you mean these last two lines, right?

And perhaps explain why I'm not using either basename(3) or
gitbasename() from git-compat-util.h?  (The latter of which I just
learned about while responding to the review of this patch.)

or maybe gitbasename can do the job, but the skip_dos_drive_prefix()
and the munging of the string passed in both worry me.  And the
is_dir_sep() looks inefficient since I know I'm dealing with filenames
as stored in git internally, and thus can only use '/' characters.
Hmm...

Yeah, I think I'll add my own helper in this file, since you want one,
and just use it.

> > +             /* Find out if this basename is unique among sources */
> > +             src_index = strintmap_get(&sources, base);
> > +             if (src_index == -1)
> > +                     continue; /* not a unique basename; skip it */
> > +             assert(src_index == i);
> > +
> > +             if (strintmap_contains(&dests, base)) {
> > +                     struct diff_filespec *one, *two;
> > +                     int score;
> > +
> > +                     /* Find out if this basename is unique among dests */
> > +                     dst_index = strintmap_get(&dests, base);
> > +                     if (dst_index == -1)
> > +                             continue; /* not a unique basename; skip it */
> > +
> > +                     /* Ignore this dest if already used in a rename */
> > +                     if (rename_dst[dst_index].is_rename)
> > +                             continue; /* already used previously */
> > +
> > +                     /* Estimate the similarity */
> > +                     one = rename_src[src_index].p->one;
> > +                     two = rename_dst[dst_index].p->two;
> > +                     score = estimate_similarity(options->repo, one, two,
> > +                                                 minimum_score, skip_unmodified);
> > +
> > +                     /* If sufficiently similar, record as rename pair */
> > +                     if (score < minimum_score)
> > +                             continue;
> > +                     record_rename_pair(dst_index, src_index, score);
> > +                     renames++;
> > +
> > +                     /*
> > +                      * Found a rename so don't need text anymore; if we
> > +                      * didn't find a rename, the filespec_blob would get
> > +                      * re-used when doing the matrix of comparisons.
> > +                      */
> > +                     diff_free_filespec_blob(one);
> > +                     diff_free_filespec_blob(two);
> > +             }
> > +     }
>
> Makes sense to me.
>
> Thanks,
> -Stolee
Derrick Stolee Feb. 9, 2021, 5:34 p.m. UTC | #3
On 2/9/2021 12:17 PM, Elijah Newren wrote:
> On Tue, Feb 9, 2021 at 5:25 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
>>> +             /* Get the basename */
>>> +             base = strrchr(filename, '/');
>>> +             base = (base ? base+1 : filename);
>>
>> Here is the third instance of this in the same function. At minimum we should
>> extract a helper for you to consume.
> 
> Where by "this" you mean these last two lines, right?

Correct. The reason to use a helper is to ease cognitive load when
reading the code. These lines are identical and serve the same
purpose. By making a "get_basename()" helper and using it as

	base = get_basename(filename);

makes it easy to understand what is happening without needing
to think carefully about it. For example, I had to remember
that strrchr() returns NULL when '/' is not found, not the first
character of the string.

> And perhaps explain why I'm not using either basename(3) or
> gitbasename() from git-compat-util.h?  (The latter of which I just
> learned about while responding to the review of this patch.)
> 
> or maybe gitbasename can do the job, but the skip_dos_drive_prefix()
> and the munging of the string passed in both worry me.  And the
> is_dir_sep() looks inefficient since I know I'm dealing with filenames
> as stored in git internally, and thus can only use '/' characters.
> Hmm...
> 
> Yeah, I think I'll add my own helper in this file, since you want one,
> and just use it.

Right. I almost made a point to say "Don't use find_last_dir_sep()"
because it uses platform-specific directory separators. Your helper
is based on in-memory representation that always uses Unix-style paths.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1c52077b04e5..b1dda41de9b1 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -372,10 +372,48 @@  static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 int num_src)
 {
-	int i;
+	/*
+	 * When I checked, over 76% of file renames in linux just moved
+	 * files to a different directory but kept the same basename.  gcc
+	 * did that with over 64% of renames, gecko did it with over 79%,
+	 * and WebKit did it with over 89%.
+	 *
+	 * Therefore we can bypass the normal exhaustive NxM matrix
+	 * comparison of similarities between all potential rename sources
+	 * and destinations by instead using file basename as a hint, checking
+	 * for similarity between files with the same basename, and if we
+	 * find a pair that are sufficiently similar, record the rename
+	 * pair and exclude those two from the NxM matrix.
+	 *
+	 * This *might* cause us to find a less than optimal pairing (if
+	 * there is another file that we are even more similar to but has a
+	 * different basename).  Given the huge performance advantage
+	 * basename matching provides, and given the frequency with which
+	 * people use the same basename in real world projects, that's a
+	 * trade-off we are willing to accept when doing just rename
+	 * detection.  However, if someone wants copy detection that
+	 * implies they are willing to spend more cycles to find
+	 * similarities between files, so it may be less likely that this
+	 * heuristic is wanted.
+	 */
+
+	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
 
+	/*
+	 * The prefeteching stuff wants to know if it can skip prefetching blobs
+	 * that are unmodified.  unmodified blobs are only relevant when doing
+	 * copy detection.  find_basename_matches() is only used when detecting
+	 * renames, not when detecting copies, so it'll only be used when a file
+	 * only existed in the source.  Since we already know that the file
+	 * won't be unmodified, there's no point checking for it; that's just a
+	 * waste of resources.  So set skip_unmodified to 0 so that
+	 * estimate_similarity() and prefetch() won't waste resources checking
+	 * for something we already know is false.
+	 */
+	int skip_unmodified = 0;
+
 	/* Create maps of basename -> fullname(s) for sources and dests */
 	strintmap_init_with_options(&sources, -1, NULL, 0);
 	strintmap_init_with_options(&dests, -1, NULL, 0);
@@ -412,12 +450,62 @@  static int find_basename_matches(struct diff_options *options,
 			strintmap_set(&dests, base, i);
 	}
 
-	/* TODO: Make use of basenames source and destination basenames */
+	/* Now look for basename matchups and do similarity estimation */
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		char *base = NULL;
+		intptr_t src_index;
+		intptr_t dst_index;
+
+		/* Get the basename */
+		base = strrchr(filename, '/');
+		base = (base ? base+1 : filename);
+
+		/* Find out if this basename is unique among sources */
+		src_index = strintmap_get(&sources, base);
+		if (src_index == -1)
+			continue; /* not a unique basename; skip it */
+		assert(src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
+			struct diff_filespec *one, *two;
+			int score;
+
+			/* Find out if this basename is unique among dests */
+			dst_index = strintmap_get(&dests, base);
+			if (dst_index == -1)
+				continue; /* not a unique basename; skip it */
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
+			/* Estimate the similarity */
+			one = rename_src[src_index].p->one;
+			two = rename_dst[dst_index].p->two;
+			score = estimate_similarity(options->repo, one, two,
+						    minimum_score, skip_unmodified);
+
+			/* If sufficiently similar, record as rename pair */
+			if (score < minimum_score)
+				continue;
+			record_rename_pair(dst_index, src_index, score);
+			renames++;
+
+			/*
+			 * Found a rename so don't need text anymore; if we
+			 * didn't find a rename, the filespec_blob would get
+			 * re-used when doing the matrix of comparisons.
+			 */
+			diff_free_filespec_blob(one);
+			diff_free_filespec_blob(two);
+		}
+	}
 
 	strintmap_clear(&sources);
 	strintmap_clear(&dests);
 
-	return 0;
+	return renames;
 }
 
 #define NUM_CANDIDATE_PER_DST 4