diff mbox series

[v3,2/5] diffcore-rename: compute basenames of all source and dest candidates

Message ID 4fff9b1ff57b62587b1cbec2031f96199a214702.1612970140.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 7: use file basenames to guide rename detection | expand

Commit Message

Elijah Newren Feb. 10, 2021, 3:15 p.m. UTC
From: Elijah Newren <newren@gmail.com>

We want to make use of unique basenames to help inform rename detection,
so that more likely pairings can be checked first.  (src/moduleA/foo.txt
and source/module/A/foo.txt are likely related if there are no other
'foo.txt' files among the deleted and added files.)  Add a new function,
not yet used, which creates a map of the unique basenames within
rename_src and another within rename_dst, together with the indices
within rename_src/rename_dst where those basenames show up.  Non-unique
basenames still show up in the map, but have an invalid index (-1).

This function was inspired by the fact that in real world repositories,
files are often moved across directories without changing names.  Here
are some sample repositories and the percentage of their historical
renames (as of early 2020) that preserved basenames:
  * linux: 76%
  * gcc: 64%
  * gecko: 79%
  * webkit: 89%
These statistics alone don't prove that an optimization in this area
will help or how much it will help, since there are also unpaired adds
and deletes, restrictions on which basenames we consider, etc., but it
certainly motivated the idea to try something in this area.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

Comments

Junio C Hamano Feb. 13, 2021, 1:32 a.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +MAYBE_UNUSED
> +static int find_basename_matches(struct diff_options *options,
> +				 int minimum_score,
> +				 int num_src)
> +{
> +	int i;
> +	struct strintmap sources;
> +	struct strintmap dests;
> +
> +	/* Create maps of basename -> fullname(s) for sources and dests */
> +	strintmap_init_with_options(&sources, -1, NULL, 0);
> +	strintmap_init_with_options(&dests, -1, NULL, 0);
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base;
> +
> +		/* exact renames removed in remove_unneeded_paths_from_src() */
> +		assert(!rename_src[i].p->one->rename_used);
> +
> +		/* Record index within rename_src (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&sources, base))
> +			strintmap_set(&sources, base, -1);
> +		else
> +			strintmap_set(&sources, base, i);
> +	}
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		char *filename = rename_dst[i].p->two->path;
> +		const char *base;
> +
> +		if (rename_dst[i].is_rename)
> +			continue; /* involved in exact match already. */
> +
> +		/* Record index within rename_dst (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&dests, base))
> +			strintmap_set(&dests, base, -1);
> +		else
> +			strintmap_set(&dests, base, i);
> +	}
> +
> +	/* TODO: Make use of basenames source and destination basenames */

;-)

So at this point sources and dests can be used to quickly look up,
given a filename, if there is a single src among all sources, and a
single dst among all dests, that have the filename.

I wonder if the second loop over destinations can be "optimized"
further by using the sources map, though.  The reason you quash
entries with -1 when you see second instance of the same name is
because you intend to limit the heuristics only to a uniquely named
file among the removed files going to a uniquely named file among
the added files, right?  So even if a name is unique among dests,
if that name has duplicates on the source side, there is no point
recording its location.  i.e.

	/* record index within dst if it is unique in both dst and src */
	base = get_basename(filename);
	if (strintmap_contains(&sources, base) ||
	    strintmap_contains(&dests, base))
		strintmap_set(&dests, base, -1);
	else
		strintmap_set(&dests, base, i);

perhaps?

I guess it depends on what actually will be written in this "TODO"
space how effective such a change would be.  Presumably, you'd
iterate over &sources while skipping entries that record -1, to
learn (basename, i), and use the basename found there to consult
&dests to see if it yields a non-negative integer j, to notice that
rename_src[i] is a good candidate to match rename_dst[j].  If that
is the case, then such a change won't help as an optimization at
all, as we'd need to consult &dests map with the basename anyway,
so let's scratch the above idea.

In any case, after we walk over rename_src[] and rename_dst[] once,
the number of entries in &sources would be smaller than rename_src[]
so iterating over &sources, hunting for entries that record
non-negative index into rename_src[] would hopefully be cheaper than
the naive loop we've been working with.  I like the idea of using the
strintmap for this part of the code.

Thanks.


> +	strintmap_clear(&sources);
> +	strintmap_clear(&dests);
> +
> +	return 0;
> +}
> +
>  #define NUM_CANDIDATE_PER_DST 4
>  static void record_if_better(struct diff_score m[], struct diff_score *o)
>  {
diff mbox series

Patch

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 74930716e70d..3eb49a098adf 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,67 @@  static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+static const char *get_basename(const char *filename)
+{
+	/*
+	 * gitbasename() has to worry about special drivers, multiple
+	 * directory separator characters, trailing slashes, NULL or
+	 * empty strings, etc.  We only work on filenames as stored in
+	 * git, and thus get to ignore all those complications.
+	 */
+	const char *base = strrchr(filename, '/');
+	return base ? base + 1 : filename;
+}
+
+MAYBE_UNUSED
+static int find_basename_matches(struct diff_options *options,
+				 int minimum_score,
+				 int num_src)
+{
+	int i;
+	struct strintmap sources;
+	struct strintmap dests;
+
+	/* Create maps of basename -> fullname(s) for sources and dests */
+	strintmap_init_with_options(&sources, -1, NULL, 0);
+	strintmap_init_with_options(&dests, -1, NULL, 0);
+	for (i = 0; i < num_src; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base;
+
+		/* exact renames removed in remove_unneeded_paths_from_src() */
+		assert(!rename_src[i].p->one->rename_used);
+
+		/* Record index within rename_src (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&sources, base))
+			strintmap_set(&sources, base, -1);
+		else
+			strintmap_set(&sources, base, i);
+	}
+	for (i = 0; i < rename_dst_nr; ++i) {
+		char *filename = rename_dst[i].p->two->path;
+		const char *base;
+
+		if (rename_dst[i].is_rename)
+			continue; /* involved in exact match already. */
+
+		/* Record index within rename_dst (i) if basename is unique */
+		base = get_basename(filename);
+		if (strintmap_contains(&dests, base))
+			strintmap_set(&dests, base, -1);
+		else
+			strintmap_set(&dests, base, i);
+	}
+
+	/* TODO: Make use of basenames source and destination basenames */
+
+	strintmap_clear(&sources);
+	strintmap_clear(&dests);
+
+	return 0;
+}
+
 #define NUM_CANDIDATE_PER_DST 4
 static void record_if_better(struct diff_score m[], struct diff_score *o)
 {