Message ID | 4fff9b1ff57b62587b1cbec2031f96199a214702.1612970140.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 7: use file basenames to guide rename detection | expand |
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > +MAYBE_UNUSED > +static int find_basename_matches(struct diff_options *options, > + int minimum_score, > + int num_src) > +{ > + int i; > + struct strintmap sources; > + struct strintmap dests; > + > + /* Create maps of basename -> fullname(s) for sources and dests */ > + strintmap_init_with_options(&sources, -1, NULL, 0); > + strintmap_init_with_options(&dests, -1, NULL, 0); > + for (i = 0; i < num_src; ++i) { > + char *filename = rename_src[i].p->one->path; > + const char *base; > + > + /* exact renames removed in remove_unneeded_paths_from_src() */ > + assert(!rename_src[i].p->one->rename_used); > + > + /* Record index within rename_src (i) if basename is unique */ > + base = get_basename(filename); > + if (strintmap_contains(&sources, base)) > + strintmap_set(&sources, base, -1); > + else > + strintmap_set(&sources, base, i); > + } > + for (i = 0; i < rename_dst_nr; ++i) { > + char *filename = rename_dst[i].p->two->path; > + const char *base; > + > + if (rename_dst[i].is_rename) > + continue; /* involved in exact match already. */ > + > + /* Record index within rename_dst (i) if basename is unique */ > + base = get_basename(filename); > + if (strintmap_contains(&dests, base)) > + strintmap_set(&dests, base, -1); > + else > + strintmap_set(&dests, base, i); > + } > + > + /* TODO: Make use of basenames source and destination basenames */ ;-) So at this point sources and dests can be used to quickly look up, given a filename, if there is a single src among all sources, and a single dst among all dests, that have the filename. I wonder if the second loop over destinations can be "optimized" further by using the sources map, though. The reason you quash entries with -1 when you see second instance of the same name is because you intend to limit the heuristics only to a uniquely named file among the removed files going to a uniquely named file among the added files, right? So even if a name is unique among dests, if that name has duplicates on the source side, there is no point recording its location. i.e. /* record index within dst if it is unique in both dst and src */ base = get_basename(filename); if (strintmap_contains(&sources, base) || strintmap_contains(&dests, base)) strintmap_set(&dests, base, -1); else strintmap_set(&dests, base, i); perhaps? I guess it depends on what actually will be written in this "TODO" space how effective such a change would be. Presumably, you'd iterate over &sources while skipping entries that record -1, to learn (basename, i), and use the basename found there to consult &dests to see if it yields a non-negative integer j, to notice that rename_src[i] is a good candidate to match rename_dst[j]. If that is the case, then such a change won't help as an optimization at all, as we'd need to consult &dests map with the basename anyway, so let's scratch the above idea. In any case, after we walk over rename_src[] and rename_dst[] once, the number of entries in &sources would be smaller than rename_src[] so iterating over &sources, hunting for entries that record non-negative index into rename_src[] would hopefully be cheaper than the naive loop we've been working with. I like the idea of using the strintmap for this part of the code. Thanks. > + strintmap_clear(&sources); > + strintmap_clear(&dests); > + > + return 0; > +} > + > #define NUM_CANDIDATE_PER_DST 4 > static void record_if_better(struct diff_score m[], struct diff_score *o) > {
diff --git a/diffcore-rename.c b/diffcore-rename.c index 74930716e70d..3eb49a098adf 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,6 +367,67 @@ static int find_exact_renames(struct diff_options *options) return renames; } +static const char *get_basename(const char *filename) +{ + /* + * gitbasename() has to worry about special drivers, multiple + * directory separator characters, trailing slashes, NULL or + * empty strings, etc. We only work on filenames as stored in + * git, and thus get to ignore all those complications. + */ + const char *base = strrchr(filename, '/'); + return base ? base + 1 : filename; +} + +MAYBE_UNUSED +static int find_basename_matches(struct diff_options *options, + int minimum_score, + int num_src) +{ + int i; + struct strintmap sources; + struct strintmap dests; + + /* Create maps of basename -> fullname(s) for sources and dests */ + strintmap_init_with_options(&sources, -1, NULL, 0); + strintmap_init_with_options(&dests, -1, NULL, 0); + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + const char *base; + + /* exact renames removed in remove_unneeded_paths_from_src() */ + assert(!rename_src[i].p->one->rename_used); + + /* Record index within rename_src (i) if basename is unique */ + base = get_basename(filename); + if (strintmap_contains(&sources, base)) + strintmap_set(&sources, base, -1); + else + strintmap_set(&sources, base, i); + } + for (i = 0; i < rename_dst_nr; ++i) { + char *filename = rename_dst[i].p->two->path; + const char *base; + + if (rename_dst[i].is_rename) + continue; /* involved in exact match already. */ + + /* Record index within rename_dst (i) if basename is unique */ + base = get_basename(filename); + if (strintmap_contains(&dests, base)) + strintmap_set(&dests, base, -1); + else + strintmap_set(&dests, base, i); + } + + /* TODO: Make use of basenames source and destination basenames */ + + strintmap_clear(&sources); + strintmap_clear(&dests); + + return 0; +} + #define NUM_CANDIDATE_PER_DST 4 static void record_if_better(struct diff_score m[], struct diff_score *o) {