Message ID | cbd055ab33998962cb7712906cdad5dff3390660.1614123848.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 8: use file basenames even more | expand |
On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:> +static char *get_dirname(const char *filename) > +{ > + char *slash = strrchr(filename, '/'); > + return slash ? xstrndup(filename, slash-filename) : xstrdup(""); My brain interpreted "slash-filename" as a single token on first read, which confused me briefly. Inserting spaces would help readers like me. > + * (4) Check if applying that directory rename to the original file > + * would result in a destination filename that is in the > + * potential rename set. If so, return the index of the > + * destination file (the index within rename_dst). > + * This function, idx_possible_rename(), is only responsible for (4). This helps isolate the important step to care about for the implementation, while the rest of the context is important, too. > + char *old_dir, *new_dir, *new_path; > + int idx; > + > + if (!info->setup) > + return -1; > + > + old_dir = get_dirname(filename); > + new_dir = strmap_get(&info->dir_rename_guess, old_dir); > + free(old_dir); > + if (!new_dir) > + return -1; > + > + new_path = xstrfmt("%s/%s", new_dir, get_basename(filename)); This is running in a loop, so `xstrfmt()` might be overkill compared to something like strbuf_addstr(&new_path, new_dir); strbuf_addch(&new_path, '/'); strbuf_addstr(&new_path, get_basename(filename)); but maybe the difference is too small to notice. (notice the type change to "struct strbuf new_path = STRBUF_INIT;") > + > + idx = strintmap_get(&info->idx_map, new_path); > + free(new_path); > + return idx; > +} Does what it says it does. Thanks, -Stolee
On Wed, Feb 24, 2021 at 9:35 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:> +static char *get_dirname(const char *filename) > > +{ > > + char *slash = strrchr(filename, '/'); > > + return slash ? xstrndup(filename, slash-filename) : xstrdup(""); > > My brain interpreted "slash-filename" as a single token on first > read, which confused me briefly. Inserting spaces would help > readers like me. > > > + * (4) Check if applying that directory rename to the original file > > + * would result in a destination filename that is in the > > + * potential rename set. If so, return the index of the > > + * destination file (the index within rename_dst). > > > + * This function, idx_possible_rename(), is only responsible for (4). > > This helps isolate the important step to care about for the implementation, > while the rest of the context is important, too. > > > + char *old_dir, *new_dir, *new_path; > > + int idx; > > + > > + if (!info->setup) > > + return -1; > > + > > + old_dir = get_dirname(filename); > > + new_dir = strmap_get(&info->dir_rename_guess, old_dir); > > + free(old_dir); > > + if (!new_dir) > > + return -1; > > + > > + new_path = xstrfmt("%s/%s", new_dir, get_basename(filename)); > > This is running in a loop, so `xstrfmt()` might be overkill compared > to something like > > strbuf_addstr(&new_path, new_dir); > strbuf_addch(&new_path, '/'); > strbuf_addstr(&new_path, get_basename(filename)); > > but maybe the difference is too small to notice. (notice the type > change to "struct strbuf new_path = STRBUF_INIT;") Ooh, nice find. Since this is in a loop over the renames as you point out, this is an O(N) improvement (with N = number of renames) rather than an O(1) improvement. It does turn out to be hard to notice, though. Since we still have some O(N^2) code (all the inexact rename detection for which our exact- and basename-guided detection optimizations can't handle), with that N^2 actually being multiplied by the average number of lines in the given files, this improvement does seem to mostly get lost in the noise. I tried a bunch of times to measure the performance with these changes. After a bunch of runs, it seems that this optimization saves somewhere between 3-10ms (depending on which testcase, whether at this point in the series or at the very end, etc.). It's hard to pin down, because the savings is less than the standard deviation of any given sets of runs. I don't think it's big enough to warrant restating the performance measurements, but I'm very happy to include this suggestion in my reroll. > > > + > > + idx = strintmap_get(&info->idx_map, new_path); > > + free(new_path); > > + return idx; > > +} > > Does what it says it does. > > Thanks, > -Stolee
diff --git a/diffcore-rename.c b/diffcore-rename.c index d24f104aa81c..1e4a56adde2c 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -374,6 +374,12 @@ struct dir_rename_info { unsigned setup; }; +static char *get_dirname(const char *filename) +{ + char *slash = strrchr(filename, '/'); + return slash ? xstrndup(filename, slash-filename) : xstrdup(""); +} + static void dirname_munge(char *filename) { char *slash = strrchr(filename, '/'); @@ -651,6 +657,81 @@ static const char *get_basename(const char *filename) return base ? base + 1 : filename; } +MAYBE_UNUSED +static int idx_possible_rename(char *filename, struct dir_rename_info *info) +{ + /* + * Our comparison of files with the same basename (see + * find_basename_matches() below), is only helpful when after exact + * rename detection we have exactly one file with a given basename + * among the rename sources and also only exactly one file with + * that basename among the rename destinations. When we have + * multiple files with the same basename in either set, we do not + * know which to compare against. However, there are some + * filenames that occur in large numbers (particularly + * build-related filenames such as 'Makefile', '.gitignore', or + * 'build.gradle' that potentially exist within every single + * subdirectory), and for performance we want to be able to quickly + * find renames for these files too. + * + * The reason basename comparisons are a useful heuristic was that it + * is common for people to move files across directories while keeping + * their filename the same. If we had a way of determining or even + * making a good educated guess about which directory these non-unique + * basename files had moved the file to, we could check it. + * Luckily... + * + * When an entire directory is in fact renamed, we have two factors + * helping us out: + * (a) the original directory disappeared giving us a hint + * about when we can apply an extra heuristic. + * (a) we often have several files within that directory and + * subdirectories that are renamed without changes + * So, rules for a heuristic: + * (0) If there basename matches are non-unique (the condition under + * which this function is called) AND + * (1) the directory in which the file was found has disappeared + * (i.e. dirs_removed is non-NULL and has a relevant entry) THEN + * (2) use exact renames of files within the directory to determine + * where the directory is likely to have been renamed to. IF + * there is at least one exact rename from within that + * directory, we can proceed. + * (3) If there are multiple places the directory could have been + * renamed to based on exact renames, ignore all but one of them. + * Just use the destination with the most renames going to it. + * (4) Check if applying that directory rename to the original file + * would result in a destination filename that is in the + * potential rename set. If so, return the index of the + * destination file (the index within rename_dst). + * (5) Compare the original file and returned destination for + * similarity, and if they are sufficiently similar, record the + * rename. + * + * This function, idx_possible_rename(), is only responsible for (4). + * The conditions/steps in (1)-(3) are handled via setting up + * dir_rename_count and dir_rename_guess in + * initialize_dir_rename_info(). Steps (0) and (5) are handled by + * the caller of this function. + */ + char *old_dir, *new_dir, *new_path; + int idx; + + if (!info->setup) + return -1; + + old_dir = get_dirname(filename); + new_dir = strmap_get(&info->dir_rename_guess, old_dir); + free(old_dir); + if (!new_dir) + return -1; + + new_path = xstrfmt("%s/%s", new_dir, get_basename(filename)); + + idx = strintmap_get(&info->idx_map, new_path); + free(new_path); + return idx; +} + static int find_basename_matches(struct diff_options *options, int minimum_score, struct dir_rename_info *info,