Message ID | 381a45d239bb52a70373c385d8978005c9cb4800.1612870326.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 7: use file basenames to guide rename detection | expand |
On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > We want to make use of unique basenames to help inform rename detection, > so that more likely pairings can be checked first. (src/moduleA/foo.txt > and source/module/A/foo.txt are likely related if there are no other > 'foo.txt' files among the deleted and added files.) Add a new function, > not yet used, which creates a map of the unique basenames within > rename_src and another within rename_dst, together with the indices > within rename_src/rename_dst where those basenames show up. Non-unique > basenames still show up in the map, but have an invalid index (-1). > > This function was inspired by the fact that in real world repositories, > most renames often do not involve a basename change. Here are some > sample repositories and the percentage of their historical renames (as of > early 2020) that did not involve a basename change: I found this difficult to parse. Perhaps instead "the percentage of their renames that preserved basenames". We might also need something stronger, though: which percentage of renames preserved the basename but also had no other copy of that basename in the scope of all add/deletes? Is this reproducible from a shell command that could be documented here? > +MAYBE_UNUSED > +static int find_basename_matches(struct diff_options *options, > + int minimum_score, > + int num_src) > +{ > + int i; > + struct strintmap sources; > + struct strintmap dests; > + > + /* Create maps of basename -> fullname(s) for sources and dests */ > + strintmap_init_with_options(&sources, -1, NULL, 0); > + strintmap_init_with_options(&dests, -1, NULL, 0); Initially, I was wondering why we need the map for each side, but we will need to enforce uniqueness in each set, so OK. > + for (i = 0; i < num_src; ++i) { > + char *filename = rename_src[i].p->one->path; > + char *base; > + > + /* exact renames removed in remove_unneeded_paths_from_src() */ > + assert(!rename_src[i].p->one->rename_used); > + > + base = strrchr(filename, '/'); > + base = (base ? base+1 : filename); nit: "base + 1" Also, this is used here and below. Perhaps it's worth pulling out as a helper? I see similar code being duplicated in these existing spots: * diff-no-index.c:append_basename() * help.c:append_similar_ref() * packfile.c:pack_basename() * replace-object.c:register_replace_ref() * setup.c:read_gitfile_gently() * builtin/rebase.c:cmd_rebase() * builtin/stash.c:do_create_stash() * builtin/worktree.c:add_worktree() * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage() * contrib/credential/libsecret/git-credential-libsecret.c:usage() * trace2/tr2_dst.c:tr2_dst_try_auto_path() There are other places that use strchr(_, '/') but they seem to be related to peeling basenames off of paths and using the leading portion of the path. > + /* Record index within rename_src (i) if basename is unique */ > + if (strintmap_contains(&sources, base)) > + strintmap_set(&sources, base, -1); > + else > + strintmap_set(&sources, base, i); > + } > + for (i = 0; i < rename_dst_nr; ++i) { > + char *filename = rename_dst[i].p->two->path; > + char *base; > + > + if (rename_dst[i].is_rename) > + continue; /* involved in exact match already. */ > + > + base = strrchr(filename, '/'); > + base = (base ? base+1 : filename); > + > + /* Record index within rename_dst (i) if basename is unique */ > + if (strintmap_contains(&dests, base)) > + strintmap_set(&dests, base, -1); > + else > + strintmap_set(&dests, base, i); > + } > + > + /* TODO: Make use of basenames source and destination basenames */ > + > + strintmap_clear(&sources); > + strintmap_clear(&dests); > + > + return 0; > +} Thanks, -Stolee
Hi, On Tue, Feb 9, 2021 at 5:17 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote: > > From: Elijah Newren <newren@gmail.com> > > > > We want to make use of unique basenames to help inform rename detection, > > so that more likely pairings can be checked first. (src/moduleA/foo.txt > > and source/module/A/foo.txt are likely related if there are no other > > 'foo.txt' files among the deleted and added files.) Add a new function, > > not yet used, which creates a map of the unique basenames within > > rename_src and another within rename_dst, together with the indices > > within rename_src/rename_dst where those basenames show up. Non-unique > > basenames still show up in the map, but have an invalid index (-1). > > > > This function was inspired by the fact that in real world repositories, > > most renames often do not involve a basename change. Here are some > > sample repositories and the percentage of their historical renames (as of > > early 2020) that did not involve a basename change: > > I found this difficult to parse. Perhaps instead > > "the percentage of their renames that preserved basenames". Ooh, I like it; happy to make that change. > We might also need something stronger, though: which percentage of renames > preserved the basename but also had no other copy of that basename in the > scope of all add/deletes? I don't think it's useful to try to prove that this idea can save time or how much time we can save before we try it; I think the only purpose of these numbers should be to motivate the idea behind why it was worth trying. If we attempt to prove how much we'll save apriori, then what you have is also too weak. We would need "percentage of total adds/deletes that are renames that preserved the basename but also had no other copy of that basename in the scope of all add/deletes". But that is also wrong, actually; we need "for any given two commits that we are likely to diff, what is the average percentage of total adds/deletes between them that are renames that preserved the basename but also had no other copy of that basename in the scope of all add/deletes". In particular, my script did not look at the "any two given likely-to-be-diffed commits" viewpoint, I simply added the number of renames within individual commits that preserved renames, and divided by the total number of renames in individual commits. But even if we could calculate the "any two given likely-to-be-diffed commits" viewpoint in some sane manner, it'd still be misleading. The next series is going to change the "no other copy of that basename in the scope of all adds/deletes" caveat, by adding a way to match up _some_ of those files (when it can find a way to compare any given file to exactly one of the other files with the same basename). And even if you consider all the above and calculated it in order to try to show how much could be saved, you might need to start worrying about details like the fact that the first comparison between files in diffcore-rename.c is _much_ more expensive than subsequent comparisons (due to the fact that the spanhash is cached). Trying to account for all these details and describe them fully is completely beside the point, though; I didn't bother to check any of this before implementing the algorithm -- I just looked up these very rough numbers and felt they provided sufficient motivation that there was an optimization worth trying. > Is this reproducible from a shell command that could be documented here? No, trying to parse log output with full handling of proper quoting in the case of filenames with funny characters is too complex to attempt in shell. I was surprised by how long it turned out to be in python. (And I dread attempting to calculate "something stronger" in any accurate way given how involved just this rough calculation was. That idea seems harder to me than actually implementing this series.) If you're curious, though, and don't care about quickly-hacked-together-script-not-designed-for-reuse: https://github.com/newren/git/blob/ort/rebase-testcase/count-renames.py > > +MAYBE_UNUSED > > +static int find_basename_matches(struct diff_options *options, > > + int minimum_score, > > + int num_src) > > +{ > > + int i; > > + struct strintmap sources; > > + struct strintmap dests; > > + > > + /* Create maps of basename -> fullname(s) for sources and dests */ > > + strintmap_init_with_options(&sources, -1, NULL, 0); > > + strintmap_init_with_options(&dests, -1, NULL, 0); > > Initially, I was wondering why we need the map for each side, but we will need > to enforce uniqueness in each set, so OK. > >> > + for (i = 0; i < num_src; ++i) { > > + char *filename = rename_src[i].p->one->path; > > + char *base; > > + > > + /* exact renames removed in remove_unneeded_paths_from_src() */ > > + assert(!rename_src[i].p->one->rename_used); > > + > > + base = strrchr(filename, '/'); > > + base = (base ? base+1 : filename); > > nit: "base + 1" Will fix. > Also, this is used here and below. Perhaps it's worth pulling out as a > helper? I see similar code being duplicated in these existing spots: > > * diff-no-index.c:append_basename() > * help.c:append_similar_ref() > * packfile.c:pack_basename() > * replace-object.c:register_replace_ref() > * setup.c:read_gitfile_gently() > * builtin/rebase.c:cmd_rebase() > * builtin/stash.c:do_create_stash() > * builtin/worktree.c:add_worktree() > * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage() > * contrib/credential/libsecret/git-credential-libsecret.c:usage() > * trace2/tr2_dst.c:tr2_dst_try_auto_path() Honestly asking: would anyone ever search for such a two-line helper function? I wouldn't have even thought to look, since it seems so simple. However, my real concern here is that this type of change would risk introducing conflicts with unrelated series. This series is the second in what will be a 9-series deep dependency chain of optimizations[1], and the later series are going to be longer than these first two were (the latter ones are 6-11 patches each). We've already discussed previously whether we possibly want to hold the first couple optimization series out of the upcoming git-2.31 release in order to keep the optimizations all together, but that might increase the risk of conflicts with unrelated patches if we try a bigger tree refactor like this. (Junio never commented on that, though.) It might be better to keep the series touching only merge-ort.c & diffcore-rename.c, and then do cleanups like the one you suggest here after the whole series. That said, it's not a difficult initial change, so I'm mostly expressing this concern out of making things harder for Junio. It'd be best to get his opinion -- Junio, your thoughts? [1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch > There are other places that use strchr(_, '/') but they seem to be related > to peeling basenames off of paths and using the leading portion of the path. > > > + /* Record index within rename_src (i) if basename is unique */ > > + if (strintmap_contains(&sources, base)) > > + strintmap_set(&sources, base, -1); > > + else > > + strintmap_set(&sources, base, i); > > + } > > + for (i = 0; i < rename_dst_nr; ++i) { > > + char *filename = rename_dst[i].p->two->path; > > + char *base; > > + > > + if (rename_dst[i].is_rename) > > + continue; /* involved in exact match already. */ > > + > > + base = strrchr(filename, '/'); > > + base = (base ? base+1 : filename); > > + > > + /* Record index within rename_dst (i) if basename is unique */ > > + if (strintmap_contains(&dests, base)) > > + strintmap_set(&dests, base, -1); > > + else > > + strintmap_set(&dests, base, i); > > + } > > + > > + /* TODO: Make use of basenames source and destination basenames */ > > + > > + strintmap_clear(&sources); > > + strintmap_clear(&dests); > > + > > + return 0; > > +} > > Thanks, > -Stolee Thanks for the review!
On 2/9/2021 11:56 AM, Elijah Newren wrote: >> Also, this is used here and below. Perhaps it's worth pulling out as a >> helper? I see similar code being duplicated in these existing spots: >> >> * diff-no-index.c:append_basename() >> * help.c:append_similar_ref() >> * packfile.c:pack_basename() >> * replace-object.c:register_replace_ref() >> * setup.c:read_gitfile_gently() >> * builtin/rebase.c:cmd_rebase() >> * builtin/stash.c:do_create_stash() >> * builtin/worktree.c:add_worktree() >> * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage() >> * contrib/credential/libsecret/git-credential-libsecret.c:usage() >> * trace2/tr2_dst.c:tr2_dst_try_auto_path() > Honestly asking: would anyone ever search for such a two-line helper > function? I wouldn't have even thought to look, since it seems so > simple. > > However, my real concern here is that this type of change would risk > introducing conflicts with unrelated series. This series is the > second in what will be a 9-series deep dependency chain of > optimizations[1], and the later series are going to be longer than > these first two were (the latter ones are 6-11 patches each). We've > already discussed previously whether we possibly want to hold the > first couple optimization series out of the upcoming git-2.31 release > in order to keep the optimizations all together, but that might > increase the risk of conflicts with unrelated patches if we try a > bigger tree refactor like this. (Junio never commented on that, > though.) It might be better to keep the series touching only > merge-ort.c & diffcore-rename.c, and then do cleanups like the one you > suggest here after the whole series. > > That said, it's not a difficult initial change, so I'm mostly > expressing this concern out of making things harder for Junio. It'd > be best to get his opinion -- Junio, your thoughts? > > [1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch I don't consider the step of "go put the helper in all these other places" necessary for the current series. However, the "get basename" code appears a total of three times in this series, so it would be good to at least extract it to a static inline method to reduce the duplication isolated to this change. Thanks, -Stolee
On Tue, Feb 9, 2021 at 9:02 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 2/9/2021 11:56 AM, Elijah Newren wrote: > >> Also, this is used here and below. Perhaps it's worth pulling out as a > >> helper? I see similar code being duplicated in these existing spots: > >> > >> * diff-no-index.c:append_basename() > >> * help.c:append_similar_ref() > >> * packfile.c:pack_basename() > >> * replace-object.c:register_replace_ref() > >> * setup.c:read_gitfile_gently() > >> * builtin/rebase.c:cmd_rebase() > >> * builtin/stash.c:do_create_stash() > >> * builtin/worktree.c:add_worktree() > >> * contrib/credential/gnome-keyring/git-credential-gnome-keyring.c:usage() > >> * contrib/credential/libsecret/git-credential-libsecret.c:usage() > >> * trace2/tr2_dst.c:tr2_dst_try_auto_path() > > Honestly asking: would anyone ever search for such a two-line helper > > function? I wouldn't have even thought to look, since it seems so > > simple. > > > > However, my real concern here is that this type of change would risk > > introducing conflicts with unrelated series. This series is the > > second in what will be a 9-series deep dependency chain of > > optimizations[1], and the later series are going to be longer than > > these first two were (the latter ones are 6-11 patches each). We've > > already discussed previously whether we possibly want to hold the > > first couple optimization series out of the upcoming git-2.31 release > > in order to keep the optimizations all together, but that might > > increase the risk of conflicts with unrelated patches if we try a > > bigger tree refactor like this. (Junio never commented on that, > > though.) It might be better to keep the series touching only > > merge-ort.c & diffcore-rename.c, and then do cleanups like the one you > > suggest here after the whole series. > > > > That said, it's not a difficult initial change, so I'm mostly > > expressing this concern out of making things harder for Junio. It'd > > be best to get his opinion -- Junio, your thoughts? > > > > [1] https://github.com/gitgitgadget/git/pulls?q=is%3Apr+author%3Anewren+Optimization+batch > > I don't consider the step of "go put the helper in all these other > places" necessary for the current series. However, the "get basename" > code appears a total of three times in this series, so it would be > good to at least extract it to a static inline method to reduce > the duplication isolated to this change. Sounds good; will do.
diff --git a/diffcore-rename.c b/diffcore-rename.c index 74930716e70d..1c52077b04e5 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -367,6 +367,59 @@ static int find_exact_renames(struct diff_options *options) return renames; } +MAYBE_UNUSED +static int find_basename_matches(struct diff_options *options, + int minimum_score, + int num_src) +{ + int i; + struct strintmap sources; + struct strintmap dests; + + /* Create maps of basename -> fullname(s) for sources and dests */ + strintmap_init_with_options(&sources, -1, NULL, 0); + strintmap_init_with_options(&dests, -1, NULL, 0); + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + char *base; + + /* exact renames removed in remove_unneeded_paths_from_src() */ + assert(!rename_src[i].p->one->rename_used); + + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Record index within rename_src (i) if basename is unique */ + if (strintmap_contains(&sources, base)) + strintmap_set(&sources, base, -1); + else + strintmap_set(&sources, base, i); + } + for (i = 0; i < rename_dst_nr; ++i) { + char *filename = rename_dst[i].p->two->path; + char *base; + + if (rename_dst[i].is_rename) + continue; /* involved in exact match already. */ + + base = strrchr(filename, '/'); + base = (base ? base+1 : filename); + + /* Record index within rename_dst (i) if basename is unique */ + if (strintmap_contains(&dests, base)) + strintmap_set(&dests, base, -1); + else + strintmap_set(&dests, base, i); + } + + /* TODO: Make use of basenames source and destination basenames */ + + strintmap_clear(&sources); + strintmap_clear(&dests); + + return 0; +} + #define NUM_CANDIDATE_PER_DST 4 static void record_if_better(struct diff_score m[], struct diff_score *o) {