Message ID | dc26881e4ed3447c6efdcd492463be294f99b8c4.1612970140.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 7: use file basenames to guide rename detection | expand |
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > + /* Now look for basename matchups and do similarity estimation */ > + for (i = 0; i < num_src; ++i) { > + char *filename = rename_src[i].p->one->path; > + const char *base = NULL; > + intptr_t src_index; > + intptr_t dst_index; > + > + /* Find out if this basename is unique among sources */ > + base = get_basename(filename); > + src_index = strintmap_get(&sources, base); > + if (src_index == -1) > + continue; /* not a unique basename; skip it */ > + assert(src_index == i); > + > + if (strintmap_contains(&dests, base)) { > + struct diff_filespec *one, *two; > + int score; > + > + /* Find out if this basename is unique among dests */ > + dst_index = strintmap_get(&dests, base); > + if (dst_index == -1) > + continue; /* not a unique basename; skip it */ It would be a lot easier to read if "we must have the same singleton in dests" in a single if condition, I suspect. I.e. if (strintmap_contains(&dests, base) && 0 <= (dst_index = (strintmap_get(&dests, base)))) { It is a bit sad that we iterate over rename_src[] array, even though we now have a map that presumably have fewer number of entries than the original array, though. > + /* Ignore this dest if already used in a rename */ > + if (rename_dst[dst_index].is_rename) > + continue; /* already used previously */ Since we will only be matching between unique entries in src and dst, this "this has been used, so we cannot use it" will not change during this loop. I wonder if the preparation done in the previous step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e. a dst that has already been used (in the previous "exact" step) would not even have to be in &dests map, so that the strintmap_contains() check can reject it much earlier. Stepping back a bit, it appears to me that [2/5] and [3/5] considers a source file having unique basename among the sources even if there are many such files with the same basename, as long as all the other files with the same basename have been matched in the previous "exact" phase. It probably does the same thing for destination side. Intended? It feels incompatible with the spirit of these two steps aim for (i.e. only use this optimization on a pair of src/dst with UNIQUE basenames). For the purpose of "we only handle unique ones", the paths that already have matched should participate in deciding if the files that survived "exact" phase have unique basename among the original inpu? Thanks.
On Fri, Feb 12, 2021 at 5:48 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > + /* Now look for basename matchups and do similarity estimation */ > > + for (i = 0; i < num_src; ++i) { > > + char *filename = rename_src[i].p->one->path; > > + const char *base = NULL; > > + intptr_t src_index; > > + intptr_t dst_index; > > + > > + /* Find out if this basename is unique among sources */ > > + base = get_basename(filename); > > + src_index = strintmap_get(&sources, base); > > + if (src_index == -1) > > + continue; /* not a unique basename; skip it */ > > + assert(src_index == i); > > + > > + if (strintmap_contains(&dests, base)) { > > + struct diff_filespec *one, *two; > > + int score; > > + > > + /* Find out if this basename is unique among dests */ > > + dst_index = strintmap_get(&dests, base); > > + if (dst_index == -1) > > + continue; /* not a unique basename; skip it */ > > It would be a lot easier to read if "we must have the same singleton > in dests" in a single if condition, I suspect. I.e. > > if (strintmap_contains(&dests, base) && > 0 <= (dst_index = (strintmap_get(&dests, base)))) { I can change that. I can also simplify it further to if (0 <= (dst_index = (strintmap_get(&dests, base)))) { since dests uses a default value of -1. That will decrease the number of strmap lookups here from 2 to 1. > It is a bit sad that we iterate over rename_src[] array, even though > we now have a map that presumably have fewer number of entries than > the original array, though. Oh, interesting; I forgot all about that. I just looked up my original implementation from February of last year and indeed I had done exactly that (https://github.com/newren/git/commit/43eaec6007c92b6af05e0ef0fcc047c1d1ba1de8). However, when I added a later optimization that pairs up non-unique basenames, I had to switch to looping over rename_src. For various reasons (mostly starting with the fact that I had lots of experimental ideas that were tried and thrown out but with pieces kept around for ideas), I wasn't even close to having a clean history in my original implementation of merge-ort and the diffcore-rename optimizations. And it was far, far easier to achieve the goal of a clean history by picking out chunks of code from the end-state and creating entirely new commits than attempting to use my existing history. But, of course, that method made me lose this intermediate state. > > > + /* Ignore this dest if already used in a rename */ > > + if (rename_dst[dst_index].is_rename) > > + continue; /* already used previously */ > > Since we will only be matching between unique entries in src and > dst, this "this has been used, so we cannot use it" will not change > during this loop. I wonder if the preparation done in the previous > step, i.e. [PATCH v3 2/5], can take advantage of this fact, i.e. a > dst that has already been used (in the previous "exact" step) would > not even have to be in &dests map, so that the strintmap_contains() > check can reject it much earlier. Good, catch again. The previous step (v4 2/5) actually did already check this, so this if-condition will always be false at this point. Looking at the link above, this if-condition check wasn't there in the original, but again was added due to altered state introduced by a later optimization. So, I should pull this check out of this patch and add it back in to the later patch. > Stepping back a bit, it appears to me that [2/5] and [3/5] considers > a source file having unique basename among the sources even if there > are many such files with the same basename, as long as all the other > files with the same basename have been matched in the previous > "exact" phase. It probably does the same thing for destination > side. > > Intended? > > It feels incompatible with the spirit of these two steps aim for > (i.e. only use this optimization on a pair of src/dst with UNIQUE > basenames). For the purpose of "we only handle unique ones", the > paths that already have matched should participate in deciding if > the files that survived "exact" phase have unique basename among > the original inpu? Yeah, I should have been more careful with my wording. Stated a different way, what confidence can we associate with an exact rename? Obviously, the confidence is high as we mark them as renames. But if the confidence is less than 100%, and enough less than 100% that it casts a doubt on "related" inexact renames, then yes the basenames of the exact renames should also be computed so that we can determine what basenames are truly unique. By the exact same argument, you could take this a step further and say that we should calculate the basenames of *all* files in the tree, not just add/delete pairs, and only match up the ones via basename that are *truly* unique. After all, break detection exists, so perhaps we don't have full confidence that files with an unchanged fullname are actually related. From my view, though, both are too cautious and throwing out valuable heuristics for common cases. Starting with break detection, it is off for a reason: we think unchanged filename is a strong enough heuristic to just match up those files and consider the confidence of the match in effect 100%. Similarly, we put a lot of confidence in exact rename detection. If there are multiple adds/deletes with the same basename, and all but one on each side are paired up by exact rename detection, aren't the remaining two files a (very) likely rename pair? I think so, and believe they're worth including in the basename-based rename detection step. We do require basename-based matches to meet a much higher similarity scoring threshold now, which I feel already adequately adjusts for not doing full content similarity against all other files. Also, in the next series, I find an additional way to match up files by basename when basenames are not unique, and which doesn't involve pairwise comparing all the files with the same basename. I only pick at most one other file to compare to (and the selection is not random). So, my overall strategy for these two series is "find which basenames are likely matches" even if I didn't word it very well. I do agree, though, that I should add some more careful wording about this in the series. I'll include it in a re-roll.
Elijah Newren <newren@gmail.com> writes: > I can change that. I can also simplify it further to > > if (0 <= (dst_index = (strintmap_get(&dests, base)))) { > > since dests uses a default value of -1. That will decrease the number > of strmap lookups here from 2 to 1. Which would be a real win, unlike what I said in the message you are responding to. >> It feels incompatible with the spirit of these two steps aim for >> (i.e. only use this optimization on a pair of src/dst with UNIQUE >> basenames). For the purpose of "we only handle unique ones", the >> paths that already have matched should participate in deciding if >> the files that survived "exact" phase have unique basename among >> the original inpu? > > Yeah, I should have been more careful with my wording. Stated a > different way, what confidence can we associate with an exact rename? Suppose you start with a/Makefile, b/Makefile and c/Makefile and then they all disappeared while a1/Makefile, b1/Makefile and c1/Makefile now are in the tree. The contents a/Makefile used to have appears without any difference in a1/Makefile, the same for b and b1, but c/Makefile and c1/Makefile are different. The c vs c1 pair may worth investigating, so it goes through the "same basename" phase. Now, in a slightly different situation, a vs a1 are still identical, but b vs b1 have only one blank line removal but without any other change. It looks odd that such a change has to pessimize c vs c1 optimization opportunity, but an interesting part of the story is that we can only say "such a change", not "such a miniscule change", because we have just finished the "exact" phase, and we do not know how big a difference b vs b1 pair actually had. That makes me feel that this whole "we must treat unique one that remains specially" is being incoherent. If "because we have only small number of removed and added Makefiles spread across the trees, first full-matrix matching among them without anything else with higher bar may be worth an optimization" were the optimization, then I would understand and support the design to omit those that have already been matched in the "exact" phase. But IIRC, limiting this "same basename" phase to unique add/del pair was sold as a way to make it less likely for the heuristics to make mistakes, yet the definition of "unique", as shown above, is not all that solid. That I find it rather unsatisfactory. In other words, it is not "what confidence do we have in exact phase?" "exact" matching may have found perfect matching pair. But the found pair should be happy just between themselves, and should not have undue effect on how _other_ pairs are compared. Stopping the "exact" pair from participating in the "uniqueness" definition is placing "exact" phase too much weight to affect how other filepairs are found. > By the exact same argument, you > could take this a step further and say that we should calculate the > basenames of *all* files in the tree, not just add/delete pairs, and > only match up the ones via basename that are *truly* unique. After > all, break detection exists, so perhaps we don't have full confidence > that files with an unchanged fullname are actually related. Sorry, but you are not making sense. These optimizations are done only when we are not using copies and breaks, no? What _other_ changes that kept the paths the same, or modified in place, have any effect on matching added and deleted pairs?
On Sat, Feb 13, 2021 at 3:55 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > > I can change that. I can also simplify it further to > > > > if (0 <= (dst_index = (strintmap_get(&dests, base)))) { > > > > since dests uses a default value of -1. That will decrease the number > > of strmap lookups here from 2 to 1. > > Which would be a real win, unlike what I said in the message you are > responding to. Sadly, while it's a real win, it's very temporary. The next series I'll submit needs to separate the two checks back out for other reasons. > >> It feels incompatible with the spirit of these two steps aim for > >> (i.e. only use this optimization on a pair of src/dst with UNIQUE > >> basenames). For the purpose of "we only handle unique ones", the > >> paths that already have matched should participate in deciding if > >> the files that survived "exact" phase have unique basename among > >> the original inpu? > > > > Yeah, I should have been more careful with my wording. Stated a > > different way, what confidence can we associate with an exact rename? > > Suppose you start with a/Makefile, b/Makefile and c/Makefile and > then they all disappeared while a1/Makefile, b1/Makefile and > c1/Makefile now are in the tree. The contents a/Makefile used to > have appears without any difference in a1/Makefile, the same for b > and b1, but c/Makefile and c1/Makefile are different. The c vs c1 > pair may worth investigating, so it goes through the "same basename" > phase. > > Now, in a slightly different situation, a vs a1 are still identical, > but b vs b1 have only one blank line removal but without any other > change. It looks odd that such a change has to pessimize c vs c1 > optimization opportunity, but an interesting part of the story is > that we can only say "such a change", not "such a miniscule change", > because we have just finished the "exact" phase, and we do not know > how big a difference b vs b1 pair actually had. > > That makes me feel that this whole "we must treat unique one that > remains specially" is being incoherent. It's really not that special; the pessimization is not in my mind due to correctness reasons, but performance reasons. I need to only compare any given file to at most one other file in the preliminary steps. When there are multiple remaining possibilities to compare, I need a method for selecting which ones to compare. I have such a method, but it's a lot more code. It was easier to submit a series that was only 3 patches long and only considered the pairs that just happened to uniquely match up so we could talk about the general idea of basename matching. The next series finds ways to match up more files with similar basenames. > If "because we have only > small number of removed and added Makefiles spread across the trees, > first full-matrix matching among them without anything else with > higher bar may be worth an optimization" were the optimization, then This optimization was indeed considered...and fully implemented. Let's give it a name, so I can refer to it more below. How about the "preliminary-matrix-of-basenames" optimization? > I would understand and support the design to omit those that have > already been matched in the "exact" phase. > > But IIRC, limiting this "same basename" phase to unique add/del pair > was sold as a way to make it less likely for the heuristics to make > mistakes, yet the definition of "unique", as shown above, is not all > that solid. That I find it rather unsatisfactory. No, I never sold it as a way to make it less likely for the heuristics to make mistakes. If I implied that anywhere, it was on accident. I certainly emphasized only doing one comparison per file, but not for that reason. I had three reasons for mentioning one-comparison-per-file: (1) I was trying to contrast with Stolee's original assumption about what this series was doing, to try to avoid a repeat of the misunderstandings about the current optimization being suggested. (2) The preliminary-matrix-of-basenames optimization has worst-case performance nearly twice as bad as without such an optimization. (For example, with preliminary-matrix-of-basenames, if nearly all unmatched files have the same basename, we end up basically doing inexact rename detection on all files twice). I believe Stolee's original assumption of what was being proposed also has such twice-as-slow-as-normal worst-case performance behavior. Even though the worst case performance would be fairly rare, making an algorithm twice as slow by introducing an optimization felt like something I should avoid. (3) Despite the theoretical problems with worst-case performance, I implemented the preliminary-matrix-of-basenames optimization anyway. I threw the code away, because even in cases with a wide variety of basenames, it slowed things down when other optimizations were also involved. The one clear way to work well with other optimizations I was working with was to only allow the preliminary step to compare any given file to at most one other file. > In other words, it is not "what confidence do we have in exact > phase?" "exact" matching may have found perfect matching pair. But > the found pair should be happy just between themselves, and should > not have undue effect on how _other_ pairs are compared. Stopping > the "exact" pair from participating in the "uniqueness" definition > is placing "exact" phase too much weight to affect how other filepairs > are found. I guess I look at this quite a bit differently. Here's my view: * If we have a reasonable and cheap way to determine that two particular files are likely potential rename pairs, * AND checking their similarity confirms they are sufficiently similar (perhaps with a higher bar) * then we've found a way to avoid quadratic comparisons. We will give up "optimal" matches, but as long as what we provide are "reasonable" matches I think that should suffice. I personally believe "reasonable" at O(N) cost trumps "optimal" at O(N^2). There are several different ways to find "likely potential rename pairs": * The preliminary-matrix-of-basenames is one that I tried (but interacts badly performance-wise with other optimizations). * https://github.com/gitgitgadget/git/issues/519 has multiple ideas. * Stolee's misunderstanding of my series is another * unique basenames among remaining pairs after exact renames is a really simple one that lets me introduce "reasonable" matches so we can discuss * my next series adds another That leaves us with a big question. Are we happy with higher sufficient similarity bar being enough of a constraint for "reasonable" matches? If so, each of the above ideas might be able to help us. If not, we may be able to rule some of them out apriori and avoid working on them (well, working on them any more; I've already implemented three, and we have an intern who picked a project to look at one) > > By the exact same argument, you > > could take this a step further and say that we should calculate the > > basenames of *all* files in the tree, not just add/delete pairs, and > > only match up the ones via basename that are *truly* unique. After > > all, break detection exists, so perhaps we don't have full confidence > > that files with an unchanged fullname are actually related. > > Sorry, but you are not making sense. These optimizations are done > only when we are not using copies and breaks, no? What _other_ > changes that kept the paths the same, or modified in place, have any > effect on matching added and deleted pairs? If the optimization is presented to users as "only compare basenames in a preliminary step when they are unique", which is what I was understanding you to say, and if the user has a/Makefile and d/Makefile in the source tree, and a1/Makefile and d/Makefile in the destination tree, then a/Makefile is not the unique "Makefile" in the source tree. I think you're trying to make an argument about uniqueness and why it matters for correctness, but I'm not following it. The only reason uniqueness is important to me is because I was using it with future optimizations in mind, and knew it to be related to an important performance criteria. I tried to avoid mentioning uniqueness at all in the user-facing documentation, though I did try to explain why some files with the same basename might not be matched up by that step (and my next series modifies those docs a bit.)
diff --git a/diffcore-rename.c b/diffcore-rename.c index 3eb49a098adf..001645624e71 100644 --- a/diffcore-rename.c +++ b/diffcore-rename.c @@ -384,10 +384,52 @@ static int find_basename_matches(struct diff_options *options, int minimum_score, int num_src) { - int i; + /* + * When I checked in early 2020, over 76% of file renames in linux + * just moved files to a different directory but kept the same + * basename. gcc did that with over 64% of renames, gecko did it + * with over 79%, and WebKit did it with over 89%. + * + * Therefore we can bypass the normal exhaustive NxM matrix + * comparison of similarities between all potential rename sources + * and destinations by instead using file basename as a hint (i.e. + * the portion of the filename after the last '/'), checking for + * similarity between files with the same basename, and if we find + * a pair that are sufficiently similar, record the rename pair and + * exclude those two from the NxM matrix. + * + * This *might* cause us to find a less than optimal pairing (if + * there is another file that we are even more similar to but has a + * different basename). Given the huge performance advantage + * basename matching provides, and given the frequency with which + * people use the same basename in real world projects, that's a + * trade-off we are willing to accept when doing just rename + * detection. + * + * If someone wants copy detection that implies they are willing to + * spend more cycles to find similarities between files, so it may + * be less likely that this heuristic is wanted. If someone is + * doing break detection, that means they do not want filename + * similarity to imply any form of content similiarity, and thus + * this heuristic would definitely be incompatible. + */ + + int i, renames = 0; struct strintmap sources; struct strintmap dests; + /* + * The prefeteching stuff wants to know if it can skip prefetching + * blobs that are unmodified...and will then do a little extra work + * to verify that the oids are indeed different before prefetching. + * Unmodified blobs are only relevant when doing copy detection; + * when limiting to rename detection, diffcore_rename[_extended]() + * will never be called with unmodified source paths fed to us, so + * the extra work necessary to check if rename_src entries are + * unmodified would be a small waste. + */ + int skip_unmodified = 0; + /* Create maps of basename -> fullname(s) for sources and dests */ strintmap_init_with_options(&sources, -1, NULL, 0); strintmap_init_with_options(&dests, -1, NULL, 0); @@ -420,12 +462,59 @@ static int find_basename_matches(struct diff_options *options, strintmap_set(&dests, base, i); } - /* TODO: Make use of basenames source and destination basenames */ + /* Now look for basename matchups and do similarity estimation */ + for (i = 0; i < num_src; ++i) { + char *filename = rename_src[i].p->one->path; + const char *base = NULL; + intptr_t src_index; + intptr_t dst_index; + + /* Find out if this basename is unique among sources */ + base = get_basename(filename); + src_index = strintmap_get(&sources, base); + if (src_index == -1) + continue; /* not a unique basename; skip it */ + assert(src_index == i); + + if (strintmap_contains(&dests, base)) { + struct diff_filespec *one, *two; + int score; + + /* Find out if this basename is unique among dests */ + dst_index = strintmap_get(&dests, base); + if (dst_index == -1) + continue; /* not a unique basename; skip it */ + + /* Ignore this dest if already used in a rename */ + if (rename_dst[dst_index].is_rename) + continue; /* already used previously */ + + /* Estimate the similarity */ + one = rename_src[src_index].p->one; + two = rename_dst[dst_index].p->two; + score = estimate_similarity(options->repo, one, two, + minimum_score, skip_unmodified); + + /* If sufficiently similar, record as rename pair */ + if (score < minimum_score) + continue; + record_rename_pair(dst_index, src_index, score); + renames++; + + /* + * Found a rename so don't need text anymore; if we + * didn't find a rename, the filespec_blob would get + * re-used when doing the matrix of comparisons. + */ + diff_free_filespec_blob(one); + diff_free_filespec_blob(two); + } + } strintmap_clear(&sources); strintmap_clear(&dests); - return 0; + return renames; } #define NUM_CANDIDATE_PER_DST 4