Message ID | 3e6af929d135ef2dc239e2f47f92a7e2e91cbd17.1612970140.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 7: use file basenames to guide rename detection | expand |
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Elijah Newren <newren@gmail.com> > > Add a simple test where a removed file is similar to two different added > files; one of them has the same basename, and the other has a slightly > higher content similarity. Without break detection, filename similarity > of 100% trumps content similarity for pairing up related files. For > any filename similarity less than 100%, the opposite is true -- content > similarity is all that matters. Add a testcase that documents this. I am not sure why it is the "opposite". When contents are similar to the same degree of 100%, we tiebreak with the filename. We never favor a pair between the same filename over a pair between different filenames with better content similarity. And when contents are similar to the same degree of less than 100%, we do not favor a pair between the same filename over a pair between different filenames, as long as they are similar to the same degree. So, I do not think "opposite" is helping readers to understand what is going on. > +test_expect_success 'basename similarity vs best similarity' ' > + mkdir subdir && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 line9 line10 >subdir/file.txt && > + git add subdir/file.txt && > + git commit -m "base txt" && > + > + git rm subdir/file.txt && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 >file.txt && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 line9 >file.md && > + git add file.txt file.md && > + git commit -a -m "rename" && > + git diff-tree -r -M --name-status HEAD^ HEAD >actual && > + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, > + # but since same basenames are checked first... I am not sure what the second line of this comment wants to imply with the ellipses here. Care to finish the sentence? Or was the second line planned to be added when we start applying the "check only the same filename first and see if we find a better-than-reasonable match" heuristics but somehow survived "rebase -i" and ended up here? > + cat >expected <<-\EOF && > + R088 subdir/file.txt file.md > + A file.txt > + EOF > + test_cmp expected actual Thanks.
On Fri, Feb 12, 2021 at 5:15 PM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Elijah Newren <newren@gmail.com> > > > > Add a simple test where a removed file is similar to two different added > > files; one of them has the same basename, and the other has a slightly > > higher content similarity. Without break detection, filename similarity > > of 100% trumps content similarity for pairing up related files. For > > any filename similarity less than 100%, the opposite is true -- content > > similarity is all that matters. Add a testcase that documents this. > > I am not sure why it is the "opposite". When contents are similar > to the same degree of 100%, we tiebreak with the filename. We never > favor a pair between the same filename over a pair between different > filenames with better content similarity. This is not true. If src/main.c is 99% similar to src/foo.c, and is 0% similar to the src/main.c in the new commit, we match the old src/main.c to the new src/main.c despite being far more similar src/foo.c. Unless break detection is turned on, we do not allow content similarity to trump (full) filename equality. > And when contents are similar to the same degree of less than 100%, > we do not favor a pair between the same filename over a pair between > different filenames, as long as they are similar to the same degree. This is also not true; we tiebreak with filenames for inexact renames just like we do for exact renames (note that basename_same() is called both from find_identical_files() and from the nested loop where inexact rename detection is done). > So, I do not think "opposite" is helping readers to understand what > is going on. > > > +test_expect_success 'basename similarity vs best similarity' ' > > + mkdir subdir && > > + test_write_lines line1 line2 line3 line4 line5 \ > > + line6 line7 line8 line9 line10 >subdir/file.txt && > > + git add subdir/file.txt && > > + git commit -m "base txt" && > > + > > + git rm subdir/file.txt && > > + test_write_lines line1 line2 line3 line4 line5 \ > > + line6 line7 line8 >file.txt && > > + test_write_lines line1 line2 line3 line4 line5 \ > > + line6 line7 line8 line9 >file.md && > > + git add file.txt file.md && > > + git commit -a -m "rename" && > > + git diff-tree -r -M --name-status HEAD^ HEAD >actual && > > + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, > > + # but since same basenames are checked first... > > I am not sure what the second line of this comment wants to imply > with the ellipses here. Care to finish the sentence? > > Or was the second line planned to be added when we start applying > the "check only the same filename first and see if we find a > better-than-reasonable match" heuristics but somehow survived > "rebase -i" and ended up here? Oops, indeed; that is precisely what happened. Will fix. > > + cat >expected <<-\EOF && > > + R088 subdir/file.txt file.md > > + A file.txt > > + EOF > > + test_cmp expected actual > > Thanks.
Elijah Newren <newren@gmail.com> writes: > This is not true. If src/main.c is 99% similar to src/foo.c, and is > 0% similar to the src/main.c in the new commit, we match the old > src/main.c to the new src/main.c despite being far more similar > src/foo.c. Unless break detection is turned on, we do not allow > content similarity to trump (full) filename equality. Absolutely. And we are talking about a new optimization that kicks in only when there is no break or no copy detection going on, no?
On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > Elijah Newren <newren@gmail.com> writes: > > > This is not true. If src/main.c is 99% similar to src/foo.c, and is > > 0% similar to the src/main.c in the new commit, we match the old > > src/main.c to the new src/main.c despite being far more similar > > src/foo.c. Unless break detection is turned on, we do not allow > > content similarity to trump (full) filename equality. > > Absolutely. And we are talking about a new optimization that kicks > in only when there is no break or no copy detection going on, no? Yes, precisely, we are only considering cases without break detection...and thus we are considering cases where for the last 15 years or more, sufficiently large filename similarity (an exact fullname match) trumps any level of content similarity. I think it is useful to note that while my optimization is adding more considerations that can overrule maximal content similarity, it is not the first such code choice to do that. But let me back up a bit... When I submitted the series, you and Stolee went into a long discussion about an optimization that I didn't submit, one that feels looser on "matching" than anything I submitted, and which I think might counter-intuitively reduce performance rather than aid it. (The performance side only comes into view in combination with later series, but it was why I harped so much since then on only comparing against at most one other file in the steps before full inexact rename detection.) I was quite surprised by the diversion, but it made it clear to me that my descriptions and commit messages were far too vague and could be read to imply a completely different algorithm than I intended. So, I tried to be far more careful in subsequent iterations by adding wider context and contrasts. Further, after I wrote various things to try to clarify the misunderstandings, I noticed that Stolee picked out one thing and stated that "This idea of optimizing first for 100% filename similarity is a good perspective on Git's rename detection algorithm." (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/) So, that particular point seemed to help him understand more, and thus might be useful extra context for others reading along now or in the future. Given all the above, I was trying to address earlier misunderstandings and provide more context. Perhaps I swung the pendulum too far and talked too much about other cases, or perhaps I just worded things poorly again. All I was attempting to do in the commit message was point out the multiple basic rules with filename and content similarity, to lay the groundwork for new rules that do alternative weightings. Anyway, I've added a few more tweaks to try to improve the wording for the next round I'll submit today. Given my track record so far, it would not be surprising if it still needed more tweaks.
I do not consider "the same file changed in place" the same as "we seem to have lost a file in the old tree, ah, we found one that has the same basename in a different directory" at all, so your argument still does not make any sense to me, sorry. 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>: > > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Elijah Newren <newren@gmail.com> writes: > > > > > This is not true. If src/main.c is 99% similar to src/foo.c, and is > > > 0% similar to the src/main.c in the new commit, we match the old > > > src/main.c to the new src/main.c despite being far more similar > > > src/foo.c. Unless break detection is turned on, we do not allow > > > content similarity to trump (full) filename equality. > > > > Absolutely. And we are talking about a new optimization that kicks > > in only when there is no break or no copy detection going on, no? > > Yes, precisely, we are only considering cases without break > detection...and thus we are considering cases where for the last 15 > years or more, sufficiently large filename similarity (an exact > fullname match) trumps any level of content similarity. I think it is > useful to note that while my optimization is adding more > considerations that can overrule maximal content similarity, it is not > the first such code choice to do that. > > But let me back up a bit... > > When I submitted the series, you and Stolee went into a long > discussion about an optimization that I didn't submit, one that feels > looser on "matching" than anything I submitted, and which I think > might counter-intuitively reduce performance rather than aid it. (The > performance side only comes into view in combination with later > series, but it was why I harped so much since then on only comparing > against at most one other file in the steps before full inexact rename > detection.) I was quite surprised by the diversion, but it made it > clear to me that my descriptions and commit messages were far too > vague and could be read to imply a completely different algorithm than > I intended. So, I tried to be far more careful in subsequent > iterations by adding wider context and contrasts. > > Further, after I wrote various things to try to clarify the > misunderstandings, I noticed that Stolee picked out one thing and > stated that "This idea of optimizing first for 100% filename > similarity is a good perspective on Git's rename detection algorithm." > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/) > So, that particular point seemed to help him understand more, and > thus might be useful extra context for others reading along now or in > the future. > > Given all the above, I was trying to address earlier misunderstandings > and provide more context. Perhaps I swung the pendulum too far and > talked too much about other cases, or perhaps I just worded things > poorly again. All I was attempting to do in the commit message was > point out the multiple basic rules with filename and content > similarity, to lay the groundwork for new rules that do alternative > weightings. > > Anyway, I've added a few more tweaks to try to improve the wording for > the next round I'll submit today. Given my track record so far, it > would not be surprising if it still needed more tweaks.
On Sat, Feb 13, 2021 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote: > > I do not consider "the same file changed in place" the same as "we > seem to have lost a file in the old tree, ah, we found one that has > the same basename in a different directory" at all, so your argument > still does not make any sense to me, sorry. I'm not set on the commit message wording, you asked why I had used the terms I did and I tried to explain. I also explained how the wording seemed to have helped Stolee understand. If you'd like to suggest an alternative commit message, I'm happy to take it. > 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>: > > > > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > > > Elijah Newren <newren@gmail.com> writes: > > > > > > > This is not true. If src/main.c is 99% similar to src/foo.c, and is > > > > 0% similar to the src/main.c in the new commit, we match the old > > > > src/main.c to the new src/main.c despite being far more similar > > > > src/foo.c. Unless break detection is turned on, we do not allow > > > > content similarity to trump (full) filename equality. > > > > > > Absolutely. And we are talking about a new optimization that kicks > > > in only when there is no break or no copy detection going on, no? > > > > Yes, precisely, we are only considering cases without break > > detection...and thus we are considering cases where for the last 15 > > years or more, sufficiently large filename similarity (an exact > > fullname match) trumps any level of content similarity. I think it is > > useful to note that while my optimization is adding more > > considerations that can overrule maximal content similarity, it is not > > the first such code choice to do that. > > > > But let me back up a bit... > > > > When I submitted the series, you and Stolee went into a long > > discussion about an optimization that I didn't submit, one that feels > > looser on "matching" than anything I submitted, and which I think > > might counter-intuitively reduce performance rather than aid it. (The > > performance side only comes into view in combination with later > > series, but it was why I harped so much since then on only comparing > > against at most one other file in the steps before full inexact rename > > detection.) I was quite surprised by the diversion, but it made it > > clear to me that my descriptions and commit messages were far too > > vague and could be read to imply a completely different algorithm than > > I intended. So, I tried to be far more careful in subsequent > > iterations by adding wider context and contrasts. > > > > Further, after I wrote various things to try to clarify the > > misunderstandings, I noticed that Stolee picked out one thing and > > stated that "This idea of optimizing first for 100% filename > > similarity is a good perspective on Git's rename detection algorithm." > > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/) > > So, that particular point seemed to help him understand more, and > > thus might be useful extra context for others reading along now or in > > the future. > > > > Given all the above, I was trying to address earlier misunderstandings > > and provide more context. Perhaps I swung the pendulum too far and > > talked too much about other cases, or perhaps I just worded things > > poorly again. All I was attempting to do in the commit message was > > point out the multiple basic rules with filename and content > > similarity, to lay the groundwork for new rules that do alternative > > weightings. > > > > Anyway, I've added a few more tweaks to try to improve the wording for > > the next round I'll submit today. Given my track record so far, it > > would not be surprising if it still needed more tweaks.
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index c16486a9d41a..797343b38106 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' grep "myotherfile.*myfile" actual ' +test_expect_success 'basename similarity vs best similarity' ' + mkdir subdir && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 line10 >subdir/file.txt && + git add subdir/file.txt && + git commit -m "base txt" && + + git rm subdir/file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 >file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 >file.md && + git add file.txt file.md && + git commit -a -m "rename" && + git diff-tree -r -M --name-status HEAD^ HEAD >actual && + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, + # but since same basenames are checked first... + cat >expected <<-\EOF && + R088 subdir/file.txt file.md + A file.txt + EOF + test_cmp expected actual +' + test_done