diff mbox series

[v3,1/5] t4001: add a test comparing basename similarity and content similarity

Message ID 3e6af929d135ef2dc239e2f47f92a7e2e91cbd17.1612970140.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 7: use file basenames to guide rename detection | expand

Commit Message

Elijah Newren Feb. 10, 2021, 3:15 p.m. UTC
From: Elijah Newren <newren@gmail.com>

Add a simple test where a removed file is similar to two different added
files; one of them has the same basename, and the other has a slightly
higher content similarity.  Without break detection, filename similarity
of 100% trumps content similarity for pairing up related files.  For
any filename similarity less than 100%, the opposite is true -- content
similarity is all that matters.  Add a testcase that documents this.

Subsequent commits will add a new rule that includes an inbetween state,
where a mixture of filename similarity and content similarity are
weighed, and which will change the outcome of this testcase.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t4001-diff-rename.sh | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

Comments

Junio C Hamano Feb. 13, 2021, 1:15 a.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Add a simple test where a removed file is similar to two different added
> files; one of them has the same basename, and the other has a slightly
> higher content similarity.  Without break detection, filename similarity
> of 100% trumps content similarity for pairing up related files.  For
> any filename similarity less than 100%, the opposite is true -- content
> similarity is all that matters.  Add a testcase that documents this.

I am not sure why it is the "opposite".  When contents are similar
to the same degree of 100%, we tiebreak with the filename.  We never
favor a pair between the same filename over a pair between different
filenames with better content similarity.

And when contents are similar to the same degree of less than 100%,
we do not favor a pair between the same filename over a pair between
different filenames, as long as they are similar to the same degree.

So, I do not think "opposite" is helping readers to understand what
is going on.

> +test_expect_success 'basename similarity vs best similarity' '
> +	mkdir subdir &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			 line6 line7 line8 line9 line10 >subdir/file.txt &&
> +	git add subdir/file.txt &&
> +	git commit -m "base txt" &&
> +
> +	git rm subdir/file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 >file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 line9 >file.md &&
> +	git add file.txt file.md &&
> +	git commit -a -m "rename" &&
> +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> +	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> +	# but since same basenames are checked first...

I am not sure what the second line of this comment wants to imply
with the ellipses here.  Care to finish the sentence?

Or was the second line planned to be added when we start applying
the "check only the same filename first and see if we find a
better-than-reasonable match" heuristics but somehow survived
"rebase -i" and ended up here?

> +	cat >expected <<-\EOF &&
> +	R088	subdir/file.txt	file.md
> +	A	file.txt
> +	EOF
> +	test_cmp expected actual

Thanks.
Elijah Newren Feb. 13, 2021, 4:50 a.m. UTC | #2
On Fri, Feb 12, 2021 at 5:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > Add a simple test where a removed file is similar to two different added
> > files; one of them has the same basename, and the other has a slightly
> > higher content similarity.  Without break detection, filename similarity
> > of 100% trumps content similarity for pairing up related files.  For
> > any filename similarity less than 100%, the opposite is true -- content
> > similarity is all that matters.  Add a testcase that documents this.
>
> I am not sure why it is the "opposite".  When contents are similar
> to the same degree of 100%, we tiebreak with the filename.  We never
> favor a pair between the same filename over a pair between different
> filenames with better content similarity.

This is not true.  If src/main.c is 99% similar to src/foo.c, and is
0% similar to the src/main.c in the new commit, we match the old
src/main.c to the new src/main.c despite being far more similar
src/foo.c.  Unless break detection is turned on, we do not allow
content similarity to trump (full) filename equality.

> And when contents are similar to the same degree of less than 100%,
> we do not favor a pair between the same filename over a pair between
> different filenames, as long as they are similar to the same degree.

This is also not true; we tiebreak with filenames for inexact renames
just like we do for exact renames (note that basename_same() is called
both from find_identical_files() and from the nested loop where
inexact rename detection is done).

> So, I do not think "opposite" is helping readers to understand what
> is going on.
>
> > +test_expect_success 'basename similarity vs best similarity' '
> > +     mkdir subdir &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                      line6 line7 line8 line9 line10 >subdir/file.txt &&
> > +     git add subdir/file.txt &&
> > +     git commit -m "base txt" &&
> > +
> > +     git rm subdir/file.txt &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                       line6 line7 line8 >file.txt &&
> > +     test_write_lines line1 line2 line3 line4 line5 \
> > +                       line6 line7 line8 line9 >file.md &&
> > +     git add file.txt file.md &&
> > +     git commit -a -m "rename" &&
> > +     git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> > +     # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> > +     # but since same basenames are checked first...
>
> I am not sure what the second line of this comment wants to imply
> with the ellipses here.  Care to finish the sentence?
>
> Or was the second line planned to be added when we start applying
> the "check only the same filename first and see if we find a
> better-than-reasonable match" heuristics but somehow survived
> "rebase -i" and ended up here?

Oops, indeed; that is precisely what happened.  Will fix.

> > +     cat >expected <<-\EOF &&
> > +     R088    subdir/file.txt file.md
> > +     A       file.txt
> > +     EOF
> > +     test_cmp expected actual
>
> Thanks.
Junio C Hamano Feb. 13, 2021, 11:56 p.m. UTC | #3
Elijah Newren <newren@gmail.com> writes:

> This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> 0% similar to the src/main.c in the new commit, we match the old
> src/main.c to the new src/main.c despite being far more similar
> src/foo.c.  Unless break detection is turned on, we do not allow
> content similarity to trump (full) filename equality.

Absolutely.  And we are talking about a new optimization that kicks
in only when there is no break or no copy detection going on, no?
Elijah Newren Feb. 14, 2021, 1:24 a.m. UTC | #4
On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > 0% similar to the src/main.c in the new commit, we match the old
> > src/main.c to the new src/main.c despite being far more similar
> > src/foo.c.  Unless break detection is turned on, we do not allow
> > content similarity to trump (full) filename equality.
>
> Absolutely.  And we are talking about a new optimization that kicks
> in only when there is no break or no copy detection going on, no?

Yes, precisely, we are only considering cases without break
detection...and thus we are considering cases where for the last 15
years or more, sufficiently large filename similarity (an exact
fullname match) trumps any level of content similarity.  I think it is
useful to note that while my optimization is adding more
considerations that can overrule maximal content similarity, it is not
the first such code choice to do that.

But let me back up a bit...

When I submitted the series, you and Stolee went into a long
discussion about an optimization that I didn't submit, one that feels
looser on "matching" than anything I submitted, and which I think
might counter-intuitively reduce performance rather than aid it.  (The
performance side only comes into view in combination with later
series, but it was why I harped so much since then on only comparing
against at most one other file in the steps before full inexact rename
detection.)  I was quite surprised by the diversion, but it made it
clear to me that my descriptions and commit messages were far too
vague and could be read to imply a completely different algorithm than
I intended.  So, I tried to be far more careful in subsequent
iterations by adding wider context and contrasts.

Further, after I wrote various things to try to clarify the
misunderstandings, I noticed that Stolee picked out one thing and
stated that "This idea of optimizing first for 100% filename
similarity is a good perspective on Git's rename detection algorithm."
(see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
 So, that particular point seemed to help him understand more, and
thus might be useful extra context for others reading along now or in
the future.

Given all the above, I was trying to address earlier misunderstandings
and provide more context.  Perhaps I swung the pendulum too far and
talked too much about other cases, or perhaps I just worded things
poorly again.  All I was attempting to do in the commit message was
point out the multiple basic rules with filename and content
similarity, to lay the groundwork for new rules that do alternative
weightings.

Anyway, I've added a few more tweaks to try to improve the wording for
the next round I'll submit today.  Given my track record so far, it
would not be surprising if it still needed more tweaks.
Junio C Hamano Feb. 14, 2021, 1:32 a.m. UTC | #5
I do not consider "the same file changed in place" the same as "we
seem to have lost a file in the old tree, ah, we found one that has
the same basename in a different directory" at all, so your argument
still does not make any sense to me, sorry.

2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>:
>
> On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > 0% similar to the src/main.c in the new commit, we match the old
> > > src/main.c to the new src/main.c despite being far more similar
> > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > content similarity to trump (full) filename equality.
> >
> > Absolutely.  And we are talking about a new optimization that kicks
> > in only when there is no break or no copy detection going on, no?
>
> Yes, precisely, we are only considering cases without break
> detection...and thus we are considering cases where for the last 15
> years or more, sufficiently large filename similarity (an exact
> fullname match) trumps any level of content similarity.  I think it is
> useful to note that while my optimization is adding more
> considerations that can overrule maximal content similarity, it is not
> the first such code choice to do that.
>
> But let me back up a bit...
>
> When I submitted the series, you and Stolee went into a long
> discussion about an optimization that I didn't submit, one that feels
> looser on "matching" than anything I submitted, and which I think
> might counter-intuitively reduce performance rather than aid it.  (The
> performance side only comes into view in combination with later
> series, but it was why I harped so much since then on only comparing
> against at most one other file in the steps before full inexact rename
> detection.)  I was quite surprised by the diversion, but it made it
> clear to me that my descriptions and commit messages were far too
> vague and could be read to imply a completely different algorithm than
> I intended.  So, I tried to be far more careful in subsequent
> iterations by adding wider context and contrasts.
>
> Further, after I wrote various things to try to clarify the
> misunderstandings, I noticed that Stolee picked out one thing and
> stated that "This idea of optimizing first for 100% filename
> similarity is a good perspective on Git's rename detection algorithm."
> (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
>  So, that particular point seemed to help him understand more, and
> thus might be useful extra context for others reading along now or in
> the future.
>
> Given all the above, I was trying to address earlier misunderstandings
> and provide more context.  Perhaps I swung the pendulum too far and
> talked too much about other cases, or perhaps I just worded things
> poorly again.  All I was attempting to do in the commit message was
> point out the multiple basic rules with filename and content
> similarity, to lay the groundwork for new rules that do alternative
> weightings.
>
> Anyway, I've added a few more tweaks to try to improve the wording for
> the next round I'll submit today.  Given my track record so far, it
> would not be surprising if it still needed more tweaks.
Elijah Newren Feb. 14, 2021, 3:14 a.m. UTC | #6
On Sat, Feb 13, 2021 at 5:32 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> I do not consider "the same file changed in place" the same as "we
> seem to have lost a file in the old tree, ah, we found one that has
> the same basename in a different directory" at all, so your argument
> still does not make any sense to me, sorry.

I'm not set on the commit message wording, you asked why I had used
the terms I did and I tried to explain.  I also explained how the
wording seemed to have helped Stolee understand.

If you'd like to suggest an alternative commit message, I'm happy to take it.

> 2021年2月13日(土) 17:25 Elijah Newren <newren@gmail.com>:
> >
> > On Sat, Feb 13, 2021 at 3:56 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Elijah Newren <newren@gmail.com> writes:
> > >
> > > > This is not true.  If src/main.c is 99% similar to src/foo.c, and is
> > > > 0% similar to the src/main.c in the new commit, we match the old
> > > > src/main.c to the new src/main.c despite being far more similar
> > > > src/foo.c.  Unless break detection is turned on, we do not allow
> > > > content similarity to trump (full) filename equality.
> > >
> > > Absolutely.  And we are talking about a new optimization that kicks
> > > in only when there is no break or no copy detection going on, no?
> >
> > Yes, precisely, we are only considering cases without break
> > detection...and thus we are considering cases where for the last 15
> > years or more, sufficiently large filename similarity (an exact
> > fullname match) trumps any level of content similarity.  I think it is
> > useful to note that while my optimization is adding more
> > considerations that can overrule maximal content similarity, it is not
> > the first such code choice to do that.
> >
> > But let me back up a bit...
> >
> > When I submitted the series, you and Stolee went into a long
> > discussion about an optimization that I didn't submit, one that feels
> > looser on "matching" than anything I submitted, and which I think
> > might counter-intuitively reduce performance rather than aid it.  (The
> > performance side only comes into view in combination with later
> > series, but it was why I harped so much since then on only comparing
> > against at most one other file in the steps before full inexact rename
> > detection.)  I was quite surprised by the diversion, but it made it
> > clear to me that my descriptions and commit messages were far too
> > vague and could be read to imply a completely different algorithm than
> > I intended.  So, I tried to be far more careful in subsequent
> > iterations by adding wider context and contrasts.
> >
> > Further, after I wrote various things to try to clarify the
> > misunderstandings, I noticed that Stolee picked out one thing and
> > stated that "This idea of optimizing first for 100% filename
> > similarity is a good perspective on Git's rename detection algorithm."
> > (see https://lore.kernel.org/git/57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com/)
> >  So, that particular point seemed to help him understand more, and
> > thus might be useful extra context for others reading along now or in
> > the future.
> >
> > Given all the above, I was trying to address earlier misunderstandings
> > and provide more context.  Perhaps I swung the pendulum too far and
> > talked too much about other cases, or perhaps I just worded things
> > poorly again.  All I was attempting to do in the commit message was
> > point out the multiple basic rules with filename and content
> > similarity, to lay the groundwork for new rules that do alternative
> > weightings.
> >
> > Anyway, I've added a few more tweaks to try to improve the wording for
> > the next round I'll submit today.  Given my track record so far, it
> > would not be surprising if it still needed more tweaks.
diff mbox series

Patch

diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41a..797343b38106 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,28 @@  test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'basename similarity vs best similarity' '
+	mkdir subdir &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >subdir/file.txt &&
+	git add subdir/file.txt &&
+	git commit -m "base txt" &&
+
+	git rm subdir/file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >file.md &&
+	git add file.txt file.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
+	# but since same basenames are checked first...
+	cat >expected <<-\EOF &&
+	R088	subdir/file.txt	file.md
+	A	file.txt
+	EOF
+	test_cmp expected actual
+'
+
 test_done