diff mbox series

[v2,4/4] gitdiffcore doc: mention new preliminary step for rename detection

Message ID a0e75d8cd6bd32fb1ab2a209bc2079c30995b257.1612870326.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 7: use file basenames to guide rename detection | expand

Commit Message

Elijah Newren Feb. 9, 2021, 11:32 a.m. UTC
From: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt | 15 +++++++++++++++
 t/t4001-diff-rename.sh        | 24 ++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

Comments

Derrick Stolee Feb. 9, 2021, 12:59 p.m. UTC | #1
On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> The last few patches have introduced a new preliminary step when rename
> detection is on but both break detection and copy detection are off.
> Document this new step.  While we're at it, add a testcase that checks
> the new behavior as well.

Thanks for adding this documentation and test.

> +Note that when rename detection is on but both copy and break
> +detection are off, rename detection adds a preliminary step that first
> +checks files with the same basename.  If files with the same basename

I find myself wanting a definition of 'basename' here, but perhaps I'm
just being pedantic. A quick search clarifies this as a standard term [1]
of which I was just ignorant.

[1] https://man7.org/linux/man-pages/man3/basename.3.html

> +are sufficiently similar, it will mark them as renames and exclude
> +them from the later quadratic step (the one that pairwise compares all
> +unmatched files to find the "best" matches, determined by the highest
> +content similarity).  So, for example, if docs/extensions.txt and
> +docs/config/extensions.txt have similar content, then they will be
> +marked as a rename even if it turns out that docs/extensions.txt was
> +more similar to src/extension-checks.c.  At most, one comparison is
> +done per file in this preliminary pass; so if there are several
> +extensions.txt files throughout the directory hierarchy that were
> +added and deleted, this preliminary step will be skipped for those
> +files.

> +test_expect_success 'basename similarity vs best similarity' '
> +	mkdir subdir &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			 line6 line7 line8 line9 line10 >subdir/file.txt &&
> +	git add subdir/file.txt &&
> +	git commit -m "base txt" &&
> +
> +	git rm subdir/file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 >file.txt &&
> +	test_write_lines line1 line2 line3 line4 line5 \
> +			  line6 line7 line8 line9 >file.md &&
> +	git add file.txt file.md &&
> +	git commit -a -m "rename" &&
> +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
> +	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
> +	# but since same basenames are checked first...
> +	cat >expected <<-\EOF &&
> +	A	file.md
> +	R078	subdir/file.txt	file.txt
> +	EOF
> +	test_cmp expected actual
> +'
> +

I appreciate the additional comments in this test to make it clear
what you are testing. A minor nit is that the test could have been
added at the start of the series to document the _old_ behavior.
The 'expected' file would have this content:

+	cat >expected <<-\EOF &&
+	A	file.txt
+	R078	subdir/file.txt	file.md
+	EOF

Then, this test case would change the expected output in the same
patch that introduces the behavior change.

Thanks,
-Stolee
Junio C Hamano Feb. 9, 2021, 5:03 p.m. UTC | #2
Derrick Stolee <stolee@gmail.com> writes:

>> +Note that when rename detection is on but both copy and break
>> +detection are off, rename detection adds a preliminary step that first
>> +checks files with the same basename.  If files with the same basename
>
> I find myself wanting a definition of 'basename' here, but perhaps I'm
> just being pedantic. A quick search clarifies this as a standard term [1]
> of which I was just ignorant.
>
> [1] https://man7.org/linux/man-pages/man3/basename.3.html
>
>> +are sufficiently similar, it will mark them as renames and exclude
>> +them from the later quadratic step (the one that pairwise compares all
>> +unmatched files to find the "best" matches, determined by the highest
>> +content similarity).

While I do not think `basename` is unacceptably bad, we should aim
to do better.  For "direc/tory/hello.txt", both "hello.txt" or
"hello" are what would come up to people's mind with the technical
term "basename" (i.e. basename as opposed to dirname, vs basename as
opposed to filename with .extension).

Avoiding this ambiguity and using a word understandable by those not
versed well with UNIX/POSIX lingo may be done at the same time,
hopefully.

For example, can we frame the description around this key sentence:

    The heuristics is based on an observation that a file is often
    moved across directories while keeping its filename the same.

The term "filename" alone can be ambiguous (i.e. both "hello.txt"
and "direc/tory/hello.txt" are valid interpretations in the earlier
example), but in the context of a sentence that talks about "moved
across directories", the former would become the only valid one.  We
can even say just "name" and there is no ambiguity in the above "key
sentence".

Then keeping that in mind, we can rewrite the above you quoted like
so without going technical and without risking ambiguity, like this:

    ... a preliminary step that checks if files are moved across
    directories while keeping their filenames the same.  If there is
    a file added to a directory whose contents is sufficiently
    similar to a file with the same name that got deleted from a
    different directory, ...
Elijah Newren Feb. 9, 2021, 5:44 p.m. UTC | #3
On Tue, Feb 9, 2021 at 9:03 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Derrick Stolee <stolee@gmail.com> writes:
>
> >> +Note that when rename detection is on but both copy and break
> >> +detection are off, rename detection adds a preliminary step that first
> >> +checks files with the same basename.  If files with the same basename
> >
> > I find myself wanting a definition of 'basename' here, but perhaps I'm
> > just being pedantic. A quick search clarifies this as a standard term [1]
> > of which I was just ignorant.
> >
> > [1] https://man7.org/linux/man-pages/man3/basename.3.html
> >
> >> +are sufficiently similar, it will mark them as renames and exclude
> >> +them from the later quadratic step (the one that pairwise compares all
> >> +unmatched files to find the "best" matches, determined by the highest
> >> +content similarity).
>
> While I do not think `basename` is unacceptably bad, we should aim
> to do better.  For "direc/tory/hello.txt", both "hello.txt" or
> "hello" are what would come up to people's mind with the technical
> term "basename" (i.e. basename as opposed to dirname, vs basename as
> opposed to filename with .extension).
>
> Avoiding this ambiguity and using a word understandable by those not
> versed well with UNIX/POSIX lingo may be done at the same time,
> hopefully.
>
> For example, can we frame the description around this key sentence:
>
>     The heuristics is based on an observation that a file is often
>     moved across directories while keeping its filename the same.
>
> The term "filename" alone can be ambiguous (i.e. both "hello.txt"
> and "direc/tory/hello.txt" are valid interpretations in the earlier
> example), but in the context of a sentence that talks about "moved
> across directories", the former would become the only valid one.  We
> can even say just "name" and there is no ambiguity in the above "key
> sentence".
>
> Then keeping that in mind, we can rewrite the above you quoted like
> so without going technical and without risking ambiguity, like this:
>
>     ... a preliminary step that checks if files are moved across
>     directories while keeping their filenames the same.  If there is
>     a file added to a directory whose contents is sufficiently
>     similar to a file with the same name that got deleted from a
>     different directory, ...

Nice, I like it!
diff mbox series

Patch

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c970d9fe438a..954ae3ef1082 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -168,6 +168,21 @@  a similarity score different from the default of 50% by giving a
 number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
 8/10 = 80%).
 
+Note that when rename detection is on but both copy and break
+detection are off, rename detection adds a preliminary step that first
+checks files with the same basename.  If files with the same basename
+are sufficiently similar, it will mark them as renames and exclude
+them from the later quadratic step (the one that pairwise compares all
+unmatched files to find the "best" matches, determined by the highest
+content similarity).  So, for example, if docs/extensions.txt and
+docs/config/extensions.txt have similar content, then they will be
+marked as a rename even if it turns out that docs/extensions.txt was
+more similar to src/extension-checks.c.  At most, one comparison is
+done per file in this preliminary pass; so if there are several
+extensions.txt files throughout the directory hierarchy that were
+added and deleted, this preliminary step will be skipped for those
+files.
+
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 diffcore mechanism as well as modified ones.  This lets the copy
diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
index c16486a9d41a..bf62537c29a0 100755
--- a/t/t4001-diff-rename.sh
+++ b/t/t4001-diff-rename.sh
@@ -262,4 +262,28 @@  test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
 	grep "myotherfile.*myfile" actual
 '
 
+test_expect_success 'basename similarity vs best similarity' '
+	mkdir subdir &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			 line6 line7 line8 line9 line10 >subdir/file.txt &&
+	git add subdir/file.txt &&
+	git commit -m "base txt" &&
+
+	git rm subdir/file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 >file.txt &&
+	test_write_lines line1 line2 line3 line4 line5 \
+			  line6 line7 line8 line9 >file.md &&
+	git add file.txt file.md &&
+	git commit -a -m "rename" &&
+	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
+	# but since same basenames are checked first...
+	cat >expected <<-\EOF &&
+	A	file.md
+	R078	subdir/file.txt	file.txt
+	EOF
+	test_cmp expected actual
+'
+
 test_done