Message ID | a0e75d8cd6bd32fb1ab2a209bc2079c30995b257.1612870326.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 7: use file basenames to guide rename detection | expand |
On 2/9/2021 6:32 AM, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > The last few patches have introduced a new preliminary step when rename > detection is on but both break detection and copy detection are off. > Document this new step. While we're at it, add a testcase that checks > the new behavior as well. Thanks for adding this documentation and test. > +Note that when rename detection is on but both copy and break > +detection are off, rename detection adds a preliminary step that first > +checks files with the same basename. If files with the same basename I find myself wanting a definition of 'basename' here, but perhaps I'm just being pedantic. A quick search clarifies this as a standard term [1] of which I was just ignorant. [1] https://man7.org/linux/man-pages/man3/basename.3.html > +are sufficiently similar, it will mark them as renames and exclude > +them from the later quadratic step (the one that pairwise compares all > +unmatched files to find the "best" matches, determined by the highest > +content similarity). So, for example, if docs/extensions.txt and > +docs/config/extensions.txt have similar content, then they will be > +marked as a rename even if it turns out that docs/extensions.txt was > +more similar to src/extension-checks.c. At most, one comparison is > +done per file in this preliminary pass; so if there are several > +extensions.txt files throughout the directory hierarchy that were > +added and deleted, this preliminary step will be skipped for those > +files. > +test_expect_success 'basename similarity vs best similarity' ' > + mkdir subdir && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 line9 line10 >subdir/file.txt && > + git add subdir/file.txt && > + git commit -m "base txt" && > + > + git rm subdir/file.txt && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 >file.txt && > + test_write_lines line1 line2 line3 line4 line5 \ > + line6 line7 line8 line9 >file.md && > + git add file.txt file.md && > + git commit -a -m "rename" && > + git diff-tree -r -M --name-status HEAD^ HEAD >actual && > + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, > + # but since same basenames are checked first... > + cat >expected <<-\EOF && > + A file.md > + R078 subdir/file.txt file.txt > + EOF > + test_cmp expected actual > +' > + I appreciate the additional comments in this test to make it clear what you are testing. A minor nit is that the test could have been added at the start of the series to document the _old_ behavior. The 'expected' file would have this content: + cat >expected <<-\EOF && + A file.txt + R078 subdir/file.txt file.md + EOF Then, this test case would change the expected output in the same patch that introduces the behavior change. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: >> +Note that when rename detection is on but both copy and break >> +detection are off, rename detection adds a preliminary step that first >> +checks files with the same basename. If files with the same basename > > I find myself wanting a definition of 'basename' here, but perhaps I'm > just being pedantic. A quick search clarifies this as a standard term [1] > of which I was just ignorant. > > [1] https://man7.org/linux/man-pages/man3/basename.3.html > >> +are sufficiently similar, it will mark them as renames and exclude >> +them from the later quadratic step (the one that pairwise compares all >> +unmatched files to find the "best" matches, determined by the highest >> +content similarity). While I do not think `basename` is unacceptably bad, we should aim to do better. For "direc/tory/hello.txt", both "hello.txt" or "hello" are what would come up to people's mind with the technical term "basename" (i.e. basename as opposed to dirname, vs basename as opposed to filename with .extension). Avoiding this ambiguity and using a word understandable by those not versed well with UNIX/POSIX lingo may be done at the same time, hopefully. For example, can we frame the description around this key sentence: The heuristics is based on an observation that a file is often moved across directories while keeping its filename the same. The term "filename" alone can be ambiguous (i.e. both "hello.txt" and "direc/tory/hello.txt" are valid interpretations in the earlier example), but in the context of a sentence that talks about "moved across directories", the former would become the only valid one. We can even say just "name" and there is no ambiguity in the above "key sentence". Then keeping that in mind, we can rewrite the above you quoted like so without going technical and without risking ambiguity, like this: ... a preliminary step that checks if files are moved across directories while keeping their filenames the same. If there is a file added to a directory whose contents is sufficiently similar to a file with the same name that got deleted from a different directory, ...
On Tue, Feb 9, 2021 at 9:03 AM Junio C Hamano <gitster@pobox.com> wrote: > > Derrick Stolee <stolee@gmail.com> writes: > > >> +Note that when rename detection is on but both copy and break > >> +detection are off, rename detection adds a preliminary step that first > >> +checks files with the same basename. If files with the same basename > > > > I find myself wanting a definition of 'basename' here, but perhaps I'm > > just being pedantic. A quick search clarifies this as a standard term [1] > > of which I was just ignorant. > > > > [1] https://man7.org/linux/man-pages/man3/basename.3.html > > > >> +are sufficiently similar, it will mark them as renames and exclude > >> +them from the later quadratic step (the one that pairwise compares all > >> +unmatched files to find the "best" matches, determined by the highest > >> +content similarity). > > While I do not think `basename` is unacceptably bad, we should aim > to do better. For "direc/tory/hello.txt", both "hello.txt" or > "hello" are what would come up to people's mind with the technical > term "basename" (i.e. basename as opposed to dirname, vs basename as > opposed to filename with .extension). > > Avoiding this ambiguity and using a word understandable by those not > versed well with UNIX/POSIX lingo may be done at the same time, > hopefully. > > For example, can we frame the description around this key sentence: > > The heuristics is based on an observation that a file is often > moved across directories while keeping its filename the same. > > The term "filename" alone can be ambiguous (i.e. both "hello.txt" > and "direc/tory/hello.txt" are valid interpretations in the earlier > example), but in the context of a sentence that talks about "moved > across directories", the former would become the only valid one. We > can even say just "name" and there is no ambiguity in the above "key > sentence". > > Then keeping that in mind, we can rewrite the above you quoted like > so without going technical and without risking ambiguity, like this: > > ... a preliminary step that checks if files are moved across > directories while keeping their filenames the same. If there is > a file added to a directory whose contents is sufficiently > similar to a file with the same name that got deleted from a > different directory, ... Nice, I like it!
diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c970d9fe438a..954ae3ef1082 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -168,6 +168,21 @@ a similarity score different from the default of 50% by giving a number after the "-M" or "-C" option (e.g. "-M8" to tell it to use 8/10 = 80%). +Note that when rename detection is on but both copy and break +detection are off, rename detection adds a preliminary step that first +checks files with the same basename. If files with the same basename +are sufficiently similar, it will mark them as renames and exclude +them from the later quadratic step (the one that pairwise compares all +unmatched files to find the "best" matches, determined by the highest +content similarity). So, for example, if docs/extensions.txt and +docs/config/extensions.txt have similar content, then they will be +marked as a rename even if it turns out that docs/extensions.txt was +more similar to src/extension-checks.c. At most, one comparison is +done per file in this preliminary pass; so if there are several +extensions.txt files throughout the directory hierarchy that were +added and deleted, this preliminary step will be skipped for those +files. + Note. When the "-C" option is used with `--find-copies-harder` option, 'git diff-{asterisk}' commands feed unmodified filepairs to diffcore mechanism as well as modified ones. This lets the copy diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh index c16486a9d41a..bf62537c29a0 100755 --- a/t/t4001-diff-rename.sh +++ b/t/t4001-diff-rename.sh @@ -262,4 +262,28 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' ' grep "myotherfile.*myfile" actual ' +test_expect_success 'basename similarity vs best similarity' ' + mkdir subdir && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 line10 >subdir/file.txt && + git add subdir/file.txt && + git commit -m "base txt" && + + git rm subdir/file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 >file.txt && + test_write_lines line1 line2 line3 line4 line5 \ + line6 line7 line8 line9 >file.md && + git add file.txt file.md && + git commit -a -m "rename" && + git diff-tree -r -M --name-status HEAD^ HEAD >actual && + # subdir/file.txt is 89% similar to file.md, 78% similar to file.txt, + # but since same basenames are checked first... + cat >expected <<-\EOF && + A file.md + R078 subdir/file.txt file.txt + EOF + test_cmp expected actual +' + test_done