Message ID | fc72d24a3358b1c5cc2753b5f07ac60174e6452b.1612970140.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Optimization batch 7: use file basenames to guide rename detection | expand |
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Elijah Newren <newren@gmail.com> > > The last few patches have introduced a new preliminary step when rename > detection is on but both break detection and copy detection are off. > Document this new step. While we're at it, add a testcase that checks > the new behavior as well. > > Signed-off-by: Elijah Newren <newren@gmail.com> > --- > Documentation/gitdiffcore.txt | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt > index c970d9fe438a..36ebe364d874 100644 > --- a/Documentation/gitdiffcore.txt > +++ b/Documentation/gitdiffcore.txt > @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a > number after the "-M" or "-C" option (e.g. "-M8" to tell it to use > 8/10 = 80%). > > +Note that when rename detection is on but both copy and break > +detection are off, rename detection adds a preliminary step that first > +checks if files are moved across directories while keeping their > +filename the same. If there is a file added to a directory whose > +contents is sufficiently similar to a file with the same name that got > +deleted from a different directory, it will mark them as renames and > +exclude them from the later quadratic step (the one that pairwise > +compares all unmatched files to find the "best" matches, determined by > +the highest content similarity). So, for example, if > +docs/extensions.txt and docs/config/extensions.txt have similar > +content, then they will be marked as a rename even if it turns out > +that docs/extensions.txt was more similar to src/extension-checks.c. I'd rather use docs/extensions.md instead of src/extension-checks.c; it would be more realistic for .md to be similar to .txt than .c. With a raised bar for this step, the equation changes a bit, no? So, for example, if a deleted docs/ext.txt and an added docs/config/ext.txt are similar enough, they will be marked as a rename and prevent an added docs/ext.md that may be even similar to the deleted docs/ext.txt from being considered as the rename destination in the later step. For this reason, the preliminary "match same filename" step uses a bit higher threshold to mark a file pair as a rename and stop considering other candidates for better matches. or something? > +At most, one comparison is done per file in this preliminary pass; so > +if there are several extensions.txt files throughout the directory > +hierarchy that were added and deleted, this preliminary step will be > +skipped for those files. Other than that, the whole series looked sensible to my cursory read. Thanks.
On Wed, Feb 10, 2021 at 8:41 AM Junio C Hamano <gitster@pobox.com> wrote: > > "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Elijah Newren <newren@gmail.com> > > > > The last few patches have introduced a new preliminary step when rename > > detection is on but both break detection and copy detection are off. > > Document this new step. While we're at it, add a testcase that checks > > the new behavior as well. > > > > Signed-off-by: Elijah Newren <newren@gmail.com> > > --- > > Documentation/gitdiffcore.txt | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt > > index c970d9fe438a..36ebe364d874 100644 > > --- a/Documentation/gitdiffcore.txt > > +++ b/Documentation/gitdiffcore.txt > > @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a > > number after the "-M" or "-C" option (e.g. "-M8" to tell it to use > > 8/10 = 80%). > > > > +Note that when rename detection is on but both copy and break > > +detection are off, rename detection adds a preliminary step that first > > +checks if files are moved across directories while keeping their > > +filename the same. If there is a file added to a directory whose > > +contents is sufficiently similar to a file with the same name that got > > +deleted from a different directory, it will mark them as renames and > > +exclude them from the later quadratic step (the one that pairwise > > +compares all unmatched files to find the "best" matches, determined by > > +the highest content similarity). So, for example, if > > +docs/extensions.txt and docs/config/extensions.txt have similar > > +content, then they will be marked as a rename even if it turns out > > +that docs/extensions.txt was more similar to src/extension-checks.c. > > I'd rather use docs/extensions.md instead of src/extension-checks.c; > it would be more realistic for .md to be similar to .txt than .c. > > With a raised bar for this step, the equation changes a bit, no? > > So, for example, if a deleted docs/ext.txt and an added > docs/config/ext.txt are similar enough, they will be marked as a > rename and prevent an added docs/ext.md that may be even similar > to the deleted docs/ext.txt from being considered as the rename > destination in the later step. For this reason, the preliminary > "match same filename" step uses a bit higher threshold to mark a > file pair as a rename and stop considering other candidates for > better matches. > > or something? Good points; I've updated the docs locally to reflect your suggestions, I'll wait a bit for any other feedback and then send out a new round with this update. > > +At most, one comparison is done per file in this preliminary pass; so > > +if there are several extensions.txt files throughout the directory > > +hierarchy that were added and deleted, this preliminary step will be > > +skipped for those files. > > Other than that, the whole series looked sensible to my cursory > read. Thanks.
diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt index c970d9fe438a..36ebe364d874 100644 --- a/Documentation/gitdiffcore.txt +++ b/Documentation/gitdiffcore.txt @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a number after the "-M" or "-C" option (e.g. "-M8" to tell it to use 8/10 = 80%). +Note that when rename detection is on but both copy and break +detection are off, rename detection adds a preliminary step that first +checks if files are moved across directories while keeping their +filename the same. If there is a file added to a directory whose +contents is sufficiently similar to a file with the same name that got +deleted from a different directory, it will mark them as renames and +exclude them from the later quadratic step (the one that pairwise +compares all unmatched files to find the "best" matches, determined by +the highest content similarity). So, for example, if +docs/extensions.txt and docs/config/extensions.txt have similar +content, then they will be marked as a rename even if it turns out +that docs/extensions.txt was more similar to src/extension-checks.c. +At most, one comparison is done per file in this preliminary pass; so +if there are several extensions.txt files throughout the directory +hierarchy that were added and deleted, this preliminary step will be +skipped for those files. + Note. When the "-C" option is used with `--find-copies-harder` option, 'git diff-{asterisk}' commands feed unmodified filepairs to diffcore mechanism as well as modified ones. This lets the copy