diff mbox series

[v3,5/5] gitdiffcore doc: mention new preliminary step for rename detection

Message ID fc72d24a3358b1c5cc2753b5f07ac60174e6452b.1612970140.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Optimization batch 7: use file basenames to guide rename detection | expand

Commit Message

Elijah Newren Feb. 10, 2021, 3:15 p.m. UTC
From: Elijah Newren <newren@gmail.com>

The last few patches have introduced a new preliminary step when rename
detection is on but both break detection and copy detection are off.
Document this new step.  While we're at it, add a testcase that checks
the new behavior as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Junio C Hamano Feb. 10, 2021, 4:41 p.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> The last few patches have introduced a new preliminary step when rename
> detection is on but both break detection and copy detection are off.
> Document this new step.  While we're at it, add a testcase that checks
> the new behavior as well.
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  Documentation/gitdiffcore.txt | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> index c970d9fe438a..36ebe364d874 100644
> --- a/Documentation/gitdiffcore.txt
> +++ b/Documentation/gitdiffcore.txt
> @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a
>  number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
>  8/10 = 80%).
>  
> +Note that when rename detection is on but both copy and break
> +detection are off, rename detection adds a preliminary step that first
> +checks if files are moved across directories while keeping their
> +filename the same.  If there is a file added to a directory whose
> +contents is sufficiently similar to a file with the same name that got
> +deleted from a different directory, it will mark them as renames and
> +exclude them from the later quadratic step (the one that pairwise
> +compares all unmatched files to find the "best" matches, determined by
> +the highest content similarity).  So, for example, if
> +docs/extensions.txt and docs/config/extensions.txt have similar
> +content, then they will be marked as a rename even if it turns out
> +that docs/extensions.txt was more similar to src/extension-checks.c.

I'd rather use docs/extensions.md instead of src/extension-checks.c;
it would be more realistic for .md to be similar to .txt than .c.

With a raised bar for this step, the equation changes a bit, no?  

    So, for example, if a deleted docs/ext.txt and an added
    docs/config/ext.txt are similar enough, they will be marked as a
    rename and prevent an added docs/ext.md that may be even similar
    to the deleted docs/ext.txt from being considered as the rename
    destination in the later step.  For this reason, the preliminary
    "match same filename" step uses a bit higher threshold to mark a
    file pair as a rename and stop considering other candidates for
    better matches.

or something?

> +At most, one comparison is done per file in this preliminary pass; so
> +if there are several extensions.txt files throughout the directory
> +hierarchy that were added and deleted, this preliminary step will be
> +skipped for those files.

Other than that, the whole series looked sensible to my cursory
read.

Thanks.
Elijah Newren Feb. 10, 2021, 5:20 p.m. UTC | #2
On Wed, Feb 10, 2021 at 8:41 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > The last few patches have introduced a new preliminary step when rename
> > detection is on but both break detection and copy detection are off.
> > Document this new step.  While we're at it, add a testcase that checks
> > the new behavior as well.
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  Documentation/gitdiffcore.txt | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> >
> > diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
> > index c970d9fe438a..36ebe364d874 100644
> > --- a/Documentation/gitdiffcore.txt
> > +++ b/Documentation/gitdiffcore.txt
> > @@ -168,6 +168,23 @@ a similarity score different from the default of 50% by giving a
> >  number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
> >  8/10 = 80%).
> >
> > +Note that when rename detection is on but both copy and break
> > +detection are off, rename detection adds a preliminary step that first
> > +checks if files are moved across directories while keeping their
> > +filename the same.  If there is a file added to a directory whose
> > +contents is sufficiently similar to a file with the same name that got
> > +deleted from a different directory, it will mark them as renames and
> > +exclude them from the later quadratic step (the one that pairwise
> > +compares all unmatched files to find the "best" matches, determined by
> > +the highest content similarity).  So, for example, if
> > +docs/extensions.txt and docs/config/extensions.txt have similar
> > +content, then they will be marked as a rename even if it turns out
> > +that docs/extensions.txt was more similar to src/extension-checks.c.
>
> I'd rather use docs/extensions.md instead of src/extension-checks.c;
> it would be more realistic for .md to be similar to .txt than .c.
>
> With a raised bar for this step, the equation changes a bit, no?
>
>     So, for example, if a deleted docs/ext.txt and an added
>     docs/config/ext.txt are similar enough, they will be marked as a
>     rename and prevent an added docs/ext.md that may be even similar
>     to the deleted docs/ext.txt from being considered as the rename
>     destination in the later step.  For this reason, the preliminary
>     "match same filename" step uses a bit higher threshold to mark a
>     file pair as a rename and stop considering other candidates for
>     better matches.
>
> or something?

Good points; I've updated the docs locally to reflect your
suggestions, I'll wait a bit for any other feedback and then send out
a new round with this update.

> > +At most, one comparison is done per file in this preliminary pass; so
> > +if there are several extensions.txt files throughout the directory
> > +hierarchy that were added and deleted, this preliminary step will be
> > +skipped for those files.
>
> Other than that, the whole series looked sensible to my cursory
> read.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index c970d9fe438a..36ebe364d874 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -168,6 +168,23 @@  a similarity score different from the default of 50% by giving a
 number after the "-M" or "-C" option (e.g. "-M8" to tell it to use
 8/10 = 80%).
 
+Note that when rename detection is on but both copy and break
+detection are off, rename detection adds a preliminary step that first
+checks if files are moved across directories while keeping their
+filename the same.  If there is a file added to a directory whose
+contents is sufficiently similar to a file with the same name that got
+deleted from a different directory, it will mark them as renames and
+exclude them from the later quadratic step (the one that pairwise
+compares all unmatched files to find the "best" matches, determined by
+the highest content similarity).  So, for example, if
+docs/extensions.txt and docs/config/extensions.txt have similar
+content, then they will be marked as a rename even if it turns out
+that docs/extensions.txt was more similar to src/extension-checks.c.
+At most, one comparison is done per file in this preliminary pass; so
+if there are several extensions.txt files throughout the directory
+hierarchy that were added and deleted, this preliminary step will be
+skipped for those files.
+
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 diffcore mechanism as well as modified ones.  This lets the copy