diff mbox series

similarity index vs. whitespaces

Message ID 20230413015640.h6npzp47tnob7bq2@glandium.org (mailing list archive)
State New, archived
Headers show
Series similarity index vs. whitespaces | expand

Commit Message

Mike Hommey April 13, 2023, 1:56 a.m. UTC
Hi,

I hit an interesting case of copy-detection "failure" after renaming a
python script and changing indentation in it at the same time: git show
-C wouldn't detect it as a rename unless I lowered the similarity index
significantly. But from a human perspective, the similarity index feels
wrong.

Here's a small illustrative example:
$ git init qux
$ cd qux
$ cat > qux <<EOF
a
b
c
d
e
f
EOF
$ git add qux
$ git commit -a -m qux
$ git mv qux hoge
$ sed -i 's/[cde]/  \0/' hoge
$ git commit -a -m hoge
$ git diff -C HEAD^!

From a human perspective 33% similarity feels way too low here. I know
it's essentially counting lines in the diff, but that feels limited.

What do you think?

Mike

Comments

Chris Torek April 13, 2023, 2:35 a.m. UTC | #1
On Wed, Apr 12, 2023 at 7:01 PM Mike Hommey <mh@glandium.org> wrote:
[example of Python script diff snipped]
> From a human perspective 33% similarity feels way too low here. I know
> it's essentially counting lines in the diff, but that feels limited.

Technically, the similarity index isn't based on lines.  Instead
it's based on byte-by-byte matching, broken into segments (these
do use LF or CRLF to break up segments as appropriate but very
*long* lines get broken up without such line terminators).  This
shares some code with the delta compression algorithm used to
pack objects.  One of the goals here is to consider a lot of
*moved* lines to be "very similar", despite such moves generally
producing large-ish diffs.

CRs are, if I recall correctly, discarded during similarty index
computation.  It would be somewhat easy to discard leading white
space -- and even easier to discard *all* white space, but I'd
suggest that would be wrong -- due to the special case already
in place for line terminators here.

It's not clear to me that discarding leading white space would
be correct in *all* cases, but it does seem very appropriate for
Python code.  Whether the earlier discussion about diff algorithms
being adjusted based on .gitattributes entries might apply here,
I'll leave for others to argue about. :-)  (That is, I'm just
tossing it out as an idea for the mix.)

Chris
diff mbox series

Patch

diff --git a/hoge b/hoge
new file mode 100644
index 0000000..9ab6fcc
--- /dev/null
+++ b/hoge
@@ -0,0 +1,6 @@ 
+a
+b
+  c
+  d
+  e
+f
diff --git a/qux b/qux
deleted file mode 100644
index 0fdf397..0000000
--- a/qux
+++ /dev/null
@@ -1,6 +0,0 @@ 
-a
-b
-c
-d
-e
-f

$ git diff -C10% HEAD^!
diff --git a/qux b/hoge
similarity index 33%
rename from qux
rename to hoge
index 0fdf397..9ab6fcc 100644
--- a/qux
+++ b/hoge
@@ -1,6 +1,6 @@ 
 a
 b
-c
-d
-e
+  c
+  d
+  e
 f