diff mbox series

Git diff misattributes the first word of a line to the previous line

Message ID CABwTF4U-KXHF7=8RWY7Ecbspz205Msa3syZFiWYDg3XmZsNGVw@mail.gmail.com (mailing list archive)
State New, archived
Headers show
Series Git diff misattributes the first word of a line to the previous line | expand

Commit Message

Gurjeet Singh Oct. 13, 2022, 5:51 a.m. UTC
Git diff seems to get confused about word boundaries, and includes the
first word from the next line.

In the output below, you'd be forgiven to assume that the word 'ab'
was removed from first line, and was replaced with the word 'opt1'.
But as you can see in the contents of file '1.txt', that word was on
the _second_ line.

It seems that the first word of a line gets attributed to the previous
line, ignoring the fact that there's an intervening newline before the
word.

This confusion is exhibited by the --color-words option, too, where I
discovered it first. But when trying to create a reproducible test
case, I discovered that this may be a problem with the word-boundary
identification, and the --word-diff=plain is a better way to
demonstrate the bug.

I have also eliminated the possibility that this may be due to some
misconfiguration in my GIt config files, by setting some environment
variables, as seen in the last command.

$ git diff --word-diff=plain /tmp/1.txt /tmp/2.txt

Best regards,
Gurjeet
http://Gurje.et

Comments

Johannes Sixt Oct. 13, 2022, 6:45 a.m. UTC | #1
Am 13.10.22 um 07:51 schrieb Gurjeet Singh:
> Git diff seems to get confused about word boundaries, and includes the
> first word from the next line.

No, that would misattribute the perceived malfunction.

> It seems that the first word of a line gets attributed to the previous
> line, ignoring the fact that there's an intervening newline before the
> word.
> [...]
> $ git diff --word-diff=plain /tmp/1.txt /tmp/2.txt
> diff --git a/tmp/1.txt b/tmp/2.txt
> index 8239f93..099fb80 100644
> --- a/tmp/1.txt
> +++ b/tmp/2.txt
> @@ -1,2 +1,2 @@
>     x = yz [-ab-]{+opt1+}
> {+    ac+} = [-cd ef-]{+pq opt2+}
> 
> $ cat /tmp/1.txt
>     x = yz
>     ab = cd ef
> 
> $ cat /tmp/2.txt
>     x = yz opt1
>     ac = pq opt2

The reason for this is that the implementation of word-diff does not
treat newline characters in any special way. They are treated as
"whitespace" like any other character that is not captured by the
word-diff patterns. Whitespace characters following each word are
recorded, but are disregarded when the word-diff is computed. When the
text is reconstructed in the output, these recorded space characters are
printed only for unchanged and added words, but are not printed for
removed words (IIRC). Combine this with the fact that when there is a
change, i.e., a combination of removal and addition, then the removal is
printed before the addition, and you get the observed output.

I don't see an easy solution for this without completely rewriting the
implementation.

-- Hannes
Philip Oakley Oct. 13, 2022, 11:30 a.m. UTC | #2
Hi Gurjeet,

On 13/10/2022 07:45, Johannes Sixt wrote:
> Am 13.10.22 um 07:51 schrieb Gurjeet Singh:
>> Git diff seems to get confused about word boundaries, and includes the
>> first word from the next line.
> No, that would misattribute the perceived malfunction.
>
>> It seems that the first word of a line gets attributed to the previous
>> line, ignoring the fact that there's an intervening newline before the
>> word.

Given that this effect is a part of the design (LF => whitespace), are
there any changes to the *documentation* that could be made to help
clarify this? E.g. looking back (you did check the manual? ;-) did you
miss some aspect in the man pages that could have been more prominent,
placed earlier, or clarified?

Why is this way of reporting even expected (e.g. confusion between
flowed text, and line oriented code, without a mode change), etc. ?

Any other `retrospective` thoughts that could help?

--
Philip

>> [...]
>> $ git diff --word-diff=plain /tmp/1.txt /tmp/2.txt
>> diff --git a/tmp/1.txt b/tmp/2.txt
>> index 8239f93..099fb80 100644
>> --- a/tmp/1.txt
>> +++ b/tmp/2.txt
>> @@ -1,2 +1,2 @@
>>     x = yz [-ab-]{+opt1+}
>> {+    ac+} = [-cd ef-]{+pq opt2+}
>>
>> $ cat /tmp/1.txt
>>     x = yz
>>     ab = cd ef
>>
>> $ cat /tmp/2.txt
>>     x = yz opt1
>>     ac = pq opt2
> The reason for this is that the implementation of word-diff does not
> treat newline characters in any special way. They are treated as
> "whitespace" like any other character that is not captured by the
> word-diff patterns. Whitespace characters following each word are
> recorded, but are disregarded when the word-diff is computed. When the
> text is reconstructed in the output, these recorded space characters are
> printed only for unchanged and added words, but are not printed for
> removed words (IIRC). Combine this with the fact that when there is a
> change, i.e., a combination of removal and addition, then the removal is
> printed before the addition, and you get the observed output.
>
> I don't see an easy solution for this without completely rewriting the
> implementation.
>
> -- Hannes
>
Bagas Sanjaya Oct. 13, 2022, 12:16 p.m. UTC | #3
On 10/13/22 12:51, Gurjeet Singh wrote:
> # **** Expected **** output
> $ git diff --word-diff=plain /tmp/1.txt /tmp/2.txt
> diff --git a/tmp/1.txt b/tmp/2.txt
> index 8239f93..099fb80 100644
> --- a/tmp/1.txt
> +++ b/tmp/2.txt
> @@ -1,2 +1,2 @@
>     x = yz {+opt1+}
>     [-ab-]{+ac+} = [-cd ef-]{+pq opt2+}
> 

What Git version (and on what system) you made the expected diff above?
diff mbox series

Patch

diff --git a/tmp/1.txt b/tmp/2.txt
index 8239f93..099fb80 100644
--- a/tmp/1.txt
+++ b/tmp/2.txt
@@ -1,2 +1,2 @@ 
    x = yz [-ab-]{+opt1+}
{+    ac+} = [-cd ef-]{+pq opt2+}

$ cat /tmp/1.txt
    x = yz
    ab = cd ef

$ cat /tmp/2.txt
    x = yz opt1
    ac = pq opt2

# Git installed on macOS, via Nixpkgs
$ git --version
git version 2.35.1

# Also tested on Git installed via Homebrew
$ /usr/local/bin/git --version
git version 2.38.0

# Try to run with a clean environment.
$ export GIT_CONFIG_GLOBAL=/dev/null
$ export GIT_CONFIG_SYSTEM=/dev/null
$ export GIT_CONFIG_NOSYSTEM=yes
$ git diff --word-diff=plain /tmp/1.txt /tmp/2.txt # Same buggy output

# **** Expected **** output
$ git diff --word-diff=plain /tmp/1.txt /tmp/2.txt
diff --git a/tmp/1.txt b/tmp/2.txt
index 8239f93..099fb80 100644
--- a/tmp/1.txt
+++ b/tmp/2.txt
@@ -1,2 +1,2 @@ 
    x = yz {+opt1+}
    [-ab-]{+ac+} = [-cd ef-]{+pq opt2+}