diff mbox series

word-diff-regex=. sometimes ignores newlines

Message ID CABiJAjbEpYkcrxj82uQ=O27tR9fKoUFH0=MOCobDfa9cWsbdAA@mail.gmail.com (mailing list archive)
State New
Headers show
Series word-diff-regex=. sometimes ignores newlines | expand

Commit Message

高橋全 (Tamo) Sept. 3, 2024, 12:30 p.m. UTC
Thank you for filling out a Git bug report!
Please answer the following questions to help us understand your issue.

What did you do before the bug happened? (Steps to reproduce your issue)

mkdir test
cd test
git init

cat >a.txt <<EOF
NRZ /NZRQ/NBRQ/
NRZ(C) /NZRCQ/
NRZ(M) /NZRMQ/
EOF

git add a.txt
git commit -m 1

cat >a.txt <<EOF
NRZ /NZRMQ/NZRCQ/NZRQ/NBRQ/
EOF

git diff --word-diff-regex=.


What did you expect to happen? (Expected behavior)

~
-NRZ(C) /N
 ZRCQ/N
-R
 Z
-(M)
+RQ
 /N
-Z
+B
 R
-M
 Q/
~
```

this should emit 3 tildes




Please review the rest of the bug report below.
You can delete any lines you don't wish to share.


[System Info]
git version:
git version 2.46.0
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
libcurl: 7.68.0
zlib: 1.2.11
uname: Linux 5.15.153.1-microsoft-standard-WSL2 #1 SMP Fri Mar 29
23:14:13 UTC 2024 x86_64
compiler info: gnuc: 9.4
libc info: glibc: 2.31
$SHELL (typically, interactive shell): /bin/bash


[Enabled Hooks]

Comments

Johannes Schindelin Sept. 6, 2024, 10:08 a.m. UTC | #1
Hi Tamo,

On Tue, 3 Sep 2024, 高橋全 (Tamo) wrote:

> What did you do before the bug happened? (Steps to reproduce your issue)
>
> mkdir test
> cd test
> git init
>
> cat >a.txt <<EOF
> NRZ /NZRQ/NBRQ/
> NRZ(C) /NZRCQ/
> NRZ(M) /NZRMQ/
> EOF
>
> git add a.txt
> git commit -m 1
>
> cat >a.txt <<EOF
> NRZ /NZRMQ/NZRCQ/NZRQ/NBRQ/
> EOF
>
> git diff --word-diff-regex=.
>
>
> What did you expect to happen? (Expected behavior)
>
> diff --git a/a.txt b/a.txt
> index 278ea76..7e6f42f 100644
> --- a/a.txt
> +++ b/a.txt
> @@ -1,3 +1 @@
> NRZ /NZR{+M+}Q/N[-BRQ/-]{+ZRCQ/NZRQ/NBRQ/+}
> [-NRZ(C) /NZRCQ/-]
> [-NRZ(M) /NZRMQ/-]
>
> or anything whose hunk has three lines
>
>
> What happened instead? (Actual behavior)
>
> diff --git a/a.txt b/a.txt
> index 278ea76..7e6f42f 100644
> --- a/a.txt
> +++ b/a.txt
> @@ -1,3 +1 @@
> NRZ /NZR{+M+}Q/N[-BRQ/-]
> [-NRZ(C) /N-]ZRCQ/N[-R-]Z[-(M) -]{+RQ+}/N[-Z-]{+B+}R[-M-]Q/
>
>
>
> What's different between what you expected and what actually happened?
>
> some newlines are ignored
> and the length of the hunk is wrong;
> git says "@@ -1,3 +1 @@" but the hunk has only 2 lines

The reason is the regular expression, which does not match newlines. See
https://github.com/git/git/blob/v2.46.0/diff.c#L2268-L2270, which shows
how the regular expression is compiled:

		if (regcomp(ecbdata->diff_words->word_regex,
			    o->word_regex,
			    REG_EXTENDED | REG_NEWLINE))

Note the flag `REG_NEWLINE`, described in detail at
https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html:

	If REG_NEWLINE is set, then <newline> shall be treated as an
	ordinary character except as follows:

	1. A <newline> in string shall not be matched by a <period>
	   outside a bracket expression or by any form of a non-matching
	   list (see XBD Regular Expressions).

You will note that you can see three lines in the output when using
`--word-diff-regex='[^ \t\n]+|[ \t\n]+'`:

	$ git diff --word-diff-regex='[^ \t\n]+|[ \t\n]+'
	diff --git a/a.txt b/a.txt
	index 278ea76..7e6f42f 100644
	--- a/a.txt
	+++ b/a.txt
	@@ -1,3 +1 @@
	NRZ [-/NZRQ/NBRQ/-]
	[-NRZ(C) /NZRCQ/-]
	[-NRZ(M) /NZRMQ/-]{+/NZRMQ/NZRCQ/NZRQ/NBRQ/+}

However, when including the slash in the boundary characters, the newlines
are suppressed again:

	$ git diff --word-diff-regex='[^/ \t\n]+|[/ \t\n]+'
	diff --git a/a.txt b/a.txt
	index 278ea76..7e6f42f 100644
	--- a/a.txt
	+++ b/a.txt
	@@ -1,3 +1 @@
	NRZ /[-NZRQ/NBRQ-]{+NZRMQ+}/[-NRZ(C) /-]NZRCQ/[-NRZ(M) /NZRMQ-]{+NZRQ/NBRQ+}/

I am fairly convinced that the reason for this behavior is that the word
diff machinery special-cases newlines and _never_ makes them part of the
"words", see https://github.com/git/git/blob/v2.46.0/diff.c#L2072-L2074
for the code implementing that logic.

Now, is this a bug? I can't really say. From my perspective, it is not:
When I implemented the original version of the word diff code, my use case
was LaTeX-formatted scientific articles, which traditionally do not
contain newline characters within paragraphs. I still have a hard time
wrapping my head around use cases where any pattern that includes a
newline would match a what is considered a word.

I do remember how I struggled (and punted) on the question how to display
newlines in word diffs. There just is no good way to do it that would
address all valid scenarios.

Ciao,
Johannes
diff mbox series

Patch

diff --git a/a.txt b/a.txt
index 278ea76..7e6f42f 100644
--- a/a.txt
+++ b/a.txt
@@ -1,3 +1 @@ 
NRZ /NZR{+M+}Q/N[-BRQ/-]{+ZRCQ/NZRQ/NBRQ/+}
[-NRZ(C) /NZRCQ/-]
[-NRZ(M) /NZRMQ/-]

or anything whose hunk has three lines


What happened instead? (Actual behavior)

diff --git a/a.txt b/a.txt
index 278ea76..7e6f42f 100644
--- a/a.txt
+++ b/a.txt
@@ -1,3 +1 @@ 
NRZ /NZR{+M+}Q/N[-BRQ/-]
[-NRZ(C) /N-]ZRCQ/N[-R-]Z[-(M) -]{+RQ+}/N[-Z-]{+B+}R[-M-]Q/



What's different between what you expected and what actually happened?

some newlines are ignored
and the length of the hunk is wrong;
git says "@@ -1,3 +1 @@" but the hunk has only 2 lines



Anything else you want to add:

this may be clearer with --word-diff=porcelain
because it doesn't emit enough number of "~"s


```
$ git diff --word-diff-regex=. --word-diff=porcelain
diff --git a/a.txt b/a.txt
index 278ea76..7e6f42f 100644
--- a/a.txt
+++ b/a.txt
@@ -1,3 +1 @@ 
 NRZ /NZR
+M
 Q/N
-BRQ/