Message ID | CABiJAjbEpYkcrxj82uQ=O27tR9fKoUFH0=MOCobDfa9cWsbdAA@mail.gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | word-diff-regex=. sometimes ignores newlines | expand |
Hi Tamo, On Tue, 3 Sep 2024, 高橋全 (Tamo) wrote: > What did you do before the bug happened? (Steps to reproduce your issue) > > mkdir test > cd test > git init > > cat >a.txt <<EOF > NRZ /NZRQ/NBRQ/ > NRZ(C) /NZRCQ/ > NRZ(M) /NZRMQ/ > EOF > > git add a.txt > git commit -m 1 > > cat >a.txt <<EOF > NRZ /NZRMQ/NZRCQ/NZRQ/NBRQ/ > EOF > > git diff --word-diff-regex=. > > > What did you expect to happen? (Expected behavior) > > diff --git a/a.txt b/a.txt > index 278ea76..7e6f42f 100644 > --- a/a.txt > +++ b/a.txt > @@ -1,3 +1 @@ > NRZ /NZR{+M+}Q/N[-BRQ/-]{+ZRCQ/NZRQ/NBRQ/+} > [-NRZ(C) /NZRCQ/-] > [-NRZ(M) /NZRMQ/-] > > or anything whose hunk has three lines > > > What happened instead? (Actual behavior) > > diff --git a/a.txt b/a.txt > index 278ea76..7e6f42f 100644 > --- a/a.txt > +++ b/a.txt > @@ -1,3 +1 @@ > NRZ /NZR{+M+}Q/N[-BRQ/-] > [-NRZ(C) /N-]ZRCQ/N[-R-]Z[-(M) -]{+RQ+}/N[-Z-]{+B+}R[-M-]Q/ > > > > What's different between what you expected and what actually happened? > > some newlines are ignored > and the length of the hunk is wrong; > git says "@@ -1,3 +1 @@" but the hunk has only 2 lines The reason is the regular expression, which does not match newlines. See https://github.com/git/git/blob/v2.46.0/diff.c#L2268-L2270, which shows how the regular expression is compiled: if (regcomp(ecbdata->diff_words->word_regex, o->word_regex, REG_EXTENDED | REG_NEWLINE)) Note the flag `REG_NEWLINE`, described in detail at https://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html: If REG_NEWLINE is set, then <newline> shall be treated as an ordinary character except as follows: 1. A <newline> in string shall not be matched by a <period> outside a bracket expression or by any form of a non-matching list (see XBD Regular Expressions). You will note that you can see three lines in the output when using `--word-diff-regex='[^ \t\n]+|[ \t\n]+'`: $ git diff --word-diff-regex='[^ \t\n]+|[ \t\n]+' diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ [-/NZRQ/NBRQ/-] [-NRZ(C) /NZRCQ/-] [-NRZ(M) /NZRMQ/-]{+/NZRMQ/NZRCQ/NZRQ/NBRQ/+} However, when including the slash in the boundary characters, the newlines are suppressed again: $ git diff --word-diff-regex='[^/ \t\n]+|[/ \t\n]+' diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ /[-NZRQ/NBRQ-]{+NZRMQ+}/[-NRZ(C) /-]NZRCQ/[-NRZ(M) /NZRMQ-]{+NZRQ/NBRQ+}/ I am fairly convinced that the reason for this behavior is that the word diff machinery special-cases newlines and _never_ makes them part of the "words", see https://github.com/git/git/blob/v2.46.0/diff.c#L2072-L2074 for the code implementing that logic. Now, is this a bug? I can't really say. From my perspective, it is not: When I implemented the original version of the word diff code, my use case was LaTeX-formatted scientific articles, which traditionally do not contain newline characters within paragraphs. I still have a hard time wrapping my head around use cases where any pattern that includes a newline would match a what is considered a word. I do remember how I struggled (and punted) on the question how to display newlines in word diffs. There just is no good way to do it that would address all valid scenarios. Ciao, Johannes
diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ /NZR{+M+}Q/N[-BRQ/-]{+ZRCQ/NZRQ/NBRQ/+} [-NRZ(C) /NZRCQ/-] [-NRZ(M) /NZRMQ/-] or anything whose hunk has three lines What happened instead? (Actual behavior) diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ /NZR{+M+}Q/N[-BRQ/-] [-NRZ(C) /N-]ZRCQ/N[-R-]Z[-(M) -]{+RQ+}/N[-Z-]{+B+}R[-M-]Q/ What's different between what you expected and what actually happened? some newlines are ignored and the length of the hunk is wrong; git says "@@ -1,3 +1 @@" but the hunk has only 2 lines Anything else you want to add: this may be clearer with --word-diff=porcelain because it doesn't emit enough number of "~"s ``` $ git diff --word-diff-regex=. --word-diff=porcelain diff --git a/a.txt b/a.txt index 278ea76..7e6f42f 100644 --- a/a.txt +++ b/a.txt @@ -1,3 +1 @@ NRZ /NZR +M Q/N -BRQ/