diff mbox series

[v13,3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

Message ID 20211015161356.3372-3-someguy@effective-light.com (mailing list archive)
State Accepted
Commit ae39ba431ab861548eb60b4bd2e1d8b8813db76f
Headers show
Series [v13,1/3] grep: refactor next_match() and match_one_pattern() for external use | expand

Commit Message

Hamza Mahfooz Oct. 15, 2021, 4:13 p.m. UTC
If we attempt to grep non-ascii log message text with an ascii pattern, we
run into the following issue:

    $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
    grep: (standard input): binary file matches

So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
output is encoded in UTF-8.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
---
v12: get rid of utf8_all_the_things and fix an issue with one of the unit
     tests.

v13: make some clarifications in the commit message.
---
 grep.c                          |  6 +++--
 t/t7812-grep-icase-non-ascii.sh | 48 +++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 2 deletions(-)

Comments

Junio C Hamano Oct. 15, 2021, 8:03 p.m. UTC | #1
Hamza Mahfooz <someguy@effective-light.com> writes:

> If we attempt to grep non-ascii log message text with an ascii pattern, we

"with an ascii pattern, when Git is built with and told to use pcre2, we"

> run into the following issue:
>
>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>     grep: (standard input): binary file matches
>
> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
> output is encoded in UTF-8.

> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
> +	    (!opt->ignore_locale && is_utf8_locale() &&
> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
> +					    (p->fixed || p->is_fixed))))

That's a mouthful.  It is not obvious what new condition is being
added.  I had to flip the order to see the only difference is, that

> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))

... in addition to the case where the original condition holds, if
we do not say "ignore locale" and the pattern is ascii-only, we
apply these two option flags.  And that matches what the proposed
log message explained as the condition the problem appears.

So,... looks good, I guess.

Thanks, will queue.


Addendum.

If we were reordering pieces in the condition, I wonder if there is
a better way to reorganize it, though.  The original is already
barely explainable with words, and with this new condition added, I
am not sure if anybody can phrase the condition in simple words to
others after staring it for a few minutes.  I can't.

But straightening it out is best left as a future clean-up patch,
separate from this series.
René Scharfe Oct. 16, 2021, 4:25 p.m. UTC | #2
Am 15.10.21 um 22:03 schrieb Junio C Hamano:
> Hamza Mahfooz <someguy@effective-light.com> writes:
>
>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>
> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>
>> run into the following issue:
>>
>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>     grep: (standard input): binary file matches

I get no error message on macOS 11.6, but this result, with the underlined
part in red:

   Author: ??var Arnfjörð Bjarmason <avarab@gmail.com>
            ^^^^^^^^^^^^^^^^^^

So the pattern matches the second byte of a two-byte character, inserts a
color code in the middle and thus produces invalid output in this case.

>>
>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>> output is encoded in UTF-8.
>
>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>> +					    (p->fixed || p->is_fixed))))
>
> That's a mouthful.  It is not obvious what new condition is being
> added.  I had to flip the order to see the only difference is, that
>
>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>
> ... in addition to the case where the original condition holds, if
> we do not say "ignore locale" and the pattern is ascii-only, we
> apply these two option flags.  And that matches what the proposed
> log message explained as the condition the problem appears.
>
> So,... looks good, I guess.
>
> Thanks, will queue.
>
>
> Addendum.
>
> If we were reordering pieces in the condition, I wonder if there is
> a better way to reorganize it, though.  The original is already
> barely explainable with words, and with this new condition added, I
> am not sure if anybody can phrase the condition in simple words to
> others after staring it for a few minutes.  I can't.
>
> But straightening it out is best left as a future clean-up patch,
> separate from this series.
>

It can be written as:

	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
	if (!opt->ignore_locale) {
		if (!has_non_ascii(p->pattern) ||
		    (is_utf8_locale() && !literal))
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

Literal patterns are those that don't use any wildcards or case-folding.
If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
pattern only consists of ASCII characters, or if the pattern is encoded
in UTF-8 and is not just a literal pattern.

Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
ASCII chars?

The old condition was (reformatted to better match the new one):

	if (!opt->ignore_locale) {
		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

Intuitively I'd say the condition should be:

	if (!opt->ignore_locale && is_utf8_locale()) {
		if (has_non_ascii(p->pattern) || !literal)
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
have to match non-ASCII characters or do more than just literal
matching.

For literal patterns that consist only of ASCII characters we don't need
the cost and complication of PCRE2_UTF.

Makes sense?

René
Ævar Arnfjörð Bjarmason Oct. 16, 2021, 5:12 p.m. UTC | #3
On Sat, Oct 16 2021, René Scharfe wrote:

> Am 15.10.21 um 22:03 schrieb Junio C Hamano:
>> Hamza Mahfooz <someguy@effective-light.com> writes:
>>
>>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>>
>> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>>
>>> run into the following issue:
>>>
>>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>>     grep: (standard input): binary file matches
>
> I get no error message on macOS 11.6, but this result, with the underlined
> part in red:
>
>    Author: ??var Arnfjörð Bjarmason <avarab@gmail.com>
>             ^^^^^^^^^^^^^^^^^^
>
> So the pattern matches the second byte of a two-byte character, inserts a
> color code in the middle and thus produces invalid output in this case.

Thanks for digging into these edge cases...

>>>
>>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>>> output is encoded in UTF-8.
>>
>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>> +					    (p->fixed || p->is_fixed))))
>>
>> That's a mouthful.  It is not obvious what new condition is being
>> added.  I had to flip the order to see the only difference is, that
>>
>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>>
>> ... in addition to the case where the original condition holds, if
>> we do not say "ignore locale" and the pattern is ascii-only, we
>> apply these two option flags.  And that matches what the proposed
>> log message explained as the condition the problem appears.
>>
>> So,... looks good, I guess.
>>
>> Thanks, will queue.
>>
>>
>> Addendum.
>>
>> If we were reordering pieces in the condition, I wonder if there is
>> a better way to reorganize it, though.  The original is already
>> barely explainable with words, and with this new condition added, I
>> am not sure if anybody can phrase the condition in simple words to
>> others after staring it for a few minutes.  I can't.
>>
>> But straightening it out is best left as a future clean-up patch,
>> separate from this series.
>>
>
> It can be written as:
>
> 	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
> 	if (!opt->ignore_locale) {
> 		if (!has_non_ascii(p->pattern) ||
> 		    (is_utf8_locale() && !literal))
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}

Whatever we go from here I'm very much for untangling that condition,
but I guess it can be done as a follow-up too, I'll defer to Hamza
there...

> Literal patterns are those that don't use any wildcards or case-folding.
> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
> pattern only consists of ASCII characters, or if the pattern is encoded
> in UTF-8 and is not just a literal pattern.
>
> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
> ASCII chars?
>
> The old condition was (reformatted to better match the new one):
>
> 	if (!opt->ignore_locale) {
> 		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}
>
> Intuitively I'd say the condition should be:
>
> 	if (!opt->ignore_locale && is_utf8_locale()) {
> 		if (has_non_ascii(p->pattern) || !literal)
> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> 	}
>
> If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
> have to match non-ASCII characters or do more than just literal
> matching.
>
> For literal patterns that consist only of ASCII characters we don't need
> the cost and complication of PCRE2_UTF.
>
> Makes sense?

    echo 'René Scharfe' >f &&
    $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
    1
    $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
    f:René Scharfe
    0

So it's a choose-your-own adventure where you can pick if you're
right. I.e. do you want the "." metacharacter to match your "é" or not?

These sorts of patterns demonstrate nicely that the relationship between your
pattern being ASCII and wanting or not wanting UTF-8 matching semantics
isn't what you might imagine it to be.

If you look at:

    git log --reverse -p -Gis_utf8_locale -- grep.c

And my earlier replies in this thread-at-large (not digging up a
reference now, sorry), you'll see that the current behavior in grep.c is
really a compromise based on some intersection of user patterns, us
potentially grepping arbitrary binary data and/or "valid" encodings, and
what PCRE does and doesn't puke on.

It's not "right" by any sense, but we sort of limp along with it, mostly
because unlike say Perl's regex engine which really is used for serious
production work where Unicode-correctness matters I don't think anyone
is using "git grep" for anything like that, it's mostly for people's
ad-hoc greps.

Right now I can't remember if the only reason we can't "fix this" is
because we'd be too aggressive with the PCRE version dependency, see
95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks,
2021-01-24).
René Scharfe Oct. 16, 2021, 7:44 p.m. UTC | #4
Am 16.10.21 um 19:12 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sat, Oct 16 2021, René Scharfe wrote:
>
>> Am 15.10.21 um 22:03 schrieb Junio C Hamano:
>>> Hamza Mahfooz <someguy@effective-light.com> writes:
>>>
>>>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>>>
>>> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>>>
>>>> run into the following issue:
>>>>
>>>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>>>     grep: (standard input): binary file matches
>>
>> I get no error message on macOS 11.6, but this result, with the underlined
>> part in red:
>>
>>    Author: ??var Arnfjörð Bjarmason <avarab@gmail.com>
>>             ^^^^^^^^^^^^^^^^^^
>>
>> So the pattern matches the second byte of a two-byte character, inserts a
>> color code in the middle and thus produces invalid output in this case.
>
> Thanks for digging into these edge cases...
>
>>>>
>>>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>>>> output is encoded in UTF-8.
>>>
>>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>>>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>>> +					    (p->fixed || p->is_fixed))))
>>>
>>> That's a mouthful.  It is not obvious what new condition is being
>>> added.  I had to flip the order to see the only difference is, that
>>>
>>>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>>>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>>>
>>> ... in addition to the case where the original condition holds, if
>>> we do not say "ignore locale" and the pattern is ascii-only, we
>>> apply these two option flags.  And that matches what the proposed
>>> log message explained as the condition the problem appears.
>>>
>>> So,... looks good, I guess.
>>>
>>> Thanks, will queue.
>>>
>>>
>>> Addendum.
>>>
>>> If we were reordering pieces in the condition, I wonder if there is
>>> a better way to reorganize it, though.  The original is already
>>> barely explainable with words, and with this new condition added, I
>>> am not sure if anybody can phrase the condition in simple words to
>>> others after staring it for a few minutes.  I can't.
>>>
>>> But straightening it out is best left as a future clean-up patch,
>>> separate from this series.
>>>
>>
>> It can be written as:
>>
>> 	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
>> 	if (!opt->ignore_locale) {
>> 		if (!has_non_ascii(p->pattern) ||
>> 		    (is_utf8_locale() && !literal))
>> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>> 	}
>
> Whatever we go from here I'm very much for untangling that condition,
> but I guess it can be done as a follow-up too, I'll defer to Hamza
> there...
>
>> Literal patterns are those that don't use any wildcards or case-folding.
>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>> pattern only consists of ASCII characters, or if the pattern is encoded
>> in UTF-8 and is not just a literal pattern.
>>
>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>> ASCII chars?
>>
>> The old condition was (reformatted to better match the new one):
>>
>> 	if (!opt->ignore_locale) {
>> 		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
>> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>> 	}
>>
>> Intuitively I'd say the condition should be:
>>
>> 	if (!opt->ignore_locale && is_utf8_locale()) {
>> 		if (has_non_ascii(p->pattern) || !literal)
>> 			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>> 	}
>>
>> If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
>> have to match non-ASCII characters or do more than just literal
>> matching.
>>
>> For literal patterns that consist only of ASCII characters we don't need
>> the cost and complication of PCRE2_UTF.
>>
>> Makes sense?
>
>     echo 'René Scharfe' >f &&
>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>     1
>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>     f:René Scharfe
>     0
>
> So it's a choose-your-own adventure where you can pick if you're
> right. I.e. do you want the "." metacharacter to match your "é" or not?

Yes, I do, and it's what Hamza's patch is fixing.

> These sorts of patterns demonstrate nicely that the relationship between your
> pattern being ASCII and wanting or not wanting UTF-8 matching semantics
> isn't what you might imagine it to be.

Differences are:

o: opt->ignore_locale
h: has_non_ascii(p->pattern)
i: is_utf8_locale()
l: literal

o h i l master hamza rene
0 0 0 0      0     1    0
0 0 0 1      0     1    0
0 0 1 0      0     1    1   <== your first example
0 0 1 1      0     1    0
0 1 1 1      0     0    1

Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two
lines).

Turning on PCRE2_UTF for literal matching (fourth line) goes against
870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
2019-07-26).

Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth
line) also goes against that, so my intuition betrayed me.  When I
adjust it, I get:

	if (!opt->ignore_locale && is_utf8_locale() && !literal)
		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

That looks deceptively simple -- just drop has_non_ascii(p->pattern)
from the original condition.

Your second example is handle the same by all versions btw.:

o h i l master hamza rene
0 1 1 0      1     1    1

René
Junio C Hamano Oct. 17, 2021, 6 a.m. UTC | #5
René Scharfe <l.s.r@web.de> writes:

>>> Literal patterns are those that don't use any wildcards or case-folding.
>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>> in UTF-8 and is not just a literal pattern.
>>>
>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>> ASCII chars?
>>> ...
>>     echo 'René Scharfe' >f &&
>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>     1
>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>     f:René Scharfe
>>     0
>>
>> So it's a choose-your-own adventure where you can pick if you're
>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>
> Yes, I do, and it's what Hamza's patch is fixing.

That may be correct but is this discussion still about "Why enable
... for literal patterns that consist of only ASCII"?  Calling "." a
"metacharacter" and wanting it to match anything other than a single
dot would mean the pattern we are discussing is no longer "literal",
isn't it?  I am puzzled.

>> These sorts of patterns demonstrate nicely that the relationship between your
>> pattern being ASCII and wanting or not wanting UTF-8 matching semantics
>> isn't what you might imagine it to be.
>
> Differences are:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: literal
>
> o h i l master hamza rene
> 0 0 0 0      0     1    0
> 0 0 0 1      0     1    0
> 0 0 1 0      0     1    1   <== your first example
> 0 0 1 1      0     1    0
> 0 1 1 1      0     0    1
>
> Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two
> lines).
>
> Turning on PCRE2_UTF for literal matching (fourth line) goes against
> 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
> 2019-07-26).
>
> Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth
> line) also goes against that, so my intuition betrayed me.  When I
> adjust it, I get:
>
> 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
> That looks deceptively simple -- just drop has_non_ascii(p->pattern)
> from the original condition.
>
> Your second example is handle the same by all versions btw.:
>
> o h i l master hamza rene
> 0 1 1 0      1     1    1
>
> René
René Scharfe Oct. 17, 2021, 6:55 a.m. UTC | #6
Am 17.10.21 um 08:00 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>>>> Literal patterns are those that don't use any wildcards or case-folding.
>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>>> in UTF-8 and is not just a literal pattern.
>>>>
>>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>>> ASCII chars?
>>>> ...
>>>     echo 'René Scharfe' >f &&
>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>>     1
>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>>     f:René Scharfe
>>>     0
>>>
>>> So it's a choose-your-own adventure where you can pick if you're
>>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>>
>> Yes, I do, and it's what Hamza's patch is fixing.
>
> That may be correct but is this discussion still about "Why enable
> ... for literal patterns that consist of only ASCII"?  Calling "." a
> "metacharacter" and wanting it to match anything other than a single
> dot would mean the pattern we are discussing is no longer "literal",
> isn't it?  I am puzzled.

Right, Ævar's comment is not about my question, but highlights an
inconsistency in master that is fixed by Hamza's patch.

I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for
non-ASCII patterns even if they are literal or our locale is not UTF-8.
The following change would fix the edge case mentioned in its commit
message without these side-effects.  Am I correct?

diff --git a/grep.c b/grep.c
index fe847a0111..5badb6d851 100644
--- a/grep.c
+++ b/grep.c
@@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	if (!opt->ignore_locale && is_utf8_locale() &&
 	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
Ævar Arnfjörð Bjarmason Oct. 17, 2021, 9:44 a.m. UTC | #7
On Sun, Oct 17 2021, René Scharfe wrote:

> Am 17.10.21 um 08:00 schrieb Junio C Hamano:
>> René Scharfe <l.s.r@web.de> writes:
>>
>>>>> Literal patterns are those that don't use any wildcards or case-folding.
>>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>>>> in UTF-8 and is not just a literal pattern.
>>>>>
>>>>> Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
>>>>> ASCII chars?
>>>>> ...
>>>>     echo 'René Scharfe' >f &&
>>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>>>>     1
>>>>     $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>>>>     f:René Scharfe
>>>>     0
>>>>
>>>> So it's a choose-your-own adventure where you can pick if you're
>>>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>>>
>>> Yes, I do, and it's what Hamza's patch is fixing.
>>
>> That may be correct but is this discussion still about "Why enable
>> ... for literal patterns that consist of only ASCII"?  Calling "." a
>> "metacharacter" and wanting it to match anything other than a single
>> dot would mean the pattern we are discussing is no longer "literal",
>> isn't it?  I am puzzled.
>
> Right, Ævar's comment is not about my question, but highlights an
> inconsistency in master that is fixed by Hamza's patch.

Yes, sorry about that. Just generally about the messy semantics.

> I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for
> non-ASCII patterns even if they are literal or our locale is not UTF-8.
> The following change would fix the edge case mentioned in its commit
> message without these side-effects.  Am I correct?
>
> diff --git a/grep.c b/grep.c
> index fe847a0111..5badb6d851 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	if (!opt->ignore_locale && is_utf8_locale() &&
>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

I haven't had time to carefully look into this, but one caveat to check
out is if it works with older PCRE and whether you need e.g. the
GIT_PCRE2_VERSION_10_36_OR_HIGHER.

I tried your suggestion on top of Hamza's series, compiled PCRE v2
10.23, tested, and also tried manually removing the
PCRE2_MATCH_INVALID_UTF flag and tested again.

We pass all tests with both, so maybe this is safe to do (or maybe we're
missing some test we haven't thought of yet...).

One thing that makes me nervous is that we had breakages in the past
once the patches escaped into the wild, particularly because the code
being modified here has is_utf8_locale(), and our tests run under
LANG=c LC_ALL=C.

I tried running all the tests with a non-C locale, there's a lot of
failures, but none new with this change. As an aside the below patch
makes all but one shortlog test pass for me. I wonder if we shouldn't do
this for real to smoke out any $LANG or $LC_ALL dependencies.

I.e. almost all of the failures were due to relying on the sort order of
sort(1), and in one case comm(1), the first hunk here is also redundant
to defining our own ls(1) wrapper....

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index bd17f308b38..738ca6ef587 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -693,12 +693,12 @@ test_expect_success 'expire removes unreferenced packs' '
 		^refs/heads/C
 		EOF
 		git multi-pack-index write &&
-		ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
+		ls .git/objects/pack | sort | grep -v -e pack-[AB] >expect &&
 		git multi-pack-index expire &&
-		ls .git/objects/pack >actual &&
+		ls .git/objects/pack | sort >actual &&
 		test_cmp expect actual &&
-		ls .git/objects/pack/ | grep idx >expect-idx &&
-		test-tool read-midx .git/objects | grep idx >actual-midx &&
+		ls .git/objects/pack/ | sort | grep idx >expect-idx &&
+		test-tool read-midx .git/objects | sort | grep idx >actual-midx &&
 		test_cmp expect-idx actual-midx &&
 		git multi-pack-index verify &&
 		git fsck
@@ -802,7 +802,7 @@ test_expect_success 'expire works when adding new packs' '
 		refs/heads/E
 		EOF
 		git multi-pack-index expire &&
-		ls .git/objects/pack/ | grep idx >expect &&
+		ls .git/objects/pack/ | sort | grep idx >expect &&
 		test-tool read-midx .git/objects | grep idx >actual &&
 		test_cmp expect actual &&
 		git multi-pack-index verify
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 8361b5c1c57..f4f9d231f28 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -417,14 +417,22 @@ test -n "$BASH_VERSION" && shopt -u checkwinsize 2>/dev/null
 
 # For repeatability, reset the environment to known value.
 # TERM is sanitized below, after saving color control sequences.
-LANG=C
-LC_ALL=C
+LANG=en_US.UTF-8
+LC_ALL=en_US.UTF-8
 PAGER=cat
 TZ=UTC
 COLUMNS=80
 export LANG LC_ALL PAGER TZ COLUMNS
 EDITOR=:
 
+sort () {
+	LC_ALL=C LANG=C command sort "$@"
+}
+
+comm () {
+	LC_ALL=C LANG=C command comm "$@"
+}
+
 # A call to "unset" with no arguments causes at least Solaris 10
 # /usr/xpg4/bin/sh and /bin/ksh to bail out.  So keep the unsets
 # deriving from the command substitution clustered with the other
Andreas Schwab Nov. 15, 2021, 8:43 p.m. UTC | #8
This breaks 'PCRE v2: grep ASCII from invalid UTF-8 data'.

Andreas.
Ævar Arnfjörð Bjarmason Nov. 15, 2021, 10:41 p.m. UTC | #9
On Mon, Nov 15 2021, Andreas Schwab wrote:

> This breaks 'PCRE v2: grep ASCII from invalid UTF-8 data'.

What PCREv2 version are you using? I wonder if it's got to do with the
10.36 boundary, as noted in 95ca1f987ed (grep/pcre2: better support
invalid UTF-8 haystacks, 2021-01-24). One of the two tests matching your
description tests for that bug.

I can't remember much of the details of the bug, and it seems PCREv2's
bug tracker went private, or at least I can't view
https://bugs.exim.org/show_bug.cgi?id=2642 anymore.
Carlo Marcelo Arenas Belón Nov. 16, 2021, 2:12 a.m. UTC | #10
> I can't remember much of the details of the bug, and it seems PCREv2's
> bug tracker went private, or at least I can't view
> https://bugs.exim.org/show_bug.cgi?id=2642 anymore.

It was shutdown; all new development is now done in github[1], old
closed bugs were not migrated over though but the fix is pretty
straight forward if it needs backporting[2]

I would have expected the less risky change from René[3] to be
released though, but might be a good opportunity to take a deeper look
at the underlying problem and also have a likely solution to both the
regression and the original bug.

Carlo

[1] https://github.com/PhilipHazel/pcre2
[2] https://github.com/PhilipHazel/pcre2/commit/f8cbb1f58df05f175a6898f35dc18e26be6362d0
[3] https://lore.kernel.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
Andreas Schwab Nov. 16, 2021, 8:41 a.m. UTC | #11
On Nov 15 2021, Ævar Arnfjörð Bjarmason wrote:

> What PCREv2 version are you using?

10.31

https://build.opensuse.org/package/show/openSUSE:Leap:15.3:Update/pcre2

Andreas.
Carlo Marcelo Arenas Belón Nov. 16, 2021, 9:06 a.m. UTC | #12
On Tue, Nov 16, 2021 at 12:42 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> On Nov 15 2021, Ævar Arnfjörð Bjarmason wrote:
>
> > What PCREv2 version are you using?
>
> 10.31
>
> https://build.opensuse.org/package/show/openSUSE:Leap:15.3:Update/pcre2

interesting; that version doesn't need the workaround because it
doesn't have MATCH_INVALID_UTF that was released with 10.34, but it
has a recently backported fix for a similar issue that might be
confusing git and it is somehow related to it.

a full log (run it with -x -i, and get the last lines) would be useful.

curious also if you see other problems using `git log --grep` and
which locale do you have configured?

Carlo
Andreas Schwab Nov. 16, 2021, 9:18 a.m. UTC | #13
On Nov 16 2021, Carlo Arenas wrote:

> curious also if you see other problems using `git log --grep` and

Not sure what you are asking for.

> which locale do you have configured?

There is no locale configured during build.

Andreas.
Andreas Schwab Nov. 16, 2021, 9:29 a.m. UTC | #14
expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': 
        git grep -h "var" invalid-0x80 >actual &&
        test_cmp expected actual &&
        git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
        test_cmp expected actual

++ git grep -h var invalid-0x80
++ test_cmp expected actual
++ test 2 -ne 2
++ eval 'diff -u' '"$@"'
+++ diff -u expected actual
++ git grep -h '(*NO_JIT)var' invalid-0x80
fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
error: last command exited with $?=128
not ok 13 - PCRE v2: grep ASCII from invalid UTF-8 data
#
#               git grep -h "var" invalid-0x80 >actual &&
#               test_cmp expected actual &&
#               git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
#               test_cmp expected actual
#

Andreas.
Carlo Marcelo Arenas Belón Nov. 16, 2021, 9:38 a.m. UTC | #15
On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>
> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
>         git grep -h "var" invalid-0x80 >actual &&
>         test_cmp expected actual &&
>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
>         test_cmp expected actual
>
> ++ git grep -h var invalid-0x80
> ++ test_cmp expected actual
> ++ test 2 -ne 2
> ++ eval 'diff -u' '"$@"'
> +++ diff -u expected actual
> ++ git grep -h '(*NO_JIT)var' invalid-0x80
> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set

That is exactly what I was worried about, this is not failing one
test, but making `git grep` unusable in any repository that has any
binary files that might be reachable by it, and it is likely affecting
anyone using PCRE older than 10.34

Reverting this specific commit might fix it and is unlikely to
introduce other issues, did you try it?

Carlo
Andreas Schwab Nov. 16, 2021, 9:55 a.m. UTC | #16
On Nov 16 2021, Carlo Arenas wrote:

> Reverting this specific commit might fix it and is unlikely to
> introduce other issues, did you try it?

Yes, of course.

Andreas.
René Scharfe Nov. 17, 2021, 6:46 p.m. UTC | #17
Am 16.11.21 um 10:38 schrieb Carlo Arenas:
> On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>
>> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
>>         git grep -h "var" invalid-0x80 >actual &&
>>         test_cmp expected actual &&
>>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
>>         test_cmp expected actual
>>
>> ++ git grep -h var invalid-0x80
>> ++ test_cmp expected actual
>> ++ test 2 -ne 2
>> ++ eval 'diff -u' '"$@"'
>> +++ diff -u expected actual
>> ++ git grep -h '(*NO_JIT)var' invalid-0x80
>> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
>
> That is exactly what I was worried about, this is not failing one
> test, but making `git grep` unusable in any repository that has any
> binary files that might be reachable by it, and it is likely affecting
> anyone using PCRE older than 10.34

Let's have a look at the map.  Here are the differences between the
versions regarding use of PCRE2_UTF:

o: opt->ignore_locale
h: has_non_ascii(p->pattern)
i: is_utf8_locale()
l: !opt->ignore_case && (p->fixed || p->is_fixed)

o h i l master hamza rene2
0 0 0 0      0     1     0
0 0 0 1      0     1     0
0 0 1 0      0     1     1
0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging

So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
should not have this breakage, because it doesn't enable PCRE2_UTF for
literal patterns.

René
Carlo Marcelo Arenas Belón Nov. 17, 2021, 7:56 p.m. UTC | #18
On Wed, Nov 17, 2021 at 10:46 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 16.11.21 um 10:38 schrieb Carlo Arenas:
> > On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
> >>
> >> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
> >>         git grep -h "var" invalid-0x80 >actual &&
> >>         test_cmp expected actual &&
> >>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
> >>         test_cmp expected actual
> >>
> >> ++ git grep -h var invalid-0x80
> >> ++ test_cmp expected actual
> >> ++ test 2 -ne 2
> >> ++ eval 'diff -u' '"$@"'
> >> +++ diff -u expected actual
> >> ++ git grep -h '(*NO_JIT)var' invalid-0x80
> >> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
> >
> > That is exactly what I was worried about, this is not failing one
> > test, but making `git grep` unusable in any repository that has any
> > binary files that might be reachable by it, and it is likely affecting
> > anyone using PCRE older than 10.34
>
> Let's have a look at the map.  Here are the differences between the
> versions regarding use of PCRE2_UTF:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>
> o h i l master hamza rene2
> 0 0 0 0      0     1     0
> 0 0 0 1      0     1     0
> 0 0 1 0      0     1     1
> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>
> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
> should not have this breakage, because it doesn't enable PCRE2_UTF for
> literal patterns.

Correct, and indeed I had your expression instead of the ugly one in
my draft for v2[1] but then I found the tests were even more broken
than reported originally and got worried it might introduce yet
another issue because of the brittleness of the rest of the code
around it.  I am hoping it will be introduced in a follow up series
though, and unless this code gets thrown away in the promised
refactoring by Ævar which I haven't seen yet and once this is in
maint.

IMHO and in line with the previous warnings you raised[2] during the
development of this feature, it might be worth looking at it more
deeply, but at least the proposed patch keeps the feature working in
practice (the only case that won't work as expected will be when the
expression includes "." with the intention of matching a UTF-8
character and while using PCRE2 older than 10.34) and more importantly
fixes the regression that it introduced.

Carlo

[1] https://lore.kernel.org/git/20211117102329.95456-1-carenas@gmail.com/
[2] https://lore.kernel.org/git/fc7eb9fc-9521-5484-b05f-9c20086fd485@web.de/
Ævar Arnfjörð Bjarmason Nov. 17, 2021, 8:59 p.m. UTC | #19
On Wed, Nov 17 2021, René Scharfe wrote:

> Am 16.11.21 um 10:38 schrieb Carlo Arenas:
>> On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote:
>>>
>>> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
>>>         git grep -h "var" invalid-0x80 >actual &&
>>>         test_cmp expected actual &&
>>>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
>>>         test_cmp expected actual
>>>
>>> ++ git grep -h var invalid-0x80
>>> ++ test_cmp expected actual
>>> ++ test 2 -ne 2
>>> ++ eval 'diff -u' '"$@"'
>>> +++ diff -u expected actual
>>> ++ git grep -h '(*NO_JIT)var' invalid-0x80
>>> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
>>
>> That is exactly what I was worried about, this is not failing one
>> test, but making `git grep` unusable in any repository that has any
>> binary files that might be reachable by it, and it is likely affecting
>> anyone using PCRE older than 10.34
>
> Let's have a look at the map.  Here are the differences between the
> versions regarding use of PCRE2_UTF:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>
> o h i l master hamza rene2
> 0 0 0 0      0     1     0
> 0 0 0 1      0     1     0
> 0 0 1 0      0     1     1
> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>
> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
> should not have this breakage, because it doesn't enable PCRE2_UTF for
> literal patterns.

PCRE2_UTF will also matter for literal patterns. Try to peel apart the
two bytes in "é" and match them under -i with/without PCRE_UTF.

I don't know what the right behavior should be here (haven't had time to
dig), but it matters for matching.
Carlo Marcelo Arenas Belón Nov. 17, 2021, 9:53 p.m. UTC | #20
On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
> two bytes in "é" and match them under -i with/without PCRE_UTF.

Is there a real use case for why someone would do that? and how is
that "literal" valid UTF to warrant setting PCRE2_UTF?
I would expect that someone including random bytes in the expression
is really more interested in binary matching anyway and the use of -i
with it probably should be an error.

Indeed I suspect the fact that pcre2_compile lets it through might be
a bug in PCRE2

$ pcre2test
PCRE2 version 10.39 2021-10-29
  re> /\303/utf,caseless
data> \303
 0: \x{c3}
data> é
No match

Carlo
Ævar Arnfjörð Bjarmason Nov. 18, 2021, midnight UTC | #21
On Wed, Nov 17 2021, Carlo Arenas wrote:

> On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
>> two bytes in "é" and match them under -i with/without PCRE_UTF.
>
> Is there a real use case for why someone would do that? and how is
> that "literal" valid UTF to warrant setting PCRE2_UTF?

We don't check the validity of the input pattern as correct UTF-8 before
feeding it to PCRE. We don't even check if you're under a UTF-8 locale,
a call to is_utf8_locale() is one of the conditions we'll try (but check
how loose we are under NO_GETTEXT=Y even then), but it's not a necessary
one.

So in practice we'll likely have users pasting say Big5, UTF-32, JIS
encoding or whatever into "git grep" and then grepping arbitrary binary
data.

I don't really have a specific use-case in mind. I'm just trying to
counter the apparent recurring misconception that's popped up more than
once in these recent threads that the UTF-8 *mode* is only something we
need to worry about if the pattern contains non-ASCII.

The implications of UTF-8 as a matching mode go much deeper than that in
Perl's and PCRE's regex engines, e.g. in this case what's considered a
character boundary.

> I would expect that someone including random bytes in the expression
> is really more interested in binary matching anyway and the use of -i
> with it probably should be an error.
>
> Indeed I suspect the fact that pcre2_compile lets it through might be
> a bug in PCRE2
>
> $ pcre2test
> PCRE2 version 10.39 2021-10-29
>   re> /\303/utf,caseless
> data> \303
>  0: \x{c3}
> data> é
> No match

It's really not "random bytes".

We're not talking about someone doing a git grep where the argument is
fed from a "cat" of /dev/urandom. Just plain old boring natural language
encodings in common use can and will conflict with what's considered a
character boundary in UTF-8,

Anyway, I haven't had time to re-page this topic at large into my
wetware and really don't know what we should be doing exactly at this
point to move forward, except my previous suggestion to either revert
the recent changes, or at least to narrow them down so that we get the
old behavior for everything except the revision.c "git log" caller.

I think a really good next step aside from dealing with that immediate
problem would be to harden some of our test data in a way similar to
what we've got in t7816-grep-binary-pattern.sh, but for partial matches,
mixtures of valid/invalid/partial character patterns and subjects etc.

We might be able to lift some tests from existing test suites,
perl.git's t/re/re_tests and pcre2.git's RunGrepTest (And testdata/
directory) are probably good places to start looking.

We don't need to test e.g. PCRE itself independently, but it is worth
testing how whatever special-sauce we add on top (if and when to turn on
its flags based on heuristics) impacts things.

We've also got the problem that whatever code we write will target
multiple PCREv2 versions, which isn't something PCREv2 itself needs to
deal with.
Junio C Hamano Nov. 18, 2021, 6:17 p.m. UTC | #22
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> Let's have a look at the map.  Here are the differences between the
>> versions regarding use of PCRE2_UTF:
>>
>> o: opt->ignore_locale
>> h: has_non_ascii(p->pattern)
>> i: is_utf8_locale()
>> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>>
>> o h i l master hamza rene2
>> 0 0 0 0      0     1     0
>> 0 0 0 1      0     1     0
>> 0 0 1 0      0     1     1
>> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>>
>> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
>> should not have this breakage, because it doesn't enable PCRE2_UTF for
>> literal patterns.
>
> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
> two bytes in "é" and match them under -i with/without PCRE_UTF.

Sorry for being late to the party, but doesn't "literal" in the
context of this thread mean the column "l" above, i.e. we are not
ignoring case and fixed or is_fixed member is set?  So "under -i"
disqualifies as an example for "will also matter for literal",
doesn't it?

In hindsight, I guess we could have pushed a bit harder when René's

-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	if (!opt->ignore_locale && is_utf8_locale() &&
 	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
(is that what is called 'rene2' above?) was raised on Oct 17th to
amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from
having this breakage?

Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@gmail.com>,
I see that the #else side for older PCREv2 users essentially reverts
what Hamza's [PATCH v13 3/3] did to this area.

+#else
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+		options |= PCRE2_UTF;
+#endif

I guess this is a lot of change in the amount of text involved but
the least amount of actual change in the behaviour.  For those with
newer PCREv2, the behaviour would be the same as v2.34.0, and for
others, the behaviour would be the same as v2.33.0.

Having said all that, because the consensus seems to be that the
whole "when we should match in UTF mode" may need to be rethought, I
think reverting Hamza's [v13 3/3] would be the simplest way to clean
up the mess for v2.34.1 that will give us a cleaner slate to later
build on, than applying this patch.

So, I dunno.  Comments from those involved in the discussion?

Thanks.
René Scharfe Nov. 18, 2021, 8:57 p.m. UTC | #23
Am 18.11.21 um 19:17 schrieb Junio C Hamano:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> Let's have a look at the map.  Here are the differences between the
>>> versions regarding use of PCRE2_UTF:
>>>
>>> o: opt->ignore_locale
>>> h: has_non_ascii(p->pattern)
>>> i: is_utf8_locale()
>>> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>>>
>>> o h i l master hamza rene2
>>> 0 0 0 0      0     1     0
>>> 0 0 0 1      0     1     0
>>> 0 0 1 0      0     1     1
>>> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>>>
>>> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
>>> should not have this breakage, because it doesn't enable PCRE2_UTF for
>>> literal patterns.
>>
>> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
>> two bytes in "é" and match them under -i with/without PCRE_UTF.
>
> Sorry for being late to the party, but doesn't "literal" in the
> context of this thread mean the column "l" above, i.e. we are not
> ignoring case and fixed or is_fixed member is set?  So "under -i"
> disqualifies as an example for "will also matter for literal",
> doesn't it?

Correct.

> In hindsight, I guess we could have pushed a bit harder when René's
>
> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	if (!opt->ignore_locale && is_utf8_locale() &&
>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
> in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
> (is that what is called 'rene2' above?) was raised on Oct 17th to
> amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from
> having this breakage?

Yes, that the change I meant with "rene2".

> Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@gmail.com>,
> I see that the #else side for older PCREv2 users essentially reverts
> what Hamza's [PATCH v13 3/3] did to this area.
>
> +#else
> +	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> +		options |= PCRE2_UTF;
> +#endif
>
> I guess this is a lot of change in the amount of text involved but
> the least amount of actual change in the behaviour.  For those with
> newer PCREv2, the behaviour would be the same as v2.34.0, and for
> others, the behaviour would be the same as v2.33.0.
>
> Having said all that, because the consensus seems to be that the
> whole "when we should match in UTF mode" may need to be rethought, I
> think reverting Hamza's [v13 3/3] would be the simplest way to clean
> up the mess for v2.34.1 that will give us a cleaner slate to later
> build on, than applying this patch.

Makes sense to me.  It gives a better starting point to solve the issue
afresh without getting entangled in mind-melting boolean expressions.

René
Ævar Arnfjörð Bjarmason Nov. 19, 2021, 7 a.m. UTC | #24
On Thu, Nov 18 2021, René Scharfe wrote:

> Am 18.11.21 um 19:17 schrieb Junio C Hamano:
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> [...]
>> I guess this is a lot of change in the amount of text involved but
>> the least amount of actual change in the behaviour.  For those with
>> newer PCREv2, the behaviour would be the same as v2.34.0, and for
>> others, the behaviour would be the same as v2.33.0.
>>
>> Having said all that, because the consensus seems to be that the
>> whole "when we should match in UTF mode" may need to be rethought, I
>> think reverting Hamza's [v13 3/3] would be the simplest way to clean
>> up the mess for v2.34.1 that will give us a cleaner slate to later
>> build on, than applying this patch.
>
> Makes sense to me.  It gives a better starting point to solve the issue
> afresh without getting entangled in mind-melting boolean expressions.

Yes, agreed. As noted I haven't had time to dig deeply into this, but
from what I've seen so far there doesn't seem to be any obvious way
forward in terms of a quick fix.

I thought perhaps your patch would be that (but I haven't looked into it
carefully enough), but since you're on-board with reverting & retrying.
René Scharfe Nov. 19, 2021, 4:08 p.m. UTC | #25
Am 19.11.21 um 08:00 schrieb Ævar Arnfjörð Bjarmason:
>
> On Thu, Nov 18 2021, René Scharfe wrote:
>
>> Am 18.11.21 um 19:17 schrieb Junio C Hamano:
>>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> [...]
>>> I guess this is a lot of change in the amount of text involved but
>>> the least amount of actual change in the behaviour.  For those with
>>> newer PCREv2, the behaviour would be the same as v2.34.0, and for
>>> others, the behaviour would be the same as v2.33.0.
>>>
>>> Having said all that, because the consensus seems to be that the
>>> whole "when we should match in UTF mode" may need to be rethought, I
>>> think reverting Hamza's [v13 3/3] would be the simplest way to clean
>>> up the mess for v2.34.1 that will give us a cleaner slate to later
>>> build on, than applying this patch.
>>
>> Makes sense to me.  It gives a better starting point to solve the issue
>> afresh without getting entangled in mind-melting boolean expressions.
>
> Yes, agreed. As noted I haven't had time to dig deeply into this, but
> from what I've seen so far there doesn't seem to be any obvious way
> forward in terms of a quick fix.
>
> I thought perhaps your patch would be that (but I haven't looked into it
> carefully enough), but since you're on-board with reverting & retrying.

That patch should fix the edge case without any side-effects -- at least
I haven't seen any reports of ill effects that would apply to it.  It's
easier to understand and reason about when applied after reverting, I
think.  But it's only for grep.c and I don't know the situation in t/.

René
Junio C Hamano Nov. 19, 2021, 5:11 p.m. UTC | #26
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Nov 18 2021, René Scharfe wrote:
>>
>> Makes sense to me.  It gives a better starting point to solve the issue
>> afresh without getting entangled in mind-melting boolean expressions.
>
> Yes, agreed. As noted I haven't had time to dig deeply into this, but
> from what I've seen so far there doesn't seem to be any obvious way
> forward in terms of a quick fix.
>
> I thought perhaps your patch would be that (but I haven't looked into it
> carefully enough), but since you're on-board with reverting & retrying.

OK.  Here is what I'll queue 'master', in the hope of later merging
down and be part of v2.34.1.

Thanks, all.

----- >8 --------- >8 --------- >8 --------- >8 -----
Subject: [PATCH] Revert "grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data"

This reverts commit ae39ba431ab861548eb60b4bd2e1d8b8813db76f, as it
breaks "grep" when looking for a string in non UTF-8 haystack, when
linked with certain versions of PCREv2 library.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 grep.c                          |  6 ++---
 t/t7812-grep-icase-non-ascii.sh | 48 ---------------------------------
 2 files changed, 2 insertions(+), 52 deletions(-)

diff --git a/grep.c b/grep.c
index f6e113e9f0..fe847a0111 100644
--- a/grep.c
+++ b/grep.c
@@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
-	    (!opt->ignore_locale && is_utf8_locale() &&
-	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
-					    (p->fixed || p->is_fixed))))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 22487d90fd..e5d1e4ea68 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -53,54 +53,6 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
 	test_cmp expected actual
 '
 
-test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' '
-	cat >expected <<-\EOF &&
-	Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com>
-	EOF
-	test_write_lines "forth" >file4 &&
-	git add file4 &&
-	git commit --author="À Ú Thor <author@example.com>" -m sécond &&
-	git log -1 --color=always --perl-regexp --author=".*Thor" >log &&
-	grep Author log >actual.raw &&
-	test_decode_color <actual.raw >actual &&
-	test_cmp expected actual
-'
-
-test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' '
-	cat >expected <<-\EOF &&
-	Commit:     Ç<BOLD;RED> O Mîtter <committer@example.com><RESET>
-	EOF
-	test_write_lines "fifth" >file5 &&
-	git add file5 &&
-	GIT_COMMITTER_NAME="Ç O Mîtter" &&
-	GIT_COMMITTER_EMAIL="committer@example.com" &&
-	git -c i18n.commitEncoding=latin1 commit -m thïrd &&
-	git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
-	grep Commit: log >actual.raw &&
-	test_decode_color <actual.raw >actual &&
-	test_cmp expected actual
-'
-
-test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' '
-	cat >expected <<-\EOF &&
-	    sé<BOLD;RED>con<RESET>d
-	EOF
-	git log -1 --color=always --perl-regexp --grep="con" >log &&
-	grep con log >actual.raw &&
-	test_decode_color <actual.raw >actual &&
-	test_cmp expected actual
-'
-
-test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' '
-	cat >expected <<-\EOF &&
-	    <BOLD;RED>thïrd<RESET>
-	EOF
-	git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log &&
-	grep "th.*rd" log >actual.raw &&
-	test_decode_color <actual.raw >actual &&
-	test_cmp expected actual
-'
-
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
 	printf "\\200\\n" >invalid-0x80 &&
 	echo "ævar" >expected &&
Carlo Marcelo Arenas Belón Nov. 19, 2021, 5:33 p.m. UTC | #27
On Fri, Nov 19, 2021 at 8:08 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 19.11.21 um 08:00 schrieb Ævar Arnfjörð Bjarmason:
> >
> > On Thu, Nov 18 2021, René Scharfe wrote:
> >
> >> Am 18.11.21 um 19:17 schrieb Junio C Hamano:
> >>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> >> [...]
> >>> I guess this is a lot of change in the amount of text involved but
> >>> the least amount of actual change in the behaviour.  For those with
> >>> newer PCREv2, the behaviour would be the same as v2.34.0, and for
> >>> others, the behaviour would be the same as v2.33.0.
> >>>
> >>> Having said all that, because the consensus seems to be that the
> >>> whole "when we should match in UTF mode" may need to be rethought, I
> >>> think reverting Hamza's [v13 3/3] would be the simplest way to clean
> >>> up the mess for v2.34.1 that will give us a cleaner slate to later
> >>> build on, than applying this patch.
> >>
> >> Makes sense to me.  It gives a better starting point to solve the issue
> >> afresh without getting entangled in mind-melting boolean expressions.
> >
> > Yes, agreed. As noted I haven't had time to dig deeply into this, but
> > from what I've seen so far there doesn't seem to be any obvious way
> > forward in terms of a quick fix.
> >
> > I thought perhaps your patch would be that (but I haven't looked into it
> > carefully enough), but since you're on-board with reverting & retrying.
>
> That patch should fix the edge case without any side-effects -- at least
> I haven't seen any reports of ill effects that would apply to it.

Since it isn't restricted to log, it will still cause a regression to
the `git grep` case with binary data for versions of PCRE2 older than
10.34 and unlike the previous one it might not trigger an error in the
testsuite just because we are missing a test for it.

> It's
> easier to understand and reason about when applied after reverting, I
> think.  But it's only for grep.c and I don't know the situation in t/.

We had been focusing in PCRE in this discussion, but I see the strict
behaviour of older PCRE2 as just a "coal mine canary" to point to the
bigger problem that we are expecting git's regex to handle safely and
correctly from the point of UTF, what is technically a binary match
with --color making the mismatch obvious.

The issue is not unique to PCRE, and seems Ævar also acknowledges[1]
that by seeing the same bug this was attempting to fix with probably
some version of glibc's ERE.  I suspect FreeBSD's (and derivatives) is
also broken and might be throwing REGILLSEQ errors as well, so I think
that it is better to revert the whole thing.

Carlo

[1] https://lore.kernel.org/git/211119.86r1bc4om5.gmgdl@evledraar.gmail.com/
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index fe847a0111..f6e113e9f0 100644
--- a/grep.c
+++ b/grep.c
@@ -382,8 +382,10 @@  static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
-	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
+	    (!opt->ignore_locale && is_utf8_locale() &&
+	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
+					    (p->fixed || p->is_fixed))))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index e5d1e4ea68..22487d90fd 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -53,6 +53,54 @@  test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
 	test_cmp expected actual
 '
 
+test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' '
+	cat >expected <<-\EOF &&
+	Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com>
+	EOF
+	test_write_lines "forth" >file4 &&
+	git add file4 &&
+	git commit --author="À Ú Thor <author@example.com>" -m sécond &&
+	git log -1 --color=always --perl-regexp --author=".*Thor" >log &&
+	grep Author log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' '
+	cat >expected <<-\EOF &&
+	Commit:     Ç<BOLD;RED> O Mîtter <committer@example.com><RESET>
+	EOF
+	test_write_lines "fifth" >file5 &&
+	git add file5 &&
+	GIT_COMMITTER_NAME="Ç O Mîtter" &&
+	GIT_COMMITTER_EMAIL="committer@example.com" &&
+	git -c i18n.commitEncoding=latin1 commit -m thïrd &&
+	git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
+	grep Commit: log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' '
+	cat >expected <<-\EOF &&
+	    sé<BOLD;RED>con<RESET>d
+	EOF
+	git log -1 --color=always --perl-regexp --grep="con" >log &&
+	grep con log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' '
+	cat >expected <<-\EOF &&
+	    <BOLD;RED>thïrd<RESET>
+	EOF
+	git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log &&
+	grep "th.*rd" log >actual.raw &&
+	test_decode_color <actual.raw >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
 	printf "\\200\\n" >invalid-0x80 &&
 	echo "ævar" >expected &&