diff mbox series

grep: avoid setting UTF mode when not needed

Message ID 20211116110035.22140-1-carenas@gmail.com (mailing list archive)
State New, archived
Headers show
Series grep: avoid setting UTF mode when not needed | expand

Commit Message

Carlo Marcelo Arenas Belón Nov. 16, 2021, 11 a.m. UTC
Since ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns
and UTF-8 data, 2021-10-15), PCRE_UTF mode is enabled in cases where it
will fail because of UTF-8 validation, which is needed for versions of
PCRE2 older than 10.34.

Revert the change on logic to avoid failures that were reported from the
test cases, but that should also reflect in normal use when JIT is enabled
and could result in crashes (or worse), as UTF-8 validation is skipped.

Keeping the tests, as they pass even without the fix as replicated locally
in Debian 10 and the CI.

Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Nov. 16, 2021, 12:32 p.m. UTC | #1
On Tue, Nov 16 2021, Carlo Marcelo Arenas Belón wrote:

> Since ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns
> and UTF-8 data, 2021-10-15), PCRE_UTF mode is enabled in cases where it
> will fail because of UTF-8 validation, which is needed for versions of
> PCRE2 older than 10.34.
>
> Revert the change on logic to avoid failures that were reported from the
> test cases, but that should also reflect in normal use when JIT is enabled
> and could result in crashes (or worse), as UTF-8 validation is skipped.
>
> Keeping the tests, as they pass even without the fix as replicated locally
> in Debian 10 and the CI.
>
> Reported-by: Andreas Schwab <schwab@linux-m68k.org>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  grep.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index f6e113e9f0..fe847a0111 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
> -	    (!opt->ignore_locale && is_utf8_locale() &&
> -	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
> -					    (p->fixed || p->is_fixed))))
> +	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>  
>  #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER

Hrm.

A few things:

First, if we've got a post-PCREv2 version whatever fix let's guard that
with an ifdef, see thep GIT_PCRE2_VERSION_*_HIGHER at the top of grep.h.

It really helps to have those, both to know to test on those older
versions, and also so that we can at some point in the future bump the
required version and drop these workarounds entirely (as we do with
git-curl-compat.h).

Secondly, whatever do here let's first fix the test added in ae39ba431a,
so we're not groping around in the dark even more.

I didn't spot this at the time but the test that Hamza added in that
based on my initial report[1] is broken & doesn't test anything
meaningful. It needs to have this applied:
	
	diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
	index 22487d90fdc..1da6b07a579 100755
	--- a/t/t7812-grep-icase-non-ascii.sh
	+++ b/t/t7812-grep-icase-non-ascii.sh
	@@ -60,7 +60,7 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on U
	        test_write_lines "forth" >file4 &&
	        git add file4 &&
	        git commit --author="À Ú Thor <author@example.com>" -m sécond &&
	-       git log -1 --color=always --perl-regexp --author=".*Thor" >log &&
	+       git log -1 --color=always --perl-regexp --author=". . Thor" >log &&
	        grep Author log >actual.raw &&
	        test_decode_color <actual.raw >actual &&
	        test_cmp expected actual

I.e. the whole point of using the color output to test this is to
discover where PCRE2 is going to consider a character boundary to be,
using .* means that it won't be tested at add, since .* will happily eat
up whatever arbitrary data it finds with or without UTF-8 mode.

Other tests added in that & adjacent (if any?) commits may have the same
issue, I haven't dug into it.

If we lead with that patch we'll get the test passing on master as
before, but with your patch above it'll break. I.e. the "when not
needed' in the $subject isn't true, it's just that the test is
completely broken.

In the context of this being a pretty urgent post-release fix (but I
don't know if Junio would consider a point-release, so perhaphs it's
not) I'd be OK with either of:

 A. Let's back out this new log grep color thing entirely while we
    reconsider this. The gitster/hm/paint-hits-in-log-grep topic
    currently reverts cleanly.

 B. Don't break the new log grep color thing, and also fix the 'grepping
    binary' regression (which is much more important than having A)

But let's not go for some in-between where we break the new feature to
the point of it being worse than the state of not having it at all in
v2.33.0.

I.e. without the that log grep color feature we wouldn't screw up the
display of non-ASCII characters in log output (yay), in v2.34.0 we
don't, but also color the match (yay), but we broke grepping binary
*files* (boo!).

I think the approach I suggested in [2] is a much more viable way
forward, i.e. let's stop fiddling with this giant nested if statement
that's mainly meant for the grep-a-file-case, revert to the
pre-log-grep-color state, and have the log-grep-color mode pass in a
"yes, I'd like the UTF-8 mode, please".

Having said that that's probably also broken, just in ways we're less
likely to spot (or maybe some of the log encoding settings/reencoding
saves us there?), we may need to have that ifdef'd to 10.34 and higher,
or have some opt-in setting for this.

1. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
2. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
Hamza Mahfooz Nov. 16, 2021, 1:35 p.m. UTC | #2
Does anyone know if René's patch causes older PCRE versions to break? 
If it doesn't I think René's patch plus Ævar's fix is the way to go.

On Tue, Nov 16 2021 at 01:32:17 PM +0100, Ævar Arnfjörð Bjarmason 
<avarab@gmail.com> wrote:
> 
> On Tue, Nov 16 2021, Carlo Marcelo Arenas Belón wrote:
> 
>>  Since ae39ba431a (grep/pcre2: fix an edge case concerning ascii 
>> patterns
>>  and UTF-8 data, 2021-10-15), PCRE_UTF mode is enabled in cases 
>> where it
>>  will fail because of UTF-8 validation, which is needed for versions 
>> of
>>  PCRE2 older than 10.34.
>> 
>>  Revert the change on logic to avoid failures that were reported 
>> from the
>>  test cases, but that should also reflect in normal use when JIT is 
>> enabled
>>  and could result in crashes (or worse), as UTF-8 validation is 
>> skipped.
>> 
>>  Keeping the tests, as they pass even without the fix as replicated 
>> locally
>>  in Debian 10 and the CI.
>> 
>>  Reported-by: Andreas Schwab <schwab@linux-m68k.org>
>>  Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
>>  ---
>>   grep.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>> 
>>  diff --git a/grep.c b/grep.c
>>  index f6e113e9f0..fe847a0111 100644
>>  --- a/grep.c
>>  +++ b/grep.c
>>  @@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct 
>> grep_pat *p, const struct grep_opt *opt
>>   		}
>>   		options |= PCRE2_CASELESS;
>>   	}
>>  -	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>>  -	    (!opt->ignore_locale && is_utf8_locale() &&
>>  -	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>>  -					    (p->fixed || p->is_fixed))))
>>  +	if (!opt->ignore_locale && is_utf8_locale() && 
>> has_non_ascii(p->pattern) &&
>>  +	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>   		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>> 
>>   #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER
> 
> Hrm.
> 
> A few things:
> 
> First, if we've got a post-PCREv2 version whatever fix let's guard 
> that
> with an ifdef, see thep GIT_PCRE2_VERSION_*_HIGHER at the top of 
> grep.h.
> 
> It really helps to have those, both to know to test on those older
> versions, and also so that we can at some point in the future bump the
> required version and drop these workarounds entirely (as we do with
> git-curl-compat.h).
> 
> Secondly, whatever do here let's first fix the test added in 
> ae39ba431a,
> so we're not groping around in the dark even more.
> 
> I didn't spot this at the time but the test that Hamza added in that
> based on my initial report[1] is broken & doesn't test anything
> meaningful. It needs to have this applied:
> 
> 	diff --git a/t/t7812-grep-icase-non-ascii.sh 
> b/t/t7812-grep-icase-non-ascii.sh
> 	index 22487d90fdc..1da6b07a579 100755
> 	--- a/t/t7812-grep-icase-non-ascii.sh
> 	+++ b/t/t7812-grep-icase-non-ascii.sh
> 	@@ -60,7 +60,7 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log 
> --author with an ascii pattern on U
> 	        test_write_lines "forth" >file4 &&
> 	        git add file4 &&
> 	        git commit --author="À Ú Thor <author@example.com>" -m 
> sécond &&
> 	-       git log -1 --color=always --perl-regexp --author=".*Thor" 
> >log &&
> 	+       git log -1 --color=always --perl-regexp --author=". . Thor" 
> >log &&
> 	        grep Author log >actual.raw &&
> 	        test_decode_color <actual.raw >actual &&
> 	        test_cmp expected actual
> 
> I.e. the whole point of using the color output to test this is to
> discover where PCRE2 is going to consider a character boundary to be,
> using .* means that it won't be tested at add, since .* will happily 
> eat
> up whatever arbitrary data it finds with or without UTF-8 mode.
> 
> Other tests added in that & adjacent (if any?) commits may have the 
> same
> issue, I haven't dug into it.
> 
> If we lead with that patch we'll get the test passing on master as
> before, but with your patch above it'll break. I.e. the "when not
> needed' in the $subject isn't true, it's just that the test is
> completely broken.
> 
> In the context of this being a pretty urgent post-release fix (but I
> don't know if Junio would consider a point-release, so perhaphs it's
> not) I'd be OK with either of:
> 
>  A. Let's back out this new log grep color thing entirely while we
>     reconsider this. The gitster/hm/paint-hits-in-log-grep topic
>     currently reverts cleanly.
> 
>  B. Don't break the new log grep color thing, and also fix the 
> 'grepping
>     binary' regression (which is much more important than having A)
> 
> But let's not go for some in-between where we break the new feature to
> the point of it being worse than the state of not having it at all in
> v2.33.0.
> 
> I.e. without the that log grep color feature we wouldn't screw up the
> display of non-ASCII characters in log output (yay), in v2.34.0 we
> don't, but also color the match (yay), but we broke grepping binary
> *files* (boo!).
> 
> I think the approach I suggested in [2] is a much more viable way
> forward, i.e. let's stop fiddling with this giant nested if statement
> that's mainly meant for the grep-a-file-case, revert to the
> pre-log-grep-color state, and have the log-grep-color mode pass in a
> "yes, I'd like the UTF-8 mode, please".
> 
> Having said that that's probably also broken, just in ways we're less
> likely to spot (or maybe some of the log encoding settings/reencoding
> saves us there?), we may need to have that ifdef'd to 10.34 and 
> higher,
> or have some opt-in setting for this.
> 
> 1. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
Carlo Marcelo Arenas Belón Nov. 16, 2021, 6:22 p.m. UTC | #3
On Tue, Nov 16, 2021 at 4:50 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> First, if we've got a post-PCREv2 version whatever fix let's guard that
> with an ifdef, see thep GIT_PCRE2_VERSION_*_HIGHER at the top of grep.h.

The availability of PCRE2_MATCH_INVALID_UTF does that implicitly, but
adding one makes sense to make it explicit; specially if it leads to
better testing.

> In the context of this being a pretty urgent post-release fix (but I
> don't know if Junio would consider a point-release, so perhaphs it's
> not) I'd be OK with either of:
>
>  A. Let's back out this new log grep color thing entirely while we
>     reconsider this. The gitster/hm/paint-hits-in-log-grep topic
>     currently reverts cleanly.

Agree that reverting the whole feature makes more sense, but was
aiming to the minimum change required to allow for a "brown bag"
release with this.

The way the PCRE2 integration works, uses the fast path that presumes
valid UTF-8 and is documented[1] to have "Undefined Behaviour" when
the subject that is being searched on is not.

That has been shown before to lead to crashes or infinite loops

Carlo

[1] https://www.pcre.org/current/doc/html/pcre2jit.html#SEC4
Junio C Hamano Nov. 16, 2021, 6:48 p.m. UTC | #4
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> In the context of this being a pretty urgent post-release fix (but I
> don't know if Junio would consider a point-release, so perhaphs it's
> not) I'd be OK with either of:
>
>  A. Let's back out this new log grep color thing entirely while we
>     reconsider this. The gitster/hm/paint-hits-in-log-grep topic
>     currently reverts cleanly.
>
>  B. Don't break the new log grep color thing, and also fix the 'grepping
>     binary' regression (which is much more important than having A)
>
> But let's not go for some in-between where we break the new feature to
> the point of it being worse than the state of not having it at all in
> v2.33.0.
>
> I.e. without the that log grep color feature we wouldn't screw up the
> display of non-ASCII characters in log output (yay), in v2.34.0 we
> don't, but also color the match (yay), but we broke grepping binary
> *files* (boo!).

Sorry, but with too many new patches on the list that are not
urgent, while we wanted to see us work solely on post-release
regression fixes, I do not seem to be able to locate the reports of
this "breakage" and the other binary-file breakage.

But in any case, yes, since the 2.33.0 cycle was run deliberately
loosely to take undercooked topics to 'next' without much reviews
(no, "I looked at some part and they looked OK" is not a review),
and because any topic in 'next', by default, graduates to 'master'
solely on time basis, I am fully expecting that we'd have to issue
2.33.1 (and possibly .2) during the first two or three weeks of this
cycle.  So let's make sure we fix any iffy ones.

Downthread Carlo seem to agree with you that it would be better to
revert the paint-hits-in-log-grep topic wholesale; I have not yet
formed an opinion, as I haven't seen the breakage reports as I said.

As I will be offline most of the day and perhaps tomorrow, nothing
may happen in the meantime on my end, but hopefully we'll see a
reviewed and ready-to-be-applied patchset that people are happy with
by the time I look at the list again ;-)?

Thanks.
Ævar Arnfjörð Bjarmason Nov. 17, 2021, 2:31 p.m. UTC | #5
On Tue, Nov 16 2021, Hamza Mahfooz wrote:

> Does anyone know if René's patch causes older PCRE versions to break?
> If it doesn't I think René's patch plus Ævar's fix is the way to go.

I haven't tested, but I think a very good thing to do/test (hint hint!)
is to add a CI job where we build/test git in combination with various
PCRE versions.

PCRE2 used to be in SVN, but nowadays it's in a git repository on
github: https://github.com/PhilipHazel/pcre2

It should be a small matter of copying the existing template to set up a
"matrix" where we test various known older pcre versions, they're
available as tags in the git repository.

"--enable-jit" and "--disable-jit" should be part of that matrix too,
and "--enable-utf" and "--disable-utf".

See the "microsoft/vcpkg" and "regular:" parts of
.github/workflows/main.git for relevant examples, i.e. this would
e.g. clone pcre2.git into compat/pcre2-repo, check out a given version
from the matrix. Then:

    cd compat/pcre2-repo &&
    ./autogen.sh &&
    ./configure --prefix="$PWD/inst" &&
    make install

Then at the top-level:

    make USE_LIBPCRE=Y CFLAGS=-O3 LIBPCREDIR="$PWD/compat/pcre2-repo/inst"

Then run "make test", for saving some CPU & speeding up the runs it
should be more than sufficient to only run those tests that match
*{grep,log,pickaxe}*
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index f6e113e9f0..fe847a0111 100644
--- a/grep.c
+++ b/grep.c
@@ -382,10 +382,8 @@  static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
-	    (!opt->ignore_locale && is_utf8_locale() &&
-	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
-					    (p->fixed || p->is_fixed))))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER