Message ID | 20230117105123.58328-1-carenas@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] grep: correctly identify utf-8 characters with \{b,w} in -P | expand |
On Tue, Jan 17 2023, Carlo Marcelo Arenas Belón wrote: > When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used > by the pcre2_compile() call, but that would only allow for the > use of Unicode character properties when caseless is required > but not to include the additional UTF characters for all other > class matches. > > This would result in failed matches for expressions that rely > on those properties, for ex: > > $ git grep -P '\bÆvar' > > Add a configuration that could be used to enable the PCRE2_UCP > flag to correctly match those cases, when required. > > The use of this has an impact on performance that has been estimated > to be significant. > > Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> > --- > Changes since v2: > * make setting UCP and opt-in as suggested by Ævar To argue with myself here, I'm not so sure that just making this the default isn't the right move, especially as the GNU grep maintainer seems to be convinced that that's the right thing for grep(1). We've usually just followed GNU grep semantics, so if it's doing X and we're doing Y after this it's probably better to unify our behavior with theirs. I was mainly concerned with the behavior change sneaking in as a mere bugfix, I think it's OK if we change the behavior, as long as we're going into it with our eyes open... > * remove performance test and instead add a test ...I didn't follow the thread(s) where this may have been discussed, but I for one would like to see a perf test with this, but maybe it was removed for a good reason that I'm not aware of... > Documentation/config/grep.txt | 6 ++++++ > grep.c | 11 ++++++++++- > grep.h | 1 + > t/t7810-grep.sh | 13 +++++++++++++ > 4 files changed, 30 insertions(+), 1 deletion(-) > > diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt > index e521f20390..8848db7311 100644 > --- a/Documentation/config/grep.txt > +++ b/Documentation/config/grep.txt > @@ -26,3 +26,9 @@ grep.fullName:: > grep.fallbackToNoIndex:: > If set to true, fall back to git grep --no-index if git grep > is executed outside of a git repository. Defaults to false. > + > +pcre.ucp:: > + If set to true, will use all Unicode Character Properties when matching > + `\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used > + as the underlying engine. If PCRE is not being used it is ignored. > + Defaults to false There's a couple of exceptions to this, but we tend to stick config docs in their corresponding Documentation/config/<namespace>.txt, so this should be in a new Documentation/config/pcre.txt if we're adding this name. But I'd rather that we don't expose the implementation detail that we're using PCRE, which we haven't done so far. We just have a "grep.patternType=perl", which (and that's another issue, in any case) we should explicitly mention here, i.e. that this is for use with that config (and corresponding option(s)). I think calling this e.g.: grep.perl.Unicode=<bool> grep.patternTypePerl.Unicode=<bool> Or even: grep.patternTypePerl.Flags=u Would be better, i.e. PCRE's C API is really just mapping to the flags you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In this case the /u flag maps to the "PCRE2_UCP" API flag. That we happen to use PCRE to give ourselves "Perl" semantics is an implementation detail we should avoid exposing, so we could either give our config generic names, or literally map to the perl /flags/. For now we could just die on any "Flags" value that isn't "u". Of course all of this is predicated on us wanting to leave this as an opt-in, which I'm not so sure about. If it's opt-out we'll avoid this entire question, > diff --git a/grep.c b/grep.c > index 06eed69493..ceafb8937d 100644 > --- a/grep.c > +++ b/grep.c > @@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb) > return config_error_nonbool(var); > return color_parse(value, color); > } > + > + if (!strcmp(var, "pcre.ucp")) { > + opt->pcre_ucp = git_config_bool(var, value); > + return 0; > + } > + > return 0; > } > > @@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && !literal) > + if (!opt->ignore_locale && is_utf8_locale() && !literal) { > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > + if (opt->pcre_ucp) > + options |= PCRE2_UCP; > + } This interaction with locale settings etc. is probably correct, but if we're keeping the config etc. we should really document how this interacts with those. I.e. you might expect "-c grep.patternType=perl -c <whatever_the_setting_is>=true" to give you UCP semantics, but we'll ignore it based on these other criteria. > #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER > /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ > diff --git a/grep.h b/grep.h > index 6075f997e6..082bd3a0c7 100644 > --- a/grep.h > +++ b/grep.h > @@ -171,6 +171,7 @@ struct grep_opt { > int file_break; > int heading; > int max_count; > + int pcre_ucp; > void *priv; The reason for why we have some "bool"-like settings (like "int ignore_case") as an "int" as opposed to an "unsigned int <name>:1" bitfield is because we need to take their address via the parse_options() API. But in this case it's a purely internal field, so (and again, if we're keeping the option) let's use "unsigned int ...:1" here instead? Unless that is, we're expecting a corresponding command-line option. > > void (*output)(struct grep_opt *opt, const void *data, size_t size); > diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh > index 8eded6ab27..a99a967060 100755 > --- a/t/t7810-grep.sh > +++ b/t/t7810-grep.sh > @@ -95,6 +95,7 @@ test_expect_success setup ' > then > echo "¿" >reverse-question-mark > fi && > + echo "émotion" >ucp && Here we carry this file through the entirety ouf our tests... > git add . && > test_tick && > git commit -m initial > @@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE > test_cmp hello_world actual > ' > > +test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' ' > + cat >expected <<-\EOF && > + ucp:émotion ...only to use it in this one test, it clearly didn't harm anything (or rather, I didn't run this, but I expect you did and it passed), but how about avoiding more global state here doing: test_when_finished "rm -rf repo" && git init repo && test_commit -C repo msg ucp émotion && [...] > + EOF > + cat >pattern <<-\EOF && > + \bémotion > + EOF > + LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual && > + test_cmp expected actual && > + LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern This will break on platforms that don't have en_US.UTF-8 (and that's not hypothetical, some systems will skip installing locales for various reasons). I see we have some almost recent breakage 1819ad327b7 (grep: fix multibyte regex handling under macOS, 2022-08-26), but that one adds a new MB_REGEX prereq, which presumably fails if we don't have this locale. You can just use the existing "GETTEXT_LOCALE", which will piggy-back on our existing locale tests, or we could add a corresponding one for en_US.UTF-8 and refactor the existing MB_REGEX to be in lib-gettext.sh where it arguably belongs... > +' > + > test_expect_success 'grep -G invalidpattern properly dies ' ' > test_must_fail git grep -G "a[" > '
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > To argue with myself here, I'm not so sure that just making this the > default isn't the right move, especially as the GNU grep maintainer > seems to be convinced that that's the right thing for grep(1). OK. > I think calling this e.g.: > > grep.perl.Unicode=<bool> > grep.patternTypePerl.Unicode=<bool> > > Or even: > > grep.patternTypePerl.Flags=u > > Would be better, i.e. PCRE's C API is really just mapping to the flags > you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In > this case the /u flag maps to the "PCRE2_UCP" API flag. > > That we happen to use PCRE to give ourselves "Perl" semantics is an > implementation detail we should avoid exposing, so we could either give > our config generic names, or literally map to the perl /flags/. > > For now we could just die on any "Flags" value that isn't "u". > > Of course all of this is predicated on us wanting to leave this as an > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this > entire question, Making it opt-out would also require a similar knob to turn the "flag" off, be it a configuration variable or a command line option, wouldn't it? I tend to agree with you that it makes sense to make it a goal to take us closer to "grep -P" from GNU---do they have such an opt-out knob? If not, let's make it simple by turning it always on, which would be the simplest ;-) Again, thanks for a careful review with concrete points.
On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote: > > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > > > To argue with myself here, I'm not so sure that just making this the > > default isn't the right move, especially as the GNU grep maintainer > > seems to be convinced that that's the right thing for grep(1). > > OK. I think that is definitely the right thing to do for grep, because the current behaviour can only be described as a bug (and a bad one at it), but after all the push back and performance testing, I am also not convinced anymore it needs to be the default for git, because the negatives outweigh the positives. First there is the performance hit, which is inevitable because there are just a lot more characters to match when UCP tables are being used, and second there is the fact that PCRE2_UCP itself might not be what you want when matching code, because for example numbers are never going to be using digits outside what ASCII provides, and identifiers have a narrow set of characters as valid than what you would expect from all written human languages in history. Lastly, even with PCRE2_UCP enabled, our current logic for word matches is still broken, because the current code still uses a definition of word that was done outside what the regex engines provide and that roughly matches what you would expect of identifiers from C in the ASCII times. > > Of course all of this is predicated on us wanting to leave this as an > > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this > > entire question, > > Making it opt-out would also require a similar knob to turn the > "flag" off, be it a configuration variable or a command line option, > wouldn't it? I tend to agree with you that it makes sense to make > it a goal to take us closer to "grep -P" from GNU---do they have > such an opt-out knob? If not, let's make it simple by turning it > always on, which would be the simplest ;-) GNU grep -P has no knob and would likely never have one. So for now, I think we should acknowledge the bug, provide an option for people that might need the fix, and fix all other problems we have, which will include changes in PCRE2 as well to better fit our use case. Carlo
On Tue, Jan 17 2023, Carlo Arenas wrote: > On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote: >> >> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> >> > To argue with myself here, I'm not so sure that just making this the >> > default isn't the right move, especially as the GNU grep maintainer >> > seems to be convinced that that's the right thing for grep(1). >> >> OK. > > I think that is definitely the right thing to do for grep, because the > current behaviour can only be described as a bug (and a bad one at > it), but after all the push back and performance testing, I am also > not convinced anymore it needs to be the default for git, because the > negatives outweigh the positives. > > First there is the performance hit, which is inevitable because there > are just a lot more characters to match when UCP tables are being > used,[...] I'm less concerned about the performance, we should aim for correctness first. We can always provide an opt-out (and the locale setting is already that opt-out). > and second there is the fact that PCRE2_UCP itself might not be > what you want when matching code, because for example numbers are > never going to be using digits outside what ASCII provides, and > identifiers have a narrow set of characters as valid than what you > would expect from all written human languages in history. [0-9] will be ASCII, but \d will use [^0-9] Unicode numbers. I agree it might not be expected by some, but I can't really square that view in my mind with the desire to match "\bÆvar" :). After all that "Æ" is also arbitrary byte garbage in the ASCII-view of the world. I can see how it might be more practical in some cases to have "\b" have Unicode semantics, but to specifically make "\d" an exception. But the ship has sailed on that in Perl & PCRE land years (or more than a decade ago). I think us coming up with some exception to that would probably suck more than going with their behavior. > Lastly, even with PCRE2_UCP enabled, our current logic for word > matches is still broken, because the current code still uses a > definition of word that was done outside what the regex engines > provide and that roughly matches what you would expect of identifiers > from C in the ASCII times. Yes, FWIW I have some WIP patches somewhere to get rid of that bit of grep.c if we're using PCRE. I.e. the "-w" should be powered by just adding "\b" to the start/end of the provided string. That'll then be correct, and faster. I can't remember if there were some subtle bugs in that, or why I didn't finish that... >> > Of course all of this is predicated on us wanting to leave this as an >> > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this >> > entire question, >> >> Making it opt-out would also require a similar knob to turn the >> "flag" off, be it a configuration variable or a command line option, >> wouldn't it? I tend to agree with you that it makes sense to make >> it a goal to take us closer to "grep -P" from GNU---do they have >> such an opt-out knob? If not, let's make it simple by turning it >> always on, which would be the simplest ;-) > > GNU grep -P has no knob and would likely never have one. I think the general knob in not just GNU grep but GNU utils and the wider *nix landscape is "tweak your LC_ALL and/or other locale varibales". Which works for it, and will work for us once we're using PCRE2_UCP too. > So for now, I think we should acknowledge the bug, provide an option > for people that might need the fix, and fix all other problems we > have, which will include changes in PCRE2 as well to better fit our > use case. Hrm, what are those PCRE2 changes? The one I saw so far (or was it a proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU grep is now doing in its unreleased version in its git repo...
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> GNU grep -P has no knob and would likely never have one. > > I think the general knob in not just GNU grep but GNU utils and the > wider *nix landscape is "tweak your LC_ALL and/or other locale > varibales". > > Which works for it, and will work for us once we're using PCRE2_UCP too. > >> So for now, I think we should acknowledge the bug, provide an option >> for people that might need the fix, and fix all other problems we >> have, which will include changes in PCRE2 as well to better fit our >> use case. > > Hrm, what are those PCRE2 changes? The one I saw so far (or was it a > proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU > grep is now doing in its unreleased version in its git repo... Yeah, I didn't understand Carlo's comment in that paragraph at all. In short, it sounds to me that the earlier one that added PCRE2_UCP unconditionally would be the best alternative among those that have been discussed.
On Wed, Jan 18 2023, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >>> GNU grep -P has no knob and would likely never have one. >> >> I think the general knob in not just GNU grep but GNU utils and the >> wider *nix landscape is "tweak your LC_ALL and/or other locale >> varibales". >> >> Which works for it, and will work for us once we're using PCRE2_UCP too. >> >>> So for now, I think we should acknowledge the bug, provide an option >>> for people that might need the fix, and fix all other problems we >>> have, which will include changes in PCRE2 as well to better fit our >>> use case. >> >> Hrm, what are those PCRE2 changes? The one I saw so far (or was it a >> proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU >> grep is now doing in its unreleased version in its git repo... > > Yeah, I didn't understand Carlo's comment in that paragraph at all. > > In short, it sounds to me that the earlier one that added PCRE2_UCP > unconditionally would be the best alternative among those that have > been discussed. I agree.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> In short, it sounds to me that the earlier one that added PCRE2_UCP >> unconditionally would be the best alternative among those that have >> been discussed. > > I agree. So, ... we'll mark cb/grep-pcre-ucp, ea8bc435 (grep: correctly identify utf-8 characters with \{b,w} in -P, 2023-01-08), to be merged to 'next' and see what happens. I'll add your Acked-by while at it, if you do not mind. ---- >8 --------- >8 --------- >8 --------- >8 ------ From: Carlo Marcelo Arenas Belón <carenas@gmail.com> Date: Sun, 8 Jan 2023 07:52:17 -0800 Subject: [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P When UTF is enabled for a PCRE match, the corresponding flags are added to the pcre2_compile() call, but PCRE2_UCP wasn't included. This prevents extending the meaning of the character classes to include those new valid characters and therefore result in failed matches for expressions that rely on that extention, for ex: $ git grep -P '\bÆvar' Add PCRE2_UCP so that \w will include Æ and therefore \b could correctly match the beginning of that word. This has an impact on performance that has been estimated to be between 20% to 40% and that is shown through the added performance test. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com> --- grep.c | 2 +- t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++ 2 files changed, 43 insertions(+), 1 deletion(-) create mode 100755 t/perf/p7822-grep-perl-character.sh diff --git a/grep.c b/grep.c index 06eed69493..1687f65b64 100644 --- a/grep.c +++ b/grep.c @@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt options |= PCRE2_CASELESS; } if (!opt->ignore_locale && is_utf8_locale() && !literal) - options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); + options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh new file mode 100755 index 0000000000..87009c60df --- /dev/null +++ b/t/perf/p7822-grep-perl-character.sh @@ -0,0 +1,42 @@ +#!/bin/sh + +test_description="git-grep's perl regex + +If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8' +etc.) we will test the patterns under those numbers of threads. +" + +. ./perf-lib.sh + +test_perf_large_repo +test_checkout_worktree + +if test -n "$GIT_PERF_GREP_THREADS" +then + test_set_prereq PERF_GREP_ENGINES_THREADS +fi + +for pattern in \ + '\\bhow' \ + '\\bÆvar' \ + '\\d+ \\bÆvar' \ + '\\bBelón\\b' \ + '\\w{12}\\b' +do + echo '$pattern' >pat + if ! test_have_prereq PERF_GREP_ENGINES_THREADS + then + test_perf "grep -P '$pattern'" --prereq PCRE " + git -P grep -f pat || : + " + else + for threads in $GIT_PERF_GREP_THREADS + do + test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " + git -c grep.threads=$threads -P grep -f pat || : + " + done + fi +done + +test_done
diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt index e521f20390..8848db7311 100644 --- a/Documentation/config/grep.txt +++ b/Documentation/config/grep.txt @@ -26,3 +26,9 @@ grep.fullName:: grep.fallbackToNoIndex:: If set to true, fall back to git grep --no-index if git grep is executed outside of a git repository. Defaults to false. + +pcre.ucp:: + If set to true, will use all Unicode Character Properties when matching + `\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used + as the underlying engine. If PCRE is not being used it is ignored. + Defaults to false diff --git a/grep.c b/grep.c index 06eed69493..ceafb8937d 100644 --- a/grep.c +++ b/grep.c @@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb) return config_error_nonbool(var); return color_parse(value, color); } + + if (!strcmp(var, "pcre.ucp")) { + opt->pcre_ucp = git_config_bool(var, value); + return 0; + } + return 0; } @@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && !literal) + if (!opt->ignore_locale && is_utf8_locale() && !literal) { options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); + if (opt->pcre_ucp) + options |= PCRE2_UCP; + } #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ diff --git a/grep.h b/grep.h index 6075f997e6..082bd3a0c7 100644 --- a/grep.h +++ b/grep.h @@ -171,6 +171,7 @@ struct grep_opt { int file_break; int heading; int max_count; + int pcre_ucp; void *priv; void (*output)(struct grep_opt *opt, const void *data, size_t size); diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh index 8eded6ab27..a99a967060 100755 --- a/t/t7810-grep.sh +++ b/t/t7810-grep.sh @@ -95,6 +95,7 @@ test_expect_success setup ' then echo "¿" >reverse-question-mark fi && + echo "émotion" >ucp && git add . && test_tick && git commit -m initial @@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE test_cmp hello_world actual ' +test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' ' + cat >expected <<-\EOF && + ucp:émotion + EOF + cat >pattern <<-\EOF && + \bémotion + EOF + LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual && + test_cmp expected actual && + LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern +' + test_expect_success 'grep -G invalidpattern properly dies ' ' test_must_fail git grep -G "a[" '
When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used by the pcre2_compile() call, but that would only allow for the use of Unicode character properties when caseless is required but not to include the additional UTF characters for all other class matches. This would result in failed matches for expressions that rely on those properties, for ex: $ git grep -P '\bÆvar' Add a configuration that could be used to enable the PCRE2_UCP flag to correctly match those cases, when required. The use of this has an impact on performance that has been estimated to be significant. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com> --- Changes since v2: * make setting UCP and opt-in as suggested by Ævar * remove performance test and instead add a test Documentation/config/grep.txt | 6 ++++++ grep.c | 11 ++++++++++- grep.h | 1 + t/t7810-grep.sh | 13 +++++++++++++ 4 files changed, 30 insertions(+), 1 deletion(-)