Message ID | 20211015161356.3372-3-someguy@effective-light.com (mailing list archive) |
---|---|
State | Accepted |
Commit | ae39ba431ab861548eb60b4bd2e1d8b8813db76f |
Headers | show |
Series | [v13,1/3] grep: refactor next_match() and match_one_pattern() for external use | expand |
Hamza Mahfooz <someguy@effective-light.com> writes: > If we attempt to grep non-ascii log message text with an ascii pattern, we "with an ascii pattern, when Git is built with and told to use pcre2, we" > run into the following issue: > > $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author > grep: (standard input): binary file matches > > So, to fix this teach the grep code to use PCRE2_UTF, as long as the log > output is encoded in UTF-8. > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > - !(!opt->ignore_case && (p->fixed || p->is_fixed))) > + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || > + (!opt->ignore_locale && is_utf8_locale() && > + has_non_ascii(p->pattern) && !(!opt->ignore_case && > + (p->fixed || p->is_fixed)))) That's a mouthful. It is not obvious what new condition is being added. I had to flip the order to see the only difference is, that > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > - !(!opt->ignore_case && (p->fixed || p->is_fixed))) > + if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + !(!opt->ignore_case && (p->fixed || p->is_fixed))) || > + (!opt->ignore_locale && !has_non_ascii(p->pattern))) ... in addition to the case where the original condition holds, if we do not say "ignore locale" and the pattern is ascii-only, we apply these two option flags. And that matches what the proposed log message explained as the condition the problem appears. So,... looks good, I guess. Thanks, will queue. Addendum. If we were reordering pieces in the condition, I wonder if there is a better way to reorganize it, though. The original is already barely explainable with words, and with this new condition added, I am not sure if anybody can phrase the condition in simple words to others after staring it for a few minutes. I can't. But straightening it out is best left as a future clean-up patch, separate from this series.
Am 15.10.21 um 22:03 schrieb Junio C Hamano: > Hamza Mahfooz <someguy@effective-light.com> writes: > >> If we attempt to grep non-ascii log message text with an ascii pattern, we > > "with an ascii pattern, when Git is built with and told to use pcre2, we" > >> run into the following issue: >> >> $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author >> grep: (standard input): binary file matches I get no error message on macOS 11.6, but this result, with the underlined part in red: Author: ??var Arnfjörð Bjarmason <avarab@gmail.com> ^^^^^^^^^^^^^^^^^^ So the pattern matches the second byte of a two-byte character, inserts a color code in the middle and thus produces invalid output in this case. >> >> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log >> output is encoded in UTF-8. > >> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >> + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || >> + (!opt->ignore_locale && is_utf8_locale() && >> + has_non_ascii(p->pattern) && !(!opt->ignore_case && >> + (p->fixed || p->is_fixed)))) > > That's a mouthful. It is not obvious what new condition is being > added. I had to flip the order to see the only difference is, that > >> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >> + if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >> + !(!opt->ignore_case && (p->fixed || p->is_fixed))) || >> + (!opt->ignore_locale && !has_non_ascii(p->pattern))) > > ... in addition to the case where the original condition holds, if > we do not say "ignore locale" and the pattern is ascii-only, we > apply these two option flags. And that matches what the proposed > log message explained as the condition the problem appears. > > So,... looks good, I guess. > > Thanks, will queue. > > > Addendum. > > If we were reordering pieces in the condition, I wonder if there is > a better way to reorganize it, though. The original is already > barely explainable with words, and with this new condition added, I > am not sure if anybody can phrase the condition in simple words to > others after staring it for a few minutes. I can't. > > But straightening it out is best left as a future clean-up patch, > separate from this series. > It can be written as: literal = !opt->ignore_case && (p->fixed || p->is_fixed); if (!opt->ignore_locale) { if (!has_non_ascii(p->pattern) || (is_utf8_locale() && !literal)) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); } Literal patterns are those that don't use any wildcards or case-folding. If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the pattern only consists of ASCII characters, or if the pattern is encoded in UTF-8 and is not just a literal pattern. Hmm. Why enable PCRE2_UTF for literal patterns that consist of only ASCII chars? The old condition was (reformatted to better match the new one): if (!opt->ignore_locale) { if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); } Intuitively I'd say the condition should be: if (!opt->ignore_locale && is_utf8_locale()) { if (has_non_ascii(p->pattern) || !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); } If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we have to match non-ASCII characters or do more than just literal matching. For literal patterns that consist only of ASCII characters we don't need the cost and complication of PCRE2_UTF. Makes sense? René
On Sat, Oct 16 2021, René Scharfe wrote: > Am 15.10.21 um 22:03 schrieb Junio C Hamano: >> Hamza Mahfooz <someguy@effective-light.com> writes: >> >>> If we attempt to grep non-ascii log message text with an ascii pattern, we >> >> "with an ascii pattern, when Git is built with and told to use pcre2, we" >> >>> run into the following issue: >>> >>> $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author >>> grep: (standard input): binary file matches > > I get no error message on macOS 11.6, but this result, with the underlined > part in red: > > Author: ??var Arnfjörð Bjarmason <avarab@gmail.com> > ^^^^^^^^^^^^^^^^^^ > > So the pattern matches the second byte of a two-byte character, inserts a > color code in the middle and thus produces invalid output in this case. Thanks for digging into these edge cases... >>> >>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log >>> output is encoded in UTF-8. >> >>> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>> + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || >>> + (!opt->ignore_locale && is_utf8_locale() && >>> + has_non_ascii(p->pattern) && !(!opt->ignore_case && >>> + (p->fixed || p->is_fixed)))) >> >> That's a mouthful. It is not obvious what new condition is being >> added. I had to flip the order to see the only difference is, that >> >>> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>> + if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>> + !(!opt->ignore_case && (p->fixed || p->is_fixed))) || >>> + (!opt->ignore_locale && !has_non_ascii(p->pattern))) >> >> ... in addition to the case where the original condition holds, if >> we do not say "ignore locale" and the pattern is ascii-only, we >> apply these two option flags. And that matches what the proposed >> log message explained as the condition the problem appears. >> >> So,... looks good, I guess. >> >> Thanks, will queue. >> >> >> Addendum. >> >> If we were reordering pieces in the condition, I wonder if there is >> a better way to reorganize it, though. The original is already >> barely explainable with words, and with this new condition added, I >> am not sure if anybody can phrase the condition in simple words to >> others after staring it for a few minutes. I can't. >> >> But straightening it out is best left as a future clean-up patch, >> separate from this series. >> > > It can be written as: > > literal = !opt->ignore_case && (p->fixed || p->is_fixed); > if (!opt->ignore_locale) { > if (!has_non_ascii(p->pattern) || > (is_utf8_locale() && !literal)) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > } Whatever we go from here I'm very much for untangling that condition, but I guess it can be done as a follow-up too, I'll defer to Hamza there... > Literal patterns are those that don't use any wildcards or case-folding. > If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the > pattern only consists of ASCII characters, or if the pattern is encoded > in UTF-8 and is not just a literal pattern. > > Hmm. Why enable PCRE2_UTF for literal patterns that consist of only > ASCII chars? > > The old condition was (reformatted to better match the new one): > > if (!opt->ignore_locale) { > if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > } > > Intuitively I'd say the condition should be: > > if (!opt->ignore_locale && is_utf8_locale()) { > if (has_non_ascii(p->pattern) || !literal) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > } > > If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we > have to match non-ASCII characters or do more than just literal > matching. > > For literal patterns that consist only of ASCII characters we don't need > the cost and complication of PCRE2_UTF. > > Makes sense? echo 'René Scharfe' >f && $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? 1 $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? f:René Scharfe 0 So it's a choose-your-own adventure where you can pick if you're right. I.e. do you want the "." metacharacter to match your "é" or not? These sorts of patterns demonstrate nicely that the relationship between your pattern being ASCII and wanting or not wanting UTF-8 matching semantics isn't what you might imagine it to be. If you look at: git log --reverse -p -Gis_utf8_locale -- grep.c And my earlier replies in this thread-at-large (not digging up a reference now, sorry), you'll see that the current behavior in grep.c is really a compromise based on some intersection of user patterns, us potentially grepping arbitrary binary data and/or "valid" encodings, and what PCRE does and doesn't puke on. It's not "right" by any sense, but we sort of limp along with it, mostly because unlike say Perl's regex engine which really is used for serious production work where Unicode-correctness matters I don't think anyone is using "git grep" for anything like that, it's mostly for people's ad-hoc greps. Right now I can't remember if the only reason we can't "fix this" is because we'd be too aggressive with the PCRE version dependency, see 95ca1f987ed (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24).
Am 16.10.21 um 19:12 schrieb Ævar Arnfjörð Bjarmason: > > On Sat, Oct 16 2021, René Scharfe wrote: > >> Am 15.10.21 um 22:03 schrieb Junio C Hamano: >>> Hamza Mahfooz <someguy@effective-light.com> writes: >>> >>>> If we attempt to grep non-ascii log message text with an ascii pattern, we >>> >>> "with an ascii pattern, when Git is built with and told to use pcre2, we" >>> >>>> run into the following issue: >>>> >>>> $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author >>>> grep: (standard input): binary file matches >> >> I get no error message on macOS 11.6, but this result, with the underlined >> part in red: >> >> Author: ??var Arnfjörð Bjarmason <avarab@gmail.com> >> ^^^^^^^^^^^^^^^^^^ >> >> So the pattern matches the second byte of a two-byte character, inserts a >> color code in the middle and thus produces invalid output in this case. > > Thanks for digging into these edge cases... > >>>> >>>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log >>>> output is encoded in UTF-8. >>> >>>> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>>> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>>> + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || >>>> + (!opt->ignore_locale && is_utf8_locale() && >>>> + has_non_ascii(p->pattern) && !(!opt->ignore_case && >>>> + (p->fixed || p->is_fixed)))) >>> >>> That's a mouthful. It is not obvious what new condition is being >>> added. I had to flip the order to see the only difference is, that >>> >>>> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>>> - !(!opt->ignore_case && (p->fixed || p->is_fixed))) >>>> + if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && >>>> + !(!opt->ignore_case && (p->fixed || p->is_fixed))) || >>>> + (!opt->ignore_locale && !has_non_ascii(p->pattern))) >>> >>> ... in addition to the case where the original condition holds, if >>> we do not say "ignore locale" and the pattern is ascii-only, we >>> apply these two option flags. And that matches what the proposed >>> log message explained as the condition the problem appears. >>> >>> So,... looks good, I guess. >>> >>> Thanks, will queue. >>> >>> >>> Addendum. >>> >>> If we were reordering pieces in the condition, I wonder if there is >>> a better way to reorganize it, though. The original is already >>> barely explainable with words, and with this new condition added, I >>> am not sure if anybody can phrase the condition in simple words to >>> others after staring it for a few minutes. I can't. >>> >>> But straightening it out is best left as a future clean-up patch, >>> separate from this series. >>> >> >> It can be written as: >> >> literal = !opt->ignore_case && (p->fixed || p->is_fixed); >> if (!opt->ignore_locale) { >> if (!has_non_ascii(p->pattern) || >> (is_utf8_locale() && !literal)) >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> } > > Whatever we go from here I'm very much for untangling that condition, > but I guess it can be done as a follow-up too, I'll defer to Hamza > there... > >> Literal patterns are those that don't use any wildcards or case-folding. >> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >> pattern only consists of ASCII characters, or if the pattern is encoded >> in UTF-8 and is not just a literal pattern. >> >> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >> ASCII chars? >> >> The old condition was (reformatted to better match the new one): >> >> if (!opt->ignore_locale) { >> if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal) >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> } >> >> Intuitively I'd say the condition should be: >> >> if (!opt->ignore_locale && is_utf8_locale()) { >> if (has_non_ascii(p->pattern) || !literal) >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >> } >> >> If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we >> have to match non-ASCII characters or do more than just literal >> matching. >> >> For literal patterns that consist only of ASCII characters we don't need >> the cost and complication of PCRE2_UTF. >> >> Makes sense? > > echo 'René Scharfe' >f && > $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? > 1 > $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? > f:René Scharfe > 0 > > So it's a choose-your-own adventure where you can pick if you're > right. I.e. do you want the "." metacharacter to match your "é" or not? Yes, I do, and it's what Hamza's patch is fixing. > These sorts of patterns demonstrate nicely that the relationship between your > pattern being ASCII and wanting or not wanting UTF-8 matching semantics > isn't what you might imagine it to be. Differences are: o: opt->ignore_locale h: has_non_ascii(p->pattern) i: is_utf8_locale() l: literal o h i l master hamza rene 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 <== your first example 0 0 1 1 0 1 0 0 1 1 1 0 0 1 Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two lines). Turning on PCRE2_UTF for literal matching (fourth line) goes against 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, 2019-07-26). Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth line) also goes against that, so my intuition betrayed me. When I adjust it, I get: if (!opt->ignore_locale && is_utf8_locale() && !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); That looks deceptively simple -- just drop has_non_ascii(p->pattern) from the original condition. Your second example is handle the same by all versions btw.: o h i l master hamza rene 0 1 1 0 1 1 1 René
René Scharfe <l.s.r@web.de> writes: >>> Literal patterns are those that don't use any wildcards or case-folding. >>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >>> pattern only consists of ASCII characters, or if the pattern is encoded >>> in UTF-8 and is not just a literal pattern. >>> >>> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >>> ASCII chars? >>> ... >> echo 'René Scharfe' >f && >> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? >> 1 >> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? >> f:René Scharfe >> 0 >> >> So it's a choose-your-own adventure where you can pick if you're >> right. I.e. do you want the "." metacharacter to match your "é" or not? > > Yes, I do, and it's what Hamza's patch is fixing. That may be correct but is this discussion still about "Why enable ... for literal patterns that consist of only ASCII"? Calling "." a "metacharacter" and wanting it to match anything other than a single dot would mean the pattern we are discussing is no longer "literal", isn't it? I am puzzled. >> These sorts of patterns demonstrate nicely that the relationship between your >> pattern being ASCII and wanting or not wanting UTF-8 matching semantics >> isn't what you might imagine it to be. > > Differences are: > > o: opt->ignore_locale > h: has_non_ascii(p->pattern) > i: is_utf8_locale() > l: literal > > o h i l master hamza rene > 0 0 0 0 0 1 0 > 0 0 0 1 0 1 0 > 0 0 1 0 0 1 1 <== your first example > 0 0 1 1 0 1 0 > 0 1 1 1 0 0 1 > > Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two > lines). > > Turning on PCRE2_UTF for literal matching (fourth line) goes against > 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, > 2019-07-26). > > Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth > line) also goes against that, so my intuition betrayed me. When I > adjust it, I get: > > if (!opt->ignore_locale && is_utf8_locale() && !literal) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > That looks deceptively simple -- just drop has_non_ascii(p->pattern) > from the original condition. > > Your second example is handle the same by all versions btw.: > > o h i l master hamza rene > 0 1 1 0 1 1 1 > > René
Am 17.10.21 um 08:00 schrieb Junio C Hamano: > René Scharfe <l.s.r@web.de> writes: > >>>> Literal patterns are those that don't use any wildcards or case-folding. >>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >>>> pattern only consists of ASCII characters, or if the pattern is encoded >>>> in UTF-8 and is not just a literal pattern. >>>> >>>> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >>>> ASCII chars? >>>> ... >>> echo 'René Scharfe' >f && >>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? >>> 1 >>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? >>> f:René Scharfe >>> 0 >>> >>> So it's a choose-your-own adventure where you can pick if you're >>> right. I.e. do you want the "." metacharacter to match your "é" or not? >> >> Yes, I do, and it's what Hamza's patch is fixing. > > That may be correct but is this discussion still about "Why enable > ... for literal patterns that consist of only ASCII"? Calling "." a > "metacharacter" and wanting it to match anything other than a single > dot would mean the pattern we are discussing is no longer "literal", > isn't it? I am puzzled. Right, Ævar's comment is not about my question, but highlights an inconsistency in master that is fixed by Hamza's patch. I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for non-ASCII patterns even if they are literal or our locale is not UTF-8. The following change would fix the edge case mentioned in its commit message without these side-effects. Am I correct? diff --git a/grep.c b/grep.c index fe847a0111..5badb6d851 100644 --- a/grep.c +++ b/grep.c @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + if (!opt->ignore_locale && is_utf8_locale() && !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
On Sun, Oct 17 2021, René Scharfe wrote: > Am 17.10.21 um 08:00 schrieb Junio C Hamano: >> René Scharfe <l.s.r@web.de> writes: >> >>>>> Literal patterns are those that don't use any wildcards or case-folding. >>>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the >>>>> pattern only consists of ASCII characters, or if the pattern is encoded >>>>> in UTF-8 and is not just a literal pattern. >>>>> >>>>> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only >>>>> ASCII chars? >>>>> ... >>>> echo 'René Scharfe' >f && >>>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $? >>>> 1 >>>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $? >>>> f:René Scharfe >>>> 0 >>>> >>>> So it's a choose-your-own adventure where you can pick if you're >>>> right. I.e. do you want the "." metacharacter to match your "é" or not? >>> >>> Yes, I do, and it's what Hamza's patch is fixing. >> >> That may be correct but is this discussion still about "Why enable >> ... for literal patterns that consist of only ASCII"? Calling "." a >> "metacharacter" and wanting it to match anything other than a single >> dot would mean the pattern we are discussing is no longer "literal", >> isn't it? I am puzzled. > > Right, Ævar's comment is not about my question, but highlights an > inconsistency in master that is fixed by Hamza's patch. Yes, sorry about that. Just generally about the messy semantics. > I'll repeat and extend my question: Hamza's patch enables PCRE2_UTF for > non-ASCII patterns even if they are literal or our locale is not UTF-8. > The following change would fix the edge case mentioned in its commit > message without these side-effects. Am I correct? > > diff --git a/grep.c b/grep.c > index fe847a0111..5badb6d851 100644 > --- a/grep.c > +++ b/grep.c > @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + if (!opt->ignore_locale && is_utf8_locale() && > !(!opt->ignore_case && (p->fixed || p->is_fixed))) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); I haven't had time to carefully look into this, but one caveat to check out is if it works with older PCRE and whether you need e.g. the GIT_PCRE2_VERSION_10_36_OR_HIGHER. I tried your suggestion on top of Hamza's series, compiled PCRE v2 10.23, tested, and also tried manually removing the PCRE2_MATCH_INVALID_UTF flag and tested again. We pass all tests with both, so maybe this is safe to do (or maybe we're missing some test we haven't thought of yet...). One thing that makes me nervous is that we had breakages in the past once the patches escaped into the wild, particularly because the code being modified here has is_utf8_locale(), and our tests run under LANG=c LC_ALL=C. I tried running all the tests with a non-C locale, there's a lot of failures, but none new with this change. As an aside the below patch makes all but one shortlog test pass for me. I wonder if we shouldn't do this for real to smoke out any $LANG or $LC_ALL dependencies. I.e. almost all of the failures were due to relying on the sort order of sort(1), and in one case comm(1), the first hunk here is also redundant to defining our own ls(1) wrapper.... diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh index bd17f308b38..738ca6ef587 100755 --- a/t/t5319-multi-pack-index.sh +++ b/t/t5319-multi-pack-index.sh @@ -693,12 +693,12 @@ test_expect_success 'expire removes unreferenced packs' ' ^refs/heads/C EOF git multi-pack-index write && - ls .git/objects/pack | grep -v -e pack-[AB] >expect && + ls .git/objects/pack | sort | grep -v -e pack-[AB] >expect && git multi-pack-index expire && - ls .git/objects/pack >actual && + ls .git/objects/pack | sort >actual && test_cmp expect actual && - ls .git/objects/pack/ | grep idx >expect-idx && - test-tool read-midx .git/objects | grep idx >actual-midx && + ls .git/objects/pack/ | sort | grep idx >expect-idx && + test-tool read-midx .git/objects | sort | grep idx >actual-midx && test_cmp expect-idx actual-midx && git multi-pack-index verify && git fsck @@ -802,7 +802,7 @@ test_expect_success 'expire works when adding new packs' ' refs/heads/E EOF git multi-pack-index expire && - ls .git/objects/pack/ | grep idx >expect && + ls .git/objects/pack/ | sort | grep idx >expect && test-tool read-midx .git/objects | grep idx >actual && test_cmp expect actual && git multi-pack-index verify diff --git a/t/test-lib.sh b/t/test-lib.sh index 8361b5c1c57..f4f9d231f28 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -417,14 +417,22 @@ test -n "$BASH_VERSION" && shopt -u checkwinsize 2>/dev/null # For repeatability, reset the environment to known value. # TERM is sanitized below, after saving color control sequences. -LANG=C -LC_ALL=C +LANG=en_US.UTF-8 +LC_ALL=en_US.UTF-8 PAGER=cat TZ=UTC COLUMNS=80 export LANG LC_ALL PAGER TZ COLUMNS EDITOR=: +sort () { + LC_ALL=C LANG=C command sort "$@" +} + +comm () { + LC_ALL=C LANG=C command comm "$@" +} + # A call to "unset" with no arguments causes at least Solaris 10 # /usr/xpg4/bin/sh and /bin/ksh to bail out. So keep the unsets # deriving from the command substitution clustered with the other
This breaks 'PCRE v2: grep ASCII from invalid UTF-8 data'. Andreas.
On Mon, Nov 15 2021, Andreas Schwab wrote:
> This breaks 'PCRE v2: grep ASCII from invalid UTF-8 data'.
What PCREv2 version are you using? I wonder if it's got to do with the
10.36 boundary, as noted in 95ca1f987ed (grep/pcre2: better support
invalid UTF-8 haystacks, 2021-01-24). One of the two tests matching your
description tests for that bug.
I can't remember much of the details of the bug, and it seems PCREv2's
bug tracker went private, or at least I can't view
https://bugs.exim.org/show_bug.cgi?id=2642 anymore.
> I can't remember much of the details of the bug, and it seems PCREv2's > bug tracker went private, or at least I can't view > https://bugs.exim.org/show_bug.cgi?id=2642 anymore. It was shutdown; all new development is now done in github[1], old closed bugs were not migrated over though but the fix is pretty straight forward if it needs backporting[2] I would have expected the less risky change from René[3] to be released though, but might be a good opportunity to take a deeper look at the underlying problem and also have a likely solution to both the regression and the original bug. Carlo [1] https://github.com/PhilipHazel/pcre2 [2] https://github.com/PhilipHazel/pcre2/commit/f8cbb1f58df05f175a6898f35dc18e26be6362d0 [3] https://lore.kernel.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/
On Nov 15 2021, Ævar Arnfjörð Bjarmason wrote:
> What PCREv2 version are you using?
10.31
https://build.opensuse.org/package/show/openSUSE:Leap:15.3:Update/pcre2
Andreas.
On Tue, Nov 16, 2021 at 12:42 AM Andreas Schwab <schwab@linux-m68k.org> wrote: > > On Nov 15 2021, Ævar Arnfjörð Bjarmason wrote: > > > What PCREv2 version are you using? > > 10.31 > > https://build.opensuse.org/package/show/openSUSE:Leap:15.3:Update/pcre2 interesting; that version doesn't need the workaround because it doesn't have MATCH_INVALID_UTF that was released with 10.34, but it has a recently backported fix for a similar issue that might be confusing git and it is somehow related to it. a full log (run it with -x -i, and get the last lines) would be useful. curious also if you see other problems using `git log --grep` and which locale do you have configured? Carlo
On Nov 16 2021, Carlo Arenas wrote: > curious also if you see other problems using `git log --grep` and Not sure what you are asking for. > which locale do you have configured? There is no locale configured during build. Andreas.
expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': git grep -h "var" invalid-0x80 >actual && test_cmp expected actual && git grep -h "(*NO_JIT)var" invalid-0x80 >actual && test_cmp expected actual ++ git grep -h var invalid-0x80 ++ test_cmp expected actual ++ test 2 -ne 2 ++ eval 'diff -u' '"$@"' +++ diff -u expected actual ++ git grep -h '(*NO_JIT)var' invalid-0x80 fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set error: last command exited with $?=128 not ok 13 - PCRE v2: grep ASCII from invalid UTF-8 data # # git grep -h "var" invalid-0x80 >actual && # test_cmp expected actual && # git grep -h "(*NO_JIT)var" invalid-0x80 >actual && # test_cmp expected actual # Andreas.
On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote: > > expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': > git grep -h "var" invalid-0x80 >actual && > test_cmp expected actual && > git grep -h "(*NO_JIT)var" invalid-0x80 >actual && > test_cmp expected actual > > ++ git grep -h var invalid-0x80 > ++ test_cmp expected actual > ++ test 2 -ne 2 > ++ eval 'diff -u' '"$@"' > +++ diff -u expected actual > ++ git grep -h '(*NO_JIT)var' invalid-0x80 > fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set That is exactly what I was worried about, this is not failing one test, but making `git grep` unusable in any repository that has any binary files that might be reachable by it, and it is likely affecting anyone using PCRE older than 10.34 Reverting this specific commit might fix it and is unlikely to introduce other issues, did you try it? Carlo
On Nov 16 2021, Carlo Arenas wrote: > Reverting this specific commit might fix it and is unlikely to > introduce other issues, did you try it? Yes, of course. Andreas.
Am 16.11.21 um 10:38 schrieb Carlo Arenas: > On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote: >> >> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': >> git grep -h "var" invalid-0x80 >actual && >> test_cmp expected actual && >> git grep -h "(*NO_JIT)var" invalid-0x80 >actual && >> test_cmp expected actual >> >> ++ git grep -h var invalid-0x80 >> ++ test_cmp expected actual >> ++ test 2 -ne 2 >> ++ eval 'diff -u' '"$@"' >> +++ diff -u expected actual >> ++ git grep -h '(*NO_JIT)var' invalid-0x80 >> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set > > That is exactly what I was worried about, this is not failing one > test, but making `git grep` unusable in any repository that has any > binary files that might be reachable by it, and it is likely affecting > anyone using PCRE older than 10.34 Let's have a look at the map. Here are the differences between the versions regarding use of PCRE2_UTF: o: opt->ignore_locale h: has_non_ascii(p->pattern) i: is_utf8_locale() l: !opt->ignore_case && (p->fixed || p->is_fixed) o h i l master hamza rene2 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ should not have this breakage, because it doesn't enable PCRE2_UTF for literal patterns. René
On Wed, Nov 17, 2021 at 10:46 AM René Scharfe <l.s.r@web.de> wrote: > > Am 16.11.21 um 10:38 schrieb Carlo Arenas: > > On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote: > >> > >> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': > >> git grep -h "var" invalid-0x80 >actual && > >> test_cmp expected actual && > >> git grep -h "(*NO_JIT)var" invalid-0x80 >actual && > >> test_cmp expected actual > >> > >> ++ git grep -h var invalid-0x80 > >> ++ test_cmp expected actual > >> ++ test 2 -ne 2 > >> ++ eval 'diff -u' '"$@"' > >> +++ diff -u expected actual > >> ++ git grep -h '(*NO_JIT)var' invalid-0x80 > >> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set > > > > That is exactly what I was worried about, this is not failing one > > test, but making `git grep` unusable in any repository that has any > > binary files that might be reachable by it, and it is likely affecting > > anyone using PCRE older than 10.34 > > Let's have a look at the map. Here are the differences between the > versions regarding use of PCRE2_UTF: > > o: opt->ignore_locale > h: has_non_ascii(p->pattern) > i: is_utf8_locale() > l: !opt->ignore_case && (p->fixed || p->is_fixed) > > o h i l master hamza rene2 > 0 0 0 0 0 1 0 > 0 0 0 1 0 1 0 > 0 0 1 0 0 1 1 > 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging > > So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ > should not have this breakage, because it doesn't enable PCRE2_UTF for > literal patterns. Correct, and indeed I had your expression instead of the ugly one in my draft for v2[1] but then I found the tests were even more broken than reported originally and got worried it might introduce yet another issue because of the brittleness of the rest of the code around it. I am hoping it will be introduced in a follow up series though, and unless this code gets thrown away in the promised refactoring by Ævar which I haven't seen yet and once this is in maint. IMHO and in line with the previous warnings you raised[2] during the development of this feature, it might be worth looking at it more deeply, but at least the proposed patch keeps the feature working in practice (the only case that won't work as expected will be when the expression includes "." with the intention of matching a UTF-8 character and while using PCRE2 older than 10.34) and more importantly fixes the regression that it introduced. Carlo [1] https://lore.kernel.org/git/20211117102329.95456-1-carenas@gmail.com/ [2] https://lore.kernel.org/git/fc7eb9fc-9521-5484-b05f-9c20086fd485@web.de/
On Wed, Nov 17 2021, René Scharfe wrote: > Am 16.11.21 um 10:38 schrieb Carlo Arenas: >> On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@linux-m68k.org> wrote: >>> >>> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': >>> git grep -h "var" invalid-0x80 >actual && >>> test_cmp expected actual && >>> git grep -h "(*NO_JIT)var" invalid-0x80 >actual && >>> test_cmp expected actual >>> >>> ++ git grep -h var invalid-0x80 >>> ++ test_cmp expected actual >>> ++ test 2 -ne 2 >>> ++ eval 'diff -u' '"$@"' >>> +++ diff -u expected actual >>> ++ git grep -h '(*NO_JIT)var' invalid-0x80 >>> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set >> >> That is exactly what I was worried about, this is not failing one >> test, but making `git grep` unusable in any repository that has any >> binary files that might be reachable by it, and it is likely affecting >> anyone using PCRE older than 10.34 > > Let's have a look at the map. Here are the differences between the > versions regarding use of PCRE2_UTF: > > o: opt->ignore_locale > h: has_non_ascii(p->pattern) > i: is_utf8_locale() > l: !opt->ignore_case && (p->fixed || p->is_fixed) > > o h i l master hamza rene2 > 0 0 0 0 0 1 0 > 0 0 0 1 0 1 0 > 0 0 1 0 0 1 1 > 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging > > So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ > should not have this breakage, because it doesn't enable PCRE2_UTF for > literal patterns. PCRE2_UTF will also matter for literal patterns. Try to peel apart the two bytes in "é" and match them under -i with/without PCRE_UTF. I don't know what the right behavior should be here (haven't had time to dig), but it matters for matching.
On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > PCRE2_UTF will also matter for literal patterns. Try to peel apart the > two bytes in "é" and match them under -i with/without PCRE_UTF. Is there a real use case for why someone would do that? and how is that "literal" valid UTF to warrant setting PCRE2_UTF? I would expect that someone including random bytes in the expression is really more interested in binary matching anyway and the use of -i with it probably should be an error. Indeed I suspect the fact that pcre2_compile lets it through might be a bug in PCRE2 $ pcre2test PCRE2 version 10.39 2021-10-29 re> /\303/utf,caseless data> \303 0: \x{c3} data> é No match Carlo
On Wed, Nov 17 2021, Carlo Arenas wrote: > On Wed, Nov 17, 2021 at 1:01 PM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> PCRE2_UTF will also matter for literal patterns. Try to peel apart the >> two bytes in "é" and match them under -i with/without PCRE_UTF. > > Is there a real use case for why someone would do that? and how is > that "literal" valid UTF to warrant setting PCRE2_UTF? We don't check the validity of the input pattern as correct UTF-8 before feeding it to PCRE. We don't even check if you're under a UTF-8 locale, a call to is_utf8_locale() is one of the conditions we'll try (but check how loose we are under NO_GETTEXT=Y even then), but it's not a necessary one. So in practice we'll likely have users pasting say Big5, UTF-32, JIS encoding or whatever into "git grep" and then grepping arbitrary binary data. I don't really have a specific use-case in mind. I'm just trying to counter the apparent recurring misconception that's popped up more than once in these recent threads that the UTF-8 *mode* is only something we need to worry about if the pattern contains non-ASCII. The implications of UTF-8 as a matching mode go much deeper than that in Perl's and PCRE's regex engines, e.g. in this case what's considered a character boundary. > I would expect that someone including random bytes in the expression > is really more interested in binary matching anyway and the use of -i > with it probably should be an error. > > Indeed I suspect the fact that pcre2_compile lets it through might be > a bug in PCRE2 > > $ pcre2test > PCRE2 version 10.39 2021-10-29 > re> /\303/utf,caseless > data> \303 > 0: \x{c3} > data> é > No match It's really not "random bytes". We're not talking about someone doing a git grep where the argument is fed from a "cat" of /dev/urandom. Just plain old boring natural language encodings in common use can and will conflict with what's considered a character boundary in UTF-8, Anyway, I haven't had time to re-page this topic at large into my wetware and really don't know what we should be doing exactly at this point to move forward, except my previous suggestion to either revert the recent changes, or at least to narrow them down so that we get the old behavior for everything except the revision.c "git log" caller. I think a really good next step aside from dealing with that immediate problem would be to harden some of our test data in a way similar to what we've got in t7816-grep-binary-pattern.sh, but for partial matches, mixtures of valid/invalid/partial character patterns and subjects etc. We might be able to lift some tests from existing test suites, perl.git's t/re/re_tests and pcre2.git's RunGrepTest (And testdata/ directory) are probably good places to start looking. We don't need to test e.g. PCRE itself independently, but it is worth testing how whatever special-sauce we add on top (if and when to turn on its flags based on heuristics) impacts things. We've also got the problem that whatever code we write will target multiple PCREv2 versions, which isn't something PCREv2 itself needs to deal with.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> Let's have a look at the map. Here are the differences between the >> versions regarding use of PCRE2_UTF: >> >> o: opt->ignore_locale >> h: has_non_ascii(p->pattern) >> i: is_utf8_locale() >> l: !opt->ignore_case && (p->fixed || p->is_fixed) >> >> o h i l master hamza rene2 >> 0 0 0 0 0 1 0 >> 0 0 0 1 0 1 0 >> 0 0 1 0 0 1 1 >> 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging >> >> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ >> should not have this breakage, because it doesn't enable PCRE2_UTF for >> literal patterns. > > PCRE2_UTF will also matter for literal patterns. Try to peel apart the > two bytes in "é" and match them under -i with/without PCRE_UTF. Sorry for being late to the party, but doesn't "literal" in the context of this thread mean the column "l" above, i.e. we are not ignoring case and fixed or is_fixed member is set? So "under -i" disqualifies as an example for "will also matter for literal", doesn't it? In hindsight, I guess we could have pushed a bit harder when René's - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + if (!opt->ignore_locale && is_utf8_locale() && !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ (is that what is called 'rene2' above?) was raised on Oct 17th to amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from having this breakage? Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@gmail.com>, I see that the #else side for older PCREv2 users essentially reverts what Hamza's [PATCH v13 3/3] did to this area. +#else + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + !(!opt->ignore_case && (p->fixed || p->is_fixed))) + options |= PCRE2_UTF; +#endif I guess this is a lot of change in the amount of text involved but the least amount of actual change in the behaviour. For those with newer PCREv2, the behaviour would be the same as v2.34.0, and for others, the behaviour would be the same as v2.33.0. Having said all that, because the consensus seems to be that the whole "when we should match in UTF mode" may need to be rethought, I think reverting Hamza's [v13 3/3] would be the simplest way to clean up the mess for v2.34.1 that will give us a cleaner slate to later build on, than applying this patch. So, I dunno. Comments from those involved in the discussion? Thanks.
Am 18.11.21 um 19:17 schrieb Junio C Hamano: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >>> Let's have a look at the map. Here are the differences between the >>> versions regarding use of PCRE2_UTF: >>> >>> o: opt->ignore_locale >>> h: has_non_ascii(p->pattern) >>> i: is_utf8_locale() >>> l: !opt->ignore_case && (p->fixed || p->is_fixed) >>> >>> o h i l master hamza rene2 >>> 0 0 0 0 0 1 0 >>> 0 0 0 1 0 1 0 >>> 0 0 1 0 0 1 1 >>> 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging >>> >>> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ >>> should not have this breakage, because it doesn't enable PCRE2_UTF for >>> literal patterns. >> >> PCRE2_UTF will also matter for literal patterns. Try to peel apart the >> two bytes in "é" and match them under -i with/without PCRE_UTF. > > Sorry for being late to the party, but doesn't "literal" in the > context of this thread mean the column "l" above, i.e. we are not > ignoring case and fixed or is_fixed member is set? So "under -i" > disqualifies as an example for "will also matter for literal", > doesn't it? Correct. > In hindsight, I guess we could have pushed a bit harder when René's > > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + if (!opt->ignore_locale && is_utf8_locale() && > !(!opt->ignore_case && (p->fixed || p->is_fixed))) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@web.de/ > (is that what is called 'rene2' above?) was raised on Oct 17th to > amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from > having this breakage? Yes, that the change I meant with "rene2". > Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@gmail.com>, > I see that the #else side for older PCREv2 users essentially reverts > what Hamza's [PATCH v13 3/3] did to this area. > > +#else > + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + !(!opt->ignore_case && (p->fixed || p->is_fixed))) > + options |= PCRE2_UTF; > +#endif > > I guess this is a lot of change in the amount of text involved but > the least amount of actual change in the behaviour. For those with > newer PCREv2, the behaviour would be the same as v2.34.0, and for > others, the behaviour would be the same as v2.33.0. > > Having said all that, because the consensus seems to be that the > whole "when we should match in UTF mode" may need to be rethought, I > think reverting Hamza's [v13 3/3] would be the simplest way to clean > up the mess for v2.34.1 that will give us a cleaner slate to later > build on, than applying this patch. Makes sense to me. It gives a better starting point to solve the issue afresh without getting entangled in mind-melting boolean expressions. René
On Thu, Nov 18 2021, René Scharfe wrote: > Am 18.11.21 um 19:17 schrieb Junio C Hamano: >> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > [...] >> I guess this is a lot of change in the amount of text involved but >> the least amount of actual change in the behaviour. For those with >> newer PCREv2, the behaviour would be the same as v2.34.0, and for >> others, the behaviour would be the same as v2.33.0. >> >> Having said all that, because the consensus seems to be that the >> whole "when we should match in UTF mode" may need to be rethought, I >> think reverting Hamza's [v13 3/3] would be the simplest way to clean >> up the mess for v2.34.1 that will give us a cleaner slate to later >> build on, than applying this patch. > > Makes sense to me. It gives a better starting point to solve the issue > afresh without getting entangled in mind-melting boolean expressions. Yes, agreed. As noted I haven't had time to dig deeply into this, but from what I've seen so far there doesn't seem to be any obvious way forward in terms of a quick fix. I thought perhaps your patch would be that (but I haven't looked into it carefully enough), but since you're on-board with reverting & retrying.
Am 19.11.21 um 08:00 schrieb Ævar Arnfjörð Bjarmason: > > On Thu, Nov 18 2021, René Scharfe wrote: > >> Am 18.11.21 um 19:17 schrieb Junio C Hamano: >>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> [...] >>> I guess this is a lot of change in the amount of text involved but >>> the least amount of actual change in the behaviour. For those with >>> newer PCREv2, the behaviour would be the same as v2.34.0, and for >>> others, the behaviour would be the same as v2.33.0. >>> >>> Having said all that, because the consensus seems to be that the >>> whole "when we should match in UTF mode" may need to be rethought, I >>> think reverting Hamza's [v13 3/3] would be the simplest way to clean >>> up the mess for v2.34.1 that will give us a cleaner slate to later >>> build on, than applying this patch. >> >> Makes sense to me. It gives a better starting point to solve the issue >> afresh without getting entangled in mind-melting boolean expressions. > > Yes, agreed. As noted I haven't had time to dig deeply into this, but > from what I've seen so far there doesn't seem to be any obvious way > forward in terms of a quick fix. > > I thought perhaps your patch would be that (but I haven't looked into it > carefully enough), but since you're on-board with reverting & retrying. That patch should fix the edge case without any side-effects -- at least I haven't seen any reports of ill effects that would apply to it. It's easier to understand and reason about when applied after reverting, I think. But it's only for grep.c and I don't know the situation in t/. René
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > On Thu, Nov 18 2021, René Scharfe wrote: >> >> Makes sense to me. It gives a better starting point to solve the issue >> afresh without getting entangled in mind-melting boolean expressions. > > Yes, agreed. As noted I haven't had time to dig deeply into this, but > from what I've seen so far there doesn't seem to be any obvious way > forward in terms of a quick fix. > > I thought perhaps your patch would be that (but I haven't looked into it > carefully enough), but since you're on-board with reverting & retrying. OK. Here is what I'll queue 'master', in the hope of later merging down and be part of v2.34.1. Thanks, all. ----- >8 --------- >8 --------- >8 --------- >8 ----- Subject: [PATCH] Revert "grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data" This reverts commit ae39ba431ab861548eb60b4bd2e1d8b8813db76f, as it breaks "grep" when looking for a string in non UTF-8 haystack, when linked with certain versions of PCREv2 library. Signed-off-by: Junio C Hamano <gitster@pobox.com> --- grep.c | 6 ++--- t/t7812-grep-icase-non-ascii.sh | 48 --------------------------------- 2 files changed, 2 insertions(+), 52 deletions(-) diff --git a/grep.c b/grep.c index f6e113e9f0..fe847a0111 100644 --- a/grep.c +++ b/grep.c @@ -382,10 +382,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || - (!opt->ignore_locale && is_utf8_locale() && - has_non_ascii(p->pattern) && !(!opt->ignore_case && - (p->fixed || p->is_fixed)))) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 22487d90fd..e5d1e4ea68 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -53,54 +53,6 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' ' test_cmp expected actual ' -test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' ' - cat >expected <<-\EOF && - Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com> - EOF - test_write_lines "forth" >file4 && - git add file4 && - git commit --author="À Ú Thor <author@example.com>" -m sécond && - git log -1 --color=always --perl-regexp --author=".*Thor" >log && - grep Author log >actual.raw && - test_decode_color <actual.raw >actual && - test_cmp expected actual -' - -test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' ' - cat >expected <<-\EOF && - Commit: Ç<BOLD;RED> O Mîtter <committer@example.com><RESET> - EOF - test_write_lines "fifth" >file5 && - git add file5 && - GIT_COMMITTER_NAME="Ç O Mîtter" && - GIT_COMMITTER_EMAIL="committer@example.com" && - git -c i18n.commitEncoding=latin1 commit -m thïrd && - git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log && - grep Commit: log >actual.raw && - test_decode_color <actual.raw >actual && - test_cmp expected actual -' - -test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' ' - cat >expected <<-\EOF && - sé<BOLD;RED>con<RESET>d - EOF - git log -1 --color=always --perl-regexp --grep="con" >log && - grep con log >actual.raw && - test_decode_color <actual.raw >actual && - test_cmp expected actual -' - -test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' ' - cat >expected <<-\EOF && - <BOLD;RED>thïrd<RESET> - EOF - git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log && - grep "th.*rd" log >actual.raw && - test_decode_color <actual.raw >actual && - test_cmp expected actual -' - test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' ' printf "\\200\\n" >invalid-0x80 && echo "ævar" >expected &&
On Fri, Nov 19, 2021 at 8:08 AM René Scharfe <l.s.r@web.de> wrote: > > Am 19.11.21 um 08:00 schrieb Ævar Arnfjörð Bjarmason: > > > > On Thu, Nov 18 2021, René Scharfe wrote: > > > >> Am 18.11.21 um 19:17 schrieb Junio C Hamano: > >>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> [...] > >>> I guess this is a lot of change in the amount of text involved but > >>> the least amount of actual change in the behaviour. For those with > >>> newer PCREv2, the behaviour would be the same as v2.34.0, and for > >>> others, the behaviour would be the same as v2.33.0. > >>> > >>> Having said all that, because the consensus seems to be that the > >>> whole "when we should match in UTF mode" may need to be rethought, I > >>> think reverting Hamza's [v13 3/3] would be the simplest way to clean > >>> up the mess for v2.34.1 that will give us a cleaner slate to later > >>> build on, than applying this patch. > >> > >> Makes sense to me. It gives a better starting point to solve the issue > >> afresh without getting entangled in mind-melting boolean expressions. > > > > Yes, agreed. As noted I haven't had time to dig deeply into this, but > > from what I've seen so far there doesn't seem to be any obvious way > > forward in terms of a quick fix. > > > > I thought perhaps your patch would be that (but I haven't looked into it > > carefully enough), but since you're on-board with reverting & retrying. > > That patch should fix the edge case without any side-effects -- at least > I haven't seen any reports of ill effects that would apply to it. Since it isn't restricted to log, it will still cause a regression to the `git grep` case with binary data for versions of PCRE2 older than 10.34 and unlike the previous one it might not trigger an error in the testsuite just because we are missing a test for it. > It's > easier to understand and reason about when applied after reverting, I > think. But it's only for grep.c and I don't know the situation in t/. We had been focusing in PCRE in this discussion, but I see the strict behaviour of older PCRE2 as just a "coal mine canary" to point to the bigger problem that we are expecting git's regex to handle safely and correctly from the point of UTF, what is technically a binary match with --color making the mismatch obvious. The issue is not unique to PCRE, and seems Ævar also acknowledges[1] that by seeing the same bug this was attempting to fix with probably some version of glibc's ERE. I suspect FreeBSD's (and derivatives) is also broken and might be throwing REGILLSEQ errors as well, so I think that it is better to revert the whole thing. Carlo [1] https://lore.kernel.org/git/211119.86r1bc4om5.gmgdl@evledraar.gmail.com/
diff --git a/grep.c b/grep.c index fe847a0111..f6e113e9f0 100644 --- a/grep.c +++ b/grep.c @@ -382,8 +382,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && - !(!opt->ignore_case && (p->fixed || p->is_fixed))) + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) || + (!opt->ignore_locale && is_utf8_locale() && + has_non_ascii(p->pattern) && !(!opt->ignore_case && + (p->fixed || p->is_fixed)))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index e5d1e4ea68..22487d90fd 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -53,6 +53,54 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' ' test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on UTF-8 data' ' + cat >expected <<-\EOF && + Author: <BOLD;RED>À Ú Thor<RESET> <author@example.com> + EOF + test_write_lines "forth" >file4 && + git add file4 && + git commit --author="À Ú Thor <author@example.com>" -m sécond && + git log -1 --color=always --perl-regexp --author=".*Thor" >log && + grep Author log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern on ISO-8859-1 data' ' + cat >expected <<-\EOF && + Commit: Ç<BOLD;RED> O Mîtter <committer@example.com><RESET> + EOF + test_write_lines "fifth" >file5 && + git add file5 && + GIT_COMMITTER_NAME="Ç O Mîtter" && + GIT_COMMITTER_EMAIL="committer@example.com" && + git -c i18n.commitEncoding=latin1 commit -m thïrd && + git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log && + grep Commit: log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on UTF-8 data' ' + cat >expected <<-\EOF && + sé<BOLD;RED>con<RESET>d + EOF + git log -1 --color=always --perl-regexp --grep="con" >log && + grep con log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,PCRE 'log --grep with an ascii pattern on ISO-8859-1 data' ' + cat >expected <<-\EOF && + <BOLD;RED>thïrd<RESET> + EOF + git -c i18n.logOutputEncoding=latin1 log -1 --color=always --perl-regexp --grep="th.*rd" >log && + grep "th.*rd" log >actual.raw && + test_decode_color <actual.raw >actual && + test_cmp expected actual +' + test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' ' printf "\\200\\n" >invalid-0x80 && echo "ævar" >expected &&