Message ID | 508ca102-63a9-6334-fee8-7a1ae84c7a23@cs.ucla.edu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Compatibility between GNU and Git grep -P | expand |
On Fri, Apr 21, 2023 at 2:11 PM Paul Eggert <eggert@cs.ucla.edu> wrote: > > In <https://lists.gnu.org/r/grep-devel/2023-04/msg00017.html> Carlo > Marcelo Arenas Belón wrote: > > > After using this for a while think the following will be better suited > > for a release because: > > > > * the unreleased PCRE2 code is still changing and is unlikely to be released > > for a couple of months. > > * the current way to configure PCRE2 make it difficult to link with the > > unreleased code (this might be an independent bug), but it is likely that > > the wrong headers might be used by mistake. > > * the tests and documentation were not completely accurate. Just to clarify; those points were made about the GNU grep codebase, hence are not really relevant about git's which had an independent thread[1] and that will be better to use instead to avoid more confusion. > Thanks for looking into this. I'm concerned about the resulting patches, > though, because I see recent activity in on the Git grep -P side here: > > https://lore.kernel.org/git/xmqqzgaf2zpt.fsf@gitster.g/ This is really not that recent, and has been released already with git 2.40, so at least at that point in time git and grep 3.9 were consistent. That was changed with grep 3.10 though. FWIW, it doesn't seem git had any issues (other than the crasher with PCRE2 10.34) with the transition to matching multibyte digits with '\d' and which is what perl (and therefore PCRE2) does, but as I explained in the other thread I think it might be wise (on the context of what is usually matched against with git) to not do that in the long term, and was therefore working on adding the necessary features to PCRE2 to be able to do so. Note that no decision has been made though, which is why I didn't even bother sending an RFC patch. > Given Jim's strong desire that \d should match only ASCII digits, I > doubt whether GNU grep will simply use PCRE2_UCP without > PCRE2_EXTRA_ASCII_BSD. My assumption is that you would also need PCRE2_EXTRA_ASCII_DIGIT, and indeed bleeding edge pcre2grep[2] had a compatibility option added assuming as much. > Either way, we should see what the Git folks say about this. The proposed patch for git would IMHO just cause the same risk I was trying to prevent with my proposed change to GNU grep. There are no plans to release PCRE2 10.43 and based on its regular cadence wouldn't be for another couple of months, so this code is a little premature and will need updating eitherway. Carlo [1] https://lore.kernel.org/git/2554712d-e386-3bab-bc6c-1f0e85d999db@cs.ucla.edu/ [2] https://github.com/PCRE2Project/pcre2/commit/3bbdb6dd713b39868934fdc978ba61b81da6d1c5
On 2023-04-21 15:05, Carlo Arenas wrote: > This is really not that recent, and has been released already with git > 2.40, so at least at that point in time git and grep 3.9 were > consistent. That was changed with grep 3.10 though. OK, so currently GNU grep -P and git grep -P treat \d differently (the former matches only ASCII digits; the latter matches all decimal digits). And the next GNU grep -P (i.e., the one currently on Savannah master), when combined with a future PCRE2 release, is planned to change behavior in other areas, but the two programs will continue to treat \d differently. Not good, obviously. > FWIW, it doesn't seem git had any issues (other than the crasher with > PCRE2 10.34) with the transition to matching multibyte digits with > '\d' and which is what perl (and therefore PCRE2) does I suspect the issues that Jim is worried about have more to do with people attacking grep-using programs with data that unexpectedly match \d. This is more likely to happen with GNU grep -P (which gets all sorts of weird junk thrown at it) than with Git grep -P (which tends to lead a more cloistered life). If my suspicion is right, then even if Git users don't have issues with PCRE2_UCP and \d, that doesn't mean GNU grep users would be free of such issues. > My assumption is that you would also need PCRE2_EXTRA_ASCII_DIGIT, and > indeed bleeding edge pcre2grep[2] had a compatibility option added > assuming as much. Although I wasn't aware of that future PCRE2 option, I am not sure GNU grep -P should use it. As things stand, the next GNU grep -P release, when combined with the next PCRE2 release, will have \d match ASCII digits only, and will have [[:digit:]] match all decimal digits. This is compatible with how plain GNU grep [[:digit:]] works (plain GNU grep lacks \d of course, so there's no compatibility issue there). Quite possibly GNU grep -P should retain [[:digit:]] compatibility with plain grep by not using PCRE2_EXTRA_ASCII_DIGIT. Though it'd be unfortunate that \d would not mean the same thing as [[:digit:]], that might be better than the alternative of having [[:digit:]] mean something different with -P than without -P. > The proposed patch for git would IMHO just cause the same risk I was > trying to prevent with my proposed change to GNU grep. It sounds like this ship has already sailed. At best we can now try to repair it. For now I installed the attached documentation patch. I left the grep code alone as we are so close to a release.
From 5f5e54157a01c540bde02c305c8ee5e1a39d4f1c Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@cs.ucla.edu> Date: Fri, 21 Apr 2023 14:06:25 -0700 Subject: [PATCH] grep: be compatible with GNU grep -P Use PCRE2_UCP only when PCRE2_EXTRA_ASCII_BSD is defined, for compatibility with GNU grep. --- grep.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/grep.c b/grep.c index 073559f2cd..e9dc8dc0bc 100644 --- a/grep.c +++ b/grep.c @@ -320,8 +320,13 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && !literal) - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); + if (!opt->ignore_locale && is_utf8_locale() && !literal) { + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); +#ifdef PCRE2_EXTRA_ASCII_BSD + /* Be compatible with GNU grep -P '\d'. */ + options |= (PCRE2_UCP | PCRE2_EXTRA_ASCII_BSD); +#endif + } #ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER /* -- 2.39.2