diff mbox series

grep: correctly identify utf-8 characters with \{b,w} in -P

Message ID 20230108062335.72114-1-carenas@gmail.com (mailing list archive)
State Superseded
Headers show
Series grep: correctly identify utf-8 characters with \{b,w} in -P | expand

Commit Message

Carlo Marcelo Arenas Belón Jan. 8, 2023, 6:23 a.m. UTC
When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:

  $ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Junio C Hamano Jan. 8, 2023, 6:39 a.m. UTC | #1
Carlo Marcelo Arenas Belón  <carenas@gmail.com> writes:

> When UTF is enabled for a PCRE match, the corresponding flags are
> added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

Would the same performance concern as

https://discourse.julialang.org/t/regex-pcre2-and-the-pcre2-ucp-ucp-flag/10930

apply to us as well?



>  	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> -		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>  
>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@  static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= PCRE2_CASELESS;
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
-		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */