diff mbox series

[v3] grep: correctly identify utf-8 characters with \{b,w} in -P

Message ID 20230117105123.58328-1-carenas@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] grep: correctly identify utf-8 characters with \{b,w} in -P | expand

Commit Message

Carlo Marcelo Arenas Belón Jan. 17, 2023, 10:51 a.m. UTC
When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
by the pcre2_compile() call, but that would only allow for the
use of Unicode character properties when caseless is required
but not to include the additional UTF characters for all other
class matches.

This would result in failed matches for expressions that rely
on those properties, for ex:

  $ git grep -P '\bÆvar'

Add a configuration that could be used to enable the PCRE2_UCP
flag to correctly match those cases, when required.

The use of this has an impact on performance that has been estimated
to be significant.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
Changes since v2:
* make setting UCP and opt-in as suggested by Ævar
* remove performance test and instead add a test

 Documentation/config/grep.txt |  6 ++++++
 grep.c                        | 11 ++++++++++-
 grep.h                        |  1 +
 t/t7810-grep.sh               | 13 +++++++++++++
 4 files changed, 30 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason Jan. 17, 2023, 12:38 p.m. UTC | #1
On Tue, Jan 17 2023, Carlo Marcelo Arenas Belón wrote:

> When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
> by the pcre2_compile() call, but that would only allow for the
> use of Unicode character properties when caseless is required
> but not to include the additional UTF characters for all other
> class matches.
>
> This would result in failed matches for expressions that rely
> on those properties, for ex:
>
>   $ git grep -P '\bÆvar'
>
> Add a configuration that could be used to enable the PCRE2_UCP
> flag to correctly match those cases, when required.
>
> The use of this has an impact on performance that has been estimated
> to be significant.
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
> Changes since v2:
> * make setting UCP and opt-in as suggested by Ævar

To argue with myself here, I'm not so sure that just making this the
default isn't the right move, especially as the GNU grep maintainer
seems to be convinced that that's the right thing for grep(1).

We've usually just followed GNU grep semantics, so if it's doing X and
we're doing Y after this it's probably better to unify our behavior with
theirs.
        
I was mainly concerned with the behavior change sneaking in as a mere
bugfix, I think it's OK if we change the behavior, as long as we're
going into it with our eyes open...

> * remove performance test and instead add a test

...I didn't follow the thread(s) where this may have been discussed, but
I for one would like to see a perf test with this, but maybe it was
removed for a good reason that I'm not aware of...



>  Documentation/config/grep.txt |  6 ++++++
>  grep.c                        | 11 ++++++++++-
>  grep.h                        |  1 +
>  t/t7810-grep.sh               | 13 +++++++++++++
>  4 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index e521f20390..8848db7311 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -26,3 +26,9 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>  	If set to true, fall back to git grep --no-index if git grep
>  	is executed outside of a git repository.  Defaults to false.
> +
> +pcre.ucp::
> +	If set to true, will use all Unicode Character Properties when matching
> +	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
> +	as the underlying engine. If PCRE is not being used it is ignored.
> +	Defaults to false

There's a couple of exceptions to this, but we tend to stick config docs
in their corresponding Documentation/config/<namespace>.txt, so this
should be in a new Documentation/config/pcre.txt if we're adding this
name.

But I'd rather that we don't expose the implementation detail that we're
using PCRE, which we haven't done so far. We just have a
"grep.patternType=perl", which (and that's another issue, in any case)
we should explicitly mention here, i.e. that this is for use with that
config (and corresponding option(s)).

I think calling this e.g.:

	grep.perl.Unicode=<bool>
	grep.patternTypePerl.Unicode=<bool>

Or even:

	grep.patternTypePerl.Flags=u

Would be better, i.e. PCRE's C API is really just mapping to the flags
you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In
this case the /u flag maps to the "PCRE2_UCP" API flag.

That we happen to use PCRE to give ourselves "Perl" semantics is an
implementation detail we should avoid exposing, so we could either give
our config generic names, or literally map to the perl /flags/.

For now we could just die on any "Flags" value that isn't "u".

Of course all of this is predicated on us wanting to leave this as an
opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
entire question,

> diff --git a/grep.c b/grep.c
> index 06eed69493..ceafb8937d 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb)
>  			return config_error_nonbool(var);
>  		return color_parse(value, color);
>  	}
> +
> +	if (!strcmp(var, "pcre.ucp")) {
> +		opt->pcre_ucp = git_config_bool(var, value);
> +		return 0;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> +	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +		if (opt->pcre_ucp)
> +			options |= PCRE2_UCP;
> +	}

This interaction with locale settings etc. is probably correct, but if
we're keeping the config etc. we should really document how this
interacts with those.

I.e. you might expect "-c grep.patternType=perl -c
<whatever_the_setting_is>=true" to give you UCP semantics, but we'll
ignore it based on these other criteria.

>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> diff --git a/grep.h b/grep.h
> index 6075f997e6..082bd3a0c7 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -171,6 +171,7 @@ struct grep_opt {
>  	int file_break;
>  	int heading;
>  	int max_count;
> +	int pcre_ucp;
>  	void *priv;

The reason for why we have some "bool"-like settings (like "int
ignore_case") as an "int" as opposed to an "unsigned int <name>:1"
bitfield is because we need to take their address via the
parse_options() API.

But in this case it's a purely internal field, so (and again, if we're
keeping the option) let's use "unsigned int ...:1" here instead?

Unless that is, we're expecting a corresponding command-line option.

>  
>  	void (*output)(struct grep_opt *opt, const void *data, size_t size);
> diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
> index 8eded6ab27..a99a967060 100755
> --- a/t/t7810-grep.sh
> +++ b/t/t7810-grep.sh
> @@ -95,6 +95,7 @@ test_expect_success setup '
>  	then
>  		echo "¿" >reverse-question-mark
>  	fi &&
> +	echo "émotion" >ucp &&

Here we carry this file through the entirety ouf our tests...

>  	git add . &&
>  	test_tick &&
>  	git commit -m initial
> @@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
>  	test_cmp hello_world actual
>  '
>  
> +test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
> +	cat >expected <<-\EOF &&
> +	ucp:émotion

...only to use it in this one test, it clearly didn't harm anything (or
rather, I didn't run this, but I expect you did and it passed), but how
about avoiding more global state here doing:

	test_when_finished "rm -rf repo" &&
	git init repo &&
	test_commit -C repo msg ucp émotion &&
	[...]

> +	EOF
> +	cat >pattern <<-\EOF &&
> +	\bémotion
> +	EOF
> +	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
> +	test_cmp expected actual &&
> +	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern

This will break on platforms that don't have en_US.UTF-8 (and that's not
hypothetical, some systems will skip installing locales for various
reasons).

I see we have some almost recent breakage 1819ad327b7 (grep: fix
multibyte regex handling under macOS, 2022-08-26), but that one adds a
new MB_REGEX prereq, which presumably fails if we don't have this
locale.

You can just use the existing "GETTEXT_LOCALE", which will piggy-back on
our existing locale tests, or we could add a corresponding one for
en_US.UTF-8 and refactor the existing MB_REGEX to be in lib-gettext.sh
where it arguably belongs...

> +'
> +
>  test_expect_success 'grep -G invalidpattern properly dies ' '
>  	test_must_fail git grep -G "a["
>  '
Junio C Hamano Jan. 17, 2023, 3:19 p.m. UTC | #2
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> To argue with myself here, I'm not so sure that just making this the
> default isn't the right move, especially as the GNU grep maintainer
> seems to be convinced that that's the right thing for grep(1).

OK.

> I think calling this e.g.:
>
> 	grep.perl.Unicode=<bool>
> 	grep.patternTypePerl.Unicode=<bool>
>
> Or even:
>
> 	grep.patternTypePerl.Flags=u
>
> Would be better, i.e. PCRE's C API is really just mapping to the flags
> you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In
> this case the /u flag maps to the "PCRE2_UCP" API flag.
>
> That we happen to use PCRE to give ourselves "Perl" semantics is an
> implementation detail we should avoid exposing, so we could either give
> our config generic names, or literally map to the perl /flags/.
>
> For now we could just die on any "Flags" value that isn't "u".
>
> Of course all of this is predicated on us wanting to leave this as an
> opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
> entire question,

Making it opt-out would also require a similar knob to turn the
"flag" off, be it a configuration variable or a command line option,
wouldn't it?  I tend to agree with you that it makes sense to make
it a goal to take us closer to "grep -P" from GNU---do they have
such an opt-out knob?  If not, let's make it simple by turning it
always on, which would be the simplest ;-)

Again, thanks for a careful review with concrete points.
Carlo Marcelo Arenas Belón Jan. 18, 2023, 7:35 a.m. UTC | #3
On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> > To argue with myself here, I'm not so sure that just making this the
> > default isn't the right move, especially as the GNU grep maintainer
> > seems to be convinced that that's the right thing for grep(1).
>
> OK.

I think that is definitely the right thing to do for grep, because the
current behaviour can only be described as a bug (and a bad one at
it), but after all the push back and performance testing, I am also
not convinced anymore it needs to be the default for git, because the
negatives outweigh the positives.

First there is the performance hit, which is inevitable because there
are just a lot more characters to match when UCP tables are being
used, and second there is the fact that PCRE2_UCP itself might not be
what you want when matching code, because for example numbers are
never going to be using digits outside what ASCII provides, and
identifiers have a narrow set of characters as valid than what you
would expect from all written human languages in history.

Lastly, even with PCRE2_UCP enabled, our current logic for word
matches is still broken, because the current code still uses a
definition of word that was done outside what the regex engines
provide and that roughly matches what you would expect of identifiers
from C in the ASCII times.

> > Of course all of this is predicated on us wanting to leave this as an
> > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
> > entire question,
>
> Making it opt-out would also require a similar knob to turn the
> "flag" off, be it a configuration variable or a command line option,
> wouldn't it?  I tend to agree with you that it makes sense to make
> it a goal to take us closer to "grep -P" from GNU---do they have
> such an opt-out knob?  If not, let's make it simple by turning it
> always on, which would be the simplest ;-)

GNU grep -P has no knob and would likely never have one.

So for now, I think we should acknowledge the bug, provide an option
for people that might need the fix, and fix all other problems we
have, which will include changes in PCRE2 as well to better fit our
use case.

Carlo
Ævar Arnfjörð Bjarmason Jan. 18, 2023, 11:49 a.m. UTC | #4
On Tue, Jan 17 2023, Carlo Arenas wrote:

> On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>> > To argue with myself here, I'm not so sure that just making this the
>> > default isn't the right move, especially as the GNU grep maintainer
>> > seems to be convinced that that's the right thing for grep(1).
>>
>> OK.
>
> I think that is definitely the right thing to do for grep, because the
> current behaviour can only be described as a bug (and a bad one at
> it), but after all the push back and performance testing, I am also
> not convinced anymore it needs to be the default for git, because the
> negatives outweigh the positives.
>
> First there is the performance hit, which is inevitable because there
> are just a lot more characters to match when UCP tables are being
> used,[...]

I'm less concerned about the performance, we should aim for correctness
first. We can always provide an opt-out (and the locale setting is
already that opt-out).

> and second there is the fact that PCRE2_UCP itself might not be
> what you want when matching code, because for example numbers are
> never going to be using digits outside what ASCII provides, and
> identifiers have a narrow set of characters as valid than what you
> would expect from all written human languages in history.

[0-9] will be ASCII, but \d will use [^0-9] Unicode numbers.

I agree it might not be expected by some, but I can't really square that
view in my mind with the desire to match "\bÆvar" :). After all that "Æ"
is also arbitrary byte garbage in the ASCII-view of the world.

I can see how it might be more practical in some cases to have "\b" have
Unicode semantics, but to specifically make "\d" an exception. But the
ship has sailed on that in Perl & PCRE land years (or more than a decade
ago). I think us coming up with some exception to that would probably
suck more than going with their behavior.

> Lastly, even with PCRE2_UCP enabled, our current logic for word
> matches is still broken, because the current code still uses a
> definition of word that was done outside what the regex engines
> provide and that roughly matches what you would expect of identifiers
> from C in the ASCII times.

Yes, FWIW I have some WIP patches somewhere to get rid of that bit of
grep.c if we're using PCRE. I.e. the "-w" should be powered by just
adding "\b" to the start/end of the provided string.

That'll then be correct, and faster.

I can't remember if there were some subtle bugs in that, or why I didn't
finish that...

>> > Of course all of this is predicated on us wanting to leave this as an
>> > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
>> > entire question,
>>
>> Making it opt-out would also require a similar knob to turn the
>> "flag" off, be it a configuration variable or a command line option,
>> wouldn't it?  I tend to agree with you that it makes sense to make
>> it a goal to take us closer to "grep -P" from GNU---do they have
>> such an opt-out knob?  If not, let's make it simple by turning it
>> always on, which would be the simplest ;-)
>
> GNU grep -P has no knob and would likely never have one.

I think the general knob in not just GNU grep but GNU utils and the
wider *nix landscape is "tweak your LC_ALL and/or other locale
varibales".

Which works for it, and will work for us once we're using PCRE2_UCP too.

> So for now, I think we should acknowledge the bug, provide an option
> for people that might need the fix, and fix all other problems we
> have, which will include changes in PCRE2 as well to better fit our
> use case.

Hrm, what are those PCRE2 changes? The one I saw so far (or was it a
proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU
grep is now doing in its unreleased version in its git repo...
Junio C Hamano Jan. 18, 2023, 4:20 p.m. UTC | #5
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> GNU grep -P has no knob and would likely never have one.
>
> I think the general knob in not just GNU grep but GNU utils and the
> wider *nix landscape is "tweak your LC_ALL and/or other locale
> varibales".
>
> Which works for it, and will work for us once we're using PCRE2_UCP too.
>
>> So for now, I think we should acknowledge the bug, provide an option
>> for people that might need the fix, and fix all other problems we
>> have, which will include changes in PCRE2 as well to better fit our
>> use case.
>
> Hrm, what are those PCRE2 changes? The one I saw so far (or was it a
> proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU
> grep is now doing in its unreleased version in its git repo...

Yeah, I didn't understand Carlo's comment in that paragraph at all.

In short, it sounds to me that the earlier one that added PCRE2_UCP
unconditionally would be the best alternative among those that have
been discussed.
Ævar Arnfjörð Bjarmason Jan. 18, 2023, 11:06 p.m. UTC | #6
On Wed, Jan 18 2023, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> GNU grep -P has no knob and would likely never have one.
>>
>> I think the general knob in not just GNU grep but GNU utils and the
>> wider *nix landscape is "tweak your LC_ALL and/or other locale
>> varibales".
>>
>> Which works for it, and will work for us once we're using PCRE2_UCP too.
>>
>>> So for now, I think we should acknowledge the bug, provide an option
>>> for people that might need the fix, and fix all other problems we
>>> have, which will include changes in PCRE2 as well to better fit our
>>> use case.
>>
>> Hrm, what are those PCRE2 changes? The one I saw so far (or was it a
>> proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU
>> grep is now doing in its unreleased version in its git repo...
>
> Yeah, I didn't understand Carlo's comment in that paragraph at all.
>
> In short, it sounds to me that the earlier one that added PCRE2_UCP
> unconditionally would be the best alternative among those that have
> been discussed.

I agree.
Junio C Hamano Jan. 18, 2023, 11:24 p.m. UTC | #7
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> In short, it sounds to me that the earlier one that added PCRE2_UCP
>> unconditionally would be the best alternative among those that have
>> been discussed.
>
> I agree.

So, ... we'll mark cb/grep-pcre-ucp, ea8bc435 (grep: correctly
identify utf-8 characters with \{b,w} in -P, 2023-01-08), to be
merged to 'next' and see what happens.  I'll add your Acked-by
while at it, if you do not mind.

---- >8 --------- >8 --------- >8 --------- >8 ------
From: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Date: Sun, 8 Jan 2023 07:52:17 -0800
Subject: [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P

When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:

  $ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.

This has an impact on performance that has been estimated to be
between 20% to 40% and that is shown through the added performance
test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 grep.c                              |  2 +-
 t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)
 create mode 100755 t/perf/p7822-grep-perl-character.sh

diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= PCRE2_CASELESS;
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
-		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh
new file mode 100755
index 0000000000..87009c60df
--- /dev/null
+++ b/t/perf/p7822-grep-perl-character.sh
@@ -0,0 +1,42 @@
+#!/bin/sh
+
+test_description="git-grep's perl regex
+
+If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8'
+etc.) we will test the patterns under those numbers of threads.
+"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+if test -n "$GIT_PERF_GREP_THREADS"
+then
+	test_set_prereq PERF_GREP_ENGINES_THREADS
+fi
+
+for pattern in \
+	'\\bhow' \
+	'\\bÆvar' \
+	'\\d+ \\bÆvar' \
+	'\\bBelón\\b' \
+	'\\w{12}\\b'
+do
+	echo '$pattern' >pat
+	if ! test_have_prereq PERF_GREP_ENGINES_THREADS
+	then
+		test_perf "grep -P '$pattern'" --prereq PCRE "
+			git -P grep -f pat || :
+		"
+	else
+		for threads in $GIT_PERF_GREP_THREADS
+		do
+			test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
+				git -c grep.threads=$threads -P grep -f pat || :
+			"
+		done
+	fi
+done
+
+test_done
diff mbox series

Patch

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index e521f20390..8848db7311 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -26,3 +26,9 @@  grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+pcre.ucp::
+	If set to true, will use all Unicode Character Properties when matching
+	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
+	as the underlying engine. If PCRE is not being used it is ignored.
+	Defaults to false
diff --git a/grep.c b/grep.c
index 06eed69493..ceafb8937d 100644
--- a/grep.c
+++ b/grep.c
@@ -102,6 +102,12 @@  int grep_config(const char *var, const char *value, void *cb)
 			return config_error_nonbool(var);
 		return color_parse(value, color);
 	}
+
+	if (!strcmp(var, "pcre.ucp")) {
+		opt->pcre_ucp = git_config_bool(var, value);
+		return 0;
+	}
+
 	return 0;
 }
 
@@ -292,8 +298,11 @@  static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && !literal)
+	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		if (opt->pcre_ucp)
+			options |= PCRE2_UCP;
+	}
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/grep.h b/grep.h
index 6075f997e6..082bd3a0c7 100644
--- a/grep.h
+++ b/grep.h
@@ -171,6 +171,7 @@  struct grep_opt {
 	int file_break;
 	int heading;
 	int max_count;
+	int pcre_ucp;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 8eded6ab27..a99a967060 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -95,6 +95,7 @@  test_expect_success setup '
 	then
 		echo "¿" >reverse-question-mark
 	fi &&
+	echo "émotion" >ucp &&
 	git add . &&
 	test_tick &&
 	git commit -m initial
@@ -1474,6 +1475,18 @@  test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
 	test_cmp hello_world actual
 '
 
+test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
+	cat >expected <<-\EOF &&
+	ucp:émotion
+	EOF
+	cat >pattern <<-\EOF &&
+	\bémotion
+	EOF
+	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
+	test_cmp expected actual &&
+	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern
+'
+
 test_expect_success 'grep -G invalidpattern properly dies ' '
 	test_must_fail git grep -G "a["
 '