diff mbox series

[1/2] grep/pcre2: limit the instances in which UTF mode is enabled

Message ID 20211118084143.279174-1-someguy@effective-light.com (mailing list archive)
State New, archived
Headers show
Series [1/2] grep/pcre2: limit the instances in which UTF mode is enabled | expand

Commit Message

Hamza Mahfooz Nov. 18, 2021, 8:41 a.m. UTC
UTF mode is enabled for cases that cause older versions of PCRE to break.
This is primarily due to the fact that we can't make as many assumptions on
the kind of data that is fed to "git grep." So, limit when UTF mode can be
enabled by introducing "is_log" to struct grep_opt, checking to see if it's
a non-zero value in compile_pcre2_pattern() and only mutating it in
cmd_log() so that we know "git log" was invoked if it's set to a non-zero
value.

Fixes: ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns
and UTF-8 data, 2021-10-15)
Suggested-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
---
 builtin/log.c                   | 1 +
 grep.c                          | 2 +-
 grep.h                          | 1 +
 t/t7812-grep-icase-non-ascii.sh | 2 +-
 4 files changed, 4 insertions(+), 2 deletions(-)

Comments

Carlo Marcelo Arenas Belón Nov. 18, 2021, 10:04 a.m. UTC | #1
On Thu, Nov 18, 2021 at 12:42 AM Hamza Mahfooz
<someguy@effective-light.com> wrote:
>
> UTF mode is enabled for cases that cause older versions of PCRE to break.

Not really; what is broken is our implementation of how PCRE gets
called and that ignores the fact that giving it invalid UTF-8 (which
might be valid LATIN-1 text for example) and telling it to do a match
using UTF, will fail (if we are lucky even with an error) or might
even crash (and obviously don't match) if we also tell it to not do
the validation, and which is something we do when JIT is enabled.

> This is primarily due to the fact that we can't make as many assumptions on
> the kind of data that is fed to "git grep." So, limit when UTF mode can be
> enabled by introducing "is_log" to struct grep_opt, checking to see if it's
> a non-zero value in compile_pcre2_pattern() and only mutating it in
> cmd_log() so that we know "git log" was invoked if it's set to a non-zero
> value.

I haven't tested it, but I think that for this to work with the log,
we also need to make sure that all log entries that might not be UTF-8
get first iconv() which is why probably Æevar mentioned[1]
i18n.commitEncoding in his old email.

Of course doing that translation only makes sense if the log output is
meant to be UTF-8 which is why there is all that logic about being in
an UTF-8 locale or not which probably needs to be adjusted as well.

Carlo

[1] https://lore.kernel.org/git/87v92bju64.fsf@evledraar.gmail.com/
Carlo Marcelo Arenas Belón Nov. 18, 2021, 7:40 p.m. UTC | #2
On Thu, Nov 18, 2021 at 02:04:08AM -0800, Carlo Arenas wrote:
> 
> I haven't tested it, but I think that for this to work with the log,
> we also need to make sure that all log entries that might not be UTF-8
> get first iconv() which is why probably Æevar mentioned[1]
> i18n.commitEncoding in his old email.

Having tested it, it seems this might work, but needs at least the
following to make the test for it meaningful:

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 1da6b07a57..e1fbdc0f80 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -47,6 +47,8 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 '
 
 test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
+	GIT_COMMITTER_NAME="C O Mitter" &&
+	GIT_COMMITTER_EMAIL="committer@example.com" &&
 	git commit -m first &&
 	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
 	echo first >expected &&
@@ -72,10 +74,10 @@ test_expect_success GETTEXT_LOCALE,PCRE 'log --committer with an ascii pattern o
 	EOF
 	test_write_lines "fifth" >file5 &&
 	git add file5 &&
-	GIT_COMMITTER_NAME="Ç O Mîtter" &&
+	GIT_COMMITTER_NAME=$(echo "Ç O Mîtter" | iconv -t ISO-8859-1) &&
 	GIT_COMMITTER_EMAIL="committer@example.com" &&
 	git -c i18n.commitEncoding=latin1 commit -m thïrd &&
-	git -c i18n.logOutputEncoding=latin1 log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
+	git log -1 --pretty=fuller --color=always --perl-regexp --committer=" O.*" >log &&
 	grep Commit: log >actual.raw &&
 	test_decode_color <actual.raw >actual &&
 	test_cmp expected actual

The first hunk is a nice to have and mimics what is done in the previous
test where a non UTF match will match instead an equivalent ASCII text,
but is not strictly needed unless the expression is also tightened.

Note the second hunk is incomplete as the message is still not really
encoded as latin1 and will need an equivalent processing, but left it
out so it can be done together with fixing the corresponding test.

I have to admit that adding to a complex condition might be obscuring
some other edge case, and indeed the fact the test passed even without
this fix is concerning.

From my reading of what Ævar suggested originally[1], it would seem that
the new is_log condition should be on an branch of its own, and more code
should be needed to make sure the contents we are going to use are indeed
utf8 by the time it hits pcre2_*match() before setting it.

Carlo

[1] https://lore.kernel.org/git/211116.86tugcp8mg.gmgdl@evledraar.gmail.com/
Junio C Hamano Nov. 18, 2021, 8:53 p.m. UTC | #3
Carlo Marcelo Arenas Belón <carenas@gmail.com> writes:

> I have to admit that adding to a complex condition might be obscuring
> some other edge case, and indeed the fact the test passed even without
> this fix is concerning.
>
> From my reading of what Ævar suggested originally[1], it would seem that
> the new is_log condition should be on an branch of its own, and more code
> should be needed to make sure the contents we are going to use are indeed
> utf8 by the time it hits pcre2_*match() before setting it.

At least, let's not consider pursuing this approach to pile on more
iffy code on top of broken code in the release.  Unless I hear
otherwise, I am leaning to revert the whole three patches of the
original series and call it a day for both 'maint' (for 2.34.1) and
'master'.

Thanks.
diff mbox series

Patch

diff --git a/builtin/log.c b/builtin/log.c
index f75d87e8d7..040b0b533f 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -751,6 +751,7 @@  int cmd_log(int argc, const char **argv, const char *prefix)
 	git_config(git_log_config, NULL);
 
 	repo_init_revisions(the_repository, &rev, prefix);
+	rev.grep_filter.is_log = 1;
 	rev.always_show_header = 1;
 	memset(&opt, 0, sizeof(opt));
 	opt.def = "HEAD";
diff --git a/grep.c b/grep.c
index f6e113e9f0..665d86f007 100644
--- a/grep.c
+++ b/grep.c
@@ -382,7 +382,7 @@  static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
+	if ((opt->is_log && !opt->ignore_locale && !has_non_ascii(p->pattern)) ||
 	    (!opt->ignore_locale && is_utf8_locale() &&
 	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
 					    (p->fixed || p->is_fixed))))
diff --git a/grep.h b/grep.h
index 3e8815c347..64634c6a3f 100644
--- a/grep.h
+++ b/grep.h
@@ -167,6 +167,7 @@  struct grep_opt {
 	int extended_regexp_option;
 	int pattern_type_option;
 	int ignore_locale;
+	int is_log;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
 	unsigned post_context;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 22487d90fd..1da6b07a57 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -60,7 +60,7 @@  test_expect_success GETTEXT_LOCALE,PCRE 'log --author with an ascii pattern on U
 	test_write_lines "forth" >file4 &&
 	git add file4 &&
 	git commit --author="À Ú Thor <author@example.com>" -m sécond &&
-	git log -1 --color=always --perl-regexp --author=".*Thor" >log &&
+	git log -1 --color=always --perl-regexp --author=". . Thor" >log &&
 	grep Author log >actual.raw &&
 	test_decode_color <actual.raw >actual &&
 	test_cmp expected actual