diff mbox series

[v2] grep: skip UTF8 checks explicitly

Message ID 20190828145444.31588-1-carenas@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v2] grep: skip UTF8 checks explicitly | expand

Commit Message

Carlo Marcelo Arenas Belón Aug. 28, 2019, 2:54 p.m. UTC
18547aacf5 ("grep/pcre: support utf-8", 2016-06-25) that was released
with git 2.10 added the PCRE_UTF8 flag to PCRE1 matching including a
call to has_non_ascii() to try to avoid breakage if there was non-utf8
encoded content in the haystack.

Usually PCRE is compiled with JIT support (even if is not the default),
and therefore the codepath used includes calling pcre_jit_exec, which
skips UTF-8 validation by design (which might result in crashes or hangs)
but when JIT support wasn't compiled we use pcre_exec instead with the
posibility that grep might be aborted if invalid UTF-8 is found in the
haystack.

PCRE1 provides a flag since Mar 5, 2007 that could be used to skip the
checks explicitly so use that to make both codepaths equivalent (the
flag is ignored by pcre1_jit_exec)

this fix is only implemented for PCRE1 because PCRE2 is likely to have
a better solution (without the risks) instead in the future

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
V2:
* drop PCRE2 support
* add backward compatibility define

 grep.c | 4 ++--
 grep.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

Comments

Carlo Marcelo Arenas Belón Aug. 28, 2019, 4:57 p.m. UTC | #1
FWIW, the changes in grep.h are IMHO optional and hadn't been really
tested as I couldn't find a version of PCRE1 old enough to trigger it
but I am hoping are simple enough to work.

As in my original proposal, there is no test because this is going to
(as documented) trigger undefined behaviour and so I don't see how we
can reliably test that.  Goes without saying tthough, that the same is
triggered when JIT is enabled and the JIT fast path is being used, so
this is not a regression and will be more visible once
ab/pcre-jit-fixes gets merged because we are moving out of the JIT
fast path with a patch[0] in that series

While Ævar made a point[1] that this wasn't a solution to the problem,
it was because PCRE2 could have a better one (still missing but based
on ab/pcre-jit-fixes), and it was expected that PCRE2 will be used a
lot more and so it wasn't a good idea for it (something that doesn't
apply to PCRE1)

the patch was based on maint (like all bugfixes) but applies cleanly
to master and next, it will conflict with pu but for easy of merge I'd
applied it on top of cb/pcre1-cleanup and make it available in
GitHub[2]; that branch could be used as well as a reroll for that
topic (if that is preferred)

the error message hasn't been changed on this patch, as it might make
sense to improve it as well for PCRE2, but at least shouldn't be
triggered anymore (ex, from running a freshly built git without the
patch and linked against a non JIT enabled PCRE1):

$ ./git-grep -P 'Nguyễn Thái.Ngọc'
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
fatal: pcre_exec failed with error code -10

Carlo

[0] 685668faaa (grep: stop using a custom JIT stack with PCRE v1, 2019-07-26)
[1] https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/
[2] https://github.com/carenas/git/tree/pcre1-cleanup
Carlo Marcelo Arenas Belón Sept. 9, 2019, 3:07 p.m. UTC | #2
ping

any feedback on code/approach highly appreciated

Carlo

On Wed, Aug 28, 2019 at 9:57 AM Carlo Arenas <carenas@gmail.com> wrote:
>
> FWIW, the changes in grep.h are IMHO optional and hadn't been really
> tested as I couldn't find a version of PCRE1 old enough to trigger it
> but I am hoping are simple enough to work.
>
> As in my original proposal, there is no test because this is going to
> (as documented) trigger undefined behaviour and so I don't see how we
> can reliably test that.  Goes without saying tthough, that the same is
> triggered when JIT is enabled and the JIT fast path is being used, so
> this is not a regression and will be more visible once
> ab/pcre-jit-fixes gets merged because we are moving out of the JIT
> fast path with a patch[0] in that series
>
> While Ævar made a point[1] that this wasn't a solution to the problem,
> it was because PCRE2 could have a better one (still missing but based
> on ab/pcre-jit-fixes), and it was expected that PCRE2 will be used a
> lot more and so it wasn't a good idea for it (something that doesn't
> apply to PCRE1)
>
> the patch was based on maint (like all bugfixes) but applies cleanly
> to master and next, it will conflict with pu but for easy of merge I'd
> applied it on top of cb/pcre1-cleanup and make it available in
> GitHub[2]; that branch could be used as well as a reroll for that
> topic (if that is preferred)
>
> the error message hasn't been changed on this patch, as it might make
> sense to improve it as well for PCRE2, but at least shouldn't be
> triggered anymore (ex, from running a freshly built git without the
> patch and linked against a non JIT enabled PCRE1):
>
> $ ./git-grep -P 'Nguyễn Thái.Ngọc'
> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> fatal: pcre_exec failed with error code -10
>
> Carlo
>
> [0] 685668faaa (grep: stop using a custom JIT stack with PCRE v1, 2019-07-26)
> [1] https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/
> [2] https://github.com/carenas/git/tree/pcre1-cleanup
Junio C Hamano Sept. 9, 2019, 6:49 p.m. UTC | #3
Carlo Arenas <carenas@gmail.com> writes:

> ping
>
> any feedback on code/approach highly appreciated

I'd prefer to see others weigh in before starting to merge the pcre
stuff to 'next'.

I do not mind taking this updated one that limits the scope to pcre1
as a pure regressino fix, if others agree.  Thanks for keeping a tab
on these pcre issues.

>> While Ævar made a point[1] that this wasn't a solution to the problem,
>> it was because PCRE2 could have a better one (still missing but based
>> on ab/pcre-jit-fixes), and it was expected that PCRE2 will be used a
>> lot more and so it wasn't a good idea for it (something that doesn't
>> apply to PCRE1)
>>
>> the patch was based on maint (like all bugfixes) but applies cleanly
>> to master and next, it will conflict with pu but for easy of merge I'd
>> applied it on top of cb/pcre1-cleanup and make it available in
>> GitHub[2]; that branch could be used as well as a reroll for that
>> topic (if that is preferred)
>>
>> the error message hasn't been changed on this patch, as it might make
>> sense to improve it as well for PCRE2, but at least shouldn't be
>> triggered anymore (ex, from running a freshly built git without the
>> patch and linked against a non JIT enabled PCRE1):
>>
>> $ ./git-grep -P 'Nguyễn Thái.Ngọc'
>> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> fatal: pcre_exec failed with error code -10
>>
>> Carlo
>>
>> [0] 685668faaa (grep: stop using a custom JIT stack with PCRE v1, 2019-07-26)
>> [1] https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/
>> [2] https://github.com/carenas/git/tree/pcre1-cleanup
diff mbox series

Patch

diff --git a/grep.c b/grep.c
index f7c3a5803e..69ef69516e 100644
--- a/grep.c
+++ b/grep.c
@@ -421,7 +421,7 @@  static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 		regmatch_t *match, int eflags)
 {
-	int ovector[30], ret, flags = 0;
+	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
 
 	if (eflags & REG_NOTBOL)
 		flags |= PCRE_NOTBOL;
diff --git a/grep.h b/grep.h
index 1875880f37..9c8797a017 100644
--- a/grep.h
+++ b/grep.h
@@ -3,6 +3,9 @@ 
 #include "color.h"
 #ifdef USE_LIBPCRE1
 #include <pcre.h>
+#ifndef PCRE_NO_UTF8_CHECK
+#define PCRE_NO_UTF8_CHECK 0
+#endif
 #ifdef PCRE_CONFIG_JIT
 #if PCRE_MAJOR >= 8 && PCRE_MINOR >= 32
 #ifndef NO_LIBPCRE1_JIT