mbox series

[RFC/PATCH,0/7] grep: move from kwset to optional PCRE v2

Message ID 20190626000329.32475-1-avarab@gmail.com (mailing list archive)
Headers show
Series grep: move from kwset to optional PCRE v2 | expand

Message

Ævar Arnfjörð Bjarmason June 26, 2019, 12:03 a.m. UTC
This speeds things up a lot, but as shown in the patches & tests
changed modifies the behavior where we have \0 in *patterns* (only
possible with 'grep -f <file>').

I'd like to go down this route because it makes dropping kwset a lot
easier, and I don't think bending over backwards to support these \0
patterns is worth it.

But maybe others disagree, so I wanted to send what I had before I
tried tackling the pickaxe code. There I figured I'd just make -G's
ERE be a PCRE if we had the PCRE v2 backend, since unlike "grep"'s
default BRE the ERE syntax is mostly a subset of PCRE, but again
others might thing that's too aggressive and would prefer to keep the
distinction, only using PCRE there in place of our current use of
kwset.

Ævar Arnfjörð Bjarmason (7):
  grep: inline the return value of a function call used only once
  grep tests: move "grep binary" alongside the rest
  grep tests: move binary pattern tests into their own file
  grep: make the behavior for \0 in patterns sane
  grep: drop support for \0 in --fixed-strings <pattern>
  grep: remove the kwset optimization
  grep: use PCRE v2 for optimized fixed-string search

 Documentation/git-grep.txt                    |  17 +++
 grep.c                                        | 103 ++++++--------
 grep.h                                        |   2 -
 ...a1.sh => t7008-filter-branch-null-sha1.sh} |   0
 ...08-grep-binary.sh => t7815-grep-binary.sh} | 101 --------------
 t/t7816-grep-binary-pattern.sh                | 127 ++++++++++++++++++
 6 files changed, 183 insertions(+), 167 deletions(-)
 rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%)
 rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (55%)
 create mode 100755 t/t7816-grep-binary-pattern.sh

Comments

Johannes Schindelin June 26, 2019, 2:02 p.m. UTC | #1
Hi Ævar,

On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

> This speeds things up a lot, but as shown in the patches & tests
> changed modifies the behavior where we have \0 in *patterns* (only
> possible with 'grep -f <file>').

I agree that it is not worth a lot to care about NULs in search patterns.

So I am in favor of the goal of this patch series.

Ciao,
Dscho
Johannes Schindelin June 27, 2019, 9:16 a.m. UTC | #2
Hi Ævar,

On Wed, 26 Jun 2019, Johannes Schindelin wrote:

> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
>
> > This speeds things up a lot, but as shown in the patches & tests
> > changed modifies the behavior where we have \0 in *patterns* (only
> > possible with 'grep -f <file>').
>
> I agree that it is not worth a lot to care about NULs in search patterns.
>
> So I am in favor of the goal of this patch series.

There seems to be a Windows-specific test failure:
https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug

The output is this:

-- snip --
not ok 5 - log --grep does not find non-reencoded values (latin1)

expecting success:
	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
&&
	test_must_be_empty actual

++ git log --encoding=ISO-8859-1 --format=%s --grep=é
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
not 0x80
-- snap --

Any quick ideas? (I _could_ imagine that it is yet another case of passing
non-UTF-8-encoded stuff via command-line vs via file, which does not work
on Windows.)

Ciao,
Dscho
Ævar Arnfjörð Bjarmason June 27, 2019, 4:27 p.m. UTC | #3
On Thu, Jun 27 2019, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Wed, 26 Jun 2019, Johannes Schindelin wrote:
>
>> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
>>
>> > This speeds things up a lot, but as shown in the patches & tests
>> > changed modifies the behavior where we have \0 in *patterns* (only
>> > possible with 'grep -f <file>').
>>
>> I agree that it is not worth a lot to care about NULs in search patterns.
>>
>> So I am in favor of the goal of this patch series.
>
> There seems to be a Windows-specific test failure:
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug
>
> The output is this:
>
> -- snip --
> not ok 5 - log --grep does not find non-reencoded values (latin1)
>
> expecting success:
> 	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
> &&
> 	test_must_be_empty actual
>
> ++ git log --encoding=ISO-8859-1 --format=%s --grep=é
> fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
> not 0x80
> -- snap --
>
> Any quick ideas? (I _could_ imagine that it is yet another case of passing
> non-UTF-8-encoded stuff via command-line vs via file, which does not work
> on Windows.)

This is an existing issue that my patches just happen to uncover. I'm
working on a v2 which'll fix it.
Johannes Schindelin June 27, 2019, 6:21 p.m. UTC | #4
Hi Ævar,

On Thu, 27 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

>
> On Thu, Jun 27 2019, Johannes Schindelin wrote:
>
> > On Wed, 26 Jun 2019, Johannes Schindelin wrote:
> >
> >> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> > This speeds things up a lot, but as shown in the patches & tests
> >> > changed modifies the behavior where we have \0 in *patterns* (only
> >> > possible with 'grep -f <file>').
> >>
> >> I agree that it is not worth a lot to care about NULs in search patterns.
> >>
> >> So I am in favor of the goal of this patch series.
> >
> > There seems to be a Windows-specific test failure:
> > https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug
> >
> > The output is this:
> >
> > -- snip --
> > not ok 5 - log --grep does not find non-reencoded values (latin1)
> >
> > expecting success:
> > 	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
> > &&
> > 	test_must_be_empty actual
> >
> > ++ git log --encoding=ISO-8859-1 --format=%s --grep=é
> > fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
> > not 0x80
> > -- snap --
> >
> > Any quick ideas? (I _could_ imagine that it is yet another case of passing
> > non-UTF-8-encoded stuff via command-line vs via file, which does not work
> > on Windows.)
>
> This is an existing issue that my patches just happen to uncover. I'm
> working on a v2 which'll fix it.

I just found yet another problem. When using a libpcre2 _without_ JIT
support, I get this:

-- snip --
$ sh t4210-log-i18n.sh -i -V -x || tail -20
test-results/t4210-log-i18n.out
ok 1 - create commits in different encodings
ok 2 - log --grep searches in log output encoding (utf8)
ok 3 # skip log --grep searches in log output encoding (latin1) (missing !MINGW)
ok 4 # skip log --grep does not find non-reencoded values (utf8) (missing !MINGW)
not ok 5 - log --grep does not find non-reencoded values (latin1)
#
#               git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e
#               >actual &&
#               test_must_be_empty actual
#
ok 3 # skip log --grep searches in log output encoding (latin1) (missing !MINGW)

skipping test: log --grep does not find non-reencoded values (utf8)
        git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
        test_must_be_empty actual

ok 4 # skip log --grep does not find non-reencoded values (utf8) (missing !MINGW)

expecting success:
        git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
        test_must_be_empty actual

++ git log --encoding=ISO-8859-1 --format=%s --grep=é
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits not 0x80
error: last command exited with $?=128
not ok 5 - log --grep does not find non-reencoded values (latin1)
#
#               git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e
#               >actual &&
#               test_must_be_empty actual
#
-- snap --

This is actually a correct error, as we specifically feed non-UTF-8 text
to PCRE2, but we do turn on the PCRE2_UTF flag.

Funnily enough, this error only occurs when `pcre2_jit_on == 0`, i.e. when
we hit the code path that calls `pcre2_match()`. When the alternative code
path is used, `pcre2_jit_match()` is called, and it does _not_ print that
error.

Whatever the bug in libpcre2 that causes the JIT code path to fail on the
Unicode validation, it points to the problem in this code in
`compile_pcre2_pattern()`:

-- snip --
        if (is_utf8_locale() && has_non_ascii(p->pattern))
                options |= PCRE2_UTF;
-- snap --

It only asks whether there is a non-ASCII character in pattern, but we
never bother to see whether the haystack is also encoded in UTF-8. In this
case, it is not...

Ciao,
Dscho

P.S.: Yes, yes, I know, we should run PCRE2 with JIT, and I am trying to
figure out why it is not enabled on Windows when I specifically asked
`./configure` to enable it... Investigating now.

-