diff mbox series

[14/25] pickaxe -S: remove redundant "sz" check in while-loop

Message ID 20210203032811.14979-15-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series grep: PCREv2 fixes, remove kwset.[ch] | expand

Commit Message

Ævar Arnfjörð Bjarmason Feb. 3, 2021, 3:28 a.m. UTC
If we walk to the end of the string we just won't match the rest of
the regex. This removes an optimization for simplicity's sake. In
subsequent commits we'll alter this code more, and not having to think
about this condition makes it easier to read.

If we look at the context of what we're doing here the last thing we
need to be worried about is one extra regex match. The real problem is
that we keep matching after it's clear that the number of contains()
for "A" and "B" is different. So we could be much smarter here.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 diffcore-pickaxe.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

René Scharfe Feb. 4, 2021, 4:16 p.m. UTC | #1
Am 03.02.21 um 04:28 schrieb Ævar Arnfjörð Bjarmason:
> If we walk to the end of the string we just won't match the rest of
> the regex. This removes an optimization for simplicity's sake. In
> subsequent commits we'll alter this code more, and not having to think
> about this condition makes it easier to read.
>
> If we look at the context of what we're doing here the last thing we
> need to be worried about is one extra regex match. The real problem is
> that we keep matching after it's clear that the number of contains()
> for "A" and "B" is different. So we could be much smarter here.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  diffcore-pickaxe.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
> index 208177bb40..8df76afb6e 100644
> --- a/diffcore-pickaxe.c
> +++ b/diffcore-pickaxe.c
> @@ -82,12 +82,11 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
>  		regmatch_t regmatch;
>  		int flags = 0;
>
> -		while (sz &&
> -		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
> +		while (!regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {

This will loop forever for regexes that match an empty string.  An
example would be /$/.  Silly, perhaps, but still I understand this check
less as an optimization and more as a correctness/robustness thing.

>  			flags |= REG_NOTBOL;
>  			data += regmatch.rm_eo;
>  			sz -= regmatch.rm_eo;
> -			if (sz && regmatch.rm_so == regmatch.rm_eo) {
> +			if (regmatch.rm_so == regmatch.rm_eo) {
>  				data++;
>  				sz--;
>  			}

Before, if the match was an empty string and there was more data after
it, then the code would consume a character anyway, in order to avoid
matching the same empty string again.  With the patch, that character
is consumed even if there is no more data.  This leaves 'data'
pointing beyond the buffer and 'sz' rolls over to ULONG_MAX.  Oops. :(

René
Junio C Hamano Feb. 4, 2021, 5:56 p.m. UTC | #2
René Scharfe <l.s.r@web.de> writes:

>> -		while (sz &&
>> -		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
>> +		while (!regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
>
> This will loop forever for regexes that match an empty string.  An
> example would be /$/.  Silly, perhaps, but still I understand this check
> less as an optimization and more as a correctness/robustness thing.
>
>>  			flags |= REG_NOTBOL;
>>  			data += regmatch.rm_eo;
>>  			sz -= regmatch.rm_eo;
>> -			if (sz && regmatch.rm_so == regmatch.rm_eo) {
>> +			if (regmatch.rm_so == regmatch.rm_eo) {
>>  				data++;
>>  				sz--;
>>  			}
>
> Before, if the match was an empty string and there was more data after
> it, then the code would consume a character anyway, in order to avoid
> matching the same empty string again.  With the patch, that character
> is consumed even if there is no more data.  This leaves 'data'
> pointing beyond the buffer and 'sz' rolls over to ULONG_MAX.  Oops. :(

While I do not care too much about NUL in the haystack, I do not
mind [13/25] either.  But this is bad.

This whole thing reminds me of f53c5de2 (pickaxe: fix segfault with
'-S<...> --pickaxe-regex', 2017-03-18), by the way.

Thanks.
Ævar Arnfjörð Bjarmason Feb. 4, 2021, 9:13 p.m. UTC | #3
On Thu, Feb 04 2021, Junio C Hamano wrote:

> René Scharfe <l.s.r@web.de> writes:
>
>>> -		while (sz &&
>>> -		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
>>> +		while (!regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
>>
>> This will loop forever for regexes that match an empty string.  An
>> example would be /$/.  Silly, perhaps, but still I understand this check
>> less as an optimization and more as a correctness/robustness thing.
>>
>>>  			flags |= REG_NOTBOL;
>>>  			data += regmatch.rm_eo;
>>>  			sz -= regmatch.rm_eo;
>>> -			if (sz && regmatch.rm_so == regmatch.rm_eo) {
>>> +			if (regmatch.rm_so == regmatch.rm_eo) {
>>>  				data++;
>>>  				sz--;
>>>  			}
>>
>> Before, if the match was an empty string and there was more data after
>> it, then the code would consume a character anyway, in order to avoid
>> matching the same empty string again.  With the patch, that character
>> is consumed even if there is no more data.  This leaves 'data'
>> pointing beyond the buffer and 'sz' rolls over to ULONG_MAX.  Oops. :(
>
> While I do not care too much about NUL in the haystack, I do not
> mind [13/25] either.  But this is bad.
>
> This whole thing reminds me of f53c5de2 (pickaxe: fix segfault with
> '-S<...> --pickaxe-regex', 2017-03-18), by the way.

René: Well spotted, thanks, and Oops.

I've just sent a separate series with 01-10 of this one. I'm sitting on
the diffcore-pickaxe patches for a while. I've got local fixes for a lot
of issues in it, will fix this one too.

I've optimized the PCRE v2 codepath a lot more in my local
version. Current results are:

    4209.1: git log -S'int main' <limit-rev>..                                0.38(0.36+0.01)   0.37(0.33+0.04) -2.6%
    4209.2: git log -S'æ' <limit-rev>..                                       0.51(0.47+0.04)   0.32(0.27+0.05) -37.3%
    4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>..         0.72(0.68+0.03)   0.57(0.54+0.03) -20.8%
    4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..          0.60(0.55+0.02)   0.39(0.34+0.05) -35.0%
    4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..       0.43(0.40+0.03)   0.50(0.44+0.06) +16.3%
    4209.6: git log -G'(int|void|null)' <limit-rev>..                         0.64(0.55+0.09)   0.63(0.56+0.05) -1.6%
    4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>..                          0.64(0.59+0.05)   0.63(0.56+0.06) -1.6%
    4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                       0.63(0.54+0.08)   0.62(0.55+0.06) -1.6%
    4209.9: git log -i -S'int main' <limit-rev>..                             0.39(0.35+0.03)   0.38(0.35+0.02) -2.6%
    4209.10: git log -i -S'æ' <limit-rev>..                                   0.39(0.33+0.06)   0.32(0.28+0.04) -17.9%
    4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>..     0.90(0.84+0.05)   0.58(0.53+0.04) -35.6%
    4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..      0.71(0.64+0.06)   0.40(0.37+0.03) -43.7%
    4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..   0.43(0.40+0.03)   0.50(0.46+0.04) +16.3%
    4209.14: git log -i -G'(int|void|null)' <limit-rev>..                     0.64(0.57+0.06)   0.62(0.56+0.05) -3.1%
    4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>..                      0.65(0.59+0.06)   0.63(0.54+0.08) -3.1%
    4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                   0.63(0.55+0.08)   0.62(0.56+0.05) -1.6%

The main optimization was just moving the compilation of the pattern up
the stack into the diff_options struct, the current version in this
thread re-compiles it every time.
diff mbox series

Patch

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 208177bb40..8df76afb6e 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -82,12 +82,11 @@  static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws)
 		regmatch_t regmatch;
 		int flags = 0;
 
-		while (sz &&
-		       !regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
+		while (!regexec_buf(regexp, data, sz, 1, &regmatch, flags)) {
 			flags |= REG_NOTBOL;
 			data += regmatch.rm_eo;
 			sz -= regmatch.rm_eo;
-			if (sz && regmatch.rm_so == regmatch.rm_eo) {
+			if (regmatch.rm_so == regmatch.rm_eo) {
 				data++;
 				sz--;
 			}