Message ID | 20210203032811.14979-15-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | grep: PCREv2 fixes, remove kwset.[ch] | expand |
Am 03.02.21 um 04:28 schrieb Ævar Arnfjörð Bjarmason: > If we walk to the end of the string we just won't match the rest of > the regex. This removes an optimization for simplicity's sake. In > subsequent commits we'll alter this code more, and not having to think > about this condition makes it easier to read. > > If we look at the context of what we're doing here the last thing we > need to be worried about is one extra regex match. The real problem is > that we keep matching after it's clear that the number of contains() > for "A" and "B" is different. So we could be much smarter here. > > Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> > --- > diffcore-pickaxe.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c > index 208177bb40..8df76afb6e 100644 > --- a/diffcore-pickaxe.c > +++ b/diffcore-pickaxe.c > @@ -82,12 +82,11 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws) > regmatch_t regmatch; > int flags = 0; > > - while (sz && > - !regexec_buf(regexp, data, sz, 1, ®match, flags)) { > + while (!regexec_buf(regexp, data, sz, 1, ®match, flags)) { This will loop forever for regexes that match an empty string. An example would be /$/. Silly, perhaps, but still I understand this check less as an optimization and more as a correctness/robustness thing. > flags |= REG_NOTBOL; > data += regmatch.rm_eo; > sz -= regmatch.rm_eo; > - if (sz && regmatch.rm_so == regmatch.rm_eo) { > + if (regmatch.rm_so == regmatch.rm_eo) { > data++; > sz--; > } Before, if the match was an empty string and there was more data after it, then the code would consume a character anyway, in order to avoid matching the same empty string again. With the patch, that character is consumed even if there is no more data. This leaves 'data' pointing beyond the buffer and 'sz' rolls over to ULONG_MAX. Oops. :( René
René Scharfe <l.s.r@web.de> writes: >> - while (sz && >> - !regexec_buf(regexp, data, sz, 1, ®match, flags)) { >> + while (!regexec_buf(regexp, data, sz, 1, ®match, flags)) { > > This will loop forever for regexes that match an empty string. An > example would be /$/. Silly, perhaps, but still I understand this check > less as an optimization and more as a correctness/robustness thing. > >> flags |= REG_NOTBOL; >> data += regmatch.rm_eo; >> sz -= regmatch.rm_eo; >> - if (sz && regmatch.rm_so == regmatch.rm_eo) { >> + if (regmatch.rm_so == regmatch.rm_eo) { >> data++; >> sz--; >> } > > Before, if the match was an empty string and there was more data after > it, then the code would consume a character anyway, in order to avoid > matching the same empty string again. With the patch, that character > is consumed even if there is no more data. This leaves 'data' > pointing beyond the buffer and 'sz' rolls over to ULONG_MAX. Oops. :( While I do not care too much about NUL in the haystack, I do not mind [13/25] either. But this is bad. This whole thing reminds me of f53c5de2 (pickaxe: fix segfault with '-S<...> --pickaxe-regex', 2017-03-18), by the way. Thanks.
On Thu, Feb 04 2021, Junio C Hamano wrote: > René Scharfe <l.s.r@web.de> writes: > >>> - while (sz && >>> - !regexec_buf(regexp, data, sz, 1, ®match, flags)) { >>> + while (!regexec_buf(regexp, data, sz, 1, ®match, flags)) { >> >> This will loop forever for regexes that match an empty string. An >> example would be /$/. Silly, perhaps, but still I understand this check >> less as an optimization and more as a correctness/robustness thing. >> >>> flags |= REG_NOTBOL; >>> data += regmatch.rm_eo; >>> sz -= regmatch.rm_eo; >>> - if (sz && regmatch.rm_so == regmatch.rm_eo) { >>> + if (regmatch.rm_so == regmatch.rm_eo) { >>> data++; >>> sz--; >>> } >> >> Before, if the match was an empty string and there was more data after >> it, then the code would consume a character anyway, in order to avoid >> matching the same empty string again. With the patch, that character >> is consumed even if there is no more data. This leaves 'data' >> pointing beyond the buffer and 'sz' rolls over to ULONG_MAX. Oops. :( > > While I do not care too much about NUL in the haystack, I do not > mind [13/25] either. But this is bad. > > This whole thing reminds me of f53c5de2 (pickaxe: fix segfault with > '-S<...> --pickaxe-regex', 2017-03-18), by the way. René: Well spotted, thanks, and Oops. I've just sent a separate series with 01-10 of this one. I'm sitting on the diffcore-pickaxe patches for a while. I've got local fixes for a lot of issues in it, will fix this one too. I've optimized the PCRE v2 codepath a lot more in my local version. Current results are: 4209.1: git log -S'int main' <limit-rev>.. 0.38(0.36+0.01) 0.37(0.33+0.04) -2.6% 4209.2: git log -S'æ' <limit-rev>.. 0.51(0.47+0.04) 0.32(0.27+0.05) -37.3% 4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>.. 0.72(0.68+0.03) 0.57(0.54+0.03) -20.8% 4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>.. 0.60(0.55+0.02) 0.39(0.34+0.05) -35.0% 4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.43(0.40+0.03) 0.50(0.44+0.06) +16.3% 4209.6: git log -G'(int|void|null)' <limit-rev>.. 0.64(0.55+0.09) 0.63(0.56+0.05) -1.6% 4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>.. 0.64(0.59+0.05) 0.63(0.56+0.06) -1.6% 4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.63(0.54+0.08) 0.62(0.55+0.06) -1.6% 4209.9: git log -i -S'int main' <limit-rev>.. 0.39(0.35+0.03) 0.38(0.35+0.02) -2.6% 4209.10: git log -i -S'æ' <limit-rev>.. 0.39(0.33+0.06) 0.32(0.28+0.04) -17.9% 4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>.. 0.90(0.84+0.05) 0.58(0.53+0.04) -35.6% 4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>.. 0.71(0.64+0.06) 0.40(0.37+0.03) -43.7% 4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.43(0.40+0.03) 0.50(0.46+0.04) +16.3% 4209.14: git log -i -G'(int|void|null)' <limit-rev>.. 0.64(0.57+0.06) 0.62(0.56+0.05) -3.1% 4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>.. 0.65(0.59+0.06) 0.63(0.54+0.08) -3.1% 4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>.. 0.63(0.55+0.08) 0.62(0.56+0.05) -1.6% The main optimization was just moving the compilation of the pattern up the stack into the diff_options struct, the current version in this thread re-compiles it every time.
diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c index 208177bb40..8df76afb6e 100644 --- a/diffcore-pickaxe.c +++ b/diffcore-pickaxe.c @@ -82,12 +82,11 @@ static unsigned int contains(mmfile_t *mf, regex_t *regexp, kwset_t kws) regmatch_t regmatch; int flags = 0; - while (sz && - !regexec_buf(regexp, data, sz, 1, ®match, flags)) { + while (!regexec_buf(regexp, data, sz, 1, ®match, flags)) { flags |= REG_NOTBOL; data += regmatch.rm_eo; sz -= regmatch.rm_eo; - if (sz && regmatch.rm_so == regmatch.rm_eo) { + if (regmatch.rm_so == regmatch.rm_eo) { data++; sz--; }
If we walk to the end of the string we just won't match the rest of the regex. This removes an optimization for simplicity's sake. In subsequent commits we'll alter this code more, and not having to think about this condition makes it easier to read. If we look at the context of what we're doing here the last thing we need to be worried about is one extra regex match. The real problem is that we keep matching after it's clear that the number of contains() for "A" and "B" is different. So we could be much smarter here. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- diffcore-pickaxe.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)