Message ID | 20190809030210.18353-1-carenas@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | grep: almost no more leaks, hopefully no crashes | expand |
disregard this version, it still broken (and wouldn't even build on some cases), the reasons why are still unclear though but at least it might seem from the last known run in windows that segfaults were prevented at last and something was still off enough to trigger a BUG (shouldn't be a concern with V6 or later that do PCRE2 migration to NED fully, as agreed) Thanks to Dscho's github integration to Azure pipelines got finally a RC[1] for V6 that at least passes all tests and will need to get in shape for submission. as René pointed out in three my performance concerns might be a red herring as well, but would be nice if someone with windows would definitely get some numbers so we can't be sure there were truly no regressions Carlo [1] https://github.com/carenas/git/tree/pcre2-chartables-leakfix
Am 09.08.19 um 13:24 schrieb Carlo Arenas: > disregard this version, it still broken (and wouldn't even build on > some cases), the reasons why are still unclear though but at least it > might > seem from the last known run in windows that segfaults were prevented > at last and something was still off enough to trigger a BUG (shouldn't > be a concern with V6 or later that do PCRE2 migration to NED fully, as > agreed) So how about starting stupidly simple? You can test it everywhere, it just needs libpcre2. It works without that library as well (tested with "make USE_LIBPCRE= USE_LIBPCRE2= test"), but doesn't allocate anything in that case, of course. The character tables leak fix should be safe on top. If you detect performance issues then we can address them in additional patches. -- >8 -- Subject: [PATCH] grep: use xmalloc() for all PCRE2 allocations Build a PCRE2 global custom context when compiling a pattern and use it to tell the library to use xmalloc() for allocations. This provides consistent out-of-memory handling and makes sure it uses a custom allocator, e.g. with USE_NED_ALLOCATOR. Signed-off-by: René Scharfe <l.s.r@web.de> --- The function names are ridiculously long, I tried to stay within 80 characters per line but gave up in the end and just kept going without line breaks. Fits the "stupidly simple" approach.. grep.c | 23 +++++++++++++++++------ grep.h | 2 ++ 2 files changed, 19 insertions(+), 6 deletions(-) diff --git a/grep.c b/grep.c index cd952ef5d3..44f4e38657 100644 --- a/grep.c +++ b/grep.c @@ -482,6 +482,16 @@ static void free_pcre1_regexp(struct grep_pat *p) #endif /* !USE_LIBPCRE1 */ #ifdef USE_LIBPCRE2 +static void *pcre2_malloc(PCRE2_SIZE size, MAYBE_UNUSED void *memory_data) +{ + return xmalloc(size); +} + +static void pcre2_free(void *pointer, MAYBE_UNUSED void *memory_data) +{ + free(pointer); +} + static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) { int error; @@ -495,12 +505,12 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt assert(opt->pcre2); - p->pcre2_compile_context = NULL; + p->pcre2_general_context = pcre2_general_context_create(pcre2_malloc, pcre2_free, NULL); + p->pcre2_compile_context = pcre2_compile_context_create(p->pcre2_general_context); if (opt->ignore_case) { if (has_non_ascii(p->pattern)) { - character_tables = pcre2_maketables(NULL); - p->pcre2_compile_context = pcre2_compile_context_create(NULL); + character_tables = pcre2_maketables(p->pcre2_general_context); pcre2_set_character_tables(p->pcre2_compile_context, character_tables); } options |= PCRE2_CASELESS; @@ -513,7 +523,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_compile_context); if (p->pcre2_pattern) { - p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, NULL); + p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, p->pcre2_general_context); if (!p->pcre2_match_data) die("Couldn't allocate PCRE2 match data"); } else { @@ -550,10 +560,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt return; } - p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL); + p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, p->pcre2_general_context); if (!p->pcre2_jit_stack) die("Couldn't allocate PCRE2 JIT stack"); - p->pcre2_match_context = pcre2_match_context_create(NULL); + p->pcre2_match_context = pcre2_match_context_create(p->pcre2_general_context); if (!p->pcre2_match_context) die("Couldn't allocate PCRE2 match context"); pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); @@ -605,6 +615,7 @@ static void free_pcre2_pattern(struct grep_pat *p) pcre2_match_data_free(p->pcre2_match_data); pcre2_jit_stack_free(p->pcre2_jit_stack); pcre2_match_context_free(p->pcre2_match_context); + pcre2_general_context_free(p->pcre2_general_context); } #else /* !USE_LIBPCRE2 */ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) diff --git a/grep.h b/grep.h index 1875880f37..73b8b87a3a 100644 --- a/grep.h +++ b/grep.h @@ -28,6 +28,7 @@ typedef int pcre_jit_stack; #else typedef int pcre2_code; typedef int pcre2_match_data; +typedef int pcre2_general_context; typedef int pcre2_compile_context; typedef int pcre2_match_context; typedef int pcre2_jit_stack; @@ -93,6 +94,7 @@ struct grep_pat { int pcre1_jit_on; pcre2_code *pcre2_pattern; pcre2_match_data *pcre2_match_data; + pcre2_general_context *pcre2_general_context; pcre2_compile_context *pcre2_compile_context; pcre2_match_context *pcre2_match_context; pcre2_jit_stack *pcre2_jit_stack; -- 2.22.0
René Scharfe <l.s.r@web.de> writes: > So how about starting stupidly simple? You can test it everywhere, it > just needs libpcre2. It works without that library as well (tested with > "make USE_LIBPCRE= USE_LIBPCRE2= test"), but doesn't allocate anything > in that case, of course. The character tables leak fix should be safe > on top. If you detect performance issues then we can address them in > additional patches. > > -- >8 -- > Subject: [PATCH] grep: use xmalloc() for all PCRE2 allocations > > Build a PCRE2 global custom context when compiling a pattern and use it > to tell the library to use xmalloc() for allocations. This provides > consistent out-of-memory handling and makes sure it uses a custom > allocator, e.g. with USE_NED_ALLOCATOR. > > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > The function names are ridiculously long, I tried to stay within 80 > characters per line but gave up in the end and just kept going without > line breaks. Fits the "stupidly simple" approach.. ;-) Thanks for keeping the conversation going.
Hi, On Fri, 9 Aug 2019, René Scharfe wrote: > Am 09.08.19 um 13:24 schrieb Carlo Arenas: > > disregard this version, it still broken (and wouldn't even build on > > some cases), the reasons why are still unclear though but at least it > > might > > seem from the last known run in windows that segfaults were prevented > > at last and something was still off enough to trigger a BUG (shouldn't > > be a concern with V6 or later that do PCRE2 migration to NED fully, as > > agreed) > > So how about starting stupidly simple? FWIW I am very much in favor of this approach. Thanks, Dscho > You can test it everywhere, it just needs libpcre2. It works without > that library as well (tested with "make USE_LIBPCRE= USE_LIBPCRE2= > test"), but doesn't allocate anything in that case, of course. The > character tables leak fix should be safe on top. If you detect > performance issues then we can address them in additional patches. > > -- >8 -- > Subject: [PATCH] grep: use xmalloc() for all PCRE2 allocations > > Build a PCRE2 global custom context when compiling a pattern and use it > to tell the library to use xmalloc() for allocations. This provides > consistent out-of-memory handling and makes sure it uses a custom > allocator, e.g. with USE_NED_ALLOCATOR. > > Signed-off-by: René Scharfe <l.s.r@web.de> > --- > The function names are ridiculously long, I tried to stay within 80 > characters per line but gave up in the end and just kept going without > line breaks. Fits the "stupidly simple" approach.. > > grep.c | 23 +++++++++++++++++------ > grep.h | 2 ++ > 2 files changed, 19 insertions(+), 6 deletions(-) > > diff --git a/grep.c b/grep.c > index cd952ef5d3..44f4e38657 100644 > --- a/grep.c > +++ b/grep.c > @@ -482,6 +482,16 @@ static void free_pcre1_regexp(struct grep_pat *p) > #endif /* !USE_LIBPCRE1 */ > > #ifdef USE_LIBPCRE2 > +static void *pcre2_malloc(PCRE2_SIZE size, MAYBE_UNUSED void *memory_data) > +{ > + return xmalloc(size); > +} > + > +static void pcre2_free(void *pointer, MAYBE_UNUSED void *memory_data) > +{ > + free(pointer); > +} > + > static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) > { > int error; > @@ -495,12 +505,12 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > > assert(opt->pcre2); > > - p->pcre2_compile_context = NULL; > + p->pcre2_general_context = pcre2_general_context_create(pcre2_malloc, pcre2_free, NULL); > + p->pcre2_compile_context = pcre2_compile_context_create(p->pcre2_general_context); > > if (opt->ignore_case) { > if (has_non_ascii(p->pattern)) { > - character_tables = pcre2_maketables(NULL); > - p->pcre2_compile_context = pcre2_compile_context_create(NULL); > + character_tables = pcre2_maketables(p->pcre2_general_context); > pcre2_set_character_tables(p->pcre2_compile_context, character_tables); > } > options |= PCRE2_CASELESS; > @@ -513,7 +523,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > p->pcre2_compile_context); > > if (p->pcre2_pattern) { > - p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, NULL); > + p->pcre2_match_data = pcre2_match_data_create_from_pattern(p->pcre2_pattern, p->pcre2_general_context); > if (!p->pcre2_match_data) > die("Couldn't allocate PCRE2 match data"); > } else { > @@ -550,10 +560,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > return; > } > > - p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL); > + p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, p->pcre2_general_context); > if (!p->pcre2_jit_stack) > die("Couldn't allocate PCRE2 JIT stack"); > - p->pcre2_match_context = pcre2_match_context_create(NULL); > + p->pcre2_match_context = pcre2_match_context_create(p->pcre2_general_context); > if (!p->pcre2_match_context) > die("Couldn't allocate PCRE2 match context"); > pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); > @@ -605,6 +615,7 @@ static void free_pcre2_pattern(struct grep_pat *p) > pcre2_match_data_free(p->pcre2_match_data); > pcre2_jit_stack_free(p->pcre2_jit_stack); > pcre2_match_context_free(p->pcre2_match_context); > + pcre2_general_context_free(p->pcre2_general_context); > } > #else /* !USE_LIBPCRE2 */ > static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) > diff --git a/grep.h b/grep.h > index 1875880f37..73b8b87a3a 100644 > --- a/grep.h > +++ b/grep.h > @@ -28,6 +28,7 @@ typedef int pcre_jit_stack; > #else > typedef int pcre2_code; > typedef int pcre2_match_data; > +typedef int pcre2_general_context; > typedef int pcre2_compile_context; > typedef int pcre2_match_context; > typedef int pcre2_jit_stack; > @@ -93,6 +94,7 @@ struct grep_pat { > int pcre1_jit_on; > pcre2_code *pcre2_pattern; > pcre2_match_data *pcre2_match_data; > + pcre2_general_context *pcre2_general_context; > pcre2_compile_context *pcre2_compile_context; > pcre2_match_context *pcre2_match_context; > pcre2_jit_stack *pcre2_jit_stack; > -- > 2.22.0 >
On Fri, Aug 9, 2019 at 2:26 PM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > FWIW I am very much in favor of this approach. FWIW this is (almost) exactly what I had in mind with my first reply, except I wanted to make setting of the global context be conditioned to having NED enabled to avoid possible performance issues in environments were NED is not even usable. in macOS (obviously testing without NED) the following is the output of (a hacked version) of p7801 for maint (against chromium's repository), with René's patch on top Test HEAD^ HEAD -------------------------------------------------------------------------------------- 7820.1: perl grep 'how.to' 0.51(0.35+1.11) 0.48(0.33+1.16) -5.9% 7820.2: perl grep '^how to' 0.47(0.33+1.08) 0.45(0.34+1.11) -4.3% 7820.3: perl grep '[how] to' 0.49(0.40+1.11) 0.53(0.41+1.13) +8.2% 7820.4: perl grep '(e.t[^ ]*|v.ry) rare' 68.72(68.77+1.14) 72.10(72.15+1.20) +4.9% 7820.5: perl grep 'm(ú|u)lt.b(æ|y)te' 0.48(0.35+1.12) 0.50(0.40+1.23) +4.2% and this is with my squashed[2] changed on top of that : Test HEAD^ HEAD -------------------------------------------------------------------------------------- 7820.1: perl grep 'how.to' 0.48(0.36+1.16) 0.46(0.33+1.09) -4.2% 7820.2: perl grep '^how to' 0.45(0.34+1.12) 0.42(0.29+0.99) -6.7% 7820.3: perl grep '[how] to' 0.48(0.40+1.13) 0.52(0.43+1.16) +8.3% 7820.4: perl grep '(e.t[^ ]*|v.ry) rare' 69.12(69.10+1.07) 69.19(69.19+1.18) +0.1% 7820.5: perl grep 'm(ú|u)lt.b(æ|y)te' 0.49(0.38+1.17) 0.46(0.35+1.13) -6.1% the degenerate case is not something we can't fix anyway, since it is likely a locking issue inside PCRE2 (I see at most 1 CPU doing work), and the numbers are noisy because of the other problems I mentioned before (hardcoded to 8 threads, running in a laptop with low number of cores), which is why testing for performance regressions in windows is strongly encouraged, regardless Carlo [1] https://public-inbox.org/git/CAPUEspgH1v1zo7smzQWCV4rX9pKVKLV84gDSfCPdT7LffQxUWw@mail.gmail.com/ [2] https://public-inbox.org/git/20190810030315.7519-1-carenas@gmail.com/
Am 10.08.19 um 05:05 schrieb Carlo Arenas: > in macOS (obviously testing without NED) the following is the output > of (a hacked version) of p7801 for maint (against chromium's > repository), with René's patch on top Do you mean p7820? And what did you change? Looking at the results you removed basic and extended from the list of regex engines, right? Ugh, cloning https://chromium.googlesource.com/chromium/src.git sends more than 16GB across the wire. Is that even the right repository? Not sure if my machine can keep the relevant parts cached while grepping -- I/O times could drown out any difference due to context allocation and memory allocator choice. Let's see... > > Test HEAD^ HEAD > -------------------------------------------------------------------------------------- > 7820.1: perl grep 'how.to' 0.51(0.35+1.11) > 0.48(0.33+1.16) -5.9% > 7820.2: perl grep '^how to' 0.47(0.33+1.08) > 0.45(0.34+1.11) -4.3% > 7820.3: perl grep '[how] to' 0.49(0.40+1.11) > 0.53(0.41+1.13) +8.2% > 7820.4: perl grep '(e.t[^ ]*|v.ry) rare' 68.72(68.77+1.14) > 72.10(72.15+1.20) +4.9% > 7820.5: perl grep 'm(ú|u)lt.b(æ|y)te' 0.48(0.35+1.12) > 0.50(0.40+1.23) +4.2% > > and this is with my squashed[2] changed on top of that : > > Test HEAD^ HEAD > -------------------------------------------------------------------------------------- > 7820.1: perl grep 'how.to' 0.48(0.36+1.16) > 0.46(0.33+1.09) -4.2% > 7820.2: perl grep '^how to' 0.45(0.34+1.12) > 0.42(0.29+0.99) -6.7% > 7820.3: perl grep '[how] to' 0.48(0.40+1.13) > 0.52(0.43+1.16) +8.3% > 7820.4: perl grep '(e.t[^ ]*|v.ry) rare' 69.12(69.10+1.07) > 69.19(69.19+1.18) +0.1% > 7820.5: perl grep 'm(ú|u)lt.b(æ|y)te' 0.49(0.38+1.17) > 0.46(0.35+1.13) -6.1% > > the degenerate case is not something we can't fix anyway, since it is > likely a locking issue inside PCRE2 (I see at most 1 CPU doing work), > and the numbers are noisy because of the other problems I mentioned > before (hardcoded to 8 threads, running in a laptop with low number of > cores), which is why testing for performance regressions in windows is > strongly encouraged, regardless > > Carlo > > [1] https://public-inbox.org/git/CAPUEspgH1v1zo7smzQWCV4rX9pKVKLV84gDSfCPdT7LffQxUWw@mail.gmail.com/ > [2] https://public-inbox.org/git/20190810030315.7519-1-carenas@gmail.com/ > So I pointed GIT_PERF_LARGE_REPO to the monster mentioned above, ran the test once for warmup and here are the results of the second run: Test origin/master pcre2-xmalloc pcre2-xmalloc+nedmalloc --------------------------------------------------------------------------------------------------------------------- 7820.1: basic grep 'how.to' 1.59(2.93+1.75) 1.60(3.04+1.74) +0.6% 1.64(2.87+1.90) +3.1% 7820.2: extended grep 'how.to' 1.59(2.98+1.66) 1.55(2.83+1.76) -2.5% 1.67(3.15+1.70) +5.0% 7820.3: perl grep 'how.to' 1.25(1.21+2.13) 1.25(1.24+2.08) +0.0% 1.29(1.32+2.08) +3.2% 7820.5: basic grep '^how to' 1.52(2.82+1.66) 1.51(2.68+1.77) -0.7% 1.64(3.07+1.69) +7.9% 7820.6: extended grep '^how to' 1.57(2.84+1.75) 1.51(2.76+1.73) -3.8% 1.61(2.95+1.75) +2.5% 7820.7: perl grep '^how to' 1.21(1.15+2.10) 1.22(1.26+1.98) +0.8% 1.27(1.22+2.09) +5.0% 7820.9: basic grep '[how] to' 1.95(4.51+1.68) 1.96(4.48+1.69) +0.5% 2.00(4.66+1.65) +2.6% 7820.10: extended grep '[how] to' 1.96(4.54+1.65) 1.94(4.46+1.70) -1.0% 2.04(4.78+1.65) +4.1% 7820.11: perl grep '[how] to' 1.29(1.58+1.88) 1.28(1.50+1.94) -0.8% 1.34(1.51+2.06) +3.9% 7820.13: basic grep '\(e.t[^ ]*\|v.ry\) rare' 8.17(13.18+1.50) 8.29(13.36+1.37) +1.5% 8.31(13.33+1.60) +1.7% 7820.14: extended grep '(e.t[^ ]*|v.ry) rare' 8.13(13.03+1.59) 8.14(13.12+1.47) +0.1% 8.30(13.35+1.56) +2.1% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 34.96(35.80+1.68) 34.99(35.60+1.91) +0.1% 35.18(35.83+1.90) +0.6% 7820.17: basic grep 'm\(ú\|u\)lt.b\(æ\|y\)te' 1.57(3.03+1.64) 1.53(2.76+1.75) -2.5% 1.60(2.89+1.77) +1.9% 7820.18: extended grep 'm(ú|u)lt.b(æ|y)te' 1.52(2.83+1.69) 1.52(2.89+1.63) +0.0% 1.58(2.80+1.84) +3.9% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 1.20(1.25+2.02) 1.21(1.30+1.96) +0.8% 1.25(1.22+2.11) +4.2% pcre2-xmalloc is my patch on top of master, +nedmalloc has the warning fixes I sent earlier and sets USE_NED_MALLOC. I don't understand why my performance is lower by factor 2.5 than yours for all perl regex tests except 7820.15 (your 7820.4), where my system is two times faster. Debian Testing, GCC 9.1.0, i5-9600K, 16GB RAM. Anyway, nedmalloc is slower across the board, but the impact of my patch is in the noise. Right? René
On Sat, Aug 10, 2019 at 12:57 AM René Scharfe <l.s.r@web.de> wrote: > > Am 10.08.19 um 05:05 schrieb Carlo Arenas: > > in macOS (obviously testing without NED) the following is the output > > of (a hacked version) of p7801 for maint (against chromium's > > repository), with René's patch on top > > Do you mean p7820? And what did you change? Looking at the results you > removed basic and extended from the list of regex engines, right? correct, that was a weird typo there; apologize for the confusion. you were correct about my changes; ironically, I didn't spell those changes out to avoid confusion. > Ugh, cloning https://chromium.googlesource.com/chromium/src.git sends > more than 16GB across the wire. Is that even the right repository? noticed it was mentioned before by people[1] doing performance testing on grep specifically and thought it was better than try to come with another one, the linux kernel wouldn't work in macOS because it breaks "run" as it fails to checkout in a case insensitive filesystem. > Not > sure if my machine can keep the relevant parts cached while grepping -- > I/O times could drown out any difference due to context allocation and > memory allocator choice. Let's see... ;), try setting /proc/sys/vm/swappiness to 0 or get more RAM > I don't understand why my performance is lower by factor 2.5 than yours > for all perl regex tests except 7820.15 (your 7820.4), where my system > is two times faster. Debian Testing, GCC 9.1.0, i5-9600K, 16GB RAM. interesting; did you also see at most 100% of one CPU being used? yours seem to be faster than mine so this might be representative of single threaded performance. > Anyway, nedmalloc is slower across the board, but the impact of my > patch is in the noise. Right? yes, and there is a lot of noise. Carlo [1] https://public-inbox.org/git/e72330c6747218545cce1b6b1edfd1e448141a8f.1563570204.git.matheus.bernardino@usp.br/
Am 10.08.19 um 14:40 schrieb Carlo Arenas: > On Sat, Aug 10, 2019 at 12:57 AM René Scharfe <l.s.r@web.de> wrote: >> I don't understand why my performance is lower by factor 2.5 than yours >> for all perl regex tests except 7820.15 (your 7820.4), where my system >> is two times faster. Debian Testing, GCC 9.1.0, i5-9600K, 16GB RAM. > > interesting; did you also see at most 100% of one CPU being used? For 7820.15 yes -- 100% of a single core, the rest idling. The other tests had only brief sequential phases, if at all, and used all cores. René