From patchwork Fri Jul 26 15:08:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061211 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 87ABF746 for ; Fri, 26 Jul 2019 15:09:02 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7598228B2C for ; Fri, 26 Jul 2019 15:09:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 69DA528B30; Fri, 26 Jul 2019 15:09:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0DEB528B2D for ; Fri, 26 Jul 2019 15:09:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728409AbfGZPJA (ORCPT ); Fri, 26 Jul 2019 11:09:00 -0400 Received: from mail-wr1-f66.google.com ([209.85.221.66]:41142 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727679AbfGZPI7 (ORCPT ); Fri, 26 Jul 2019 11:08:59 -0400 Received: by mail-wr1-f66.google.com with SMTP id c2so51590716wrm.8 for ; Fri, 26 Jul 2019 08:08:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8tQD2pikR3lL/BmDCOR9dB2uMiWSXy6uvRqu3bpgCps=; b=R5ww8XTost5Vx8hBIKVFsCA2oCeA+DHSxeJ7Ehfl589BDwyi/+uLkV0jV+ciy9RbFa MoOkgYMiXuEok7B/GRJ3ZhR9CRCJBSmlVkj5fuDWRRCTCEzV3qRW+sYVlB5yzKsq5Xac oFlrAZ1xoHeCCvFphU24gFUeK9S8tdRMLVu+x/wV819KgDh2fy7kP/SZfI4/s5mtGeYl T87USHM4m07BD2xtZBrxrJJgVdazuSzF2rj4hbJ5DyGfEn8NSLEkgwuTuMSo3bClIKdC RnjdZjtNh/Px+rUVcf6oOxIRHQUREB6M9JcBnd5lOcYyJm1XGr5C1IYkdkrMkqKKyVKb b9SQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8tQD2pikR3lL/BmDCOR9dB2uMiWSXy6uvRqu3bpgCps=; b=FA8zcOtLYLtR9RXkhowFLWSGaKS17HRTfjVzn9BS8ZJVwUx9UjeAdN+xGOAU/oXlji cQ8VlxfhiH40GpPYgw/3HpZqys08SepJ0iOoi9R4gLi32tm/z55R9OJzL88EL+Z5Up69 2msvddrOsru4uO1OLgEZoYDk04Q3hlDnAX5VYn55JNUgWBjfMKjzmajpInhepKhXYYaV KwEIAKpsh466F9pTGwy1/rScbY4MHawPVfW1J5dVEgRumfhOM4fYJ+NZhZbL2XwFNaXc GZydyErMDJrEd99r0YPo5+zkQB+0vNvQzH2IO0FeQmziJXueVSDY7y78wwK/lSYLQMoP f/Lw== X-Gm-Message-State: APjAAAUXxND479Ww0gRfJzP4fI+wNpyMa3kZjmwZ7twkgR0V3jvcMMcj mZmdp63ji6Wsfu1+U0Sgd1oX49uW X-Google-Smtp-Source: APXvYqypFZPwCr93wRiY7O+oTFutE346edJCxZ7e2qXn3kxaxTLKV2JZsah3Qvx5GeAyT9SdmUJu1w== X-Received: by 2002:adf:ca0f:: with SMTP id o15mr21017196wrh.135.1564153737937; Fri, 26 Jul 2019 08:08:57 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.08.56 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:08:57 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 1/8] grep: remove overly paranoid BUG(...) code Date: Fri, 26 Jul 2019 17:08:11 +0200 Message-Id: <20190726150818.6373-2-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Remove code that would trigger if pcre_config() or pcre2_config() was so broken that "do we have JIT?" wouldn't return a boolean. I added this code back in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25) and then as noted in f002532784 ("grep: print the pcre2_jit_on value", 2019-07-22) incorrectly copy/pasted some of it in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01). Let's just remove this code. Being this paranoid about the pcre2?_config() function itself being broken is crossing the line into unreasonable paranoia. Reported-by: Beat Bolli Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/grep.c b/grep.c index 0937c5bfff..95af88cb74 100644 --- a/grep.c +++ b/grep.c @@ -394,14 +394,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) #ifdef GIT_PCRE1_USE_JIT pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on); - if (p->pcre1_jit_on == 1) { + if (p->pcre1_jit_on) { p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024); if (!p->pcre1_jit_stack) die("Couldn't allocate PCRE JIT stack"); pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack); - } else if (p->pcre1_jit_on != 0) { - BUG("The pcre1_jit_on variable should be 0 or 1, not %d", - p->pcre1_jit_on); } #endif } @@ -510,7 +507,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on); - if (p->pcre2_jit_on == 1) { + if (p->pcre2_jit_on) { jitret = pcre2_jit_compile(p->pcre2_pattern, PCRE2_JIT_COMPLETE); if (jitret) die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", p->pattern, jitret); @@ -545,9 +542,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt if (!p->pcre2_match_context) die("Couldn't allocate PCRE2 match context"); pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); - } else if (p->pcre2_jit_on != 0) { - BUG("The pcre2_jit_on variable should be 0 or 1, not %d", - p->pcre2_jit_on); } } From patchwork Fri Jul 26 15:08:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061215 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 87AAF13A0 for ; Fri, 26 Jul 2019 15:09:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 77D2228B37 for ; Fri, 26 Jul 2019 15:09:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6BC2228B3B; Fri, 26 Jul 2019 15:09:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1038B28B37 for ; Fri, 26 Jul 2019 15:09:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387748AbfGZPJC (ORCPT ); Fri, 26 Jul 2019 11:09:02 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:34600 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728328AbfGZPJB (ORCPT ); Fri, 26 Jul 2019 11:09:01 -0400 Received: by mail-wm1-f66.google.com with SMTP id w9so38219001wmd.1 for ; Fri, 26 Jul 2019 08:09:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=2VJryduwg2gq0ryvY+MWMRVp7B1VLU1f3IYXSni+Qq8=; b=kIXJsMMvUNPCCK3E3Ygbm8bBF7jbk/Pp2tn76eDB0dYbilsU5jmyseVAE8qnco77+i IjZhGRsZLoNcx4DM8QZyNtuWqphueaZiesnbJIWWNN9a2yUiuUlE9FCWNUC5nO2vVYBX xUyf7jU2+vowrlpP4ooi6wzpTP4X94x9PKNIIOcLDmzE+60WO+YBxI1SB8S/JrmMYFFE F8y6wdZM4mZehykEO0+b0JYNPZRgJjvLE9FNTnmKPivPAv/38NRNlq+jezGOxSV21mFk 7voxKX+clpIzjfBUhf3Wpf1q8ZbYuG21oyZtU/qUEO2449Wl6FvwToehdvM1dLrrrN1E nEQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2VJryduwg2gq0ryvY+MWMRVp7B1VLU1f3IYXSni+Qq8=; b=V1JL2i7uA9xMChMcjZlCk9R0HT5EM92o1DSov/kMuquQJX0nxVgqQxsjbJImXt309M V9q78uyG6wGHOILS2PxCeOVek1eMOL9gAypDnmmSkrxjIGWYb2y3fXgaA1ghqvEr1FkF pfX77pWR2FoZZmu87tc42j1rHGIYpsd2YDIBxw2Pl9F8nVZOAVBnaea61DMO8WgAy7H2 iSjJBiB63L+SSq6uLq4+rsIwr4LdeuemDqap2xFrgO9kLKRl0Qja3Z8CpjIo5xPqsEQQ jl0gpAndiF9CxQPb6/HT8f7q71wZN57mjDgOGanqiS3PfQGn5b9wwxK8uG+Biu3OkQCz LtUg== X-Gm-Message-State: APjAAAWy2aMdBS8jBu6GutHbdItsGG2eGlLTsXheXjlsvz+3915z+W15 avz3tXSpWK0XbvHfs911HORoXoeblys= X-Google-Smtp-Source: APXvYqzIFa0ZsTJ/2zqkBfiFFL1F5oEi/inXc7Le2JYX3riV+LcdVCCJ2PienRclUMxLz3U/yqCqyw== X-Received: by 2002:a1c:d185:: with SMTP id i127mr88419480wmg.63.1564153739108; Fri, 26 Jul 2019 08:08:59 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.08.57 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:08:58 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2 Date: Fri, 26 Jul 2019 17:08:12 +0200 Message-Id: <20190726150818.6373-3-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP As reported in [1] the code I added in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) to use a custom JIT stack has never worked. It was incorrectly copy/pasted from code I added in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25), which did work. Thus our intention of starting with 1 byte of stack at a maximum of 1 MB didn't happen, we'd always use the 32 KB stack provided by PCRE v2's jit_machine_stack_exec()[2]. The reason I allocated a custom stack at all was this advice in pcrejit(3) (same in pcre2jit(3)): "By default, it uses 32KiB on the machine stack. However, some large or complicated patterns need more than this" Since we've haven't had any reports of users running into PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume that we can just use the library defaults instead and drop this code. This won't change with the wider use of PCRE v2 in ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a fixed string search is not a "large or complicated pattern". For good measure I ran the performance test noted in 94da9193a6, although the command is simpler now due to my 0f50c8e32c ("Makefile: remove the NO_R_TO_GCC_LINKER flag", 2019-05-17): GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh Just the /perl/ results are: Test HEAD~ HEAD --------------------------------------------------------------------------------------- 7820.3: perl grep 'how.to' 0.17(0.27+0.65) 0.17(0.24+0.68) +0.0% 7820.7: perl grep '^how to' 0.16(0.23+0.66) 0.16(0.23+0.67) +0.0% 7820.11: perl grep '[how] to' 0.18(0.35+0.62) 0.18(0.33+0.65) +0.0% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.17(0.45+0.54) 0.17(0.49+0.50) +0.0% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.16(0.33+0.58) 0.16(0.29+0.62) +0.0% So, as expected there's no change, and running with valgrind reveals that we have fewer allocations now. As noted in [3] there are known regexes that will fail with the lower stack limit, the way GNU grep fixed it is interesting, although I believe the implementation is overly verbose, they could make PCRE v2 handle that gradual re-allocation, that's what min/max memory is for. So we might end up bringing this back, I'm more inclined to just kick such cases upstairs to PCRE maintainers as a bug, perhaps they'll add some overall "just allocate more then" flag to make this easier. In any case there's no functional change here, we didn't have a custom stack, so let's apply this first, we can always revert it later. 1. https://public-inbox.org/git/20190721194052.15440-1-carenas@gmail.com/ 2. I didn't really intend to start with 1 byte, looking at the PCRE v2 code again what happened is that I cargo-culted some of PCRE v2's own test code which was meant to test re-allocations. It's more sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c does. 3. https://public-inbox.org/git/CAPUEspjj+fG8QDmf=bZXktfpLgkgiu34HTjKLhm-cmEE04FE-A@mail.gmail.com/ Reported-by: Carlo Marcelo Arenas Belón Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 ---------- grep.h | 4 ---- 2 files changed, 14 deletions(-) diff --git a/grep.c b/grep.c index 95af88cb74..4b1e917ac5 100644 --- a/grep.c +++ b/grep.c @@ -534,14 +534,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_jit_on = 0; return; } - - p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL); - if (!p->pcre2_jit_stack) - die("Couldn't allocate PCRE2 JIT stack"); - p->pcre2_match_context = pcre2_match_context_create(NULL); - if (!p->pcre2_match_context) - die("Couldn't allocate PCRE2 match context"); - pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack); } } @@ -585,8 +577,6 @@ static void free_pcre2_pattern(struct grep_pat *p) pcre2_compile_context_free(p->pcre2_compile_context); pcre2_code_free(p->pcre2_pattern); pcre2_match_data_free(p->pcre2_match_data); - pcre2_jit_stack_free(p->pcre2_jit_stack); - pcre2_match_context_free(p->pcre2_match_context); } #else /* !USE_LIBPCRE2 */ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt) diff --git a/grep.h b/grep.h index d35a137fcb..4d8e300175 100644 --- a/grep.h +++ b/grep.h @@ -29,8 +29,6 @@ typedef int pcre_jit_stack; typedef int pcre2_code; typedef int pcre2_match_data; typedef int pcre2_compile_context; -typedef int pcre2_match_context; -typedef int pcre2_jit_stack; #endif #include "thread-utils.h" #include "userdiff.h" @@ -93,8 +91,6 @@ struct grep_pat { pcre2_code *pcre2_pattern; pcre2_match_data *pcre2_match_data; pcre2_compile_context *pcre2_compile_context; - pcre2_match_context *pcre2_match_context; - pcre2_jit_stack *pcre2_jit_stack; uint32_t pcre2_jit_on; unsigned fixed:1; unsigned ignore_case:1; From patchwork Fri Jul 26 15:08:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061213 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 487AA13A0 for ; Fri, 26 Jul 2019 15:09:05 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 38BFA26E54 for ; Fri, 26 Jul 2019 15:09:05 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2D06D28B3B; Fri, 26 Jul 2019 15:09:05 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7A52F26E54 for ; Fri, 26 Jul 2019 15:09:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387754AbfGZPJD (ORCPT ); Fri, 26 Jul 2019 11:09:03 -0400 Received: from mail-wm1-f68.google.com ([209.85.128.68]:50355 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727679AbfGZPJD (ORCPT ); Fri, 26 Jul 2019 11:09:03 -0400 Received: by mail-wm1-f68.google.com with SMTP id v15so48330423wml.0 for ; Fri, 26 Jul 2019 08:09:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=gIZJUwhkVOxwhAlp0dkrA4dJpBmqoq8e6MyfhV+j5PI=; b=WhzjbONTDmZfq2WFZTmYEdQdtw5gLRBm9M62+JehzqYwsuxnNTFfy0bxhKan4ctknu u7zjlswrDpELlsF4TK23/0PShgTc0mAihUxr2JyJ5udUwC5XAp872nZZOHTJSAxO2t/6 w0lsX+E/bpmXHltBJRkxIlbmQofZA4KdvEtlIZsXB16CErXmUOrULjpshma1gJon7K7c p3LsaglKRm3Fso/cXAvsDkidzluDqLHLE+9ci0NV7JWCG/pkQc9iNLfhPxoCUryDmsKh 71OgEULytSPOyuqf6bX5alt5IsuBtf25+DaN52lZEPIEyiSD0+rvTTMa7iVIgyx+k206 GMkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gIZJUwhkVOxwhAlp0dkrA4dJpBmqoq8e6MyfhV+j5PI=; b=ZicwsqgDl0SBhELV4nSWRq9tdbQgHT42k1pdbQngr9VjE5hfHMiQMea8My+XxAq8Bb jD1Qq/Lm/y1znBlOLHgGaqmXUZAI8mCqrFTwTSOPfNjF0dolAqDK/fD91p6KB3DE7tmy YTHXP1nBjqWHwMDbMJpIjtFAD2dRO2rhjXAN3dhsYN1pQjMj3/Gm2hkp0sXa2DJbX21H olwa7MOeRSX7tMuWco1p2wqPqNLB4Gs9rJb9sSQg7lmwNwNih0NCYTTNfJEV1ij4wiZk U3HmhQbNYCOxwv01t0Nkg6zAtcsFYx3AGtBWBsYYXUdIKz5ZX02/gAp1YeCbj8u8Fytp S/rg== X-Gm-Message-State: APjAAAVLuM4iFIrMaezKxMugUUWDGvTObQDemICjXgU5eR95Go/b2pXu BLU/ypZyPUKDQTzUSlhjNxspyiU02hY= X-Google-Smtp-Source: APXvYqwigWF8q9D/+zBZHvBGU9k4OzFNTgzLPWSZ9rEWgiCZ3XAhaOWQI7efT59Xj2HVgZ3NxdAuzA== X-Received: by 2002:a1c:9c8a:: with SMTP id f132mr85902565wme.29.1564153740558; Fri, 26 Jul 2019 08:09:00 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.08.59 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:08:59 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1 Date: Fri, 26 Jul 2019 17:08:13 +0200 Message-Id: <20190726150818.6373-4-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code in the last commit. Unlike with v2 we actually used the custom stack in v1, but let's use PCRE's built-in 32 KB one instead, since experience with v2 shows that's enough. Most distros are already using v2 as a default, and the underlying sljit code is the same. Unfortunately we can't just pass a NULL to pcre_jit_exec() as with pcre2_jit_match(). Unlike the v2 function it doesn't support that. Instead we need to use the fatter pcre_exec() if we'd like the same behavior. This will make things slightly slower than on the fast-path function, but it's OK since we care less about v1 performance these days since we have and recommend v2. Running a similar performance test as what I ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25) via: GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh Gives us this, just the /perl/ results: Test HEAD~ HEAD --------------------------------------------------------------------------------------- 7820.3: perl grep 'how.to' 0.19(0.67+0.52) 0.19(0.65+0.52) +0.0% 7820.7: perl grep '^how to' 0.19(0.78+0.44) 0.19(0.72+0.49) +0.0% 7820.11: perl grep '[how] to' 0.39(2.13+0.43) 0.40(2.10+0.46) +2.6% 7820.15: perl grep '(e.t[^ ]*|v.ry) rare' 0.44(2.55+0.37) 0.45(2.47+0.41) +2.3% 7820.19: perl grep 'm(ú|u)lt.b(æ|y)te' 0.23(1.06+0.42) 0.22(1.03+0.43) -4.3% It will also implicitly re-enable UTF-8 validation for PCRE v1. As noted in [1] we now have cases as a result where PCRE v1 is more eager to error out. Subsequent patches will fix that for v2, and I think it's fair to tell v1 users "just upgrade" and not worry about that edge case for v1. 1. https://public-inbox.org/git/CAPUEsphZJ_Uv9o1-yDpjNLA_q-f7gWXz9g1gCY2pYAYN8ri40g@mail.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 28 +++++----------------------- grep.h | 5 ----- 2 files changed, 5 insertions(+), 28 deletions(-) diff --git a/grep.c b/grep.c index 4b1e917ac5..9c2b259771 100644 --- a/grep.c +++ b/grep.c @@ -394,12 +394,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) #ifdef GIT_PCRE1_USE_JIT pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on); - if (p->pcre1_jit_on) { - p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024); - if (!p->pcre1_jit_stack) - die("Couldn't allocate PCRE JIT stack"); - pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack); - } #endif } @@ -411,18 +405,9 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol, if (eflags & REG_NOTBOL) flags |= PCRE_NOTBOL; -#ifdef GIT_PCRE1_USE_JIT - if (p->pcre1_jit_on) { - ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line, - eol - line, 0, flags, ovector, - ARRAY_SIZE(ovector), p->pcre1_jit_stack); - } else -#endif - { - ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line, - eol - line, 0, flags, ovector, - ARRAY_SIZE(ovector)); - } + ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line, + eol - line, 0, flags, ovector, + ARRAY_SIZE(ovector)); if (ret < 0 && ret != PCRE_ERROR_NOMATCH) die("pcre_exec failed with error code %d", ret); @@ -439,14 +424,11 @@ static void free_pcre1_regexp(struct grep_pat *p) { pcre_free(p->pcre1_regexp); #ifdef GIT_PCRE1_USE_JIT - if (p->pcre1_jit_on) { + if (p->pcre1_jit_on) pcre_free_study(p->pcre1_extra_info); - pcre_jit_stack_free(p->pcre1_jit_stack); - } else + else #endif - { pcre_free(p->pcre1_extra_info); - } pcre_free((void *)p->pcre1_tables); } #else /* !USE_LIBPCRE1 */ diff --git a/grep.h b/grep.h index 4d8e300175..ce2d72571f 100644 --- a/grep.h +++ b/grep.h @@ -14,13 +14,9 @@ #ifndef GIT_PCRE_STUDY_JIT_COMPILE #define GIT_PCRE_STUDY_JIT_COMPILE 0 #endif -#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20 -typedef int pcre_jit_stack; -#endif #else typedef int pcre; typedef int pcre_extra; -typedef int pcre_jit_stack; #endif #ifdef USE_LIBPCRE2 #define PCRE2_CODE_UNIT_WIDTH 8 @@ -85,7 +81,6 @@ struct grep_pat { regex_t regexp; pcre *pcre1_regexp; pcre_extra *pcre1_extra_info; - pcre_jit_stack *pcre1_jit_stack; const unsigned char *pcre1_tables; int pcre1_jit_on; pcre2_code *pcre2_pattern; From patchwork Fri Jul 26 15:08:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061217 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 17A8F746 for ; Fri, 26 Jul 2019 15:09:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0874928B3B for ; Fri, 26 Jul 2019 15:09:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F13AE28B36; Fri, 26 Jul 2019 15:09:07 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9CE4B28B38 for ; Fri, 26 Jul 2019 15:09:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387765AbfGZPJG (ORCPT ); Fri, 26 Jul 2019 11:09:06 -0400 Received: from mail-wm1-f68.google.com ([209.85.128.68]:39865 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387749AbfGZPJE (ORCPT ); Fri, 26 Jul 2019 11:09:04 -0400 Received: by mail-wm1-f68.google.com with SMTP id u25so37617407wmc.4 for ; Fri, 26 Jul 2019 08:09:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=5VK7GSe3vK933aCGBev1CIXdxyB0Q+tZ+wQj2qxGq3Q=; b=CijPiRh+YXjQU0DB/QudEjnJy7FmDd3uAhCyLTv7BhwWvUbNZOh4jquekBgWy7WGKy OfBoaVJyj5kqdpuw6faV6fqBCnzAnQh/ZkT31sRXD6INJrle7nmSfOMd6BmiXpzfYWaL rkrMgsXisZMyn4duJ41vtsKBYXGTBmtKdLzkytQmnAV7N0bhS+gofNS/ov1TPIR8x1Ae 2MK3LMWr4gVdawrgrLyshz4mvM/WgL6f1ip7FuUZ81OnBqglkAoSCjcFW3z+SaHf4Rtj 9C05xoMBwZKNb1VPILQY93e+uiYq3UcF+RGDtk1PI3l0byuhamptgoU+SHWlwmIohbMA pwFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=5VK7GSe3vK933aCGBev1CIXdxyB0Q+tZ+wQj2qxGq3Q=; b=PdbpRJwBNP0bGJ/YZ+fwNXFZjSlQgGxaNHZBIDNRxBTJIDYRAG1DTiukKlc1dpQ+jN yisQjDEe4KUNHxGSNm2etN7et+X7rYsspMuZMBdbtMpyzvugvOM0Fc+Jorq+nWxnRDRe i/CnI7vGWgKHiJn7In0cdUsuv+CxiovmfuosRu+zrfWEOF4owS/qgjezjHElsppRxScw vUCMzXsngiuCsQFskCXgvEpQAIDvvyfNYb3kZZOgzswEjgkk/pSEPRxyhPdVhZdb09ls plky1jfzF7jm0MN6V2pEja/nVMRE2ayQFdkl7BfrjUJH+tk0Aymqdl/6jxU4vxbv6bYh yGoA== X-Gm-Message-State: APjAAAUjp1GhcQ5lw1JHB8C1J2u2QIz/Hpm166664+fP48UKRUPSBM69 S0TotZQyrTQ3GBjxCCgwGQMefZiEPKQ= X-Google-Smtp-Source: APXvYqyQEUNLIG+KXEI3fmPRlp+mRw8bilYEnHxZs2YLtxUUysiW7Grm0/fAUFn6xrwK7n8Z2sZnCg== X-Received: by 2002:a1c:2ec6:: with SMTP id u189mr33263760wmu.67.1564153741848; Fri, 26 Jul 2019 08:09:01 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.09.00 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:09:00 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp() Date: Fri, 26 Jul 2019 17:08:14 +0200 Message-Id: <20190726150818.6373-5-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP At the start of this function we do: p->fixed = opt->fixed; It's less confusing to use that variable consistently that switch back & forth between the two. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/grep.c b/grep.c index 9c2b259771..b94e998680 100644 --- a/grep.c +++ b/grep.c @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); pat_is_fixed = is_fixed(p->pattern, p->patternlen); - if (opt->fixed || pat_is_fixed) { + if (p->fixed || pat_is_fixed) { #ifdef USE_LIBPCRE2 opt->pcre2 = 1; if (pat_is_fixed) { From patchwork Fri Jul 26 15:08:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061219 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 87F0E746 for ; Fri, 26 Jul 2019 15:09:09 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7674B28B3E for ; Fri, 26 Jul 2019 15:09:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6A22328B47; Fri, 26 Jul 2019 15:09:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0CE7D28B3C for ; Fri, 26 Jul 2019 15:09:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387763AbfGZPJG (ORCPT ); Fri, 26 Jul 2019 11:09:06 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:44932 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387756AbfGZPJE (ORCPT ); Fri, 26 Jul 2019 11:09:04 -0400 Received: by mail-wr1-f65.google.com with SMTP id p17so54778741wrf.11 for ; Fri, 26 Jul 2019 08:09:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=l7b/H9pWjz7ysPsDqIE/RUWlARK66YOiOP7x8FPQTDw=; b=om3tFF5iSu+QXOxWzS8EaF6GvFjO0tFx6/573LRyqYGLFXfDzuU4hcXmUKzUM1RlRJ 7u6jJrndnePbkeNzzyZyXqeikkI0PNEz6A9Z0HEebx4mfZ6/qNEBkIAtTBkIahbE1+MW qcHnB3uJyz5fnySdahKTRtMwCKEygIpZgur6pNIzj81UzOilAA7c/t+TR9t6/O0Aa3ro wqoaml0vjXD5QNtE/uoIv18k9ZHZXyiibTL6gNLbDVVRHmC+1Ml8xswJnMJEOOhuHh1+ 0fzP8klUHS+6X390nidKo3KN5rDlU6CDMW1JXBCpS5rF+TkPSQFRBl47Ab5VC/wAroyQ UbWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=l7b/H9pWjz7ysPsDqIE/RUWlARK66YOiOP7x8FPQTDw=; b=tY6jzG/Y3vluxUCvzePX7VfwnysyrtaZn24yB/heQmY73sO+0TowOfBvaizMYckaC1 8lqChCFCdqMHfg8oj/C+zU3emBBlcy5cErm8qjqWxIMVVLSf1skdoT+jrG32+zW+3X6a LTBJuC8CXc5MuRRhGnGazgDNgYAue28lUAhtXZngnbd78BYMlqbGDgTc0HLW+osmwklh VsPN2HSqqxjb6ml6nJAwN+KSTLc9lXx1FYZrg/Nt9Jwq7S1r7v/8EieQFIKqAHMpW0D8 JjaR/POK2HP3v6E4jGMy4DmV4WkO7/fDiBc5mtIT5Il/FDYYGomRBBXVnZtTxfnR08l7 QXzg== X-Gm-Message-State: APjAAAVxT77iU6uJHYMe64xH7F29vn2cGFtuW9EoZWddFX3T+SNtQy6Y Tio9QKAwh8tBe5vBbv2LiA3Bi/UW+1M= X-Google-Smtp-Source: APXvYqwIPwKvni9GDGXuUCvzn3IQ+llnRmiEmqyPwEWEWjy4AklodaxyLw43wHmFAQ04j58aem728w== X-Received: by 2002:adf:e483:: with SMTP id i3mr61545662wrm.210.1564153743071; Fri, 26 Jul 2019 08:09:03 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.09.01 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:09:02 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat" Date: Fri, 26 Jul 2019 17:08:15 +0200 Message-Id: <20190726150818.6373-6-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP This change paves the way for later using this value the regex compile functions themselves. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 7 +++---- grep.h | 1 + 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/grep.c b/grep.c index b94e998680..6d60e2e557 100644 --- a/grep.c +++ b/grep.c @@ -606,7 +606,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { int err; int regflags = REG_NEWLINE; - int pat_is_fixed; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; @@ -615,11 +614,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - pat_is_fixed = is_fixed(p->pattern, p->patternlen); - if (p->fixed || pat_is_fixed) { + p->is_fixed = is_fixed(p->pattern, p->patternlen); + if (p->fixed || p->is_fixed) { #ifdef USE_LIBPCRE2 opt->pcre2 = 1; - if (pat_is_fixed) { + if (p->is_fixed) { compile_pcre2_pattern(p, opt); } else { /* diff --git a/grep.h b/grep.h index ce2d72571f..c0c71eb4a9 100644 --- a/grep.h +++ b/grep.h @@ -88,6 +88,7 @@ struct grep_pat { pcre2_compile_context *pcre2_compile_context; uint32_t pcre2_jit_on; unsigned fixed:1; + unsigned is_fixed:1; unsigned ignore_case:1; unsigned word_regexp:1; }; From patchwork Fri Jul 26 15:08:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061225 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9AE96746 for ; Fri, 26 Jul 2019 15:09:12 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 89AE728B26 for ; Fri, 26 Jul 2019 15:09:12 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7D99928B45; Fri, 26 Jul 2019 15:09:12 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id EA88428B26 for ; Fri, 26 Jul 2019 15:09:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387760AbfGZPJK (ORCPT ); Fri, 26 Jul 2019 11:09:10 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:39623 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387587AbfGZPJH (ORCPT ); Fri, 26 Jul 2019 11:09:07 -0400 Received: by mail-wr1-f65.google.com with SMTP id x4so1650508wrt.6 for ; Fri, 26 Jul 2019 08:09:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eZ/w1lSHdkPbdWhdGQSuekUaVPK1hxxgQnFp3e4/tWc=; b=QqQaCy43hmgV2+NGUm6TbF/ZMJyg22D0cjH3wCrks1GAGOGesdJ97Tlzg0wLlki8Ub krPk1Jg6TXz+gTXXM+38S0gfkc5mLhTEEu78e4pIDNgnh15vZQViJrry88f7iNsqo9H3 EVMG+h25vva6YKkeeDPHd7C5KhhJcZyaeuKelr9xfQFHF+5/wooMCGb3aem81NlyBNzh 5LQQC6RRvwu1CPnfbz2eCdXoX+VPhhMA5tyc/xN1CR8yBzizO3Ml9yTeRVIbW/ZieiPy jVw05wLxl56j3IVEWzNven5NVBU+s0eeEBizPpZf9daAep0gfvyC2RdRM6aKI0YBIZnr 45nw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eZ/w1lSHdkPbdWhdGQSuekUaVPK1hxxgQnFp3e4/tWc=; b=sX5bS6eXuP1cgRfVV2quOBnFSeJpJtyas7GMVucQ8HMDpDvR8ydO3+ACYJIuhKJo2F iTYKBP4vodquV15gtWWt6zWx4oRZvE4OrJ/gg97BKnAB1sdvwR5N9fw40ri0DOpli/2/ pRcT+lj+0Ty7eiRbSR7ZfxHTYeX7V2xTjhI7haI9ClflfdM8RN1kFj0tmxGl0shmLG/e WjJjAnqjQtAfbgWsJlc9fIpXlnKp8R9ElJlHkfehvKTcYci4Qc45lf9jkgyuKucwNMlY gOTnVpi95/Bu48U23UnRgOBq89fSVZOzV1TRVygJ2Ym5uFn30NhTyLuFylJFvBUJSqJ7 edVA== X-Gm-Message-State: APjAAAVEiQGkaPU8oH3mqSDoqsEkDVCcTFZUd8GUfwL41eL8ybYePsO8 2zCYzWE0XPEzhY5Xdl1ZT1yMAqD/jzs= X-Google-Smtp-Source: APXvYqyibzje7DvZ0txyy1/lCINh22bTZg9Qe/peWFe+7wbAMMHRN+dIeiTCWu1OmRXdppXNa2Qomg== X-Received: by 2002:adf:f601:: with SMTP id t1mr6170341wrp.337.1564153744266; Fri, 26 Jul 2019 08:09:04 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.09.03 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:09:03 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Date: Fri, 26 Jul 2019 17:08:16 +0200 Message-Id: <20190726150818.6373-7-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Since my b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) we've been dying on invalid UTF-8 data when grepping for fixed strings if the following are all true: * The subject string is non-ASCII (e.g. "ævar") * We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" * We compiled with PCRE v2 * That PCRE v2 did not have JIT support The last of those is why this wasn't caught earlier, per pcre2jit(3): "unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined." I.e. the subject being matched against our pattern was invalid, but we were lucky and getting away with it on the JIT path, but the non-JIT one is stricter. This patch does nothing to fix that, instead we sneak in support for fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2 jit with implicit fixed-string matching for testing, see pcre2syntax(3) the syntax. This is technically a change in behavior, but it's so obscure that I figured it was OK. We'd previously consider this an invalid regular expression as regcomp() would die on it, now we feed it to the PCRE v2 fixed-string path. I thought this was better than introducing yet another GIT_TEST_* environment variable. We're also relying on a behavior of PCRE v2 that technically could change, but I think the test coverage is worth dipping our toe into some somewhat undefined behavior. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 10 ++++++++++ t/t7812-grep-icase-non-ascii.sh | 28 ++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/grep.c b/grep.c index 6d60e2e557..5bc0f4f32a 100644 --- a/grep.c +++ b/grep.c @@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); p->is_fixed = is_fixed(p->pattern, p->patternlen); +#ifdef USE_LIBPCRE2 + if (!p->fixed && !p->is_fixed) { + const char *no_jit = "(*NO_JIT)"; + const int no_jit_len = strlen(no_jit); + if (starts_with(p->pattern, no_jit) && + is_fixed(p->pattern + no_jit_len, + p->patternlen - no_jit_len)) + p->is_fixed = 1; + } +#endif if (p->fixed || p->is_fixed) { #ifdef USE_LIBPCRE2 opt->pcre2 = 1; diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 0c685d3598..96c3572056 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -53,4 +53,32 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' ' test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' ' + printf "\\200\\n" >invalid-0x80 && + echo "ævar" >expected && + cat expected >>invalid-0x80 && + git add invalid-0x80 +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' ' + git grep -h "var" invalid-0x80 >actual && + test_cmp expected actual && + git grep -h "(*NO_JIT)var" invalid-0x80 >actual && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' ' + test_might_fail git grep -h "æ" invalid-0x80 >actual && + test_cmp expected actual && + test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 && + test_cmp expected actual +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' + test_might_fail git grep -hi "Æ" invalid-0x80 >actual && + test_cmp expected actual && + test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 && + test_cmp expected actual +' + test_done From patchwork Fri Jul 26 15:08:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061221 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9A5CC13A0 for ; Fri, 26 Jul 2019 15:09:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8999E28B30 for ; Fri, 26 Jul 2019 15:09:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 7C32128B44; Fri, 26 Jul 2019 15:09:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DC35E28B3E for ; Fri, 26 Jul 2019 15:09:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387770AbfGZPJJ (ORCPT ); Fri, 26 Jul 2019 11:09:09 -0400 Received: from mail-wr1-f66.google.com ([209.85.221.66]:38820 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387760AbfGZPJH (ORCPT ); Fri, 26 Jul 2019 11:09:07 -0400 Received: by mail-wr1-f66.google.com with SMTP id g17so54810536wrr.5 for ; Fri, 26 Jul 2019 08:09:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=UCbaawn9qITQQVmcx7WGR7ndms2+0yVqXlVziRzXhuY=; b=mCDP9L21Wzx+UC08PgO210NZER6dv0lscsGh1riUd3QcYXmQ/6uSRTeVK0o0u3eKyz UMrxbdGIXYVSv8cS3j3Rj5PBZe6iQtw443O6CiH657YKU7gmIt+AAqEUbkNtXTaNKb/+ SnvtCncnXVa9ACR+mJJXc69LkAyC4qDKH4bijVwLZWfvFk5mnRku8PQiJrABmQ+iPZkY RVp5UVLF/lXEHr+Y8WX9nTZ+laevUFs9efN9jAKCiZNDqiUpvzmHdPBFLXqO+r8fSJza jlr/wmMj4wqjgHVVC1lymSkZKu6YWLdDAqcJFk9ns17Kcrce2Hq6ZCbHAiy21qy9w96j kqUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=UCbaawn9qITQQVmcx7WGR7ndms2+0yVqXlVziRzXhuY=; b=HboD/mf1WZ74Oy89El4IrGejsLnv5eylINz1rpW6RQUhYRPQ6HDHxHrU6/VNVr82M0 KjBrem5NDpzeKtaSqsqzgtrpchhj+siYPpdsPA87J6nttU7K2MSCG8mPnUj150G53DRq DoHvdXuY0fZnIa52dwHwJvuJNN23MFaRSNi0Cuy6tBkN4dqT3O3HEFSi4xfoRFHMsBVU vp3DSl3wf2dZBt4zp4MGDhDfj5Et8L+M+ilOWVVyQ4UjHytHz9aka0qV9bVJU8VtwPIn cwuvLDZnOFUo7oAjwjTjEUvfeQ5Bxn8kouVdDXKCzkbmn+To6WHdoe24c/y1+jTyA/RQ yLIQ== X-Gm-Message-State: APjAAAVUoVnvbQTGR3lu86fj9pqP12E4Wt+5yJEyOWr91f3o2QiIr5B/ 8c3ebWd8Jlti1OJHLZM1aKHoFPEM0KU= X-Google-Smtp-Source: APXvYqxfQmA78N0V0cssm9jUpdOaJM8DsupAWi7OZBvalngkLNI5yzCK7bPi8b0vphh3f5EOADfKqA== X-Received: by 2002:adf:d4c6:: with SMTP id w6mr104634014wrk.98.1564153745614; Fri, 26 Jul 2019 08:09:05 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.09.04 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:09:04 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching Date: Fri, 26 Jul 2019 17:08:17 +0200 Message-Id: <20190726150818.6373-8-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP As discussed in the last commit partially fix a bug introduced in b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input we'd die on e.g.: fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set When grepping a non-ASCII fixed string. This is a more general problem that's hard to fix, but we can at least fix the most common case of grepping for a fixed string without "-i". I can't think of a reason for why we'd turn on PCRE2_UTF when matching byte-for-byte like that. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 3 ++- t/t7812-grep-icase-non-ascii.sh | 4 ++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/grep.c b/grep.c index 5bc0f4f32a..c7c06ae08d 100644 --- a/grep.c +++ b/grep.c @@ -472,7 +472,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 96c3572056..531eb59d57 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -68,9 +68,9 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT ' test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' ' - test_might_fail git grep -h "æ" invalid-0x80 >actual && + git grep -h "æ" invalid-0x80 >actual && test_cmp expected actual && - test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 && + git grep -h "(*NO_JIT)æ" invalid-0x80 && test_cmp expected actual ' From patchwork Fri Jul 26 15:08:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11061229 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 02F7414E5 for ; Fri, 26 Jul 2019 15:09:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E720728B26 for ; Fri, 26 Jul 2019 15:09:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DB37F28B38; Fri, 26 Jul 2019 15:09:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4780528B27 for ; Fri, 26 Jul 2019 15:09:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387783AbfGZPJO (ORCPT ); Fri, 26 Jul 2019 11:09:14 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:55545 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387757AbfGZPJK (ORCPT ); Fri, 26 Jul 2019 11:09:10 -0400 Received: by mail-wm1-f66.google.com with SMTP id a15so48296999wmj.5 for ; Fri, 26 Jul 2019 08:09:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=OBkWT95Uwgo8haPGhHLiGW0JxysHkMuA1iAqHf/Uss4=; b=Yug0/bi2t+3fsWAQEvKsS4SULX26Ms3WxmdMXf5Bdm3FnapFuy7aCEsS7D4sQXquMW MxYm6OzZbctyQbfEe3+CmgJhKtOd6GqD3XU3vnmLpfFLfpKY4FlzDXXS3liyzoKspCri TpUDlx1X4GlYxcpR8RdQhXzvCtVWGu/uMFkYAPppxz5ICxly1jqv5zF2qHmWSWLZvsct ZpWCvarMg2WQKfOdVRNmXy16YYI70bbKhmmysFWEZvudyPA0Ps6TlJ3TBKXnSdsYmZr5 Mh+uuwQ9IBCAkFu8zs6iycIRZFXr4zZ4ywCv3ft/BKTKzqdF2mY4nKLdrKLHnvlytf4Z XUrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=OBkWT95Uwgo8haPGhHLiGW0JxysHkMuA1iAqHf/Uss4=; b=rHGK5dcZHw+Qkg/lUmgVZ8Kr1NHdsKZP/NxmWuIb52ym8RJqbDhUJngQ/NtE/iwa0z 5hx2lk8nP6HXP5I2BCNXL3VCMfebjNNhWuP/puoB/FghzG7bkYRBbrmwQTApn6hmlHIf muhYxghmuB6l7XFkJ9EiU4ojPBpN8ZpW+TJKT3HUrGVIx3ruboET+bZ497Z6mSLQNBi3 aCvtqiUc01KhDPF3WgjDdEk06T+AyFGbgzmaUm5u6jrrLK6+8lD/0uoGLy1b4CmYxNDT I28GDfU4SLnoEaKUXyy+V+81MWOCtM2Xq+J2txSi5eW3cpzuDSz1pcUdw2fm9GMwI7RH BhrA== X-Gm-Message-State: APjAAAVTi9fFvMNDuYFPhxwI53UndD8E2XCFbPRpx2/BvaRaxNWNxt4V 8+Q6noQxQ5eyz6VVbFe1EJG5E/cWexc= X-Google-Smtp-Source: APXvYqzI2fVsqaJWCpzpnFX6Kr5zgigMPaSv8EkyXkVAiPvY2VJOpaSeOlIche6f0DERHXEEzaFivg== X-Received: by 2002:a7b:c651:: with SMTP id q17mr79718213wmk.136.1564153746982; Fri, 26 Jul 2019 08:09:06 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id p63sm4814341wmp.45.2019.07.26.08.09.05 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:09:05 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: Junio C Hamano , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Beat Bolli , Johannes Schindelin , =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF Date: Fri, 26 Jul 2019 17:08:18 +0200 Message-Id: <20190726150818.6373-9-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190724151415.3698-1-avarab@gmail.com> References: <20190724151415.3698-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP As discussed in the "grep: stess test PCRE v2 on invalid UTF-8 data" commit leading up to this one there's a regression in b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01) when matching UTF-8 data. This ultimately isn't straightforward to just "fix", because the kwset backend was so dumb about icase matching that we'd skip it entirely on non-ASCII. See the code removed in 48de2a768c ("grep: remove the kwset optimization", 2019-07-01). Just going back to the C library for those isn't ideal, since it's likely to be even dumber about these mixed-encoding cases. So let's support this "properly" using the PCRE2_MATCH_INVALID_UTF flag. This is new code that's not in any released PCRE v2 version, so we might need a fix that emulates it somehow. I figure that the case that with the non-icase bug out of the way this is obscure enough to tell people "upgrade your PCRE v2 too!'. It'll likely be released by the time we release the git version this commit is part of. We can't just use PCRE2_NO_UTF_CHECK instead for the reasons discussed in [1]. 1. https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- Makefile | 1 + grep.c | 2 +- grep.h | 3 +++ t/helper/test-pcre2-config.c | 12 ++++++++++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/t7812-grep-icase-non-ascii.sh | 13 ++++++++++++- 7 files changed, 31 insertions(+), 2 deletions(-) create mode 100644 t/helper/test-pcre2-config.c diff --git a/Makefile b/Makefile index bd246f2989..dd38d5e527 100644 --- a/Makefile +++ b/Makefile @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o TEST_BUILTINS_OBJS += test-online-cpus.o TEST_BUILTINS_OBJS += test-parse-options.o TEST_BUILTINS_OBJS += test-path-utils.o +TEST_BUILTINS_OBJS += test-pcre2-config.o TEST_BUILTINS_OBJS += test-pkt-line.o TEST_BUILTINS_OBJS += test-prio-queue.o TEST_BUILTINS_OBJS += test-reach.o diff --git a/grep.c b/grep.c index c7c06ae08d..8b8b9efe12 100644 --- a/grep.c +++ b/grep.c @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && !(!opt->ignore_case && (p->fixed || p->is_fixed))) - options |= PCRE2_UTF; + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, p->patternlen, options, &error, &erroffset, diff --git a/grep.h b/grep.h index c0c71eb4a9..506f05b97b 100644 --- a/grep.h +++ b/grep.h @@ -21,6 +21,9 @@ typedef int pcre_extra; #ifdef USE_LIBPCRE2 #define PCRE2_CODE_UNIT_WIDTH 8 #include +#ifndef PCRE2_MATCH_INVALID_UTF +#define PCRE2_MATCH_INVALID_UTF 0 +#endif #else typedef int pcre2_code; typedef int pcre2_match_data; diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c new file mode 100644 index 0000000000..5258fdddba --- /dev/null +++ b/t/helper/test-pcre2-config.c @@ -0,0 +1,12 @@ +#include "test-tool.h" +#include "cache.h" +#include "grep.h" + +int cmd__pcre2_config(int argc, const char **argv) +{ + if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) { + int value = PCRE2_MATCH_INVALID_UTF; + return !value; + } + return 1; +} diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index ce7e89028c..e022ce0e48 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -40,6 +40,7 @@ static struct test_cmd cmds[] = { { "online-cpus", cmd__online_cpus }, { "parse-options", cmd__parse_options }, { "path-utils", cmd__path_utils }, + { "pcre2-config", cmd__pcre2_config }, { "pkt-line", cmd__pkt_line }, { "prio-queue", cmd__prio_queue }, { "reach", cmd__reach }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index f805bb39ae..acd8af2a9d 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -30,6 +30,7 @@ int cmd__oidmap(int argc, const char **argv); int cmd__online_cpus(int argc, const char **argv); int cmd__parse_options(int argc, const char **argv); int cmd__path_utils(int argc, const char **argv); +int cmd__pcre2_config(int argc, const char **argv); int cmd__pkt_line(int argc, const char **argv); int cmd__prio_queue(int argc, const char **argv); int cmd__reach(int argc, const char **argv); diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index 531eb59d57..848d46e4f9 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -74,11 +74,22 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invali test_cmp expected actual ' -test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' +test_lazy_prereq PCRE2_MATCH_INVALID_UTF ' + test-tool pcre2-config has-PCRE2_MATCH_INVALID_UTF +' + +test_expect_success GETTEXT_LOCALE,LIBPCRE2,!PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' test_might_fail git grep -hi "Æ" invalid-0x80 >actual && test_cmp expected actual && test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 && test_cmp expected actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' ' + git grep -hi "Æ" invalid-0x80 >actual && + test_cmp expected actual && + git grep -hi "(*NO_JIT)Æ" invalid-0x80 && + test_cmp expected actual +' + test_done