From patchwork Mon Jul 1 21:20:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026733 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 26B661580 for ; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1A3312870C for ; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0E3742871E; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 91CBC2870C for ; Mon, 1 Jul 2019 21:21:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726798AbfGAVVO (ORCPT ); Mon, 1 Jul 2019 17:21:14 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:35893 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbfGAVVN (ORCPT ); Mon, 1 Jul 2019 17:21:13 -0400 Received: by mail-wm1-f66.google.com with SMTP id u8so1020775wmm.1 for ; Mon, 01 Jul 2019 14:21:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=jGETbHwR8FkQ1GAWFD9sYLqQAWVHiYH/Mnhtx31OTo0=; b=JrRGk1zlzDh6DJoLWvm0QG6wufkswfG6Tn3QUCf3ev2lViSz6is8glTkVcHURCK1vB nhU22AY+SLAqf4kDKVtPWWJMXvTDdrRVypYgvPOQfEsPXvIZJgnwWvfRljai6/QLflIk /yiWdNI76hOCo900hKuiiF2PHjxLVSOZUwfR/rlK3koQyDVQICG2e1Y5DFjM6YQ+IB0x 6Ul6NZc2pKi2c8hsIydVM0T+jMb/wMGDtqEw0PqIbDi7NrdQ1iOoSG5awlU70337MI93 U1YbXNS1/RIA0qmO9EY7ctsxZslIxVH0Cl6RoGesWO8SFM9iJc8rezPABmyDcjP0s2tK Da0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=jGETbHwR8FkQ1GAWFD9sYLqQAWVHiYH/Mnhtx31OTo0=; b=TLv5VE6HPiAFYGzIga/8/07F8ItFGsjPidR7+5utA663CiQI/R1rgZ0fsFG7PCb8YL 920K63lVuniAztbCC5EAeSSMFTrQRyFBrmzAXaIjYfbhgE2lwi3NbaHFglNPEzHJ37Wo cDN41PQhFaOxq/5FLW1PAGu2Nkb1hxn8Ec7tXb/0S4C21lmAvAiZ/SqWeM+8SnmL1oKM 00EvkmsoMJkyi8bBDvMZMONJeEePX3NnGKoIyFuz7SJaqbFynI3kcBM8Qv97vYd1Orm5 +a1kRr6N0yU+WTTKcCnJx+H7hHujifc3A8g9rjhphi2jmcBNbLyv9o8PAaXksIDgG58u ZvBg== X-Gm-Message-State: APjAAAXnaTT4UwdoSbPoOx98bCZkBrgEZkDBCWc1vdWum56fsEoovDBO Q0oxz9PJdlrPyeXqp/VpSGoY7vkypFk= X-Google-Smtp-Source: APXvYqxEDJA2uPA+gZW6d05uXStjFicB/L9nFjKvo8U/4w12Jyuw1WwrdkFDzhnjKnkVTKS03BTuzQ== X-Received: by 2002:a1c:7a01:: with SMTP id v1mr742462wmc.10.1562016070486; Mon, 01 Jul 2019 14:21:10 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.09 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:09 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 01/10] log tests: test regex backends in "--encode=" tests Date: Mon, 1 Jul 2019 23:20:51 +0200 Message-Id: <20190701212100.27850-2-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Improve the tests added in 04deccda11 ("log: re-encode commit messages before grepping", 2013-02-11) to test the regex backends. Those tests never worked as advertised, due to the is_fixed() optimization in grep.c (which was in place at the time), and the needle in the tests being a fixed string. We'd thus always use the "fixed" backend during the tests, which would use the kwset() backend. This backend liberally accepts any garbage input, so invalid encodings would be silently accepted. In a follow-up commit we'll fix this bug, this test just demonstrates the existing issue. In practice this issue happened on Windows, see [1], but due to the structure of the existing tests & how liberal the kwset code is about garbage we missed this. Cover this blind spot by testing all our regex engines. The PCRE backend will spot these invalid encodings. It's possible that this test breaks the "basic" and "extended" backends on some systems that are more anal than glibc about the encoding of locale issues with POSIX functions that I can remember, but PCRE is more careful about the validation. 1. https://public-inbox.org/git/nycvar.QRO.7.76.6.1906271113090.44@tvgsbejvaqbjf.bet/ Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t4210-log-i18n.sh | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 7c519436ef..86d22c1d4c 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -1,12 +1,15 @@ #!/bin/sh test_description='test log with i18n features' -. ./test-lib.sh +. ./lib-gettext.sh # two forms of é utf8_e=$(printf '\303\251') latin1_e=$(printf '\351') +# invalid UTF-8 +invalid_e=$(printf '\303\50)') # ")" at end to close opening "(" + test_expect_success 'create commits in different encodings' ' test_tick && cat >msg <<-EOF && @@ -53,4 +56,40 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' test_must_be_empty actual ' +for engine in fixed basic extended perl +do + prereq= + result=success + if test $engine = "perl" + then + result=failure + prereq="PCRE" + else + prereq="" + fi + force_regex= + if test $engine != "fixed" + then + force_regex=.* + fi + test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + cat >expect <<-\EOF && + latin1 + utf8 + EOF + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual && + test_cmp expect actual + " + + test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual && + test_must_be_empty actual + " + + test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " + LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual && + test_must_be_empty actual + " +done + test_done From patchwork Mon Jul 1 21:20:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026735 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C126D13A4 for ; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B52A42870C for ; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A951828718; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 229F22871F for ; Mon, 1 Jul 2019 21:21:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726829AbfGAVVP (ORCPT ); Mon, 1 Jul 2019 17:21:15 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:38613 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726509AbfGAVVO (ORCPT ); Mon, 1 Jul 2019 17:21:14 -0400 Received: by mail-wm1-f66.google.com with SMTP id s15so1005226wmj.3 for ; Mon, 01 Jul 2019 14:21:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=fHxDvc+/06Sn8DHuRS+mgUCZrNHULKEzqugU8V1/wIE=; b=KiU95tO+8p11zMzIAv3xbsxArVMubBZv7tA9RF9MrwdpCqbL42ocAZ3H/wDQg8R23v olv4gr99tv0yM4USSXxwLvl+3fTqzMaQUhRckZbtglYCXIpVT2lAGIDI/MuoRB+9J5ix Ym6auRNdSLHk9M3CPCk3MTBsqyKz7q7/R+RAcVOO3z2iuq1P2+Ehqn6f5pmFvVvVGkUr mDVM3eOruRhPbE2jJGzIieNNzKvzC+cTrqJwqPIdM0mEELIDtAfbxCM3aEoAk/JqLSI8 KtVH9upEtoWOEnk5+A7saepg7sIhQhhOZaPyagWFbDQ7ycGthMVnu5GutFrTvFMkY+jG Ui9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=fHxDvc+/06Sn8DHuRS+mgUCZrNHULKEzqugU8V1/wIE=; b=j7fB935QSR9VkJPcbhSN5EcqOgZE2+z6e6IKTq57rGjhNFj5jgnfHxuOCjaeAhmmnb +UY8VzaRTaOdesJ+HTui1M5q6RRmfl8GfaDo0m8tq1H/29GmkOB3WageMKs27tPiTxWi 2Qlo/JFNgMuJH3BLDAuCgF9t/7M3hykpgvFAAUVb0eO/8EuAidoJiIaRNcAT6FENqUm2 CutDEmOkY6bj254Z9p8tBLHh1vUvx5SbjzTY+g1rbep5SppOqpaRTL6vMOawClKM993A pPzrX6yeqpISAX6WMtVpGPe9Jzcouf1w8MDF4yk0j40z7ZXwQOjorHNLsJ+uGv3nn2kl atCw== X-Gm-Message-State: APjAAAU2bbEFUmagC9qDI00Lvzgd8fYOZUteTTqqRWO7/Xynhvou6fKv Y573Dk/u8ydT5j/x6fwGArrYT9wDYH4= X-Google-Smtp-Source: APXvYqyrCrw2a1cmYrQ8hp9APUqpoM/wb9WJ8bNISN7I0KWIDLOwEIuGJsCDH5LYQ7/a774DBOfYEw== X-Received: by 2002:a7b:c081:: with SMTP id r1mr747167wmh.76.1562016072794; Mon, 01 Jul 2019 14:21:12 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.10 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:10 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 02/10] grep: don't use PCRE2?_UTF8 with "log --encoding=" Date: Mon, 1 Jul 2019 23:20:52 +0200 Message-Id: <20190701212100.27850-3-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Fix a bug introduced in 18547aacf5 ("grep/pcre: support utf-8", 2016-06-25) that was missed due to a blindspot in our tests, as discussed in the previous commit. I then blindly copied the same bug in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) when adding the PCRE v2 code. We should not tell PCRE that we're processing UTF-8 just because we're dealing with non-ASCII. In the case of e.g. "log --encoding=<...>" under is_utf8_locale() the haystack might be in ISO-8859-1, and the needle might be in a non-UTF-8 encoding. Maybe we should be more strict here and die earlier? Should we also be converting the needle to the encoding in question, and failing if it's not a string that's valid in that encoding? Maybe. But for now matching this as non-UTF8 at least has some hope of producing sensible results, since we know that our default heuristic of assuming the text to be matched is in the user locale encoding isn't true when we've explicitly encoded it to be in a different encoding. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 8 ++++---- grep.h | 1 + revision.c | 3 +++ t/t4210-log-i18n.sh | 6 ++---- 4 files changed, 10 insertions(+), 8 deletions(-) diff --git a/grep.c b/grep.c index f7c3a5803e..1de4ab49c0 100644 --- a/grep.c +++ b/grep.c @@ -388,11 +388,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) int options = PCRE_MULTILINE; if (opt->ignore_case) { - if (has_non_ascii(p->pattern)) + if (!opt->ignore_locale && has_non_ascii(p->pattern)) p->pcre1_tables = pcre_maketables(); options |= PCRE_CASELESS; } - if (is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) options |= PCRE_UTF8; p->pcre1_regexp = pcre_compile(p->pattern, options, &error, &erroffset, @@ -498,14 +498,14 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt p->pcre2_compile_context = NULL; if (opt->ignore_case) { - if (has_non_ascii(p->pattern)) { + if (!opt->ignore_locale && has_non_ascii(p->pattern)) { character_tables = pcre2_maketables(NULL); p->pcre2_compile_context = pcre2_compile_context_create(NULL); pcre2_set_character_tables(p->pcre2_compile_context, character_tables); } options |= PCRE2_CASELESS; } - if (is_utf8_locale() && has_non_ascii(p->pattern)) + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern)) options |= PCRE2_UTF; p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, diff --git a/grep.h b/grep.h index 1875880f37..4bb8a79d93 100644 --- a/grep.h +++ b/grep.h @@ -173,6 +173,7 @@ struct grep_opt { int funcbody; int extended_regexp_option; int pattern_type_option; + int ignore_locale; char colors[NR_GREP_COLORS][COLOR_MAXLEN]; unsigned pre_context; unsigned post_context; diff --git a/revision.c b/revision.c index 621feb9df7..a842fb158a 100644 --- a/revision.c +++ b/revision.c @@ -28,6 +28,7 @@ #include "commit-graph.h" #include "prio-queue.h" #include "hashmap.h" +#include "utf8.h" volatile show_early_output_fn_t show_early_output; @@ -2655,6 +2656,8 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED, &revs->grep_filter); + if (!is_encoding_utf8(get_log_output_encoding())) + revs->grep_filter.ignore_locale = 1; compile_grep_patterns(&revs->grep_filter); if (revs->reverse && revs->reflog_info) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 86d22c1d4c..515bcb7ce1 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -59,10 +59,8 @@ test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' for engine in fixed basic extended perl do prereq= - result=success if test $engine = "perl" then - result=failure prereq="PCRE" else prereq="" @@ -72,7 +70,7 @@ do then force_regex=.* fi - test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " cat >expect <<-\EOF && latin1 utf8 @@ -86,7 +84,7 @@ do test_must_be_empty actual " - test_expect_$result GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " + test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual && test_must_be_empty actual " From patchwork Mon Jul 1 21:20:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026739 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BB7DD13A4 for ; Mon, 1 Jul 2019 21:21:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B16972870C for ; Mon, 1 Jul 2019 21:21:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A5E5528726; Mon, 1 Jul 2019 21:21:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 41C052870C for ; Mon, 1 Jul 2019 21:21:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726838AbfGAVVR (ORCPT ); Mon, 1 Jul 2019 17:21:17 -0400 Received: from mail-wm1-f68.google.com ([209.85.128.68]:56147 "EHLO mail-wm1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbfGAVVQ (ORCPT ); Mon, 1 Jul 2019 17:21:16 -0400 Received: by mail-wm1-f68.google.com with SMTP id a15so872903wmj.5 for ; Mon, 01 Jul 2019 14:21:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eXnMdLmuMsCvL2a+Rs3FbD+F0m7k0TY93UGAJ8b4tjo=; b=VLstUddyi1x2sHCke0f5lzzUURH+cnJpprdwUhOf0WqkXXuSX9dNb9cLbMtUyvDNBP m8McpECvsel/9JubjITJR3u2RT8mtNbthrcj7JZVSWV+dDvwQTKEJj2xLMgj4+xWpUQj e7r64gOl5tGQTGcKk2hpuT39qO7GHAmQfIYfAH16MPY6S8+BrYIYij9i5HED83HNkTK8 XRIvibq0bhSn/2qjXZx4xR4unjxmSjI/VgPK4nhaJtPyGgm49FMPIW6UIpFj86aa9C97 XOLjkiKxd//n4BzlE2ZeqoePJO9eCycLQAsBHydhUoIu7PJ7agLyOjUDSDottPt9rQPK E5UQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eXnMdLmuMsCvL2a+Rs3FbD+F0m7k0TY93UGAJ8b4tjo=; b=I5JLDopDwR0pH3wyapm7oAg3FVonovy/kfGDF0MJrA3BQi/JhBFj1gpZQ7ZQNxbZ2V pjewgt+aV3MbC2z7HXDn3dUOXWgVQ+LnPAZy8qGwD9MK7BxXTm7OuBavKlmnmuKyiHmm CvY7C1qj039UG+iG/2F7qHWl7KGR6VNgFY8lUgA9rDoOqbbu+vfOfzmgXHH4syylsYgV FbzbpgohqN4fcf5QIvirclLjqEIHfDxChpVSZhG+FSlpU5wDmOpXOXrrazMKI4TgwkMV Y2WfzBcCPla7bmwVkym6GCd0bPOvhpXtzXmpeuG5IAlZN/qo/c2zeBRtHLtVXIVWBItM TXaQ== X-Gm-Message-State: APjAAAX/3T1CnylyRiJzFIWzEyXq1O7oWFmL8Hol2jGS+xVD+R040OtQ DDZWimZQ0dAdnmYgv3C+ZZhYb3P7TE0= X-Google-Smtp-Source: APXvYqxGDApdQV7oYE4kRdnh1pebIH4DQsTmSojL+UAck67+W7PxPO2Z/1rUUE+NmpP6FYPEVdlqDw== X-Received: by 2002:a1c:c289:: with SMTP id s131mr646718wmf.115.1562016074068; Mon, 01 Jul 2019 14:21:14 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.12 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:13 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 03/10] t4210: skip more command-line encoding tests on MinGW Date: Mon, 1 Jul 2019 23:20:53 +0200 Message-Id: <20190701212100.27850-4-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP In 5212f91deb ("t4210: skip command-line encoding tests on mingw", 2014-07-17) the positive tests in this file were skipped. That left the negative tests that don't produce a match. An upcoming change to migrate the "fixed" backend of grep to PCRE v2 will cause these "log" commands to produce an error instead on MinGW. This is because the command-line on that platform implicitly has its encoding changed before being passed to git. See [1]. 1. https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@tvgsbejvaqbjf.bet/ Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t4210-log-i18n.sh | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/t/t4210-log-i18n.sh b/t/t4210-log-i18n.sh index 515bcb7ce1..6e61f57f09 100755 --- a/t/t4210-log-i18n.sh +++ b/t/t4210-log-i18n.sh @@ -51,7 +51,7 @@ test_expect_success !MINGW 'log --grep does not find non-reencoded values (utf8) test_must_be_empty actual ' -test_expect_success 'log --grep does not find non-reencoded values (latin1)' ' +test_expect_success !MINGW 'log --grep does not find non-reencoded values (latin1)' ' git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual && test_must_be_empty actual ' @@ -70,7 +70,7 @@ do then force_regex=.* fi - test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " cat >expect <<-\EOF && latin1 utf8 @@ -79,12 +79,12 @@ do test_cmp expect actual " - test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " + test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" " LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual && test_must_be_empty actual " - test_expect_success GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " + test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" " LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual && test_must_be_empty actual " From patchwork Mon Jul 1 21:20:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026741 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0BCD81580 for ; Mon, 1 Jul 2019 21:21:19 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 00BAA2870C for ; Mon, 1 Jul 2019 21:21:19 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E94462870F; Mon, 1 Jul 2019 21:21:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 879ED28718 for ; Mon, 1 Jul 2019 21:21:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726865AbfGAVVR (ORCPT ); Mon, 1 Jul 2019 17:21:17 -0400 Received: from mail-wm1-f65.google.com ([209.85.128.65]:38617 "EHLO mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726509AbfGAVVR (ORCPT ); Mon, 1 Jul 2019 17:21:17 -0400 Received: by mail-wm1-f65.google.com with SMTP id s15so1005324wmj.3 for ; Mon, 01 Jul 2019 14:21:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=bUiqybTOopFcwW4EfTGYqjTSRi5yZxqzQZSqNSqKeew=; b=B5BFmQwZwJlgt4lKd8Az2Bcq8YEtctpoiGYUe938Cz2OJf29J+LQCPtqFiVPmsqAFo Ne6XLaJ/8aDcGrriL0yvZ0DgF/6d2DZEuFotrbp4dI1HYLHn8CkYDn9FUesbbiCmXsas BHQdXqa2IvG5+PmOpPrrR930Uq5Mb1LQmK/RAfDxSE7rlSo0CtY92hw0meNVm6pMJncq ayRahwxuC6NMsTL4OO3pyZiW9e9lDaLucD94rvucLZYs10hJM7YRqHHS8sAX0EM2n9uZ Io/WVSfhhjHnKdvHIdn0BDuNYGrCfxpxSXgfvU1OQGmTa8s88CDQvm3s98yl9zesQ13i 5bvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=bUiqybTOopFcwW4EfTGYqjTSRi5yZxqzQZSqNSqKeew=; b=aryKgvo3hTJjJcXj+RPrhJSJjHrt7YyA1A04PmQp+SoaJ1kIAWzclaJFj+eaprd8jp ggsP6vbhhVD2queAoT/EaXbdI1f8zftEvGtLZVQ14lgrtX4NaqINTCcHeTnqGROQ5ksC jHymQr+Z0Fa5Pc3Na/BxXAnMDkZT+PPbBr+iF3+GE14oGd8sqi3013TJPn2NtZHNymws D7Ll0rJ7ygQ32fTYu71HdaotbkXdM2OJlm0LfGEPBr6oXHaiRukBmdjbbjO9Is3uWSmg Oyf5kphWRNfeTzYC7+9HQ5F5DtXQK46UEM/PfUFPKGBkhvisfx+lcgZee/SFyW8vKOby zKdA== X-Gm-Message-State: APjAAAWxRHIeEs6wyM+vlF7YHnCUc9vkTxxR7kM+4gnVk68T4jd8XtXY ePfZlze77Udl21Wcmw9vLdpcBr6hulM= X-Google-Smtp-Source: APXvYqysdHzrlQeofC8reb9+GPI2R4l/SEK17CBHfCLPT394544IMDLe4eGy/RteCybHYjPvIdW3+A== X-Received: by 2002:a1c:9dc5:: with SMTP id g188mr704187wme.93.1562016075259; Mon, 01 Jul 2019 14:21:15 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.14 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:14 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 04/10] grep: inline the return value of a function call used only once Date: Mon, 1 Jul 2019 23:20:54 +0200 Message-Id: <20190701212100.27850-5-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Since e944d9d932 ("grep: rewrite an if/else condition to avoid duplicate expression", 2016-06-25) the "ascii_only" variable has only been used once in compile_regexp(), let's just inline it there. This makes the code easier to read, and might make it marginally faster depending on compiler optimizations. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/grep.c b/grep.c index 1de4ab49c0..4e8d0645a8 100644 --- a/grep.c +++ b/grep.c @@ -650,13 +650,11 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { - int ascii_only; int err; int regflags = REG_NEWLINE; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; - ascii_only = !has_non_ascii(p->pattern); /* * Even when -F (fixed) asks us to do a non-regexp search, we @@ -673,7 +671,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (opt->fixed || has_null(p->pattern, p->patternlen) || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || ascii_only; + p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); From patchwork Mon Jul 1 21:20:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026743 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5A2F913A4 for ; Mon, 1 Jul 2019 21:21:21 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4F1D12870C for ; Mon, 1 Jul 2019 21:21:21 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4369B28718; Mon, 1 Jul 2019 21:21:21 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E746C2870C for ; Mon, 1 Jul 2019 21:21:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726879AbfGAVVU (ORCPT ); Mon, 1 Jul 2019 17:21:20 -0400 Received: from mail-wr1-f66.google.com ([209.85.221.66]:42529 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbfGAVVS (ORCPT ); Mon, 1 Jul 2019 17:21:18 -0400 Received: by mail-wr1-f66.google.com with SMTP id x17so15331752wrl.9 for ; Mon, 01 Jul 2019 14:21:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=4ztwW1p0aQ2mCbzMuBeWwugeB5xsSD7QoczddYbD4PI=; b=sehieitpg5I7HYawtF34csBWIpyvxZJuWvZHlMqO8b7K2Ocp3Olo40xK3COxf76NsO xzT6EES0m1HzeycD2PSFvU++R1r0UavICX8ACooMZ1KBLQsQezcsU3DbkpfXiSwBzfgQ xFu8iaDZ9jBJ9+sbwqAJWSzhHI+Y/+UkCZGWjyw/kQnHV+t7TS73CXPf5H/39DfNfITE zJYgeWkaw41p2MW9xewPJjoH+I+cjp9X0/lPrgCrG2XbjmJ4vDhABZ8mpbxdTHXz9qQ4 3dQ/OumrRMW350tFk0J1vfRiRJasIj7B7BwJf0f9IVWh/DCLgSK0AZCDUPYkxQIjCKDN Bu6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=4ztwW1p0aQ2mCbzMuBeWwugeB5xsSD7QoczddYbD4PI=; b=GfnmToHM1XBsQ921GYdE3a0SixGU7l+U25+3mhsw6aJAGO73osKGfGLDR2mTqvTRJf +fIH22s0RoZh2JHdf0INhkjhRw7HzYh9eEWRaIrxlz61XfcKDISL5U2dtbRz6OuoIcAR rDdMBn5rBto6W4TCU9HtuTVUiivaC7yEQCWOIVYUXNtp1BDaGQwR21vn0Ij2WptPZrEn Bwt9HITrGo/JwjXrsMFSetlu31dPmbhr+9DpDsm3GHncMjVm5OkI9ATcIi5EBSNdLBLL K1yy85jLBjV7KsdmYqXCKJZAvqahaWBxSJjqAKvHbvwaB2uv9HOFpYs6rS9h7gHMogbc beHg== X-Gm-Message-State: APjAAAWKxlWkWgckkS/vZF6CmEzwvWiojVYp9pzVMt0VxYqmI7ebVQ/E zgfnkT6FxElV89qPT6WLIYGlQ2NFeiE= X-Google-Smtp-Source: APXvYqyBL9QZazlvYYD6ylVXOVFGRblWGajQuHrC8oDqWx7uf/G9pvP8qP9+tTQsWWn84UHPgPQBPA== X-Received: by 2002:adf:e3cc:: with SMTP id k12mr1428811wrm.284.1562016076318; Mon, 01 Jul 2019 14:21:16 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.15 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:15 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 05/10] grep tests: move "grep binary" alongside the rest Date: Mon, 1 Jul 2019 23:20:55 +0200 Message-Id: <20190701212100.27850-6-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Move the "grep binary" test case added in aca20dd558 ("grep: add test script for binary file handling", 2010-05-22) so that it lives alongside the rest of the "grep" tests in t781*. This would have left a gap in the t/700* namespace, so move a "filter-branch" test down, leaving the "t7010-setup.sh" test as the next one after that. Signed-off-by: Ævar Arnfjörð Bjarmason --- ...ilter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} | 0 t/{t7008-grep-binary.sh => t7815-grep-binary.sh} | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename t/{t7009-filter-branch-null-sha1.sh => t7008-filter-branch-null-sha1.sh} (100%) rename t/{t7008-grep-binary.sh => t7815-grep-binary.sh} (100%) diff --git a/t/t7009-filter-branch-null-sha1.sh b/t/t7008-filter-branch-null-sha1.sh similarity index 100% rename from t/t7009-filter-branch-null-sha1.sh rename to t/t7008-filter-branch-null-sha1.sh diff --git a/t/t7008-grep-binary.sh b/t/t7815-grep-binary.sh similarity index 100% rename from t/t7008-grep-binary.sh rename to t/t7815-grep-binary.sh From patchwork Mon Jul 1 21:20:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026749 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CC35814F6 for ; Mon, 1 Jul 2019 21:21:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C0E402870F for ; Mon, 1 Jul 2019 21:21:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B569E28718; Mon, 1 Jul 2019 21:21:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07EAB2870F for ; Mon, 1 Jul 2019 21:21:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726927AbfGAVVX (ORCPT ); Mon, 1 Jul 2019 17:21:23 -0400 Received: from mail-wm1-f44.google.com ([209.85.128.44]:54131 "EHLO mail-wm1-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726509AbfGAVVV (ORCPT ); Mon, 1 Jul 2019 17:21:21 -0400 Received: by mail-wm1-f44.google.com with SMTP id x15so885025wmj.3 for ; Mon, 01 Jul 2019 14:21:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=x+qhYDvPBakHIP3UC2hedMaJ5fEjJCaLZvim4xmFh3k=; b=NiX3ROCLi7BB8KOPDrTXG+SgZSV8LenCF6MfssEHagW2mfdb9CZj9okiDVTncFXEhY 3bYSH9GjYAFUtVzEGOemKXwXYV5Iaex9ip7XXUYEEcs0i/dq8jOnIKR+SdkoIGVqDGX7 p/zR6Qc6NKemozffR6te9MSLC4meLspJ55OU1At/AKkC2Dk60GCV2pDSAhl5puJb9r16 Np7t5x1w/i6Dq++/9LkYjyO0S2HhpQ4nW+Yiu31ECG4SHNkqwYz+xd0ylm2HcGTFy23h wmqhu042G5ItFDjo067+gDuZd+1JKxdb/Ub/HSJ3ZYRZvIpVk76+uba2RXg9BVJlXUDM 67Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=x+qhYDvPBakHIP3UC2hedMaJ5fEjJCaLZvim4xmFh3k=; b=B5tw0nC2lJ8IBwgZgv5skdccBdIBvDXceQ6NAB+0g2I5+Y2IBN9SKi3s3ulChFVSBN /xoRcUTLEaaRupz9sQWYqKfCJl32EksV3W1AfD5SuqEB5opwDesOnfzIlo782w0LGokT ghAFcAfgHdhYMkq3uu2Q3Eby8JFISttHxPXgf23bHTxIi0i/q1+9yGA+qRGBoV8lODt9 dV96wG0g5oMccyDimsHEGa6iEWom2sQFbv5Q5pBut0//jdxGqI+hoBjGhDJqPObLFMcY G8EWyRdVRQEn80Jl5DBbMNA//VkhWPRltfGQ80p7rF+89NekNt0nwXWGL00gh7tAKIPL fTBw== X-Gm-Message-State: APjAAAUQZ6LRqi/pA2VncKTY4bUplXhfHm24fUxBCrQaM5aQLiefwVsU +o24SKjwFEW+qkJjWj72fNXTF7PomrU= X-Google-Smtp-Source: APXvYqw8Xtl90q1qxtEteel37AzLZvPctsYSwBwx5KCfW350PxyfqeQFOZmnqYP94LCe2CCs4vK9Fg== X-Received: by 2002:a05:600c:2549:: with SMTP id e9mr729122wma.46.1562016077824; Mon, 01 Jul 2019 14:21:17 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.16 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:17 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 06/10] grep tests: move binary pattern tests into their own file Date: Mon, 1 Jul 2019 23:20:56 +0200 Message-Id: <20190701212100.27850-7-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Move the tests for "-f " where "" contains a NUL byte pattern into their own file. I added most of these tests in 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20). Whether a regex engine supports matching binary content is very different from whether it matches binary patterns. Since 2f8952250a ("regex: add regexec_buf() that can work on a non NUL-terminated string", 2016-09-21) we've required REG_STARTEND of our regex engines so we can match binary content, but only the PCRE v2 engine can sensibly match binary patterns. Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) we've been punting patterns containing NUL-byte and considering them fixed, except in cases where "--ignore-case" is provided and they're non-ASCII, see 5c1ebcca4d ("grep/icase: avoid kwsset on literal non-ascii strings", 2016-06-25). Subsequent commits will change this behavior. Signed-off-by: Ævar Arnfjörð Bjarmason --- t/t7815-grep-binary.sh | 101 ----------------------------- t/t7816-grep-binary-pattern.sh | 114 +++++++++++++++++++++++++++++++++ 2 files changed, 114 insertions(+), 101 deletions(-) create mode 100755 t/t7816-grep-binary-pattern.sh diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh index 2d87c49b75..90ebb64f46 100755 --- a/t/t7815-grep-binary.sh +++ b/t/t7815-grep-binary.sh @@ -4,41 +4,6 @@ test_description='git grep in binary files' . ./test-lib.sh -nul_match () { - matches=$1 - flags=$2 - pattern=$3 - pattern_human=$(echo "$pattern" | sed 's/Q//g') - - if test "$matches" = 1 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = 0 - then - test_expect_success "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - elif test "$matches" = T1 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = T0 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a - " - else - test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false' - fi -} - test_expect_success 'setup' " echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a && git add a && @@ -102,72 +67,6 @@ test_expect_failure 'git grep .fi a' ' git grep .fi a ' -nul_match 1 '-F' 'yQf' -nul_match 0 '-F' 'yQx' -nul_match 1 '-Fi' 'YQf' -nul_match 0 '-Fi' 'YQx' -nul_match 1 '' 'yQf' -nul_match 0 '' 'yQx' -nul_match 1 '' 'æQð' -nul_match 1 '-F' 'eQm[*]c' -nul_match 1 '-Fi' 'EQM[*]C' - -# Regex patterns that would match but shouldn't with -F -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-F' '[y]Qf' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '-Fi' '[Y]QF' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-F' '[æ]Qð' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '-Fi' '[Æ]QÐ' - -# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 -# patterns case-insensitively. -nul_match T1 '-i' 'ÆQÐ' - -# \0 implicitly disables regexes. This is an undocumented internal -# limitation. -nul_match T1 '' 'yQ[f]' -nul_match T1 '' '[y]Qf' -nul_match T1 '-i' 'YQ[F]' -nul_match T1 '-i' '[Y]Qf' -nul_match T1 '' 'æQ[ð]' -nul_match T1 '' '[æ]Qð' -nul_match T1 '-i' 'ÆQ[Ð]' - -# ... because of \0 implicitly disabling regexes regexes that -# should/shouldn't match don't do the right thing. -nul_match T1 '' 'eQm.*cQ' -nul_match T1 '-i' 'EQM.*cQ' -nul_match T0 '' 'eQm[*]c' -nul_match T0 '-i' 'EQM[*]C' - -# Due to the REG_STARTEND extension when kwset() is disabled on -i & -# non-ASCII the string will be matched in its entirety, but the -# pattern will be cut off at the first \0. -nul_match 0 '-i' 'NOMATCHQð' -nul_match T0 '-i' '[Æ]QNOMATCH' -nul_match T0 '-i' '[æ]QNOMATCH' -# Matches, but for the wrong reasons, just stops at [æ] -nul_match 1 '-i' '[Æ]Qð' -nul_match 1 '-i' '[æ]Qð' - -# Ensure that the matcher doesn't regress to something that stops at -# \0 -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '' 'yQNOMATCH' -nul_match 0 '' 'QNOMATCH' -nul_match 0 '-i' 'YQNOMATCH' -nul_match 0 '-i' 'QNOMATCH' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '' 'yQNÓMATCH' -nul_match 0 '' 'QNÓMATCH' -nul_match 0 '-i' 'YQNÓMATCH' -nul_match 0 '-i' 'QNÓMATCH' - test_expect_success 'grep respects binary diff attribute' ' echo text >t && git add t && diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh new file mode 100755 index 0000000000..4060dbd679 --- /dev/null +++ b/t/t7816-grep-binary-pattern.sh @@ -0,0 +1,114 @@ +#!/bin/sh + +test_description='git grep with a binary pattern files' + +. ./test-lib.sh + +nul_match () { + matches=$1 + flags=$2 + pattern=$3 + pattern_human=$(echo "$pattern" | sed 's/Q//g') + + if test "$matches" = 1 + then + test_expect_success "git grep -f f $flags '$pattern_human' a" " + printf '$pattern' | q_to_nul >f && + git grep -f f $flags a + " + elif test "$matches" = 0 + then + test_expect_success "git grep -f f $flags '$pattern_human' a" " + printf '$pattern' | q_to_nul >f && + test_must_fail git grep -f f $flags a + " + elif test "$matches" = T1 + then + test_expect_failure "git grep -f f $flags '$pattern_human' a" " + printf '$pattern' | q_to_nul >f && + git grep -f f $flags a + " + elif test "$matches" = T0 + then + test_expect_failure "git grep -f f $flags '$pattern_human' a" " + printf '$pattern' | q_to_nul >f && + test_must_fail git grep -f f $flags a + " + else + test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false' + fi +} + +test_expect_success 'setup' " + echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a && + git add a && + git commit -m. +" + +nul_match 1 '-F' 'yQf' +nul_match 0 '-F' 'yQx' +nul_match 1 '-Fi' 'YQf' +nul_match 0 '-Fi' 'YQx' +nul_match 1 '' 'yQf' +nul_match 0 '' 'yQx' +nul_match 1 '' 'æQð' +nul_match 1 '-F' 'eQm[*]c' +nul_match 1 '-Fi' 'EQM[*]C' + +# Regex patterns that would match but shouldn't with -F +nul_match 0 '-F' 'yQ[f]' +nul_match 0 '-F' '[y]Qf' +nul_match 0 '-Fi' 'YQ[F]' +nul_match 0 '-Fi' '[Y]QF' +nul_match 0 '-F' 'æQ[ð]' +nul_match 0 '-F' '[æ]Qð' +nul_match 0 '-Fi' 'ÆQ[Ð]' +nul_match 0 '-Fi' '[Æ]QÐ' + +# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 +# patterns case-insensitively. +nul_match T1 '-i' 'ÆQÐ' + +# \0 implicitly disables regexes. This is an undocumented internal +# limitation. +nul_match T1 '' 'yQ[f]' +nul_match T1 '' '[y]Qf' +nul_match T1 '-i' 'YQ[F]' +nul_match T1 '-i' '[Y]Qf' +nul_match T1 '' 'æQ[ð]' +nul_match T1 '' '[æ]Qð' +nul_match T1 '-i' 'ÆQ[Ð]' + +# ... because of \0 implicitly disabling regexes regexes that +# should/shouldn't match don't do the right thing. +nul_match T1 '' 'eQm.*cQ' +nul_match T1 '-i' 'EQM.*cQ' +nul_match T0 '' 'eQm[*]c' +nul_match T0 '-i' 'EQM[*]C' + +# Due to the REG_STARTEND extension when kwset() is disabled on -i & +# non-ASCII the string will be matched in its entirety, but the +# pattern will be cut off at the first \0. +nul_match 0 '-i' 'NOMATCHQð' +nul_match T0 '-i' '[Æ]QNOMATCH' +nul_match T0 '-i' '[æ]QNOMATCH' +# Matches, but for the wrong reasons, just stops at [æ] +nul_match 1 '-i' '[Æ]Qð' +nul_match 1 '-i' '[æ]Qð' + +# Ensure that the matcher doesn't regress to something that stops at +# \0 +nul_match 0 '-F' 'yQ[f]' +nul_match 0 '-Fi' 'YQ[F]' +nul_match 0 '' 'yQNOMATCH' +nul_match 0 '' 'QNOMATCH' +nul_match 0 '-i' 'YQNOMATCH' +nul_match 0 '-i' 'QNOMATCH' +nul_match 0 '-F' 'æQ[ð]' +nul_match 0 '-Fi' 'ÆQ[Ð]' +nul_match 0 '' 'yQNÓMATCH' +nul_match 0 '' 'QNÓMATCH' +nul_match 0 '-i' 'YQNÓMATCH' +nul_match 0 '-i' 'QNÓMATCH' + +test_done From patchwork Mon Jul 1 21:20:57 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026747 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9ACF11580 for ; Mon, 1 Jul 2019 21:21:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8ED212870F for ; Mon, 1 Jul 2019 21:21:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 828622871E; Mon, 1 Jul 2019 21:21:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9F1C928718 for ; Mon, 1 Jul 2019 21:21:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726934AbfGAVVX (ORCPT ); Mon, 1 Jul 2019 17:21:23 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:44076 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726871AbfGAVVV (ORCPT ); Mon, 1 Jul 2019 17:21:21 -0400 Received: by mail-wr1-f65.google.com with SMTP id e3so5812157wrs.11 for ; Mon, 01 Jul 2019 14:21:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=uSzdJGlC+7CuGH0kChg3GSifKwVp2uSbk5Z+EiJb4m8=; b=K+mfe/pvttbHV/6KxcNM50MCxnY/KyGf2TwY+Z7Y18q8eNeJDCjcn2Avt2UZYeepkw 7lVQFAssOIWUYEw5Jo1qQy730ji7TYEmbl49NAo3AN9GhcTfPMzbOimCsosMa4J0+0Pp 4cRBExLi0hjTLxbD0k2Fo0mcxAsF4jPEaETKJ7hXzWr1aqNLlIOKe9Vjk0mg4CRwQ93G n0KRD0XnF1agKeg/Hr5cgVGEF69NZpGVPR36fkrkMq+kAuSbGhWlq8o8zEzfyny/404j b+LRlfdHMx+vX63QVnUGjZaj0Jel7NrfDTc4sU90MXMNpV+1/pC95k8Cys9z0441WKiy VMvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=uSzdJGlC+7CuGH0kChg3GSifKwVp2uSbk5Z+EiJb4m8=; b=ZUZLV/9hnJksSh8qA2grXGCML9yalzl7EIZHhNP4uTiE+HUkjMcOpESuWNFJIoItnL SiWSQpZbFWM75taKQZlYSRnmPBrXrfr13wwiArvARMbdYJvpFe+NMDU3hdtE+dU2wcH6 92iAdtBEbjgmvizJU4a788emARlyGmy21DqguZrKUcGqVdHV+LmRZOHm9vJoZ6sOR9F2 60aZ/SEJiasmrUEdK3wQSFUeHrJUiLN/6ffTx90s7rlV8I4SbJzrRpsANys7YttXydmj +0l0i0g4mSKDhsPPeRqL4Pyvw4v6x2I4WbIKMbJjZ6VSG+xZJRZmC5HX/lj8QgK/w/9b 8tDQ== X-Gm-Message-State: APjAAAWNfq622H1DPNh8ntnVK8uHOjOsQMNvwXTHZrn3jG/LZrfXyXOM +IRIrLgT2hP+pN9H0zw88Vvwu6RFRXc= X-Google-Smtp-Source: APXvYqziu+TSPyomMTIV2ovEAZQpOB/NV94U2XnlPgmBgDpVFIk4Qqctr+wlX4uZue41oeZU5S4jUA== X-Received: by 2002:a5d:4849:: with SMTP id n9mr20091426wrs.139.1562016078979; Mon, 01 Jul 2019 14:21:18 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.17 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:18 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 07/10] grep: make the behavior for NUL-byte in patterns sane Date: Mon, 1 Jul 2019 23:20:57 +0200 Message-Id: <20190701212100.27850-8-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The behavior of "grep" when patterns contained a NUL-byte has always been haphazard, and has served the vagaries of the implementation more than anything else. A pattern containing a NUL-byte can only be provided via "-f ". Since pickaxe (log search) has no such flag the NUL-byte in patterns has only ever been supported by "grep" (and not "log --grep"). Since 9eceddeec6 ("Use kwset in grep", 2011-08-21) patterns containing "\0" were considered fixed. In 966be95549 ("grep: add tests to fix blind spots with \0 patterns", 2017-05-20) I added tests for this behavior. Change the behavior to do the obvious thing, i.e. don't silently discard a regex pattern and make it implicitly fixed just because they contain a NUL-byte. Instead die if the backend in question can't handle them, e.g. --basic-regexp is combined with such a pattern. This is desired because from a user's point of view it's the obvious thing to do. Whether we support BRE/ERE/Perl syntax is different from whether our implementation is limited by C-strings. These patterns are obscure enough that I think this behavior change is OK, especially since we never documented the old behavior. Doing this also makes it easier to replace the kwset backend with something else, since we'll no longer strictly need it for anything we can't easily use another fixed-string backend for. Signed-off-by: Ævar Arnfjörð Bjarmason --- Documentation/git-grep.txt | 17 ++++ grep.c | 23 ++--- t/t7816-grep-binary-pattern.sh | 159 ++++++++++++++++++--------------- 3 files changed, 110 insertions(+), 89 deletions(-) diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt index 2d27969057..c89fb569e3 100644 --- a/Documentation/git-grep.txt +++ b/Documentation/git-grep.txt @@ -271,6 +271,23 @@ providing this option will cause it to die. -f :: Read patterns from , one per line. ++ +Passing the pattern via allows for providing a search pattern +containing a \0. ++ +Not all pattern types support patterns containing \0. Git will error +out if a given pattern type can't support such a pattern. The +`--perl-regexp` pattern type when compiled against the PCRE v2 backend +has the widest support for these types of patterns. ++ +In versions of Git before 2.23.0 patterns containing \0 would be +silently considered fixed. This was never documented, there were also +odd and undocumented interactions between e.g. non-ASCII patterns +containing \0 and `--ignore-case`. ++ +In future versions we may learn to support patterns containing \0 for +more search backends, until then we'll die when the pattern type in +question doesn't support them. -e:: The next parameter is the pattern. This option has to be diff --git a/grep.c b/grep.c index 4e8d0645a8..d6603bc950 100644 --- a/grep.c +++ b/grep.c @@ -368,18 +368,6 @@ static int is_fixed(const char *s, size_t len) return 1; } -static int has_null(const char *s, size_t len) -{ - /* - * regcomp cannot accept patterns with NULs so when using it - * we consider any pattern containing a NUL fixed. - */ - if (memchr(s, 0, len)) - return 1; - - return 0; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -668,9 +656,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) * simple string match using kws. p->fixed tells us if we * want to use kws. */ - if (opt->fixed || - has_null(p->pattern, p->patternlen) || - is_fixed(p->pattern, p->patternlen)) + if (opt->fixed || is_fixed(p->pattern, p->patternlen)) p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); if (p->fixed) { @@ -678,7 +664,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) kwsincr(p->kws, p->pattern, p->patternlen); kwsprep(p->kws); return; - } else if (opt->fixed) { + } + + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + + if (opt->fixed) { /* * We come here when the pattern has the non-ascii * characters we cannot case-fold, and asked to diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 4060dbd679..9e09bd5d6a 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -2,113 +2,126 @@ test_description='git grep with a binary pattern files' -. ./test-lib.sh +. ./lib-gettext.sh -nul_match () { +nul_match_internal () { matches=$1 - flags=$2 - pattern=$3 + prereqs=$2 + lc_all=$3 + extra_flags=$4 + flags=$5 + pattern=$6 pattern_human=$(echo "$pattern" | sed 's/Q//g') if test "$matches" = 1 then - test_expect_success "git grep -f f $flags '$pattern_human' a" " + test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" " printf '$pattern' | q_to_nul >f && - git grep -f f $flags a + LC_ALL='$lc_all' git grep $extra_flags -f f $flags a " elif test "$matches" = 0 then - test_expect_success "git grep -f f $flags '$pattern_human' a" " + test_expect_success $prereqs "LC_ALL='$lc_all' git grep $extra_flags -f f $flags '$pattern_human' a" " + >stderr && printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a + test_must_fail env LC_ALL=\"$lc_all\" git grep $extra_flags -f f $flags a 2>stderr && + test_i18ngrep ! 'This is only supported with -P under PCRE v2' stderr " - elif test "$matches" = T1 + elif test "$matches" = P then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " + test_expect_success $prereqs "error, PCRE v2 only: LC_ALL='$lc_all' git grep -f f $flags '$pattern_human' a" " + >stderr && printf '$pattern' | q_to_nul >f && - git grep -f f $flags a - " - elif test "$matches" = T0 - then - test_expect_failure "git grep -f f $flags '$pattern_human' a" " - printf '$pattern' | q_to_nul >f && - test_must_fail git grep -f f $flags a + test_must_fail env LC_ALL=\"$lc_all\" git grep -f f $flags a 2>stderr && + test_i18ngrep 'This is only supported with -P under PCRE v2' stderr " else test_expect_success "PANIC: Test framework error. Unknown matches value $matches" 'false' fi } +nul_match () { + matches=$1 + matches_pcre2=$2 + matches_pcre2_locale=$3 + flags=$4 + pattern=$5 + pattern_human=$(echo "$pattern" | sed 's/Q//g') + + nul_match_internal "$matches" "" "C" "" "$flags" "$pattern" + nul_match_internal "$matches_pcre2" "LIBPCRE2" "C" "-P" "$flags" "$pattern" + nul_match_internal "$matches_pcre2_locale" "LIBPCRE2,GETTEXT_LOCALE" "$is_IS_locale" "-P" "$flags" "$pattern" +} + test_expect_success 'setup' " echo 'binaryQfileQm[*]cQ*æQð' | q_to_nul >a && git add a && git commit -m. " -nul_match 1 '-F' 'yQf' -nul_match 0 '-F' 'yQx' -nul_match 1 '-Fi' 'YQf' -nul_match 0 '-Fi' 'YQx' -nul_match 1 '' 'yQf' -nul_match 0 '' 'yQx' -nul_match 1 '' 'æQð' -nul_match 1 '-F' 'eQm[*]c' -nul_match 1 '-Fi' 'EQM[*]C' +# Simple fixed-string matching that can use kwset (no -i && non-ASCII) +nul_match 1 1 1 '-F' 'yQf' +nul_match 0 0 0 '-F' 'yQx' +nul_match 1 1 1 '-Fi' 'YQf' +nul_match 0 0 0 '-Fi' 'YQx' +nul_match 1 1 1 '' 'yQf' +nul_match 0 0 0 '' 'yQx' +nul_match 1 1 1 '' 'æQð' +nul_match 1 1 1 '-F' 'eQm[*]c' +nul_match 1 1 1 '-Fi' 'EQM[*]C' # Regex patterns that would match but shouldn't with -F -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-F' '[y]Qf' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '-Fi' '[Y]QF' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-F' '[æ]Qð' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '-Fi' '[Æ]QÐ' +nul_match 0 0 0 '-F' 'yQ[f]' +nul_match 0 0 0 '-F' '[y]Qf' +nul_match 0 0 0 '-Fi' 'YQ[F]' +nul_match 0 0 0 '-Fi' '[Y]QF' +nul_match 0 0 0 '-F' 'æQ[ð]' +nul_match 0 0 0 '-F' '[æ]Qð' -# kwset is disabled on -i & non-ASCII. No way to match non-ASCII \0 -# patterns case-insensitively. -nul_match T1 '-i' 'ÆQÐ' +# The -F kwset codepath can't handle -i && non-ASCII... +nul_match P 1 1 '-i' '[æ]Qð' -# \0 implicitly disables regexes. This is an undocumented internal -# limitation. -nul_match T1 '' 'yQ[f]' -nul_match T1 '' '[y]Qf' -nul_match T1 '-i' 'YQ[F]' -nul_match T1 '-i' '[Y]Qf' -nul_match T1 '' 'æQ[ð]' -nul_match T1 '' '[æ]Qð' -nul_match T1 '-i' 'ÆQ[Ð]' +# ...PCRE v2 only matches non-ASCII with -i casefolding under UTF-8 +# semantics +nul_match P P P '-Fi' 'ÆQ[Ð]' +nul_match P 0 1 '-i' 'ÆQ[Ð]' +nul_match P 0 1 '-i' '[Æ]QÐ' +nul_match P 0 1 '-i' '[Æ]Qð' +nul_match P 0 1 '-i' 'ÆQÐ' -# ... because of \0 implicitly disabling regexes regexes that -# should/shouldn't match don't do the right thing. -nul_match T1 '' 'eQm.*cQ' -nul_match T1 '-i' 'EQM.*cQ' -nul_match T0 '' 'eQm[*]c' -nul_match T0 '-i' 'EQM[*]C' +# \0 in regexes can only work with -P & PCRE v2 +nul_match P 1 1 '' 'yQ[f]' +nul_match P 1 1 '' '[y]Qf' +nul_match P 1 1 '-i' 'YQ[F]' +nul_match P 1 1 '-i' '[Y]Qf' +nul_match P 1 1 '' 'æQ[ð]' +nul_match P 1 1 '' '[æ]Qð' +nul_match P 0 1 '-i' 'ÆQ[Ð]' +nul_match P 1 1 '' 'eQm.*cQ' +nul_match P 1 1 '-i' 'EQM.*cQ' +nul_match P 0 0 '' 'eQm[*]c' +nul_match P 0 0 '-i' 'EQM[*]C' -# Due to the REG_STARTEND extension when kwset() is disabled on -i & -# non-ASCII the string will be matched in its entirety, but the -# pattern will be cut off at the first \0. -nul_match 0 '-i' 'NOMATCHQð' -nul_match T0 '-i' '[Æ]QNOMATCH' -nul_match T0 '-i' '[æ]QNOMATCH' -# Matches, but for the wrong reasons, just stops at [æ] -nul_match 1 '-i' '[Æ]Qð' -nul_match 1 '-i' '[æ]Qð' +# Assert that we're using REG_STARTEND and the pattern doesn't match +# just because it's cut off at the first \0. +nul_match 0 0 0 '-i' 'NOMATCHQð' +nul_match P 0 0 '-i' '[Æ]QNOMATCH' +nul_match P 0 0 '-i' '[æ]QNOMATCH' # Ensure that the matcher doesn't regress to something that stops at # \0 -nul_match 0 '-F' 'yQ[f]' -nul_match 0 '-Fi' 'YQ[F]' -nul_match 0 '' 'yQNOMATCH' -nul_match 0 '' 'QNOMATCH' -nul_match 0 '-i' 'YQNOMATCH' -nul_match 0 '-i' 'QNOMATCH' -nul_match 0 '-F' 'æQ[ð]' -nul_match 0 '-Fi' 'ÆQ[Ð]' -nul_match 0 '' 'yQNÓMATCH' -nul_match 0 '' 'QNÓMATCH' -nul_match 0 '-i' 'YQNÓMATCH' -nul_match 0 '-i' 'QNÓMATCH' +nul_match 0 0 0 '-F' 'yQ[f]' +nul_match 0 0 0 '-Fi' 'YQ[F]' +nul_match 0 0 0 '' 'yQNOMATCH' +nul_match 0 0 0 '' 'QNOMATCH' +nul_match 0 0 0 '-i' 'YQNOMATCH' +nul_match 0 0 0 '-i' 'QNOMATCH' +nul_match 0 0 0 '-F' 'æQ[ð]' +nul_match P P P '-Fi' 'ÆQ[Ð]' +nul_match P 0 1 '-i' 'ÆQ[Ð]' +nul_match 0 0 0 '' 'yQNÓMATCH' +nul_match 0 0 0 '' 'QNÓMATCH' +nul_match 0 0 0 '-i' 'YQNÓMATCH' +nul_match 0 0 0 '-i' 'QNÓMATCH' test_done From patchwork Mon Jul 1 21:20:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026745 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1532813A4 for ; Mon, 1 Jul 2019 21:21:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 095542870F for ; Mon, 1 Jul 2019 21:21:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F114E28726; Mon, 1 Jul 2019 21:21:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 656112870F for ; Mon, 1 Jul 2019 21:21:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726939AbfGAVVX (ORCPT ); Mon, 1 Jul 2019 17:21:23 -0400 Received: from mail-wm1-f65.google.com ([209.85.128.65]:33654 "EHLO mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbfGAVVW (ORCPT ); Mon, 1 Jul 2019 17:21:22 -0400 Received: by mail-wm1-f65.google.com with SMTP id h19so926442wme.0 for ; Mon, 01 Jul 2019 14:21:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/8/hF+vDfQwR9XYatXRcM5x39X44gHxyP2myd+s9iBw=; b=f/d2VOxYET6Kfh31iEcoyeOdR3IcZGLJY6ueWX0Zjs1jdj1ARPRjXGp8akxTlDoKaZ EfshztyLSEXKSOaIFgU/0MrEn9Az7zK1M0TSUwJzEaJEiFxEbok+lDoE3tHRonL02XF+ XlWB/9T2Jng3bQrWuJ+ZPbS2aXfZ2+eX86Xfkx9ZiZL+UTLX4NN0NaHtwx5g99/HYGZj JWDbXYLvXpMa1wn1GB69u4pXDwdd5u53ul4Qo8FQnAWmiHAntXiV4ik485v0rQp40/Go YVcg8/Y/lxgc/MNPv5+NiRkVc4pc7/Z2vHp6UzAIbd5JH6n9dNHAFjToFbY+U9rtlVdl iK+w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/8/hF+vDfQwR9XYatXRcM5x39X44gHxyP2myd+s9iBw=; b=KiEuoI3yqNpVX175D7kOnsG6v1mx9gii7rWOuEjztqWoJz5wpulzt7vws+sswY/IMI j0MAqUJ8uA2BTbNZE/m6W4ByIE4aX1OpOn6fBR2LSq9Z1OHGz+W0lW0AB/7zdvd5PePd 1boRIY2ffWzFd8AzkYuyn3gbby21wS/jrlnSbIwhFckU/InEIpz3CTvASa6i83iuaWLw N+O1BtTDXF8jF8QliOiFqSz+s3QwQ6zHBehiqY5rAwhtwzf6IjC3W2g4WOBp/+l/A0BR GlPdRq5ovhsZCGFbwShqjXlweX41a9rmPtlNMoM2A6UmJk0FoMLgl/FsuQRIe1zsee7s pyuA== X-Gm-Message-State: APjAAAWf+CsOE7VfuijwqZt9VXBAEtp7o30gHI9Ws2RiPnJtS1xu7anA B+jVwruqenIu2nqfsHZeWeBbvQNxF3c= X-Google-Smtp-Source: APXvYqzk+laz0I/HVV76MGHAj3eKy4x2xhavjJXiMhuPKSuuCExeClw/YJ9cgUJyDe9duAJ9a6LFVA== X-Received: by 2002:a7b:cd84:: with SMTP id y4mr660525wmj.79.1562016080194; Mon, 01 Jul 2019 14:21:20 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.19 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:19 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 08/10] grep: drop support for \0 in --fixed-strings Date: Mon, 1 Jul 2019 23:20:58 +0200 Message-Id: <20190701212100.27850-9-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Change "-f " to not support patterns with a NUL-byte in them under --fixed-strings. We'll now only support these under "--perl-regexp" with PCRE v2. A previous change to grep's documentation changed the description of "-f " to be vague enough as to not promise that this would work. By dropping support for this we make it a whole lot easier to move away from the kwset backend, which we'll do in a subsequent change. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 6 +-- t/t7816-grep-binary-pattern.sh | 82 +++++++++++++++++----------------- 2 files changed, 44 insertions(+), 44 deletions(-) diff --git a/grep.c b/grep.c index d6603bc950..8d0fff316c 100644 --- a/grep.c +++ b/grep.c @@ -644,6 +644,9 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) + die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); + /* * Even when -F (fixed) asks us to do a non-regexp search, we * may not be able to correctly case-fold when -i @@ -666,9 +669,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) return; } - if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) - die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { /* * We come here when the pattern has the non-ascii diff --git a/t/t7816-grep-binary-pattern.sh b/t/t7816-grep-binary-pattern.sh index 9e09bd5d6a..60bab291e4 100755 --- a/t/t7816-grep-binary-pattern.sh +++ b/t/t7816-grep-binary-pattern.sh @@ -60,23 +60,23 @@ test_expect_success 'setup' " " # Simple fixed-string matching that can use kwset (no -i && non-ASCII) -nul_match 1 1 1 '-F' 'yQf' -nul_match 0 0 0 '-F' 'yQx' -nul_match 1 1 1 '-Fi' 'YQf' -nul_match 0 0 0 '-Fi' 'YQx' -nul_match 1 1 1 '' 'yQf' -nul_match 0 0 0 '' 'yQx' -nul_match 1 1 1 '' 'æQð' -nul_match 1 1 1 '-F' 'eQm[*]c' -nul_match 1 1 1 '-Fi' 'EQM[*]C' +nul_match P P P '-F' 'yQf' +nul_match P P P '-F' 'yQx' +nul_match P P P '-Fi' 'YQf' +nul_match P P P '-Fi' 'YQx' +nul_match P P 1 '' 'yQf' +nul_match P P 0 '' 'yQx' +nul_match P P 1 '' 'æQð' +nul_match P P P '-F' 'eQm[*]c' +nul_match P P P '-Fi' 'EQM[*]C' # Regex patterns that would match but shouldn't with -F -nul_match 0 0 0 '-F' 'yQ[f]' -nul_match 0 0 0 '-F' '[y]Qf' -nul_match 0 0 0 '-Fi' 'YQ[F]' -nul_match 0 0 0 '-Fi' '[Y]QF' -nul_match 0 0 0 '-F' 'æQ[ð]' -nul_match 0 0 0 '-F' '[æ]Qð' +nul_match P P P '-F' 'yQ[f]' +nul_match P P P '-F' '[y]Qf' +nul_match P P P '-Fi' 'YQ[F]' +nul_match P P P '-Fi' '[Y]QF' +nul_match P P P '-F' 'æQ[ð]' +nul_match P P P '-F' '[æ]Qð' # The -F kwset codepath can't handle -i && non-ASCII... nul_match P 1 1 '-i' '[æ]Qð' @@ -90,38 +90,38 @@ nul_match P 0 1 '-i' '[Æ]Qð' nul_match P 0 1 '-i' 'ÆQÐ' # \0 in regexes can only work with -P & PCRE v2 -nul_match P 1 1 '' 'yQ[f]' -nul_match P 1 1 '' '[y]Qf' -nul_match P 1 1 '-i' 'YQ[F]' -nul_match P 1 1 '-i' '[Y]Qf' -nul_match P 1 1 '' 'æQ[ð]' -nul_match P 1 1 '' '[æ]Qð' -nul_match P 0 1 '-i' 'ÆQ[Ð]' -nul_match P 1 1 '' 'eQm.*cQ' -nul_match P 1 1 '-i' 'EQM.*cQ' -nul_match P 0 0 '' 'eQm[*]c' -nul_match P 0 0 '-i' 'EQM[*]C' +nul_match P P 1 '' 'yQ[f]' +nul_match P P 1 '' '[y]Qf' +nul_match P P 1 '-i' 'YQ[F]' +nul_match P P 1 '-i' '[Y]Qf' +nul_match P P 1 '' 'æQ[ð]' +nul_match P P 1 '' '[æ]Qð' +nul_match P P 1 '-i' 'ÆQ[Ð]' +nul_match P P 1 '' 'eQm.*cQ' +nul_match P P 1 '-i' 'EQM.*cQ' +nul_match P P 0 '' 'eQm[*]c' +nul_match P P 0 '-i' 'EQM[*]C' # Assert that we're using REG_STARTEND and the pattern doesn't match # just because it's cut off at the first \0. -nul_match 0 0 0 '-i' 'NOMATCHQð' -nul_match P 0 0 '-i' '[Æ]QNOMATCH' -nul_match P 0 0 '-i' '[æ]QNOMATCH' +nul_match P P 0 '-i' 'NOMATCHQð' +nul_match P P 0 '-i' '[Æ]QNOMATCH' +nul_match P P 0 '-i' '[æ]QNOMATCH' # Ensure that the matcher doesn't regress to something that stops at # \0 -nul_match 0 0 0 '-F' 'yQ[f]' -nul_match 0 0 0 '-Fi' 'YQ[F]' -nul_match 0 0 0 '' 'yQNOMATCH' -nul_match 0 0 0 '' 'QNOMATCH' -nul_match 0 0 0 '-i' 'YQNOMATCH' -nul_match 0 0 0 '-i' 'QNOMATCH' -nul_match 0 0 0 '-F' 'æQ[ð]' +nul_match P P P '-F' 'yQ[f]' +nul_match P P P '-Fi' 'YQ[F]' +nul_match P P 0 '' 'yQNOMATCH' +nul_match P P 0 '' 'QNOMATCH' +nul_match P P 0 '-i' 'YQNOMATCH' +nul_match P P 0 '-i' 'QNOMATCH' +nul_match P P P '-F' 'æQ[ð]' nul_match P P P '-Fi' 'ÆQ[Ð]' -nul_match P 0 1 '-i' 'ÆQ[Ð]' -nul_match 0 0 0 '' 'yQNÓMATCH' -nul_match 0 0 0 '' 'QNÓMATCH' -nul_match 0 0 0 '-i' 'YQNÓMATCH' -nul_match 0 0 0 '-i' 'QNÓMATCH' +nul_match P P 1 '-i' 'ÆQ[Ð]' +nul_match P P 0 '' 'yQNÓMATCH' +nul_match P P 0 '' 'QNÓMATCH' +nul_match P P 0 '-i' 'YQNÓMATCH' +nul_match P P 0 '-i' 'QNÓMATCH' test_done From patchwork Mon Jul 1 21:20:59 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026751 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AAF3E14F6 for ; Mon, 1 Jul 2019 21:21:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9FEFF2870C for ; Mon, 1 Jul 2019 21:21:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 942562871E; Mon, 1 Jul 2019 21:21:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B8ACD2870C for ; Mon, 1 Jul 2019 21:21:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726958AbfGAVVZ (ORCPT ); Mon, 1 Jul 2019 17:21:25 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:52562 "EHLO mail-wm1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726895AbfGAVVY (ORCPT ); Mon, 1 Jul 2019 17:21:24 -0400 Received: by mail-wm1-f66.google.com with SMTP id s3so892735wms.2 for ; Mon, 01 Jul 2019 14:21:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=DBT2tIPVN7eZKTg9IYDlb3i5h0eYIWulMAz9iB0EOgk=; b=mhjajJa9+qoa0m29HLavyNdhwOQOKQkXi1/tiJED9GUkqDSevXDMMQGoPzxLJ7i6WP g3oqRLio1ajs00wWI4o2oKkwu1C4ZPSE1sPS2yfROIislNJUvIFut6df5V1xf3sx5Mtd U7PzPWNnC9cSkq0bQMUYTxodyk/ly5/eHkSjPUpWwvdcSbJ/XEuQU0q5aPMgb6Mgp+Zg 42nP70hoI9FuyeJjDBYQ4wxZC8nMIze7tBPYYozXXy3LI7E4B4Sbv8LPRAYNqQliusVN RDpFyBmv/acpUSLiuQzpj/yzteD0tvhez1O47U0CZVZ2bewg/X8HHRTk9nBC4F6IVb9z YluA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DBT2tIPVN7eZKTg9IYDlb3i5h0eYIWulMAz9iB0EOgk=; b=fKjP9fxEkt6IRBgtiXoTM5H4e8qhFSsHhEtxWXgl6TwyKodyevnHXVjoMFzayTYLc4 Bl+DVniMeeeM52ZQ07LbsbJ5ipY9IdoLDY2FCVh8+KFjJTJ5p7UZQh4EzWrDdSp+igfJ iR/W3qTQ7NIsT2dCHjTUieJZ9q9kxi22C3naXqAMHjTnd9wwNPXTVmXPdKL9x5qdOuCL bY300ram0Xg4F5mSQg8tzWLQdZiFXoRfNhwWfZ3ba2z2sJOjpZs4iVd2aiRnoEOf8QPt RJBnH/qXznfwRKFDRyZpzx7F0Scg2r+KGs82ALll0AsHsZhhTpkB/qVqP9i+8UN2t1pr lyGA== X-Gm-Message-State: APjAAAXjFPxU08hrFrgvVWWDoYSmtzHxiHUO4dRQywyu71Jth1UL01yz 5b5V11BhKv/r4KoTPdc2XXwbeCb2eEc= X-Google-Smtp-Source: APXvYqwmd7vxyj906G/swhhyHRR5HFs0B25Co8GcmM0psTX3PKj1Vn3p1fUmtoDAmMir/qeRk3ILLQ== X-Received: by 2002:a1c:eb16:: with SMTP id j22mr655878wmh.140.1562016081288; Mon, 01 Jul 2019 14:21:21 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.20 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:20 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 09/10] grep: remove the kwset optimization Date: Mon, 1 Jul 2019 23:20:59 +0200 Message-Id: <20190701212100.27850-10-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP A later change will replace this optimization with optimistic use of PCRE v2. I'm completely removing it as an intermediate step, as opposed to replacing it with PCRE v2, to demonstrate that no grep semantics depend on this (or any other) optimization for the fixed backend anymore. For now this is mostly (but not entirely) a performance regression, as shown by this hacky one-liner: for opt in '' ' -i' do GIT_PERF_7821_GREP_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p7821-grep-engines-fixed.sh done && for opt in '' ' -i' do GIT_PERF_4221_LOG_OPTS=$opt GIT_PERF_REPEAT_COUNT=10 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3 USE_LIBPCRE=YesPlease' ./run origin/master HEAD -- p4221-log-grep-engines-fixed.sh done Which produces: plain grep: Test origin/master HEAD ------------------------------------------------------------------------- 7821.1: fixed grep int 0.55(1.60+0.63) 0.82(3.11+0.51) +49.1% 7821.2: basic grep int 0.62(1.68+0.49) 0.85(3.02+0.52) +37.1% 7821.3: extended grep int 0.61(1.63+0.53) 0.91(3.09+0.44) +49.2% 7821.4: perl grep int 0.55(1.60+0.57) 0.41(0.93+0.57) -25.5% 7821.6: fixed grep uncommon 0.20(0.50+0.44) 0.35(1.27+0.42) +75.0% 7821.7: basic grep uncommon 0.20(0.49+0.45) 0.35(1.29+0.41) +75.0% 7821.8: extended grep uncommon 0.20(0.45+0.48) 0.35(1.25+0.44) +75.0% 7821.9: perl grep uncommon 0.20(0.53+0.41) 0.16(0.24+0.49) -20.0% 7821.11: fixed grep æ 0.35(1.27+0.40) 0.25(0.82+0.39) -28.6% 7821.12: basic grep æ 0.35(1.28+0.38) 0.25(0.75+0.44) -28.6% 7821.13: extended grep æ 0.36(1.21+0.46) 0.25(0.86+0.35) -30.6% 7821.14: perl grep æ 0.35(1.33+0.34) 0.16(0.26+0.47) -54.3% grep with -i: Test origin/master HEAD ----------------------------------------------------------------------------- 7821.1: fixed grep -i int 0.61(1.84+0.64) 1.11(4.12+0.64) +82.0% 7821.2: basic grep -i int 0.72(1.86+0.57) 1.15(4.48+0.49) +59.7% 7821.3: extended grep -i int 0.94(1.83+0.60) 1.53(4.12+0.58) +62.8% 7821.4: perl grep -i int 0.66(1.82+0.59) 0.55(1.08+0.58) -16.7% 7821.6: fixed grep -i uncommon 0.21(0.51+0.44) 0.44(1.74+0.34) +109.5% 7821.7: basic grep -i uncommon 0.21(0.55+0.41) 0.44(1.72+0.40) +109.5% 7821.8: extended grep -i uncommon 0.21(0.57+0.39) 0.42(1.64+0.45) +100.0% 7821.9: perl grep -i uncommon 0.21(0.48+0.48) 0.17(0.30+0.45) -19.0% 7821.11: fixed grep -i æ 0.25(0.73+0.45) 0.25(0.75+0.45) +0.0% 7821.12: basic grep -i æ 0.25(0.71+0.49) 0.26(0.77+0.44) +4.0% 7821.13: extended grep -i æ 0.25(0.75+0.44) 0.25(0.74+0.46) +0.0% 7821.14: perl grep -i æ 0.17(0.26+0.48) 0.16(0.20+0.52) -5.9% plain log: Test origin/master HEAD --------------------------------------------------------------------------------- 4221.1: fixed log --grep='int' 7.31(7.06+0.21) 8.11(7.85+0.20) +10.9% 4221.2: basic log --grep='int' 7.30(6.94+0.27) 8.16(7.89+0.19) +11.8% 4221.3: extended log --grep='int' 7.34(7.05+0.21) 8.08(7.76+0.25) +10.1% 4221.4: perl log --grep='int' 7.27(6.94+0.24) 7.05(6.76+0.25) -3.0% 4221.6: fixed log --grep='uncommon' 6.97(6.62+0.32) 7.86(7.51+0.30) +12.8% 4221.7: basic log --grep='uncommon' 7.05(6.69+0.29) 7.89(7.60+0.28) +11.9% 4221.8: extended log --grep='uncommon' 6.89(6.56+0.32) 7.99(7.66+0.24) +16.0% 4221.9: perl log --grep='uncommon' 7.02(6.66+0.33) 6.97(6.54+0.36) -0.7% 4221.11: fixed log --grep='æ' 7.37(7.03+0.33) 7.67(7.30+0.31) +4.1% 4221.12: basic log --grep='æ' 7.41(7.00+0.31) 7.60(7.28+0.26) +2.6% 4221.13: extended log --grep='æ' 7.35(6.96+0.38) 7.73(7.31+0.34) +5.2% 4221.14: perl log --grep='æ' 7.43(7.10+0.32) 6.95(6.61+0.27) -6.5% log with -i: Test origin/master HEAD ------------------------------------------------------------------------------------ 4221.1: fixed log -i --grep='int' 7.40(7.05+0.23) 8.66(8.38+0.20) +17.0% 4221.2: basic log -i --grep='int' 7.39(7.09+0.23) 8.67(8.39+0.20) +17.3% 4221.3: extended log -i --grep='int' 7.29(6.99+0.26) 8.69(8.31+0.26) +19.2% 4221.4: perl log -i --grep='int' 7.42(7.16+0.21) 7.14(6.80+0.24) -3.8% 4221.6: fixed log -i --grep='uncommon' 6.94(6.58+0.35) 8.43(8.04+0.30) +21.5% 4221.7: basic log -i --grep='uncommon' 6.95(6.62+0.31) 8.34(7.93+0.32) +20.0% 4221.8: extended log -i --grep='uncommon' 7.06(6.75+0.25) 8.32(7.98+0.31) +17.8% 4221.9: perl log -i --grep='uncommon' 6.96(6.69+0.26) 7.04(6.64+0.32) +1.1% 4221.11: fixed log -i --grep='æ' 7.92(7.55+0.33) 7.86(7.44+0.34) -0.8% 4221.12: basic log -i --grep='æ' 7.88(7.49+0.32) 7.84(7.46+0.34) -0.5% 4221.13: extended log -i --grep='æ' 7.91(7.51+0.32) 7.87(7.48+0.32) -0.5% 4221.14: perl log -i --grep='æ' 7.01(6.59+0.35) 6.99(6.64+0.28) -0.3% Some of those, as noted in [1] are because PCRE is faster at finding fixed strings. This looks bad for some engines, but in the next change we'll optimistically use PCRE v2 for all of these, so it'll look better. 1. https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 63 +++------------------------------------------------------- grep.h | 2 -- 2 files changed, 3 insertions(+), 62 deletions(-) diff --git a/grep.c b/grep.c index 8d0fff316c..4468519d5c 100644 --- a/grep.c +++ b/grep.c @@ -356,18 +356,6 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } -static int is_fixed(const char *s, size_t len) -{ - size_t i; - - for (i = 0; i < len; i++) { - if (is_regex_special(s[i])) - return 0; - } - - return 1; -} - #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -643,38 +631,12 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; + p->fixed = opt->fixed; if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - /* - * Even when -F (fixed) asks us to do a non-regexp search, we - * may not be able to correctly case-fold when -i - * (ignore-case) is asked (in which case, we'll synthesize a - * regexp to match the pattern that matches regexp special - * characters literally, while ignoring case differences). On - * the other hand, even without -F, if the pattern does not - * have any regexp special characters and there is no need for - * case-folding search, we can internally turn it into a - * simple string match using kws. p->fixed tells us if we - * want to use kws. - */ - if (opt->fixed || is_fixed(p->pattern, p->patternlen)) - p->fixed = !p->ignore_case || !has_non_ascii(p->pattern); - - if (p->fixed) { - p->kws = kwsalloc(p->ignore_case ? tolower_trans_tbl : NULL); - kwsincr(p->kws, p->pattern, p->patternlen); - kwsprep(p->kws); - return; - } - if (opt->fixed) { - /* - * We come here when the pattern has the non-ascii - * characters we cannot case-fold, and asked to - * ignore-case. - */ compile_fixed_regexp(p, opt); return; } @@ -1042,9 +1004,7 @@ void free_grep_patterns(struct grep_opt *opt) case GREP_PATTERN: /* atom */ case GREP_PATTERN_HEAD: case GREP_PATTERN_BODY: - if (p->kws) - kwsfree(p->kws); - else if (p->pcre1_regexp) + if (p->pcre1_regexp) free_pcre1_regexp(p); else if (p->pcre2_pattern) free_pcre2_pattern(p); @@ -1104,29 +1064,12 @@ static void show_name(struct grep_opt *opt, const char *name) opt->output(opt, opt->null_following_name ? "\0" : "\n", 1); } -static int fixmatch(struct grep_pat *p, char *line, char *eol, - regmatch_t *match) -{ - struct kwsmatch kwsm; - size_t offset = kwsexec(p->kws, line, eol - line, &kwsm); - if (offset == -1) { - match->rm_so = match->rm_eo = -1; - return REG_NOMATCH; - } else { - match->rm_so = offset; - match->rm_eo = match->rm_so + kwsm.size[0]; - return 0; - } -} - static int patmatch(struct grep_pat *p, char *line, char *eol, regmatch_t *match, int eflags) { int hit; - if (p->fixed) - hit = !fixmatch(p, line, eol, match); - else if (p->pcre1_regexp) + if (p->pcre1_regexp) hit = !pcre1match(p, line, eol, match, eflags); else if (p->pcre2_pattern) hit = !pcre2match(p, line, eol, match, eflags); diff --git a/grep.h b/grep.h index 4bb8a79d93..d35a137fcb 100644 --- a/grep.h +++ b/grep.h @@ -32,7 +32,6 @@ typedef int pcre2_compile_context; typedef int pcre2_match_context; typedef int pcre2_jit_stack; #endif -#include "kwset.h" #include "thread-utils.h" #include "userdiff.h" @@ -97,7 +96,6 @@ struct grep_pat { pcre2_match_context *pcre2_match_context; pcre2_jit_stack *pcre2_jit_stack; uint32_t pcre2_jit_on; - kwset_t kws; unsigned fixed:1; unsigned ignore_case:1; unsigned word_regexp:1; From patchwork Mon Jul 1 21:21:00 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= X-Patchwork-Id: 11026753 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 47A3D13A4 for ; Mon, 1 Jul 2019 21:21:28 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3C1822870C for ; Mon, 1 Jul 2019 21:21:28 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3077028718; Mon, 1 Jul 2019 21:21:28 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5B2BA2870F for ; Mon, 1 Jul 2019 21:21:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726961AbfGAVV0 (ORCPT ); Mon, 1 Jul 2019 17:21:26 -0400 Received: from mail-wr1-f66.google.com ([209.85.221.66]:34845 "EHLO mail-wr1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726586AbfGAVVZ (ORCPT ); Mon, 1 Jul 2019 17:21:25 -0400 Received: by mail-wr1-f66.google.com with SMTP id c27so7654515wrb.2 for ; Mon, 01 Jul 2019 14:21:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=LKD+OlAHzq6XbVAiOK/4PPFfIewNGcJHpWW9mariSAs=; b=KZHcpik4nAIxOVND9IZSZOEHe9Udlf3ikYbW2uOSGYAzxNna0gtc3lawsdqCAe9NC2 SHQaWlKx0JM2ZSypdovjhtOGJrmOr5MAmE+meyBrRL5umvuDAnopfEMgwWsTtaP7w3mO dEQ3bdWSidBG+OMt9ZWMeQ12EKwK9Svr2wk7EgsrWpZLhnQy6R+VTCxQJOV2guqW6kmN COoNHlvbpX18m2JRkUz8f2hAkR8TQSE+r9Q89arL06XKUQ6UPjRAOEPDQ+ij/NtHgHLh GLWjzZsvUIBxFRWwbzEMOM1JmxpXapok65gO7CULTIMSHyp0MIDv86KfGMbOpaL2lY6A dstQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=LKD+OlAHzq6XbVAiOK/4PPFfIewNGcJHpWW9mariSAs=; b=JDOlYO3H4P/FVX/zZ3kWKomOf4YMtLJWWxxdl528IIVepUz9VURhpS+AgsX83tIoKf kZCxdTnQXQ7yoQg+wUig4kaU3SwPbrbZ6uX8x/GLZbsjK1HhO/Ql/fjo+dK+Yr69e+cA 3BGGM01ud2oxIPiaHjYcmeOmZbT9VaJbv32I5eBltCxoG44x/xpzBXVxPr1NdZt59bUp Iy7hBnmL+O1Ix1DH1XJ3hEf4pxQPIdgaDnSgymbrh5lbKyyuTNFfTp/CPmOr1TR+ze9N 7+6U5cVYUikYV7bGJnKuDbheG8qTL5bx/wiqZYZ5HzWW0Ad7A7RIBPxMRLZgQXlLmr7m JfxQ== X-Gm-Message-State: APjAAAXgyuiHSDKRcx4KHGGYTH70SiWwWN7PLbpRFSQmd5bw+f1LYJdh e/7xZsZpJ1auEoxsmPUa3zAi+8i7f3I= X-Google-Smtp-Source: APXvYqxK1RuS0PPrqn7VF5vV2rNix6IHGff8XJijZLPr7LwJqzLDVI3+o0Nt6CitH2akrzmlJXwngQ== X-Received: by 2002:adf:eb89:: with SMTP id t9mr19632892wrn.120.1562016082483; Mon, 01 Jul 2019 14:21:22 -0700 (PDT) Received: from vm.nix.is ([2a01:4f8:120:2468::2]) by smtp.gmail.com with ESMTPSA id s2sm466824wmj.33.2019.07.01.14.21.21 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Mon, 01 Jul 2019 14:21:21 -0700 (PDT) From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= To: git@vger.kernel.org Cc: git-packagers@googlegroups.com, gitgitgadget@gmail.com, gitster@pobox.com, johannes.schindelin@gmx.de, peff@peff.net, sandals@crustytoothpaste.net, szeder.dev@gmail.com, =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= Subject: [PATCH v3 10/10] grep: use PCRE v2 for optimized fixed-string search Date: Mon, 1 Jul 2019 23:21:00 +0200 Message-Id: <20190701212100.27850-11-avarab@gmail.com> X-Mailer: git-send-email 2.22.0.455.g172b71a6c5 In-Reply-To: <20190627233912.7117-1-avarab@gmail.com> References: <20190627233912.7117-1-avarab@gmail.com> MIME-Version: 1.0 Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Bring back optimized fixed-string search for "grep", this time with PCRE v2 as an optional backend. As noted in [1] with kwset we were slower than PCRE v1 and v2 JIT with the kwset backend, so that optimization was counterproductive. This brings back the optimization for "--fixed-strings", without changing the semantics of having a NUL-byte in patterns. As seen in previous commits in this series we could support it now, but I'd rather just leave that edge-case aside so we don't have one behavior or the other depending what "--fixed-strings" backend we're using. It makes the behavior harder to understand and document, and makes tests for the different backends more painful. This does change the behavior under non-C locales when "log"'s "--encoding" option is used and the heystack/needle in the content/command-line doesn't have a matching encoding. See the recent change in "t4210: skip more command-line encoding tests on MinGW" in this series. I think that's OK. We did nothing sensible before then (just compared raw bytes that had no hope of matching). At least now the user will get some idea why their grep/log never matches in that edge case. I could also support the PCRE v1 backend here, but that would make the code more complex. I'd rather aim for simplicity here and in future changes to the diffcore. We're not going to have someone who absolutely must have faster search, but for whom building PCRE v2 isn't acceptable. The difference between this series of commits and the current "master" is, using the same t/perf commands shown in the last commit: plain grep: Test origin/master HEAD ------------------------------------------------------------------------- 7821.1: fixed grep int 0.55(1.67+0.56) 0.41(0.98+0.60) -25.5% 7821.2: basic grep int 0.58(1.65+0.52) 0.41(0.96+0.57) -29.3% 7821.3: extended grep int 0.57(1.66+0.49) 0.42(0.93+0.60) -26.3% 7821.4: perl grep int 0.54(1.67+0.50) 0.43(0.88+0.65) -20.4% 7821.6: fixed grep uncommon 0.21(0.52+0.42) 0.16(0.24+0.51) -23.8% 7821.7: basic grep uncommon 0.20(0.49+0.45) 0.17(0.28+0.47) -15.0% 7821.8: extended grep uncommon 0.20(0.54+0.39) 0.16(0.25+0.50) -20.0% 7821.9: perl grep uncommon 0.20(0.58+0.36) 0.16(0.23+0.50) -20.0% 7821.11: fixed grep æ 0.35(1.24+0.43) 0.16(0.23+0.50) -54.3% 7821.12: basic grep æ 0.36(1.29+0.38) 0.16(0.20+0.54) -55.6% 7821.13: extended grep æ 0.35(1.23+0.44) 0.16(0.24+0.50) -54.3% 7821.14: perl grep æ 0.35(1.33+0.34) 0.16(0.28+0.46) -54.3% grep with -i: Test origin/master HEAD ---------------------------------------------------------------------------- 7821.1: fixed grep -i int 0.62(1.81+0.70) 0.47(1.11+0.64) -24.2% 7821.2: basic grep -i int 0.67(1.90+0.53) 0.46(1.07+0.62) -31.3% 7821.3: extended grep -i int 0.62(1.92+0.53) 0.53(1.12+0.58) -14.5% 7821.4: perl grep -i int 0.66(1.85+0.58) 0.45(1.10+0.59) -31.8% 7821.6: fixed grep -i uncommon 0.21(0.54+0.43) 0.17(0.20+0.55) -19.0% 7821.7: basic grep -i uncommon 0.20(0.52+0.45) 0.17(0.29+0.48) -15.0% 7821.8: extended grep -i uncommon 0.21(0.52+0.44) 0.17(0.26+0.50) -19.0% 7821.9: perl grep -i uncommon 0.21(0.53+0.44) 0.17(0.20+0.56) -19.0% 7821.11: fixed grep -i æ 0.26(0.79+0.44) 0.16(0.29+0.46) -38.5% 7821.12: basic grep -i æ 0.26(0.79+0.42) 0.16(0.20+0.54) -38.5% 7821.13: extended grep -i æ 0.26(0.84+0.39) 0.16(0.24+0.50) -38.5% 7821.14: perl grep -i æ 0.16(0.24+0.49) 0.17(0.25+0.51) +6.3% plain log: Test origin/master HEAD -------------------------------------------------------------------------------- 4221.1: fixed log --grep='int' 7.24(6.95+0.28) 7.20(6.95+0.18) -0.6% 4221.2: basic log --grep='int' 7.31(6.97+0.22) 7.20(6.93+0.21) -1.5% 4221.3: extended log --grep='int' 7.37(7.04+0.24) 7.22(6.91+0.25) -2.0% 4221.4: perl log --grep='int' 7.31(7.04+0.21) 7.19(6.89+0.21) -1.6% 4221.6: fixed log --grep='uncommon' 6.93(6.59+0.32) 7.04(6.66+0.37) +1.6% 4221.7: basic log --grep='uncommon' 6.92(6.58+0.29) 7.08(6.75+0.29) +2.3% 4221.8: extended log --grep='uncommon' 6.92(6.55+0.31) 7.00(6.68+0.31) +1.2% 4221.9: perl log --grep='uncommon' 7.03(6.59+0.33) 7.12(6.73+0.34) +1.3% 4221.11: fixed log --grep='æ' 7.41(7.08+0.28) 7.05(6.76+0.29) -4.9% 4221.12: basic log --grep='æ' 7.39(6.99+0.33) 7.00(6.68+0.25) -5.3% 4221.13: extended log --grep='æ' 7.34(7.00+0.25) 7.15(6.81+0.31) -2.6% 4221.14: perl log --grep='æ' 7.43(7.13+0.26) 7.01(6.60+0.36) -5.7% log with -i: Test origin/master HEAD ------------------------------------------------------------------------------------ 4221.1: fixed log -i --grep='int' 7.31(7.07+0.24) 7.23(7.00+0.22) -1.1% 4221.2: basic log -i --grep='int' 7.40(7.08+0.28) 7.19(6.92+0.20) -2.8% 4221.3: extended log -i --grep='int' 7.43(7.13+0.25) 7.27(6.99+0.21) -2.2% 4221.4: perl log -i --grep='int' 7.34(7.10+0.24) 7.10(6.90+0.19) -3.3% 4221.6: fixed log -i --grep='uncommon' 7.07(6.71+0.32) 7.11(6.77+0.28) +0.6% 4221.7: basic log -i --grep='uncommon' 6.99(6.64+0.28) 7.12(6.69+0.38) +1.9% 4221.8: extended log -i --grep='uncommon' 7.11(6.74+0.32) 7.10(6.77+0.27) -0.1% 4221.9: perl log -i --grep='uncommon' 6.98(6.60+0.29) 7.05(6.64+0.34) +1.0% 4221.11: fixed log -i --grep='æ' 7.85(7.45+0.34) 7.03(6.68+0.32) -10.4% 4221.12: basic log -i --grep='æ' 7.87(7.49+0.29) 7.06(6.69+0.31) -10.3% 4221.13: extended log -i --grep='æ' 7.87(7.54+0.31) 7.09(6.69+0.31) -9.9% 4221.14: perl log -i --grep='æ' 7.06(6.77+0.28) 6.91(6.57+0.31) -2.1% So as with e05b027627 ("grep: use PCRE v2 for optimized fixed-string search", 2019-06-26) there's a huge improvement in performance for "grep", but in "log" most of our time is spent elsewhere, so we don't notice it that much. Signed-off-by: Ævar Arnfjörð Bjarmason --- grep.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 49 insertions(+), 2 deletions(-) diff --git a/grep.c b/grep.c index 4468519d5c..fc0ed73ef3 100644 --- a/grep.c +++ b/grep.c @@ -356,6 +356,18 @@ static NORETURN void compile_regexp_failed(const struct grep_pat *p, die("%s'%s': %s", where, p->pattern, error); } +static int is_fixed(const char *s, size_t len) +{ + size_t i; + + for (i = 0; i < len; i++) { + if (is_regex_special(s[i])) + return 0; + } + + return 1; +} + #ifdef USE_LIBPCRE1 static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt) { @@ -602,7 +614,6 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol, static void free_pcre2_pattern(struct grep_pat *p) { } -#endif /* !USE_LIBPCRE2 */ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) { @@ -623,11 +634,13 @@ static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt) compile_regexp_failed(p, errbuf); } } +#endif /* !USE_LIBPCRE2 */ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) { int err; int regflags = REG_NEWLINE; + int pat_is_fixed; p->word_regexp = opt->word_regexp; p->ignore_case = opt->ignore_case; @@ -636,8 +649,42 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt) if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2) die(_("given pattern contains NULL byte (via -f ). This is only supported with -P under PCRE v2")); - if (opt->fixed) { + pat_is_fixed = is_fixed(p->pattern, p->patternlen); + if (opt->fixed || pat_is_fixed) { +#ifdef USE_LIBPCRE2 + opt->pcre2 = 1; + if (pat_is_fixed) { + compile_pcre2_pattern(p, opt); + } else { + /* + * E.g. t7811-grep-open.sh relies on the + * pattern being restored. + */ + char *old_pattern = p->pattern; + size_t old_patternlen = p->patternlen; + struct strbuf sb = STRBUF_INIT; + + /* + * There is the PCRE2_LITERAL flag, but it's + * only in PCRE v2 10.30 and later. Needing to + * ifdef our way around that and dealing with + * it + PCRE2_MULTILINE being an error is more + * complex than just quoting this ourselves. + */ + strbuf_add(&sb, "\\Q", 2); + strbuf_add(&sb, p->pattern, p->patternlen); + strbuf_add(&sb, "\\E", 2); + + p->pattern = sb.buf; + p->patternlen = sb.len; + compile_pcre2_pattern(p, opt); + p->pattern = old_pattern; + p->patternlen = old_patternlen; + strbuf_release(&sb); + } +#else /* !USE_LIBPCRE2 */ compile_fixed_regexp(p, opt); +#endif /* !USE_LIBPCRE2 */ return; }