From patchwork Sat Dec 18 19:50:02 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ren=C3=A9_Scharfe?= X-Patchwork-Id: 12686199 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BB61C433F5 for ; Sat, 18 Dec 2021 19:50:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234044AbhLRTu1 (ORCPT ); Sat, 18 Dec 2021 14:50:27 -0500 Received: from mout.web.de ([212.227.17.11]:60999 "EHLO mout.web.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234041AbhLRTuY (ORCPT ); Sat, 18 Dec 2021 14:50:24 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de; s=dbaedf251592; t=1639857003; bh=LH6ISoowrJUzvMggPUf8R5RqzJBXxaVWhLaCicA57y4=; h=X-UI-Sender-Class:Date:To:Cc:From:Subject; b=LrGao+4etb4KBZlbqA3mS7b5dJXuCo5SbdQZyh/Wh0/Q1kVujv/B1D9VE01mj8wCU u2tyBU97Eo98HVU45I47KRCXiH4Vn89aEOOZVV7s5yTb+GOe2yO5EtksG1rmWZfy5F TFFufVB9NeVhCv//fp21eHl9kiJ0OnwYnDTkJxQ0= X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9 Received: from [192.168.178.29] ([79.203.22.121]) by smtp.web.de (mrweb105 [213.165.67.124]) with ESMTPSA (Nemesis) id 1Mcpqq-1mP3mb3R90-00aDhu; Sat, 18 Dec 2021 20:50:03 +0100 Message-ID: <5fa6962e-3c1c-6dbc-f6d7-589151a9baec@web.de> Date: Sat, 18 Dec 2021 20:50:02 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Content-Language: en-US To: Git List Cc: Hamza Mahfooz , Junio C Hamano , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?utf-8?b?bWFzb24=?= , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Andreas Schwab From: =?utf-8?q?Ren=C3=A9_Scharfe?= Subject: [PATCH 1/2] grep/pcre2: use PCRE2_UTF even with ASCII patterns X-Provags-ID: V03:K1:vQIywTwTuta18nGdX4uZlcjn3f7kex7OCAGgJTOoFdp0h0KTS2+ Cs1FszlgKAwonT7LJscnFVTxb4s2WEeGyXP0Qcl/ZBIlruA9FmOpscddSAFfXDENBZnhG5W 7EzTIKeUitD63vdTOvkRQVBGRgWE0k4s57nemkssU0Q2GNGn3/YpOMt92eHj82wyowPmAyY IA6z1B5eAUO09El4+/J6A== X-UI-Out-Filterresults: notjunk:1;V03:K0:qKj2HxsMsto=:34EcXeKF75N7PqLdSoqfrE QCrinkUsrLH1e4hRWfJ4e603u/e2RwnvJb3oEZBXHVq6tEvXr119GzWP4U1TXdVr5RaIzy79u YjXdm3DUN+9vnLMcgP+y4vC4KDjhZyaDv2V+3KQw18npRomrVoi1IkeT5uPgnqnvrA305OavQ NgFbbZhbJ0wODyTrGpINwGTS0WKK83xNDKQ7vfF/TSC21gJnja7sA8qI/2m+81+/ccy14Ph4S 9cRzXWQ6rrbkVQAQIDiz8cWDqvbgtmAobqiV6nFGrmzJU6dgiRIzpf/KHNLFrH3fubehKIN6r Z280af1qTOMLk4GTRuS+tQOHbwUK1ALWOI9D9MZQGH3iwXKZVNyJKFR29/QFk6TZD+m+GWgiU SCrQL+u/cOBNrYJWvJ/pXo0QkvfUL5cTsoV7K1Y3934C+R08NbF1+Sy5eNpwEDcjjA64ZgGiU 4Bl5564UwAINB8C12+to6YWFQv7jWzYdzX7KDTdVLnwYaO5jPWXzWEYyAlwPJ3+6cx5NuB4Ix SkXoxqEZC4aXZrm4nEijW7GL0iCHsGTt26yRvcSLJ6hrHVDh4s4/6sDIjYQF7sP/uLEveaB+u c1aGM8+MdF6n/5fh9rH6m8xKCi1GOYOV7ZNwK+Id+CXYFATFTQFv4Jje+IuDE1hSBEPkbGS2I v15Qm+kVWHO6jUh/58tnTqb0OqSMnDviipsA9pRMCXr8ajMdJooI2aJ/qzKd16UK1pTagSh8i TaOo+5Bg0V98lVrzEVvxXYtZf42JkA6bH1sgXyiWtVn00wA6tp5vQDQwvve7NUF9hOoHmfWXq N07p/2lCmoFs+ix+JcuHErZhle5AepREEaCVyH1hlAQKc0JafGnjd8t5hp7SwsRdL+Fdv5cJq I7y/pSrMTapkoPvXnc6PyZa0X0y3ex6KuGfr5JzhHX38xtwUUuqhsV7CU/WU1/HCouCGub6Aj nX5v8eqCTAXeWvmKkMOuQibudDdlwxj6onc7dyDy4rx7TaD4GB/RF8mJ97wkjDzmTnDUwZlFP RWyOiZg8Oo2IxvpSSZe8Uwaao5vALOEnMHa8ea4rpie+zfeuvldJmncCQ6M/NIm5lg== Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org compile_pcre2_pattern() currently uses the option PCRE2_UTF only for patterns with non-ASCII characters. Patterns with ASCII wildcards can match non-ASCII strings, though. Without that option PCRE2 mishandles UTF-8 input, though -- it matches parts of multi-byte characters. Fix that by using PCRE2_UTF even for ASCII-only patterns. This is a remake of the reverted ae39ba431a (grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data, 2021-10-15). The change to the condition and the test are simplified and more targeted. Original-patch-by: Hamza Mahfooz Signed-off-by: René Scharfe Reported-by: SZEDER Gábor Signed-off-by: René Scharfe Signed-off-by: René Scharfe --- grep.c | 2 +- t/t7812-grep-icase-non-ascii.sh | 6 ++++++ 2 files changed, 7 insertions(+), 1 deletion(-) -- 2.34.0 diff --git a/grep.c b/grep.c index fe847a0111..5badb6d851 100644 --- a/grep.c +++ b/grep.c @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + if (!opt->ignore_locale && is_utf8_locale() && !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh index e5d1e4ea68..ca3f24f807 100755 --- a/t/t7812-grep-icase-non-ascii.sh +++ b/t/t7812-grep-icase-non-ascii.sh @@ -123,4 +123,10 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: gr test_cmp invalid-0xe5 actual ' +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-literal ASCII from UTF-8' ' + git grep --perl-regexp -h -o -e ll. file >actual && + echo "lló" >expected && + test_cmp expected actual +' + test_done From patchwork Sat Dec 18 19:53:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ren=C3=A9_Scharfe?= X-Patchwork-Id: 12686201 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C731C433F5 for ; Sat, 18 Dec 2021 19:53:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234067AbhLRTxb (ORCPT ); Sat, 18 Dec 2021 14:53:31 -0500 Received: from mout.web.de ([212.227.17.12]:59415 "EHLO mout.web.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234057AbhLRTxa (ORCPT ); Sat, 18 Dec 2021 14:53:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=web.de; s=dbaedf251592; t=1639857196; bh=EHEfFCfyKmxuwsNwnTYa5/cs7Xs6AwKkj4WgNFYHlYk=; h=X-UI-Sender-Class:Date:Subject:From:To:Cc:References:In-Reply-To; b=oiZyypM0Qx2uAU+5u1Xbe9/iyK94qSTA1KF01rapA6r2+7I+ixgD2T42l8DGUSyIa bModl/b5MSkRRkaDzx1dg95sodBSvkjpKQJC7wh5eFyvcClQs5fxAbUS4FhsgJGpbm PMXy7he31nFV3XPttejZh++DHr2EUT70KO3meOIo= X-UI-Sender-Class: c548c8c5-30a9-4db5-a2e7-cb6cb037b8f9 Received: from [192.168.178.29] ([79.203.22.121]) by smtp.web.de (mrweb105 [213.165.67.124]) with ESMTPSA (Nemesis) id 1MP382-1n8bwW21E4-00PgHo; Sat, 18 Dec 2021 20:53:16 +0100 Message-ID: Date: Sat, 18 Dec 2021 20:53:15 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Subject: [PATCH 2/2] grep/pcre2: factor out literal variable Content-Language: en-US From: =?utf-8?q?Ren=C3=A9_Scharfe?= To: Git List Cc: Hamza Mahfooz , Junio C Hamano , =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFy?= =?utf-8?b?bWFzb24=?= , =?utf-8?q?Carlo_Marcelo_Arenas_Bel?= =?utf-8?q?=C3=B3n?= , Andreas Schwab References: <5fa6962e-3c1c-6dbc-f6d7-589151a9baec@web.de> In-Reply-To: <5fa6962e-3c1c-6dbc-f6d7-589151a9baec@web.de> X-Provags-ID: V03:K1:P2EBysGYGyBYlbImQLSfa2LCDXEmAPnbu71K0Ez3EV7Pk/3SR+l W2BaT9wOLLh7P1trpAlAxqlMPlzM1MMnitNKVlP9/GuZB41Y0ITzgiwn/P0NtYvXDl2Xik5 dLguD1d92RWapgwIQvy/Y9y8Z0bdLm0GpMGQVdT4rEiRItxRwBGlBR/PJd2BdlYUU1WlKxD JmPeIzjPH7OM2liMcCtdg== X-UI-Out-Filterresults: notjunk:1;V03:K0:TxHmAShtIeQ=:i2mz1HIQjCc6TwIgX62QbF Mp7+rFq0q+u1qqj35jV5a6byuaVuCnhhMwPHdZCDVwVYWJ0kZy/RAnm4DQI2wnDqTJ3ARxTlW AGruUePClXzRs17XlhFqGXrdAGnrH9ROXKnvFaq2WoRLiLPG9cdF8qm8WoAk1O4TEqjst3ALP iUl9u85tKg+J+7YKK4s3x2evEVus5Eai+x1gls0ltypQNO84Dr6xDsZjVRD4m+3fkL5WsFXC5 4z9uq+61OGB1nl1Je0pg/WUch4o4LaL0wnEYFqB2zJFkNl0g4CxLP6zZHbpGtgYR7wROqu18C vsh5ESu/IanMeiowAWt0k+ghZS/hkiP0hNAUHJgclAIy6J+PCsbl016atQijuYWHUgB+ASnpn vE2OHjKPS7bi3G3ndMPLM5dA8r+HlX3Xe8EtWSeaz46fCG5QNuoe/YO4ZoRBlPh9EXWKEXSLo R5mCntFPClIP4jsN9Cytpes6xHgDLm7U7As2ERv3OisGPZf4XU3kuR9CAHhZMTzX7Xm2OOdXC KyJkfnKK3HtSF77Zh0F0wB1rYj7S7HDtSDnsfH3W2VVHxEPyWurNmcCCNncLTmumZqWur8s4M ZKaPHTKgxzvljKoJmUJs4cKgXsEesZ3f2bj0rYqEkjmK00NCoZ8t1BOchSvwpjErYSIzubCZa 7RqnBXe+Ts1E3h83O8lCpj5R7n9/zGeCKDhXmTwA9PQg3TrMZEUyHnp9NuynplrT91kRdUG/T jeJs0lDO+4I205L81HjAPHIjwx3epdWkRnnzAboYanrB/EnH3ewO4v7ANjIjvtf+yM9p7/FQd JvdgXyLMHl5SYqKcgZfkv9dRLkiZbhmz+3+wAuNF0PuIauXMg5cvA/7XjuXV3F/rMB9/vr5iO exRs4izhkNAhOUAijKZbWJmN+FIS8xdqvdKalhTa8h0ZMYZrXKz/HRWluUceT/b0atyNcatDm eXDHUxz4N25SfiYSln9Wr/JxViAOh1GtMdQPJhqirXwLXozjtEEcFGsKPjxpjyw14xNZh2oFd 8wdDmIbHVTIk3wrWLoErB+N5QInHttCZD5+vEb9jWcCKyC1YYHdvAK7ABhF/Qn9z5HNu+bRy7 YeYnsv1VHo43xU= Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Patterns that contain no wildcards and don't have to be case-folded are literal. Give this condition a name to increase the readability of the boolean expression for enabling the option PCRE2_UTF. Signed-off-by: René Scharfe --- grep.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- 2.34.0 diff --git a/grep.c b/grep.c index 5badb6d851..2b6ac3205d 100644 --- a/grep.c +++ b/grep.c @@ -362,6 +362,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt int jitret; int patinforet; size_t jitsizearg; + int literal = !opt->ignore_case && (p->fixed || p->is_fixed); /* * Call pcre2_general_context_create() before calling any @@ -382,8 +383,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt } options |= PCRE2_CASELESS; } - if (!opt->ignore_locale && is_utf8_locale() && - !(!opt->ignore_case && (p->fixed || p->is_fixed))) + if (!opt->ignore_locale && is_utf8_locale() && !literal) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); #ifdef GIT_PCRE2_VERSION_10_36_OR_HIGHER