diff mbox series

[v2,7/8] grep: simplify config parsing, change grep.<rx config> interaction

Message ID patch-v2-7.8-140a7416223-20211110T013632Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series grep: simplify & delete code by changing obscure cfg variable behavior | expand

Commit Message

Ævar Arnfjörð Bjarmason Nov. 10, 2021, 1:43 a.m. UTC
Change the interaction between "grep.patternType=default" and
"grep.extendedRegexp=true" to make setting "grep.extendedRegexp=true"
synonymous with setting "grep.patternType=extended".

This changes our existing config parsing behavior as detailed below,
but in a way that's consistent with how we parse other
configuration.

We are breaking past promises here, but I doubt that this will impact
anyone in practice. The reduction in complexity and resulting
consistency with other default config behavior is worth it.

When "grep.patternType" was introduced in 84befcd0a4a (grep: add a
grep.patternType configuration setting, 2012-08-03) we made two
seemingly contradictory promises:

 1. You can set "grep.patternType", and "[setting it to] 'default'
    will return to the default matching behavior".

 2. Support the existing "grep.extendedRegexp" option, but ignore it
    when the new "grep.patternType" is set, *except* "when the
    `grep.patternType` option is set. to a value other than 'default'".

I think that 84befcd0a4a probably didn't intend this behavior, but
instead ended up conflating our internal "unspecified" state with a
user's explicit desire to set the configuration back to the
default.

I.e. a user would correctly expect this to keep working:

    # ERE grep
    git -c grep.extendedRegexp=true grep <pattern>

And likewise for "grep.patternType=default" to take precedence over
the disfavored "grep.extendedRegexp" option, i.e. the usual "last set
wins" semantics.

    # BRE grep
    git -c grep.extendedRegexp=true -c grep.patternType=basic grep <pattern>

But probably not for this to ignore the favored "grep.patternType"
option entirely, say if /etc/gitconfig was still setting
"grep.extendedRegexp", but "~/.gitconfig" used the new
"grep.patternType" (and wanted to use the "default" value):

    # Was ERE, now BRE
    git -c grep.extendedRegexp=true grep.patternType=default grep <pattern>

I think that in practice nobody or almost nobody is going to be
relying on this obscure interaction, and as shown here it makes the
config parsing much simpler. We no longer have to carry a complex
state machine in "grep_commit_pattern_type()" and
"grep_set_pattern_type_option()".

We can also do away with the "int fixed" and "int pcre2" members in
favor of using "pattern_type_option" directly in "grep.c", as well as
dropping the "pattern_type_arg" variable in "builtin/grep.c" in favor
of using the "pattern_type_option" member directly.

See my 07a3d411739 (grep: remove regflags from the public grep_opt
API, 2017-06-29) for addition of the two comments being removed here,
i.e. the complexity noted in that commit is now going away.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/config/grep.txt |  3 +-
 builtin/grep.c                | 10 ++---
 grep.c                        | 71 +++++------------------------------
 grep.h                        |  6 +--
 revision.c                    |  2 -
 t/t7810-grep.sh               |  2 +-
 6 files changed, 16 insertions(+), 78 deletions(-)

Comments

Junio C Hamano Nov. 12, 2021, 7:19 p.m. UTC | #1
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Change the interaction between "grep.patternType=default" and
> "grep.extendedRegexp=true" to make setting "grep.extendedRegexp=true"
> synonymous with setting "grep.patternType=extended".

This description alone is not quite understandable.  It is not
saying much more than the single line title, and presense of it does
not seem to improve the understanding by the readers.

> When "grep.patternType" was introduced in 84befcd0a4a (grep: add a
> grep.patternType configuration setting, 2012-08-03) we made two
> seemingly contradictory promises:
>
>  1. You can set "grep.patternType", and "[setting it to] 'default'
>     will return to the default matching behavior".
>
>  2. Support the existing "grep.extendedRegexp" option, but ignore it
>     when the new "grep.patternType" is set, *except* "when the
>     `grep.patternType` option is set. to a value other than 'default'".

OK, so setting grep.patternType=default makes grep.extendedRegexp to
be taken into account.  By grep.patternType to something else, the
other one is ignored.  2. is a very explicit way to say so.  Where
did you get 1. from?  If you have this paragraph in the log message
in mind, I agree that it is less than ideally phrased, but ...

    Rather than adding an additional setting for grep.fooRegexp for
    current and future pattern matching options, add a
    grep.patternType setting that can accept appropriate values for
    modifying the default grep pattern matching behavior. The
    current values are "basic", "extended", "fixed", "perl" and
    "default" for setting -G, -E, -F, -P and the default behavior
    respectively.

... with the understanding of 2. (which is in what the commit adds
to Documentation/config.txt), it is reasonable to understand that
"the default behaviour" is "use BRE or ERE, depending on the setting
of grep.extendedRegexp".

Doesn't the code behave that way?  I think the above is exactly how
the commit wanted to make the code behave.

> I think that 84befcd0a4a probably didn't intend this behavior, but
> instead ended up conflating our internal "unspecified" state with a
> user's explicit desire to set the configuration back to the
> default.

I am not sure where that comes from, but if I imagine somebody
confuses between "default" and "basic" and considers "default" a
synonym for "basic", I can sort-of understand it.  Is it what is
happening here?

But it is not what the original .patternType patch wanted to do back
then, and it is not what we want to see now.

> I.e. a user would correctly expect this to keep working:
>
>     # ERE grep
>     git -c grep.extendedRegexp=true grep <pattern>

This makes sense.

> And likewise for "grep.patternType=default" to take precedence over
> the disfavored "grep.extendedRegexp" option, i.e. the usual "last set
> wins" semantics.
>
>     # BRE grep
>     git -c grep.extendedRegexp=true -c grep.patternType=basic grep <pattern>

This makes sense, too.

Do either of the above two not work as you expect (i.e. the first
use ERE and the second use BRE)?

What I have trouble with is that it is unclear if you are describing
what should happen (in the above, I said "makes sense", to show my
agreement, assuming that it is the case), or if you are describing
what does happen that you disagree with. 

Another thing I have trouble with is your mention of "keep working".
Are you proposing to deliberately break what is working as users
correctly expect?  Why?

> But probably not for this to ignore the favored "grep.patternType"
> option entirely, say if /etc/gitconfig was still setting
> "grep.extendedRegexp", but "~/.gitconfig" used the new
> "grep.patternType" (and wanted to use the "default" value):
>
>     # Was ERE, now BRE
>     git -c grep.extendedRegexp=true grep.patternType=default grep <pattern>

I do not quite get your "Was X, now Y" label.  What did you want to
say with that?

Also I am not sure what you exactly mean when you say "and wanted to
use the 'default' value".  There is no single "THE" default value.
If patternType=default is the last patterntype (it may be set in
many places, but the last one should win), the user is telling that
the last-one-wins setting of extendedRegexp is to be honored.  So,
if grep.extendedRegexp in /etc/gitconfig is the last one defined, we
would choose between BRE and ERE depending on that setting.

Isn't that what is happening in the current code?

Or are all the above explanation result of simple misunderstanding
that setting to "default" means setting to "basic"?
Ævar Arnfjörð Bjarmason Nov. 13, 2021, 9:55 a.m. UTC | #2
On Fri, Nov 12 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
> [...]

I'll try to reply to all the rest of the feedback, just really quick on
this, because I think it might represent a bit of a gordian knot.

> Another thing I have trouble with is your mention of "keep working".
> Are you proposing to deliberately break what is working as users
> correctly expect?  Why?

Yes, I'd like to change the behavior, because it makes the grep API much
easier to deal with, and beacuse I think it impacts nobody in practice.

The real goal for this series is that I've got pending patches to
speedup diffcore-pickaxe massively by moving it over to PCRE & drop the
kwset.c code. An old perf test I dug up for that is in [1].

To do that I needed to re-use the bits of grep.c machinery that deal
with setting up patterns, dealing with BRE,ERE,PCRE etc. elsewhere.

I *can* do that in a different way, but it's going to be much easier if
we can gradually evolve the already working grep API to become an
internal textual pattern matching API. Eventually I'd like to move all
of regcomp()/regexec() over to such a thing, because we can for any
other ranodm thing we use regexes for get speedups by using PCRE (and
optionally use its interface to understand BRE/ERE syntax).

The alternative is to split that part off from grep.c, which is a bit
more painful, or to have the init bits etc. take some "no config doesn't
go first", "no it goes first" flags just to support this one API user.

So it would be generally useful to know if you're at all open to
that. Reading between the lines in some other comments I fear that it
may be a "no" except if we mark it as deprecated, wait some years, maybe
remove/change it then etc.

1.

    GIT_TEST_LONG= GIT_PERF_REPEAT_COUNT=10 GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE=1 CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run origin/next HEAD -- p4209-pickaxe.sh
    Test                                                                      origin/next       HEAD
    ------------------------------------------------------------------------------------------------------------------
    4209.1: git log -S'int main' <limit-rev>..                                0.38(0.36+0.01)   0.37(0.33+0.04) -2.6%
    4209.2: git log -S'æ' <limit-rev>..                                       0.51(0.47+0.04)   0.32(0.27+0.05) -37.3%
    4209.3: git log --pickaxe-regex -S'(int|void|null)' <limit-rev>..         0.72(0.68+0.03)   0.57(0.54+0.03) -20.8%
    4209.4: git log --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..          0.60(0.55+0.02)   0.39(0.34+0.05) -35.0%
    4209.5: git log --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..       0.43(0.40+0.03)   0.50(0.44+0.06) +16.3%
    4209.6: git log -G'(int|void|null)' <limit-rev>..                         0.64(0.55+0.09)   0.63(0.56+0.05) -1.6%
    4209.7: git log -G'if *\([^ ]+ & ' <limit-rev>..                          0.64(0.59+0.05)   0.63(0.56+0.06) -1.6%
    4209.8: git log -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                       0.63(0.54+0.08)   0.62(0.55+0.06) -1.6%
    4209.9: git log -i -S'int main' <limit-rev>..                             0.39(0.35+0.03)   0.38(0.35+0.02) -2.6%
    4209.10: git log -i -S'æ' <limit-rev>..                                   0.39(0.33+0.06)   0.32(0.28+0.04) -17.9%
    4209.11: git log -i --pickaxe-regex -S'(int|void|null)' <limit-rev>..     0.90(0.84+0.05)   0.58(0.53+0.04) -35.6%
    4209.12: git log -i --pickaxe-regex -S'if *\([^ ]+ & ' <limit-rev>..      0.71(0.64+0.06)   0.40(0.37+0.03) -43.7%
    4209.13: git log -i --pickaxe-regex -S'[àáâãäåæñøùúûüýþ]' <limit-rev>..   0.43(0.40+0.03)   0.50(0.46+0.04) +16.3%
    4209.14: git log -i -G'(int|void|null)' <limit-rev>..                     0.64(0.57+0.06)   0.62(0.56+0.05) -3.1%
    4209.15: git log -i -G'if *\([^ ]+ & ' <limit-rev>..                      0.65(0.59+0.06)   0.63(0.54+0.08) -3.1%
    4209.16: git log -i -G'[àáâãäåæñøùúûüýþ]' <limit-rev>..                   0.63(0.55+0.08)   0.62(0.56+0.05) -1.6%
diff mbox series

Patch

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index ae51f2d91c8..f4b7d3041fb 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -12,8 +12,7 @@  grep.patternType::
 
 grep.extendedRegexp::
 	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
+	option is ignored when the `grep.patternType` option is set.
 
 grep.threads::
 	Number of grep worker threads to use. If unset (or set to 0), Git will
diff --git a/builtin/grep.c b/builtin/grep.c
index 0ea124321b6..942c4b25077 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -845,7 +845,6 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 	int i;
 	int dummy;
 	int use_index = 1;
-	int pattern_type_arg = GREP_PATTERN_TYPE_UNSPECIFIED;
 	int allow_revs;
 
 	struct option options[] = {
@@ -879,16 +878,16 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 			N_("descend at most <depth> levels"), PARSE_OPT_NONEG,
 			NULL, 1 },
 		OPT_GROUP(""),
-		OPT_SET_INT('E', "extended-regexp", &pattern_type_arg,
+		OPT_SET_INT('E', "extended-regexp", &opt.pattern_type_option,
 			    N_("use extended POSIX regular expressions"),
 			    GREP_PATTERN_TYPE_ERE),
-		OPT_SET_INT('G', "basic-regexp", &pattern_type_arg,
+		OPT_SET_INT('G', "basic-regexp", &opt.pattern_type_option,
 			    N_("use basic POSIX regular expressions (default)"),
 			    GREP_PATTERN_TYPE_BRE),
-		OPT_SET_INT('F', "fixed-strings", &pattern_type_arg,
+		OPT_SET_INT('F', "fixed-strings", &opt.pattern_type_option,
 			    N_("interpret patterns as fixed strings"),
 			    GREP_PATTERN_TYPE_FIXED),
-		OPT_SET_INT('P', "perl-regexp", &pattern_type_arg,
+		OPT_SET_INT('P', "perl-regexp", &opt.pattern_type_option,
 			    N_("use Perl-compatible regular expressions"),
 			    GREP_PATTERN_TYPE_PCRE),
 		OPT_GROUP(""),
@@ -982,7 +981,6 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix, options, grep_usage,
 			     PARSE_OPT_KEEP_DASHDASH |
 			     PARSE_OPT_STOP_AT_NON_OPTION);
-	grep_commit_pattern_type(pattern_type_arg, &opt);
 
 	if (use_index && !startup_info->have_repository) {
 		int fallback = 0;
diff --git a/grep.c b/grep.c
index fb3f63c63ef..dda8e536fe3 100644
--- a/grep.c
+++ b/grep.c
@@ -60,8 +60,10 @@  int grep_config(const char *var, const char *value, void *cb)
 	if (userdiff_config(var, value) < 0)
 		return -1;
 
-	if (!strcmp(var, "grep.extendedregexp")) {
-		opt->extended_regexp_option = git_config_bool(var, value);
+	if (opt->pattern_type_option == GREP_PATTERN_TYPE_UNSPECIFIED &&
+	    !strcmp(var, "grep.extendedregexp") &&
+	    git_config_bool(var, value)) {
+		opt->pattern_type_option = GREP_PATTERN_TYPE_ERE;
 		return 0;
 	}
 
@@ -115,62 +117,6 @@  void grep_init(struct grep_opt *opt, struct repository *repo)
 	opt->header_tail = &opt->header_list;
 }
 
-static void grep_set_pattern_type_option(enum grep_pattern_type pattern_type, struct grep_opt *opt)
-{
-	/*
-	 * When committing to the pattern type by setting the relevant
-	 * fields in grep_opt it's generally not necessary to zero out
-	 * the fields we're not choosing, since they won't have been
-	 * set by anything. The extended_regexp_option field is the
-	 * only exception to this.
-	 *
-	 * This is because in the process of parsing grep.patternType
-	 * & grep.extendedRegexp we set opt->pattern_type_option and
-	 * opt->extended_regexp_option, respectively. We then
-	 * internally use opt->extended_regexp_option to see if we're
-	 * compiling an ERE. It must be unset if that's not actually
-	 * the case.
-	 */
-	if (pattern_type != GREP_PATTERN_TYPE_ERE &&
-	    opt->extended_regexp_option)
-		opt->extended_regexp_option = 0;
-
-	switch (pattern_type) {
-	case GREP_PATTERN_TYPE_UNSPECIFIED:
-		/* fall through */
-
-	case GREP_PATTERN_TYPE_BRE:
-		break;
-
-	case GREP_PATTERN_TYPE_ERE:
-		opt->extended_regexp_option = 1;
-		break;
-
-	case GREP_PATTERN_TYPE_FIXED:
-		opt->fixed = 1;
-		break;
-
-	case GREP_PATTERN_TYPE_PCRE:
-		opt->pcre2 = 1;
-		break;
-	}
-}
-
-void grep_commit_pattern_type(enum grep_pattern_type pattern_type, struct grep_opt *opt)
-{
-	if (pattern_type != GREP_PATTERN_TYPE_UNSPECIFIED)
-		grep_set_pattern_type_option(pattern_type, opt);
-	else if (opt->pattern_type_option != GREP_PATTERN_TYPE_UNSPECIFIED)
-		grep_set_pattern_type_option(opt->pattern_type_option, opt);
-	else if (opt->extended_regexp_option)
-		/*
-		 * This branch *must* happen after setting from the
-		 * opt->pattern_type_option above, we don't want
-		 * grep.extendedRegexp to override grep.patternType!
-		 */
-		grep_set_pattern_type_option(GREP_PATTERN_TYPE_ERE, opt);
-}
-
 static struct grep_pat *create_grep_pat(const char *pat, size_t patlen,
 					const char *origin, int no,
 					enum grep_pat_token t,
@@ -492,9 +438,10 @@  static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
-	p->fixed = opt->fixed;
+	p->fixed = opt->pattern_type_option == GREP_PATTERN_TYPE_FIXED;
 
-	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
+	if (opt->pattern_type_option != GREP_PATTERN_TYPE_PCRE &&
+	    memchr(p->pattern, 0, p->patternlen))
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
 	p->is_fixed = is_fixed(p->pattern, p->patternlen);
@@ -545,14 +492,14 @@  static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		return;
 	}
 
-	if (opt->pcre2) {
+	if (opt->pattern_type_option == GREP_PATTERN_TYPE_PCRE) {
 		compile_pcre2_pattern(p, opt);
 		return;
 	}
 
 	if (p->ignore_case)
 		regflags |= REG_ICASE;
-	if (opt->extended_regexp_option)
+	if (opt->pattern_type_option == GREP_PATTERN_TYPE_ERE)
 		regflags |= REG_EXTENDED;
 	err = regcomp(&p->regexp, p->pattern, regflags);
 	if (err) {
diff --git a/grep.h b/grep.h
index 30a7dfd3294..e4e548aed90 100644
--- a/grep.h
+++ b/grep.h
@@ -143,7 +143,6 @@  struct grep_opt {
 	int unmatch_name_only;
 	int count;
 	int word_regexp;
-	int fixed;
 	int all_match;
 #define GREP_BINARY_DEFAULT	0
 #define GREP_BINARY_NOMATCH	1
@@ -152,7 +151,6 @@  struct grep_opt {
 	int allow_textconv;
 	int extended;
 	int use_reflog_filter;
-	int pcre2;
 	int relative;
 	int pathname;
 	int null_following_name;
@@ -161,8 +159,7 @@  struct grep_opt {
 	int max_depth;
 	int funcname;
 	int funcbody;
-	int extended_regexp_option;
-	int pattern_type_option;
+	enum grep_pattern_type pattern_type_option;
 	int ignore_locale;
 	char colors[NR_GREP_COLORS][COLOR_MAXLEN];
 	unsigned pre_context;
@@ -201,7 +198,6 @@  struct grep_opt {
 
 int grep_config(const char *var, const char *value, void *);
 void grep_init(struct grep_opt *, struct repository *repo);
-void grep_commit_pattern_type(enum grep_pattern_type, struct grep_opt *opt);
 
 void append_grep_pat(struct grep_opt *opt, const char *pat, size_t patlen, const char *origin, int no, enum grep_pat_token t);
 void append_grep_pattern(struct grep_opt *opt, const char *pat, const char *origin, int no, enum grep_pat_token t);
diff --git a/revision.c b/revision.c
index 9f9b0d2429e..ed29d245c89 100644
--- a/revision.c
+++ b/revision.c
@@ -2864,8 +2864,6 @@  int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 
 	diff_setup_done(&revs->diffopt);
 
-	grep_commit_pattern_type(GREP_PATTERN_TYPE_UNSPECIFIED,
-				 &revs->grep_filter);
 	if (!is_encoding_utf8(get_log_output_encoding()))
 		revs->grep_filter.ignore_locale = 1;
 	compile_grep_patterns(&revs->grep_filter);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 6b6423a07c3..a59a9726357 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -443,7 +443,7 @@  do
 	'
 
 	test_expect_success "grep $L with grep.extendedRegexp=true and grep.patternType=default" '
-		echo "${HC}ab:abc" >expected &&
+		echo "${HC}ab:a+bc" >expected &&
 		git \
 			-c grep.extendedRegexp=true \
 			-c grep.patternType=default \