diff mbox series

grep: add --max-count command line option

Message ID pull.1264.git.git.1652361610103.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series grep: add --max-count command line option | expand

Commit Message

Carlos López May 12, 2022, 1:20 p.m. UTC
From: =?UTF-8?q?Carlos=20L=C3=B3pez?= <00xc@protonmail.com>

This patch adds a command line option analogous to that of GNU
grep(1)'s -m / --max-count, which users might already be used to.
This makes it possible to limit the amount of matches shown in the
output while keeping the functionality of other options such as -C
(show code context) or -p (show containing function), which would be
difficult to do with a shell pipeline (e.g. head(1)).

Signed-off-by: Carlos López <00xc@protonmail.com>
---
    grep: add --max-count command line option
    
    This patch adds a command line option analogous to that of GNU grep(1)'s
    -m / --max-count, which users might already be used to. This makes it
    possible to limit the amount of matches shown in the output while
    keeping the functionality of other options such as -C (show code
    context) or -p (show containing function), which would be difficult to
    do with a shell pipeline (e.g. head(1)).
    
    Signed-off-by: Carlos López 00xc@protonmail.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1264%2F00xc%2Fmaster-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1264/00xc/master-v1
Pull-Request: https://github.com/git/git/pull/1264

 Documentation/git-grep.txt | 7 +++++++
 builtin/grep.c             | 2 ++
 grep.c                     | 2 ++
 grep.h                     | 1 +
 4 files changed, 12 insertions(+)


base-commit: 277cf0bc36094f6dc4297d8c9cef79df045b735d

Comments

Martin Ågren May 14, 2022, 6:16 p.m. UTC | #1
Hi Carlos,

Welcome to the mailing list. :-)

On Thu, 12 May 2022 at 21:13, Carlos L. via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: =?UTF-8?q?Carlos=20L=C3=B3pez?= <00xc@protonmail.com>
>
> This patch adds a command line option analogous to that of GNU
> grep(1)'s -m / --max-count, which users might already be used to.
> This makes it possible to limit the amount of matches shown in the
> output while keeping the functionality of other options such as -C
> (show code context) or -p (show containing function), which would be
> difficult to do with a shell pipeline (e.g. head(1)).

Makes sense to me.

> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 3d393fbac1b..02b36046475 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -23,6 +23,7 @@ SYNOPSIS
>            [--break] [--heading] [-p | --show-function]
>            [-A <post-context>] [-B <pre-context>] [-C <context>]
>            [-W | --function-context]
> +          [-m | --max-count <num>]

I think this should be

    [(-m | --max-count) <num>]

since the short form "-m" also wants to take "<num>".

> +-m <num>::
> +--max-count <num>::
> +       Limit the amount of matches per file. When using the -v or
> +       --invert-match option, the search stops after the specified
> +       number of non-matches. Setting this option to 0 has no effect.

Please use `backticks` with `-v` and `--invert-match` so that they are
set in monospace.

Regarding the special value 0, it's a bit unclear what "has no effect"
means. In particular, it can have an effect in the sense that when it
is used like

  git grep -m 1 -m 0 foo

it undoes the `-m 1`.

But also, that's not how my grep(1) behaves: with `-m 0`, it limits the
number of matches to zero. I don't know how useful that is (can that
zero-case be optimized by exiting with 1 before even trying to find the
needle!?), or if maybe different variants of grep handle this
differently?  If all grep implementations handle 0 by actually only
emitting zero hits, I think it would be wise for us to handle 0 the same
way.

As for overriding an earlier `-m <foo>`, which could be useful, it seems
to me like `--no-max-count` would make sense.

All in all, I would suggest the following documentation:

    -m <num>::
    --max-count <num>::
           Limit the amount of matches per file. When using the `-v` or
           `--invert-match` option, the search stops after the specified
           number of non-matches. Use `--no-max-count` to countermand an
           earlier `--max-count` option on the command line.

... and of course the matching implementation. :-) Maybe you could
achieve that by using -1 to signal that there's no max-count in play?
How does that sound to you?

Even if we want to handle the zero just like you do, I think this patch
needs a few tests. We should make sure to test the 0-case (whatever we
end up wanting it to behave like), and probably the "suppress an earlier
-m by giving --no-max-count" case. It also seems wise to set up some
test scenario where there are several files involved so that we can see
that we don't just print the first m matches *globally*, but that the
counter is really handled *per file*.

I think this `-m` flag would be a nice addition. I know that I've been
missing something like it a few times. As you wrote in your commit
message, `| head -3` can work for some use-cases, but definitely not for
others. This `-m` is a lot more granular than `-l` which can be seen as
a crude `-m 1`. Thanks for posting this patch! I hope you find my
comments useful.

Martin
Junio C Hamano May 16, 2022, 5:57 a.m. UTC | #2
"Carlos L. via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: =?UTF-8?q?Carlos=20L=C3=B3pez?= <00xc@protonmail.com>

Offtopic, but I wonder why this line is encoded like so?  The
"Signed-off-by:" line is not, and it is safely transmitted, so
it feels like we do not need to encode the in-body header that
is added only for e-mail but not in the original commit...

> This patch adds a command line option analogous to that of GNU
> grep(1)'s -m / --max-count, which users might already be used to.
> This makes it possible to limit the amount of matches shown in the
> output while keeping the functionality of other options such as -C
> (show code context) or -p (show containing function), which would be
> difficult to do with a shell pipeline (e.g. head(1)).
>
> Signed-off-by: Carlos López <00xc@protonmail.com>
> ---
> ...
> +-m <num>::
> +--max-count <num>::
> +	Limit the amount of matches per file. When using the -v or
> +	--invert-match option, the search stops after the specified
> +	number of non-matches. Setting this option to 0 has no effect.
> +

Good thing that this is defined as "per-file" limit.  If it were a
global limit, the interaction between this one and "--threads=<num>"
would have been interesting.  Perhaps add a test to make sure the
feature continues to work with "--threads=2" (I am assuming that you
have already tested this implementation works with the option).

Martin already commented on the wording "no effect"; I agree it is a
poor choice of words from the point of view of "overriding with 0".

It indeed is curious why GNU grep chose to immediately exit with 1
when "-m 0" was given, but that was decision made more than 20 years
ago (http://gnu.ist.utl.pt/software/grep/changes.html and look for
"2000-03-17").  Between "being consistent even with a seemingly
useless design choice made by somebody else" and "choose to be
different in a corner case where nobody should care and allow us to
be more useful", I am slightly in favor in this particular case.

What "git grep -m -1" should do?  IIRC, OPT_INTEGER is for signed
integer but the new .max_count member, as well as the existing
"count" that is compared with it, are of "unsigned" type.  Either
erroring out or treating it as unlimited is probably fine, but
whatever we do, we should document and have a test for it.

Thanks.
Paul Eggert May 16, 2022, 7:28 a.m. UTC | #3
On 5/15/22 22:57, Junio C Hamano wrote:

> It indeed is curious why GNU grep chose to immediately exit with 1
> when "-m 0" was given,

As I vaguely recall, if "-m 1" stops before "-m 2" does, then the idea 
was that it's reasonable for "-m 0" to stop before "-m 1" does, and the 
logical place to stop is right at the start, before any matches are 
found (i.e., exit with status 1).

What would be more useful for 'grep -m 0' to do? (Sorry, I came into 
this conversation just now.) Perhaps GNU 'grep -m 0' should change, if 
there's something better for it to do.


> What "git grep -m -1" should do?  IIRC, OPT_INTEGER is for signed
> integer but the new .max_count member, as well as the existing
> "count" that is compared with it, are of "unsigned" type.  Either
> erroring out or treating it as unlimited is probably fine, but
> whatever we do, we should document and have a test for it.

'grep -m -1' treats the count as being unlimited, but this isn't 
documented and (from the code) appears to be accidental. It'd make sense 
for it to be documented.
Carlos López May 16, 2022, 8:38 a.m. UTC | #4
Hi list,

Thanks to everyone who provided feedback :)

On Saturday, May 14th, 2022 at 20:16, Martin Ågren <martin.agren@gmail.com> wrote:
> I think this should be
>
> [(-m | --max-count) <num>]

> Please use `backticks` with `-v` and `--invert-match` so that they are
> set in monospace.

I will add these suggestions to my patch.

> Regarding the special value 0, it's a bit unclear what "has no effect"
> means. In particular, it can have an effect in the sense that when it
> is used like
>
> git grep -m 1 -m 0 foo
>
> it undoes the `-m 1`.
>
> But also, that's not how my grep(1) behaves: with `-m 0`, it limits the
> number of matches to zero. I don't know how useful that is (can that
> zero-case be optimized by exiting with 1 before even trying to find the
> needle!?), or if maybe different variants of grep handle this
> differently? If all grep implementations handle 0 by actually only
> emitting zero hits, I think it would be wise for us to handle 0 the same
> way.

I agree the wording is not clear. I did not see a good use case for GNU's `-m 0`, which is why I used that value as unlimited. I am not sold on using `--no-max-count` or -1 *just* for consistency, but if someone can point to a good use case of GNU's `-m 0` (especially in git grep), I will gladly concede.

> Even if we want to handle the zero just like you do, I think this patch
> needs a few tests. We should make sure to test the 0-case (whatever we
> end up wanting it to behave like), and probably the "suppress an earlier
> -m by giving --no-max-count" case. It also seems wise to set up some
> test scenario where there are several files involved so that we can see
> that we don't just print the first m matches globally, but that the
> counter is really handled per file.

This seems sound. Is there any documentation on how to write tests for git?

On Monday, May 16th, 2022 at 07:57, Junio C Hamano <gitster@pobox.com> wrote:
> Good thing that this is defined as "per-file" limit. If it were a
> global limit, the interaction between this one and "--threads=<num>"
> would have been interesting. Perhaps add a test to make sure the
> feature continues to work with "--threads=2" (I am assuming that you
> have already tested this implementation works with the option).

I did and I found no unexpected behavior.

> What "git grep -m -1" should do? IIRC, OPT_INTEGER is for signed
> integer but the new .max_count member, as well as the existing
> "count" that is compared with it, are of "unsigned" type. Either
> erroring out or treating it as unlimited is probably fine, but
> whatever we do, we should document and have a test for it.

I would favor treating it as an error. As mentioned above, using 0 to describe "unlimited matches" (e.g. the default) is my preference, but I am willing to concede if someone can think of a good use for `-m 0`. Also, from the implementation side (although not as important) it looks better: if we allow negative values, we need to distinguish between -1 (unlimited) and -4 (display error to user, probably) - the patch is much simpler right now. And just as a side note, we avoid an issue in the pretty much insignificant use case of giving a very big value (UINT_MAX) for `-m` and it overflowing into -1, thus not properly limiting the number of matches.

On Monday, May 16th, 2022 at 09:28, Paul Eggert <eggert@cs.ucla.edu> wrote:
> As I vaguely recall, if "-m 1" stops before "-m 2" does, then the idea
> was that it's reasonable for "-m 0" to stop before "-m 1" does, and the
> logical place to stop is right at the start, before any matches are
> found (i.e., exit with status 1).

As I mentioned above, I do not see what this `-m 0` behavior is useful for, but if someone could show me an use for it I would appreciate it.

Again, thank you everyone for your comments.
Junio C Hamano May 16, 2022, 3:18 p.m. UTC | #5
Paul Eggert <eggert@cs.ucla.edu> writes:

> On 5/15/22 22:57, Junio C Hamano wrote:
>
>> It indeed is curious why GNU grep chose to immediately exit with 1
>> when "-m 0" was given,
>
> As I vaguely recall, if "-m 1" stops before "-m 2" does, then the idea
> was that it's reasonable for "-m 0" to stop before "-m 1" does, and
> the logical place to stop is right at the start, before any matches
> are found (i.e., exit with status 1).
>
> What would be more useful for 'grep -m 0' to do? (Sorry, I came into
> this conversation just now.) Perhaps GNU 'grep -m 0' should change, if 
> there's something better for it to do.

"grep -m 0" that declares a failure upfront because it is asked to
stop before finding any match, combined with the fact that the
command is expected to signal a failure after finding no matches, is
an optimization that is mathmatically correct ;-)

It was asked as a part of discussion on a proposed patch to teach
the same "-m <max-number-of-hits>" option to "git grep" what it
ought to mean to give "-m 0".  As we are too accustomed to the "last
command line option wins" behaviour, I initially did not find the
behaviour of the proposed patch, where 0 (or negative) stood for
"unlimited", quite natural and useful (e.g. it allows overriding a
hardcoded default option in aliases, "[alias] gg = grep -m 4"), and
then was surprised by the "'-m 0' is an immediate failure" in GNU
grep.  I would call it mathematically pure and correct but of
dubious utility.

Sorry for not providing enough context.  Full discussion is seen at 
https://lore.kernel.org/git/pull.1264.git.git.1652361610103.gitgitgadget@gmail.com/

>> What "git grep -m -1" should do?  IIRC, OPT_INTEGER is for signed
>> integer but the new .max_count member, as well as the existing
>> "count" that is compared with it, are of "unsigned" type.  Either
>> erroring out or treating it as unlimited is probably fine, but
>> whatever we do, we should document and have a test for it.
>
> 'grep -m -1' treats the count as being unlimited, but this isn't
> documented and (from the code) appears to be accidental. It'd make
> sense for it to be documented.

Thanks.  The question was asked for the proposed addition to "git
grep", but it is funny to see it apply equally well to GNU grep ;-).
Junio C Hamano May 16, 2022, 3:36 p.m. UTC | #6
"Carlos L." <00xc@protonmail.com> writes:

>> Even if we want to handle the zero just like you do, I think this patch
>> needs a few tests. We should make sure to test the 0-case (whatever we
>> end up wanting it to behave like), and probably the "suppress an earlier
>> -m by giving --no-max-count" case. It also seems wise to set up some
>> test scenario where there are several files involved so that we can see
>> that we don't just print the first m matches globally, but that the
>> counter is really handled per file.
>
> This seems sound. Is there any documentation on how to write tests for git?

t/README and Documentation/MyFirstContribution would be two good
places to start.

>> What "git grep -m -1" should do? IIRC, OPT_INTEGER is for signed
>> integer but the new .max_count member, as well as the existing
>> "count" that is compared with it, are of "unsigned" type. Either
>> erroring out or treating it as unlimited is probably fine, but
>> whatever we do, we should document and have a test for it.
>
> I would favor treating it as an error. As mentioned above, using 0
> to describe "unlimited matches" (e.g. the default) is my
> preference, but I am willing to concede if someone can think of a
> good use for `-m 0`.

With Devil's advocate hat on.

"GNU grep has been doing so for the past 20 years and existing users
of the command expects '-m 0' to behave that way" is a good enough
reason, especially if '-m 0' is not the only possible way to say
"unlimited".

> Also, from the implementation side (although
> not as important) it looks better: if we allow negative values, we
> need to distinguish between -1 (unlimited) and -4 (display error
> to user, probably)

If we are going to document "you can pass a negative value to
explicitly say 'unlimited', which is a useful way to countermand
another `-m <num>` that appear earlier on the command line", then -1
and -4 would equally be 'unlimited' and there is no need to
distinguish anything.

Devil's advocate hat off.

I personally do not mind if "-m <non-positive>" means "unlimited",
as long as that is clearly documented and tested, but for long time
"GNU grep" users "-m 0" might appear surprising (not necessarily
because they would find that the "-m 0" that immediately fails is
useful, but because the behaviour is deliberately made different).

Thanks.
Paul Eggert May 17, 2022, 5:53 a.m. UTC | #7
On 5/16/22 08:36, Junio C Hamano wrote:
> "GNU grep has been doing so for the past 20 years and existing users
> of the command expects '-m 0' to behave that way" is a good enough
> reason, especially if '-m 0' is not the only possible way to say
> "unlimited".

Yes, I'm inclined in the same direction, now that I see more of the 
context. That is, GNU grep can continue what it's long been doing, with 
the only change being to the documentation so that we document -m-1 as 
meaning "unlimited". This minimizes possible disruption to existing 
scripts and satisfies the use case of having a way to turn off any 
previously-appearing -m option.

I installed the attached to the GNU grep master doc to do that. Hope 
this works for you.
diff mbox series

Patch

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 3d393fbac1b..02b36046475 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -23,6 +23,7 @@  SYNOPSIS
 	   [--break] [--heading] [-p | --show-function]
 	   [-A <post-context>] [-B <pre-context>] [-C <context>]
 	   [-W | --function-context]
+	   [-m | --max-count <num>]
 	   [--threads <num>]
 	   [-f <file>] [-e] <pattern>
 	   [--and|--or|--not|(|)|-e <pattern>...]
@@ -238,6 +239,12 @@  providing this option will cause it to die.
 	`git diff` works out patch hunk headers (see 'Defining a
 	custom hunk-header' in linkgit:gitattributes[5]).
 
+-m <num>::
+--max-count <num>::
+	Limit the amount of matches per file. When using the -v or
+	--invert-match option, the search stops after the specified
+	number of non-matches. Setting this option to 0 has no effect.
+
 --threads <num>::
 	Number of grep worker threads to use.
 	See `grep.threads` in 'CONFIGURATION' for more information.
diff --git a/builtin/grep.c b/builtin/grep.c
index bcb07ea7f75..ba1894d5675 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -961,6 +961,8 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 		OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
 			   N_("allow calling of grep(1) (ignored by this build)"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_INTEGER('m', "max-count", &opt.max_count,
+			N_("maximum number of results per file (default: 0, no limit)")),
 		OPT_END()
 	};
 	grep_prefix = prefix;
diff --git a/grep.c b/grep.c
index 82eb7da1022..173b6c27b6e 100644
--- a/grep.c
+++ b/grep.c
@@ -1686,6 +1686,8 @@  static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
 		bol = eol + 1;
 		if (!left)
 			break;
+		if (opt->max_count && count == opt->max_count)
+			break;
 		left--;
 		lno++;
 	}
diff --git a/grep.h b/grep.h
index c722d25ed9d..25836f34314 100644
--- a/grep.h
+++ b/grep.h
@@ -171,6 +171,7 @@  struct grep_opt {
 	int show_hunk_mark;
 	int file_break;
 	int heading;
+	unsigned max_count;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);