diff mbox series

[1/2] shortlog: introduce `--group-filter` to restrict output

Message ID 5cec04b65d350ca8b482ca14260ef118341e4039.1686178917.git.me@ttaylorr.com (mailing list archive)
State New, archived
Headers show
Series shortlog: introduce --email-only, --group-filter options | expand

Commit Message

Taylor Blau June 7, 2023, 11:02 p.m. UTC
Sometimes it is useful to ask, "how many commits have I authored or been
mentioned in via the Co-authored-by trailer"? `git shortlog` is a
reasonable tool for this, and the answer can easily be obtained by
running:

    $ git shortlog -ns --group=author --group=trailer:Co-authored-by

and reading off the corresponding value from its output. But what if you
want to know which commits contribute to the above count? You can drop
the `-s` option and parse through the results, but you'll have to skip
past everything you don't want (and stop reading after matching
everything you do want).

That is a script-able task, but it is cumbersome, and potentially very
slow, if there is a large amount of output that does not match the given
query.

Instead, this patch introduces `--group-filter` in order to restrict the
output of `git shortlog` to only matching group(s). Items match if they
are in a group which is strictly equal to one of the specified filters.

This means that you could easily view the hashes of all commits you
either wrote or co-authored with something like:

    $ git shortlog -n --group=author --group=trailer:Co-authored-by \
        --group-filter="$(git config user.name)"

When filtering just by trailers, it is tempting to want to introduce a
new grep mode for matching a given trailer, like `--author=<pattern>`
for matching the author header. But this would not be suitable for the
above, since we want commits which match either the author or the
Co-authored-by trailer, not ones which match both.

An alternative approach might be to implement trailer filtering as
above in revision.c, and show commits matching either the `--author`
value or some hypothetical `--trailer` filter. That would give shortlog
fewer commits, which may improve its performance. But it would restrict
the interpretation of these options to be an OR (i.e. show commits
matching either the `--author` or `--trailer` field).

In fact, this is already not possible to do, since the `--author` and
`--committer` options are documented as:

> Limit the commits output to ones with author/committer header lines
> that match the specified pattern

So introducing another option which changes the behavior of existing
ones is a non-starter.

Instead, `git shortlog` will process more commits than necessary. But
this is a marginal cost, since implementing the hypothetical revision
options from above as an OR would mean that revision.c has to process
every commit anyway.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-shortlog.txt |  5 ++++
 builtin/shortlog.c             | 20 ++++++++++++-
 shortlog.h                     |  2 ++
 t/t4201-shortlog.sh            | 54 ++++++++++++++++++++++++++++++++++
 4 files changed, 80 insertions(+), 1 deletion(-)

Comments

Derrick Stolee June 8, 2023, 2:34 p.m. UTC | #1
On 6/7/2023 7:02 PM, Taylor Blau wrote:

> This means that you could easily view the hashes of all commits you
> either wrote or co-authored with something like:
> 
>     $ git shortlog -n --group=author --group=trailer:Co-authored-by \
>         --group-filter="$(git config user.name)"
> 
> When filtering just by trailers, it is tempting to want to introduce a
> new grep mode for matching a given trailer, like `--author=<pattern>`
> for matching the author header. But this would not be suitable for the
> above, since we want commits which match either the author or the
> Co-authored-by trailer, not ones which match both.

One thing that is not immediately obvious from reading the patch, but
becomes clearer in patch 2, is that your --group-filter is an exact
string match. This differs from the --author filter in 'git log' and
similar, which is actually a case-insensitive substring match.

> +static int want_shortlog_group(struct shortlog *log, const char *group)
> +{
> +	if (!log->group_filter.nr)
> +		return 1;
> +	return string_list_has_string(&log->group_filter, group);
> +}
> +

This is the critical piece of code for this issue. Replacing it with

static int want_shortlog_group(struct shortlog *log, const char *group)
{
	struct string_list_item *item;
	if (!log->group_filter.nr)
		return 1;

	for_each_string_list_item(item, &log->group_filter) {
		if (strcasestr(group, item->string))
			return 1;
	}

	return 0;
}

Results in the case-insensitive substring search that I would expect
from this parameter.

This would also solve the problem from Patch 2 where we want to search
by email address. Using '-e --group-filter="my@email.com"' works, though
it will catch users with 'tammy@email.com' emails, as well.

Something to consider at a high level before committing to this CLI.

Thanks,
-Stolee
Taylor Blau June 8, 2023, 4:22 p.m. UTC | #2
On Thu, Jun 08, 2023 at 10:34:02AM -0400, Derrick Stolee wrote:
> On 6/7/2023 7:02 PM, Taylor Blau wrote:
>
> > This means that you could easily view the hashes of all commits you
> > either wrote or co-authored with something like:
> >
> >     $ git shortlog -n --group=author --group=trailer:Co-authored-by \
> >         --group-filter="$(git config user.name)"
> >
> > When filtering just by trailers, it is tempting to want to introduce a
> > new grep mode for matching a given trailer, like `--author=<pattern>`
> > for matching the author header. But this would not be suitable for the
> > above, since we want commits which match either the author or the
> > Co-authored-by trailer, not ones which match both.
>
> One thing that is not immediately obvious from reading the patch, but
> becomes clearer in patch 2, is that your --group-filter is an exact
> string match. This differs from the --author filter in 'git log' and
> similar, which is actually a case-insensitive substring match.

Yeah, funnily enough I originally thought about adding a `--trailer`
flag to the revision machinery, but realized that it was a non-starter
since `--author` is documented as only showing commits matching that
author.

So that doesn't solve for the case of finding a commit which has the
identity you want buried in some trailer, but is written by a different
author. shortlog needs to see all commits along the traversal, so I went
with the filtering approach instead.

Thanks for the pointer on beefing up the documentation to indicate that
this is an exact search.

> This is the critical piece of code for this issue. Replacing it with
>
> static int want_shortlog_group(struct shortlog *log, const char *group)
> {
> 	struct string_list_item *item;
> 	if (!log->group_filter.nr)
> 		return 1;
>
> 	for_each_string_list_item(item, &log->group_filter) {
> 		if (strcasestr(group, item->string))
> 			return 1;
> 	}
>
> 	return 0;
> }
>
> Results in the case-insensitive substring search that I would expect
> from this parameter.
>
> This would also solve the problem from Patch 2 where we want to search
> by email address. Using '-e --group-filter="my@email.com"' works, though
> it will catch users with 'tammy@email.com' emails, as well.

Yeah, thanks for raising it. I wonder if there are other semantics that
don't incorrectly return substring matches. You could conceivably extend
the --group-filter argument to take an extended regular expression, and
then just match on whatever's inside the last "<>".

That approach was one that I considered, but feels a little heavyweight
for what we're trying to do here. One potential approach is to keep
things as-is, and then consider extending `--group-filter` to mean
"extended regular expression" in the future. Although I'm not sure that
would work, either, since fixed-string matches would suddenly match
anywhere in the string, breaking backwards compatibility.

You could add another option that specifies the behavior of
`--group-filter`, or add some extra bit of information there like
`--group-filter=egrep:<pattern>` or something.

At least there's no shortage of options ;-).

Thanks,
Taylor
diff mbox series

Patch

diff --git a/Documentation/git-shortlog.txt b/Documentation/git-shortlog.txt
index 7d0277d033..dab6d09648 100644
--- a/Documentation/git-shortlog.txt
+++ b/Documentation/git-shortlog.txt
@@ -97,6 +97,11 @@  counts both authors and co-authors.
 If width is `0` (zero) then indent the lines of the output without wrapping
 them.
 
+--group-filter=<group>::
+	Only show output from the given group. If given more than once,
+	show output from any of the previously specified groups. May be
+	cleared with `--no-group-filter`.
+
 <revision-range>::
 	Show only commits in the specified revision range.  When no
 	<revision-range> is specified, it defaults to `HEAD` (i.e. the
diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 46f4e0832a..679db22c57 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -365,6 +365,7 @@  void shortlog_init(struct shortlog *log)
 	log->trailers.strdup_strings = 1;
 	log->trailers.cmp = strcasecmp;
 	log->format.strdup_strings = 1;
+	log->group_filter.strdup_strings = 1;
 }
 
 void shortlog_finish_setup(struct shortlog *log)
@@ -377,6 +378,7 @@  void shortlog_finish_setup(struct shortlog *log)
 				   log->email ? "%cN <%cE>" : "%cN");
 
 	string_list_sort(&log->trailers);
+	string_list_sort(&log->group_filter);
 }
 
 int cmd_shortlog(int argc, const char **argv, const char *prefix)
@@ -400,6 +402,8 @@  int cmd_shortlog(int argc, const char **argv, const char *prefix)
 			&parse_wrap_args),
 		OPT_CALLBACK(0, "group", &log, N_("field"),
 			N_("group by field"), parse_group_option),
+		OPT_STRING_LIST(0, "group-filter", &log.group_filter,
+				N_("group"), N_("only show matching groups")),
 		OPT_END(),
 	};
 
@@ -476,6 +480,13 @@  static void add_wrapped_shortlog_msg(struct strbuf *sb, const char *s,
 	strbuf_addch(sb, '\n');
 }
 
+static int want_shortlog_group(struct shortlog *log, const char *group)
+{
+	if (!log->group_filter.nr)
+		return 1;
+	return string_list_has_string(&log->group_filter, group);
+}
+
 void shortlog_output(struct shortlog *log)
 {
 	size_t i, j;
@@ -486,6 +497,9 @@  void shortlog_output(struct shortlog *log)
 		      log->summary ? compare_by_counter : compare_by_list);
 	for (i = 0; i < log->list.nr; i++) {
 		const struct string_list_item *item = &log->list.items[i];
+		if (!want_shortlog_group(log, item->string))
+			goto next;
+
 		if (log->summary) {
 			fprintf(log->file, "%6d\t%s\n",
 				(int)UTIL_TO_INT(item), item->string);
@@ -505,11 +519,15 @@  void shortlog_output(struct shortlog *log)
 					fprintf(log->file, "      %s\n", msg);
 			}
 			putc('\n', log->file);
+		}
+
+next:
+		if (!log->summary) {
+			struct string_list *onelines = item->util;
 			onelines->strdup_strings = 1;
 			string_list_clear(onelines, 0);
 			free(onelines);
 		}
-
 		log->list.items[i].util = NULL;
 	}
 
diff --git a/shortlog.h b/shortlog.h
index 28d04f951a..8ebee0e2d6 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -18,6 +18,8 @@  struct shortlog {
 	int abbrev;
 	struct date_mode date_mode;
 
+	struct string_list group_filter;
+
 	enum {
 		SHORTLOG_GROUP_AUTHOR = (1 << 0),
 		SHORTLOG_GROUP_COMMITTER = (1 << 1),
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index 8e4effebdb..0695c42ca8 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -346,6 +346,60 @@  test_expect_success 'shortlog can match multiple groups' '
 	test_cmp expect actual
 '
 
+test_expect_success '--group-filter shows only matching groups (single groups)' '
+	cat >expect <<-\EOF &&
+	     1	A U Thor
+	EOF
+	git shortlog -ns \
+		--group=trailer:another-trailer \
+		--group-filter="A U Thor" \
+		-2 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '--group-filter shows only matching groups (multiple groups)' '
+	cat >expect <<-\EOF &&
+	     2	A U Thor
+	EOF
+	git shortlog -ns \
+		--group=author \
+		--group=trailer:some-trailer \
+		--group=trailer:another-trailer \
+		--group-filter="A U Thor" \
+		-2 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '--group-filter can be specified more than once' '
+	cat >expect <<-\EOF &&
+	     2	User B
+	     1	User A
+	EOF
+	git shortlog -ns \
+		--group=author \
+		--group=trailer:some-trailer \
+		--group=trailer:another-trailer \
+		--group-filter="User A" \
+		--group-filter="User B" \
+		-2 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '--no-group-filter reset group filters' '
+	cat >expect <<-\EOF &&
+	     2	A U Thor
+	     2	User B
+	     1	User A
+	EOF
+	git shortlog -ns \
+		--group=author \
+		--group=trailer:some-trailer \
+		--group=trailer:another-trailer \
+		--group-filter="A U Thor" --no-group-filter \
+		-2 HEAD >actual &&
+	test_cmp expect actual
+'
+
 test_expect_success 'shortlog can match multiple format groups' '
 	GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME" \
 		git commit --allow-empty -m "identical names" &&