diff mbox series

[v2,8/8] shortlog: allow multiple groups to be specified

Message ID 20200927084015.GH2465761@coredump.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series parsing trailers with shortlog | expand

Commit Message

Jeff King Sept. 27, 2020, 8:40 a.m. UTC
Now that shortlog supports reading from trailers, it can be useful to
combine counts from multiple trailers, or between trailers and authors.
This can be done manually by post-processing the output from multiple
runs, but it's non-trivial to make sure that each name/commit pair is
counted only once.

This patch teaches shortlog to accept multiple --group options on the
command line, and pull data from all of them. That makes it possible to
run:

  git shortlog -ns --group=author --group=trailer:co-authored-by

to get a shortlog that counts authors and co-authors equally.

The implementation is mostly straightforward. The "group" enum becomes a
bitfield, and the trailer key becomes a list. I didn't bother
implementing the multi-group semantics for reading from stdin. It would
be possible to do, but the existing matching code makes it awkward, and
I doubt anybody cares.

The duplicate suppression we used for trailers now covers authors and
committers as well (though in non-trailer single-group mode we can skip
the hash insertion and lookup, since we only see one value per commit).

There is one subtlety: we now care about the case when no group bit is
set (in which case we default to showing the author). The caller in
builtin/log.c needs to be adapted to ask explicitly for authors, rather
than relying on shortlog_init(). It would be possible with some
gymnastics to make this keep working as-is, but it's not worth it for a
single caller.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-shortlog.txt |  7 ++++
 builtin/log.c                  |  1 +
 builtin/shortlog.c             | 64 ++++++++++++++++++-----------
 shortlog.h                     | 10 ++---
 t/t4201-shortlog.sh            | 74 ++++++++++++++++++++++++++++++++++
 5 files changed, 127 insertions(+), 29 deletions(-)

Comments

Junio C Hamano Sept. 27, 2020, 9:18 p.m. UTC | #1
Jeff King <peff@peff.net> writes:

> -		if (strcasecmp(iter.key.buf, log->trailer))
> +		if (!string_list_has_string(&log->trailers, iter.key.buf))
>  			continue;
...
> +	if (!log.groups)
> +		log.groups = SHORTLOG_GROUP_AUTHOR;
> +	string_list_sort(&log.trailers);

I initially reacted with "oh, why sort?" to this line, before
realizing that the list is used as a look-up table, which, if I
recall correctly, you said you want to see us move off of in the
longer term.  

As we already have the string-set in this series, I am wondering
why we are not using it.  It's not like the code at some point
needs to iterate over log.trailers in some stable order, right?

I do realize that going to hashmap might be overkill, but once we
have an easy-to-use wrapper around it, between one "table of
strings" API and another "table of strings" API, I do not see a
reason why we want to choose the string_list.

Thanks.
Jeff King Sept. 28, 2020, 3:25 a.m. UTC | #2
On Sun, Sep 27, 2020 at 02:18:10PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > -		if (strcasecmp(iter.key.buf, log->trailer))
> > +		if (!string_list_has_string(&log->trailers, iter.key.buf))
> >  			continue;
> ...
> > +	if (!log.groups)
> > +		log.groups = SHORTLOG_GROUP_AUTHOR;
> > +	string_list_sort(&log.trailers);
> 
> I initially reacted with "oh, why sort?" to this line, before
> realizing that the list is used as a look-up table, which, if I
> recall correctly, you said you want to see us move off of in the
> longer term.
> 
> As we already have the string-set in this series, I am wondering
> why we are not using it.  It's not like the code at some point
> needs to iterate over log.trailers in some stable order, right?

Right, it's purely a lookup, and the new strset or any hashmap would
work.  However, the trailers list is part of the public shortlog.h[1]
interface, so we'd have to move strset to be a public thing, too. I
think that may be where we want to go eventually, but I thought I'd
trial it as something strictly internal to shortlog.c for now.

I'm also not _too_ bothered by this particular use of string-list as a
lookup table. It's one of the "easy" cases: we build up the list in one
phase, and then do lookups in the other, so sorting in between is no big
deal (though I do think a strset is even easier, as you do not have to
remember to sort between the phases).

-Peff
Junio C Hamano Dec. 28, 2020, 11:29 a.m. UTC | #3
Jeff King <peff@peff.net> writes:

> Now that shortlog supports reading from trailers, it can be useful to
> combine counts from multiple trailers, or between trailers and authors.
> This can be done manually by post-processing the output from multiple
> runs, but it's non-trivial to make sure that each name/commit pair is
> counted only once.
>
> This patch teaches shortlog to accept multiple --group options on the
> command line, and pull data from all of them. That makes it possible to
> run:
>
>   git shortlog -ns --group=author --group=trailer:co-authored-by
>
> to get a shortlog that counts authors and co-authors equally.

As I recently had a chance to revisit an old thread back in v2.3.3
days, I recalled that I wished something like this was available,

  https://lore.kernel.org/git/xmqqd24g6uf1.fsf@gitster.dls.corp.google.com/

where I wished to have a way to count non-author contributions
easily to list them.

I am not comfortable with the idea of changing the release
announcement script immediately before the new release, so I'll do
v2.30.0 with the old "authorship only" script, but I'll play with
the new shortlog feature plus ideas from the old thread to see if we
can give more credit to non author contributors.
Junio C Hamano Feb. 4, 2021, 6:44 a.m. UTC | #4
Junio C Hamano <gitster@pobox.com> writes:

> Jeff King <peff@peff.net> writes:
>
>> Now that shortlog supports reading from trailers, it can be useful to
>> combine counts from multiple trailers, or between trailers and authors.
>> This can be done manually by post-processing the output from multiple
>> runs, but it's non-trivial to make sure that each name/commit pair is
>> counted only once.
>>
>> This patch teaches shortlog to accept multiple --group options on the
>> command line, and pull data from all of them. That makes it possible to
>> run:
>>
>>   git shortlog -ns --group=author --group=trailer:co-authored-by
>>
>> to get a shortlog that counts authors and co-authors equally.
>
> As I recently had a chance to revisit an old thread back in v2.3.3
> days, I recalled that I wished something like this was available,
>
>   https://lore.kernel.org/git/xmqqd24g6uf1.fsf@gitster.dls.corp.google.com/
>
> where I wished to have a way to count non-author contributions
> easily to list them.
>
> I am not comfortable with the idea of changing the release
> announcement script immediately before the new release, so I'll do
> v2.30.0 with the old "authorship only" script, but I'll play with
> the new shortlog feature plus ideas from the old thread to see if we
> can give more credit to non author contributors.

But now it is a good time to do so, as it is not immediately before
a new release.

Here is what I am planning to use for the next release due in mid
March.  We list those who had contribution in the part of the
history before the last release and in the part of the history since
the last release, and run "comm" on these two sets to derive "new
folks" vs "continued supporters".  The list of contributors were
only counting the commit authors in the previous announcement, but
with the "shortlog --group" feature, we can give folks credit for
all kinds of *-by roles.

diff --git a/Announce b/Announce
index 491516b..d6ec20c 100755
--- a/Announce
+++ b/Announce
@@ -44,8 +44,22 @@ esac
 
 vername=$(echo "$vername" | tr "-" ".")
 
-git log --use-mailmap --format='%aN,' "$previous" | sort -u >"$tmpbase.prev"
-git log --use-mailmap --format='%aN,' "$previous..$commit" | sort -u >"$tmpbase.this"
+people () {
+	git shortlog -s --no-merges \
+		--group=author \
+		--group=trailer:co-authored-by \
+		--group=trailer:reviewed-by \
+		--group=trailer:mentored-by \
+		--group=trailer:helped-by \
+		--group=trailer:reported-by \
+		"$@" |
+	sed -e 's/^[ 	0-9]*[ 	]//' -e 's/$/,/' |
+	sort -u
+}
+
+people "$previous" >"$tmpbase.prev"
+people "$previous..$commit" >"$tmpbase.this"
+
 comm -12 "$tmpbase.prev" "$tmpbase.this" >"$tmpbase.old"
 comm -13 "$tmpbase.prev" "$tmpbase.this" >"$tmpbase.new"
diff mbox series

Patch

diff --git a/Documentation/git-shortlog.txt b/Documentation/git-shortlog.txt
index 3db0db2da0..fd93cd41e9 100644
--- a/Documentation/git-shortlog.txt
+++ b/Documentation/git-shortlog.txt
@@ -51,6 +51,7 @@  OPTIONS
 	Group commits based on `<type>`. If no `--group` option is
 	specified, the default is `author`. `<type>` is one of:
 +
+--
  - `author`, commits are grouped by author
  - `committer`, commits are grouped by committer (the same as `-c`)
  - `trailer:<field>`, the `<field>` is interpreted as a case-insensitive
@@ -68,6 +69,12 @@  Shortlog will attempt to parse each trailer value as a `name <email>`
 identity. If successful, the mailmap is applied and the email is omitted
 unless the `--email` option is specified. If the value cannot be parsed
 as an identity, it will be taken literally and completely.
+--
++
+If `--group` is specified multiple times, commits are counted under each
+value (but again, only once per unique value in that commit). For
+example, `git shortlog --group=author --group=trailer:co-authored-by`
+counts both authors and co-authors.
 
 -c::
 --committer::
diff --git a/builtin/log.c b/builtin/log.c
index b8824d898f..7f27e9eca1 100644
--- a/builtin/log.c
+++ b/builtin/log.c
@@ -1195,6 +1195,7 @@  static void make_cover_letter(struct rev_info *rev, int use_stdout,
 	log.in1 = 2;
 	log.in2 = 4;
 	log.file = rev->diffopt.file;
+	log.groups = SHORTLOG_GROUP_AUTHOR;
 	for (i = 0; i < nr; i++)
 		shortlog_add_commit(&log, list[i]);
 
diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 28133aec68..0a5c4968f6 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -130,7 +130,10 @@  static void read_from_stdin(struct shortlog *log)
 	static const char *committer_match[2] = { "Commit: ", "committer " };
 	const char **match;
 
-	switch (log->group) {
+	if (HAS_MULTI_BITS(log->groups))
+		die(_("using multiple --group options with stdin is not supported"));
+
+	switch (log->groups) {
 	case SHORTLOG_GROUP_AUTHOR:
 		match = author_match;
 		break;
@@ -221,13 +224,13 @@  static void strset_clear(struct strset *ss)
 }
 
 static void insert_records_from_trailers(struct shortlog *log,
+					 struct strset *dups,
 					 struct commit *commit,
 					 struct pretty_print_context *ctx,
 					 const char *oneline)
 {
 	struct trailer_iterator iter;
 	const char *commit_buffer, *body;
-	struct strset dups = STRSET_INIT;
 	struct strbuf ident = STRBUF_INIT;
 
 	/*
@@ -243,28 +246,28 @@  static void insert_records_from_trailers(struct shortlog *log,
 	while (trailer_iterator_advance(&iter)) {
 		const char *value = iter.val.buf;
 
-		if (strcasecmp(iter.key.buf, log->trailer))
+		if (!string_list_has_string(&log->trailers, iter.key.buf))
 			continue;
 
 		strbuf_reset(&ident);
 		if (!parse_ident(log, &ident, value))
 			value = ident.buf;
 
-		if (strset_check_and_add(&dups, value))
+		if (strset_check_and_add(dups, value))
 			continue;
 		insert_one_record(log, value, oneline);
 	}
 	trailer_iterator_release(&iter);
 
 	strbuf_release(&ident);
-	strset_clear(&dups);
 	unuse_commit_buffer(commit, commit_buffer);
 }
 
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 {
 	struct strbuf ident = STRBUF_INIT;
 	struct strbuf oneline = STRBUF_INIT;
+	struct strset dups = STRSET_INIT;
 	struct pretty_print_context ctx = {0};
 	const char *oneline_str;
 
@@ -282,24 +285,29 @@  void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	}
 	oneline_str = oneline.len ? oneline.buf : "<none>";
 
-	switch (log->group) {
-	case SHORTLOG_GROUP_AUTHOR:
+	if (log->groups & SHORTLOG_GROUP_AUTHOR) {
+		strbuf_reset(&ident);
 		format_commit_message(commit,
 				      log->email ? "%aN <%aE>" : "%aN",
 				      &ident, &ctx);
-		insert_one_record(log, ident.buf, oneline_str);
-		break;
-	case SHORTLOG_GROUP_COMMITTER:
+		if (!HAS_MULTI_BITS(log->groups) ||
+		    !strset_check_and_add(&dups, ident.buf))
+			insert_one_record(log, ident.buf, oneline_str);
+	}
+	if (log->groups & SHORTLOG_GROUP_COMMITTER) {
+		strbuf_reset(&ident);
 		format_commit_message(commit,
 				      log->email ? "%cN <%cE>" : "%cN",
 				      &ident, &ctx);
-		insert_one_record(log, ident.buf, oneline_str);
-		break;
-	case SHORTLOG_GROUP_TRAILER:
-		insert_records_from_trailers(log, commit, &ctx, oneline_str);
-		break;
+		if (!HAS_MULTI_BITS(log->groups) ||
+		    !strset_check_and_add(&dups, ident.buf))
+			insert_one_record(log, ident.buf, oneline_str);
+	}
+	if (log->groups & SHORTLOG_GROUP_TRAILER) {
+		insert_records_from_trailers(log, &dups, commit, &ctx, oneline_str);
 	}
 
+	strset_clear(&dups);
 	strbuf_release(&ident);
 	strbuf_release(&oneline);
 }
@@ -366,14 +374,16 @@  static int parse_group_option(const struct option *opt, const char *arg, int uns
 	struct shortlog *log = opt->value;
 	const char *field;
 
-	if (unset || !strcasecmp(arg, "author"))
-		log->group = SHORTLOG_GROUP_AUTHOR;
+	if (unset) {
+		log->groups = 0;
+		string_list_clear(&log->trailers, 0);
+	} else if (!strcasecmp(arg, "author"))
+		log->groups |= SHORTLOG_GROUP_AUTHOR;
 	else if (!strcasecmp(arg, "committer"))
-		log->group = SHORTLOG_GROUP_COMMITTER;
+		log->groups |= SHORTLOG_GROUP_COMMITTER;
 	else if (skip_prefix(arg, "trailer:", &field)) {
-		log->group = SHORTLOG_GROUP_TRAILER;
-		free(log->trailer);
-		log->trailer = xstrdup(field);
+		log->groups |= SHORTLOG_GROUP_TRAILER;
+		string_list_append(&log->trailers, field);
 	} else
 		return error(_("unknown group type: %s"), arg);
 
@@ -391,6 +401,8 @@  void shortlog_init(struct shortlog *log)
 	log->wrap = DEFAULT_WRAPLEN;
 	log->in1 = DEFAULT_INDENT1;
 	log->in2 = DEFAULT_INDENT2;
+	log->trailers.strdup_strings = 1;
+	log->trailers.cmp = strcasecmp;
 }
 
 int cmd_shortlog(int argc, const char **argv, const char *prefix)
@@ -400,9 +412,9 @@  int cmd_shortlog(int argc, const char **argv, const char *prefix)
 	int nongit = !startup_info->have_repository;
 
 	const struct option options[] = {
-		OPT_SET_INT('c', "committer", &log.group,
-			    N_("Group by committer rather than author"),
-			    SHORTLOG_GROUP_COMMITTER),
+		OPT_BIT('c', "committer", &log.groups,
+			N_("Group by committer rather than author"),
+			SHORTLOG_GROUP_COMMITTER),
 		OPT_BOOL('n', "numbered", &log.sort_by_number,
 			 N_("sort output according to the number of commits per author")),
 		OPT_BOOL('s', "summary", &log.summary,
@@ -454,6 +466,10 @@  int cmd_shortlog(int argc, const char **argv, const char *prefix)
 	log.abbrev = rev.abbrev;
 	log.file = rev.diffopt.file;
 
+	if (!log.groups)
+		log.groups = SHORTLOG_GROUP_AUTHOR;
+	string_list_sort(&log.trailers);
+
 	/* assume HEAD if from a tty */
 	if (!nongit && !rev.pending.nr && isatty(0))
 		add_head_to_pending(&rev);
diff --git a/shortlog.h b/shortlog.h
index 89c2dbc5e6..64be879b24 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -17,11 +17,11 @@  struct shortlog {
 	int abbrev;
 
 	enum {
-		SHORTLOG_GROUP_AUTHOR = 0,
-		SHORTLOG_GROUP_COMMITTER,
-		SHORTLOG_GROUP_TRAILER,
-	} group;
-	char *trailer;
+		SHORTLOG_GROUP_AUTHOR = (1 << 0),
+		SHORTLOG_GROUP_COMMITTER = (1 << 1),
+		SHORTLOG_GROUP_TRAILER = (1 << 2),
+	} groups;
+	struct string_list trailers;
 
 	char *common_repo_prefix;
 	int email;
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index a62ee9ed55..3d5c4a2086 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -282,4 +282,78 @@  test_expect_success 'shortlog de-duplicates trailers in a single commit' '
 	test_cmp expect actual
 '
 
+test_expect_success 'shortlog can match multiple groups' '
+	git commit --allow-empty -F - <<-\EOF &&
+	subject one
+
+	this has two trailers that are distinct from the author; it will count
+	3 times in the output
+
+	Some-trailer: User A <a@example.com>
+	Another-trailer: User B <b@example.com>
+	EOF
+
+	git commit --allow-empty -F - <<-\EOF &&
+	subject two
+
+	this one has two trailers, one of which is a duplicate with the author;
+	it will only be counted once for them
+
+	Another-trailer: A U Thor <author@example.com>
+	Some-trailer: User B <b@example.com>
+	EOF
+
+	cat >expect <<-\EOF &&
+	     2	A U Thor
+	     2	User B
+	     1	User A
+	EOF
+	git shortlog -ns \
+		--group=author \
+		--group=trailer:some-trailer \
+		--group=trailer:another-trailer \
+		-2 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'set up option selection tests' '
+	git commit --allow-empty -F - <<-\EOF
+	subject
+
+	body
+
+	Trailer-one: value-one
+	Trailer-two: value-two
+	EOF
+'
+
+test_expect_success '--no-group resets group list to author' '
+	cat >expect <<-\EOF &&
+	     1	A U Thor
+	EOF
+	git shortlog -ns \
+		--group=committer \
+		--group=trailer:trailer-one \
+		--no-group \
+		-1 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success '--no-group resets trailer list' '
+	cat >expect <<-\EOF &&
+	     1	value-two
+	EOF
+	git shortlog -ns \
+		--group=trailer:trailer-one \
+		--no-group \
+		--group=trailer:trailer-two \
+		-1 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'stdin with multiple groups reports error' '
+	git log >log &&
+	test_must_fail git shortlog --group=author --group=committer <log
+'
+
 test_done