diff mbox series

[RFC,v2] shortlog: add group-by options for year and month

Message ID 20220922232536.40807-1-jacob@initialcommit.io (mailing list archive)
State New, archived
Headers show
Series [RFC,v2] shortlog: add group-by options for year and month | expand

Commit Message

Jacob Stopak Sept. 22, 2022, 11:25 p.m. UTC
It can be useful to group commits using time-based attributes in
addition to author/committer. Currently, this can somewhat be
done using "git shortlog --since=x --until=y", however all commits
will be displayed in that single time chunk, grouped by author.

However, much more versatile time groupings can be achieved by adding
options to group by year or month. This can lead to more interesting
commit summaries breaking down the commits an author made during each
year or month, using something like:

"git shortlog --group=month --group=author --author=Stopak"

Shorthand flags added for month grouping are "-m" or "--month", and for
year groupings are "-y" or "--year".

If grouped _only_ by month or year (ie no "--group=author" option),
shortlog will group commits made by ALL authors during each time period.

It turns out that combining these with existing flags "-s" or "-n" or
both leads to various useful grouped commit summaries which can be
ordered chronologically (default) or based on number of commits during
each time period (when the "-n" flag is added), as in:

"git shortlog -nsy"

Furthermore, these new groupings can be combined with "--since" or
"--until" to generate yearly or monthly groupings within those
overarching time slices.

Note that (at least for now) using the year and month groupings
are not supported when shortlog is reading from stdin, since the
default log output for date might require some assumptions to reformat
during parsing, mainly regarding the first 2 digits of the year.

Since the year and/or month part used for grouping comes directly
from each commit, and commits are already being parsed by the existing
shortlog logic, I don't think adding these new flags should have a
noticeable performance impact. The only added time should be to format
the month or year into the shortlog messages. The ordering was already
handled by the existing shortlog output logic.

Signed-off-by: Jacob Stopak <jacob@initialcommit.io>
---
Thx a lot for the feedback! Added v2 patch here.

> I would actually skip using/testing `git help` and just go
> straight for the rendered page using, e.g, something like ...

Cool I was able to use your steps to get the local man page working.

> One of the nice things about `--group` is that we can potentially
> have many groupings without having to carry correspondingly many `--option`s.
>
> In particular, it might be wise to wait with implementing `-y` and `-m`
> until we know that your new feature turns out to be so hugely successful
> that people start craving `-m` as a short form for `--group=month`. ;-)

Haha yes I might have gotten a bit excited with the shorthand flags, BUT
let me give my pitch for why I think it makes sense to keep them :)

Adding these new time-based groups makes it more likely that folks will
specify multiple groups at once, (like pairing with "--group=author")
which makes it painful to write "--group=..." over & over again,
especially when running/editing the command many times.

Also, having shorthands -y and -m pairs very nicely when using other
shorthand flags -s, -n, and -c, for stuff like "git shortlog -nsy".

To justify why the new year/month groups should have shorthand flags
while "--group=trailer:value" does not, I'd say that the fact that
trailer requires a custom value would make the shorthand version clunky,
and it wouldn't fit in well with other shorthand options like -n or -s.
Since year/month have no custom value, the flags make a bit more sense
and would match up with how "-c, --committer" currently works.

So overall the shorthand flags can be more convenient in several ways,
which increases the odds folks will use the new feature often :D. Thoughts?

> This trips up `-Werror=declaration-after-statement`. If you build with
> `DEVELOPER=Yes`, you should see the same thing.

Hm, I tried setting the DEVELOPER flag at the top of Makefile, and also
passing as argument to "make DEVELOPER=Yes git-shortlog", but didn't see
those warnings - I'm on a mac fwiw. Anyway I moved the statement after
the declarations so I think it should be fixed.

> I can easily imagine going even more granular with this
> (`--group=week`?), but that can wait for some other time. :-)

I'd love to add a week option in the future if this gets accepted...

> BTW, I got this when `git am`-ing your patch: ...

Fixed those pesky whitespace issues!

 Documentation/git-shortlog.txt | 10 ++++
 builtin/shortlog.c             | 83 ++++++++++++++++++++++++++++++----
 shortlog.h                     |  2 +
 t/t4201-shortlog.sh            | 42 +++++++++++++++++
 4 files changed, 127 insertions(+), 10 deletions(-)

Comments

Junio C Hamano Sept. 23, 2022, 4:17 p.m. UTC | #1
Jacob Stopak <jacob@initialcommit.io> writes:

> To justify why the new year/month groups should have shorthand flags
> while "--group=trailer:value" does not, I'd say that the fact that
> trailer requires a custom value would make the shorthand version clunky,
> and it wouldn't fit in well with other shorthand options like -n or -s.
> Since year/month have no custom value, the flags make a bit more sense
> and would match up with how "-c, --committer" currently works.

It is an explanation why it is easier to implement and design --year
etc., and does not justify why adding --year etc. is a good idea at
all.  Let's not add the "aliases" in the same patch.

>   - `author`, commits are grouped by author
>   - `committer`, commits are grouped by committer (the same as `-c`)
> + - `month`, commits are grouped by month (the same as `-m`)
> + - `year`, commits are grouped by year (the same as `-y`)

It is unclear what timestamp is used, how a "month" is defined, etc.
As "git shortlog --since=2.years" cuts off based on the committer
timestamp, I would expect that the committer timestamps are used for
this grouping as well?  If I make a commit on the first day of the
month in my timezone, but that instant happens to be still on the
last day of the previous month in your timezone, which month would
your invocation of "git shortlog --group=month" would the commit be
attributed?  My month, or your month?

Does it make sense to even say "group by month and year"?  I expect
that it would mean the same thing as "group by month", and if that
is the case, the command probably should error out or at least warn
if both are given.  An alternative interpretation could be, when
told to "group by month", group the commits made in September 2022
into the same group as the group for commits made in September in
all other years, but I do not know how useful it would be.

Not a suggestion to use a different implementation or add a new
feature on top of this --group-by-time-range idea, but I wonder if
it is a more flexible and generalizeable approach to say "formulate
this value given by the --format=<format> string, apply this regular
expression match, and group by the subexpression value".  E.g.

    git shortlog \
	--group-by-value="%cI" \
	--group-by-regexp="^(\d{4}-\d{2})"

would "formulate the 'committer date in ISO format' value, and apply
the 'grab leading 4 digits followed by a dash followed by 2 digits'
regexp, and group by the matched part".

That's a better way to implement "group by month" internally, and
allow more flexibility.  If a project is well disciplined and its
commit titles follow the "<area>: <description>" convention, you
probably could do

    git shortlog --no-merges \
	--group-by-value="%s" \
	--group-by-regexp="^([^:]+):"

and group by <area> each commit touches.  Of course, existing
--committer and --author can also be internally reimplemented using
the same mechanism.

> @@ -80,6 +82,14 @@ counts both authors and co-authors.
>  --committer::
>  	This is an alias for `--group=committer`.
>  
> +-m::
> +--month::
> +	This is an alias for `--group=month`.
> +
> +-y::
> +--year::
> +	This is an alias for `--group=year`.
> +

Let's not add this in the same patch.  I am fairly negative on
adding more, outside "--group".  Besides, we do not have a good
answer to those who want to group by week.  -w is already taken.
Jacob Stopak Sept. 23, 2022, 9:19 p.m. UTC | #2
On Fri, Sep 23, 2022 at 09:17:10AM -0700, Junio C Hamano wrote:

> It is unclear what timestamp is used, how a "month" is defined, etc.
> As "git shortlog --since=2.years" cuts off based on the committer
> timestamp, I would expect that the committer timestamps are used for
> this grouping as well?

It uses the "commit->date" member from the commit struct in commit.h,
which I assumed was the committer timestamp, but I'll confirm that's
what actually gets populated in there since I don't see a separate
member for the author timestamp...

> If I make a commit on the first day of the
> month in my timezone, but that instant happens to be still on the
> last day of the previous month in your timezone, which month would
> your invocation of "git shortlog --group=month" would the commit be
> attributed?  My month, or your month?

I need to look into how Git typically handles these timezone differences
and will try to apply similar behavior for these time-based groupings.

> Does it make sense to even say "group by month and year"?  I expect
> that it would mean the same thing as "group by month", and if that
> is the case, the command probably should error out or at least warn
> if both are given.

Yes, "group by month and year" and "group by month" means the same
thing the way I implemented. If both groups are supplied, it will
just ignore the year group and group by month like you mentioned,
by flipping off the YEAR bit as follows:

if ((log->groups & SHORTLOG_GROUP_MONTH) &&
    (log->groups & SHORTLOG_GROUP_YEAR))
	log->groups ^= SHORTLOG_GROUP_YEAR;

I can add a warning message to make this more clear to the user.

> An alternative interpretation could be, when
> told to "group by month", group the commits made in September 2022
> into the same group as the group for commits made in September in
> all other years, but I do not know how useful it would be.

In data analytics terms this is usually referred to as "month of year",
and personally I see it less useful in Git's shortlog context because I
envision more folks would find output useful for single or consecutive
time periods. However, adding a "month of year" grouping could be useful
to answer questions like "what periods throughout the year are contributors
most active?". If we decide to add a "month of year" grouping option as
well, it would be trivial to include.
 
> Not a suggestion to use a different implementation or add a new
> feature on top of this --group-by-time-range idea, but I wonder if
> it is a more flexible and generalizeable approach to say "formulate
> this value given by the --format=<format> string, apply this regular
> expression match, and group by the subexpression value".  E.g.
> 
>     git shortlog \
> 	--group-by-value="%cI" \
> 	--group-by-regexp="^(\d{4}-\d{2})"
> 
> would "formulate the 'committer date in ISO format' value, and apply
> the 'grab leading 4 digits followed by a dash followed by 2 digits'
> regexp, and group by the matched part".
> 
> That's a better way to implement "group by month" internally, and
> allow more flexibility.  If a project is well disciplined and its
> commit titles follow the "<area>: <description>" convention, you
> probably could do
> 
>     git shortlog --no-merges \
> 	--group-by-value="%s" \
> 	--group-by-regexp="^([^:]+):"
> 
> and group by <area> each commit touches.  Of course, existing
> --committer and --author can also be internally reimplemented using
> the same mechanism.

At first look this sounds very flexible and appealing, and I would be 
interested in exploring a refactor to this in the future. I think the
rub is that supplying custom patterns wouldn't necessarily stack up
neatly into good groups, which could lead to confusing results for the
user in terms of both grouping and sorting. But like you mentioned
it could be really cool if used judiciously for a consistent history
like Git's. And the generalized re-implementation of the current
shortlog groups would be a nice bonus.

> > @@ -80,6 +82,14 @@ counts both authors and co-authors.
> >  --committer::
> >  	This is an alias for `--group=committer`.
> >  
> > +-m::
> > +--month::
> > +	This is an alias for `--group=month`.
> > +
> > +-y::
> > +--year::
> > +	This is an alias for `--group=year`.
> > +
> 
> Let's not add this in the same patch.  I am fairly negative on
> adding more, outside "--group".  Besides, we do not have a good
> answer to those who want to group by week.  -w is already taken.

No worries - I'll remove the shorthand flags for v3.
Jeff King Sept. 23, 2022, 9:58 p.m. UTC | #3
On Fri, Sep 23, 2022 at 09:17:10AM -0700, Junio C Hamano wrote:

> Not a suggestion to use a different implementation or add a new
> feature on top of this --group-by-time-range idea, but I wonder if
> it is a more flexible and generalizeable approach to say "formulate
> this value given by the --format=<format> string, apply this regular
> expression match, and group by the subexpression value".  E.g.
> 
>     git shortlog \
> 	--group-by-value="%cI" \
> 	--group-by-regexp="^(\d{4}-\d{2})"

Heh, I was about to make the exact same suggestion. The existing
"--group=author" could really just be "--group='%an <%ae>'" (or variants
depending on the "-e" flag).

I don't think you even really need the regexp. If we respect --date,
then you should be able to ask for --date=format:%Y-%m. Unfortunately
there's no way to specify the format as part of the placeholder. The
for-each-ref formatter understands this, like:

  %(authordate:format:%Y-%m)

I wouldn't be opposed to teaching the git-log formatter something
similar.

> That's a better way to implement "group by month" internally, and
> allow more flexibility.  If a project is well disciplined and its
> commit titles follow the "<area>: <description>" convention, you
> probably could do
> 
>     git shortlog --no-merges \
> 	--group-by-value="%s" \
> 	--group-by-regexp="^([^:]+):"
> 
> and group by <area> each commit touches.  Of course, existing
> --committer and --author can also be internally reimplemented using
> the same mechanism.

This example makes the regex feature more interesting, because it's not
something we'd likely have a unique placeholder for (or maybe we
should?).

But there's something else interesting going on in Jack's patch, which
is that he's not just introducing the date-sorting, but also that it's
used in conjunction with other sorting. So really the intended use is
something like:

  git shortlog --group:author --group:%Y-%m

I think we'd want to allow the general form to be a series of groupings.
In the output from his patch it looks like:

  2022-09 Jeff King
     some commit message
     another commit message

I.e., the groups are collapsed into a single string, and unique strings
become their own groups (and are sorted in the usual way).

If you give up the regex thing, then that naturally falls out as
(imagining we learn about authordate as a placeholder):

  git shortlog --group='%(authordate:format=%Y-%n) %an'

without having to implement multiple groupings as a specific feature
(which is both more code, but also has user-facing confusion about when
--group overrides versus appends). That also skips the question of which
--group-by-regex applies to which --group-by-value.

I do agree the regex thing is more flexible, but if we can't come up
with a case more compelling than subsystem matching, I'd just as soon
add %(subject:subsystem) or similar. :)

-Peff
Junio C Hamano Sept. 23, 2022, 10:06 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> If you give up the regex thing, then that naturally falls out as
> (imagining we learn about authordate as a placeholder):
>
>   git shortlog --group='%(authordate:format=%Y-%n) %an'
>
> without having to implement multiple groupings as a specific feature
> (which is both more code, but also has user-facing confusion about when
> --group overrides versus appends). That also skips the question of which
> --group-by-regex applies to which --group-by-value.
>
> I do agree the regex thing is more flexible, but if we can't come up
> with a case more compelling than subsystem matching, I'd just as soon
> add %(subject:subsystem) or similar. :)

;-)  I like that as a general direction.
Jacob Stopak Sept. 24, 2022, 4:38 a.m. UTC | #5
On Fri, Sep 23, 2022 at 05:58:56PM -0400, Jeff King wrote:
> I don't think you even really need the regexp. If we respect --date,
> then you should be able to ask for --date=format:%Y-%m.

Hmm I tried passing in --date=format:... to my patched shortlog command
along with setting some date placeholder like "... %cd ..." in the code,
but it's not picking up on the format. Do you know how the date format
can be wedged into the format_commit_message(...) "format" argument?

> Unfortunately there's no way to specify the format as part of the
> placeholder. The for-each-ref formatter understands this, like:
> 
>   %(authordate:format:%Y-%m)
>
> I wouldn't be opposed to teaching the git-log formatter something
> similar.

Oh that would solve my problem... Would it be a hefty effort to teach
this to the git-log formatter?

> But there's something else interesting going on in Jack's patch, which
> is that he's not just introducing the date-sorting, but also that it's
> used in conjunction with other sorting. So really the intended use is
> something like:
> 
>   git shortlog --group:author --group:%Y-%m

Yes I sort of stumbled on this and realized that this way I wouldn't have
to touch the actual sorting or grouping functionality at all, which was
already working properly. I just needed to reformat the shortlog message to
include the year and/or month in a way that kept things consistent.

> I think we'd want to allow the general form to be a series of groupings.
> In the output from his patch it looks like:
> 
>   2022-09 Jeff King
>      some commit message
>      another commit message
> 
> I.e., the groups are collapsed into a single string, and unique strings
> become their own groups (and are sorted in the usual way).
> 
> If you give up the regex thing, then that naturally falls out as
> (imagining we learn about authordate as a placeholder):
> 
>   git shortlog --group='%(authordate:format=%Y-%n) %an'
> 
> without having to implement multiple groupings as a specific feature
> (which is both more code, but also has user-facing confusion about when
> --group overrides versus appends). That also skips the question of which
> --group-by-regex applies to which --group-by-value.
> 
> I do agree the regex thing is more flexible, but if we can't come up
> with a case more compelling than subsystem matching, I'd just as soon
> add %(subject:subsystem) or similar. :)
> 
> -Peff

I like this idea too. Since it requires a larger re-implementation,
maybe I can pursue this going forward. I assume if we did this we would
keep the existing group options like "--group=author" as shortcuts, and
refactor them behind the scenes to use the new method. If so it may be
useful to add my originally suggested options of "--group=year" and
"--group=month" as well for convenient default time-based groupings.

How do you feel about me submitting a v3 patch of my initial suggested
implementation of new group options for year and month? Then going forward
I can work on generalizing the grouping feature the way Peff suggested.

-Jack
Taylor Blau Oct. 5, 2022, 9:43 p.m. UTC | #6
On Fri, Sep 23, 2022 at 05:58:56PM -0400, Jeff King wrote:
> On Fri, Sep 23, 2022 at 09:17:10AM -0700, Junio C Hamano wrote:
>
> > Not a suggestion to use a different implementation or add a new
> > feature on top of this --group-by-time-range idea, but I wonder if
> > it is a more flexible and generalizeable approach to say "formulate
> > this value given by the --format=<format> string, apply this regular
> > expression match, and group by the subexpression value".  E.g.
> >
> >     git shortlog \
> > 	--group-by-value="%cI" \
> > 	--group-by-regexp="^(\d{4}-\d{2})"
>
> Heh, I was about to make the exact same suggestion. The existing
> "--group=author" could really just be "--group='%an <%ae>'" (or variants
> depending on the "-e" flag).

This caught my attention, so I wanted to see how hard it would be to
implement. It actually is quite straightforward, and gets us most of the
way to being able to get the same functionality as in Jacob's patch
(minus being able to do the for-each-ref-style sub-selectors, like
`%(authordate:format=%Y-%m)`).

Here's the patch:

--- >8 ---

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 7a1e1fe7c0..68880e8867 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -200,6 +200,29 @@ static void insert_records_from_trailers(struct shortlog *log,
 	unuse_commit_buffer(commit, commit_buffer);
 }

+static void insert_record_from_pretty(struct shortlog *log,
+				      struct strset *dups,
+				      struct commit *commit,
+				      struct pretty_print_context *ctx,
+				      const char *oneline)
+{
+	struct strbuf ident = STRBUF_INIT;
+	size_t i;
+
+	for (i = 0; i < log->pretty.nr; i++) {
+		if (i)
+			strbuf_addch(&ident, ' ');
+
+		format_commit_message(commit, log->pretty.items[i].string,
+				      &ident, ctx);
+	}
+
+	if (strset_add(dups, ident.buf))
+		insert_one_record(log, ident.buf, oneline);
+
+	strbuf_release(&ident);
+}
+
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 {
 	struct strbuf ident = STRBUF_INIT;
@@ -243,6 +266,8 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	if (log->groups & SHORTLOG_GROUP_TRAILER) {
 		insert_records_from_trailers(log, &dups, commit, &ctx, oneline_str);
 	}
+	if (log->groups & SHORTLOG_GROUP_PRETTY)
+		insert_record_from_pretty(log, &dups, commit, &ctx, oneline_str);

 	strset_clear(&dups);
 	strbuf_release(&ident);
@@ -321,8 +346,10 @@ static int parse_group_option(const struct option *opt, const char *arg, int uns
 	else if (skip_prefix(arg, "trailer:", &field)) {
 		log->groups |= SHORTLOG_GROUP_TRAILER;
 		string_list_append(&log->trailers, field);
-	} else
-		return error(_("unknown group type: %s"), arg);
+	} else {
+		log->groups |= SHORTLOG_GROUP_PRETTY;
+		string_list_append(&log->pretty, arg);
+	}

 	return 0;
 }
@@ -340,6 +367,7 @@ void shortlog_init(struct shortlog *log)
 	log->in2 = DEFAULT_INDENT2;
 	log->trailers.strdup_strings = 1;
 	log->trailers.cmp = strcasecmp;
+	log->pretty.strdup_strings = 1;
 }

 int cmd_shortlog(int argc, const char **argv, const char *prefix)
diff --git a/shortlog.h b/shortlog.h
index 3f7e9aabca..d7caecb76f 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -20,8 +20,10 @@ struct shortlog {
 		SHORTLOG_GROUP_AUTHOR = (1 << 0),
 		SHORTLOG_GROUP_COMMITTER = (1 << 1),
 		SHORTLOG_GROUP_TRAILER = (1 << 2),
+		SHORTLOG_GROUP_PRETTY = (1 << 3),
 	} groups;
 	struct string_list trailers;
+	struct string_list pretty;

 	int email;
 	struct string_list mailmap;

--- 8< ---

> I don't think you even really need the regexp. If we respect --date,
> then you should be able to ask for --date=format:%Y-%m. Unfortunately
> there's no way to specify the format as part of the placeholder. The
> for-each-ref formatter understands this, like:
>
>   %(authordate:format:%Y-%m)
>
> I wouldn't be opposed to teaching the git-log formatter something
> similar.

Yeah, I think having a similar mechanism there would be useful, and
certainly a prerequisite to being able to achieve what Jacob has done
here with the more general approach.

I think you could also do some cleanup on top, like replacing the
SHORTLOG_GROUP_AUTHOR mode with adding either "%aN <%aE>" (or "%aN",
without --email) as an entry in the `pretty` string_list.

Thanks,
Taylor
Jeff King Oct. 5, 2022, 10:14 p.m. UTC | #7
On Fri, Sep 23, 2022 at 09:38:29PM -0700, Jacob Stopak wrote:

> Hmm I tried passing in --date=format:... to my patched shortlog command
> along with setting some date placeholder like "... %cd ..." in the code,
> but it's not picking up on the format. Do you know how the date format
> can be wedged into the format_commit_message(...) "format" argument?

It comes to the format code via the pretty_print_context. And we pick up
the --date command via setup_revisions(), where it ends up in
rev_info.date_mode.

In a normal git-log, I think that data gets shuffled across by
show_log(). But shortlog has its own traversal.

I think something like this:

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 7a1e1fe7c0..53c379a51d 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -211,7 +211,7 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	ctx.fmt = CMIT_FMT_USERFORMAT;
 	ctx.abbrev = log->abbrev;
 	ctx.print_email_subject = 1;
-	ctx.date_mode.type = DATE_NORMAL;
+	ctx.date_mode = log->date_mode;
 	ctx.output_encoding = get_log_output_encoding();
 
 	if (!log->summary) {
@@ -407,6 +407,7 @@ int cmd_shortlog(int argc, const char **argv, const char *prefix)
 	log.user_format = rev.commit_format == CMIT_FMT_USERFORMAT;
 	log.abbrev = rev.abbrev;
 	log.file = rev.diffopt.file;
+	log.date_mode = rev.date_mode;
 
 	if (!log.groups)
 		log.groups = SHORTLOG_GROUP_AUTHOR;
diff --git a/shortlog.h b/shortlog.h
index 3f7e9aabca..ef3a3dbc65 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -15,6 +15,7 @@ struct shortlog {
 	int in2;
 	int user_format;
 	int abbrev;
+	struct date_mode date_mode;
 
 	enum {
 		SHORTLOG_GROUP_AUTHOR = (1 << 0),

is enough. At least it allows:

  git shortlog --format='%ad %s' --date=format:%Y-%m

to work as you'd expect (but of course that's just the output for each
commit that we show, not the actual grouping).

> > Unfortunately there's no way to specify the format as part of the
> > placeholder. The for-each-ref formatter understands this, like:
> > 
> >   %(authordate:format:%Y-%m)
> >
> > I wouldn't be opposed to teaching the git-log formatter something
> > similar.
> 
> Oh that would solve my problem... Would it be a hefty effort to teach
> this to the git-log formatter?

Probably not a huge amount of work. But it puts us in a weird in-between
situation where we support _one_ of the more advanced ref-filter
placeholders, but not the others. And of course no code is shared.

That might be OK, as long as the syntax and semantics are identical to
what ref-filter can do. Then in the long run, if we eventually merge the
two implementations, there's no compatibility problem.

That said, I think it may just be easier to respect --date, as above.
It's not quite as flexible, but it's probably flexible enough.

-Peff
Jeff King Oct. 5, 2022, 10:26 p.m. UTC | #8
On Wed, Oct 05, 2022 at 05:43:20PM -0400, Taylor Blau wrote:

> > Heh, I was about to make the exact same suggestion. The existing
> > "--group=author" could really just be "--group='%an <%ae>'" (or variants
> > depending on the "-e" flag).
> 
> This caught my attention, so I wanted to see how hard it would be to
> implement. It actually is quite straightforward, and gets us most of the
> way to being able to get the same functionality as in Jacob's patch
> (minus being able to do the for-each-ref-style sub-selectors, like
> `%(authordate:format=%Y-%m)`).

Yeah, your patch is about what I'd expect.

The date thing I think can be done with --date; I just sent a sketch in
another part of the thread.

> +static void insert_record_from_pretty(struct shortlog *log,
> +				      struct strset *dups,
> +				      struct commit *commit,
> +				      struct pretty_print_context *ctx,
> +				      const char *oneline)
> +{
> +	struct strbuf ident = STRBUF_INIT;
> +	size_t i;
> +
> +	for (i = 0; i < log->pretty.nr; i++) {
> +		if (i)
> +			strbuf_addch(&ident, ' ');
> +
> +		format_commit_message(commit, log->pretty.items[i].string,
> +				      &ident, ctx);
> +	}

So here you're allowing multiple pretty options. But really, once we
allow the user an arbitrary format, is there any reason for them to do:

  git shortlog --group=%an --group=%ad

versus just:

  git shortlog --group='%an %ad'

?

>  void shortlog_add_commit(struct shortlog *log, struct commit *commit)
>  {
>  	struct strbuf ident = STRBUF_INIT;
> @@ -243,6 +266,8 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
>  	if (log->groups & SHORTLOG_GROUP_TRAILER) {
>  		insert_records_from_trailers(log, &dups, commit, &ctx, oneline_str);
>  	}
> +	if (log->groups & SHORTLOG_GROUP_PRETTY)
> +		insert_record_from_pretty(log, &dups, commit, &ctx, oneline_str);

I was puzzled at first that this was a bitwise check. But I forgot that
we added support for --group options already, in 63d24fa0b0 (shortlog:
allow multiple groups to be specified, 2020-09-27).

So a plan like:

  git shortlog --group=author --group=date

(as in the original patch in this thread) doesn't quite work, I think.
Because the semantics for multiple --group lines are that the commit is
credited individually to each ident. That's what lets you do:

  git shortlog -ns --group=author --group=trailer:co-authored-by

and credit authors and co-authors equally. So likewise, I think multiple
group-format options don't really make sense (or at least, do not make
sense to concatenate; you'd put each key in its own single format).

> @@ -321,8 +346,10 @@ static int parse_group_option(const struct option *opt, const char *arg, int uns
>  	else if (skip_prefix(arg, "trailer:", &field)) {
>  		log->groups |= SHORTLOG_GROUP_TRAILER;
>  		string_list_append(&log->trailers, field);
> -	} else
> -		return error(_("unknown group type: %s"), arg);
> +	} else {
> +		log->groups |= SHORTLOG_GROUP_PRETTY;
> +		string_list_append(&log->pretty, arg);
> +	}

We probably want to insist that the format contains a "%" sign, and/or
git it a keyword like "format:". Otherwise a typo like:

  git shortlog --format=autor

stops being an error we detect, and just returns nonsense results
(every commit has the same ident).

I think you'd want to detect SHORTLOG_GROUP_PRETTY in the
read_from_stdin() path, too. And probably just die() with "not
supported", like we do for trailers.

> I think you could also do some cleanup on top, like replacing the
> SHORTLOG_GROUP_AUTHOR mode with adding either "%aN <%aE>" (or "%aN",
> without --email) as an entry in the `pretty` string_list.

Yeah, that would be a nice cleanup. I think might even be a good idea to
explain the various options to the users in terms of "--author is
equivalent to %aN <%aE>". It may help them understand how the tool
works.

-Peff
Jacob Stopak Oct. 7, 2022, 12:48 a.m. UTC | #9
On Wed, Oct 05, 2022 at 06:26:57PM -0400, Jeff King wrote:
> On Wed, Oct 05, 2022 at 05:43:20PM -0400, Taylor Blau wrote:
> 
> > This caught my attention, so I wanted to see how hard it would be to
> > implement. It actually is quite straightforward, and gets us most of the
> > way to being able to get the same functionality as in Jacob's patch
> > (minus being able to do the for-each-ref-style sub-selectors, like
> > `%(authordate:format=%Y-%m)`).

Thanks Taylor!! This looks awesome and helped me understand how the pretty
context stuff works. I was able to apply your patch locally and test,
and plan to continue working off of this :D. Like Peff mentioned seems to
be a few usage details to hammer out.

> The date thing I think can be done with --date; I just sent a sketch in
> another part of the thread.

Peff - I applied your --date sketch onto Taylor's patch and it worked first try.

> So here you're allowing multiple pretty options. But really, once we
> allow the user an arbitrary format, is there any reason for them to do:
> 
>   git shortlog --group=%an --group=%ad
> 
> versus just:
> 
>   git shortlog --group='%an %ad'
> 
> ?

Yes I can't think of an advantage of having multiple custom-formatted group
fields. Also see my note on this below related to your comment on specifying
multiple groups.

> >  void shortlog_add_commit(struct shortlog *log, struct commit *commit)
> >  {
> >  	struct strbuf ident = STRBUF_INIT;
> > @@ -243,6 +266,8 @@ void shortlog_add_commit(struct shortlog *log, struct commit *commit)
> >  	if (log->groups & SHORTLOG_GROUP_TRAILER) {
> >  		insert_records_from_trailers(log, &dups, commit, &ctx, oneline_str);
> >  	}
> > +	if (log->groups & SHORTLOG_GROUP_PRETTY)
> > +		insert_record_from_pretty(log, &dups, commit, &ctx, oneline_str);
> 
> I was puzzled at first that this was a bitwise check. But I forgot that
> we added support for --group options already, in 63d24fa0b0 (shortlog:
> allow multiple groups to be specified, 2020-09-27).
> 
> So a plan like:
> 
>   git shortlog --group=author --group=date
> 
> (as in the original patch in this thread) doesn't quite work, I think.

My first patch addressed this by specifically handling cases where the new
grouping options were passed in-tandem with existing options, and making sure
only a single shortlog group was generated. But if we're generalizing the custom
group format then it might be unecessary to even allow the custom group in tandem
with most other options (like 'author' and 'committer'), since those options can be
included in the custom group format. The trailer option might be an exception, but
that could possibly just be handled as a special case.

> We probably want to insist that the format contains a "%" sign, and/or
> git it a keyword like "format:". Otherwise a typo like:
> 
>   git shortlog --format=autor
> 
> stops being an error we detect, and just returns nonsense results
> (every commit has the same ident).

Small aside: I like how Taylor re-used the --group option for the custom format.
IMO it hammers home that this is a grouping option and not just formatting or
filtering which can be confusing to users sometimes when doing data analytics.

But your points here all still apply. Maybe detecting a "%" sign can be the way
that git identifies the custom format being passed in. In conjuction with the %
identifiers, this would still enable users to add in some arbitrary constant label
like --group='Year: %cd' --date='%Y', without affecting the grouped/sorted results
since all entries would then include the "Years: " prefix (or postfix or however
they decide to write it).

The one case that should probably be handled is when no % sign is used and no other
matching flags like --author or --committer are either, because currently that will
just group all commits under 1 nonsensical group name like "autor" as you mentioned.

> I think you'd want to detect SHORTLOG_GROUP_PRETTY in the
> read_from_stdin() path, too. And probably just die() with "not
> supported", like we do for trailers.

Glad you said this because I applied this in my original patch with the explicit
--year and --month groups. Didn't seem to be an obvious use-case with the stdin 
even though my original year and month values could possibly be read from the git
log output supplied as stdin. But for a generalized group format seems even more
far-fetched to try and make it jive with the stdin version of the command.

-Jack
Taylor Blau Oct. 7, 2022, 9:59 p.m. UTC | #10
On Thu, Oct 06, 2022 at 05:48:06PM -0700, Jacob Stopak wrote:
> > The date thing I think can be done with --date; I just sent a sketch in
> > another part of the thread.
>
> Peff - I applied your --date sketch onto Taylor's patch and it worked
> first try.

Yeah; I think that allowing the `--date` argument (similar to how `git
log` treats an option by the same name) is sensible, albeit slightly
less flexible than the proposed `%(authordate:format=%Y-%m)`.

For what it's worth, I don't have anything against the latter (other
than that it is slightly redundant in many cases with `--date`), but it
does seem easier and more worthwhile to repurpose `--date` here instead.

Thanks,
Taylor
Taylor Blau Oct. 7, 2022, 10:24 p.m. UTC | #11
On Wed, Oct 05, 2022 at 06:26:57PM -0400, Jeff King wrote:
> > +static void insert_record_from_pretty(struct shortlog *log,
> > +				      struct strset *dups,
> > +				      struct commit *commit,
> > +				      struct pretty_print_context *ctx,
> > +				      const char *oneline)
> > +{
> > +	struct strbuf ident = STRBUF_INIT;
> > +	size_t i;
> > +
> > +	for (i = 0; i < log->pretty.nr; i++) {
> > +		if (i)
> > +			strbuf_addch(&ident, ' ');
> > +
> > +		format_commit_message(commit, log->pretty.items[i].string,
> > +				      &ident, ctx);
> > +	}
>
> So here you're allowing multiple pretty options. But really, once we
> allow the user an arbitrary format, is there any reason for them to do:
>
>   git shortlog --group=%an --group=%ad
>
> versus just:
>
>   git shortlog --group='%an %ad'
>
> ?

I think that if you want to unify `--group=author` into the new format
group implementation, you would have to allow multiple `--group`
options, but each such option would generate its own shortlog identity
instead of getting concatenated together.

Thanks,
Taylor
Jeff King Oct. 11, 2022, 12:59 a.m. UTC | #12
On Thu, Oct 06, 2022 at 05:48:06PM -0700, Jacob Stopak wrote:

> > We probably want to insist that the format contains a "%" sign, and/or
> > git it a keyword like "format:". Otherwise a typo like:
> > 
> >   git shortlog --format=autor
> > 
> > stops being an error we detect, and just returns nonsense results
> > (every commit has the same ident).
> 
> Small aside: I like how Taylor re-used the --group option for the custom format.
> IMO it hammers home that this is a grouping option and not just formatting or
> filtering which can be confusing to users sometimes when doing data analytics.

Yeah, sorry this was just a typo/thinko on my part. It absolutely should
be --group, as --format already does something else.

-Peff
Jeff King Oct. 11, 2022, 1 a.m. UTC | #13
On Fri, Oct 07, 2022 at 06:24:36PM -0400, Taylor Blau wrote:

> > So here you're allowing multiple pretty options. But really, once we
> > allow the user an arbitrary format, is there any reason for them to do:
> >
> >   git shortlog --group=%an --group=%ad
> >
> > versus just:
> >
> >   git shortlog --group='%an %ad'
> >
> > ?
> 
> I think that if you want to unify `--group=author` into the new format
> group implementation, you would have to allow multiple `--group`
> options, but each such option would generate its own shortlog identity
> instead of getting concatenated together.

Exactly, and I think we have to do that anyway to match the existing
multiple-option behavior.

-Peff
diff mbox series

Patch

diff --git a/Documentation/git-shortlog.txt b/Documentation/git-shortlog.txt
index f64e77047b..ab68b287d8 100644
--- a/Documentation/git-shortlog.txt
+++ b/Documentation/git-shortlog.txt
@@ -54,6 +54,8 @@  OPTIONS
 --
  - `author`, commits are grouped by author
  - `committer`, commits are grouped by committer (the same as `-c`)
+ - `month`, commits are grouped by month (the same as `-m`)
+ - `year`, commits are grouped by year (the same as `-y`)
  - `trailer:<field>`, the `<field>` is interpreted as a case-insensitive
    commit message trailer (see linkgit:git-interpret-trailers[1]). For
    example, if your project uses `Reviewed-by` trailers, you might want
@@ -80,6 +82,14 @@  counts both authors and co-authors.
 --committer::
 	This is an alias for `--group=committer`.
 
+-m::
+--month::
+	This is an alias for `--group=month`.
+
+-y::
+--year::
+	This is an alias for `--group=year`.
+
 -w[<width>[,<indent1>[,<indent2>]]]::
 	Linewrap the output by wrapping each line at `width`.  The first
 	line of each entry is indented by `indent1` spaces, and the second
diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 7a1e1fe7c0..1beba9b91c 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -133,6 +133,10 @@  static void read_from_stdin(struct shortlog *log)
 		break;
 	case SHORTLOG_GROUP_TRAILER:
 		die(_("using --group=trailer with stdin is not supported"));
+	case SHORTLOG_GROUP_YEAR:
+		die(_("using --group=year with stdin is not supported"));
+	case SHORTLOG_GROUP_MONTH:
+		die(_("using --group=month with stdin is not supported"));
 	default:
 		BUG("unhandled shortlog group");
 	}
@@ -200,10 +204,28 @@  static void insert_records_from_trailers(struct shortlog *log,
 	unuse_commit_buffer(commit, commit_buffer);
 }
 
+static void format_commit_date(struct commit *commit, struct strbuf *sb,
+			       char *format, struct shortlog *log)
+{
+	time_t t = (time_t) commit->date;
+	struct tm commit_date;
+	localtime_r(&t, &commit_date);
+
+	if (log->groups & SHORTLOG_GROUP_MONTH) {
+		strftime(sb->buf, strbuf_avail(sb), "%Y/%m", &commit_date);
+		snprintf(sb->buf+7, strbuf_avail(sb), "%s", format);
+	} else if (log->groups & SHORTLOG_GROUP_YEAR) {
+		strftime(sb->buf, strbuf_avail(sb), "%Y", &commit_date);
+		snprintf(sb->buf+4, strbuf_avail(sb), "%s", format);
+	}
+}
+
 void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 {
 	struct strbuf ident = STRBUF_INIT;
 	struct strbuf oneline = STRBUF_INIT;
+	struct strbuf buffer;
+
 	struct strset dups = STRSET_INIT;
 	struct pretty_print_context ctx = {0};
 	const char *oneline_str;
@@ -214,6 +236,8 @@  void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	ctx.date_mode.type = DATE_NORMAL;
 	ctx.output_encoding = get_log_output_encoding();
 
+	strbuf_init(&buffer, 100);
+
 	if (!log->summary) {
 		if (log->user_format)
 			pretty_print_commit(&ctx, commit, &oneline);
@@ -222,20 +246,47 @@  void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	}
 	oneline_str = oneline.len ? oneline.buf : "<none>";
 
-	if (log->groups & SHORTLOG_GROUP_AUTHOR) {
-		strbuf_reset(&ident);
+	if ((log->groups & SHORTLOG_GROUP_MONTH) && (log->groups & SHORTLOG_GROUP_YEAR))
+		log->groups ^= SHORTLOG_GROUP_YEAR;
+
+	if (((log->groups & SHORTLOG_GROUP_MONTH) || (log->groups & SHORTLOG_GROUP_YEAR))
+	      && !HAS_MULTI_BITS(log->groups)) {
+		format_commit_date(commit, &buffer, "", log);
 		format_commit_message(commit,
-				      log->email ? "%aN <%aE>" : "%aN",
+				      buffer.buf,
 				      &ident, &ctx);
+
+		if (strset_add(&dups, ident.buf))
+			insert_one_record(log, ident.buf, oneline_str);
+	}
+	if (log->groups & SHORTLOG_GROUP_AUTHOR) {
+		strbuf_reset(&ident);
+		if ((log->groups & SHORTLOG_GROUP_MONTH) || (log->groups & SHORTLOG_GROUP_YEAR)) {
+			format_commit_date(commit, &buffer, log->email ? " %aN <%aE>" : " %aN", log);
+			format_commit_message(commit,
+					      buffer.buf,
+					      &ident, &ctx);
+		} else {
+			format_commit_message(commit,
+					      log->email ? "%aN <%aE>" : "%aN",
+					      &ident, &ctx);
+		}
 		if (!HAS_MULTI_BITS(log->groups) ||
 		    strset_add(&dups, ident.buf))
 			insert_one_record(log, ident.buf, oneline_str);
 	}
 	if (log->groups & SHORTLOG_GROUP_COMMITTER) {
 		strbuf_reset(&ident);
-		format_commit_message(commit,
-				      log->email ? "%cN <%cE>" : "%cN",
-				      &ident, &ctx);
+		if ((log->groups & SHORTLOG_GROUP_MONTH) || (log->groups & SHORTLOG_GROUP_YEAR)) {
+			format_commit_date(commit, &buffer, log->email ? " %cN <%cE>" : " %cN", log);
+			format_commit_message(commit,
+					      buffer.buf,
+					      &ident, &ctx);
+		} else {
+			format_commit_message(commit,
+					      log->email ? "%cN <%cE>" : "%cN",
+					      &ident, &ctx);
+		}
 		if (!HAS_MULTI_BITS(log->groups) ||
 		    strset_add(&dups, ident.buf))
 			insert_one_record(log, ident.buf, oneline_str);
@@ -247,6 +298,7 @@  void shortlog_add_commit(struct shortlog *log, struct commit *commit)
 	strset_clear(&dups);
 	strbuf_release(&ident);
 	strbuf_release(&oneline);
+	strbuf_release(&buffer);
 }
 
 static void get_from_rev(struct rev_info *rev, struct shortlog *log)
@@ -314,15 +366,20 @@  static int parse_group_option(const struct option *opt, const char *arg, int uns
 	if (unset) {
 		log->groups = 0;
 		string_list_clear(&log->trailers, 0);
-	} else if (!strcasecmp(arg, "author"))
+	} else if (!strcasecmp(arg, "author")) {
 		log->groups |= SHORTLOG_GROUP_AUTHOR;
-	else if (!strcasecmp(arg, "committer"))
+	} else if (!strcasecmp(arg, "committer")) {
 		log->groups |= SHORTLOG_GROUP_COMMITTER;
-	else if (skip_prefix(arg, "trailer:", &field)) {
+	} else if (skip_prefix(arg, "trailer:", &field)) {
 		log->groups |= SHORTLOG_GROUP_TRAILER;
 		string_list_append(&log->trailers, field);
-	} else
+	} else if (!strcasecmp(arg, "month")) {
+		log->groups |= SHORTLOG_GROUP_MONTH;
+	} else if (!strcasecmp(arg, "year")) {
+		log->groups |= SHORTLOG_GROUP_YEAR;
+	} else {
 		return error(_("unknown group type: %s"), arg);
+	}
 
 	return 0;
 }
@@ -363,6 +420,12 @@  int cmd_shortlog(int argc, const char **argv, const char *prefix)
 			&parse_wrap_args),
 		OPT_CALLBACK(0, "group", &log, N_("field"),
 			N_("group by field"), parse_group_option),
+		OPT_BIT('m', "month", &log.groups,
+			N_("group by month rather than author"),
+			SHORTLOG_GROUP_MONTH),
+		OPT_BIT('y', "year", &log.groups,
+			N_("group by year rather than author"),
+			SHORTLOG_GROUP_YEAR),
 		OPT_END(),
 	};
 
diff --git a/shortlog.h b/shortlog.h
index 3f7e9aabca..45b5efb6dc 100644
--- a/shortlog.h
+++ b/shortlog.h
@@ -20,6 +20,8 @@  struct shortlog {
 		SHORTLOG_GROUP_AUTHOR = (1 << 0),
 		SHORTLOG_GROUP_COMMITTER = (1 << 1),
 		SHORTLOG_GROUP_TRAILER = (1 << 2),
+		SHORTLOG_GROUP_MONTH = (1 << 3),
+		SHORTLOG_GROUP_YEAR = (1 << 4),
 	} groups;
 	struct string_list trailers;
 
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index 3095b1b2ff..981f45f732 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -359,4 +359,46 @@  test_expect_success 'stdin with multiple groups reports error' '
 	test_must_fail git shortlog --group=author --group=committer <log
 '
 
+test_expect_success '--group=year groups output by year' '
+	git commit --allow-empty -m "git shortlog --group=year test" &&
+	cat >expect <<-\EOF &&
+	     1	2005
+	EOF
+	git shortlog -ns \
+		--group=year \
+		-1 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'stdin with --group=year reports error' '
+	test_must_fail git shortlog --group=year
+'
+
+test_expect_success '--group=month groups output by month' '
+	git commit --allow-empty -m "git shortlog --group=month test" &&
+	cat >expect <<-\EOF &&
+	     1	2005/04
+	EOF
+	git shortlog -ns \
+		--group=month \
+		-1 HEAD >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'stdin with --group=month reports error' '
+	test_must_fail git shortlog --group=month
+'
+
+test_expect_success '--group=month and --group=year defaults to month' '
+	git commit --allow-empty -m "git shortlog --group=month --group=year test" &&
+	cat >expect <<-\EOF &&
+	     1	2005/04
+	EOF
+	git shortlog -ns \
+		--group=month \
+		--group=year \
+		-1 HEAD >actual &&
+	test_cmp expect actual
+'
+
 test_done