diff mbox series

[RFC,v2] builtin/shortlog: explicitly set hash algo when there is no repo

Message ID 20241015114826.715158-1-wolf@oriole.systems (mailing list archive)
State Superseded
Headers show
Series [RFC,v2] builtin/shortlog: explicitly set hash algo when there is no repo | expand

Commit Message

Wolfgang Müller Oct. 15, 2024, 11:48 a.m. UTC
Whilst git-shortlog(1) does not explicitly need any repository
information when run without reference to one, it still parses some of
its arguments with parse_revision_opt() which assumes that the hash
algorithm is set. However, in c8aed5e8da (repository: stop setting SHA1
as the default object hash, 2024-05-07) we stopped setting up a default
hash algorithm and instead require commands to set it up explicitly.

This was done for most other commands like in ab274909d4 (builtin/diff:
explicitly set hash algo when there is no repo, 2024-05-07) but was
missed for builtin/shortlog, making git-shortlog(1) segfault outside of
a repository when given arguments like --author that trigger a call to
parse_revision_opt().

Fix this for now by explicitly setting the hash algorithm to SHA1. Also
add a regression test for the segfault.

Signed-off-by: Wolfgang Müller <wolf@oriole.systems>
---
 builtin/shortlog.c  | 12 ++++++++++++
 t/t4201-shortlog.sh |  5 +++++
 2 files changed, 17 insertions(+)


Range-diff against v1:
1:  42516cc02d ! 1:  d3047a0291 builtin/shortlog: explicitly set hash algo when there is no repo
    @@ Commit message
         a repository when given arguments like --author that trigger a call to
         parse_revision_opt().
     
    -    Fix this for now by explicitly setting the hash algorithm to SHA1.
    +    Fix this for now by explicitly setting the hash algorithm to SHA1. Also
    +    add a regression test for the segfault.
     
         Signed-off-by: Wolfgang Müller <wolf@oriole.systems>
     
    @@ builtin/shortlog.c: int cmd_shortlog(int argc,
      	int nongit = !startup_info->have_repository;
      
     +	/*
    -+	 * Later on we'll call parse_revision_opt which relies on the hash
    -+	 * algorithm being set but since we are operating outside of a Git
    -+	 * repository we cannot determine one. For now default to SHA1.
    ++	 * NEEDSWORK: Later on we'll call parse_revision_opt which relies on
    ++	 * the hash algorithm being set but since we are operating outside of a
    ++	 * Git repository we cannot determine one. This is only needed because
    ++	 * parse_revision_opt expects hexsz for --abbrev which is irrelevant
    ++	 * for shortlog outside of a git repository. For now explicitly set
    ++	 * SHA1, but ideally the parsing machinery would be split between
    ++	 * git/nongit so that we do not have to do this.
     +	 */
     +	if (nongit && !the_hash_algo)
     +		repo_set_hash_algo(the_repository, GIT_HASH_SHA1);
    @@ builtin/shortlog.c: int cmd_shortlog(int argc,
      	const struct option options[] = {
      		OPT_BIT('c', "committer", &log.groups,
      			N_("group by committer rather than author"),
    +
    + ## t/t4201-shortlog.sh ##
    +@@ t/t4201-shortlog.sh: fuzz()
    + 	test_grep "too many arguments" out
    + '
    + 
    ++test_expect_success 'shortlog --author from non-git directory does not segfault' '
    ++	git log --no-expand-tabs HEAD >log &&
    ++	env GIT_DIR=non-existing git shortlog --author=author <log 2>out
    ++'
    ++
    + test_expect_success 'shortlog should add newline when input line matches wraplen' '
    + 	cat >expect <<\EOF &&
    + A U Thor (2):

Comments

Eric Sunshine Oct. 15, 2024, 5:20 p.m. UTC | #1
On Tue, Oct 15, 2024 at 7:48 AM Wolfgang Müller <wolf@oriole.systems> wrote:
> Whilst git-shortlog(1) does not explicitly need any repository
> information when run without reference to one, it still parses some of
> its arguments with parse_revision_opt() which assumes that the hash
> algorithm is set. However, in c8aed5e8da (repository: stop setting SHA1
> as the default object hash, 2024-05-07) we stopped setting up a default
> hash algorithm and instead require commands to set it up explicitly.
>
> This was done for most other commands like in ab274909d4 (builtin/diff:
> explicitly set hash algo when there is no repo, 2024-05-07) but was
> missed for builtin/shortlog, making git-shortlog(1) segfault outside of
> a repository when given arguments like --author that trigger a call to
> parse_revision_opt().
>
> Fix this for now by explicitly setting the hash algorithm to SHA1. Also
> add a regression test for the segfault.
>
> Signed-off-by: Wolfgang Müller <wolf@oriole.systems>

Cc:'ing Patrick since I believe he's been patching similar cases.

> diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
> @@ -143,6 +143,11 @@ fuzz()
> +test_expect_success 'shortlog --author from non-git directory does not segfault' '
> +       git log --no-expand-tabs HEAD >log &&
> +       env GIT_DIR=non-existing git shortlog --author=author <log 2>out
> +'

t/test-lib-functions.sh has a handy nongit() function for running a
command in a non-Git directory:

    nongit git shortlog --author=author <log 2>out

By the way, what is the purpose of capturing output to file "out"?
That file is never consulted.

Also, can the original crash be reproduced without having to invoke
the additional `git log`? In my tests, this is sufficient:

    echo | nongit git shortlog --author=author
Wolfgang Müller Oct. 15, 2024, 5:51 p.m. UTC | #2
On 2024-10-15 13:20, Eric Sunshine wrote:
> t/test-lib-functions.sh has a handy nongit() function for running a
> command in a non-Git directory:
> 
>     nongit git shortlog --author=author <log 2>out

Ah, nice, I'll be using that instead, then.

> By the way, what is the purpose of capturing output to file "out"?
> That file is never consulted.

Oh, my bad, that was a leftover from local testing.

> Also, can the original crash be reproduced without having to invoke
> the additional `git log`? In my tests, this is sufficient:
> 
>     echo | nongit git shortlog --author=author

Yeah, same as above. I'll take that into account for the next version,
thanks!

Thinking more about this, any opinion on the specific test for --author?
It is a succinct test but also relies on the fact that it "just happens"
to be one of the flags handled in parse_revision_opt(). In fact, any
such flag (even non-existing ones, like "--foo") would crash shortlog
right now. I'm wondering whether there's a more universal way of testing
this.
Patrick Steinhardt Oct. 16, 2024, 5:32 a.m. UTC | #3
On Tue, Oct 15, 2024 at 01:48:26PM +0200, Wolfgang Müller wrote:
> +	/*
> +	 * NEEDSWORK: Later on we'll call parse_revision_opt which relies on
> +	 * the hash algorithm being set but since we are operating outside of a
> +	 * Git repository we cannot determine one. This is only needed because
> +	 * parse_revision_opt expects hexsz for --abbrev which is irrelevant
> +	 * for shortlog outside of a git repository. For now explicitly set
> +	 * SHA1, but ideally the parsing machinery would be split between
> +	 * git/nongit so that we do not have to do this.
> +	 */
> +	if (nongit && !the_hash_algo)
> +		repo_set_hash_algo(the_repository, GIT_HASH_SHA1);
> +

Makes sense. As you say, git-shortlog(1) should ideally recognize the
hash function used by the input. The next-best thing would be to provide
a command line option to switch it. But punting on that should be fine
for now.

>  	const struct option options[] = {
>  		OPT_BIT('c', "committer", &log.groups,
>  			N_("group by committer rather than author"),
> diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
> index c20c885724..ed39c67ba1 100755
> --- a/t/t4201-shortlog.sh
> +++ b/t/t4201-shortlog.sh
> @@ -143,6 +143,11 @@ fuzz()
>  	test_grep "too many arguments" out
>  '
>  
> +test_expect_success 'shortlog --author from non-git directory does not segfault' '
> +	git log --no-expand-tabs HEAD >log &&
> +	env GIT_DIR=non-existing git shortlog --author=author <log 2>out
> +'
> +

I'd like to see another testcase added that exercises behaviour when
git-shortlog(1) is passed SHA256 output outside of a repo, as I'm
curious how it'll behave in that scenario.

Thanks!

Patrick
Wolfgang Müller Oct. 16, 2024, 8:47 a.m. UTC | #4
On 2024-10-16 07:32, Patrick Steinhardt wrote:
> On Tue, Oct 15, 2024 at 01:48:26PM +0200, Wolfgang Müller wrote:
> > diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
> > index c20c885724..ed39c67ba1 100755
> > --- a/t/t4201-shortlog.sh
> > +++ b/t/t4201-shortlog.sh
> > @@ -143,6 +143,11 @@ fuzz()
> >  	test_grep "too many arguments" out
> >  '
> >  
> > +test_expect_success 'shortlog --author from non-git directory does not segfault' '
> > +	git log --no-expand-tabs HEAD >log &&
> > +	env GIT_DIR=non-existing git shortlog --author=author <log 2>out
> > +'
> > +
> 
> I'd like to see another testcase added that exercises behaviour when
> git-shortlog(1) is passed SHA256 output outside of a repo, as I'm
> curious how it'll behave in that scenario.

I had a look at this in builtin/shortlog.c's read_from_stdin() and am
pretty sure that git-shortlog(1), when reading from stdin, simply
ignores anything but either the "Author:" or "Commit:" lines, depending
on the value given by --group. The --group=format: option is not
supported when reading from stdin. Neither is --format handled at all.

So I don't think there is actually a way to make git-shortlog(1)
encounter and handle a commit hash when reading from stdin; the hash
algorithm seems completely meaningless for its user-facing behaviour. As
far as I have seen the closest it comes to getting into contact with a
hash (or more specifically, hexsz) is when cmd_shortlog() sets:

	log.abbrev = rev.abbrev;

This relies on the parsing machinery in parse_revision_opt() - the one
this patch is for. Technically --abbrev is honored by git-shortlog(1)
when reading from stdin, but its value goes unused because of the
difference in code paths when reading from stdin. Do take this with a
grain of salt, however, this is my first look at the inner workings of
git-shortlog(1).

As for the test, I'd be happy to provide one if this is still deemed
necessary after considering the above. There's two questions I have:

1) Is this already covered by GIT_TEST_DEFAULT_HASH=sha256? Running the
t4201-shortlog testsuite with that passes.

2) I've already experimented with setting up a test for this and am
unsure how to cleanly set up a sha256 repository. Ordinarily it should
be a simple init/add (perhaps with test_commit), but t4201-shortlog is
already running within a git repository if I understand the setup step
correctly. Is there a clean way to escape from there, or would it simply
be fine creating another repository within that one?

Thanks!
Patrick Steinhardt Oct. 16, 2024, 8:57 a.m. UTC | #5
On Wed, Oct 16, 2024 at 10:47:03AM +0200, Wolfgang Müller wrote:
> On 2024-10-16 07:32, Patrick Steinhardt wrote:
> > On Tue, Oct 15, 2024 at 01:48:26PM +0200, Wolfgang Müller wrote:
> > > diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
> > > index c20c885724..ed39c67ba1 100755
> > > --- a/t/t4201-shortlog.sh
> > > +++ b/t/t4201-shortlog.sh
> > > @@ -143,6 +143,11 @@ fuzz()
> > >  	test_grep "too many arguments" out
> > >  '
> > >  
> > > +test_expect_success 'shortlog --author from non-git directory does not segfault' '
> > > +	git log --no-expand-tabs HEAD >log &&
> > > +	env GIT_DIR=non-existing git shortlog --author=author <log 2>out
> > > +'
> > > +
> > 
> > I'd like to see another testcase added that exercises behaviour when
> > git-shortlog(1) is passed SHA256 output outside of a repo, as I'm
> > curious how it'll behave in that scenario.
> 
> I had a look at this in builtin/shortlog.c's read_from_stdin() and am
> pretty sure that git-shortlog(1), when reading from stdin, simply
> ignores anything but either the "Author:" or "Commit:" lines, depending
> on the value given by --group. The --group=format: option is not
> supported when reading from stdin. Neither is --format handled at all.
> 
> So I don't think there is actually a way to make git-shortlog(1)
> encounter and handle a commit hash when reading from stdin; the hash
> algorithm seems completely meaningless for its user-facing behaviour. As
> far as I have seen the closest it comes to getting into contact with a
> hash (or more specifically, hexsz) is when cmd_shortlog() sets:
> 
> 	log.abbrev = rev.abbrev;
> 
> This relies on the parsing machinery in parse_revision_opt() - the one
> this patch is for. Technically --abbrev is honored by git-shortlog(1)
> when reading from stdin, but its value goes unused because of the
> difference in code paths when reading from stdin. Do take this with a
> grain of salt, however, this is my first look at the inner workings of
> git-shortlog(1).

Okay, thanks for the explanation.

Given that we do set `log.abbrev` I think we should be hitting code
paths in git-shortlog(1) that use it. `git shortlog --format=%h` for
example would use `log.abbrev`, wouldn't it? It would be nice to figure
out whether this can be made to misbehave based on which object hash we
have in the input.

> As for the test, I'd be happy to provide one if this is still deemed
> necessary after considering the above. There's two questions I have:
> 
> 1) Is this already covered by GIT_TEST_DEFAULT_HASH=sha256? Running the
> t4201-shortlog testsuite with that passes.

I think it doesn't hurt to have an explicit test for this scenario, even
if it just demonstrates that things don't crash or behave in weird ways.

> 2) I've already experimented with setting up a test for this and am
> unsure how to cleanly set up a sha256 repository. Ordinarily it should
> be a simple init/add (perhaps with test_commit), but t4201-shortlog is
> already running within a git repository if I understand the setup step
> correctly. Is there a clean way to escape from there, or would it simply
> be fine creating another repository within that one?

You can take e.g. b2dbf97f47 (builtin/index-pack: fix segfaults when
running outside of a repo, 2024-09-04) as inspiration for how to achieve
this.

Patrick
Wolfgang Müller Oct. 16, 2024, 9:07 a.m. UTC | #6
On 2024-10-16 10:57, Patrick Steinhardt wrote:
> Given that we do set `log.abbrev` I think we should be hitting code
> paths in git-shortlog(1) that use it. `git shortlog --format=%h` for
> example would use `log.abbrev`, wouldn't it? It would be nice to figure
> out whether this can be made to misbehave based on which object hash we
> have in the input.

I did try this, but like I said, --format seems not supported when
git-shortlog(1) is reading from stdin. It always outputs the same
summary format, either grouped on authors or committers. This is not
explicitly documented anywhere - the manual only says that "git shortlog
will output a summary of the log read from standard input" and then goes
on to document all the options with no mention of a difference in
behaviour when reading from stdin. So I'm still not entirely convinced
that this is impossible to trigger (also given the complexity of the
argument parsing machinery), but I have not been able to find a way.

> I think it doesn't hurt to have an explicit test for this scenario, even
> if it just demonstrates that things don't crash or behave in weird ways.

Sure, that makes sense.

> You can take e.g. b2dbf97f47 (builtin/index-pack: fix segfaults when
> running outside of a repo, 2024-09-04) as inspiration for how to achieve
> this.

Thanks much for the pointer, I'll have a look!
Wolfgang Müller Oct. 16, 2024, 9:48 a.m. UTC | #7
On 2024-10-16 10:57, Patrick Steinhardt wrote:
> Given that we do set `log.abbrev` I think we should be hitting code
> paths in git-shortlog(1) that use it. `git shortlog --format=%h` for
> example would use `log.abbrev`, wouldn't it? It would be nice to
> figure out whether this can be made to misbehave based on which object
> hash we have in the input.

I dove into the code again and now I'm fairly sure custom formatting is
only ever done when in a repository. shortlog_output() itself, called at
the end of cmd_shortlog(), doesn't do any formatting, only possibly
wrapping the lines already present in the shortlog struct.

That struct is filled either by read_from_stdin() or get_from_rev(). The
latter is only ever called when in a repository:

>	if (!nongit && !rev.pending.nr && isatty(0))
>		add_head_to_pending(&rev);
>	if (rev.pending.nr == 0) {
>		if (isatty(0))
>			fprintf(stderr, _("(reading log message from standard input)\n"));
>		read_from_stdin(&log);
>	}
>	else
>		get_from_rev(&rev, &log);

get_from_rev() handles formatting through shortlog_add_commit(),
directly formatting the commit into the oneline buffer that is later
added to the shortlog struct:

> 	if (!log->summary) {
> 		if (log->user_format)
> 			pretty_print_commit(&ctx, commit, &oneline);
> 		else
> 			repo_format_commit_message(the_repository, commit,
> 						   "%s", &oneline, &ctx);
> 	}
>	oneline_str = oneline.len ? oneline.buf : "<none>";
>
>	insert_records_from_trailers(log, &dups, commit, &ctx, oneline_str);
>	insert_records_from_format(log, &dups, commit, &ctx, oneline_str);

read_from_stdin(), in comparison, skips all lines from stdin not
matching either "Author: " or "Commit: ", and then adds only the
author/committer name and the subject to its log:

>		const char *v;
>		if (!skip_prefix(ident.buf, match[0], &v) &&
>		    !skip_prefix(ident.buf, match[1], &v))
>			continue;
>		while (strbuf_getline_lf(&oneline, stdin) != EOF &&
>		       oneline.len)
>			; /* discard headers */
>		while (strbuf_getline_lf(&oneline, stdin) != EOF &&
>		       !oneline.len)
>			; /* discard blanks */
>
>		strbuf_reset(&mapped_ident);
>		if (parse_ident(log, &mapped_ident, v) < 0)
>			continue;
>
>		insert_one_record(log, mapped_ident.buf, oneline.buf);

So whilst we parse all the relevant options like --abbrev and --format,
we take a shortcut through read_from_stdin() and never get to apply a
custom format. Commit hashes from stdin are discarded.

I'm not sure a test case for different hash algorithms would test
anything meaningful here, unless the plan in the future is to have
git-shortlog(1) support formatting when reading from stdin.
Taylor Blau Oct. 16, 2024, 6:52 p.m. UTC | #8
On Wed, Oct 16, 2024 at 11:07:02AM +0200, Wolfgang Müller wrote:
> On 2024-10-16 10:57, Patrick Steinhardt wrote:
> > Given that we do set `log.abbrev` I think we should be hitting code
> > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for
> > example would use `log.abbrev`, wouldn't it? It would be nice to figure
> > out whether this can be made to misbehave based on which object hash we
> > have in the input.
>
> I did try this, but like I said, --format seems not supported when
> git-shortlog(1) is reading from stdin. It always outputs the same
> summary format, either grouped on authors or committers. This is not
> explicitly documented anywhere - the manual only says that "git shortlog
> will output a summary of the log read from standard input" and then goes
> on to document all the options with no mention of a difference in
> behaviour when reading from stdin. So I'm still not entirely convinced
> that this is impossible to trigger (also given the complexity of the
> argument parsing machinery), but I have not been able to find a way.

Yeah, I think this is correct. For the purposes of this series, I think
that what Woflgang has done is sufficient on the testing front.

As a nice piece of #leftoverbits, it would be appreciated to have a
patch that amends git-shortlog(1) to indicate that '--format' is ignored
when reading input from stdin.

Thanks,
Taylor
Wolfgang Müller Oct. 16, 2024, 7:01 p.m. UTC | #9
On 2024-10-16 14:52, Taylor Blau wrote:
> As a nice piece of #leftoverbits, it would be appreciated to have a
> patch that amends git-shortlog(1) to indicate that '--format' is
> ignored when reading input from stdin.

Think I shall give that a look in the coming days, then, this is
something I'd like to see fixed as well. Immediately what comes to my
mind is that not only --format is ignored (--author is too if I'm not
mistaken), so I'll see whether there's a nice way to indicate which
specific options are valid when git-shortlog(1) reads from stdin.
Taylor Blau Oct. 16, 2024, 7:01 p.m. UTC | #10
On Wed, Oct 16, 2024 at 11:48:24AM +0200, Wolfgang Müller wrote:
> On 2024-10-16 10:57, Patrick Steinhardt wrote:
> > Given that we do set `log.abbrev` I think we should be hitting code
> > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for
> > example would use `log.abbrev`, wouldn't it? It would be nice to
> > figure out whether this can be made to misbehave based on which object
> > hash we have in the input.
>
> I dove into the code again and now I'm fairly sure custom formatting is
> only ever done when in a repository. shortlog_output() itself, called at
> the end of cmd_shortlog(), doesn't do any formatting, only possibly
> wrapping the lines already present in the shortlog struct.
>
> That struct is filled either by read_from_stdin() or get_from_rev(). The
> latter is only ever called when in a repository:
>
> [...]

Thanks; I agree with your analysis here.

> So whilst we parse all the relevant options like --abbrev and --format,
> we take a shortcut through read_from_stdin() and never get to apply a
> custom format. Commit hashes from stdin are discarded.
>
> I'm not sure a test case for different hash algorithms would test
> anything meaningful here, unless the plan in the future is to have
> git-shortlog(1) support formatting when reading from stdin.

I think that in general it would be difficult to support the full range
of --format specifiers when operating outside of a repository, because
we don't have all of the information necessary to assemble all of the
possible formatting options.

For instance, let's say I want to take Patrick's example to test 'git
shortlog' with '--format="%H"' outside of the repository. There's no way
to disambiguate whether, say, a SHA-256 hash is either (a) a correctly
formatted SHA-256 hash, or (b) a corrupted / too-long SHA-1 hash.

So that means that '%H', '%h', '%T', and '%t' are off the table. '%an'
and '%ae' seem reasonable to implement, but '%aN' and '%aE' less so,
because we don't have a .mailmap file to read. The same goes for the
committer variants of all of those.

I don't think there is any reasonable interpretation of '%d'/'%D', and
likewise for '%(decorate)' as well as '%(describe)'. We could probably
go on, but I am getting tired of looking through the 'PRETTY FORMATS'
section of git-log(1) and trying to figure out how they'd work (or not)
without a repository ;-).

In any event, my feeling is that while we could probably implement a
handful of these formatting options, that it would likely be kind of
awkward to do so. Not to mention the user-visible awkwardness of
supporting some '--format' specifiers but not others[^1].

So I think that the best course of action would be to document the
limitation and move on ;-).

Thanks,
Taylor

[^1]: Playing devil's advocate, though, perhaps it is OK to document
  well which formatting options do and don't work, and accept that a
  user asking for '--format="%(describe)"' (etc.) outside of a
  repository is nonsensical and warn / return nothing appropriately.
Wolfgang Müller Oct. 16, 2024, 7:14 p.m. UTC | #11
On 2024-10-16 15:01, Taylor Blau wrote:
> So that means that '%H', '%h', '%T', and '%t' are off the table. '%an'
> and '%ae' seem reasonable to implement, but '%aN' and '%aE' less so,
> because we don't have a .mailmap file to read. The same goes for the
> committer variants of all of those.

An interesting tidbit here is that git-shortlog(1) actually does read a
.mailmap file when processing stdin, but it takes it from the current
directory instead. Having read the full manual today, it's one of the
only times it mentions a difference in behaviour regarding standard
input.
Patrick Steinhardt Oct. 17, 2024, 5:04 a.m. UTC | #12
On Wed, Oct 16, 2024 at 02:52:56PM -0400, Taylor Blau wrote:
> On Wed, Oct 16, 2024 at 11:07:02AM +0200, Wolfgang Müller wrote:
> > On 2024-10-16 10:57, Patrick Steinhardt wrote:
> > > Given that we do set `log.abbrev` I think we should be hitting code
> > > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for
> > > example would use `log.abbrev`, wouldn't it? It would be nice to figure
> > > out whether this can be made to misbehave based on which object hash we
> > > have in the input.
> >
> > I did try this, but like I said, --format seems not supported when
> > git-shortlog(1) is reading from stdin. It always outputs the same
> > summary format, either grouped on authors or committers. This is not
> > explicitly documented anywhere - the manual only says that "git shortlog
> > will output a summary of the log read from standard input" and then goes
> > on to document all the options with no mention of a difference in
> > behaviour when reading from stdin. So I'm still not entirely convinced
> > that this is impossible to trigger (also given the complexity of the
> > argument parsing machinery), but I have not been able to find a way.
> 
> Yeah, I think this is correct. For the purposes of this series, I think
> that what Woflgang has done is sufficient on the testing front.
> 
> As a nice piece of #leftoverbits, it would be appreciated to have a
> patch that amends git-shortlog(1) to indicate that '--format' is ignored
> when reading input from stdin.

Yeah, I'm happy with the in-depth explanation that he gave and am fine
with not adding a test.

Patrick
diff mbox series

Patch

diff --git a/builtin/shortlog.c b/builtin/shortlog.c
index 3ed5c46078..0fa35202ed 100644
--- a/builtin/shortlog.c
+++ b/builtin/shortlog.c
@@ -387,6 +387,18 @@  int cmd_shortlog(int argc,
 	struct rev_info rev;
 	int nongit = !startup_info->have_repository;
 
+	/*
+	 * NEEDSWORK: Later on we'll call parse_revision_opt which relies on
+	 * the hash algorithm being set but since we are operating outside of a
+	 * Git repository we cannot determine one. This is only needed because
+	 * parse_revision_opt expects hexsz for --abbrev which is irrelevant
+	 * for shortlog outside of a git repository. For now explicitly set
+	 * SHA1, but ideally the parsing machinery would be split between
+	 * git/nongit so that we do not have to do this.
+	 */
+	if (nongit && !the_hash_algo)
+		repo_set_hash_algo(the_repository, GIT_HASH_SHA1);
+
 	const struct option options[] = {
 		OPT_BIT('c', "committer", &log.groups,
 			N_("group by committer rather than author"),
diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh
index c20c885724..ed39c67ba1 100755
--- a/t/t4201-shortlog.sh
+++ b/t/t4201-shortlog.sh
@@ -143,6 +143,11 @@  fuzz()
 	test_grep "too many arguments" out
 '
 
+test_expect_success 'shortlog --author from non-git directory does not segfault' '
+	git log --no-expand-tabs HEAD >log &&
+	env GIT_DIR=non-existing git shortlog --author=author <log 2>out
+'
+
 test_expect_success 'shortlog should add newline when input line matches wraplen' '
 	cat >expect <<\EOF &&
 A U Thor (2):