Message ID | 20241015114826.715158-1-wolf@oriole.systems (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [RFC,v2] builtin/shortlog: explicitly set hash algo when there is no repo | expand |
On Tue, Oct 15, 2024 at 7:48 AM Wolfgang Müller <wolf@oriole.systems> wrote: > Whilst git-shortlog(1) does not explicitly need any repository > information when run without reference to one, it still parses some of > its arguments with parse_revision_opt() which assumes that the hash > algorithm is set. However, in c8aed5e8da (repository: stop setting SHA1 > as the default object hash, 2024-05-07) we stopped setting up a default > hash algorithm and instead require commands to set it up explicitly. > > This was done for most other commands like in ab274909d4 (builtin/diff: > explicitly set hash algo when there is no repo, 2024-05-07) but was > missed for builtin/shortlog, making git-shortlog(1) segfault outside of > a repository when given arguments like --author that trigger a call to > parse_revision_opt(). > > Fix this for now by explicitly setting the hash algorithm to SHA1. Also > add a regression test for the segfault. > > Signed-off-by: Wolfgang Müller <wolf@oriole.systems> Cc:'ing Patrick since I believe he's been patching similar cases. > diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh > @@ -143,6 +143,11 @@ fuzz() > +test_expect_success 'shortlog --author from non-git directory does not segfault' ' > + git log --no-expand-tabs HEAD >log && > + env GIT_DIR=non-existing git shortlog --author=author <log 2>out > +' t/test-lib-functions.sh has a handy nongit() function for running a command in a non-Git directory: nongit git shortlog --author=author <log 2>out By the way, what is the purpose of capturing output to file "out"? That file is never consulted. Also, can the original crash be reproduced without having to invoke the additional `git log`? In my tests, this is sufficient: echo | nongit git shortlog --author=author
On 2024-10-15 13:20, Eric Sunshine wrote: > t/test-lib-functions.sh has a handy nongit() function for running a > command in a non-Git directory: > > nongit git shortlog --author=author <log 2>out Ah, nice, I'll be using that instead, then. > By the way, what is the purpose of capturing output to file "out"? > That file is never consulted. Oh, my bad, that was a leftover from local testing. > Also, can the original crash be reproduced without having to invoke > the additional `git log`? In my tests, this is sufficient: > > echo | nongit git shortlog --author=author Yeah, same as above. I'll take that into account for the next version, thanks! Thinking more about this, any opinion on the specific test for --author? It is a succinct test but also relies on the fact that it "just happens" to be one of the flags handled in parse_revision_opt(). In fact, any such flag (even non-existing ones, like "--foo") would crash shortlog right now. I'm wondering whether there's a more universal way of testing this.
On Tue, Oct 15, 2024 at 01:48:26PM +0200, Wolfgang Müller wrote: > + /* > + * NEEDSWORK: Later on we'll call parse_revision_opt which relies on > + * the hash algorithm being set but since we are operating outside of a > + * Git repository we cannot determine one. This is only needed because > + * parse_revision_opt expects hexsz for --abbrev which is irrelevant > + * for shortlog outside of a git repository. For now explicitly set > + * SHA1, but ideally the parsing machinery would be split between > + * git/nongit so that we do not have to do this. > + */ > + if (nongit && !the_hash_algo) > + repo_set_hash_algo(the_repository, GIT_HASH_SHA1); > + Makes sense. As you say, git-shortlog(1) should ideally recognize the hash function used by the input. The next-best thing would be to provide a command line option to switch it. But punting on that should be fine for now. > const struct option options[] = { > OPT_BIT('c', "committer", &log.groups, > N_("group by committer rather than author"), > diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh > index c20c885724..ed39c67ba1 100755 > --- a/t/t4201-shortlog.sh > +++ b/t/t4201-shortlog.sh > @@ -143,6 +143,11 @@ fuzz() > test_grep "too many arguments" out > ' > > +test_expect_success 'shortlog --author from non-git directory does not segfault' ' > + git log --no-expand-tabs HEAD >log && > + env GIT_DIR=non-existing git shortlog --author=author <log 2>out > +' > + I'd like to see another testcase added that exercises behaviour when git-shortlog(1) is passed SHA256 output outside of a repo, as I'm curious how it'll behave in that scenario. Thanks! Patrick
On 2024-10-16 07:32, Patrick Steinhardt wrote: > On Tue, Oct 15, 2024 at 01:48:26PM +0200, Wolfgang Müller wrote: > > diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh > > index c20c885724..ed39c67ba1 100755 > > --- a/t/t4201-shortlog.sh > > +++ b/t/t4201-shortlog.sh > > @@ -143,6 +143,11 @@ fuzz() > > test_grep "too many arguments" out > > ' > > > > +test_expect_success 'shortlog --author from non-git directory does not segfault' ' > > + git log --no-expand-tabs HEAD >log && > > + env GIT_DIR=non-existing git shortlog --author=author <log 2>out > > +' > > + > > I'd like to see another testcase added that exercises behaviour when > git-shortlog(1) is passed SHA256 output outside of a repo, as I'm > curious how it'll behave in that scenario. I had a look at this in builtin/shortlog.c's read_from_stdin() and am pretty sure that git-shortlog(1), when reading from stdin, simply ignores anything but either the "Author:" or "Commit:" lines, depending on the value given by --group. The --group=format: option is not supported when reading from stdin. Neither is --format handled at all. So I don't think there is actually a way to make git-shortlog(1) encounter and handle a commit hash when reading from stdin; the hash algorithm seems completely meaningless for its user-facing behaviour. As far as I have seen the closest it comes to getting into contact with a hash (or more specifically, hexsz) is when cmd_shortlog() sets: log.abbrev = rev.abbrev; This relies on the parsing machinery in parse_revision_opt() - the one this patch is for. Technically --abbrev is honored by git-shortlog(1) when reading from stdin, but its value goes unused because of the difference in code paths when reading from stdin. Do take this with a grain of salt, however, this is my first look at the inner workings of git-shortlog(1). As for the test, I'd be happy to provide one if this is still deemed necessary after considering the above. There's two questions I have: 1) Is this already covered by GIT_TEST_DEFAULT_HASH=sha256? Running the t4201-shortlog testsuite with that passes. 2) I've already experimented with setting up a test for this and am unsure how to cleanly set up a sha256 repository. Ordinarily it should be a simple init/add (perhaps with test_commit), but t4201-shortlog is already running within a git repository if I understand the setup step correctly. Is there a clean way to escape from there, or would it simply be fine creating another repository within that one? Thanks!
On Wed, Oct 16, 2024 at 10:47:03AM +0200, Wolfgang Müller wrote: > On 2024-10-16 07:32, Patrick Steinhardt wrote: > > On Tue, Oct 15, 2024 at 01:48:26PM +0200, Wolfgang Müller wrote: > > > diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh > > > index c20c885724..ed39c67ba1 100755 > > > --- a/t/t4201-shortlog.sh > > > +++ b/t/t4201-shortlog.sh > > > @@ -143,6 +143,11 @@ fuzz() > > > test_grep "too many arguments" out > > > ' > > > > > > +test_expect_success 'shortlog --author from non-git directory does not segfault' ' > > > + git log --no-expand-tabs HEAD >log && > > > + env GIT_DIR=non-existing git shortlog --author=author <log 2>out > > > +' > > > + > > > > I'd like to see another testcase added that exercises behaviour when > > git-shortlog(1) is passed SHA256 output outside of a repo, as I'm > > curious how it'll behave in that scenario. > > I had a look at this in builtin/shortlog.c's read_from_stdin() and am > pretty sure that git-shortlog(1), when reading from stdin, simply > ignores anything but either the "Author:" or "Commit:" lines, depending > on the value given by --group. The --group=format: option is not > supported when reading from stdin. Neither is --format handled at all. > > So I don't think there is actually a way to make git-shortlog(1) > encounter and handle a commit hash when reading from stdin; the hash > algorithm seems completely meaningless for its user-facing behaviour. As > far as I have seen the closest it comes to getting into contact with a > hash (or more specifically, hexsz) is when cmd_shortlog() sets: > > log.abbrev = rev.abbrev; > > This relies on the parsing machinery in parse_revision_opt() - the one > this patch is for. Technically --abbrev is honored by git-shortlog(1) > when reading from stdin, but its value goes unused because of the > difference in code paths when reading from stdin. Do take this with a > grain of salt, however, this is my first look at the inner workings of > git-shortlog(1). Okay, thanks for the explanation. Given that we do set `log.abbrev` I think we should be hitting code paths in git-shortlog(1) that use it. `git shortlog --format=%h` for example would use `log.abbrev`, wouldn't it? It would be nice to figure out whether this can be made to misbehave based on which object hash we have in the input. > As for the test, I'd be happy to provide one if this is still deemed > necessary after considering the above. There's two questions I have: > > 1) Is this already covered by GIT_TEST_DEFAULT_HASH=sha256? Running the > t4201-shortlog testsuite with that passes. I think it doesn't hurt to have an explicit test for this scenario, even if it just demonstrates that things don't crash or behave in weird ways. > 2) I've already experimented with setting up a test for this and am > unsure how to cleanly set up a sha256 repository. Ordinarily it should > be a simple init/add (perhaps with test_commit), but t4201-shortlog is > already running within a git repository if I understand the setup step > correctly. Is there a clean way to escape from there, or would it simply > be fine creating another repository within that one? You can take e.g. b2dbf97f47 (builtin/index-pack: fix segfaults when running outside of a repo, 2024-09-04) as inspiration for how to achieve this. Patrick
On 2024-10-16 10:57, Patrick Steinhardt wrote: > Given that we do set `log.abbrev` I think we should be hitting code > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for > example would use `log.abbrev`, wouldn't it? It would be nice to figure > out whether this can be made to misbehave based on which object hash we > have in the input. I did try this, but like I said, --format seems not supported when git-shortlog(1) is reading from stdin. It always outputs the same summary format, either grouped on authors or committers. This is not explicitly documented anywhere - the manual only says that "git shortlog will output a summary of the log read from standard input" and then goes on to document all the options with no mention of a difference in behaviour when reading from stdin. So I'm still not entirely convinced that this is impossible to trigger (also given the complexity of the argument parsing machinery), but I have not been able to find a way. > I think it doesn't hurt to have an explicit test for this scenario, even > if it just demonstrates that things don't crash or behave in weird ways. Sure, that makes sense. > You can take e.g. b2dbf97f47 (builtin/index-pack: fix segfaults when > running outside of a repo, 2024-09-04) as inspiration for how to achieve > this. Thanks much for the pointer, I'll have a look!
On 2024-10-16 10:57, Patrick Steinhardt wrote: > Given that we do set `log.abbrev` I think we should be hitting code > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for > example would use `log.abbrev`, wouldn't it? It would be nice to > figure out whether this can be made to misbehave based on which object > hash we have in the input. I dove into the code again and now I'm fairly sure custom formatting is only ever done when in a repository. shortlog_output() itself, called at the end of cmd_shortlog(), doesn't do any formatting, only possibly wrapping the lines already present in the shortlog struct. That struct is filled either by read_from_stdin() or get_from_rev(). The latter is only ever called when in a repository: > if (!nongit && !rev.pending.nr && isatty(0)) > add_head_to_pending(&rev); > if (rev.pending.nr == 0) { > if (isatty(0)) > fprintf(stderr, _("(reading log message from standard input)\n")); > read_from_stdin(&log); > } > else > get_from_rev(&rev, &log); get_from_rev() handles formatting through shortlog_add_commit(), directly formatting the commit into the oneline buffer that is later added to the shortlog struct: > if (!log->summary) { > if (log->user_format) > pretty_print_commit(&ctx, commit, &oneline); > else > repo_format_commit_message(the_repository, commit, > "%s", &oneline, &ctx); > } > oneline_str = oneline.len ? oneline.buf : "<none>"; > > insert_records_from_trailers(log, &dups, commit, &ctx, oneline_str); > insert_records_from_format(log, &dups, commit, &ctx, oneline_str); read_from_stdin(), in comparison, skips all lines from stdin not matching either "Author: " or "Commit: ", and then adds only the author/committer name and the subject to its log: > const char *v; > if (!skip_prefix(ident.buf, match[0], &v) && > !skip_prefix(ident.buf, match[1], &v)) > continue; > while (strbuf_getline_lf(&oneline, stdin) != EOF && > oneline.len) > ; /* discard headers */ > while (strbuf_getline_lf(&oneline, stdin) != EOF && > !oneline.len) > ; /* discard blanks */ > > strbuf_reset(&mapped_ident); > if (parse_ident(log, &mapped_ident, v) < 0) > continue; > > insert_one_record(log, mapped_ident.buf, oneline.buf); So whilst we parse all the relevant options like --abbrev and --format, we take a shortcut through read_from_stdin() and never get to apply a custom format. Commit hashes from stdin are discarded. I'm not sure a test case for different hash algorithms would test anything meaningful here, unless the plan in the future is to have git-shortlog(1) support formatting when reading from stdin.
On Wed, Oct 16, 2024 at 11:07:02AM +0200, Wolfgang Müller wrote: > On 2024-10-16 10:57, Patrick Steinhardt wrote: > > Given that we do set `log.abbrev` I think we should be hitting code > > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for > > example would use `log.abbrev`, wouldn't it? It would be nice to figure > > out whether this can be made to misbehave based on which object hash we > > have in the input. > > I did try this, but like I said, --format seems not supported when > git-shortlog(1) is reading from stdin. It always outputs the same > summary format, either grouped on authors or committers. This is not > explicitly documented anywhere - the manual only says that "git shortlog > will output a summary of the log read from standard input" and then goes > on to document all the options with no mention of a difference in > behaviour when reading from stdin. So I'm still not entirely convinced > that this is impossible to trigger (also given the complexity of the > argument parsing machinery), but I have not been able to find a way. Yeah, I think this is correct. For the purposes of this series, I think that what Woflgang has done is sufficient on the testing front. As a nice piece of #leftoverbits, it would be appreciated to have a patch that amends git-shortlog(1) to indicate that '--format' is ignored when reading input from stdin. Thanks, Taylor
On 2024-10-16 14:52, Taylor Blau wrote: > As a nice piece of #leftoverbits, it would be appreciated to have a > patch that amends git-shortlog(1) to indicate that '--format' is > ignored when reading input from stdin. Think I shall give that a look in the coming days, then, this is something I'd like to see fixed as well. Immediately what comes to my mind is that not only --format is ignored (--author is too if I'm not mistaken), so I'll see whether there's a nice way to indicate which specific options are valid when git-shortlog(1) reads from stdin.
On Wed, Oct 16, 2024 at 11:48:24AM +0200, Wolfgang Müller wrote: > On 2024-10-16 10:57, Patrick Steinhardt wrote: > > Given that we do set `log.abbrev` I think we should be hitting code > > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for > > example would use `log.abbrev`, wouldn't it? It would be nice to > > figure out whether this can be made to misbehave based on which object > > hash we have in the input. > > I dove into the code again and now I'm fairly sure custom formatting is > only ever done when in a repository. shortlog_output() itself, called at > the end of cmd_shortlog(), doesn't do any formatting, only possibly > wrapping the lines already present in the shortlog struct. > > That struct is filled either by read_from_stdin() or get_from_rev(). The > latter is only ever called when in a repository: > > [...] Thanks; I agree with your analysis here. > So whilst we parse all the relevant options like --abbrev and --format, > we take a shortcut through read_from_stdin() and never get to apply a > custom format. Commit hashes from stdin are discarded. > > I'm not sure a test case for different hash algorithms would test > anything meaningful here, unless the plan in the future is to have > git-shortlog(1) support formatting when reading from stdin. I think that in general it would be difficult to support the full range of --format specifiers when operating outside of a repository, because we don't have all of the information necessary to assemble all of the possible formatting options. For instance, let's say I want to take Patrick's example to test 'git shortlog' with '--format="%H"' outside of the repository. There's no way to disambiguate whether, say, a SHA-256 hash is either (a) a correctly formatted SHA-256 hash, or (b) a corrupted / too-long SHA-1 hash. So that means that '%H', '%h', '%T', and '%t' are off the table. '%an' and '%ae' seem reasonable to implement, but '%aN' and '%aE' less so, because we don't have a .mailmap file to read. The same goes for the committer variants of all of those. I don't think there is any reasonable interpretation of '%d'/'%D', and likewise for '%(decorate)' as well as '%(describe)'. We could probably go on, but I am getting tired of looking through the 'PRETTY FORMATS' section of git-log(1) and trying to figure out how they'd work (or not) without a repository ;-). In any event, my feeling is that while we could probably implement a handful of these formatting options, that it would likely be kind of awkward to do so. Not to mention the user-visible awkwardness of supporting some '--format' specifiers but not others[^1]. So I think that the best course of action would be to document the limitation and move on ;-). Thanks, Taylor [^1]: Playing devil's advocate, though, perhaps it is OK to document well which formatting options do and don't work, and accept that a user asking for '--format="%(describe)"' (etc.) outside of a repository is nonsensical and warn / return nothing appropriately.
On 2024-10-16 15:01, Taylor Blau wrote: > So that means that '%H', '%h', '%T', and '%t' are off the table. '%an' > and '%ae' seem reasonable to implement, but '%aN' and '%aE' less so, > because we don't have a .mailmap file to read. The same goes for the > committer variants of all of those. An interesting tidbit here is that git-shortlog(1) actually does read a .mailmap file when processing stdin, but it takes it from the current directory instead. Having read the full manual today, it's one of the only times it mentions a difference in behaviour regarding standard input.
On Wed, Oct 16, 2024 at 02:52:56PM -0400, Taylor Blau wrote: > On Wed, Oct 16, 2024 at 11:07:02AM +0200, Wolfgang Müller wrote: > > On 2024-10-16 10:57, Patrick Steinhardt wrote: > > > Given that we do set `log.abbrev` I think we should be hitting code > > > paths in git-shortlog(1) that use it. `git shortlog --format=%h` for > > > example would use `log.abbrev`, wouldn't it? It would be nice to figure > > > out whether this can be made to misbehave based on which object hash we > > > have in the input. > > > > I did try this, but like I said, --format seems not supported when > > git-shortlog(1) is reading from stdin. It always outputs the same > > summary format, either grouped on authors or committers. This is not > > explicitly documented anywhere - the manual only says that "git shortlog > > will output a summary of the log read from standard input" and then goes > > on to document all the options with no mention of a difference in > > behaviour when reading from stdin. So I'm still not entirely convinced > > that this is impossible to trigger (also given the complexity of the > > argument parsing machinery), but I have not been able to find a way. > > Yeah, I think this is correct. For the purposes of this series, I think > that what Woflgang has done is sufficient on the testing front. > > As a nice piece of #leftoverbits, it would be appreciated to have a > patch that amends git-shortlog(1) to indicate that '--format' is ignored > when reading input from stdin. Yeah, I'm happy with the in-depth explanation that he gave and am fine with not adding a test. Patrick
diff --git a/builtin/shortlog.c b/builtin/shortlog.c index 3ed5c46078..0fa35202ed 100644 --- a/builtin/shortlog.c +++ b/builtin/shortlog.c @@ -387,6 +387,18 @@ int cmd_shortlog(int argc, struct rev_info rev; int nongit = !startup_info->have_repository; + /* + * NEEDSWORK: Later on we'll call parse_revision_opt which relies on + * the hash algorithm being set but since we are operating outside of a + * Git repository we cannot determine one. This is only needed because + * parse_revision_opt expects hexsz for --abbrev which is irrelevant + * for shortlog outside of a git repository. For now explicitly set + * SHA1, but ideally the parsing machinery would be split between + * git/nongit so that we do not have to do this. + */ + if (nongit && !the_hash_algo) + repo_set_hash_algo(the_repository, GIT_HASH_SHA1); + const struct option options[] = { OPT_BIT('c', "committer", &log.groups, N_("group by committer rather than author"), diff --git a/t/t4201-shortlog.sh b/t/t4201-shortlog.sh index c20c885724..ed39c67ba1 100755 --- a/t/t4201-shortlog.sh +++ b/t/t4201-shortlog.sh @@ -143,6 +143,11 @@ fuzz() test_grep "too many arguments" out ' +test_expect_success 'shortlog --author from non-git directory does not segfault' ' + git log --no-expand-tabs HEAD >log && + env GIT_DIR=non-existing git shortlog --author=author <log 2>out +' + test_expect_success 'shortlog should add newline when input line matches wraplen' ' cat >expect <<\EOF && A U Thor (2):