diff mbox series

[v3,3/4] parse_commit(): handle broken whitespace-only timestamp

Message ID 20230427081715.GA1478467@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 089d9adff6408b8f3406e2f46179501337715ae8
Headers show
Series fixing some parse_commit() timestamp corner cases | expand

Commit Message

Jeff King April 27, 2023, 8:17 a.m. UTC
The comment in parse_commit_date() claims that parse_timestamp() will
not walk past the end of the buffer we've been given, since it will hit
the newline at "eol" and stop. This is usually true, when dateptr
contains actual numbers to parse. But with a line like:

   committer name <email>   \n

with just whitespace, and no numbers, parse_timestamp() will consume
that newline as part of the leading whitespace, and we may walk past our
"tail" pointer (which itself is set from the "size" parameter passed in
to parse_commit_buffer()).

In practice this can't cause us to walk off the end of an array, because
we always add an extra NUL byte to the end of objects we load from disk
(as a defense against exactly this kind of bug). However, you can see
the behavior in action when "committer" is the final header (which it
usually is, unless there's an encoding) and the subject line can be
parsed as an integer. We walk right past the newline on the committer
line, as well as the "\n\n" separator, and mistake the subject for the
timestamp.

We can solve this by trimming the whitespace ourselves, making sure that
it has some non-whitespace to parse. Note that we need to be a bit
careful about the definition of "whitespace" here, as our isspace()
doesn't match exotic characters like vertical tab or formfeed. We can
work around that by checking for an actual number (see the in-code
comment). This is slightly more restrictive than the current code, but
in practice the results are either the same (we reject "foo" as "0", but
so would parse_timestamp()) or extremely unlikely even for broken
commits (parse_timestamp() would allow "\v123" as "123", but we'll now
make it "0").

I did also allow "-" here, which may be controversial, as we don't
currently support negative timestamps. My reasoning was two-fold. One,
the design of parse_timestamp() is such that we should be able to easily
switch it to handling signed values, and this otherwise creates a
hard-to-find gotcha that anybody doing that work would get tripped up
on. And two, the status quo is that we currently parse them, though the
result of course ends up as a very large unsigned value (which is likely
to just get clamped to "0" for display anyway, since our date routines
can't handle it).

The new test checks the commit parser (via "--until") for both vanilla
spaces and the vertical-tab case. I also added a test to check these
against the pretty-print formatter, which uses split_ident_line().  It's
not subject to the same bug, because it already insists that there be
one or more digits in the timestamp.

Helped-by: Phillip Wood <phillip.wood123@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c               | 28 ++++++++++++++++++++++++++--
 t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+), 2 deletions(-)

Comments

Phillip Wood April 27, 2023, 10:11 a.m. UTC | #1
On 27/04/2023 09:17, Jeff King wrote:
> The comment in parse_commit_date() claims that parse_timestamp() will
> not walk past the end of the buffer we've been given, since it will hit
> the newline at "eol" and stop. This is usually true, when dateptr
> contains actual numbers to parse. But with a line like:
> 
>     committer name <email>   \n
> 
> with just whitespace, and no numbers, parse_timestamp() will consume
> that newline as part of the leading whitespace, and we may walk past our
> "tail" pointer (which itself is set from the "size" parameter passed in
> to parse_commit_buffer()).
> 
> In practice this can't cause us to walk off the end of an array, because
> we always add an extra NUL byte to the end of objects we load from disk
> (as a defense against exactly this kind of bug). However, you can see
> the behavior in action when "committer" is the final header (which it
> usually is, unless there's an encoding) and the subject line can be
> parsed as an integer. We walk right past the newline on the committer
> line, as well as the "\n\n" separator, and mistake the subject for the
> timestamp.
> 
> We can solve this by trimming the whitespace ourselves, making sure that
> it has some non-whitespace to parse. Note that we need to be a bit
> careful about the definition of "whitespace" here, as our isspace()
> doesn't match exotic characters like vertical tab or formfeed. We can
> work around that by checking for an actual number (see the in-code
> comment). This is slightly more restrictive than the current code, but
> in practice the results are either the same (we reject "foo" as "0", but
> so would parse_timestamp()) or extremely unlikely even for broken
> commits (parse_timestamp() would allow "\v123" as "123", but we'll now
> make it "0").
> 
> I did also allow "-" here, which may be controversial, as we don't
> currently support negative timestamps. My reasoning was two-fold. One,
> the design of parse_timestamp() is such that we should be able to easily
> switch it to handling signed values, and this otherwise creates a
> hard-to-find gotcha that anybody doing that work would get tripped up
> on. And two, the status quo is that we currently parse them, though the
> result of course ends up as a very large unsigned value (which is likely
> to just get clamped to "0" for display anyway, since our date routines
> can't handle it).

I think this makes a good case for accepting '-'. The commit message is 
well explained as always :-) This all looks good to me apart from a 
query about one of the tests.

> The new test checks the commit parser (via "--until") for both vanilla
> spaces and the vertical-tab case. I also added a test to check these
> against the pretty-print formatter, which uses split_ident_line().  It's
> not subject to the same bug, because it already insists that there be
> one or more digits in the timestamp.
> 
> Helped-by: Phillip Wood <phillip.wood123@gmail.com>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   commit.c               | 28 ++++++++++++++++++++++++++--
>   t/t4212-log-corrupt.sh | 41 +++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 67 insertions(+), 2 deletions(-)
> 

> +test_expect_success 'create commits with whitespace committer dates' '
> +	# It is important that this subject line is numeric, since we want to
> +	# be sure we are not confused by skipping whitespace and accidentally
> +	# parsing the subject as a timestamp.
> +	#
> +	# Do not use munge_author_date here. Besides not hitting the committer
> +	# line, it leaves the timezone intact, and we want nothing but
> +	# whitespace.
> +	#
> +	# We will make two munged commits here. The first, ws_commit, will
> +	# be purely spaces. The second contains a vertical tab, which is
> +	# considered a space by strtoumax(), but not by our isspace().

This comment is really helpful to explain what's going on and testing 
'\v' as well as ' ' is a good idea.

> +	test_commit 1234567890 &&
> +	git cat-file commit HEAD >commit.orig &&
> +	sed "s/>.*/>    /" <commit.orig >commit.munge &&
> +	ws_commit=$(git hash-object --literally -w -t commit commit.munge) &&
> +	sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&

Does the shell eat the '\v' when it trims trailing whitespace from the 
command substitution (I can't remember the rules off the top of my head)?

Best Wishes

Phillip

> +	vt_commit=$(git hash-object --literally -w -t commit commit.munge)
> +'
> +
> +test_expect_success '--until treats whitespace date as sentinel' '
> +	echo $ws_commit >expect &&
> +	git rev-list --until=1980-01-01 $ws_commit >actual &&
> +	test_cmp expect actual &&
> +
> +	echo $vt_commit >expect &&
> +	git rev-list --until=1980-01-01 $vt_commit >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'pretty-printer handles whitespace date' '
> +	# as with the %ad test above, we will show these as the empty string,
> +	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
> +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
> +	echo : >expect &&
> +	git log -1 --format="%at:%ct" $ws_commit >actual &&
> +	test_cmp expect actual &&
> +	git log -1 --format="%at:%ct" $vt_commit >actual &&
> +	test_cmp expect actual
> +'
> +
>   test_done
Phillip Wood April 27, 2023, 11:55 a.m. UTC | #2
On 27/04/2023 11:11, Phillip Wood wrote:

>> +    test_commit 1234567890 &&
>> +    git cat-file commit HEAD >commit.orig &&
>> +    sed "s/>.*/>    /" <commit.orig >commit.munge &&
>> +    ws_commit=$(git hash-object --literally -w -t commit 
>> commit.munge) &&
>> +    sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
> 
> Does the shell eat the '\v' when it trims trailing whitespace from the 
> command substitution (I can't remember the rules off the top of my head)?

Having looked it up, the shell trims newlines but not other whitespace 
so this should be fine.

Best Wishes

Phillip

> 
> Best Wishes
> 
> Phillip
> 
>> +    vt_commit=$(git hash-object --literally -w -t commit commit.munge)
>> +'
>> +
>> +test_expect_success '--until treats whitespace date as sentinel' '
>> +    echo $ws_commit >expect &&
>> +    git rev-list --until=1980-01-01 $ws_commit >actual &&
>> +    test_cmp expect actual &&
>> +
>> +    echo $vt_commit >expect &&
>> +    git rev-list --until=1980-01-01 $vt_commit >actual &&
>> +    test_cmp expect actual
>> +'
>> +
>> +test_expect_success 'pretty-printer handles whitespace date' '
>> +    # as with the %ad test above, we will show these as the empty 
>> string,
>> +    # not the 1970 epoch date. This is intentional; see 7d9a281941 
>> (t4212:
>> +    # test bogus timestamps with git-log, 2014-02-24) for more 
>> discussion.
>> +    echo : >expect &&
>> +    git log -1 --format="%at:%ct" $ws_commit >actual &&
>> +    test_cmp expect actual &&
>> +    git log -1 --format="%at:%ct" $vt_commit >actual &&
>> +    test_cmp expect actual
>> +'
>> +
>>   test_done
>
Junio C Hamano April 27, 2023, 4:20 p.m. UTC | #3
Phillip Wood <phillip.wood123@gmail.com> writes:

>> I did also allow "-" here, which may be controversial, as we don't
>> currently support negative timestamps. My reasoning was two-fold. One,
>> the design of parse_timestamp() is such that we should be able to easily
>> switch it to handling signed values, and this otherwise creates a
>> hard-to-find gotcha that anybody doing that work would get tripped up
>> on. And two, the status quo is that we currently parse them, though the
>> result of course ends up as a very large unsigned value (which is likely
>> to just get clamped to "0" for display anyway, since our date routines
>> can't handle it).
>
> I think this makes a good case for accepting '-'. The commit message
> is well explained as always :-) This all looks good to me apart from a
> query about one of the tests.

I agree.  I was somewhat surprised that the big comment before that
code did not mention it, but hopefully those who would be tempted to
remove the check for '-' would either be careful enough themselves
or be stopped by reviewers who are careful enough to go back to the
log message of the commit that added the check in the first place,
so it is OK.
Junio C Hamano April 27, 2023, 4:25 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> In practice this can't cause us to walk off the end of an array, because
> we always add an extra NUL byte to the end of objects we load from disk
> (as a defense against exactly this kind of bug). However, you can see
> the behavior in action when "committer" is the final header (which it
> usually is, unless there's an encoding ...

... or it is a signed commit or a commit that merges a signed tag.

There is no need for us to be exhaustive here, but I just wondered
which one of these three commit object headers is more common.  I
guess the reason "encoding" came to your mind first is because it is
the oldest among the three.
Jeff King April 27, 2023, 4:46 p.m. UTC | #5
On Thu, Apr 27, 2023 at 12:55:26PM +0100, Phillip Wood wrote:

> On 27/04/2023 11:11, Phillip Wood wrote:
> 
> > > +    test_commit 1234567890 &&
> > > +    git cat-file commit HEAD >commit.orig &&
> > > +    sed "s/>.*/>    /" <commit.orig >commit.munge &&
> > > +    ws_commit=$(git hash-object --literally -w -t commit
> > > commit.munge) &&
> > > +    sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
> > 
> > Does the shell eat the '\v' when it trims trailing whitespace from the
> > command substitution (I can't remember the rules off the top of my
> > head)?
> 
> Having looked it up, the shell trims newlines but not other whitespace so
> this should be fine.

Yep. I also wondered if some sed versions might complain. But it's
really not that much more exotic than a tab, so it's probably OK.

-Peff
Jeff King April 27, 2023, 4:55 p.m. UTC | #6
On Thu, Apr 27, 2023 at 09:20:53AM -0700, Junio C Hamano wrote:

> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
> >> I did also allow "-" here, which may be controversial, as we don't
> >> currently support negative timestamps. My reasoning was two-fold. One,
> >> the design of parse_timestamp() is such that we should be able to easily
> >> switch it to handling signed values, and this otherwise creates a
> >> hard-to-find gotcha that anybody doing that work would get tripped up
> >> on. And two, the status quo is that we currently parse them, though the
> >> result of course ends up as a very large unsigned value (which is likely
> >> to just get clamped to "0" for display anyway, since our date routines
> >> can't handle it).
> >
> > I think this makes a good case for accepting '-'. The commit message
> > is well explained as always :-) This all looks good to me apart from a
> > query about one of the tests.
> 
> I agree.  I was somewhat surprised that the big comment before that
> code did not mention it, but hopefully those who would be tempted to
> remove the check for '-' would either be careful enough themselves
> or be stopped by reviewers who are careful enough to go back to the
> log message of the commit that added the check in the first place,
> so it is OK.

Hmph, I thought I did write something in that comment. But I think in
the end I just migrated it to the commit message (and skimming my reflog
I don't think it even made it as far as a commit).

I think it's OK. I was mostly trying to help out Future Peff, who has a
series almost completed to handle negative timestamps. But I think the
worst case is that the tests in that new series would fail, and I'd
figure it out. :)

-Peff
Jeff King April 27, 2023, 4:57 p.m. UTC | #7
On Thu, Apr 27, 2023 at 09:25:02AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > In practice this can't cause us to walk off the end of an array, because
> > we always add an extra NUL byte to the end of objects we load from disk
> > (as a defense against exactly this kind of bug). However, you can see
> > the behavior in action when "committer" is the final header (which it
> > usually is, unless there's an encoding ...
> 
> ... or it is a signed commit or a commit that merges a signed tag.
> 
> There is no need for us to be exhaustive here, but I just wondered
> which one of these three commit object headers is more common.  I
> guess the reason "encoding" came to your mind first is because it is
> the oldest among the three.

Mostly the others did not occur to me at all. :)

I expect that "gpgsig" lines are probably the most common these days,
but that may be my biased view (I guess in the kernel workflow it is
probably signed tags).

-Peff
diff mbox series

Patch

diff --git a/commit.c b/commit.c
index 04c20d9cc6..8dfe92cf37 100644
--- a/commit.c
+++ b/commit.c
@@ -121,10 +121,34 @@  static timestamp_t parse_commit_date(const char *buf, const char *tail)
 	dateptr = eol;
 	while (dateptr > buf && dateptr[-1] != '>')
 		dateptr--;
-	if (dateptr == buf || dateptr == eol)
+	if (dateptr == buf)
 		return 0;
 
-	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
+	/*
+	 * Trim leading whitespace, but make sure we have at least one
+	 * non-whitespace character, as parse_timestamp() will otherwise walk
+	 * right past the newline we found in "eol" when skipping whitespace
+	 * itself.
+	 *
+	 * In theory it would be sufficient to allow any character not matched
+	 * by isspace(), but there's a catch: our isspace() does not
+	 * necessarily match the behavior of parse_timestamp(), as the latter
+	 * is implemented by system routines which match more exotic control
+	 * codes, or even locale-dependent sequences.
+	 *
+	 * Since we expect the timestamp to be a number, we can check for that.
+	 * Anything else (e.g., a non-numeric token like "foo") would just
+	 * cause parse_timestamp() to return 0 anyway.
+	 */
+	while (dateptr < eol && isspace(*dateptr))
+		dateptr++;
+	if (!isdigit(*dateptr) && *dateptr != '-')
+		return 0;
+
+	/*
+	 * We know there is at least one digit (or dash), so we'll begin
+	 * parsing there and stop at worst case at eol.
+	 */
 	return parse_timestamp(dateptr, NULL, 10);
 }
 
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index af4b35ff56..85e90acb09 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -92,4 +92,45 @@  test_expect_success 'absurdly far-in-future date' '
 	git log -1 --format=%ad $commit
 '
 
+test_expect_success 'create commits with whitespace committer dates' '
+	# It is important that this subject line is numeric, since we want to
+	# be sure we are not confused by skipping whitespace and accidentally
+	# parsing the subject as a timestamp.
+	#
+	# Do not use munge_author_date here. Besides not hitting the committer
+	# line, it leaves the timezone intact, and we want nothing but
+	# whitespace.
+	#
+	# We will make two munged commits here. The first, ws_commit, will
+	# be purely spaces. The second contains a vertical tab, which is
+	# considered a space by strtoumax(), but not by our isspace().
+	test_commit 1234567890 &&
+	git cat-file commit HEAD >commit.orig &&
+	sed "s/>.*/>    /" <commit.orig >commit.munge &&
+	ws_commit=$(git hash-object --literally -w -t commit commit.munge) &&
+	sed "s/>.*/>   $(printf "\013")/" <commit.orig >commit.munge &&
+	vt_commit=$(git hash-object --literally -w -t commit commit.munge)
+'
+
+test_expect_success '--until treats whitespace date as sentinel' '
+	echo $ws_commit >expect &&
+	git rev-list --until=1980-01-01 $ws_commit >actual &&
+	test_cmp expect actual &&
+
+	echo $vt_commit >expect &&
+	git rev-list --until=1980-01-01 $vt_commit >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'pretty-printer handles whitespace date' '
+	# as with the %ad test above, we will show these as the empty string,
+	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
+	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
+	echo : >expect &&
+	git log -1 --format="%at:%ct" $ws_commit >actual &&
+	test_cmp expect actual &&
+	git log -1 --format="%at:%ct" $vt_commit >actual &&
+	test_cmp expect actual
+'
+
 test_done