diff mbox series

[v2,3/4] parse_commit(): handle broken whitespace-only timestamp

Message ID 20230425055458.GC4015649@coredump.intra.peff.net (mailing list archive)
State Superseded
Headers show
Series fixing some parse_commit() timestamp corner cases | expand

Commit Message

Jeff King April 25, 2023, 5:54 a.m. UTC
The comment in parse_commit_date() claims that parse_timestamp() will
not walk past the end of the buffer we've been given, since it will hit
the newline at "eol" and stop. This is usually true, when dateptr
contains actual numbers to parse. But with a line like:

   committer name <email>   \n

with just whitespace, and no numbers, parse_timestamp() will consume
that newline as part of the leading whitespace, and we may walk past our
"tail" pointer (which itself is set from the "size" parameter passed in
to parse_commit_buffer()).

In practice this can't cause us to walk off the end of an array, because
we always add an extra NUL byte to the end of objects we load from disk
(as a defense against exactly this kind of bug). However, you can see
the behavior in action when "committer" is the final header (which it
usually is, unless there's an encoding) and the subject line can be
parsed as an integer. We walk right past the newline on the committer
line, as well as the "\n\n" separator, and mistake the subject for the
timestamp.

The new test demonstrates such a case. I also added a test to check this
case against the pretty-print formatter, which uses split_ident_line().
It's not subject to the same bug, because it insists that there be one
or more digits in the timestamp.

Signed-off-by: Jeff King <peff@peff.net>
---
 commit.c               | 17 +++++++++++++++--
 t/t4212-log-corrupt.sh | 29 +++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 2 deletions(-)

Comments

Phillip Wood April 25, 2023, 10:11 a.m. UTC | #1
Hi Peff

On 25/04/2023 06:54, Jeff King wrote:
> The comment in parse_commit_date() claims that parse_timestamp() will
> not walk past the end of the buffer we've been given, since it will hit
> the newline at "eol" and stop. This is usually true, when dateptr
> contains actual numbers to parse. But with a line like:
> 
>     committer name <email>   \n
> 
> with just whitespace, and no numbers, parse_timestamp() will consume
> that newline as part of the leading whitespace, and we may walk past our
> "tail" pointer (which itself is set from the "size" parameter passed in
> to parse_commit_buffer()).
> 
> In practice this can't cause us to walk off the end of an array, because
> we always add an extra NUL byte to the end of objects we load from disk
> (as a defense against exactly this kind of bug). However, you can see
> the behavior in action when "committer" is the final header (which it
> usually is, unless there's an encoding) and the subject line can be
> parsed as an integer. We walk right past the newline on the committer
> line, as well as the "\n\n" separator, and mistake the subject for the
> timestamp.
> 
> The new test demonstrates such a case. I also added a test to check this
> case against the pretty-print formatter, which uses split_ident_line().
> It's not subject to the same bug, because it insists that there be one
> or more digits in the timestamp.
> 
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>   commit.c               | 17 +++++++++++++++--
>   t/t4212-log-corrupt.sh | 29 +++++++++++++++++++++++++++++
>   2 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/commit.c b/commit.c
> index bb340f66fa..2f1b5d505b 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -120,10 +120,23 @@ static timestamp_t parse_commit_date(const char *buf, const char *tail)
>   	dateptr = eol;
>   	while (dateptr > buf && dateptr[-1] != '>')
>   		dateptr--;
> -	if (dateptr == buf || dateptr == eol)
> +	if (dateptr == buf)
>   		return 0;
>   
> -	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
> +	/*
> +	 * Trim leading whitespace; parse_timestamp() will do this itself, but
> +	 * if we have _only_ whitespace, it will walk right past the newline
> +	 * while doing so.
> +	 */
> +	while (dateptr < eol && isspace(*dateptr))
> +		dateptr++;
> +	if (dateptr == eol)
> +		return 0;
> +
> +	/*
> +	 * We know there is at least one non-whitespace character, so we'll
> +	 * begin parsing there and stop at worst case at eol.
> +	 */

This probably doesn't matter in practice but we define our own isspace() 
that does not treat '\v' and '\f' as whitespace. However 
parse_timestamp() (which is just strtoumax()) uses the standard 
library's isspace() which does treat those characters as whitespace and 
is locale dependent. This means we can potentially stop at a character 
that parse_timestamp() treats as whitespace and if there are no digits 
after it we'll still walk past the end of the line. Using Rene's 
suggestion of testing the character with isdigit() would fix that. It 
would also avoid parsing negative timestamps as positive numbers and 
reject any timestamps that begin with a locale dependent digit.

I'm not familiar with this code, but would it be worth changing 
parse_timestamp() to stop parsing if it sees a newline?

Best Wishes

Phillip

>   	return parse_timestamp(dateptr, NULL, 10);
>   }
>   
> diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
> index af4b35ff56..d4ef48d646 100755
> --- a/t/t4212-log-corrupt.sh
> +++ b/t/t4212-log-corrupt.sh
> @@ -92,4 +92,33 @@ test_expect_success 'absurdly far-in-future date' '
>   	git log -1 --format=%ad $commit
>   '
>   
> +test_expect_success 'create commit with whitespace committer date' '
> +	# It is important that this subject line is numeric, since we want to
> +	# be sure we are not confused by skipping whitespace and accidentally
> +	# parsing the subject as a timestamp.
> +	#
> +	# Do not use munge_author_date here. Besides not hitting the committer
> +	# line, it leaves the timezone intact, and we want nothing but
> +	# whitespace.
> +	test_commit 1234567890 &&
> +	git cat-file commit HEAD >commit.orig &&
> +	sed "s/>.*/>    /" <commit.orig >commit.munge &&
> +	ws_commit=$(git hash-object --literally -w -t commit commit.munge)
> +'
> +
> +test_expect_success '--until treats whitespace date as sentinel' '
> +	echo $ws_commit >expect &&
> +	git rev-list --until=1980-01-01 $ws_commit >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'pretty-printer handles whitespace date' '
> +	# as with the %ad test above, we will show these as the empty string,
> +	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
> +	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
> +	echo : >expect &&
> +	git log -1 --format="%at:%ct" $ws_commit >actual &&
> +	test_cmp expect actual
> +'
> +
>   test_done
Junio C Hamano April 25, 2023, 4:06 p.m. UTC | #2
Phillip Wood <phillip.wood123@gmail.com> writes:

> This probably doesn't matter in practice but we define our own
> isspace() that does not treat '\v' and '\f' as whitespace. However
> parse_timestamp() (which is just strtoumax()) uses the standard
> library's isspace() which does treat those characters as whitespace
> and is locale dependent. This means we can potentially stop at a
> character that parse_timestamp() treats as whitespace and if there are
> no digits after it we'll still walk past the end of the line. Using
> Rene's suggestion of testing the character with isdigit() would fix
> that. It would also avoid parsing negative timestamps as positive
> numbers and reject any timestamps that begin with a locale dependent
> digit.

A very interesting observation.  I wonder if a curious person can
craft a malformed timestamp with "hash-object --literally" to do
more than DoS themselves?

We are not going to put anything other than [ 0-9+-] after the '>'
we scan for, and making sure '>' is followed by SP and then [0-9]
would be sufficient to ensure strtoumax() to stop before the '\n'
but does not ensure that the "signal a bad timestamp with 0"
happens.  Perhaps that would be sufficient.  I dunno.

> I'm not familiar with this code, but would it be worth changing
> parse_timestamp() to stop parsing if it sees a newline?

Meaning replace or write our own strtoumax() equivalent?
Jeff King April 26, 2023, 11:36 a.m. UTC | #3
On Tue, Apr 25, 2023 at 09:06:47AM -0700, Junio C Hamano wrote:

> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
> > This probably doesn't matter in practice but we define our own
> > isspace() that does not treat '\v' and '\f' as whitespace. However
> > parse_timestamp() (which is just strtoumax()) uses the standard
> > library's isspace() which does treat those characters as whitespace
> > and is locale dependent. This means we can potentially stop at a
> > character that parse_timestamp() treats as whitespace and if there are
> > no digits after it we'll still walk past the end of the line. Using
> > Rene's suggestion of testing the character with isdigit() would fix
> > that. It would also avoid parsing negative timestamps as positive
> > numbers and reject any timestamps that begin with a locale dependent
> > digit.
> 
> A very interesting observation.  I wonder if a curious person can
> craft a malformed timestamp with "hash-object --literally" to do
> more than DoS themselves?

I think the answer is no, because the worst case is that they read to
the trailing NUL that we stick after any object content we read into
memory. So we'd mis-parse:

  committer name <email> \v\n

  123456 in the subject line

to read "123456" as the commit timestamp (so basically the same bug my
patch was trying to fix). But we'd never read out-of-bounds memory.
Still, it does not give me warm fuzzies, and I think is worth fixing.

> We are not going to put anything other than [ 0-9+-] after the '>'
> we scan for, and making sure '>' is followed by SP and then [0-9]
> would be sufficient to ensure strtoumax() to stop before the '\n'
> but does not ensure that the "signal a bad timestamp with 0"
> happens.  Perhaps that would be sufficient.  I dunno.

Any single non-whitespace character at all would be sufficient to avoid
the problem. And that's what the current iteration of the patch is
trying to do. It's just that our definition of "whitespace" has to agree
with strtoumax()'s for it to work. And as Phillip notes, that may even
include locale dependent characters. So I don't think we want to get
into trying to match them all (i.e., a "allow known" strategy).

Instead, we should go back to what the original iteration of the series
was doing, and make sure there is at least one digit (i.e., a "forbid
unknown" strategy). Assuming that there is no locale where ascii "1" is
considered whitespace. ;)

Note that will exclude a few cases that we do allow now, like:

  committer name <email> \v123456 +0000\n

Right now that parses as "123456", but we'd reject it as "0" after such
a patch.

The alternative is to check _all_ of the characters between ">" and the
newline and make sure there is some digit somewhere, which would be
sufficient to prevent strtoumax() from walking past the newline.

I guess it's not even any more expensive in the normal case (since the
very first non-whitespace entry should be a digit!). I'm not sure it's
worth caring about too much either way. Garbage making it into
name/email is an easy mistake to make (for users and implementations).
Putting whitespace control codes into your timestamp is not, and marking
them as "0" is an OK outcome.

-Peff
Phillip Wood April 26, 2023, 2:06 p.m. UTC | #4
On 25/04/2023 17:06, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> This probably doesn't matter in practice but we define our own
>> isspace() that does not treat '\v' and '\f' as whitespace. However
>> parse_timestamp() (which is just strtoumax()) uses the standard
>> library's isspace() which does treat those characters as whitespace
>> and is locale dependent. This means we can potentially stop at a
>> character that parse_timestamp() treats as whitespace and if there are
>> no digits after it we'll still walk past the end of the line. Using
>> Rene's suggestion of testing the character with isdigit() would fix
>> that. It would also avoid parsing negative timestamps as positive
>> numbers 

>> and reject any timestamps that begin with a locale dependent
>> digit.

Sorry, that bit is not correct, I've since checked the C standard and I 
think strtoul() and friends expect ascii digits (isdigit() and 
isxdigit() are also locale independent unlike isspace(), isalpha() etc.)

> A very interesting observation.  I wonder if a curious person can
> craft a malformed timestamp with "hash-object --literally" to do
> more than DoS themselves?
> 
> We are not going to put anything other than [ 0-9+-] after the '>'
> we scan for, and making sure '>' is followed by SP and then [0-9]
> would be sufficient to ensure strtoumax() to stop before the '\n'
> but does not ensure that the "signal a bad timestamp with 0"
> happens.  Perhaps that would be sufficient.  I dunno.
> 
>> I'm not familiar with this code, but would it be worth changing
>> parse_timestamp() to stop parsing if it sees a newline?
> 
> Meaning replace or write our own strtoumax() equivalent?

I was thinking of a wrapper around strtoumax() that skipped the leading 
whitespace itself and returned 0 if it saw '\n' or the first 
non-whitespace character was not a digit. It would help other callers 
avoid the problem with missing timestamps that is being fixed in this 
series. I was surprised to see that callers are expected to pass a base 
to parse_timestamp(). All of them seem to pass "10" apart from a caller 
in upload-pack.c that passes "0" when parsing the argument to 
"deepen-since" - do we really want to support octal and hexadecimal 
timestamps there?.

Best Wishes

Phillip
Andreas Schwab April 26, 2023, 2:31 p.m. UTC | #5
On Apr 26 2023, Phillip Wood wrote:

> On 25/04/2023 17:06, Junio C Hamano wrote:
>> Phillip Wood <phillip.wood123@gmail.com> writes:
>> 
>>> This probably doesn't matter in practice but we define our own
>>> isspace() that does not treat '\v' and '\f' as whitespace. However
>>> parse_timestamp() (which is just strtoumax()) uses the standard
>>> library's isspace() which does treat those characters as whitespace
>>> and is locale dependent. This means we can potentially stop at a
>>> character that parse_timestamp() treats as whitespace and if there are
>>> no digits after it we'll still walk past the end of the line. Using
>>> Rene's suggestion of testing the character with isdigit() would fix
>>> that. It would also avoid parsing negative timestamps as positive
>>> numbers 
>
>>> and reject any timestamps that begin with a locale dependent
>>> digit.
>
> Sorry, that bit is not correct, I've since checked the C standard and I
> think strtoul() and friends expect ascii digits (isdigit() and isxdigit()
> are also locale independent unlike isspace(), isalpha() etc.)

The standard says:

    In other than the "C" locale, additional locale-specific subject
    sequence forms may be accepted.
Phillip Wood April 26, 2023, 2:44 p.m. UTC | #6
Hi Andreas

On 26/04/2023 15:31, Andreas Schwab wrote:
> On Apr 26 2023, Phillip Wood wrote:
> 
>> On 25/04/2023 17:06, Junio C Hamano wrote:
>>> Phillip Wood <phillip.wood123@gmail.com> writes:
>>>
>>>> This probably doesn't matter in practice but we define our own
>>>> isspace() that does not treat '\v' and '\f' as whitespace. However
>>>> parse_timestamp() (which is just strtoumax()) uses the standard
>>>> library's isspace() which does treat those characters as whitespace
>>>> and is locale dependent. This means we can potentially stop at a
>>>> character that parse_timestamp() treats as whitespace and if there are
>>>> no digits after it we'll still walk past the end of the line. Using
>>>> Rene's suggestion of testing the character with isdigit() would fix
>>>> that. It would also avoid parsing negative timestamps as positive
>>>> numbers
>>
>>>> and reject any timestamps that begin with a locale dependent
>>>> digit.
>>
>> Sorry, that bit is not correct, I've since checked the C standard and I
>> think strtoul() and friends expect ascii digits (isdigit() and isxdigit()
>> are also locale independent unlike isspace(), isalpha() etc.)
> 
> The standard says:
> 
>      In other than the "C" locale, additional locale-specific subject
>      sequence forms may be accepted.

Thanks, looking at the standard again I don't know how I managed to miss 
that, my initial recollection was correct after all.

Best Wishes

Phillip
Junio C Hamano April 26, 2023, 3:32 p.m. UTC | #7
Jeff King <peff@peff.net> writes:

> Instead, we should go back to what the original iteration of the series
> was doing, and make sure there is at least one digit (i.e., a "forbid
> unknown" strategy). Assuming that there is no locale where ascii "1" is
> considered whitespace. ;)
>
> Note that will exclude a few cases that we do allow now, like:
>
>   committer name <email> \v123456 +0000\n
>
> Right now that parses as "123456", but we'd reject it as "0" after such
> a patch.

I would say that it is a good thing.

The only (somewhat) end-user controlled things on the line are the
name and email, and even there name is sanitized to remove "crud".
The user-supplied timestamp goes through date.c::parse_date(),
ending up with what date.c::date_string() formats, so there will not
be syntactically incorrect timestamp there.  So we can be strict
format-wise on the timestamp field, once we identify where it begins,
which is the point of scanning backwards for '>'.

Unless the user does "hash-object" and deliberately creates a
malformed commit object---they can keep both halves just fine in
such a case as long as we do reject such a timestamp correctly.

> The alternative is to check _all_ of the characters between ">" and the
> newline and make sure there is some digit somewhere, which would be
> sufficient to prevent strtoumax() from walking past the newline.
>
> I guess it's not even any more expensive in the normal case (since the
> very first non-whitespace entry should be a digit!). I'm not sure it's
> worth caring about too much either way. Garbage making it into
> name/email is an easy mistake to make (for users and implementations).
> Putting whitespace control codes into your timestamp is not, and marking
> them as "0" is an OK outcome.

Yeah, I think it is fine either way.

Thanks
diff mbox series

Patch

diff --git a/commit.c b/commit.c
index bb340f66fa..2f1b5d505b 100644
--- a/commit.c
+++ b/commit.c
@@ -120,10 +120,23 @@  static timestamp_t parse_commit_date(const char *buf, const char *tail)
 	dateptr = eol;
 	while (dateptr > buf && dateptr[-1] != '>')
 		dateptr--;
-	if (dateptr == buf || dateptr == eol)
+	if (dateptr == buf)
 		return 0;
 
-	/* dateptr < eol && *eol == '\n', so parsing will stop at eol */
+	/*
+	 * Trim leading whitespace; parse_timestamp() will do this itself, but
+	 * if we have _only_ whitespace, it will walk right past the newline
+	 * while doing so.
+	 */
+	while (dateptr < eol && isspace(*dateptr))
+		dateptr++;
+	if (dateptr == eol)
+		return 0;
+
+	/*
+	 * We know there is at least one non-whitespace character, so we'll
+	 * begin parsing there and stop at worst case at eol.
+	 */
 	return parse_timestamp(dateptr, NULL, 10);
 }
 
diff --git a/t/t4212-log-corrupt.sh b/t/t4212-log-corrupt.sh
index af4b35ff56..d4ef48d646 100755
--- a/t/t4212-log-corrupt.sh
+++ b/t/t4212-log-corrupt.sh
@@ -92,4 +92,33 @@  test_expect_success 'absurdly far-in-future date' '
 	git log -1 --format=%ad $commit
 '
 
+test_expect_success 'create commit with whitespace committer date' '
+	# It is important that this subject line is numeric, since we want to
+	# be sure we are not confused by skipping whitespace and accidentally
+	# parsing the subject as a timestamp.
+	#
+	# Do not use munge_author_date here. Besides not hitting the committer
+	# line, it leaves the timezone intact, and we want nothing but
+	# whitespace.
+	test_commit 1234567890 &&
+	git cat-file commit HEAD >commit.orig &&
+	sed "s/>.*/>    /" <commit.orig >commit.munge &&
+	ws_commit=$(git hash-object --literally -w -t commit commit.munge)
+'
+
+test_expect_success '--until treats whitespace date as sentinel' '
+	echo $ws_commit >expect &&
+	git rev-list --until=1980-01-01 $ws_commit >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'pretty-printer handles whitespace date' '
+	# as with the %ad test above, we will show these as the empty string,
+	# not the 1970 epoch date. This is intentional; see 7d9a281941 (t4212:
+	# test bogus timestamps with git-log, 2014-02-24) for more discussion.
+	echo : >expect &&
+	git log -1 --format="%at:%ct" $ws_commit >actual &&
+	test_cmp expect actual
+'
+
 test_done