mbox series

[RFC/PATCH,0/6] hash-object: use fsck to check objects

Message ID Y8hX+pIZUKXsyYj5@coredump.intra.peff.net (mailing list archive)
Headers show
Series hash-object: use fsck to check objects | expand

Message

Jeff King Jan. 18, 2023, 8:35 p.m. UTC
Right now "git hash-object" will do some basic sanity checks of the
input using the usual parser code. This series teaches it to use the
fsck code instead, which should catch more things. See patch 6 for some
discussion of the implications.

The reason this is marked as an RFC is that at the end, compiling with
SANITIZE=address will provoke a failure in t3800. The issue is that
fsck_tag_standalone(), when fed a buffer/size combo, will look for a NUL
at the end of the headers, which might be buffer[size]. This is usually
OK for objects we've loaded from the odb, because we intentionally stick
an extra NUL at the end for safety. But here index_mem() may get an
arbitrary buffer.

I'm not sure yet of the right path forward. It's not too hard to add an
extra NUL in most cases, but one code path will mmap a file on disk. And
sticking a NUL there is hard (we already went down that road trying to
avoid REG_STARTEND for grep, and there wasn't a good solution).

The other option is having the fsck code avoid looking past the size it
was given. I think the intent is that this should work, from commits
like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the
buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which
won't respect the size, but I think[1] that's OK because we'll have
parsed up to the end-of-header beforehand (and those functions would
never match past there).

Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body
\n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of
custom verify_tag(), 2021-01-05) were buggy, and we can just fix them.

[1] But I said "I think" above because it can get pretty subtle. There's
    some more discussion in this thread:

      https://lore.kernel.org/git/20150625155128.C3E9738005C@gemini.denx.de/

    but I haven't yet convinced myself it's safe. This is exactly the
    kind of analysis I wish I had the power to nerd-snipe René into.

Anyway, here are the patches in the meantime. I do think this is a good
direction overall, modulo addressing the NUL-terminator question.

  [1/6]: t1007: modernize malformed object tests
  [2/6]: t1006: stop using 0-padded timestamps
  [3/6]: t7030: stop using invalid tag name
  [4/6]: t: use hash-object --literally when created malformed objects
  [5/6]: fsck: provide a function to fsck buffer without object struct
  [6/6]: hash-object: use fsck for object checks

 fsck.c                           | 29 ++++++++++-------
 fsck.h                           |  8 +++++
 object-file.c                    | 55 +++++++++++++-------------------
 t/t1006-cat-file.sh              |  6 ++--
 t/t1007-hash-object.sh           | 29 +++++++++++------
 t/t1450-fsck.sh                  | 28 ++++++++--------
 t/t4054-diff-bogus-tree.sh       |  2 +-
 t/t4058-diff-duplicates.sh       |  2 +-
 t/t4212-log-corrupt.sh           |  4 +--
 t/t5302-pack-index.sh            |  2 +-
 t/t5504-fetch-receive-strict.sh  |  2 +-
 t/t5702-protocol-v2.sh           |  2 +-
 t/t6300-for-each-ref.sh          |  2 +-
 t/t7030-verify-tag.sh            |  2 +-
 t/t7031-verify-tag-signed-ssh.sh |  2 +-
 t/t7509-commit-authorship.sh     |  2 +-
 t/t7510-signed-commit.sh         |  2 +-
 t/t7528-signed-commit-ssh.sh     |  2 +-
 t/t8003-blame-corner-cases.sh    |  2 +-
 t/t9350-fast-export.sh           |  2 +-
 20 files changed, 101 insertions(+), 84 deletions(-)

-Peff

Comments

Jeff King Jan. 18, 2023, 8:46 p.m. UTC | #1
On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote:

> The other option is having the fsck code avoid looking past the size it
> was given. I think the intent is that this should work, from commits
> like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the
> buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which
> won't respect the size, but I think[1] that's OK because we'll have
> parsed up to the end-of-header beforehand (and those functions would
> never match past there).
> 
> Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body
> \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of
> custom verify_tag(), 2021-01-05) were buggy, and we can just fix them.

That would look something like this:

diff --git a/fsck.c b/fsck.c
index c2c8facd2d..d220276bcb 100644
--- a/fsck.c
+++ b/fsck.c
@@ -898,6 +898,7 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
 {
 	int ret = 0;
 	char *eol;
+	const char *eob = buffer + size;
 	struct strbuf sb = STRBUF_INIT;
 	const char *p;
 
@@ -960,10 +961,8 @@ int fsck_tag_standalone(const struct object_id *oid, const char *buffer,
 	}
 	else
 		ret = fsck_ident(&buffer, oid, OBJ_TAG, options);
-	if (!*buffer)
-		goto done;
 
-	if (!starts_with(buffer, "\n")) {
+	if (buffer != eob && *buffer != '\n') {
 		/*
 		 * The verify_headers() check will allow
 		 * e.g. "[...]tagger <tagger>\nsome

Changing the starts_with() is not strictly necessary, but I think it
makes it more clear that we are only going to look at the one character
we confirmed is still valid inside the buffer.

This is enough to have the whole test suite pass with ASan/UBSan after
my series. But as I said earlier, I'd want to look carefully at the rest
of the fsck code to make sure there aren't any other possible inputs
that could look past the end of the buffer.

-Peff
Junio C Hamano Jan. 18, 2023, 8:59 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

>   [1/6]: t1007: modernize malformed object tests

Obviously good.

>   [2/6]: t1006: stop using 0-padded timestamps
>   [3/6]: t7030: stop using invalid tag name

These two are pleasant to see and revealed what are "accepted" by
mistake, quite surprisingly.

>   [4/6]: t: use hash-object --literally when created malformed objects

The --literally option was invented initially primarily to allow a
bogus type of object (e.g. "hash-object -t xyzzy --literally") but I
am happy to see that we are finding different uses.  I wonder if
these objects of known types but with syntactically bad contents can
be "repack"ed from loose into packed?

>   [5/6]: fsck: provide a function to fsck buffer without object struct

Obvious, clean and very nice.

>   [6/6]: hash-object: use fsck for object checks
Taylor Blau Jan. 18, 2023, 9:38 p.m. UTC | #3
On Wed, Jan 18, 2023 at 12:59:24PM -0800, Junio C Hamano wrote:
> The --literally option was invented initially primarily to allow a
> bogus type of object (e.g. "hash-object -t xyzzy --literally") but I
> am happy to see that we are finding different uses.  I wonder if
> these objects of known types but with syntactically bad contents can
> be "repack"ed from loose into packed?
>
> >   [5/6]: fsck: provide a function to fsck buffer without object struct

It is indeed possible:

--- >8 ---
Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t9999-test/.git/
expecting success of 9999.1 'repacking corrupt loose object into packed':
	name=$(echo $ZERO_OID | sed -e "s/00/Q/g") &&
	printf "100644 fooQ$name" | q_to_nul |
		git hash-object -w --stdin -t tree >in &&

	git pack-objects .git/objects/pack/pack <in

Enumerating objects: 1, done.
Counting objects: 100% (1/1), done.
06146c77fd19c096858d6459d602be0fdf10891b
Writing objects: 100% (1/1), done.
Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
ok 1 - repacking corrupt loose object into packed
--- 8< ---

Thanks,
Taylor
Jeff King Jan. 19, 2023, 1:39 a.m. UTC | #4
On Wed, Jan 18, 2023 at 03:35:06PM -0500, Jeff King wrote:

> The other option is having the fsck code avoid looking past the size it
> was given. I think the intent is that this should work, from commits
> like 4d0d89755e (Make sure fsck_commit_buffer() does not run out of the
> buffer, 2014-09-11). We do use skip_prefix() and parse_oid_hex(), which
> won't respect the size, but I think[1] that's OK because we'll have
> parsed up to the end-of-header beforehand (and those functions would
> never match past there).
> 
> Which would mean that 9a1a3a4d4c (mktag: allow omitting the header/body
> \n separator, 2021-01-05) and acf9de4c94 (mktag: use fsck instead of
> custom verify_tag(), 2021-01-05) were buggy, and we can just fix them.
> 
> [1] But I said "I think" above because it can get pretty subtle. There's
>     some more discussion in this thread:
> 
>       https://lore.kernel.org/git/20150625155128.C3E9738005C@gemini.denx.de/
> 
>     but I haven't yet convinced myself it's safe. This is exactly the
>     kind of analysis I wish I had the power to nerd-snipe René into.

I poked at this a bit more, and it definitely isn't safe. I think the
use of skip_prefix(), etc, is OK, because they'd always stop at an
unexpected newline. But verify_headers() is only confirming that we have
a series of complete lines, and we might end with no "\n\n" (and hence
no commit/tag message). And so the obvious case that fools us is one
where the data simply ends at a newline, but we are missing one or more
headers. So a truncated commit like:

  tree 1234567890123456789012345678901234567890

(with the newline at the end of the "tree" line, but nothing else) will
cause fsck_commit() to look past the "size" we pass it. With all of the
current callers, that means it sees a NUL and bails. So it's not
currently a bug, but it becomes one if we can feed it arbitrary buffers.

Fixing it isn't _too_ bad, and could look something like this:

diff --git a/fsck.c b/fsck.c
index c2c8facd2d..3f0bb3e350 100644
--- a/fsck.c
+++ b/fsck.c
@@ -834,6 +834,7 @@ static int fsck_commit(const struct object_id *oid,
 	unsigned author_count;
 	int err;
 	const char *buffer_begin = buffer;
+	const char *buffer_end = buffer + size;
 	const char *p;
 
 	if (verify_headers(buffer, size, oid, OBJ_COMMIT, options))
@@ -847,7 +848,7 @@ static int fsck_commit(const struct object_id *oid,
 			return err;
 	}
 	buffer = p + 1;
-	while (skip_prefix(buffer, "parent ", &buffer)) {
+	while (buffer < buffer_end && skip_prefix(buffer, "parent ", &buffer)) {
 		if (parse_oid_hex(buffer, &parent_oid, &p) || *p != '\n') {
 			err = report(options, oid, OBJ_COMMIT, FSCK_MSG_BAD_PARENT_SHA1, "invalid 'parent' line format - bad sha1");
 			if (err)
@@ -856,7 +857,7 @@ static int fsck_commit(const struct object_id *oid,
 		buffer = p + 1;
 	}
 	author_count = 0;
-	while (skip_prefix(buffer, "author ", &buffer)) {
+	while (buffer < buffer_end && skip_prefix(buffer, "author ", &buffer)) {
 		author_count++;
 		err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
 		if (err)
@@ -868,7 +869,7 @@ static int fsck_commit(const struct object_id *oid,
 		err = report(options, oid, OBJ_COMMIT, FSCK_MSG_MULTIPLE_AUTHORS, "invalid format - multiple 'author' lines");
 	if (err)
 		return err;
-	if (!skip_prefix(buffer, "committer ", &buffer))
+	if (buffer >= buffer_end || !skip_prefix(buffer, "committer ", &buffer))
 		return report(options, oid, OBJ_COMMIT, FSCK_MSG_MISSING_COMMITTER, "invalid format - expected 'committer' line");
 	err = fsck_ident(&buffer, oid, OBJ_COMMIT, options);
 	if (err)

And then the tag side would need something similar. I'd probably also
sprinkle some comments in verify_headers() and its callers documenting
our assumptions and what's OK to do (string-like parsing functions work
as long as they stop when they hit a newline). That, plus a few tests
covering the problematic cases to avoid regressions, would probably be
OK.

I think fsck_tree() is mostly fine, as the tree-iterating code detects
truncation. Though I do find the use of strlen() in decode_tree_entry()
a little suspicious (and that would be true of the current code, as
well, since it powers hash-object's existing parsing check).

-Peff
Jeff King Jan. 19, 2023, 2:03 a.m. UTC | #5
On Wed, Jan 18, 2023 at 04:38:40PM -0500, Taylor Blau wrote:

> On Wed, Jan 18, 2023 at 12:59:24PM -0800, Junio C Hamano wrote:
> > The --literally option was invented initially primarily to allow a
> > bogus type of object (e.g. "hash-object -t xyzzy --literally") but I
> > am happy to see that we are finding different uses.  I wonder if
> > these objects of known types but with syntactically bad contents can
> > be "repack"ed from loose into packed?
> >
> > >   [5/6]: fsck: provide a function to fsck buffer without object struct
> 
> It is indeed possible:
> 
> --- >8 ---
> Initialized empty Git repository in /home/ttaylorr/src/git/t/trash directory.t9999-test/.git/
> expecting success of 9999.1 'repacking corrupt loose object into packed':
> 	name=$(echo $ZERO_OID | sed -e "s/00/Q/g") &&
> 	printf "100644 fooQ$name" | q_to_nul |
> 		git hash-object -w --stdin -t tree >in &&
> 
> 	git pack-objects .git/objects/pack/pack <in
> 
> Enumerating objects: 1, done.
> Counting objects: 100% (1/1), done.
> 06146c77fd19c096858d6459d602be0fdf10891b
> Writing objects: 100% (1/1), done.
> Total 1 (delta 0), reused 0 (delta 0), pack-reused 0
> ok 1 - repacking corrupt loose object into packed
> --- 8< ---

Right, we don't do any fsck-ing when packing objects. Nor should we, I
think. We should be checking objects when they come into the repository
(via index-pack/unpack-objects) or when they're created (hash-object),
but there's little need to do so when they migrate between storage
formats.

The fact that "--literally" manually writes a loose object is mostly an
implementation detail. I think if we are not writing an object with an
esoteric type, that it could even just hit the regular index_fd() code
path (and drop the HASH_FORMAT_CHECK flag).

If you do write one with "-t xyzzy", I think pack-objects would barf,
but not because of fsck checks. It just couldn't represent that type
(which really makes such objects pretty pointless; you cannot ever fetch
or push them!).

-Peff
René Scharfe Jan. 21, 2023, 9:36 a.m. UTC | #6
Am 19.01.23 um 02:39 schrieb Jeff King:
>
> Though I do find the use of strlen() in decode_tree_entry()
> a little suspicious (and that would be true of the current code, as
> well, since it powers hash-object's existing parsing check).

strlen() won't overrun the buffer because the first check in
decode_tree_entry() makes sure there is a NUL in the buffer ahead.
If get_mode() crosses it then we exit early.

Storing the result in an unsigned int can overflow on platforms where
size_t is bigger.  That would result in pathlen values being too short
for really long paths, but no out-of-bounds access.  They are then
stored as signed int in struct name_entry and used as such in many
places -- that seems like a bad idea, but I didn't actually check them
thoroughly.

get_mode() can overflow "mode" if there are too many octal digits.  Do
we need to accept more than two handfuls in the first place?  I'll send
a patch for at least rejecting overflow.

Hmm, what would be the performance impact of trees with mode fields
zero-padded to silly lengths?

René
Jeff King Jan. 22, 2023, 7:48 a.m. UTC | #7
On Sat, Jan 21, 2023 at 10:36:08AM +0100, René Scharfe wrote:

> Am 19.01.23 um 02:39 schrieb Jeff King:
> >
> > Though I do find the use of strlen() in decode_tree_entry()
> > a little suspicious (and that would be true of the current code, as
> > well, since it powers hash-object's existing parsing check).
> 
> strlen() won't overrun the buffer because the first check in
> decode_tree_entry() makes sure there is a NUL in the buffer ahead.
> If get_mode() crosses it then we exit early.

Yeah, that was what I found after digging deeper (see my patch 7).

> Storing the result in an unsigned int can overflow on platforms where
> size_t is bigger.  That would result in pathlen values being too short
> for really long paths, but no out-of-bounds access.  They are then
> stored as signed int in struct name_entry and used as such in many
> places -- that seems like a bad idea, but I didn't actually check them
> thoroughly.

Yeah, I agree that the use of a signed int there looks questionable. I
do think it's orthogonal to my series here, as that tree-decoding is
used by the existing hash-object checks.

But it probably bears further examination, especially because we use it
for the fsck checks on incoming objects via receive-pack, etc, which are
meant to be the first line of defense for hosters who might receive
malicious garbage from users.

We probably ought to reject trees with enormous names via fsck anyway. I
actually have a patch to do that, but of course it depends on
decode_tree_entry() to get the length, so there's a bit of
chicken-and-egg. We probably also should outright reject gigantic trees,
which closes out a whole class of integer truncation problems. I know
GitHub has rejected trees over 100MB for years for this reason.

> get_mode() can overflow "mode" if there are too many octal digits.  Do
> we need to accept more than two handfuls in the first place?  I'll send
> a patch for at least rejecting overflow.

Seems reasonable. I doubt there's an interesting attack here, just
because the mode isn't used to index anything. If you feed a garbage
mode that happens to overflow to something useful, you could just as
easily have sent the useful mode in the first place.

> Hmm, what would be the performance impact of trees with mode fields
> zero-padded to silly lengths?

Certainly it would waste some time parsing the tree, but you could do
that already with a long pathname. Or just having a lot of paths in a
tree. Or a lot of trees.

-Peff
René Scharfe Jan. 22, 2023, 11:39 a.m. UTC | #8
Am 22.01.23 um 08:48 schrieb Jeff King:
> On Sat, Jan 21, 2023 at 10:36:08AM +0100, René Scharfe wrote:
>
>> Am 19.01.23 um 02:39 schrieb Jeff King:
>>>
>>> Though I do find the use of strlen() in decode_tree_entry()
>>> a little suspicious (and that would be true of the current code, as
>>> well, since it powers hash-object's existing parsing check).
>>
>> strlen() won't overrun the buffer because the first check in
>> decode_tree_entry() makes sure there is a NUL in the buffer ahead.
>> If get_mode() crosses it then we exit early.
>
> Yeah, that was what I found after digging deeper (see my patch 7).
>
>> Storing the result in an unsigned int can overflow on platforms where
>> size_t is bigger.  That would result in pathlen values being too short
>> for really long paths, but no out-of-bounds access.  They are then
>> stored as signed int in struct name_entry and used as such in many
>> places -- that seems like a bad idea, but I didn't actually check them
>> thoroughly.
>
> Yeah, I agree that the use of a signed int there looks questionable. I
> do think it's orthogonal to my series here, as that tree-decoding is
> used by the existing hash-object checks.

Sure.

> But it probably bears further examination, especially because we use it
> for the fsck checks on incoming objects via receive-pack, etc, which are
> meant to be the first line of defense for hosters who might receive
> malicious garbage from users.
>
> We probably ought to reject trees with enormous names via fsck anyway. I
> actually have a patch to do that, but of course it depends on
> decode_tree_entry() to get the length, so there's a bit of
> chicken-and-egg.

Solvable by limiting the search for the end of the string in
decode_tree_entry() by using strnlen(3) or memchr(3) instead of
strlen(3).  You just need to define some (configurable?) limit.

> We probably also should outright reject gigantic trees,
> which closes out a whole class of integer truncation problems. I know
> GitHub has rejected trees over 100MB for years for this reason.

Makes sense.

>> get_mode() can overflow "mode" if there are too many octal digits.  Do
>> we need to accept more than two handfuls in the first place?  I'll send
>> a patch for at least rejecting overflow.
>
> Seems reasonable. I doubt there's an interesting attack here, just
> because the mode isn't used to index anything. If you feed a garbage
> mode that happens to overflow to something useful, you could just as
> easily have sent the useful mode in the first place.
>
>> Hmm, what would be the performance impact of trees with mode fields
>> zero-padded to silly lengths?
>
> Certainly it would waste some time parsing the tree, but you could do
> that already with a long pathname. Or just having a lot of paths in a
> tree. Or a lot of trees.

That's a cup half full/empty thing, perhaps.  What's the harm in leading
zeros? vs. Why allow leading zeros?

René
Ævar Arnfjörð Bjarmason Feb. 1, 2023, 2:06 p.m. UTC | #9
On Sun, Jan 22 2023, René Scharfe wrote:

> Am 22.01.23 um 08:48 schrieb Jeff King:
>> We probably also should outright reject gigantic trees,
>> which closes out a whole class of integer truncation problems. I know
>> GitHub has rejected trees over 100MB for years for this reason.
>
> Makes sense.

I really don't think it does, let's not forever encode arbitrary limits
in the formats because of transitory implementation details.

Those sort of arbitrary limits are understandable for hosting providers,
and a sensible trade-off on that front.

But for git as a general tool? I'd like to be able to throw whatever
garbage I've got locally at it, and not have it complain.

We already have a deluge of int v.s. unsigned int v.s. size_t warnings
that we could address, we're just choosing to suppress them
currently. Instead we have hacks like cast_size_t_to_int().

Those sorts of hacks are understandable as band-aid fixes, but let's
work on fixing the real causes.