[1/2] fast-import: duplicate parsed encoding string

Message ID	20190825080821.GA31824@sigill.intra.peff.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=+9GI=WV=vger.kernel.org=git-owner@kernel.org> Date: Sun, 25 Aug 2019 04:08:21 -0400 From: Jeff King <peff@peff.net> To: Mike Hommey <mh@glandium.org> Cc: Elijah Newren <newren@gmail.com>, git@vger.kernel.org, gitster@pobox.com Subject: [PATCH 1/2] fast-import: duplicate parsed encoding string Message-ID: <20190825080821.GA31824@sigill.intra.peff.net> References: <20190825080640.GA31453@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20190825080640.GA31453@sigill.intra.peff.net> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	fast-import input string handling bugs \| expand [0/2] fast-import input string handling bugs [1/2] fast-import: duplicate parsed encoding string [2/2] fast-import: duplicate into history rather than passing ownership

Message ID

20190825080821.GA31824@sigill.intra.peff.net (mailing list archive)

State

New, archived

Headers

Date: Sun, 25 Aug 2019 04:08:21 -0400
From: Jeff King <peff@peff.net>
To: Mike Hommey <mh@glandium.org>
Cc: Elijah Newren <newren@gmail.com>, git@vger.kernel.org,
        gitster@pobox.com
Subject: [PATCH 1/2] fast-import: duplicate parsed encoding string
Message-ID: <20190825080821.GA31824@sigill.intra.peff.net>
References: <20190825080640.GA31453@sigill.intra.peff.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20190825080640.GA31453@sigill.intra.peff.net>
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

fast-import input string handling bugs | expand

Commit Message

Jeff King Aug. 25, 2019, 8:08 a.m. UTC

We read each line of the fast-import stream into the command_buf strbuf.
When reading a commit, we parse a line like "encoding foo" by storing a
pointer to "foo", but not making a copy. We may then read an unbounded
number of other lines (e.g., one for each modified file in the commit),
each of which writes into command_buf.

This works out in practice for small cases, because we hand off
ownership of the heap buffer from command_buf to the cmd_hist array, and
read new commands into a fresh heap buffer. And thus the pointer to
"foo" remains valid as long as there aren't so many intermediate lines
that we end up dropping the original "encoding" line from the history.

But as the test modification shows, if we go over our default of 100
lines, we end up with our encoding string pointing into freed heap
memory. This seems to fail reliably by writing garbage into the output,
but running under ASan definitely detects this as a user-after-free.

We can fix it by duplicating the encoding value, just as we do for other
parsed lines (e.g., an author line ends up in parse_ident, which copies
it to a new string).

Signed-off-by: Jeff King <peff@peff.net>
---
 fast-import.c          | 7 +++++--
 t/t9300-fast-import.sh | 5 +++++
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

Elijah Newren Aug. 26, 2019, 6:28 p.m. UTC | #1

On Sun, Aug 25, 2019 at 1:08 AM Jeff King <peff@peff.net> wrote:
>
> We read each line of the fast-import stream into the command_buf strbuf.
> When reading a commit, we parse a line like "encoding foo" by storing a
> pointer to "foo", but not making a copy. We may then read an unbounded
> number of other lines (e.g., one for each modified file in the commit),
> each of which writes into command_buf.
>
> This works out in practice for small cases, because we hand off
> ownership of the heap buffer from command_buf to the cmd_hist array, and
> read new commands into a fresh heap buffer. And thus the pointer to
> "foo" remains valid as long as there aren't so many intermediate lines
> that we end up dropping the original "encoding" line from the history.
>
> But as the test modification shows, if we go over our default of 100
> lines, we end up with our encoding string pointing into freed heap
> memory. This seems to fail reliably by writing garbage into the output,
> but running under ASan definitely detects this as a user-after-free.

s/user-after-free/use-after-free/

> We can fix it by duplicating the encoding value, just as we do for other
> parsed lines (e.g., an author line ends up in parse_ident, which copies
> it to a new string).

Eek!  Thanks for fixing this up for me; patch looks good.

Jeff King Aug. 26, 2019, 6:44 p.m. UTC | #2

On Mon, Aug 26, 2019 at 11:28:58AM -0700, Elijah Newren wrote:

> On Sun, Aug 25, 2019 at 1:08 AM Jeff King <peff@peff.net> wrote:
> >
> > We read each line of the fast-import stream into the command_buf strbuf.
> > When reading a commit, we parse a line like "encoding foo" by storing a
> > pointer to "foo", but not making a copy. We may then read an unbounded
> > number of other lines (e.g., one for each modified file in the commit),
> > each of which writes into command_buf.
> >
> > This works out in practice for small cases, because we hand off
> > ownership of the heap buffer from command_buf to the cmd_hist array, and
> > read new commands into a fresh heap buffer. And thus the pointer to
> > "foo" remains valid as long as there aren't so many intermediate lines
> > that we end up dropping the original "encoding" line from the history.
> >
> > But as the test modification shows, if we go over our default of 100
> > lines, we end up with our encoding string pointing into freed heap
> > memory. This seems to fail reliably by writing garbage into the output,
> > but running under ASan definitely detects this as a user-after-free.
> 
> s/user-after-free/use-after-free/

Wow. I self-corrected "user-after-free" at least three other times while
writing this thread. I clearly have a problem. :)

> > We can fix it by duplicating the encoding value, just as we do for other
> > parsed lines (e.g., an author line ends up in parse_ident, which copies
> > it to a new string).
> 
> Eek!  Thanks for fixing this up for me; patch looks good.

No problem. On the plus side, finding your bug made me think much harder
about the implications of patch 2 (because it's quite subtle and an easy
mistake to make).

-Peff

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..ee7258037a 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2588,7 +2588,7 @@  static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
-	const char *encoding = NULL;
+	char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2611,8 +2611,10 @@  static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
-	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+	if (skip_prefix(command_buf.buf, "encoding ", &v)) {
+		encoding = xstrdup(v);
 		read_next_command();
+	}
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2686,6 +2688,7 @@  static void parse_new_commit(const char *arg)
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
+	free(encoding);
 
 	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
 		b->pack_id = pack_id;
diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 141b7fa35e..cf66b40ebc 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3314,6 +3314,11 @@  test_expect_success 'X: handling encoding' '
 
 	printf "Pi: \360\nCOMMIT\n" >>input &&
 
+	for i in $(test_seq 100)
+	do
+		echo "M 644 $EMPTY_BLOB file-$i"
+	done >>input &&
+
 	git fast-import <input &&
 	git cat-file -p encoding | grep $(printf "\360") &&
 	git log -1 --format=%B encoding | grep $(printf "\317\200")

[1/2] fast-import: duplicate parsed encoding string

Commit Message

Comments

Patch