diff mbox series

fast-import: Reinitialize command_buf rather than detach it.

Message ID 20190825041348.31835-1-mh@glandium.org (mailing list archive)
State New, archived
Headers show
Series fast-import: Reinitialize command_buf rather than detach it. | expand

Commit Message

Mike Hommey Aug. 25, 2019, 4:13 a.m. UTC
command_buf.buf is also stored in cmd_hist, so instead of
strbuf_release, the current code uses strbuf_detach in order to
"leak" the buffer as far as the strbuf is concerned.

However, strbuf_detach does more than "leak" the strbuf buffer: it
possibly reallocates it to ensure a terminating nul character. And when
that happens, what is already stored in cmd_hist may now be a free()d
buffer.

In practice, though, command_buf.buf is most of the time a nul
terminated string, meaning command_buf.len < command_buf.alloc, and
strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
allocate a 1 byte buffer to store a nul character in it, which is then
leaked.

Since the code using strbuf_detach is assuming it does nothing to
command_buf.buf, it's overall safer to use strbuf_init, which has
the same practical effect in the usual case, and works appropriately
when command_buf is empty.
---
 fast-import.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Jeff King Aug. 25, 2019, 6:57 a.m. UTC | #1
On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote:

> command_buf.buf is also stored in cmd_hist, so instead of
> strbuf_release, the current code uses strbuf_detach in order to
> "leak" the buffer as far as the strbuf is concerned.
> 
> However, strbuf_detach does more than "leak" the strbuf buffer: it
> possibly reallocates it to ensure a terminating nul character. And when
> that happens, what is already stored in cmd_hist may now be a free()d
> buffer.
> 
> In practice, though, command_buf.buf is most of the time a nul
> terminated string, meaning command_buf.len < command_buf.alloc, and
> strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> allocate a 1 byte buffer to store a nul character in it, which is then
> leaked.

I think this is stronger than just "most of the time". It's an invariant
for strbufs to have a NUL, so the only case where detaching isn't a noop
is the empty slopbuf case you mention.

Splitting hairs, perhaps, but I think with that explanation, we could
probably argue that this case will never come up: strbuf_getline will
either have allocated a buffer or will have returned EOF.

That said, I do think it's quite confusing and is worth fixing, both for
readability and for future-proofing. But...

> diff --git a/fast-import.c b/fast-import.c
> index b44d6a467e..b1d07efe8c 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1763,7 +1763,9 @@ static int read_next_command(void)
>  		} else {
>  			struct recent_command *rc;
>  
> -			strbuf_detach(&command_buf, NULL);
> +			// command_buf is enther empty or also stored in cmd_hist,
> +			// reinitialize it.
> +			strbuf_init(&command_buf, 0);

This whole thing is a sign that the code is Doing It Wrong. Whoever is
taking ownership of the buffer should be calling strbuf_detach() at that
point.

It's a bit tricky, though, because we take ownership and then expect
people still look at command_buf.buf. Which makes me concerned that
there are other operations which might modify the buffer and have the
same issue.

I think it would be much easier to follow if we simply used the same
command_buf over and over, and just copied into the history. The cost is
about the same (we still alloc once per line, though we do an extra
memcpy now). So I thought something like this would work (we don't need
to convert those detaches into a strbuf_reset() calls because
strbuf_getline does so automatically):

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..31207acd61 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1763,7 +1763,6 @@ static int read_next_command(void)
 		} else {
 			struct recent_command *rc;
 
-			strbuf_detach(&command_buf, NULL);
 			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
 			if (stdin_eof)
 				return EOF;
@@ -1784,7 +1783,7 @@ static int read_next_command(void)
 				free(rc->buf);
 			}
 
-			rc->buf = command_buf.buf;
+			rc->buf = xstrdup(command_buf.buf);
 			rc->prev = cmd_tail;
 			rc->next = cmd_hist.prev;
 			rc->prev->next = rc;
@@ -1833,7 +1832,6 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 		char *term = xstrdup(data);
 		size_t term_len = command_buf.len - (data - command_buf.buf);
 
-		strbuf_detach(&command_buf, NULL);
 		for (;;) {
 			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
 				die("EOF in data (terminator '%s' not found)", term);

But it doesn't! It turns out that there are other places in the code
which assume they can call read_next_command() while hanging onto the
existing buffer. Which only works in the old code because the history
feature happens to hold on to the old pointer.

If I do this on top, then all tests pass:

diff --git a/fast-import.c b/fast-import.c
index 31207acd61..1f9160b645 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -2586,7 +2586,7 @@ static void parse_new_commit(const char *arg)
 	struct branch *b;
 	char *author = NULL;
 	char *committer = NULL;
-	const char *encoding = NULL;
+	char *encoding = NULL;
 	struct hash_list *merge_list = NULL;
 	unsigned int merge_count;
 	unsigned char prev_fanout, new_fanout;
@@ -2609,8 +2609,10 @@ static void parse_new_commit(const char *arg)
 	}
 	if (!committer)
 		die("Expected committer but didn't get one");
-	if (skip_prefix(command_buf.buf, "encoding ", &encoding))
+	if (skip_prefix(command_buf.buf, "encoding ", &v)) {
+		encoding = xstrdup(v);
 		read_next_command();
+	}
 	parse_data(&msg, 0, NULL);
 	read_next_command();
 	parse_from(b);
@@ -2684,6 +2686,7 @@ static void parse_new_commit(const char *arg)
 	strbuf_addbuf(&new_data, &msg);
 	free(author);
 	free(committer);
+	free(encoding);
 
 	if (!store_object(OBJ_COMMIT, &new_data, NULL, &b->oid, next_mark))
 		b->pack_id = pack_id;

And I think this is actually a real bug in the current code! We keep a
pointer to the encoding string, which survives because of the history.
But that history is bounded, and we could have an indefinite number of
changed files in the middle. If I modify t9300 like this:

diff --git a/t/t9300-fast-import.sh b/t/t9300-fast-import.sh
index 141b7fa35e..d4bbe630d5 100755
--- a/t/t9300-fast-import.sh
+++ b/t/t9300-fast-import.sh
@@ -3314,6 +3314,11 @@ test_expect_success 'X: handling encoding' '
 
 	printf "Pi: \360\nCOMMIT\n" >>input &&
 
+	for i in $(test_seq 200)
+	do
+		echo "M 644 $EMPTY_BLOB file-$i"
+	done >>input &&
+
 	git fast-import <input &&
 	git cat-file -p encoding | grep $(printf "\360") &&
 	git log -1 --format=%B encoding | grep $(printf "\317\200")

and run the test suite (using an unmodified git, without the earlier
patches I showed) then the test fails, putting garbage into the encoding
header (and when compiled with ASan, reports a use-after-free).

So I think the way the string handling is currently done papers over
such problems. You only see the problem if you have a hundred or more
modified files, so it works _most_ of the time.

That implies to me it's worth following a fix like the one I showed
above. I am slightly concerned there are other similar cases to the
"encoding" one lurking (and they _might_ not be bugs already, if there's
a fixed number of reads between the pointer being saved and being used),
but it seems worth the risk.

-Peff
Mike Hommey Aug. 25, 2019, 7:20 a.m. UTC | #2
On Sun, Aug 25, 2019 at 02:57:48AM -0400, Jeff King wrote:
> On Sun, Aug 25, 2019 at 01:13:48PM +0900, Mike Hommey wrote:
> 
> > command_buf.buf is also stored in cmd_hist, so instead of
> > strbuf_release, the current code uses strbuf_detach in order to
> > "leak" the buffer as far as the strbuf is concerned.
> > 
> > However, strbuf_detach does more than "leak" the strbuf buffer: it
> > possibly reallocates it to ensure a terminating nul character. And when
> > that happens, what is already stored in cmd_hist may now be a free()d
> > buffer.
> > 
> > In practice, though, command_buf.buf is most of the time a nul
> > terminated string, meaning command_buf.len < command_buf.alloc, and
> > strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> > command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> > allocate a 1 byte buffer to store a nul character in it, which is then
> > leaked.
> 
> I think this is stronger than just "most of the time". It's an invariant
> for strbufs to have a NUL, so the only case where detaching isn't a noop
> is the empty slopbuf case you mention.

If it's an invariant, why does detach actively tries to realloc and set
the nul terminator as if it can happen in more cases than when using the
slopbuf?

> Splitting hairs, perhaps, but I think with that explanation, we could
> probably argue that this case will never come up: strbuf_getline will
> either have allocated a buffer or will have returned EOF.

Note that the slopbuf case _does_ come up, and we always leak a 1 byte
buffer.

> That said, I do think it's quite confusing and is worth fixing, both for
> readability and for future-proofing. But...
(...)

I do agree the way fast-import works between cmd_hist and command_buf is
very brittle, as you've shown. I didn't feel like digging into it
though. Thanks for having gone further than I did.

Mike
Jeff King Aug. 25, 2019, 7:28 a.m. UTC | #3
On Sun, Aug 25, 2019 at 04:20:31PM +0900, Mike Hommey wrote:

> > I think this is stronger than just "most of the time". It's an invariant
> > for strbufs to have a NUL, so the only case where detaching isn't a noop
> > is the empty slopbuf case you mention.
> 
> If it's an invariant, why does detach actively tries to realloc and set
> the nul terminator as if it can happen in more cases than when using the
> slopbuf?

It calls strbuf_grow() to handle the slopbuf case (we can't hand off the
slopbuf, since the caller expects an allocated buffer). It just doesn't
bother to distinguish that case itself, and lets strbuf_grow() handle
it.

I think it would be equally correct for strbuf_detach() to do:

  if (!sb->alloc)
	strbuf_grow(0);

> > Splitting hairs, perhaps, but I think with that explanation, we could
> > probably argue that this case will never come up: strbuf_getline will
> > either have allocated a buffer or will have returned EOF.
> 
> Note that the slopbuf case _does_ come up, and we always leak a 1 byte
> buffer.

Hmm, I suppose so, on the very first call before we've read anything
(and likewise if parse_data() reset it then got an EOF, and we then
tried to read another command).

> I do agree the way fast-import works between cmd_hist and command_buf is
> very brittle, as you've shown. I didn't feel like digging into it
> though. Thanks for having gone further than I did.

I'll see if I can shape my rambling into a patch.

-Peff
René Scharfe Aug. 25, 2019, 12:35 p.m. UTC | #4
Am 25.08.19 um 06:13 schrieb Mike Hommey:
> command_buf.buf is also stored in cmd_hist, so instead of
> strbuf_release, the current code uses strbuf_detach in order to
> "leak" the buffer as far as the strbuf is concerned.

Hmm.

> However, strbuf_detach does more than "leak" the strbuf buffer: it
> possibly reallocates it to ensure a terminating nul character. And when
> that happens, what is already stored in cmd_hist may now be a free()d
> buffer.
>
> In practice, though, command_buf.buf is most of the time a nul
> terminated string, meaning command_buf.len < command_buf.alloc, and
> strbuf_detach is a no-op. BUT, when it's not (e.g. on the first call),
> command_buf.buf is &strbuf_slopbuf. In that case, strbuf_detach does
> allocate a 1 byte buffer to store a nul character in it, which is then
> leaked.

And that's why a pointer to a strbuf buf is valid until the next strbuf_
function call.

>
> Since the code using strbuf_detach is assuming it does nothing to
> command_buf.buf, it's overall safer to use strbuf_init, which has
> the same practical effect in the usual case, and works appropriately
> when command_buf is empty.
> ---
>  fast-import.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/fast-import.c b/fast-import.c
> index b44d6a467e..b1d07efe8c 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -1763,7 +1763,9 @@ static int read_next_command(void)
>  		} else {
>  			struct recent_command *rc;
>
> -			strbuf_detach(&command_buf, NULL);
> +			// command_buf is enther empty or also stored in cmd_hist,
> +			// reinitialize it.
> +			strbuf_init(&command_buf, 0);

This is a no-op; strbuf_detach already re-initializes the strbuf.

(And double-slash comments are avoided in Git code..)

((s/enther/either/))

>  			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
>  			if (stdin_eof)
>  				return EOF;
> @@ -1833,7 +1835,9 @@ static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
>  		char *term = xstrdup(data);
>  		size_t term_len = command_buf.len - (data - command_buf.buf);
>
> -		strbuf_detach(&command_buf, NULL);
> +		// command_buf is enther empty or also stored in cmd_hist,
> +		// reinitialize it.
> +		strbuf_init(&command_buf, 0);

Same here.

>  		for (;;) {
>  			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
>  				die("EOF in data (terminator '%s' not found)", term);
>

strbuf_detach() is handing over ownership of the buffer.  Its return
value should be stored; grabbing the pointer beforehand is naughty
because it can get stale, as you noted.  I doubt there is a good reason
for ignoring the return value of strbuf_detach(), ever.

René
diff mbox series

Patch

diff --git a/fast-import.c b/fast-import.c
index b44d6a467e..b1d07efe8c 100644
--- a/fast-import.c
+++ b/fast-import.c
@@ -1763,7 +1763,9 @@  static int read_next_command(void)
 		} else {
 			struct recent_command *rc;
 
-			strbuf_detach(&command_buf, NULL);
+			// command_buf is enther empty or also stored in cmd_hist,
+			// reinitialize it.
+			strbuf_init(&command_buf, 0);
 			stdin_eof = strbuf_getline_lf(&command_buf, stdin);
 			if (stdin_eof)
 				return EOF;
@@ -1833,7 +1835,9 @@  static int parse_data(struct strbuf *sb, uintmax_t limit, uintmax_t *len_res)
 		char *term = xstrdup(data);
 		size_t term_len = command_buf.len - (data - command_buf.buf);
 
-		strbuf_detach(&command_buf, NULL);
+		// command_buf is enther empty or also stored in cmd_hist,
+		// reinitialize it.
+		strbuf_init(&command_buf, 0);
 		for (;;) {
 			if (strbuf_getline_lf(&command_buf, stdin) == EOF)
 				die("EOF in data (terminator '%s' not found)", term);