[1/1] upload-pack: fix race condition in error messages

Message ID	5c313aba7e97cb93e7d07f6d5dfaf0febe8a2f8b.1566956608.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=PxVa=WY=vger.kernel.org=git-owner@kernel.org> Date: Tue, 27 Aug 2019 18:43:29 -0700 (PDT) Message-Id: <5c313aba7e97cb93e7d07f6d5dfaf0febe8a2f8b.1566956608.git.gitgitgadget@gmail.com> In-Reply-To: <pull.324.git.gitgitgadget@gmail.com> References: <pull.324.git.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Subject: [PATCH 1/1] upload-pack: fix race condition in error messages Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	upload-pack: fix race condition in t5516 \| expand [0/1] upload-pack: fix race condition in t5516 [1/1] upload-pack: fix race condition in error messages

Koji Nakamaru via GitGitGadget Aug. 28, 2019, 1:43 a.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

Test t5516-fetch-push.sh has a test 'deny fetch unreachable SHA1,
allowtipsha1inwant=true' that checks stderr for a specific error
string from the remote. In some build environments the error sent
over the remote connection gets mingled with the error from the
die() statement. Since both signals are being output to the same
file descriptor (but from parent and child processes), the output
we are matching with grep gets split.

To reduce the risk of this failure, follow this process instead:

1. Write an error message to stderr.
2. Write an error message across the connection.
3. exit(1).

This reorders the events so the error is written entirely before
the client receives a message from the remote, removing the race
condition.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 upload-pack.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

SZEDER Gábor Aug. 28, 2019, 9:34 a.m. UTC | #1

On Tue, Aug 27, 2019 at 06:43:29PM -0700, Derrick Stolee via GitGitGadget wrote:
> Test t5516-fetch-push.sh has a test 'deny fetch unreachable SHA1,
> allowtipsha1inwant=true' that checks stderr for a specific error
> string from the remote. In some build environments the error sent
> over the remote connection gets mingled with the error from the
> die() statement. Since both signals are being output to the same
> file descriptor (but from parent and child processes), the output
> we are matching with grep gets split.

In the spirit of "An error message is worth a thousand words", I think
it's worth to include the error message causing the failure:

  error: 'grep not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886 err' didn't find a match in:
  fatal: git upload-pack: not our ref 64ea4c13fatal: remote error: upload-pack: not our ref 63d59fa98e86a771eda009872d6ab2886
  4ea4c133d59fa98e86a771eda009872d6ab2886
  error: last command exited with $?=1

I tried to reproduce this specific error on Linux and macOS, but
couldn't yet.

> To reduce the risk of this failure, follow this process instead:

Here you talk about reducing the risk ...

> 1. Write an error message to stderr.
> 2. Write an error message across the connection.
> 3. exit(1).
> 
> This reorders the events so the error is written entirely before
> the client receives a message from the remote, removing the race
> condition.

... but here you talk about removing the race condition.

I don't understand how this change would remove the race condition.
After all, the occasional failure is caused by two messages racing
through different file descriptors, and merely sending them in reverse
order doesn't change that they would still be racing.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  upload-pack.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/upload-pack.c b/upload-pack.c
> index 222cd3ad89..b0d3e028d1 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -613,11 +613,12 @@ static void check_non_tip(struct object_array *want_obj,
>  	for (i = 0; i < want_obj->nr; i++) {
>  		struct object *o = want_obj->objects[i].item;
>  		if (!is_our_ref(o)) {
> +			warning("git upload-pack: not our ref %s",
> +				oid_to_hex(&o->oid));
>  			packet_writer_error(writer,
>  					    "upload-pack: not our ref %s",
>  					    oid_to_hex(&o->oid));
> -			die("git upload-pack: not our ref %s",
> -			    oid_to_hex(&o->oid));
> +			exit(1);


So, the error coming from the 'git fetch' command in question
currently looks like this:

  fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
  fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886

Changing die() to a warning() changes the prefix of the message from
"fatal:" to "warning:", i.e. with this patch I got this:

  warning: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
  fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886

I don't think that "demoting" that message from fatal to warning
matters much to users, because they would see the (arguably redundant)
second line starting with "fatal:".

As for the problematic test, it checks this error with:

  test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err

so changing that prefix shouldn't affect the test, either.


Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79
--stress' to try to reproduce a failure caused by those mingled
messages, the same check only failed for a different reason so far
(both on Linux and macOS (on Travis CI)):

  error: 'grep remote error:.*not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886$ err' didn't find a match in:
  fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
  fatal: unable to write to remote: Broken pipe
  error: last command exited with $?=1

And with this patch:

  error: 'grep remote error:.*not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886$ err' didn't find a match in:
  warning: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
  fatal: unable to write to remote: Broken pipe
  error: last command exited with $?=1

We could make the test pass by relaxing the 'grep' pattern to look
only for 'not our ref.*<SHA...>', but I doubt that ignoring a broken
pipe is a such good idea.

Jeff King Aug. 28, 2019, 2:54 p.m. UTC | #2

On Wed, Aug 28, 2019 at 11:34:08AM +0200, SZEDER Gábor wrote:

> On Tue, Aug 27, 2019 at 06:43:29PM -0700, Derrick Stolee via GitGitGadget wrote:
> > Test t5516-fetch-push.sh has a test 'deny fetch unreachable SHA1,
> > allowtipsha1inwant=true' that checks stderr for a specific error
> > string from the remote. In some build environments the error sent
> > over the remote connection gets mingled with the error from the
> > die() statement. Since both signals are being output to the same
> > file descriptor (but from parent and child processes), the output
> > we are matching with grep gets split.
> 
> In the spirit of "An error message is worth a thousand words", I think
> it's worth to include the error message causing the failure:
> 
>   error: 'grep not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886 err' didn't find a match in:
>   fatal: git upload-pack: not our ref 64ea4c13fatal: remote error: upload-pack: not our ref 63d59fa98e86a771eda009872d6ab2886
>   4ea4c133d59fa98e86a771eda009872d6ab2886
>   error: last command exited with $?=1
> 
> I tried to reproduce this specific error on Linux and macOS, but
> couldn't yet.

I suspect it depends on write() to a file not being atomic, since we
should be able to get the full message out to a single write in both
cases. It's _possible_ that stderr is fully buffered on Windows, but it
generally shouldn't be. If it is, then this might help:

diff --git a/usage.c b/usage.c
index 2fdb20086b..d6df31bc5b 100644
--- a/usage.c
+++ b/usage.c
@@ -10,13 +10,19 @@ void vreportf(const char *prefix, const char *err, va_list params)
 {
 	char msg[4096];
 	char *p;
-
-	vsnprintf(msg, sizeof(msg), err, params);
+	size_t len;
+
+	/* truncation is OK here, but make sure we have a newline */
+	len = xsnprintf(msg, sizeof(msg), "%s", prefix);
+	len += vsnprintf(msg + len, sizeof(msg) - len, err, params);
+	if (len >= sizeof(msg) - 1)
+		len--;
+	len += xsnprintf(msg + len, sizeof(msg) - len, "\n");
 	for (p = msg; *p; p++) {
 		if (iscntrl(*p) && *p != '\t' && *p != '\n')
 			*p = '?';
 	}
-	fprintf(stderr, "%s%s\n", prefix, msg);
+	write(2, msg, len);
 }

 static NORETURN void usage_builtin(const char *err, va_list params)

But again, I'm doubtful.

> Here you talk about reducing the risk ...
> 
> > 1. Write an error message to stderr.
> > 2. Write an error message across the connection.
> > 3. exit(1).
> > 
> > This reorders the events so the error is written entirely before
> > the client receives a message from the remote, removing the race
> > condition.
> 
> ... but here you talk about removing the race condition.
> 
> I don't understand how this change would remove the race condition.
> After all, the occasional failure is caused by two messages racing
> through different file descriptors, and merely sending them in reverse
> order doesn't change that they would still be racing.

I had the same puzzlement. I think the answer might be that in the
original, we can have two write()s happening simultaneously:

  1. upload-pack sends the packet to the client

  2. client receives packet

  3a. upload-pack then writes to stderr

  3b. simultaneously, the client writes to stderr

But by reordering, step 3a is completed before step 1.

> > +			warning("git upload-pack: not our ref %s",
> > +				oid_to_hex(&o->oid));
> >  			packet_writer_error(writer,
> >  					    "upload-pack: not our ref %s",
> >  					    oid_to_hex(&o->oid));
> > -			die("git upload-pack: not our ref %s",
> > -			    oid_to_hex(&o->oid));
> > +			exit(1);
> 
> 
> So, the error coming from the 'git fetch' command in question
> currently looks like this:
> 
>   fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
>   fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
> 
> Changing die() to a warning() changes the prefix of the message from
> "fatal:" to "warning:", i.e. with this patch I got this:
> 
>   warning: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
>   fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
> 
> I don't think that "demoting" that message from fatal to warning
> matters much to users, because they would see the (arguably redundant)
> second line starting with "fatal:".

An interesting thing about this message is that many users won't see it
at all! It requires that they be connected to the stderr of upload-pack,
which happens only for local-filesystem clones, and for git-over-ssh to
a vanilla ssh session.

It won't be seen for git:// nor for https://, which go to the stderr of
git-daemon and the webserver respectively. Nor would it be seen on any
hosting site which uses something other than vanilla sshd to terminate
the connection (at GitHub for example, ssh terminates in a proxy layer
which then uses a different protocol to run upload-pack on the backend).

That's why we've been adding this sideband-level errors: so everybody
can see them. So unless the client is a 2005-era version of Git that
lacks sideband, the die message is either invisible to the user, or
redundant.

I'd almost suggest it is worth dropping the die() message altogether
here, but it does carry some value to people collecting them on the
server side[1].

Another solution is for Git to redirect the stderr of the child
upload-pack it runs, which puts all invocations on the same footing,
regardless of protocol. But that risks losing useful messages from
before the sideband starts up (e.g., ssh failures).

[1] At GitHub, we record the exit status of every Git process, and we've
    hooked die() to record the message, which is often helpful for
    post-facto debugging. I'd hate to lose it, but we could certainly
    work around it for this case, perhaps by hooking
    packet_writer_error() in a similar way.

> Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79
> --stress' to try to reproduce a failure caused by those mingled
> messages, the same check only failed for a different reason so far
> (both on Linux and macOS (on Travis CI)):

There's some hand-waving argument that this should be race-free in
014ade7484 (upload-pack: send ERR packet for non-tip objects,
2019-04-13), but I am not too surprised if there is a flaw in that
logic.

This kind of race is a problem anytime upload-pack dies. And it's not
just a test issue, but a real problem for users: they might see EPIPE
instead of the error message from the server (it used to be SIGPIPE but
we recently relaxed that).

There's details and a possible path forward discussed in:

  https://public-inbox.org/git/20190305041139.GA19800@sigill.intra.peff.net/

-Peff

Jeff King Aug. 28, 2019, 3:39 p.m. UTC | #3

On Wed, Aug 28, 2019 at 10:54:12AM -0400, Jeff King wrote:

> > Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79
> > --stress' to try to reproduce a failure caused by those mingled
> > messages, the same check only failed for a different reason so far
> > (both on Linux and macOS (on Travis CI)):
> 
> There's some hand-waving argument that this should be race-free in
> 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> 2019-04-13), but I am not too surprised if there is a flaw in that
> logic.

By the way, I've not been able to reproduce this locally after ~10
minutes of running "./t5516-fetch-push.sh -r 1,79 --stress" on my Linux
box. I wonder what's different.

Are you running the tip of master?

-Peff

SZEDER Gábor Aug. 28, 2019, 4:15 p.m. UTC | #4

On Wed, Aug 28, 2019 at 11:39:44AM -0400, Jeff King wrote:
> On Wed, Aug 28, 2019 at 10:54:12AM -0400, Jeff King wrote:
> 
> > > Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79
> > > --stress' to try to reproduce a failure caused by those mingled
> > > messages, the same check only failed for a different reason so far
> > > (both on Linux and macOS (on Travis CI)):
> > 
> > There's some hand-waving argument that this should be race-free in
> > 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> > 2019-04-13), but I am not too surprised if there is a flaw in that
> > logic.
> 
> By the way, I've not been able to reproduce this locally after ~10
> minutes of running "./t5516-fetch-push.sh -r 1,79 --stress" on my Linux
> box. I wonder what's different.
> 
> Are you running the tip of master?

Yeah, but this seems to be one of those "you have to be really lucky,
even with --stress" cases.

So...  I was away for keyboard for over an hour and let it run on
'master', but it didn't fail.  Then I figured that I give it a try
with Derrick's patch, because, well, why not, and then I got this
broken pipe error in ~150 repetitions.  Run it again, same error after
~200 reps.  However, I didn't understand how that patch could lead to
broken pipe, so went back to stressing master...  nothing.  So I
started writing the reply to that patch saying that it seems to cause
some racy failures on Linux, and was already proofreading before
sending when the damn thing finally did fail.  Oh, well.

Then tried it on macOS, and it failed fairly quickly.  For lack of
better options I used Travis CI's debug shell to access a mac VM, and
could reproduce the failure both with and without the patch before it
timeouted.

Derrick Stolee Aug. 29, 2019, 12:58 p.m. UTC | #5

On 8/28/2019 12:15 PM, SZEDER Gábor wrote:
> On Wed, Aug 28, 2019 at 11:39:44AM -0400, Jeff King wrote:
>> On Wed, Aug 28, 2019 at 10:54:12AM -0400, Jeff King wrote:
>>
>>>> Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79
>>>> --stress' to try to reproduce a failure caused by those mingled
>>>> messages, the same check only failed for a different reason so far
>>>> (both on Linux and macOS (on Travis CI)):
>>>
>>> There's some hand-waving argument that this should be race-free in
>>> 014ade7484 (upload-pack: send ERR packet for non-tip objects,
>>> 2019-04-13), but I am not too surprised if there is a flaw in that
>>> logic.
>>
>> By the way, I've not been able to reproduce this locally after ~10
>> minutes of running "./t5516-fetch-push.sh -r 1,79 --stress" on my Linux
>> box. I wonder what's different.
>>
>> Are you running the tip of master?
> 
> Yeah, but this seems to be one of those "you have to be really lucky,
> even with --stress" cases.
> 
> So...  I was away for keyboard for over an hour and let it run on
> 'master', but it didn't fail.  Then I figured that I give it a try
> with Derrick's patch, because, well, why not, and then I got this
> broken pipe error in ~150 repetitions.  Run it again, same error after
> ~200 reps.  However, I didn't understand how that patch could lead to
> broken pipe, so went back to stressing master...  nothing.  So I
> started writing the reply to that patch saying that it seems to cause
> some racy failures on Linux, and was already proofreading before
> sending when the damn thing finally did fail.  Oh, well.
> 
> Then tried it on macOS, and it failed fairly quickly.  For lack of
> better options I used Travis CI's debug shell to access a mac VM, and
> could reproduce the failure both with and without the patch before it
> timeouted.

I'm running these tests under --stress now, but not seeing the error
you saw.

However, I do have a theory: the process exits before flushing the
packet line. Adding this line before exit(1) should fix it:

	packet_writer_flush(writer);

I can send this in a v2, but it would be nice if you could test this
in your environment that already demonstrated the failure.

Thanks,
-Stolee

Jeff King Aug. 29, 2019, 2:13 p.m. UTC | #6

On Thu, Aug 29, 2019 at 08:58:55AM -0400, Derrick Stolee wrote:

> However, I do have a theory: the process exits before flushing the
> packet line. Adding this line before exit(1) should fix it:
> 
> 	packet_writer_flush(writer);
> 
> I can send this in a v2, but it would be nice if you could test this
> in your environment that already demonstrated the failure.

I don't think we should need such a call. For one thing, if it were
necessary, that would mean we're not writing out the packet at all. But
your whole problem is that we're writing the message twice, one of which
comes from the packet.

Second is that this is not "flush the output stream", but "write a flush
packet". The packet_writer_error() function immediately calls write()
without buffering. And no matter where we are in the conversation, a
flush packet would not be necessary, because the error packet we send
would be interpreted immediately by the client as aborting the
connection.

-Peff

Derrick Stolee Aug. 29, 2019, 2:27 p.m. UTC | #7

On 8/29/2019 10:13 AM, Jeff King wrote:
> On Thu, Aug 29, 2019 at 08:58:55AM -0400, Derrick Stolee wrote:
> 
>> However, I do have a theory: the process exits before flushing the
>> packet line. Adding this line before exit(1) should fix it:
>>
>> 	packet_writer_flush(writer);
>>
>> I can send this in a v2, but it would be nice if you could test this
>> in your environment that already demonstrated the failure.
> 
> I don't think we should need such a call. For one thing, if it were
> necessary, that would mean we're not writing out the packet at all. But
> your whole problem is that we're writing the message twice, one of which
> comes from the packet.

The problem the flush() was trying to solve was the new "Broken pipe" error,
which I had assumed was due to a communication race. (Looking at the message
more closely now, I see that Szeder was able to repro this broken pipe both
with and without my change. I am still unable to repro the broken pipe.)

> Second is that this is not "flush the output stream", but "write a flush
> packet". The packet_writer_error() function immediately calls write()
> without buffering. And no matter where we are in the conversation, a
> flush packet would not be necessary, because the error packet we send
> would be interpreted immediately by the client as aborting the
> connection.

This clearly shows that my proposed solution is absolutely wrong.

Thanks,
-Stolee

Jeff King Aug. 29, 2019, 2:38 p.m. UTC | #8

On Thu, Aug 29, 2019 at 10:27:16AM -0400, Derrick Stolee wrote:

> > I don't think we should need such a call. For one thing, if it were
> > necessary, that would mean we're not writing out the packet at all. But
> > your whole problem is that we're writing the message twice, one of which
> > comes from the packet.
> 
> The problem the flush() was trying to solve was the new "Broken pipe" error,
> which I had assumed was due to a communication race. (Looking at the message
> more closely now, I see that Szeder was able to repro this broken pipe both
> with and without my change. I am still unable to repro the broken pipe.)

I think the broken pipe is coming the _other_ way. We do send the packet
from the server to the client, but since the client is still writing
when the server has hung up, we get a write error instead of seeing the
error packet.

So any fixes there have to happen on the client side. I am still
confused about why the client is writing in this case, per the argument
in 014ade7484 (upload-pack: send ERR packet for non-tip objects,
2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's
trying to write, but I still haven't been able to reproduce the issue.

-Peff

SZEDER Gábor Aug. 29, 2019, 9:58 p.m. UTC | #9

On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote:
> So any fixes there have to happen on the client side. I am still
> confused about why the client is writing in this case, per the argument
> in 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's
> trying to write, but I still haven't been able to reproduce the issue.

It's the "done" line:

  + cat trace-packet
  packet:  upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty
  packet:  upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master
  packet:  upload-pack> 0000
  packet:        fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty
  packet:        fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master
  packet:        fetch< 0000
  packet:        fetch> want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty
  packet:        fetch> 0000
  packet:  upload-pack< want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty
  packet:  upload-pack< 0000
  packet:  upload-pack> ERR upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
  packet:  upload-pack> 0000
  packet:        fetch> done

In the avarage successful run that "fetch> done" pkt-line is
immediately after the "fetch> 0000".

The above is on my Linux box; here is a failure on macOS, essentially
the same:

  https://travis-ci.org/szeder/git/jobs/578555606#L3453

SZEDER Gábor Aug. 29, 2019, 10:06 p.m. UTC | #10

On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote:
> On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote:
> > So any fixes there have to happen on the client side. I am still
> > confused about why the client is writing in this case, per the argument
> > in 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's
> > trying to write, but I still haven't been able to reproduce the issue.
> 
> It's the "done" line:
> 
>   + cat trace-packet
>   packet:  upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty
>   packet:  upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master
>   packet:  upload-pack> 0000
>   packet:        fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty
>   packet:        fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master
>   packet:        fetch< 0000
>   packet:        fetch> want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty
>   packet:        fetch> 0000
>   packet:  upload-pack< want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty
>   packet:  upload-pack< 0000
>   packet:  upload-pack> ERR upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886
>   packet:  upload-pack> 0000
>   packet:        fetch> done
> 
> In the avarage successful run that "fetch> done" pkt-line is
> immediately after the "fetch> 0000".
> 
> The above is on my Linux box; here is a failure on macOS, essentially
> the same:
> 
>   https://travis-ci.org/szeder/git/jobs/578555606#L3453
> 

Try this patch to reliably reproduce this failure:

diff --git a/fetch-pack.c b/fetch-pack.c
index 65be043f2a..e18864458b 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -471,6 +471,7 @@ static int find_common(struct fetch_negotiator *negotiator,
 	}
 done:
 	if (!got_ready || !no_done) {
+		sleep(1);
 		packet_buf_write(&req_buf, "done\n");
 		send_request(args, fd[1], &req_buf);
 	}
diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index c81ca360ac..ebecc81c89 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -1252,7 +1252,9 @@ do
 			git fetch ../testrepo/.git $SHA1_2 &&
 			git cat-file commit $SHA1_2 &&
 			test_must_fail env GIT_TEST_PROTOCOL_VERSION= \
+				GIT_TRACE_PACKET="$(pwd)/trace-packet" \
 				git fetch ../testrepo/.git $SHA1_3 2>err &&
+			cat trace-packet &&
 			test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err
 		)
 	'

SZEDER Gábor Aug. 30, 2019, 12:10 p.m. UTC | #11

On Fri, Aug 30, 2019 at 12:06:30AM +0200, SZEDER Gábor wrote:
> On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote:
> > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote:
> > > So any fixes there have to happen on the client side. I am still
> > > confused about why the client is writing in this case, per the argument
> > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's
> > > trying to write, but I still haven't been able to reproduce the issue.
> > 
> > It's the "done" line:
> > 
> >   + cat trace-packet
[...]
> >   packet:  upload-pack> 0000
> >   packet:        fetch> done
> > 
> > In the avarage successful run that "fetch> done" pkt-line is
> > immediately after the "fetch> 0000".

So instead of immediately die()int after write_in_full() returned an
error, fetch should first try to read all incoming packets in the hope
that the remote did send an ERR packet before it died, and then die
with the error in that packet, or fall back to the current generic
error message if there is no ERR packet (e.g. remote segfaulted or
something similarly horrible).  This fixes the test failure with that
strategically-placed sleep() in 'fetch-pack.c'.

  https://travis-ci.org/szeder/git/jobs/578778749#L2689

Alas, passing a 'reader' to a function called send_request() doesn't
look quite right, does it...  And I'm not sure about the stateless
communication, it still uses write_or_die().


diff --git a/fetch-pack.c b/fetch-pack.c
index e18864458b..773d9c7904 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -186,14 +186,27 @@ static enum ack_type get_ack(struct packet_reader *reader,
 }
 
 static void send_request(struct fetch_pack_args *args,
-			 int fd, struct strbuf *buf)
+			 int fd, struct strbuf *buf,
+			 struct packet_reader *reader)
 {
 	if (args->stateless_rpc) {
 		send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX);
 		packet_flush(fd);
 	} else {
-		if (write_in_full(fd, buf->buf, buf->len) < 0)
+		if (write_in_full(fd, buf->buf, buf->len) < 0) {
+			int save_errno = errno;
+			/*
+			 * Read everything the remote has sent to us.
+			 * If there is an ERR packet, then the loop die()s
+			 * with the received error message.
+			 * If we reach EOF without seeing an ERR, then die()
+			 * with a generic error message, most likely "Broken
+			 * pipe".
+			 */
+			while (packet_reader_read(reader) != PACKET_READ_EOF);
+			errno = save_errno;
 			die_errno(_("unable to write to remote"));
+		}
 	}
 }
 
@@ -353,7 +366,7 @@ static int find_common(struct fetch_negotiator *negotiator,
 		const char *arg;
 		struct object_id oid;
 
-		send_request(args, fd[1], &req_buf);
+		send_request(args, fd[1], &req_buf, &reader);
 		while (packet_reader_read(&reader) == PACKET_READ_NORMAL) {
 			if (skip_prefix(reader.line, "shallow ", &arg)) {
 				if (get_oid_hex(arg, &oid))
@@ -376,7 +389,7 @@ static int find_common(struct fetch_negotiator *negotiator,
 			die(_("expected shallow/unshallow, got %s"), reader.line);
 		}
 	} else if (!args->stateless_rpc)
-		send_request(args, fd[1], &req_buf);
+		send_request(args, fd[1], &req_buf, &reader);
 
 	if (!args->stateless_rpc) {
 		/* If we aren't using the stateless-rpc interface
@@ -398,7 +411,7 @@ static int find_common(struct fetch_negotiator *negotiator,
 			int ack;
 
 			packet_buf_flush(&req_buf);
-			send_request(args, fd[1], &req_buf);
+			send_request(args, fd[1], &req_buf, &reader);
 			strbuf_setlen(&req_buf, state_len);
 			flushes++;
 			flush_at = next_flush(args->stateless_rpc, count);
@@ -473,7 +486,7 @@ static int find_common(struct fetch_negotiator *negotiator,
 	if (!got_ready || !no_done) {
 		sleep(1);
 		packet_buf_write(&req_buf, "done\n");
-		send_request(args, fd[1], &req_buf);
+		send_request(args, fd[1], &req_buf, &reader);
 	}
 	print_verbose(args, _("done"));
 	if (retval != 0) {

Jeff King Sept. 4, 2019, 5:04 a.m. UTC | #12

On Fri, Aug 30, 2019 at 02:10:05PM +0200, SZEDER Gábor wrote:

> On Fri, Aug 30, 2019 at 12:06:30AM +0200, SZEDER Gábor wrote:
> > On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote:
> > > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote:
> > > > So any fixes there have to happen on the client side. I am still
> > > > confused about why the client is writing in this case, per the argument
> > > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> > > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's
> > > > trying to write, but I still haven't been able to reproduce the issue.
> > > 
> > > It's the "done" line:
> > > 
> > >   + cat trace-packet
> [...]
> > >   packet:  upload-pack> 0000
> > >   packet:        fetch> done
> > > 
> > > In the avarage successful run that "fetch> done" pkt-line is
> > > immediately after the "fetch> 0000".

Thanks for all of your persistent digging on this. I had forgotten about
the "done" packet, but it explains all of the symptoms we've seen.

> So instead of immediately die()int after write_in_full() returned an
> error, fetch should first try to read all incoming packets in the hope
> that the remote did send an ERR packet before it died, and then die
> with the error in that packet, or fall back to the current generic
> error message if there is no ERR packet (e.g. remote segfaulted or
> something similarly horrible).  This fixes the test failure with that
> strategically-placed sleep() in 'fetch-pack.c'.
> 
>   https://travis-ci.org/szeder/git/jobs/578778749#L2689
> 
> Alas, passing a 'reader' to a function called send_request() doesn't
> look quite right, does it...  And I'm not sure about the stateless
> communication, it still uses write_or_die().

And thank you for putting this patch together. I had taken a stab at it
a while ago, but got discouraged by figuring out at which layer to add
the "reader" info (I had envisioned it much lower in packet_write(), but
it is clear from your patch that fetch-pack does most of its own
writing).

I agree passing around the reader is a bit weird; I wonder if we should
be representing the full-duplex connection more clearly as a single
struct. But I suspect that creates other headaches, and what you have
here doesn't look _too_ bad. As you note, it probably doesn't cover all
code paths, but it at least fixes some of them, and gives us a template
for addressing the others.

>  	} else {
> -		if (write_in_full(fd, buf->buf, buf->len) < 0)
> +		if (write_in_full(fd, buf->buf, buf->len) < 0) {
> +			int save_errno = errno;
> +			/*
> +			 * Read everything the remote has sent to us.
> +			 * If there is an ERR packet, then the loop die()s
> +			 * with the received error message.
> +			 * If we reach EOF without seeing an ERR, then die()
> +			 * with a generic error message, most likely "Broken
> +			 * pipe".
> +			 */
> +			while (packet_reader_read(reader) != PACKET_READ_EOF);
> +			errno = save_errno;
>  			die_errno(_("unable to write to remote"));
> +		}

One unfortunate thing here is that we could block indefinitely in
packet_reader_read(). That shouldn't happen, I don't think, but since
this is an error case where we've been cutoff, anything's possible.

We maybe could get away with using non-blocking I/O. We're looking for
an ERR packet the other side sent us _before_ it hung up, so in theory
we've have received the data before any FIN packet (or EOF on a pipe).
But I'm wary of introducing new races there.

It might be enough to put in an actual timer, waiting for an ERR packet,
EOF, or something like 5 seconds. Or maybe I'm just being overly
paranoid.

-Peff

SZEDER Gábor Sept. 10, 2019, 12:08 p.m. UTC | #13

On Wed, Sep 04, 2019 at 01:04:42AM -0400, Jeff King wrote:
> On Fri, Aug 30, 2019 at 02:10:05PM +0200, SZEDER Gábor wrote:
> 
> > On Fri, Aug 30, 2019 at 12:06:30AM +0200, SZEDER Gábor wrote:
> > > On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote:
> > > > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote:
> > > > > So any fixes there have to happen on the client side. I am still
> > > > > confused about why the client is writing in this case, per the argument
> > > > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects,
> > > > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's
> > > > > trying to write, but I still haven't been able to reproduce the issue.
> > > > 
> > > > It's the "done" line:
> > > > 
> > > >   + cat trace-packet
> > [...]
> > > >   packet:  upload-pack> 0000
> > > >   packet:        fetch> done
> > > > 
> > > > In the avarage successful run that "fetch> done" pkt-line is
> > > > immediately after the "fetch> 0000".
> 
> Thanks for all of your persistent digging on this.

Yeah, I can be easily distracted by an interesting looking bug... was
told it's a character flaw ;)

> I had forgotten about
> the "done" packet, but it explains all of the symptoms we've seen.
> 
> > So instead of immediately die()int after write_in_full() returned an
> > error, fetch should first try to read all incoming packets in the hope
> > that the remote did send an ERR packet before it died, and then die
> > with the error in that packet, or fall back to the current generic
> > error message if there is no ERR packet (e.g. remote segfaulted or
> > something similarly horrible).  This fixes the test failure with that
> > strategically-placed sleep() in 'fetch-pack.c'.
> > 
> >   https://travis-ci.org/szeder/git/jobs/578778749#L2689
> > 
> > Alas, passing a 'reader' to a function called send_request() doesn't
> > look quite right, does it...  And I'm not sure about the stateless
> > communication, it still uses write_or_die().
> 
> And thank you for putting this patch together. I had taken a stab at it
> a while ago, but got discouraged by figuring out at which layer to add
> the "reader" info (I had envisioned it much lower in packet_write(), but
> it is clear from your patch that fetch-pack does most of its own
> writing).
> 
> I agree passing around the reader is a bit weird;

I considered renaming send_request() so that 'reader' won't look that
out of place among its parameters, but all my ideas were ridiculous,
e.g. send_request_and_process_ERR_pkt_on_error()...

> I wonder if we should
> be representing the full-duplex connection more clearly as a single
> struct. But I suspect that creates other headaches, and what you have
> here doesn't look _too_ bad. As you note, it probably doesn't cover all
> code paths, but it at least fixes some of them, and gives us a template
> for addressing the others.
> 
> >  	} else {
> > -		if (write_in_full(fd, buf->buf, buf->len) < 0)
> > +		if (write_in_full(fd, buf->buf, buf->len) < 0) {
> > +			int save_errno = errno;
> > +			/*
> > +			 * Read everything the remote has sent to us.
> > +			 * If there is an ERR packet, then the loop die()s
> > +			 * with the received error message.
> > +			 * If we reach EOF without seeing an ERR, then die()
> > +			 * with a generic error message, most likely "Broken
> > +			 * pipe".
> > +			 */
> > +			while (packet_reader_read(reader) != PACKET_READ_EOF);
> > +			errno = save_errno;
> >  			die_errno(_("unable to write to remote"));
> > +		}
> 
> One unfortunate thing here is that we could block indefinitely in
> packet_reader_read(). That shouldn't happen, I don't think, but since
> this is an error case where we've been cutoff, anything's possible.

Yeah, when we use different file descriptors for reading and writing,
then any error on the writing fd doesn't necessarily mean that there
is on an error on the reading fd as well.  I mean, we could get an
EBADF or EFAULT as well, but those would rather indicate a bug in Git
than an error with the connection itself.

I wondered whether we could avoid blocking indefinitely by looking for
an ERR packet only if the write() resulted in ECONNRESET or EPIPE,
i.e. that indicate a connection error.  I suppose 'git upload-pack'
(or an alternative implementation) could be buggy and could
inadvertently close() the other end of the fd that fetch-pack writes
to but not the fd where it reads from, so the write() would get
ECONNRESET, but then packet_reader_read() could still hang on the
still open read fd.  I'm not sure whether it's worth worrying about; I
mean a buggy 'git upload-pack' can do all kinds of weird things that
would lead to a hang.

Anyway, after write_in_full() returns with error we could first print
an error message with error_errno() and then go on to look for an ERR
packet.  So even if packet_reader_read() hangs, at least there will be
an error message for the user to see.  It wouldn't help automation,
though.

> We maybe could get away with using non-blocking I/O. We're looking for
> an ERR packet the other side sent us _before_ it hung up, so in theory
> we've have received the data before any FIN packet (or EOF on a pipe).
> But I'm wary of introducing new races there.
> 
> It might be enough to put in an actual timer, waiting for an ERR packet,
> EOF, or something like 5 seconds. Or maybe I'm just being overly
> paranoid.

I think the timeout would be the safest bet, but then we would have to
pass that timeout parameter through a couple of a function calls...

[1/1] upload-pack: fix race condition in error messages

Commit Message

Comments

Patch