Message ID | 5c313aba7e97cb93e7d07f6d5dfaf0febe8a2f8b.1566956608.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | upload-pack: fix race condition in t5516 | expand |
On Tue, Aug 27, 2019 at 06:43:29PM -0700, Derrick Stolee via GitGitGadget wrote: > Test t5516-fetch-push.sh has a test 'deny fetch unreachable SHA1, > allowtipsha1inwant=true' that checks stderr for a specific error > string from the remote. In some build environments the error sent > over the remote connection gets mingled with the error from the > die() statement. Since both signals are being output to the same > file descriptor (but from parent and child processes), the output > we are matching with grep gets split. In the spirit of "An error message is worth a thousand words", I think it's worth to include the error message causing the failure: error: 'grep not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886 err' didn't find a match in: fatal: git upload-pack: not our ref 64ea4c13fatal: remote error: upload-pack: not our ref 63d59fa98e86a771eda009872d6ab2886 4ea4c133d59fa98e86a771eda009872d6ab2886 error: last command exited with $?=1 I tried to reproduce this specific error on Linux and macOS, but couldn't yet. > To reduce the risk of this failure, follow this process instead: Here you talk about reducing the risk ... > 1. Write an error message to stderr. > 2. Write an error message across the connection. > 3. exit(1). > > This reorders the events so the error is written entirely before > the client receives a message from the remote, removing the race > condition. ... but here you talk about removing the race condition. I don't understand how this change would remove the race condition. After all, the occasional failure is caused by two messages racing through different file descriptors, and merely sending them in reverse order doesn't change that they would still be racing. > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > upload-pack.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/upload-pack.c b/upload-pack.c > index 222cd3ad89..b0d3e028d1 100644 > --- a/upload-pack.c > +++ b/upload-pack.c > @@ -613,11 +613,12 @@ static void check_non_tip(struct object_array *want_obj, > for (i = 0; i < want_obj->nr; i++) { > struct object *o = want_obj->objects[i].item; > if (!is_our_ref(o)) { > + warning("git upload-pack: not our ref %s", > + oid_to_hex(&o->oid)); > packet_writer_error(writer, > "upload-pack: not our ref %s", > oid_to_hex(&o->oid)); > - die("git upload-pack: not our ref %s", > - oid_to_hex(&o->oid)); > + exit(1); So, the error coming from the 'git fetch' command in question currently looks like this: fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 Changing die() to a warning() changes the prefix of the message from "fatal:" to "warning:", i.e. with this patch I got this: warning: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 I don't think that "demoting" that message from fatal to warning matters much to users, because they would see the (arguably redundant) second line starting with "fatal:". As for the problematic test, it checks this error with: test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err so changing that prefix shouldn't affect the test, either. Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79 --stress' to try to reproduce a failure caused by those mingled messages, the same check only failed for a different reason so far (both on Linux and macOS (on Travis CI)): error: 'grep remote error:.*not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886$ err' didn't find a match in: fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 fatal: unable to write to remote: Broken pipe error: last command exited with $?=1 And with this patch: error: 'grep remote error:.*not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886$ err' didn't find a match in: warning: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 fatal: unable to write to remote: Broken pipe error: last command exited with $?=1 We could make the test pass by relaxing the 'grep' pattern to look only for 'not our ref.*<SHA...>', but I doubt that ignoring a broken pipe is a such good idea.
On Wed, Aug 28, 2019 at 11:34:08AM +0200, SZEDER Gábor wrote: > On Tue, Aug 27, 2019 at 06:43:29PM -0700, Derrick Stolee via GitGitGadget wrote: > > Test t5516-fetch-push.sh has a test 'deny fetch unreachable SHA1, > > allowtipsha1inwant=true' that checks stderr for a specific error > > string from the remote. In some build environments the error sent > > over the remote connection gets mingled with the error from the > > die() statement. Since both signals are being output to the same > > file descriptor (but from parent and child processes), the output > > we are matching with grep gets split. > > In the spirit of "An error message is worth a thousand words", I think > it's worth to include the error message causing the failure: > > error: 'grep not our ref.*64ea4c133d59fa98e86a771eda009872d6ab2886 err' didn't find a match in: > fatal: git upload-pack: not our ref 64ea4c13fatal: remote error: upload-pack: not our ref 63d59fa98e86a771eda009872d6ab2886 > 4ea4c133d59fa98e86a771eda009872d6ab2886 > error: last command exited with $?=1 > > I tried to reproduce this specific error on Linux and macOS, but > couldn't yet. I suspect it depends on write() to a file not being atomic, since we should be able to get the full message out to a single write in both cases. It's _possible_ that stderr is fully buffered on Windows, but it generally shouldn't be. If it is, then this might help: diff --git a/usage.c b/usage.c index 2fdb20086b..d6df31bc5b 100644 --- a/usage.c +++ b/usage.c @@ -10,13 +10,19 @@ void vreportf(const char *prefix, const char *err, va_list params) { char msg[4096]; char *p; - - vsnprintf(msg, sizeof(msg), err, params); + size_t len; + + /* truncation is OK here, but make sure we have a newline */ + len = xsnprintf(msg, sizeof(msg), "%s", prefix); + len += vsnprintf(msg + len, sizeof(msg) - len, err, params); + if (len >= sizeof(msg) - 1) + len--; + len += xsnprintf(msg + len, sizeof(msg) - len, "\n"); for (p = msg; *p; p++) { if (iscntrl(*p) && *p != '\t' && *p != '\n') *p = '?'; } - fprintf(stderr, "%s%s\n", prefix, msg); + write(2, msg, len); } static NORETURN void usage_builtin(const char *err, va_list params) But again, I'm doubtful. > Here you talk about reducing the risk ... > > > 1. Write an error message to stderr. > > 2. Write an error message across the connection. > > 3. exit(1). > > > > This reorders the events so the error is written entirely before > > the client receives a message from the remote, removing the race > > condition. > > ... but here you talk about removing the race condition. > > I don't understand how this change would remove the race condition. > After all, the occasional failure is caused by two messages racing > through different file descriptors, and merely sending them in reverse > order doesn't change that they would still be racing. I had the same puzzlement. I think the answer might be that in the original, we can have two write()s happening simultaneously: 1. upload-pack sends the packet to the client 2. client receives packet 3a. upload-pack then writes to stderr 3b. simultaneously, the client writes to stderr But by reordering, step 3a is completed before step 1. > > + warning("git upload-pack: not our ref %s", > > + oid_to_hex(&o->oid)); > > packet_writer_error(writer, > > "upload-pack: not our ref %s", > > oid_to_hex(&o->oid)); > > - die("git upload-pack: not our ref %s", > > - oid_to_hex(&o->oid)); > > + exit(1); > > > So, the error coming from the 'git fetch' command in question > currently looks like this: > > fatal: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 > fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 > > Changing die() to a warning() changes the prefix of the message from > "fatal:" to "warning:", i.e. with this patch I got this: > > warning: git upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 > fatal: remote error: upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 > > I don't think that "demoting" that message from fatal to warning > matters much to users, because they would see the (arguably redundant) > second line starting with "fatal:". An interesting thing about this message is that many users won't see it at all! It requires that they be connected to the stderr of upload-pack, which happens only for local-filesystem clones, and for git-over-ssh to a vanilla ssh session. It won't be seen for git:// nor for https://, which go to the stderr of git-daemon and the webserver respectively. Nor would it be seen on any hosting site which uses something other than vanilla sshd to terminate the connection (at GitHub for example, ssh terminates in a proxy layer which then uses a different protocol to run upload-pack on the backend). That's why we've been adding this sideband-level errors: so everybody can see them. So unless the client is a 2005-era version of Git that lacks sideband, the die message is either invisible to the user, or redundant. I'd almost suggest it is worth dropping the die() message altogether here, but it does carry some value to people collecting them on the server side[1]. Another solution is for Git to redirect the stderr of the child upload-pack it runs, which puts all invocations on the same footing, regardless of protocol. But that risks losing useful messages from before the sideband starts up (e.g., ssh failures). [1] At GitHub, we record the exit status of every Git process, and we've hooked die() to record the message, which is often helpful for post-facto debugging. I'd hate to lose it, but we could certainly work around it for this case, perhaps by hooking packet_writer_error() in a similar way. > Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79 > --stress' to try to reproduce a failure caused by those mingled > messages, the same check only failed for a different reason so far > (both on Linux and macOS (on Travis CI)): There's some hand-waving argument that this should be race-free in 014ade7484 (upload-pack: send ERR packet for non-tip objects, 2019-04-13), but I am not too surprised if there is a flaw in that logic. This kind of race is a problem anytime upload-pack dies. And it's not just a test issue, but a real problem for users: they might see EPIPE instead of the error message from the server (it used to be SIGPIPE but we recently relaxed that). There's details and a possible path forward discussed in: https://public-inbox.org/git/20190305041139.GA19800@sigill.intra.peff.net/ -Peff
On Wed, Aug 28, 2019 at 10:54:12AM -0400, Jeff King wrote: > > Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79 > > --stress' to try to reproduce a failure caused by those mingled > > messages, the same check only failed for a different reason so far > > (both on Linux and macOS (on Travis CI)): > > There's some hand-waving argument that this should be race-free in > 014ade7484 (upload-pack: send ERR packet for non-tip objects, > 2019-04-13), but I am not too surprised if there is a flaw in that > logic. By the way, I've not been able to reproduce this locally after ~10 minutes of running "./t5516-fetch-push.sh -r 1,79 --stress" on my Linux box. I wonder what's different. Are you running the tip of master? -Peff
On Wed, Aug 28, 2019 at 11:39:44AM -0400, Jeff King wrote: > On Wed, Aug 28, 2019 at 10:54:12AM -0400, Jeff King wrote: > > > > Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79 > > > --stress' to try to reproduce a failure caused by those mingled > > > messages, the same check only failed for a different reason so far > > > (both on Linux and macOS (on Travis CI)): > > > > There's some hand-waving argument that this should be race-free in > > 014ade7484 (upload-pack: send ERR packet for non-tip objects, > > 2019-04-13), but I am not too surprised if there is a flaw in that > > logic. > > By the way, I've not been able to reproduce this locally after ~10 > minutes of running "./t5516-fetch-push.sh -r 1,79 --stress" on my Linux > box. I wonder what's different. > > Are you running the tip of master? Yeah, but this seems to be one of those "you have to be really lucky, even with --stress" cases. So... I was away for keyboard for over an hour and let it run on 'master', but it didn't fail. Then I figured that I give it a try with Derrick's patch, because, well, why not, and then I got this broken pipe error in ~150 repetitions. Run it again, same error after ~200 reps. However, I didn't understand how that patch could lead to broken pipe, so went back to stressing master... nothing. So I started writing the reply to that patch saying that it seems to cause some racy failures on Linux, and was already proofreading before sending when the damn thing finally did fail. Oh, well. Then tried it on macOS, and it failed fairly quickly. For lack of better options I used Travis CI's debug shell to access a mac VM, and could reproduce the failure both with and without the patch before it timeouted.
On 8/28/2019 12:15 PM, SZEDER Gábor wrote: > On Wed, Aug 28, 2019 at 11:39:44AM -0400, Jeff King wrote: >> On Wed, Aug 28, 2019 at 10:54:12AM -0400, Jeff King wrote: >> >>>> Unfortunately, however, while running './t5516-fetch-push.sh -r 1,79 >>>> --stress' to try to reproduce a failure caused by those mingled >>>> messages, the same check only failed for a different reason so far >>>> (both on Linux and macOS (on Travis CI)): >>> >>> There's some hand-waving argument that this should be race-free in >>> 014ade7484 (upload-pack: send ERR packet for non-tip objects, >>> 2019-04-13), but I am not too surprised if there is a flaw in that >>> logic. >> >> By the way, I've not been able to reproduce this locally after ~10 >> minutes of running "./t5516-fetch-push.sh -r 1,79 --stress" on my Linux >> box. I wonder what's different. >> >> Are you running the tip of master? > > Yeah, but this seems to be one of those "you have to be really lucky, > even with --stress" cases. > > So... I was away for keyboard for over an hour and let it run on > 'master', but it didn't fail. Then I figured that I give it a try > with Derrick's patch, because, well, why not, and then I got this > broken pipe error in ~150 repetitions. Run it again, same error after > ~200 reps. However, I didn't understand how that patch could lead to > broken pipe, so went back to stressing master... nothing. So I > started writing the reply to that patch saying that it seems to cause > some racy failures on Linux, and was already proofreading before > sending when the damn thing finally did fail. Oh, well. > > Then tried it on macOS, and it failed fairly quickly. For lack of > better options I used Travis CI's debug shell to access a mac VM, and > could reproduce the failure both with and without the patch before it > timeouted. I'm running these tests under --stress now, but not seeing the error you saw. However, I do have a theory: the process exits before flushing the packet line. Adding this line before exit(1) should fix it: packet_writer_flush(writer); I can send this in a v2, but it would be nice if you could test this in your environment that already demonstrated the failure. Thanks, -Stolee
On Thu, Aug 29, 2019 at 08:58:55AM -0400, Derrick Stolee wrote: > However, I do have a theory: the process exits before flushing the > packet line. Adding this line before exit(1) should fix it: > > packet_writer_flush(writer); > > I can send this in a v2, but it would be nice if you could test this > in your environment that already demonstrated the failure. I don't think we should need such a call. For one thing, if it were necessary, that would mean we're not writing out the packet at all. But your whole problem is that we're writing the message twice, one of which comes from the packet. Second is that this is not "flush the output stream", but "write a flush packet". The packet_writer_error() function immediately calls write() without buffering. And no matter where we are in the conversation, a flush packet would not be necessary, because the error packet we send would be interpreted immediately by the client as aborting the connection. -Peff
On 8/29/2019 10:13 AM, Jeff King wrote: > On Thu, Aug 29, 2019 at 08:58:55AM -0400, Derrick Stolee wrote: > >> However, I do have a theory: the process exits before flushing the >> packet line. Adding this line before exit(1) should fix it: >> >> packet_writer_flush(writer); >> >> I can send this in a v2, but it would be nice if you could test this >> in your environment that already demonstrated the failure. > > I don't think we should need such a call. For one thing, if it were > necessary, that would mean we're not writing out the packet at all. But > your whole problem is that we're writing the message twice, one of which > comes from the packet. The problem the flush() was trying to solve was the new "Broken pipe" error, which I had assumed was due to a communication race. (Looking at the message more closely now, I see that Szeder was able to repro this broken pipe both with and without my change. I am still unable to repro the broken pipe.) > Second is that this is not "flush the output stream", but "write a flush > packet". The packet_writer_error() function immediately calls write() > without buffering. And no matter where we are in the conversation, a > flush packet would not be necessary, because the error packet we send > would be interpreted immediately by the client as aborting the > connection. This clearly shows that my proposed solution is absolutely wrong. Thanks, -Stolee
On Thu, Aug 29, 2019 at 10:27:16AM -0400, Derrick Stolee wrote: > > I don't think we should need such a call. For one thing, if it were > > necessary, that would mean we're not writing out the packet at all. But > > your whole problem is that we're writing the message twice, one of which > > comes from the packet. > > The problem the flush() was trying to solve was the new "Broken pipe" error, > which I had assumed was due to a communication race. (Looking at the message > more closely now, I see that Szeder was able to repro this broken pipe both > with and without my change. I am still unable to repro the broken pipe.) I think the broken pipe is coming the _other_ way. We do send the packet from the server to the client, but since the client is still writing when the server has hung up, we get a write error instead of seeing the error packet. So any fixes there have to happen on the client side. I am still confused about why the client is writing in this case, per the argument in 014ade7484 (upload-pack: send ERR packet for non-tip objects, 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's trying to write, but I still haven't been able to reproduce the issue. -Peff
On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote: > So any fixes there have to happen on the client side. I am still > confused about why the client is writing in this case, per the argument > in 014ade7484 (upload-pack: send ERR packet for non-tip objects, > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's > trying to write, but I still haven't been able to reproduce the issue. It's the "done" line: + cat trace-packet packet: upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty packet: upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master packet: upload-pack> 0000 packet: fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty packet: fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master packet: fetch< 0000 packet: fetch> want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty packet: fetch> 0000 packet: upload-pack< want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty packet: upload-pack< 0000 packet: upload-pack> ERR upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 packet: upload-pack> 0000 packet: fetch> done In the avarage successful run that "fetch> done" pkt-line is immediately after the "fetch> 0000". The above is on my Linux box; here is a failure on macOS, essentially the same: https://travis-ci.org/szeder/git/jobs/578555606#L3453
On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote: > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote: > > So any fixes there have to happen on the client side. I am still > > confused about why the client is writing in this case, per the argument > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects, > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's > > trying to write, but I still haven't been able to reproduce the issue. > > It's the "done" line: > > + cat trace-packet > packet: upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty > packet: upload-pack> 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master > packet: upload-pack> 0000 > packet: fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad HEAD\0multi_ack thin-pack side-band side-band-64k ofs-delta shallow deepen-since deepen-not deepen-relative no-progress include-tag multi_ack_detailed allow-tip-sha1-in-want allow-reachable-sha1-in-want symref=HEAD:refs/heads/master agent=git/2.23.0.40.g4d780d4382.dirty > packet: fetch< 5e26403b4485d7a44fd8b65dc3f71e21c0dd6fad refs/heads/master > packet: fetch< 0000 > packet: fetch> want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty > packet: fetch> 0000 > packet: upload-pack< want 64ea4c133d59fa98e86a771eda009872d6ab2886 multi_ack_detailed side-band-64k thin-pack no-progress ofs-delta deepen-since deepen-not agent=git/2.23.0.40.g4d780d4382.dirty > packet: upload-pack< 0000 > packet: upload-pack> ERR upload-pack: not our ref 64ea4c133d59fa98e86a771eda009872d6ab2886 > packet: upload-pack> 0000 > packet: fetch> done > > In the avarage successful run that "fetch> done" pkt-line is > immediately after the "fetch> 0000". > > The above is on my Linux box; here is a failure on macOS, essentially > the same: > > https://travis-ci.org/szeder/git/jobs/578555606#L3453 > Try this patch to reliably reproduce this failure: diff --git a/fetch-pack.c b/fetch-pack.c index 65be043f2a..e18864458b 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -471,6 +471,7 @@ static int find_common(struct fetch_negotiator *negotiator, } done: if (!got_ready || !no_done) { + sleep(1); packet_buf_write(&req_buf, "done\n"); send_request(args, fd[1], &req_buf); } diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh index c81ca360ac..ebecc81c89 100755 --- a/t/t5516-fetch-push.sh +++ b/t/t5516-fetch-push.sh @@ -1252,7 +1252,9 @@ do git fetch ../testrepo/.git $SHA1_2 && git cat-file commit $SHA1_2 && test_must_fail env GIT_TEST_PROTOCOL_VERSION= \ + GIT_TRACE_PACKET="$(pwd)/trace-packet" \ git fetch ../testrepo/.git $SHA1_3 2>err && + cat trace-packet && test_i18ngrep "remote error:.*not our ref.*$SHA1_3\$" err ) '
On Fri, Aug 30, 2019 at 12:06:30AM +0200, SZEDER Gábor wrote: > On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote: > > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote: > > > So any fixes there have to happen on the client side. I am still > > > confused about why the client is writing in this case, per the argument > > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects, > > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's > > > trying to write, but I still haven't been able to reproduce the issue. > > > > It's the "done" line: > > > > + cat trace-packet [...] > > packet: upload-pack> 0000 > > packet: fetch> done > > > > In the avarage successful run that "fetch> done" pkt-line is > > immediately after the "fetch> 0000". So instead of immediately die()int after write_in_full() returned an error, fetch should first try to read all incoming packets in the hope that the remote did send an ERR packet before it died, and then die with the error in that packet, or fall back to the current generic error message if there is no ERR packet (e.g. remote segfaulted or something similarly horrible). This fixes the test failure with that strategically-placed sleep() in 'fetch-pack.c'. https://travis-ci.org/szeder/git/jobs/578778749#L2689 Alas, passing a 'reader' to a function called send_request() doesn't look quite right, does it... And I'm not sure about the stateless communication, it still uses write_or_die(). diff --git a/fetch-pack.c b/fetch-pack.c index e18864458b..773d9c7904 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -186,14 +186,27 @@ static enum ack_type get_ack(struct packet_reader *reader, } static void send_request(struct fetch_pack_args *args, - int fd, struct strbuf *buf) + int fd, struct strbuf *buf, + struct packet_reader *reader) { if (args->stateless_rpc) { send_sideband(fd, -1, buf->buf, buf->len, LARGE_PACKET_MAX); packet_flush(fd); } else { - if (write_in_full(fd, buf->buf, buf->len) < 0) + if (write_in_full(fd, buf->buf, buf->len) < 0) { + int save_errno = errno; + /* + * Read everything the remote has sent to us. + * If there is an ERR packet, then the loop die()s + * with the received error message. + * If we reach EOF without seeing an ERR, then die() + * with a generic error message, most likely "Broken + * pipe". + */ + while (packet_reader_read(reader) != PACKET_READ_EOF); + errno = save_errno; die_errno(_("unable to write to remote")); + } } } @@ -353,7 +366,7 @@ static int find_common(struct fetch_negotiator *negotiator, const char *arg; struct object_id oid; - send_request(args, fd[1], &req_buf); + send_request(args, fd[1], &req_buf, &reader); while (packet_reader_read(&reader) == PACKET_READ_NORMAL) { if (skip_prefix(reader.line, "shallow ", &arg)) { if (get_oid_hex(arg, &oid)) @@ -376,7 +389,7 @@ static int find_common(struct fetch_negotiator *negotiator, die(_("expected shallow/unshallow, got %s"), reader.line); } } else if (!args->stateless_rpc) - send_request(args, fd[1], &req_buf); + send_request(args, fd[1], &req_buf, &reader); if (!args->stateless_rpc) { /* If we aren't using the stateless-rpc interface @@ -398,7 +411,7 @@ static int find_common(struct fetch_negotiator *negotiator, int ack; packet_buf_flush(&req_buf); - send_request(args, fd[1], &req_buf); + send_request(args, fd[1], &req_buf, &reader); strbuf_setlen(&req_buf, state_len); flushes++; flush_at = next_flush(args->stateless_rpc, count); @@ -473,7 +486,7 @@ static int find_common(struct fetch_negotiator *negotiator, if (!got_ready || !no_done) { sleep(1); packet_buf_write(&req_buf, "done\n"); - send_request(args, fd[1], &req_buf); + send_request(args, fd[1], &req_buf, &reader); } print_verbose(args, _("done")); if (retval != 0) {
On Fri, Aug 30, 2019 at 02:10:05PM +0200, SZEDER Gábor wrote: > On Fri, Aug 30, 2019 at 12:06:30AM +0200, SZEDER Gábor wrote: > > On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote: > > > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote: > > > > So any fixes there have to happen on the client side. I am still > > > > confused about why the client is writing in this case, per the argument > > > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects, > > > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's > > > > trying to write, but I still haven't been able to reproduce the issue. > > > > > > It's the "done" line: > > > > > > + cat trace-packet > [...] > > > packet: upload-pack> 0000 > > > packet: fetch> done > > > > > > In the avarage successful run that "fetch> done" pkt-line is > > > immediately after the "fetch> 0000". Thanks for all of your persistent digging on this. I had forgotten about the "done" packet, but it explains all of the symptoms we've seen. > So instead of immediately die()int after write_in_full() returned an > error, fetch should first try to read all incoming packets in the hope > that the remote did send an ERR packet before it died, and then die > with the error in that packet, or fall back to the current generic > error message if there is no ERR packet (e.g. remote segfaulted or > something similarly horrible). This fixes the test failure with that > strategically-placed sleep() in 'fetch-pack.c'. > > https://travis-ci.org/szeder/git/jobs/578778749#L2689 > > Alas, passing a 'reader' to a function called send_request() doesn't > look quite right, does it... And I'm not sure about the stateless > communication, it still uses write_or_die(). And thank you for putting this patch together. I had taken a stab at it a while ago, but got discouraged by figuring out at which layer to add the "reader" info (I had envisioned it much lower in packet_write(), but it is clear from your patch that fetch-pack does most of its own writing). I agree passing around the reader is a bit weird; I wonder if we should be representing the full-duplex connection more clearly as a single struct. But I suspect that creates other headaches, and what you have here doesn't look _too_ bad. As you note, it probably doesn't cover all code paths, but it at least fixes some of them, and gives us a template for addressing the others. > } else { > - if (write_in_full(fd, buf->buf, buf->len) < 0) > + if (write_in_full(fd, buf->buf, buf->len) < 0) { > + int save_errno = errno; > + /* > + * Read everything the remote has sent to us. > + * If there is an ERR packet, then the loop die()s > + * with the received error message. > + * If we reach EOF without seeing an ERR, then die() > + * with a generic error message, most likely "Broken > + * pipe". > + */ > + while (packet_reader_read(reader) != PACKET_READ_EOF); > + errno = save_errno; > die_errno(_("unable to write to remote")); > + } One unfortunate thing here is that we could block indefinitely in packet_reader_read(). That shouldn't happen, I don't think, but since this is an error case where we've been cutoff, anything's possible. We maybe could get away with using non-blocking I/O. We're looking for an ERR packet the other side sent us _before_ it hung up, so in theory we've have received the data before any FIN packet (or EOF on a pipe). But I'm wary of introducing new races there. It might be enough to put in an actual timer, waiting for an ERR packet, EOF, or something like 5 seconds. Or maybe I'm just being overly paranoid. -Peff
On Wed, Sep 04, 2019 at 01:04:42AM -0400, Jeff King wrote: > On Fri, Aug 30, 2019 at 02:10:05PM +0200, SZEDER Gábor wrote: > > > On Fri, Aug 30, 2019 at 12:06:30AM +0200, SZEDER Gábor wrote: > > > On Thu, Aug 29, 2019 at 11:58:18PM +0200, SZEDER Gábor wrote: > > > > On Thu, Aug 29, 2019 at 10:38:05AM -0400, Jeff King wrote: > > > > > So any fixes there have to happen on the client side. I am still > > > > > confused about why the client is writing in this case, per the argument > > > > > in 014ade7484 (upload-pack: send ERR packet for non-tip objects, > > > > > 2019-04-13). It would be nice to use GIT_TRACE_PACKET to see what it's > > > > > trying to write, but I still haven't been able to reproduce the issue. > > > > > > > > It's the "done" line: > > > > > > > > + cat trace-packet > > [...] > > > > packet: upload-pack> 0000 > > > > packet: fetch> done > > > > > > > > In the avarage successful run that "fetch> done" pkt-line is > > > > immediately after the "fetch> 0000". > > Thanks for all of your persistent digging on this. Yeah, I can be easily distracted by an interesting looking bug... was told it's a character flaw ;) > I had forgotten about > the "done" packet, but it explains all of the symptoms we've seen. > > > So instead of immediately die()int after write_in_full() returned an > > error, fetch should first try to read all incoming packets in the hope > > that the remote did send an ERR packet before it died, and then die > > with the error in that packet, or fall back to the current generic > > error message if there is no ERR packet (e.g. remote segfaulted or > > something similarly horrible). This fixes the test failure with that > > strategically-placed sleep() in 'fetch-pack.c'. > > > > https://travis-ci.org/szeder/git/jobs/578778749#L2689 > > > > Alas, passing a 'reader' to a function called send_request() doesn't > > look quite right, does it... And I'm not sure about the stateless > > communication, it still uses write_or_die(). > > And thank you for putting this patch together. I had taken a stab at it > a while ago, but got discouraged by figuring out at which layer to add > the "reader" info (I had envisioned it much lower in packet_write(), but > it is clear from your patch that fetch-pack does most of its own > writing). > > I agree passing around the reader is a bit weird; I considered renaming send_request() so that 'reader' won't look that out of place among its parameters, but all my ideas were ridiculous, e.g. send_request_and_process_ERR_pkt_on_error()... > I wonder if we should > be representing the full-duplex connection more clearly as a single > struct. But I suspect that creates other headaches, and what you have > here doesn't look _too_ bad. As you note, it probably doesn't cover all > code paths, but it at least fixes some of them, and gives us a template > for addressing the others. > > > } else { > > - if (write_in_full(fd, buf->buf, buf->len) < 0) > > + if (write_in_full(fd, buf->buf, buf->len) < 0) { > > + int save_errno = errno; > > + /* > > + * Read everything the remote has sent to us. > > + * If there is an ERR packet, then the loop die()s > > + * with the received error message. > > + * If we reach EOF without seeing an ERR, then die() > > + * with a generic error message, most likely "Broken > > + * pipe". > > + */ > > + while (packet_reader_read(reader) != PACKET_READ_EOF); > > + errno = save_errno; > > die_errno(_("unable to write to remote")); > > + } > > One unfortunate thing here is that we could block indefinitely in > packet_reader_read(). That shouldn't happen, I don't think, but since > this is an error case where we've been cutoff, anything's possible. Yeah, when we use different file descriptors for reading and writing, then any error on the writing fd doesn't necessarily mean that there is on an error on the reading fd as well. I mean, we could get an EBADF or EFAULT as well, but those would rather indicate a bug in Git than an error with the connection itself. I wondered whether we could avoid blocking indefinitely by looking for an ERR packet only if the write() resulted in ECONNRESET or EPIPE, i.e. that indicate a connection error. I suppose 'git upload-pack' (or an alternative implementation) could be buggy and could inadvertently close() the other end of the fd that fetch-pack writes to but not the fd where it reads from, so the write() would get ECONNRESET, but then packet_reader_read() could still hang on the still open read fd. I'm not sure whether it's worth worrying about; I mean a buggy 'git upload-pack' can do all kinds of weird things that would lead to a hang. Anyway, after write_in_full() returns with error we could first print an error message with error_errno() and then go on to look for an ERR packet. So even if packet_reader_read() hangs, at least there will be an error message for the user to see. It wouldn't help automation, though. > We maybe could get away with using non-blocking I/O. We're looking for > an ERR packet the other side sent us _before_ it hung up, so in theory > we've have received the data before any FIN packet (or EOF on a pipe). > But I'm wary of introducing new races there. > > It might be enough to put in an actual timer, waiting for an ERR packet, > EOF, or something like 5 seconds. Or maybe I'm just being overly > paranoid. I think the timeout would be the safest bet, but then we would have to pass that timeout parameter through a couple of a function calls...
diff --git a/upload-pack.c b/upload-pack.c index 222cd3ad89..b0d3e028d1 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -613,11 +613,12 @@ static void check_non_tip(struct object_array *want_obj, for (i = 0; i < want_obj->nr; i++) { struct object *o = want_obj->objects[i].item; if (!is_our_ref(o)) { + warning("git upload-pack: not our ref %s", + oid_to_hex(&o->oid)); packet_writer_error(writer, "upload-pack: not our ref %s", oid_to_hex(&o->oid)); - die("git upload-pack: not our ref %s", - oid_to_hex(&o->oid)); + exit(1); } } }