Message ID | xmqqmskbwe1a.fsf@gitster.g (mailing list archive) |
---|---|
State | Accepted |
Commit | e1e0d305c417448753162129f694afa9f183c03e |
Headers | show |
Series | t5512.40 sometimes dies by SIGPIPE | expand |
On Fri, Sep 13, 2024 at 3:26 PM Junio C Hamano <gitster@pobox.com> wrote: > The last test in t5512 we recently added seems to be flaky. > Running > > $ make && cd t && sh ./t5512-ls-remote.sh --stress > > shows that "git ls-remote foo::bar" exited with status 141, which > means we got a SIGPIPE. This test piece was introduced by 9e89dcb6 > (builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02) > and is pretty much independent from all other tests in the script > (it can even run standalone with everything before it removed). > > The transport-helper.c:get_helper() function tries to write to the > helper. As we can see the helper script is very short and can exit > even before it reads anything, when get_helper() tries to give the > first command, "capabilities", the helper may already be gone. > > A trivial fix, presented here, os to make sure that the helper reads s/os/is/ > the first command it is given, as what it writes later is a response > to that command. > > I however would wonder if the interactions with the helper initiated > by get_helper() should be done on a non-blocking I/O (we do check > the return value from our write(2) system calls, do we?). > > Signed-off-by: Junio C Hamano <gitster@pobox.com>
Eric Sunshine <sunshine@sunshineco.com> writes: >> A trivial fix, presented here, os to make sure that the helper reads > > s/os/is/ Thanks.
On Fri, Sep 13, 2024 at 12:26:41PM -0700, Junio C Hamano wrote: > The last test in t5512 we recently added seems to be flaky. > Running > > $ make && cd t && sh ./t5512-ls-remote.sh --stress > > shows that "git ls-remote foo::bar" exited with status 141, which > means we got a SIGPIPE. This test piece was introduced by 9e89dcb6 > (builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02) > and is pretty much independent from all other tests in the script > (it can even run standalone with everything before it removed). > > The transport-helper.c:get_helper() function tries to write to the > helper. As we can see the helper script is very short and can exit > even before it reads anything, when get_helper() tries to give the > first command, "capabilities", the helper may already be gone. > > A trivial fix, presented here, os to make sure that the helper reads > the first command it is given, as what it writes later is a response > to that command. This analysis looks right, and I think the test change here is a good minimal fix. It does leave "real world" cases where we'd hit SIGPIPE vulnerable to the same problem, but I don't think that happens all that often (any helper that doesn't speak the protocol correctly and causes spurious pipe failures will annoy users and get fixed). So I think we should apply it as-is. But see below for some interesting historical context. > I however would wonder if the interactions with the helper initiated > by get_helper() should be done on a non-blocking I/O (we do check > the return value from our write(2) system calls, do we?). You still get SIGPIPE with non-blocking I/O. I assume you meant ignoring SIGPIPE? That's certainly an option, but unfortunately it does mean we'd ignore SIGPIPE everywhere. We do generally check the return value of write(), but not always for buffered I/O. So in something like: git ls-remote | head -1 we'd probably do a bunch of useless writes to the closed pipe. That's not the end of the world, given that we've done most of the work before generating the first line of output anyway, but that's the case that SIGPIPE is really supposed to be helping with. It's much worse on something like git-log, but if we are just talking about ignoring SIGPIPE in users of transport-helper.c, that's much more limited (though a confusing concept when you start thinking about libification, which may do many things in one process). I really wish there was a way to ignore SIGPIPE per-descriptor (or even tell which descriptor triggered it in a signal handler), but I don't think there is. Some environments like "go" claim to do this, but I think it's really that they ignore SIGPIPE, and then assume all writes are going through their wrapper, which converts EPIPE into the signal. That's a bit harder for us to do (even if we got xwrite(), you have to similarly wrap fprintf, and so on). Another option is to selectively disable/re-enable SIGPIPE for individual write() calls. That's not thread-safe, and I imagine may have some performance penalty in practice due to the extra syscalls. But it might make sense to do it selectively. And now we get into the historical bits. A long time ago, we noticed that a totally broken helper (i.e., one that exits immediately) would cause Git to exit silently, which could be confusing. So we added an error message when we saw an unexpected close of the pipe. But people found that annoying, because most of the time the helper already said something useful to stderr, and the user did not even care that a helper was in use. So we got rid of it. But we discussed the concept of complaining only about the initial "capabilities" request, which would catch helpers that failed to even start. There's some discussion in this thread: https://lore.kernel.org/git/20140117201325.GB775@sigill.intra.peff.net/ But the most interesting part to me is that on that same day over 10 years ago I wrote the patch below, but apparently never sent it. And I've been rebasing it forward ever since (I have several dozen of these "to look at and think about again" topic branches, and I guess I let this one slide for quite a while). Anyway, the interesting thing is that it does the exact "turn off SIGPIPE only for this call" strategy discussed above. And so I did not even see the race in t5512, because it was "fixed" in my fork. I use quotation marks there, though, because the test was sometimes racily succeeding for the wrong reason (because we got EPIPE, not because of the forbidden capability response). So even if we took the patch below, I think your test fix is the right thing to do. -- >8 -- From: Jeff King <peff@peff.net> Date: Fri, 17 Jan 2014 15:36:21 -0500 Subject: [PATCH] print an error when remote helpers die during capabilities The transport-helper code generally relies on the remote-helper to provide an informative message to the user when it encounters an error. In the rare cases where the helper does not do so, the output can be quite confusing. E.g.: $ git clone https://example.com/foo.git Cloning into 'foo'... $ echo $? 128 $ ls foo /bin/ls: cannot access foo: No such file or directory We tried to address this with 81d340d (transport-helper: report errors properly, 2013-04-10). But that makes the common case much more confusing. The remote helper protocol's method for signaling normal errors is to simply hang up. So when the helper does encounter a routine error and prints something to stderr, the extra error message is redundant and misleading. So we dropped it again in 266f1fd (transport-helper: be quiet on read errors from helpers, 2013-06-21). This puts the uncommon case right back where it started. We may be able to do a little better, though. It is common for the helper to die during a "real" command, like fetching the list of remote refs. It is not common for it to die during the initial "capabilities" negotiation, right after we start. Reporting failure here is likely to catch fundamental problems that prevent the helper from running (and reporting errors) at all. Anything after that is the responsibility of the helper itself to report. Signed-off-by: Jeff King <peff@peff.net> --- t/t5801-remote-helpers.sh | 11 +++++++++++ transport-helper.c | 16 +++++++++++++--- 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh index 20f43f7b7d..d21877150e 100755 --- a/t/t5801-remote-helpers.sh +++ b/t/t5801-remote-helpers.sh @@ -344,4 +344,15 @@ test_expect_success 'fetch tag' ' compare_refs local v1.0 server v1.0 ' +test_expect_success 'totally broken helper reports failure message' ' + write_script git-remote-broken <<-\EOF && + read cap_cmd + exit 1 + EOF + test_must_fail \ + env PATH="$PWD:$PATH" \ + git clone broken://example.com/foo.git 2>stderr && + grep aborted stderr +' + test_done diff --git a/transport-helper.c b/transport-helper.c index 09b3560ffd..98dbeb6d3d 100644 --- a/transport-helper.c +++ b/transport-helper.c @@ -89,11 +89,18 @@ static int recvline(struct helper_data *helper, struct strbuf *buffer) return recvline_fh(helper->out, buffer); } -static void write_constant(int fd, const char *str) +static int write_constant_gently(int fd, const char *str) { if (debug) fprintf(stderr, "Debug: Remote helper: -> %s", str); if (write_in_full(fd, str, strlen(str)) < 0) + return -1; + return 0; +} + +static void write_constant(int fd, const char *str) +{ + if (write_constant_gently(fd, str) < 0) die_errno(_("full write to remote helper failed")); } @@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport) die_errno(_("can't dup helper output fd")); data->out = xfdopen(duped, "r"); - write_constant(helper->in, "capabilities\n"); + sigchain_push(SIGPIPE, SIG_IGN); + if (write_constant_gently(helper->in, "capabilities\n") < 0) + die("remote helper '%s' aborted session", data->name); + sigchain_pop(SIGPIPE); while (1) { const char *capname, *arg; int mandatory = 0; if (recvline(data, &buf)) - exit(128); + die("remote helper '%s' aborted session", data->name); if (!*buf.buf) break;
Jeff King <peff@peff.net> writes: > I really wish there was a way to ignore SIGPIPE per-descriptor (or even > tell which descriptor triggered it in a signal handler), but I don't > think there is. > ... > Another option is to selectively disable/re-enable SIGPIPE for > individual write() calls. That's not thread-safe, and I imagine may have > some performance penalty in practice due to the extra syscalls. But it > might make sense to do it selectively. All true. > +static int write_constant_gently(int fd, const char *str) > { > if (debug) > fprintf(stderr, "Debug: Remote helper: -> %s", str); > if (write_in_full(fd, str, strlen(str)) < 0) > + return -1; > + return 0; > +} > + > +static void write_constant(int fd, const char *str) > +{ > + if (write_constant_gently(fd, str) < 0) > die_errno(_("full write to remote helper failed")); > } The _gently variant is meant to be (optionally) used while SIGPIPE is ignored, and this one is meant to be always used while SIGPIPE is not. If the reading end closed the fd, the underlying write would fail and trigger SIGPIPE. This die_errno() will not trigger in such a case. But for other kind of errors, we die just like we used to in the original code. OK. > @@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport) > die_errno(_("can't dup helper output fd")); > data->out = xfdopen(duped, "r"); > > - write_constant(helper->in, "capabilities\n"); > + sigchain_push(SIGPIPE, SIG_IGN); > + if (write_constant_gently(helper->in, "capabilities\n") < 0) > + die("remote helper '%s' aborted session", data->name); > + sigchain_pop(SIGPIPE); And the only caller of the _gently variant with SIGPIPE ignored supplies its own error message. It is easier for the end-user to identify than a generic "Broken pipe". Nice. We identified a helper that closes the connection even before it hears from us, so we say "aborted". > while (1) { > const char *capname, *arg; > int mandatory = 0; > if (recvline(data, &buf)) > - exit(128); > + die("remote helper '%s' aborted session", data->name); Here, we found a helper that failed to talk back to us (they may have actually read what we wrote initially, or what we wrote may be hanging in the pipe buffer without being read). It depends on the timing between which of the above two points we detect such a helper, so using the same error message does make sense. > diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh > index 20f43f7b7d..d21877150e 100755 > --- a/t/t5801-remote-helpers.sh > +++ b/t/t5801-remote-helpers.sh > @@ -344,4 +344,15 @@ test_expect_success 'fetch tag' ' > compare_refs local v1.0 server v1.0 > ' > > +test_expect_success 'totally broken helper reports failure message' ' > + write_script git-remote-broken <<-\EOF && > + read cap_cmd > + exit 1 > + EOF > + test_must_fail \ > + env PATH="$PWD:$PATH" \ > + git clone broken://example.com/foo.git 2>stderr && > + grep aborted stderr > +' > + > test_done The code change covers both a helper that disconnects before we write the first greeting and a helper that disconnects while we are still expecting it to talk to us. The test explicitly covers the latter by reading our greeting (in other words, it does not die before we do our initial "write"---no risk of the SIGPIPE) and then stopping without writing anything. As we are waiting to read from the helper, we will see an error in recvline(). Nice. If the test loses the initial "read the greeting", then some of the time our greeting would be hanging in the pipe, we wait in read, and notice that the helper died, to trigger the "recvline() failed" code path. But other times, the helper would die even before we write the greeting, so we'd see an error from write_constant_gently() and die with an identical message. Such a test won't suffer from any flakyness but makes me feel somewhat dirty, so the above is good ;-). Thanks.
Just a quick note for background: On Fri, Sep 13, 2024 at 11:41 PM Jeff King <peff@peff.net> wrote: > I really wish there was a way to ignore SIGPIPE per-descriptor (or even > tell which descriptor triggered it in a signal handler), but I don't > think there is. Some environments like "go" claim to do this, but I > think it's really that they ignore SIGPIPE, and then assume all writes > are going through their wrapper, which converts EPIPE into the signal. It's more complicated than that (and varies on MacOS), but it's true that the Go runtime wraps all OS-level write() operations. The reason for that is that Go has its own internal "thread" (g/m/p things, really) scheduler and does *all* I/O as non-blocking calls. Anything that potentially blocks an actual OS-level thread has to be handled specially. It's quite messy and You Don't Want To Go There (no pun intended) if you don't have to. (The Go scheduler is especially complex since it's doing M:N multiplexing of goroutines to OS-level threads within a single OS-level process. It also has to allow calls in and out via the Go scheduler and foreign-function interfaces. When making a blocking system call, it will create a new OS thread, increasing past the normal GOMAXPROCS thread limit.) (Also: I'm not steeped in the internals of Git's transport code but Peff's patch looks good to me.) Chris
Chris Torek <chris.torek@gmail.com> writes: > It's more complicated than that (and varies on MacOS), but it's true > that the Go runtime wraps all OS-level write() operations. The reason > for that is that Go has its own internal "thread" (g/m/p things, really) > scheduler and does *all* I/O as non-blocking calls. Anything that > potentially blocks an actual OS-level thread has to be handled > specially. It's quite messy and You Don't Want To Go There (no > pun intended) if you don't have to. That sounds like a fun hack ;-)
On Sat, Sep 14, 2024 at 10:09:52AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > I really wish there was a way to ignore SIGPIPE per-descriptor (or even > > tell which descriptor triggered it in a signal handler), but I don't > > think there is. > > ... > > Another option is to selectively disable/re-enable SIGPIPE for > > individual write() calls. That's not thread-safe, and I imagine may have > > some performance penalty in practice due to the extra syscalls. But it > > might make sense to do it selectively. > > All true. > > > +static int write_constant_gently(int fd, const char *str) > > { > > if (debug) > > fprintf(stderr, "Debug: Remote helper: -> %s", str); > > if (write_in_full(fd, str, strlen(str)) < 0) > > + return -1; > > + return 0; > > +} > > + > > +static void write_constant(int fd, const char *str) > > +{ > > + if (write_constant_gently(fd, str) < 0) > > die_errno(_("full write to remote helper failed")); > > } > > The _gently variant is meant to be (optionally) used while SIGPIPE > is ignored, and this one is meant to be always used while SIGPIPE is > not. If the reading end closed the fd, the underlying write would > fail and trigger SIGPIPE. This die_errno() will not trigger in such > a case. But for other kind of errors, we die just like we used to > in the original code. OK. > > > @@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport) > > die_errno(_("can't dup helper output fd")); > > data->out = xfdopen(duped, "r"); > > > > - write_constant(helper->in, "capabilities\n"); > > + sigchain_push(SIGPIPE, SIG_IGN); > > + if (write_constant_gently(helper->in, "capabilities\n") < 0) > > + die("remote helper '%s' aborted session", data->name); > > + sigchain_pop(SIGPIPE); > > And the only caller of the _gently variant with SIGPIPE ignored > supplies its own error message. It is easier for the end-user to > identify than a generic "Broken pipe". Nice. > > We identified a helper that closes the connection even before it > hears from us, so we say "aborted". > > > while (1) { > > const char *capname, *arg; > > int mandatory = 0; > > if (recvline(data, &buf)) > > - exit(128); > > + die("remote helper '%s' aborted session", data->name); > > Here, we found a helper that failed to talk back to us (they may > have actually read what we wrote initially, or what we wrote may be > hanging in the pipe buffer without being read). It depends on the > timing between which of the above two points we detect such a > helper, so using the same error message does make sense. > > > diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh > > index 20f43f7b7d..d21877150e 100755 > > --- a/t/t5801-remote-helpers.sh > > +++ b/t/t5801-remote-helpers.sh > > @@ -344,4 +344,15 @@ test_expect_success 'fetch tag' ' > > compare_refs local v1.0 server v1.0 > > ' > > > > +test_expect_success 'totally broken helper reports failure message' ' > > + write_script git-remote-broken <<-\EOF && > > + read cap_cmd > > + exit 1 > > + EOF > > + test_must_fail \ > > + env PATH="$PWD:$PATH" \ > > + git clone broken://example.com/foo.git 2>stderr && > > + grep aborted stderr > > +' > > + > > test_done > > The code change covers both a helper that disconnects before we > write the first greeting and a helper that disconnects while we are > still expecting it to talk to us. The test explicitly covers the > latter by reading our greeting (in other words, it does not die > before we do our initial "write"---no risk of the SIGPIPE) and then > stopping without writing anything. As we are waiting to read from > the helper, we will see an error in recvline(). > > Nice. > > If the test loses the initial "read the greeting", then some of the > time our greeting would be hanging in the pipe, we wait in read, and > notice that the helper died, to trigger the "recvline() failed" code > path. But other times, the helper would die even before we write > the greeting, so we'd see an error from write_constant_gently() and > die with an identical message. Such a test won't suffer from any > flakyness but makes me feel somewhat dirty, so the above is good > ;-). Thanks to both of you for catching and addressing this issue! Both patches look sensible to me. Patrick
diff --git c/t/t5512-ls-remote.sh w/t/t5512-ls-remote.sh index d64b40e408..64b3491e4e 100755 --- c/t/t5512-ls-remote.sh +++ w/t/t5512-ls-remote.sh @@ -406,6 +406,7 @@ test_expect_success 'v0 clients can handle multiple symrefs' ' test_expect_success 'helper with refspec capability fails gracefully' ' mkdir test-bin && write_script test-bin/git-remote-foo <<-EOF && + read capabilities echo import echo refspec ${SQ}*:*${SQ} EOF
The last test in t5512 we recently added seems to be flaky. Running $ make && cd t && sh ./t5512-ls-remote.sh --stress shows that "git ls-remote foo::bar" exited with status 141, which means we got a SIGPIPE. This test piece was introduced by 9e89dcb6 (builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02) and is pretty much independent from all other tests in the script (it can even run standalone with everything before it removed). The transport-helper.c:get_helper() function tries to write to the helper. As we can see the helper script is very short and can exit even before it reads anything, when get_helper() tries to give the first command, "capabilities", the helper may already be gone. A trivial fix, presented here, os to make sure that the helper reads the first command it is given, as what it writes later is a response to that command. I however would wonder if the interactions with the helper initiated by get_helper() should be done on a non-blocking I/O (we do check the return value from our write(2) system calls, do we?). Signed-off-by: Junio C Hamano <gitster@pobox.com> --- t/t5512-ls-remote.sh | 1 + 1 file changed, 1 insertion(+)