diff mbox series

t5512.40 sometimes dies by SIGPIPE

Message ID xmqqmskbwe1a.fsf@gitster.g (mailing list archive)
State Accepted
Commit e1e0d305c417448753162129f694afa9f183c03e
Headers show
Series t5512.40 sometimes dies by SIGPIPE | expand

Commit Message

Junio C Hamano Sept. 13, 2024, 7:26 p.m. UTC
The last test in t5512 we recently added seems to be flaky.
Running

    $ make && cd t && sh ./t5512-ls-remote.sh --stress

shows that "git ls-remote foo::bar" exited with status 141, which
means we got a SIGPIPE.  This test piece was introduced by 9e89dcb6
(builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02)
and is pretty much independent from all other tests in the script
(it can even run standalone with everything before it removed).

The transport-helper.c:get_helper() function tries to write to the
helper.  As we can see the helper script is very short and can exit
even before it reads anything, when get_helper() tries to give the
first command, "capabilities", the helper may already be gone.

A trivial fix, presented here, os to make sure that the helper reads
the first command it is given, as what it writes later is a response
to that command.

I however would wonder if the interactions with the helper initiated
by get_helper() should be done on a non-blocking I/O (we do check
the return value from our write(2) system calls, do we?).

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 t/t5512-ls-remote.sh | 1 +
 1 file changed, 1 insertion(+)

Comments

Eric Sunshine Sept. 13, 2024, 7:44 p.m. UTC | #1
On Fri, Sep 13, 2024 at 3:26 PM Junio C Hamano <gitster@pobox.com> wrote:
> The last test in t5512 we recently added seems to be flaky.
> Running
>
>     $ make && cd t && sh ./t5512-ls-remote.sh --stress
>
> shows that "git ls-remote foo::bar" exited with status 141, which
> means we got a SIGPIPE.  This test piece was introduced by 9e89dcb6
> (builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02)
> and is pretty much independent from all other tests in the script
> (it can even run standalone with everything before it removed).
>
> The transport-helper.c:get_helper() function tries to write to the
> helper.  As we can see the helper script is very short and can exit
> even before it reads anything, when get_helper() tries to give the
> first command, "capabilities", the helper may already be gone.
>
> A trivial fix, presented here, os to make sure that the helper reads

s/os/is/

> the first command it is given, as what it writes later is a response
> to that command.
>
> I however would wonder if the interactions with the helper initiated
> by get_helper() should be done on a non-blocking I/O (we do check
> the return value from our write(2) system calls, do we?).
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Junio C Hamano Sept. 13, 2024, 8:48 p.m. UTC | #2
Eric Sunshine <sunshine@sunshineco.com> writes:

>> A trivial fix, presented here, os to make sure that the helper reads
>
> s/os/is/

Thanks.
Jeff King Sept. 14, 2024, 6:41 a.m. UTC | #3
On Fri, Sep 13, 2024 at 12:26:41PM -0700, Junio C Hamano wrote:

> The last test in t5512 we recently added seems to be flaky.
> Running
> 
>     $ make && cd t && sh ./t5512-ls-remote.sh --stress
> 
> shows that "git ls-remote foo::bar" exited with status 141, which
> means we got a SIGPIPE.  This test piece was introduced by 9e89dcb6
> (builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02)
> and is pretty much independent from all other tests in the script
> (it can even run standalone with everything before it removed).
> 
> The transport-helper.c:get_helper() function tries to write to the
> helper.  As we can see the helper script is very short and can exit
> even before it reads anything, when get_helper() tries to give the
> first command, "capabilities", the helper may already be gone.
> 
> A trivial fix, presented here, os to make sure that the helper reads
> the first command it is given, as what it writes later is a response
> to that command.

This analysis looks right, and I think the test change here is a good
minimal fix. It does leave "real world" cases where we'd hit SIGPIPE
vulnerable to the same problem, but I don't think that happens all that
often (any helper that doesn't speak the protocol correctly and causes
spurious pipe failures will annoy users and get fixed).

So I think we should apply it as-is. But see below for some interesting
historical context.

> I however would wonder if the interactions with the helper initiated
> by get_helper() should be done on a non-blocking I/O (we do check
> the return value from our write(2) system calls, do we?).

You still get SIGPIPE with non-blocking I/O. I assume you meant ignoring
SIGPIPE?

That's certainly an option, but unfortunately it does mean we'd ignore
SIGPIPE everywhere. We do generally check the return value of write(),
but not always for buffered I/O. So in something like:

  git ls-remote | head -1

we'd probably do a bunch of useless writes to the closed pipe. That's
not the end of the world, given that we've done most of the work before
generating the first line of output anyway, but that's the case that
SIGPIPE is really supposed to be helping with. It's much worse on
something like git-log, but if we are just talking about ignoring
SIGPIPE in users of transport-helper.c, that's much more limited (though
a confusing concept when you start thinking about libification, which
may do many things in one process).

I really wish there was a way to ignore SIGPIPE per-descriptor (or even
tell which descriptor triggered it in a signal handler), but I don't
think there is. Some environments like "go" claim to do this, but I
think it's really that they ignore SIGPIPE, and then assume all writes
are going through their wrapper, which converts EPIPE into the signal.
That's a bit harder for us to do (even if we got xwrite(), you have to
similarly wrap fprintf, and so on).

Another option is to selectively disable/re-enable SIGPIPE for
individual write() calls. That's not thread-safe, and I imagine may have
some performance penalty in practice due to the extra syscalls. But it
might make sense to do it selectively.

And now we get into the historical bits.

A long time ago, we noticed that a totally broken helper (i.e., one that
exits immediately) would cause Git to exit silently, which could be
confusing. So we added an error message when we saw an unexpected close
of the pipe. But people found that annoying, because most of the time
the helper already said something useful to stderr, and the user did not
even care that a helper was in use. So we got rid of it.

But we discussed the concept of complaining only about the initial
"capabilities" request, which would catch helpers that failed to even
start. There's some discussion in this thread:

  https://lore.kernel.org/git/20140117201325.GB775@sigill.intra.peff.net/

But the most interesting part to me is that on that same day over 10
years ago I wrote the patch below, but apparently never sent it. And
I've been rebasing it forward ever since (I have several dozen of these
"to look at and think about again" topic branches, and I guess I let
this one slide for quite a while).

Anyway, the interesting thing is that it does the exact "turn off
SIGPIPE only for this call" strategy discussed above. And so I did not
even see the race in t5512, because it was "fixed" in my fork. I use
quotation marks there, though, because the test was sometimes racily
succeeding for the wrong reason (because we got EPIPE, not because of
the forbidden capability response). So even if we took the patch below,
I think your test fix is the right thing to do.

-- >8 --
From: Jeff King <peff@peff.net>
Date: Fri, 17 Jan 2014 15:36:21 -0500
Subject: [PATCH] print an error when remote helpers die during capabilities

The transport-helper code generally relies on the
remote-helper to provide an informative message to the user
when it encounters an error. In the rare cases where the
helper does not do so, the output can be quite confusing.
E.g.:

  $ git clone https://example.com/foo.git
  Cloning into 'foo'...
  $ echo $?
  128
  $ ls foo
  /bin/ls: cannot access foo: No such file or directory

We tried to address this with 81d340d (transport-helper:
report errors properly, 2013-04-10).

But that makes the common case much more confusing. The
remote helper protocol's method for signaling normal errors
is to simply hang up. So when the helper does encounter a
routine error and prints something to stderr, the extra
error message is redundant and misleading. So we dropped it
again in 266f1fd (transport-helper: be quiet on read errors
from helpers, 2013-06-21).

This puts the uncommon case right back where it started. We
may be able to do a little better, though. It is common for
the helper to die during a "real" command, like fetching the
list of remote refs. It is not common for it to die during
the initial "capabilities" negotiation, right after we
start. Reporting failure here is likely to catch fundamental
problems that prevent the helper from running (and reporting
errors) at all. Anything after that is the responsibility of
the helper itself to report.

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t5801-remote-helpers.sh | 11 +++++++++++
 transport-helper.c        | 16 +++++++++++++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh
index 20f43f7b7d..d21877150e 100755
--- a/t/t5801-remote-helpers.sh
+++ b/t/t5801-remote-helpers.sh
@@ -344,4 +344,15 @@ test_expect_success 'fetch tag' '
 	compare_refs local v1.0 server v1.0
 '
 
+test_expect_success 'totally broken helper reports failure message' '
+	write_script git-remote-broken <<-\EOF &&
+	read cap_cmd
+	exit 1
+	EOF
+	test_must_fail \
+		env PATH="$PWD:$PATH" \
+		git clone broken://example.com/foo.git 2>stderr &&
+	grep aborted stderr
+'
+
 test_done
diff --git a/transport-helper.c b/transport-helper.c
index 09b3560ffd..98dbeb6d3d 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -89,11 +89,18 @@ static int recvline(struct helper_data *helper, struct strbuf *buffer)
 	return recvline_fh(helper->out, buffer);
 }
 
-static void write_constant(int fd, const char *str)
+static int write_constant_gently(int fd, const char *str)
 {
 	if (debug)
 		fprintf(stderr, "Debug: Remote helper: -> %s", str);
 	if (write_in_full(fd, str, strlen(str)) < 0)
+		return -1;
+	return 0;
+}
+
+static void write_constant(int fd, const char *str)
+{
+	if (write_constant_gently(fd, str) < 0)
 		die_errno(_("full write to remote helper failed"));
 }
 
@@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport)
 		die_errno(_("can't dup helper output fd"));
 	data->out = xfdopen(duped, "r");
 
-	write_constant(helper->in, "capabilities\n");
+	sigchain_push(SIGPIPE, SIG_IGN);
+	if (write_constant_gently(helper->in, "capabilities\n") < 0)
+		die("remote helper '%s' aborted session", data->name);
+	sigchain_pop(SIGPIPE);
 
 	while (1) {
 		const char *capname, *arg;
 		int mandatory = 0;
 		if (recvline(data, &buf))
-			exit(128);
+			die("remote helper '%s' aborted session", data->name);
 
 		if (!*buf.buf)
 			break;
Junio C Hamano Sept. 14, 2024, 5:09 p.m. UTC | #4
Jeff King <peff@peff.net> writes:

> I really wish there was a way to ignore SIGPIPE per-descriptor (or even
> tell which descriptor triggered it in a signal handler), but I don't
> think there is.
> ...
> Another option is to selectively disable/re-enable SIGPIPE for
> individual write() calls. That's not thread-safe, and I imagine may have
> some performance penalty in practice due to the extra syscalls. But it
> might make sense to do it selectively.

All true.

> +static int write_constant_gently(int fd, const char *str)
>  {
>  	if (debug)
>  		fprintf(stderr, "Debug: Remote helper: -> %s", str);
>  	if (write_in_full(fd, str, strlen(str)) < 0)
> +		return -1;
> +	return 0;
> +}
> +
> +static void write_constant(int fd, const char *str)
> +{
> +	if (write_constant_gently(fd, str) < 0)
>  		die_errno(_("full write to remote helper failed"));
>  }

The _gently variant is meant to be (optionally) used while SIGPIPE
is ignored, and this one is meant to be always used while SIGPIPE is
not.  If the reading end closed the fd, the underlying write would
fail and trigger SIGPIPE.  This die_errno() will not trigger in such
a case.  But for other kind of errors, we die just like we used to
in the original code.  OK.

> @@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport)
>  		die_errno(_("can't dup helper output fd"));
>  	data->out = xfdopen(duped, "r");
>  
> -	write_constant(helper->in, "capabilities\n");
> +	sigchain_push(SIGPIPE, SIG_IGN);
> +	if (write_constant_gently(helper->in, "capabilities\n") < 0)
> +		die("remote helper '%s' aborted session", data->name);
> +	sigchain_pop(SIGPIPE);

And the only caller of the _gently variant with SIGPIPE ignored
supplies its own error message.  It is easier for the end-user to
identify than a generic "Broken pipe".  Nice.

We identified a helper that closes the connection even before it
hears from us, so we say "aborted".

>  	while (1) {
>  		const char *capname, *arg;
>  		int mandatory = 0;
>  		if (recvline(data, &buf))
> -			exit(128);
> +			die("remote helper '%s' aborted session", data->name);

Here, we found a helper that failed to talk back to us (they may
have actually read what we wrote initially, or what we wrote may be
hanging in the pipe buffer without being read).  It depends on the
timing between which of the above two points we detect such a
helper, so using the same error message does make sense.

> diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh
> index 20f43f7b7d..d21877150e 100755
> --- a/t/t5801-remote-helpers.sh
> +++ b/t/t5801-remote-helpers.sh
> @@ -344,4 +344,15 @@ test_expect_success 'fetch tag' '
>  	compare_refs local v1.0 server v1.0
>  '
>  
> +test_expect_success 'totally broken helper reports failure message' '
> +	write_script git-remote-broken <<-\EOF &&
> +	read cap_cmd
> +	exit 1
> +	EOF
> +	test_must_fail \
> +		env PATH="$PWD:$PATH" \
> +		git clone broken://example.com/foo.git 2>stderr &&
> +	grep aborted stderr
> +'
> +
>  test_done

The code change covers both a helper that disconnects before we
write the first greeting and a helper that disconnects while we are
still expecting it to talk to us.  The test explicitly covers the
latter by reading our greeting (in other words, it does not die
before we do our initial "write"---no risk of the SIGPIPE) and then
stopping without writing anything.  As we are waiting to read from
the helper, we will see an error in recvline().

Nice.

If the test loses the initial "read the greeting", then some of the
time our greeting would be hanging in the pipe, we wait in read, and
notice that the helper died, to trigger the "recvline() failed" code
path.  But other times, the helper would die even before we write
the greeting, so we'd see an error from write_constant_gently() and
die with an identical message.  Such a test won't suffer from any
flakyness but makes me feel somewhat dirty, so the above is good
;-).

Thanks.
Chris Torek Sept. 14, 2024, 11:27 p.m. UTC | #5
Just a quick note for background:

On Fri, Sep 13, 2024 at 11:41 PM Jeff King <peff@peff.net> wrote:
> I really wish there was a way to ignore SIGPIPE per-descriptor (or even
> tell which descriptor triggered it in a signal handler), but I don't
> think there is. Some environments like "go" claim to do this, but I
> think it's really that they ignore SIGPIPE, and then assume all writes
> are going through their wrapper, which converts EPIPE into the signal.

It's more complicated than that (and varies on MacOS), but it's true
that the Go runtime wraps all OS-level write() operations. The reason
for that is that Go has its own internal "thread" (g/m/p things, really)
scheduler and does *all* I/O as non-blocking calls. Anything that
potentially blocks an actual OS-level thread has to be handled
specially. It's quite messy and You Don't Want To Go There (no
pun intended) if you don't have to.

(The Go scheduler is especially complex since it's doing M:N
multiplexing of goroutines to OS-level threads within a single
OS-level process. It also has to allow calls in and out via the
Go scheduler and foreign-function interfaces. When making
a blocking system call, it will create a new OS thread, increasing
past the normal GOMAXPROCS thread limit.)

(Also: I'm not steeped in the internals of Git's transport code
but Peff's patch looks good to me.)

Chris
Junio C Hamano Sept. 15, 2024, 4:13 p.m. UTC | #6
Chris Torek <chris.torek@gmail.com> writes:

> It's more complicated than that (and varies on MacOS), but it's true
> that the Go runtime wraps all OS-level write() operations. The reason
> for that is that Go has its own internal "thread" (g/m/p things, really)
> scheduler and does *all* I/O as non-blocking calls. Anything that
> potentially blocks an actual OS-level thread has to be handled
> specially. It's quite messy and You Don't Want To Go There (no
> pun intended) if you don't have to.

That sounds like a fun hack ;-)
Patrick Steinhardt Sept. 16, 2024, 8:07 a.m. UTC | #7
On Sat, Sep 14, 2024 at 10:09:52AM -0700, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > I really wish there was a way to ignore SIGPIPE per-descriptor (or even
> > tell which descriptor triggered it in a signal handler), but I don't
> > think there is.
> > ...
> > Another option is to selectively disable/re-enable SIGPIPE for
> > individual write() calls. That's not thread-safe, and I imagine may have
> > some performance penalty in practice due to the extra syscalls. But it
> > might make sense to do it selectively.
> 
> All true.
> 
> > +static int write_constant_gently(int fd, const char *str)
> >  {
> >  	if (debug)
> >  		fprintf(stderr, "Debug: Remote helper: -> %s", str);
> >  	if (write_in_full(fd, str, strlen(str)) < 0)
> > +		return -1;
> > +	return 0;
> > +}
> > +
> > +static void write_constant(int fd, const char *str)
> > +{
> > +	if (write_constant_gently(fd, str) < 0)
> >  		die_errno(_("full write to remote helper failed"));
> >  }
> 
> The _gently variant is meant to be (optionally) used while SIGPIPE
> is ignored, and this one is meant to be always used while SIGPIPE is
> not.  If the reading end closed the fd, the underlying write would
> fail and trigger SIGPIPE.  This die_errno() will not trigger in such
> a case.  But for other kind of errors, we die just like we used to
> in the original code.  OK.
> 
> > @@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport)
> >  		die_errno(_("can't dup helper output fd"));
> >  	data->out = xfdopen(duped, "r");
> >  
> > -	write_constant(helper->in, "capabilities\n");
> > +	sigchain_push(SIGPIPE, SIG_IGN);
> > +	if (write_constant_gently(helper->in, "capabilities\n") < 0)
> > +		die("remote helper '%s' aborted session", data->name);
> > +	sigchain_pop(SIGPIPE);
> 
> And the only caller of the _gently variant with SIGPIPE ignored
> supplies its own error message.  It is easier for the end-user to
> identify than a generic "Broken pipe".  Nice.
> 
> We identified a helper that closes the connection even before it
> hears from us, so we say "aborted".
> 
> >  	while (1) {
> >  		const char *capname, *arg;
> >  		int mandatory = 0;
> >  		if (recvline(data, &buf))
> > -			exit(128);
> > +			die("remote helper '%s' aborted session", data->name);
> 
> Here, we found a helper that failed to talk back to us (they may
> have actually read what we wrote initially, or what we wrote may be
> hanging in the pipe buffer without being read).  It depends on the
> timing between which of the above two points we detect such a
> helper, so using the same error message does make sense.
> 
> > diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh
> > index 20f43f7b7d..d21877150e 100755
> > --- a/t/t5801-remote-helpers.sh
> > +++ b/t/t5801-remote-helpers.sh
> > @@ -344,4 +344,15 @@ test_expect_success 'fetch tag' '
> >  	compare_refs local v1.0 server v1.0
> >  '
> >  
> > +test_expect_success 'totally broken helper reports failure message' '
> > +	write_script git-remote-broken <<-\EOF &&
> > +	read cap_cmd
> > +	exit 1
> > +	EOF
> > +	test_must_fail \
> > +		env PATH="$PWD:$PATH" \
> > +		git clone broken://example.com/foo.git 2>stderr &&
> > +	grep aborted stderr
> > +'
> > +
> >  test_done
> 
> The code change covers both a helper that disconnects before we
> write the first greeting and a helper that disconnects while we are
> still expecting it to talk to us.  The test explicitly covers the
> latter by reading our greeting (in other words, it does not die
> before we do our initial "write"---no risk of the SIGPIPE) and then
> stopping without writing anything.  As we are waiting to read from
> the helper, we will see an error in recvline().
> 
> Nice.
> 
> If the test loses the initial "read the greeting", then some of the
> time our greeting would be hanging in the pipe, we wait in read, and
> notice that the helper died, to trigger the "recvline() failed" code
> path.  But other times, the helper would die even before we write
> the greeting, so we'd see an error from write_constant_gently() and
> die with an identical message.  Such a test won't suffer from any
> flakyness but makes me feel somewhat dirty, so the above is good
> ;-).

Thanks to both of you for catching and addressing this issue! Both
patches look sensible to me.

Patrick
diff mbox series

Patch

diff --git c/t/t5512-ls-remote.sh w/t/t5512-ls-remote.sh
index d64b40e408..64b3491e4e 100755
--- c/t/t5512-ls-remote.sh
+++ w/t/t5512-ls-remote.sh
@@ -406,6 +406,7 @@  test_expect_success 'v0 clients can handle multiple symrefs' '
 test_expect_success 'helper with refspec capability fails gracefully' '
 	mkdir test-bin &&
 	write_script test-bin/git-remote-foo <<-EOF &&
+	read capabilities
 	echo import
 	echo refspec ${SQ}*:*${SQ}
 	EOF