trace2: intercept all common signals

Message ID	20240510172243.3529851-1-emilyshaffer@google.com (mailing list archive)
State	New
Headers	show Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9269417580 for <git@vger.kernel.org>; Fri, 10 May 2024 17:22:49 +0000 (UTC) Date: Fri, 10 May 2024 10:22:43 -0700 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240510172243.3529851-1-emilyshaffer@google.com> Subject: [PATCH] trace2: intercept all common signals From: Emily Shaffer <emilyshaffer@google.com> To: git@vger.kernel.org Cc: git@jeffhostetler.com, Emily Shaffer <nasamuffin@google.com> Content-Type: text/plain; charset="UTF-8"
Series	trace2: intercept all common signals \| expand trace2: intercept all common signals

Emily Shaffer May 10, 2024, 5:22 p.m. UTC

From: Emily Shaffer <nasamuffin@google.com>

We already use trace2 to find out about unexpected pipe breakages, which
is nice for detecting bugs or system problems, by adding a handler for
SIGPIPE which simply writes a trace2 line. However, there are a handful
of other common signals which we might want to snoop on:

 - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
   frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
 - SIGHUP, when the network closes unexpectedly (indicating there may be
   a problem to solve)

There are lots more signals which we might find useful later, but at
least let's teach trace2 to report these egregious ones. Conveniently,
they're also already covered by the `_common` variants in sigchain.[ch].

Sigchain itself is already tested via helper/test-sigchain.c, and trace2
is tested in a number of places - let's also add tests demonstrating
that sigchain + trace2 works correctly.

Signed-off-by: Emily Shaffer <nasamuffin@google.com>
---
 t/helper/test-trace2.c   | 17 +++++++++++++++++
 t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++
 trace2.c                 |  2 +-
 3 files changed, 40 insertions(+), 1 deletion(-)

Emily Shaffer May 10, 2024, 5:57 p.m. UTC | #1

On Fri, May 10, 2024 at 10:22 AM Emily Shaffer <emilyshaffer@google.com> wrote:
>
> From: Emily Shaffer <nasamuffin@google.com>
>
> We already use trace2 to find out about unexpected pipe breakages, which
> is nice for detecting bugs or system problems, by adding a handler for
> SIGPIPE which simply writes a trace2 line. However, there are a handful
> of other common signals which we might want to snoop on:
>
>  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
>    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
>  - SIGHUP, when the network closes unexpectedly (indicating there may be
>    a problem to solve)
>
> There are lots more signals which we might find useful later, but at
> least let's teach trace2 to report these egregious ones. Conveniently,
> they're also already covered by the `_common` variants in sigchain.[ch].
>
> Sigchain itself is already tested via helper/test-sigchain.c, and trace2
> is tested in a number of places - let's also add tests demonstrating
> that sigchain + trace2 works correctly.
>
> Signed-off-by: Emily Shaffer <nasamuffin@google.com>
> ---

Missed including the CI results. They're passing[1] with the exception
of the osx-gcc run, which seems to also be failing on the latest
'master'[2] and looks to be failing in setup rather than in test run.

1: https://github.com/nasamuffin/git/actions/runs/9035666915
2: https://github.com/nasamuffin/git/actions/runs/9036080205/job/24832209129

>  t/helper/test-trace2.c   | 17 +++++++++++++++++
>  t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++
>  trace2.c                 |  2 +-
>  3 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c
> index 1adac29a57..8970956ea8 100644
> --- a/t/helper/test-trace2.c
> +++ b/t/helper/test-trace2.c
> @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED)
>         BUG("a %s message", "BUG");
>  }
>
> +static int ut_011signal(int argc, const char **argv)
> +{
> +       const char *usage_error = "expect <bool common>";
> +       int common = 0;
> +
> +       if (argc != 1 || get_i(&common, argv[0]))
> +               die("%s", usage_error);
> +
> +       /*
> +        * There is no strong reason SIGSEGV is ignored by trace2 - it's just
> +        * not included by sigchain_push_common().
> +        */
> +       raise(common ? SIGINT : SIGSEGV);
> +       return 0; /*unreachable*/
> +}
> +
>  /*
>   * Single-threaded timer test.  Create several intervals using the
>   * TEST1 timer.  The test script can verify that an aggregate Trace2
> @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = {
>         { ut_008bug,      "008bug",    "" },
>         { ut_009bug_BUG,  "009bug_BUG","" },
>         { ut_010bug_BUG,  "010bug_BUG","" },
> +       { ut_011signal,   "011signal","" },
>
>         { ut_100timer,    "100timer",  "<count> <ms_delay>" },
>         { ut_101timer,    "101timer",  "<count> <ms_delay> <threads>" },
> diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh
> index c312657a12..c34ccc518c 100755
> --- a/t/t0210-trace2-normal.sh
> +++ b/t/t0210-trace2-normal.sh
> @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' '
>         test_cmp expect actual
>  '
>
> +test_expect_success 'trace2 reports common signals' '
> +       test_when_finished "rm trace.normal actual" &&
> +
> +       # signals are fatal, so expect this to fail
> +       ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 &&
> +
> +       perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual &&
> +
> +       test_grep "signal elapsed:" actual
> +'
> +
> +test_expect_success 'trace2 ignores uncommon signals' '
> +       test_when_finished "rm trace.normal actual" &&
> +
> +       # signals are fatal, so expect this to fail
> +       ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 0 &&
> +
> +       perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual &&
> +
> +       ! test_grep "signal elapsed:" actual
> +'
> +
>  sane_unset GIT_TRACE2_BRIEF
>
>  # Now test without environment variables and get all Trace2 settings
> diff --git a/trace2.c b/trace2.c
> index f894532d05..3692010f5d 100644
> --- a/trace2.c
> +++ b/trace2.c
> @@ -231,7 +231,7 @@ void trace2_initialize_fl(const char *file, int line)
>         tr2_sid_get();
>
>         atexit(tr2main_atexit_handler);
> -       sigchain_push(SIGPIPE, tr2main_signal_handler);
> +       sigchain_push_common(tr2main_signal_handler);
>         tr2tls_init();
>
>         /*
> --
> 2.45.0.118.g7fe29c98d7-goog
>

Junio C Hamano May 10, 2024, 6:47 p.m. UTC | #2

Emily Shaffer <emilyshaffer@google.com> writes:

> From: Emily Shaffer <nasamuffin@google.com>
>
> We already use trace2 to find out about unexpected pipe breakages, which
> is nice for detecting bugs or system problems, by adding a handler for
> SIGPIPE which simply writes a trace2 line. However, there are a handful
> of other common signals which we might want to snoop on:
>
>  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
>    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
>  - SIGHUP, when the network closes unexpectedly (indicating there may be
>    a problem to solve)
>
> There are lots more signals which we might find useful later, but at
> least let's teach trace2 to report these egregious ones. Conveniently,
> they're also already covered by the `_common` variants in sigchain.[ch].
>
> Sigchain itself is already tested via helper/test-sigchain.c, and trace2
> is tested in a number of places - let's also add tests demonstrating
> that sigchain + trace2 works correctly.
>
> Signed-off-by: Emily Shaffer <nasamuffin@google.com>
> ---
>  t/helper/test-trace2.c   | 17 +++++++++++++++++
>  t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++
>  trace2.c                 |  2 +-
>  3 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c
> index 1adac29a57..8970956ea8 100644
> --- a/t/helper/test-trace2.c
> +++ b/t/helper/test-trace2.c
> @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED)
>  	BUG("a %s message", "BUG");
>  }
>  
> +static int ut_011signal(int argc, const char **argv)
> +{
> +	const char *usage_error = "expect <bool common>";
> +	int common = 0;
> +
> +	if (argc != 1 || get_i(&common, argv[0]))
> +		die("%s", usage_error);
> +
> +	/*
> +	 * There is no strong reason SIGSEGV is ignored by trace2 - it's just
> +	 * not included by sigchain_push_common().
> +	 */
> +	raise(common ? SIGINT : SIGSEGV);
> +	return 0; /*unreachable*/
> +}
> +
>  /*
>   * Single-threaded timer test.  Create several intervals using the
>   * TEST1 timer.  The test script can verify that an aggregate Trace2
> @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = {
>  	{ ut_008bug,      "008bug",    "" },
>  	{ ut_009bug_BUG,  "009bug_BUG","" },
>  	{ ut_010bug_BUG,  "010bug_BUG","" },
> +	{ ut_011signal,   "011signal","" },
>  
>  	{ ut_100timer,    "100timer",  "<count> <ms_delay>" },
>  	{ ut_101timer,    "101timer",  "<count> <ms_delay> <threads>" },
> diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh
> index c312657a12..c34ccc518c 100755
> --- a/t/t0210-trace2-normal.sh
> +++ b/t/t0210-trace2-normal.sh
> @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'trace2 reports common signals' '
> +	test_when_finished "rm trace.normal actual" &&
> +
> +	# signals are fatal, so expect this to fail
> +	! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 &&

Is it deliberate that this does not use test_must_fail or is it an
oversight?  The same comment applies to all other uses of "! env".

We often see the use of "env" in conjunction with a test that is
expected to fail because

    test_must_fail VAR=VAL cmd

simply does not work.  If you are not using test_expect_fail, then

    ! VAR=VAL cmd

should be sufficient, but it would mean that you will be happy even
if the way "cmd" dies is not in a controlled way (e.g. due to
receiving a signal).

Ah, perhaps that is it?  Is "test-tool trace2 011signal 1" raise a
signal to kill itself and after showing the event in the trace
stream it is expected to die the narual death of receiving the same
signal by re-raising it?

If that is what is happening here, not using test_must_fail is
absolutely the right thing to do, but then I doubt you need "env"
there.  Also, if we know what signal is raised, then we should also
know the exit status from this (i.e. signal number plus 128 or
something) that we want to validate?  I dunno.


> diff --git a/trace2.c b/trace2.c
> index f894532d05..3692010f5d 100644
> --- a/trace2.c
> +++ b/trace2.c
> @@ -231,7 +231,7 @@ void trace2_initialize_fl(const char *file, int line)
>  	tr2_sid_get();
>  
>  	atexit(tr2main_atexit_handler);
> -	sigchain_push(SIGPIPE, tr2main_signal_handler);
> +	sigchain_push_common(tr2main_signal_handler);
>  	tr2tls_init();
>  
>  	/*

Emily Shaffer May 10, 2024, 7:34 p.m. UTC | #3

On Fri, May 10, 2024 at 11:47 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Emily Shaffer <emilyshaffer@google.com> writes:
>
> > From: Emily Shaffer <nasamuffin@google.com>
> >
> > We already use trace2 to find out about unexpected pipe breakages, which
> > is nice for detecting bugs or system problems, by adding a handler for
> > SIGPIPE which simply writes a trace2 line. However, there are a handful
> > of other common signals which we might want to snoop on:
> >
> >  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
> >    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
> >  - SIGHUP, when the network closes unexpectedly (indicating there may be
> >    a problem to solve)
> >
> > There are lots more signals which we might find useful later, but at
> > least let's teach trace2 to report these egregious ones. Conveniently,
> > they're also already covered by the `_common` variants in sigchain.[ch].
> >
> > Sigchain itself is already tested via helper/test-sigchain.c, and trace2
> > is tested in a number of places - let's also add tests demonstrating
> > that sigchain + trace2 works correctly.
> >
> > Signed-off-by: Emily Shaffer <nasamuffin@google.com>
> > ---
> >  t/helper/test-trace2.c   | 17 +++++++++++++++++
> >  t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++
> >  trace2.c                 |  2 +-
> >  3 files changed, 40 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c
> > index 1adac29a57..8970956ea8 100644
> > --- a/t/helper/test-trace2.c
> > +++ b/t/helper/test-trace2.c
> > @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED)
> >       BUG("a %s message", "BUG");
> >  }
> >
> > +static int ut_011signal(int argc, const char **argv)
> > +{
> > +     const char *usage_error = "expect <bool common>";
> > +     int common = 0;
> > +
> > +     if (argc != 1 || get_i(&common, argv[0]))
> > +             die("%s", usage_error);
> > +
> > +     /*
> > +      * There is no strong reason SIGSEGV is ignored by trace2 - it's just
> > +      * not included by sigchain_push_common().
> > +      */
> > +     raise(common ? SIGINT : SIGSEGV);
> > +     return 0; /*unreachable*/
> > +}
> > +
> >  /*
> >   * Single-threaded timer test.  Create several intervals using the
> >   * TEST1 timer.  The test script can verify that an aggregate Trace2
> > @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = {
> >       { ut_008bug,      "008bug",    "" },
> >       { ut_009bug_BUG,  "009bug_BUG","" },
> >       { ut_010bug_BUG,  "010bug_BUG","" },
> > +     { ut_011signal,   "011signal","" },
> >
> >       { ut_100timer,    "100timer",  "<count> <ms_delay>" },
> >       { ut_101timer,    "101timer",  "<count> <ms_delay> <threads>" },
> > diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh
> > index c312657a12..c34ccc518c 100755
> > --- a/t/t0210-trace2-normal.sh
> > +++ b/t/t0210-trace2-normal.sh
> > @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' '
> >       test_cmp expect actual
> >  '
> >
> > +test_expect_success 'trace2 reports common signals' '
> > +     test_when_finished "rm trace.normal actual" &&
> > +
> > +     # signals are fatal, so expect this to fail
> > +     ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 &&
>
> Is it deliberate that this does not use test_must_fail or is it an
> oversight?  The same comment applies to all other uses of "! env".
>
> We often see the use of "env" in conjunction with a test that is
> expected to fail because
>
>     test_must_fail VAR=VAL cmd
>
> simply does not work.  If you are not using test_expect_fail, then
>
>     ! VAR=VAL cmd
>
> should be sufficient, but it would mean that you will be happy even
> if the way "cmd" dies is not in a controlled way (e.g. due to
> receiving a signal).
>
> Ah, perhaps that is it?  Is "test-tool trace2 011signal 1" raise a
> signal to kill itself and after showing the event in the trace
> stream it is expected to die the narual death of receiving the same
> signal by re-raising it?

Yes, it is because test_must_fail expects "natural" death. You can
tell test_must_fail which signal you'd expect to receive, in theory,
but I didn't get it to work (and it will be tricky to provide the
correct signal in shell - I had originally hardcoded signal ints in
sh, but then moved the signal enum->int resolution into
helper/test-trace2.c because the alternative is doing some nasty
grepping on other shell utility outputs, since the signal codes aren't
platform/arch consistent).

Anyway, I will try it without `env`.

>
> If that is what is happening here, not using test_must_fail is
> absolutely the right thing to do, but then I doubt you need "env"
> there.  Also, if we know what signal is raised, then we should also
> know the exit status from this (i.e. signal number plus 128 or
> something) that we want to validate?  I dunno.

We could? But I don't feel strongly about it. If I specify the exit
status, the test will be brittle if we change the exit codes later
(for example, 128 or -1 for all error exits is kind of an antipattern;
it might be nice to ask Git to return meaningful error codes depending
on what went wrong, in the future). We are already checking later on
during the test_grep that we exited due to a fatal signal.

Thanks for the feedback - I'll get going on a v2 and aim to send it
later today, since I don't hear you saying that the patch's overall
goal is objectionable.

While I'm at it, since you pointed out ! instead of test_must_fail, I
wondered if I should change "! test_grep" as well - but when I grep t/
it looks like it's not usual to use `test_must_fail test_grep`, but
instead to use `test_grep ! <omitted pattern> <file>`. I'll change
that too.

I also wonder - do we want to capture SIGKILL as well? Some people may
have muscle memory for `kill -9` (I do, for better or worse) instead
of gentler `kill`. My intent was to notice any indication of user
frustration resulting in manual termination, which would include `kill
-9` too...

 - Emily

Jeff King May 10, 2024, 7:41 p.m. UTC | #4

On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote:

> From: Emily Shaffer <nasamuffin@google.com>
> 
> We already use trace2 to find out about unexpected pipe breakages, which
> is nice for detecting bugs or system problems, by adding a handler for
> SIGPIPE which simply writes a trace2 line. However, there are a handful
> of other common signals which we might want to snoop on:
> 
>  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
>    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
>  - SIGHUP, when the network closes unexpectedly (indicating there may be
>    a problem to solve)
> 
> There are lots more signals which we might find useful later, but at
> least let's teach trace2 to report these egregious ones. Conveniently,
> they're also already covered by the `_common` variants in sigchain.[ch].

I think this would be a useful thing to have, but having looked at the
trace2 signal code, this is almost certain to cause racy deadlocks.

The exact details depend on the specific trace2 target backend, but
looking at the various fn_signal() methods, they rely on allocations via
strbufs. This is a problem in signal handlers because we can get a
signal at any time, including when other code is inside malloc() holding
a lock. And then further calls to malloc() will block forever on that
lock.

We should be able to do a quick experiment. Try this snippet, which
repeatedly kills "git log -p" (which is likely to be allocating memory)
and waits for it to exit. Eventually each invocation will stall on a
deadlock:

-- >8 --
doit() {
	me=$1
	i=0
	while true; do
		GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 &
		sleep 0.1
		kill $!
		wait $! 2>/dev/null
		i=$((i+1))
		echo $me:$i
	done
}

for i in $(seq 1 64); do
	doit $i &
done
-- >8 --

I didn't have the patience to wait for all of them to stall, but if you
let it run for a bit and check "ps", you'll see some git processes which
are hanging. Stracing shows them stuck on a lock, like:

  $ strace -p 1838693
  strace: Process 1838693 attached
  futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached
   <detached ...>

This problem existed before your patch. I imagine it was much less
likely (or perhaps even impossible) with SIGPIPE though, because we'd
see that signal only when in a write() syscall, which implies we're not
in malloc(). Whereas we can get SIGTERM, etc, any time.

Obviously the script above is meant to exacerbate the situation, and
most runs would be fine. But over the course of normal use across many
users and many runs, I think we would see this in practice. I think your
test won't because it triggers the signal only from raise().

So I think before doing this, we'd need to clean up the trace2 signal
code to avoid any allocations.

-Peff

Jeff King May 10, 2024, 7:46 p.m. UTC | #5

On Fri, May 10, 2024 at 12:34:07PM -0700, Emily Shaffer wrote:

> Yes, it is because test_must_fail expects "natural" death. You can
> tell test_must_fail which signal you'd expect to receive, in theory,
> but I didn't get it to work (and it will be tricky to provide the
> correct signal in shell - I had originally hardcoded signal ints in
> sh, but then moved the signal enum->int resolution into
> helper/test-trace2.c because the alternative is doing some nasty
> grepping on other shell utility outputs, since the signal codes aren't
> platform/arch consistent).

We have test_match_signal(). Unfortunately it's not integrated with
test_expect_code(), so you have to do:

  { thing_which_fails; OUT=$?; } &&
  test_match_signal 15 "$OUT"

See 5263e22cba (t7006: simplify exit-code checks for sigpipe tests,
2021-11-21) for an example.

> I also wonder - do we want to capture SIGKILL as well? Some people may
> have muscle memory for `kill -9` (I do, for better or worse) instead
> of gentler `kill`. My intent was to notice any indication of user
> frustration resulting in manual termination, which would include `kill
> -9` too...

You can't catch SIGKILL; its whole purpose is to be un-catchable.

-Peff

Emily Shaffer May 10, 2024, 7:49 p.m. UTC | #6

On Fri, May 10, 2024 at 12:46 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, May 10, 2024 at 12:34:07PM -0700, Emily Shaffer wrote:
>
> > Yes, it is because test_must_fail expects "natural" death. You can
> > tell test_must_fail which signal you'd expect to receive, in theory,
> > but I didn't get it to work (and it will be tricky to provide the
> > correct signal in shell - I had originally hardcoded signal ints in
> > sh, but then moved the signal enum->int resolution into
> > helper/test-trace2.c because the alternative is doing some nasty
> > grepping on other shell utility outputs, since the signal codes aren't
> > platform/arch consistent).
>
> We have test_match_signal(). Unfortunately it's not integrated with
> test_expect_code(), so you have to do:
>
>   { thing_which_fails; OUT=$?; } &&
>   test_match_signal 15 "$OUT"

Right, what I meant above is that `15` isn't portable, I'd have to get
the correct int value of SIGINT/SIGSEGV from some other shell utility
at test time.

>
> See 5263e22cba (t7006: simplify exit-code checks for sigpipe tests,
> 2021-11-21) for an example.
>
> > I also wonder - do we want to capture SIGKILL as well? Some people may
> > have muscle memory for `kill -9` (I do, for better or worse) instead
> > of gentler `kill`. My intent was to notice any indication of user
> > frustration resulting in manual termination, which would include `kill
> > -9` too...
>
> You can't catch SIGKILL; its whole purpose is to be un-catchable.

Duh :') Thanks.

>
> -Peff

Jeff King May 10, 2024, 8:05 p.m. UTC | #7

On Fri, May 10, 2024 at 12:49:06PM -0700, Emily Shaffer wrote:

> > We have test_match_signal(). Unfortunately it's not integrated with
> > test_expect_code(), so you have to do:
> >
> >   { thing_which_fails; OUT=$?; } &&
> >   test_match_signal 15 "$OUT"
> 
> Right, what I meant above is that `15` isn't portable, I'd have to get
> the correct int value of SIGINT/SIGSEGV from some other shell utility
> at test time.

Yes, but we already rely on it elsewhere (like t0005), and the idea of
test_match_signal is that it would convert from "standard" numbers to
something platform specific. Though aside from Windows (where the signal
number is sometimes lost entirely) we've never had to actually do such
conversion so far; "15" really is standard.

If your primary concern, though, is the trace2 output and not the exit
code of the program, then it may not be worth worrying too much about.

-Peff

Junio C Hamano May 10, 2024, 8:34 p.m. UTC | #8

Emily Shaffer <nasamuffin@google.com> writes:

> While I'm at it, since you pointed out ! instead of test_must_fail, I
> wondered if I should change "! test_grep" as well - but when I grep t/
> it looks like it's not usual to use `test_must_fail test_grep`, but
> instead to use `test_grep ! <omitted pattern> <file>`. I'll change
> that too.

"! test_grep" is an anti-pattern.  We should have a documentation
somewhere in t/README or nearby (if we don't, somebody please add
one).

The point of test_grep is "when we expect to see hits, we do show
them to the standard output even if we just used a bare 'grep', but
when such a test fails, we can easily miss the failure, because the
failure is signalled only by $? and no output---hence, test_grep
helper loudly says that we expected to find something but we did not
see any".  Using "! test_grep" will make "! grep" louder in a wrong
case.  That is the whole reason why "test_grep !" exists.

> I also wonder - do we want to capture SIGKILL as well?

An eternally interesting question is "How would you catch an
uncatchable signal?" ;-)

Jeff King May 10, 2024, 10:20 p.m. UTC | #9

On Fri, May 10, 2024 at 01:34:42PM -0700, Junio C Hamano wrote:

> Emily Shaffer <nasamuffin@google.com> writes:
> 
> > While I'm at it, since you pointed out ! instead of test_must_fail, I
> > wondered if I should change "! test_grep" as well - but when I grep t/
> > it looks like it's not usual to use `test_must_fail test_grep`, but
> > instead to use `test_grep ! <omitted pattern> <file>`. I'll change
> > that too.
> 
> "! test_grep" is an anti-pattern.  We should have a documentation
> somewhere in t/README or nearby (if we don't, somebody please add
> one).

Better than documentation, maybe something like:

diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl
index b2b28c2ced..7de2c30aa0 100755
--- a/t/check-non-portable-shell.pl
+++ b/t/check-non-portable-shell.pl
@@ -51,6 +51,7 @@ sub err {
 		err q(quote "$val" in 'local var=$val');
 	/^\s*([A-Z0-9_]+=(\w*|(["']).*?\3)\s+)+(\w+)/ and exists($func{$4}) and
 		err '"FOO=bar shell_func" assignment extends beyond "shell_func"';
+	/! test_grep/ and err 'do not invert test_* functions';
 	$line = '';
 	# this resets our $. for each file
 	close ARGV if eof;

There's at least one other case already. If you shorten it to just "!
test_" to catch more functions, you can see there are a lot of wrong
test_cmp invocations, too (maybe not quite as bad because we don't
produce a specific message, but we'd yield a confusing diff output).

But I think we can't cover everything; there are some like
test_have_prereq which obviously are invertible.

-Peff

Emily Shaffer May 13, 2024, 4:21 p.m. UTC | #10

On Fri, May 10, 2024 at 12:41 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote:
>
> > From: Emily Shaffer <nasamuffin@google.com>
> >
> > We already use trace2 to find out about unexpected pipe breakages, which
> > is nice for detecting bugs or system problems, by adding a handler for
> > SIGPIPE which simply writes a trace2 line. However, there are a handful
> > of other common signals which we might want to snoop on:
> >
> >  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
> >    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
> >  - SIGHUP, when the network closes unexpectedly (indicating there may be
> >    a problem to solve)
> >
> > There are lots more signals which we might find useful later, but at
> > least let's teach trace2 to report these egregious ones. Conveniently,
> > they're also already covered by the `_common` variants in sigchain.[ch].
>
> I think this would be a useful thing to have, but having looked at the
> trace2 signal code, this is almost certain to cause racy deadlocks.
>
> The exact details depend on the specific trace2 target backend, but
> looking at the various fn_signal() methods, they rely on allocations via
> strbufs. This is a problem in signal handlers because we can get a
> signal at any time, including when other code is inside malloc() holding
> a lock. And then further calls to malloc() will block forever on that
> lock.
>
> We should be able to do a quick experiment. Try this snippet, which
> repeatedly kills "git log -p" (which is likely to be allocating memory)
> and waits for it to exit. Eventually each invocation will stall on a
> deadlock:
>
> -- >8 --
> doit() {
>         me=$1
>         i=0
>         while true; do
>                 GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 &
>                 sleep 0.1
>                 kill $!
>                 wait $! 2>/dev/null
>                 i=$((i+1))
>                 echo $me:$i
>         done
> }
>
> for i in $(seq 1 64); do
>         doit $i &
> done
> -- >8 --
>
> I didn't have the patience to wait for all of them to stall, but if you
> let it run for a bit and check "ps", you'll see some git processes which
> are hanging. Stracing shows them stuck on a lock, like:
>
>   $ strace -p 1838693
>   strace: Process 1838693 attached
>   futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached
>    <detached ...>
>
> This problem existed before your patch. I imagine it was much less
> likely (or perhaps even impossible) with SIGPIPE though, because we'd
> see that signal only when in a write() syscall, which implies we're not
> in malloc(). Whereas we can get SIGTERM, etc, any time.
>
> Obviously the script above is meant to exacerbate the situation, and
> most runs would be fine. But over the course of normal use across many
> users and many runs, I think we would see this in practice. I think your
> test won't because it triggers the signal only from raise().
>
> So I think before doing this, we'd need to clean up the trace2 signal
> code to avoid any allocations.

I started to look into doing this, and it's actually really tricky.
I've got a sample diff here
(https://github.com/git/git/commit/bf8a5084ede2b9c476e0cf90b7f198c52573fba7);
I'll need to do it for the other two trace formats as well. But, the
entire trace2 library relies heavily on strbuf, which doesn't have a
stack-allocated form. I'm also not sure how we can guarantee the
no-alloc-ness of these - maybe there's some flag we can give to one of
the analyzers or something? - so I'm worried about backsliding in the
future.

Anyway, I won't have time to work on these again until the end of next
week. If this looks like a reasonable direction I'll pick it up again
then; otherwise, maybe it makes sense for the fn_signal() dispatcher
to just time out if the handler process doesn't terminate in, say, 1s?

 - Emily

>
> -Peff

Jeff King May 16, 2024, 7:11 a.m. UTC | #11

On Mon, May 13, 2024 at 09:21:54AM -0700, Emily Shaffer wrote:

> I started to look into doing this, and it's actually really tricky.
> I've got a sample diff here
> (https://github.com/git/git/commit/bf8a5084ede2b9c476e0cf90b7f198c52573fba7);
> I'll need to do it for the other two trace formats as well. But, the
> entire trace2 library relies heavily on strbuf, which doesn't have a
> stack-allocated form. I'm also not sure how we can guarantee the
> no-alloc-ness of these - maybe there's some flag we can give to one of
> the analyzers or something? - so I'm worried about backsliding in the
> future.

Looking briefly over that patch, a few thoughts:

  - rather than try to make the generic trace2 printing functions handle
    both the alloc and fixed-buffer cases, if the signal handlers only
    need a limited set of functions, it might be easier to just let them
    live in a totally parallel universe. For the simple printing case
    that's not too much extra code, and then the complications are
    limited to the signal-handling functions themselves. It's a bit more
    tricky with json, but we might be able to get away with just
    hand-forming it into a buffer, given the relative simplicity of it.

    In some cases you might need to precompute and stash buffers ahead
    of time that could be used by the signal handler (e.g., the whole
    tr2_sid thing).

  - the opposite approach might be: stop using any allocating functions
    in the trace2 code. There's a certain simplicity there, even for
    non-signal functions, that we know we're just touching a few
    fixed-size buffers, and you can never create a weird DoS by tweaking
    the tracing code. But it would mean rewriting a lot of it (including
    json formatting stuff) without many of our usual strbuf niceties.

    This is more or less the approach we take with error(), die(), etc,
    which are built on vreportf() and its fixed buffer.

  - you probably don't want to use xsnprintf(). That function is for
    cases where it's a bug to truncate, and we're just asserting that
    everything fit as expected. For your purposes, you probably want
    regular snprintf(). Again, see vreportf().

I don't think there's an easy static analysis solution here. It's more
than just allocation, too. Anything that holds a lock is a potential
problem (e.g., stdio streams), and likewise anything that looks at
global state that might be in the middle of being mutated.

So overall it is a pretty thorny problem, and for the most part we've
just tried to keep what we do inside signal handlers to a minimum
(usually cleanup, but even there we have to be careful not to do things
like build up allocated paths for recursive removal).

> Anyway, I won't have time to work on these again until the end of next
> week. If this looks like a reasonable direction I'll pick it up again
> then; otherwise, maybe it makes sense for the fn_signal() dispatcher
> to just time out if the handler process doesn't terminate in, say, 1s?

The timeout would help with locks, but not other weird logic bugs you
can get into. Fundamentally you really want to do as little as possible
from a signal handler.

-Peff

Junio C Hamano May 16, 2024, 4:32 p.m. UTC | #12

Jeff King <peff@peff.net> writes:

>   - the opposite approach might be: stop using any allocating functions
>     in the trace2 code. There's a certain simplicity there, even for
>     non-signal functions, that we know we're just touching a few
>     fixed-size buffers, and you can never create a weird DoS by tweaking
>     the tracing code. But it would mean rewriting a lot of it (including
>     json formatting stuff) without many of our usual strbuf niceties.
>
>     This is more or less the approach we take with error(), die(), etc,
>     which are built on vreportf() and its fixed buffer.

Would another approach be to add various trace2 functions that use
strbuf() allocation a way to tell if they are called from a signal
handing codepath, and punt (by doing nothing if needed, but
hopefully we have enough slop in the buffer to say "hey we got
interrupted so no more detailed report for you, sorry") if that is
the case?

> So overall it is a pretty thorny problem, and for the most part we've
> just tried to keep what we do inside signal handlers to a minimum
> (usually cleanup, but even there we have to be careful not to do things
> like build up allocated paths for recursive removal).

Yes, I agree that it is the right approach to do very little in a
signal handler.

trace2: intercept all common signals

Commit Message

Comments

Patch