diff mbox series

[RFC,1/3] test-tool: don't fake up BUG() exits as code 99

Message ID RFC-patch-1.3-78431bdc8f0-20220525T234908Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series trace2: log "signal" end events if we invoke BUG() | expand

Commit Message

Ævar Arnfjörð Bjarmason May 26, 2022, 12:30 a.m. UTC
Change the BUG() function invoked be "test-tool" to be the "real" one,
instead of one that avoids producing core files. In
a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02)
to test the (then recently added) BUG() function we faked up the
abort() in favor of an exit with code 99.

However, in doing so we've been fooling ourselves when it comes to
what trace2 events we log. The events tested for in
0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the
real ones, but those that we emit only from the "test-tool".

Let's just stop faking it, and call abort(). As a86303cb5d5 notes this
will produce core files on some OS's, but as the default behavior for
that is to dump them in the current directory they'll be placed in the
trash directory that we'll shortly me "rm -rf"-ing.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

There's a CI problem with the test_must_BUG wrapper here on Windows:
https://github.com/avar/git/runs/6602059183?check_suite_focus=true#step:6:1361

 t/helper/test-tool.c           |  1 -
 t/t0210-trace2-normal.sh       |  4 +---
 t/t1406-submodule-ref-store.sh | 10 +++++-----
 t/test-lib-functions.sh        | 13 +++++++++++++
 usage.c                        |  5 -----
 5 files changed, 19 insertions(+), 14 deletions(-)

Comments

Junio C Hamano June 3, 2022, 7:25 p.m. UTC | #1
Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Change the BUG() function invoked be "test-tool" to be the "real" one,
> instead of one that avoids producing core files. In
> a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02)
> to test the (then recently added) BUG() function we faked up the
> abort() in favor of an exit with code 99.
>
> However, in doing so we've been fooling ourselves when it comes to
> what trace2 events we log. The events tested for in
> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the
> real ones, but those that we emit only from the "test-tool".

I can fully agree with the above reasoning, i.e. let's test what we
do use in production, instead of something nobody uses for real, if
we were adding a test for BUG() in vacuum, but why did we have to
"fake" it in the first place?

> Let's just stop faking it, and call abort(). As a86303cb5d5 notes this
> will produce core files on some OS's, but as the default behavior for
> that is to dump them in the current directory they'll be placed in the
> trash directory that we'll shortly me "rm -rf"-ing.

Are we sure that the reason no longer applies?  How do we know?  We
would want to explain that to future developers in the proposed log
message, I would think.

> +	elif test_match_signal 6 $exit_code && list_contains "$_test_ok" sigabrt
> +	then
> +		return 0
>  	elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe

Not a new problem, but these numberings are probably not very
portable.  I am willing to take this as-is and let people on
minority platforms complain ;-)
Junio C Hamano June 3, 2022, 9:05 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

>> However, in doing so we've been fooling ourselves when it comes to
>> what trace2 events we log. The events tested for in
>> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the
>> real ones, but those that we emit only from the "test-tool".
>
> I can fully agree with the above reasoning, i.e. let's test what we
> do use in production, instead of something nobody uses for real, if
> we were adding a test for BUG() in vacuum, but why did we have to
> "fake" it in the first place?
> ...
> Are we sure that the reason no longer applies?  How do we know?  We
> would want to explain that to future developers in the proposed log
> message, I would think.

We can flip it the other way around.  

I do not think I ever saw anybody asked anybody on this list who got
a BUG() message to use the coredump to do something useful.  Don't
modern distros ship with "ulimit -c 0" these days?

It might be possible that a better direction is to introduce
GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that
chooses between abort() and exit(99), or something like that, and
then we switch to use the latter by default over time?
Ævar Arnfjörð Bjarmason June 3, 2022, 11:03 p.m. UTC | #3
On Fri, Jun 03 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> Change the BUG() function invoked be "test-tool" to be the "real" one,
>> instead of one that avoids producing core files. In
>> a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02)
>> to test the (then recently added) BUG() function we faked up the
>> abort() in favor of an exit with code 99.
>>
>> However, in doing so we've been fooling ourselves when it comes to
>> what trace2 events we log. The events tested for in
>> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the
>> real ones, but those that we emit only from the "test-tool".
>
> I can fully agree with the above reasoning, i.e. let's test what we
> do use in production, instead of something nobody uses for real, if
> we were adding a test for BUG() in vacuum, but why did we have to
> "fake" it in the first place?

Per a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02)
I think it was to get rid of coredump clutter outside of trash
directories (which would require some non-standard config), which
doesn't seem worth it.

>> Let's just stop faking it, and call abort(). As a86303cb5d5 notes this
>> will produce core files on some OS's, but as the default behavior for
>> that is to dump them in the current directory they'll be placed in the
>> trash directory that we'll shortly me "rm -rf"-ing.
>
> Are we sure that the reason no longer applies?  How do we know?  We
> would want to explain that to future developers in the proposed log
> message, I would think.

I was trying to get the above across, i.e. that the reasoning still
applies, but that the fakery isn't worth it.

>> +	elif test_match_signal 6 $exit_code && list_contains "$_test_ok" sigabrt
>> +	then
>> +		return 0
>>  	elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe
>
> Not a new problem, but these numberings are probably not very
> portable.  I am willing to take this as-is and let people on
> minority platforms complain ;-)

I'll test it better for portability before a non-RFC.
Ævar Arnfjörð Bjarmason June 3, 2022, 11:05 p.m. UTC | #4
On Fri, Jun 03 2022, Junio C Hamano wrote:

> Junio C Hamano <gitster@pobox.com> writes:
>
>>> However, in doing so we've been fooling ourselves when it comes to
>>> what trace2 events we log. The events tested for in
>>> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the
>>> real ones, but those that we emit only from the "test-tool".
>>
>> I can fully agree with the above reasoning, i.e. let's test what we
>> do use in production, instead of something nobody uses for real, if
>> we were adding a test for BUG() in vacuum, but why did we have to
>> "fake" it in the first place?
>> ...
>> Are we sure that the reason no longer applies?  How do we know?  We
>> would want to explain that to future developers in the proposed log
>> message, I would think.
>
> We can flip it the other way around.  
>
> I do not think I ever saw anybody asked anybody on this list who got
> a BUG() message to use the coredump to do something useful.  Don't
> modern distros ship with "ulimit -c 0" these days?

I think that part of it was just a side-effect of SIGABRT.

> It might be possible that a better direction is to introduce
> GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that
> chooses between abort() and exit(99), or something like that, and
> then we switch to use the latter by default over time?

I think the reason for abort() was what's covered in raise(3), i.e. it's
a one-stop-shop to getting "stop it" behavior both under threading and
non-threading, which as e.g. exit(3) discusses wouldn't be thread-safe
with it.

But perhaps that was all premature worrying, we're mostly running
non-threaded, and to the extent that we ever BUG() running into exit(3)
threading issues is probably the least of our worries.

So perhaps we should drop abort() entirely, I don't know. These proposed
patches tried not to do that, but just to log things when we did so.
Jeff King June 8, 2022, 7:17 p.m. UTC | #5
On Fri, Jun 03, 2022 at 02:05:49PM -0700, Junio C Hamano wrote:

> > Are we sure that the reason no longer applies?  How do we know?  We
> > would want to explain that to future developers in the proposed log
> > message, I would think.
> 
> We can flip it the other way around.  
> 
> I do not think I ever saw anybody asked anybody on this list who got
> a BUG() message to use the coredump to do something useful.  Don't
> modern distros ship with "ulimit -c 0" these days?

Certainly I have found the coredumps useful, though I would expect most
Git developers to be able to run under a debugger and stop at BUG()
anyway. So we could probably live without that, but...

> It might be possible that a better direction is to introduce
> GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that
> chooses between abort() and exit(99), or something like that, and
> then we switch to use the latter by default over time?

It really should continue to die with a signal (or an exit code that
pretends to be one) to continue triggering even under test_must_fail().

IMHO the whole "let's trigger BUG() intentionally via test-tool" stuff
in t1406 is misguided. It's literally introducing broken code that is
not run in the normal Git binary in order to make sure that it's broken.
If that were the only spot, I'd suggest just getting rid of it entirely.
But the code in t0210 that checks the handling of trace2 and BUG() is
probably worth keeping.

I do agree that an environment variable would be a better selector than
"this code is linked against test-tool". I thought so even back in:

  https://lore.kernel.org/git/20180507090109.GA367@sigill.intra.peff.net/

:) That message also covers the flip-side case discussed earlier in this
thread: why calling abort() unconditionally in the test suite can be a
pain.

-Peff
Junio C Hamano June 8, 2022, 9:03 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> I do agree that an environment variable would be a better selector than
> "this code is linked against test-tool". I thought so even back in:
>
>   https://lore.kernel.org/git/20180507090109.GA367@sigill.intra.peff.net/
>
> :) That message also covers the flip-side case discussed earlier in this
> thread: why calling abort() unconditionally in the test suite can be a
> pain.

Nice.  Thanks.
Ævar Arnfjörð Bjarmason June 9, 2022, 8:09 a.m. UTC | #7
On Wed, Jun 08 2022, Jeff King wrote:

> On Fri, Jun 03, 2022 at 02:05:49PM -0700, Junio C Hamano wrote:
>
>> > Are we sure that the reason no longer applies?  How do we know?  We
>> > would want to explain that to future developers in the proposed log
>> > message, I would think.
>> 
>> We can flip it the other way around.  
>> 
>> I do not think I ever saw anybody asked anybody on this list who got
>> a BUG() message to use the coredump to do something useful.  Don't
>> modern distros ship with "ulimit -c 0" these days?
>
> Certainly I have found the coredumps useful, though I would expect most
> Git developers to be able to run under a debugger and stop at BUG()
> anyway. So we could probably live without that, but...
>
>> It might be possible that a better direction is to introduce
>> GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that
>> chooses between abort() and exit(99), or something like that, and
>> then we switch to use the latter by default over time?
>
> It really should continue to die with a signal (or an exit code that
> pretends to be one) to continue triggering even under test_must_fail().
>
> IMHO the whole "let's trigger BUG() intentionally via test-tool" stuff
> in t1406 is misguided. It's literally introducing broken code that is
> not run in the normal Git binary in order to make sure that it's broken.

I haven't looked at that ref code in particular, but in general adding
coverage for the case we actually expect to defend against with BUG()
via a test-tool is something I think we could use in more cases.

E.g. parse-options.c will BUG() out on various bad options structs,
there it makes sense to feed those bad structs in to make sure our
assertion code is doing what we still expect, but we don't have those
tests.

> If that were the only spot, I'd suggest just getting rid of it entirely.
> But the code in t0210 that checks the handling of trace2 and BUG() is
> probably worth keeping.

Yes, we definitely want to test what *actually* happens as far as 

> I do agree that an environment variable would be a better selector than
> "this code is linked against test-tool". I thought so even back in:
>
>   https://lore.kernel.org/git/20180507090109.GA367@sigill.intra.peff.net/
>
> :) That message also covers the flip-side case discussed earlier in this
> thread: why calling abort() unconditionally in the test suite can be a
> pain.

I didn't know about that Jenkins case, but in any case I think we can
probably stop worrying about this given that we haven't had complaints
about ac14de13b22 (t4058: explore duplicate tree entry handling in a bit
more detail, 2020-12-11) (although I've noticed in in "dmesg" before).

I.e. since v2.31.0 running the test suite has produced at least 2
segfaults per run:

    $ (sudo dmesg -c; ./t4058-diff-duplicates.sh) >/dev/null; sudo dmesg | grep segfault
    [17687265.247252] git[11336]: segfault at 40 ip 0000000000700916 sp 00007ffc10d5dda0 error 4 in git[405000+33d000]
    [17687265.297027] git[11397]: segfault at 40 ip 0000000000700916 sp 00007ffd7dfd5310 error 4 in git[405000+33d000]

Perhaps those tests aren't valuable, but I think that shows that the
GIT_BUG_ABORT approach you suggested should probably be tweaked to
instead be some test prereq like SEGFAULT_AND_ABORT_OK.

On the other hand those segfaults in t4058 should probably be converted
to a BUG()...
Jeff King June 9, 2022, 3:23 p.m. UTC | #8
On Thu, Jun 09, 2022 at 10:09:25AM +0200, Ævar Arnfjörð Bjarmason wrote:

> > IMHO the whole "let's trigger BUG() intentionally via test-tool" stuff
> > in t1406 is misguided. It's literally introducing broken code that is
> > not run in the normal Git binary in order to make sure that it's broken.
> 
> I haven't looked at that ref code in particular, but in general adding
> coverage for the case we actually expect to defend against with BUG()
> via a test-tool is something I think we could use in more cases.
> 
> E.g. parse-options.c will BUG() out on various bad options structs,
> there it makes sense to feed those bad structs in to make sure our
> assertion code is doing what we still expect, but we don't have those
> tests.

These seem like low-value tests to me. Once you've identified a case in
parse-options.c that should BUG(), it's not very many lines of code to
detect it and call the function. But to then create a test case,
including a harness that triggers the internal code in such a way that
is designed never to happen in the production code, seems much more
error prone to me.

I.e., it is very unlikely to me that such a test case will find an
actual problem, and it carries a lot of scaffolding cost.

Meanwhile the much more likely problem is that a BUG() can be triggered,
but we simply don't have test coverage of the right cases. Once you've
identified the case, the problem can be demonstrated (and fixed) in the
production code.

> I didn't know about that Jenkins case, but in any case I think we can
> probably stop worrying about this given that we haven't had complaints
> about ac14de13b22 (t4058: explore duplicate tree entry handling in a bit
> more detail, 2020-12-11) (although I've noticed in in "dmesg" before).

I didn't notice those ones. It's probable that the CI instance I was
using at GitHub changed under the hood. And GitHub is more likely than
most people to run git's test suite in a CI environment. :) So yeah, it
may not matter that much. I do still think dumping cores around the
filesystem (remembering that they do not always go into the trash
directory) and polluting the kernel log are a bit unfriendly. If we can
avoid it easily, it might be worth doing.

> On the other hand those segfaults in t4058 should probably be converted
> to a BUG()...

Agreed. Those stack-exhaustion ones (which were what originally alerted
me to the Jenkins issue) are hard to catch.

-Peff
diff mbox series

Patch

diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 0424f7adf5d..99a10f294f5 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -110,7 +110,6 @@  int cmd_main(int argc, const char **argv)
 		OPT_END()
 	};
 
-	BUG_exit_code = 99;
 	argc = parse_options(argc, argv, NULL, options, test_tool_usage,
 			     PARSE_OPT_STOP_AT_NON_OPTION |
 			     PARSE_OPT_KEEP_ARGV0);
diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh
index 37c359bd5a2..910a6af8058 100755
--- a/t/t0210-trace2-normal.sh
+++ b/t/t0210-trace2-normal.sh
@@ -155,15 +155,13 @@  test_expect_success 'normal stream, error event' '
 
 test_expect_success 'BUG messages are written to trace2' '
 	test_when_finished "rm trace.normal actual expect" &&
-	test_must_fail env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 007bug &&
+	test_must_BUG env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 007bug &&
 	perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual &&
 	cat >expect <<-EOF &&
 		version $V
 		start _EXE_ trace2 007bug
 		cmd_name trace2 (trace2)
 		error the bug message
-		exit elapsed:_TIME_ code:99
-		atexit elapsed:_TIME_ code:99
 	EOF
 	test_cmp expect actual
 '
diff --git a/t/t1406-submodule-ref-store.sh b/t/t1406-submodule-ref-store.sh
index e6a7f7334b6..6f9a16b7355 100755
--- a/t/t1406-submodule-ref-store.sh
+++ b/t/t1406-submodule-ref-store.sh
@@ -25,15 +25,15 @@  test_expect_success 'pack_refs() not allowed' '
 '
 
 test_expect_success 'create_symref() not allowed' '
-	test_must_fail $RUN create-symref FOO refs/heads/main nothing
+	test_must_BUG $RUN create-symref FOO refs/heads/main nothing
 '
 
 test_expect_success 'delete_refs() not allowed' '
-	test_must_fail $RUN delete-refs 0 nothing FOO refs/tags/new-tag
+	test_must_BUG $RUN delete-refs 0 nothing FOO refs/tags/new-tag
 '
 
 test_expect_success 'rename_refs() not allowed' '
-	test_must_fail $RUN rename-ref refs/heads/main refs/heads/new-main
+	test_must_BUG $RUN rename-ref refs/heads/main refs/heads/new-main
 '
 
 test_expect_success 'for_each_ref(refs/heads/)' '
@@ -89,11 +89,11 @@  test_expect_success 'reflog_exists(HEAD)' '
 '
 
 test_expect_success 'delete_reflog() not allowed' '
-	test_must_fail $RUN delete-reflog HEAD
+	test_must_BUG $RUN delete-reflog HEAD
 '
 
 test_expect_success 'create-reflog() not allowed' '
-	test_must_fail $RUN create-reflog HEAD
+	test_must_BUG $RUN create-reflog HEAD
 '
 
 test_done
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 93c03380d44..61f1f03c099 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1167,6 +1167,9 @@  test_must_fail () {
 	then
 		echo >&4 "test_must_fail: command succeeded: $*"
 		return 1
+	elif test_match_signal 6 $exit_code && list_contains "$_test_ok" sigabrt
+	then
+		return 0
 	elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe
 	then
 		return 0
@@ -1186,6 +1189,16 @@  test_must_fail () {
 	return 0
 } 7>&2 2>&4
 
+# The test_must_BUG() function is a wrapper for test_must_fail which
+# checks that we BUG() out.
+#
+# Currently this checks that we exit with abort(), but in the future
+# we might e.g. check the trace2 logging itself, or otherwise make
+# sure that we used BUG() in particular.
+test_must_BUG () {
+	test_must_fail ok=sigabrt "$@"
+} 7>&2 2>&4
+
 # Similar to test_must_fail, but tolerates success, too.  This is
 # meant to be used in contexts like:
 #
diff --git a/usage.c b/usage.c
index b738dd178b3..bf868b5be1f 100644
--- a/usage.c
+++ b/usage.c
@@ -287,9 +287,6 @@  void warning(const char *warn, ...)
 	va_end(params);
 }
 
-/* Only set this, ever, from t/helper/, when verifying that bugs are caught. */
-int BUG_exit_code;
-
 static NORETURN void BUG_vfl(const char *file, int line, const char *fmt, va_list params)
 {
 	char prefix[256];
@@ -309,8 +306,6 @@  static NORETURN void BUG_vfl(const char *file, int line, const char *fmt, va_lis
 
 	trace2_cmd_error_va(fmt, params_copy);
 
-	if (BUG_exit_code)
-		exit(BUG_exit_code);
 	abort();
 }