Message ID | RFC-patch-1.3-78431bdc8f0-20220525T234908Z-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | trace2: log "signal" end events if we invoke BUG() | expand |
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > Change the BUG() function invoked be "test-tool" to be the "real" one, > instead of one that avoids producing core files. In > a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02) > to test the (then recently added) BUG() function we faked up the > abort() in favor of an exit with code 99. > > However, in doing so we've been fooling ourselves when it comes to > what trace2 events we log. The events tested for in > 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the > real ones, but those that we emit only from the "test-tool". I can fully agree with the above reasoning, i.e. let's test what we do use in production, instead of something nobody uses for real, if we were adding a test for BUG() in vacuum, but why did we have to "fake" it in the first place? > Let's just stop faking it, and call abort(). As a86303cb5d5 notes this > will produce core files on some OS's, but as the default behavior for > that is to dump them in the current directory they'll be placed in the > trash directory that we'll shortly me "rm -rf"-ing. Are we sure that the reason no longer applies? How do we know? We would want to explain that to future developers in the proposed log message, I would think. > + elif test_match_signal 6 $exit_code && list_contains "$_test_ok" sigabrt > + then > + return 0 > elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe Not a new problem, but these numberings are probably not very portable. I am willing to take this as-is and let people on minority platforms complain ;-)
Junio C Hamano <gitster@pobox.com> writes: >> However, in doing so we've been fooling ourselves when it comes to >> what trace2 events we log. The events tested for in >> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the >> real ones, but those that we emit only from the "test-tool". > > I can fully agree with the above reasoning, i.e. let's test what we > do use in production, instead of something nobody uses for real, if > we were adding a test for BUG() in vacuum, but why did we have to > "fake" it in the first place? > ... > Are we sure that the reason no longer applies? How do we know? We > would want to explain that to future developers in the proposed log > message, I would think. We can flip it the other way around. I do not think I ever saw anybody asked anybody on this list who got a BUG() message to use the coredump to do something useful. Don't modern distros ship with "ulimit -c 0" these days? It might be possible that a better direction is to introduce GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that chooses between abort() and exit(99), or something like that, and then we switch to use the latter by default over time?
On Fri, Jun 03 2022, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> Change the BUG() function invoked be "test-tool" to be the "real" one, >> instead of one that avoids producing core files. In >> a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02) >> to test the (then recently added) BUG() function we faked up the >> abort() in favor of an exit with code 99. >> >> However, in doing so we've been fooling ourselves when it comes to >> what trace2 events we log. The events tested for in >> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the >> real ones, but those that we emit only from the "test-tool". > > I can fully agree with the above reasoning, i.e. let's test what we > do use in production, instead of something nobody uses for real, if > we were adding a test for BUG() in vacuum, but why did we have to > "fake" it in the first place? Per a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02) I think it was to get rid of coredump clutter outside of trash directories (which would require some non-standard config), which doesn't seem worth it. >> Let's just stop faking it, and call abort(). As a86303cb5d5 notes this >> will produce core files on some OS's, but as the default behavior for >> that is to dump them in the current directory they'll be placed in the >> trash directory that we'll shortly me "rm -rf"-ing. > > Are we sure that the reason no longer applies? How do we know? We > would want to explain that to future developers in the proposed log > message, I would think. I was trying to get the above across, i.e. that the reasoning still applies, but that the fakery isn't worth it. >> + elif test_match_signal 6 $exit_code && list_contains "$_test_ok" sigabrt >> + then >> + return 0 >> elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe > > Not a new problem, but these numberings are probably not very > portable. I am willing to take this as-is and let people on > minority platforms complain ;-) I'll test it better for portability before a non-RFC.
On Fri, Jun 03 2022, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > >>> However, in doing so we've been fooling ourselves when it comes to >>> what trace2 events we log. The events tested for in >>> 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the >>> real ones, but those that we emit only from the "test-tool". >> >> I can fully agree with the above reasoning, i.e. let's test what we >> do use in production, instead of something nobody uses for real, if >> we were adding a test for BUG() in vacuum, but why did we have to >> "fake" it in the first place? >> ... >> Are we sure that the reason no longer applies? How do we know? We >> would want to explain that to future developers in the proposed log >> message, I would think. > > We can flip it the other way around. > > I do not think I ever saw anybody asked anybody on this list who got > a BUG() message to use the coredump to do something useful. Don't > modern distros ship with "ulimit -c 0" these days? I think that part of it was just a side-effect of SIGABRT. > It might be possible that a better direction is to introduce > GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that > chooses between abort() and exit(99), or something like that, and > then we switch to use the latter by default over time? I think the reason for abort() was what's covered in raise(3), i.e. it's a one-stop-shop to getting "stop it" behavior both under threading and non-threading, which as e.g. exit(3) discusses wouldn't be thread-safe with it. But perhaps that was all premature worrying, we're mostly running non-threaded, and to the extent that we ever BUG() running into exit(3) threading issues is probably the least of our worries. So perhaps we should drop abort() entirely, I don't know. These proposed patches tried not to do that, but just to log things when we did so.
On Fri, Jun 03, 2022 at 02:05:49PM -0700, Junio C Hamano wrote: > > Are we sure that the reason no longer applies? How do we know? We > > would want to explain that to future developers in the proposed log > > message, I would think. > > We can flip it the other way around. > > I do not think I ever saw anybody asked anybody on this list who got > a BUG() message to use the coredump to do something useful. Don't > modern distros ship with "ulimit -c 0" these days? Certainly I have found the coredumps useful, though I would expect most Git developers to be able to run under a debugger and stop at BUG() anyway. So we could probably live without that, but... > It might be possible that a better direction is to introduce > GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that > chooses between abort() and exit(99), or something like that, and > then we switch to use the latter by default over time? It really should continue to die with a signal (or an exit code that pretends to be one) to continue triggering even under test_must_fail(). IMHO the whole "let's trigger BUG() intentionally via test-tool" stuff in t1406 is misguided. It's literally introducing broken code that is not run in the normal Git binary in order to make sure that it's broken. If that were the only spot, I'd suggest just getting rid of it entirely. But the code in t0210 that checks the handling of trace2 and BUG() is probably worth keeping. I do agree that an environment variable would be a better selector than "this code is linked against test-tool". I thought so even back in: https://lore.kernel.org/git/20180507090109.GA367@sigill.intra.peff.net/ :) That message also covers the flip-side case discussed earlier in this thread: why calling abort() unconditionally in the test suite can be a pain. -Peff
Jeff King <peff@peff.net> writes: > I do agree that an environment variable would be a better selector than > "this code is linked against test-tool". I thought so even back in: > > https://lore.kernel.org/git/20180507090109.GA367@sigill.intra.peff.net/ > > :) That message also covers the flip-side case discussed earlier in this > thread: why calling abort() unconditionally in the test suite can be a > pain. Nice. Thanks.
On Wed, Jun 08 2022, Jeff King wrote: > On Fri, Jun 03, 2022 at 02:05:49PM -0700, Junio C Hamano wrote: > >> > Are we sure that the reason no longer applies? How do we know? We >> > would want to explain that to future developers in the proposed log >> > message, I would think. >> >> We can flip it the other way around. >> >> I do not think I ever saw anybody asked anybody on this list who got >> a BUG() message to use the coredump to do something useful. Don't >> modern distros ship with "ulimit -c 0" these days? > > Certainly I have found the coredumps useful, though I would expect most > Git developers to be able to run under a debugger and stop at BUG() > anyway. So we could probably live without that, but... > >> It might be possible that a better direction is to introduce >> GIT_ABORT_ON_BUG environment or core.abortOnBUG configuration that >> chooses between abort() and exit(99), or something like that, and >> then we switch to use the latter by default over time? > > It really should continue to die with a signal (or an exit code that > pretends to be one) to continue triggering even under test_must_fail(). > > IMHO the whole "let's trigger BUG() intentionally via test-tool" stuff > in t1406 is misguided. It's literally introducing broken code that is > not run in the normal Git binary in order to make sure that it's broken. I haven't looked at that ref code in particular, but in general adding coverage for the case we actually expect to defend against with BUG() via a test-tool is something I think we could use in more cases. E.g. parse-options.c will BUG() out on various bad options structs, there it makes sense to feed those bad structs in to make sure our assertion code is doing what we still expect, but we don't have those tests. > If that were the only spot, I'd suggest just getting rid of it entirely. > But the code in t0210 that checks the handling of trace2 and BUG() is > probably worth keeping. Yes, we definitely want to test what *actually* happens as far as > I do agree that an environment variable would be a better selector than > "this code is linked against test-tool". I thought so even back in: > > https://lore.kernel.org/git/20180507090109.GA367@sigill.intra.peff.net/ > > :) That message also covers the flip-side case discussed earlier in this > thread: why calling abort() unconditionally in the test suite can be a > pain. I didn't know about that Jenkins case, but in any case I think we can probably stop worrying about this given that we haven't had complaints about ac14de13b22 (t4058: explore duplicate tree entry handling in a bit more detail, 2020-12-11) (although I've noticed in in "dmesg" before). I.e. since v2.31.0 running the test suite has produced at least 2 segfaults per run: $ (sudo dmesg -c; ./t4058-diff-duplicates.sh) >/dev/null; sudo dmesg | grep segfault [17687265.247252] git[11336]: segfault at 40 ip 0000000000700916 sp 00007ffc10d5dda0 error 4 in git[405000+33d000] [17687265.297027] git[11397]: segfault at 40 ip 0000000000700916 sp 00007ffd7dfd5310 error 4 in git[405000+33d000] Perhaps those tests aren't valuable, but I think that shows that the GIT_BUG_ABORT approach you suggested should probably be tweaked to instead be some test prereq like SEGFAULT_AND_ABORT_OK. On the other hand those segfaults in t4058 should probably be converted to a BUG()...
On Thu, Jun 09, 2022 at 10:09:25AM +0200, Ævar Arnfjörð Bjarmason wrote: > > IMHO the whole "let's trigger BUG() intentionally via test-tool" stuff > > in t1406 is misguided. It's literally introducing broken code that is > > not run in the normal Git binary in order to make sure that it's broken. > > I haven't looked at that ref code in particular, but in general adding > coverage for the case we actually expect to defend against with BUG() > via a test-tool is something I think we could use in more cases. > > E.g. parse-options.c will BUG() out on various bad options structs, > there it makes sense to feed those bad structs in to make sure our > assertion code is doing what we still expect, but we don't have those > tests. These seem like low-value tests to me. Once you've identified a case in parse-options.c that should BUG(), it's not very many lines of code to detect it and call the function. But to then create a test case, including a harness that triggers the internal code in such a way that is designed never to happen in the production code, seems much more error prone to me. I.e., it is very unlikely to me that such a test case will find an actual problem, and it carries a lot of scaffolding cost. Meanwhile the much more likely problem is that a BUG() can be triggered, but we simply don't have test coverage of the right cases. Once you've identified the case, the problem can be demonstrated (and fixed) in the production code. > I didn't know about that Jenkins case, but in any case I think we can > probably stop worrying about this given that we haven't had complaints > about ac14de13b22 (t4058: explore duplicate tree entry handling in a bit > more detail, 2020-12-11) (although I've noticed in in "dmesg" before). I didn't notice those ones. It's probable that the CI instance I was using at GitHub changed under the hood. And GitHub is more likely than most people to run git's test suite in a CI environment. :) So yeah, it may not matter that much. I do still think dumping cores around the filesystem (remembering that they do not always go into the trash directory) and polluting the kernel log are a bit unfriendly. If we can avoid it easily, it might be worth doing. > On the other hand those segfaults in t4058 should probably be converted > to a BUG()... Agreed. Those stack-exhaustion ones (which were what originally alerted me to the Jenkins issue) are hard to catch. -Peff
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index 0424f7adf5d..99a10f294f5 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -110,7 +110,6 @@ int cmd_main(int argc, const char **argv) OPT_END() }; - BUG_exit_code = 99; argc = parse_options(argc, argv, NULL, options, test_tool_usage, PARSE_OPT_STOP_AT_NON_OPTION | PARSE_OPT_KEEP_ARGV0); diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh index 37c359bd5a2..910a6af8058 100755 --- a/t/t0210-trace2-normal.sh +++ b/t/t0210-trace2-normal.sh @@ -155,15 +155,13 @@ test_expect_success 'normal stream, error event' ' test_expect_success 'BUG messages are written to trace2' ' test_when_finished "rm trace.normal actual expect" && - test_must_fail env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 007bug && + test_must_BUG env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 007bug && perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual && cat >expect <<-EOF && version $V start _EXE_ trace2 007bug cmd_name trace2 (trace2) error the bug message - exit elapsed:_TIME_ code:99 - atexit elapsed:_TIME_ code:99 EOF test_cmp expect actual ' diff --git a/t/t1406-submodule-ref-store.sh b/t/t1406-submodule-ref-store.sh index e6a7f7334b6..6f9a16b7355 100755 --- a/t/t1406-submodule-ref-store.sh +++ b/t/t1406-submodule-ref-store.sh @@ -25,15 +25,15 @@ test_expect_success 'pack_refs() not allowed' ' ' test_expect_success 'create_symref() not allowed' ' - test_must_fail $RUN create-symref FOO refs/heads/main nothing + test_must_BUG $RUN create-symref FOO refs/heads/main nothing ' test_expect_success 'delete_refs() not allowed' ' - test_must_fail $RUN delete-refs 0 nothing FOO refs/tags/new-tag + test_must_BUG $RUN delete-refs 0 nothing FOO refs/tags/new-tag ' test_expect_success 'rename_refs() not allowed' ' - test_must_fail $RUN rename-ref refs/heads/main refs/heads/new-main + test_must_BUG $RUN rename-ref refs/heads/main refs/heads/new-main ' test_expect_success 'for_each_ref(refs/heads/)' ' @@ -89,11 +89,11 @@ test_expect_success 'reflog_exists(HEAD)' ' ' test_expect_success 'delete_reflog() not allowed' ' - test_must_fail $RUN delete-reflog HEAD + test_must_BUG $RUN delete-reflog HEAD ' test_expect_success 'create-reflog() not allowed' ' - test_must_fail $RUN create-reflog HEAD + test_must_BUG $RUN create-reflog HEAD ' test_done diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 93c03380d44..61f1f03c099 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -1167,6 +1167,9 @@ test_must_fail () { then echo >&4 "test_must_fail: command succeeded: $*" return 1 + elif test_match_signal 6 $exit_code && list_contains "$_test_ok" sigabrt + then + return 0 elif test_match_signal 13 $exit_code && list_contains "$_test_ok" sigpipe then return 0 @@ -1186,6 +1189,16 @@ test_must_fail () { return 0 } 7>&2 2>&4 +# The test_must_BUG() function is a wrapper for test_must_fail which +# checks that we BUG() out. +# +# Currently this checks that we exit with abort(), but in the future +# we might e.g. check the trace2 logging itself, or otherwise make +# sure that we used BUG() in particular. +test_must_BUG () { + test_must_fail ok=sigabrt "$@" +} 7>&2 2>&4 + # Similar to test_must_fail, but tolerates success, too. This is # meant to be used in contexts like: # diff --git a/usage.c b/usage.c index b738dd178b3..bf868b5be1f 100644 --- a/usage.c +++ b/usage.c @@ -287,9 +287,6 @@ void warning(const char *warn, ...) va_end(params); } -/* Only set this, ever, from t/helper/, when verifying that bugs are caught. */ -int BUG_exit_code; - static NORETURN void BUG_vfl(const char *file, int line, const char *fmt, va_list params) { char prefix[256]; @@ -309,8 +306,6 @@ static NORETURN void BUG_vfl(const char *file, int line, const char *fmt, va_lis trace2_cmd_error_va(fmt, params_copy); - if (BUG_exit_code) - exit(BUG_exit_code); abort(); }
Change the BUG() function invoked be "test-tool" to be the "real" one, instead of one that avoids producing core files. In a86303cb5d5 (test-tool: help verifying BUG() code paths, 2018-05-02) to test the (then recently added) BUG() function we faked up the abort() in favor of an exit with code 99. However, in doing so we've been fooling ourselves when it comes to what trace2 events we log. The events tested for in 0a9dde4a04c (usage: trace2 BUG() invocations, 2021-02-05) are not the real ones, but those that we emit only from the "test-tool". Let's just stop faking it, and call abort(). As a86303cb5d5 notes this will produce core files on some OS's, but as the default behavior for that is to dump them in the current directory they'll be placed in the trash directory that we'll shortly me "rm -rf"-ing. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- There's a CI problem with the test_must_BUG wrapper here on Windows: https://github.com/avar/git/runs/6602059183?check_suite_focus=true#step:6:1361 t/helper/test-tool.c | 1 - t/t0210-trace2-normal.sh | 4 +--- t/t1406-submodule-ref-store.sh | 10 +++++----- t/test-lib-functions.sh | 13 +++++++++++++ usage.c | 5 ----- 5 files changed, 19 insertions(+), 14 deletions(-)