Message ID | patch-1.1-47b236139e6-20220323T204410Z-avarab@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | bbfbcd25b39e9020ce98467cfcf60dfce7e9b484 |
Headers | show |
Series | test-lib: have --immediate emit valid TAP on failure | expand |
Hi Ævar, On Wed, 23 Mar 2022, Ævar Arnfjörð Bjarmason wrote: > Change the "--immediate" option so that it emits valid TAP on > failure. Before this it would omit the required plan at the end, > e.g. under SANITIZE=leak we'd show a "No plan found in TAP output" > error from "prove": > > $ prove t0006-date.sh :: --immediate > t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/22 subtests > > Test Summary Report > ------------------- > t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1) > Failed test: 22 > Non-zero exit status: 1 > Parse errors: No plan found in TAP output > Files=1, Tests=22, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.18 cusr 0.06 csys = 0.27 CPU) > Result: FAIL > > Now we'll emit output that doesn't result in TAP parsing failures: > > $ prove t0006-date.sh :: --immediate > t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100) > Failed 1/22 subtests > > Test Summary Report > ------------------- > t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1) > Failed test: 22 > Non-zero exit status: 1 > Files=1, Tests=22, 0 wallclock secs ( 0.02 usr 0.00 sys + 0.19 cusr 0.05 csys = 0.26 CPU) > Result: FAIL The commit message is strong on the what, very strong in giving verbose output that might or might not clarify the intention, and a little weak in the why and the greater context. In the broader story arc, I wonder why we're even bothering with TAP anymore because it seems that the world has moved on to more powerful file formats to represent test output, such as JUnit XML, that can be parsed, rendered and analyzed by powerful special-purpose applications. Apart from taking reviewer time away from such more powerful avenues to make sense of our failing tests, the diff does not really hurt, so I have no objections against integrating this patch. Ciao, Johannes > diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh > index 9dcbf518a78..17a268ccd1b 100755 > --- a/t/t0000-basic.sh > +++ b/t/t0000-basic.sh > @@ -101,6 +101,19 @@ test_expect_success 'subtest: 2/3 tests passing' ' > EOF > ' > > +test_expect_success 'subtest: --immediate' ' > + run_sub_test_lib_test_err partial-pass \ > + --immediate && > + check_sub_test_lib_test_err partial-pass \ > + <<-\EOF_OUT 3<<-EOF_ERR > + > ok 1 - passing test #1 > + > not ok 2 - failing test #2 > + > # false > + > 1..2 > + EOF_OUT > + EOF_ERR > +' > + > test_expect_success 'subtest: a failing TODO test' ' > write_and_run_sub_test_lib_test failing-todo <<-\EOF && > test_expect_success "passing test" "true" > diff --git a/t/test-lib.sh b/t/test-lib.sh > index 515b1af7ed4..4373f7d70b5 100644 > --- a/t/test-lib.sh > +++ b/t/test-lib.sh > @@ -806,7 +806,11 @@ test_failure_ () { > say_color error "not ok $test_count - $1" > shift > printf '%s\n' "$*" | sed -e 's/^/# /' > - test "$immediate" = "" || _error_exit > + if test -n "$immediate" > + then > + say_color error "1..$test_count" > + _error_exit > + fi > } > > test_known_broken_ok_ () { > -- > 2.35.1.1452.ga7cfc89151f > >
On Thu, Mar 24 2022, Johannes Schindelin wrote: > Hi Ævar, > > On Wed, 23 Mar 2022, Ævar Arnfjörð Bjarmason wrote: > >> Change the "--immediate" option so that it emits valid TAP on >> failure. Before this it would omit the required plan at the end, >> e.g. under SANITIZE=leak we'd show a "No plan found in TAP output" >> error from "prove": >> >> $ prove t0006-date.sh :: --immediate >> t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100) >> Failed 1/22 subtests >> >> Test Summary Report >> ------------------- >> t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1) >> Failed test: 22 >> Non-zero exit status: 1 >> Parse errors: No plan found in TAP output >> Files=1, Tests=22, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.18 cusr 0.06 csys = 0.27 CPU) >> Result: FAIL >> >> Now we'll emit output that doesn't result in TAP parsing failures: >> >> $ prove t0006-date.sh :: --immediate >> t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100) >> Failed 1/22 subtests >> >> Test Summary Report >> ------------------- >> t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1) >> Failed test: 22 >> Non-zero exit status: 1 >> Files=1, Tests=22, 0 wallclock secs ( 0.02 usr 0.00 sys + 0.19 cusr 0.05 csys = 0.26 CPU) >> Result: FAIL > > The commit message is strong on the what, very strong in giving verbose > output that might or might not clarify the intention, and a little weak in > the why and the greater context. I thought "so that it emits valid TAP" was sufficiently self-explaining. I.e. we emit this machine-readable format, but in this edge case our output is invalid TAP, now it's valid. Re the verbose output: The last time I submitted patch for tweaking the TAP output I got the feedback (from Junio in https://lore.kernel.org/git/xmqq7dad10oy.fsf@gitster.g/) that a more explicit example was wanted. So I included it this time around. > In the broader story arc, I wonder why we're even bothering with TAP > anymore because it seems that the world has moved on to more powerful file > formats to represent test output, such as JUnit XML, that can be parsed, > rendered and analyzed by powerful special-purpose applications. [I guess that warrants a $subject change] A few things, grouped by heading: == It "can be parsed" First, the "[TAP ]can['t] be parsed" part of that is wrong. As your own CI patch series[1] shows we currently parse the entire TAP output of our test suite on every CI run. Otherwise prove(1) couldn't have come up with this (quoting from your [1]): Test Summary Report ------------------- t1092-sparse-checkout-compatibility.sh (Wstat: 256 Tests: 53 Failed: 1) Failed test: 49 Non-zero exit status: 1 t3701-add-interactive.sh (Wstat: 0 Tests: 71 Failed: 0) TODO passed: 45, 47 That's the result of a compliant TAP parser parsing all of our test output, which has worked for all our tests (sans edge cases like the one being fixed here) for a long time. Let's not go into how you don't like that summary output, that's not the point. That's a UX layer presenting something from the fully parsed output, the point is that it *is* not only parsable, but it's currently *being parsed* by our toolchain on every CI run. == Human readable Unless you're expecting us to eyeball something like this verbose XML on ad-hoc runs like: $ ./t0013-sha1dc.sh --write-junit-xml --tee ok 1 - test-sha1 detects shattered pdf # passed all 1 test(s) 1..1 Which for JUnit is: $ cat out/TEST-t0013-sha1dc.xml <testsuites> <testsuite name="t0013-sha1dc" timestamp="2022-03-24T14:20:27" time="0.065762"> <testcase name="t0013.1 test-sha1 detects shattered pdf
" classname="t0013" time="0.029818"> </testcase> </testsuite> </testsuites> You're going to end up with 2 output formats, a human readable one and some secondary also-machine-readable output. TAP isn't perfect, but it's uniquely suitable as a format for the sort of testing we do, because it's explicitly machine-readable format that's designed to also be human readable. Which isn't a goal with JUnit. Perhaps some XML enthusiasts would argue that point. Sure, you *can* read it, but it's not exactly pleasing nor aiming to reduce verbosity. The JUnit we currently emit is >2x the verbosity of TAP by lines, and ~3x as much by byte count (all of which a human needs to occasionally skim): $ rm -rf out; ./t0001-init.sh --write-junit-xml --tee |wc ; wc <out/*xml 63 488 2553 130 615 7592 == It's trivial to produce Most output formats are way more picky about what constitutes a valid payload than TAP is, which is why for the TAP v.s. JUnit output the number of syscalls we need shows this drastic difference: $ strace --summary-only -f ./t0001-init.sh --tee 2>&1 | tail -n 1 100.00 3.703689 21 171581 13587 total $ strace --summary-only -f ./t0001-init.sh --tee --write-junit-xml 2>&1 | tail -n 1 100.00 6.424080 22 285903 21097 total I.e. the implementation of it just needs an "echo" or whatever, whereas the XML one is shelling out for pretty much every line it prints to escape things. I'm you might argue about the hatefulness about shellscripting there and how we should rewrite everything in C. I don't disagree with that point in the abstract, but realistically we'll be needing to deal with the shellscript-based test suite for the foreseeable future. == It's more flexible The JUnit we write out is just the summary output, but when you run a test with e.g. "--verbose -x" the result is valid TAP. I.e. TAP parser will take this: $ ./t0013-sha1dc.sh -vx 2>&1|wc 38 165 1104 And grab the lines matching "ok" "not ok" etc., and *ignore everything else*. This means that we're able to use it for things that you just couldn't use JUnit for, e.g. the POC I have here where we pair up our "--verbose -x" output with the failure summary, i.e. show what specific test things belongs t [2]: == It doesn't matter if nobody else uses it Or: "we don't need a 2nd machine-parsable format". I think your "the world has moved on to more powerful file formats to represent test output" is a valid concern. I.e. we don't want to have a test suite output that we can't pretty render on whatever the latest fancy test CI renderer is is a completely valid concern, whether that's Azure or GitHub CI. But the implicit assumption there seems to be that TAP isn't rather trivially machine-parsable, and that we aren't already running a parser for it. It is, and we are. Because of that we can just have a post-transformation step of changing TAP into whatever other format you'd like. So you can have JUnit, GitHub CI Markdown or whatever, it should just be produced from the TAP output. I think that your addition of JUnit & GitHub CI Markdown (in [1]) is much better done like that. It avoids the complexity of N output formats in test-lib.sh. A transformer for your desired format probably exists already (or could trivially be adjusted for a new format): https://metacpan.org/search?q=TAP%3A%3AFormatter == Everyone uses it Well, <citation needed>, but while per the above I don't think wide use of TAP even matters I think being concerned about lack of tooling support for it is strange. I'm fairly sure it's the single most widely supported test output format of any kind. Because of how trivial it is to parse and implement I'm pretty sure it's the most widely-supported machine-readable test output format of any kind. *Maybe* something like JUnit has (near) parity if we're counting format transformers-to-transformers, but I'm fairly sure not if it wouldn't be allowed to "cheat" by making use of TAP as an intermediate format. == Cases where TAP isn't good enough currently Someone familiar with our test tooling might thing "wait, TAP with --verbose is valid TAP?" because since 614fe015212 (test-lib: bail out when "-v" used under "prove", 2016-10-22) we have been refusing to produce it in that case: $ prove t0013-sha1dc.sh :: --verbose Bailout called. Further testing stopped: verbose mode forbidden under TAP harness; try --verbose-log FAILED--Further testing stopped: verbose mode forbidden under TAP harness; try --verbose-log Fixing that is needed for the approach in [2] to work 100% of the time, I have working patches to that (which as it happens, are currently blocking on conflicts with your [1]). I.e. the problem is that if we're expecting a line starting with "ok" or whatever the firehose --verbose output from the test itself might emit such a line, and confuse the parser. The solution was rather trivial in the case of TAP, just have test-lib.sh emit a configurable prefix string before any TAP syntax it emits, then have a "tee" tool we ship as a test helper munge that after the fact: https://github.com/avar/git/commit/2358ff8c3a7 https://github.com/avar/git/commit/2ae430bfce7 == Handling the long-tail of machine-readability Continuing the above: I don't think a solution to that sort of thing is really viable with anything that doesn't behave like TAP. We have tests doing things like starting a background daemon that will racily emit output its output *in the middle of another* test. As in not theoretically, it's a case I actually had to debug & test to get the above commits working. The JUnit emitter punts on it entirely by not including the "--verbose -x" output at all. The rather easy solution for TAP explicitly relies on its "non-syntax" of considering every line it doesn't explicitly classify as syntax as something it'll pass through as "output". Those lines thenh need no further escaping. They can pretty much be latin-1, UTF-8, your favorite bits from /dev/urandom etc.. It doesn't matter. Whereas for e.g. any XML, JSON etc. format where things are syntax by default you'd need to intercept all that output to strip "<" and the like from it. Even if you did that you probably wouldn't have anywhere sensible to stick that output, you probably already emitted the beginning/end markers for "here's the verbose output for test N" already, which would require complexity a TAP emitter wouldn't need. > Apart from taking reviewer time away from such more powerful avenues to > make sense of our failing tests, the diff does not really hurt, so I have > no objections against integrating this patch. I hope the above will summarize why those suggested avenues aren't "more powerful", quite the opposite. I think if you did the legwork of trying to make those formats represent our "--verbose -x" output in a machine-readable way and tried to roundtrip that parsed output (which I have done for TAP) that you'd find it somewhere between "much, much harder" and "impossible" to do the same for those other formats in the context of our test suite. So yes, I agree that there's a lack of focus here, and we should put more wood behind fewer arrows in terms of making our test output machine-parsable. The [2] I mentioned above shows the sorts of wins we can get from that, i.e. emitting *only* the lines of test output relevant to the specific failures we had. That's really useful, and something you inherently can't do with the sort of approach you're going for in your [1] series. At least not without buffering the whole thing & parsing it, at which point you're back to square #1 in terms of fixing the sorts of issues noted in "Handling the long-tail of machine-readability" above. 1. https://lore.kernel.org/git/pull.1117.git.1643050574.gitgitgadget@gmail.com/ 2. https://lore.kernel.org/git/220302.86mti87cj2.gmgdl@evledraar.gmail.com/
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > == It "can be parsed" > == Human readable > == It's trivial to produce > == It's more flexible > == It doesn't matter if nobody else uses it > == Everyone uses it I agree with all of what you said above. But do non-human readers really want to read output from an "--immediate" session? Don't they rather see the whole thing? The only "Huh?" I had with the original patch that started the thread was that TAP output currently does not work well with the "--verbose" option, and I've never run our tests with the "--immediate" option without the "--verbose" option to see where and how the first breakage happens and what is left behind in the trash directory, i.e. to bootstrap an interactive debugging session. But not being useful for the use case of human reader running interactively to get the leftover state does not mean improvement for other use cases is not welcome. > I think if you did the legwork of trying to make those formats represent > our "--verbose -x" output in a machine-readable way and tried to > roundtrip that parsed output (which I have done for TAP) that you'd find > it somewhere between "much, much harder" and "impossible" to do the same > for those other formats in the context of our test suite. > > So yes, I agree that there's a lack of focus here, and we should put > more wood behind fewer arrows in terms of making our test output > machine-parsable. > > The [2] I mentioned above shows the sorts of wins we can get from that, > i.e. emitting *only* the lines of test output relevant to the specific > failures we had. > > That's really useful, and something you inherently can't do with the > sort of approach you're going for in your [1] series. > > At least not without buffering the whole thing & parsing it, at which > point you're back to square #1 in terms of fixing the sorts of issues > noted in "Handling the long-tail of machine-readability" above. > > 1. https://lore.kernel.org/git/pull.1117.git.1643050574.gitgitgadget@gmail.com/ > 2. https://lore.kernel.org/git/220302.86mti87cj2.gmgdl@evledraar.gmail.com/
On Thu, Mar 24, 2022 at 02:48:42PM +0100, Ævar Arnfjörð Bjarmason wrote: > > The commit message is strong on the what, very strong in giving verbose > > output that might or might not clarify the intention, and a little weak in > > the why and the greater context. > > I thought "so that it emits valid TAP" was sufficiently > self-explaining. I.e. we emit this machine-readable format, but in this > edge case our output is invalid TAP, now it's valid. I agree; if the justification is "something we use not-infrequently is broken" and the rest is "and this patch un-breaks it", I do not think we should devote much space to justifying why we use that thing in the first place. Our TAP output meets the bar (at least for me, personally) of not needing to be rehashed anytime we change it, so I don't have any complaints about Ævar's patch message here. Of course, we should be careful to avoid following that guidance _too_ much, since if it leaves us in a spot where we never question any past decisions, then I think we have gone too far. Thanks, Taylor
On Thu, Mar 24 2022, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > >> == It "can be parsed" >> == Human readable >> == It's trivial to produce >> == It's more flexible >> == It doesn't matter if nobody else uses it >> == Everyone uses it > > I agree with all of what you said above. > > But do non-human readers really want to read output from an > "--immediate" session? Don't they rather see the whole thing? That test run will still take longer, especially if you're expecting widespread failures. The immediate use-case I have for this is that I have unsubmitted patches to implement a "check" mode for SANITIZE=leak, where it runs with --immediate and "fails" tests that pass under SANITIZE=leak but aren't marked with "TEST_PASSES_SANITIZE_LEAK=true", and passes (and runs to completion completely) those tests that do have "TEST_PASSES_SANITIZE_LEAK=true" but pass with SANITIZE=leak (and fails those that would fail there, no "fail inversion"). That sort of thing is very useful to for any tests where we mark certain other tests as being OK under a given mode, and except there to be a 1=1 correspondance. The reason for the --immediate there is an optimization, it takes a long time to run with SANITIZE=leak, and if any single failure is enough to declare the test bad --immediate is useful. Now, the TAP tooling is quite tolerant of such bad output, another point in its favor over something like JUnit (I'm assuming most libraries using it would pass it through a generic XML parser, whose error handling is either "ok" or "ARGHL!" :). I.e. it'll give you a full parse, but just have a minor complaint about the lacking end marker, getting rid of that output & parse nit is nice. > The only "Huh?" I had with the original patch that started the > thread was that TAP output currently does not work well with the > "--verbose" option, and I've never run our tests with the > "--immediate" option without the "--verbose" option to see where and > how the first breakage happens and what is left behind in the trash > directory, i.e. to bootstrap an interactive debugging session. > > But not being useful for the use case of human reader running > interactively to get the leftover state does not mean improvement > for other use cases is not welcome. I was sad to see that ability go away, without the patches I referenced running through a TAP parser with --verbose works for almost all ouf our test suite, so IMO we were a bit overzelous in adding that early abort for it. But I do have patches to make it pass 100% guaranteed, so depending on your interest ... :) I was planning to queue that behind some of the outstanding test lib changes it would conflict with, e.g. this one & the outstanding CI topics.
diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh index 9dcbf518a78..17a268ccd1b 100755 --- a/t/t0000-basic.sh +++ b/t/t0000-basic.sh @@ -101,6 +101,19 @@ test_expect_success 'subtest: 2/3 tests passing' ' EOF ' +test_expect_success 'subtest: --immediate' ' + run_sub_test_lib_test_err partial-pass \ + --immediate && + check_sub_test_lib_test_err partial-pass \ + <<-\EOF_OUT 3<<-EOF_ERR + > ok 1 - passing test #1 + > not ok 2 - failing test #2 + > # false + > 1..2 + EOF_OUT + EOF_ERR +' + test_expect_success 'subtest: a failing TODO test' ' write_and_run_sub_test_lib_test failing-todo <<-\EOF && test_expect_success "passing test" "true" diff --git a/t/test-lib.sh b/t/test-lib.sh index 515b1af7ed4..4373f7d70b5 100644 --- a/t/test-lib.sh +++ b/t/test-lib.sh @@ -806,7 +806,11 @@ test_failure_ () { say_color error "not ok $test_count - $1" shift printf '%s\n' "$*" | sed -e 's/^/# /' - test "$immediate" = "" || _error_exit + if test -n "$immediate" + then + say_color error "1..$test_count" + _error_exit + fi } test_known_broken_ok_ () {