test-lib: have --immediate emit valid TAP on failure

Message ID	patch-1.1-47b236139e6-20220323T204410Z-avarab@gmail.com (mailing list archive)
State	Accepted
Commit	bbfbcd25b39e9020ce98467cfcf60dfce7e9b484
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> To: git@vger.kernel.org Cc: Junio C Hamano <gitster@pobox.com>, Johannes Schindelin <Johannes.Schindelin@gmx.de>, =?utf-8?b?w4Z2YXIgQXJu?= =?utf-8?b?ZmrDtnLDsCBCamFybWFzb24=?= <avarab@gmail.com> Subject: [PATCH] test-lib: have --immediate emit valid TAP on failure Date: Wed, 23 Mar 2022 21:51:31 +0100 Message-Id: <patch-1.1-47b236139e6-20220323T204410Z-avarab@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	test-lib: have --immediate emit valid TAP on failure \| expand test-lib: have --immediate emit valid TAP on failure

Ævar Arnfjörð Bjarmason March 23, 2022, 8:51 p.m. UTC

Change the "--immediate" option so that it emits valid TAP on
failure. Before this it would omit the required plan at the end,
e.g. under SANITIZE=leak we'd show a "No plan found in TAP output"
error from "prove":

    $ prove t0006-date.sh ::  --immediate
    t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100)
    Failed 1/22 subtests

    Test Summary Report
    -------------------
    t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1)
      Failed test:  22
      Non-zero exit status: 1
      Parse errors: No plan found in TAP output
    Files=1, Tests=22,  0 wallclock secs ( 0.02 usr  0.01 sys +  0.18 cusr  0.06 csys =  0.27 CPU)
    Result: FAIL

Now we'll emit output that doesn't result in TAP parsing failures:

    $ prove t0006-date.sh ::  --immediate
    t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100)
    Failed 1/22 subtests

    Test Summary Report
    -------------------
    t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1)
      Failed test:  22
      Non-zero exit status: 1
    Files=1, Tests=22,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.19 cusr  0.05 csys =  0.26 CPU)
    Result: FAIL

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

Junio: When this is merged to "seen" there's a conflict like:

	<<<<<<< HEAD
		if test -n "$immediate"
		then
			say_color error "1..$test_count"
			_error_exit
		fi
	=======
		test "$immediate" = "" || _error_exit
		finalize_test_case_output failure "$failure_label" "$@"
	>>>>>>> origin/seen

The "solution" is to end up with:

	test_failure_ () {
		failure_label=$1
		test_failure=$(($test_failure + 1))
		say_color error "not ok $test_count - $1"
		shift
		printf '%s\n' "$*" | sed -e 's/^/#	/'
		if test -n "$immediate"
		then
			say_color error "1..$test_count"
			_error_exit
		fi
		finalize_test_case_output failure "$failure_label" "$@"
	}

Those scare quotes being because the "finalize_test_case_output" is in
the wrong place both before and after, but that's an existing bug in
eb53a5b8047 (ci: call `finalize_test_case_output` a little later,
2022-03-01), i.e. that alternate output mechanism wasn't doing the
right thing on --immediate output.

But this change and that merge resolution doesn't make that any worse,
so hopefully this can be queued anyway. Thanks!
		
 t/t0000-basic.sh | 13 +++++++++++++
 t/test-lib.sh    |  6 +++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

Johannes Schindelin March 24, 2022, 1:38 p.m. UTC | #1

Hi Ævar,

On Wed, 23 Mar 2022, Ævar Arnfjörð Bjarmason wrote:

> Change the "--immediate" option so that it emits valid TAP on
> failure. Before this it would omit the required plan at the end,
> e.g. under SANITIZE=leak we'd show a "No plan found in TAP output"
> error from "prove":
>
>     $ prove t0006-date.sh ::  --immediate
>     t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100)
>     Failed 1/22 subtests
>
>     Test Summary Report
>     -------------------
>     t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1)
>       Failed test:  22
>       Non-zero exit status: 1
>       Parse errors: No plan found in TAP output
>     Files=1, Tests=22,  0 wallclock secs ( 0.02 usr  0.01 sys +  0.18 cusr  0.06 csys =  0.27 CPU)
>     Result: FAIL
>
> Now we'll emit output that doesn't result in TAP parsing failures:
>
>     $ prove t0006-date.sh ::  --immediate
>     t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100)
>     Failed 1/22 subtests
>
>     Test Summary Report
>     -------------------
>     t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1)
>       Failed test:  22
>       Non-zero exit status: 1
>     Files=1, Tests=22,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.19 cusr  0.05 csys =  0.26 CPU)
>     Result: FAIL

The commit message is strong on the what, very strong in giving verbose
output that might or might not clarify the intention, and a little weak in
the why and the greater context.

In the broader story arc, I wonder why we're even bothering with TAP
anymore because it seems that the world has moved on to more powerful file
formats to represent test output, such as JUnit XML, that can be parsed,
rendered and analyzed by powerful special-purpose applications.

Apart from taking reviewer time away from such more powerful avenues to
make sense of our failing tests, the diff does not really hurt, so I have
no objections against integrating this patch.

Ciao,
Johannes

> diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh
> index 9dcbf518a78..17a268ccd1b 100755
> --- a/t/t0000-basic.sh
> +++ b/t/t0000-basic.sh
> @@ -101,6 +101,19 @@ test_expect_success 'subtest: 2/3 tests passing' '
>  	EOF
>  '
>
> +test_expect_success 'subtest: --immediate' '
> +	run_sub_test_lib_test_err partial-pass \
> +		--immediate &&
> +	check_sub_test_lib_test_err partial-pass \
> +		<<-\EOF_OUT 3<<-EOF_ERR
> +	> ok 1 - passing test #1
> +	> not ok 2 - failing test #2
> +	> #	false
> +	> 1..2
> +	EOF_OUT
> +	EOF_ERR
> +'
> +
>  test_expect_success 'subtest: a failing TODO test' '
>  	write_and_run_sub_test_lib_test failing-todo <<-\EOF &&
>  	test_expect_success "passing test" "true"
> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index 515b1af7ed4..4373f7d70b5 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -806,7 +806,11 @@ test_failure_ () {
>  	say_color error "not ok $test_count - $1"
>  	shift
>  	printf '%s\n' "$*" | sed -e 's/^/#	/'
> -	test "$immediate" = "" || _error_exit
> +	if test -n "$immediate"
> +	then
> +		say_color error "1..$test_count"
> +		_error_exit
> +	fi
>  }
>
>  test_known_broken_ok_ () {
> --
> 2.35.1.1452.ga7cfc89151f
>
>

Ævar Arnfjörð Bjarmason March 24, 2022, 1:48 p.m. UTC | #2

On Thu, Mar 24 2022, Johannes Schindelin wrote:

> Hi Ævar,
>
> On Wed, 23 Mar 2022, Ævar Arnfjörð Bjarmason wrote:
>
>> Change the "--immediate" option so that it emits valid TAP on
>> failure. Before this it would omit the required plan at the end,
>> e.g. under SANITIZE=leak we'd show a "No plan found in TAP output"
>> error from "prove":
>>
>>     $ prove t0006-date.sh ::  --immediate
>>     t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100)
>>     Failed 1/22 subtests
>>
>>     Test Summary Report
>>     -------------------
>>     t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1)
>>       Failed test:  22
>>       Non-zero exit status: 1
>>       Parse errors: No plan found in TAP output
>>     Files=1, Tests=22,  0 wallclock secs ( 0.02 usr  0.01 sys +  0.18 cusr  0.06 csys =  0.27 CPU)
>>     Result: FAIL
>>
>> Now we'll emit output that doesn't result in TAP parsing failures:
>>
>>     $ prove t0006-date.sh ::  --immediate
>>     t0006-date.sh .. Dubious, test returned 1 (wstat 256, 0x100)
>>     Failed 1/22 subtests
>>
>>     Test Summary Report
>>     -------------------
>>     t0006-date.sh (Wstat: 256 Tests: 22 Failed: 1)
>>       Failed test:  22
>>       Non-zero exit status: 1
>>     Files=1, Tests=22,  0 wallclock secs ( 0.02 usr  0.00 sys +  0.19 cusr  0.05 csys =  0.26 CPU)
>>     Result: FAIL
>
> The commit message is strong on the what, very strong in giving verbose
> output that might or might not clarify the intention, and a little weak in
> the why and the greater context.

I thought "so that it emits valid TAP" was sufficiently
self-explaining. I.e. we emit this machine-readable format, but in this
edge case our output is invalid TAP, now it's valid.

Re the verbose output: The last time I submitted patch for tweaking the
TAP output I got the feedback (from Junio in
https://lore.kernel.org/git/xmqq7dad10oy.fsf@gitster.g/) that a more
explicit example was wanted. So I included it this time around.

> In the broader story arc, I wonder why we're even bothering with TAP
> anymore because it seems that the world has moved on to more powerful file
> formats to represent test output, such as JUnit XML, that can be parsed,
> rendered and analyzed by powerful special-purpose applications.

[I guess that warrants a $subject change]

A few things, grouped by heading:

== It "can be parsed"

First, the "[TAP ]can['t] be parsed" part of that is wrong.

As your own CI patch series[1] shows we currently parse the entire TAP
output of our test suite on every CI run. Otherwise prove(1) couldn't
have come up with this (quoting from your [1]):

	Test Summary Report
	-------------------
	t1092-sparse-checkout-compatibility.sh           (Wstat: 256 Tests: 53 Failed: 1)
	  Failed test:  49
	  Non-zero exit status: 1
	t3701-add-interactive.sh                         (Wstat: 0 Tests: 71 Failed: 0)
	  TODO passed:   45, 47

That's the result of a compliant TAP parser parsing all of our test
output, which has worked for all our tests (sans edge cases like the one
being fixed here) for a long time.

Let's not go into how you don't like that summary output, that's not the
point.

That's a UX layer presenting something from the fully parsed output, the
point is that it *is* not only parsable, but it's currently *being
parsed* by our toolchain on every CI run.

== Human readable

Unless you're expecting us to eyeball something like this verbose XML on
ad-hoc runs like:

	$ ./t0013-sha1dc.sh --write-junit-xml --tee
	ok 1 - test-sha1 detects shattered pdf
	# passed all 1 test(s)
	1..1

Which for JUnit is:

	$ cat out/TEST-t0013-sha1dc.xml 
	<testsuites>
	  <testsuite name="t0013-sha1dc" timestamp="2022-03-24T14:20:27" time="0.065762">
	    <testcase name="t0013.1 test-sha1 detects shattered pdf&#x0a;" classname="t0013" time="0.029818">
	    </testcase>
	  </testsuite>
	</testsuites>

You're going to end up with 2 output formats, a human readable one and
some secondary also-machine-readable output.

TAP isn't perfect, but it's uniquely suitable as a format for the sort
of testing we do, because it's explicitly machine-readable format that's
designed to also be human readable.

Which isn't a goal with JUnit. Perhaps some XML enthusiasts would argue
that point. Sure, you *can* read it, but it's not exactly pleasing nor
aiming to reduce verbosity.

The JUnit we currently emit is >2x the verbosity of TAP by lines, and
~3x as much by byte count (all of which a human needs to occasionally
skim):

	$ rm -rf out; ./t0001-init.sh --write-junit-xml --tee |wc ; wc <out/*xml
	 63   488  2553
	 130  615  7592

== It's trivial to produce

Most output formats are way more picky about what constitutes a valid
payload than TAP is, which is why for the TAP v.s. JUnit output the
number of syscalls we need shows this drastic difference:

	$ strace --summary-only -f ./t0001-init.sh --tee 2>&1 | tail -n 1
	100.00    3.703689          21    171581     13587 total
	$ strace --summary-only -f ./t0001-init.sh --tee --write-junit-xml 2>&1 | tail -n 1
	100.00    6.424080          22    285903     21097 total

I.e. the implementation of it just needs an "echo" or whatever, whereas
the XML one is shelling out for pretty much every line it prints to
escape things.

I'm you might argue about the hatefulness about shellscripting there and
how we should rewrite everything in C.

I don't disagree with that point in the abstract, but realistically
we'll be needing to deal with the shellscript-based test suite for the
foreseeable future.

== It's more flexible

The JUnit we write out is just the summary output, but when you run a
test with e.g. "--verbose -x" the result is valid TAP. I.e. TAP parser
will take this:

	$ ./t0013-sha1dc.sh -vx 2>&1|wc
	38     165    1104

And grab the lines matching "ok" "not ok" etc., and *ignore everything
else*.

This means that we're able to use it for things that you just couldn't
use JUnit for, e.g. the POC I have here where we pair up our "--verbose
-x" output with the failure summary, i.e. show what specific test things
belongs t [2]:

== It doesn't matter if nobody else uses it

Or: "we don't need a 2nd machine-parsable format".

I think your "the world has moved on to more powerful file formats to
represent test output" is a valid concern.

I.e. we don't want to have a test suite output that we can't pretty
render on whatever the latest fancy test CI renderer is is a completely
valid concern, whether that's Azure or GitHub CI.

But the implicit assumption there seems to be that TAP isn't rather
trivially machine-parsable, and that we aren't already running a parser
for it. It is, and we are.

Because of that we can just have a post-transformation step of changing
TAP into whatever other format you'd like.

So you can have JUnit, GitHub CI Markdown or whatever, it should just be
produced from the TAP output.

I think that your addition of JUnit & GitHub CI Markdown (in [1]) is
much better done like that. It avoids the complexity of N output formats
in test-lib.sh. A transformer for your desired format probably exists
already (or could trivially be adjusted for a new format):

    https://metacpan.org/search?q=TAP%3A%3AFormatter

== Everyone uses it

Well, <citation needed>, but while per the above I don't think wide use
of TAP even matters I think being concerned about lack of tooling
support for it is strange.

I'm fairly sure it's the single most widely supported test output format
of any kind. Because of how trivial it is to parse and implement I'm
pretty sure it's the most widely-supported machine-readable test output
format of any kind.

*Maybe* something like JUnit has (near) parity if we're counting format
transformers-to-transformers, but I'm fairly sure not if it wouldn't be
allowed to "cheat" by making use of TAP as an intermediate format.

== Cases where TAP isn't good enough currently

Someone familiar with our test tooling might thing "wait, TAP with
--verbose is valid TAP?" because since 614fe015212 (test-lib: bail out
when "-v" used under "prove", 2016-10-22) we have been refusing to
produce it in that case:

    $ prove t0013-sha1dc.sh :: --verbose
    Bailout called.  Further testing stopped:  verbose mode forbidden under TAP harness; try --verbose-log
    FAILED--Further testing stopped: verbose mode forbidden under TAP harness; try --verbose-log

Fixing that is needed for the approach in [2] to work 100% of the time,
I have working patches to that (which as it happens, are currently
blocking on conflicts with your [1]).

I.e. the problem is that if we're expecting a line starting with "ok" or
whatever the firehose --verbose output from the test itself might emit
such a line, and confuse the parser.

The solution was rather trivial in the case of TAP, just have
test-lib.sh emit a configurable prefix string before any TAP syntax it
emits, then have a "tee" tool we ship as a test helper munge that after
the fact:

    https://github.com/avar/git/commit/2358ff8c3a7
    https://github.com/avar/git/commit/2ae430bfce7

== Handling the long-tail of machine-readability

Continuing the above: I don't think a solution to that sort of thing is
really viable with anything that doesn't behave like TAP.

We have tests doing things like starting a background daemon that will
racily emit output its output *in the middle of another* test.

As in not theoretically, it's a case I actually had to debug & test to
get the above commits working. The JUnit emitter punts on it entirely by
not including the "--verbose -x" output at all.

The rather easy solution for TAP explicitly relies on its "non-syntax"
of considering every line it doesn't explicitly classify as syntax as
something it'll pass through as "output".

Those lines thenh need no further escaping. They can pretty much be
latin-1, UTF-8, your favorite bits from /dev/urandom etc.. It doesn't
matter.

Whereas for e.g. any XML, JSON etc. format where things are syntax by
default you'd need to intercept all that output to strip "<" and the
like from it.

Even if you did that you probably wouldn't have anywhere sensible to
stick that output, you probably already emitted the beginning/end
markers for "here's the verbose output for test N" already, which would
require complexity a TAP emitter wouldn't need.

> Apart from taking reviewer time away from such more powerful avenues to
> make sense of our failing tests, the diff does not really hurt, so I have
> no objections against integrating this patch.

I hope the above will summarize why those suggested avenues aren't "more
powerful", quite the opposite.

I think if you did the legwork of trying to make those formats represent
our "--verbose -x" output in a machine-readable way and tried to
roundtrip that parsed output (which I have done for TAP) that you'd find
it somewhere between "much, much harder" and "impossible" to do the same
for those other formats in the context of our test suite.

So yes, I agree that there's a lack of focus here, and we should put
more wood behind fewer arrows in terms of making our test output
machine-parsable.

The [2] I mentioned above shows the sorts of wins we can get from that,
i.e. emitting *only* the lines of test output relevant to the specific
failures we had.

That's really useful, and something you inherently can't do with the
sort of approach you're going for in your [1] series.

At least not without buffering the whole thing & parsing it, at which
point you're back to square #1 in terms of fixing the sorts of issues
noted in "Handling the long-tail of machine-readability" above.

1. https://lore.kernel.org/git/pull.1117.git.1643050574.gitgitgadget@gmail.com/
2. https://lore.kernel.org/git/220302.86mti87cj2.gmgdl@evledraar.gmail.com/

Junio C Hamano March 24, 2022, 5:26 p.m. UTC | #3

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> == It "can be parsed"
> == Human readable
> == It's trivial to produce
> == It's more flexible
> == It doesn't matter if nobody else uses it
> == Everyone uses it

I agree with all of what you said above.

But do non-human readers really want to read output from an
"--immediate" session?  Don't they rather see the whole thing?

The only "Huh?" I had with the original patch that started the
thread was that TAP output currently does not work well with the
"--verbose" option, and I've never run our tests with the
"--immediate" option without the "--verbose" option to see where and
how the first breakage happens and what is left behind in the trash
directory, i.e. to bootstrap an interactive debugging session.

But not being useful for the use case of human reader running
interactively to get the leftover state does not mean improvement
for other use cases is not welcome.

> I think if you did the legwork of trying to make those formats represent
> our "--verbose -x" output in a machine-readable way and tried to
> roundtrip that parsed output (which I have done for TAP) that you'd find
> it somewhere between "much, much harder" and "impossible" to do the same
> for those other formats in the context of our test suite.
>
> So yes, I agree that there's a lack of focus here, and we should put
> more wood behind fewer arrows in terms of making our test output
> machine-parsable.
>
> The [2] I mentioned above shows the sorts of wins we can get from that,
> i.e. emitting *only* the lines of test output relevant to the specific
> failures we had.
>
> That's really useful, and something you inherently can't do with the
> sort of approach you're going for in your [1] series.
>
> At least not without buffering the whole thing & parsing it, at which
> point you're back to square #1 in terms of fixing the sorts of issues
> noted in "Handling the long-tail of machine-readability" above.
>
> 1. https://lore.kernel.org/git/pull.1117.git.1643050574.gitgitgadget@gmail.com/
> 2. https://lore.kernel.org/git/220302.86mti87cj2.gmgdl@evledraar.gmail.com/

Taylor Blau March 24, 2022, 9:57 p.m. UTC | #4

On Thu, Mar 24, 2022 at 02:48:42PM +0100, Ævar Arnfjörð Bjarmason wrote:
> > The commit message is strong on the what, very strong in giving verbose
> > output that might or might not clarify the intention, and a little weak in
> > the why and the greater context.
>
> I thought "so that it emits valid TAP" was sufficiently
> self-explaining. I.e. we emit this machine-readable format, but in this
> edge case our output is invalid TAP, now it's valid.

I agree; if the justification is "something we use not-infrequently is
broken" and the rest is "and this patch un-breaks it", I do not think we
should devote much space to justifying why we use that thing in the
first place.

Our TAP output meets the bar (at least for me, personally) of not
needing to be rehashed anytime we change it, so I don't have any
complaints about Ævar's patch message here.

Of course, we should be careful to avoid following that guidance _too_
much, since if it leaves us in a spot where we never question any past
decisions, then I think we have gone too far.

Thanks,
Taylor

Ævar Arnfjörð Bjarmason March 28, 2022, 3:50 p.m. UTC | #5

On Thu, Mar 24 2022, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> == It "can be parsed"
>> == Human readable
>> == It's trivial to produce
>> == It's more flexible
>> == It doesn't matter if nobody else uses it
>> == Everyone uses it
>
> I agree with all of what you said above.
>
> But do non-human readers really want to read output from an
> "--immediate" session?  Don't they rather see the whole thing?

That test run will still take longer, especially if you're expecting
widespread failures.

The immediate use-case I have for this is that I have unsubmitted
patches to implement a "check" mode for SANITIZE=leak, where it runs
with --immediate and "fails" tests that pass under SANITIZE=leak but
aren't marked with "TEST_PASSES_SANITIZE_LEAK=true", and passes (and
runs to completion completely) those tests that do have
"TEST_PASSES_SANITIZE_LEAK=true" but pass with SANITIZE=leak (and fails
those that would fail there, no "fail inversion").

That sort of thing is very useful to for any tests where we mark certain
other tests as being OK under a given mode, and except there to be a 1=1
correspondance.

The reason for the --immediate there is an optimization, it takes a long
time to run with SANITIZE=leak, and if any single failure is enough to
declare the test bad --immediate is useful.

Now, the TAP tooling is quite tolerant of such bad output, another point
in its favor over something like JUnit (I'm assuming most libraries
using it would pass it through a generic XML parser, whose error
handling is either "ok" or "ARGHL!" :).

I.e. it'll give you a full parse, but just have a minor complaint about
the lacking end marker, getting rid of that output & parse nit is nice.

> The only "Huh?" I had with the original patch that started the
> thread was that TAP output currently does not work well with the
> "--verbose" option, and I've never run our tests with the
> "--immediate" option without the "--verbose" option to see where and
> how the first breakage happens and what is left behind in the trash
> directory, i.e. to bootstrap an interactive debugging session.
>
> But not being useful for the use case of human reader running
> interactively to get the leftover state does not mean improvement
> for other use cases is not welcome.

I was sad to see that ability go away, without the patches I referenced
running through a TAP parser with --verbose works for almost all ouf our
test suite, so IMO we were a bit overzelous in adding that early abort
for it.

But I do have patches to make it pass 100% guaranteed, so depending on
your interest ... :)

I was planning to queue that behind some of the outstanding test lib
changes it would conflict with, e.g. this one & the outstanding CI
topics.

test-lib: have --immediate emit valid TAP on failure

Commit Message

Comments

Patch