test-lib.sh musings: test_expect_failure considered harmful

Message ID	87tuhmk19c.fsf@evledraar.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?b?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason <avarab@gmail.com> To: Junio C Hamano <gitster@pobox.com> Cc: git@vger.kernel.org, David Turner <novalis@novalis.org>, Elijah Newren <newren@gmail.com>, Matheus Tavares <matheus.bernardino@usp.br>, Jeff King <peff@peff.net>, Derrick Stolee <derrickstolee@github.com>, =?utf-8?b?xJBvw6BuIFRy4bqnbiBD?= =?utf-8?b?w7RuZw==?= Danh <congdanhqx@gmail.com> Subject: test-lib.sh musings: test_expect_failure considered harmful Date: Tue, 12 Oct 2021 11:23:40 +0200 User-agent: Debian GNU/Linux bookworm/sid; Emacs 27.1; mu4e 1.7.0 Message-ID: <87tuhmk19c.fsf@evledraar.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	test-lib.sh musings: test_expect_failure considered harmful \| expand test-lib.sh musings: test_expect_failure considered harmful

Ævar Arnfjörð Bjarmason Oct. 12, 2021, 9:23 a.m. UTC

On Mon, Oct 11 2021, Junio C Hamano wrote:

[Removed "In-reply-to: <xmqq5yu3b80j.fsf@gitster.g>" with the Subject
change]

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>  test_expect_success POSIXPERM,SANITY 'commit should notice unwritable repository' '
>>  	test_when_finished "chmod 775 .git/objects .git/objects/??" &&
>>  	chmod a-w .git/objects .git/objects/?? &&
>> -	test_must_fail git commit -m second
>> +
>> +	cat >expect <<-\EOF &&
>> +	error: insufficient permission for adding an object to repository database .git/objects
>> +	error: insufficient permission for adding an object to repository database .git/objects
>> +	error: Error building trees
>> +	EOF
>
> This is odd.  Shouldn't the test expect one message from write-tree
> and be marked as expecting a failure until the bug gets fixed?

Presumably with test_expect_failure.

I'll change it, in this case we'd end up with a test_expect_success at
the end, so it doesn't matter much & I don't care.

But FWIW $subject, or at least s/harmful/running with scissors/g :)

[CC'd some recent-ish users of test_expect_failure, and I'm no innocent
in that department :)]

In the Perl world (Test::More et al) the "#TODO" keyword we map
test_expect_failure to (and yeah, I know the latter pre-dates the
former...) doesn't generally lead to subtle breakages and mismatched
expectations, i.e. you do:

    TODO: {
        local $TODO = "not implemented yet";
        is($a, $b, "this is why this in particular fails");
    }

So you generally mark the *specific* thing that fails, as separate from
your test setup itself.

But our test-lib.sh API for it is the equivalent of marking an entire
logical test block and its setup as a TODO.

So the diff below "passes". But did we intend for the test_cmp to fail,
for the thing to segfault or hit a BUG?

Any of those conditions being hit will have the TODO test pass. So will
all of it succeeding.

=== snip ===

=== snip ===

(Although for the "suceeding" case we'll print out a summary from
"prove", but unless you're carefully eyeballing that...).

So I think "test_expect_failure" should be avoided, the only useful way
of holding it which works in combination with other test-lib.sh features
that I've come up with is:

	test_expect_success 'setup flaky failure' '
		[multi-line test code that passes here] &&
		>setup-todo
	'

	if test -e setup-todo
	then
		test_expect_failure 'flaky failure due to XYZ' '
			test_cmp todo.expect todo.actual
		'
	fi

I.e.:

 * Don't say that the failure of your passing test setup is OK too.
 * In doing that don't break --run=N, so that "test -e setup-todo" test
   (or equivalent) is needed, in case the "setup" is skipped.
 * Have *just* the "test_cmp" (or other specific failure test) in the
   "test_expect_failure"

But it's only useful if you can't make that a "! test_cmp" (or rather, a
more specific positive & passing "test_cmp".

I.e. it's flaky, or the output/end state is otherwise unknown (but we
expect it to be once bugs are fixed).

We have ~150 uses of test_expect_failure in the test suite, I'm pretty
sure that <20 of them at most are "correct" under the above
criteria. E.g. this is ok-ish:

    t7815-grep-binary.sh-# This test actually passes on platforms where regexec() supports the
    t7815-grep-binary.sh-# flag REG_STARTEND.
    t7815-grep-binary.sh-test_expect_success 'git grep ile a' '
    t7815-grep-binary.sh-   git grep ile a
    t7815-grep-binary.sh-'

Although in that case we should make it a test_expect_success if we can
get a "REG_STARTEND" build flag exported to the test suite. Skimming the
grep hits *maybe* some of the ones in "t9602-cvsimport-branches-tags.sh"
(I haven't looked carefully).

But most of them I Consider Harmful, i.e. they're a bunch of setup code
that could be hiding an unexpected bug/segfault. Running into that with
some past WIP work (a thing I considered to "just fail a test_cmp"
started segfaulting) is why I try to avoid it.

Junio C Hamano Oct. 12, 2021, 4:45 p.m. UTC | #1

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Mon, Oct 11 2021, Junio C Hamano wrote:
>
> [Removed "In-reply-to: <xmqq5yu3b80j.fsf@gitster.g>" with the Subject
> change]

Please do not do the former, although it is welcome to change Subject.

> Presumably with test_expect_failure.
>
> I'll change it, in this case we'd end up with a test_expect_success at
> the end, so it doesn't matter much & I don't care.

I do agree with you that compared to expect_success, which requires
_all_ steps to succeed, so an failure in any of its steps is
immediately noticeable, it is harder to write and keep
expect_failure useful, because it is not like we are happy to see
any failure in any step.  We do not expect a failure in many
preparation and conclusion steps in the &&-chain in expect_failure
block, and we consider it is an error if these steps fail.  We only
want to mark only a single step to exhibit an expected but undesirable
behaviour.

But even with the shortcomings of expect_failure, it still is much
better than claiming that we expect a bogus outcome.

Improving the shortcomings of expect_failure would be a much better
use of our time than advocating an abuse of expect_sucess, I would
think.

Ævar Arnfjörð Bjarmason Oct. 13, 2021, 10:10 a.m. UTC | #2

On Tue, Oct 12 2021, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> On Mon, Oct 11 2021, Junio C Hamano wrote:
> [...]
>> Presumably with test_expect_failure.
>>
>> I'll change it, in this case we'd end up with a test_expect_success at
>> the end, so it doesn't matter much & I don't care.
>
> I do agree with you that compared to expect_success, which requires
> _all_ steps to succeed, so an failure in any of its steps is
> immediately noticeable, it is harder to write and keep
> expect_failure useful, because it is not like we are happy to see
> any failure in any step.  We do not expect a failure in many
> preparation and conclusion steps in the &&-chain in expect_failure
> block, and we consider it is an error if these steps fail.  We only
> want to mark only a single step to exhibit an expected but undesirable
> behaviour.
>
> But even with the shortcomings of expect_failure, it still is much
> better than claiming that we expect a bogus outcome.
>
> Improving the shortcomings of expect_failure would be a much better
> use of our time than advocating an abuse of expect_sucess, I would
> think.

I'd like to improve it, but I'll have to get any patch in this are past
you :)

My reading of your opinion from past exchanges is that you find it
objectionable to say "this is a success" when it's not the /desired/
behavior, whereas I think it's valuable to just test for and document
the exact existing behavior, even if it's not desirable. So you don't
really need a function different from test_expect_success, just a
comment saying "this should change", or add a ("non-hash so it's not TAP
syntax") "TODO" to the description of the test.

But if you agree that we shouldn't conflate failures in the different
steps I think we're getting somewhere, so to begin with what do you
think about the hack in the v2 of my series?
https://lore.kernel.org/git/cover-v2-0.2-00000000000-20211012T142950Z-avarab@gmail.com/

If we were to prompote those semantics to something that
test_expect_failure would use it would be the below, which I think is
the only sensible way to use it.

But that would mean changing all existing test_expect_failure uses in
the test suite, so it would need either a pretty large patch, or some
incremental steps to get there:

But it will mean we can't use it for any test that's actually flaky, so
we'll need a test_expect_flaky, or have some test-specific workarounds
in those areas.

diff --git a/t/t7815-grep-binary.sh b/t/t7815-grep-binary.sh
index 90ebb64f46e..9a95c9e7d69 100755
--- a/t/t7815-grep-binary.sh
+++ b/t/t7815-grep-binary.sh
@@ -64,7 +64,7 @@ test_expect_success 'git grep ile a' '
 '
 
 test_expect_failure 'git grep .fi a' '
-	git grep .fi a
+	test_must_fail git grep .fi a
 '
 
 test_expect_success 'grep respects binary diff attribute' '
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 8361b5c1c57..6d9291b7ead 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -728,8 +728,8 @@ test_known_broken_ok_ () {
 	then
 		write_junit_xml_testcase "$* (breakage fixed)"
 	fi
-	test_fixed=$(($test_fixed+1))
-	say_color error "ok $test_count - $@ # TODO known breakage vanished"
+	test_broken=$(($test_broken+1))
+	say_color warn "not ok $test_count - $@ # TODO known breakage"
 }
 
 test_known_broken_failure_ () {
@@ -737,8 +737,8 @@ test_known_broken_failure_ () {
 	then
 		write_junit_xml_testcase "$* (known breakage)"
 	fi
-	test_broken=$(($test_broken+1))
-	say_color warn "not ok $test_count - $@ # TODO known breakage"
+	test_fixed=$(($test_fixed+1))
+	say_color error "not ok $test_count - $@ # TODO a 'known breakage' changed behavior!"
 }
 
 test_debug () {

Derrick Stolee Oct. 13, 2021, 1:05 p.m. UTC | #3

On 10/12/2021 12:45 PM, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
>> On Mon, Oct 11 2021, Junio C Hamano wrote:
>>
>> [Removed "In-reply-to: <xmqq5yu3b80j.fsf@gitster.g>" with the Subject
>> change]
> 
> Please do not do the former, although it is welcome to change Subject.
> 
>> Presumably with test_expect_failure.
>>
>> I'll change it, in this case we'd end up with a test_expect_success at
>> the end, so it doesn't matter much & I don't care.
> 
> I do agree with you that compared to expect_success, which requires
> _all_ steps to succeed, so an failure in any of its steps is
> immediately noticeable, it is harder to write and keep
> expect_failure useful, because it is not like we are happy to see
> any failure in any step.  We do not expect a failure in many
> preparation and conclusion steps in the &&-chain in expect_failure
> block, and we consider it is an error if these steps fail.  We only
> want to mark only a single step to exhibit an expected but undesirable
> behaviour.
> 
> But even with the shortcomings of expect_failure, it still is much
> better than claiming that we expect a bogus outcome.
> 
> Improving the shortcomings of expect_failure would be a much better
> use of our time than advocating an abuse of expect_sucess, I would
> think.

I agree that test_expect_failure has these drawbacks. I've recently
been using _expect_success to document "bad" behavior so we can verify
that behavior changes when that behavior is fixed. But it does have
the drawback of looking like we claim the result is by design.

One possible way to correct this is to create a "test_expected_failure"
helper that could be placed on the step(s) of the &&-chain that are
expected to fail. The helper could set some variable to true if the
failure is hit, and false otherwise. It can also convert a failure
into a positive result. Then, test_expect_failure could look for that
variable's value (after verifying that the &&-chain returns success)
to show that all expected failures completed correctly.

This could have the side-effect of having a "fixed" test_expect_failure
show as a failed test, not a "TODO" message.

Thanks,
-Stolee

Junio C Hamano Oct. 13, 2021, 5:16 p.m. UTC | #4

Derrick Stolee <stolee@gmail.com> writes:

>> But even with the shortcomings of expect_failure, it still is much
>> better than claiming that we expect a bogus outcome.
>> 
>> Improving the shortcomings of expect_failure would be a much better
>> use of our time than advocating an abuse of expect_sucess, I would
>> think.
>
> I agree that test_expect_failure has these drawbacks. I've recently
> been using _expect_success to document "bad" behavior so we can verify
> that behavior changes when that behavior is fixed. But it does have
> the drawback of looking like we claim the result is by design.

Yeah, I think I saw (and I think I used the same technique myself)
people expect a bad output with test_expect_success with an in-code
(not in-log) comment that explicitly says "This documents the
current behaviour, which is wrong", and that is a very acceptable
solution, I would think.

> One possible way to correct this is to create a "test_expected_failure"
> helper that could be placed on the step(s) of the &&-chain that are
> expected to fail. The helper could set some variable to true if the
> failure is hit, and false otherwise. It can also convert a failure
> into a positive result. Then, test_expect_failure could look for that
> variable's value (after verifying that the &&-chain returns success)
> to show that all expected failures completed correctly.

Yup, I would very much like the direction, and further imagine that
the above approach can be extended to ...

> This could have the side-effect of having a "fixed" test_expect_failure
> show as a failed test, not a "TODO" message.

... avoid such downside.  Perhaps call that magic "we know this step
fails currently" test_known_breakage and declare that we deprecate
the use of test_expect_failure in new tests.  Such a test might look
like this:

    test_expect_success 'commit error message should not duplicate' '
	test_when_finished "chmod -R u+rwx ." &&
	chmod u-rwx .git/objects/ &&
	orig_head=$(git rev-parse HEAD) &&
	test_must_fail git commit --allow-empty -m "read-only" 2>rawerr &&
	grep "insufficient permission" rawerr >err &&
	test_known_breakage test_line_count = 1 err &&
	new_head=$(git rev-parse HEAD) &&
	test "$orig_head" = "$new_head"
    '

which may use your trick to turn both failure and success to OK (to
let the remainder of the test to continue) but signal the
surrounding test_expect_success to say either "TODO know breakage"
or "Fixed".

Thanks.

Ævar Arnfjörð Bjarmason Oct. 14, 2021, 5:11 p.m. UTC | #5

On Wed, Oct 13 2021, Junio C Hamano wrote:

> Derrick Stolee <stolee@gmail.com> writes:
>
>>> But even with the shortcomings of expect_failure, it still is much
>>> better than claiming that we expect a bogus outcome.
>>> 
>>> Improving the shortcomings of expect_failure would be a much better
>>> use of our time than advocating an abuse of expect_sucess, I would
>>> think.
>>
>> I agree that test_expect_failure has these drawbacks. I've recently
>> been using _expect_success to document "bad" behavior so we can verify
>> that behavior changes when that behavior is fixed. But it does have
>> the drawback of looking like we claim the result is by design.
>
> Yeah, I think I saw (and I think I used the same technique myself)
> people expect a bad output with test_expect_success with an in-code
> (not in-log) comment that explicitly says "This documents the
> current behaviour, which is wrong", and that is a very acceptable
> solution, I would think.
>
>> One possible way to correct this is to create a "test_expected_failure"
>> helper that could be placed on the step(s) of the &&-chain that are
>> expected to fail. The helper could set some variable to true if the
>> failure is hit, and false otherwise. It can also convert a failure
>> into a positive result. Then, test_expect_failure could look for that
>> variable's value (after verifying that the &&-chain returns success)
>> to show that all expected failures completed correctly.
>
> Yup, I would very much like the direction, and further imagine that
> the above approach can be extended to ...
>
>> This could have the side-effect of having a "fixed" test_expect_failure
>> show as a failed test, not a "TODO" message.
>
> ... avoid such downside.  Perhaps call that magic "we know this step
> fails currently" test_known_breakage and declare that we deprecate
> the use of test_expect_failure in new tests.  Such a test might look
> like this:
>
>     test_expect_success 'commit error message should not duplicate' '
> 	test_when_finished "chmod -R u+rwx ." &&
> 	chmod u-rwx .git/objects/ &&
> 	orig_head=$(git rev-parse HEAD) &&
> 	test_must_fail git commit --allow-empty -m "read-only" 2>rawerr &&
> 	grep "insufficient permission" rawerr >err &&
> 	test_known_breakage test_line_count = 1 err &&
> 	new_head=$(git rev-parse HEAD) &&
> 	test "$orig_head" = "$new_head"
>     '
>
> which may use your trick to turn both failure and success to OK (to
> let the remainder of the test to continue) but signal the
> surrounding test_expect_success to say either "TODO know breakage"
> or "Fixed".

I don't see how it's a downside. Considering the behavior bad now
shouldn't entail that we should be fuzzy about testing what exactly *is*
happening right now. If one bad but expected state turns into another
unexpected bad state the test should fail.

In this case this thread spawned off a fix where we print an error
twice, instead of once:
https://lore.kernel.org/git/cover-v2-0.2-00000000000-20211012T142950Z-avarab@gmail.com/#t

That sucks a bit, but printing pages full of such errors in a loop would
be way worse, which we'll hide if we insist on not testing the exact
emitted output, or on such "test_known_breakage" helpers.

It's just a downside because when we fix bugs we'll need to go through
the "expected failure" tests and adjust them, but that seems like a
feature to me.

Now I can submit a patch that fixes a known bug with no test suite
changes, and I might not even notice that I fixed one.

We may want to have tests for something that really is nondeterministic,
e.g. for the code I added in 2d3c02f5db6 (die(): stop hiding errors due
to overzealous recursion guard, 2017-06-21).

But that'll only be the case for some tiny minority (or none) of the
existing callers of "test_expect_failure".

1. https://lore.kernel.org/git/87tuhmk19c.fsf@evledraar.gmail.com/

test-lib.sh musings: test_expect_failure considered harmful

Commit Message

Comments

Patch