test-lib: make '--stress' more bisect-friendly

Message ID	20190208115045.13256-1-szeder.dev@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: =?utf-8?q?SZEDER_G=C3=A1bor?= <szeder.dev@gmail.com> To: Junio C Hamano <gitster@pobox.com> Cc: Jeff King <peff@peff.net>, git@vger.kernel.org, =?utf-8?q?SZEDER_G=C3=A1?= =?utf-8?q?bor?= <szeder.dev@gmail.com> Subject: [PATCH] test-lib: make '--stress' more bisect-friendly Date: Fri, 8 Feb 2019 12:50:45 +0100 Message-Id: <20190208115045.13256-1-szeder.dev@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	test-lib: make '--stress' more bisect-friendly \| expand test-lib: make '--stress' more bisect-friendly

SZEDER Gábor Feb. 8, 2019, 11:50 a.m. UTC

Let's suppose that a test somehow becomes flaky between 'master' and
'pu', and tends to fail within the first 50 repetitions when run with
'--stress'.  In such a case we could use 'git bisect' to find the
culprit: if the test script fails with '--stress', then the commit is
definitely bad, but if it survives, say, 300 repetitions, then we could
consider it good with reasonable confidence.

Unfortunately, all this could only be done manually, because
'--stress' would run the test script repeatedly for all eternity on a
good commit, and it would exit with success even when it found a
failure on a bad commit.

So let's make '--stress' usable with 'git bisect run':

  - Make it exit with failure if a failure is found.

  - Add the '--stress-limit=<N>' option to repeat the test script
    at most N times in each of the parallel jobs, and exit with
    success when the limit is reached.

And then we could simply run something like:

  $ git bisect start origin/pu master
  $ git bisect run sh -c 'make && cd t &&
                          ./t1234-foo.sh --stress --stress-limit=300'

Sure, as a brand new feature it won't be any useful right now, but in
a release or three most cooking topics will already contain this, so
we could automatically bisect at least newly introduced flakiness.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---

This is a case when an external stress script works better, as it can
easily check commits in the past...  if someone has such a script,
that is.

Anyway, the approach works:

  https://public-inbox.org/git/20190129213533.GE13764@szeder.dev/
  https://public-inbox.org/git/20190208113059.GV10587@szeder.dev/

 t/README      |  5 +++++
 t/test-lib.sh | 18 ++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

Jeff King Feb. 8, 2019, 4:47 p.m. UTC | #1

On Fri, Feb 08, 2019 at 12:50:45PM +0100, SZEDER Gábor wrote:

>   - Make it exit with failure if a failure is found.
> 
>   - Add the '--stress-limit=<N>' option to repeat the test script
>     at most N times in each of the parallel jobs, and exit with
>     success when the limit is reached.
> [...]
> 
> This is a case when an external stress script works better, as it can
> easily check commits in the past...  if someone has such a script,
> that is.

Heh, I literally just implemented this kind of max-count in my own
"stress" script[1] to handle this recent t0025 testing. So certainly I
think it is a good idea.

Picking an <N> is tough. Too low and you get a false negative, too high
and you can wait forever, especially if the script is long. But I don't
think there's any real way to auto-scale it, except by seeing a few of
the failing cases and watching how long they take.

>  t/README      |  5 +++++
>  t/test-lib.sh | 18 ++++++++++++++++--
>  2 files changed, 21 insertions(+), 2 deletions(-)

Patch looks good. A few observations:

> @@ -237,8 +248,10 @@ then
>  				exit 1
>  			' TERM INT
>  
> -			cnt=0
> -			while ! test -e "$stressfail"
> +			cnt=1
> +			while ! test -e "$stressfail" &&
> +			      { test -z "$stress_limit" ||
> +				test $cnt -le $stress_limit ; }
>  			do
>  				$TEST_SHELL_PATH "$0" "$@" >"$TEST_RESULTS_BASE.stress-$job_nr.out" 2>&1 &
>  				test_pid=$!

You switch to 1-indexing the counts here. I think that makes sense,
since otherwise --stress-limit=300 would end at "1.299", etc.

> @@ -261,6 +274,7 @@ then
>  
>  	if test -f "$stressfail"
>  	then
> +		stress_exit=1
>  		echo "Log(s) of failed test run(s):"
>  		for failed_job_nr in $(sort -n "$stressfail")
>  		do

I think I'd argue that this missing stress_exit is a bug in the original
script, and somewhat orthogonal to the limit counter. But I don't think
it's worth the trouble to split it out (and certainly the theme of "now
you can run this via bisect" unifies the two changes).

-Peff

Jeff King Feb. 8, 2019, 4:49 p.m. UTC | #2

On Fri, Feb 08, 2019 at 11:47:33AM -0500, Jeff King wrote:

> > This is a case when an external stress script works better, as it can
> > easily check commits in the past...  if someone has such a script,
> > that is.
> 
> Heh, I literally just implemented this kind of max-count in my own
> "stress" script[1] to handle this recent t0025 testing. So certainly I
> think it is a good idea.

As usual, I forgot my footnote. ;)

It was:

  I've actually mostly given up my stress script in favor of --stress. I
  only used it here because of the bisection issue you mention.

  One other thing I've noticed with it: I forget to add my custom
  --root=/var/ram/git-tests when I invoke it, so my hard disk goes
  crazy (and the tests often run slower!). I'm not sure if there's a
  convenient fix.

-Peff

SZEDER Gábor Feb. 8, 2019, 6:23 p.m. UTC | #3

On Fri, Feb 08, 2019 at 11:47:33AM -0500, Jeff King wrote:
> On Fri, Feb 08, 2019 at 12:50:45PM +0100, SZEDER Gábor wrote:
> 
> >   - Make it exit with failure if a failure is found.
> > 
> >   - Add the '--stress-limit=<N>' option to repeat the test script
> >     at most N times in each of the parallel jobs, and exit with
> >     success when the limit is reached.
> > [...]
> > 
> > This is a case when an external stress script works better, as it can
> > easily check commits in the past...  if someone has such a script,
> > that is.
> 
> Heh, I literally just implemented this kind of max-count in my own
> "stress" script[1] to handle this recent t0025 testing. So certainly I
> think it is a good idea.
> 
> Picking an <N> is tough. Too low and you get a false negative, too high
> and you can wait forever, especially if the script is long. But I don't
> think there's any real way to auto-scale it, except by seeing a few of
> the failing cases and watching how long they take.

So far I've chosen <N> like this: run the test script with --stress
3-5 times to trigger the failure, take the highest repetition count
that was necessary for the failure, multiply it by 4-6 to get a round
number, and that's a good ballpark for <N>.  And once bisect came up
with the suspect commit, I double checked it by letting the test
script run with --stress on its parent commit for at least 5-10x <N>
repetitions.

Anyway, I doubt that auto-scaling <N> is worth the effort.

> >  t/README      |  5 +++++
> >  t/test-lib.sh | 18 ++++++++++++++++--
> >  2 files changed, 21 insertions(+), 2 deletions(-)
> 
> Patch looks good. A few observations:
> 
> > @@ -237,8 +248,10 @@ then
> >  				exit 1
> >  			' TERM INT
> >  
> > -			cnt=0
> > -			while ! test -e "$stressfail"
> > +			cnt=1
> > +			while ! test -e "$stressfail" &&
> > +			      { test -z "$stress_limit" ||
> > +				test $cnt -le $stress_limit ; }
> >  			do
> >  				$TEST_SHELL_PATH "$0" "$@" >"$TEST_RESULTS_BASE.stress-$job_nr.out" 2>&1 &
> >  				test_pid=$!
> 
> You switch to 1-indexing the counts here. I think that makes sense,
> since otherwise --stress-limit=300 would end at "1.299", etc.

Yeah, that's exactly why I did it.

> 
> > @@ -261,6 +274,7 @@ then
> >  
> >  	if test -f "$stressfail"
> >  	then
> > +		stress_exit=1
> >  		echo "Log(s) of failed test run(s):"
> >  		for failed_job_nr in $(sort -n "$stressfail")
> >  		do
> 
> I think I'd argue that this missing stress_exit is a bug in the original
> script,

Well, yes, indeed.

Though being able to trigger an elusive test failure is a success in
my book ;)

> and somewhat orthogonal to the limit counter. But I don't think
> it's worth the trouble to split it out (and certainly the theme of "now
> you can run this via bisect" unifies the two changes).
> 
> -Peff

SZEDER Gábor Feb. 8, 2019, 6:33 p.m. UTC | #4

On Fri, Feb 08, 2019 at 11:49:37AM -0500, Jeff King wrote:
>   One other thing I've noticed with it: I forget to add my custom
>   --root=/var/ram/git-tests when I invoke it, so my hard disk goes
>   crazy (and the tests often run slower!). I'm not sure if there's a
>   convenient fix.

OTOH, that could introduce more variance in the timing of the test's
commands, thus potentially increasing the chances of a failure.  I
dunno.

Maybe ./t1234-foo.sh should learn to respect DEFAULT_TEST_OPTS
somehow?

Jeff King Feb. 8, 2019, 7:11 p.m. UTC | #5

On Fri, Feb 08, 2019 at 07:23:19PM +0100, SZEDER Gábor wrote:

> > Picking an <N> is tough. Too low and you get a false negative, too high
> > and you can wait forever, especially if the script is long. But I don't
> > think there's any real way to auto-scale it, except by seeing a few of
> > the failing cases and watching how long they take.
> 
> So far I've chosen <N> like this: run the test script with --stress
> 3-5 times to trigger the failure, take the highest repetition count
> that was necessary for the failure, multiply it by 4-6 to get a round
> number, and that's a good ballpark for <N>.  And once bisect came up
> with the suspect commit, I double checked it by letting the test
> script run with --stress on its parent commit for at least 5-10x <N>
> repetitions.

Heh. That's exactly my process, too. :)

> Anyway, I doubt that auto-scaling <N> is worth the effort.

Yeah, especially because as a concept it exists outside of the script
itself (i.e., you have to checkout a failing version and then run the
script a bunch of times; that's not something that test-lib.sh should
even know about).

So let's go with this for now. It's already a much nicer tool than we
had yesterday, so we can take some time to get used to it.

-Peff

Jeff King Feb. 8, 2019, 7:12 p.m. UTC | #6

On Fri, Feb 08, 2019 at 07:33:07PM +0100, SZEDER Gábor wrote:

> On Fri, Feb 08, 2019 at 11:49:37AM -0500, Jeff King wrote:
> >   One other thing I've noticed with it: I forget to add my custom
> >   --root=/var/ram/git-tests when I invoke it, so my hard disk goes
> >   crazy (and the tests often run slower!). I'm not sure if there's a
> >   convenient fix.
> 
> OTOH, that could introduce more variance in the timing of the test's
> commands, thus potentially increasing the chances of a failure.  I
> dunno.
> 
> Maybe ./t1234-foo.sh should learn to respect DEFAULT_TEST_OPTS
> somehow?

Yeah, that was what I was thinking. On the other hand, I'd actually find
that a little bit annoying for the non-stress case. I commonly do
"./t1234-foo.sh" in order to dig into a specific breakage, and having
the failing trash directory right there is convenient (and I don't care
as much about speed, since I'm just running it once).

I may just gut my "stress" script and make it a wrapper for calling
the script with "--stress --root=...". :)

-Peff

test-lib: make '--stress' more bisect-friendly

Commit Message

Comments

Patch