[RFC,3/3] test-lib: add the '--stress' option to run a test repeatedly under load
diff mbox series

Message ID 20181204163457.15717-4-szeder.dev@gmail.com
State New
Headers show
Series
  • test-lib: add the '--stress' option to help reproduce occasional failures in flaky tests
Related show

Commit Message

SZEDER Gábor Dec. 4, 2018, 4:34 p.m. UTC
Unfortunately, we have a few flaky tests, whose failures tend to be
hard to reproduce.  We've found that the best we can do to reproduce
such a failure is to run the test repeatedly while the machine is
under load, and wait in the hope that the load creates enough variance
in the timing of the test's commands that a failure is evenually
triggered.  I have a command to do that, and I noticed that two other
contributors have rolled their own scripts to do the same, all
choosing slightly different approaches.

To help reproduce failures in flaky tests, introduce the '--stress'
option to run a test script repeatedly in multiple parallel
invocations until one of them fails, thereby using the test script
itself to increase the load on the machine.

The number of parallel invocations is determined by, in order of
precedence: the number specified as '--stress=<N>', or the value of
the GIT_TEST_STRESS_LOAD environment variable, or twice the number of
available processors in '/proc/cpuinfo', or 8.

To prevent the several parallel invocations of the same test from
interfering with each other:

  - Include the parallel job's number in the name of the trash
    directory and the various output files under 't/test-results/' as
    a '.stress-<Nr>' suffix.

  - Add the parallel job's number to the port number specified by the
    user or to the test number, so even tests involving daemons
    listening on a TCP socket can be stressed.

  - Make '--stress' imply '--verbose-log' and discard the test's
    standard ouput and error; dumping the output of several parallel
    tests to the terminal would create a big ugly mess.

'wait' for all parallel jobs before exiting (either because a failure
was found or because the user lost patience and aborted the stress
test), allowing the still running tests to finish.  Otherwise the "OK
X.Y" progress output from the last iteration would likely arrive after
the user got back the shell prompt, interfering with typing in the
next command.  OTOH, this waiting might induce a considerable delay
between hitting ctrl-C and the test actually exiting; I'm not sure
this is the right tradeoff.

Based on Jeff King's 'stress' script.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 t/README                | 13 ++++++-
 t/test-lib-functions.sh |  7 +++-
 t/test-lib.sh           | 82 +++++++++++++++++++++++++++++++++++++++--
 3 files changed, 96 insertions(+), 6 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Dec. 4, 2018, 5:04 p.m. UTC | #1
On Tue, Dec 04 2018, SZEDER Gábor wrote:

> The number of parallel invocations is determined by, in order of
> precedence: the number specified as '--stress=<N>', or the value of
> the GIT_TEST_STRESS_LOAD environment variable, or twice the number of
> available processors in '/proc/cpuinfo', or 8.

With this series we have at least 3 ways to get this number. First
online_cpus() in the C code, then Peff's recent `getconf
_NPROCESSORS_ONLN` in doc-diff, and now this /proc/cpuinfo parsing.

Perhaps it makes sense to split online_cpus() into a helper to use from
the shellscripts instead?
SZEDER Gábor Dec. 4, 2018, 5:37 p.m. UTC | #2
On Tue, Dec 04, 2018 at 06:04:14PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Dec 04 2018, SZEDER Gábor wrote:
> 
> > The number of parallel invocations is determined by, in order of
> > precedence: the number specified as '--stress=<N>', or the value of
> > the GIT_TEST_STRESS_LOAD environment variable, or twice the number of
> > available processors in '/proc/cpuinfo', or 8.
> 
> With this series we have at least 3 ways to get this number. First
> online_cpus() in the C code, then Peff's recent `getconf
> _NPROCESSORS_ONLN` in doc-diff, and now this /proc/cpuinfo parsing.
> 
> Perhaps it makes sense to split online_cpus() into a helper to use from
> the shellscripts instead?

I don't think so, but I will update this patch to use 'getconf'
instead.
Ævar Arnfjörð Bjarmason Dec. 4, 2018, 6:11 p.m. UTC | #3
On Tue, Dec 04 2018, SZEDER Gábor wrote:

> Unfortunately, we have a few flaky tests, whose failures tend to be
> hard to reproduce.  We've found that the best we can do to reproduce
> such a failure is to run the test repeatedly while the machine is
> under load, and wait in the hope that the load creates enough variance
> in the timing of the test's commands that a failure is evenually
> triggered.  I have a command to do that, and I noticed that two other
> contributors have rolled their own scripts to do the same, all
> choosing slightly different approaches.
>
> To help reproduce failures in flaky tests, introduce the '--stress'
> option to run a test script repeatedly in multiple parallel
> invocations until one of them fails, thereby using the test script
> itself to increase the load on the machine.
>
> The number of parallel invocations is determined by, in order of
> precedence: the number specified as '--stress=<N>', or the value of
> the GIT_TEST_STRESS_LOAD environment variable, or twice the number of
> available processors in '/proc/cpuinfo', or 8.
>
> To prevent the several parallel invocations of the same test from
> interfering with each other:
>
>   - Include the parallel job's number in the name of the trash
>     directory and the various output files under 't/test-results/' as
>     a '.stress-<Nr>' suffix.
>
>   - Add the parallel job's number to the port number specified by the
>     user or to the test number, so even tests involving daemons
>     listening on a TCP socket can be stressed.
>
>   - Make '--stress' imply '--verbose-log' and discard the test's
>     standard ouput and error; dumping the output of several parallel
>     tests to the terminal would create a big ugly mess.
>
> 'wait' for all parallel jobs before exiting (either because a failure
> was found or because the user lost patience and aborted the stress
> test), allowing the still running tests to finish.  Otherwise the "OK
> X.Y" progress output from the last iteration would likely arrive after
> the user got back the shell prompt, interfering with typing in the
> next command.  OTOH, this waiting might induce a considerable delay
> between hitting ctrl-C and the test actually exiting; I'm not sure
> this is the right tradeoff.

I think it makes sense to generalize this and split it up into two
features.

It's a frequent annoyance of mine in the test suite that I'm
e.g. running t*.sh with some parallel "prove" in one screen, and then I
run tABCD*.sh manually, and get unlucky because they use the same trash
dir, and both tests go boom.

You can fix that with --root, which is much of what this patch does. My
one-liner for doing --stress has been something like:

    perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh --root=trash-{} -v'

But it would be great if I didn't have to worry about that and could
just run two concurrent:

    ./t0000-basic.sh

So I think we could just set some env variable where instead of having
the predictable trash directory we have a $TRASHDIR.$N as this patch
does, except we pick $N by locking some test-runs/tABCD.Nth file with
flock() during the run.

Then a stress mode like this would just be:

    GIT_TEST_FLOCKED_TRASH_DIRS=1 perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh'

And sure, we could ship a --stress option too, but it wouldn't be
magical in any way, just another "spawn N in a loop" implementation, but
you could also e.g. use GNU parallel to drive it, and without needing to
decide to stress test in advance, since we'd either flock() the trash
dir, or just mktemp(1)-it.
Jeff King Dec. 5, 2018, 5:44 a.m. UTC | #4
On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:

> To prevent the several parallel invocations of the same test from
> interfering with each other:
> 
>   - Include the parallel job's number in the name of the trash
>     directory and the various output files under 't/test-results/' as
>     a '.stress-<Nr>' suffix.

Makes sense.

>   - Add the parallel job's number to the port number specified by the
>     user or to the test number, so even tests involving daemons
>     listening on a TCP socket can be stressed.

In my script I just use an arbitrary sequence of ports. That means two
stress runs may stomp on each other, but stress runs tend not to
interfere with regular runs (whereas with your scheme, a stress run of
t5561 is going to stomp on t5562). It probably doesn't matter much
either way, as I tend not to do too much with the machine during a
stress run.

>   - Make '--stress' imply '--verbose-log' and discard the test's
>     standard ouput and error; dumping the output of several parallel
>     tests to the terminal would create a big ugly mess.

Makes sense. My script just redirects the output to a file, but it
predates --verbose-log.

My script always runs with "-x". I guess it's not that hard to add it if
you want, but it is annoying to finally get a failure after several
minutes only to realize that your are missing some important
information. ;)

Ditto for "-i", which leaves more readable output (you know the
interesting part is at the end of the log), and leaves a trash directory
state that is more likely to be useful for inspecting.

My script also implies "--root". That's not strictly necessary, though I
suspect it is much more likely to catch races when it's run on a ram
disk (simply because it leaves the CPU as the main bottleneck).

> 'wait' for all parallel jobs before exiting (either because a failure
> was found or because the user lost patience and aborted the stress
> test), allowing the still running tests to finish.  Otherwise the "OK
> X.Y" progress output from the last iteration would likely arrive after
> the user got back the shell prompt, interfering with typing in the
> next command.  OTOH, this waiting might induce a considerable delay
> between hitting ctrl-C and the test actually exiting; I'm not sure
> this is the right tradeoff.

If there is a delay, it's actually quite annoying to _not_ wait; you
start doing something in the terminal, and then a bunch of extra test
statuses print at a random time.

> +	job_nr=0
> +	while test $job_nr -lt "$job_count"
> +	do
> +		(
> +			GIT_TEST_STRESS_STARTED=done
> +			GIT_TEST_STRESS_JOB_NR=$job_nr
> +			export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
> +
> +			cnt=0
> +			while ! test -e "$stressfail"
> +			do
> +				if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
> +				then
> +					printf >&2 "OK   %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt

OK, this adds a counter for each job number (compared to my script).
Seems helpful.

> +				elif test -f "$stressfail" &&
> +				     test "$(cat "$stressfail")" = "aborted"
> +				then
> +					printf >&2 "ABORTED %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> +				else
> +					printf >&2 "FAIL %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> +					echo $GIT_TEST_STRESS_JOB_NR >>"$stressfail"
> +				fi

Hmm. What happens if we see a failure _and_ there's an abort? That's
actually pretty plausible if you see a failure whiz by, and you want to
abort the remaining scripts because they take a long time to run.

If the abort happens first, then the goal is to assume any errors are
due to the abort. But there's a race between the individual jobs reading
$stressfail and the parent signal handler writing it. So you may end up
with "aborted\n5" or similar (after which any other jobs would fail to
match "aborted" and declare themselves failures, as well).  I think my
script probably also has this race, too (it doesn't check the abort case
explicitly, but it would race in checking "test -e fail").

If the fail happens first (which I think is the more likely case), then
the abort handler will overwrite the file with "aborted", and we'll
forget that there was a real failure. This works in my script because of
the symlinking (if a real failure symlinked $fail to a directory, then
the attempt to write "aborted" will just be a noop).

> +	job_nr=0
> +	while test $job_nr -lt "$job_count"
> +	do
> +		wait
> +		job_nr=$(($job_nr + 1))
> +	done

Do we need to loop? Calling "wait" with no arguments should wait for all
children.

> +	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
> +	then
> +		echo "Log(s) of failed test run(s) be found in:"
> +		for f in $(cat "$stressfail")
> +		do
> +			echo "  $TEST_RESULTS_BASE.stress-$f.out"
> +		done
> +	fi

In my experience, outputting the failed log saves a step (especially
with "-i"), since seeing the failed test and its output is often
sufficient.

I'm also sad to lose the symlink to the failed trash directory, which
saves cutting and pasting when you have to inspect it. But I guess we
can't rely on symlinks everywhere.

-Peff
Jeff King Dec. 5, 2018, 5:46 a.m. UTC | #5
On Tue, Dec 04, 2018 at 06:04:14PM +0100, Ævar Arnfjörð Bjarmason wrote:

> 
> On Tue, Dec 04 2018, SZEDER Gábor wrote:
> 
> > The number of parallel invocations is determined by, in order of
> > precedence: the number specified as '--stress=<N>', or the value of
> > the GIT_TEST_STRESS_LOAD environment variable, or twice the number of
> > available processors in '/proc/cpuinfo', or 8.
> 
> With this series we have at least 3 ways to get this number. First
> online_cpus() in the C code, then Peff's recent `getconf
> _NPROCESSORS_ONLN` in doc-diff, and now this /proc/cpuinfo parsing.

To be fair, I only added the "getconf" thing because somebody complained
that I was parsing /proc/cpuinfo, and suggested it instead. ;)

I don't think it's especially portable, but it should work on Linux and
modern BSD/macOS, which may be enough (unlike doc-diff, I'd be a little
more inclined to ask somebody on a random platform to stress test if
they're hitting a bug).

> Perhaps it makes sense to split online_cpus() into a helper to use from
> the shellscripts instead?

I think somebody proposed that in a recent thread for other reasons,
too. I'd be OK with that, but probably just using getconf is reasonable
in the meantime.

-Peff
Jeff King Dec. 5, 2018, 5:50 a.m. UTC | #6
On Tue, Dec 04, 2018 at 07:11:08PM +0100, Ævar Arnfjörð Bjarmason wrote:

> It's a frequent annoyance of mine in the test suite that I'm
> e.g. running t*.sh with some parallel "prove" in one screen, and then I
> run tABCD*.sh manually, and get unlucky because they use the same trash
> dir, and both tests go boom.
> 
> You can fix that with --root, which is much of what this patch does. My
> one-liner for doing --stress has been something like:
> 
>     perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh --root=trash-{} -v'
> 
> But it would be great if I didn't have to worry about that and could
> just run two concurrent:
> 
>     ./t0000-basic.sh
> 
> So I think we could just set some env variable where instead of having
> the predictable trash directory we have a $TRASHDIR.$N as this patch
> does, except we pick $N by locking some test-runs/tABCD.Nth file with
> flock() during the run.

That actually sounds kind of annoying when doing a single run, since now
you have this extra ".N". I guess it would at least be predictable, and
I tend to tab-complete the trash dirs anyway.

I accomplish the same thing by doing my "big" runs using --root
specified in config.mak (which points to a RAM disk -- if you're not
using one to run the tests, you really should look into it, as it's way
faster). And then for one-offs, investigating failures, etc, I do "cd t
&& ./t0000-basic.sh -v -i", which naturally runs in the local directory.

-Peff
SZEDER Gábor Dec. 5, 2018, 10:34 a.m. UTC | #7
Just a quick reply to this one point for now:

On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> > +	job_nr=0
> > +	while test $job_nr -lt "$job_count"
> > +	do
> > +		wait
> > +		job_nr=$(($job_nr + 1))
> > +	done
> 
> Do we need to loop? Calling "wait" with no arguments should wait for all
> children.

It should, but in dash, ksh, ksh93, Bash v4.2 or older it doesn't seem
to do so, at least not on my machine, and I get back my shell prompt
right after hitting ctrl-C or the first failed test, and then get the
progress output from the remaining jobs later.

Bash 4.3 or later are strange: I get back the shell prompt immediately
after ctrl-C as well, so it doesn't appear to be waiting for all
remaining jobs to finish either, but! I don't get any of the progress
output from those jobs to mess up my next command.

And mksh and zsh can't run our tests, and I don't have any more shells
at hand to try.
SZEDER Gábor Dec. 5, 2018, 12:07 p.m. UTC | #8
On Tue, Dec 04, 2018 at 07:11:08PM +0100, Ævar Arnfjörð Bjarmason wrote:
> It's a frequent annoyance of mine in the test suite that I'm
> e.g. running t*.sh with some parallel "prove" in one screen, and then I
> run tABCD*.sh manually, and get unlucky because they use the same trash
> dir, and both tests go boom.
> 
> You can fix that with --root, which is much of what this patch does. My
> one-liner for doing --stress has been something like:
> 
>     perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh --root=trash-{} -v'
> 
> But it would be great if I didn't have to worry about that and could
> just run two concurrent:
> 
>     ./t0000-basic.sh
> 
> So I think we could just set some env variable where instead of having
> the predictable trash directory we have a $TRASHDIR.$N as this patch
> does, except we pick $N by locking some test-runs/tABCD.Nth file with
> flock() during the run.

Is 'flock' portable?  It doesn't appear so.

> Then a stress mode like this would just be:
> 
>     GIT_TEST_FLOCKED_TRASH_DIRS=1 perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh'

This doesn't address the issue of TCP ports for various daemons.

> And sure, we could ship a --stress option too, but it wouldn't be
> magical in any way, just another "spawn N in a loop" implementation, but
> you could also e.g. use GNU parallel to drive it, and without needing to

GNU 'parallel' may or may not be available on the platform, that's why
I stuck with the barebones shell-loops-in-the-background approach.

> decide to stress test in advance, since we'd either flock() the trash
> dir, or just mktemp(1)-it.

While 'mktemp' seems to be more portable than 'flock', it doesn't seem
to be portable enough; at least it's not in POSIX.

And in general I'd prefer deterministic trash directory names.  If I
re-run a failed test after fixing the bug, then currently the trash
directory will be overwritten, and that's good.  With 'mktemp's junk
in the trash direcory's name it won't be overwritten, and if my bugfix
doesn't work, then I'll start accumulating trash directories and won't
even know which one is from the last run.
SZEDER Gábor Dec. 5, 2018, 2:01 p.m. UTC | #9
On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> 
> > To prevent the several parallel invocations of the same test from
> > interfering with each other:
> > 
> >   - Include the parallel job's number in the name of the trash
> >     directory and the various output files under 't/test-results/' as
> >     a '.stress-<Nr>' suffix.
> 
> Makes sense.
> 
> >   - Add the parallel job's number to the port number specified by the
> >     user or to the test number, so even tests involving daemons
> >     listening on a TCP socket can be stressed.
> 
> In my script I just use an arbitrary sequence of ports. That means two
> stress runs may stomp on each other, but stress runs tend not to
> interfere with regular runs (whereas with your scheme, a stress run of
> t5561 is going to stomp on t5562). It probably doesn't matter much
> either way, as I tend not to do too much with the machine during a
> stress run.

A custom port can always be set via environment variables, though the
user has to remember to set it (I doubt that I would remember it until
reminded by test failures...)

> >   - Make '--stress' imply '--verbose-log' and discard the test's
> >     standard ouput and error; dumping the output of several parallel
> >     tests to the terminal would create a big ugly mess.
> 
> Makes sense. My script just redirects the output to a file, but it
> predates --verbose-log.
> 
> My script always runs with "-x". I guess it's not that hard to add it if
> you want, but it is annoying to finally get a failure after several
> minutes only to realize that your are missing some important
> information. ;)
> 
> Ditto for "-i", which leaves more readable output (you know the
> interesting part is at the end of the log), and leaves a trash directory
> state that is more likely to be useful for inspecting.

I wanted to imply only the bare minimum of options that are necessary
to prevent the tests from stomping on each other.  Other than that I
agree and wouldn't run '--stress' without '-x -i' myself, and at one
point considered to recommend '-x -i' in the description of
'--stress'.

> My script also implies "--root". That's not strictly necessary, though I
> suspect it is much more likely to catch races when it's run on a ram
> disk (simply because it leaves the CPU as the main bottleneck).
> 
> > 'wait' for all parallel jobs before exiting (either because a failure
> > was found or because the user lost patience and aborted the stress
> > test), allowing the still running tests to finish.  Otherwise the "OK
> > X.Y" progress output from the last iteration would likely arrive after
> > the user got back the shell prompt, interfering with typing in the
> > next command.  OTOH, this waiting might induce a considerable delay
> > between hitting ctrl-C and the test actually exiting; I'm not sure
> > this is the right tradeoff.
> 
> If there is a delay, it's actually quite annoying to _not_ wait; you
> start doing something in the terminal, and then a bunch of extra test
> statuses print at a random time.

At least the INT/TERM/etc. handler should say something like "Waiting
for background jobs to finish", to let the user know why it doesn't
exit right away.

An alternative would be to exit right away, and make the background
loops a tad bit smarter to not print these progress lines when
aborting.

> > +	job_nr=0
> > +	while test $job_nr -lt "$job_count"
> > +	do
> > +		(
> > +			GIT_TEST_STRESS_STARTED=done
> > +			GIT_TEST_STRESS_JOB_NR=$job_nr
> > +			export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
> > +
> > +			cnt=0
> > +			while ! test -e "$stressfail"
> > +			do
> > +				if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
> > +				then
> > +					printf >&2 "OK   %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> 
> OK, this adds a counter for each job number (compared to my script).
> Seems helpful.
> 
> > +				elif test -f "$stressfail" &&
> > +				     test "$(cat "$stressfail")" = "aborted"
> > +				then
> > +					printf >&2 "ABORTED %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> > +				else
> > +					printf >&2 "FAIL %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> > +					echo $GIT_TEST_STRESS_JOB_NR >>"$stressfail"
> > +				fi
> 
> Hmm. What happens if we see a failure _and_ there's an abort? That's
> actually pretty plausible if you see a failure whiz by, and you want to
> abort the remaining scripts because they take a long time to run.
> 
> If the abort happens first, then the goal is to assume any errors are
> due to the abort.

Yeah, that's why I added this ABORTED case.  When I hit ctrl-C, a
couple of the tests tend to fail, but that's not the rare failure we
are hoping to find when stress testing, so printing FAIL would be
misleading.  The test script exits with 1 all the same, so we can't
tell the difference, but since rare failures are, well, rare, this
failure is most likely due to the abort.

> But there's a race between the individual jobs reading
> $stressfail and the parent signal handler writing it. So you may end up
> with "aborted\n5" or similar (after which any other jobs would fail to
> match "aborted" and declare themselves failures, as well).  I think my
> script probably also has this race, too (it doesn't check the abort case
> explicitly, but it would race in checking "test -e fail").
>
> If the fail happens first (which I think is the more likely case), then
> the abort handler will overwrite the file with "aborted", and we'll
> forget that there was a real failure. This works in my script because of
> the symlinking (if a real failure symlinked $fail to a directory, then
> the attempt to write "aborted" will just be a noop).

OK, I think the abort handler should append, not overwrite, and then
the file's contents should be checked e.g. with a case statement
looking for *aborted*.

Or use two separate files for signaling abort and test failure.

In any case, you could always grab the job number from the "FAIL X.Y"
line, and then look at its logs and trash direcory.

> > +	job_nr=0
> > +	while test $job_nr -lt "$job_count"
> > +	do
> > +		wait
> > +		job_nr=$(($job_nr + 1))
> > +	done
> 
> Do we need to loop? Calling "wait" with no arguments should wait for all
> children.
> 
> > +	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
> > +	then
> > +		echo "Log(s) of failed test run(s) be found in:"
> > +		for f in $(cat "$stressfail")
> > +		do
> > +			echo "  $TEST_RESULTS_BASE.stress-$f.out"
> > +		done
> > +	fi
> 
> In my experience, outputting the failed log saves a step (especially
> with "-i"), since seeing the failed test and its output is often
> sufficient.

I just don't like when test output, especially with '-x', is dumped on
my terminal :)

> I'm also sad to lose the symlink to the failed trash directory, which
> saves cutting and pasting when you have to inspect it. But I guess we
> can't rely on symlinks everywhere.

You can tab-complete most of the trash directory, and then there are
only those 1-2-3 digits of the job number that you have to type.  That
worked out well enough for me so far (but I've only used it for a few
days, so...).

We could rename the trash directory of the failed test to something
short, sweet, and common, but that could be racy in the unlikely case
that two tests fail at the same time.  And I like the 'trash
directory.' prefix, because then 'make clean' will delete that one,
too, without modifying 't/Makefile' (which is also the reason for
adding the 'stress-N' suffix instead of a prefix, and putting the fail
file into the 'test-results' directory).
Ævar Arnfjörð Bjarmason Dec. 5, 2018, 2:01 p.m. UTC | #10
On Wed, Dec 05 2018, SZEDER Gábor wrote:

> On Tue, Dec 04, 2018 at 07:11:08PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> It's a frequent annoyance of mine in the test suite that I'm
>> e.g. running t*.sh with some parallel "prove" in one screen, and then I
>> run tABCD*.sh manually, and get unlucky because they use the same trash
>> dir, and both tests go boom.
>>
>> You can fix that with --root, which is much of what this patch does. My
>> one-liner for doing --stress has been something like:
>>
>>     perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh --root=trash-{} -v'
>>
>> But it would be great if I didn't have to worry about that and could
>> just run two concurrent:
>>
>>     ./t0000-basic.sh
>>
>> So I think we could just set some env variable where instead of having
>> the predictable trash directory we have a $TRASHDIR.$N as this patch
>> does, except we pick $N by locking some test-runs/tABCD.Nth file with
>> flock() during the run.
>
> Is 'flock' portable?  It doesn't appear so.

Portable enough, and since it's just an alias for "grab lock atomically"
it can devolve into other more basic FS functions on systems that don't
have it.

>> Then a stress mode like this would just be:
>>
>>     GIT_TEST_FLOCKED_TRASH_DIRS=1 perl -E 'say ++$_ while 1' | parallel --jobs=100% --halt-on-error soon,fail=1 './t0000-basic.sh'
>
> This doesn't address the issue of TCP ports for various daemons.

Once we have a test ABCD and slot offset N (for a fixed size of N) we
can pick a port from that.

>> And sure, we could ship a --stress option too, but it wouldn't be
>> magical in any way, just another "spawn N in a loop" implementation, but
>> you could also e.g. use GNU parallel to drive it, and without needing to
>
> GNU 'parallel' may or may not be available on the platform, that's why
> I stuck with the barebones shell-loops-in-the-background approach.

Yes, my point is not that you should drop that part of your patch for
using GNU parallel, but just to demonstrate that we can get pretty close
to this for most tests with an external tool, and that it would make
this sort of thing work for concurrent tests without needing to decide
to concurrently test in advance.

>> decide to stress test in advance, since we'd either flock() the trash
>> dir, or just mktemp(1)-it.
>
> While 'mktemp' seems to be more portable than 'flock', it doesn't seem
> to be portable enough; at least it's not in POSIX.>

We are already relying on stuff like mktemp() being reliable for
e.g. the split index.

> And in general I'd prefer deterministic trash directory names.  If I
> re-run a failed test after fixing the bug, then currently the trash
> directory will be overwritten, and that's good.  With 'mktemp's junk
> in the trash direcory's name it won't be overwritten, and if my bugfix
> doesn't work, then I'll start accumulating trash directories and won't
> even know which one is from the last run.

In general you wouldn't need to set GIT_TEST_FLOCKED_TRASH_DIRS=1 and
would get the same monolithic trash names as now, and for something like
the flock() version it could use a job number as a suffix like your
patch does.
SZEDER Gábor Dec. 5, 2018, 2:39 p.m. UTC | #11
On Wed, Dec 05, 2018 at 03:01:41PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >> decide to stress test in advance, since we'd either flock() the trash
> >> dir, or just mktemp(1)-it.
> >
> > While 'mktemp' seems to be more portable than 'flock', it doesn't seem
> > to be portable enough; at least it's not in POSIX.>
> 
> We are already relying on stuff like mktemp() being reliable for
> e.g. the split index.

That's the mktemp() function from libc; I meant the 'mktemp' command
that we would use this early in 'test-lib.sh', where PATH has not been
set up for testing yet.
Ævar Arnfjörð Bjarmason Dec. 5, 2018, 7:59 p.m. UTC | #12
On Wed, Dec 05 2018, SZEDER Gábor wrote:

> On Wed, Dec 05, 2018 at 03:01:41PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> >> decide to stress test in advance, since we'd either flock() the trash
>> >> dir, or just mktemp(1)-it.
>> >
>> > While 'mktemp' seems to be more portable than 'flock', it doesn't seem
>> > to be portable enough; at least it's not in POSIX.>
>>
>> We are already relying on stuff like mktemp() being reliable for
>> e.g. the split index.
>
> That's the mktemp() function from libc; I meant the 'mktemp' command
> that we would use this early in 'test-lib.sh', where PATH has not been
> set up for testing yet.

Ah, if that ever becomes an issue it's a one-line test-tool wrapper.
Jeff King Dec. 5, 2018, 9:36 p.m. UTC | #13
On Wed, Dec 05, 2018 at 11:34:54AM +0100, SZEDER Gábor wrote:

> 
> Just a quick reply to this one point for now:
> 
> On Wed, Dec 05, 2018 at 12:44:09AM -0500, Jeff King wrote:
> > On Tue, Dec 04, 2018 at 05:34:57PM +0100, SZEDER Gábor wrote:
> > > +	job_nr=0
> > > +	while test $job_nr -lt "$job_count"
> > > +	do
> > > +		wait
> > > +		job_nr=$(($job_nr + 1))
> > > +	done
> > 
> > Do we need to loop? Calling "wait" with no arguments should wait for all
> > children.
> 
> It should, but in dash, ksh, ksh93, Bash v4.2 or older it doesn't seem
> to do so, at least not on my machine, and I get back my shell prompt
> right after hitting ctrl-C or the first failed test, and then get the
> progress output from the remaining jobs later.

That's weird. I run my stress script under dash with a single "wait",
and it handles the failing case just fine. And switching to a single
"wait" in your script works for me.

But the ^C case is interesting. Try running your script under "sh -x"
and hitting ^C. The signal interrupts the first wait. In my script (or
yours modified to use a single wait), we then proceed immediately to the
"exit", and get output from the subshells after we've exited.

But in your script with the loop, the first wait is interrupted, and
then the _second_ wait (in the second loop iteration) picks up all of
the exiting processes. The subsequent waits are all noops that return
immediately, because there are no processes left.

So the right number of waits is either "1" or "2". Looping means we do
too many (which is mostly a harmless noop) or too few (in the off chance
that you have only a single job ;) ). So it works out in practice.

But I think a more clear way to do it is to simply wait in the signal
trap, to cover the case when our original wait was aborted.

I.e., something like this:

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 9b7f687396..5e0c41626f 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -98,7 +98,7 @@ done,*)
 	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
 	stressfail="$TEST_RESULTS_BASE.stress-failed"
 	rm -f "$stressfail"
-	trap 'echo aborted >"$stressfail"' TERM INT HUP
+	trap 'echo aborted >"$stressfail"; wait' TERM INT HUP
 
 	job_nr=0
 	while test $job_nr -lt "$job_count"
@@ -128,12 +128,7 @@ done,*)
 		job_nr=$(($job_nr + 1))
 	done
 
-	job_nr=0
-	while test $job_nr -lt "$job_count"
-	do
-		wait
-		job_nr=$(($job_nr + 1))
-	done
+	wait
 
 	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
 	then

which behaves well for me in all cases.

> Bash 4.3 or later are strange: I get back the shell prompt immediately
> after ctrl-C as well, so it doesn't appear to be waiting for all
> remaining jobs to finish either, but! I don't get any of the progress
> output from those jobs to mess up my next command.

Interesting. My bash 4.4 seems to behave the same as dash. It almost
sounds like the SIGINT is getting passed to the subshells for you.
Probably not really worth digging into, though.

In case anybody is playing with this and needed to simulate a stress
failure, here's what I used:

diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh
index b6566003dd..a370cd9977 100755
--- a/t/t0000-basic.sh
+++ b/t/t0000-basic.sh
@@ -1171,4 +1171,11 @@ test_expect_success 'very long name in the index handled sanely' '
 	test $len = 4098
 '
 
+test_expect_success 'roll those dice' '
+	case "$(openssl rand -base64 1)" in
+	z*)
+		return 1
+	esac
+'
+
 test_done
Jeff King Dec. 5, 2018, 9:56 p.m. UTC | #14
On Wed, Dec 05, 2018 at 03:01:06PM +0100, SZEDER Gábor wrote:

> > >   - Make '--stress' imply '--verbose-log' and discard the test's
> > >     standard ouput and error; dumping the output of several parallel
> > >     tests to the terminal would create a big ugly mess.
> > 
> > Makes sense. My script just redirects the output to a file, but it
> > predates --verbose-log.
> > 
> > My script always runs with "-x". I guess it's not that hard to add it if
> > you want, but it is annoying to finally get a failure after several
> > minutes only to realize that your are missing some important
> > information. ;)
> > 
> > Ditto for "-i", which leaves more readable output (you know the
> > interesting part is at the end of the log), and leaves a trash directory
> > state that is more likely to be useful for inspecting.
> 
> I wanted to imply only the bare minimum of options that are necessary
> to prevent the tests from stomping on each other.  Other than that I
> agree and wouldn't run '--stress' without '-x -i' myself, and at one
> point considered to recommend '-x -i' in the description of
> '--stress'.

Yeah, it probably is the right thing to make the options as orthogonal
as possible, and let the user decide what they want. I was just
wondering if I could replace my "stress" script with this. But I think
I'd still want it to remember to use a sane set of options (including
"--root" in my case).

> > > 'wait' for all parallel jobs before exiting (either because a failure
> > > was found or because the user lost patience and aborted the stress
> > > test), allowing the still running tests to finish.  Otherwise the "OK
> > > X.Y" progress output from the last iteration would likely arrive after
> > > the user got back the shell prompt, interfering with typing in the
> > > next command.  OTOH, this waiting might induce a considerable delay
> > > between hitting ctrl-C and the test actually exiting; I'm not sure
> > > this is the right tradeoff.
> > 
> > If there is a delay, it's actually quite annoying to _not_ wait; you
> > start doing something in the terminal, and then a bunch of extra test
> > statuses print at a random time.
> 
> At least the INT/TERM/etc. handler should say something like "Waiting
> for background jobs to finish", to let the user know why it doesn't
> exit right away.

That seems reasonable. There's also no point in waiting for them to
finish if we know we're aborting anyway. Could we just kill them all?

I guess it's a little tricky, because $! is going to give us the pid of
each subshell. We actually want to kill the test sub-process. That takes
a few contortions, but the output is nice (every sub-job immediately
says "ABORTED ...", and the final process does not exit until the whole
tree is done):

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 9b7f687396..357dead3ff 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -98,8 +98,9 @@ done,*)
 	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
 	stressfail="$TEST_RESULTS_BASE.stress-failed"
 	rm -f "$stressfail"
-	trap 'echo aborted >"$stressfail"' TERM INT HUP
+	trap 'echo aborted >"$stressfail"; kill $job_pids 2>/dev/null; wait' TERM INT HUP
 
+	job_pids=
 	job_nr=0
 	while test $job_nr -lt "$job_count"
 	do
@@ -108,10 +109,13 @@ done,*)
 			GIT_TEST_STRESS_JOB_NR=$job_nr
 			export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
 
+			trap 'kill $test_pid 2>/dev/null' TERM
+
 			cnt=0
 			while ! test -e "$stressfail"
 			do
-				if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
+				$TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1 & test_pid=$!
+				if wait
 				then
 					printf >&2 "OK   %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
 				elif test -f "$stressfail" &&
@@ -124,16 +128,11 @@ done,*)
 				fi
 				cnt=$(($cnt + 1))
 			done
-		) &
+		) & job_pids="$job_pids $!"
 		job_nr=$(($job_nr + 1))
 	done
 
-	job_nr=0
-	while test $job_nr -lt "$job_count"
-	do
-		wait
-		job_nr=$(($job_nr + 1))
-	done
+	wait
 
 	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
 	then

> > > +				elif test -f "$stressfail" &&
> > > +				     test "$(cat "$stressfail")" = "aborted"
> > > +				then
> > > +					printf >&2 "ABORTED %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> > > +				else
> > > +					printf >&2 "FAIL %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
> > > +					echo $GIT_TEST_STRESS_JOB_NR >>"$stressfail"
> > > +				fi
> > 
> > Hmm. What happens if we see a failure _and_ there's an abort? That's
> > actually pretty plausible if you see a failure whiz by, and you want to
> > abort the remaining scripts because they take a long time to run.
> > 
> > If the abort happens first, then the goal is to assume any errors are
> > due to the abort.
> 
> Yeah, that's why I added this ABORTED case.  When I hit ctrl-C, a
> couple of the tests tend to fail, but that's not the rare failure we
> are hoping to find when stress testing, so printing FAIL would be
> misleading.  The test script exits with 1 all the same, so we can't
> tell the difference, but since rare failures are, well, rare, this
> failure is most likely due to the abort.

With the patch as posted I actually see most of them still say "OK",
because the SIGINT does not make it to them (it would be a different
story if I killed the whole process group).

But with the modification above, I get "ABORTED" for most, which is what
I'd want. And it should be pretty race-free, since the "aborted" file is
put into place before triggering the signal cascade.

> > If the fail happens first (which I think is the more likely case), then
> > the abort handler will overwrite the file with "aborted", and we'll
> > forget that there was a real failure. This works in my script because of
> > the symlinking (if a real failure symlinked $fail to a directory, then
> > the attempt to write "aborted" will just be a noop).
> 
> OK, I think the abort handler should append, not overwrite, and then
> the file's contents should be checked e.g. with a case statement
> looking for *aborted*.
> 
> Or use two separate files for signaling abort and test failure.

Two separate files seems a lot simpler to me.

> > In my experience, outputting the failed log saves a step (especially
> > with "-i"), since seeing the failed test and its output is often
> > sufficient.
> 
> I just don't like when test output, especially with '-x', is dumped on
> my terminal :)

Don't you then use your terminal to display the file? ;) (I'm kidding, I
assume you load it in "less" or whatever, which is distinct).

I wish this could be optional somehow, though. I really prefer the dump.
I guess if I continue to use my script as a wrapper, it can dump for me
(after digging in the $stressfail file, I guess).

> > I'm also sad to lose the symlink to the failed trash directory, which
> > saves cutting and pasting when you have to inspect it. But I guess we
> > can't rely on symlinks everywhere.
> 
> You can tab-complete most of the trash directory, and then there are
> only those 1-2-3 digits of the job number that you have to type.  That
> worked out well enough for me so far (but I've only used it for a few
> days, so...).
> 
> We could rename the trash directory of the failed test to something
> short, sweet, and common, but that could be racy in the unlikely case
> that two tests fail at the same time.  And I like the 'trash
> directory.' prefix, because then 'make clean' will delete that one,
> too, without modifying 't/Makefile' (which is also the reason for
> adding the 'stress-N' suffix instead of a prefix, and putting the fail
> file into the 'test-results' directory).

We could rename it to "trash directory.xxx.failed", which is at least
predictable. That can even be done atomically, though I think it's a
little tricky to do portably with shell tools. I guess I can continue to
do the symlink thing in my wrapper script.

-Peff
Junio C Hamano Dec. 6, 2018, 12:22 a.m. UTC | #15
Jeff King <peff@peff.net> writes:

> But the ^C case is interesting. Try running your script under "sh -x"
> and hitting ^C. The signal interrupts the first wait. In my script (or
> yours modified to use a single wait), we then proceed immediately to the
> "exit", and get output from the subshells after we've exited.
>
> But in your script with the loop, the first wait is interrupted, and
> then the _second_ wait (in the second loop iteration) picks up all of
> the exiting processes. The subsequent waits are all noops that return
> immediately, because there are no processes left.
>
> So the right number of waits is either "1" or "2". Looping means we do
> too many (which is mostly a harmless noop) or too few (in the off chance
> that you have only a single job ;) ). So it works out in practice.

Well, if you time your ^C perfectly, no number of waits is right, I
am afraid.  You spawn N processes and while looping N times waiting
for them, you can ^C out of wait before these N processes all die,
no?

> But I think a more clear way to do it is to simply wait in the signal
> trap, to cover the case when our original wait was aborted.

Sounds good.
Jeff King Dec. 6, 2018, 5:35 a.m. UTC | #16
On Thu, Dec 06, 2018 at 09:22:23AM +0900, Junio C Hamano wrote:

> > So the right number of waits is either "1" or "2". Looping means we do
> > too many (which is mostly a harmless noop) or too few (in the off chance
> > that you have only a single job ;) ). So it works out in practice.
> 
> Well, if you time your ^C perfectly, no number of waits is right, I
> am afraid.  You spawn N processes and while looping N times waiting
> for them, you can ^C out of wait before these N processes all die,
> no?

Each "wait" will try to collect all processes, but may be interrupted by
a signal. So the correct number is actually "1 plus the number of times
the user hits ^C". I had assumed the user was just hitting it once,
though putting the wait into the trap means we do that "1 plus" thing
anyway.

I could also see an argument that subsequent ^C's should exit
immediately, but I think we are getting well into the realm of
over-engineering.

-Peff
Junio C Hamano Dec. 6, 2018, 6:41 a.m. UTC | #17
Jeff King <peff@peff.net> writes:

> Each "wait" will try to collect all processes, but may be interrupted by
> a signal. So the correct number is actually "1 plus the number of times
> the user hits ^C".

Yeah and that is not bounded.  It is OK not to catch multiple ^C
that races with what we do, so having ane extra wait in the clean-up
procedure after receiving a signal like you suggested would be both
good enough and the cleanest solution, I think.

Thanks.
SZEDER Gábor Dec. 6, 2018, 10:56 p.m. UTC | #18
On Wed, Dec 05, 2018 at 04:36:26PM -0500, Jeff King wrote:
> The signal interrupts the first wait.

Ah, of course.  I'm ashamed to say that this is not the first time I
forget about that...

> > Bash 4.3 or later are strange: I get back the shell prompt immediately
> > after ctrl-C as well, so it doesn't appear to be waiting for all
> > remaining jobs to finish either, but! I don't get any of the progress
> > output from those jobs to mess up my next command.
> 
> Interesting. My bash 4.4 seems to behave the same as dash. It almost
> sounds like the SIGINT is getting passed to the subshells for you.
> Probably not really worth digging into, though.

The subshell does indeed get SIGINT.  I don't know why that happens,
to my understanding that should not happen.

> In case anybody is playing with this and needed to simulate a stress
> failure, here's what I used:
> 
> diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh
> index b6566003dd..a370cd9977 100755
> --- a/t/t0000-basic.sh
> +++ b/t/t0000-basic.sh
> @@ -1171,4 +1171,11 @@ test_expect_success 'very long name in the index handled sanely' '
>  	test $len = 4098
>  '
>  
> +test_expect_success 'roll those dice' '
> +	case "$(openssl rand -base64 1)" in
> +	z*)
> +		return 1
> +	esac
> +'

Wasteful :)

  test $(($$ % 42)) -ne 0
SZEDER Gábor Dec. 6, 2018, 11:10 p.m. UTC | #19
On Wed, Dec 05, 2018 at 04:56:21PM -0500, Jeff King wrote:
> Could we just kill them all?
> 
> I guess it's a little tricky, because $! is going to give us the pid of
> each subshell. We actually want to kill the test sub-process. That takes
> a few contortions, but the output is nice (every sub-job immediately
> says "ABORTED ...", and the final process does not exit until the whole
> tree is done):

It's trickier than that:

  - Nowadays our test lib translates SIGINT to exit, but not TERM (or
    HUP, in case a user decides to close the terminal window), which
    means that if the test runs any daemons in the background, then
    those won't be killed when the stress test is aborted.

    This is easy enough to address, and I've been meaning to do this
    in an other patch series anyway.

  - With the (implied) '--verbose-log' the process killed in the
    background subshell's SIGTERM handler is not the actual test
    process, because '--verbose-log' runs the test again in a piped
    subshell to save its output:
    
      (GIT_TEST_TEE_STARTED=done ${TEST_SHELL_PATH} "$0" "$@" 2>&1;
       echo $? >"$TEST_RESULTS_BASE.exit") | tee -a "$GIT_TEST_TEE_OUTPUT_FILE"

    That 'kill' kills the "outer" shell process.
    And though I get "ABORTED..." immediately and I get back my
    prompt right away, the commands involved in the above construct
    are still running in the background:

      $ ./t3903-stash.sh --stress=1
      ^CABORTED  0.0
      $ ps a |grep t3903
      1280 pts/17   S      0:00 /bin/sh ./t3903-stash.sh --stress=1
      1281 pts/17   S      0:00 tee -a <...>/test-results/t3903-stash.stress-0.out
      1282 pts/17   S      0:00 /bin/sh ./t3903-stash.sh --stress=1
      4173 pts/17   S+     0:00 grep t3903

    I see this behavior with all shells I tried.
    I haven't yet started to think it through why this happens :)

    Not implying '--verbose-log' but redirecting the test script's
    output, like you did in your 'stress' script, seems to work in
    dash, ksh, and ks93, but not in Bash v4.3 or later, where, for
    whatever reason, the subshell get SIGINT before the SIGTERM would
    arrive.
    While we could easily handle SIGINT in the subshell with the same
    signal handler as SIGTERM, it bothers me that something
    fundamental is wrong here.
    Furthermore, then we should perhaps forbid '--stress' in
    combination with '--verbose-log' or '--tee'.
    

> diff --git a/t/test-lib.sh b/t/test-lib.sh
> index 9b7f687396..357dead3ff 100644
> --- a/t/test-lib.sh
> +++ b/t/test-lib.sh
> @@ -98,8 +98,9 @@ done,*)
>  	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
>  	stressfail="$TEST_RESULTS_BASE.stress-failed"
>  	rm -f "$stressfail"
> -	trap 'echo aborted >"$stressfail"' TERM INT HUP
> +	trap 'echo aborted >"$stressfail"; kill $job_pids 2>/dev/null; wait' TERM INT HUP
>  
> +	job_pids=
>  	job_nr=0
>  	while test $job_nr -lt "$job_count"
>  	do
> @@ -108,10 +109,13 @@ done,*)
>  			GIT_TEST_STRESS_JOB_NR=$job_nr
>  			export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
>  
> +			trap 'kill $test_pid 2>/dev/null' TERM
> +
>  			cnt=0
>  			while ! test -e "$stressfail"
>  			do
> -				if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
> +				$TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1 & test_pid=$!
> +				if wait
>  				then
>  					printf >&2 "OK   %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
>  				elif test -f "$stressfail" &&
> @@ -124,16 +128,11 @@ done,*)
>  				fi
>  				cnt=$(($cnt + 1))
>  			done
> -		) &
> +		) & job_pids="$job_pids $!"
>  		job_nr=$(($job_nr + 1))
>  	done
>  
> -	job_nr=0
> -	while test $job_nr -lt "$job_count"
> -	do
> -		wait
> -		job_nr=$(($job_nr + 1))
> -	done
> +	wait
>  
>  	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
>  	then
>
Jeff King Dec. 7, 2018, 1:03 a.m. UTC | #20
On Thu, Dec 06, 2018 at 11:56:01PM +0100, SZEDER Gábor wrote:

> > +test_expect_success 'roll those dice' '
> > +	case "$(openssl rand -base64 1)" in
> > +	z*)
> > +		return 1
> > +	esac
> > +'
> 
> Wasteful :)
> 
>   test $(($$ % 42)) -ne 0

Oh, indeed, that is much more clever. :)

-Peff
Jeff King Dec. 7, 2018, 1:14 a.m. UTC | #21
On Fri, Dec 07, 2018 at 12:10:05AM +0100, SZEDER Gábor wrote:

> On Wed, Dec 05, 2018 at 04:56:21PM -0500, Jeff King wrote:
> > Could we just kill them all?
> > 
> > I guess it's a little tricky, because $! is going to give us the pid of
> > each subshell. We actually want to kill the test sub-process. That takes
> > a few contortions, but the output is nice (every sub-job immediately
> > says "ABORTED ...", and the final process does not exit until the whole
> > tree is done):
> 
> It's trickier than that:
> 
>   - Nowadays our test lib translates SIGINT to exit, but not TERM (or
>     HUP, in case a user decides to close the terminal window), which
>     means that if the test runs any daemons in the background, then
>     those won't be killed when the stress test is aborted.
> 
>     This is easy enough to address, and I've been meaning to do this
>     in an other patch series anyway.

Yeah, trapping on more signals seems reasonable (or just propagating INT
instead of TERM). I'm more worried about this one:

>   - With the (implied) '--verbose-log' the process killed in the
>     background subshell's SIGTERM handler is not the actual test
>     process, because '--verbose-log' runs the test again in a piped
>     subshell to save its output:

Hmm, yeah. The patch I sent earlier already propagates through the
subshell, so this is really just another layer there. I.e., would it be
reasonable when we are using --verbose-log (or --tee) for the parent
process to propagate signals down to child it spawned?

That would fix this case, and it would make things more predictable in
general (e.g., a user who manually runs kill).

It does feel like a lot of boilerplate to be propagating these signals
manually. I think the "right" Unix-y solution here is process groups:
each sub-job should get its own group, and then the top-level runner
script can kill the whole group. We may run into portability issues,
though (setsid is in util-linux, but I don't know if there's an
equivalent elsewhere; a small perl or C helper could do it, but I have
no idea what that would mean on Windows).

>       (GIT_TEST_TEE_STARTED=done ${TEST_SHELL_PATH} "$0" "$@" 2>&1;
>        echo $? >"$TEST_RESULTS_BASE.exit") | tee -a "$GIT_TEST_TEE_OUTPUT_FILE"
> 
>     That 'kill' kills the "outer" shell process.
>     And though I get "ABORTED..." immediately and I get back my
>     prompt right away, the commands involved in the above construct
>     are still running in the background:
> 
>       $ ./t3903-stash.sh --stress=1
>       ^CABORTED  0.0
>       $ ps a |grep t3903
>       1280 pts/17   S      0:00 /bin/sh ./t3903-stash.sh --stress=1
>       1281 pts/17   S      0:00 tee -a <...>/test-results/t3903-stash.stress-0.out
>       1282 pts/17   S      0:00 /bin/sh ./t3903-stash.sh --stress=1
>       4173 pts/17   S+     0:00 grep t3903
> 
>     I see this behavior with all shells I tried.
>     I haven't yet started to think it through why this happens :)

I think you get ABORTED because we are just watching for the
parent-process to exit (and reporting its status). The fact that that
the subshell (and the tee) are still running is not important. That all
makes sense to me.

>     Not implying '--verbose-log' but redirecting the test script's
>     output, like you did in your 'stress' script, seems to work in
>     dash, ksh, and ks93, but not in Bash v4.3 or later, where, for
>     whatever reason, the subshell get SIGINT before the SIGTERM would
>     arrive.
>     While we could easily handle SIGINT in the subshell with the same
>     signal handler as SIGTERM, it bothers me that something
>     fundamental is wrong here.

Yeah, I don't understand the SIGINT behavior you're seeing with bash. I
can't replicate it with bash 4.4.23 (from Debian unstable).

-Peff

Patch
diff mbox series

diff --git a/t/README b/t/README
index 28711cc508..9851de25c2 100644
--- a/t/README
+++ b/t/README
@@ -186,6 +186,16 @@  appropriately before running "make".
 	this feature by setting the GIT_TEST_CHAIN_LINT environment
 	variable to "1" or "0", respectively.
 
+--stress::
+--stress=<N>::
+	Run the test script repeatedly in multiple parallel
+	invocations until one of them fails.  Useful for reproducing
+	rare failures in flaky tests.  The number of parallel
+	invocations is, in order of precedence: <N>, or the value of
+	the GIT_TEST_STRESS_LOAD environment variable, or twice the
+	number of available processors in '/proc/cpuinfo', or 8.
+	Implies `--verbose-log`.
+
 You can also set the GIT_TEST_INSTALLED environment variable to
 the bindir of an existing git installation to test that installation.
 You still need to have built this git sandbox, from which various
@@ -425,7 +435,8 @@  This test harness library does the following things:
  - Creates an empty test directory with an empty .git/objects database
    and chdir(2) into it.  This directory is 't/trash
    directory.$test_name_without_dotsh', with t/ subject to change by
-   the --root option documented above.
+   the --root option documented above, and a '.stress-<N>' suffix
+   appended by the --stress option.
 
  - Defines standard test helper functions for your scripts to
    use.  These functions are designed to make all scripts behave
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index d9a602cd0f..9af11e3eed 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1288,8 +1288,6 @@  test_set_port () {
 			# root-only port, use a larger one instead.
 			port=$(($port + 10000))
 		fi
-
-		eval $var=$port
 		;;
 	*[^0-9]*)
 		error >&7 "invalid port number: $port"
@@ -1298,4 +1296,9 @@  test_set_port () {
 		# The user has specified the port.
 		;;
 	esac
+
+	# Make sure that parallel '--stress' test jobs get different
+	# ports.
+	port=$(($port + ${GIT_TEST_STRESS_JOB_NR:-0}))
+	eval $var=$port
 }
diff --git a/t/test-lib.sh b/t/test-lib.sh
index 49e4563405..9b7f687396 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -71,8 +71,81 @@  then
 	exit 1
 fi
 
+TEST_STRESS_SFX="${GIT_TEST_STRESS_JOB_NR:+.stress-$GIT_TEST_STRESS_JOB_NR}"
 TEST_NAME="$(basename "$0" .sh)"
-TEST_RESULTS_BASE="$TEST_OUTPUT_DIRECTORY/test-results/$TEST_NAME"
+TEST_RESULTS_BASE="$TEST_OUTPUT_DIRECTORY/test-results/$TEST_NAME$TEST_STRESS_SFX"
+
+# If --stress was passed, run this test repeatedly in several parallel loops.
+case "$GIT_TEST_STRESS_STARTED, $* " in
+done,*)
+	# Don't stress test again.
+	;;
+*' --stress '*|*' '--stress=*' '*)
+	job_count=${*##*--stress=}
+	if test "$job_count" != "$*"
+	then
+		job_count=${job_count%% *}
+	elif test -n "$GIT_TEST_STRESS_LOAD"
+	then
+		job_count="$GIT_TEST_STRESS_LOAD"
+	elif test -r /proc/cpuinfo
+	then
+		job_count=$((2 * $(grep -c ^processor /proc/cpuinfo)))
+	else
+		job_count=8
+	fi
+
+	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
+	stressfail="$TEST_RESULTS_BASE.stress-failed"
+	rm -f "$stressfail"
+	trap 'echo aborted >"$stressfail"' TERM INT HUP
+
+	job_nr=0
+	while test $job_nr -lt "$job_count"
+	do
+		(
+			GIT_TEST_STRESS_STARTED=done
+			GIT_TEST_STRESS_JOB_NR=$job_nr
+			export GIT_TEST_STRESS_STARTED GIT_TEST_STRESS_JOB_NR
+
+			cnt=0
+			while ! test -e "$stressfail"
+			do
+				if $TEST_SHELL_PATH "$0" "$@" >/dev/null 2>&1
+				then
+					printf >&2 "OK   %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
+				elif test -f "$stressfail" &&
+				     test "$(cat "$stressfail")" = "aborted"
+				then
+					printf >&2 "ABORTED %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
+				else
+					printf >&2 "FAIL %2d.%d\n" $GIT_TEST_STRESS_JOB_NR $cnt
+					echo $GIT_TEST_STRESS_JOB_NR >>"$stressfail"
+				fi
+				cnt=$(($cnt + 1))
+			done
+		) &
+		job_nr=$(($job_nr + 1))
+	done
+
+	job_nr=0
+	while test $job_nr -lt "$job_count"
+	do
+		wait
+		job_nr=$(($job_nr + 1))
+	done
+
+	if test -f "$stressfail" && test "$(cat "$stressfail")" != "aborted"
+	then
+		echo "Log(s) of failed test run(s) be found in:"
+		for f in $(cat "$stressfail")
+		do
+			echo "  $TEST_RESULTS_BASE.stress-$f.out"
+		done
+	fi
+	exit
+	;;
+esac
 
 # if --tee was passed, write the output not only to the terminal, but
 # additionally to the file test-results/$BASENAME.out, too.
@@ -80,7 +153,7 @@  case "$GIT_TEST_TEE_STARTED, $* " in
 done,*)
 	# do not redirect again
 	;;
-*' --tee '*|*' --va'*|*' -V '*|*' --verbose-log '*)
+*' --tee '*|*' --va'*|*' -V '*|*' --verbose-log '*|*' --stress '*|*' '--stress=*' '*)
 	mkdir -p "$(dirname "$TEST_RESULTS_BASE")"
 
 	# Make this filename available to the sub-process in case it is using
@@ -341,6 +414,9 @@  do
 	-V|--verbose-log)
 		verbose_log=t
 		shift ;;
+	--stress|--stress=*)
+		verbose_log=t
+		shift ;;
 	*)
 		echo "error: unknown test option '$1'" >&2; exit 1 ;;
 	esac
@@ -1028,7 +1104,7 @@  then
 fi
 
 # Test repository
-TRASH_DIRECTORY="trash directory.$TEST_NAME"
+TRASH_DIRECTORY="trash directory.$TEST_NAME$TEST_STRESS_SFX"
 test -n "$root" && TRASH_DIRECTORY="$root/$TRASH_DIRECTORY"
 case "$TRASH_DIRECTORY" in
 /*) ;; # absolute path is good