diff mbox series

[06/18] chainlint.pl: validate test scripts in parallel

Message ID 62fc652eb47a4df83d88a197e376f28dbbab3b52.1661992197.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 29fb2ec384a867ca577335a12f4b45c184e7b642
Headers show
Series make test "linting" more comprehensive | expand

Commit Message

Eric Sunshine Sept. 1, 2022, 12:29 a.m. UTC
From: Eric Sunshine <sunshine@sunshineco.com>

Although chainlint.pl has undergone a good deal of optimization during
its development -- increasing in speed significantly -- parsing and
validating 1050+ scripts and 16500+ tests via Perl is not exactly
instantaneous. However, perceived performance can be improved by taking
advantage of the fact that there is no interdependence between test
scripts or test definitions, thus parsing and validating can be done in
parallel. The number of available cores is determined automatically but
can be overridden via the --jobs option.

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
---
 t/chainlint.pl | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

Comments

Ævar Arnfjörð Bjarmason Sept. 1, 2022, 12:36 p.m. UTC | #1
On Thu, Sep 01 2022, Eric Sunshine via GitGitGadget wrote:

> From: Eric Sunshine <sunshine@sunshineco.com>
>
> Although chainlint.pl has undergone a good deal of optimization during
> its development -- increasing in speed significantly -- parsing and
> validating 1050+ scripts and 16500+ tests via Perl is not exactly
> instantaneous. However, perceived performance can be improved by taking
> advantage of the fact that there is no interdependence between test
> scripts or test definitions, thus parsing and validating can be done in
> parallel. The number of available cores is determined automatically but
> can be overridden via the --jobs option.

Per your CL:
	
	Ævar offered some sensible comments[2,3] about optimizing the Makefile rules
	related to chainlint, but those optimizations are not tackled here for a few
	reasons: (1) this series is already quite long, (2) I'd like to keep the
	series focused on its primary goal of installing a new and improved linter,
	(3) these patches do not make the Makefile situation any worse[4], and (4)
	those optimizations can easily be done atop this series[5].

I have been running with those t/Makefile changesg locally, but didn't
submit them. FWIW that's here:

	https://github.com/git/git/compare/master...avar:git:avar/t-Makefile-use-dependency-graph-for-check-chainlint

Which I'm not entirely sure I'm happy about, and it's jeust about the
chainlint tests, but...

> +sub ncores {
> +	# Windows
> +	return $ENV{NUMBER_OF_PROCESSORS} if exists($ENV{NUMBER_OF_PROCESSORS});
> +	# Linux / MSYS2 / Cygwin / WSL
> +	do { local @ARGV='/proc/cpuinfo'; return scalar(grep(/^processor\s*:/, <>)); } if -r '/proc/cpuinfo';
> +	# macOS & BSD
> +	return qx/sysctl -n hw.ncpu/ if $^O =~ /(?:^darwin$|bsd)/;
> +	return 1;
> +}
> +
>  sub show_stats {
>  	my ($start_time, $stats) = @_;
>  	my $walltime = $interval->($start_time);
> @@ -621,7 +633,9 @@ sub exit_code {
>  Getopt::Long::Configure(qw{bundling});
>  GetOptions(
>  	"emit-all!" => \$emit_all,
> +	"jobs|j=i" => \$jobs,
>  	"stats|show-stats!" => \$show_stats) or die("option error\n");
> +$jobs = ncores() if $jobs < 1;
>  
>  my $start_time = $getnow->();
>  my @stats;
> @@ -633,6 +647,40 @@ unless (@scripts) {
>  	exit;
>  }
>  
> -push(@stats, check_script(1, sub { shift(@scripts); }, sub { print(@_); }));
> +unless ($Config{useithreads} && eval {
> +	require threads; threads->import();
> +	require Thread::Queue; Thread::Queue->import();
> +	1;
> +	}) {
> +	push(@stats, check_script(1, sub { shift(@scripts); }, sub { print(@_); }));
> +	show_stats($start_time, \@stats) if $show_stats;
> +	exit(exit_code(\@stats));
> +}
> +
> +my $script_queue = Thread::Queue->new();
> +my $output_queue = Thread::Queue->new();
> +
> +sub next_script { return $script_queue->dequeue(); }
> +sub emit { $output_queue->enqueue(@_); }
> +
> +sub monitor {
> +	while (my $s = $output_queue->dequeue()) {
> +		print($s);
> +	}
> +}
> +
> +my $mon = threads->create({'context' => 'void'}, \&monitor);
> +threads->create({'context' => 'list'}, \&check_script, $_, \&next_script, \&emit) for 1..$jobs;
> +
> +$script_queue->enqueue(@scripts);
> +$script_queue->end();
> +
> +for (threads->list()) {
> +	push(@stats, $_->join()) unless $_ == $mon;
> +}
> +
> +$output_queue->end();
> +$mon->join();

Maybe I'm misunderstanding this whole thing, but this really seems like
the wrong direction in an otherwise fantastic direction of a series.

I.e. it's *great* that we can do chain-lint without needing to actually
execute the *.sh file, this series adds a lint parser that can parse
those *.sh "at rest".

But in your 16/18 you then do:
	
	+if test "${GIT_TEST_CHAIN_LINT:-1}" != 0
	+then
	+	"$PERL_PATH" "$TEST_DIRECTORY/chainlint.pl" "$0" ||
	+		BUG "lint error (see '?!...!? annotations above)"
	+fi
	
I may just be missing something here, but why not instead just borrow
what I did for "lint-docs" in 8650c6298c1 (doc lint: make "lint-docs"
non-.PHONY, 2021-10-15)?

I.e. if we can run against t0001-init.sh or whatever *once* to see if it
chain-lints OK then surely we could have a rule like:

	t0001-init.sh.chainlint-ok: t0001-init.sh
		perl chainlint.pl $< >$@

Then whenever you change t0001-init.sh we refresh that
t0001-init.sh.chainlint-ok, if the chainlint.pl exits non-zero we'll
fail to make it, and will unlink that t0001-init.sh.chainlint-ok.

That way you wouldn't need any parallelism in the Perl script, because
you'd have "make" take care of it, and the common case of re-testing
where the speed matters would be that we woudln't need to run this at
all, or would only re-run it for the test scripts that changed.

(Obviously a "real" implementation would want to create that ".ok" file
in t/.build/chainlint" or whatever)

A drawback is that you'd probably be slower on the initial run, as you'd
spwn N chainlint.pl. You could use $? instead of $< to get around that,
but that requires some re-structuring, and I've found it to generally
not be worth it.

It would also have the drawback that a:

	./t0001-init.sh

wouldn't run the chain-lint, but this would:

	make T=t0001-init.sh

But if want the former to work we could carry some
"GIT_TEST_VIA_MAKEFILE" variable or whatever, and only run the
test-via-test-lib.sh if it isn't set.
Eric Sunshine Sept. 3, 2022, 7:51 a.m. UTC | #2
On Thu, Sep 1, 2022 at 8:47 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Thu, Sep 01 2022, Eric Sunshine via GitGitGadget wrote:
> > Although chainlint.pl has undergone a good deal of optimization during
> > its development -- increasing in speed significantly -- parsing and
> > validating 1050+ scripts and 16500+ tests via Perl is not exactly
> > instantaneous. However, perceived performance can be improved by taking
> > advantage of the fact that there is no interdependence between test
> > scripts or test definitions, thus parsing and validating can be done in
> > parallel. The number of available cores is determined automatically but
> > can be overridden via the --jobs option.
>
> Per your CL:
>
>         Ævar offered some sensible comments[2,3] about optimizing the Makefile rules
>         related to chainlint, but those optimizations are not tackled here for a few
>         reasons: (1) this series is already quite long, (2) I'd like to keep the
>         series focused on its primary goal of installing a new and improved linter,
>         (3) these patches do not make the Makefile situation any worse[4], and (4)
>         those optimizations can easily be done atop this series[5].
>
> I have been running with those t/Makefile changesg locally, but didn't
> submit them. FWIW that's here:
>
>         https://github.com/git/git/compare/master...avar:git:avar/t-Makefile-use-dependency-graph-for-check-chainlint

Thanks for the link. It's nice to see an actual implementation. I
think most of what you wrote in the commit message and the patch
itself are still meaningful following this series.

> > +my $script_queue = Thread::Queue->new();
> > +my $output_queue = Thread::Queue->new();
> > +
> > +my $mon = threads->create({'context' => 'void'}, \&monitor);
> > +threads->create({'context' => 'list'}, \&check_script, $_, \&next_script, \&emit) for 1..$jobs;
>
> Maybe I'm misunderstanding this whole thing, but this really seems like
> the wrong direction in an otherwise fantastic direction of a series.
>
> I.e. it's *great* that we can do chain-lint without needing to actually
> execute the *.sh file, this series adds a lint parser that can parse
> those *.sh "at rest".
>
> But in your 16/18 you then do:
>
>         +if test "${GIT_TEST_CHAIN_LINT:-1}" != 0
>         +then
>         +       "$PERL_PATH" "$TEST_DIRECTORY/chainlint.pl" "$0" ||
>         +               BUG "lint error (see '?!...!? annotations above)"
>         +fi
>
> I may just be missing something here, but why not instead just borrow
> what I did for "lint-docs" in 8650c6298c1 (doc lint: make "lint-docs"
> non-.PHONY, 2021-10-15)?

I may be misunderstanding, but regarding patch [16/18], I think you
answered your own question at the end of your response when you
pointed out the drawback that you wouldn't get linting when running
the test script manually (i.e. `./t1234-test-stuff.sh`). Ensuring that
the linter is invoked when running a test script manually is important
(at least to me) since it's a frequent step when developing a new test
or modifying an existing test. [16/18] is present to ensure that we
still get that behavior.

> I.e. if we can run against t0001-init.sh or whatever *once* to see if it
> chain-lints OK then surely we could have a rule like:
>
>         t0001-init.sh.chainlint-ok: t0001-init.sh
>                 perl chainlint.pl $< >$@
>
> Then whenever you change t0001-init.sh we refresh that
> t0001-init.sh.chainlint-ok, if the chainlint.pl exits non-zero we'll
> fail to make it, and will unlink that t0001-init.sh.chainlint-ok.
>
> That way you wouldn't need any parallelism in the Perl script, because
> you'd have "make" take care of it, and the common case of re-testing
> where the speed matters would be that we woudln't need to run this at
> all, or would only re-run it for the test scripts that changed.

A couple comments regarding parallelism: (1) as mentioned in another
response, when developing the script, I had in mind that it might be
useful for other projects (i.e. `sharness`), thus should be able to
stand on its own without advanced Makefile support, and (2) process
creation on Microsoft Windows is _very_ expensive and slow, so on that
platform, being able to lint all tests in all script with a single
invocation is a big win over running the linter 1050+ times, once for
each test script.

That's not to discredit any of your points... I'm just conveying some
of my thought process.

> (Obviously a "real" implementation would want to create that ".ok" file
> in t/.build/chainlint" or whatever)
>
> A drawback is that you'd probably be slower on the initial run, as you'd
> spwn N chainlint.pl. You could use $? instead of $< to get around that,
> but that requires some re-structuring, and I've found it to generally
> not be worth it.

The $? trick might be something Windows folk would appreciate, and
even those of us in macOS land (at least those of us with old hardware
and OS).

> It would also have the drawback that a:
>
>         ./t0001-init.sh
>
> wouldn't run the chain-lint, but this would:
>
>         make T=t0001-init.sh
>
> But if want the former to work we could carry some
> "GIT_TEST_VIA_MAKEFILE" variable or whatever, and only run the
> test-via-test-lib.sh if it isn't set.

I may be misunderstanding, but isn't the GIT_TEST_CHAIN_LINT variable
useful for this already, as in [16/18]?

Regarding your observations as a whole, I think the extract from the
cover letter which you cited above is relevant to my response. I don't
disagree with your points about using the Makefile to optimize away
unnecessary invocations of the linter, or that doing so can be a
useful future direction. As mentioned in the cover letter, though, I
think that such optimizations are outside the scope of this series
which -- aside from installing an improved linter -- aims to maintain
the status quo; in particular, this series ensures that (1) tests get
linted as they are being written/modified when the developer runs the
script manually `./t1234-test-stuff.sh`, and (2) all tests get linted
upon `make test`.

(The other reason why I'd prefer to see such optimizations applied
atop this series is that I simply don't have the time these days to
devote to major changes of direction in this series, which I think
meets its stated goals without making the situation any worse or
making it any more difficult to apply the optimizations you describe.
And the new linter has been languishing on my computer for far too
long; the implementation has been complete for well over a year, but
it took me this long to finish polishing the patch series. I'd like to
see the new linter make it into the toolchest of other developers
since it can be beneficial; it has already found scores or hundreds[1]
of possible hiding places for bugs due to broken &&-chain or missing
`|| return`, and has sniffed out some actual broken tests[2,3].)

[1]: https://lore.kernel.org/git/20211209051115.52629-1-sunshine@sunshineco.com/
[2]: https://lore.kernel.org/git/20211209051115.52629-3-sunshine@sunshineco.com/
[3]: https://lore.kernel.org/git/7b0784056f3cc0c96e9543ae44d0f5a7b0bf85fa.1661192802.git.gitgitgadget@gmail.com/
Eric Wong Sept. 6, 2022, 10:35 p.m. UTC | #3
Eric Sunshine via GitGitGadget <gitgitgadget@gmail.com> wrote:
> +unless ($Config{useithreads} && eval {
> +	require threads; threads->import();

Fwiw, the threads(3perl) manpage has this since 2014:

       The use of interpreter-based threads in perl is officially discouraged.

I was bummed, too :<  but I've decided it wasn't worth the
effort to deal with the problems threads could cause down the
line in future Perl versions.  For example, common libraries
like File::Temp will chdir behind-the-scenes which is
thread-unsafe.

(of course I only care about *BSD and Linux on MMU hardware,
so I use SOCK_SEQPACKET and fork() freely :>)
Eric Sunshine Sept. 6, 2022, 10:52 p.m. UTC | #4
On Tue, Sep 6, 2022 at 6:35 PM Eric Wong <e@80x24.org> wrote:
> Eric Sunshine via GitGitGadget <gitgitgadget@gmail.com> wrote:
> > +unless ($Config{useithreads} && eval {
> > +     require threads; threads->import();
>
> Fwiw, the threads(3perl) manpage has this since 2014:
>
>        The use of interpreter-based threads in perl is officially discouraged.

Thanks for pointing this out. I did see that, but as no better
alternative was offered, and since I did want this to work on Windows,
I went with it.

> I was bummed, too :<  but I've decided it wasn't worth the
> effort to deal with the problems threads could cause down the
> line in future Perl versions.  For example, common libraries
> like File::Temp will chdir behind-the-scenes which is
> thread-unsafe.
>
> (of course I only care about *BSD and Linux on MMU hardware,
> so I use SOCK_SEQPACKET and fork() freely :>)

I'm not overly worried about the deprecation at the moment since (1)
chainlint.pl isn't a widely used script -- it's audience is very
narrow; (2) the `$Config{useithreads}` conditional can be seen as an
automatic escape-hatch, and (if need be) I can even make `--jobs=1` be
an explicit escape hatch, and there's already --no-chain-lint for an
extreme escape-hatch; (3) the script is pretty much standalone -- it
doesn't rely upon any libraries like File::Temp or others; (4) Ævar
has ideas for using the Makefile for parallelism instead; (5) we can
cross the deprecation-bridge when/if it actually does become a
problem, either by dropping parallelism from chainlint.pl or by
dropping chainlint.pl itself.
Jeff King Sept. 6, 2022, 11:26 p.m. UTC | #5
On Tue, Sep 06, 2022 at 06:52:26PM -0400, Eric Sunshine wrote:

> On Tue, Sep 6, 2022 at 6:35 PM Eric Wong <e@80x24.org> wrote:
> > Eric Sunshine via GitGitGadget <gitgitgadget@gmail.com> wrote:
> > > +unless ($Config{useithreads} && eval {
> > > +     require threads; threads->import();
> >
> > Fwiw, the threads(3perl) manpage has this since 2014:
> >
> >        The use of interpreter-based threads in perl is officially discouraged.
> 
> Thanks for pointing this out. I did see that, but as no better
> alternative was offered, and since I did want this to work on Windows,
> I went with it.

I did some timings the other night, and I found something quite curious
with the thread stuff.

Here's a hyperfine run of "make" in the t/ directory before any of your
patches. It uses "prove" to do parallelism under the hood:

  Benchmark 1: make
    Time (mean ± σ):     68.895 s ±  0.840 s    [User: 620.914 s, System: 428.498 s]
    Range (min … max):   67.943 s … 69.531 s    3 runs

So that gives us a baseline. Now the first thing I wondered is how bad
it would be to just run chainlint.pl once per script. So I applied up to
that patch:

  Benchmark 1: make
    Time (mean ± σ):     71.289 s ±  1.302 s    [User: 673.300 s, System: 417.912 s]
    Range (min … max):   69.788 s … 72.120 s    3 runs

I was quite surprised that it made things slower! It's nice that we're
only calling it once per script instead of once per test, but it seems
the startup overhead of the script is really high.

And since in this mode we're only feeding it one script at a time, I
tried reverting the "chainlint.pl: validate test scripts in parallel"
commit. And indeed, now things are much faster:

  Benchmark 1: make
    Time (mean ± σ):     61.544 s ±  3.364 s    [User: 556.486 s, System: 384.001 s]
    Range (min … max):   57.660 s … 63.490 s    3 runs

And you can see the same thing just running chainlint by itself:

  $ time perl chainlint.pl /dev/null
  real	0m0.069s
  user	0m0.042s
  sys	0m0.020s

  $ git revert HEAD^{/validate.test.scripts.in.parallel}
  $ time perl chainlint.pl /dev/null
  real	0m0.014s
  user	0m0.010s
  sys	0m0.004s

I didn't track down the source of the slowness. Maybe it's loading extra
modules, or maybe it's opening /proc/cpuinfo, or maybe it's the thread
setup. But it's a surprising slowdown.

Now of course your intent is to do a single repo-wide invocation. And
that is indeed a bit faster. Here it is without the parallel code:

  Benchmark 1: make
    Time (mean ± σ):     61.727 s ±  2.140 s    [User: 507.712 s, System: 377.753 s]
    Range (min … max):   59.259 s … 63.074 s    3 runs

The wall-clock time didn't improve much, but the CPU time did. Restoring
the parallel code does improve the wall-clock time a bit, but at the
cost of some extra CPU:

  Benchmark 1: make
    Time (mean ± σ):     59.029 s ±  2.851 s    [User: 515.690 s, System: 380.369 s]
    Range (min … max):   55.736 s … 60.693 s    3 runs

which makes sense. If I do a with/without of just "make test-chainlint",
the parallelism is buying a few seconds of wall-clock:

  Benchmark 1: make test-chainlint
    Time (mean ± σ):     900.1 ms ± 102.9 ms    [User: 12049.8 ms, System: 79.7 ms]
    Range (min … max):   704.2 ms … 994.4 ms    10 runs

  Benchmark 1: make test-chainlint
    Time (mean ± σ):      3.778 s ±  0.042 s    [User: 3.756 s, System: 0.023 s]
    Range (min … max):    3.706 s …  3.833 s    10 runs

I'm not sure what it all means. For Linux, I think I'd be just as happy
with a single non-parallelized test-chainlint run for each file. But
maybe on Windows the startup overhead is worse? OTOH, the whole test run
is so much worse there. One process per script is not going to be that
much in relative terms either way.

And if we did cache the results and avoid extra invocations via "make",
then we'd want all the parallelism to move to there anyway.

Maybe that gives you more food for thought about whether perl's "use
threads" is worth having.

-Peff
Eric Sunshine Nov. 21, 2022, 4:02 a.m. UTC | #6
On Tue, Sep 6, 2022 at 7:27 PM Jeff King <peff@peff.net> wrote:
> I did some timings the other night, and I found something quite curious
> with the thread stuff.
>
> I was quite surprised that it made things slower! It's nice that we're
> only calling it once per script instead of once per test, but it seems
> the startup overhead of the script is really high.
>
> And since in this mode we're only feeding it one script at a time, I
> tried reverting the "chainlint.pl: validate test scripts in parallel"
> commit. And indeed, now things are much faster:
>
>   Benchmark 1: make
>     Time (mean ± σ):     61.544 s ±  3.364 s    [User: 556.486 s, System: 384.001 s]
>     Range (min … max):   57.660 s … 63.490 s    3 runs
>
> And you can see the same thing just running chainlint by itself:
>
>   $ time perl chainlint.pl /dev/null
>   real  0m0.069s
>   user  0m0.042s
>   sys   0m0.020s
>
>   $ git revert HEAD^{/validate.test.scripts.in.parallel}
>   $ time perl chainlint.pl /dev/null
>   real  0m0.014s
>   user  0m0.010s
>   sys   0m0.004s
>
> I didn't track down the source of the slowness. Maybe it's loading extra
> modules, or maybe it's opening /proc/cpuinfo, or maybe it's the thread
> setup. But it's a surprising slowdown.

It is surprising, and unfortunate. Ditching "ithreads" would probably
be a good idea. (more on that below)

> Now of course your intent is to do a single repo-wide invocation. And
> that is indeed a bit faster. Here it is without the parallel code:
>
>   Benchmark 1: make
>     Time (mean ± σ):     61.727 s ±  2.140 s    [User: 507.712 s, System: 377.753 s]
>     Range (min … max):   59.259 s … 63.074 s    3 runs
>
> The wall-clock time didn't improve much, but the CPU time did. Restoring
> the parallel code does improve the wall-clock time a bit, but at the
> cost of some extra CPU:
>
>   Benchmark 1: make
>     Time (mean ± σ):     59.029 s ±  2.851 s    [User: 515.690 s, System: 380.369 s]
>     Range (min … max):   55.736 s … 60.693 s    3 runs
>
> which makes sense. If I do a with/without of just "make test-chainlint",
> the parallelism is buying a few seconds of wall-clock:
>
>   Benchmark 1: make test-chainlint
>     Time (mean ± σ):     900.1 ms ± 102.9 ms    [User: 12049.8 ms, System: 79.7 ms]
>     Range (min … max):   704.2 ms … 994.4 ms    10 runs
>
>   Benchmark 1: make test-chainlint
>     Time (mean ± σ):      3.778 s ±  0.042 s    [User: 3.756 s, System: 0.023 s]
>     Range (min … max):    3.706 s …  3.833 s    10 runs
>
> I'm not sure what it all means. For Linux, I think I'd be just as happy
> with a single non-parallelized test-chainlint run for each file. But
> maybe on Windows the startup overhead is worse? OTOH, the whole test run
> is so much worse there. One process per script is not going to be that
> much in relative terms either way.

Somehow Windows manages to be unbelievably slow no matter what. I
mentioned elsewhere (after you sent this) that I tested on a five or
six year old 8-core dual-boot machine. Booted to Linux, running a
single chainlint.pl invocation using all 8 cores to check all scripts
in the project took under 1 second walltime. The same machine booted
to Windows using all 8 cores took just under two minutes(!) walltime
for the single Perl invocation to check all scripts in the project.

So, at this point, I have no hope for making linting fast on Windows;
it seems to be a lost cause.

> And if we did cache the results and avoid extra invocations via "make",
> then we'd want all the parallelism to move to there anyway.
>
> Maybe that gives you more food for thought about whether perl's "use
> threads" is worth having.

I'm not especially happy about the significant overhead of "ithreads";
on my (old) machine, although it does improve perceived time
significantly, it eats up quite a bit of additional user-time. As
such, I would not be unhappy to see "ithreads" go away, especially
since fast linting on Windows seems unattainable (at least with Perl).

Overall, I think Ævar's plan to parallelize linting via "make" is
probably the way to go.
Ævar Arnfjörð Bjarmason Nov. 21, 2022, 1:28 p.m. UTC | #7
On Sun, Nov 20 2022, Eric Sunshine wrote:

> On Tue, Sep 6, 2022 at 7:27 PM Jeff King <peff@peff.net> wrote:
>> I did some timings the other night, and I found something quite curious
>> with the thread stuff.
>>
>> I was quite surprised that it made things slower! It's nice that we're
>> only calling it once per script instead of once per test, but it seems
>> the startup overhead of the script is really high.
>>
>> And since in this mode we're only feeding it one script at a time, I
>> tried reverting the "chainlint.pl: validate test scripts in parallel"
>> commit. And indeed, now things are much faster:
>>
>>   Benchmark 1: make
>>     Time (mean ± σ):     61.544 s ±  3.364 s    [User: 556.486 s, System: 384.001 s]
>>     Range (min … max):   57.660 s … 63.490 s    3 runs
>>
>> And you can see the same thing just running chainlint by itself:
>>
>>   $ time perl chainlint.pl /dev/null
>>   real  0m0.069s
>>   user  0m0.042s
>>   sys   0m0.020s
>>
>>   $ git revert HEAD^{/validate.test.scripts.in.parallel}
>>   $ time perl chainlint.pl /dev/null
>>   real  0m0.014s
>>   user  0m0.010s
>>   sys   0m0.004s
>>
>> I didn't track down the source of the slowness. Maybe it's loading extra
>> modules, or maybe it's opening /proc/cpuinfo, or maybe it's the thread
>> setup. But it's a surprising slowdown.
>
> It is surprising, and unfortunate. Ditching "ithreads" would probably
> be a good idea. (more on that below)
>
>> Now of course your intent is to do a single repo-wide invocation. And
>> that is indeed a bit faster. Here it is without the parallel code:
>>
>>   Benchmark 1: make
>>     Time (mean ± σ):     61.727 s ±  2.140 s    [User: 507.712 s, System: 377.753 s]
>>     Range (min … max):   59.259 s … 63.074 s    3 runs
>>
>> The wall-clock time didn't improve much, but the CPU time did. Restoring
>> the parallel code does improve the wall-clock time a bit, but at the
>> cost of some extra CPU:
>>
>>   Benchmark 1: make
>>     Time (mean ± σ):     59.029 s ±  2.851 s    [User: 515.690 s, System: 380.369 s]
>>     Range (min … max):   55.736 s … 60.693 s    3 runs
>>
>> which makes sense. If I do a with/without of just "make test-chainlint",
>> the parallelism is buying a few seconds of wall-clock:
>>
>>   Benchmark 1: make test-chainlint
>>     Time (mean ± σ):     900.1 ms ± 102.9 ms    [User: 12049.8 ms, System: 79.7 ms]
>>     Range (min … max):   704.2 ms … 994.4 ms    10 runs
>>
>>   Benchmark 1: make test-chainlint
>>     Time (mean ± σ):      3.778 s ±  0.042 s    [User: 3.756 s, System: 0.023 s]
>>     Range (min … max):    3.706 s …  3.833 s    10 runs
>>
>> I'm not sure what it all means. For Linux, I think I'd be just as happy
>> with a single non-parallelized test-chainlint run for each file. But
>> maybe on Windows the startup overhead is worse? OTOH, the whole test run
>> is so much worse there. One process per script is not going to be that
>> much in relative terms either way.
>
> Somehow Windows manages to be unbelievably slow no matter what. I
> mentioned elsewhere (after you sent this) that I tested on a five or
> six year old 8-core dual-boot machine. Booted to Linux, running a
> single chainlint.pl invocation using all 8 cores to check all scripts
> in the project took under 1 second walltime. The same machine booted
> to Windows using all 8 cores took just under two minutes(!) walltime
> for the single Perl invocation to check all scripts in the project.
>
> So, at this point, I have no hope for making linting fast on Windows;
> it seems to be a lost cause.

I'd be really interested in seeing e.g. the NYTProf output for that run,
compared with that on *nix (if you could upload the HTML versions of
both somewhere, even better).

Maybe "chainlint.pl" is doing something odd, but this goes against the
usual wisdom about what is and isn't slow in Perl on windows, as I
understand it.

I.e. process star-up etc. is slow there, and I/O's a bit slower, but
once you're started up and e.g. slurping up all of those files & parsing
them you're just running "perl-native" code.

Which shouldn't be much slower at all. A perl compiled with ithreads is
(last I checked) around 10-20% slower, and the Windows version is always
compiled with that (it's needed for "fork" emulation).

But most *nix versions are compiled with that too, and certainly the one
you're using with "threads", so that's not the difference.

So I suspect something odd's going on...

>> And if we did cache the results and avoid extra invocations via "make",
>> then we'd want all the parallelism to move to there anyway.
>>
>> Maybe that gives you more food for thought about whether perl's "use
>> threads" is worth having.
>
> I'm not especially happy about the significant overhead of "ithreads";
> on my (old) machine, although it does improve perceived time
> significantly, it eats up quite a bit of additional user-time. As
> such, I would not be unhappy to see "ithreads" go away, especially
> since fast linting on Windows seems unattainable (at least with Perl).
>
> Overall, I think Ævar's plan to parallelize linting via "make" is
> probably the way to go.

Yeah, but that seems to me to be orthagonal to why it's this slow on
Windows, and if it is that wouldn't help much, except for incremental
re-runs.
Eric Sunshine Nov. 21, 2022, 2:07 p.m. UTC | #8
On Mon, Nov 21, 2022 at 8:32 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Sun, Nov 20 2022, Eric Sunshine wrote:
> > Somehow Windows manages to be unbelievably slow no matter what. I
> > mentioned elsewhere (after you sent this) that I tested on a five or
> > six year old 8-core dual-boot machine. Booted to Linux, running a
> > single chainlint.pl invocation using all 8 cores to check all scripts
> > in the project took under 1 second walltime. The same machine booted
> > to Windows using all 8 cores took just under two minutes(!) walltime
> > for the single Perl invocation to check all scripts in the project.
>
> I'd be really interested in seeing e.g. the NYTProf output for that run,
> compared with that on *nix (if you could upload the HTML versions of
> both somewhere, even better).

Unfortunately, I no longer have access to that machine, or usable
Windows in general. Of course, someone else with access to a dual-boot
machine could generate such a report, but whether anyone will offer to
do so is a different matter.

> Maybe "chainlint.pl" is doing something odd, but this goes against the
> usual wisdom about what is and isn't slow in Perl on windows, as I
> understand it.
>
> I.e. process star-up etc. is slow there, and I/O's a bit slower, but
> once you're started up and e.g. slurping up all of those files & parsing
> them you're just running "perl-native" code.
>
> Which shouldn't be much slower at all. A perl compiled with ithreads is
> (last I checked) around 10-20% slower, and the Windows version is always
> compiled with that (it's needed for "fork" emulation).
>
> But most *nix versions are compiled with that too, and certainly the one
> you're using with "threads", so that's not the difference.
>
> So I suspect something odd's going on...

This is all my understanding, as well, which is why I was so surprised
by the difference in speed. Aside from suspecting Windows I/O as the
culprit, another obvious possible culprit would be whatever
mechanism/primitives "ithreads" is using on Windows for
locking/synchronizing and passing messages between threads. I wouldn't
be surprised to learn that those mechanisms/primitives have very high
overhead on that platform.

> > Overall, I think Ævar's plan to parallelize linting via "make" is
> > probably the way to go.
>
> Yeah, but that seems to me to be orthagonal to why it's this slow on
> Windows, and if it is that wouldn't help much, except for incremental
> re-runs.

Oh, I didn't at all mean that `make` parallelism would be helpful on
Windows; I can't imagine that it ever would be (though I could once
again be wrong). What I meant was that `make` parallelism would be a
nice improvement and simplification (of sorts), in general,
considering that I've given up hope of ever seeing linting be speedy
on Windows.
Ævar Arnfjörð Bjarmason Nov. 21, 2022, 2:18 p.m. UTC | #9
On Mon, Nov 21 2022, Eric Sunshine wrote:

> On Mon, Nov 21, 2022 at 8:32 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> On Sun, Nov 20 2022, Eric Sunshine wrote:
>> > Somehow Windows manages to be unbelievably slow no matter what. I
>> > mentioned elsewhere (after you sent this) that I tested on a five or
>> > six year old 8-core dual-boot machine. Booted to Linux, running a
>> > single chainlint.pl invocation using all 8 cores to check all scripts
>> > in the project took under 1 second walltime. The same machine booted
>> > to Windows using all 8 cores took just under two minutes(!) walltime
>> > for the single Perl invocation to check all scripts in the project.
>>
>> I'd be really interested in seeing e.g. the NYTProf output for that run,
>> compared with that on *nix (if you could upload the HTML versions of
>> both somewhere, even better).
>
> Unfortunately, I no longer have access to that machine, or usable
> Windows in general. Of course, someone else with access to a dual-boot
> machine could generate such a report, but whether anyone will offer to
> do so is a different matter.

:(

>> Maybe "chainlint.pl" is doing something odd, but this goes against the
>> usual wisdom about what is and isn't slow in Perl on windows, as I
>> understand it.
>>
>> I.e. process star-up etc. is slow there, and I/O's a bit slower, but
>> once you're started up and e.g. slurping up all of those files & parsing
>> them you're just running "perl-native" code.
>>
>> Which shouldn't be much slower at all. A perl compiled with ithreads is
>> (last I checked) around 10-20% slower, and the Windows version is always
>> compiled with that (it's needed for "fork" emulation).
>>
>> But most *nix versions are compiled with that too, and certainly the one
>> you're using with "threads", so that's not the difference.
>>
>> So I suspect something odd's going on...
>
> This is all my understanding, as well, which is why I was so surprised
> by the difference in speed. Aside from suspecting Windows I/O as the
> culprit, another obvious possible culprit would be whatever
> mechanism/primitives "ithreads" is using on Windows for
> locking/synchronizing and passing messages between threads. I wouldn't
> be surprised to learn that those mechanisms/primitives have very high
> overhead on that platform.

Yeah, that could be, but then...

>> > Overall, I think Ævar's plan to parallelize linting via "make" is
>> > probably the way to go.
>>
>> Yeah, but that seems to me to be orthagonal to why it's this slow on
>> Windows, and if it is that wouldn't help much, except for incremental
>> re-runs.
>
> Oh, I didn't at all mean that `make` parallelism would be helpful on
> Windows; I can't imagine that it ever would be (though I could once
> again be wrong). What I meant was that `make` parallelism would be a
> nice improvement and simplification (of sorts), in general,
> considering that I've given up hope of ever seeing linting be speedy
> on Windows.

...that parallelism probably wouldn't be helpful, as it'll run into
another thing that's slow.

But just ditching the "ithreads" commit from chainlint.pl should make it
much faster, as sequentially parsing all the files isn't that slow, and
as that won't use threads should be much faster then.
Eric Sunshine Nov. 21, 2022, 2:48 p.m. UTC | #10
On Mon, Nov 21, 2022 at 9:20 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Mon, Nov 21 2022, Eric Sunshine wrote:
> > Oh, I didn't at all mean that `make` parallelism would be helpful on
> > Windows; I can't imagine that it ever would be (though I could once
> > again be wrong). What I meant was that `make` parallelism would be a
> > nice improvement and simplification (of sorts), in general,
> > considering that I've given up hope of ever seeing linting be speedy
> > on Windows.
>
> But just ditching the "ithreads" commit from chainlint.pl should make it
> much faster, as sequentially parsing all the files isn't that slow, and
> as that won't use threads should be much faster then.

On my (old) machine (with spinning hard drive), `make test-chainlint`
with "ithreads" and warm filesystem cache takes about 3.8 seconds
walltime. Without "ithreads", it takes about 11.3 seconds. So, the
improvement in perceived time is significant. As such, I'm somewhat
hesitant to see "ithreads" dropped from chainlint.pl before `make`
parallelism is implemented. (I can easily see "drop ithreads" as the
final patch of a series which adds `make` parallelism.)

But perhaps I'm focussing too much on my own experience with my old
machine. Maybe linting without "ithreads" and without `make`
parallelism would be "fast enough" for developers using beefier modern
machines... (genuine question/thought since I don't have access to any
beefy modern hardware).
Jeff King Nov. 21, 2022, 6:04 p.m. UTC | #11
On Sun, Nov 20, 2022 at 11:02:54PM -0500, Eric Sunshine wrote:

> > And if we did cache the results and avoid extra invocations via "make",
> > then we'd want all the parallelism to move to there anyway.
> >
> > Maybe that gives you more food for thought about whether perl's "use
> > threads" is worth having.
> 
> I'm not especially happy about the significant overhead of "ithreads";
> on my (old) machine, although it does improve perceived time
> significantly, it eats up quite a bit of additional user-time. As
> such, I would not be unhappy to see "ithreads" go away, especially
> since fast linting on Windows seems unattainable (at least with Perl).
> 
> Overall, I think Ævar's plan to parallelize linting via "make" is
> probably the way to go.

TBH, I think just running the linter once per test script when the
script is run would be sufficient. That is one extra process per script,
but they are already shell scripts running a bunch of processes. You get
parallelism for free because you're already running the tests in
parallel. You lose out on "don't bother linting because the file hasn't
changed", but I'm not sure that's really worth the extra complexity
overall.

-Peff
Eric Sunshine Nov. 21, 2022, 6:47 p.m. UTC | #12
On Mon, Nov 21, 2022 at 1:04 PM Jeff King <peff@peff.net> wrote:
> On Sun, Nov 20, 2022 at 11:02:54PM -0500, Eric Sunshine wrote:
> > Overall, I think Ævar's plan to parallelize linting via "make" is
> > probably the way to go.
>
> TBH, I think just running the linter once per test script when the
> script is run would be sufficient. That is one extra process per script,
> but they are already shell scripts running a bunch of processes. You get
> parallelism for free because you're already running the tests in
> parallel. You lose out on "don't bother linting because the file hasn't
> changed", but I'm not sure that's really worth the extra complexity
> overall.

Hmm, yes, that's appealing (especially since I've essentially given up
on making linting fast on Windows), and it wouldn't be hard to
implement. In fact, it's already implemented by 23a14f3016 (test-lib:
replace chainlint.sed with chainlint.pl, 2022-09-01); making it work
the way you describe would just involve dropping 69b9924b87
(t/Makefile: teach `make test` and `make prove` to run chainlint.pl,
2022-09-01) and 29fb2ec384 (chainlint.pl: validate test scripts in
parallel, 2022-09-01).

I think Ævar's use-case for `make` parallelization was to speed up
git-bisect runs. But thinking about it now, the likelihood of "lint"
problems cropping up during a git-bisect run is effectively nil, in
which case setting GIT_TEST_CHAIN_LINT=1 should be a perfectly
appropriate way to take linting out of the equation when bisecting.
Eric Sunshine Nov. 21, 2022, 6:50 p.m. UTC | #13
On Mon, Nov 21, 2022 at 1:47 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> I think Ævar's use-case for `make` parallelization was to speed up
> git-bisect runs. But thinking about it now, the likelihood of "lint"
> problems cropping up during a git-bisect run is effectively nil, in
> which case setting GIT_TEST_CHAIN_LINT=1 should be a perfectly
> appropriate way to take linting out of the equation when bisecting.

I mean "GIT_TEST_CHAIN_LINT=0", of course.
Jeff King Nov. 21, 2022, 6:52 p.m. UTC | #14
On Mon, Nov 21, 2022 at 01:47:42PM -0500, Eric Sunshine wrote:

> On Mon, Nov 21, 2022 at 1:04 PM Jeff King <peff@peff.net> wrote:
> > On Sun, Nov 20, 2022 at 11:02:54PM -0500, Eric Sunshine wrote:
> > > Overall, I think Ævar's plan to parallelize linting via "make" is
> > > probably the way to go.
> >
> > TBH, I think just running the linter once per test script when the
> > script is run would be sufficient. That is one extra process per script,
> > but they are already shell scripts running a bunch of processes. You get
> > parallelism for free because you're already running the tests in
> > parallel. You lose out on "don't bother linting because the file hasn't
> > changed", but I'm not sure that's really worth the extra complexity
> > overall.
> 
> Hmm, yes, that's appealing (especially since I've essentially given up
> on making linting fast on Windows), and it wouldn't be hard to
> implement. In fact, it's already implemented by 23a14f3016 (test-lib:
> replace chainlint.sed with chainlint.pl, 2022-09-01); making it work
> the way you describe would just involve dropping 69b9924b87
> (t/Makefile: teach `make test` and `make prove` to run chainlint.pl,
> 2022-09-01) and 29fb2ec384 (chainlint.pl: validate test scripts in
> parallel, 2022-09-01).

Yes, that was one of the modes I timed in my original email. :)

> I think Ævar's use-case for `make` parallelization was to speed up
> git-bisect runs. But thinking about it now, the likelihood of "lint"
> problems cropping up during a git-bisect run is effectively nil, in
> which case setting GIT_TEST_CHAIN_LINT=1 should be a perfectly
> appropriate way to take linting out of the equation when bisecting.

Yes. It's also dumb to run a straight "make test" while bisecting in the
first place, because you are going to run a zillion tests that aren't
relevant to your bisection. Bisecting on "cd t && ./test-that-fails" is
faster, at which point you're only running the one lint process (and if
it really bothers you, you can disable chain lint as you suggest).

-Peff
Eric Sunshine Nov. 21, 2022, 7 p.m. UTC | #15
On Mon, Nov 21, 2022 at 1:52 PM Jeff King <peff@peff.net> wrote:
> On Mon, Nov 21, 2022 at 01:47:42PM -0500, Eric Sunshine wrote:
> > I think Ævar's use-case for `make` parallelization was to speed up
> > git-bisect runs. But thinking about it now, the likelihood of "lint"
> > problems cropping up during a git-bisect run is effectively nil, in
> > which case setting GIT_TEST_CHAIN_LINT=1 should be a perfectly
> > appropriate way to take linting out of the equation when bisecting.
>
> Yes. It's also dumb to run a straight "make test" while bisecting in the
> first place, because you are going to run a zillion tests that aren't
> relevant to your bisection. Bisecting on "cd t && ./test-that-fails" is
> faster, at which point you're only running the one lint process (and if
> it really bothers you, you can disable chain lint as you suggest).

I think I misspoke. Dredging up old memories, I think Ævar's use-case
is that he now runs:

    git rebase -i --exec 'make test' ...

in order to ensure that the entire test suite passes for _every_ patch
in a series. (This is due to him having missed a runtime breakage by
only running "make test" after the final patch in a series was
applied, when the breakage was only temporary -- added by one patch,
but resolved by some other later patch.)

Even so, GIT_TEST_CHAIN_LINT=0 should be appropriate here too.
Jeff King Nov. 21, 2022, 7:28 p.m. UTC | #16
On Mon, Nov 21, 2022 at 02:00:41PM -0500, Eric Sunshine wrote:

> On Mon, Nov 21, 2022 at 1:52 PM Jeff King <peff@peff.net> wrote:
> > On Mon, Nov 21, 2022 at 01:47:42PM -0500, Eric Sunshine wrote:
> > > I think Ævar's use-case for `make` parallelization was to speed up
> > > git-bisect runs. But thinking about it now, the likelihood of "lint"
> > > problems cropping up during a git-bisect run is effectively nil, in
> > > which case setting GIT_TEST_CHAIN_LINT=1 should be a perfectly
> > > appropriate way to take linting out of the equation when bisecting.
> >
> > Yes. It's also dumb to run a straight "make test" while bisecting in the
> > first place, because you are going to run a zillion tests that aren't
> > relevant to your bisection. Bisecting on "cd t && ./test-that-fails" is
> > faster, at which point you're only running the one lint process (and if
> > it really bothers you, you can disable chain lint as you suggest).
> 
> I think I misspoke. Dredging up old memories, I think Ævar's use-case
> is that he now runs:
> 
>     git rebase -i --exec 'make test' ...
> 
> in order to ensure that the entire test suite passes for _every_ patch
> in a series. (This is due to him having missed a runtime breakage by
> only running "make test" after the final patch in a series was
> applied, when the breakage was only temporary -- added by one patch,
> but resolved by some other later patch.)

Yeah, I do that sometimes, too, especially when heavy refactoring is
involved.

> Even so, GIT_TEST_CHAIN_LINT=0 should be appropriate here too.

Agreed. But also, my original point stands. If you are running 10 CPU
minutes of tests, then a few CPU seconds of linting is not really that
important.

-Peff
Ævar Arnfjörð Bjarmason Nov. 22, 2022, 12:11 a.m. UTC | #17
On Mon, Nov 21 2022, Eric Sunshine wrote:

> On Mon, Nov 21, 2022 at 1:52 PM Jeff King <peff@peff.net> wrote:
>> On Mon, Nov 21, 2022 at 01:47:42PM -0500, Eric Sunshine wrote:
>> > I think Ævar's use-case for `make` parallelization was to speed up
>> > git-bisect runs. But thinking about it now, the likelihood of "lint"
>> > problems cropping up during a git-bisect run is effectively nil, in
>> > which case setting GIT_TEST_CHAIN_LINT=1 should be a perfectly
>> > appropriate way to take linting out of the equation when bisecting.
>>
>> Yes. It's also dumb to run a straight "make test" while bisecting in the
>> first place, because you are going to run a zillion tests that aren't
>> relevant to your bisection. Bisecting on "cd t && ./test-that-fails" is
>> faster, at which point you're only running the one lint process (and if
>> it really bothers you, you can disable chain lint as you suggest).
>
> I think I misspoke. Dredging up old memories, I think Ævar's use-case
> is that he now runs:
>
>     git rebase -i --exec 'make test' ...
>
> in order to ensure that the entire test suite passes for _every_ patch
> in a series. (This is due to him having missed a runtime breakage by
> only running "make test" after the final patch in a series was
> applied, when the breakage was only temporary -- added by one patch,
> but resolved by some other later patch.)
>
> Even so, GIT_TEST_CHAIN_LINT=0 should be appropriate here too.

I'd like to make "make" fast in terms of avoiding its own overhead
before it gets to actual work mainly because of that use-case, but it
helps in general. E.g. if you switch branches we don't compile a file we
don't need to, we shouldn't re-run test checks we don't need either.

For t/ this is:

 - Running chainlint.pl on the file, even if it didn't change
 - Ditto check-non-portable-shell.pl
 - Ditto "non-portable file name(s)" check
 - Ditto "test -x" on all test files

I have a branch where these are all checked using dependencies instead,
e.g. we run a "test -x" on t0071-sort.sh and create a
".build/check-executable/t0071-sort.sh.ok" if that passed, we don't need
to shell out in the common case.

The results of that are, and this is a best case in picking one where
the test itself is cheap:
	
	$ git hyperfine -L rev @{u},HEAD~,HEAD -s 'make CFLAGS=-O3' 'make test T=t0071-sort.sh' -w 1
	Benchmark 1: make test T=t0071-sort.sh' in '@{u}
	  Time (mean ± σ):      1.168 s ±  0.074 s    [User: 1.534 s, System: 0.082 s]
	  Range (min … max):    1.096 s …  1.316 s    10 runs
	
	Benchmark 2: make test T=t0071-sort.sh' in 'HEAD~
	  Time (mean ± σ):     719.1 ms ±  46.1 ms    [User: 910.6 ms, System: 79.7 ms]
	  Range (min … max):   682.0 ms … 828.2 ms    10 runs
	
	Benchmark 3: make test T=t0071-sort.sh' in 'HEAD
	  Time (mean ± σ):     685.0 ms ±  34.2 ms    [User: 645.0 ms, System: 56.8 ms]
	  Range (min … max):   657.6 ms … 773.6 ms    10 runs
	
	Summary
	  'make test T=t0071-sort.sh' in 'HEAD' ran
	    1.05 ± 0.09 times faster than 'make test T=t0071-sort.sh' in 'HEAD~'
	    1.71 ± 0.14 times faster than 'make test T=t0071-sort.sh' in '@{u}'

The @{u} being "master", HEAD~ is "incremant without chainlint.pl", and
"HEAD" is where it's all incremental.

It's very WIP-quality, but I pushed the chainlint.pl part of it as a POC
just now, I did the others a while ago:
https://github.com/avar/git/tree/avar/t-Makefile-break-T-to-file-association
diff mbox series

Patch

diff --git a/t/chainlint.pl b/t/chainlint.pl
index d526723ac00..898573a9100 100755
--- a/t/chainlint.pl
+++ b/t/chainlint.pl
@@ -15,9 +15,11 @@ 
 
 use warnings;
 use strict;
+use Config;
 use File::Glob;
 use Getopt::Long;
 
+my $jobs = -1;
 my $show_stats;
 my $emit_all;
 
@@ -569,6 +571,16 @@  if (eval {require Time::HiRes; Time::HiRes->import(); 1;}) {
 	$interval = sub { return Time::HiRes::tv_interval(shift); };
 }
 
+sub ncores {
+	# Windows
+	return $ENV{NUMBER_OF_PROCESSORS} if exists($ENV{NUMBER_OF_PROCESSORS});
+	# Linux / MSYS2 / Cygwin / WSL
+	do { local @ARGV='/proc/cpuinfo'; return scalar(grep(/^processor\s*:/, <>)); } if -r '/proc/cpuinfo';
+	# macOS & BSD
+	return qx/sysctl -n hw.ncpu/ if $^O =~ /(?:^darwin$|bsd)/;
+	return 1;
+}
+
 sub show_stats {
 	my ($start_time, $stats) = @_;
 	my $walltime = $interval->($start_time);
@@ -621,7 +633,9 @@  sub exit_code {
 Getopt::Long::Configure(qw{bundling});
 GetOptions(
 	"emit-all!" => \$emit_all,
+	"jobs|j=i" => \$jobs,
 	"stats|show-stats!" => \$show_stats) or die("option error\n");
+$jobs = ncores() if $jobs < 1;
 
 my $start_time = $getnow->();
 my @stats;
@@ -633,6 +647,40 @@  unless (@scripts) {
 	exit;
 }
 
-push(@stats, check_script(1, sub { shift(@scripts); }, sub { print(@_); }));
+unless ($Config{useithreads} && eval {
+	require threads; threads->import();
+	require Thread::Queue; Thread::Queue->import();
+	1;
+	}) {
+	push(@stats, check_script(1, sub { shift(@scripts); }, sub { print(@_); }));
+	show_stats($start_time, \@stats) if $show_stats;
+	exit(exit_code(\@stats));
+}
+
+my $script_queue = Thread::Queue->new();
+my $output_queue = Thread::Queue->new();
+
+sub next_script { return $script_queue->dequeue(); }
+sub emit { $output_queue->enqueue(@_); }
+
+sub monitor {
+	while (my $s = $output_queue->dequeue()) {
+		print($s);
+	}
+}
+
+my $mon = threads->create({'context' => 'void'}, \&monitor);
+threads->create({'context' => 'list'}, \&check_script, $_, \&next_script, \&emit) for 1..$jobs;
+
+$script_queue->enqueue(@scripts);
+$script_queue->end();
+
+for (threads->list()) {
+	push(@stats, $_->join()) unless $_ == $mon;
+}
+
+$output_queue->end();
+$mon->join();
+
 show_stats($start_time, \@stats) if $show_stats;
 exit(exit_code(\@stats));