diff mbox series

p5311: handle spaces in wc(1) output

Message ID a682e2c8-fecc-906e-0ff6-93de2b311d14@web.de (mailing list archive)
State New, archived
Headers show
Series p5311: handle spaces in wc(1) output | expand

Commit Message

René Scharfe Oct. 2, 2021, 8:33 p.m. UTC
Some implementations of wc(1) align their output with leading spaces,
even when just a single number is requested, e.g. with "wc -c".  p5311
runs all tests successfully on such a platform, but fails to aggregate
their results and reports:

   # passed all 33 test(s)
   1..33
   bad input line:    57144

Use the helper function test_file_size to get the number without any
spaces in a portable way to avoid the issue.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/perf/README                      | 2 +-
 t/perf/p5311-pack-bitmaps-fetch.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--
2.33.0

Comments

Taylor Blau Oct. 3, 2021, 5:14 a.m. UTC | #1
On Sat, Oct 02, 2021 at 10:33:18PM +0200, René Scharfe wrote:
> Some implementations of wc(1) align their output with leading spaces,
> even when just a single number is requested, e.g. with "wc -c".  p5311
> runs all tests successfully on such a platform, but fails to aggregate
> their results and reports:

This makes sense, and makes me think that wc's platform-specific
implementations are too tricky to use when we are being picky about
leading spaces.

In other words, I think that your fix is absolutely correct, but I
wonder if test_size should be friendlier in what it accepts, and to
chomp off any leading space. So perhaps something like the below would
work without any modification to p5311.

--- 8< ---

Subject: [PATCH] t/perf/aggregate.perl: tolerate leading spaces

When using `test_size` with `wc -c`, users on certain platforms can run
into issues when `wc` emits leading space characters in its output,
which confuses get_times.

Callers could switch to use test_file_size instead of `wc -c` (the
former never prints leading space characters, so will always work with
test_size regardless of platform), but this is an easy enough spot to
miss that we should teach get_times to be more tolerant of the input it
accepts.

Teach get_times to do just that by stripping any leading space
characters.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 t/perf/aggregate.perl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
index 82c0df4553..575d2000cc 100755
--- a/t/perf/aggregate.perl
+++ b/t/perf/aggregate.perl
@@ -17,8 +17,8 @@ sub get_times {
 		my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
 		return ($rt, $4, $5);
 	# size
-	} elsif ($line =~ /^\d+$/) {
-		return $&;
+	} elsif ($line =~ /^\s*(\d+)$/) {
+		return $1;
 	} else {
 		die "bad input line: $line";
 	}
--
2.33.0.96.g73915697e6
Ævar Arnfjörð Bjarmason Oct. 3, 2021, 8:04 a.m. UTC | #2
On Sun, Oct 03 2021, Taylor Blau wrote:

> On Sat, Oct 02, 2021 at 10:33:18PM +0200, René Scharfe wrote:
>> Some implementations of wc(1) align their output with leading spaces,
>> even when just a single number is requested, e.g. with "wc -c".  p5311
>> runs all tests successfully on such a platform, but fails to aggregate
>> their results and reports:
>
> This makes sense, and makes me think that wc's platform-specific
> implementations are too tricky to use when we are being picky about
> leading spaces.
>
> In other words, I think that your fix is absolutely correct, but I
> wonder if test_size should be friendlier in what it accepts, and to
> chomp off any leading space. So perhaps something like the below would
> work without any modification to p5311.
>
> --- 8< ---
>
> Subject: [PATCH] t/perf/aggregate.perl: tolerate leading spaces
>
> When using `test_size` with `wc -c`, users on certain platforms can run
> into issues when `wc` emits leading space characters in its output,
> which confuses get_times.
>
> Callers could switch to use test_file_size instead of `wc -c` (the
> former never prints leading space characters, so will always work with
> test_size regardless of platform), but this is an easy enough spot to
> miss that we should teach get_times to be more tolerant of the input it
> accepts.
>
> Teach get_times to do just that by stripping any leading space
> characters.
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  t/perf/aggregate.perl | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
> index 82c0df4553..575d2000cc 100755
> --- a/t/perf/aggregate.perl
> +++ b/t/perf/aggregate.perl
> @@ -17,8 +17,8 @@ sub get_times {
>  		my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
>  		return ($rt, $4, $5);
>  	# size
> -	} elsif ($line =~ /^\d+$/) {
> -		return $&;
> +	} elsif ($line =~ /^\s*(\d+)$/) {
> +		return $1;
>  	} else {
>  		die "bad input line: $line";
>  	}

This approach seems like a bit of plastering over the real problem. It's
fine to use the output of "wc -l" or "wc -c" in the context of the
shell's whitespace handling. That's why in various places we do:

    test $(wc -l <$file>) = 1

Or similar, but *don't* put that $() in double-quotes. I.e. we're
relying on the shell's whitespace semantics.

So isn't it better to just pass this through the shell's own handling
before emitting the data, something like this POC:

    $ stripspace() { var=$1; echo $@; }; x=$(stripspace "  hi" "  there "); echo "\"$x\""
    "hi there"

Of course fixing it up after that in Perl will work just as well, so I
guess this is just an asthetic preference for having the shell handle
the shell's output issues with what's guaranteed to be shell-portable
solutions... :)
Jeff King Oct. 4, 2021, 7:43 a.m. UTC | #3
On Sun, Oct 03, 2021 at 01:14:49AM -0400, Taylor Blau wrote:

> On Sat, Oct 02, 2021 at 10:33:18PM +0200, René Scharfe wrote:
> > Some implementations of wc(1) align their output with leading spaces,
> > even when just a single number is requested, e.g. with "wc -c".  p5311
> > runs all tests successfully on such a platform, but fails to aggregate
> > their results and reports:
> 
> This makes sense, and makes me think that wc's platform-specific
> implementations are too tricky to use when we are being picky about
> leading spaces.
> 
> In other words, I think that your fix is absolutely correct, but I
> wonder if test_size should be friendlier in what it accepts, and to
> chomp off any leading space. So perhaps something like the below would
> work without any modification to p5311.

I do like this direction, because by centralizing, it's one less thing
for perf-script writers to mess up. And not only does it fix "wc -c",
but it is more friendly to any other tools (since test_size can really
be used with any scalar magnitude measurement we like; our current tests
just happen to use wc).

But...

> Subject: [PATCH] t/perf/aggregate.perl: tolerate leading spaces
> 
> When using `test_size` with `wc -c`, users on certain platforms can run
> into issues when `wc` emits leading space characters in its output,
> which confuses get_times.
> 
> Callers could switch to use test_file_size instead of `wc -c` (the
> former never prints leading space characters, so will always work with
> test_size regardless of platform), but this is an easy enough spot to
> miss that we should teach get_times to be more tolerant of the input it
> accepts.
> 
> Teach get_times to do just that by stripping any leading space
> characters.

This leaves the extra whitespace inside the test-results/foo.results
file, which is a bit unfortunate, just because anything else besides
aggregate.perl will have to do the same workaround. So we've traded one
gotcha for another. ;)

I don't have a strong opinion on which is worse. The ideal would be for
test_size() itself to handle it, though it's a bit awkward because it is
literally just redirecting the output of the test snippet into the
result file. It's probably not worth spending a ton of effort on that.

> diff --git a/t/perf/aggregate.perl b/t/perf/aggregate.perl
> index 82c0df4553..575d2000cc 100755
> --- a/t/perf/aggregate.perl
> +++ b/t/perf/aggregate.perl
> @@ -17,8 +17,8 @@ sub get_times {
>  		my $rt = ((defined $1 ? $1 : 0.0)*60+$2)*60+$3;
>  		return ($rt, $4, $5);
>  	# size
> -	} elsif ($line =~ /^\d+$/) {
> -		return $&;
> +	} elsif ($line =~ /^\s*(\d+)$/) {
> +		return $1;

If we do go this route, it might be nice to ignore trailing whitespace,
too (I don't think it matters for wc, but just for general
friendliness). I'm tempted even to say that it should just drop the
anchors and match "\d+" anywhere, but perhaps that is a recipe for
mistakes (if somebody writes "foo 1234" we probably want to detect and
complain).

-Peff
Junio C Hamano Oct. 4, 2021, 4:16 p.m. UTC | #4
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> This approach seems like a bit of plastering over the real problem. It's
> fine to use the output of "wc -l" or "wc -c" in the context of the
> shell's whitespace handling. That's why in various places we do:

Sorry, but I am confused.

>     test $(wc -l <$file>) = 1
>
> Or similar, but *don't* put that $() in double-quotes. I.e. we're
> relying on the shell's whitespace semantics.
>
> So isn't it better to just pass this through the shell's own handling
> before emitting the data, something like this POC:
>
>     $ stripspace() { var=$1; echo $@; }; x=$(stripspace "  hi" "  there "); echo "\"$x\""
>     "hi there"

All of the above are not wrong per-se, but if I read the scaffolding
code correctly, the way the output from "wc -c" is used is not via a
variable, but

    test_size_ () {
            say >&3 "running: $2"
            if test_eval_ "$2" 3>"$base".result; then
                    test_ok_ "$1"
            else
                    test_failure_ "$@"
            fi
    }

    test_size () {
            test_wrapper_ test_size_ "$@"
    }

where "$2" gets the script given to test_size, e.g.

 	test_size "size   $title" '
		wc -c <tmp.pack
 	'

the "wc -c" command.  And we just let the command emit its output to
"$base.result" (test_eval_ does the stdout-to-#3 redirection, and we
redirect #3 back to the file here).  So I am not quite sure where in
the current system your suggestion to apply the "substitition will
lose $IFS around values and gets word splitted if you omit dq around
it" would fit to address the issue at hand.

> Of course fixing it up after that in Perl will work just as well, so I
> guess this is just an asthetic preference for having the shell handle
> the shell's output issues with what's guaranteed to be shell-portable
> solutions... :)

Meaning we could rewrite aggregation in shell, then we can say we
are not making Perl clean up after mess sh creates?  I dunno...

Thanks.
diff mbox series

Patch

diff --git a/t/perf/README b/t/perf/README
index fb9127a66f..802402d738 100644
--- a/t/perf/README
+++ b/t/perf/README
@@ -190,7 +190,7 @@  shown in the aggregated output. For example:
 	'

 	test_size 'output size'
-		wc -c <foo.out
+		test_file_size foo.out
 	'

 might produce output like:
diff --git a/t/perf/p5311-pack-bitmaps-fetch.sh b/t/perf/p5311-pack-bitmaps-fetch.sh
index 47c3fd7581..ed0c7570d7 100755
--- a/t/perf/p5311-pack-bitmaps-fetch.sh
+++ b/t/perf/p5311-pack-bitmaps-fetch.sh
@@ -33,7 +33,7 @@  for days in 1 2 4 8 16 32 64 128; do
 	'

 	test_size "size   $title" '
-		wc -c <tmp.pack
+		test_file_size tmp.pack
 	'

 	test_perf "client $title" '