diff mbox series

[02/18] chainlint.pl: add POSIX shell lexical analyzer

Message ID c1042b9bcd94b9ecb0bf73dfbd4334b9f30ba99a.1661992197.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 7d4804731ed642b92b516908fb93397b08e986bf
Headers show
Series make test "linting" more comprehensive | expand

Commit Message

Eric Sunshine Sept. 1, 2022, 12:29 a.m. UTC
From: Eric Sunshine <sunshine@sunshineco.com>

Begin fleshing out chainlint.pl by adding a lexical analyzer for the
POSIX shell command language. The sole entry point Lexer::scan_token()
returns the next token from the input. It will be called by the upcoming
shell language parser.

Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
---
 t/chainlint.pl | 177 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 177 insertions(+)

Comments

Ævar Arnfjörð Bjarmason Sept. 1, 2022, 12:32 p.m. UTC | #1
On Thu, Sep 01 2022, Eric Sunshine via GitGitGadget wrote:

> From: Eric Sunshine <sunshine@sunshineco.com>

Just generally on this series:

> +	$tag =~ s/['"\\]//g;

I think this would be a *lot* easier to read if all of these little
regex decls could be split out into some "grammar" class, or other
helper module/namespace. So e.g.:

	my $SCRIPT_QUOTE_RX = qr/['"\\]/;

Then:

> +	return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;

	my $SCRIPT_WHATEVER_RX = qr/
		^(?:
		&&
		|
		\|\|
		[...]
	/x;

etc., i.e. we could then make use of /x to add inline comments to these.
Eric Sunshine Sept. 3, 2022, 6 a.m. UTC | #2
On Thu, Sep 1, 2022 at 8:35 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Thu, Sep 01 2022, Eric Sunshine via GitGitGadget wrote:
> Just generally on this series:
>
> > +     $tag =~ s/['"\\]//g;
>
> I think this would be a *lot* easier to read if all of these little
> regex decls could be split out into some "grammar" class, or other
> helper module/namespace. So e.g.:
>
>         my $SCRIPT_QUOTE_RX = qr/['"\\]/;

Taken out of context (as in the quoted snippet), it may indeed be
difficult to understand what that line is doing, however in context
with a meaningful function name:

    sub scan_heredoc_tag {
        ...
        my $tag = $self->scan_token();
        $tag =~ s/['"\\]//g;
        push(@{$self->{heretags}}, $indented ? "\t$tag" : "$tag");
        ...
    }

for someone who is familiar with common heredoc tag quoting/escaping
(i.e. <<'EOF', <<"EOF", <<\EOF), I find the inline character class
`['"\\]` much easier to understand than some opaque name such as
$SCRIPT_QUOTE_RX, doubly so because the definition of the named regex
might be far removed from the actual code which uses it, which would
require going and studying that definition before being able to
understand what this code is doing.

I grasp you made that name up on-the-fly as an example, but that does
highlight another reason why I'd be hesitant to try to pluck out and
name these regexes. Specifically, naming is hard and I don't trust
that I could come up with succinct meaningful names which would convey
what a regex does as well as the actual regex itself conveys what it
does. In context within the well-named function, `s/['"\\]//g` is
obviously stripping quoting/escaping from the tag name; trying to come
up with a succinct yet accurate name to convey that intention is
difficult. And this is just one example. The script is littered with
little regexes like this, and they are almost all unique, thus making
the task of inventing succinct meaningful names extra difficult. And,
as noted above, I'm not at all convinced that plucking the regex out
of its natural context -- thus making the reader go elsewhere to find
the definition of the regex -- would help improve comprehension.

> Then:
>
> > +     return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
>
>         my $SCRIPT_WHATEVER_RX = qr/
>                 ^(?:
>                 &&
>                 |
>                 \|\|
>                 [...]
>         /x;
>
> etc., i.e. we could then make use of /x to add inline comments to these.

`/x` does make this slightly easier to grok, and this is a an example
of a regex which might be easy to name (i.e. $TWO_CHAR_OPERATOR), but
-- extra mandatory escaping aside -- it's not hard to understand this
one as-is; it's pretty obvious that it's looking for operators `&&`,
`||`, `>>`, `;;`, `<&`, `>&`, `<>`, and `>|`.
diff mbox series

Patch

diff --git a/t/chainlint.pl b/t/chainlint.pl
index e8ab95c7858..81ffbf28bf3 100755
--- a/t/chainlint.pl
+++ b/t/chainlint.pl
@@ -21,6 +21,183 @@  use Getopt::Long;
 my $show_stats;
 my $emit_all;
 
+# Lexer tokenizes POSIX shell scripts. It is roughly modeled after section 2.3
+# "Token Recognition" of POSIX chapter 2 "Shell Command Language". Although
+# similar to lexical analyzers for other languages, this one differs in a few
+# substantial ways due to quirks of the shell command language.
+#
+# For instance, in many languages, newline is just whitespace like space or
+# TAB, but in shell a newline is a command separator, thus a distinct lexical
+# token. A newline is significant and returned as a distinct token even at the
+# end of a shell comment.
+#
+# In other languages, `1+2` would typically be scanned as three tokens
+# (`1`, `+`, and `2`), but in shell it is a single token. However, the similar
+# `1 + 2`, which embeds whitepace, is scanned as three token in shell, as well.
+# In shell, several characters with special meaning lose that meaning when not
+# surrounded by whitespace. For instance, the negation operator `!` is special
+# when standing alone surrounded by whitespace; whereas in `foo!uucp` it is
+# just a plain character in the longer token "foo!uucp". In many other
+# languages, `"string"/foo:'string'` might be scanned as five tokens ("string",
+# `/`, `foo`, `:`, and 'string'), but in shell, it is just a single token.
+#
+# The lexical analyzer for the shell command language is also somewhat unusual
+# in that it recursively invokes the parser to handle the body of `$(...)`
+# expressions which can contain arbitrary shell code. Such expressions may be
+# encountered both inside and outside of double-quoted strings.
+#
+# The lexical analyzer is responsible for consuming shell here-doc bodies which
+# extend from the line following a `<<TAG` operator until a line consisting
+# solely of `TAG`. Here-doc consumption begins when a newline is encountered.
+# It is legal for multiple here-doc `<<TAG` operators to be present on a single
+# line, in which case their bodies must be present one following the next, and
+# are consumed in the (left-to-right) order the `<<TAG` operators appear on the
+# line. A special complication is that the bodies of all here-docs must be
+# consumed when the newline is encountered even if the parse context depth has
+# changed. For instance, in `cat <<A && x=$(cat <<B &&\n`, bodies of here-docs
+# "A" and "B" must be consumed even though "A" was introduced outside the
+# recursive parse context in which "B" was introduced and in which the newline
+# is encountered.
+package Lexer;
+
+sub new {
+	my ($class, $parser, $s) = @_;
+	bless {
+		parser => $parser,
+		buff => $s,
+		heretags => []
+	} => $class;
+}
+
+sub scan_heredoc_tag {
+	my $self = shift @_;
+	${$self->{buff}} =~ /\G(-?)/gc;
+	my $indented = $1;
+	my $tag = $self->scan_token();
+	$tag =~ s/['"\\]//g;
+	push(@{$self->{heretags}}, $indented ? "\t$tag" : "$tag");
+	return "<<$indented$tag";
+}
+
+sub scan_op {
+	my ($self, $c) = @_;
+	my $b = $self->{buff};
+	return $c unless $$b =~ /\G(.)/sgc;
+	my $cc = $c . $1;
+	return scan_heredoc_tag($self) if $cc eq '<<';
+	return $cc if $cc =~ /^(?:&&|\|\||>>|;;|<&|>&|<>|>\|)$/;
+	pos($$b)--;
+	return $c;
+}
+
+sub scan_sqstring {
+	my $self = shift @_;
+	${$self->{buff}} =~ /\G([^']*'|.*\z)/sgc;
+	return "'" . $1;
+}
+
+sub scan_dqstring {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	my $s = '"';
+	while (1) {
+		# slurp up non-special characters
+		$s .= $1 if $$b =~ /\G([^"\$\\]+)/gc;
+		# handle special characters
+		last unless $$b =~ /\G(.)/sgc;
+		my $c = $1;
+		$s .= '"', last if $c eq '"';
+		$s .= '$' . $self->scan_dollar(), next if $c eq '$';
+		if ($c eq '\\') {
+			$s .= '\\', last unless $$b =~ /\G(.)/sgc;
+			$c = $1;
+			next if $c eq "\n"; # line splice
+			# backslash escapes only $, `, ", \ in dq-string
+			$s .= '\\' unless $c =~ /^[\$`"\\]$/;
+			$s .= $c;
+			next;
+		}
+		die("internal error scanning dq-string '$c'\n");
+	}
+	return $s;
+}
+
+sub scan_balanced {
+	my ($self, $c1, $c2) = @_;
+	my $b = $self->{buff};
+	my $depth = 1;
+	my $s = $c1;
+	while ($$b =~ /\G([^\Q$c1$c2\E]*(?:[\Q$c1$c2\E]|\z))/gc) {
+		$s .= $1;
+		$depth++, next if $s =~ /\Q$c1\E$/;
+		$depth--;
+		last if $depth == 0;
+	}
+	return $s;
+}
+
+sub scan_subst {
+	my $self = shift @_;
+	my @tokens = $self->{parser}->parse(qr/^\)$/);
+	$self->{parser}->next_token(); # closing ")"
+	return @tokens;
+}
+
+sub scan_dollar {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	return $self->scan_balanced('(', ')') if $$b =~ /\G\((?=\()/gc; # $((...))
+	return '(' . join(' ', $self->scan_subst()) . ')' if $$b =~ /\G\(/gc; # $(...)
+	return $self->scan_balanced('{', '}') if $$b =~ /\G\{/gc; # ${...}
+	return $1 if $$b =~ /\G(\w+)/gc; # $var
+	return $1 if $$b =~ /\G([@*#?$!0-9-])/gc; # $*, $1, $$, etc.
+	return '';
+}
+
+sub swallow_heredocs {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	my $tags = $self->{heretags};
+	while (my $tag = shift @$tags) {
+		my $indent = $tag =~ s/^\t// ? '\\s*' : '';
+		$$b =~ /(?:\G|\n)$indent\Q$tag\E(?:\n|\z)/gc;
+	}
+}
+
+sub scan_token {
+	my $self = shift @_;
+	my $b = $self->{buff};
+	my $token = '';
+RESTART:
+	$$b =~ /\G[ \t]+/gc; # skip whitespace (but not newline)
+	return "\n" if $$b =~ /\G#[^\n]*(?:\n|\z)/gc; # comment
+	while (1) {
+		# slurp up non-special characters
+		$token .= $1 if $$b =~ /\G([^\\;&|<>(){}'"\$\s]+)/gc;
+		# handle special characters
+		last unless $$b =~ /\G(.)/sgc;
+		my $c = $1;
+		last if $c =~ /^[ \t]$/; # whitespace ends token
+		pos($$b)--, last if length($token) && $c =~ /^[;&|<>(){}\n]$/;
+		$token .= $self->scan_sqstring(), next if $c eq "'";
+		$token .= $self->scan_dqstring(), next if $c eq '"';
+		$token .= $c . $self->scan_dollar(), next if $c eq '$';
+		$self->swallow_heredocs(), $token = $c, last if $c eq "\n";
+		$token = $self->scan_op($c), last if $c =~ /^[;&|<>]$/;
+		$token = $c, last if $c =~ /^[(){}]$/;
+		if ($c eq '\\') {
+			$token .= '\\', last unless $$b =~ /\G(.)/sgc;
+			$c = $1;
+			next if $c eq "\n" && length($token); # line splice
+			goto RESTART if $c eq "\n"; # line splice
+			$token .= '\\' . $c;
+			next;
+		}
+		die("internal error scanning character '$c'\n");
+	}
+	return length($token) ? $token : undef;
+}
+
 package ScriptParser;
 
 sub new {