Message ID | 20240701220840.GA20631@coredump.intra.peff.net (mailing list archive) |
---|---|
State | Accepted |
Commit | 1d133ae91f7dd5cc6ccd0137a175372460383235 |
Headers | show |
Series | here-doc test bodies | expand |
On Mon, Jul 1, 2024 at 6:08 PM Jeff King <peff@peff.net> wrote: > [...] > This commit adds another option: feeding the snippet via the function's > stdin. This doesn't conflict with anything the snippet would want to do, > because we always redirect its stdin from /dev/null anyway (which we'll > continue to do). > > Signed-off-by: Jeff King <peff@peff.net> > --- > diff --git a/t/README b/t/README > @@ -906,6 +906,14 @@ see test-lib-functions.sh for the full list and their options. > + If <script> is `-` (a single dash), then the script to run is read > + from stdin. This lets you more easily use single quotes within the > + script by using a here-doc. For example: > + > + test_expect_success 'output contains expected string' - <<\EOT > + grep "this string has 'quotes' in it" output > + EOT We lose `chainlint` functionality for test bodies specified in this manner. Restoring such functionality will require some (possibly) not-so-subtle changes. There are at least a couple issues which need to be addressed: (1) chainlint.pl:ScriptParser::parse_cmd() only currently recognizes `test_expect_* [prereq] 'title' 'body'` but will now also need to recognize `test_expect_success [prereq] 'title' - <body-as-here-doc>`. (2) Until now, chainlint.pl has never had to concern itself with the body of a here-doc; it just throws them away. With this new calling convention, here-doc bodies become relevant and must be returned by the lexer. This may involve some not-so-minor surgery.
Eric Sunshine <sunshine@sunshineco.com> writes: > We lose `chainlint` functionality for test bodies specified in this manner. Ouch. > Restoring such functionality will require some (possibly) > not-so-subtle changes. There are at least a couple issues which need > to be addressed: > > (1) chainlint.pl:ScriptParser::parse_cmd() only currently recognizes > `test_expect_* [prereq] 'title' 'body'` but will now also need to > recognize `test_expect_success [prereq] 'title' - <body-as-here-doc>`. > > (2) Until now, chainlint.pl has never had to concern itself with the > body of a here-doc; it just throws them away. With this new calling > convention, here-doc bodies become relevant and must be returned by > the lexer. This may involve some not-so-minor surgery.
On Mon, Jul 01, 2024 at 06:45:19PM -0400, Eric Sunshine wrote: > > @@ -906,6 +906,14 @@ see test-lib-functions.sh for the full list and their options. > > + If <script> is `-` (a single dash), then the script to run is read > > + from stdin. This lets you more easily use single quotes within the > > + script by using a here-doc. For example: > > + > > + test_expect_success 'output contains expected string' - <<\EOT > > + grep "this string has 'quotes' in it" output > > + EOT > > We lose `chainlint` functionality for test bodies specified in this manner. > > Restoring such functionality will require some (possibly) > not-so-subtle changes. There are at least a couple issues which need > to be addressed: > > (1) chainlint.pl:ScriptParser::parse_cmd() only currently recognizes > `test_expect_* [prereq] 'title' 'body'` but will now also need to > recognize `test_expect_success [prereq] 'title' - <body-as-here-doc>`. > > (2) Until now, chainlint.pl has never had to concern itself with the > body of a here-doc; it just throws them away. With this new calling > convention, here-doc bodies become relevant and must be returned by > the lexer. This may involve some not-so-minor surgery. Hmm. The patch below seems to work on a simple test. The lexer stuffs the heredoc into a special variable. Which at first glance feels like a hack versus returning it from the token stream, but the contents really _aren't_ part of that stream. They're a separate magic thing that is found on the stdin of whatever command the tokens represent. And then ScriptParser::parse_cmd() just has to recognize that any "<<" token isn't interesting, and that "-" means "read the here-doc". Obviously we'd want to add to the chainlint tests here. It looks like the current test infrastructure is focused on evaluating snippets, with the test_expect_success part already handled. diff --git a/t/chainlint.pl b/t/chainlint.pl index 1bbd985b78..7eb904afaa 100755 --- a/t/chainlint.pl +++ b/t/chainlint.pl @@ -168,12 +168,15 @@ sub swallow_heredocs { my $self = shift @_; my $b = $self->{buff}; my $tags = $self->{heretags}; + $self->{parser}->{heredoc} = ''; while (my $tag = shift @$tags) { my $start = pos($$b); my $indent = $$tag[0] =~ s/^\t// ? '\\s*' : ''; $$b =~ /(?:\G|\n)$indent\Q$$tag[0]\E(?:\n|\z)/gc; if (pos($$b) > $start) { my $body = substr($$b, $start, pos($$b) - $start); + $self->{parser}->{heredoc} .= + substr($body, 0, length($body) - length($&)); $self->{lineno} += () = $body =~ /\n/sg; next; } @@ -618,6 +621,9 @@ sub check_test { my $self = shift @_; my ($title, $body) = map(unwrap, @_); $self->{ntests}++; + if ($body eq '-') { + $body = $self->{heredoc}; + } my $parser = TestParser->new(\$body); my @tokens = $parser->parse(); my $problems = $parser->{problems}; @@ -648,7 +654,7 @@ sub parse_cmd { my @tokens = $self->SUPER::parse_cmd(); return @tokens unless @tokens && $tokens[0]->[0] =~ /^test_expect_(?:success|failure)$/; my $n = $#tokens; - $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/; + $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\||<<[A-Za-z]+)$/; $self->check_test($tokens[1], $tokens[2]) if $n == 2; # title body $self->check_test($tokens[2], $tokens[3]) if $n > 2; # prereq title body return @tokens; -Peff
On Mon, Jul 01, 2024 at 08:51:45PM -0400, Jeff King wrote: > Hmm. The patch below seems to work on a simple test. > > The lexer stuffs the heredoc into a special variable. Which at first > glance feels like a hack versus returning it from the token stream, but > the contents really _aren't_ part of that stream. They're a separate > magic thing that is found on the stdin of whatever command the tokens > represent. > > And then ScriptParser::parse_cmd() just has to recognize that any "<<" > token isn't interesting, and that "-" means "read the here-doc". BTW, there's one non-obvious thing here about why this works. You'd think that: test_expect_success 'foo' <<\EOT cat <<-\EOF this is a here-doc EOF echo ok EOT wouldn't work, because the lexer only has a single here-doc store, and the inner one is going to overwrite the outer. But we don't lex the inner contents of the test snippet until we've processed the test_expect_success line, at which point we've copied it out. So I dunno. It feels a bit hacky, but I think it's how you have to do it anyway. > @@ -648,7 +654,7 @@ sub parse_cmd { > my @tokens = $self->SUPER::parse_cmd(); > return @tokens unless @tokens && $tokens[0]->[0] =~ /^test_expect_(?:success|failure)$/; > my $n = $#tokens; > - $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/; > + $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\||<<[A-Za-z]+)$/; One curiosity I noted is that the backslash of my "<<\EOT" seems to be eaten by the lexer (I guess because it doesn't know the special meaning of backslash here, and just does the usual "take the next char literally"). I think that is OK for our purposes here, though we might in the long run want to raise a linting error if you accidentally used an interpolating here-doc (it's not strictly wrong to do so, but I think we generally frown on it as a style thing). -Peff
On Mon, Jul 01, 2024 at 08:51:44PM -0400, Jeff King wrote: > Obviously we'd want to add to the chainlint tests here. It looks like > the current test infrastructure is focused on evaluating snippets, with > the test_expect_success part already handled. So doing this (with the patch I showed earlier): diff --git a/t/Makefile b/t/Makefile index b2eb9f770b..7c97aa3673 100644 --- a/t/Makefile +++ b/t/Makefile @@ -106,18 +106,28 @@ clean: clean-except-prove-cache clean-chainlint: $(RM) -r '$(CHAINLINTTMP_SQ)' +CHAINLINTTESTS_SRC = $(patsubst %,chainlint/%.test,$(CHAINLINTTESTS)) check-chainlint: @mkdir -p '$(CHAINLINTTMP_SQ)' && \ for i in $(CHAINLINTTESTS); do \ echo "test_expect_success '$$i' '" && \ sed -e '/^# LINT: /d' chainlint/$$i.test && \ echo "'"; \ done >'$(CHAINLINTTMP_SQ)'/tests && \ + for i in $$(grep -L "'" $(CHAINLINTTESTS_SRC)); do \ + echo "test_expect_success '$$i' - <<\\\\EOT" && \ + sed -e '/^# LINT: /d' $$i && \ + echo "EOT"; \ + done >>'$(CHAINLINTTMP_SQ)'/tests && \ { \ echo "# chainlint: $(CHAINLINTTMP_SQ)/tests" && \ for i in $(CHAINLINTTESTS); do \ echo "# chainlint: $$i" && \ cat chainlint/$$i.expect; \ + done && \ + for i in $$(grep -L "'" $(CHAINLINTTESTS_SRC)); do \ + echo "# chainlint: $$i" && \ + cat $${i%.test}.expect; \ done \ } >'$(CHAINLINTTMP_SQ)'/expect && \ $(CHAINLINT) --emit-all '$(CHAINLINTTMP_SQ)'/tests | \ does pass. It's just running all of the tests inside an "EOT" block. But we have to omit ones that have single quotes in them, because they are making the implicit assumption that they're inside a single-quoted block (so they do things like '"$foo"', or '\'', etc, which behave differently in a here-doc). It was a nice check that the output is the same in both cases, but it's a bit limiting as a test suite, as there's no room to introduce test cases that vary the test_expect_success lines. I'm thinking the path forward may be: 1. Move the test_expect_success wrapping lines into each chainlint/*.test file. It's a little bit of extra boilerplate, but it makes them a bit easier to reason about on their own. 2. Add a few new tests that use here-docs with a few variations ("<<EOT", "<<\EOT", probably a here-doc inside the test here-doc). Does that sound OK to you? -Peff
On Mon, Jul 1, 2024 at 8:51 PM Jeff King <peff@peff.net> wrote: > On Mon, Jul 01, 2024 at 06:45:19PM -0400, Eric Sunshine wrote: > > We lose `chainlint` functionality for test bodies specified in this manner. > > Hmm. The patch below seems to work on a simple test. > > The lexer stuffs the heredoc into a special variable. Which at first > glance feels like a hack versus returning it from the token stream, but > the contents really _aren't_ part of that stream. They're a separate > magic thing that is found on the stdin of whatever command the tokens > represent. I created a white-room fix for this issue, as well, before taking a look at your patch. The two implementations bear a strong similarity which suggests that we agree upon the basic approach. My implementation, however, takes a more formal and paranoid stance. Rather than squirreling away only the most-recently-seen heredoc body, it stores each heredoc body along with the tag which introduced it. This makes it robust against cases when multiple heredocs are initiated on the same line (even within different parse contexts): cat <<EOFA && x=$(cat <<EOFB && A body EOFA B body EOFB Of course, that's not likely to come up in the context of test_expect_* calls, but I prefer the added robustness over the more lax approach. > And then ScriptParser::parse_cmd() just has to recognize that any "<<" > token isn't interesting, and that "-" means "read the here-doc". In my implementation, the `<<` token is "interesting" because the heredoc tag is attached to it, and the tag is needed to pluck the heredoc body from the set of saved bodies (since my implementation doesn't assume most-recently-seen body is the correct one). > Obviously we'd want to add to the chainlint tests here. It looks like > the current test infrastructure is focused on evaluating snippets, with > the test_expect_success part already handled. Yes, the "snippet" approach is a throwback to the old chainlint.sed implementation when there wasn't any actual parsing going on. As you note, this unfortunately does not allow for testing parsing-related aspects of the implementation, which is a limitation I most definitely felt when chainlint.pl was implemented. It probably would be a good idea to update the infrastructure to allow for more broad testing but that doesn't need to be part of the changes being discussed here. > diff --git a/t/chainlint.pl b/t/chainlint.pl > @@ -168,12 +168,15 @@ sub swallow_heredocs { > if (pos($$b) > $start) { > my $body = substr($$b, $start, pos($$b) - $start); > + $self->{parser}->{heredoc} .= > + substr($body, 0, length($body) - length($&)); > $self->{lineno} += () = $body =~ /\n/sg; In my implementation, I use regex to strip off the ending tag before storing the heredoc body. When I later looked at your implementation, I noticed that you used substr() -- which seems preferable -- but discovered that it strips too much in some cases. For instance, in t0600, I saw that: cat >expected <<-\EOF && HEAD PSEUDO_WT_HEAD refs/bisect/wt-random refs/heads/main refs/heads/wt-main EOF was getting stripped down to: HEAD PSEUDO_WT_HEAD refs/bisect/wt-random refs/heads/main refs/heads/wt-ma{{missing-nl}} It wasn't immediately obvious why this was happening, though I didn't spend a lot of time trying to debug it. Although I think my implementation is complete, I haven't submitted it yet because I discovered that the changes you made to t1404 are triggering false-positives: # chainlint: t1404-update-ref-errors.sh # chainlint: existing loose ref is a simple prefix of new 120 prefix=refs/1l && 121 test_update_rejected a c e false b c/x d \ 122 '$prefix/c' exists; ?!AMP?! cannot create '$prefix/c/x' Unfortunately, I ran out of time, thus haven't tracked down this problem yet. I also haven't tested your implementation yet to determine if this is due to a change I made or due to a deeper existing issue with chainlint.pl.
On Mon, Jul 1, 2024 at 9:13 PM Jeff King <peff@peff.net> wrote: > On Mon, Jul 01, 2024 at 08:51:45PM -0400, Jeff King wrote: > > And then ScriptParser::parse_cmd() just has to recognize that any "<<" > > token isn't interesting, and that "-" means "read the here-doc". > > BTW, there's one non-obvious thing here about why this works. You'd > think that: > > test_expect_success 'foo' <<\EOT > cat <<-\EOF > this is a here-doc > EOF > echo ok > EOT > > wouldn't work, because the lexer only has a single here-doc store, and > the inner one is going to overwrite the outer. But we don't lex the > inner contents of the test snippet until we've processed the > test_expect_success line, at which point we've copied it out. > > So I dunno. It feels a bit hacky, but I think it's how you have to do it > anyway. It wasn't non-obvious to me, but I suppose it's because I know the author, or I am the author, or something. > > - $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/; > > + $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\||<<[A-Za-z]+)$/; > > One curiosity I noted is that the backslash of my "<<\EOT" seems to be > eaten by the lexer (I guess because it doesn't know the special meaning > of backslash here, and just does the usual "take the next char > literally"). That's not the reason. It actively strips the backslash because it knows that it doesn't care about it after this point and, more importantly, because it needs to extract the raw heredoc tag name (without the slash or other surrounding quotes) so that it can match upon that name (say, "EOF") to find the end of the heredoc body. It's mostly an accident of implementation (and probably a throwback to chainlint.sed) that it strips the backslash early in Lexer::scan_heredoc_tag() even though it doesn't actually have to be stripped until Lexer::swallow_heredocs() needs to match the tag name to find the end of the heredoc body. Thus, in retrospect, the implementation could have retained the backslash (`\EOF`) or quotes (`'EOF'` or `"EOF"`) and left it for swallow_heredocs() to strip them only when needed. There's another weird throwback to chainlint.sed in Lexer::scan_heredoc_tag() where it transforms `<<-` into `<<\t`, which is potentially more than a little confusing, especially since it is (I believe) totally unnecessary in the context of chainlint.pl. > I think that is OK for our purposes here, though we might > in the long run want to raise a linting error if you accidentally used > an interpolating here-doc (it's not strictly wrong to do so, but I think > we generally frown on it as a style thing). Such a linting warning would probably have to be context-sensitive so it only triggers for test_expect_* calls.
On Tue, Jul 2, 2024 at 5:19 PM Jeff King <peff@peff.net> wrote: > On Mon, Jul 01, 2024 at 08:51:44PM -0400, Jeff King wrote: > > Obviously we'd want to add to the chainlint tests here. It looks like > > the current test infrastructure is focused on evaluating snippets, with > > the test_expect_success part already handled. > > So doing this (with the patch I showed earlier): > > diff --git a/t/Makefile b/t/Makefile > @@ -106,18 +106,28 @@ clean: clean-except-prove-cache > + for i in $$(grep -L "'" $(CHAINLINTTESTS_SRC)); do \ > + echo "test_expect_success '$$i' - <<\\\\EOT" && \ > + sed -e '/^# LINT: /d' $$i && \ > + echo "EOT"; \ > + done >>'$(CHAINLINTTMP_SQ)'/tests && \ Unfortunately, `grep -L` is not POSIX. > does pass. It's just running all of the tests inside an "EOT" block. But > we have to omit ones that have single quotes in them, because they are > making the implicit assumption that they're inside a single-quoted block > (so they do things like '"$foo"', or '\'', etc, which behave differently > in a here-doc). > > It was a nice check that the output is the same in both cases, but it's > a bit limiting as a test suite, as there's no room to introduce test > cases that vary the test_expect_success lines. Agreed. It feels rather hacky and awfully special-case, as it's only (additionally) checking that the `test_expect_* title - <<EOT` form works, but doesn't help at all with testing other parsing-related behaviors of chainlint.pl (which is something I definitely wanted to be able to do when implementing the Perl version). > I'm thinking the path forward may be: > > 1. Move the test_expect_success wrapping lines into each > chainlint/*.test file. It's a little bit of extra boilerplate, but > it makes them a bit easier to reason about on their own. Yes. This is exactly what I had in mind for moving forward. It's just a one-time noise-patch cost but gives us much more flexibility in terms of testing. It also makes spot-testing the chainlint self-test files much simpler. We would be able to do this: ./chainlint.pl chainlint/block.test rather than much more painful: { echo "test_expect_success foo '" && cat chainlint/block.test && echo "'"; } >dummy && ./chainlint.pl dummy; rm dummy or something similar. > 2. Add a few new tests that use here-docs with a few variations > ("<<EOT", "<<\EOT", probably a here-doc inside the test here-doc). > > Does that sound OK to you? Absolutely. I'm very much in favor of these changes.
On Tue, Jul 2, 2024 at 5:25 PM Eric Sunshine <sunshine@sunshineco.com> wrote: > On Mon, Jul 1, 2024 at 8:51 PM Jeff King <peff@peff.net> wrote: > > my $body = substr($$b, $start, pos($$b) - $start); > > + $self->{parser}->{heredoc} .= > > + substr($body, 0, length($body) - length($&)); > > $self->{lineno} += () = $body =~ /\n/sg; > > In my implementation, I use regex to strip off the ending tag before > storing the heredoc body. When I later looked at your implementation, > I noticed that you used substr() -- which seems preferable -- but > discovered that it strips too much in some cases. [...] Nevermind this part. I just looked again at the misbehaving code (which I had commented out but not deleted) and noticed that I botched the implementation in two distinct ways. With those botches removed, the substr() approach works just fine.
On Tue, Jul 2, 2024 at 5:25 PM Eric Sunshine <sunshine@sunshineco.com> wrote: > Although I think my implementation is complete, I haven't submitted it > yet because I discovered that the changes you made to t1404 are > triggering false-positives: > > # chainlint: t1404-update-ref-errors.sh > # chainlint: existing loose ref is a simple prefix of new > 120 prefix=refs/1l && > 121 test_update_rejected a c e false b c/x d \ > 122 '$prefix/c' exists; ?!AMP?! cannot create '$prefix/c/x' > > Unfortunately, I ran out of time, thus haven't tracked down this > problem yet. This is also now fixed. It wasn't any deep problem, just a minor oversight.
On Tue, Jul 02, 2024 at 05:59:46PM -0400, Eric Sunshine wrote: > > diff --git a/t/Makefile b/t/Makefile > > @@ -106,18 +106,28 @@ clean: clean-except-prove-cache > > + for i in $$(grep -L "'" $(CHAINLINTTESTS_SRC)); do \ > > + echo "test_expect_success '$$i' - <<\\\\EOT" && \ > > + sed -e '/^# LINT: /d' $$i && \ > > + echo "EOT"; \ > > + done >>'$(CHAINLINTTMP_SQ)'/tests && \ > > Unfortunately, `grep -L` is not POSIX. Yeah, this was just for illustration. Even if it were portable, I don't think it's a good direction. :) > > 1. Move the test_expect_success wrapping lines into each > > chainlint/*.test file. It's a little bit of extra boilerplate, but > > it makes them a bit easier to reason about on their own. > > Yes. This is exactly what I had in mind for moving forward. It's just > a one-time noise-patch cost but gives us much more flexibility in > terms of testing. > > It also makes spot-testing the chainlint self-test files much simpler. > We would be able to do this: > > ./chainlint.pl chainlint/block.test > > rather than much more painful: > > { echo "test_expect_success foo '" && cat chainlint/block.test && > echo "'"; } >dummy && ./chainlint.pl dummy; rm dummy Oh, nice. Having just written new chainlint tests, this made checking them _way_ easier. > > 2. Add a few new tests that use here-docs with a few variations > > ("<<EOT", "<<\EOT", probably a here-doc inside the test here-doc). > > > > Does that sound OK to you? > > Absolutely. I'm very much in favor of these changes. Great! I have patches which I'll send out in a moment. -Peff
On Tue, Jul 02, 2024 at 05:25:48PM -0400, Eric Sunshine wrote: > I created a white-room fix for this issue, as well, before taking a > look at your patch. The two implementations bear a strong similarity > which suggests that we agree upon the basic approach. > > My implementation, however, takes a more formal and paranoid stance. > Rather than squirreling away only the most-recently-seen heredoc body, > it stores each heredoc body along with the tag which introduced it. > This makes it robust against cases when multiple heredocs are > initiated on the same line (even within different parse contexts): > > cat <<EOFA && x=$(cat <<EOFB && > A body > EOFA > B body > EOFB > > Of course, that's not likely to come up in the context of > test_expect_* calls, but I prefer the added robustness over the more > lax approach. Yes, that's so much better than what I wrote. I didn't engage my brain very much when I read the in-code comments about multiple tags on the same line, and I thought you meant: cat <<FOO <<BAR this is foo FOO this is bar BAR which is...weird. It does "work" in the sense that "FOO" is a here-doc that should be skipped past. But it is not doing anything useful; cat sees only "this is bar" on stdin. So even for this case, the appending behavior that my patch does would not make sense. And of course for the actual useful thing, which you wrote above, appending is just nonsense. Recording and accessing by tag is the right thing. > > And then ScriptParser::parse_cmd() just has to recognize that any "<<" > > token isn't interesting, and that "-" means "read the here-doc". > > In my implementation, the `<<` token is "interesting" because the > heredoc tag is attached to it, and the tag is needed to pluck the > heredoc body from the set of saved bodies (since my implementation > doesn't assume most-recently-seen body is the correct one). Ah, OK. So it would probably not be that big of a deal to record a single bit for "this heredoc is interpolated". But until we have anything useful to do with that information, let's not worry about it for now. > > diff --git a/t/chainlint.pl b/t/chainlint.pl > > @@ -168,12 +168,15 @@ sub swallow_heredocs { > > if (pos($$b) > $start) { > > my $body = substr($$b, $start, pos($$b) - $start); > > + $self->{parser}->{heredoc} .= > > + substr($body, 0, length($body) - length($&)); > > $self->{lineno} += () = $body =~ /\n/sg; > > In my implementation, I use regex to strip off the ending tag before > storing the heredoc body. When I later looked at your implementation, > I noticed that you used substr() -- which seems preferable -- but > discovered that it strips too much in some cases. For instance, in > t0600, I saw that: Yeah, I was afraid of trying another regex, just because there are optional bits (like indentation) that we'd have to account for. Since $& contains the match already, that's all taken care of by the existing regex. From your follow-up, it sounds like the substr() approach does work (*phew*). -Peff
On Sat, Jul 06, 2024 at 01:31:05AM -0400, Jeff King wrote: > > > And then ScriptParser::parse_cmd() just has to recognize that any "<<" > > > token isn't interesting, and that "-" means "read the here-doc". > > > > In my implementation, the `<<` token is "interesting" because the > > heredoc tag is attached to it, and the tag is needed to pluck the > > heredoc body from the set of saved bodies (since my implementation > > doesn't assume most-recently-seen body is the correct one). > > Ah, OK. So it would probably not be that big of a deal to record a > single bit for "this heredoc is interpolated". But until we have > anything useful to do with that information, let's not worry about it > for now. Oh, oops. I attached this response to the wrong message (I read them all through before starting to respond). My response here was about the fact that "<<\EOT" does not record the "\" anywhere from the lexer. But yes, for your implementation, we do need to recognize "<<\EOT", etc, to pull out "EOT". -Peff
On Tue, Jul 02, 2024 at 05:37:39PM -0400, Eric Sunshine wrote: > > BTW, there's one non-obvious thing here about why this works. You'd > > think that: > > > > test_expect_success 'foo' <<\EOT > > cat <<-\EOF > > this is a here-doc > > EOF > > echo ok > > EOT > > > > wouldn't work, because the lexer only has a single here-doc store, and > > the inner one is going to overwrite the outer. But we don't lex the > > inner contents of the test snippet until we've processed the > > test_expect_success line, at which point we've copied it out. > > > > So I dunno. It feels a bit hacky, but I think it's how you have to do it > > anyway. > > It wasn't non-obvious to me, but I suppose it's because I know the > author, or I am the author, or something. :) I had a brief moment of panic where I thought "wait, what I sent out is going to break in this case!" and then was surprised when it worked. > > > - $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\|)$/; > > > + $n-- while $n >= 0 && $tokens[$n]->[0] =~ /^(?:[;&\n|]|&&|\|\||<<[A-Za-z]+)$/; > > > > One curiosity I noted is that the backslash of my "<<\EOT" seems to be > > eaten by the lexer (I guess because it doesn't know the special meaning > > of backslash here, and just does the usual "take the next char > > literally"). > > That's not the reason. It actively strips the backslash because it > knows that it doesn't care about it after this point and, more > importantly, because it needs to extract the raw heredoc tag name > (without the slash or other surrounding quotes) so that it can match > upon that name (say, "EOF") to find the end of the heredoc body. > > It's mostly an accident of implementation (and probably a throwback to > chainlint.sed) that it strips the backslash early in > Lexer::scan_heredoc_tag() even though it doesn't actually have to be > stripped until Lexer::swallow_heredocs() needs to match the tag name > to find the end of the heredoc body. Thus, in retrospect, the > implementation could have retained the backslash (`\EOF`) or quotes > (`'EOF'` or `"EOF"`) and left it for swallow_heredocs() to strip them > only when needed. OK. I think it does make things easier to normalize this a bit, so that ScriptParser::parse_cmd() doesn't have to worry about all of the various spellings. If we recorded a single bit for "this was quoted" alongside the heredoc contents, that would be plenty. But as I (erroneously) said elsewhere, we can worry about that later if we find something useful to do with it. > There's another weird throwback to chainlint.sed in > Lexer::scan_heredoc_tag() where it transforms `<<-` into `<<\t`, which > is potentially more than a little confusing, especially since it is (I > believe) totally unnecessary in the context of chainlint.pl. Ah, I hadn't noticed that. Looks like we use it in swallow_heredocs() to read the tag data itself. But importantly the token stream still has the correct original in it, which we need to correctly match in ScriptParser::parse_cmd(). > > I think that is OK for our purposes here, though we might > > in the long run want to raise a linting error if you accidentally used > > an interpolating here-doc (it's not strictly wrong to do so, but I think > > we generally frown on it as a style thing). > > Such a linting warning would probably have to be context-sensitive so > it only triggers for test_expect_* calls. Yes, definitely. -Peff
On Sat, Jul 6, 2024 at 1:31 AM Jeff King <peff@peff.net> wrote: > On Tue, Jul 02, 2024 at 05:25:48PM -0400, Eric Sunshine wrote: > > My implementation, however, takes a more formal and paranoid stance. > > Rather than squirreling away only the most-recently-seen heredoc body, > > it stores each heredoc body along with the tag which introduced it. > > This makes it robust against cases when multiple heredocs are > > initiated on the same line (even within different parse contexts): > > > > cat <<EOFA && x=$(cat <<EOFB && > > A body > > EOFA > > B body > > EOFB > > > > Of course, that's not likely to come up in the context of > > test_expect_* calls, but I prefer the added robustness over the more > > lax approach. > > Yes, that's so much better than what I wrote. I didn't engage my brain > very much when I read the in-code comments about multiple tags on the > same line, and I thought you meant: > > cat <<FOO <<BAR > this is foo > FOO > this is bar > BAR > > which is...weird. It does "work" in the sense that "FOO" is a here-doc > that should be skipped past. But it is not doing anything useful; cat > sees only "this is bar" on stdin. So even for this case, the appending > behavior that my patch does would not make sense. > > And of course for the actual useful thing, which you wrote above, > appending is just nonsense. Recording and accessing by tag is the right > thing. In retrospect, I think my claim is bogus in the context of ScriptParser::parse_cmd(). Specifically, ScriptParser::parse_cmd() calls its parent ShellParser::parse_cmd() to latch one command. ShellParser::parse_cmd() stops parsing as soon as it encounters a command terminator (i.e. `;`, `&&`, `||`, `|`, '&', '\n') and returns the command. Moreover, by definition, given the language specification, the lexer only consumes the heredocs upon encountering `\n`. Thus, if someone writes: test_expect_success title - <<\EOT && whatever && ...test body... EOT then ScriptParser::parse_cmd() will receive the command `test_expect_success title -` from ShellParser::parse_cmd() but the heredoc will not yet have been consumed by the lexer since it hasn't yet encountered the newline[1]. So, the above example simply can't work correctly given the way ScriptParser::parse_cmd() calls ScriptParser::check_test() as soon as it encounters a `test_expect_success/failure` invocation since it doesn't know if the heredocs have been latched at that point. To make it properly robust, rather than immediately calling check_test(), it would have to continue consuming commands, and saving the ones which match `test_expect_success/failure` invocation, until it finally hits a `\n`, and only then call check_test() with each command it saved. But that's probably overkill at this point considering that we never write code like the above, so the submitted patch[2] is probably good enough for now. FOOTNOTES [1] One might rightly ask that if ShellParser::parse_cmd() returns immediately upon seeing a command terminator (i.e. `;`, `&&`, etc.), then how is it that even a simple: test_expect_success title - <<\EOT && ...test body... EOT can work correctly since the `\n` comes after the `&&`. The answer is that, as a special case, the very last thing ShellParser::parse_cmd() does is peek ahead to see if a `\n` follows the command terminator (assuming the terminator is not itself a `\n`). When the next token is indeed a `\n`, that peek operation causes the lexer to consume the heredocs. [2]: https://lore.kernel.org/git/20240702235034.88219-1-ericsunshine@charter.net/
On Sat, Jul 6, 2024 at 2:11 AM Eric Sunshine <sunshine@sunshineco.com> wrote: > So, the above example simply can't work correctly given the way > ScriptParser::parse_cmd() calls ScriptParser::check_test() as soon as > it encounters a `test_expect_success/failure` invocation since it > doesn't know if the heredocs have been latched at that point. To make > it properly robust, rather than immediately calling check_test(), it > would have to continue consuming commands, and saving the ones which > match `test_expect_success/failure` invocation, until it finally hits > a `\n`, and only then call check_test() with each command it saved. > But that's probably overkill at this point considering that we never > write code like the above, so the submitted patch[2] is probably good > enough for now. Of course, the more I think about it, the more I dislike relying upon what is effectively an accident of implementation; i.e. that in the typical case, the heredoc will already have been latched by the time ScriptParser::parse_cmd() has identified a `test_expect_success` command, due to the fact that ShellParser::parse_cmd() has that special case which peeks for `\n` immediately following some other command terminator. As such, fixing ScriptParser::parse_cmd() to only call check_test() once it is sure that a '\n' has been encountered is becoming more appealing, though it is of course a more invasive and fundamental change than the posted patch.
On Sat, Jul 06, 2024 at 02:11:13AM -0400, Eric Sunshine wrote: > > > cat <<EOFA && x=$(cat <<EOFB && > > > A body > > > EOFA > > > B body > > > EOFB > [...] > In retrospect, I think my claim is bogus in the context of > ScriptParser::parse_cmd(). Specifically, ScriptParser::parse_cmd() > calls its parent ShellParser::parse_cmd() to latch one command. > ShellParser::parse_cmd() stops parsing as soon as it encounters a > command terminator (i.e. `;`, `&&`, `||`, `|`, '&', '\n') and returns > the command. Moreover, by definition, given the language > specification, the lexer only consumes the heredocs upon encountering > `\n`. Thus, if someone writes: > > test_expect_success title - <<\EOT && whatever && > ...test body... > EOT > > then ScriptParser::parse_cmd() will receive the command > `test_expect_success title -` from ShellParser::parse_cmd() but the > heredoc will not yet have been consumed by the lexer since it hasn't > yet encountered the newline[1]. > > So, the above example simply can't work correctly given the way > ScriptParser::parse_cmd() calls ScriptParser::check_test() as soon as > it encounters a `test_expect_success/failure` invocation since it > doesn't know if the heredocs have been latched at that point. Ah, yeah, I think you're right. I had parsed your example in my mind as: cat <<EOFA $(cat <<EOFB) without an intervening "&&" (taking the second here-doc as an argument to the original command). Which _does_ work with your patch. > To make it properly robust, rather than immediately calling > check_test(), it would have to continue consuming commands, and saving > the ones which match `test_expect_success/failure` invocation, until > it finally hits a `\n`, and only then call check_test() with each > command it saved. But that's probably overkill at this point > considering that we never write code like the above, so the submitted > patch[2] is probably good enough for now. Yep, I'd agree with all of that. -Peff
On Sat, Jul 06, 2024 at 02:47:57AM -0400, Eric Sunshine wrote: > On Sat, Jul 6, 2024 at 2:11 AM Eric Sunshine <sunshine@sunshineco.com> wrote: > > So, the above example simply can't work correctly given the way > > ScriptParser::parse_cmd() calls ScriptParser::check_test() as soon as > > it encounters a `test_expect_success/failure` invocation since it > > doesn't know if the heredocs have been latched at that point. To make > > it properly robust, rather than immediately calling check_test(), it > > would have to continue consuming commands, and saving the ones which > > match `test_expect_success/failure` invocation, until it finally hits > > a `\n`, and only then call check_test() with each command it saved. > > But that's probably overkill at this point considering that we never > > write code like the above, so the submitted patch[2] is probably good > > enough for now. > > Of course, the more I think about it, the more I dislike relying upon > what is effectively an accident of implementation; i.e. that in the > typical case, the heredoc will already have been latched by the time > ScriptParser::parse_cmd() has identified a `test_expect_success` > command, due to the fact that ShellParser::parse_cmd() has that > special case which peeks for `\n` immediately following some other > command terminator. As such, fixing ScriptParser::parse_cmd() to only > call check_test() once it is sure that a '\n' has been encountered is > becoming more appealing, though it is of course a more invasive and > fundamental change than the posted patch. Rats, I just agreed with your earlier email. ;) I am OK with the slightly hacky version we've posted (modulo the fixes I discussed elsewhere). But if you want to take a little time to explore the more robust fix, I am happy to review it. -Peff
On Sat, Jul 6, 2024 at 2:55 AM Jeff King <peff@peff.net> wrote: > On Sat, Jul 06, 2024 at 02:47:57AM -0400, Eric Sunshine wrote: > > Of course, the more I think about it, the more I dislike relying upon > > what is effectively an accident of implementation; i.e. that in the > > typical case, the heredoc will already have been latched by the time > > ScriptParser::parse_cmd() has identified a `test_expect_success` > > command, due to the fact that ShellParser::parse_cmd() has that > > special case which peeks for `\n` immediately following some other > > command terminator. As such, fixing ScriptParser::parse_cmd() to only > > call check_test() once it is sure that a '\n' has been encountered is > > becoming more appealing, though it is of course a more invasive and > > fundamental change than the posted patch. > > Rats, I just agreed with your earlier email. ;) I am OK with the > slightly hacky version we've posted (modulo the fixes I discussed > elsewhere). But if you want to take a little time to explore the more > robust fix, I am happy to review it. The primary reason I said "the more I dislike relying upon ... an accident of implementation" is that this limitation is not documented anywhere other than in this email thread. That said, I don't mind the posted version of the patch being picked up. The "correct" approach can always be implemented atop it at a later time.
diff --git a/t/README b/t/README index d9e0e07506..dec644f997 100644 --- a/t/README +++ b/t/README @@ -906,6 +906,14 @@ see test-lib-functions.sh for the full list and their options. 'git-write-tree should be able to write an empty tree.' \ 'tree=$(git-write-tree)' + If <script> is `-` (a single dash), then the script to run is read + from stdin. This lets you more easily use single quotes within the + script by using a here-doc. For example: + + test_expect_success 'output contains expected string' - <<\EOT + grep "this string has 'quotes' in it" output + EOT + If you supply three parameters the first will be taken to be a prerequisite; see the test_set_prereq and test_have_prereq documentation below: diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 427b375b39..803ed2df39 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -872,6 +872,24 @@ test_verify_prereq () { BUG "'$test_prereq' does not look like a prereq" } +# assign the variable named by "$1" with the contents of "$2"; +# if "$2" is "-", then read stdin into "$1" instead +test_body_or_stdin () { + if test "$2" != "-" + then + eval "$1=\$2" + return + fi + + # start with a newline, to match hanging newline from open-quote style + eval "$1=\$LF" + local test_line + while IFS= read -r test_line + do + eval "$1=\${$1}\${test_line}\${LF}" + done +} + test_expect_failure () { test_start_ "$@" test "$#" = 3 && { test_prereq=$1; shift; } || test_prereq= @@ -881,9 +899,11 @@ test_expect_failure () { export test_prereq if ! test_skip "$@" then + local test_body + test_body_or_stdin test_body "$2" test -n "$test_skip_test_preamble" || - say >&3 "checking known breakage of $TEST_NUMBER.$test_count '$1': $2" - if test_run_ "$2" expecting_failure + say >&3 "checking known breakage of $TEST_NUMBER.$test_count '$1': $test_body" + if test_run_ "$test_body" expecting_failure then test_known_broken_ok_ "$1" else @@ -902,13 +922,15 @@ test_expect_success () { export test_prereq if ! test_skip "$@" then + local test_body + test_body_or_stdin test_body "$2" test -n "$test_skip_test_preamble" || - say >&3 "expecting success of $TEST_NUMBER.$test_count '$1': $2" - if test_run_ "$2" + say >&3 "expecting success of $TEST_NUMBER.$test_count '$1': $test_body" + if test_run_ "$test_body" then test_ok_ "$1" else - test_failure_ "$@" + test_failure_ "$1" "$test_body" fi fi test_finish_
Most test snippets are wrapped in single quotes, like: test_expect_success 'some description' ' do_something ' This sometimes makes the snippets awkward to write, because you can't easily use single quotes within them. We sometimes work around this with $SQ, or by loosening regexes to use "." instead of a literal quote, or by using double quotes when we'd prefer to use single-quotes (and just adding extra backslash-escapes to avoid interpolation). This commit adds another option: feeding the snippet via the function's stdin. This doesn't conflict with anything the snippet would want to do, because we always redirect its stdin from /dev/null anyway (which we'll continue to do). A few notes on the implementation: - it would be nice to push this down into test_run_, but we can't, as test_expect_success and test_expect_failure want to see the actual script content to report it for verbose-mode. A helper function limits the amount of duplication in those callers here. - The helper function is a little awkward to call, as you feed it the name of the variable you want to set. The more natural thing in shell would be command substitution like: body=$(body_or_stdin "$2") but that loses trailing whitespace. There are tricks around this, like: body=$(body_or_stdin "$2"; printf .) body=${body%.} but we'd prefer to keep such tricks in the helper, not in each caller. - I implemented the helper using a sequence of "read" calls. Together with "-r" and unsetting the IFS, this preserves incoming whitespace. An alternative is to use "cat" (which then requires the gross "." trick above). But this saves us a process, which is probably a good thing. The "read" builtin does use more read() syscalls than necessary (one per byte), but that is almost certainly a win over a separate process. Both are probably slower than passing a single-quoted string, but the difference is lost in the noise for a script that I converted as an experiment. - I handle test_expect_success and test_expect_failure here. If we like this style, we could easily extend it to other spots (e.g., lazy_prereq bodies) on top of this patch. - even though we are using "local", we have to be careful about our variable names. Within test_expect_success, any variable we declare with local will be seen as local by the test snippets themselves (so it wouldn't persist between tests like normal variables would). Signed-off-by: Jeff King <peff@peff.net> --- t/README | 8 ++++++++ t/test-lib-functions.sh | 32 +++++++++++++++++++++++++++----- 2 files changed, 35 insertions(+), 5 deletions(-)