diff mbox

Expansion-lookalikes in heredoc delimiters

Message ID 20180315102730.GA32076@gondor.apana.org.au (mailing list archive)
State Accepted
Delegated to: Herbert Xu
Headers show

Commit Message

Herbert Xu March 15, 2018, 10:27 a.m. UTC
Harald van Dijk <harald@gigawatt.nl> wrote:
>
> Changing dash to support this would in my opinion be unreasonable, as it 
> would require a total rewrite of the parser.

I don't think it's that hard.

---8<---
Subject: parser: Fix backquote support in here-document EOF mark

Currently using backquotes in a here-document EOF mark is broken
because dash tries to do command substitution on it.  This patch
fixes it by checking whether we're looking for an EOF mark during
tokenisation.

Reported-by: Harald van Dijk <harald@gigawatt.nl>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Harald van Dijk March 15, 2018, 11:41 a.m. UTC | #1
On 15/03/2018 11:27, Herbert Xu wrote:
> Harald van Dijk <harald@gigawatt.nl> wrote:
>>
>> Changing dash to support this would in my opinion be unreasonable, as it
>> would require a total rewrite of the parser.
> 
> I don't think it's that hard.

It is if you want to do it the way POSIX specifies. You're adding a 
special exception in the parser. I don't see how this approach can be 
extended to handle the other examples in my mail:

   cat <<`one two`
   ok
   `one two`

`one two` is a single word, so should be allowed as a heredoc delimiter. 
By treating ` as a literal and not looking for the matching closing `, 
this only ends up taking `one as the delimiter, does it not?

And to hopefully better demonstrate the need to not introduce special 
parsing exceptions:

   cat <<$(case x in 0) echo bug ;; esac)
   ok
   $(case x in 0) echo bug ;; esac)

Yes, this actually works in bash/zsh/yash.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 15, 2018, 2:52 p.m. UTC | #2
On Thu, Mar 15, 2018 at 12:41:10PM +0100, Harald van Dijk wrote:
>
> It is if you want to do it the way POSIX specifies. You're adding a special
> exception in the parser. I don't see how this approach can be extended to
> handle the other examples in my mail:

I don't think it's exactly clear what POSIX says on this.

>   cat <<`one two`
>   ok
>   `one two`

Try this one:

	cat << ${
	${

bash/ksh93 both will get stuck forever until a right brace is
entered.  While other shells all behave the way dash does.
Considering the fact that even if you closed the brace after
the newline bash would still be stuck forever, I think this
behaviour is suboptimal.  ksh93 seems to do the right thing though.

Another interesting thing to try is

	cat << ${ foo
	${

Also you can look at the quotes:

	cat << "
	"

IOW it's not clear that "word" in this context necessarily follows
the same rules used during normal tokenisation.

Anyway, dash has had CHKEOFMARK since 2007 so this is just an
extension of existing behaviour.

Cheers,
Harald van Dijk March 15, 2018, 4:29 p.m. UTC | #3
On 15/03/2018 15:52, Herbert Xu wrote:
> On Thu, Mar 15, 2018 at 12:41:10PM +0100, Harald van Dijk wrote:
>>
>> It is if you want to do it the way POSIX specifies. You're adding a special
>> exception in the parser. I don't see how this approach can be extended to
>> handle the other examples in my mail:
> 
> I don't think it's exactly clear what POSIX says on this.

I don't think it's clear what POSIX means to say, but I do think it's 
clear what it does say.

>>    cat <<`one two`
>>    ok
>>    `one two`
> 
> Try this one:
> 
> 	cat << ${
> 	${
> 
> bash/ksh93 both will get stuck forever until a right brace is
> entered.

That's because POSIX specifies that after ${, everything up to the 
matching }, not including nested strings, expansions, etc., is part of 
the word. No exception is made when it spans multiple lines.

Another instance of this is

   if false; then
     echo ${
   fi
   }
   fi
   echo ok

This is accepted by bash, ksh, zsh and posh.

However, in this instance, I'm having trouble finding where, but IIRC, 
POSIX says or means to say that it's okay to flag this as an invalid 
parameter expansion at parse time even though the evaluation would 
otherwise be skipped.

> While other shells all behave the way dash does.

All? At least two others don't: yash rejects ${ as invalid because of 
the missing parameter name. zsh agrees with bash and ksh that it should 
look for the closing }.

> Considering the fact that even if you closed the brace after
> the newline bash would still be stuck forever,

That's because bash doesn't ever match multi-line heredoc delimiters. 
Which is what POSIX technically requires:

   "The here-document shall be treated as a single word that begins 
after the next <newline> and continues until there is a line containing 
only the delimiter and a <newline>, with no <blank> characters in between."

No single line will ever match a multi-line heredoc delimiter.

> I think this
> behaviour is suboptimal.  ksh93 seems to do the right thing though.

Yes, I agree that multi-line heredoc delimiters are useful.

> Another interesting thing to try is
> 
> 	cat << ${ foo
> 	${
> 
> Also you can look at the quotes:
> 
> 	cat << "
> 	"
> 
> IOW it's not clear that "word" in this context necessarily follows
> the same rules used during normal tokenisation.

Quotes are clear: as far as they go, this *must* follow the same rules 
used during normal tokenisation. Does any shell disagree?

   cat << "A ' B"
   ok 1
   A ' B
   echo ok 2

I expect this to print

   ok 1
   ok 2

and I'm not finding any shell that disagrees, not even dash. Do you have 
an example of a shell that sees only "A as the heredoc delimiter, or one 
that perhaps performs quote removal on the inner '?

For your example, does any shell, including dash, *not* treat this as a 
multi-line heredoc delimiter consisting of a newline character (which 
may or may not be matched by two blank lines, depending on the shell)?

> Anyway, dash has had CHKEOFMARK since 2007 so this is just an
> extension of existing behaviour.

This is definitely true: your patch, regardless of whether it matches 
POSIX's requirements, is consistent with how dash has behaved for a long 
time.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 15, 2018, 5:11 p.m. UTC | #4
On Thu, Mar 15, 2018 at 05:29:27PM +0100, Harald van Dijk wrote:
>
> That's because POSIX specifies that after ${, everything up to the matching
> }, not including nested strings, expansions, etc., is part of the word. No
> exception is made when it spans multiple lines.
> 
> Another instance of this is
> 
>   if false; then
>     echo ${
>   fi
>   }
>   fi
>   echo ok
> 
> This is accepted by bash, ksh, zsh and posh.

Interestingly, ksh bombs out on this:

	echo ${
	fi
	}

So this behaviour is not exactly consistent.

In any case, this substituion is invalid in all of these shells so
does it really matter?

> >While other shells all behave the way dash does.
> 
> All? At least two others don't: yash rejects ${ as invalid because of the
> missing parameter name. zsh agrees with bash and ksh that it should look for
> the closing }.

pdksh/mksh are the two shells that I tried.  OK I admit "all"
was stretching things :)

But my point regarding the inconsistency remains.

> >Considering the fact that even if you closed the brace after
> >the newline bash would still be stuck forever,
> 
> That's because bash doesn't ever match multi-line heredoc delimiters. Which
> is what POSIX technically requires:
> 
>   "The here-document shall be treated as a single word that begins after the
> next <newline> and continues until there is a line containing only the
> delimiter and a <newline>, with no <blank> characters in between."
> 
> No single line will ever match a multi-line heredoc delimiter.

Sure.  But this is a quality of implementation issue.  If you're
never going to match the delimter you should probably emit an error
at the very start rather than at the EOF.

> >Another interesting thing to try is
> >
> >	cat << ${ foo
> >	${
> >
> >Also you can look at the quotes:
> >
> >	cat << "
> >	"
> >
> >IOW it's not clear that "word" in this context necessarily follows
> >the same rules used during normal tokenisation.
> 
> Quotes are clear: as far as they go, this *must* follow the same rules used
> during normal tokenisation. Does any shell disagree?

I was talking about multi-line quotes, which you conveniently dismissed :)

> For your example, does any shell, including dash, *not* treat this as a
> multi-line heredoc delimiter consisting of a newline character (which may or
> may not be matched by two blank lines, depending on the shell)?

Yes, almost every shell fails with the multi-line delimiter inside
double quotes.  Only dash and ksh93 seem to get it right.

Cheers,
Harald van Dijk March 15, 2018, 9:49 p.m. UTC | #5
On 15/03/2018 18:11, Herbert Xu wrote:
> On Thu, Mar 15, 2018 at 05:29:27PM +0100, Harald van Dijk wrote:
>>
>> That's because POSIX specifies that after ${, everything up to the matching
>> }, not including nested strings, expansions, etc., is part of the word. No
>> exception is made when it spans multiple lines.
>>
>> Another instance of this is
>>
>>    if false; then
>>      echo ${
>>    fi
>>    }
>>    fi
>>    echo ok
>>
>> This is accepted by bash, ksh, zsh and posh.
> 
> Interestingly, ksh bombs out on this:
> 
> 	echo ${
> 	fi
> 	}
> 
> So this behaviour is not exactly consistent.

It's perfectly consistent. It gets accepted at parse time, it only gets 
rejected at expansion time. That's how dash generally behaves as well:

   $ dash -c 'echo ${x^}'
   dash: 1: Bad substitution
   $ dash -c ': || echo ${x^}'
   $

Historically, as I understand it, ash would reject both of these, but 
you specifically modified dash to accept invalid expansions during 
parsing (which makes sense):

 
<https://git.kernel.org/pub/scm/utils/dash/dash.git/commit/?id=3df3edd13389ae768010bfacee5612346b413e38>

> In any case, this substituion is invalid in all of these shells so
> does it really matter?

Okay, it can be trivially modified to something that does work in other 
shells (even if it were actually executed), but gets rejected at parse 
time by dash:

   if false; then
     : ${$+
   }
   fi

It'd be nice if all shells used the same parse rules, so that scripts 
can detect dynamically which expansions are supported, but don't have to 
go through ugly eval commands to use them.

>> That's because bash doesn't ever match multi-line heredoc delimiters. Which
>> is what POSIX technically requires:
>>
>>    "The here-document shall be treated as a single word that begins after the
>> next <newline> and continues until there is a line containing only the
>> delimiter and a <newline>, with no <blank> characters in between."
>>
>> No single line will ever match a multi-line heredoc delimiter.
> 
> Sure.  But this is a quality of implementation issue.  If you're
> never going to match the delimter you should probably emit an error
> at the very start rather than at the EOF.

POSIX isn't clear on whether reaching EOF without seeing the heredoc 
delimiter is an error and shells disagree. dash silently accepts it, as 
do ksh, yash and zsh.

When EOF is an acceptable way to terminate a heredoc, a delimiter which 
never matches can be useful to force the complete remainder of the file 
to be treated as a heredoc.

>>> Another interesting thing to try is
>>>
>>> 	cat << ${ foo
>>> 	${
>>>
>>> Also you can look at the quotes:
>>>
>>> 	cat << "
>>> 	"
>>>
>>> IOW it's not clear that "word" in this context necessarily follows
>>> the same rules used during normal tokenisation.
>>
>> Quotes are clear: as far as they go, this *must* follow the same rules used
>> during normal tokenisation. Does any shell disagree?
> 
> I was talking about multi-line quotes, which you conveniently dismissed :)

I got back to that :) I picked another example first because I thought 
it would be clearer using that that the normal tokenisation rules should 
be used.

>> For your example, does any shell, including dash, *not* treat this as a
>> multi-line heredoc delimiter consisting of a newline character (which may or
>> may not be matched by two blank lines, depending on the shell)?
> 
> Yes, almost every shell fails with the multi-line delimiter inside
> double quotes.  Only dash and ksh93 seem to get it right.

posh accepts multi-line heredoc delimiters as well.

Cheers,
Harald van Dijk
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 16, 2018, 3:27 a.m. UTC | #6
On Thu, Mar 15, 2018 at 10:49:15PM +0100, Harald van Dijk wrote:
>
> It's perfectly consistent. It gets accepted at parse time, it only gets
> rejected at expansion time. That's how dash generally behaves as well:
> 
>   $ dash -c 'echo ${x^}'
>   dash: 1: Bad substitution
>   $ dash -c ': || echo ${x^}'
>   $
> 
> Historically, as I understand it, ash would reject both of these, but you
> specifically modified dash to accept invalid expansions during parsing
> (which makes sense):
> 
> 
> <https://git.kernel.org/pub/scm/utils/dash/dash.git/commit/?id=3df3edd13389ae768010bfacee5612346b413e38>
> 
> >In any case, this substituion is invalid in all of these shells so
> >does it really matter?
> 
> Okay, it can be trivially modified to something that does work in other
> shells (even if it were actually executed), but gets rejected at parse time
> by dash:
> 
>   if false; then
>     : ${$+
>   }
>   fi

That's just a bug in dash's parser with ${} in general, because
it bombs out without the if clause too:

	: ${$+
	}

> It'd be nice if all shells used the same parse rules, so that scripts can
> detect dynamically which expansions are supported, but don't have to go
> through ugly eval commands to use them.

I think the reason dash bombs out on your first example (fi) isn't
because of the keyword at all, but it's actually the same reason
as the second, namely newlines are not accepted within parameter
substitution.  If we fixed this then it should work correctly.

Cheers,
diff mbox

Patch

diff --git a/src/parser.c b/src/parser.c
index 382658e..283ed63 100644
--- a/src/parser.c
+++ b/src/parser.c
@@ -1012,6 +1012,11 @@  quotemark:
 				}
 				break;
 			case CBQUOTE:	/* '`' */
+				if (checkkwd & CHKEOFMARK) {
+					USTPUTC('`', out);
+					break;
+				}
+
 				PARSEBACKQOLD();
 				break;
 			case CEOF: