diff mbox

Parameter expansion, patterns and fnmatch

Message ID 0ce0bca2-3bdd-a1f5-169e-0291a49cd6c7@gigawatt.nl (mailing list archive)
State Rejected
Delegated to: Herbert Xu
Headers show

Commit Message

Harald van Dijk Aug. 9, 2016, 9:39 p.m. UTC
On 09/08/16 11:28, Olof Johansson wrote:
> Hello,
>
> I'm seeing some discrepancies between dash built with --enable-fnmatch
> and dash built without that got me curious. Maybe you could help shed
> some light?
>
>  foo='[abc]'
>  echo ${foo#[}
>  echo ${foo#\[}
>
> With dash built with --enable-fnmatch:
>
>  abc]
>  abc]
>
> With dash built without --enable-fnmatch:
>
>  [abc]
>  abc]

Nice find.

> I was able to reproduce this behavior on latest git master
> (17a5f24e0). The dash manual states:
>
>> The end of the character class is indicated by a ]; if the ] is
>> missing then the [ matches a [ rather than introducing a character
>> class.
>
> Which to me suggests that the non-fnmatch case is not the expected
> behavior. Is this interpretation correct?

Yes, this looks like a bug in dash. With the default --disable-fnmatch 
code, when dash encounters [ in a pattern, it immediately treats the 
following characters as part of the set. If it then encounters the end 
of the pattern without having seen a matching ], it attempts to reset 
the state and continue as if [ was treated as a literal character right 
from the start. The attempt to reset the state doesn't look right, and 
has been like this since at least the initial Git commit in 2005.

This also affects

     case [a in [?) echo ok ;; *) echo bad ;; esac

which should print ok.

Attached is a patch that attempts to reset the state correctly. It 
passes your test and mine, but I have not yet tested it extensively.

> Thanks,
>

Cheers,
Harald van Dijk

Comments

Olof Johansson Aug. 17, 2016, 2:50 p.m. UTC | #1
On 2016-08-09 23:39 +0200, Harald van Dijk wrote:
> Yes, this looks like a bug in dash. With the default
> --disable-fnmatch code, when dash encounters [ in a pattern, it
> immediately treats the following characters as part of the set. If
> it then encounters the end of the pattern without having seen a
> matching ], it attempts to reset the state and continue as if [ was
> treated as a literal character right from the start. The attempt to
> reset the state doesn't look right, and has been like this since at
> least the initial Git commit in 2005.
> 
> This also affects
> 
>     case [a in [?) echo ok ;; *) echo bad ;; esac
> 
> which should print ok.
> 
> Attached is a patch that attempts to reset the state correctly. It
> passes your test and mine, but I have not yet tested it extensively.

Thanks a lot! For what it's worth, I've been using dash with this
patch this for about a week now, including building a lot of open
source software (via OpenEmbedded). I haven't seen any issues yet.
Herbert Xu Sept. 2, 2016, 2:04 p.m. UTC | #2
Harald van Dijk <harald@gigawatt.nl> wrote:
>
> Yes, this looks like a bug in dash. With the default --disable-fnmatch 
> code, when dash encounters [ in a pattern, it immediately treats the 
> following characters as part of the set. If it then encounters the end 
> of the pattern without having seen a matching ], it attempts to reset 
> the state and continue as if [ was treated as a literal character right 
> from the start. The attempt to reset the state doesn't look right, and 
> has been like this since at least the initial Git commit in 2005.

pdksh exhibits the same behaviour:

$ pdksh -c 'foo=[abc]; echo ${foo#[}'
[abc]
$

POSIX says:

9.3.3 BRE Special Characters

A BRE special character has special properties in certain contexts.
Outside those contexts, or when preceded by a backslash, such a
character is a BRE that matches the special character itself. The
BRE special characters and the contexts in which they have their
special meaning are as follows:

.[\
The period, left-bracket, and backslash shall be special except
when used in a bracket expression (see RE Bracket Expression). An
expression containing a '[' that is not preceded by a backslash
and is not part of a bracket expression produces undefined results.

> This also affects
> 
>     case [a in [?) echo ok ;; *) echo bad ;; esac
> 
> which should print ok.

Even ksh prints bad here.

I would however consider a patch that simplifies the code in the
undefined case.

Cheers,
Eric Blake Sept. 2, 2016, 2:25 p.m. UTC | #3
On 09/02/2016 09:04 AM, Herbert Xu wrote:

>> Yes, this looks like a bug in dash. With the default --disable-fnmatch 
>> code, when dash encounters [ in a pattern, it immediately treats the 
>> following characters as part of the set. If it then encounters the end 
>> of the pattern without having seen a matching ], it attempts to reset 
>> the state and continue as if [ was treated as a literal character right 
>> from the start.

> 
> POSIX says:
> 
> 9.3.3 BRE Special Characters
> 
> A BRE special character has special properties in certain contexts.
...
> An
> expression containing a '[' that is not preceded by a backslash
> and is not part of a bracket expression produces undefined results.


Ah, but POSIX also says:

2.13.1 Patterns Matching a Single Character

[
    If an open bracket introduces a bracket expression as in XBD RE
Bracket Expression, except that the <exclamation-mark> character ( '!' )
shall replace the <circumflex> character ( '^' ) in its role in a
non-matching list in the regular expression notation, it shall introduce
a pattern bracket expression. A bracket expression starting with an
unquoted <circumflex> character produces unspecified results. Otherwise,
'[' shall match the character itself.

So while a lone '[' is unspecified in a normal BRE, it is well-defined
in a shell filename pattern matching context.  Since '[' is not a
bracket expression, it MUST be treated as a literal '[', so ${foo#[}
MUST strip the leading [ from the contents of foo, without requiring
that the [ be quoted.

> 
>> This also affects
>>
>>     case [a in [?) echo ok ;; *) echo bad ;; esac
>>
>> which should print ok.
> 
> Even ksh prints bad here.

So ksh is also buggy.

> 
> I would however consider a patch that simplifies the code in the
> undefined case.

Except that it is well-defined by POSIX, not undefined.
Herbert Xu Sept. 2, 2016, 2:29 p.m. UTC | #4
On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
>
> >> This also affects
> >>
> >>     case [a in [?) echo ok ;; *) echo bad ;; esac
> >>
> >> which should print ok.
> > 
> > Even ksh prints bad here.
> 
> So ksh is also buggy.

Good luck writing a script with an unquoted [ expecting it to be
portable :)
Herbert Xu Sept. 2, 2016, 2:46 p.m. UTC | #5
On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
>
> 2.13.1 Patterns Matching a Single Character
> 
> [
>     If an open bracket introduces a bracket expression as in XBD RE
> Bracket Expression, except that the <exclamation-mark> character ( '!' )
> shall replace the <circumflex> character ( '^' ) in its role in a
> non-matching list in the regular expression notation, it shall introduce
> a pattern bracket expression. A bracket expression starting with an
> unquoted <circumflex> character produces unspecified results. Otherwise,
> '[' shall match the character itself.

BTW, this last sentence is not present in

http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#tag_02_13

So I presume it's a newer unreleased revision.

Seriously, you guys are turning POSIX into a joke by introducing
all these new requirements.  At this point I think we should
pretty much give up on POSIX compliance the way it's headed.

Cheers,
Eric Blake Sept. 2, 2016, 2:48 p.m. UTC | #6
On 09/02/2016 09:25 AM, Eric Blake wrote:
> 
> So while a lone '[' is unspecified in a normal BRE, it is well-defined
> in a shell filename pattern matching context.  Since '[' is not a
> bracket expression, it MUST be treated as a literal '[', so ${foo#[}
> MUST strip the leading [ from the contents of foo, without requiring
> that the [ be quoted.

Rationale: The '[' shell builtin is not undefined behavior.  This was a
specific addition made in POSIX 2008 that was not present in POSIX 2001
(although I can't easily find the bug report that justified it, since
the bugs for POSIX 2008 predate the current austingroupbugs.net database)
Eric Blake Sept. 2, 2016, 2:49 p.m. UTC | #7
On 09/02/2016 09:29 AM, Herbert Xu wrote:
> On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
>>
>>>> This also affects
>>>>
>>>>     case [a in [?) echo ok ;; *) echo bad ;; esac
>>>>
>>>> which should print ok.
>>>
>>> Even ksh prints bad here.
>>
>> So ksh is also buggy.
> 
> Good luck writing a script with an unquoted [ expecting it to be
> portable :)

[ '' ] || echo empty

There, I just wrote a portable script with unquoted [ portably
interpreted as itself and not as a bracket filename expansion pattern.
Herbert Xu Sept. 2, 2016, 2:51 p.m. UTC | #8
On Fri, Sep 02, 2016 at 09:49:53AM -0500, Eric Blake wrote:
> On 09/02/2016 09:29 AM, Herbert Xu wrote:
> > On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
> >>
> >>>> This also affects
> >>>>
> >>>>     case [a in [?) echo ok ;; *) echo bad ;; esac
> >>>>
> >>>> which should print ok.
> >>>
> >>> Even ksh prints bad here.
> >>
> >> So ksh is also buggy.
> > 
> > Good luck writing a script with an unquoted [ expecting it to be
> > portable :)
> 
> [ '' ] || echo empty
> 
> There, I just wrote a portable script with unquoted [ portably
> interpreted as itself and not as a bracket filename expansion pattern.

dash handles that just fine.  We're not talking about a lone [
in file expansion context here.

Cheers,
Eric Blake Sept. 2, 2016, 2:54 p.m. UTC | #9
On 09/02/2016 09:46 AM, Herbert Xu wrote:
> On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
>>
>> 2.13.1 Patterns Matching a Single Character
>>
>> [
>>     If an open bracket introduces a bracket expression as in XBD RE
>> Bracket Expression, except that the <exclamation-mark> character ( '!' )
>> shall replace the <circumflex> character ( '^' ) in its role in a
>> non-matching list in the regular expression notation, it shall introduce
>> a pattern bracket expression. A bracket expression starting with an
>> unquoted <circumflex> character produces unspecified results. Otherwise,
>> '[' shall match the character itself.
> 
> BTW, this last sentence is not present in
> 
> http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#tag_02_13
> 

That's the 2004 edition (TC1 of the 2001 spec, aka Issue 6).

> So I presume it's a newer unreleased revision.

Newer but released (TC2 of the 2008 spec, aka Issue 7):
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_01

The requirement has been there for 8 years now.

> 
> Seriously, you guys are turning POSIX into a joke by introducing
> all these new requirements.  At this point I think we should
> pretty much give up on POSIX compliance the way it's headed.

I hope you're just stating that out of frustration, and not something
that you actually intend to follow through with.  And if there is a
requirement being considered in the Austin Group that you disagree with,
please speak up on the Austin Group - membership is free.
Chet Ramey Sept. 2, 2016, 3:06 p.m. UTC | #10
On 9/2/16 10:46 AM, Herbert Xu wrote:
> On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
>>
>> 2.13.1 Patterns Matching a Single Character
>>
>> [
>>     If an open bracket introduces a bracket expression as in XBD RE
>> Bracket Expression, except that the <exclamation-mark> character ( '!' )
>> shall replace the <circumflex> character ( '^' ) in its role in a
>> non-matching list in the regular expression notation, it shall introduce
>> a pattern bracket expression. A bracket expression starting with an
>> unquoted <circumflex> character produces unspecified results. Otherwise,
>> '[' shall match the character itself.
> 
> BTW, this last sentence is not present in
> 
> http://pubs.opengroup.org/onlinepubs/009604499/utilities/xcu_chap02.html#tag_02_13

That's pretty old; it's from 2004.

> So I presume it's a newer unreleased revision.

It's in the current revision, which dates from 2008.

> 
> Seriously, you guys are turning POSIX into a joke by introducing
> all these new requirements.  At this point I think we should
> pretty much give up on POSIX compliance the way it's headed.

Let's not go overboard here.  This was introduced to tighten up ambiguous
behavior: what happens when the open bracket *doesn't* introduce a
bracket expression.
Jilles Tjoelker Sept. 2, 2016, 3:12 p.m. UTC | #11
On Fri, Sep 02, 2016 at 10:04:37PM +0800, Herbert Xu wrote:
> Harald van Dijk <harald@gigawatt.nl> wrote:
> > Yes, this looks like a bug in dash. With the default --disable-fnmatch 
> > code, when dash encounters [ in a pattern, it immediately treats the 
> > following characters as part of the set. If it then encounters the end 
> > of the pattern without having seen a matching ], it attempts to reset 
> > the state and continue as if [ was treated as a literal character right 
> > from the start. The attempt to reset the state doesn't look right, and 
> > has been like this since at least the initial Git commit in 2005.

> pdksh exhibits the same behaviour:

> $ pdksh -c 'foo=[abc]; echo ${foo#[}'
> [abc]
> $

> POSIX says:

> 9.3.3 BRE Special Characters

> A BRE special character has special properties in certain contexts.
> Outside those contexts, or when preceded by a backslash, such a
> character is a BRE that matches the special character itself. The
> BRE special characters and the contexts in which they have their
> special meaning are as follows:

> .[\
> The period, left-bracket, and backslash shall be special except
> when used in a bracket expression (see RE Bracket Expression). An
> expression containing a '[' that is not preceded by a backslash
> and is not part of a bracket expression produces undefined results.

I think this interpretation of POSIX is incorrect. This is about shell
patterns, not basic regular expressions. Shell patterns are specified in
XCU 2.13 Pattern Matching Notation. In XCU 2.13.1, it is written:

] [
] If an open bracket introduces a bracket expression as in XBD Section
] 9.3.5, except that the <exclamation-mark> character ('!') shall
] replace the <circumflex> character ('^') in its role in a non-matching
] list in the regular expression notation, it shall introduce a pattern
] bracket expression. A bracket expression starting with an unquoted
] <circumflex> character produces unspecified results. Otherwise, '['
] shall match the character itself.

Therefore, pdksh is wrong and the output should be abc].

It is normally better to test against the actively developed mksh
instead of pdksh, but here mksh has the same bug. OpenBSD's ksh also has
some active development but stays closer to the original pdksh.

> > This also affects

> >     case [a in [?) echo ok ;; *) echo bad ;; esac

> > which should print ok.

> Even ksh prints bad here.

I think POSIX may be saying something different here from what it really
wants to say. There is text in 2.13.3 Patterns Used for Filename
Expansion that leaves unspecified whether [? matches only the literal
filename component [? or all two-character filename components starting
with [ (other slash-separated components in the same pattern are
unaffected). However, if ksh93 behaves similarly in a case statement,
that may have been what the standard had intended to say.

Looking at as simple code as possible, this seems, however, unhelpful.
Since a pattern like *[ should match the literal string *[ in the choice
where brackets that do not introduce a bracket expression are supposed
to disable other special characters and any earlier work on the * is
therefore wrong, implementing this choice requires an additional scan
for brackets that do not introduce a bracket expression.
Harald van Dijk Sept. 3, 2016, 12:03 p.m. UTC | #12
On 02/09/16 16:51, Herbert Xu wrote:
> On Fri, Sep 02, 2016 at 09:49:53AM -0500, Eric Blake wrote:
>> On 09/02/2016 09:29 AM, Herbert Xu wrote:
>>> On Fri, Sep 02, 2016 at 09:25:15AM -0500, Eric Blake wrote:
>>>>
>>>>>> This also affects
>>>>>>
>>>>>>     case [a in [?) echo ok ;; *) echo bad ;; esac
>>>>>>
>>>>>> which should print ok.
>>>>>
>>>>> Even ksh prints bad here.
>>>>
>>>> So ksh is also buggy.
>>>
>>> Good luck writing a script with an unquoted [ expecting it to be
>>> portable :)
>>
>> [ '' ] || echo empty
>>
>> There, I just wrote a portable script with unquoted [ portably
>> interpreted as itself and not as a bracket filename expansion pattern.
>
> dash handles that just fine.  We're not talking about a lone [
> in file expansion context here.

We were originally talking about the [ in ${foo#[}. That is a lone [ in 
pattern matching context, which is what file expansion context is.

We're talking about something that dash has had code to handle for >10 
years, that's been documented by dash as supported for >10 years, and 
now that it turns out there's a flaw in the code where dash does not 
behave as documented and as clearly intended by the code, it's POSIX's 
fault?
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Sept. 3, 2016, 1:05 p.m. UTC | #13
On Sat, Sep 03, 2016 at 02:03:28PM +0200, Harald van Dijk wrote:
> >>>>
> >>>>>>This also affects
> >>>>>>
> >>>>>>    case [a in [?) echo ok ;; *) echo bad ;; esac
>
> We're talking about something that dash has had code to handle for
> >10 years, that's been documented by dash as supported for >10
> years, and now that it turns out there's a flaw in the code where
> dash does not behave as documented and as clearly intended by the
> code, it's POSIX's fault?

dash has never worked this way.  What's worse is that your patch
changes dash's behaviour in a way that is inconsistent with every
shell out there except bash.

Cheers,
Harald van Dijk Sept. 3, 2016, 1:19 p.m. UTC | #14
On 03/09/16 15:05, Herbert Xu wrote:
> On Sat, Sep 03, 2016 at 02:03:28PM +0200, Harald van Dijk wrote:
>>>>>>
>>>>>>>> This also affects
>>>>>>>>
>>>>>>>>    case [a in [?) echo ok ;; *) echo bad ;; esac
>>
>> We're talking about something that dash has had code to handle for
>>> 10 years, that's been documented by dash as supported for >10
>> years, and now that it turns out there's a flaw in the code where
>> dash does not behave as documented and as clearly intended by the
>> code, it's POSIX's fault?
>
> dash has never worked this way.

Very misleading to continue quoting the complicated case but leave out 
the simple case again: dash doesn't handle even ${foo#[} correctly. And 
again, by correctly I mean as documented by dash, as required by POSIX, 
and as clearly intended by the author at the time the code was written. 
I do not know if you were the author of that specific piece of code.

But yeah, sure, if the bug has been there for over 10 years, and I'm 
unable to find older versions of dash to check, I would have guessed 
that dash indeed has never worked this way.

> What's worse is that your patch
> changes dash's behaviour in a way that is inconsistent with every
> shell out there except bash.

For the simple case, ksh and posh also strip a leading [ if doing ${foo#[}.

For the complicated case, at the very least, it makes dash consistent 
between builds. If you feel that for the complicated case, bad should be 
printed rather than ok, then that just makes dash buggy in the 
--enable-fnmatch builds, it still means there's something to fix.
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/src/expand.c
+++ b/src/expand.c
@@ -1594,7 +1594,8 @@  pmatch(const char *pattern, const char *string)
 			do {
 				if (!c) {
 					p = startp;
-					c = *p;
+					q--;
+					c = '[';
 					goto dft;
 				}
 				if (c == '[') {