wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
diff mbox series

Message ID 20181027084823.23382-1-pclouds@gmail.com
State New
Headers show
Series
  • wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
Related show

Commit Message

Duy Nguyen Oct. 27, 2018, 8:48 a.m. UTC
In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
can but only in three patterns:

- '**/' matches zero or more leading directories
- '/**/' matches zero or more directories in between
- '/**' matches zero or more trailing directories/files

When '**' is present but not in one of these patterns, the current
behavior is consider the pattern invalid and stop matching. In other
words, 'foo**bar' never matches anything, whatever you throw at it.

This behavior is arguably a bit confusing partly because we can't
really tell the user their pattern is invalid so that they can fix
it. So instead, tolerate it and make '**' act like two regular '*'s
(which is essentially the same as a single asterisk). This behavior
seems more predictable.

Noticed-by: dana <dana@dana.is>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 Documentation/gitignore.txt | 3 ++-
 t/t3070-wildmatch.sh        | 4 ++--
 wildmatch.c                 | 4 ++--
 wildmatch.h                 | 1 -
 4 files changed, 6 insertions(+), 6 deletions(-)

Comments

Torsten Bögershausen Oct. 28, 2018, 6:25 a.m. UTC | #1
On Sat, Oct 27, 2018 at 10:48:23AM +0200, Nguyễn Thái Ngọc Duy wrote:
> In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
> can but only in three patterns:
> 
> - '**/' matches zero or more leading directories
> - '/**/' matches zero or more directories in between
> - '/**' matches zero or more trailing directories/files
> 
> When '**' is present but not in one of these patterns, the current
> behavior is consider the pattern invalid and stop matching. In other
> words, 'foo**bar' never matches anything, whatever you throw at it.
> 
> This behavior is arguably a bit confusing partly because we can't
> really tell the user their pattern is invalid so that they can fix
> it. So instead, tolerate it and make '**' act like two regular '*'s
> (which is essentially the same as a single asterisk). This behavior
> seems more predictable.

Nice analyzes.
I have one question here:
If the user specifies '**' and nothhing is found,
would it be better to die() with a useful message
instead of silently correcting it ?

See the the patch below:
> -				} else
> -					return WM_ABORT_MALFORMED;

Would it be possible to put in the die() here?
As it is outlined so nicely above, a '**' must have either a '/'
before, or behind, or both, to make sense.
When there is no '/' then the user specified something wrong.
Either a '/' has been forgotten, or the '*' key may be bouncing.
I don't think that Git should assume anything here.
(but I didn't follow the previous discussions, so I may have missed
some arguments.)

[]
Duy Nguyen Oct. 28, 2018, 6:35 a.m. UTC | #2
On Sun, Oct 28, 2018 at 7:25 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Sat, Oct 27, 2018 at 10:48:23AM +0200, Nguyễn Thái Ngọc Duy wrote:
> > In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
> > can but only in three patterns:
> >
> > - '**/' matches zero or more leading directories
> > - '/**/' matches zero or more directories in between
> > - '/**' matches zero or more trailing directories/files
> >
> > When '**' is present but not in one of these patterns, the current
> > behavior is consider the pattern invalid and stop matching. In other
> > words, 'foo**bar' never matches anything, whatever you throw at it.
> >
> > This behavior is arguably a bit confusing partly because we can't
> > really tell the user their pattern is invalid so that they can fix
> > it. So instead, tolerate it and make '**' act like two regular '*'s
> > (which is essentially the same as a single asterisk). This behavior
> > seems more predictable.
>
> Nice analyzes.
> I have one question here:
> If the user specifies '**' and nothing is found,
> would it be better to die() with a useful message
> instead of silently correcting it ?

Consider the main use case of wildmatch, .gitignore patterns, dying
would be really bad because it can affect a lot of commands. It would
be much better if wildmatch api allows the caller to handle the error
so they can either die() or propagate the error upwards. But even then
the current API is not suited for that, ideally we should have a
compile phase where you can validate the pattern first. Without it,
you encounter the error every time you try to match something and
handling errors becomes much uglier.

And it goes against the way pattern errors are handled by
fnmatch/wildmatch anyway. If you write '[ab' instead of '[ab]', '['
will just be considered literal '[' instead of erroring out.

> See the the patch below:
> > -                             } else
> > -                                     return WM_ABORT_MALFORMED;
>
> Would it be possible to put in the die() here?
> As it is outlined so nicely above, a '**' must have either a '/'
> before, or behind, or both, to make sense.
> When there is no '/' then the user specified something wrong.
> Either a '/' has been forgotten, or the '*' key may be bouncing.
> I don't think that Git should assume anything here.
> (but I didn't follow the previous discussions, so I may have missed
> some arguments.)
>
> []
Junio C Hamano Oct. 29, 2018, 2:28 a.m. UTC | #3
Duy Nguyen <pclouds@gmail.com> writes:

>> Nice analyzes.
>> I have one question here:
>> If the user specifies '**' and nothing is found,
>> would it be better to die() with a useful message
>> instead of silently correcting it ?
>
> Consider the main use case of wildmatch, .gitignore patterns, dying
> would be really bad because it can affect a lot of commands....

If the user gives 'foo*' and nothing is found, we may say "no match"
and some codepaths that uses wildmatch API may die.  And in such place,
when the user gives '**' and nothing is found, we should do the same
in the same codepath.  In either case, the implementation of wildmatch
API is not the place to call a die(), I think.

And yes, treating an unanchored "**" as if there is just a "*" followed
by another '*" makes good sense.

Thanks, both.
Ævar Arnfjörð Bjarmason Oct. 29, 2018, 1:24 p.m. UTC | #4
On Sat, Oct 27 2018, Nguyễn Thái Ngọc Duy wrote:

> In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
> can but only in three patterns:
>
> - '**/' matches zero or more leading directories
> - '/**/' matches zero or more directories in between
> - '/**' matches zero or more trailing directories/files
>
> When '**' is present but not in one of these patterns, the current
> behavior is consider the pattern invalid and stop matching. In other
> words, 'foo**bar' never matches anything, whatever you throw at it.
>
> This behavior is arguably a bit confusing partly because we can't
> really tell the user their pattern is invalid so that they can fix
> it. So instead, tolerate it and make '**' act like two regular '*'s
> (which is essentially the same as a single asterisk). This behavior
> seems more predictable.
>
> Noticed-by: dana <dana@dana.is>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  Documentation/gitignore.txt | 3 ++-
>  t/t3070-wildmatch.sh        | 4 ++--
>  wildmatch.c                 | 4 ++--
>  wildmatch.h                 | 1 -
>  4 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
> index d107daaffd..1c94f08ff4 100644
> --- a/Documentation/gitignore.txt
> +++ b/Documentation/gitignore.txt
> @@ -129,7 +129,8 @@ full pathname may have special meaning:
>     matches zero or more directories. For example, "`a/**/b`"
>     matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on.
>
> - - Other consecutive asterisks are considered invalid.
> + - Other consecutive asterisks are considered regular asterisks and
> +   will match according to the previous rules.
>
>  NOTES
>  -----
> diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh
> index 46aca0af10..891d4d7cb9 100755
> --- a/t/t3070-wildmatch.sh
> +++ b/t/t3070-wildmatch.sh
> @@ -237,7 +237,7 @@ match 0 0 0 0 foobar 'foo\*bar'
>  match 1 1 1 1 'f\oo' 'f\\oo'
>  match 1 1 1 1 ball '*[al]?'
>  match 0 0 0 0 ten '[ten]'
> -match 0 0 1 1 ten '**[!te]'
> +match 1 1 1 1 ten '**[!te]'
>  match 0 0 0 0 ten '**[!ten]'
>  match 1 1 1 1 ten 't[a-g]n'
>  match 0 0 0 0 ten 't[!a-g]n'
> @@ -253,7 +253,7 @@ match 1 1 1 1 ']' ']'
>  # Extended slash-matching features
>  match 0 0 1 1 'foo/baz/bar' 'foo*bar'
>  match 0 0 1 1 'foo/baz/bar' 'foo**bar'
> -match 0 0 1 1 'foobazbar' 'foo**bar'
> +match 1 1 1 1 'foobazbar' 'foo**bar'
>  match 1 1 1 1 'foo/baz/bar' 'foo/**/bar'
>  match 1 1 0 0 'foo/baz/bar' 'foo/**/**/bar'
>  match 1 1 1 1 'foo/b/a/z/bar' 'foo/**/bar'
> diff --git a/wildmatch.c b/wildmatch.c
> index d074c1be10..9e9e2a2f95 100644
> --- a/wildmatch.c
> +++ b/wildmatch.c
> @@ -104,8 +104,8 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags)
>  					    dowild(p + 1, text, flags) == WM_MATCH)
>  						return WM_MATCH;
>  					match_slash = 1;
> -				} else
> -					return WM_ABORT_MALFORMED;
> +				} else /* WM_PATHNAME is set */
> +					match_slash = 0;
>  			} else
>  				/* without WM_PATHNAME, '*' == '**' */
>  				match_slash = flags & WM_PATHNAME ? 0 : 1;
> diff --git a/wildmatch.h b/wildmatch.h
> index b8c826aa68..5993696298 100644
> --- a/wildmatch.h
> +++ b/wildmatch.h
> @@ -4,7 +4,6 @@
>  #define WM_CASEFOLD 1
>  #define WM_PATHNAME 2
>
> -#define WM_ABORT_MALFORMED 2
>  #define WM_NOMATCH 1
>  #define WM_MATCH 0
>  #define WM_ABORT_ALL -1

This patch looks good to me, but I think it's a bad state of affairs to
keep changing these semantics and not having something like a
"gitwildmatch" doc were we document this matching syntax.

Also I still need to dig up the work for using PCRE as an alternate
matching engine, the PCRE devs produced a bug-for-bug compatible version
of our wildmatch function (all the more reason to document it), so I
think they'll need to change it now that this is in, but I haven't
rebased those ancient patches yet.

Do you have any thoughts on how to proceed with getting this documented
/ into some stable state where we can specify it? Even if we don't end
up using PCRE as a matching engine (sometimes it was faster, sometimes
slower) I think it would be very useful if we can spew out "here's your
pattern as a regex" for self-documentation purposes.

Then that can be piped into e.g. "perl -Mre=debug" to see a step-by-step
guide for how the pattern compiles, and why it does or doesn't match a
given thing.
Duy Nguyen Oct. 29, 2018, 3:53 p.m. UTC | #5
On Mon, Oct 29, 2018 at 2:24 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> This patch looks good to me, but I think it's a bad state of affairs to
> keep changing these semantics and not having something like a
> "gitwildmatch" doc were we document this matching syntax.

While we don't have a separate document for it, the behavior _is_
documented even if perhaps it wasn't as clear. There were even tests
for this corner case.

> Do you have any thoughts on how to proceed with getting this documented
> / into some stable state where we can specify it?

wildmatch has been used in git for a few years if I remember correctly
so to me it is stable. Granted there are corner cases like this, but I
can't prevent all bugs (especially this one which is more like design
mistake than bug per se). You are welcome to refactor gitignore.txt
and add gitwildmatch.txt.

Patch
diff mbox series

diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
index d107daaffd..1c94f08ff4 100644
--- a/Documentation/gitignore.txt
+++ b/Documentation/gitignore.txt
@@ -129,7 +129,8 @@  full pathname may have special meaning:
    matches zero or more directories. For example, "`a/**/b`"
    matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on.
 
- - Other consecutive asterisks are considered invalid.
+ - Other consecutive asterisks are considered regular asterisks and
+   will match according to the previous rules.
 
 NOTES
 -----
diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh
index 46aca0af10..891d4d7cb9 100755
--- a/t/t3070-wildmatch.sh
+++ b/t/t3070-wildmatch.sh
@@ -237,7 +237,7 @@  match 0 0 0 0 foobar 'foo\*bar'
 match 1 1 1 1 'f\oo' 'f\\oo'
 match 1 1 1 1 ball '*[al]?'
 match 0 0 0 0 ten '[ten]'
-match 0 0 1 1 ten '**[!te]'
+match 1 1 1 1 ten '**[!te]'
 match 0 0 0 0 ten '**[!ten]'
 match 1 1 1 1 ten 't[a-g]n'
 match 0 0 0 0 ten 't[!a-g]n'
@@ -253,7 +253,7 @@  match 1 1 1 1 ']' ']'
 # Extended slash-matching features
 match 0 0 1 1 'foo/baz/bar' 'foo*bar'
 match 0 0 1 1 'foo/baz/bar' 'foo**bar'
-match 0 0 1 1 'foobazbar' 'foo**bar'
+match 1 1 1 1 'foobazbar' 'foo**bar'
 match 1 1 1 1 'foo/baz/bar' 'foo/**/bar'
 match 1 1 0 0 'foo/baz/bar' 'foo/**/**/bar'
 match 1 1 1 1 'foo/b/a/z/bar' 'foo/**/bar'
diff --git a/wildmatch.c b/wildmatch.c
index d074c1be10..9e9e2a2f95 100644
--- a/wildmatch.c
+++ b/wildmatch.c
@@ -104,8 +104,8 @@  static int dowild(const uchar *p, const uchar *text, unsigned int flags)
 					    dowild(p + 1, text, flags) == WM_MATCH)
 						return WM_MATCH;
 					match_slash = 1;
-				} else
-					return WM_ABORT_MALFORMED;
+				} else /* WM_PATHNAME is set */
+					match_slash = 0;
 			} else
 				/* without WM_PATHNAME, '*' == '**' */
 				match_slash = flags & WM_PATHNAME ? 0 : 1;
diff --git a/wildmatch.h b/wildmatch.h
index b8c826aa68..5993696298 100644
--- a/wildmatch.h
+++ b/wildmatch.h
@@ -4,7 +4,6 @@ 
 #define WM_CASEFOLD 1
 #define WM_PATHNAME 2
 
-#define WM_ABORT_MALFORMED 2
 #define WM_NOMATCH 1
 #define WM_MATCH 0
 #define WM_ABORT_ALL -1