Message ID | 20181027084823.23382-1-pclouds@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode | expand |
On Sat, Oct 27, 2018 at 10:48:23AM +0200, Nguyễn Thái Ngọc Duy wrote: > In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**' > can but only in three patterns: > > - '**/' matches zero or more leading directories > - '/**/' matches zero or more directories in between > - '/**' matches zero or more trailing directories/files > > When '**' is present but not in one of these patterns, the current > behavior is consider the pattern invalid and stop matching. In other > words, 'foo**bar' never matches anything, whatever you throw at it. > > This behavior is arguably a bit confusing partly because we can't > really tell the user their pattern is invalid so that they can fix > it. So instead, tolerate it and make '**' act like two regular '*'s > (which is essentially the same as a single asterisk). This behavior > seems more predictable. Nice analyzes. I have one question here: If the user specifies '**' and nothhing is found, would it be better to die() with a useful message instead of silently correcting it ? See the the patch below: > - } else > - return WM_ABORT_MALFORMED; Would it be possible to put in the die() here? As it is outlined so nicely above, a '**' must have either a '/' before, or behind, or both, to make sense. When there is no '/' then the user specified something wrong. Either a '/' has been forgotten, or the '*' key may be bouncing. I don't think that Git should assume anything here. (but I didn't follow the previous discussions, so I may have missed some arguments.) []
On Sun, Oct 28, 2018 at 7:25 AM Torsten Bögershausen <tboegi@web.de> wrote: > > On Sat, Oct 27, 2018 at 10:48:23AM +0200, Nguyễn Thái Ngọc Duy wrote: > > In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**' > > can but only in three patterns: > > > > - '**/' matches zero or more leading directories > > - '/**/' matches zero or more directories in between > > - '/**' matches zero or more trailing directories/files > > > > When '**' is present but not in one of these patterns, the current > > behavior is consider the pattern invalid and stop matching. In other > > words, 'foo**bar' never matches anything, whatever you throw at it. > > > > This behavior is arguably a bit confusing partly because we can't > > really tell the user their pattern is invalid so that they can fix > > it. So instead, tolerate it and make '**' act like two regular '*'s > > (which is essentially the same as a single asterisk). This behavior > > seems more predictable. > > Nice analyzes. > I have one question here: > If the user specifies '**' and nothing is found, > would it be better to die() with a useful message > instead of silently correcting it ? Consider the main use case of wildmatch, .gitignore patterns, dying would be really bad because it can affect a lot of commands. It would be much better if wildmatch api allows the caller to handle the error so they can either die() or propagate the error upwards. But even then the current API is not suited for that, ideally we should have a compile phase where you can validate the pattern first. Without it, you encounter the error every time you try to match something and handling errors becomes much uglier. And it goes against the way pattern errors are handled by fnmatch/wildmatch anyway. If you write '[ab' instead of '[ab]', '[' will just be considered literal '[' instead of erroring out. > See the the patch below: > > - } else > > - return WM_ABORT_MALFORMED; > > Would it be possible to put in the die() here? > As it is outlined so nicely above, a '**' must have either a '/' > before, or behind, or both, to make sense. > When there is no '/' then the user specified something wrong. > Either a '/' has been forgotten, or the '*' key may be bouncing. > I don't think that Git should assume anything here. > (but I didn't follow the previous discussions, so I may have missed > some arguments.) > > []
Duy Nguyen <pclouds@gmail.com> writes: >> Nice analyzes. >> I have one question here: >> If the user specifies '**' and nothing is found, >> would it be better to die() with a useful message >> instead of silently correcting it ? > > Consider the main use case of wildmatch, .gitignore patterns, dying > would be really bad because it can affect a lot of commands.... If the user gives 'foo*' and nothing is found, we may say "no match" and some codepaths that uses wildmatch API may die. And in such place, when the user gives '**' and nothing is found, we should do the same in the same codepath. In either case, the implementation of wildmatch API is not the place to call a die(), I think. And yes, treating an unanchored "**" as if there is just a "*" followed by another '*" makes good sense. Thanks, both.
On Sat, Oct 27 2018, Nguyễn Thái Ngọc Duy wrote: > In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**' > can but only in three patterns: > > - '**/' matches zero or more leading directories > - '/**/' matches zero or more directories in between > - '/**' matches zero or more trailing directories/files > > When '**' is present but not in one of these patterns, the current > behavior is consider the pattern invalid and stop matching. In other > words, 'foo**bar' never matches anything, whatever you throw at it. > > This behavior is arguably a bit confusing partly because we can't > really tell the user their pattern is invalid so that they can fix > it. So instead, tolerate it and make '**' act like two regular '*'s > (which is essentially the same as a single asterisk). This behavior > seems more predictable. > > Noticed-by: dana <dana@dana.is> > Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> > --- > Documentation/gitignore.txt | 3 ++- > t/t3070-wildmatch.sh | 4 ++-- > wildmatch.c | 4 ++-- > wildmatch.h | 1 - > 4 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt > index d107daaffd..1c94f08ff4 100644 > --- a/Documentation/gitignore.txt > +++ b/Documentation/gitignore.txt > @@ -129,7 +129,8 @@ full pathname may have special meaning: > matches zero or more directories. For example, "`a/**/b`" > matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on. > > - - Other consecutive asterisks are considered invalid. > + - Other consecutive asterisks are considered regular asterisks and > + will match according to the previous rules. > > NOTES > ----- > diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh > index 46aca0af10..891d4d7cb9 100755 > --- a/t/t3070-wildmatch.sh > +++ b/t/t3070-wildmatch.sh > @@ -237,7 +237,7 @@ match 0 0 0 0 foobar 'foo\*bar' > match 1 1 1 1 'f\oo' 'f\\oo' > match 1 1 1 1 ball '*[al]?' > match 0 0 0 0 ten '[ten]' > -match 0 0 1 1 ten '**[!te]' > +match 1 1 1 1 ten '**[!te]' > match 0 0 0 0 ten '**[!ten]' > match 1 1 1 1 ten 't[a-g]n' > match 0 0 0 0 ten 't[!a-g]n' > @@ -253,7 +253,7 @@ match 1 1 1 1 ']' ']' > # Extended slash-matching features > match 0 0 1 1 'foo/baz/bar' 'foo*bar' > match 0 0 1 1 'foo/baz/bar' 'foo**bar' > -match 0 0 1 1 'foobazbar' 'foo**bar' > +match 1 1 1 1 'foobazbar' 'foo**bar' > match 1 1 1 1 'foo/baz/bar' 'foo/**/bar' > match 1 1 0 0 'foo/baz/bar' 'foo/**/**/bar' > match 1 1 1 1 'foo/b/a/z/bar' 'foo/**/bar' > diff --git a/wildmatch.c b/wildmatch.c > index d074c1be10..9e9e2a2f95 100644 > --- a/wildmatch.c > +++ b/wildmatch.c > @@ -104,8 +104,8 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags) > dowild(p + 1, text, flags) == WM_MATCH) > return WM_MATCH; > match_slash = 1; > - } else > - return WM_ABORT_MALFORMED; > + } else /* WM_PATHNAME is set */ > + match_slash = 0; > } else > /* without WM_PATHNAME, '*' == '**' */ > match_slash = flags & WM_PATHNAME ? 0 : 1; > diff --git a/wildmatch.h b/wildmatch.h > index b8c826aa68..5993696298 100644 > --- a/wildmatch.h > +++ b/wildmatch.h > @@ -4,7 +4,6 @@ > #define WM_CASEFOLD 1 > #define WM_PATHNAME 2 > > -#define WM_ABORT_MALFORMED 2 > #define WM_NOMATCH 1 > #define WM_MATCH 0 > #define WM_ABORT_ALL -1 This patch looks good to me, but I think it's a bad state of affairs to keep changing these semantics and not having something like a "gitwildmatch" doc were we document this matching syntax. Also I still need to dig up the work for using PCRE as an alternate matching engine, the PCRE devs produced a bug-for-bug compatible version of our wildmatch function (all the more reason to document it), so I think they'll need to change it now that this is in, but I haven't rebased those ancient patches yet. Do you have any thoughts on how to proceed with getting this documented / into some stable state where we can specify it? Even if we don't end up using PCRE as a matching engine (sometimes it was faster, sometimes slower) I think it would be very useful if we can spew out "here's your pattern as a regex" for self-documentation purposes. Then that can be piped into e.g. "perl -Mre=debug" to see a step-by-step guide for how the pattern compiles, and why it does or doesn't match a given thing.
On Mon, Oct 29, 2018 at 2:24 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > This patch looks good to me, but I think it's a bad state of affairs to > keep changing these semantics and not having something like a > "gitwildmatch" doc were we document this matching syntax. While we don't have a separate document for it, the behavior _is_ documented even if perhaps it wasn't as clear. There were even tests for this corner case. > Do you have any thoughts on how to proceed with getting this documented > / into some stable state where we can specify it? wildmatch has been used in git for a few years if I remember correctly so to me it is stable. Granted there are corner cases like this, but I can't prevent all bugs (especially this one which is more like design mistake than bug per se). You are welcome to refactor gitignore.txt and add gitwildmatch.txt.
diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt index d107daaffd..1c94f08ff4 100644 --- a/Documentation/gitignore.txt +++ b/Documentation/gitignore.txt @@ -129,7 +129,8 @@ full pathname may have special meaning: matches zero or more directories. For example, "`a/**/b`" matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on. - - Other consecutive asterisks are considered invalid. + - Other consecutive asterisks are considered regular asterisks and + will match according to the previous rules. NOTES ----- diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh index 46aca0af10..891d4d7cb9 100755 --- a/t/t3070-wildmatch.sh +++ b/t/t3070-wildmatch.sh @@ -237,7 +237,7 @@ match 0 0 0 0 foobar 'foo\*bar' match 1 1 1 1 'f\oo' 'f\\oo' match 1 1 1 1 ball '*[al]?' match 0 0 0 0 ten '[ten]' -match 0 0 1 1 ten '**[!te]' +match 1 1 1 1 ten '**[!te]' match 0 0 0 0 ten '**[!ten]' match 1 1 1 1 ten 't[a-g]n' match 0 0 0 0 ten 't[!a-g]n' @@ -253,7 +253,7 @@ match 1 1 1 1 ']' ']' # Extended slash-matching features match 0 0 1 1 'foo/baz/bar' 'foo*bar' match 0 0 1 1 'foo/baz/bar' 'foo**bar' -match 0 0 1 1 'foobazbar' 'foo**bar' +match 1 1 1 1 'foobazbar' 'foo**bar' match 1 1 1 1 'foo/baz/bar' 'foo/**/bar' match 1 1 0 0 'foo/baz/bar' 'foo/**/**/bar' match 1 1 1 1 'foo/b/a/z/bar' 'foo/**/bar' diff --git a/wildmatch.c b/wildmatch.c index d074c1be10..9e9e2a2f95 100644 --- a/wildmatch.c +++ b/wildmatch.c @@ -104,8 +104,8 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags) dowild(p + 1, text, flags) == WM_MATCH) return WM_MATCH; match_slash = 1; - } else - return WM_ABORT_MALFORMED; + } else /* WM_PATHNAME is set */ + match_slash = 0; } else /* without WM_PATHNAME, '*' == '**' */ match_slash = flags & WM_PATHNAME ? 0 : 1; diff --git a/wildmatch.h b/wildmatch.h index b8c826aa68..5993696298 100644 --- a/wildmatch.h +++ b/wildmatch.h @@ -4,7 +4,6 @@ #define WM_CASEFOLD 1 #define WM_PATHNAME 2 -#define WM_ABORT_MALFORMED 2 #define WM_NOMATCH 1 #define WM_MATCH 0 #define WM_ABORT_ALL -1
In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**' can but only in three patterns: - '**/' matches zero or more leading directories - '/**/' matches zero or more directories in between - '/**' matches zero or more trailing directories/files When '**' is present but not in one of these patterns, the current behavior is consider the pattern invalid and stop matching. In other words, 'foo**bar' never matches anything, whatever you throw at it. This behavior is arguably a bit confusing partly because we can't really tell the user their pattern is invalid so that they can fix it. So instead, tolerate it and make '**' act like two regular '*'s (which is essentially the same as a single asterisk). This behavior seems more predictable. Noticed-by: dana <dana@dana.is> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> --- Documentation/gitignore.txt | 3 ++- t/t3070-wildmatch.sh | 4 ++-- wildmatch.c | 4 ++-- wildmatch.h | 1 - 4 files changed, 6 insertions(+), 6 deletions(-)