diff mbox series

[v2,1/1] pathspec: warn for a no-glob entry that contains `**`

Message ID 20210325233648.31162-2-stdedos+git@gmail.com (mailing list archive)
State Superseded
Headers show
Series pathspec: warn for a no-glob entry that contains `**` | expand

Commit Message

Σταύρος Ντέντος March 25, 2021, 11:36 p.m. UTC
From: Stavros Ntentos <133706+stdedos@users.noreply.github.com>

If a pathspec given that contains `**`, chances are that someone is
naively expecting that it will do what the manual has told him
(i.e. that `**` will match 0-or-more directories).

However, without an explicit `:(glob)` magic, that will fall out the sky:
the two `**` will merge into one star, which surrounded by slashes, will
match any directory name.

These changes attempt to bring awareness to this issue.

Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
---
 pathspec.c                 | 13 +++++++++++++
 pathspec.h                 |  1 +
 t/t6130-pathspec-noglob.sh | 13 +++++++++++++
 3 files changed, 27 insertions(+)

Comments

Junio C Hamano March 26, 2021, 12:41 a.m. UTC | #1
"Σταύρος Ντέντος"  <stdedos@gmail.com> writes:

> From: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
>
> If a pathspec given that contains `**`, chances are that someone is
> naively expecting that it will do what the manual has told him
> (i.e. that `**` will match 0-or-more directories).

OK. this is a well written introduction of the problem being solved.

But chances are that the user may have meant to give double-asterisk
just out of habit, very well knowing that it would not be an error
to give ** when a single * suffices.

Which leads us to suspect that it is a bad change to make this a
warning that unconditionally fires and cannot squelch.  It may be
better to implement it as an advice, where the user who knowingly
uses that construct can say "yes, I know what I am doing" and
squelch it.

> However, without an explicit `:(glob)` magic, that will fall out the sky:
> the two `**` will merge into one star, which surrounded by slashes, will
> match any directory name.

I am not sure everything after the "the sky:" you wrote is what you
meant to write.  Without being marked with a "glob" magic, a
wildcard character in a pattern matches even a slash, so these two

	git ls-files 'Documentation**v2.txt'
	git ls-files 'Documentation*v2.txt'

give the identical result and there is nothing about "surrounded by
slashes" involved in it.

So, perhaps taking the first two paragraphs together and rewriting:

	When '/**/' appears in the pathspec, the user may be
	expecting that it would be matched using the "wildmatch"
	semantics, matching 0 or more directories.  But that is
	not what happens without ":(glob)" magic.

or did you want to warn about "foo**bar" as if it were "foo/**/bar"?

The output from these commands:

	git ls-files ':(glob)Documentation/**/*stash.txt'
	git ls-files ':(glob)Documentation***stash.txt'
	git ls-files ':(glob)Documentation**stash.txt'

seems to tell me that the "zero-or-more directories" magic happens
only when /**/ form is used, not when two asterisks are placed next
to each other in a random context.

> These changes attempt to bring awareness to this issue.

In any case, after presenting what problem we are trying to address,
we present our approach / solution in a form as if we are giving
orders to the codebase to "become like so".

	Teach the pathspec parser to emit an advice message when a
	substring "/**/" appears in a pathspec element that does not
	have a ":(glob)" magic.  Make sure we don't disturb users
	who use ":(literal)" magic with such a substring, as it is
	clear they want to find these strings literally.

perhaps.

> Signed-off-by: Stavros Ntentos <133706+stdedos@users.noreply.github.com>
> ---
>  pathspec.c                 | 13 +++++++++++++
>  pathspec.h                 |  1 +
>  t/t6130-pathspec-noglob.sh | 13 +++++++++++++
>  3 files changed, 27 insertions(+)
>
> diff --git a/pathspec.c b/pathspec.c
> index 7a229d8d22..d5b9c0d792 100644
> --- a/pathspec.c
> +++ b/pathspec.c
> @@ -1,3 +1,4 @@
> +#include <string.h>

Never include any header file before including <git-compat-util.h>.
Including <git-compat-util.h> indirectly by including <cache.h> or
<builtin.h> counts as including it as the first thing.

As all necessary standard system headers are supposed to be given by
including <git-compat-util.h>, if you needed to explicitly include
the above, that may mean <git-compat-util.h>, which should cause the
header that lists necessary function, is not working properly on
your platform and needs to be adjusted. Are you on a minority
platform we haven't adjusted the header inclusion order for, and
what function are you trying to use that does not work without
adding this extra #include here?  We already use strstr() in many
places, so that is not the function we are missing system includes
for.

Or perhaps this is a remnant of what you added while you were
experimenting, without realizing that "#include <cache.h>" that is
already there already gives you anything you need.  If that is the
case, iow, if the code works without the extra include, please
remove that line.

> @@ -588,6 +589,8 @@ void parse_pathspec(struct pathspec *pathspec,
>  
>  		init_pathspec_item(item + i, flags, prefix, prefixlen, entry);
>  
> +		check_missing_glob(entry, item[i].magic);
> +

We have only one caller of the helper (i.e. this one) and nowhere
else, and the helper is small enough ...

> @@ -739,3 +742,13 @@ int match_pathspec_attrs(const struct index_state *istate,
>  
>  	return 1;
>  }
> +
> +void check_missing_glob(const char *pathspec_entry, int flags) {
> +	if (flags & (PATHSPEC_GLOB | PATHSPEC_LITERAL)) {
> +		return;
> +	}
> +
> +	if (strstr(pathspec_entry, "**")) {
> +		warning(_("Pathspec provided contains `**`, but no :(glob) magic.\nIt will not match 0 or more directories!"));
> +	}
> +}

... hence, there is no reason why this helper should exist, let
alone be a public function.  Also, there is no reason to split the
conditional into two.  Just "it is OK if we have GLOB or LITERAL,
otherwise see if there is /**/" is sufficient.

It would be more sensible to just add the check to parse_pathspec()
directly.

Also, our warning messages do not begin with an uppercase.

See attached patch for all of the above, but it is for illustration
purposes only; not even compile tested.  I am not illustrating how
to turn this into an advice message that the user can squelch with
the illustration patch.

> diff --git a/t/t6130-pathspec-noglob.sh b/t/t6130-pathspec-noglob.sh
> index ba7902c9cd..af6cd16f76 100755
> --- a/t/t6130-pathspec-noglob.sh
> +++ b/t/t6130-pathspec-noglob.sh
> @@ -157,4 +157,17 @@ test_expect_success '**/ does not work with :(literal) and --glob-pathspecs' '
>  	test_must_be_empty actual
>  '
>  

> +cat > expected <<"EOF"
> +warning: Pathspec provided contains `**`, but no :(glob) magic.
> +EOF

Please don't imitate ancient tests.  These days, all preparations
for individual tests, including the expected outcome, are to be done
inside the test itself.  Study nearby tests in the same script for
better examples.

> +test_expect_success '** without :(glob) warns of lacking glob magic' '
> +	test_might_fail git stash -- "**/bar" 2>warns &&
> +	grep -Ff expected warns

Same comment.  Nearby examples all set up expeced output and never
uses "grep" to see if the expectation is satisfied.  Imitate them.

> +'
> +
> +test_expect_success '** with :(literal) does not warn of lacking glob magic' '
> +	test_might_fail git stash -- ":(literal)**/bar" 2>warns &&
> +	! grep -Ff expected warns

Ditto.

Thanks.


 pathspec.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git i/pathspec.c w/pathspec.c
index 18b3be362a..5b0eed5a65 100644
--- i/pathspec.c
+++ w/pathspec.c
@@ -598,6 +598,10 @@ void parse_pathspec(struct pathspec *pathspec,
 			die(_("pathspec '%s' is beyond a symbolic link"), entry);
 		}
 
+		if (!(item[i].magic & (PATHSPEC_GLOB | PATHSPEC_LITERAL)) &&
+		    strstr(entry, "**"))
+			warning(_("** in '%s'without :(glob) magic:"), entry);
+
 		if (item[i].nowildcard_len < item[i].len)
 			pathspec->has_wildcard = 1;
 		pathspec->magic |= item[i].magic;
Eric Sunshine March 26, 2021, 1:32 a.m. UTC | #2
On Thu, Mar 25, 2021 at 8:43 PM Junio C Hamano <gitster@pobox.com> wrote:
> So, perhaps taking the first two paragraphs together and rewriting:
>
>         When '/**/' appears in the pathspec, the user may be
>         expecting that it would be matched using the "wildmatch"
>         semantics, matching 0 or more directories.  But that is
>         not what happens without ":(glob)" magic.
>
>         Teach the pathspec parser to emit an advice message when a
>         substring "/**/" appears in a pathspec element that does not
>         have a ":(glob)" magic.  Make sure we don't disturb users
>         who use ":(literal)" magic with such a substring, as it is
>         clear they want to find these strings literally.

I haven't been following the discussion, but is there a reason we need
to penalize the user with a warning rather than helping, for instance
by inferring ":(glob)" in the presence of `/**/` if not otherwise
countermanded by ":(literal)" or whatnot?
Junio C Hamano March 26, 2021, 3:02 a.m. UTC | #3
Eric Sunshine <sunshine@sunshineco.com> writes:

> I haven't been following the discussion, but is there a reason we need
> to penalize the user with a warning rather than helping, for instance
> by inferring ":(glob)" in the presence of `/**/` if not otherwise
> countermanded by ":(literal)" or whatnot?

Two reasons I can think of offhand are

 - How /**/ is interpreted is not the only thing that is different
   between the normal mode and the glob magic mode.  IIRC, an
   asterisk * or a question mark ? matches slash in normal mode (it
   started out as fnmatch() without FNM_PATHNAME).  Should we warn
   about ":(glob)" if somebody asks for "foo*", "*foo", or
   "foo*bar".  If not, why shouldn't?

 - Thers is no explicit magic that says "there is no magic" to
   countermand such a DWIM.
Jeff King March 26, 2021, 5:13 a.m. UTC | #4
On Thu, Mar 25, 2021 at 08:02:31PM -0700, Junio C Hamano wrote:

> Eric Sunshine <sunshine@sunshineco.com> writes:
> 
> > I haven't been following the discussion, but is there a reason we need
> > to penalize the user with a warning rather than helping, for instance
> > by inferring ":(glob)" in the presence of `/**/` if not otherwise
> > countermanded by ":(literal)" or whatnot?
> 
> Two reasons I can think of offhand are
> 
>  - How /**/ is interpreted is not the only thing that is different
>    between the normal mode and the glob magic mode.  IIRC, an
>    asterisk * or a question mark ? matches slash in normal mode (it
>    started out as fnmatch() without FNM_PATHNAME).  Should we warn
>    about ":(glob)" if somebody asks for "foo*", "*foo", or
>    "foo*bar".  If not, why shouldn't?
> 
>  - Thers is no explicit magic that says "there is no magic" to
>    countermand such a DWIM.

I do wonder if this distinction creates more harm than good.

As somebody who has never used ":(glob)" myself, I was confused about
what it even does (and it was not easy to find the documentation; I
ended up finding the original commit in the history first!).

We have three modes:

  - no globbing

  - globbing with fnmatch(), with FNM_PATHNAME according to the docs

  - globbing with wildmatch

You may notice that I would call both of those latter two "globbing",
but only one of them is triggered by the ":(glob)" magic. :)

This just seems really confusing, and I wonder if anybody would be that
sad if we just used wildmatch everywhere. The original bd30c2e484
(pathspec: support :(glob) syntax, 2013-07-14) even says:

  The old fnmatch behavior is considered deprecated and discouraged to
  use.

but I guess it would be backwards-incompatible.

Maybe it would be less confusing if we named the three states
explicitly:

  :(literal)
  :(fnmatch)
  :(wildmatch)

(and keeping :(glob) as a synonym for compatibility).

-Peff
Σταύρος Ντέντος March 26, 2021, 3:43 p.m. UTC | #5
I think I have managed to address everyone's questions in this thread.
Hopefully everything will be addressed by this, and the patch that will soon follow:

> You may notice that I would call both of those latter two "globbing",
> but only one of them is triggered by the ":(glob)" magic. :)

And that's why I argued a DWIM was warranted here
(https://lore.kernel.org/git/xmqqft1iquka.fsf@gitster.g/; it's more clear
in Junio's quote, but you could of course read the original).

I would want to be considered as an above-average git user, and I still was
oblivious to the fact that `**` required a `:(glob)` from the command line.
Especially since `.gitignore` files are treated differently
(i.e. don't require `:(glob)`)

I cannot / don't want to argue to "do it" or not, as I think my experience
is not substantial enough to navigate such argument, and go from concept
to materialization.

That being said, if there was a cli.pathspec.wildmatch flag,
I would've had it set to true in my global gitconfig.
Ofc I could set `GIT_GLOB_PATHSPECS=1`, but I am not that happy to
force it by setting it in a shell rc, instead of where it belongs.

> I am not sure everything after the "the sky:" you wrote is what you
> meant to write.  Without being marked with a "glob" magic, a
> wildcard character in a pattern matches even a slash, so these two
>
> 	git ls-files 'Documentation**v2.txt'
> 	git ls-files 'Documentation*v2.txt'
>
> give the identical result and there is nothing about "surrounded by
> slashes" involved in it.

It seems you are right - in `fnmatch` mode, `:!*.gitignore` would've
served me right (and avoided the whole thread).

If you think that it's just my (i.e. few people's) lacking of understanding
`fnmatch` glob, then we can drop this.
However, given the disparity between `.gitignore` syntax and pathspec given
from the command line (from my limited POV meaningless/confusing), and your
(plural) arguments of backwards compatibility, I think we can draw the line
at an advice been acceptable.

Unless I see other points (like the warn vs advice), or pure C code-review
points, I am getting the impression that this thread will not move forward.
As I don't know how reviews usually happen here, and lacking some other medium
of discussion, I would appreciate (at some point) an explicit go/no-go
decision - to save everyone's time.

Being unfamiliar to the project, and being who I am, I prefer explicit vs implicit points.
Navigating an unknown process from top to bottom is pressure enough to me :-D

> seems to tell me that the "zero-or-more directories" magic happens
> only when /**/ form is used, not when two asterisks are placed next
> to each other in a random context.

Not exactly: ** needs to either start or end with a slash (or both) for its
wildmatch behavior. I can try to make the code more explicit, but of course
the code (and tests) will increase - and then maybe disproportionately to
the otherwise intended-to-be small change (since DWIM is not wanted).
diff mbox series

Patch

diff --git a/pathspec.c b/pathspec.c
index 7a229d8d22..d5b9c0d792 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -1,3 +1,4 @@ 
+#include <string.h>
 #include "cache.h"
 #include "config.h"
 #include "dir.h"
@@ -588,6 +589,8 @@  void parse_pathspec(struct pathspec *pathspec,
 
 		init_pathspec_item(item + i, flags, prefix, prefixlen, entry);
 
+		check_missing_glob(entry, item[i].magic);
+
 		if (item[i].magic & PATHSPEC_EXCLUDE)
 			nr_exclude++;
 		if (item[i].magic & magic_mask)
@@ -739,3 +742,13 @@  int match_pathspec_attrs(const struct index_state *istate,
 
 	return 1;
 }
+
+void check_missing_glob(const char *pathspec_entry, int flags) {
+	if (flags & (PATHSPEC_GLOB | PATHSPEC_LITERAL)) {
+		return;
+	}
+
+	if (strstr(pathspec_entry, "**")) {
+		warning(_("Pathspec provided contains `**`, but no :(glob) magic.\nIt will not match 0 or more directories!"));
+	}
+}
diff --git a/pathspec.h b/pathspec.h
index 454ce364fa..913518ebd3 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -157,5 +157,6 @@  char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
 int match_pathspec_attrs(const struct index_state *istate,
 			 const char *name, int namelen,
 			 const struct pathspec_item *item);
+void check_missing_glob(const char* pathspec_entry, int flags);
 
 #endif /* PATHSPEC_H */
diff --git a/t/t6130-pathspec-noglob.sh b/t/t6130-pathspec-noglob.sh
index ba7902c9cd..af6cd16f76 100755
--- a/t/t6130-pathspec-noglob.sh
+++ b/t/t6130-pathspec-noglob.sh
@@ -157,4 +157,17 @@  test_expect_success '**/ does not work with :(literal) and --glob-pathspecs' '
 	test_must_be_empty actual
 '
 
+cat > expected <<"EOF"
+warning: Pathspec provided contains `**`, but no :(glob) magic.
+EOF
+test_expect_success '** without :(glob) warns of lacking glob magic' '
+	test_might_fail git stash -- "**/bar" 2>warns &&
+	grep -Ff expected warns
+'
+
+test_expect_success '** with :(literal) does not warn of lacking glob magic' '
+	test_might_fail git stash -- ":(literal)**/bar" 2>warns &&
+	! grep -Ff expected warns
+'
+
 test_done