diff mbox series

[v2,04/14] dir: select directories correctly

Message ID 723de4e42582afbe841ed96470fc02db44b24b5e.1631453010.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Sparse-checkout: modify 'git add', 'git rm', and 'git add' behavior | expand

Commit Message

Derrick Stolee Sept. 12, 2021, 1:23 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

When matching a path against a list of patterns, the ones that require a
directory match previously did not work when a filename is specified.
This was fine when all pattern-matching was done within methods such as
unpack_trees() that check a directory before recursing into the
contained files. However, other commands will start matching individual
files against pattern lists without that recursive approach.

We modify path_matches_dir_pattern() to take a strbuf 'path_parent' that
is used to store the parent directory of 'pathname' between multiple
pattern matching tests. This is loaded lazily, only on the first pattern
it finds that has the PATTERN_FLAG_MUSTBEDIR flag.

If we find that a path has a parent directory, we start by checking to
see if that parent directory matches the pattern. If so, then we do not
need to query the index for the type (which can be expensive). If we
find that the parent does not match, then we still must check the type
from the index for the given pathname.

Note that this does not affect cone mode pattern matching, but instead
the more general -- and slower -- full pattern set. Thus, this does not
affect the sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 dir.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Sept. 12, 2021, 10:21 p.m. UTC | #1
On Sun, Sep 12 2021, Derrick Stolee via GitGitGadget wrote:

> +	/*
> +	 * Use 'alloc' as an indicator that the string has not been
> +	 * initialized, in case the parent is the root directory.
> +	 */
> +	if (!path_parent->alloc) {

This isn't wrong, but seems to be way too cozy with the internal
implementation details of strbuf. For what it's worth I renamed it to
"alloc2" and found that this would be only the 3rd bit of code out of
strbuf.[ch] that cares about that member.

> +		char *slash;
> +		strbuf_addstr(path_parent, pathname);

So is "pathname" ever the empty string? If not we could check the
length?

Or probably better: ...

> @@ -1331,6 +1359,7 @@ static struct path_pattern *last_matching_pattern_from_list(const char *pathname
>  {
>  	struct path_pattern *res = NULL; /* undecided */
>  	int i;
> +	struct strbuf path_parent = STRBUF_INIT;

Just malloc + strbuf_init() this in the above function and have a
"struct strbuf *" initialized to NULL here? Then we can use a much more
idiomatic "is it NULL?" to check if it's initialized.
Derrick Stolee Sept. 15, 2021, 2:41 p.m. UTC | #2
On 9/12/2021 6:21 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Sep 12 2021, Derrick Stolee via GitGitGadget wrote:
> 
>> +	/*
>> +	 * Use 'alloc' as an indicator that the string has not been
>> +	 * initialized, in case the parent is the root directory.
>> +	 */
>> +	if (!path_parent->alloc) {
> 
> This isn't wrong, but seems to be way too cozy with the internal
> implementation details of strbuf. For what it's worth I renamed it to
> "alloc2" and found that this would be only the 3rd bit of code out of
> strbuf.[ch] that cares about that member.

I can understand not wanting to poke into the internals.

>> +		char *slash;
>> +		strbuf_addstr(path_parent, pathname);
> 
> So is "pathname" ever the empty string? If not we could check the
> length?

We are given 'pathlen' as a parameter, so this should just use
strbuf_add() instead.

> Or probably better: ...
> 
>> @@ -1331,6 +1359,7 @@ static struct path_pattern *last_matching_pattern_from_list(const char *pathname
>>  {
>>  	struct path_pattern *res = NULL; /* undecided */
>>  	int i;
>> +	struct strbuf path_parent = STRBUF_INIT;
> 
> Just malloc + strbuf_init() this in the above function and have a
> "struct strbuf *" initialized to NULL here? Then we can use a much more
> idiomatic "is it NULL?" to check if it's initialized.
 
That makes sense. Can do.

Thanks,
-Stolee
Elijah Newren Sept. 15, 2021, 2:54 p.m. UTC | #3
On Sun, Sep 12, 2021 at 6:23 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When matching a path against a list of patterns, the ones that require a
> directory match previously did not work when a filename is specified.
> This was fine when all pattern-matching was done within methods such as
> unpack_trees() that check a directory before recursing into the
> contained files. However, other commands will start matching individual
> files against pattern lists without that recursive approach.
>
> We modify path_matches_dir_pattern() to take a strbuf 'path_parent' that
> is used to store the parent directory of 'pathname' between multiple
> pattern matching tests. This is loaded lazily, only on the first pattern
> it finds that has the PATTERN_FLAG_MUSTBEDIR flag.
>
> If we find that a path has a parent directory, we start by checking to
> see if that parent directory matches the pattern. If so, then we do not
> need to query the index for the type (which can be expensive). If we
> find that the parent does not match, then we still must check the type
> from the index for the given pathname.
>
> Note that this does not affect cone mode pattern matching, but instead
> the more general -- and slower -- full pattern set. Thus, this does not
> affect the sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  dir.c | 34 ++++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/dir.c b/dir.c
> index 652135df896..fe5ee87bb5f 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1305,10 +1305,38 @@ int match_pathname(const char *pathname, int pathlen,
>
>  static int path_matches_dir_pattern(const char *pathname,
>                                     int pathlen,
> +                                   struct strbuf *path_parent,
>                                     int *dtype,
>                                     struct path_pattern *pattern,
>                                     struct index_state *istate)
>  {
> +       /*
> +        * Use 'alloc' as an indicator that the string has not been
> +        * initialized, in case the parent is the root directory.
> +        */
> +       if (!path_parent->alloc) {
> +               char *slash;
> +               strbuf_addstr(path_parent, pathname);
> +               slash = find_last_dir_sep(path_parent->buf);
> +
> +               if (slash)
> +                       *slash = '\0';

Are you breaking strbuf invariants here?  path_parent->len will not be
corrected by this string manipulation.  Perhaps replace this if-else
block with

    strbuf_setlen(path_parent, slash ? slash - path_parent->buf : 0)

> +               else
> +                       strbuf_setlen(path_parent, 0);
> +       }
> +
> +       /*
> +        * If the parent directory matches the pattern, then we do not
> +        * need to check for dtype.
> +        */
> +       if (path_parent->len &&
> +           match_pathname(path_parent->buf, path_parent->len,
> +                          pattern->base,
> +                          pattern->baselen ? pattern->baselen - 1 : 0,
> +                          pattern->pattern, pattern->nowildcardlen,
> +                          pattern->patternlen, pattern->flags))
> +               return 1;
> +
>         *dtype = resolve_dtype(*dtype, istate, pathname, pathlen);
>         if (*dtype != DT_DIR)
>                 return 0;
> @@ -1331,6 +1359,7 @@ static struct path_pattern *last_matching_pattern_from_list(const char *pathname
>  {
>         struct path_pattern *res = NULL; /* undecided */
>         int i;
> +       struct strbuf path_parent = STRBUF_INIT;
>
>         if (!pl->nr)
>                 return NULL;    /* undefined */
> @@ -1340,8 +1369,8 @@ static struct path_pattern *last_matching_pattern_from_list(const char *pathname
>                 const char *exclude = pattern->pattern;
>                 int prefix = pattern->nowildcardlen;
>
> -               if ((pattern->flags & PATTERN_FLAG_MUSTBEDIR) &&
> -                   !path_matches_dir_pattern(pathname, pathlen,
> +               if (pattern->flags & PATTERN_FLAG_MUSTBEDIR &&
> +                   !path_matches_dir_pattern(pathname, pathlen, &path_parent,
>                                               dtype, pattern, istate))
>                         continue;
>
> @@ -1367,6 +1396,7 @@ static struct path_pattern *last_matching_pattern_from_list(const char *pathname
>                         break;
>                 }
>         }
> +       strbuf_release(&path_parent);
>         return res;
>  }
>
> --
> gitgitgadget
Derrick Stolee Sept. 15, 2021, 4:43 p.m. UTC | #4
On 9/15/2021 10:54 AM, Elijah Newren wrote:
>> +       /*
>> +        * Use 'alloc' as an indicator that the string has not been
>> +        * initialized, in case the parent is the root directory.
>> +        */
>> +       if (!path_parent->alloc) {
>> +               char *slash;
>> +               strbuf_addstr(path_parent, pathname);
>> +               slash = find_last_dir_sep(path_parent->buf);
>> +
>> +               if (slash)
>> +                       *slash = '\0';
> 
> Are you breaking strbuf invariants here?  path_parent->len will not be
> corrected by this string manipulation.  Perhaps replace this if-else
> block with
> 
>     strbuf_setlen(path_parent, slash ? slash - path_parent->buf : 0)

Yes, I am. I noticed and fixed this when I was rewriting this
patch for Ævar's feedback. Thanks for pointing it out.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/dir.c b/dir.c
index 652135df896..fe5ee87bb5f 100644
--- a/dir.c
+++ b/dir.c
@@ -1305,10 +1305,38 @@  int match_pathname(const char *pathname, int pathlen,
 
 static int path_matches_dir_pattern(const char *pathname,
 				    int pathlen,
+				    struct strbuf *path_parent,
 				    int *dtype,
 				    struct path_pattern *pattern,
 				    struct index_state *istate)
 {
+	/*
+	 * Use 'alloc' as an indicator that the string has not been
+	 * initialized, in case the parent is the root directory.
+	 */
+	if (!path_parent->alloc) {
+		char *slash;
+		strbuf_addstr(path_parent, pathname);
+		slash = find_last_dir_sep(path_parent->buf);
+
+		if (slash)
+			*slash = '\0';
+		else
+			strbuf_setlen(path_parent, 0);
+	}
+
+	/*
+	 * If the parent directory matches the pattern, then we do not
+	 * need to check for dtype.
+	 */
+	if (path_parent->len &&
+	    match_pathname(path_parent->buf, path_parent->len,
+			   pattern->base,
+			   pattern->baselen ? pattern->baselen - 1 : 0,
+			   pattern->pattern, pattern->nowildcardlen,
+			   pattern->patternlen, pattern->flags))
+		return 1;
+
 	*dtype = resolve_dtype(*dtype, istate, pathname, pathlen);
 	if (*dtype != DT_DIR)
 		return 0;
@@ -1331,6 +1359,7 @@  static struct path_pattern *last_matching_pattern_from_list(const char *pathname
 {
 	struct path_pattern *res = NULL; /* undecided */
 	int i;
+	struct strbuf path_parent = STRBUF_INIT;
 
 	if (!pl->nr)
 		return NULL;	/* undefined */
@@ -1340,8 +1369,8 @@  static struct path_pattern *last_matching_pattern_from_list(const char *pathname
 		const char *exclude = pattern->pattern;
 		int prefix = pattern->nowildcardlen;
 
-		if ((pattern->flags & PATTERN_FLAG_MUSTBEDIR) &&
-		    !path_matches_dir_pattern(pathname, pathlen,
+		if (pattern->flags & PATTERN_FLAG_MUSTBEDIR &&
+		    !path_matches_dir_pattern(pathname, pathlen, &path_parent,
 					      dtype, pattern, istate))
 			continue;
 
@@ -1367,6 +1396,7 @@  static struct path_pattern *last_matching_pattern_from_list(const char *pathname
 			break;
 		}
 	}
+	strbuf_release(&path_parent);
 	return res;
 }