diff mbox series

[RFC,2/3] grep: honor sparse checkout patterns

Message ID 0b9b4c4b414a571877163667694afa3053bf8890.1585027716.git.matheus.bernardino@usp.br (mailing list archive)
State New, archived
Headers show
Series grep: honor sparse checkout and add option to ignore it | expand

Commit Message

Matheus Tavares March 24, 2020, 6:12 a.m. UTC
One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the oposity direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior is still desirable for some use cases. So the
next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Something I'm not entirely sure in this patch is how we implement the
mechanism to honor sparsity for the `git grep <commit-ish>` case (which
is treated in the grep_tree() function). Currently, the patch looks for
an index entry that matches the path, and then checks its skip_worktree
bit. But this operation is perfomed in O(log(N)); N being the number of
index entries. If there are many entries (and no so many sparsity
patterns), maybe a better approach would be to try matching the path
directly against the sparsity patterns. This would be O(M) in the number
of patterns, and it could be done, in builtin/grep.c, with a function
like the following:

static struct pattern_list sparsity_patterns;
static int sparsity_patterns_initialized = 0;
static enum pattern_match_result path_matches_sparsity_patterns(
					const char *path, int pathlen,
					const char *basename,
					struct repository *repo)
{
	int dtype = DT_UNKNOWN;

	if (!sparsity_patterns_initialized) {
		char *sparse_file = git_pathdup("info/sparse-checkout");
		int ret;

		memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
		sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
		ret = add_patterns_from_file_to_list(sparse_file, "", 0,
						     &sparsity_patterns, NULL);
		free(sparse_file);

		if (ret < 0)
			die(_("failed to load sparse-checkout patterns"));
		sparsity_patterns_initialized = 1;
	}

	return path_matches_pattern_list(path, pathlen, basename, &dtype,
					 &sparsity_patterns, repo->index);
}

Also, if I understand correctly, the index doesn't hold paths to dirs,
right? So even if a complete dir is excluded from sparse checkout, we
still have to check all its subentries, only to discover that they
should all be skipped from the search. However, if we were to check
against the sparsity patterns directly (e.g. with the function above),
we could skip such directories together with all their entries.

Oh, and there is also the case of a commit whose tree paths are not in
the index (maybe manually created objects?). For such commits, with the
index lookup approach, we would have to fall back on ignoring the
sparsity rules. I'm not sure if that would be OK, though.

Any thoughts on these two approaches (looking up the skip_worktree bit
in the index or directly matching against sparsity patterns), will be
highly appreciated. (Note that it only concerns the `git grep
<commit-ish>` case. The other cases already iterate thought the index, so
there is no O(log(N)) extra complexity).

 builtin/grep.c                   | 29 ++++++++---
 t/t7011-skip-worktree-reading.sh |  9 ----
 t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 15 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Comments

Elijah Newren March 24, 2020, 7:15 a.m. UTC | #1
Hi Matheus,

On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the oposity direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

Wahoo!  This is great.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)
>
> This is also what some users reported[1] they would want as the default
> behavior.
>
> Note: for `git grep $REVISION`, we will choose to honor the sparsity
> patterns only when $REVISION is a commit-ish object. The reason is that,

Makes sense.

> for a tree, we don't know whether it represents the root of a
> repository or a subtree. So we wouldn't be able to correctly match it
> against the sparsity patterns. E.g. suppose we have a repository with
> these two sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects (tags to commits should be fine).
>
> Finally, the old behavior is still desirable for some use cases. So the
> next patch will add an option to allow restoring it when needed.
>
> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Something I'm not entirely sure in this patch is how we implement the
> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> is treated in the grep_tree() function). Currently, the patch looks for
> an index entry that matches the path, and then checks its skip_worktree

As you discuss below, checking the index is both wrong _and_ costly.
You should use the sparsity patterns; Stolee did a lot of work to make
those correspond to simple hashes you could check to determine whether
to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
mode but the non-cone sparsity patterns were a performance nightmare
waiting to rear its ugly head.  We should just try to encourage
everyone to move to cone mode, or accept the slowness they get without
it.

> bit. But this operation is perfomed in O(log(N)); N being the number of
> index entries. If there are many entries (and no so many sparsity
> patterns), maybe a better approach would be to try matching the path
> directly against the sparsity patterns. This would be O(M) in the number
> of patterns, and it could be done, in builtin/grep.c, with a function
> like the following:
>
> static struct pattern_list sparsity_patterns;
> static int sparsity_patterns_initialized = 0;
> static enum pattern_match_result path_matches_sparsity_patterns(
>                                         const char *path, int pathlen,
>                                         const char *basename,
>                                         struct repository *repo)
> {
>         int dtype = DT_UNKNOWN;
>
>         if (!sparsity_patterns_initialized) {
>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>                 int ret;
>
>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>                                                      &sparsity_patterns, NULL);
>                 free(sparse_file);
>
>                 if (ret < 0)
>                         die(_("failed to load sparse-checkout patterns"));
>                 sparsity_patterns_initialized = 1;
>         }
>
>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>                                          &sparsity_patterns, repo->index);
> }
>
> Also, if I understand correctly, the index doesn't hold paths to dirs,
> right? So even if a complete dir is excluded from sparse checkout, we
> still have to check all its subentries, only to discover that they
> should all be skipped from the search. However, if we were to check
> against the sparsity patterns directly (e.g. with the function above),
> we could skip such directories together with all their entries.
>
> Oh, and there is also the case of a commit whose tree paths are not in
> the index (maybe manually created objects?). For such commits, with the
> index lookup approach, we would have to fall back on ignoring the
> sparsity rules. I'm not sure if that would be OK, though.
>
> Any thoughts on these two approaches (looking up the skip_worktree bit
> in the index or directly matching against sparsity patterns), will be
> highly appreciated. (Note that it only concerns the `git grep
> <commit-ish>` case. The other cases already iterate thought the index, so
> there is no O(log(N)) extra complexity).
>
>  builtin/grep.c                   | 29 ++++++++---
>  t/t7011-skip-worktree-reading.sh |  9 ----
>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+), 15 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 99e2685090..52ec72a036 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int from_commit);

I'm not familiar with grep.c and have to admit I don't know what
"check_attr" means.  Slightly surprised to see you replace it, but
maybe reading the rest will explain...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +

Looks good for the case where we are grepping through what's cached.

>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {

I had the same change when I was trying to hack something like this
patch into place but only handled the worktree case before realized it
was a bit bigger job.

>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)
> +                    int from_commit)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> +       if (from_commit && repo_read_index(repo) < 0)
> +               die(_("index file corrupt"));
> +

As above, I don't think we should need to read the index.  We should
compare to sparsity patterns, which in the important case (cone mode)
simplifies to a hash lookup as we walk directories.

>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
>
> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (from_commit) {
> +                       int pos = index_name_pos(repo->index,
> +                                                base->buf + tn_len,
> +                                                base->len - tn_len);
> +                       if (pos >= 0 &&
> +                           ce_skip_worktree(repo->index->cache[pos])) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> -                                        check_attr ? base->buf + tn_len : NULL);
> +                                       from_commit ? base->buf + tn_len : NULL);

Sadly, this doesn't help me understand check_attr or from_commit.
Could you clue me in a bit?

>                 } else if (S_ISDIR(entry.mode)) {
>                         enum object_type type;
>                         struct tree_desc sub;
> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                                        from_commit);

Same.

>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..fccf44e829
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,88 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates the following dir structure:
> +.
> +| - a
> +| - b
> +| - dir
> +    | - c
> +
> +Only "a" should be present due to the sparse checkout patterns:
> +"/*", "!/b" and "!/dir".
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +       git add a b dir &&
> +       git commit -m "initial commit" &&
> +       git tag -am t-commit t-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am t-tree t-tree $tree &&
> +       cat >.git/info/sparse-checkout <<-EOF &&
> +       /*
> +       !/b
> +       !/dir
> +       EOF
> +       git sparse-checkout init &&

Using `git sparse-checkout init` but then manually writing to
.git/info/sparse-checkout?  Seems like it'd make more sense to use
`git sparse-checkout set` than writing the patterns directly yourself.
Also, would prefer to have the examples use cone mode (even if you
have to add subdirectories), as it makes the testcase a bit easier to
read and more performant, though neither is a big deal.

> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_file a
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_t-commit <<-EOF &&
> +       t-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" t-commit >actual_t-commit &&
> +       test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '

I think the test is fine but the title seems misleading.  "outside"
and "inside" aren't defined because <tree-ish> isn't known to be
rooted, meaning we have no way to apply the sparsity patterns.  So
perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_t-tree <<-EOF &&
> +       t-tree:a:text
> +       t-tree:b:text
> +       t-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" t-tree >actual_t-tree &&
> +       test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_done
> --
> 2.25.1
Derrick Stolee March 24, 2020, 3:12 p.m. UTC | #2
On 3/24/2020 3:15 AM, Elijah Newren wrote:
> Hi Matheus,
> 
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> One of the main uses for a sparse checkout is to allow users to focus on
>> the subset of files in a repository in which they are interested. But
>> git-grep currently ignores the sparsity patterns and report all matches
>> found outside this subset, which kind of goes in the oposity direction.
>> Let's fix that, making it honor the sparsity boundaries for every
>> grepping case:
>>
>> - git grep in worktree
>> - git grep --cached
>> - git grep $REVISION
> 
> Wahoo!  This is great.

I am also excited. Also thrilled to see the option to get the old
behavior in the next patch.

>> Something I'm not entirely sure in this patch is how we implement the
>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>> is treated in the grep_tree() function). Currently, the patch looks for
>> an index entry that matches the path, and then checks its skip_worktree
> 
> As you discuss below, checking the index is both wrong _and_ costly.

I'm not sure why checking the index is _wrong_, but I agree about the
performance cost.

> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.
> 
>> bit. But this operation is perfomed in O(log(N)); N being the number of
>> index entries. If there are many entries (and no so many sparsity
>> patterns), maybe a better approach would be to try matching the path
>> directly against the sparsity patterns. This would be O(M) in the number
>> of patterns, and it could be done, in builtin/grep.c, with a function
>> like the following:
>>
>> static struct pattern_list sparsity_patterns;
>> static int sparsity_patterns_initialized = 0;
>> static enum pattern_match_result path_matches_sparsity_patterns(
>>                                         const char *path, int pathlen,
>>                                         const char *basename,
>>                                         struct repository *repo)
>> {
>>         int dtype = DT_UNKNOWN;
>>
>>         if (!sparsity_patterns_initialized) {
>>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>>                 int ret;
>>
>>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>>                                                      &sparsity_patterns, NULL);
>>                 free(sparse_file);
>>
>>                 if (ret < 0)
>>                         die(_("failed to load sparse-checkout patterns"));
>>                 sparsity_patterns_initialized = 1;
>>         }
>>
>>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>>                                          &sparsity_patterns, repo->index);
>> }
>>
>> Also, if I understand correctly, the index doesn't hold paths to dirs,
>> right? So even if a complete dir is excluded from sparse checkout, we
>> still have to check all its subentries, only to discover that they
>> should all be skipped from the search. However, if we were to check
>> against the sparsity patterns directly (e.g. with the function above),
>> we could skip such directories together with all their entries.

When in cone mode, we can check if a directory is one of these three
modes:

1. Completely contained in the cone (recursive match)
2. Completely outside the cone
3. Neither. Keep matching subdirectories. (parent match)

The clear_ce_flags() code in dir.c includes the matching algorithms
for this. Hopefully you can re-use a lot of it. You may need to extract
some methods to use them from the grep code.

>> Oh, and there is also the case of a commit whose tree paths are not in
>> the index (maybe manually created objects?). For such commits, with the
>> index lookup approach, we would have to fall back on ignoring the
>> sparsity rules. I'm not sure if that would be OK, though.
>>
>> Any thoughts on these two approaches (looking up the skip_worktree bit
>> in the index or directly matching against sparsity patterns), will be
>> highly appreciated. (Note that it only concerns the `git grep
>> <commit-ish>` case. The other cases already iterate thought the index, so
>> there is no O(log(N)) extra complexity).
>>
>>  builtin/grep.c                   | 29 ++++++++---
>>  t/t7011-skip-worktree-reading.sh |  9 ----
>>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>>  3 files changed, 111 insertions(+), 15 deletions(-)
>>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index 99e2685090..52ec72a036 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>>                       const struct pathspec *pathspec, int cached);
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr);
>> +                    int from_commit);
> 
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means.  Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
> 
>>
>>  static int grep_submodule(struct grep_opt *opt,
>>                           const struct pathspec *pathspec,
>> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>>
>>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>                 const struct cache_entry *ce = repo->index->cache[nr];
>> +
>> +               if (ce_skip_worktree(ce))
>> +                       continue;
>> +
> 
> Looks good for the case where we are grepping through what's cached.
> 
>>                 strbuf_setlen(&name, name_base_len);
>>                 strbuf_addstr(&name, ce->name);
>>
>> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>>                          * cache entry are identical, even if worktree file has
>>                          * been modified, so use cache version instead
>>                          */
>> -                       if (cached || (ce->ce_flags & CE_VALID) ||
>> -                           ce_skip_worktree(ce)) {
>> +                       if (cached || (ce->ce_flags & CE_VALID)) {
> 
> I had the same change when I was trying to hack something like this
> patch into place but only handled the worktree case before realized it
> was a bit bigger job.
> 
>>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>>                                         continue;
>>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
>> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>>
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr)
>> +                    int from_commit)
>>  {
>>         struct repository *repo = opt->repo;
>>         int hit = 0;
>> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                 name_base_len = name.len;
>>         }
>>
>> +       if (from_commit && repo_read_index(repo) < 0)
>> +               die(_("index file corrupt"));
>> +
> 
> As above, I don't think we should need to read the index.  We should
> compare to sparsity patterns, which in the important case (cone mode)
> simplifies to a hash lookup as we walk directories.
> 
>>         while (tree_entry(tree, &entry)) {
>>                 int te_len = tree_entry_len(&entry);
>>
>> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>
>>                 strbuf_add(base, entry.path, te_len);
>>
>> +               if (from_commit) {
>> +                       int pos = index_name_pos(repo->index,
>> +                                                base->buf + tn_len,
>> +                                                base->len - tn_len);
>> +                       if (pos >= 0 &&
>> +                           ce_skip_worktree(repo->index->cache[pos])) {
>> +                               strbuf_setlen(base, old_baselen);
>> +                               continue;
>> +                       }
>> +               }
>> +
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
> 
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Yeah, Elijah and I know the sparse-checkout code quite well, but are
unfamiliar with grep. Let's all expand our knowledge!

>>                 } else if (S_ISDIR(entry.mode)) {
>>                         enum object_type type;
>>                         struct tree_desc sub;
>> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                         strbuf_addch(base, '/');
>>                         init_tree_desc(&sub, data, size);
>>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
>> -                                        check_attr);
>> +                                        from_commit);
> 
> Same.
> 
>>                         free(data);
>>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
>> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
>> index 37525cae3a..26852586ac 100755
>> --- a/t/t7011-skip-worktree-reading.sh
>> +++ b/t/t7011-skip-worktree-reading.sh
>> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>>         test -z "$(git ls-files -m)"
>>  '
>>
>> -test_expect_success 'grep with skip-worktree file' '
>> -       git update-index --no-skip-worktree 1 &&
>> -       echo test > 1 &&
>> -       git update-index 1 &&
>> -       git update-index --skip-worktree 1 &&
>> -       rm 1 &&
>> -       test "$(git grep --no-ext-grep test)" = "1:test"
>> -'
>> -
>>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>>         setup_absent &&
>> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
>> new file mode 100755
>> index 0000000000..fccf44e829
>> --- /dev/null
>> +++ b/t/t7817-grep-sparse-checkout.sh
>> @@ -0,0 +1,88 @@
>> +#!/bin/sh
>> +
>> +test_description='grep in sparse checkout
>> +
>> +This test creates the following dir structure:
>> +.
>> +| - a
>> +| - b
>> +| - dir
>> +    | - c
>> +
>> +Only "a" should be present due to the sparse checkout patterns:
>> +"/*", "!/b" and "!/dir".
>> +'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup' '
>> +       echo "text" >a &&
>> +       echo "text" >b &&
>> +       mkdir dir &&
>> +       echo "text" >dir/c &&
>> +       git add a b dir &&
>> +       git commit -m "initial commit" &&
>> +       git tag -am t-commit t-commit HEAD &&
>> +       tree=$(git rev-parse HEAD^{tree}) &&
>> +       git tag -am t-tree t-tree $tree &&
>> +       cat >.git/info/sparse-checkout <<-EOF &&
>> +       /*
>> +       !/b
>> +       !/dir
>> +       EOF
>> +       git sparse-checkout init &&
> 
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

I agree that we should use the builtin so your test script is less
brittle to potential back-end changes to sparse-checkout (none planned).

I do recommend having at least one test with non-cone mode patterns,
especially if you are checking the pattern-matching yourself instead of
relying on the index.

Thanks,
-Stolee
Elijah Newren March 24, 2020, 4:16 p.m. UTC | #3
On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> > Hi Matheus,
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
...
> >> Something I'm not entirely sure in this patch is how we implement the
> >> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> >> is treated in the grep_tree() function). Currently, the patch looks for
> >> an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
>
> I'm not sure why checking the index is _wrong_, but I agree about the
> performance cost.

Let's say there are two directories, dir1 and dir2.  Over time, there
have existed a total of six files:
   dir1/{a,b,c}
   dir2/{d,e,f}
At the current time, there are only four files in the index:
   dir1/{a,b}
   dir2/{d,e}
And the user has done a `git sparse-checkout set dir2` and then at
some point later run `git grep OTHERCOMMIT foobar`.  What happens?

Well, since we're in a sparse checkout, we should only search the
relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
to figure out the "relevant paths" using the index.  We can tell that
dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
dir1/c is untracked -- what do we do with it?  Include it?  Exclude
it?  Carrying on with the other files, dir2/d and dir2/e are tracked
and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
we do with it?  Include it?  Exclude it?

We're left without the necessary information to tell whether we should
search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
decision we make is going to be wrong for one of the two paths.

If we instead do not attempt to consult the index (which corresponds
to a version close to HEAD) in order to ask questions about the
completely different OTHERCOMMIT, but instead use the sparsity
patterns to query whether those files/directories are interesting,
then we get the right answer.  The index can only be consulted for the
right answer in the case of --cached; in all other cases (including
OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
we could also use the sparsity patterns in the case of --cached, it's
just that for that one particular case consulting the index will also
give the right answer.
Derrick Stolee March 24, 2020, 5:02 p.m. UTC | #4
On 3/24/2020 12:16 PM, Elijah Newren wrote:
> On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/24/2020 3:15 AM, Elijah Newren wrote:
>>> Hi Matheus,
>>>
>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> ...
>>>> Something I'm not entirely sure in this patch is how we implement the
>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>> an index entry that matches the path, and then checks its skip_worktree
>>>
>>> As you discuss below, checking the index is both wrong _and_ costly.
>>
>> I'm not sure why checking the index is _wrong_, but I agree about the
>> performance cost.
> 
> Let's say there are two directories, dir1 and dir2.  Over time, there
> have existed a total of six files:
>    dir1/{a,b,c}
>    dir2/{d,e,f}
> At the current time, there are only four files in the index:
>    dir1/{a,b}
>    dir2/{d,e}
> And the user has done a `git sparse-checkout set dir2` and then at
> some point later run `git grep OTHERCOMMIT foobar`.  What happens?
> 
> Well, since we're in a sparse checkout, we should only search the
> relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
> to figure out the "relevant paths" using the index.  We can tell that
> dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
> dir1/c is untracked -- what do we do with it?  Include it?  Exclude
> it?  Carrying on with the other files, dir2/d and dir2/e are tracked
> and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
> we do with it?  Include it?  Exclude it?
> 
> We're left without the necessary information to tell whether we should
> search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
> decision we make is going to be wrong for one of the two paths.
> 
> If we instead do not attempt to consult the index (which corresponds
> to a version close to HEAD) in order to ask questions about the
> completely different OTHERCOMMIT, but instead use the sparsity
> patterns to query whether those files/directories are interesting,
> then we get the right answer.  The index can only be consulted for the
> right answer in the case of --cached; in all other cases (including
> OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
> we could also use the sparsity patterns in the case of --cached, it's
> just that for that one particular case consulting the index will also
> give the right answer.

Thanks! This helps a lot.

-Stolee
Matheus Tavares March 24, 2020, 10:55 p.m. UTC | #5
On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > Something I'm not entirely sure in this patch is how we implement the
> > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > is treated in the grep_tree() function). Currently, the patch looks for
> > an index entry that matches the path, and then checks its skip_worktree
>
> As you discuss below, checking the index is both wrong _and_ costly.
> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.

OK, makes sense. And your reply to Stolee, later in this thread, made
it clearer for me why checking the index is not only costly but also
wrong. Thanks for the great explanation! I will use the sparsity
patterns directly, in the next iteration.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 99e2685090..52ec72a036 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
> >                       const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                    int check_attr);
> > +                    int from_commit);
>
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means. Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
...
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
>
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Sure! The grep machinery can optionally look the .gitattributes file,
to see if a given path has a "diff" attribute assigned to it. This
attribute points to a diff driver in .gitconfig, which can specify
many things, such as whether the path should be treated as a binary or
not. The "check_attr" flag passed to grep_tree() tells the grep
machinery if it should perform this attribute lookup for the paths in
the given tree.

I decided to replace it with "from_commit" because the only times we
want an attribute lookup when grepping a tree, is when it comes from a
commit. I.e., when the tree is the root. (The reasoning goes in the
same lines as for why we only check sparsity patterns in git-grep for
commit-ish objects: we cannot check pattern matching for trees which
we are not sure to be rooted). Since "knowing if the tree is a root or
not" is useful in grep_tree() for both sparsity checks and attribute
checks, I thought we could use a single "from_commit" variable instead
of "check_attr" and "check_sparsity", which would always have matching
values. But on second thought, I could maybe rename the variable to
something as "is_root_tree" or add a comment explaining the usage of
"from_commit".

(I'm not a big fan of "is_root_tree", thought, because we could give a
root tree to grep_tree() but not really know it.)

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..fccf44e829
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
...
> > +test_expect_success 'setup' '
> > +       echo "text" >a &&
> > +       echo "text" >b &&
> > +       mkdir dir &&
> > +       echo "text" >dir/c &&
> > +       git add a b dir &&
> > +       git commit -m "initial commit" &&
> > +       git tag -am t-commit t-commit HEAD &&
> > +       tree=$(git rev-parse HEAD^{tree}) &&
> > +       git tag -am t-tree t-tree $tree &&
> > +       cat >.git/info/sparse-checkout <<-EOF &&
> > +       /*
> > +       !/b
> > +       !/dir
> > +       EOF
> > +       git sparse-checkout init &&
>
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

OK, I will make use of the builtin here. I will also use the cone mode
(and leave one test without it, as Stolee suggested later in this
thread).

> > +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>
> I think the test is fine but the title seems misleading.  "outside"
> and "inside" aren't defined because <tree-ish> isn't known to be
> rooted, meaning we have no way to apply the sparsity patterns.  So
> perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

Right! "should ignore sparsity patterns" is a much better name, thanks.

Thanks a lot for the thoughtful review and comments!
Matheus Tavares March 24, 2020, 11:01 p.m. UTC | #6
On Tue, Mar 24, 2020 at 12:12 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> Also, if I understand correctly, the index doesn't hold paths to dirs,
> >> right? So even if a complete dir is excluded from sparse checkout, we
> >> still have to check all its subentries, only to discover that they
> >> should all be skipped from the search. However, if we were to check
> >> against the sparsity patterns directly (e.g. with the function above),
> >> we could skip such directories together with all their entries.
>
> When in cone mode, we can check if a directory is one of these three
> modes:
>
> 1. Completely contained in the cone (recursive match)
> 2. Completely outside the cone
> 3. Neither. Keep matching subdirectories. (parent match)
>
> The clear_ce_flags() code in dir.c includes the matching algorithms
> for this. Hopefully you can re-use a lot of it. You may need to extract
> some methods to use them from the grep code.

Thanks for the pointer! I will take a look at the code in dir.c.

> >> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> >> new file mode 100755
> >> index 0000000000..fccf44e829
...
> >> +       cat >.git/info/sparse-checkout <<-EOF &&
> >> +       /*
> >> +       !/b
> >> +       !/dir
> >> +       EOF
> >> +       git sparse-checkout init &&
> >
> > Using `git sparse-checkout init` but then manually writing to
> > .git/info/sparse-checkout?  Seems like it'd make more sense to use
> > `git sparse-checkout set` than writing the patterns directly yourself.
> > Also, would prefer to have the examples use cone mode (even if you
> > have to add subdirectories), as it makes the testcase a bit easier to
> > read and more performant, though neither is a big deal.
>
> I agree that we should use the builtin so your test script is less
> brittle to potential back-end changes to sparse-checkout (none planned).

Makes sense!

> I do recommend having at least one test with non-cone mode patterns,
> especially if you are checking the pattern-matching yourself instead of
> relying on the index.

OK, I will leave at least one test with non-cone patterns then. Thanks
for the comments!
Matheus Tavares April 21, 2020, 2:10 a.m. UTC | #7
Hi, Elijah, Stolee and others

On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> > >
> > > Something I'm not entirely sure in this patch is how we implement the
> > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > is treated in the grep_tree() function). Currently, the patch looks for
> > > an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
> > You should use the sparsity patterns; Stolee did a lot of work to make
> > those correspond to simple hashes you could check to determine whether
> > to even walk into a subdirectory.
[...]
> OK, makes sense.

I've been working on the file skipping mechanism using the sparsity
patterns directly. But I'm uncertain about some implementation
details. So I wanted to share my current plan with you, to get some
feedback before going deeper.

The first idea was to load the sparsity patterns a priori and pass
them to grep_tree(), which recursively greps the entries of a given
tree object. If --recurse-submodules is given, however, we would also
need to load each surepo's sparse-checkout file on the fly (as the
subrepos are lazily initialized in grep_tree()'s call chain). That's
not a problem on its own. But in the most naive implementation, this
means unnecessarily re-loading the sparse-checkout files of the
submodules for each tree given to git-grep (as grep_tree() is called
separately for each one of them).

So my next idea was to implement a cache, mapping 'struct repository's
to 'struct pattern_list'. Well, not 'struct repository' itself, but
repo->gitdir. This way we could load each file once, store the pattern
list, and quickly retrieve the one that affect the repository
currently being grepped, whether it is a submodule or not. But, is
gitidir unique per repository? If not, could we use
repo_git_path(repo, "info/sparse-checkout") as the key?

I already have a prototype implementation of the last idea (using
repo_git_path()). But I wanted to make sure, does this seem like a
good path? Or should we avoid the work of having this hashmap here and
do something else, as adding a 'struct pattern_list' to 'struct
repository', directly?

Thanks,
Matheus
Elijah Newren April 21, 2020, 3:08 a.m. UTC | #8
On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Elijah, Stolee and others
>
> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > Something I'm not entirely sure in this patch is how we implement the
> > > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > > is treated in the grep_tree() function). Currently, the patch looks for
> > > > an index entry that matches the path, and then checks its skip_worktree
> > >
> > > As you discuss below, checking the index is both wrong _and_ costly.
> > > You should use the sparsity patterns; Stolee did a lot of work to make
> > > those correspond to simple hashes you could check to determine whether
> > > to even walk into a subdirectory.
> [...]
> > OK, makes sense.
>
> I've been working on the file skipping mechanism using the sparsity
> patterns directly. But I'm uncertain about some implementation
> details. So I wanted to share my current plan with you, to get some
> feedback before going deeper.
>
> The first idea was to load the sparsity patterns a priori and pass
> them to grep_tree(), which recursively greps the entries of a given
> tree object. If --recurse-submodules is given, however, we would also
> need to load each surepo's sparse-checkout file on the fly (as the
> subrepos are lazily initialized in grep_tree()'s call chain). That's
> not a problem on its own. But in the most naive implementation, this
> means unnecessarily re-loading the sparse-checkout files of the
> submodules for each tree given to git-grep (as grep_tree() is called
> separately for each one of them).

Wouldn't loading the sparse-checkout files be fast compared to
grepping a submodule for matching strings?  And not just fast, but
essentially in the noise and hard to even measure?  I have a hard time
fathoming parsing the sparse-checkout file for a submodule somehow
appreciably affecting the cost of grepping through that submodule.  If
the submodule has a huge number of sparse-checkout patterns, that'll
be because it has a ginormous number of files and grepping through
them all would be way, way longer.  If the submodule only has a few
files, then the sparse-checkout file is only going to be a few lines
at most.

Also, from another angle: I think the original intent of submodules
was an alternate form of sparse-checkout/partial-clone, letting people
deal with just their piece of the repo.  As such, do we really even
expect people to use sparse-checkouts and submodules together, let
alone use them very heavily together?  Sure, someone will use them,
but I have a hard time imagining the scale of use of both features
heavily enough for this to matter, especially since it also requires
specifying multiple trees to grep (which is slightly unusual) in
addition to the combination of these other features before your
optimization here could kick in and be worthwhile.

I'd be very tempted to just implement the most naive implementation
and maybe leave a TODO note in the code for some future person to come
along and optimize if it really matters, but I'd like to see numbers
before we spend the development and maintenance effort on it because
I'm having a hard time imagining any scale where it could matter.

> So my next idea was to implement a cache, mapping 'struct repository's
> to 'struct pattern_list'. Well, not 'struct repository' itself, but
> repo->gitdir. This way we could load each file once, store the pattern
> list, and quickly retrieve the one that affect the repository
> currently being grepped, whether it is a submodule or not. But, is
> gitidir unique per repository? If not, could we use
> repo_git_path(repo, "info/sparse-checkout") as the key?
>
> I already have a prototype implementation of the last idea (using
> repo_git_path()). But I wanted to make sure, does this seem like a
> good path? Or should we avoid the work of having this hashmap here and
> do something else, as adding a 'struct pattern_list' to 'struct
> repository', directly?

Honestly, it sounds a bit like premature optimization to me.  Sorry if
that's disappointing since you've apparently already put some effort
into this, and it sounds like you're on a good track for optimizing
this if it were necessary, but I'm just having a hard time figuring
out whether it'd really help and be worth the code complexity.
Derrick Stolee April 22, 2020, 12:08 p.m. UTC | #9
On 4/20/2020 11:08 PM, Elijah Newren wrote:
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> Hi, Elijah, Stolee and others
>>
>> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
>> <matheus.bernardino@usp.br> wrote:
>>>
>>> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>>>>
>>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> Something I'm not entirely sure in this patch is how we implement the
>>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>>> an index entry that matches the path, and then checks its skip_worktree
>>>>
>>>> As you discuss below, checking the index is both wrong _and_ costly.
>>>> You should use the sparsity patterns; Stolee did a lot of work to make
>>>> those correspond to simple hashes you could check to determine whether
>>>> to even walk into a subdirectory.
>> [...]
>>> OK, makes sense.
>>
>> I've been working on the file skipping mechanism using the sparsity
>> patterns directly. But I'm uncertain about some implementation
>> details. So I wanted to share my current plan with you, to get some
>> feedback before going deeper.
>>
>> The first idea was to load the sparsity patterns a priori and pass
>> them to grep_tree(), which recursively greps the entries of a given
>> tree object. If --recurse-submodules is given, however, we would also
>> need to load each surepo's sparse-checkout file on the fly (as the
>> subrepos are lazily initialized in grep_tree()'s call chain). That's
>> not a problem on its own. But in the most naive implementation, this
>> means unnecessarily re-loading the sparse-checkout files of the
>> submodules for each tree given to git-grep (as grep_tree() is called
>> separately for each one of them).
> 
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.
> 
> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
> 
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.
> 
>> So my next idea was to implement a cache, mapping 'struct repository's
>> to 'struct pattern_list'. Well, not 'struct repository' itself, but
>> repo->gitdir. This way we could load each file once, store the pattern
>> list, and quickly retrieve the one that affect the repository
>> currently being grepped, whether it is a submodule or not. But, is
>> gitidir unique per repository? If not, could we use
>> repo_git_path(repo, "info/sparse-checkout") as the key?
>>
>> I already have a prototype implementation of the last idea (using
>> repo_git_path()). But I wanted to make sure, does this seem like a
>> good path? Or should we avoid the work of having this hashmap here and
>> do something else, as adding a 'struct pattern_list' to 'struct
>> repository', directly?
> 
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

My initial thought was to use a stack or queue. It depend on how
git-grep treats submodules. Imagine directories A, B, C where B is a
submodule.

If results from 'B' are output between results from 'A' and 'C', then
use a stack to "push" the latest sparse-checkout patterns as you
deepen into a submodule, then "pop" the patterns as you leave a
submodule.

If results from 'B' are output after results from 'C', then you could
possibly use a queue instead. I find this unlikely, and it would
behave strangely for nested submodules.

Since "struct pattern_list" has most of the information you require,
then it should not be challenging to create a list of them.

Hopefully that provides some ideas.

Thanks,
-Stolee
Matheus Tavares April 23, 2020, 6:09 a.m. UTC | #10
On Tue, Apr 21, 2020 at 12:08 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I've been working on the file skipping mechanism using the sparsity
> > patterns directly. But I'm uncertain about some implementation
> > details. So I wanted to share my current plan with you, to get some
> > feedback before going deeper.
> >
> > The first idea was to load the sparsity patterns a priori and pass
> > them to grep_tree(), which recursively greps the entries of a given
> > tree object. If --recurse-submodules is given, however, we would also
> > need to load each surepo's sparse-checkout file on the fly (as the
> > subrepos are lazily initialized in grep_tree()'s call chain). That's
> > not a problem on its own. But in the most naive implementation, this
> > means unnecessarily re-loading the sparse-checkout files of the
> > submodules for each tree given to git-grep (as grep_tree() is called
> > separately for each one of them).
>
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.

Yeah, makes sense.

> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
>
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.

You're right. I guess I got a little too excited about the
optimizations possibilities and neglected the fact that they might not
even be needed here.

Just to take a look at some numbers, I prototyped the naive
implementation and downloaded a testing repository[1] containing 8
submodules (or 14 counting the nested ones). For each of the
non-nested submodules, I added its .gitignore rules to the
sparse-checkout file (of course this doesn't make any sense for a
real-world usage, but I just wanted to populate the file with a large
quantity of valid rules, to test the parsing time). I also added the
rule '/*'. Then I ran:

git-grep --threads=1 --recurse-submodules -E "config_[a-z]+\(" $(cat /tmp/trees)

Where /tmp/trees contained about 120 trees in the said repository
(again, a probably unreal case, for testing purposes only). Then,
measuring the time spent only inside the function I created to load a
sparse-checkout file for a given 'struct repository', I got to the
following numbers:

Number of calls: 1531 (makes sense: ~120 trees and 14 submodules)
Percentage over the total time: 0.015%
Number of matches: 300897

And using 8 threads, I got the same numbers except for the percentage,
which was a little higher: 0.05%.

So, indeed, the overhead of re-loading the files is too insignificant.
And my cache idea was a premature and unnecessary optimization.

> > So my next idea was to implement a cache, mapping 'struct repository's
> > to 'struct pattern_list'. Well, not 'struct repository' itself, but
> > repo->gitdir. This way we could load each file once, store the pattern
> > list, and quickly retrieve the one that affect the repository
> > currently being grepped, whether it is a submodule or not. But, is
> > gitidir unique per repository? If not, could we use
> > repo_git_path(repo, "info/sparse-checkout") as the key?
> >
> > I already have a prototype implementation of the last idea (using
> > repo_git_path()). But I wanted to make sure, does this seem like a
> > good path? Or should we avoid the work of having this hashmap here and
> > do something else, as adding a 'struct pattern_list' to 'struct
> > repository', directly?
>
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

No problem! I'm glad to have this feedback now, while I'm still
working on v2  :) Now I can focus on what's really relevant. So thanks
again!

[1]: https://github.com/surevine/Metre
diff mbox series

Patch

diff --git a/builtin/grep.c b/builtin/grep.c
index 99e2685090..52ec72a036 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -388,7 +388,7 @@  static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int from_commit);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -486,6 +486,10 @@  static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -498,8 +502,7 @@  static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -532,7 +535,7 @@  static int grep_cache(struct grep_opt *opt,
 
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+		     int from_commit)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -546,6 +549,9 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
+	if (from_commit && repo_read_index(repo) < 0)
+		die(_("index file corrupt"));
+
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
 
@@ -564,9 +570,20 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (from_commit) {
+			int pos = index_name_pos(repo->index,
+						 base->buf + tn_len,
+						 base->len - tn_len);
+			if (pos >= 0 &&
+			    ce_skip_worktree(repo->index->cache[pos])) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
-					 check_attr ? base->buf + tn_len : NULL);
+					from_commit ? base->buf + tn_len : NULL);
 		} else if (S_ISDIR(entry.mode)) {
 			enum object_type type;
 			struct tree_desc sub;
@@ -581,7 +598,7 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
 			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+					 from_commit);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@  test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..fccf44e829
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,88 @@ 
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates the following dir structure:
+.
+| - a
+| - b
+| - dir
+    | - c
+
+Only "a" should be present due to the sparse checkout patterns:
+"/*", "!/b" and "!/dir".
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+	git add a b dir &&
+	git commit -m "initial commit" &&
+	git tag -am t-commit t-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am t-tree t-tree $tree &&
+	cat >.git/info/sparse-checkout <<-EOF &&
+	/*
+	!/b
+	!/dir
+	EOF
+	git sparse-checkout init &&
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_file a
+'
+
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_t-tree <<-EOF &&
+	t-tree:a:text
+	t-tree:b:text
+	t-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" t-tree >actual_t-tree &&
+	test_cmp expect_t-tree actual_t-tree
+'
+
+test_done