diff mbox series

[v3,4/5] grep: honor sparse checkout patterns

Message ID 5ddac818185d316cd5ad9576105f0e4e695b9456.1590627264.git.matheus.bernardino@usp.br (mailing list archive)
State New, archived
Headers show
Series grep: honor sparse checkout and add option to ignore it | expand

Commit Message

Matheus Tavares May 28, 2020, 1:13 a.m. UTC
One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the opposite direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case where this is relevant:

- git grep in worktree
- git grep --cached
- git grep $REVISION

For the worktree case, we will not grep paths that have the
SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
manually created after `git sparse-checkout init`). But the next patch
will add an option to do so. (See 'Note' below.)

For `git grep $REVISION`, we will choose to honor the sparsity patterns
only when $REVISION is a commit-ish object. The reason is that, for a
tree, we don't know whether it represents the root of a repository or a
subtree. So we wouldn't be able to correctly match it against the
sparsity patterns. E.g. suppose we have a repository with these two
sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects.

Note: The behavior introduced in this patch is what some users have
reported[1] that they would like by default. But the old behavior is
still desirable for some use cases. Therefore, the next patch will add
an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 builtin/grep.c                   | 125 ++++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 174 +++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Comments

Elijah Newren May 30, 2020, 3:48 p.m. UTC | #1
On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the opposite direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case where this is relevant:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION
>
> For the worktree case, we will not grep paths that have the
> SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
> manually created after `git sparse-checkout init`).

This seems worded to rise alarm bells and make users suspect
implementation difficulties or regrets rather than desired behavior.
It would be much better to word this simply as something like:

    For the worktree and cached cases, we iterate over paths without
the SKIP_WORKTREE bit set, and limit our searches to these paths.

> But the next patch
> will add an option to do so. (See 'Note' below.)

Because this was in the same paragraph as the previous sentence, it
made it sound like you were going to provide a special worktree-only
option to search outside the SKIP_WORKTREE bits.  Very confusing.  I
think I'd combine this sentence into the very first paragraph of the
commit message and massage the wording a little.  Perhaps something
like:  ...goes in the opposite direction.  There are some usecases for
ignoring the sparsity patterns and the next commit will add an option
to obtain this behavior, but here we start by making grep honor the
sparsity boundaries for every...

> For `git grep $REVISION`, we will choose to honor the sparsity patterns
> only when $REVISION is a commit-ish object. The reason is that, for a
> tree, we don't know whether it represents the root of a repository or a
> subtree. So we wouldn't be able to correctly match it against the
> sparsity patterns. E.g. suppose we have a repository with these two
> sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects.

This doesn't actually make it clear how you handle $REVISION which is
a commit object; you focus so much on when $REVISION is just a tree
and contrasting that case that you omit the behavior for the case of
interest.  Also, $REVISION to my mind implies "commit"; if you want to
imply that a commit or tree could be used, you'd use $TREE or
$TREE_ISH or something else.  I think it'd make sense to cover all
three relevant cases into a single paragraph (thus combining with the
previous paragraph), and then add a second paragraph about the $TREE
case that streamlines the last two pargraphs above.  So, perhaps we
can your paragraphs from "For the worktree case, we will not grep
paths..." all the way to "So, let's ignore the sparsity patterns when
grepping non-commit-ish objects" (after first moving the comment about
adding an option in the next commit to some other area of the commit
message, as dicussed above) with something like the following:


    For the worktree and cached cases, we iterate over paths without
the SKIP_WORKTREE bit set, and limit our searches to these paths.  For
the $REVISION case, we limit the paths we search to those that match
the sparsity patterns.  (We do not check the SKIP_WORKTREE bit for the
$REVISION case, because $REVISION may contain paths that do not exist
in HEAD and thus for which we have no SKIP_WORKTREE bit to consult.
The sparsity patterns tell us how the SKIP_WORKTREE bit would be set
if we were to check out $REVISION, so we consult those.  Also, we
don't use the sparsity paths with the worktree or cached cases, both
because we have a bit we can check directly and more efficiently, and
because unmerged entries from a merge or a rebase could cause more
files to temporarily be present than the sparsity patterns would
normally select.)

    Note that there is a special case here: `git grep $TREE`.  In this
case we cannot know whether $TREE corresponds to the root of the
repository or some sub-tree, and thus there is no way for us to know
which sparsity patterns, if any, apply.  So the $TREE case will not
use sparsity patterns or any SKIP_WORKTREE bits and will instead
always search all files within the $TREE.

>
> Note: The behavior introduced in this patch is what some users have
> reported[1] that they would like by default. But the old behavior is
> still desirable for some use cases. Therefore, the next patch will add
> an option to allow restoring it when needed.

This paragraph duplicates information you already stated previously.
It's much clearer than what you stated before, but if you just reword
the previous comments and combine them into the first paragraph, then
we can drop this final note.


> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  builtin/grep.c                   | 125 ++++++++++++++++++++--
>  t/t7011-skip-worktree-reading.sh |   9 --
>  t/t7817-grep-sparse-checkout.sh  | 174 +++++++++++++++++++++++++++++++
>  3 files changed, 291 insertions(+), 17 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index a5056f395a..11e33b8aee 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int is_root_tree);

So you modified the forward declaration of grep_tree()...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +
>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -552,9 +555,76 @@ static int grep_cache(struct grep_opt *opt,
>         return hit;
>  }
>
> -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> -                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)

Here the patch splits your handling of grep_tree()...

> +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> +{
> +       struct pattern_list *patterns;
> +       char *sparse_file;
> +       int sparse_config, cone_config;
> +
> +       if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> +           !sparse_config) {
> +               return NULL;
> +       }

Is core_apply_sparse_checkout not initialized for some reason?

> +
> +       sparse_file = repo_git_path(repo, "info/sparse-checkout");
> +       patterns = xcalloc(1, sizeof(*patterns));
> +
> +       if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> +               cone_config = 0;
> +       patterns->use_cone_patterns = cone_config;

Similarly, is core_sparse_checkout_cone not intialized?

> +
> +       if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> +               if (file_exists(sparse_file)) {
> +                       warning(_("failed to load sparse-checkout file: '%s'"),
> +                               sparse_file);
> +               }
> +               free(sparse_file);
> +               free(patterns);
> +               return NULL;
> +       }
> +
> +       free(sparse_file);
> +       return patterns;
> +}
> +
> +static int in_sparse_checkout(struct strbuf *path, int prefix_len,

This function name in_sparse_checkout() makes me think "Does the
working tree represent a sparse checkout?"  Perhaps we could rename it
to path_matches_sparsity_patterns() ?

Also, is there a reason we can't use dir.c's
path_matches_pattern_list() here?  How does this new function differ
in behavior from that function?

> +                             unsigned int entry_mode,
> +                             struct index_state *istate,
> +                             struct pattern_list *sparsity,
> +                             enum pattern_match_result parent_match,
> +                             enum pattern_match_result *match)
> +{
> +       int dtype = DT_UNKNOWN;
> +       int is_dir = S_ISDIR(entry_mode);
> +
> +       if (parent_match == MATCHED_RECURSIVE) {
> +               *match = parent_match;
> +               return 1;
> +       }
> +
> +       if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
> +               strbuf_addch(path, '/');
> +
> +       *match = path_matches_pattern_list(path->buf, path->len,
> +                                          path->buf + prefix_len, &dtype,
> +                                          sparsity, istate);
> +       if (*match == UNDECIDED)
> +               *match = parent_match;
> +
> +       if (is_dir)
> +               strbuf_trim_trailing_dir_sep(path);
> +
> +       if (*match == NOT_MATCHED &&
> +               (!is_dir || (is_dir && sparsity->use_cone_patterns)))
> +            return 0;
> +
> +       return 1;
> +}
> +
> +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,

I thought this meant you were renaming grep_tree() to do_grep_tree()
but it's a new function that happens to have most of the logic from
the old grep_tree() and which the new grep_tree() will call to do most
its work.

> +                       struct tree_desc *tree, struct strbuf *base, int tn_len,
> +                       int check_attr, struct pattern_list *sparsity,
> +                       enum pattern_match_result default_sparsity_match)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -570,6 +640,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
> +               enum pattern_match_result sparsity_match = 0;
>
>                 if (match != all_entries_interesting) {
>                         strbuf_addstr(&name, base->buf + tn_len);
> @@ -586,6 +657,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (sparsity) {
> +                       struct strbuf path = STRBUF_INIT;
> +                       strbuf_addstr(&path, base->buf + tn_len);
> +
> +                       if (!in_sparse_checkout(&path, old_baselen - tn_len,
> +                                               entry.mode, repo->index,
> +                                               sparsity, default_sparsity_match,
> +                                               &sparsity_match)) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>                                          check_attr ? base->buf + tn_len : NULL);
> @@ -602,8 +686,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
> -                       hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                       hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> +                                           check_attr, sparsity, sparsity_match);
>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> @@ -621,6 +705,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>         return hit;
>  }
>
> +/*
> + * Note: sparsity patterns and paths' attributes will only be considered if
> + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> + * matching on paths.)
> + */
> +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> +                    int is_root_tree)
> +{
> +       struct pattern_list *patterns = NULL;
> +       int ret;
> +
> +       if (is_root_tree)
> +               patterns = get_sparsity_patterns(opt->repo);
> +
> +       ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> +                          patterns, 0);
> +
> +       if (patterns) {
> +               clear_pattern_list(patterns);
> +               free(patterns);
> +       }
> +       return ret;
> +}

Once I figured out grep_tree() was just becoming a wrapper around
do_grep_tree(), the patch made more sense; I should have scrolled to
the end quicker.  ;-)

> +
>  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
>                        struct object *obj, const char *name, const char *path)
>  {
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'

Yaay!

> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..ce080cf572
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,174 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates a repo with the following structure:
> +
> +.
> +|-- a
> +|-- b
> +|-- dir
> +|   `-- c
> +|-- sub
> +|   |-- A
> +|   |   `-- a
> +|   `-- B
> +|       `-- b
> +`-- sub2
> +    `-- a
> +
> +Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode

Maybe "Where the outer repository has non-code mode..."?  The use of
'.' threw me for a bit.

> +sparsity patterns and sub2 is a submodule that is excluded by the superproject
> +sparsity patterns. The resulting sparse checkout should leave the following
> +structure on the working tree:

s/on the/in the/?

> +
> +.
> +|-- a
> +|-- sub
> +|   `-- B
> +|       `-- b
> +`-- sub2
> +    `-- a
> +
> +But note that sub2 should have the SKIP_WORKTREE bit set.
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +
> +       git init sub &&
> +       (
> +               cd sub &&
> +               mkdir A B &&
> +               echo "text" >A/a &&
> +               echo "text" >B/b &&
> +               git add A B &&
> +               git commit -m sub &&
> +               git sparse-checkout init --cone &&
> +               git sparse-checkout set B
> +       ) &&
> +
> +       git init sub2 &&
> +       (
> +               cd sub2 &&
> +               echo "text" >a &&
> +               git add a &&
> +               git commit -m sub2
> +       ) &&
> +
> +       git submodule add ./sub &&
> +       git submodule add ./sub2 &&
> +       git add a b dir &&
> +       git commit -m super &&
> +       git sparse-checkout init --no-cone &&
> +       git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
> +
> +       git tag -am tag-to-commit tag-to-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am tag-to-tree tag-to-tree $tree &&
> +
> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_missing sub/A &&
> +       test_path_is_file a &&
> +       test_path_is_file sub/B/b &&
> +       test_path_is_file sub2/a
> +'
> +
> +# The test bellow checks a special case: the sparsity patterns exclude '/b'

s/bellow/below/

> +# and sparse checkout is enable, but the path exists on the working tree (e.g.

s/enable/enabled/, s/on/in/

> +# manually created after `git sparse-checkout init`). In this case, grep should
> +# skip it.
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       echo "new-text" >b &&
> +       test_when_finished "rm b" &&
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_tag-to-commit <<-EOF &&
> +       tag-to-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" tag-to-commit >actual_tag-to-commit &&
> +       test_cmp expect_tag-to-commit actual_tag-to-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_tag-to-tree <<-EOF &&
> +       tag-to-tree:a:text
> +       tag-to-tree:b:text
> +       tag-to-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" tag-to-tree >actual_tag-to-tree &&
> +       test_cmp expect_tag-to-tree actual_tag-to-tree
> +'
> +
> +# Note that sub2/ is present in the worktree but it is excluded by the sparsity
> +# patterns, so grep should not recurse into it.
> +test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       EOF
> +       git grep --recurse-submodules "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       EOF
> +       git grep --recurse-submodules --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       $commit:sub/B/b:text
> +       EOF
> +       cat >expect_tag-to-commit <<-EOF &&
> +       tag-to-commit:a:text
> +       tag-to-commit:sub/B/b:text
> +       EOF
> +       git grep --recurse-submodules "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
> +       test_cmp expect_tag-to-commit actual_tag-to-commit
> +'
> +
> +test_done
> --
> 2.26.2

Looks good.  Do we want to add a testcase where a file is unmerged and
present in the working copy despite not matching the sparsity patterns
(i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?
Matheus Tavares June 1, 2020, 4:44 a.m. UTC | #2
On Sat, May 30, 2020 at 12:48 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case where this is relevant:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
> >
> > For the worktree case, we will not grep paths that have the
> > SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
> > manually created after `git sparse-checkout init`).
>
> This seems worded to rise alarm bells and make users suspect
> implementation difficulties or regrets rather than desired behavior.
> It would be much better to word this simply as something like:
>
>     For the worktree and cached cases, we iterate over paths without
> the SKIP_WORKTREE bit set, and limit our searches to these paths.
>
> > But the next patch
> > will add an option to do so. (See 'Note' below.)
>
> Because this was in the same paragraph as the previous sentence, it
> made it sound like you were going to provide a special worktree-only
> option to search outside the SKIP_WORKTREE bits.  Very confusing.  I
> think I'd combine this sentence into the very first paragraph of the
> commit message and massage the wording a little.  Perhaps something
> like:  ...goes in the opposite direction.  There are some usecases for
> ignoring the sparsity patterns and the next commit will add an option
> to obtain this behavior, but here we start by making grep honor the
> sparsity boundaries for every...
>
> > For `git grep $REVISION`, we will choose to honor the sparsity patterns
> > only when $REVISION is a commit-ish object. The reason is that, for a
> > tree, we don't know whether it represents the root of a repository or a
> > subtree. So we wouldn't be able to correctly match it against the
> > sparsity patterns. E.g. suppose we have a repository with these two
> > sparsity rules: "/*" and "!/a"; and the following structure:
> >
> > /
> > | - a (file)
> > | - d (dir)
> >     | - a (file)
> >
> > If `git grep $REVISION` were to honor the sparsity patterns for every
> > object type, when grepping the /d tree, we would wrongly ignore the /d/a
> > file. This happens because we wouldn't know it resides in /d and
> > therefore it would wrongly match the pattern "!/a". Furthermore, for a
> > search in a blob object, we wouldn't even have a path to check the
> > patterns against. So, let's ignore the sparsity patterns when grepping
> > non-commit-ish objects.
>
> This doesn't actually make it clear how you handle $REVISION which is
> a commit object; you focus so much on when $REVISION is just a tree
> and contrasting that case that you omit the behavior for the case of
> interest.  Also, $REVISION to my mind implies "commit"; if you want to
> imply that a commit or tree could be used, you'd use $TREE or
> $TREE_ISH or something else.  I think it'd make sense to cover all
> three relevant cases into a single paragraph (thus combining with the
> previous paragraph), and then add a second paragraph about the $TREE
> case that streamlines the last two pargraphs above.  So, perhaps we
> can your paragraphs from "For the worktree case, we will not grep
> paths..." all the way to "So, let's ignore the sparsity patterns when
> grepping non-commit-ish objects" (after first moving the comment about
> adding an option in the next commit to some other area of the commit
> message, as dicussed above) with something like the following:
>
>     For the worktree and cached cases, we iterate over paths without
> the SKIP_WORKTREE bit set, and limit our searches to these paths.  For
> the $REVISION case, we limit the paths we search to those that match
> the sparsity patterns.  (We do not check the SKIP_WORKTREE bit for the
> $REVISION case, because $REVISION may contain paths that do not exist
> in HEAD and thus for which we have no SKIP_WORKTREE bit to consult.
> The sparsity patterns tell us how the SKIP_WORKTREE bit would be set
> if we were to check out $REVISION, so we consult those.  Also, we
> don't use the sparsity paths with the worktree or cached cases, both
> because we have a bit we can check directly and more efficiently, and
> because unmerged entries from a merge or a rebase could cause more
> files to temporarily be present than the sparsity patterns would
> normally select.)
>
>     Note that there is a special case here: `git grep $TREE`.  In this
> case we cannot know whether $TREE corresponds to the root of the
> repository or some sub-tree, and thus there is no way for us to know
> which sparsity patterns, if any, apply.  So the $TREE case will not
> use sparsity patterns or any SKIP_WORKTREE bits and will instead
> always search all files within the $TREE.
>
> >
> > Note: The behavior introduced in this patch is what some users have
> > reported[1] that they would like by default. But the old behavior is
> > still desirable for some use cases. Therefore, the next patch will add
> > an option to allow restoring it when needed.
>
> This paragraph duplicates information you already stated previously.
> It's much clearer than what you stated before, but if you just reword
> the previous comments and combine them into the first paragraph, then
> we can drop this final note.

All great suggestions! I will amend the commit message using your
proposed paragraphs. Thanks!

> >
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >  builtin/grep.c                   | 125 ++++++++++++++++++++--
> >  t/t7011-skip-worktree-reading.sh |   9 --
> >  t/t7817-grep-sparse-checkout.sh  | 174 +++++++++++++++++++++++++++++++
[...]
> > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > +{
> > +       struct pattern_list *patterns;
> > +       char *sparse_file;
> > +       int sparse_config, cone_config;
> > +
> > +       if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > +           !sparse_config) {
> > +               return NULL;
> > +       }
>
> Is core_apply_sparse_checkout not initialized for some reason?

It should be already initialized, yes. But we cannot rely on that as
`repo` might be a submodule, and core_apply_sparse_checkout holds the
configuration's value for `the_repository`.

> > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
>
> This function name in_sparse_checkout() makes me think "Does the
> working tree represent a sparse checkout?"  Perhaps we could rename it
> to path_matches_sparsity_patterns() ?
>
> Also, is there a reason we can't use dir.c's
> path_matches_pattern_list() here?

Oh, we do use path_matches_pattern_list() inside:

> > +       *match = path_matches_pattern_list(path->buf, path->len,
> > +                                          path->buf + prefix_len, &dtype,
> > +                                          sparsity, istate);
> > +       if (*match == UNDECIDED)
> > +               *match = parent_match;

> How does this new function differ
> in behavior from that function?

The idea of in_sparse_checkout() is to implement a logic closer to
what we have in clear_ce_flags_1(). Here, it is effectively a wrapper
to path_matches_pattern_list() but with some extra logic to decide
whether grep should search in a given entry, based on its mode, the
match result against the sparsity patterns, and the result from the
parent dir.

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..ce080cf572
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -0,0 +1,174 @@
> > +#!/bin/sh
> > +
> > +test_description='grep in sparse checkout
> > +
> > +This test creates a repo with the following structure:
> > +
> > +.
> > +|-- a
> > +|-- b
> > +|-- dir
> > +|   `-- c
> > +|-- sub
> > +|   |-- A
> > +|   |   `-- a
> > +|   `-- B
> > +|       `-- b
> > +`-- sub2
> > +    `-- a
> > +
> > +Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
>
> Maybe "Where the outer repository has non-code mode..."?  The use of
> '.' threw me for a bit.

Sure!

> > +test_done
> > --
> > 2.26.2
>
> Looks good.  Do we want to add a testcase where a file is unmerged and
> present in the working copy despite not matching the sparsity patterns
> (i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?

Sure, I can add that. But after a quick test here, it seems that the
unmerged path doesn't have the SKIP_WORKTREE bit set. Is this how it
should be?
Elijah Newren June 3, 2020, 2:38 a.m. UTC | #3
On Sun, May 31, 2020 at 9:44 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Sat, May 30, 2020 at 12:48 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> > >
[...]
> > > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > > +{
> > > +       struct pattern_list *patterns;
> > > +       char *sparse_file;
> > > +       int sparse_config, cone_config;
> > > +
> > > +       if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > > +           !sparse_config) {
> > > +               return NULL;
> > > +       }
> >
> > Is core_apply_sparse_checkout not initialized for some reason?
>
> It should be already initialized, yes. But we cannot rely on that as
> `repo` might be a submodule, and core_apply_sparse_checkout holds the
> configuration's value for `the_repository`.

Ah, gotcha.  Thanks for straightening me out.

> > > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> >
> > This function name in_sparse_checkout() makes me think "Does the
> > working tree represent a sparse checkout?"  Perhaps we could rename it
> > to path_matches_sparsity_patterns() ?
> >
> > Also, is there a reason we can't use dir.c's
> > path_matches_pattern_list() here?
>
> Oh, we do use path_matches_pattern_list() inside:
>
> > > +       *match = path_matches_pattern_list(path->buf, path->len,
> > > +                                          path->buf + prefix_len, &dtype,
> > > +                                          sparsity, istate);
> > > +       if (*match == UNDECIDED)
> > > +               *match = parent_match;
>
> > How does this new function differ
> > in behavior from that function?
>
> The idea of in_sparse_checkout() is to implement a logic closer to
> what we have in clear_ce_flags_1(). Here, it is effectively a wrapper
> to path_matches_pattern_list() but with some extra logic to decide
> whether grep should search in a given entry, based on its mode, the
> match result against the sparsity patterns, and the result from the
> parent dir.

I've had this response and one to 5/5 sitting in my draft folder for
over a day because I was hoping to go read clear_ce_flags_1() and find
out what it is.  I have no idea, so your answer doesn't answer my
question... ;-)  I'll try to find some time and maybe respond further
after I do.

>
> > > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > > new file mode 100755
> > > index 0000000000..ce080cf572
> > > --- /dev/null
> > > +++ b/t/t7817-grep-sparse-checkout.sh
> > > @@ -0,0 +1,174 @@
> > > +#!/bin/sh
> > > +
> > > +test_description='grep in sparse checkout
> > > +
> > > +This test creates a repo with the following structure:
> > > +
> > > +.
> > > +|-- a
> > > +|-- b
> > > +|-- dir
> > > +|   `-- c
> > > +|-- sub
> > > +|   |-- A
> > > +|   |   `-- a
> > > +|   `-- B
> > > +|       `-- b
> > > +`-- sub2
> > > +    `-- a
> > > +
> > > +Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
> >
> > Maybe "Where the outer repository has non-code mode..."?  The use of
> > '.' threw me for a bit.
>
> Sure!
>
> > > +test_done
> > > --
> > > 2.26.2
> >
> > Looks good.  Do we want to add a testcase where a file is unmerged and
> > present in the working copy despite not matching the sparsity patterns
> > (i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?
>
> Sure, I can add that. But after a quick test here, it seems that the
> unmerged path doesn't have the SKIP_WORKTREE bit set. Is this how it
> should be?

Right, the merge machinery will clear the SKIP_WORKTREE bit when it
writes out conflicted files.  Also, any future 'git sparse-checkout'
commands will see the unmerged entry and avoid marking it as
SKIP_WORKTREE even though it doesn't match the sparsity patterns.
Thus, grep doesn't have to do any special checking for whether the
files are merged or not, and from your current implementation probably
doesn't look like a special case at all -- you just check the
SKIP_WORKTREE bit.

However, I think the test still has value because the test enforces
that other areas of the code (merge, sparse-checkout) don't break the
invariants that grep is relying on.  (I could see someone making a
merge change that keeps the SKIP_WORKTREE bit accidentally set even
though it writes the file out to the working tree, for example.)
Sure, merge has some tests around that, so it might be viewed as
slightly duplicative, but I see it as an interesting edge case that
exercises whether the SKIP_WORKTREE bit should really be set and since
grep expects a certain invariant about how that is handled, the
testcase will help make sure our expectations aren't violated.
Matheus Tavares June 10, 2020, 5:08 p.m. UTC | #4
On Tue, Jun 2, 2020 at 11:38 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Sun, May 31, 2020 at 9:44 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Sat, May 30, 2020 at 12:48 PM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> > >
> > > This function name in_sparse_checkout() makes me think "Does the
> > > working tree represent a sparse checkout?"  Perhaps we could rename it
> > > to path_matches_sparsity_patterns() ?
> > >
> > > Also, is there a reason we can't use dir.c's
> > > path_matches_pattern_list() here?
> >
> > Oh, we do use path_matches_pattern_list() inside:
> >
> > > > +       *match = path_matches_pattern_list(path->buf, path->len,
> > > > +                                          path->buf + prefix_len, &dtype,
> > > > +                                          sparsity, istate);
> > > > +       if (*match == UNDECIDED)
> > > > +               *match = parent_match;
> >
> > > How does this new function differ
> > > in behavior from that function?
> >
> > The idea of in_sparse_checkout() is to implement a logic closer to
> > what we have in clear_ce_flags_1(). Here, it is effectively a wrapper
> > to path_matches_pattern_list() but with some extra logic to decide
> > whether grep should search in a given entry, based on its mode, the
> > match result against the sparsity patterns, and the result from the
> > parent dir.
>
> I've had this response and one to 5/5 sitting in my draft folder for
> over a day because I was hoping to go read clear_ce_flags_1() and find
> out what it is.  I have no idea, so your answer doesn't answer my
> question... ;-)  I'll try to find some time and maybe respond further
> after I do.

Oops, sorry for the incomplete answer. clear_ce_flags() recursively
traverses the index entries, unsetting the bits specified in a given
mask when the entry matches a given pattern list. (It is used in
unpack-trees.c:mark_new_skip_worktree() to clear the
CE_NEW_SKIP_WORKTREE bit for the matched entries.) clear_ce_flags()
does use path_matches_pattern_list() but it also has to check some
additional rules for cone mode (as there might be recursive
matches/non-matches). These rules are implemented in
clear_ce_flags_dir().

in_sparse_checkout() is a small wrapper around
path_matches_pattern_list() with (1) the additional checks for cone
mode, similar to what clear_ce_flags_dir() implements, and (2) the
usage of the parent dir's match_result when undecided about the
current path. We could just implement this directly in grep_tree(),
but I thought that isolating this logic into its own static function
would make grep_tree() more readable.

> > > > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > > > new file mode 100755
> > > > index 0000000000..ce080cf572
> > > > --- /dev/null
> > > > +++ b/t/t7817-grep-sparse-checkout.sh
> > >
> > > Looks good.  Do we want to add a testcase where a file is unmerged and
> > > present in the working copy despite not matching the sparsity patterns
> > > (i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?
> >
> > Sure, I can add that. But after a quick test here, it seems that the
> > unmerged path doesn't have the SKIP_WORKTREE bit set. Is this how it
> > should be?
>
> Right, the merge machinery will clear the SKIP_WORKTREE bit when it
> writes out conflicted files.  Also, any future 'git sparse-checkout'
> commands will see the unmerged entry and avoid marking it as
> SKIP_WORKTREE even though it doesn't match the sparsity patterns.
> Thus, grep doesn't have to do any special checking for whether the
> files are merged or not, and from your current implementation probably
> doesn't look like a special case at all -- you just check the
> SKIP_WORKTREE bit.
>
> However, I think the test still has value because the test enforces
> that other areas of the code (merge, sparse-checkout) don't break the
> invariants that grep is relying on.  (I could see someone making a
> merge change that keeps the SKIP_WORKTREE bit accidentally set even
> though it writes the file out to the working tree, for example.)
> Sure, merge has some tests around that, so it might be viewed as
> slightly duplicative, but I see it as an interesting edge case that
> exercises whether the SKIP_WORKTREE bit should really be set and since
> grep expects a certain invariant about how that is handled, the
> testcase will help make sure our expectations aren't violated.

OK. I will add this test for the next version.
diff mbox series

Patch

diff --git a/builtin/grep.c b/builtin/grep.c
index a5056f395a..11e33b8aee 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -410,7 +410,7 @@  static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int is_root_tree);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -508,6 +508,10 @@  static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -520,8 +524,7 @@  static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -552,9 +555,76 @@  static int grep_cache(struct grep_opt *opt,
 	return hit;
 }
 
-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+static struct pattern_list *get_sparsity_patterns(struct repository *repo)
+{
+	struct pattern_list *patterns;
+	char *sparse_file;
+	int sparse_config, cone_config;
+
+	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
+	    !sparse_config) {
+		return NULL;
+	}
+
+	sparse_file = repo_git_path(repo, "info/sparse-checkout");
+	patterns = xcalloc(1, sizeof(*patterns));
+
+	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
+		cone_config = 0;
+	patterns->use_cone_patterns = cone_config;
+
+	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
+		if (file_exists(sparse_file)) {
+			warning(_("failed to load sparse-checkout file: '%s'"),
+				sparse_file);
+		}
+		free(sparse_file);
+		free(patterns);
+		return NULL;
+	}
+
+	free(sparse_file);
+	return patterns;
+}
+
+static int in_sparse_checkout(struct strbuf *path, int prefix_len,
+			      unsigned int entry_mode,
+			      struct index_state *istate,
+			      struct pattern_list *sparsity,
+			      enum pattern_match_result parent_match,
+			      enum pattern_match_result *match)
+{
+	int dtype = DT_UNKNOWN;
+	int is_dir = S_ISDIR(entry_mode);
+
+	if (parent_match == MATCHED_RECURSIVE) {
+		*match = parent_match;
+		return 1;
+	}
+
+	if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
+		strbuf_addch(path, '/');
+
+	*match = path_matches_pattern_list(path->buf, path->len,
+					   path->buf + prefix_len, &dtype,
+					   sparsity, istate);
+	if (*match == UNDECIDED)
+		*match = parent_match;
+
+	if (is_dir)
+		strbuf_trim_trailing_dir_sep(path);
+
+	if (*match == NOT_MATCHED &&
+		(!is_dir || (is_dir && sparsity->use_cone_patterns)))
+	     return 0;
+
+	return 1;
+}
+
+static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+			struct tree_desc *tree, struct strbuf *base, int tn_len,
+			int check_attr, struct pattern_list *sparsity,
+			enum pattern_match_result default_sparsity_match)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -570,6 +640,7 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
+		enum pattern_match_result sparsity_match = 0;
 
 		if (match != all_entries_interesting) {
 			strbuf_addstr(&name, base->buf + tn_len);
@@ -586,6 +657,19 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (sparsity) {
+			struct strbuf path = STRBUF_INIT;
+			strbuf_addstr(&path, base->buf + tn_len);
+
+			if (!in_sparse_checkout(&path, old_baselen - tn_len,
+						entry.mode, repo->index,
+						sparsity, default_sparsity_match,
+						&sparsity_match)) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
 					 check_attr ? base->buf + tn_len : NULL);
@@ -602,8 +686,8 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
+					    check_attr, sparsity, sparsity_match);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
@@ -621,6 +705,31 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+/*
+ * Note: sparsity patterns and paths' attributes will only be considered if
+ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
+ * matching on paths.)
+ */
+static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+		     struct tree_desc *tree, struct strbuf *base, int tn_len,
+		     int is_root_tree)
+{
+	struct pattern_list *patterns = NULL;
+	int ret;
+
+	if (is_root_tree)
+		patterns = get_sparsity_patterns(opt->repo);
+
+	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
+			   patterns, 0);
+
+	if (patterns) {
+		clear_pattern_list(patterns);
+		free(patterns);
+	}
+	return ret;
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@  test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..ce080cf572
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,174 @@ 
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates a repo with the following structure:
+
+.
+|-- a
+|-- b
+|-- dir
+|   `-- c
+|-- sub
+|   |-- A
+|   |   `-- a
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
+sparsity patterns and sub2 is a submodule that is excluded by the superproject
+sparsity patterns. The resulting sparse checkout should leave the following
+structure on the working tree:
+
+.
+|-- a
+|-- sub
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+But note that sub2 should have the SKIP_WORKTREE bit set.
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+
+	git init sub &&
+	(
+		cd sub &&
+		mkdir A B &&
+		echo "text" >A/a &&
+		echo "text" >B/b &&
+		git add A B &&
+		git commit -m sub &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set B
+	) &&
+
+	git init sub2 &&
+	(
+		cd sub2 &&
+		echo "text" >a &&
+		git add a &&
+		git commit -m sub2
+	) &&
+
+	git submodule add ./sub &&
+	git submodule add ./sub2 &&
+	git add a b dir &&
+	git commit -m super &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
+
+	git tag -am tag-to-commit tag-to-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am tag-to-tree tag-to-tree $tree &&
+
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_missing sub/A &&
+	test_path_is_file a &&
+	test_path_is_file sub/B/b &&
+	test_path_is_file sub2/a
+'
+
+# The test bellow checks a special case: the sparsity patterns exclude '/b'
+# and sparse checkout is enable, but the path exists on the working tree (e.g.
+# manually created after `git sparse-checkout init`). In this case, grep should
+# skip it.
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	echo "new-text" >b &&
+	test_when_finished "rm b" &&
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_tag-to-tree <<-EOF &&
+	tag-to-tree:a:text
+	tag-to-tree:b:text
+	tag-to-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" tag-to-tree >actual_tag-to-tree &&
+	test_cmp expect_tag-to-tree actual_tag-to-tree
+'
+
+# Note that sub2/ is present in the worktree but it is excluded by the sparsity
+# patterns, so grep should not recurse into it.
+test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:sub/B/b:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	tag-to-commit:sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_done