[RFC,v2,3/4] grep: honor sparse checkout patterns

Message ID	e00674c7278b032b826110f33e25a5dee176c7ba.1589058209.git.matheus.bernardino@usp.br (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=RpNG=6Y=vger.kernel.org=git-owner@kernel.org> From: Matheus Tavares <matheus.bernardino@usp.br> To: git@vger.kernel.org Cc: gitster@pobox.com, stolee@gmail.com, newren@gmail.com, jonathantanmy@google.com Subject: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Date: Sat, 9 May 2020 21:41:29 -0300 Message-Id: <e00674c7278b032b826110f33e25a5dee176c7ba.1589058209.git.matheus.bernardino@usp.br> In-Reply-To: <cover.1589058209.git.matheus.bernardino@usp.br> References: <cover.1589058209.git.matheus.bernardino@usp.br> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	grep: honor sparse checkout and add option to ignore it \| expand [RFC,v2,0/4] grep: honor sparse checkout and add option to ignore it [RFC,v2,1/4] doc: grep: unify info on configuration variables [RFC,v2,2/4] config: load the correct config.worktree file [RFC,v2,3/4] grep: honor sparse checkout patterns [RFC,v2,4/4] config: add setting to ignore sparsity patterns in some cmds

Matheus Tavares May 10, 2020, 12:41 a.m. UTC

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the opposite direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior may still be desirable for some use cases. So
the next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: as I mentioned in the cover letter, the tests in this patch now
contain both cone mode and full pattern sparse checkouts. This was done
for two reasons: To test grep's behavior when searching with
--recurse-submodules and having submodules with different pattern sets
than the superproject (which was incorrect in my first implementation).
And to test the direct pattern matching in grep_tree(), using both
modes.

 builtin/grep.c                   | 127 ++++++++++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 140 +++++++++++++++++++++++++++++++
 3 files changed, 259 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Junio C Hamano May 11, 2020, 7:35 p.m. UTC | #1

Matheus Tavares <matheus.bernardino@usp.br> writes:

> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the opposite direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

It makes sense for these to be limited within the "sparse" area.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)

I can understand the former; those untracked files are what _could_
be brought into attention by "git add", so limiting to the same
"sparse" area may make sense.

I am not sure about the latter, though, as "--no-index" is an
explicit request to pretend that we are dealing with a random
collection of files, not managed in a git repository.  But perhaps
there is a similar justification like how "--untracked" is
unjustifiable.  I dunno.

> diff --git a/builtin/grep.c b/builtin/grep.c
> index a5056f395a..91ee0b2734 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
>  		      const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> -		     int check_attr);
> +		     int is_root_tree);
>  
>  static int grep_submodule(struct grep_opt *opt,
>  			  const struct pathspec *pathspec,
> @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
>  
>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>  		const struct cache_entry *ce = repo->index->cache[nr];
> +
> +		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> +			continue;

Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
path that is excluded by the sparse pattern, should we still recurse
into it?

>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
>  
> @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
>  			 * cache entry are identical, even if worktree file has
>  			 * been modified, so use cache version instead
>  			 */
> -			if (cached || (ce->ce_flags & CE_VALID) ||
> -			    ce_skip_worktree(ce)) {
> +			if (cached || (ce->ce_flags & CE_VALID)) {
>  				if (ce_stage(ce) || ce_intent_to_add(ce))
>  					continue;
>  				hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
>  	return hit;
>  }
>  
> -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> -		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> -		     int check_attr)
> +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> +{
> +	struct pattern_list *patterns;
> +	char *sparse_file;
> +	int sparse_config, cone_config;
> +
> +	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> +	    !sparse_config) {
> +		return NULL;
> +	}
> +
> +	sparse_file = repo_git_path(repo, "info/sparse-checkout");
> +	patterns = xcalloc(1, sizeof(*patterns));
> +
> +	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> +		cone_config = 0;
> +	patterns->use_cone_patterns = cone_config;
> +
> +	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> +		if (file_exists(sparse_file)) {
> +			warning(_("failed to load sparse-checkout file: '%s'"),
> +				sparse_file);
> +		}
> +		free(sparse_file);
> +		free(patterns);
> +		return NULL;
> +	}
> +
> +	free(sparse_file);
> +	return patterns;
> +}
> +
> +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> +			      unsigned int entry_mode,
> +			      struct index_state *istate,
> +			      struct pattern_list *sparsity,
> +			      enum pattern_match_result parent_match,
> +			      enum pattern_match_result *match)
> +{
> +	int dtype = DT_UNKNOWN;
> +
> +	if (S_ISGITLINK(entry_mode))
> +		return 1;

This is consistent with the "we do not care where a gitlink
appears---submodules are always descended into, regardless of the
sparse definition" decision we saw earlier, I think.  I am not sure
if that is a good design in the first place, though.

> +	if (parent_match == MATCHED_RECURSIVE) {
> +		*match = parent_match;
> +		return 1;
> +	}
> +
> +	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
> +		strbuf_addch(path, '/');
> +
> +	*match = path_matches_pattern_list(path->buf, path->len,
> +					   path->buf + prefix_len, &dtype,
> +					   sparsity, istate);
> +	if (*match == UNDECIDED)
> +		*match = parent_match;
> +
> +	if (S_ISDIR(entry_mode))
> +		strbuf_trim_trailing_dir_sep(path);
> +
> +	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
> +	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
> +		return 0;
> +
> +	return 1;
> +}



> +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +			struct tree_desc *tree, struct strbuf *base, int tn_len,
> +			int check_attr, struct pattern_list *sparsity,
> +			enum pattern_match_result default_sparsity_match)
>  {
>  	struct repository *repo = opt->repo;
>  	int hit = 0;
> @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  	while (tree_entry(tree, &entry)) {
>  		int te_len = tree_entry_len(&entry);
> +		enum pattern_match_result sparsity_match = 0;
>  
>  		if (match != all_entries_interesting) {
>  			strbuf_addstr(&name, base->buf + tn_len);
> @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  		strbuf_add(base, entry.path, te_len);
>  
> +		if (sparsity) {
> +			struct strbuf path = STRBUF_INIT;
> +			strbuf_addstr(&path, base->buf + tn_len);
> +
> +			if (!in_sparse_checkout(&path, old_baselen - tn_len,
> +						entry.mode, repo->index,
> +						sparsity, default_sparsity_match,
> +						&sparsity_match)) {
> +				strbuf_setlen(base, old_baselen);
> +				continue;
> +			}
> +		}

OK.

>  		if (S_ISREG(entry.mode)) {
>  			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>  					 check_attr ? base->buf + tn_len : NULL);
> @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  			strbuf_addch(base, '/');
>  			init_tree_desc(&sub, data, size);
> -			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -					 check_attr);
> +			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> +					    check_attr, sparsity, sparsity_match);
>  			free(data);
>  		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>  			hit |= grep_submodule(opt, pathspec, &entry.oid,
> @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  	return hit;
>  }
>  
> +/*
> + * Note: sparsity patterns and paths' attributes will only be considered if
> + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> + * matching on paths.)
> + */
> +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> +		     int is_root_tree)
> +{
> +	struct pattern_list *patterns = NULL;
> +	int ret;
> +
> +	if (is_root_tree)
> +		patterns = get_sparsity_patterns(opt->repo);
> +
> +	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> +			   patterns, 0);
> +
> +	if (patterns) {
> +		clear_pattern_list(patterns);
> +		free(patterns);
> +	}

OK, it is not like this codepath is driven by "git log" to grep from
top-level tree objects of many commits, so it is OK to grab the
sparsity patterns once before do_grep_tree() and discard it when we
are done.

> +	return ret;
> +}
> +

>  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
>  		       struct object *obj, const char *name, const char *path)
>  {
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>  	test -z "$(git ls-files -m)"
>  '
>  
> -test_expect_success 'grep with skip-worktree file' '
> -	git update-index --no-skip-worktree 1 &&
> -	echo test > 1 &&
> -	git update-index 1 &&
> -	git update-index --skip-worktree 1 &&
> -	rm 1 &&
> -	test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>  	setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..3bd67082eb
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,140 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates a repo with the following structure:
> +
> +.
> +|-- a
> +|-- b
> +|-- dir
> +|   `-- c
> +`-- sub
> +    |-- A
> +    |   `-- a
> +    `-- B
> +	`-- b
> +
> +Where . has non-cone mode sparsity patterns and sub is a submodule with cone
> +mode sparsity patterns. The resulting sparse-checkout should leave the following
> +structure:
> +
> +.
> +|-- a
> +`-- sub
> +    `-- B
> +	`-- b
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	echo "text" >a &&
> +	echo "text" >b &&
> +	mkdir dir &&
> +	echo "text" >dir/c &&
> +
> +	git init sub &&
> +	(
> +		cd sub &&
> +		mkdir A B &&
> +		echo "text" >A/a &&
> +		echo "text" >B/b &&
> +		git add A B &&
> +		git commit -m sub &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set B
> +	) &&
> +
> +	git submodule add ./sub &&
> +	git add a b dir &&
> +	git commit -m super &&
> +	git sparse-checkout init --no-cone &&
> +	git sparse-checkout set "/*" "!b" "!/*/" &&
> +
> +	git tag -am t-commit t-commit HEAD &&
> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	git tag -am t-tree t-tree $tree &&
> +
> +	test_path_is_missing b &&
> +	test_path_is_missing dir &&
> +	test_path_is_missing sub/A &&
> +	test_path_is_file a &&
> +	test_path_is_file sub/B/b
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep --cached "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +	commit=$(git rev-parse HEAD) &&
> +	cat >expect_commit <<-EOF &&
> +	$commit:a:text
> +	EOF
> +	cat >expect_t-commit <<-EOF &&
> +	t-commit:a:text
> +	EOF
> +	git grep "text" $commit >actual_commit &&
> +	test_cmp expect_commit actual_commit &&
> +	git grep "text" t-commit >actual_t-commit &&
> +	test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> +	commit=$(git rev-parse HEAD) &&
> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	cat >expect_tree <<-EOF &&
> +	$tree:a:text
> +	$tree:b:text
> +	$tree:dir/c:text
> +	EOF
> +	cat >expect_t-tree <<-EOF &&
> +	t-tree:a:text
> +	t-tree:b:text
> +	t-tree:dir/c:text
> +	EOF
> +	git grep "text" $tree >actual_tree &&
> +	test_cmp expect_tree actual_tree &&
> +	git grep "text" t-tree >actual_t-tree &&
> +	test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	sub/B/b:text
> +	EOF
> +	git grep --recurse-submodules --cached "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> +	commit=$(git rev-parse HEAD) &&
> +	cat >expect_commit <<-EOF &&
> +	$commit:a:text
> +	$commit:sub/B/b:text
> +	EOF
> +	cat >expect_t-commit <<-EOF &&
> +	t-commit:a:text
> +	t-commit:sub/B/b:text
> +	EOF
> +	git grep --recurse-submodules "text" $commit >actual_commit &&
> +	test_cmp expect_commit actual_commit &&
> +	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
> +	test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_done

Matheus Tavares May 13, 2020, 12:05 a.m. UTC | #2

On Mon, May 11, 2020 at 4:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
>
> It makes sense for these to be limited within the "sparse" area.
>
> > - git grep --untracked and git grep --no-index (which already respect
> >   sparse checkout boundaries)
>
> I can understand the former; those untracked files are what _could_
> be brought into attention by "git add", so limiting to the same
> "sparse" area may make sense.
>
> I am not sure about the latter, though, as "--no-index" is an
> explicit request to pretend that we are dealing with a random
> collection of files, not managed in a git repository.  But perhaps
> there is a similar justification like how "--untracked" is
> unjustifiable.  I dunno.

Yeah, I think there was no need to mention those two cases here. My
intention was to say that, in these cases, we should stick to the
files that are present in the working tree (which should match the
sparsity patterns + untracked {and ignored, in --no-index}), as
opposed to how the worktree grep used to behave until now, falling
back to the cache on files excluded by the sparse checkout.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index a5056f395a..91ee0b2734 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
> >                     const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr);
> > +                  int is_root_tree);
> >
> >  static int grep_submodule(struct grep_opt *opt,
> >                         const struct pathspec *pathspec,
> > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
> >
> >       for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >               const struct cache_entry *ce = repo->index->cache[nr];
> > +
> > +             if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> > +                     continue;
>
> Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
> path that is excluded by the sparse pattern, should we still recurse
> into it?

The idea behind not skipping gitlinks here was to be compliant with
what we have in the working tree. In 4fd683b ("sparse-checkout:
document interactions with submodules"), we decided that, if the
sparse-checkout patterns exclude a submodule, the submodule would
still appear in the working tree. The purpose was to keep these
features (submodules and sparse-checkout) independent. Along the same
lines, I think we should always recurse into initialized submodules in
grep, and then load their own sparsity patterns, to decide what should
be grepped within.

[...]
> > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                  int is_root_tree)
> > +{
> > +     struct pattern_list *patterns = NULL;
> > +     int ret;
> > +
> > +     if (is_root_tree)
> > +             patterns = get_sparsity_patterns(opt->repo);
> > +
> > +     ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> > +                        patterns, 0);
> > +
> > +     if (patterns) {
> > +             clear_pattern_list(patterns);
> > +             free(patterns);
> > +     }
>
> OK, it is not like this codepath is driven by "git log" to grep from
> top-level tree objects of many commits, so it is OK to grab the
> sparsity patterns once before do_grep_tree() and discard it when we
> are done.

Yeah. A possible performance problem here would be when users pass
many trees to git-grep (since we are reloading the pattern lists, from
both the_repository and submodules, for each tree). But, as Elijah
pointed out [1], the cases where this overhead might be somewhat
noticeable should be very rare.

[1]: https://lore.kernel.org/git/CABPp-BGUf-4exGW23xka1twf2D=nFOz1CkD_f-rDX_AGdVEeDA@mail.gmail.com/

Junio C Hamano May 13, 2020, 12:17 a.m. UTC | #3

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

> The idea behind not skipping gitlinks here was to be compliant with
> what we have in the working tree. In 4fd683b ("sparse-checkout:
> document interactions with submodules"), we decided that, if the
> sparse-checkout patterns exclude a submodule, the submodule would
> still appear in the working tree. The purpose was to keep these
> features (submodules and sparse-checkout) independent. Along the same
> lines, I think we should always recurse into initialized submodules in
> grep, and then load their own sparsity patterns, to decide what should
> be grepped within.

OK.  

I do not necessarily agree with the justification described in
4fd683b (e.g. "would easily cause problems." that is not
substantiated is merely an opinion), but I do agree with you that
the new code in "git grep" we are discussing here does behave in
line with that design.

Thanks.

Elijah Newren May 21, 2020, 7:26 a.m. UTC | #4

On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
>
> > The idea behind not skipping gitlinks here was to be compliant with
> > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > document interactions with submodules"), we decided that, if the
> > sparse-checkout patterns exclude a submodule, the submodule would
> > still appear in the working tree. The purpose was to keep these
> > features (submodules and sparse-checkout) independent. Along the same
> > lines, I think we should always recurse into initialized submodules in

Sorry if I missed it in the code, but do you check whether the
submodule is initialized before descending into it, or do you descend
into it based on it just being a submodule?

> > grep, and then load their own sparsity patterns, to decide what should
> > be grepped within.
>
> OK.
>
> I do not necessarily agree with the justification described in
> 4fd683b (e.g. "would easily cause problems." that is not
> substantiated is merely an opinion), but I do agree with you that
> the new code in "git grep" we are discussing here does behave in
> line with that design.
>
> Thanks.

I'm also a little worried by 4fd683b; are we headed towards a circular
reasoning of some sort?  In particular, sparse-checkout was written
assuming submodules might already be checked out.  I can see how
un-checking-out an existing submodule could raise fears of losing
untracked or ignored files within it, or stuff stored on other
branches, etc.  But that's not the only relevant case.  What if
someone runs:
   git clone --recurse-submodules --sparse=moduleA git.hosting.site:my/repo.git
In such a case, we don't have already checked out submodules.
Obviously, we should clone submodules that are within our sparsity
paths.  But should we automatically clone the submodules outside our
sparsity paths?  The the logic presented in 4fd683b makes this
completely ambiguous.  ("It will appear if it's initialized."  Okay,
but do we initialize it?)

You may say that clone doesn't have a --sparse= flag right now.  So
let me change the example slightly.  What if someone runs
   git checkout --recurse-submodules $otherBranch
and $otherBranch adds a new submodule somewhere deep under a directory
excluded by the sparsity patterns (i.e. deep within a directory we
aren't interested in and don't have checked out).  Should the
submodule be checked out, i.e. should it be initialized?  Commit
4fd683b only says it will appear if it's initialized, but my whole
question is should we initialize it?

Elijah Newren May 21, 2020, 7:36 a.m. UTC | #5

On Mon, May 11, 2020 at 12:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
>
> It makes sense for these to be limited within the "sparse" area.
>
> > - git grep --untracked and git grep --no-index (which already respect
> >   sparse checkout boundaries)
>
> I can understand the former; those untracked files are what _could_
> be brought into attention by "git add", so limiting to the same
> "sparse" area may make sense.
>
> I am not sure about the latter, though, as "--no-index" is an
> explicit request to pretend that we are dealing with a random
> collection of files, not managed in a git repository.  But perhaps
> there is a similar justification like how "--untracked" is
> unjustifiable.  I dunno.

I don't think it makes sense for sparsity patterns to affect either.
sparsity patterns are a way of splitting "tracked" files into two
subsets (those matching the sparsity paths and those that don't).
Therefore, flags that are about searching things that aren't tracked,
clearly don't have anything to do with sparsity patterns.

However, I think this was just a wording issue; in the subsequent
commit Matheus made it clear that he's not modifying the behavior of
grep --untracked or grep --no-index based on the presence or absence
of sparsity patterns.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index a5056f395a..91ee0b2734 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
> >                     const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr);
> > +                  int is_root_tree);
> >
> >  static int grep_submodule(struct grep_opt *opt,
> >                         const struct pathspec *pathspec,
> > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
> >
> >       for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >               const struct cache_entry *ce = repo->index->cache[nr];
> > +
> > +             if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> > +                     continue;
>
> Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
> path that is excluded by the sparse pattern, should we still recurse
> into it?

That bothers me too.

> >               strbuf_setlen(&name, name_base_len);
> >               strbuf_addstr(&name, ce->name);
> >
> > @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
> >                        * cache entry are identical, even if worktree file has
> >                        * been modified, so use cache version instead
> >                        */
> > -                     if (cached || (ce->ce_flags & CE_VALID) ||
> > -                         ce_skip_worktree(ce)) {
> > +                     if (cached || (ce->ce_flags & CE_VALID)) {
> >                               if (ce_stage(ce) || ce_intent_to_add(ce))
> >                                       continue;
> >                               hit |= grep_oid(opt, &ce->oid, name.buf,
> > @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
> >       return hit;
> >  }
> >
> > -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > -                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr)
> > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > +{
> > +     struct pattern_list *patterns;
> > +     char *sparse_file;
> > +     int sparse_config, cone_config;
> > +
> > +     if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > +         !sparse_config) {
> > +             return NULL;
> > +     }
> > +
> > +     sparse_file = repo_git_path(repo, "info/sparse-checkout");
> > +     patterns = xcalloc(1, sizeof(*patterns));
> > +
> > +     if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> > +             cone_config = 0;
> > +     patterns->use_cone_patterns = cone_config;
> > +
> > +     if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> > +             if (file_exists(sparse_file)) {
> > +                     warning(_("failed to load sparse-checkout file: '%s'"),
> > +                             sparse_file);
> > +             }
> > +             free(sparse_file);
> > +             free(patterns);
> > +             return NULL;
> > +     }
> > +
> > +     free(sparse_file);
> > +     return patterns;
> > +}
> > +
> > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> > +                           unsigned int entry_mode,
> > +                           struct index_state *istate,
> > +                           struct pattern_list *sparsity,
> > +                           enum pattern_match_result parent_match,
> > +                           enum pattern_match_result *match)
> > +{
> > +     int dtype = DT_UNKNOWN;
> > +
> > +     if (S_ISGITLINK(entry_mode))
> > +             return 1;
>
> This is consistent with the "we do not care where a gitlink
> appears---submodules are always descended into, regardless of the
> sparse definition" decision we saw earlier, I think.  I am not sure
> if that is a good design in the first place, though.
>
> > +     if (parent_match == MATCHED_RECURSIVE) {
> > +             *match = parent_match;
> > +             return 1;
> > +     }
> > +
> > +     if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
> > +             strbuf_addch(path, '/');
> > +
> > +     *match = path_matches_pattern_list(path->buf, path->len,
> > +                                        path->buf + prefix_len, &dtype,
> > +                                        sparsity, istate);
> > +     if (*match == UNDECIDED)
> > +             *match = parent_match;
> > +
> > +     if (S_ISDIR(entry_mode))
> > +             strbuf_trim_trailing_dir_sep(path);
> > +
> > +     if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
> > +         (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
> > +             return 0;
> > +
> > +     return 1;
> > +}
>
>
>
> > +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                     struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                     int check_attr, struct pattern_list *sparsity,
> > +                     enum pattern_match_result default_sparsity_match)
> >  {
> >       struct repository *repo = opt->repo;
> >       int hit = 0;
> > @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >       while (tree_entry(tree, &entry)) {
> >               int te_len = tree_entry_len(&entry);
> > +             enum pattern_match_result sparsity_match = 0;
> >
> >               if (match != all_entries_interesting) {
> >                       strbuf_addstr(&name, base->buf + tn_len);
> > @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >               strbuf_add(base, entry.path, te_len);
> >
> > +             if (sparsity) {
> > +                     struct strbuf path = STRBUF_INIT;
> > +                     strbuf_addstr(&path, base->buf + tn_len);
> > +
> > +                     if (!in_sparse_checkout(&path, old_baselen - tn_len,
> > +                                             entry.mode, repo->index,
> > +                                             sparsity, default_sparsity_match,
> > +                                             &sparsity_match)) {
> > +                             strbuf_setlen(base, old_baselen);
> > +                             continue;
> > +                     }
> > +             }
>
> OK.
>
> >               if (S_ISREG(entry.mode)) {
> >                       hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> >                                        check_attr ? base->buf + tn_len : NULL);
> > @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >                       strbuf_addch(base, '/');
> >                       init_tree_desc(&sub, data, size);
> > -                     hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> > -                                      check_attr);
> > +                     hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> > +                                         check_attr, sparsity, sparsity_match);
> >                       free(data);
> >               } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
> >                       hit |= grep_submodule(opt, pathspec, &entry.oid,
> > @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >       return hit;
> >  }
> >
> > +/*
> > + * Note: sparsity patterns and paths' attributes will only be considered if
> > + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> > + * matching on paths.)
> > + */
> > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                  int is_root_tree)
> > +{
> > +     struct pattern_list *patterns = NULL;
> > +     int ret;
> > +
> > +     if (is_root_tree)
> > +             patterns = get_sparsity_patterns(opt->repo);
> > +
> > +     ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> > +                        patterns, 0);
> > +
> > +     if (patterns) {
> > +             clear_pattern_list(patterns);
> > +             free(patterns);
> > +     }
>
> OK, it is not like this codepath is driven by "git log" to grep from
> top-level tree objects of many commits, so it is OK to grab the
> sparsity patterns once before do_grep_tree() and discard it when we
> are done.
>
> > +     return ret;
> > +}
> > +
>
> >  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct object *obj, const char *name, const char *path)
> >  {
> > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> > index 37525cae3a..26852586ac 100755
> > --- a/t/t7011-skip-worktree-reading.sh
> > +++ b/t/t7011-skip-worktree-reading.sh
> > @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
> >       test -z "$(git ls-files -m)"
> >  '
> >
> > -test_expect_success 'grep with skip-worktree file' '
> > -     git update-index --no-skip-worktree 1 &&
> > -     echo test > 1 &&
> > -     git update-index 1 &&
> > -     git update-index --skip-worktree 1 &&
> > -     rm 1 &&
> > -     test "$(git grep --no-ext-grep test)" = "1:test"
> > -'
> > -
> >  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected
> >  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
> >       setup_absent &&
> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..3bd67082eb
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -0,0 +1,140 @@
> > +#!/bin/sh
> > +
> > +test_description='grep in sparse checkout
> > +
> > +This test creates a repo with the following structure:
> > +
> > +.
> > +|-- a
> > +|-- b
> > +|-- dir
> > +|   `-- c
> > +`-- sub
> > +    |-- A
> > +    |   `-- a
> > +    `-- B
> > +     `-- b
> > +
> > +Where . has non-cone mode sparsity patterns and sub is a submodule with cone
> > +mode sparsity patterns. The resulting sparse-checkout should leave the following
> > +structure:
> > +
> > +.
> > +|-- a
> > +`-- sub
> > +    `-- B
> > +     `-- b
> > +'
> > +
> > +. ./test-lib.sh
> > +
> > +test_expect_success 'setup' '
> > +     echo "text" >a &&
> > +     echo "text" >b &&
> > +     mkdir dir &&
> > +     echo "text" >dir/c &&
> > +
> > +     git init sub &&
> > +     (
> > +             cd sub &&
> > +             mkdir A B &&
> > +             echo "text" >A/a &&
> > +             echo "text" >B/b &&
> > +             git add A B &&
> > +             git commit -m sub &&
> > +             git sparse-checkout init --cone &&
> > +             git sparse-checkout set B
> > +     ) &&
> > +
> > +     git submodule add ./sub &&
> > +     git add a b dir &&
> > +     git commit -m super &&
> > +     git sparse-checkout init --no-cone &&
> > +     git sparse-checkout set "/*" "!b" "!/*/" &&
> > +
> > +     git tag -am t-commit t-commit HEAD &&
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     git tag -am t-tree t-tree $tree &&
> > +
> > +     test_path_is_missing b &&
> > +     test_path_is_missing dir &&
> > +     test_path_is_missing sub/A &&
> > +     test_path_is_file a &&
> > +     test_path_is_file sub/B/b
> > +'
> > +
> > +test_expect_success 'grep in working tree should honor sparse checkout' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     EOF
> > +     git grep "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep --cached should honor sparse checkout' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     EOF
> > +     git grep --cached "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     cat >expect_commit <<-EOF &&
> > +     $commit:a:text
> > +     EOF
> > +     cat >expect_t-commit <<-EOF &&
> > +     t-commit:a:text
> > +     EOF
> > +     git grep "text" $commit >actual_commit &&
> > +     test_cmp expect_commit actual_commit &&
> > +     git grep "text" t-commit >actual_t-commit &&
> > +     test_cmp expect_t-commit actual_t-commit
> > +'
> > +
> > +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     cat >expect_tree <<-EOF &&
> > +     $tree:a:text
> > +     $tree:b:text
> > +     $tree:dir/c:text
> > +     EOF
> > +     cat >expect_t-tree <<-EOF &&
> > +     t-tree:a:text
> > +     t-tree:b:text
> > +     t-tree:dir/c:text
> > +     EOF
> > +     git grep "text" $tree >actual_tree &&
> > +     test_cmp expect_tree actual_tree &&
> > +     git grep "text" t-tree >actual_t-tree &&
> > +     test_cmp expect_t-tree actual_t-tree
> > +'
> > +
> > +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     sub/B/b:text
> > +     EOF
> > +     git grep --recurse-submodules --cached "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     cat >expect_commit <<-EOF &&
> > +     $commit:a:text
> > +     $commit:sub/B/b:text
> > +     EOF
> > +     cat >expect_t-commit <<-EOF &&
> > +     t-commit:a:text
> > +     t-commit:sub/B/b:text
> > +     EOF
> > +     git grep --recurse-submodules "text" $commit >actual_commit &&
> > +     test_cmp expect_commit actual_commit &&
> > +     git grep --recurse-submodules "text" t-commit >actual_t-commit &&
> > +     test_cmp expect_t-commit actual_t-commit
> > +'
> > +
> > +test_done

Matheus Tavares May 21, 2020, 5:35 p.m. UTC | #6

On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> >
> > > The idea behind not skipping gitlinks here was to be compliant with
> > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > document interactions with submodules"), we decided that, if the
> > > sparse-checkout patterns exclude a submodule, the submodule would
> > > still appear in the working tree. The purpose was to keep these
> > > features (submodules and sparse-checkout) independent. Along the same
> > > lines, I think we should always recurse into initialized submodules in
>
> Sorry if I missed it in the code, but do you check whether the
> submodule is initialized before descending into it, or do you descend
> into it based on it just being a submodule?

We only descend if the submodule is initialized. The new code in this
patch doesn't do this check, but it is already implemented in
grep_submodule() (which is called by grep_tree() and grep_cache() when
a submodule is found).

Elijah Newren May 21, 2020, 5:52 p.m. UTC | #7

On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> > >
> > > > The idea behind not skipping gitlinks here was to be compliant with
> > > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > > document interactions with submodules"), we decided that, if the
> > > > sparse-checkout patterns exclude a submodule, the submodule would
> > > > still appear in the working tree. The purpose was to keep these
> > > > features (submodules and sparse-checkout) independent. Along the same
> > > > lines, I think we should always recurse into initialized submodules in
> >
> > Sorry if I missed it in the code, but do you check whether the
> > submodule is initialized before descending into it, or do you descend
> > into it based on it just being a submodule?
>
> We only descend if the submodule is initialized. The new code in this
> patch doesn't do this check, but it is already implemented in
> grep_submodule() (which is called by grep_tree() and grep_cache() when
> a submodule is found).

Good to know.  To up the ante a bit: What if another branch has
directory that doesn't exist in HEAD or the current checkout, and
within that directory is a submodule.  Would it be recursed into?
What if it matched the sparsity paths?  (Is it even possible to
recurse into it?)

Matheus Tavares May 22, 2020, 5:49 a.m. UTC | #8

On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> > > >
> > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> > > >
> > > > > The idea behind not skipping gitlinks here was to be compliant with
> > > > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > > > document interactions with submodules"), we decided that, if the
> > > > > sparse-checkout patterns exclude a submodule, the submodule would
> > > > > still appear in the working tree. The purpose was to keep these
> > > > > features (submodules and sparse-checkout) independent. Along the same
> > > > > lines, I think we should always recurse into initialized submodules in
> > >
> > > Sorry if I missed it in the code, but do you check whether the
> > > submodule is initialized before descending into it, or do you descend
> > > into it based on it just being a submodule?
> >
> > We only descend if the submodule is initialized. The new code in this
> > patch doesn't do this check, but it is already implemented in
> > grep_submodule() (which is called by grep_tree() and grep_cache() when
> > a submodule is found).
>
> Good to know.  To up the ante a bit: What if another branch has
> directory that doesn't exist in HEAD or the current checkout, and
> within that directory is a submodule.  Would it be recursed into?

In this case, `git grep --recurse-submodules <pattern> $branch` will
recurse into the submodule, but only if it has already been
initialized. I.e. if we have checked out to $branch, ran `git
submodule init` and then checked out back.

> What if it matched the sparsity paths?  (Is it even possible to
> recurse into it?)

That's a great question. The idea that I tried to implement is to
always recurse into _initialized_ submodules (even the ones excluded
by the superproject's sparsity patterns) and, then, follow their own
sparsity patterns inside. I'm not necessarily in favor (or against)
this behavior, but this seemed to be the most compatible way with the
design we describe in our docs:

"If your sparse-checkout patterns exclude an initialized submodule,
then that submodule will still appear in your working directory." (in
git-sparse-checkout.txt)

So, back to the original question, if you run `git grep
--recurse-submodules <pattern> $branch` and $branch contains a
submodule which was previously initialized, git-grep _would_ recurse
into it, even if it (or its parent dir) was excluded. However, your
question helped me notice an inconsistency in my patch: the behavior I
just described is working for the full pattern set, but not in cone
mode. That's because, in cone mode, we can mark the whole submodule's
parent dir as excluded. Then, path_matches_pattern_list() will return
NOT_MATCHED for the parent dir and we won't recurse into it, so we
won't even get to the submodule's path to discover that it refers to a
gitlink.

Therefore, if we decide to keep the behavior of always recursing into
submodules, we will need some extra work for the cone mode. I.e.
grep_tree() will have to check if NOT_MATCHED directories contain
submodules before discarding them, and recurse only into the
submodules if so. As for the implementation, the first idea that came
to my mind was to list the submodules' pathnames and do prefix
matching for each submodule and NOT_MATCHED dir. But the places I've
seen such submodule listings in the code base so far [1] seem to work
only in the current branch. My second idea was to continue the tree
walk when we hit NOT_MATCHED dir entries, but not doing any work, just
looking for possible gitlinks to recurse into. I'm not sure if that
could negatively affect the execution time, though.

Does this seem like a good approach? Or is there another solution that
I have not considered? Or even further, should we choose to skip the
submodules in excluded paths? My only concern in this case is that it
would be contrary to the design in git-sparse-checkout.txt. And the
working tree grep and cached grep would differ even on a clean working
tree.

[1]: builtin/submodule--helper.c:module_list_compute() and
submodule-config.c:config_from_gitmodules()

Elijah Newren May 22, 2020, 2:26 p.m. UTC | #9

Hi Matheus,

On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
>
> On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> >
<snip>
> > Good to know.  To up the ante a bit: What if another branch has
> > directory that doesn't exist in HEAD or the current checkout, and
> > within that directory is a submodule.  Would it be recursed into?
>
> In this case, `git grep --recurse-submodules <pattern> $branch` will
> recurse into the submodule, but only if it has already been
> initialized. I.e. if we have checked out to $branch, ran `git
> submodule init` and then checked out back.
>
> > What if it matched the sparsity paths?  (Is it even possible to
> > recurse into it?)
>
> That's a great question. The idea that I tried to implement is to
> always recurse into _initialized_ submodules (even the ones excluded
> by the superproject's sparsity patterns) and, then, follow their own
> sparsity patterns inside. I'm not necessarily in favor (or against)
> this behavior, but this seemed to be the most compatible way with the
> design we describe in our docs:
>
> "If your sparse-checkout patterns exclude an initialized submodule,
> then that submodule will still appear in your working directory." (in
> git-sparse-checkout.txt)
>
> So, back to the original question, if you run `git grep
> --recurse-submodules <pattern> $branch` and $branch contains a
> submodule which was previously initialized, git-grep _would_ recurse
> into it, even if it (or its parent dir) was excluded. However, your
> question helped me notice an inconsistency in my patch: the behavior I
> just described is working for the full pattern set, but not in cone
> mode. That's because, in cone mode, we can mark the whole submodule's
> parent dir as excluded. Then, path_matches_pattern_list() will return
> NOT_MATCHED for the parent dir and we won't recurse into it, so we
> won't even get to the submodule's path to discover that it refers to a
> gitlink.
>
> Therefore, if we decide to keep the behavior of always recursing into
> submodules, we will need some extra work for the cone mode. I.e.
> grep_tree() will have to check if NOT_MATCHED directories contain
> submodules before discarding them, and recurse only into the
> submodules if so. As for the implementation, the first idea that came
> to my mind was to list the submodules' pathnames and do prefix
> matching for each submodule and NOT_MATCHED dir. But the places I've
> seen such submodule listings in the code base so far [1] seem to work
> only in the current branch. My second idea was to continue the tree
> walk when we hit NOT_MATCHED dir entries, but not doing any work, just
> looking for possible gitlinks to recurse into. I'm not sure if that
> could negatively affect the execution time, though.
>
> Does this seem like a good approach? Or is there another solution that
> I have not considered? Or even further, should we choose to skip the
> submodules in excluded paths? My only concern in this case is that it
> would be contrary to the design in git-sparse-checkout.txt. And the
> working tree grep and cached grep would differ even on a clean working
> tree.

To be honest, I think it sounds insane.  What you propose does make
sense if you take what was written in git-sparse-checkout.txt very
literally and as though it was a core design principle meant to cover
all cases but I do not think it merits such a standing at all.  I
think it should be treated as a first draft attempt to explain
interactions that was written solely with the 'checkout' case in mind,
especially since it was written at the same approximate time that this
was written earlier in the same file:

    THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF
    OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY
    CHANGE IN THE FUTURE.

Anyway, the wording in that file seems to be really important, so
let's fix it.

-- >8 --
Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules

Ignoring the sparse-checkout feature momentarily, if one has a submodule and
creates local branches within it with unpushed changes and maybe adds some
untracked files to it, then we would want to avoid accidentally removing such
a submodule.  So, for example with git.git, if you run
   git checkout v2.13.0
then the sha1collisiondetection/ submodule is NOT removed even though it
did not exist as a submodule until v2.14.0.  Similarly, if you only had
v2.13.0 checked out previously and ran
   git checkout v2.14.0
the sha1collisiondetection/ submodule would NOT be automatically
initialized despite being part of v2.14.0.  In both cases, git requires
submodules to be initialized or deinitialized separately.  Further, we
also have special handling for submodules in other commands such as
clean, which requires two --force flags to delete untracked submodules,
and some commands have a --recurse-submodules flag.

sparse-checkout is very similar to checkout, as evidenced by the similar
name -- it adds and removes files from the working copy.  However, for
the same avoid-data-loss reasons we do not want to remove a submodule
from the working copy with checkout, we do not want to do it with
sparse-checkout either.  So submodules need to be separately initialized
or deinitialized; changing sparse-checkout rules should not
automatically trigger the removal or vivification of submodules.

I believe the previous wording in git-sparse-checkout.txt about
submodules was only about this particular issue.  Unfortunately, the
previous wording could be interpreted to imply that submodules should be
considered active regardless of sparsity patterns.  Update the wording
to avoid making such an implication.  It may be helpful to consider two
example situations where the differences in wording become important:

In the future, we want users to be able to run commands like
   git clone --sparse=moduleA --recurse-submodules $REPO_URL
and have sparsity paths automatically set up and have submodules *within
the sparsity paths* be automatically initialized.  We do not want all
submodules in any path to be automatically initialized with that
command.

Similarly, we want to be able to do things like
   git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
and search through $REV for $PATTERN within the recorded sparsity
patterns.  We want it to recurse into submodules within those sparsity
patterns, but do not want to recurse into directories that do not match
the sparsity patterns in search of a possible submodule.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-sparse-checkout.txt | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index c0342e5393..7dde2d330c 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -190,10 +190,23 @@ directory.
 SUBMODULES
 ----------

-If your repository contains one or more submodules, then those submodules will
-appear based on which you initialized with the `git submodule` command. If
-your sparse-checkout patterns exclude an initialized submodule, then that
-submodule will still appear in your working directory.
+If your repository contains one or more submodules, then those submodules
+will appear based on which you initialized with the `git submodule`
+command.  Submodules may have additional untracked files or code stored on
+other branches, so to avoid data loss, changing sparse inclusion/exclusion
+rules will not cause an already checked out submodule to be removed from
+the working copy.  Said another way, just as `checkout` will not cause
+submodules to be automatically removed or initialized even when switching
+between branches that remove or add submodules, using `sparse-checkout` to
+reduce or expand the scope of "interesting" files will not cause submodules
+to be automatically deinitialized or initialized either.  Adding or
+removing them must be done as a separate step with `git submodule init` or
+`git submodule deinit`.
+
+This may mean that even if your sparsity patterns include or exclude
+submodules, until you manually initialize or deinitialize them, commands
+like grep that work on tracked files in the working copy will ignore "not
+yet initialized" submodules and pay attention to "left behind" ones.

 SEE ALSO

Elijah Newren May 22, 2020, 3:36 p.m. UTC | #10

On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus,
>
> On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> >
> > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > >
<snip>
> > Does this seem like a good approach? Or is there another solution that
> > I have not considered? Or even further, should we choose to skip the
> > submodules in excluded paths? My only concern in this case is that it
> > would be contrary to the design in git-sparse-checkout.txt. And the
> > working tree grep and cached grep would differ even on a clean working
> > tree.
>
<snip>
> Anyway, the wording in that file seems to be really important, so
> let's fix it.
>

Let me also try to give a concrete proposal for grep behavior for the
edge cases we've discussed:

git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN

This goes through all the files in the index (i.e. all tracked files)
which do not have the SKIP_WORKTREE bit set.  For each of these: If
the file is a symlink, ignore it (like grep currently does).  If the
file is a regular file and is present in the working copy, search it.
If the file is a submodule and it is initialized, recurse into it.

git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN

This goes through all the files in the index (i.e. all tracked files)
which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
symlinks.  Search regular files.  Recurse into submodules if they are
initialized.

git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN

This goes through all the files in the given revision (i.e. all
tracked files) which match the sparsity patterns (i.e. that would not
have the SKIP_WORKTREE bit set if were we to checkout that commit).
For each of these: Skip symlinks.  Search regular files.  Recurse into
submodules if they are initialized.

Further, for any of these, when recursing into submodules, make sure
to load that submodules' core.sparseCheckout setting (and related
settings) and the submodules' sparsity patterns, if any.

Sound good?

I think this addresses the edge cases we've discussed so far:
interaction between submodules and sparsity patterns, and handling of
files that are still present despite not matching the sparsity
patterns. (Also note that files which are present-despite-the-rules
are prone to be removed by the next `git sparse-checkout reapply` or
anything that triggers a call to unpack_trees(); there's already
multiple things that do and Stolee's proposed patches would add more).
If I've missed edge cases, let me know.

Elijah

Matheus Tavares May 22, 2020, 8:54 p.m. UTC | #11

Hi, Elijah

On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus,
> >
> > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> > >
> > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > > >
> <snip>
> > > Does this seem like a good approach? Or is there another solution that
> > > I have not considered? Or even further, should we choose to skip the
> > > submodules in excluded paths? My only concern in this case is that it
> > > would be contrary to the design in git-sparse-checkout.txt. And the
> > > working tree grep and cached grep would differ even on a clean working
> > > tree.
> >
> <snip>
> > Anyway, the wording in that file seems to be really important, so
> > let's fix it.
> >
>
> Let me also try to give a concrete proposal for grep behavior for the
> edge cases we've discussed:

Thank you for this proposal and for the previous comments as well.

> git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN
>
> This goes through all the files in the index (i.e. all tracked files)
> which do not have the SKIP_WORKTREE bit set.  For each of these: If
> the file is a symlink, ignore it (like grep currently does).  If the
> file is a regular file and is present in the working copy, search it.
> If the file is a submodule and it is initialized, recurse into it.

Sounds good. And when sparse.restrictCmds=false, we also search the
present files and present initialized submodules that have the
SKIP_WORKTREE set, right?

> git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN
>
> This goes through all the files in the index (i.e. all tracked files)
> which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
> symlinks.  Search regular files.  Recurse into submodules if they are
> initialized.

OK.

> git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN
>
> This goes through all the files in the given revision (i.e. all
> tracked files) which match the sparsity patterns (i.e. that would not
> have the SKIP_WORKTREE bit set if were we to checkout that commit).
> For each of these: Skip symlinks.  Search regular files.  Recurse into
> submodules if they are initialized.

OK.

> Further, for any of these, when recursing into submodules, make sure
> to load that submodules' core.sparseCheckout setting (and related
> settings) and the submodules' sparsity patterns, if any.
>
> Sound good?
>
> I think this addresses the edge cases we've discussed so far:
> interaction between submodules and sparsity patterns, and handling of
> files that are still present despite not matching the sparsity
> patterns. (Also note that files which are present-despite-the-rules
> are prone to be removed by the next `git sparse-checkout reapply` or
> anything that triggers a call to unpack_trees(); there's already
> multiple things that do and Stolee's proposed patches would add more).
> If I've missed edge cases, let me know.

Sounds great. This addresses all the edge cases we've mentioned
before. Thanks again for the detailed proposal, and for considering
case by case.

Elijah Newren May 22, 2020, 9:06 p.m. UTC | #12

On Fri, May 22, 2020 at 1:54 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Elijah
>
> On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > Hi Matheus,
> > >
> > > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> > > >
> > > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > > > >
> > <snip>
> > > > Does this seem like a good approach? Or is there another solution that
> > > > I have not considered? Or even further, should we choose to skip the
> > > > submodules in excluded paths? My only concern in this case is that it
> > > > would be contrary to the design in git-sparse-checkout.txt. And the
> > > > working tree grep and cached grep would differ even on a clean working
> > > > tree.
> > >
> > <snip>
> > > Anyway, the wording in that file seems to be really important, so
> > > let's fix it.
> > >
> >
> > Let me also try to give a concrete proposal for grep behavior for the
> > edge cases we've discussed:
>
> Thank you for this proposal and for the previous comments as well.
>
> > git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN
> >
> > This goes through all the files in the index (i.e. all tracked files)
> > which do not have the SKIP_WORKTREE bit set.  For each of these: If
> > the file is a symlink, ignore it (like grep currently does).  If the
> > file is a regular file and is present in the working copy, search it.
> > If the file is a submodule and it is initialized, recurse into it.
>
> Sounds good. And when sparse.restrictCmds=false, we also search the
> present files and present initialized submodules that have the
> SKIP_WORKTREE set, right?

You're really pushing those corner cases, I love it.  :-)
SKIP_WORKTREE is supposed to mean we have removed it from the working
tree, i.e. it shouldn't be present (if we decide we're not going to
remove it from the working tree, e.g. because the file is unmerged or
something, then we don't mark it as SKIP_WORKTREE even if it doesn't
match sparsity patterns).  Therefore, the set of files that satisfy
this condition you have given should generally be empty.

But presuming we hit this corner case, I'd say you are right.
sparse.restrictCmds=false means we ignore the SKIP_WORKTREE bit
entirely (and in the case of grepping a $REVISION, we ignore the
sparsity patterns entirely).

> > git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN
> >
> > This goes through all the files in the index (i.e. all tracked files)
> > which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
> > symlinks.  Search regular files.  Recurse into submodules if they are
> > initialized.
>
> OK.
>
> > git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN
> >
> > This goes through all the files in the given revision (i.e. all
> > tracked files) which match the sparsity patterns (i.e. that would not
> > have the SKIP_WORKTREE bit set if were we to checkout that commit).
> > For each of these: Skip symlinks.  Search regular files.  Recurse into
> > submodules if they are initialized.
>
> OK.
>
> > Further, for any of these, when recursing into submodules, make sure
> > to load that submodules' core.sparseCheckout setting (and related
> > settings) and the submodules' sparsity patterns, if any.
> >
> > Sound good?
> >
> > I think this addresses the edge cases we've discussed so far:
> > interaction between submodules and sparsity patterns, and handling of
> > files that are still present despite not matching the sparsity
> > patterns. (Also note that files which are present-despite-the-rules
> > are prone to be removed by the next `git sparse-checkout reapply` or
> > anything that triggers a call to unpack_trees(); there's already
> > multiple things that do and Stolee's proposed patches would add more).
> > If I've missed edge cases, let me know.
>
> Sounds great. This addresses all the edge cases we've mentioned
> before. Thanks again for the detailed proposal, and for considering
> case by case.

And thank you for working on this.  :-)

Derrick Stolee June 10, 2020, 11:40 a.m. UTC | #13

On 5/22/2020 10:26 AM, Elijah Newren wrote:

Sorry I missed this patch. I was searching all over for patches with
"sparse" or "submodule" in the _subject_. Thanks for calling out the
need for review, Junio!

> Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules
> 
> Ignoring the sparse-checkout feature momentarily, if one has a submodule and
> creates local branches within it with unpushed changes and maybe adds some
> untracked files to it, then we would want to avoid accidentally removing such
> a submodule.  So, for example with git.git, if you run
>    git checkout v2.13.0
> then the sha1collisiondetection/ submodule is NOT removed even though it
> did not exist as a submodule until v2.14.0.  Similarly, if you only had
> v2.13.0 checked out previously and ran
>    git checkout v2.14.0
> the sha1collisiondetection/ submodule would NOT be automatically
> initialized despite being part of v2.14.0.  In both cases, git requires
> submodules to be initialized or deinitialized separately.  Further, we
> also have special handling for submodules in other commands such as
> clean, which requires two --force flags to delete untracked submodules,
> and some commands have a --recurse-submodules flag.
> 
> sparse-checkout is very similar to checkout, as evidenced by the similar
> name -- it adds and removes files from the working copy.  However, for
> the same avoid-data-loss reasons we do not want to remove a submodule
> from the working copy with checkout, we do not want to do it with
> sparse-checkout either.  So submodules need to be separately initialized
> or deinitialized; changing sparse-checkout rules should not
> automatically trigger the removal or vivification of submodules.

This is a good summary of how submodules decide to be present or not.

> I believe the previous wording in git-sparse-checkout.txt about
> submodules was only about this particular issue.  Unfortunately, the
> previous wording could be interpreted to imply that submodules should be
> considered active regardless of sparsity patterns.  Update the wording
> to avoid making such an implication.  It may be helpful to consider two
> example situations where the differences in wording become important:

You are correct, the wording was unclear. Worth fixing.

> In the future, we want users to be able to run commands like
>    git clone --sparse=moduleA --recurse-submodules $REPO_URL
> and have sparsity paths automatically set up and have submodules *within
> the sparsity paths* be automatically initialized.  We do not want all
> submodules in any path to be automatically initialized with that
> command.

INTERESTING. You are correct that it would be nice to have one
feature that describes "what should be present or not". The in-tree
sparse-checkout feature (still in infancy) would benefit from a
redesign with that in mind.

I am interested as well in the idea that combining "--sparse[=X]"
with "--recurse-submodules" might want to imply that the submodules
themselves are initialized with sparse-checkout patterns.

These ramblings are of course off-topic for the current patch.

> Similarly, we want to be able to do things like
>    git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
> and search through $REV for $PATTERN within the recorded sparsity
> patterns.  We want it to recurse into submodules within those sparsity
> patterns, but do not want to recurse into directories that do not match
> the sparsity patterns in search of a possible submodule.

(snipping way the old paragraph and focusing on the new text)

> +If your repository contains one or more submodules, then those submodules
> +will appear based on which you initialized with the `git submodule`
> +command.

This sentence is awkward. Here is a potential replacement:

  If your repository contains one or more submodules, then submodules are
  populated based on interactions with the `git submodule` command.
  Specifically, `git submodule init -- <path>` will ensure the submodule at
  `<path>` is present while `git submodule deinit -- <path>` will remove the
  files for the submodule at `<path>`. Similar to sparse-checkout, the
  deinitialized submodules still exist in the index, but are not present in
  the working directory.

That got a lot longer as I was working on it. Perhaps add a paragraph break
before the next bit.

>  Submodules may have additional untracked files or code stored on

To emphasize the importance of the following "to avoid data loss" statement,
you could mention that when a submodule is removed from the working directory,
then so is all of its Git data such as objects and branches. If that data was
not pushed to another repository, then deinitializing a submodule can result
in loss of important data. (Also: maybe I'm wrong about that?)

> +other branches, so to avoid data loss, changing sparse inclusion/exclusion

Edit: other branches. To avoid data loss, ...

> +rules will not cause an already checked out submodule to be removed from
> +the working copy.  Said another way, just as `checkout` will not cause
> +submodules to be automatically removed or initialized even when switching
> +between branches that remove or add submodules, using `sparse-checkout` to
> +reduce or expand the scope of "interesting" files will not cause submodules
> +to be automatically deinitialized or initialized either.  Adding or
> +removing them must be done as a separate step with `git submodule init` or
> +`git submodule deinit`.

This final sentence may be redundant if you include reference to init/deinit
earlier in the section.

> +This may mean that even if your sparsity patterns include or exclude
> +submodules, until you manually initialize or deinitialize them, commands
> +like grep that work on tracked files in the working copy will ignore "not
> +yet initialized" submodules and pay attention to "left behind" ones.

I don't think that "left behind" is a good phrase here. It feels like
they've been _dropped_ instead of _persisted despite sparse-checkout
changes_.

Perhaps:

  commands like `git grep` that work on tracked files in the working copy
  will pay attention only to initialized submodules, regardless of the
  sparse-checkout definition.

Thanks for pointing out how complicated this scenario is! It certainly
demands a careful update like this one.

-Stolee

Matheus Tavares June 10, 2020, 4:22 p.m. UTC | #14

On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 5/22/2020 10:26 AM, Elijah Newren wrote:
> > +This may mean that even if your sparsity patterns include or exclude
> > +submodules, until you manually initialize or deinitialize them, commands
> > +like grep that work on tracked files in the working copy will ignore "not
> > +yet initialized" submodules and pay attention to "left behind" ones.
>
> I don't think that "left behind" is a good phrase here. It feels like
> they've been _dropped_ instead of _persisted despite sparse-checkout
> changes_.
>
> Perhaps:
>
>   commands like `git grep` that work on tracked files in the working copy
>   will pay attention only to initialized submodules, regardless of the
>   sparse-checkout definition.

Hmm, I'm a little confused by the "regardless of the sparse-checkout
definition". The plan we discussed for grep was to not recurse into
submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?

[1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/

Derrick Stolee June 10, 2020, 5:42 p.m. UTC | #15

On 6/10/2020 12:22 PM, Matheus Tavares Bernardino wrote:
> On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 5/22/2020 10:26 AM, Elijah Newren wrote:
>>> +This may mean that even if your sparsity patterns include or exclude
>>> +submodules, until you manually initialize or deinitialize them, commands
>>> +like grep that work on tracked files in the working copy will ignore "not
>>> +yet initialized" submodules and pay attention to "left behind" ones.
>>
>> I don't think that "left behind" is a good phrase here. It feels like
>> they've been _dropped_ instead of _persisted despite sparse-checkout
>> changes_.
>>
>> Perhaps:
>>
>>   commands like `git grep` that work on tracked files in the working copy
>>   will pay attention only to initialized submodules, regardless of the
>>   sparse-checkout definition.
> 
> Hmm, I'm a little confused by the "regardless of the sparse-checkout
> definition". The plan we discussed for grep was to not recurse into
> submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?
> 
> [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/

Thanks for correcting my misunderstanding. By introducing
`git grep` into this documentation, I have also made it
co-dependent on your series. Instead, Elijah was probably
purposeful in his use of "grep" over "git grep".

If we revert that part of my change to use `grep` instead
of `git grep`, then is my statement correct?

Thanks,
-Stolee

Matheus Tavares June 10, 2020, 6:14 p.m. UTC | #16

On Wed, Jun 10, 2020 at 2:42 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/10/2020 12:22 PM, Matheus Tavares Bernardino wrote:
> > On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 5/22/2020 10:26 AM, Elijah Newren wrote:
> >>> +This may mean that even if your sparsity patterns include or exclude
> >>> +submodules, until you manually initialize or deinitialize them, commands
> >>> +like grep that work on tracked files in the working copy will ignore "not
> >>> +yet initialized" submodules and pay attention to "left behind" ones.
> >>
> >> I don't think that "left behind" is a good phrase here. It feels like
> >> they've been _dropped_ instead of _persisted despite sparse-checkout
> >> changes_.
> >>
> >> Perhaps:
> >>
> >>   commands like `git grep` that work on tracked files in the working copy
> >>   will pay attention only to initialized submodules, regardless of the
> >>   sparse-checkout definition.
> >
> > Hmm, I'm a little confused by the "regardless of the sparse-checkout
> > definition". The plan we discussed for grep was to not recurse into
> > submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?
> >
> > [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/
>
> Thanks for correcting my misunderstanding. By introducing
> `git grep` into this documentation, I have also made it
> co-dependent on your series. Instead, Elijah was probably
> purposeful in his use of "grep" over "git grep".

I think he used grep referring to git-grep as he mentioned "tracked
files in the working copy". Maybe he wanted to describe the current
state of git-grep, which does recurse into initialized submodules even
when they don't match the sparsity patterns. Was that it, Elijah?

If so, since this behavior is changed in mt/grep-sparse-checkout, I
think I should also change this doc section within my series. Or we
change the doc in this patch and make it dependent on the series.

Elijah Newren June 10, 2020, 7:58 p.m. UTC | #17

On Wed, Jun 10, 2020 at 4:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 5/22/2020 10:26 AM, Elijah Newren wrote:
>
> Sorry I missed this patch. I was searching all over for patches with
> "sparse" or "submodule" in the _subject_. Thanks for calling out the
> need for review, Junio!
>
> > Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules
> >
> > Ignoring the sparse-checkout feature momentarily, if one has a submodule and
> > creates local branches within it with unpushed changes and maybe adds some
> > untracked files to it, then we would want to avoid accidentally removing such
> > a submodule.  So, for example with git.git, if you run
> >    git checkout v2.13.0
> > then the sha1collisiondetection/ submodule is NOT removed even though it
> > did not exist as a submodule until v2.14.0.  Similarly, if you only had
> > v2.13.0 checked out previously and ran
> >    git checkout v2.14.0
> > the sha1collisiondetection/ submodule would NOT be automatically
> > initialized despite being part of v2.14.0.  In both cases, git requires
> > submodules to be initialized or deinitialized separately.  Further, we
> > also have special handling for submodules in other commands such as
> > clean, which requires two --force flags to delete untracked submodules,
> > and some commands have a --recurse-submodules flag.
> >
> > sparse-checkout is very similar to checkout, as evidenced by the similar
> > name -- it adds and removes files from the working copy.  However, for
> > the same avoid-data-loss reasons we do not want to remove a submodule
> > from the working copy with checkout, we do not want to do it with
> > sparse-checkout either.  So submodules need to be separately initialized
> > or deinitialized; changing sparse-checkout rules should not
> > automatically trigger the removal or vivification of submodules.
>
> This is a good summary of how submodules decide to be present or not.
>
> > I believe the previous wording in git-sparse-checkout.txt about
> > submodules was only about this particular issue.  Unfortunately, the
> > previous wording could be interpreted to imply that submodules should be
> > considered active regardless of sparsity patterns.  Update the wording
> > to avoid making such an implication.  It may be helpful to consider two
> > example situations where the differences in wording become important:
>
> You are correct, the wording was unclear. Worth fixing.
>
> > In the future, we want users to be able to run commands like
> >    git clone --sparse=moduleA --recurse-submodules $REPO_URL
> > and have sparsity paths automatically set up and have submodules *within
> > the sparsity paths* be automatically initialized.  We do not want all
> > submodules in any path to be automatically initialized with that
> > command.
>
> INTERESTING. You are correct that it would be nice to have one
> feature that describes "what should be present or not". The in-tree
> sparse-checkout feature (still in infancy) would benefit from a
> redesign with that in mind.
>
> I am interested as well in the idea that combining "--sparse[=X]"
> with "--recurse-submodules" might want to imply that the submodules
> themselves are initialized with sparse-checkout patterns.
>
> These ramblings are of course off-topic for the current patch.

Yeah, it might get complicated too; we'd almost certainly want to
limit to cone mode (globs could get super hairy).  It's also the case
we might want some submodules to have sparse-checkouts and others have
full checkouts, depending on whether the --sparse=X specification
listed some path that traversed from the toplevel outer repo down into
a submodule.  (But if --sparse is given with no specification, do all
submodules become sparse or do all remain full?)  Anyway, lots of
complications there and we should start a different thread to discuss
that when we feel it's time to tackle it.

> > Similarly, we want to be able to do things like
> >    git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
> > and search through $REV for $PATTERN within the recorded sparsity
> > patterns.  We want it to recurse into submodules within those sparsity
> > patterns, but do not want to recurse into directories that do not match
> > the sparsity patterns in search of a possible submodule.
>
> (snipping way the old paragraph and focusing on the new text)
>
> > +If your repository contains one or more submodules, then those submodules
> > +will appear based on which you initialized with the `git submodule`
> > +command.
>
> This sentence is awkward. Here is a potential replacement:
>
>   If your repository contains one or more submodules, then submodules are
>   populated based on interactions with the `git submodule` command.
>   Specifically, `git submodule init -- <path>` will ensure the submodule at
>   `<path>` is present while `git submodule deinit -- <path>` will remove the
>   files for the submodule at `<path>`. Similar to sparse-checkout, the
>   deinitialized submodules still exist in the index, but are not present in
>   the working directory.
>
> That got a lot longer as I was working on it. Perhaps add a paragraph break
> before the next bit.

Sounds good, thanks.

> >  Submodules may have additional untracked files or code stored on
>
> To emphasize the importance of the following "to avoid data loss" statement,
> you could mention that when a submodule is removed from the working directory,
> then so is all of its Git data such as objects and branches. If that data was
> not pushed to another repository, then deinitializing a submodule can result
> in loss of important data. (Also: maybe I'm wrong about that?)
>
> > +other branches, so to avoid data loss, changing sparse inclusion/exclusion

I thought that was what I covered with the "code stored on other
branches" but I guess that wasn't clear enough.  So yeah, I can try
extending it a bit.

> Edit: other branches. To avoid data loss, ...

Sounds good.

> > +rules will not cause an already checked out submodule to be removed from
> > +the working copy.  Said another way, just as `checkout` will not cause
> > +submodules to be automatically removed or initialized even when switching
> > +between branches that remove or add submodules, using `sparse-checkout` to
> > +reduce or expand the scope of "interesting" files will not cause submodules
> > +to be automatically deinitialized or initialized either.  Adding or
> > +removing them must be done as a separate step with `git submodule init` or
> > +`git submodule deinit`.
>
> This final sentence may be redundant if you include reference to init/deinit
> earlier in the section.

Yep, I'll strike it.

> > +This may mean that even if your sparsity patterns include or exclude
> > +submodules, until you manually initialize or deinitialize them, commands
> > +like grep that work on tracked files in the working copy will ignore "not
> > +yet initialized" submodules and pay attention to "left behind" ones.
>
> I don't think that "left behind" is a good phrase here. It feels like
> they've been _dropped_ instead of _persisted despite sparse-checkout
> changes_.

I think in addition to the "left behind" wording being bad, my
paragraph left another funny gray area and might be inconsistent with
what Matheus and I wrote elsewhere:

If sparsity patterns would exclude a submodule that is initialized,
sparse-checkout clearly can't remove the submodule.  However, should
it set the SKIP_WORKTREE bit for that submodule if it's not going to
remove it?

I'm not sure of the answer, yet.  I think Matheus had the right idea
for how to make grep handle an initialized submodule in the different
sparse.restrictCmds settings, and if we do go ahead and clear the
SKIP_WORKTREE bit, then I think the wording of this paragraph needs to
change.  So, let's discuss your alternative:

> Perhaps:
>
>   commands like `git grep` that work on tracked files in the working copy
>   will pay attention only to initialized submodules, regardless of the
>   sparse-checkout definition.

I think this is easy to misconstrue in an entirely new way: if there
are initialized submodules (and maybe a sparse checkout), then your
wording implies normal files would be ignored by grep (even files that
aren't removed by the sparse checkout)!  While that sounds like crazy
behavior, this whole thread started because of suggested behaviors
being proposed to carefully follow what was already written in this
document even though the end user result seemed somewhat crazy to me.
So, we might want to avoid a repeat.  :-)

Also, your suggested wording is different than the behavior we came up
with before, and is also inconsistent with how we'd work with normal
files.  For example, what if a user:

* uses sparse-checkout to remove a bunch of files/directories they
don't care about
* creates a new file that happens to have the same name as an
(unfortunately) generically worded filename that exists in the index
(but is marked SKIP_WORKTREE and had previously been removed)

Is this new file related to the tracked file?  Is the new file
considered tracked?  Should the new file be considered part of the
sparse cone (i.e. should it be considered part of the set of tracked
working tree files relevant to the user for commands that operate on
that subset)?  It's a bit of a thorny case.

Here's the behavior Matheus and I came to previously:

git -c sparse.restrictCmds=true grep --recurse-submodules <pattern>:
This goes through all the files in the index (i.e. all tracked files)
which do NOT have the SKIP_WORKTREE bit set.  For each of these: If
the file is a symlink, ignore it (like git-grep currently does).  If
the file is a regular file and is present in the working copy, search
it.  If the file is a submodule and it is initialized, recurse into
it.

git -c sparse.restrictCmds=false grep --recurse-submodules <pattern>:
This goes through all the files in the index (i.e. all tracked files)
regardless of SKIP_WORKTREE bit setting.  For each of these: If the
file is a symlink, ignore it (like git-grep currently does).  If the
file is a regular file and is present in the working copy, search it.
If the file is a submodule and it is initialized, recurse into it.

The only difference between these two sparse.restrictCmds settings is
the handling of the SKIP_WORKTREE bit.  I think that makes them nice
and orthogonal.  They also generalize nicely to the cases of searching
--cached or $REVISION with a few obvious changes (check if data is
available in git object store rather than if file is present in
working tree, and for $REVISION check sparsity patterns rather than
SKIP_WORKTREE bit).

If we start ignoring the SKIP_WORKTREE bit for some types of files
even when sparse.restrictCmds=true, I think we start getting a number
of inconsistencies and user surprises.  So, these formal definitions
seem like a good high-level design.  I think our attempts to summarize
behavior in short sentences for users sometimes ignores some cases due
to the desire to summarize.  However, taking the summary literally can
suggest behaviors that'd be inconsistent if not downright crazy for
some of the ignored cases.  I'll see if I can clean it up somehow.

> Thanks for pointing out how complicated this scenario is! It certainly
> demands a careful update like this one.

Thanks for the thoughtful review!

Elijah Newren June 10, 2020, 8:12 p.m. UTC | #18

On Wed, Jun 10, 2020 at 9:23 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 5/22/2020 10:26 AM, Elijah Newren wrote:
> > > +This may mean that even if your sparsity patterns include or exclude
> > > +submodules, until you manually initialize or deinitialize them, commands
> > > +like grep that work on tracked files in the working copy will ignore "not
> > > +yet initialized" submodules and pay attention to "left behind" ones.
> >
> > I don't think that "left behind" is a good phrase here. It feels like
> > they've been _dropped_ instead of _persisted despite sparse-checkout
> > changes_.
> >
> > Perhaps:
> >
> >   commands like `git grep` that work on tracked files in the working copy
> >   will pay attention only to initialized submodules, regardless of the
> >   sparse-checkout definition.
>
> Hmm, I'm a little confused by the "regardless of the sparse-checkout
> definition". The plan we discussed for grep was to not recurse into
> submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?
>
> [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/

I flagged some issues with that sentence...and an additional issue in
my original sentence besides the one Stolee flagged.  It seems to be
easy to mess up a simple summary here.  :-)

But I do want a simple summary of some sort; I want
Documentation/git-sparse-checkout.txt to be an end-user guide and not
an implementation spec.  Perhaps I can bring up a simpler example that
will make it easier to see my distinction between the two -- let's
consider the case of unmerged files.  I think all of the following
statements are true, but some are meant strictly as implementation
details of relevant subcommands, while others are deduced overall
behavior observed by end-users:

* If you just ran merge or rebase and have some files with conflicts,
'git grep searchstring' will search the conflicted files for the
searchstring
* When searching the working tree, git grep should not do any special
checking for whether files are in a conflicted state
* sparse-checkout will never set the SKIP_WORKTREE bit on an unmerged
file (despite sparsity patterns)
* sparse-checkout will delete all (regular and symlink) files from the
working tree when it sets the SKIP_WORKTREE bit for them
* sparse-checkout will not delete files from the working copy if it
doesn't set the SKIP_WORKTREE bit on it
* When merging, if the merge machinery notices a conflict, it must
clear the SKIP_WORKTREE bit and write the (conflicted version of the)
file out to the working tree.  (It is also allowed to clear the
SKIP_WORKTREE bit for files that are not conflicted, though we'd
rather it didn't do that so much.)

These statements above are not incompatible, because some deal with
the implementation of git grep (the second item), others deal with
implementation details of other commands or machinery (all items after
the second), and the first item deals with the combination of
behaviors between sparse-checkout + merge machinery + grep.

So, even though the first bullet point says "git grep...will search
the conflicted files" that does NOT mean git grep should check for
whether files are conflicted.  My proposed update in v2 that I'll send
out (once I come up with one) might use similar broad brushes.

Hope that helps,
Elijah

[RFC,v2,3/4] grep: honor sparse checkout patterns

Commit Message

Comments

Patch