Message ID | e00674c7278b032b826110f33e25a5dee176c7ba.1589058209.git.matheus.bernardino@usp.br (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | grep: honor sparse checkout and add option to ignore it | expand |
Matheus Tavares <matheus.bernardino@usp.br> writes: > One of the main uses for a sparse checkout is to allow users to focus on > the subset of files in a repository in which they are interested. But > git-grep currently ignores the sparsity patterns and report all matches > found outside this subset, which kind of goes in the opposite direction. > Let's fix that, making it honor the sparsity boundaries for every > grepping case: > > - git grep in worktree > - git grep --cached > - git grep $REVISION It makes sense for these to be limited within the "sparse" area. > - git grep --untracked and git grep --no-index (which already respect > sparse checkout boundaries) I can understand the former; those untracked files are what _could_ be brought into attention by "git add", so limiting to the same "sparse" area may make sense. I am not sure about the latter, though, as "--no-index" is an explicit request to pretend that we are dealing with a random collection of files, not managed in a git repository. But perhaps there is a similar justification like how "--untracked" is unjustifiable. I dunno. > diff --git a/builtin/grep.c b/builtin/grep.c > index a5056f395a..91ee0b2734 100644 > --- a/builtin/grep.c > +++ b/builtin/grep.c > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt, > const struct pathspec *pathspec, int cached); > static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > struct tree_desc *tree, struct strbuf *base, int tn_len, > - int check_attr); > + int is_root_tree); > > static int grep_submodule(struct grep_opt *opt, > const struct pathspec *pathspec, > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt, > > for (nr = 0; nr < repo->index->cache_nr; nr++) { > const struct cache_entry *ce = repo->index->cache[nr]; > + > + if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode)) > + continue; Hmph. Why exclude gitlink from this rule? If a submodule sits at a path that is excluded by the sparse pattern, should we still recurse into it? > strbuf_setlen(&name, name_base_len); > strbuf_addstr(&name, ce->name); > > @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt, > * cache entry are identical, even if worktree file has > * been modified, so use cache version instead > */ > - if (cached || (ce->ce_flags & CE_VALID) || > - ce_skip_worktree(ce)) { > + if (cached || (ce->ce_flags & CE_VALID)) { > if (ce_stage(ce) || ce_intent_to_add(ce)) > continue; > hit |= grep_oid(opt, &ce->oid, name.buf, > @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt, > return hit; > } > > -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > - struct tree_desc *tree, struct strbuf *base, int tn_len, > - int check_attr) > +static struct pattern_list *get_sparsity_patterns(struct repository *repo) > +{ > + struct pattern_list *patterns; > + char *sparse_file; > + int sparse_config, cone_config; > + > + if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) || > + !sparse_config) { > + return NULL; > + } > + > + sparse_file = repo_git_path(repo, "info/sparse-checkout"); > + patterns = xcalloc(1, sizeof(*patterns)); > + > + if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config)) > + cone_config = 0; > + patterns->use_cone_patterns = cone_config; > + > + if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) { > + if (file_exists(sparse_file)) { > + warning(_("failed to load sparse-checkout file: '%s'"), > + sparse_file); > + } > + free(sparse_file); > + free(patterns); > + return NULL; > + } > + > + free(sparse_file); > + return patterns; > +} > + > +static int in_sparse_checkout(struct strbuf *path, int prefix_len, > + unsigned int entry_mode, > + struct index_state *istate, > + struct pattern_list *sparsity, > + enum pattern_match_result parent_match, > + enum pattern_match_result *match) > +{ > + int dtype = DT_UNKNOWN; > + > + if (S_ISGITLINK(entry_mode)) > + return 1; This is consistent with the "we do not care where a gitlink appears---submodules are always descended into, regardless of the sparse definition" decision we saw earlier, I think. I am not sure if that is a good design in the first place, though. > + if (parent_match == MATCHED_RECURSIVE) { > + *match = parent_match; > + return 1; > + } > + > + if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1])) > + strbuf_addch(path, '/'); > + > + *match = path_matches_pattern_list(path->buf, path->len, > + path->buf + prefix_len, &dtype, > + sparsity, istate); > + if (*match == UNDECIDED) > + *match = parent_match; > + > + if (S_ISDIR(entry_mode)) > + strbuf_trim_trailing_dir_sep(path); > + > + if (*match == NOT_MATCHED && (S_ISREG(entry_mode) || > + (S_ISDIR(entry_mode) && sparsity->use_cone_patterns))) > + return 0; > + > + return 1; > +} > +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > + struct tree_desc *tree, struct strbuf *base, int tn_len, > + int check_attr, struct pattern_list *sparsity, > + enum pattern_match_result default_sparsity_match) > { > struct repository *repo = opt->repo; > int hit = 0; > @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > while (tree_entry(tree, &entry)) { > int te_len = tree_entry_len(&entry); > + enum pattern_match_result sparsity_match = 0; > > if (match != all_entries_interesting) { > strbuf_addstr(&name, base->buf + tn_len); > @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > strbuf_add(base, entry.path, te_len); > > + if (sparsity) { > + struct strbuf path = STRBUF_INIT; > + strbuf_addstr(&path, base->buf + tn_len); > + > + if (!in_sparse_checkout(&path, old_baselen - tn_len, > + entry.mode, repo->index, > + sparsity, default_sparsity_match, > + &sparsity_match)) { > + strbuf_setlen(base, old_baselen); > + continue; > + } > + } OK. > if (S_ISREG(entry.mode)) { > hit |= grep_oid(opt, &entry.oid, base->buf, tn_len, > check_attr ? base->buf + tn_len : NULL); > @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > strbuf_addch(base, '/'); > init_tree_desc(&sub, data, size); > - hit |= grep_tree(opt, pathspec, &sub, base, tn_len, > - check_attr); > + hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len, > + check_attr, sparsity, sparsity_match); > free(data); > } else if (recurse_submodules && S_ISGITLINK(entry.mode)) { > hit |= grep_submodule(opt, pathspec, &entry.oid, > @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > return hit; > } > > +/* > + * Note: sparsity patterns and paths' attributes will only be considered if > + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern > + * matching on paths.) > + */ > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > + struct tree_desc *tree, struct strbuf *base, int tn_len, > + int is_root_tree) > +{ > + struct pattern_list *patterns = NULL; > + int ret; > + > + if (is_root_tree) > + patterns = get_sparsity_patterns(opt->repo); > + > + ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree, > + patterns, 0); > + > + if (patterns) { > + clear_pattern_list(patterns); > + free(patterns); > + } OK, it is not like this codepath is driven by "git log" to grep from top-level tree objects of many commits, so it is OK to grab the sparsity patterns once before do_grep_tree() and discard it when we are done. > + return ret; > +} > + > static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec, > struct object *obj, const char *name, const char *path) > { > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh > index 37525cae3a..26852586ac 100755 > --- a/t/t7011-skip-worktree-reading.sh > +++ b/t/t7011-skip-worktree-reading.sh > @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' ' > test -z "$(git ls-files -m)" > ' > > -test_expect_success 'grep with skip-worktree file' ' > - git update-index --no-skip-worktree 1 && > - echo test > 1 && > - git update-index 1 && > - git update-index --skip-worktree 1 && > - rm 1 && > - test "$(git grep --no-ext-grep test)" = "1:test" > -' > - > echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected > test_expect_success 'diff-index does not examine skip-worktree absent entries' ' > setup_absent && > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh > new file mode 100755 > index 0000000000..3bd67082eb > --- /dev/null > +++ b/t/t7817-grep-sparse-checkout.sh > @@ -0,0 +1,140 @@ > +#!/bin/sh > + > +test_description='grep in sparse checkout > + > +This test creates a repo with the following structure: > + > +. > +|-- a > +|-- b > +|-- dir > +| `-- c > +`-- sub > + |-- A > + | `-- a > + `-- B > + `-- b > + > +Where . has non-cone mode sparsity patterns and sub is a submodule with cone > +mode sparsity patterns. The resulting sparse-checkout should leave the following > +structure: > + > +. > +|-- a > +`-- sub > + `-- B > + `-- b > +' > + > +. ./test-lib.sh > + > +test_expect_success 'setup' ' > + echo "text" >a && > + echo "text" >b && > + mkdir dir && > + echo "text" >dir/c && > + > + git init sub && > + ( > + cd sub && > + mkdir A B && > + echo "text" >A/a && > + echo "text" >B/b && > + git add A B && > + git commit -m sub && > + git sparse-checkout init --cone && > + git sparse-checkout set B > + ) && > + > + git submodule add ./sub && > + git add a b dir && > + git commit -m super && > + git sparse-checkout init --no-cone && > + git sparse-checkout set "/*" "!b" "!/*/" && > + > + git tag -am t-commit t-commit HEAD && > + tree=$(git rev-parse HEAD^{tree}) && > + git tag -am t-tree t-tree $tree && > + > + test_path_is_missing b && > + test_path_is_missing dir && > + test_path_is_missing sub/A && > + test_path_is_file a && > + test_path_is_file sub/B/b > +' > + > +test_expect_success 'grep in working tree should honor sparse checkout' ' > + cat >expect <<-EOF && > + a:text > + EOF > + git grep "text" >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'grep --cached should honor sparse checkout' ' > + cat >expect <<-EOF && > + a:text > + EOF > + git grep --cached "text" >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'grep <commit-ish> should honor sparse checkout' ' > + commit=$(git rev-parse HEAD) && > + cat >expect_commit <<-EOF && > + $commit:a:text > + EOF > + cat >expect_t-commit <<-EOF && > + t-commit:a:text > + EOF > + git grep "text" $commit >actual_commit && > + test_cmp expect_commit actual_commit && > + git grep "text" t-commit >actual_t-commit && > + test_cmp expect_t-commit actual_t-commit > +' > + > +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' ' > + commit=$(git rev-parse HEAD) && > + tree=$(git rev-parse HEAD^{tree}) && > + cat >expect_tree <<-EOF && > + $tree:a:text > + $tree:b:text > + $tree:dir/c:text > + EOF > + cat >expect_t-tree <<-EOF && > + t-tree:a:text > + t-tree:b:text > + t-tree:dir/c:text > + EOF > + git grep "text" $tree >actual_tree && > + test_cmp expect_tree actual_tree && > + git grep "text" t-tree >actual_t-tree && > + test_cmp expect_t-tree actual_t-tree > +' > + > +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' ' > + cat >expect <<-EOF && > + a:text > + sub/B/b:text > + EOF > + git grep --recurse-submodules --cached "text" >actual && > + test_cmp expect actual > +' > + > +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' ' > + commit=$(git rev-parse HEAD) && > + cat >expect_commit <<-EOF && > + $commit:a:text > + $commit:sub/B/b:text > + EOF > + cat >expect_t-commit <<-EOF && > + t-commit:a:text > + t-commit:sub/B/b:text > + EOF > + git grep --recurse-submodules "text" $commit >actual_commit && > + test_cmp expect_commit actual_commit && > + git grep --recurse-submodules "text" t-commit >actual_t-commit && > + test_cmp expect_t-commit actual_t-commit > +' > + > +test_done
On Mon, May 11, 2020 at 4:35 PM Junio C Hamano <gitster@pobox.com> wrote: > > Matheus Tavares <matheus.bernardino@usp.br> writes: > > > One of the main uses for a sparse checkout is to allow users to focus on > > the subset of files in a repository in which they are interested. But > > git-grep currently ignores the sparsity patterns and report all matches > > found outside this subset, which kind of goes in the opposite direction. > > Let's fix that, making it honor the sparsity boundaries for every > > grepping case: > > > > - git grep in worktree > > - git grep --cached > > - git grep $REVISION > > It makes sense for these to be limited within the "sparse" area. > > > - git grep --untracked and git grep --no-index (which already respect > > sparse checkout boundaries) > > I can understand the former; those untracked files are what _could_ > be brought into attention by "git add", so limiting to the same > "sparse" area may make sense. > > I am not sure about the latter, though, as "--no-index" is an > explicit request to pretend that we are dealing with a random > collection of files, not managed in a git repository. But perhaps > there is a similar justification like how "--untracked" is > unjustifiable. I dunno. Yeah, I think there was no need to mention those two cases here. My intention was to say that, in these cases, we should stick to the files that are present in the working tree (which should match the sparsity patterns + untracked {and ignored, in --no-index}), as opposed to how the worktree grep used to behave until now, falling back to the cache on files excluded by the sparse checkout. > > diff --git a/builtin/grep.c b/builtin/grep.c > > index a5056f395a..91ee0b2734 100644 > > --- a/builtin/grep.c > > +++ b/builtin/grep.c > > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt, > > const struct pathspec *pathspec, int cached); > > static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > struct tree_desc *tree, struct strbuf *base, int tn_len, > > - int check_attr); > > + int is_root_tree); > > > > static int grep_submodule(struct grep_opt *opt, > > const struct pathspec *pathspec, > > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt, > > > > for (nr = 0; nr < repo->index->cache_nr; nr++) { > > const struct cache_entry *ce = repo->index->cache[nr]; > > + > > + if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode)) > > + continue; > > Hmph. Why exclude gitlink from this rule? If a submodule sits at a > path that is excluded by the sparse pattern, should we still recurse > into it? The idea behind not skipping gitlinks here was to be compliant with what we have in the working tree. In 4fd683b ("sparse-checkout: document interactions with submodules"), we decided that, if the sparse-checkout patterns exclude a submodule, the submodule would still appear in the working tree. The purpose was to keep these features (submodules and sparse-checkout) independent. Along the same lines, I think we should always recurse into initialized submodules in grep, and then load their own sparsity patterns, to decide what should be grepped within. [...] > > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > + struct tree_desc *tree, struct strbuf *base, int tn_len, > > + int is_root_tree) > > +{ > > + struct pattern_list *patterns = NULL; > > + int ret; > > + > > + if (is_root_tree) > > + patterns = get_sparsity_patterns(opt->repo); > > + > > + ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree, > > + patterns, 0); > > + > > + if (patterns) { > > + clear_pattern_list(patterns); > > + free(patterns); > > + } > > OK, it is not like this codepath is driven by "git log" to grep from > top-level tree objects of many commits, so it is OK to grab the > sparsity patterns once before do_grep_tree() and discard it when we > are done. Yeah. A possible performance problem here would be when users pass many trees to git-grep (since we are reloading the pattern lists, from both the_repository and submodules, for each tree). But, as Elijah pointed out [1], the cases where this overhead might be somewhat noticeable should be very rare. [1]: https://lore.kernel.org/git/CABPp-BGUf-4exGW23xka1twf2D=nFOz1CkD_f-rDX_AGdVEeDA@mail.gmail.com/
Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > The idea behind not skipping gitlinks here was to be compliant with > what we have in the working tree. In 4fd683b ("sparse-checkout: > document interactions with submodules"), we decided that, if the > sparse-checkout patterns exclude a submodule, the submodule would > still appear in the working tree. The purpose was to keep these > features (submodules and sparse-checkout) independent. Along the same > lines, I think we should always recurse into initialized submodules in > grep, and then load their own sparsity patterns, to decide what should > be grepped within. OK. I do not necessarily agree with the justification described in 4fd683b (e.g. "would easily cause problems." that is not substantiated is merely an opinion), but I do agree with you that the new code in "git grep" we are discussing here does behave in line with that design. Thanks.
On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote: > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > > > The idea behind not skipping gitlinks here was to be compliant with > > what we have in the working tree. In 4fd683b ("sparse-checkout: > > document interactions with submodules"), we decided that, if the > > sparse-checkout patterns exclude a submodule, the submodule would > > still appear in the working tree. The purpose was to keep these > > features (submodules and sparse-checkout) independent. Along the same > > lines, I think we should always recurse into initialized submodules in Sorry if I missed it in the code, but do you check whether the submodule is initialized before descending into it, or do you descend into it based on it just being a submodule? > > grep, and then load their own sparsity patterns, to decide what should > > be grepped within. > > OK. > > I do not necessarily agree with the justification described in > 4fd683b (e.g. "would easily cause problems." that is not > substantiated is merely an opinion), but I do agree with you that > the new code in "git grep" we are discussing here does behave in > line with that design. > > Thanks. I'm also a little worried by 4fd683b; are we headed towards a circular reasoning of some sort? In particular, sparse-checkout was written assuming submodules might already be checked out. I can see how un-checking-out an existing submodule could raise fears of losing untracked or ignored files within it, or stuff stored on other branches, etc. But that's not the only relevant case. What if someone runs: git clone --recurse-submodules --sparse=moduleA git.hosting.site:my/repo.git In such a case, we don't have already checked out submodules. Obviously, we should clone submodules that are within our sparsity paths. But should we automatically clone the submodules outside our sparsity paths? The the logic presented in 4fd683b makes this completely ambiguous. ("It will appear if it's initialized." Okay, but do we initialize it?) You may say that clone doesn't have a --sparse= flag right now. So let me change the example slightly. What if someone runs git checkout --recurse-submodules $otherBranch and $otherBranch adds a new submodule somewhere deep under a directory excluded by the sparsity patterns (i.e. deep within a directory we aren't interested in and don't have checked out). Should the submodule be checked out, i.e. should it be initialized? Commit 4fd683b only says it will appear if it's initialized, but my whole question is should we initialize it?
On Mon, May 11, 2020 at 12:35 PM Junio C Hamano <gitster@pobox.com> wrote: > > Matheus Tavares <matheus.bernardino@usp.br> writes: > > > One of the main uses for a sparse checkout is to allow users to focus on > > the subset of files in a repository in which they are interested. But > > git-grep currently ignores the sparsity patterns and report all matches > > found outside this subset, which kind of goes in the opposite direction. > > Let's fix that, making it honor the sparsity boundaries for every > > grepping case: > > > > - git grep in worktree > > - git grep --cached > > - git grep $REVISION > > It makes sense for these to be limited within the "sparse" area. > > > - git grep --untracked and git grep --no-index (which already respect > > sparse checkout boundaries) > > I can understand the former; those untracked files are what _could_ > be brought into attention by "git add", so limiting to the same > "sparse" area may make sense. > > I am not sure about the latter, though, as "--no-index" is an > explicit request to pretend that we are dealing with a random > collection of files, not managed in a git repository. But perhaps > there is a similar justification like how "--untracked" is > unjustifiable. I dunno. I don't think it makes sense for sparsity patterns to affect either. sparsity patterns are a way of splitting "tracked" files into two subsets (those matching the sparsity paths and those that don't). Therefore, flags that are about searching things that aren't tracked, clearly don't have anything to do with sparsity patterns. However, I think this was just a wording issue; in the subsequent commit Matheus made it clear that he's not modifying the behavior of grep --untracked or grep --no-index based on the presence or absence of sparsity patterns. > > diff --git a/builtin/grep.c b/builtin/grep.c > > index a5056f395a..91ee0b2734 100644 > > --- a/builtin/grep.c > > +++ b/builtin/grep.c > > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt, > > const struct pathspec *pathspec, int cached); > > static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > struct tree_desc *tree, struct strbuf *base, int tn_len, > > - int check_attr); > > + int is_root_tree); > > > > static int grep_submodule(struct grep_opt *opt, > > const struct pathspec *pathspec, > > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt, > > > > for (nr = 0; nr < repo->index->cache_nr; nr++) { > > const struct cache_entry *ce = repo->index->cache[nr]; > > + > > + if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode)) > > + continue; > > Hmph. Why exclude gitlink from this rule? If a submodule sits at a > path that is excluded by the sparse pattern, should we still recurse > into it? That bothers me too. > > strbuf_setlen(&name, name_base_len); > > strbuf_addstr(&name, ce->name); > > > > @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt, > > * cache entry are identical, even if worktree file has > > * been modified, so use cache version instead > > */ > > - if (cached || (ce->ce_flags & CE_VALID) || > > - ce_skip_worktree(ce)) { > > + if (cached || (ce->ce_flags & CE_VALID)) { > > if (ce_stage(ce) || ce_intent_to_add(ce)) > > continue; > > hit |= grep_oid(opt, &ce->oid, name.buf, > > @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt, > > return hit; > > } > > > > -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > - struct tree_desc *tree, struct strbuf *base, int tn_len, > > - int check_attr) > > +static struct pattern_list *get_sparsity_patterns(struct repository *repo) > > +{ > > + struct pattern_list *patterns; > > + char *sparse_file; > > + int sparse_config, cone_config; > > + > > + if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) || > > + !sparse_config) { > > + return NULL; > > + } > > + > > + sparse_file = repo_git_path(repo, "info/sparse-checkout"); > > + patterns = xcalloc(1, sizeof(*patterns)); > > + > > + if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config)) > > + cone_config = 0; > > + patterns->use_cone_patterns = cone_config; > > + > > + if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) { > > + if (file_exists(sparse_file)) { > > + warning(_("failed to load sparse-checkout file: '%s'"), > > + sparse_file); > > + } > > + free(sparse_file); > > + free(patterns); > > + return NULL; > > + } > > + > > + free(sparse_file); > > + return patterns; > > +} > > + > > +static int in_sparse_checkout(struct strbuf *path, int prefix_len, > > + unsigned int entry_mode, > > + struct index_state *istate, > > + struct pattern_list *sparsity, > > + enum pattern_match_result parent_match, > > + enum pattern_match_result *match) > > +{ > > + int dtype = DT_UNKNOWN; > > + > > + if (S_ISGITLINK(entry_mode)) > > + return 1; > > This is consistent with the "we do not care where a gitlink > appears---submodules are always descended into, regardless of the > sparse definition" decision we saw earlier, I think. I am not sure > if that is a good design in the first place, though. > > > + if (parent_match == MATCHED_RECURSIVE) { > > + *match = parent_match; > > + return 1; > > + } > > + > > + if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1])) > > + strbuf_addch(path, '/'); > > + > > + *match = path_matches_pattern_list(path->buf, path->len, > > + path->buf + prefix_len, &dtype, > > + sparsity, istate); > > + if (*match == UNDECIDED) > > + *match = parent_match; > > + > > + if (S_ISDIR(entry_mode)) > > + strbuf_trim_trailing_dir_sep(path); > > + > > + if (*match == NOT_MATCHED && (S_ISREG(entry_mode) || > > + (S_ISDIR(entry_mode) && sparsity->use_cone_patterns))) > > + return 0; > > + > > + return 1; > > +} > > > > > +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > + struct tree_desc *tree, struct strbuf *base, int tn_len, > > + int check_attr, struct pattern_list *sparsity, > > + enum pattern_match_result default_sparsity_match) > > { > > struct repository *repo = opt->repo; > > int hit = 0; > > @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > > > while (tree_entry(tree, &entry)) { > > int te_len = tree_entry_len(&entry); > > + enum pattern_match_result sparsity_match = 0; > > > > if (match != all_entries_interesting) { > > strbuf_addstr(&name, base->buf + tn_len); > > @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > > > strbuf_add(base, entry.path, te_len); > > > > + if (sparsity) { > > + struct strbuf path = STRBUF_INIT; > > + strbuf_addstr(&path, base->buf + tn_len); > > + > > + if (!in_sparse_checkout(&path, old_baselen - tn_len, > > + entry.mode, repo->index, > > + sparsity, default_sparsity_match, > > + &sparsity_match)) { > > + strbuf_setlen(base, old_baselen); > > + continue; > > + } > > + } > > OK. > > > if (S_ISREG(entry.mode)) { > > hit |= grep_oid(opt, &entry.oid, base->buf, tn_len, > > check_attr ? base->buf + tn_len : NULL); > > @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > > > strbuf_addch(base, '/'); > > init_tree_desc(&sub, data, size); > > - hit |= grep_tree(opt, pathspec, &sub, base, tn_len, > > - check_attr); > > + hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len, > > + check_attr, sparsity, sparsity_match); > > free(data); > > } else if (recurse_submodules && S_ISGITLINK(entry.mode)) { > > hit |= grep_submodule(opt, pathspec, &entry.oid, > > @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > return hit; > > } > > > > +/* > > + * Note: sparsity patterns and paths' attributes will only be considered if > > + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern > > + * matching on paths.) > > + */ > > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, > > + struct tree_desc *tree, struct strbuf *base, int tn_len, > > + int is_root_tree) > > +{ > > + struct pattern_list *patterns = NULL; > > + int ret; > > + > > + if (is_root_tree) > > + patterns = get_sparsity_patterns(opt->repo); > > + > > + ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree, > > + patterns, 0); > > + > > + if (patterns) { > > + clear_pattern_list(patterns); > > + free(patterns); > > + } > > OK, it is not like this codepath is driven by "git log" to grep from > top-level tree objects of many commits, so it is OK to grab the > sparsity patterns once before do_grep_tree() and discard it when we > are done. > > > + return ret; > > +} > > + > > > static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec, > > struct object *obj, const char *name, const char *path) > > { > > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh > > index 37525cae3a..26852586ac 100755 > > --- a/t/t7011-skip-worktree-reading.sh > > +++ b/t/t7011-skip-worktree-reading.sh > > @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' ' > > test -z "$(git ls-files -m)" > > ' > > > > -test_expect_success 'grep with skip-worktree file' ' > > - git update-index --no-skip-worktree 1 && > > - echo test > 1 && > > - git update-index 1 && > > - git update-index --skip-worktree 1 && > > - rm 1 && > > - test "$(git grep --no-ext-grep test)" = "1:test" > > -' > > - > > echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected > > test_expect_success 'diff-index does not examine skip-worktree absent entries' ' > > setup_absent && > > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh > > new file mode 100755 > > index 0000000000..3bd67082eb > > --- /dev/null > > +++ b/t/t7817-grep-sparse-checkout.sh > > @@ -0,0 +1,140 @@ > > +#!/bin/sh > > + > > +test_description='grep in sparse checkout > > + > > +This test creates a repo with the following structure: > > + > > +. > > +|-- a > > +|-- b > > +|-- dir > > +| `-- c > > +`-- sub > > + |-- A > > + | `-- a > > + `-- B > > + `-- b > > + > > +Where . has non-cone mode sparsity patterns and sub is a submodule with cone > > +mode sparsity patterns. The resulting sparse-checkout should leave the following > > +structure: > > + > > +. > > +|-- a > > +`-- sub > > + `-- B > > + `-- b > > +' > > + > > +. ./test-lib.sh > > + > > +test_expect_success 'setup' ' > > + echo "text" >a && > > + echo "text" >b && > > + mkdir dir && > > + echo "text" >dir/c && > > + > > + git init sub && > > + ( > > + cd sub && > > + mkdir A B && > > + echo "text" >A/a && > > + echo "text" >B/b && > > + git add A B && > > + git commit -m sub && > > + git sparse-checkout init --cone && > > + git sparse-checkout set B > > + ) && > > + > > + git submodule add ./sub && > > + git add a b dir && > > + git commit -m super && > > + git sparse-checkout init --no-cone && > > + git sparse-checkout set "/*" "!b" "!/*/" && > > + > > + git tag -am t-commit t-commit HEAD && > > + tree=$(git rev-parse HEAD^{tree}) && > > + git tag -am t-tree t-tree $tree && > > + > > + test_path_is_missing b && > > + test_path_is_missing dir && > > + test_path_is_missing sub/A && > > + test_path_is_file a && > > + test_path_is_file sub/B/b > > +' > > + > > +test_expect_success 'grep in working tree should honor sparse checkout' ' > > + cat >expect <<-EOF && > > + a:text > > + EOF > > + git grep "text" >actual && > > + test_cmp expect actual > > +' > > + > > +test_expect_success 'grep --cached should honor sparse checkout' ' > > + cat >expect <<-EOF && > > + a:text > > + EOF > > + git grep --cached "text" >actual && > > + test_cmp expect actual > > +' > > + > > +test_expect_success 'grep <commit-ish> should honor sparse checkout' ' > > + commit=$(git rev-parse HEAD) && > > + cat >expect_commit <<-EOF && > > + $commit:a:text > > + EOF > > + cat >expect_t-commit <<-EOF && > > + t-commit:a:text > > + EOF > > + git grep "text" $commit >actual_commit && > > + test_cmp expect_commit actual_commit && > > + git grep "text" t-commit >actual_t-commit && > > + test_cmp expect_t-commit actual_t-commit > > +' > > + > > +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' ' > > + commit=$(git rev-parse HEAD) && > > + tree=$(git rev-parse HEAD^{tree}) && > > + cat >expect_tree <<-EOF && > > + $tree:a:text > > + $tree:b:text > > + $tree:dir/c:text > > + EOF > > + cat >expect_t-tree <<-EOF && > > + t-tree:a:text > > + t-tree:b:text > > + t-tree:dir/c:text > > + EOF > > + git grep "text" $tree >actual_tree && > > + test_cmp expect_tree actual_tree && > > + git grep "text" t-tree >actual_t-tree && > > + test_cmp expect_t-tree actual_t-tree > > +' > > + > > +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' ' > > + cat >expect <<-EOF && > > + a:text > > + sub/B/b:text > > + EOF > > + git grep --recurse-submodules --cached "text" >actual && > > + test_cmp expect actual > > +' > > + > > +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' ' > > + commit=$(git rev-parse HEAD) && > > + cat >expect_commit <<-EOF && > > + $commit:a:text > > + $commit:sub/B/b:text > > + EOF > > + cat >expect_t-commit <<-EOF && > > + t-commit:a:text > > + t-commit:sub/B/b:text > > + EOF > > + git grep --recurse-submodules "text" $commit >actual_commit && > > + test_cmp expect_commit actual_commit && > > + git grep --recurse-submodules "text" t-commit >actual_t-commit && > > + test_cmp expect_t-commit actual_t-commit > > +' > > + > > +test_done
On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote: > > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > > > > > The idea behind not skipping gitlinks here was to be compliant with > > > what we have in the working tree. In 4fd683b ("sparse-checkout: > > > document interactions with submodules"), we decided that, if the > > > sparse-checkout patterns exclude a submodule, the submodule would > > > still appear in the working tree. The purpose was to keep these > > > features (submodules and sparse-checkout) independent. Along the same > > > lines, I think we should always recurse into initialized submodules in > > Sorry if I missed it in the code, but do you check whether the > submodule is initialized before descending into it, or do you descend > into it based on it just being a submodule? We only descend if the submodule is initialized. The new code in this patch doesn't do this check, but it is already implemented in grep_submodule() (which is called by grep_tree() and grep_cache() when a submodule is found).
On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote: > > > > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > > > > > > > The idea behind not skipping gitlinks here was to be compliant with > > > > what we have in the working tree. In 4fd683b ("sparse-checkout: > > > > document interactions with submodules"), we decided that, if the > > > > sparse-checkout patterns exclude a submodule, the submodule would > > > > still appear in the working tree. The purpose was to keep these > > > > features (submodules and sparse-checkout) independent. Along the same > > > > lines, I think we should always recurse into initialized submodules in > > > > Sorry if I missed it in the code, but do you check whether the > > submodule is initialized before descending into it, or do you descend > > into it based on it just being a submodule? > > We only descend if the submodule is initialized. The new code in this > patch doesn't do this check, but it is already implemented in > grep_submodule() (which is called by grep_tree() and grep_cache() when > a submodule is found). Good to know. To up the ante a bit: What if another branch has directory that doesn't exist in HEAD or the current checkout, and within that directory is a submodule. Would it be recursed into? What if it matched the sparsity paths? (Is it even possible to recurse into it?)
On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote: > > On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino > <matheus.bernardino@usp.br> wrote: > > > > On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote: > > > > > > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > > > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes: > > > > > > > > > The idea behind not skipping gitlinks here was to be compliant with > > > > > what we have in the working tree. In 4fd683b ("sparse-checkout: > > > > > document interactions with submodules"), we decided that, if the > > > > > sparse-checkout patterns exclude a submodule, the submodule would > > > > > still appear in the working tree. The purpose was to keep these > > > > > features (submodules and sparse-checkout) independent. Along the same > > > > > lines, I think we should always recurse into initialized submodules in > > > > > > Sorry if I missed it in the code, but do you check whether the > > > submodule is initialized before descending into it, or do you descend > > > into it based on it just being a submodule? > > > > We only descend if the submodule is initialized. The new code in this > > patch doesn't do this check, but it is already implemented in > > grep_submodule() (which is called by grep_tree() and grep_cache() when > > a submodule is found). > > Good to know. To up the ante a bit: What if another branch has > directory that doesn't exist in HEAD or the current checkout, and > within that directory is a submodule. Would it be recursed into? In this case, `git grep --recurse-submodules <pattern> $branch` will recurse into the submodule, but only if it has already been initialized. I.e. if we have checked out to $branch, ran `git submodule init` and then checked out back. > What if it matched the sparsity paths? (Is it even possible to > recurse into it?) That's a great question. The idea that I tried to implement is to always recurse into _initialized_ submodules (even the ones excluded by the superproject's sparsity patterns) and, then, follow their own sparsity patterns inside. I'm not necessarily in favor (or against) this behavior, but this seemed to be the most compatible way with the design we describe in our docs: "If your sparse-checkout patterns exclude an initialized submodule, then that submodule will still appear in your working directory." (in git-sparse-checkout.txt) So, back to the original question, if you run `git grep --recurse-submodules <pattern> $branch` and $branch contains a submodule which was previously initialized, git-grep _would_ recurse into it, even if it (or its parent dir) was excluded. However, your question helped me notice an inconsistency in my patch: the behavior I just described is working for the full pattern set, but not in cone mode. That's because, in cone mode, we can mark the whole submodule's parent dir as excluded. Then, path_matches_pattern_list() will return NOT_MATCHED for the parent dir and we won't recurse into it, so we won't even get to the submodule's path to discover that it refers to a gitlink. Therefore, if we decide to keep the behavior of always recursing into submodules, we will need some extra work for the cone mode. I.e. grep_tree() will have to check if NOT_MATCHED directories contain submodules before discarding them, and recurse only into the submodules if so. As for the implementation, the first idea that came to my mind was to list the submodules' pathnames and do prefix matching for each submodule and NOT_MATCHED dir. But the places I've seen such submodule listings in the code base so far [1] seem to work only in the current branch. My second idea was to continue the tree walk when we hit NOT_MATCHED dir entries, but not doing any work, just looking for possible gitlinks to recurse into. I'm not sure if that could negatively affect the execution time, though. Does this seem like a good approach? Or is there another solution that I have not considered? Or even further, should we choose to skip the submodules in excluded paths? My only concern in this case is that it would be contrary to the design in git-sparse-checkout.txt. And the working tree grep and cached grep would differ even on a clean working tree. [1]: builtin/submodule--helper.c:module_list_compute() and submodule-config.c:config_from_gitmodules()
Hi Matheus, On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote: > > <snip> > > Good to know. Â To up the ante a bit: What if another branch has > > directory that doesn't exist in HEAD or the current checkout, and > > within that directory is a submodule. Â Would it be recursed into? > > In this case, `git grep --recurse-submodules <pattern> $branch` will > recurse into the submodule, but only if it has already been > initialized. I.e. if we have checked out to $branch, ran `git > submodule init` and then checked out back. > > > What if it matched the sparsity paths? Â (Is it even possible to > > recurse into it?) > > That's a great question. The idea that I tried to implement is to > always recurse into _initialized_ submodules (even the ones excluded > by the superproject's sparsity patterns) and, then, follow their own > sparsity patterns inside. I'm not necessarily in favor (or against) > this behavior, but this seemed to be the most compatible way with the > design we describe in our docs: > > "If your sparse-checkout patterns exclude an initialized submodule, > then that submodule will still appear in your working directory." (in > git-sparse-checkout.txt) > > So, back to the original question, if you run `git grep > --recurse-submodules <pattern> $branch` and $branch contains a > submodule which was previously initialized, git-grep _would_ recurse > into it, even if it (or its parent dir) was excluded. However, your > question helped me notice an inconsistency in my patch: the behavior I > just described is working for the full pattern set, but not in cone > mode. That's because, in cone mode, we can mark the whole submodule's > parent dir as excluded. Then, path_matches_pattern_list() will return > NOT_MATCHED for the parent dir and we won't recurse into it, so we > won't even get to the submodule's path to discover that it refers to a > gitlink. > > Therefore, if we decide to keep the behavior of always recursing into > submodules, we will need some extra work for the cone mode. I.e. > grep_tree() will have to check if NOT_MATCHED directories contain > submodules before discarding them, and recurse only into the > submodules if so. As for the implementation, the first idea that came > to my mind was to list the submodules' pathnames and do prefix > matching for each submodule and NOT_MATCHED dir. But the places I've > seen such submodule listings in the code base so far [1] seem to work > only in the current branch. My second idea was to continue the tree > walk when we hit NOT_MATCHED dir entries, but not doing any work, just > looking for possible gitlinks to recurse into. I'm not sure if that > could negatively affect the execution time, though. > > Does this seem like a good approach? Or is there another solution that > I have not considered? Or even further, should we choose to skip the > submodules in excluded paths? My only concern in this case is that it > would be contrary to the design in git-sparse-checkout.txt. And the > working tree grep and cached grep would differ even on a clean working > tree. To be honest, I think it sounds insane. What you propose does make sense if you take what was written in git-sparse-checkout.txt very literally and as though it was a core design principle meant to cover all cases but I do not think it merits such a standing at all. I think it should be treated as a first draft attempt to explain interactions that was written solely with the 'checkout' case in mind, especially since it was written at the same approximate time that this was written earlier in the same file: THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE. Anyway, the wording in that file seems to be really important, so let's fix it. -- >8 -- Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules Ignoring the sparse-checkout feature momentarily, if one has a submodule and creates local branches within it with unpushed changes and maybe adds some untracked files to it, then we would want to avoid accidentally removing such a submodule. So, for example with git.git, if you run git checkout v2.13.0 then the sha1collisiondetection/ submodule is NOT removed even though it did not exist as a submodule until v2.14.0. Similarly, if you only had v2.13.0 checked out previously and ran git checkout v2.14.0 the sha1collisiondetection/ submodule would NOT be automatically initialized despite being part of v2.14.0. In both cases, git requires submodules to be initialized or deinitialized separately. Further, we also have special handling for submodules in other commands such as clean, which requires two --force flags to delete untracked submodules, and some commands have a --recurse-submodules flag. sparse-checkout is very similar to checkout, as evidenced by the similar name -- it adds and removes files from the working copy. However, for the same avoid-data-loss reasons we do not want to remove a submodule from the working copy with checkout, we do not want to do it with sparse-checkout either. So submodules need to be separately initialized or deinitialized; changing sparse-checkout rules should not automatically trigger the removal or vivification of submodules. I believe the previous wording in git-sparse-checkout.txt about submodules was only about this particular issue. Unfortunately, the previous wording could be interpreted to imply that submodules should be considered active regardless of sparsity patterns. Update the wording to avoid making such an implication. It may be helpful to consider two example situations where the differences in wording become important: In the future, we want users to be able to run commands like git clone --sparse=moduleA --recurse-submodules $REPO_URL and have sparsity paths automatically set up and have submodules *within the sparsity paths* be automatically initialized. We do not want all submodules in any path to be automatically initialized with that command. Similarly, we want to be able to do things like git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN and search through $REV for $PATTERN within the recorded sparsity patterns. We want it to recurse into submodules within those sparsity patterns, but do not want to recurse into directories that do not match the sparsity patterns in search of a possible submodule. Signed-off-by: Elijah Newren <newren@gmail.com> --- Documentation/git-sparse-checkout.txt | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt index c0342e5393..7dde2d330c 100644 --- a/Documentation/git-sparse-checkout.txt +++ b/Documentation/git-sparse-checkout.txt @@ -190,10 +190,23 @@ directory. SUBMODULES ---------- -If your repository contains one or more submodules, then those submodules will -appear based on which you initialized with the `git submodule` command. If -your sparse-checkout patterns exclude an initialized submodule, then that -submodule will still appear in your working directory. +If your repository contains one or more submodules, then those submodules +will appear based on which you initialized with the `git submodule` +command. Submodules may have additional untracked files or code stored on +other branches, so to avoid data loss, changing sparse inclusion/exclusion +rules will not cause an already checked out submodule to be removed from +the working copy. Said another way, just as `checkout` will not cause +submodules to be automatically removed or initialized even when switching +between branches that remove or add submodules, using `sparse-checkout` to +reduce or expand the scope of "interesting" files will not cause submodules +to be automatically deinitialized or initialized either. Adding or +removing them must be done as a separate step with `git submodule init` or +`git submodule deinit`. + +This may mean that even if your sparsity patterns include or exclude +submodules, until you manually initialize or deinitialize them, commands +like grep that work on tracked files in the working copy will ignore "not +yet initialized" submodules and pay attention to "left behind" ones. SEE ALSO
On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote: > > Hi Matheus, > > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote: > > > <snip> > > Does this seem like a good approach? Or is there another solution that > > I have not considered? Or even further, should we choose to skip the > > submodules in excluded paths? My only concern in this case is that it > > would be contrary to the design in git-sparse-checkout.txt. And the > > working tree grep and cached grep would differ even on a clean working > > tree. > <snip> > Anyway, the wording in that file seems to be really important, so > let's fix it. > Let me also try to give a concrete proposal for grep behavior for the edge cases we've discussed: git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN This goes through all the files in the index (i.e. all tracked files) which do not have the SKIP_WORKTREE bit set. For each of these: If the file is a symlink, ignore it (like grep currently does). If the file is a regular file and is present in the working copy, search it. If the file is a submodule and it is initialized, recurse into it. git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN This goes through all the files in the index (i.e. all tracked files) which do not have the SKIP_WORKTREE bit set. For each of these: Skip symlinks. Search regular files. Recurse into submodules if they are initialized. git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN This goes through all the files in the given revision (i.e. all tracked files) which match the sparsity patterns (i.e. that would not have the SKIP_WORKTREE bit set if were we to checkout that commit). For each of these: Skip symlinks. Search regular files. Recurse into submodules if they are initialized. Further, for any of these, when recursing into submodules, make sure to load that submodules' core.sparseCheckout setting (and related settings) and the submodules' sparsity patterns, if any. Sound good? I think this addresses the edge cases we've discussed so far: interaction between submodules and sparsity patterns, and handling of files that are still present despite not matching the sparsity patterns. (Also note that files which are present-despite-the-rules are prone to be removed by the next `git sparse-checkout reapply` or anything that triggers a call to unpack_trees(); there's already multiple things that do and Stolee's proposed patches would add more). If I've missed edge cases, let me know. Elijah
Hi, Elijah On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote: > > On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote: > > > > Hi Matheus, > > > > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > > > > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote: > > > > > <snip> > > > Does this seem like a good approach? Or is there another solution that > > > I have not considered? Or even further, should we choose to skip the > > > submodules in excluded paths? My only concern in this case is that it > > > would be contrary to the design in git-sparse-checkout.txt. And the > > > working tree grep and cached grep would differ even on a clean working > > > tree. > > > <snip> > > Anyway, the wording in that file seems to be really important, so > > let's fix it. > > > > Let me also try to give a concrete proposal for grep behavior for the > edge cases we've discussed: Thank you for this proposal and for the previous comments as well. > git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN > > This goes through all the files in the index (i.e. all tracked files) > which do not have the SKIP_WORKTREE bit set. For each of these: If > the file is a symlink, ignore it (like grep currently does). If the > file is a regular file and is present in the working copy, search it. > If the file is a submodule and it is initialized, recurse into it. Sounds good. And when sparse.restrictCmds=false, we also search the present files and present initialized submodules that have the SKIP_WORKTREE set, right? > git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN > > This goes through all the files in the index (i.e. all tracked files) > which do not have the SKIP_WORKTREE bit set. For each of these: Skip > symlinks. Search regular files. Recurse into submodules if they are > initialized. OK. > git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN > > This goes through all the files in the given revision (i.e. all > tracked files) which match the sparsity patterns (i.e. that would not > have the SKIP_WORKTREE bit set if were we to checkout that commit). > For each of these: Skip symlinks. Search regular files. Recurse into > submodules if they are initialized. OK. > Further, for any of these, when recursing into submodules, make sure > to load that submodules' core.sparseCheckout setting (and related > settings) and the submodules' sparsity patterns, if any. > > Sound good? > > I think this addresses the edge cases we've discussed so far: > interaction between submodules and sparsity patterns, and handling of > files that are still present despite not matching the sparsity > patterns. (Also note that files which are present-despite-the-rules > are prone to be removed by the next `git sparse-checkout reapply` or > anything that triggers a call to unpack_trees(); there's already > multiple things that do and Stolee's proposed patches would add more). > If I've missed edge cases, let me know. Sounds great. This addresses all the edge cases we've mentioned before. Thanks again for the detailed proposal, and for considering case by case.
On Fri, May 22, 2020 at 1:54 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > Hi, Elijah > > On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote: > > > > On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote: > > > > > > Hi Matheus, > > > > > > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > > > > > > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote: > > > > > > > <snip> > > > > Does this seem like a good approach? Or is there another solution that > > > > I have not considered? Or even further, should we choose to skip the > > > > submodules in excluded paths? My only concern in this case is that it > > > > would be contrary to the design in git-sparse-checkout.txt. And the > > > > working tree grep and cached grep would differ even on a clean working > > > > tree. > > > > > <snip> > > > Anyway, the wording in that file seems to be really important, so > > > let's fix it. > > > > > > > Let me also try to give a concrete proposal for grep behavior for the > > edge cases we've discussed: > > Thank you for this proposal and for the previous comments as well. > > > git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN > > > > This goes through all the files in the index (i.e. all tracked files) > > which do not have the SKIP_WORKTREE bit set. For each of these: If > > the file is a symlink, ignore it (like grep currently does). If the > > file is a regular file and is present in the working copy, search it. > > If the file is a submodule and it is initialized, recurse into it. > > Sounds good. And when sparse.restrictCmds=false, we also search the > present files and present initialized submodules that have the > SKIP_WORKTREE set, right? You're really pushing those corner cases, I love it. :-) SKIP_WORKTREE is supposed to mean we have removed it from the working tree, i.e. it shouldn't be present (if we decide we're not going to remove it from the working tree, e.g. because the file is unmerged or something, then we don't mark it as SKIP_WORKTREE even if it doesn't match sparsity patterns). Therefore, the set of files that satisfy this condition you have given should generally be empty. But presuming we hit this corner case, I'd say you are right. sparse.restrictCmds=false means we ignore the SKIP_WORKTREE bit entirely (and in the case of grepping a $REVISION, we ignore the sparsity patterns entirely). > > git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN > > > > This goes through all the files in the index (i.e. all tracked files) > > which do not have the SKIP_WORKTREE bit set. For each of these: Skip > > symlinks. Search regular files. Recurse into submodules if they are > > initialized. > > OK. > > > git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN > > > > This goes through all the files in the given revision (i.e. all > > tracked files) which match the sparsity patterns (i.e. that would not > > have the SKIP_WORKTREE bit set if were we to checkout that commit). > > For each of these: Skip symlinks. Search regular files. Recurse into > > submodules if they are initialized. > > OK. > > > Further, for any of these, when recursing into submodules, make sure > > to load that submodules' core.sparseCheckout setting (and related > > settings) and the submodules' sparsity patterns, if any. > > > > Sound good? > > > > I think this addresses the edge cases we've discussed so far: > > interaction between submodules and sparsity patterns, and handling of > > files that are still present despite not matching the sparsity > > patterns. (Also note that files which are present-despite-the-rules > > are prone to be removed by the next `git sparse-checkout reapply` or > > anything that triggers a call to unpack_trees(); there's already > > multiple things that do and Stolee's proposed patches would add more). > > If I've missed edge cases, let me know. > > Sounds great. This addresses all the edge cases we've mentioned > before. Thanks again for the detailed proposal, and for considering > case by case. And thank you for working on this. :-)
On 5/22/2020 10:26 AM, Elijah Newren wrote: Sorry I missed this patch. I was searching all over for patches with "sparse" or "submodule" in the _subject_. Thanks for calling out the need for review, Junio! > Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules > > Ignoring the sparse-checkout feature momentarily, if one has a submodule and > creates local branches within it with unpushed changes and maybe adds some > untracked files to it, then we would want to avoid accidentally removing such > a submodule. So, for example with git.git, if you run > git checkout v2.13.0 > then the sha1collisiondetection/ submodule is NOT removed even though it > did not exist as a submodule until v2.14.0. Similarly, if you only had > v2.13.0 checked out previously and ran > git checkout v2.14.0 > the sha1collisiondetection/ submodule would NOT be automatically > initialized despite being part of v2.14.0. In both cases, git requires > submodules to be initialized or deinitialized separately. Further, we > also have special handling for submodules in other commands such as > clean, which requires two --force flags to delete untracked submodules, > and some commands have a --recurse-submodules flag. > > sparse-checkout is very similar to checkout, as evidenced by the similar > name -- it adds and removes files from the working copy. However, for > the same avoid-data-loss reasons we do not want to remove a submodule > from the working copy with checkout, we do not want to do it with > sparse-checkout either. So submodules need to be separately initialized > or deinitialized; changing sparse-checkout rules should not > automatically trigger the removal or vivification of submodules. This is a good summary of how submodules decide to be present or not. > I believe the previous wording in git-sparse-checkout.txt about > submodules was only about this particular issue. Unfortunately, the > previous wording could be interpreted to imply that submodules should be > considered active regardless of sparsity patterns. Update the wording > to avoid making such an implication. It may be helpful to consider two > example situations where the differences in wording become important: You are correct, the wording was unclear. Worth fixing. > In the future, we want users to be able to run commands like > git clone --sparse=moduleA --recurse-submodules $REPO_URL > and have sparsity paths automatically set up and have submodules *within > the sparsity paths* be automatically initialized. We do not want all > submodules in any path to be automatically initialized with that > command. INTERESTING. You are correct that it would be nice to have one feature that describes "what should be present or not". The in-tree sparse-checkout feature (still in infancy) would benefit from a redesign with that in mind. I am interested as well in the idea that combining "--sparse[=X]" with "--recurse-submodules" might want to imply that the submodules themselves are initialized with sparse-checkout patterns. These ramblings are of course off-topic for the current patch. > Similarly, we want to be able to do things like > git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN > and search through $REV for $PATTERN within the recorded sparsity > patterns. We want it to recurse into submodules within those sparsity > patterns, but do not want to recurse into directories that do not match > the sparsity patterns in search of a possible submodule. (snipping way the old paragraph and focusing on the new text) > +If your repository contains one or more submodules, then those submodules > +will appear based on which you initialized with the `git submodule` > +command. This sentence is awkward. Here is a potential replacement: If your repository contains one or more submodules, then submodules are populated based on interactions with the `git submodule` command. Specifically, `git submodule init -- <path>` will ensure the submodule at `<path>` is present while `git submodule deinit -- <path>` will remove the files for the submodule at `<path>`. Similar to sparse-checkout, the deinitialized submodules still exist in the index, but are not present in the working directory. That got a lot longer as I was working on it. Perhaps add a paragraph break before the next bit. > Submodules may have additional untracked files or code stored on To emphasize the importance of the following "to avoid data loss" statement, you could mention that when a submodule is removed from the working directory, then so is all of its Git data such as objects and branches. If that data was not pushed to another repository, then deinitializing a submodule can result in loss of important data. (Also: maybe I'm wrong about that?) > +other branches, so to avoid data loss, changing sparse inclusion/exclusion Edit: other branches. To avoid data loss, ... > +rules will not cause an already checked out submodule to be removed from > +the working copy. Said another way, just as `checkout` will not cause > +submodules to be automatically removed or initialized even when switching > +between branches that remove or add submodules, using `sparse-checkout` to > +reduce or expand the scope of "interesting" files will not cause submodules > +to be automatically deinitialized or initialized either. Adding or > +removing them must be done as a separate step with `git submodule init` or > +`git submodule deinit`. This final sentence may be redundant if you include reference to init/deinit earlier in the section. > +This may mean that even if your sparsity patterns include or exclude > +submodules, until you manually initialize or deinitialize them, commands > +like grep that work on tracked files in the working copy will ignore "not > +yet initialized" submodules and pay attention to "left behind" ones. I don't think that "left behind" is a good phrase here. It feels like they've been _dropped_ instead of _persisted despite sparse-checkout changes_. Perhaps: commands like `git grep` that work on tracked files in the working copy will pay attention only to initialized submodules, regardless of the sparse-checkout definition. Thanks for pointing out how complicated this scenario is! It certainly demands a careful update like this one. -Stolee
On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 5/22/2020 10:26 AM, Elijah Newren wrote: > > +This may mean that even if your sparsity patterns include or exclude > > +submodules, until you manually initialize or deinitialize them, commands > > +like grep that work on tracked files in the working copy will ignore "not > > +yet initialized" submodules and pay attention to "left behind" ones. > > I don't think that "left behind" is a good phrase here. It feels like > they've been _dropped_ instead of _persisted despite sparse-checkout > changes_. > > Perhaps: > > commands like `git grep` that work on tracked files in the working copy > will pay attention only to initialized submodules, regardless of the > sparse-checkout definition. Hmm, I'm a little confused by the "regardless of the sparse-checkout definition". The plan we discussed for grep was to not recurse into submodules if they have the SKIP_WORKTREE bit set [1], wasn't it? [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/
On 6/10/2020 12:22 PM, Matheus Tavares Bernardino wrote: > On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 5/22/2020 10:26 AM, Elijah Newren wrote: >>> +This may mean that even if your sparsity patterns include or exclude >>> +submodules, until you manually initialize or deinitialize them, commands >>> +like grep that work on tracked files in the working copy will ignore "not >>> +yet initialized" submodules and pay attention to "left behind" ones. >> >> I don't think that "left behind" is a good phrase here. It feels like >> they've been _dropped_ instead of _persisted despite sparse-checkout >> changes_. >> >> Perhaps: >> >> commands like `git grep` that work on tracked files in the working copy >> will pay attention only to initialized submodules, regardless of the >> sparse-checkout definition. > > Hmm, I'm a little confused by the "regardless of the sparse-checkout > definition". The plan we discussed for grep was to not recurse into > submodules if they have the SKIP_WORKTREE bit set [1], wasn't it? > > [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/ Thanks for correcting my misunderstanding. By introducing `git grep` into this documentation, I have also made it co-dependent on your series. Instead, Elijah was probably purposeful in his use of "grep" over "git grep". If we revert that part of my change to use `grep` instead of `git grep`, then is my statement correct? Thanks, -Stolee
On Wed, Jun 10, 2020 at 2:42 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 6/10/2020 12:22 PM, Matheus Tavares Bernardino wrote: > > On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote: > >> > >> On 5/22/2020 10:26 AM, Elijah Newren wrote: > >>> +This may mean that even if your sparsity patterns include or exclude > >>> +submodules, until you manually initialize or deinitialize them, commands > >>> +like grep that work on tracked files in the working copy will ignore "not > >>> +yet initialized" submodules and pay attention to "left behind" ones. > >> > >> I don't think that "left behind" is a good phrase here. It feels like > >> they've been _dropped_ instead of _persisted despite sparse-checkout > >> changes_. > >> > >> Perhaps: > >> > >> commands like `git grep` that work on tracked files in the working copy > >> will pay attention only to initialized submodules, regardless of the > >> sparse-checkout definition. > > > > Hmm, I'm a little confused by the "regardless of the sparse-checkout > > definition". The plan we discussed for grep was to not recurse into > > submodules if they have the SKIP_WORKTREE bit set [1], wasn't it? > > > > [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/ > > Thanks for correcting my misunderstanding. By introducing > `git grep` into this documentation, I have also made it > co-dependent on your series. Instead, Elijah was probably > purposeful in his use of "grep" over "git grep". I think he used grep referring to git-grep as he mentioned "tracked files in the working copy". Maybe he wanted to describe the current state of git-grep, which does recurse into initialized submodules even when they don't match the sparsity patterns. Was that it, Elijah? If so, since this behavior is changed in mt/grep-sparse-checkout, I think I should also change this doc section within my series. Or we change the doc in this patch and make it dependent on the series.
On Wed, Jun 10, 2020 at 4:41 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 5/22/2020 10:26 AM, Elijah Newren wrote: > > Sorry I missed this patch. I was searching all over for patches with > "sparse" or "submodule" in the _subject_. Thanks for calling out the > need for review, Junio! > > > Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules > > > > Ignoring the sparse-checkout feature momentarily, if one has a submodule and > > creates local branches within it with unpushed changes and maybe adds some > > untracked files to it, then we would want to avoid accidentally removing such > > a submodule. So, for example with git.git, if you run > > git checkout v2.13.0 > > then the sha1collisiondetection/ submodule is NOT removed even though it > > did not exist as a submodule until v2.14.0. Similarly, if you only had > > v2.13.0 checked out previously and ran > > git checkout v2.14.0 > > the sha1collisiondetection/ submodule would NOT be automatically > > initialized despite being part of v2.14.0. In both cases, git requires > > submodules to be initialized or deinitialized separately. Further, we > > also have special handling for submodules in other commands such as > > clean, which requires two --force flags to delete untracked submodules, > > and some commands have a --recurse-submodules flag. > > > > sparse-checkout is very similar to checkout, as evidenced by the similar > > name -- it adds and removes files from the working copy. However, for > > the same avoid-data-loss reasons we do not want to remove a submodule > > from the working copy with checkout, we do not want to do it with > > sparse-checkout either. So submodules need to be separately initialized > > or deinitialized; changing sparse-checkout rules should not > > automatically trigger the removal or vivification of submodules. > > This is a good summary of how submodules decide to be present or not. > > > I believe the previous wording in git-sparse-checkout.txt about > > submodules was only about this particular issue. Unfortunately, the > > previous wording could be interpreted to imply that submodules should be > > considered active regardless of sparsity patterns. Update the wording > > to avoid making such an implication. It may be helpful to consider two > > example situations where the differences in wording become important: > > You are correct, the wording was unclear. Worth fixing. > > > In the future, we want users to be able to run commands like > > git clone --sparse=moduleA --recurse-submodules $REPO_URL > > and have sparsity paths automatically set up and have submodules *within > > the sparsity paths* be automatically initialized. We do not want all > > submodules in any path to be automatically initialized with that > > command. > > INTERESTING. You are correct that it would be nice to have one > feature that describes "what should be present or not". The in-tree > sparse-checkout feature (still in infancy) would benefit from a > redesign with that in mind. > > I am interested as well in the idea that combining "--sparse[=X]" > with "--recurse-submodules" might want to imply that the submodules > themselves are initialized with sparse-checkout patterns. > > These ramblings are of course off-topic for the current patch. Yeah, it might get complicated too; we'd almost certainly want to limit to cone mode (globs could get super hairy). It's also the case we might want some submodules to have sparse-checkouts and others have full checkouts, depending on whether the --sparse=X specification listed some path that traversed from the toplevel outer repo down into a submodule. (But if --sparse is given with no specification, do all submodules become sparse or do all remain full?) Anyway, lots of complications there and we should start a different thread to discuss that when we feel it's time to tackle it. > > Similarly, we want to be able to do things like > > git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN > > and search through $REV for $PATTERN within the recorded sparsity > > patterns. We want it to recurse into submodules within those sparsity > > patterns, but do not want to recurse into directories that do not match > > the sparsity patterns in search of a possible submodule. > > (snipping way the old paragraph and focusing on the new text) > > > +If your repository contains one or more submodules, then those submodules > > +will appear based on which you initialized with the `git submodule` > > +command. > > This sentence is awkward. Here is a potential replacement: > > If your repository contains one or more submodules, then submodules are > populated based on interactions with the `git submodule` command. > Specifically, `git submodule init -- <path>` will ensure the submodule at > `<path>` is present while `git submodule deinit -- <path>` will remove the > files for the submodule at `<path>`. Similar to sparse-checkout, the > deinitialized submodules still exist in the index, but are not present in > the working directory. > > That got a lot longer as I was working on it. Perhaps add a paragraph break > before the next bit. Sounds good, thanks. > > Submodules may have additional untracked files or code stored on > > To emphasize the importance of the following "to avoid data loss" statement, > you could mention that when a submodule is removed from the working directory, > then so is all of its Git data such as objects and branches. If that data was > not pushed to another repository, then deinitializing a submodule can result > in loss of important data. (Also: maybe I'm wrong about that?) > > > +other branches, so to avoid data loss, changing sparse inclusion/exclusion I thought that was what I covered with the "code stored on other branches" but I guess that wasn't clear enough. So yeah, I can try extending it a bit. > Edit: other branches. To avoid data loss, ... Sounds good. > > +rules will not cause an already checked out submodule to be removed from > > +the working copy. Said another way, just as `checkout` will not cause > > +submodules to be automatically removed or initialized even when switching > > +between branches that remove or add submodules, using `sparse-checkout` to > > +reduce or expand the scope of "interesting" files will not cause submodules > > +to be automatically deinitialized or initialized either. Adding or > > +removing them must be done as a separate step with `git submodule init` or > > +`git submodule deinit`. > > This final sentence may be redundant if you include reference to init/deinit > earlier in the section. Yep, I'll strike it. > > +This may mean that even if your sparsity patterns include or exclude > > +submodules, until you manually initialize or deinitialize them, commands > > +like grep that work on tracked files in the working copy will ignore "not > > +yet initialized" submodules and pay attention to "left behind" ones. > > I don't think that "left behind" is a good phrase here. It feels like > they've been _dropped_ instead of _persisted despite sparse-checkout > changes_. I think in addition to the "left behind" wording being bad, my paragraph left another funny gray area and might be inconsistent with what Matheus and I wrote elsewhere: If sparsity patterns would exclude a submodule that is initialized, sparse-checkout clearly can't remove the submodule. However, should it set the SKIP_WORKTREE bit for that submodule if it's not going to remove it? I'm not sure of the answer, yet. I think Matheus had the right idea for how to make grep handle an initialized submodule in the different sparse.restrictCmds settings, and if we do go ahead and clear the SKIP_WORKTREE bit, then I think the wording of this paragraph needs to change. So, let's discuss your alternative: > Perhaps: > > commands like `git grep` that work on tracked files in the working copy > will pay attention only to initialized submodules, regardless of the > sparse-checkout definition. I think this is easy to misconstrue in an entirely new way: if there are initialized submodules (and maybe a sparse checkout), then your wording implies normal files would be ignored by grep (even files that aren't removed by the sparse checkout)! While that sounds like crazy behavior, this whole thread started because of suggested behaviors being proposed to carefully follow what was already written in this document even though the end user result seemed somewhat crazy to me. So, we might want to avoid a repeat. :-) Also, your suggested wording is different than the behavior we came up with before, and is also inconsistent with how we'd work with normal files. For example, what if a user: * uses sparse-checkout to remove a bunch of files/directories they don't care about * creates a new file that happens to have the same name as an (unfortunately) generically worded filename that exists in the index (but is marked SKIP_WORKTREE and had previously been removed) Is this new file related to the tracked file? Is the new file considered tracked? Should the new file be considered part of the sparse cone (i.e. should it be considered part of the set of tracked working tree files relevant to the user for commands that operate on that subset)? It's a bit of a thorny case. Here's the behavior Matheus and I came to previously: git -c sparse.restrictCmds=true grep --recurse-submodules <pattern>: This goes through all the files in the index (i.e. all tracked files) which do NOT have the SKIP_WORKTREE bit set. For each of these: If the file is a symlink, ignore it (like git-grep currently does). If the file is a regular file and is present in the working copy, search it. If the file is a submodule and it is initialized, recurse into it. git -c sparse.restrictCmds=false grep --recurse-submodules <pattern>: This goes through all the files in the index (i.e. all tracked files) regardless of SKIP_WORKTREE bit setting. For each of these: If the file is a symlink, ignore it (like git-grep currently does). If the file is a regular file and is present in the working copy, search it. If the file is a submodule and it is initialized, recurse into it. The only difference between these two sparse.restrictCmds settings is the handling of the SKIP_WORKTREE bit. I think that makes them nice and orthogonal. They also generalize nicely to the cases of searching --cached or $REVISION with a few obvious changes (check if data is available in git object store rather than if file is present in working tree, and for $REVISION check sparsity patterns rather than SKIP_WORKTREE bit). If we start ignoring the SKIP_WORKTREE bit for some types of files even when sparse.restrictCmds=true, I think we start getting a number of inconsistencies and user surprises. So, these formal definitions seem like a good high-level design. I think our attempts to summarize behavior in short sentences for users sometimes ignores some cases due to the desire to summarize. However, taking the summary literally can suggest behaviors that'd be inconsistent if not downright crazy for some of the ignored cases. I'll see if I can clean it up somehow. > Thanks for pointing out how complicated this scenario is! It certainly > demands a careful update like this one. Thanks for the thoughtful review!
On Wed, Jun 10, 2020 at 9:23 AM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote: > > On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote: > > > > On 5/22/2020 10:26 AM, Elijah Newren wrote: > > > +This may mean that even if your sparsity patterns include or exclude > > > +submodules, until you manually initialize or deinitialize them, commands > > > +like grep that work on tracked files in the working copy will ignore "not > > > +yet initialized" submodules and pay attention to "left behind" ones. > > > > I don't think that "left behind" is a good phrase here. It feels like > > they've been _dropped_ instead of _persisted despite sparse-checkout > > changes_. > > > > Perhaps: > > > > commands like `git grep` that work on tracked files in the working copy > > will pay attention only to initialized submodules, regardless of the > > sparse-checkout definition. > > Hmm, I'm a little confused by the "regardless of the sparse-checkout > definition". The plan we discussed for grep was to not recurse into > submodules if they have the SKIP_WORKTREE bit set [1], wasn't it? > > [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/ I flagged some issues with that sentence...and an additional issue in my original sentence besides the one Stolee flagged. It seems to be easy to mess up a simple summary here. :-) But I do want a simple summary of some sort; I want Documentation/git-sparse-checkout.txt to be an end-user guide and not an implementation spec. Perhaps I can bring up a simpler example that will make it easier to see my distinction between the two -- let's consider the case of unmerged files. I think all of the following statements are true, but some are meant strictly as implementation details of relevant subcommands, while others are deduced overall behavior observed by end-users: * If you just ran merge or rebase and have some files with conflicts, 'git grep searchstring' will search the conflicted files for the searchstring * When searching the working tree, git grep should not do any special checking for whether files are in a conflicted state * sparse-checkout will never set the SKIP_WORKTREE bit on an unmerged file (despite sparsity patterns) * sparse-checkout will delete all (regular and symlink) files from the working tree when it sets the SKIP_WORKTREE bit for them * sparse-checkout will not delete files from the working copy if it doesn't set the SKIP_WORKTREE bit on it * When merging, if the merge machinery notices a conflict, it must clear the SKIP_WORKTREE bit and write the (conflicted version of the) file out to the working tree. (It is also allowed to clear the SKIP_WORKTREE bit for files that are not conflicted, though we'd rather it didn't do that so much.) These statements above are not incompatible, because some deal with the implementation of git grep (the second item), others deal with implementation details of other commands or machinery (all items after the second), and the first item deals with the combination of behaviors between sparse-checkout + merge machinery + grep. So, even though the first bullet point says "git grep...will search the conflicted files" that does NOT mean git grep should check for whether files are conflicted. My proposed update in v2 that I'll send out (once I come up with one) might use similar broad brushes. Hope that helps, Elijah
diff --git a/builtin/grep.c b/builtin/grep.c index a5056f395a..91ee0b2734 100644 --- a/builtin/grep.c +++ b/builtin/grep.c @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt, const struct pathspec *pathspec, int cached); static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, struct tree_desc *tree, struct strbuf *base, int tn_len, - int check_attr); + int is_root_tree); static int grep_submodule(struct grep_opt *opt, const struct pathspec *pathspec, @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt, for (nr = 0; nr < repo->index->cache_nr; nr++) { const struct cache_entry *ce = repo->index->cache[nr]; + + if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode)) + continue; + strbuf_setlen(&name, name_base_len); strbuf_addstr(&name, ce->name); @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt, * cache entry are identical, even if worktree file has * been modified, so use cache version instead */ - if (cached || (ce->ce_flags & CE_VALID) || - ce_skip_worktree(ce)) { + if (cached || (ce->ce_flags & CE_VALID)) { if (ce_stage(ce) || ce_intent_to_add(ce)) continue; hit |= grep_oid(opt, &ce->oid, name.buf, @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt, return hit; } -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, - struct tree_desc *tree, struct strbuf *base, int tn_len, - int check_attr) +static struct pattern_list *get_sparsity_patterns(struct repository *repo) +{ + struct pattern_list *patterns; + char *sparse_file; + int sparse_config, cone_config; + + if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) || + !sparse_config) { + return NULL; + } + + sparse_file = repo_git_path(repo, "info/sparse-checkout"); + patterns = xcalloc(1, sizeof(*patterns)); + + if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config)) + cone_config = 0; + patterns->use_cone_patterns = cone_config; + + if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) { + if (file_exists(sparse_file)) { + warning(_("failed to load sparse-checkout file: '%s'"), + sparse_file); + } + free(sparse_file); + free(patterns); + return NULL; + } + + free(sparse_file); + return patterns; +} + +static int in_sparse_checkout(struct strbuf *path, int prefix_len, + unsigned int entry_mode, + struct index_state *istate, + struct pattern_list *sparsity, + enum pattern_match_result parent_match, + enum pattern_match_result *match) +{ + int dtype = DT_UNKNOWN; + + if (S_ISGITLINK(entry_mode)) + return 1; + + if (parent_match == MATCHED_RECURSIVE) { + *match = parent_match; + return 1; + } + + if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1])) + strbuf_addch(path, '/'); + + *match = path_matches_pattern_list(path->buf, path->len, + path->buf + prefix_len, &dtype, + sparsity, istate); + if (*match == UNDECIDED) + *match = parent_match; + + if (S_ISDIR(entry_mode)) + strbuf_trim_trailing_dir_sep(path); + + if (*match == NOT_MATCHED && (S_ISREG(entry_mode) || + (S_ISDIR(entry_mode) && sparsity->use_cone_patterns))) + return 0; + + return 1; +} + +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, + struct tree_desc *tree, struct strbuf *base, int tn_len, + int check_attr, struct pattern_list *sparsity, + enum pattern_match_result default_sparsity_match) { struct repository *repo = opt->repo; int hit = 0; @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, while (tree_entry(tree, &entry)) { int te_len = tree_entry_len(&entry); + enum pattern_match_result sparsity_match = 0; if (match != all_entries_interesting) { strbuf_addstr(&name, base->buf + tn_len); @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, strbuf_add(base, entry.path, te_len); + if (sparsity) { + struct strbuf path = STRBUF_INIT; + strbuf_addstr(&path, base->buf + tn_len); + + if (!in_sparse_checkout(&path, old_baselen - tn_len, + entry.mode, repo->index, + sparsity, default_sparsity_match, + &sparsity_match)) { + strbuf_setlen(base, old_baselen); + continue; + } + } + if (S_ISREG(entry.mode)) { hit |= grep_oid(opt, &entry.oid, base->buf, tn_len, check_attr ? base->buf + tn_len : NULL); @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, strbuf_addch(base, '/'); init_tree_desc(&sub, data, size); - hit |= grep_tree(opt, pathspec, &sub, base, tn_len, - check_attr); + hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len, + check_attr, sparsity, sparsity_match); free(data); } else if (recurse_submodules && S_ISGITLINK(entry.mode)) { hit |= grep_submodule(opt, pathspec, &entry.oid, @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, return hit; } +/* + * Note: sparsity patterns and paths' attributes will only be considered if + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern + * matching on paths.) + */ +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec, + struct tree_desc *tree, struct strbuf *base, int tn_len, + int is_root_tree) +{ + struct pattern_list *patterns = NULL; + int ret; + + if (is_root_tree) + patterns = get_sparsity_patterns(opt->repo); + + ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree, + patterns, 0); + + if (patterns) { + clear_pattern_list(patterns); + free(patterns); + } + return ret; +} + static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec, struct object *obj, const char *name, const char *path) { diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh index 37525cae3a..26852586ac 100755 --- a/t/t7011-skip-worktree-reading.sh +++ b/t/t7011-skip-worktree-reading.sh @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' ' test -z "$(git ls-files -m)" ' -test_expect_success 'grep with skip-worktree file' ' - git update-index --no-skip-worktree 1 && - echo test > 1 && - git update-index 1 && - git update-index --skip-worktree 1 && - rm 1 && - test "$(git grep --no-ext-grep test)" = "1:test" -' - echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected test_expect_success 'diff-index does not examine skip-worktree absent entries' ' setup_absent && diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh new file mode 100755 index 0000000000..3bd67082eb --- /dev/null +++ b/t/t7817-grep-sparse-checkout.sh @@ -0,0 +1,140 @@ +#!/bin/sh + +test_description='grep in sparse checkout + +This test creates a repo with the following structure: + +. +|-- a +|-- b +|-- dir +| `-- c +`-- sub + |-- A + | `-- a + `-- B + `-- b + +Where . has non-cone mode sparsity patterns and sub is a submodule with cone +mode sparsity patterns. The resulting sparse-checkout should leave the following +structure: + +. +|-- a +`-- sub + `-- B + `-- b +' + +. ./test-lib.sh + +test_expect_success 'setup' ' + echo "text" >a && + echo "text" >b && + mkdir dir && + echo "text" >dir/c && + + git init sub && + ( + cd sub && + mkdir A B && + echo "text" >A/a && + echo "text" >B/b && + git add A B && + git commit -m sub && + git sparse-checkout init --cone && + git sparse-checkout set B + ) && + + git submodule add ./sub && + git add a b dir && + git commit -m super && + git sparse-checkout init --no-cone && + git sparse-checkout set "/*" "!b" "!/*/" && + + git tag -am t-commit t-commit HEAD && + tree=$(git rev-parse HEAD^{tree}) && + git tag -am t-tree t-tree $tree && + + test_path_is_missing b && + test_path_is_missing dir && + test_path_is_missing sub/A && + test_path_is_file a && + test_path_is_file sub/B/b +' + +test_expect_success 'grep in working tree should honor sparse checkout' ' + cat >expect <<-EOF && + a:text + EOF + git grep "text" >actual && + test_cmp expect actual +' + +test_expect_success 'grep --cached should honor sparse checkout' ' + cat >expect <<-EOF && + a:text + EOF + git grep --cached "text" >actual && + test_cmp expect actual +' + +test_expect_success 'grep <commit-ish> should honor sparse checkout' ' + commit=$(git rev-parse HEAD) && + cat >expect_commit <<-EOF && + $commit:a:text + EOF + cat >expect_t-commit <<-EOF && + t-commit:a:text + EOF + git grep "text" $commit >actual_commit && + test_cmp expect_commit actual_commit && + git grep "text" t-commit >actual_t-commit && + test_cmp expect_t-commit actual_t-commit +' + +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' ' + commit=$(git rev-parse HEAD) && + tree=$(git rev-parse HEAD^{tree}) && + cat >expect_tree <<-EOF && + $tree:a:text + $tree:b:text + $tree:dir/c:text + EOF + cat >expect_t-tree <<-EOF && + t-tree:a:text + t-tree:b:text + t-tree:dir/c:text + EOF + git grep "text" $tree >actual_tree && + test_cmp expect_tree actual_tree && + git grep "text" t-tree >actual_t-tree && + test_cmp expect_t-tree actual_t-tree +' + +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' ' + cat >expect <<-EOF && + a:text + sub/B/b:text + EOF + git grep --recurse-submodules --cached "text" >actual && + test_cmp expect actual +' + +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' ' + commit=$(git rev-parse HEAD) && + cat >expect_commit <<-EOF && + $commit:a:text + $commit:sub/B/b:text + EOF + cat >expect_t-commit <<-EOF && + t-commit:a:text + t-commit:sub/B/b:text + EOF + git grep --recurse-submodules "text" $commit >actual_commit && + test_cmp expect_commit actual_commit && + git grep --recurse-submodules "text" t-commit >actual_t-commit && + test_cmp expect_t-commit actual_t-commit +' + +test_done
One of the main uses for a sparse checkout is to allow users to focus on the subset of files in a repository in which they are interested. But git-grep currently ignores the sparsity patterns and report all matches found outside this subset, which kind of goes in the opposite direction. Let's fix that, making it honor the sparsity boundaries for every grepping case: - git grep in worktree - git grep --cached - git grep $REVISION - git grep --untracked and git grep --no-index (which already respect sparse checkout boundaries) This is also what some users reported[1] they would want as the default behavior. Note: for `git grep $REVISION`, we will choose to honor the sparsity patterns only when $REVISION is a commit-ish object. The reason is that, for a tree, we don't know whether it represents the root of a repository or a subtree. So we wouldn't be able to correctly match it against the sparsity patterns. E.g. suppose we have a repository with these two sparsity rules: "/*" and "!/a"; and the following structure: / | - a (file) | - d (dir) | - a (file) If `git grep $REVISION` were to honor the sparsity patterns for every object type, when grepping the /d tree, we would wrongly ignore the /d/a file. This happens because we wouldn't know it resides in /d and therefore it would wrongly match the pattern "!/a". Furthermore, for a search in a blob object, we wouldn't even have a path to check the patterns against. So, let's ignore the sparsity patterns when grepping non-commit-ish objects (tags to commits should be fine). Finally, the old behavior may still be desirable for some use cases. So the next patch will add an option to allow restoring it when needed. [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/ Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> --- Note: as I mentioned in the cover letter, the tests in this patch now contain both cone mode and full pattern sparse checkouts. This was done for two reasons: To test grep's behavior when searching with --recurse-submodules and having submodules with different pattern sets than the superproject (which was incorrect in my first implementation). And to test the direct pattern matching in grep_tree(), using both modes. builtin/grep.c | 127 ++++++++++++++++++++++++++-- t/t7011-skip-worktree-reading.sh | 9 -- t/t7817-grep-sparse-checkout.sh | 140 +++++++++++++++++++++++++++++++ 3 files changed, 259 insertions(+), 17 deletions(-) create mode 100755 t/t7817-grep-sparse-checkout.sh