diff mbox series

[RFC,3/3] grep: add option to ignore sparsity patterns

Message ID a76242ecfa69cf29995bd859305dc2ab1bc1a221.1585027716.git.matheus.bernardino@usp.br (mailing list archive)
State New, archived
Headers show
Series grep: honor sparse checkout and add option to ignore it | expand

Commit Message

Matheus Tavares March 24, 2020, 6:13 a.m. UTC
In the last commit, git-grep learned to honor sparsity patterns. For
some use cases, however, it may be desirable to search outside the
sparse checkout. So add the '--ignore-sparsity' option, which restores
the old behavior. Also add the grep.ignoreSparsity configuration, to
allow setting this behavior by default.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: I still have to make --ignore-sparsity be able to work together
with --untracked. Unfortunatelly, this won't be as simple because the
codeflow taken by --untracked goes to grep_directory() which just
iterates the working tree, without looking the index entries. So I will
have to either: make --untracked use grep_cache(), and grep the
untracked files later; or try matching the working tree paths against
the sparsity patterns, without looking for the skip_worktree bit in
the index (as I mentioned in the previous patch's comments). Any
preferences regarding these two approaches? (or other suggestions?)

 Documentation/config/grep.txt   |  3 +++
 Documentation/git-grep.txt      |  5 ++++
 builtin/grep.c                  | 19 +++++++++++----
 t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 4 deletions(-)

Comments

Elijah Newren March 24, 2020, 7:54 a.m. UTC | #1
On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> In the last commit, git-grep learned to honor sparsity patterns. For
> some use cases, however, it may be desirable to search outside the
> sparse checkout. So add the '--ignore-sparsity' option, which restores
> the old behavior. Also add the grep.ignoreSparsity configuration, to
> allow setting this behavior by default.

Should `--ignore-sparsity` be a global git option rather than a
grep-specific one?  Also, should grep.ignoreSparsity rather be
core.ignoreSparsity or core.searchOutsideSparsePaths or something?  In
particular, I want a world where:

* Someone can do a "sparse" clone that is NOT just about
sparse-checkout but also about partial clone.  In particular, it makes
use of partial clones to download only the history for the sparsity
paths, and does a sparse-checkout --cone to get those checked out.
(Or, perhaps, defaults to just downloading history for the toplevel
dir, much like `sparse-checkout init --cone`, and then when the user
runs `sparse-checkout set $dir1 $dir2 ...` then it downloads the extra
bits).
* grep, diff, log, shortlog, blame, bisect (and maybe others) all by
default make use of the sparsity patterns to limit their output (but
can all use whatever flag(s) are added here to search outside the
sparsity pattern cones).  This helps users feel they are in a smaller
repo and searching just their area of interest, and it avoids partial
clones downloading blobs unnecessarily.  Nice for the user, and nice
for the system.
* worktrees behave nicer; when creating a new one it inherits the
sparsity patterns of the parent (again to avoid partail clones having
to download everything, and let users continue working on their area
of interest, though they can disable sparse checkouts at any time, of
course).  Still would like Junio's feedback on this one.
* rebase, merge, cherry-pick, etc. (all via the merge machiner) have
smarter tree-merging logic such that when trees are unchanged on one
or both sides of history, we take advantage of the subset of those
cases where we can avoid traversing into subtrees but can resolve the
merge at the tree level.  This is a performance optimization even when
you have all trees and blob available, but an even more important one
if you don't want partial clones to suddenly have to download
unnecessary objects.  I have ideas and am working on this as part of
merge-ort.

> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Note: I still have to make --ignore-sparsity be able to work together
> with --untracked. Unfortunatelly, this won't be as simple because the
> codeflow taken by --untracked goes to grep_directory() which just
> iterates the working tree, without looking the index entries. So I will
> have to either: make --untracked use grep_cache(), and grep the
> untracked files later; or try matching the working tree paths against
> the sparsity patterns, without looking for the skip_worktree bit in
> the index (as I mentioned in the previous patch's comments). Any
> preferences regarding these two approaches? (or other suggestions?)

Hmm.  So, 'tracked' in git is the idea that we are keeping information
about specific files.  'sparse-checkout' is the idea that we have a
subset of those that we can work with without materializing all the
other tracked files; it's clearly a subset of the realm of 'tracked'.
'untracked' is about getting everything outside the set of 'tracked'
files, which to me means it is clearly outside the set of sparsity
paths too (and thus you could take --untracked as implying
--ignore-sparsity, though whether you do might not matter in practice
because of the items I'll discuss next).  Of course, I am also
assuming `--untracked` is incompatible with --cached or specifying
revisions or trees (based on it's definiton of "In addition to
searching in the tracked files in the *working tree*, search also in
untracked files." -- emphasis added.)  If the incompatibility of
--untracked and --cached/REVSIONS/TREES is not enforced, we may want
to look into erroring out if they are given together.  Once we do, we
don't have to worry about grep_cache() at all in the case of
--untracked and shouldn't.  Files with the skip_worktree bit won't
exist in the working directory, and thus won't be searched (this is
what makes --untracked imply --ignore-sparsity not really matter).

In short: With --untracked you are grepping ALL (non-ignored) files in
the working directory -- either because they are both tracked and in
the sparsity paths (anything tracked that isn't in the sparsity paths
has the skip_worktree bit and thus isn't present), or because it is an
untracked file.  [And this may be what grep_directory() already does.]

Does that make sense?

>  Documentation/config/grep.txt   |  3 +++
>  Documentation/git-grep.txt      |  5 ++++
>  builtin/grep.c                  | 19 +++++++++++----
>  t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
>  4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 76689771aa..c1d49484c8 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -25,3 +25,6 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
>         is executed outside of a git repository.  Defaults to false.
> +
> +grep.ignoreSparsity::
> +       If set to true, enable `--ignore-sparsity` by default.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 97e25d7b1b..5c5c66c056 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -65,6 +65,11 @@ OPTIONS
>         mechanism.  Only useful when searching files in the current directory
>         with `--no-index`.
>
> +--ignore-sparsity::
> +       In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
> +       also search in files that are outside the sparse checkout. This option
> +       cannot be used with --no-index or --untracked.

If they are outside the sparse checkout, then they are not present on
disk -- so what is this outside stuff that is being searched?  Perhaps
clarify that this is only useful in combination with
--cached/REVISION/TREE, where there do exist paths outside the
sparsity patterns that become relevant?

>  --recurse-submodules::
>         Recursively search in each submodule that has been initialized and
>         checked out in the repository.  When used in combination with the
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 52ec72a036..17eae3edd6 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
>
>  static int recurse_submodules;
>
> +static int ignore_sparsity = 0;
> +
>  static int num_threads;
>
>  static pthread_t *threads;
> @@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
>         if (!strcmp(var, "submodule.recurse"))
>                 recurse_submodules = git_config_bool(var, value);
>
> +       if (!strcmp(var, "grep.ignoresparsity"))
> +               ignore_sparsity = git_config_bool(var, value);
> +
>         return st;
>  }
>
> @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce))
> +               if (!ignore_sparsity && ce_skip_worktree(ce))

Oh boy on the double negatives...maybe we want to rename this flag somehow?

>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID)) {
> +                       if (cached || (ce->ce_flags & CE_VALID) ||
> +                           ce_skip_worktree(ce)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> -       if (from_commit && repo_read_index(repo) < 0)
> +       if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
>                 die(_("index file corrupt"));
>
>         while (tree_entry(tree, &entry)) {
> @@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> -               if (from_commit) {
> +               if (!ignore_sparsity && from_commit) {
>                         int pos = index_name_pos(repo->index,
>                                                  base->buf + tn_len,
>                                                  base->len - tn_len);
> @@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
>                            N_("allow calling of grep(1) (ignored by this build)"),
>                            PARSE_OPT_NOCOMPLETE),
> +               OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
> +                        N_("also search in files outside the sparse checkout")),
>                 OPT_END()
>         };
>
> @@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>         if (recurse_submodules && untracked)
>                 die(_("--untracked not supported with --recurse-submodules"));
>
> +       if (ignore_sparsity && (!use_index || untracked))
> +               die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
> +
>         if (show_in_pager) {
>                 if (num_threads > 1)
>                         warning(_("invalid option combination, ignoring --threads"));
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index fccf44e829..1891ddea57 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>         test_cmp expect_t-tree actual_t-tree
>  '
>
> +for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
> +          'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
> +do
> +       test_expect_success "$cmd should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd --cached should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
>  test_done
> --
> 2.25.1

I think there are several things that we need to straighten out first
and will affect a lot of this patch quite a bit:
* The feedback from the previous patch that the revision handling
should use sparsity patterns rather than ce_skip_worktree() is going
to affect this patch a fair amount.
* I think the fact that --ignore-sparsity is meaningless without
--cached or a REVISION or TREE may also affect things.
* The decision about how to globally name and set the
"ignore-sparsity" bit without requiring users to set it for each and
every subcommand will change this patch a bit too.


I'm super excited to see work in this area.  I hope I'm not
discouraging you by attempting to provide what I think is the bigger
picture I'd like us to work towards.
Junio C Hamano March 24, 2020, 6:30 p.m. UTC | #2
Elijah Newren <newren@gmail.com> writes:

> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> In the last commit, git-grep learned to honor sparsity patterns. For
>> some use cases, however, it may be desirable to search outside the
>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>> allow setting this behavior by default.
>
> Should `--ignore-sparsity` be a global git option rather than a
> grep-specific one?  Also, should grep.ignoreSparsity rather be
> core.ignoreSparsity or core.searchOutsideSparsePaths or something?

Great question.  I think "git diff" with various options would also
want to optionally be able to be confined within the sparse cone, or
checking the entire world by lazily fetching outside the sparsity.

> * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> default make use of the sparsity patterns to limit their output (but
> can all use whatever flag(s) are added here to search outside the
> sparsity pattern cones).  This helps users feel they are in a smaller
> repo and searching just their area of interest, and it avoids partial
> clones downloading blobs unnecessarily.  Nice for the user, and nice
> for the system.

I am not sure which one should be the default.  From historical
point of view that sparse stuff was done as an optimization to omit
initial work and lazily give the whole world, I may have slight
preference to the "we pretend that you have everything, just some
parts may be slower to come to you" world view to be the default,
with an option to limit the view to whatever sparsity you initially
set up.  Regardless of the choice of the default, it would be a good
idea to make the subcommands consistently offer the same default and
allow the non-default views with the same UI.
Elijah Newren March 24, 2020, 7:07 p.m. UTC | #3
On Tue, Mar 24, 2020 at 11:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
>
> > * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> > default make use of the sparsity patterns to limit their output (but
> > can all use whatever flag(s) are added here to search outside the
> > sparsity pattern cones).  This helps users feel they are in a smaller
> > repo and searching just their area of interest, and it avoids partial
> > clones downloading blobs unnecessarily.  Nice for the user, and nice
> > for the system.
>
> I am not sure which one should be the default.  From historical
> point of view that sparse stuff was done as an optimization to omit
> initial work and lazily give the whole world, I may have slight
> preference to the "we pretend that you have everything, just some
> parts may be slower to come to you" world view to be the default,
> with an option to limit the view to whatever sparsity you initially
> set up.

It sounds like you are describing partial clone rather than sparse
checkout?  Or perhaps you're trying to blur the distinction,
suggesting the two should be used together, with the partial clone
machinery learning to download history within the specified sparse
cones?

>  Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Agreed.
Junio C Hamano March 25, 2020, 8:18 p.m. UTC | #4
Elijah Newren <newren@gmail.com> writes:

> It sounds like you are describing partial clone rather than sparse
> checkout?  Or perhaps you're trying to blur the distinction,
> suggesting the two should be used together, with the partial clone
> machinery learning to download history within the specified sparse
> cones?

Yeah, I guess it is a little bit of both ;-)

>>  Regardless of the choice of the default, it would be a good
>> idea to make the subcommands consistently offer the same default and
>> allow the non-default views with the same UI.
>
> Agreed.

Yup, thanks.
Matheus Tavares March 25, 2020, 11:15 p.m. UTC | #5
On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >
> > Note: I still have to make --ignore-sparsity be able to work together
> > with --untracked. Unfortunatelly, this won't be as simple because the
> > codeflow taken by --untracked goes to grep_directory() which just
> > iterates the working tree, without looking the index entries. So I will
> > have to either: make --untracked use grep_cache(), and grep the
> > untracked files later; or try matching the working tree paths against
> > the sparsity patterns, without looking for the skip_worktree bit in
> > the index (as I mentioned in the previous patch's comments). Any
> > preferences regarding these two approaches? (or other suggestions?)
>
> Hmm.  So, 'tracked' in git is the idea that we are keeping information
> about specific files.  'sparse-checkout' is the idea that we have a
> subset of those that we can work with without materializing all the
> other tracked files; it's clearly a subset of the realm of 'tracked'.
> 'untracked' is about getting everything outside the set of 'tracked'
> files, which to me means it is clearly outside the set of sparsity
> paths too (and thus you could take --untracked as implying
> --ignore-sparsity, though whether you do might not matter in practice
> because of the items I'll discuss next). Of course, I am also
> assuming `--untracked` is incompatible with --cached or specifying
> revisions or trees (based on it's definiton of "In addition to
> searching in the tracked files in the *working tree*, search also in
> untracked files." -- emphasis added.)

Hm, I see the point now, but I'm still a little confused: The "in the
working tree" section of the definition would exclude non checked out
files, right? However, git-grep's description says "Look for specified
patterns in the tracked files *in the work tree*", and it still
searches non checked out files (loading them from the cache, even when
--cache is not given). I know that's exactly what we are trying to
change with this patchset, but we will still give the
--ignore-sparsity option to allow the old behavior when needed (unless
we prohibit using --ignore-sparsity without --cached or $REV). I guess
my doubt is whether the problem is in the implementation of the
working tree grep, which considers non checked out files, or in the
docs, which say "tracked files *in the work tree*".

I tend to go with the latter, since using `git grep --ignore-sparsity`
in a sparse checked out working tree, to grep not present files as
well, kind of makes sense to me. And if the problem is indeed in the
docs, then I think we should also allow --ignore-sparsity when
grepping with --untracked, since it's an analogous case.

> If the incompatibility of
> --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> to look into erroring out if they are given together.  Once we do, we
> don't have to worry about grep_cache() at all in the case of
> --untracked and shouldn't.  Files with the skip_worktree bit won't
> exist in the working directory, and thus won't be searched (this is
> what makes --untracked imply --ignore-sparsity not really matter).
>
> In short: With --untracked you are grepping ALL (non-ignored) files in
> the working directory -- either because they are both tracked and in
> the sparsity paths (anything tracked that isn't in the sparsity paths
> has the skip_worktree bit and thus isn't present), or because it is an
> untracked file.  [And this may be what grep_directory() already does.]
>
> Does that make sense?

It does, and thanks for a very detailed explanation. But as I
mentioned before, I'm a little uncertain about --untracked implying
--ignore-spasity. The commit that added --untracked (0a93fb8) says:

"grep --untracked" would find the specified patterns from files in
untracked files in addition to its usual behaviour of finding them in
the tracked files

So, in my mind, it feels like --untracked wasn't meant to limit the
search to "all non-ignored files in the working directory", but to add
untracked files to the search (which could also contain tracked but
non checked out files). Wouldn't the "all non-ignored files in the
working directory" case be the use of --no-index?

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 52ec72a036..17eae3edd6 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
...
> >
> > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >                 const struct cache_entry *ce = repo->index->cache[nr];
> >
> > -               if (ce_skip_worktree(ce))
> > +               if (!ignore_sparsity && ce_skip_worktree(ce))
>
> Oh boy on the double negatives...maybe we want to rename this flag somehow?

Yeah, I also thought about that, but couldn't come up with a better
name myself... My alternatives were all too verbose.

...
> I'm super excited to see work in this area.  I hope I'm not
> discouraging you by attempting to provide what I think is the bigger
> picture I'd like us to work towards.

Not at all! :) Thanks a lot for the bigger picture and other
explanations. They help me understand the long-term goals and make
better decisions now.
Elijah Newren March 26, 2020, 6:02 a.m. UTC | #6
Hi Matheus!

On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >
> > > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > > ---
> > >
> > > Note: I still have to make --ignore-sparsity be able to work together
> > > with --untracked. Unfortunatelly, this won't be as simple because the
> > > codeflow taken by --untracked goes to grep_directory() which just
> > > iterates the working tree, without looking the index entries. So I will
> > > have to either: make --untracked use grep_cache(), and grep the
> > > untracked files later; or try matching the working tree paths against
> > > the sparsity patterns, without looking for the skip_worktree bit in
> > > the index (as I mentioned in the previous patch's comments). Any
> > > preferences regarding these two approaches? (or other suggestions?)
> >
> > Hmm.  So, 'tracked' in git is the idea that we are keeping information
> > about specific files.  'sparse-checkout' is the idea that we have a
> > subset of those that we can work with without materializing all the
> > other tracked files; it's clearly a subset of the realm of 'tracked'.
> > 'untracked' is about getting everything outside the set of 'tracked'
> > files, which to me means it is clearly outside the set of sparsity
> > paths too (and thus you could take --untracked as implying
> > --ignore-sparsity, though whether you do might not matter in practice
> > because of the items I'll discuss next). Of course, I am also
> > assuming `--untracked` is incompatible with --cached or specifying
> > revisions or trees (based on it's definiton of "In addition to
> > searching in the tracked files in the *working tree*, search also in
> > untracked files." -- emphasis added.)
>
> Hm, I see the point now, but I'm still a little confused: The "in the
> working tree" section of the definition would exclude non checked out
> files, right? However, git-grep's description says "Look for specified
> patterns in the tracked files *in the work tree*", and it still
> searches non checked out files (loading them from the cache, even when
> --cache is not given). I know that's exactly what we are trying to

I really respect Duy and he does some amazing work and I wish he were
still active in git, but the SKIP_WORKTREE stuff wasn't his best work
and even he downplayed it: "In my defense it was one of my first
contribution when I was naiver...I'd love to hear how sparse checkout
could be improved, or even replaced."[0]

I've seen enough egregiously confusing cases and enough
difficult-to-recover-from cases with the implementation of the
SKIP_WORKTREE handling that I think it is dangerous to assume behavior
you see with it is intended design.  A year and a half ago, I read all
available docs to figure out how to sparsify and de-sparsify, and read
them several times but was still confused.  If I could only figure it
out with great difficulty, a lot of google searching, and even trying
to look at the code, what chance did "normal" users stand?  To add
more flavor to that argument, let me cite [1] (the three paragraphs
starting with "Playing with sparse-checkout, it feels to me like a
half-baked feature"), [2], as well as good chunks of [3], [4], and
[5].

[0] https://lore.kernel.org/git/CACsJy8ArUXD0cF2vQAVnzM_AGto2k2yQTFuTO7PhP4ffHM8dVQ@mail.gmail.com/
[1] https://lore.kernel.org/git/CABPp-BFKf2N6TYzCCneRwWUektMzRMnHLZ8JT64q=MGj5WQZkA@mail.gmail.com/
[2] https://lore.kernel.org/git/CABPp-BGE-m_UFfUt_moXG-YR=ZW8hMzMwraD7fkFV-+sEHw36w@mail.gmail.com/
[3] https://lore.kernel.org/git/pull.316.git.gitgitgadget@gmail.com/
[4] https://lore.kernel.org/git/pull.513.git.1579029962.gitgitgadget@gmail.com/
[5] https://lore.kernel.org/git/a46439c8536f912ad4a1e1751852cf477d3d7dc7.1584813609.git.gitgitgadget@gmail.com/

But let me try to explain it all below from first principles in a way
that will hopefully make sense why falling back to loading from the
cache when --cached is not given is just flat wrong.  The explanation
from first principles should also help explain --untracked a bit
better, and when there are decisions about whether to use sparsity
patterns.

> change with this patchset, but we will still give the
> --ignore-sparsity option to allow the old behavior when needed (unless
> we prohibit using --ignore-sparsity without --cached or $REV). I guess
> my doubt is whether the problem is in the implementation of the
> working tree grep, which considers non checked out files, or in the
> docs, which say "tracked files *in the work tree*".
>
> I tend to go with the latter, since using `git grep --ignore-sparsity`
> in a sparse checked out working tree, to grep not present files as
> well, kind of makes sense to me. And if the problem is indeed in the
> docs, then I think we should also allow --ignore-sparsity when
> grepping with --untracked, since it's an analogous case.

It's probably not a surprise to you given what I've already said above
to hear me say that the docs are correct in this case.  But not only
are the docs correct, I'll go even further and claim that falling back
to the cache when --cached is not passed is indefensible and leads to
surprises and contradictions.  But instead of just claiming that, let
me try to spell out a bit better why I believe that from first
principles, though:

There were previously three types of files for git:
  * tracked
  * ignored
  * untracked
where:
  * tracked was defined as "recorded in index"
  * ignored was defined as "a file which is not tracked and which
matches an ignore rule (.gitignore, .git/info/exclude, etc.)"
  * untracked was defined as "all other files present in the working directory".
With the SKIP_WORKTREE bit and sparse-checkouts, we actually have four
types because we split the "tracked" category into two:
  * tracked and matches the sparsity patterns (implies it will be
missing from the working directory as the SKIP_WORKTREE bit is set)
  * tracked and does not match the sparsity patterns (implies it will
be present in the working directory, as the SKIP_WORKTREE bit is not
set)
But let's ignore the splitting of the tracked type for a minute as
well as everything else related to sparseness.  Let's just look at how
grep was designed.

git grep has traditionally been about searching "tracked files in the
work tree" as you highlighted (and note that sparsity bits came four
years later in 2009, so cannot undercut that claim).  If the user has
made edits to files and hasn't staged them, grep would search those
working tree files with their edits, not old cached versions of those
files.  People were told that git grep was a great way to just search
relevant stuff (instead of normal grep which would look through build
results and random big files in your working directory that you
weren't even tracking).  Then in 2011 grep gained options like
--untracked to extend the search in the working tree to also include
untracked files, and added --no-exclude-standard (which is "only
useful with --untracked") so that people had a way to search *all*
files in the working tree (tracked, untracked, and ignored files).
(Note: no mechanism was provided for searching tracked and ignored
files without untracked as far as I can tell, though I don't see why
that would make sense.)  git-grep also gained options like --no-index
so that it could be used in a directory that wasn't tracked by git at
all -- it turns out people liked git-grep better than normal grep (I
think it got colorization first?), even for things that weren't being
tracked by git.  But again, all these cases were about searching files
that existed in the working tree.

Of course, people sometimes wanted to search a version other than what
existed in the working tree.  And thus options like --cached or
specifying a REVISION were added early on.

Sometimes, code that wasn't meant to be used together accidentally is
used together or the docs suggest they can be used together.  In 2010,
someone had to clarify that --cached was incompatible with <tree>; not
sure why someone would attempt to use them together, but that's the
type of accident that is easy to have in the implementation or docs
because it doesn't even occur to people who understand the design and
the data structures why anyone would attempt that.  Inevitably,
someone comes along who doesn't understand the underlying data
structures or design or terminology and tries incompatible options
together...and then gets surprised.  (Side note: I think this kind of
issues occurs fairly frequently, so I'm unlikely to assume options
were meant to be supported together based solely on a lack of logic
that would throw an error when both are specified.  We could probably
add a bunch of useful microprojects around checking for flags that
should be incompatible and making sure git throws errors when both are
specified.  We had lots of cases in rebase, for example, where if
users happened to specify two flags then one would just be silently
ignored.)

REVISION and --cached are not just incompatible with each other; each
is incompatible with all three of --untracked, --no-index, and
--no-exclude-standard.  This is because REVISION and --cached are
about picking some version other than what exists in the working tree
to search through, while those other options are all intended for when
we are searching through files in the working tree (and in particular,
exist to extend how many files in the working tree we look through).

One more useful case to consider before we start adding SKIP_WORKTREE
into the mix.  Let's say that you have three files:
   fileA
   fileB
   fileC
and all of them are tracked.  You have made edits to fileA and fileB,
and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
staged).  Now, you run 'git grep mystring'.  Quick question: Which
files are searched for 'mystring'?  Well...
  * REVISION and --cached were left out of the git grep command, so
working tree files should be searched, not staged versions or versions
from other commits
  * No flags like --untracked or --no-exclude-standard were included,
so only tracked files in the working tree should be searched
  * There are two files in the working tree, both tracked: fileA and fileB.
So, this searches fileA and fileB.  In particular: NO VERSION of fileC
is searched.  fileC may be tracked/cached, but we don't search any
version of that file, because this particular command line is about
searching the working directory and fileC is not in the working
directory.  To the best of my knowledge, git grep has always behaved
that way.


Users understand the idea of searching the working copy vs. the index
vs. "old" (or different) versions of the repository.  They also
understand that when searching the working copy, by default a subset
of the files are searched.  Tell me: given all this information here,
what possible explanation is there for SKIP_WORKTREE entries to be
translated into searches of the cache when --cached is not specified?
Please square that away with the fact that 'rm fileC' results in fileC
NOT being searched.

It's just completely, utterly wrong.

Also, hopefully this helps answer your question about --untracked and
skip_worktree.  --untracked is only useful when searching through the
working tree, and is entirely about adding the "untracked" category to
the things we search.  The skip_worktree bit is about adding more
granularity to the "tracked" category.  The two are thus entirely
orthogonal and --untracked shouldn't change behavior at all in the
face of sparse checkouts.

And I also think it explains more when the sparsity patterns and
--ignore-sparsity-patterns flags even matter.  The division of working
tree files which were tracked into two subsets (those that match
sparsity patterns and those that don't) didn't matter because only one
of those two sets existed and could be searched.  So the question is,
when can the sparsity pattern divide a set of files into two subsets
where both are non-empty?  And the answer is when --cached or REVISION
is specified.  This is the case Junio recently brought up and said
that there are good reasons users might want to limit to just the
paths that match the sparsity patterns, and other reasons when users
might want to search everything[6].  So, both cases need to be
supported fairly easily, and this will be true for several commands
besides just grep.

[6] https://lore.kernel.org/git/xmqq7dz938sc.fsf@gitster.c.googlers.com/

> > If the incompatibility of
> > --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> > to look into erroring out if they are given together.  Once we do, we
> > don't have to worry about grep_cache() at all in the case of
> > --untracked and shouldn't.  Files with the skip_worktree bit won't
> > exist in the working directory, and thus won't be searched (this is
> > what makes --untracked imply --ignore-sparsity not really matter).
> >
> > In short: With --untracked you are grepping ALL (non-ignored) files in
> > the working directory -- either because they are both tracked and in
> > the sparsity paths (anything tracked that isn't in the sparsity paths
> > has the skip_worktree bit and thus isn't present), or because it is an
> > untracked file.  [And this may be what grep_directory() already does.]
> >
> > Does that make sense?
>
> It does, and thanks for a very detailed explanation. But as I
> mentioned before, I'm a little uncertain about --untracked implying
> --ignore-sparsity. The commit that added --untracked (0a93fb8) says:
>
> "grep --untracked" would find the specified patterns from files in
> untracked files in addition to its usual behaviour of finding them in
> the tracked files
>
> So, in my mind, it feels like --untracked wasn't meant to limit the
> search to "all non-ignored files in the working directory", but to add
> untracked files to the search (which could also contain tracked but
> non checked out files). Wouldn't the "all non-ignored files in the
> working directory" case be the use of --no-index?

--no-index is specifically designed for when the directory isn't
tracked by git at all.  It would be equivalent, though, to saying we
wanted to search all files in the working copy regardless of whether
they are tracked, untracked, or ignored, i.e. equivalent to specifying
both --untracked and --no-exclude-standard.

And you were right to be uncertain about --untracked implying
--ignore-sparsity; --untracked is completely orthogonal to sparsity.
(However, it wouldn't much matter if it did imply that option or if it
implied its opposite: --untracked implies we are only looking at the
working directory files, and thus we aren't even going to check the
sparsity patterns, we'll just check which files exist in the working
directory.  `git sparse-checkout reapply` will care about the sparsity
patterns and possibly add files to the working copy or remove some,
but grep certainly shouldn't be having a side effect like that; it
should just search the directory as it exists.)

> > > diff --git a/builtin/grep.c b/builtin/grep.c
> > > index 52ec72a036..17eae3edd6 100644
> > > --- a/builtin/grep.c
> > > +++ b/builtin/grep.c
> ...
> > >
> > > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> > >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> > >                 const struct cache_entry *ce = repo->index->cache[nr];
> > >
> > > -               if (ce_skip_worktree(ce))
> > > +               if (!ignore_sparsity && ce_skip_worktree(ce))
> >
> > Oh boy on the double negatives...maybe we want to rename this flag somehow?
>
> Yeah, I also thought about that, but couldn't come up with a better
> name myself... My alternatives were all too verbose.
>
> ...
> > I'm super excited to see work in this area.  I hope I'm not
> > discouraging you by attempting to provide what I think is the bigger
> > picture I'd like us to work towards.
>
> Not at all! :) Thanks a lot for the bigger picture and other
> explanations. They help me understand the long-term goals and make
> better decisions now.

Hope this email helps too.  I've composed it over about 4 different
sessions with various interruptions, so there's a good chance all my
edits and loss of train of thought might have made something murky.
Let me know which part(s) are confusing and I'll try to clarify.

Elijah
Junio C Hamano March 27, 2020, 3:51 p.m. UTC | #7
Elijah Newren <newren@gmail.com> writes:

> Sometimes, code that wasn't meant to be used together accidentally is
> used together or the docs suggest they can be used together.  ...
> ... but that's the
> type of accident that is easy to have in the implementation or docs
> because it doesn't even occur to people who understand the design and
> the data structures why anyone would attempt that.

The above is not limited to "git grep", but you said so clearly what
I have felt, without being able to express myself in a satisfactory
manner, for the last 10 years.

> ... (Side note: I think this kind of
> issues occurs fairly frequently, so I'm unlikely to assume options
> were meant to be supported together based solely on a lack of logic
> that would throw an error when both are specified.

Amen to that.

By the way, and I am so sorry to making the main issue of the
discussion into a mere "by the way" point, but if I understand your
message correctly, the primary conclusion in there is that a file
that is not in the working tree, if the sparsity pattern tells us
that it should not be checked out to the working tree, should not be
sought in the index instead.  I think I agree with that conclusion.

I however have some disagreement on a minor point, though.

"git grep -e '<pattern>' master" looks for the pattern in the commit
at the tip of the master branch.  "git grep -e '<pattern>' master
pu" does so in these two commits.  I do not think it is conceptually
wrong to allow "git grep -e '<pattern>' --cached master pu" to look
for three "commits", i.e. those two commits that already exist, plus
the one you would be creating if you were to "git commit" right now.
Similarly, I do not see a reason why we should forbid looking for
the same pattern in the tracked files in the working tree at the
same time we check tree object(s) and/or the index.

At least in principle.

There are two practical issues that makes these combinations
problematic, but I do not think they are insurmountable.

 - Once you give an object on the command line, there is no syntax
   to let you say "oh, by the way, I want the working tree as well".
   If you are looking in the index, the working tree, and optionally
   in some objects, "--index" instead of "--cached" would be the
   standard way to tell the command "I want to affect both the index
   and the working tree", but there is no way to say "I want only
   tracked files in the working tree and these objects searched".
   We'd need a new syntax to express it if we wanted to allow the
   combination.

 - The lines found in the working tree and in the index are prefixed
   by the filename, while they are prefixed by the tree's name and a
   colon.  When output for the working tree and the index are
   combined, we cannot tell where each hit came from.  We need to
   change the output to allow us to tell them apart, by
   e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
   we use "<revision>:".

Thanks.
Elijah Newren March 27, 2020, 7:01 p.m. UTC | #8
On Fri, Mar 27, 2020 at 8:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Sometimes, code that wasn't meant to be used together accidentally is
> > used together or the docs suggest they can be used together.  ...
> > ... but that's the
> > type of accident that is easy to have in the implementation or docs
> > because it doesn't even occur to people who understand the design and
> > the data structures why anyone would attempt that.
>
> The above is not limited to "git grep", but you said so clearly what
> I have felt, without being able to express myself in a satisfactory
> manner, for the last 10 years.
>
> > ... (Side note: I think this kind of
> > issues occurs fairly frequently, so I'm unlikely to assume options
> > were meant to be supported together based solely on a lack of logic
> > that would throw an error when both are specified.
>
> Amen to that.
>
> By the way, and I am so sorry to making the main issue of the
> discussion into a mere "by the way" point, but if I understand your
> message correctly, the primary conclusion in there is that a file
> that is not in the working tree, if the sparsity pattern tells us
> that it should not be checked out to the working tree, should not be
> sought in the index instead.  I think I agree with that conclusion.

Cool.

> I however have some disagreement on a minor point, though.
>
> "git grep -e '<pattern>' master" looks for the pattern in the commit
> at the tip of the master branch.  "git grep -e '<pattern>' master
> pu" does so in these two commits.  I do not think it is conceptually
> wrong to allow "git grep -e '<pattern>' --cached master pu" to look
> for three "commits", i.e. those two commits that already exist, plus
> the one you would be creating if you were to "git commit" right now.
> Similarly, I do not see a reason why we should forbid looking for
> the same pattern in the tracked files in the working tree at the
> same time we check tree object(s) and/or the index.
>
> At least in principle.
>
> There are two practical issues that makes these combinations
> problematic, but I do not think they are insurmountable.
>
>  - Once you give an object on the command line, there is no syntax
>    to let you say "oh, by the way, I want the working tree as well".
>    If you are looking in the index, the working tree, and optionally
>    in some objects, "--index" instead of "--cached" would be the
>    standard way to tell the command "I want to affect both the index
>    and the working tree", but there is no way to say "I want only
>    tracked files in the working tree and these objects searched".
>    We'd need a new syntax to express it if we wanted to allow the
>    combination.
>
>  - The lines found in the working tree and in the index are prefixed
>    by the filename, while they are prefixed by the tree's name and a
>    colon.  When output for the working tree and the index are
>    combined, we cannot tell where each hit came from.  We need to
>    change the output to allow us to tell them apart, by
>    e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
>    we use "<revision>:".
>
> Thanks.

Ah, so you're saying that even though --cached and REVISION are
incompatible today, that's not fundamental and we could conceivably
let them or even more options be used together in the future and you
even highlight how it could be made to sensibly work.  I agree with
what you say here: _if_ there is a way for users to explicitly specify
that they want to search multiple versions (whether that is revisions
or the index or the working tree), _and_ we have a way to distinguish
which version we found the results from, then (and only then) it'd
make sense to search the complete set of files from each of those
versions and show the results for the matches we found.

That differs in multiple important ways from the SKIP_WORKTREE
behavior I was railing against, and I think what you propose as a
possibility in contrast would make sense.
Matheus Tavares March 30, 2020, 1:12 a.m. UTC | #9
On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus!

Hi, Elijah.

First of all, thanks for taking the time to go over these topics in
great detail. I must say it's much clearer for me now.

> On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
[...]
> One more useful case to consider before we start adding SKIP_WORKTREE
> into the mix.  Let's say that you have three files:
>    fileA
>   fileB
>    fileC
> and all of them are tracked.  You have made edits to fileA and fileB,
> and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> staged).  Now, you run 'git grep mystring'.  Quick question: Which
> files are searched for 'mystring'?  Well...
>   * REVISION and --cached were left out of the git grep command, so
> working tree files should be searched, not staged versions or versions
> from other commits
>  * No flags like --untracked or --no-exclude-standard were included,
> so only tracked files in the working tree should be searched
>   * There are two files in the working tree, both tracked: fileA and fileB.
> So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> is searched.  fileC may be tracked/cached, but we don't search any
> version of that file, because this particular command line is about
> searching the working directory and fileC is not in the working
> directory.  To the best of my knowledge, git grep has always behaved
> that way.
>
> Users understand the idea of searching the working copy vs. the index
> vs. "old" (or different) versions of the repository.  They also
> understand that when searching the working copy, by default a subset
> of the files are searched.  Tell me: given all this information here,
> what possible explanation is there for SKIP_WORKTREE entries to be
> translated into searches of the cache when --cached is not specified?
> Please square that away with the fact that 'rm fileC' results in fileC
> NOT being searched.
>
> It's just completely, utterly wrong.

Makes sense, thanks. I agree that we shouldn't fall back to the cache
when searching the working tree.

> Also, hopefully this helps answer your question about --untracked and
> skip_worktree.  --untracked is only useful when searching through the
> working tree, and is entirely about adding the "untracked" category to
> the things we search.  The skip_worktree bit is about adding more
> granularity to the "tracked" category.  The two are thus entirely
> orthogonal and --untracked shouldn't change behavior at all in the
> face of sparse checkouts.

Thanks, your explanation clarified the issue I had. I see now why
--untracked and --ignore-sparsity don't make sense together.

It also made me think about the combination of --cached and
--untracked which, IIUC, should be prohibited. I will add a patch in
v2, making git-grep error out in this case.

> And I also think it explains more when the sparsity patterns and
> --ignore-sparsity-patterns flags even matter.  The division of working
> tree files which were tracked into two subsets (those that match
> sparsity patterns and those that don't) didn't matter because only one
> of those two sets existed and could be searched.  So the question is,
> when can the sparsity pattern divide a set of files into two subsets
> where both are non-empty?  And the answer is when --cached or REVISION
> is specified.

Makes sense. I will add in --ignore-sparsity's description that it is
only relevant with --cached or REVISION, as you previously suggested.
When it is used outside of these cases, though, I think we could just
warn that --ignore-sparsity will be discarded (to avoid erroring out
when users have grep.ignoreSparsity enabled).
Matheus Tavares March 30, 2020, 3:23 a.m. UTC | #10
On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
[...]
> Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Yeah, it seems like a sensible path. Regarding implementation, there
is the question that Elijah raised, of whether to use a global git
option or separate but consistent options for each subcommand. I don't
have much experience with sparse checkout to argument for one or
another, so I would like to hear what others have to say about it.

A question that comes to my mind regarding the global git option is:
will --ignore-sparsity (or whichever name we choose for it [1]) be
sufficient for all subcommands? Or may some of them require additional
options for command-specific behaviors concerning sparsity patterns?
Also, would it be OK if we just ignored the option in commands that do
not operate differently in sparse checkouts (maybe, fetch, branch and
send-email, for example)? And would it make sense to allow
constructions such as `git --ignore-sparsity checkout` or even `git
--ignore-sparsity sparse-checkout ...`?

[1]: Does anyone have suggestions for the option/config name? The best
I could come up with so far (without being too verbose) is
--no-sparsity-constraints. But I fear this might sound generic. As
Elijah already mentioned, --ignore-sparsity is not good either, as it
introduces double negatives in code...
Elijah Newren March 31, 2020, 4:48 p.m. UTC | #11
On Sun, Mar 29, 2020 at 6:13 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus!
>
> Hi, Elijah.
>
> First of all, thanks for taking the time to go over these topics in
> great detail. I must say it's much clearer for me now.
>
> > On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> > >
> [...]
> > One more useful case to consider before we start adding SKIP_WORKTREE
> > into the mix.  Let's say that you have three files:
> >    fileA
> >   fileB
> >    fileC
> > and all of them are tracked.  You have made edits to fileA and fileB,
> > and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> > staged).  Now, you run 'git grep mystring'.  Quick question: Which
> > files are searched for 'mystring'?  Well...
> >   * REVISION and --cached were left out of the git grep command, so
> > working tree files should be searched, not staged versions or versions
> > from other commits
> >  * No flags like --untracked or --no-exclude-standard were included,
> > so only tracked files in the working tree should be searched
> >   * There are two files in the working tree, both tracked: fileA and fileB.
> > So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> > is searched.  fileC may be tracked/cached, but we don't search any
> > version of that file, because this particular command line is about
> > searching the working directory and fileC is not in the working
> > directory.  To the best of my knowledge, git grep has always behaved
> > that way.
> >
> > Users understand the idea of searching the working copy vs. the index
> > vs. "old" (or different) versions of the repository.  They also
> > understand that when searching the working copy, by default a subset
> > of the files are searched.  Tell me: given all this information here,
> > what possible explanation is there for SKIP_WORKTREE entries to be
> > translated into searches of the cache when --cached is not specified?
> > Please square that away with the fact that 'rm fileC' results in fileC
> > NOT being searched.
> >
> > It's just completely, utterly wrong.
>
> Makes sense, thanks. I agree that we shouldn't fall back to the cache
> when searching the working tree.
>
> > Also, hopefully this helps answer your question about --untracked and
> > skip_worktree.  --untracked is only useful when searching through the
> > working tree, and is entirely about adding the "untracked" category to
> > the things we search.  The skip_worktree bit is about adding more
> > granularity to the "tracked" category.  The two are thus entirely
> > orthogonal and --untracked shouldn't change behavior at all in the
> > face of sparse checkouts.
>
> Thanks, your explanation clarified the issue I had. I see now why
> --untracked and --ignore-sparsity don't make sense together.
>
> It also made me think about the combination of --cached and
> --untracked which, IIUC, should be prohibited. I will add a patch in
> v2, making git-grep error out in this case.
>
> > And I also think it explains more when the sparsity patterns and
> > --ignore-sparsity-patterns flags even matter.  The division of working
> > tree files which were tracked into two subsets (those that match
> > sparsity patterns and those that don't) didn't matter because only one
> > of those two sets existed and could be searched.  So the question is,
> > when can the sparsity pattern divide a set of files into two subsets
> > where both are non-empty?  And the answer is when --cached or REVISION
> > is specified.
>
> Makes sense. I will add in --ignore-sparsity's description that it is
> only relevant with --cached or REVISION, as you previously suggested.
> When it is used outside of these cases, though, I think we could just
> warn that --ignore-sparsity will be discarded (to avoid erroring out
> when users have grep.ignoreSparsity enabled).

Not grep.ignoreSparsity but core.ignoreSparsity or core.$WHATEVER  ;-)
Elijah Newren March 31, 2020, 7:12 p.m. UTC | #12
// adding Jonathan Tan to cc based on the fact that we keep bringing
up partial clones and how it relates...

On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > >>
> > >> In the last commit, git-grep learned to honor sparsity patterns. For
> > >> some use cases, however, it may be desirable to search outside the
> > >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> > >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> > >> allow setting this behavior by default.
> > >
> > > Should `--ignore-sparsity` be a global git option rather than a
> > > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >
> > Great question.  I think "git diff" with various options would also
> > want to optionally be able to be confined within the sparse cone, or
> > checking the entire world by lazily fetching outside the sparsity.
> [...]
> > Regardless of the choice of the default, it would be a good
> > idea to make the subcommands consistently offer the same default and
> > allow the non-default views with the same UI.
>
> Yeah, it seems like a sensible path. Regarding implementation, there
> is the question that Elijah raised, of whether to use a global git
> option or separate but consistent options for each subcommand. I don't
> have much experience with sparse checkout to argument for one or
> another, so I would like to hear what others have to say about it.
>
> A question that comes to my mind regarding the global git option is:
> will --ignore-sparsity (or whichever name we choose for it [1]) be
> sufficient for all subcommands? Or may some of them require additional
> options for command-specific behaviors concerning sparsity patterns?
> Also, would it be OK if we just ignored the option in commands that do
> not operate differently in sparse checkouts (maybe, fetch, branch and
> send-email, for example)? And would it make sense to allow
> constructions such as `git --ignore-sparsity checkout` or even `git
> --ignore-sparsity sparse-checkout ...`?

I think the same option would probably be sufficient for all
subcommands, though I have a minor question about the merge machinery
(below).  And generally, I think it would be unusual for people to
pass the command line flag; I suspect most would set a config option
for most cases and then only occasionally override it on the command
line.  Since that config option would always be set, I'd expect
commands that are unaffected to just ignore it (much like both "git -c
merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
will both ignore the irrelevant options rather than trying to detect
that they were specified and error out).

> [1]: Does anyone have suggestions for the option/config name? The best
> I could come up with so far (without being too verbose) is
> --no-sparsity-constraints. But I fear this might sound generic. As
> Elijah already mentioned, --ignore-sparsity is not good either, as it
> introduces double negatives in code...

Does verbosity matter that much?  I think people would set it in
config, and tab completion would make it pretty easy to complete in
any event.

Anyway, maybe it will help if I provide a very rough first draft of
what changes we could introduce to Documentation/config/core.txt, and
then ask a bunch of my own questions about it below:

"""
core.restrictToSparsePaths::
        Only meaningful in conjuntion with core.sparseCheckoutCone.
        This option extends sparse checkouts (which limit which paths
        are written to the worktree), so that output and operations
        are also limited to the sparsity paths where possible and
        implemented.  The purpose of this option is to (1) focus
        output for the user on the portion of the repository that is
        of interest to them, and (2) enable potentially dramatic
        performance improvements, especially in conjunction with
        partial clones.
+
When this option is true, git commands such as log, diff, and grep may
limit their output to the directories specified by the sparse cone, or
to the intersection of those paths and any (like `*.c) that the user
might also specify on the command line.  (Note that this limit for
diff and grep only becomes relevant with --cached or when specifying a
REVISION, since a search of the working tree will automatically be
limited to the sparse paths that are present.)  Also, commands like
bisect may only select commits which modify paths within the sparsity
cone.  The merge machinery may use the sparse paths as a heuristic to
avoid trying to detect renames from within the sparsity cone to
outside the sparsity cone when at least one side of history only
touches paths within the sparsity cone (this can make the merge
machinery faster, but may risk modify/delete conflicts since upstream
can rename a file within the sparsity paths to a location outside
them).  Commands which export, integrity check, or create history will
always operate on full trees (e.g. fast-export, format-patch, fsck,
commit, etc.), unaffected by any sparsity patterns.
"""

Several questions here, of course:

  * do people like or hate the name?  indifferent?  have alternate ideas?
  * should we restrict this to core.sparseCheckoutCone as I suggested
above or also allow people to do it with core.sparseCheckout without
the cone mode?  I think attempting to weld partial clones together
with core.sparseCheckout is crazy, so I'm tempted to just make it be
specific to cone mode and to push people to use it.  But I'm
interested in thoughts on the matter.
  * should worktrees be affected?  (I've been an advocate of new
worktrees inheriting the sparse patterns of the worktree in use at the
time the new worktree was created.  Junio once suggested he didn't
like that and that worktrees should start out dense.  That seems
problematic to me in big repos with partial clones and sparse chckouts
in use.  Perhaps dense new worktrees is the behavior you get when
core.restrictToSparsePaths is false?)
  * does my idea for the merge machinery make folks uncomfortable?
Should that be a different option?  Being able to do trivial *tree*
merges for the huge portion of the tree outside the sparsity paths
would be a huge win, especially with partial clones, but it certainly
is different.  Then again, microsoft has disabled rename detection
entirely based on it being too expensive, so perhaps the idea of
rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
is a reasonable middle ground between off and on for rename detection.
  * what should the default be?  Junio suggested elsewhere[1] that
sparse-checkouts and partial clones should probably be welded together
(with partial clones downloading just history in the sparsity paths by
default), in which case having this option be true would be useful.
But it may also be slightly weird because it'll probably take us a
while to implement this; while the big warning in
git-sparse-checkout.txt certainly allows this:
        THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
        COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
        THE FUTURE.
It may still be slightly weird that the default behavior of commands
in the presence of sparse-checkouts changes release to release until
we get it all implemented.

[1] https://lore.kernel.org/git/xmqqh7ycw5lc.fsf@gitster.c.googlers.com/
Derrick Stolee March 31, 2020, 8:02 p.m. UTC | #13
On 3/31/2020 3:12 PM, Elijah Newren wrote:
> // adding Jonathan Tan to cc based on the fact that we keep bringing
> up partial clones and how it relates...
> 
> On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Elijah Newren <newren@gmail.com> writes:
>>>
>>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> In the last commit, git-grep learned to honor sparsity patterns. For
>>>>> some use cases, however, it may be desirable to search outside the
>>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>>>>> allow setting this behavior by default.
>>>>
>>>> Should `--ignore-sparsity` be a global git option rather than a
>>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
>>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>>>
>>> Great question.  I think "git diff" with various options would also
>>> want to optionally be able to be confined within the sparse cone, or
>>> checking the entire world by lazily fetching outside the sparsity.
>> [...]
>>> Regardless of the choice of the default, it would be a good
>>> idea to make the subcommands consistently offer the same default and
>>> allow the non-default views with the same UI.
>>
>> Yeah, it seems like a sensible path. Regarding implementation, there
>> is the question that Elijah raised, of whether to use a global git
>> option or separate but consistent options for each subcommand. I don't
>> have much experience with sparse checkout to argument for one or
>> another, so I would like to hear what others have to say about it.
>>
>> A question that comes to my mind regarding the global git option is:
>> will --ignore-sparsity (or whichever name we choose for it [1]) be
>> sufficient for all subcommands? Or may some of them require additional
>> options for command-specific behaviors concerning sparsity patterns?
>> Also, would it be OK if we just ignored the option in commands that do
>> not operate differently in sparse checkouts (maybe, fetch, branch and
>> send-email, for example)? And would it make sense to allow
>> constructions such as `git --ignore-sparsity checkout` or even `git
>> --ignore-sparsity sparse-checkout ...`?
> 
> I think the same option would probably be sufficient for all
> subcommands, though I have a minor question about the merge machinery
> (below).  And generally, I think it would be unusual for people to
> pass the command line flag; I suspect most would set a config option
> for most cases and then only occasionally override it on the command
> line.  Since that config option would always be set, I'd expect
> commands that are unaffected to just ignore it (much like both "git -c
> merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> will both ignore the irrelevant options rather than trying to detect
> that they were specified and error out).
> 
>> [1]: Does anyone have suggestions for the option/config name? The best
>> I could come up with so far (without being too verbose) is
>> --no-sparsity-constraints. But I fear this might sound generic. As
>> Elijah already mentioned, --ignore-sparsity is not good either, as it
>> introduces double negatives in code...
> 
> Does verbosity matter that much?  I think people would set it in
> config, and tab completion would make it pretty easy to complete in
> any event.
> 
> Anyway, maybe it will help if I provide a very rough first draft of
> what changes we could introduce to Documentation/config/core.txt, and
> then ask a bunch of my own questions about it below:
> 
> """
> core.restrictToSparsePaths::
>         Only meaningful in conjuntion with core.sparseCheckoutCone.
>         This option extends sparse checkouts (which limit which paths
>         are written to the worktree), so that output and operations
>         are also limited to the sparsity paths where possible and
>         implemented.  The purpose of this option is to (1) focus
>         output for the user on the portion of the repository that is
>         of interest to them, and (2) enable potentially dramatic
>         performance improvements, especially in conjunction with
>         partial clones.
> +
> When this option is true, git commands such as log, diff, and grep may
> limit their output to the directories specified by the sparse cone, or
> to the intersection of those paths and any (like `*.c) that the user
> might also specify on the command line.  (Note that this limit for
> diff and grep only becomes relevant with --cached or when specifying a
> REVISION, since a search of the working tree will automatically be
> limited to the sparse paths that are present.)  Also, commands like
> bisect may only select commits which modify paths within the sparsity
> cone.  The merge machinery may use the sparse paths as a heuristic to
> avoid trying to detect renames from within the sparsity cone to
> outside the sparsity cone when at least one side of history only
> touches paths within the sparsity cone (this can make the merge
> machinery faster, but may risk modify/delete conflicts since upstream
> can rename a file within the sparsity paths to a location outside
> them).  Commands which export, integrity check, or create history will
> always operate on full trees (e.g. fast-export, format-patch, fsck,
> commit, etc.), unaffected by any sparsity patterns.
> """
> 
> Several questions here, of course:
> 
>   * do people like or hate the name?  indifferent?  have alternate ideas?

It's probably time to create a 'sparse-checkout' config space. That
would allow

	sparse-checkout.restrictGrep = true

as an option. Or a more general

	sparse-checkout.restrictCommands = true

to make it clear that it affects multiple commands.

>   * should we restrict this to core.sparseCheckoutCone as I suggested
> above or also allow people to do it with core.sparseCheckout without
> the cone mode?  I think attempting to weld partial clones together
> with core.sparseCheckout is crazy, so I'm tempted to just make it be
> specific to cone mode and to push people to use it.  But I'm
> interested in thoughts on the matter.

Personally, I prefer cone mode and think it covers 99% of cases.
However, there are some who are using a big directory full of large
binaries and relying on file-prefix matches to get only the big
binaries they need. Until they restructure their repositories to
take advantage of cone mode, we should be considerate of the full
sparse-checkout specification when possible.

>   * should worktrees be affected?  (I've been an advocate of new
> worktrees inheriting the sparse patterns of the worktree in use at the
> time the new worktree was created.  Junio once suggested he didn't
> like that and that worktrees should start out dense.  That seems
> problematic to me in big repos with partial clones and sparse chckouts
> in use.  Perhaps dense new worktrees is the behavior you get when
> core.restrictToSparsePaths is false?)

We should probably consider a `--sparse` option for `git worktree add`
so we can allow interested users to add worktrees that initialize to
a sparse-checkout. Optionally create a config option that would copy
the sparse-checkout file from the current repo to the worktree.

>   * does my idea for the merge machinery make folks uncomfortable?
> Should that be a different option?  Being able to do trivial *tree*
> merges for the huge portion of the tree outside the sparsity paths
> would be a huge win, especially with partial clones, but it certainly
> is different.  Then again, microsoft has disabled rename detection
> entirely based on it being too expensive, so perhaps the idea of
> rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> is a reasonable middle ground between off and on for rename detection.

The part where you say " when at least one side of history only
touches paths within the sparsity cone" makes me want to entertain
the idea if it can be done cleanly.

I'm more concerned about the "git bisect" logic being restricted to
the cone, since that is such an open-ended command for what is
considered "good" or "bad".

>   * what should the default be?  Junio suggested elsewhere[1] that
> sparse-checkouts and partial clones should probably be welded together
> (with partial clones downloading just history in the sparsity paths by
> default), in which case having this option be true would be useful.

My opinion on this is as follows: filtering blobs based on sparse-
checkout patterns does not filter enough, and filtering trees based
on sparse-checkout patterns filters too much. The costs are just
flipped: having extra trees is not a huge problem but recovering from
a "tree miss" is problematic. Having extra blobs is painful, but
recovering from a "blob miss" is not a big deal.

> But it may also be slightly weird because it'll probably take us a
> while to implement this; while the big warning in
> git-sparse-checkout.txt certainly allows this:
>         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
>         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
>         THE FUTURE.
> It may still be slightly weird that the default behavior of commands
> in the presence of sparse-checkouts changes release to release until
> we get it all implemented.

I appreciate that we put that warning at the top. We will be
able to do more experimental things with the feature because
of it. The idea I'm toying with is to have "git clone --sparse"
set core.sparseCheckoutCone = true.

Also, if we are creating the "sparse-checkout.*" config space,
we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
or something. We would need to support both for a while, for sure.

Thanks,
-Stolee
Matheus Tavares April 27, 2020, 5:15 p.m. UTC | #14
Hi, Stolee and Elijah

I think I just finished addressing the comments on patch 2/3 [1]. And
I'm now looking at the ones in 3/3 (this one). Below are some
questions, just to make sure I'm going in the right direction with
this one.

On Tue, Mar 31, 2020 at 5:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/31/2020 3:12 PM, Elijah Newren wrote:
> >
> > Anyway, maybe it will help if I provide a very rough first draft of
> > what changes we could introduce to Documentation/config/core.txt, and
> > then ask a bunch of my own questions about it below:
> >
> > """
> > core.restrictToSparsePaths::
> >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> >         This option extends sparse checkouts (which limit which paths
> >         are written to the worktree), so that output and operations
> >         are also limited to the sparsity paths where possible and
> >         implemented.  The purpose of this option is to (1) focus
> >         output for the user on the portion of the repository that is
> >         of interest to them, and (2) enable potentially dramatic
> >         performance improvements, especially in conjunction with
> >         partial clones.
...
> > """
> >
> > Several questions here, of course:
> >
> >   * do people like or hate the name?  indifferent?  have alternate ideas?
>
> It's probably time to create a 'sparse-checkout' config space. That
> would allow
>
>         sparse-checkout.restrictGrep = true
>
> as an option. Or a more general
>
>         sparse-checkout.restrictCommands = true
>
> to make it clear that it affects multiple commands.

If we are creating the new namespace, 'core.sparseCheckout' should
also be renamed to something like 'sparse-checkout.enabled', right?
And maybe we could use 'sparsecheckout.*', instead? That seems to be
the convention for settings on hyphenated commands (as in sendemail.*,
uploadpack.* and gitgui.*).

As for compatibility, when running `git sparse-checkout init`, if the
config file already has the core.sparseCheckout setting, should we
remove it? Or just add the new sparsecheckout.enabled config, which
will always be read first?

Also, should we emit a warning about the former being deprecated? The
good thing about deprecation warnings, IMO, is that users will know
the name change faster. But, at least for `git grep <tree>`, where we
read  core.sparseCheckout and core.sparseCheckoutCone for each
submodule and each tree, there would be too much pollution in the
output...

Finally, about restrictCommands, the idea is to have both
sparsecheckout.restrictCommands and `git --restrict-to-sparse-paths`,
right? For now, the option/setting would only affect grep, but support
would be added gradually to other commands in the future. I noticed
git-read-tree already has a --no-sparse-checkout option. Should we
remove this option in favor of the global
--[no]-restrict-to-sparse-paths?

Sorry for too many questions. I just wanted to make sure that I
understood the plan before diving into the implementation, to avoid
going in the wrong direction.

[1]: Here is a sneak peek for v2 of patch 2/3, in case you might want
to take a look:
https://github.com/matheustavares/git/commit/970ef529f1e8f719c4427bd9fea8205ada69d913
Elijah Newren April 29, 2020, 4:46 p.m. UTC | #15
On Mon, Apr 27, 2020 at 10:15 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Stolee and Elijah
>
> I think I just finished addressing the comments on patch 2/3 [1]. And
> I'm now looking at the ones in 3/3 (this one). Below are some
> questions, just to make sure I'm going in the right direction with
> this one.
>
> On Tue, Mar 31, 2020 at 5:02 PM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 3/31/2020 3:12 PM, Elijah Newren wrote:
> > >
> > > Anyway, maybe it will help if I provide a very rough first draft of
> > > what changes we could introduce to Documentation/config/core.txt, and
> > > then ask a bunch of my own questions about it below:
> > >
> > > """
> > > core.restrictToSparsePaths::
> > >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> > >         This option extends sparse checkouts (which limit which paths
> > >         are written to the worktree), so that output and operations
> > >         are also limited to the sparsity paths where possible and
> > >         implemented.  The purpose of this option is to (1) focus
> > >         output for the user on the portion of the repository that is
> > >         of interest to them, and (2) enable potentially dramatic
> > >         performance improvements, especially in conjunction with
> > >         partial clones.
> ...
> > > """
> > >
> > > Several questions here, of course:
> > >
> > >   * do people like or hate the name?  indifferent?  have alternate ideas?
> >
> > It's probably time to create a 'sparse-checkout' config space. That
> > would allow
> >
> >         sparse-checkout.restrictGrep = true
> >
> > as an option. Or a more general
> >
> >         sparse-checkout.restrictCommands = true
> >
> > to make it clear that it affects multiple commands.
>
> If we are creating the new namespace, 'core.sparseCheckout' should
> also be renamed to something like 'sparse-checkout.enabled', right?
> And maybe we could use 'sparsecheckout.*', instead? That seems to be
> the convention for settings on hyphenated commands (as in sendemail.*,
> uploadpack.* and gitgui.*).

Or maybe just call the namespace 'sparse.*' if we're going that route?

> As for compatibility, when running `git sparse-checkout init`, if the
> config file already has the core.sparseCheckout setting, should we
> remove it? Or just add the new sparsecheckout.enabled config, which
> will always be read first?

We seem to have two competing issues:

  * If you remove the core.sparseCheckout setting in favor of
sparse.enabled, then people can't use the repo with an older version
of git.  (This may be acceptable, but we've generally been somewhat
careful with index extensions and such to avoid such a state, with
slow transitions with index and pack versions and such.)
  * If you leave the core.sparseCheckout setting around as well as
having sparse.enabled, then we have two different settings that we can
keep in sync with newer git but which older git will only update one
of.  What do we do if we detect they are out of sync?  Throw an error?
 Pretend that one overrules?  If the older one overrules, what do we
accomplish with the new name?  If the newer name overrules, doesn't
that also potentially break using an older git version?

I'm not sure what to do here.  Maybe people who have worked on index
version and pack version transitions have some good suggestions for
us?

> Also, should we emit a warning about the former being deprecated? The
> good thing about deprecation warnings, IMO, is that users will know
> the name change faster. But, at least for `git grep <tree>`, where we
> read  core.sparseCheckout and core.sparseCheckoutCone for each
> submodule and each tree, there would be too much pollution in the
> output...

We've already started to steer away from users setting these values
and just have them get set/updated/unset by sparse-checkout init and
sparse-checkout disable.  Since users won't be setting these directly,
I don't think deprecation warnings make sense.

> Finally, about restrictCommands, the idea is to have both
> sparsecheckout.restrictCommands and `git --restrict-to-sparse-paths`,
> right? For now, the option/setting would only affect grep, but support
> would be added gradually to other commands in the future. I noticed

There should be both a config option and a global command line flag,
yes.  We might need the flag to default to
not-restricting-to-sparse-paths for now because that's consistent with
the only thing the current implementation of these commands can do.
But I'm really worried that this will remain the default and we'll
force users in the future to jump through a bunch of hoops to do a
simple thing:

$ git clone --sparse-paths $WANTED_DIRECTORIES user@server.name:path/to/repo.git
$ cd repo
<Enjoy their small view of the repo without every command suddenly
requiring a network connection and downloading huge reams of data they
don't even care about.>

> git-read-tree already has a --no-sparse-checkout option. Should we
> remove this option in favor of the global
> --[no]-restrict-to-sparse-paths?

read-tree is plumbing; we can't break backward compatibility.  We'll
have to leave that option there and just document that the two options
do the same thing.

> Sorry for too many questions. I just wanted to make sure that I
> understood the plan before diving into the implementation, to avoid
> going in the wrong direction.

Nah, these are all good questions.  Sorry for the delay in getting back to you.
Elijah Newren April 29, 2020, 5:21 p.m. UTC | #16
Sorry for the super late reply...

On Tue, Mar 31, 2020 at 1:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/31/2020 3:12 PM, Elijah Newren wrote:
> > // adding Jonathan Tan to cc based on the fact that we keep bringing
> > up partial clones and how it relates...
> >
> > On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >>>
> >>> Elijah Newren <newren@gmail.com> writes:
> >>>
> >>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> >>>> <matheus.bernardino@usp.br> wrote:
> >>>>>
> >>>>> In the last commit, git-grep learned to honor sparsity patterns. For
> >>>>> some use cases, however, it may be desirable to search outside the
> >>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >>>>> allow setting this behavior by default.
> >>>>
> >>>> Should `--ignore-sparsity` be a global git option rather than a
> >>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
> >>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >>>
> >>> Great question.  I think "git diff" with various options would also
> >>> want to optionally be able to be confined within the sparse cone, or
> >>> checking the entire world by lazily fetching outside the sparsity.
> >> [...]
> >>> Regardless of the choice of the default, it would be a good
> >>> idea to make the subcommands consistently offer the same default and
> >>> allow the non-default views with the same UI.
> >>
> >> Yeah, it seems like a sensible path. Regarding implementation, there
> >> is the question that Elijah raised, of whether to use a global git
> >> option or separate but consistent options for each subcommand. I don't
> >> have much experience with sparse checkout to argument for one or
> >> another, so I would like to hear what others have to say about it.
> >>
> >> A question that comes to my mind regarding the global git option is:
> >> will --ignore-sparsity (or whichever name we choose for it [1]) be
> >> sufficient for all subcommands? Or may some of them require additional
> >> options for command-specific behaviors concerning sparsity patterns?
> >> Also, would it be OK if we just ignored the option in commands that do
> >> not operate differently in sparse checkouts (maybe, fetch, branch and
> >> send-email, for example)? And would it make sense to allow
> >> constructions such as `git --ignore-sparsity checkout` or even `git
> >> --ignore-sparsity sparse-checkout ...`?
> >
> > I think the same option would probably be sufficient for all
> > subcommands, though I have a minor question about the merge machinery
> > (below).  And generally, I think it would be unusual for people to
> > pass the command line flag; I suspect most would set a config option
> > for most cases and then only occasionally override it on the command
> > line.  Since that config option would always be set, I'd expect
> > commands that are unaffected to just ignore it (much like both "git -c
> > merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> > will both ignore the irrelevant options rather than trying to detect
> > that they were specified and error out).
> >
> >> [1]: Does anyone have suggestions for the option/config name? The best
> >> I could come up with so far (without being too verbose) is
> >> --no-sparsity-constraints. But I fear this might sound generic. As
> >> Elijah already mentioned, --ignore-sparsity is not good either, as it
> >> introduces double negatives in code...
> >
> > Does verbosity matter that much?  I think people would set it in
> > config, and tab completion would make it pretty easy to complete in
> > any event.
> >
> > Anyway, maybe it will help if I provide a very rough first draft of
> > what changes we could introduce to Documentation/config/core.txt, and
> > then ask a bunch of my own questions about it below:
> >
> > """
> > core.restrictToSparsePaths::
> >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> >         This option extends sparse checkouts (which limit which paths
> >         are written to the worktree), so that output and operations
> >         are also limited to the sparsity paths where possible and
> >         implemented.  The purpose of this option is to (1) focus
> >         output for the user on the portion of the repository that is
> >         of interest to them, and (2) enable potentially dramatic
> >         performance improvements, especially in conjunction with
> >         partial clones.
> > +
> > When this option is true, git commands such as log, diff, and grep may
> > limit their output to the directories specified by the sparse cone, or
> > to the intersection of those paths and any (like `*.c) that the user
> > might also specify on the command line.  (Note that this limit for
> > diff and grep only becomes relevant with --cached or when specifying a
> > REVISION, since a search of the working tree will automatically be
> > limited to the sparse paths that are present.)  Also, commands like
> > bisect may only select commits which modify paths within the sparsity
> > cone.  The merge machinery may use the sparse paths as a heuristic to
> > avoid trying to detect renames from within the sparsity cone to
> > outside the sparsity cone when at least one side of history only
> > touches paths within the sparsity cone (this can make the merge
> > machinery faster, but may risk modify/delete conflicts since upstream
> > can rename a file within the sparsity paths to a location outside
> > them).  Commands which export, integrity check, or create history will
> > always operate on full trees (e.g. fast-export, format-patch, fsck,
> > commit, etc.), unaffected by any sparsity patterns.
> > """
> >
> > Several questions here, of course:
> >
> >   * do people like or hate the name?  indifferent?  have alternate ideas?
>
> It's probably time to create a 'sparse-checkout' config space. That
> would allow
>
>         sparse-checkout.restrictGrep = true
>
> as an option. Or a more general
>
>         sparse-checkout.restrictCommands = true
>
> to make it clear that it affects multiple commands.

As I mentioned to Matheus, would a "sparse" config space be nicer?

> >   * should we restrict this to core.sparseCheckoutCone as I suggested
> > above or also allow people to do it with core.sparseCheckout without
> > the cone mode?  I think attempting to weld partial clones together
> > with core.sparseCheckout is crazy, so I'm tempted to just make it be
> > specific to cone mode and to push people to use it.  But I'm
> > interested in thoughts on the matter.
>
> Personally, I prefer cone mode and think it covers 99% of cases.
> However, there are some who are using a big directory full of large
> binaries and relying on file-prefix matches to get only the big
> binaries they need. Until they restructure their repositories to
> take advantage of cone mode, we should be considerate of the full
> sparse-checkout specification when possible.

I agree with everything you say here except the last word; if you
replaced "possible" with "practical" then I'd agree.  In particular, I
like the idea of a partial clone that defaults to grabbing all the
blobs in the sparse path specification; I think it'd be reasonable to
transfer the sparseCone specification to the server and have it use
that to walk history and make a packfile.  Transfering a sparse
specification that does not match the cone mode requirements to a
server and making it use that as it walks over all of history sounds
like a good way to overload the server.

> >   * should worktrees be affected?  (I've been an advocate of new
> > worktrees inheriting the sparse patterns of the worktree in use at the
> > time the new worktree was created.  Junio once suggested he didn't
> > like that and that worktrees should start out dense.  That seems
> > problematic to me in big repos with partial clones and sparse chckouts
> > in use.  Perhaps dense new worktrees is the behavior you get when
> > core.restrictToSparsePaths is false?)
>
> We should probably consider a `--sparse` option for `git worktree add`
> so we can allow interested users to add worktrees that initialize to
> a sparse-checkout. Optionally create a config option that would copy
> the sparse-checkout file from the current repo to the worktree.

Okay, but if someone runs a future

$ git clone --sparse $RELEVANT_DIRECTORIES user@server.name:path/to/repo.git
$ cd repo
<Blissfully work in their smaller repo without commands suddenly
downloading reams of unwanted data>

should the clone command automatically set this option for the user?
I don't like the idea of users having to remember to set this option
(and the restrictToSparsePaths option, and whatever other options are
needed to work in their smaller environment).  I'd really like there
to be a single flag, in the form of some clone option, that sets all
of this up.

> >   * does my idea for the merge machinery make folks uncomfortable?
> > Should that be a different option?  Being able to do trivial *tree*
> > merges for the huge portion of the tree outside the sparsity paths
> > would be a huge win, especially with partial clones, but it certainly
> > is different.  Then again, microsoft has disabled rename detection
> > entirely based on it being too expensive, so perhaps the idea of
> > rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> > is a reasonable middle ground between off and on for rename detection.
>
> The part where you say " when at least one side of history only
> touches paths within the sparsity cone" makes me want to entertain
> the idea if it can be done cleanly.

Yeah, I still have to dig in and verify that this really works.

> I'm more concerned about the "git bisect" logic being restricted to
> the cone, since that is such an open-ended command for what is
> considered "good" or "bad".

If the sparse checkout has sufficient information for them to build
and test whatever predicate they are interested in, then surely
bisecting in a way that restricts to the cone would be a nice
optimization, right?  And if the cone doesn't have enough information
for them to build and test commits, then they would need to leave the
sparse checkout in order to bisect anyway.

> >   * what should the default be?  Junio suggested elsewhere[1] that
> > sparse-checkouts and partial clones should probably be welded together
> > (with partial clones downloading just history in the sparsity paths by
> > default), in which case having this option be true would be useful.
>
> My opinion on this is as follows: filtering blobs based on sparse-
> checkout patterns does not filter enough, and filtering trees based
> on sparse-checkout patterns filters too much. The costs are just
> flipped: having extra trees is not a huge problem but recovering from
> a "tree miss" is problematic. Having extra blobs is painful, but
> recovering from a "blob miss" is not a big deal.

Sounds like --filter=blob:none already solves your issues.  It doesn't
make me happy; I really want the history within the sparse cone to be
downloaded as part of the initial clone.  (I can see various ways that
downloading all trees would be easier, so if we end up downloading all
commits and all trees and just the blobs within the sparse cone, that
sounds fine to me.)

> > But it may also be slightly weird because it'll probably take us a
> > while to implement this; while the big warning in
> > git-sparse-checkout.txt certainly allows this:
> >         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
> >         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
> >         THE FUTURE.
> > It may still be slightly weird that the default behavior of commands
> > in the presence of sparse-checkouts changes release to release until
> > we get it all implemented.
>
> I appreciate that we put that warning at the top. We will be
> able to do more experimental things with the feature because
> of it. The idea I'm toying with is to have "git clone --sparse"
> set core.sparseCheckoutCone = true.

Sounds good to me.  We might also want to set worktrees.copySparsity
and sparse.restrictToCone (or whatever these end up being named) as
well.

> Also, if we are creating the "sparse-checkout.*" config space,
> we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
> or something. We would need to support both for a while, for sure.

And, if we automatically migrate the setting and delete the old one,
do we prevent someone from successfully using an older git version
with the repo?  Or, if we don't automatically unset the old one, do we
risk the two values getting out of sync if they do switch to an older
git version?
diff mbox series

Patch

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 76689771aa..c1d49484c8 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -25,3 +25,6 @@  grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+grep.ignoreSparsity::
+	If set to true, enable `--ignore-sparsity` by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 97e25d7b1b..5c5c66c056 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -65,6 +65,11 @@  OPTIONS
 	mechanism.  Only useful when searching files in the current directory
 	with `--no-index`.
 
+--ignore-sparsity::
+	In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
+	also search in files that are outside the sparse checkout. This option
+	cannot be used with --no-index or --untracked.
+
 --recurse-submodules::
 	Recursively search in each submodule that has been initialized and
 	checked out in the repository.  When used in combination with the
diff --git a/builtin/grep.c b/builtin/grep.c
index 52ec72a036..17eae3edd6 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -33,6 +33,8 @@  static char const * const grep_usage[] = {
 
 static int recurse_submodules;
 
+static int ignore_sparsity = 0;
+
 static int num_threads;
 
 static pthread_t *threads;
@@ -292,6 +294,9 @@  static int grep_cmd_config(const char *var, const char *value, void *cb)
 	if (!strcmp(var, "submodule.recurse"))
 		recurse_submodules = git_config_bool(var, value);
 
+	if (!strcmp(var, "grep.ignoresparsity"))
+		ignore_sparsity = git_config_bool(var, value);
+
 	return st;
 }
 
@@ -487,7 +492,7 @@  static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce))
+		if (!ignore_sparsity && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -502,7 +507,8 @@  static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID)) {
+			if (cached || (ce->ce_flags & CE_VALID) ||
+			    ce_skip_worktree(ce)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -549,7 +555,7 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
-	if (from_commit && repo_read_index(repo) < 0)
+	if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
 	while (tree_entry(tree, &entry)) {
@@ -570,7 +576,7 @@  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
-		if (from_commit) {
+		if (!ignore_sparsity && from_commit) {
 			int pos = index_name_pos(repo->index,
 						 base->buf + tn_len,
 						 base->len - tn_len);
@@ -932,6 +938,8 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 		OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
 			   N_("allow calling of grep(1) (ignored by this build)"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
+			 N_("also search in files outside the sparse checkout")),
 		OPT_END()
 	};
 
@@ -1073,6 +1081,9 @@  int cmd_grep(int argc, const char **argv, const char *prefix)
 	if (recurse_submodules && untracked)
 		die(_("--untracked not supported with --recurse-submodules"));
 
+	if (ignore_sparsity && (!use_index || untracked))
+		die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
+
 	if (show_in_pager) {
 		if (num_threads > 1)
 			warning(_("invalid option combination, ignoring --threads"));
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index fccf44e829..1891ddea57 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -85,4 +85,46 @@  test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
 	test_cmp expect_t-tree actual_t-tree
 '
 
+for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
+	   'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
+do
+	test_expect_success "$cmd should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd --cached should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_t-commit <<-EOF &&
+		t-commit:a:text
+		t-commit:b:text
+		t-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" t-commit >actual_t-commit &&
+		test_cmp expect_t-commit actual_t-commit
+	'
+done
+
 test_done