diff mbox series

sparse-checkout.txt: new document with sparse-checkout directions

Message ID pull.1367.git.1664064588846.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series sparse-checkout.txt: new document with sparse-checkout directions | expand

Commit Message

Elijah Newren Sept. 25, 2022, 12:09 a.m. UTC
From: Elijah Newren <newren@gmail.com>

Once upon a time, Matheus wrote some patches to make
   git grep [--cached | <REVISION>] ...
restrict its output to the sparsity specification when working in a
sparse checkout[1].  That effort got derailed by two things:

  (1) The --sparse-index work just beginning which we wanted to avoid
      creating conflicts for
  (2) Never deciding on flag and config names and planned high level
      behavior for all commands.

More recently, Shaoxuan implemented a more limited form of Matheus'
patches that only affected --cached, using a different flag name,
but also changing the default behavior in line with what Matheus did.
This again highlighted the fact that we never decided on command line
flag names, config option names, and the big picture path forward.

The --sparse-index work has been mostly complete (or at least released
into production even if some small edges remain) for quite some time
now.  We have also had several discussions on flag and config names,
though we never came to solid conclusions.  Stolee once upon a time
suggested putting all these into some document in
Documentation/technical[3], which Victoria recently also requested[4].
I'm behind the times, but here's a patch attempting to finally do that.

Note that the "Implementation Questions" section is pretty large,
reflecting the fact that this is perhaps more RFC than proposal.

[1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
    (See his second link in that email in particular)
[2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
[3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
    (Scroll to the very end for the final few paragraphs)
[4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    [RFC] sparse-checkout.txt: new document with sparse-checkout directions
    
    As noted in the title and commit message, while I have some goals &
    plans proposed here, I have a lot more in the questions category.
    Thoughts and opinions very much welcome.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1367

 Documentation/technical/sparse-checkout.txt | 670 ++++++++++++++++++++
 1 file changed, 670 insertions(+)
 create mode 100644 Documentation/technical/sparse-checkout.txt


base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8

Comments

Junio C Hamano Sept. 26, 2022, 5:20 p.m. UTC | #1
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Once upon a time, Matheus wrote some patches to make
>    git grep [--cached | <REVISION>] ...
> restrict its output to the sparsity specification when working in a
> sparse checkout[1].  That effort got derailed by two things:
>
>   (1) The --sparse-index work just beginning which we wanted to avoid
>       creating conflicts for
>   (2) Never deciding on flag and config names and planned high level
>       behavior for all commands.
>
> More recently, Shaoxuan implemented a more limited form of Matheus'
> patches that only affected --cached, using a different flag name,
> but also changing the default behavior in line with what Matheus did.
> This again highlighted the fact that we never decided on command line
> flag names, config option names, and the big picture path forward.
>
> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now.  We have also had several discussions on flag and config names,
> though we never came to solid conclusions.  Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.
>
> Note that the "Implementation Questions" section is pretty large,
> reflecting the fact that this is perhaps more RFC than proposal.

Thanks for starting this.  The document even in the current
iteration with a large set of "questions" helped me refresh my
memory on where we are in the bigger picture, and will offer us a
good frame of reference.
Junio C Hamano Sept. 26, 2022, 5:38 p.m. UTC | #2
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +    In the case of am and apply, those commands only operate on the
> +    working tree, so they are kind of in the same boat as stash.

"apply" does not touch the HEAD but it can touch the index; when it
operates with the "--cached" or the "--index" option, it should not
be considered as a working-tree-only command.

"am" is about recording what is in the patch as a commit.

> +    Perhaps `git am` could run `git sparse-checkout reapply`
> +    automatically afterward and move into a category more similar to
> +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> +    vivify files besides just conflicted ones when there are conflicts.

I do not particularly think it is so bad.

How would we handle the case where the user modifies paths outside
the sparse specification and makes a commit out of the result,
without using "am"?  We should be consistent with that use case, i.e.

    $ edit path/outside/sparse/specification
    $ git add path/outside/sparse/specification
    $ git commit

Do we require some "Yes, I am aware that I need to widen my sparse
specification to do this, because I am now stepping out of it, and I
understand that my sparse specification becomes wider after doing
this operation" confirmation with "add" or "commit"?  If not, then I
think "am" should silently widen just like these commands.  If they
do, then "am" should also require such an option.  Perhaps call it
"--widen-sparse" or whatever.

By the way, I like the term "sparse specification" very much, as
we should worry about non-cone mode as well.  Please use it
consistently in this document after getting a concensus that it
is a good phrase to use from others---I saw some other words
used after "sparse" elsewhere in this patch.

> +    In the case of ls-files, `git ls-files -t` is often used to see what
> +    is sparse and not, in which case restricting would not make sense.

I suspect that leaving it tree-wide would allow scripters come up
with Porcelains that restricts to the sparse specification more
easily.

Thanks.
Victoria Dye Sept. 26, 2022, 8:08 p.m. UTC | #3
Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Once upon a time, Matheus wrote some patches to make
>    git grep [--cached | <REVISION>] ...
> restrict its output to the sparsity specification when working in a
> sparse checkout[1].  That effort got derailed by two things:
> 
>   (1) The --sparse-index work just beginning which we wanted to avoid
>       creating conflicts for
>   (2) Never deciding on flag and config names and planned high level
>       behavior for all commands.
> 
> More recently, Shaoxuan implemented a more limited form of Matheus'
> patches that only affected --cached, using a different flag name,
> but also changing the default behavior in line with what Matheus did.
> This again highlighted the fact that we never decided on command line
> flag names, config option names, and the big picture path forward.
> 
> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now.  We have also had several discussions on flag and config names,
> though we never came to solid conclusions.  Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.

Thank you so much for writing this!

> diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> new file mode 100644
> index 00000000000..b213b2b3f35
> --- /dev/null
> +++ b/Documentation/technical/sparse-checkout.txt
> @@ -0,0 +1,670 @@
> +Table of contents:
> +
> +  * Purpose of sparse-checkouts
> +  * Desired behavior
> +  * Subcommand-dependent defaults
> +  * Implementation Questions
> +  * Implementation Goals/Plans
> +  * Known bugs
> +  * Reference Emails
> +
> +
> +=== Purpose of sparse-checkouts ===
> +
> +sparse-checkouts exist to allow users to work with a subset of their
> +files.
> +
> +The idea is simple enough, but there are two different high-level
> +usecases which affect how some Git subcommands should behave.  Further,
> +even if we only considered one of those usecases, sparse-checkouts
> +modify different subcommands in over a half dozen different ways.  Let's
> +start by considering the high level usecases in this section:
> +
> +  A) Users are _only_ interested in the sparse portion of the repo
> +
> +  B) Users want a sparse working tree, but are working in a larger whole

Both of these use cases make sense to me! Two thoughts/comments:

1. This could be a "me" problem, but I regularly struggle with "sparse"
   having different meanings in similar contexts. For example, a "sparse
   directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
   of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
   A quick note/section outlining some standard terminology would be
   immensely helpful.
2. One detail I'd like this document to clarify is the similarity/difference
   between "in the sparse portion of the repo" and "does not have
   'SKIP_WORKTREE' applied." In a well-behaved sparse-checkout, these are
   one in the same. However, if a user removes 'SKIP_WORKTREE' from a file
   (either with 'update-index' or by checking it out on disk), commands
   *sometimes* treat it as inside the sparse checkout (e.g., 'git status'),
   and some treat it as outside (e.g., 'git add'). Technically, I think it
   comes down to whether a command uses sparse patterns + 'SKIP_WORKTREE' to
   determine sparsity vs. just 'SKIP_WORKTREE', but the varying behavior
   feels inconsistent as an end user. 

> +
> +=== Desired behavior ===
> +
> +As noted in the previous section, despite the simple idea of just
> +working with a subset of files, there are a range of different
> +behavioral changes that need to be made to different subcommands to work
> +well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
> +examples.  In particular, at [2], we saw that mere composition of other
> +commands that individually worked correctly in a sparse-checkout context
> +did not imply that the higher level command would work correctly; it
> +sometimes requires further tweaks.  So, understanding these differences
> +can be beneficial.
> +
> +* Commands behaving the same regardless of high-level use-case
> +
> +  * commands that only look at files within the sparsity specification
> +
> +      * status
> +      * diff (without --cached or REVISION arguments)
> +      * grep (without --cached or REVISION arguments)

'status' and 'diff' currently show information about untracked files outside
the working tree (since, not being in the index, they don't have a
'SKIP_WORKTREE' to use). Should that change with the proposed '--restrict'
option?

> +
> +  * commands that restore files to the working tree that match sparsity patterns, and
> +    remove unmodified files that don't match those patterns:
> +
> +      * switch
> +      * checkout (the switch-like half)
> +      * read-tree
> +      * reset --hard
> +
> +      * `restore` & the restore-like half of `checkout` SHOULD be in this above
> +	category, but are buggy (see the "Known bugs" section below)

These commands do behave differently if there are *modified* files outside
the sparsity patterns:

- 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
  & advise on how to clean up the modified files to re-align with the
  sparsity patterns.
- 'reset --hard' silently drops the modified file and resets the
  'SKIP_WORKTREE' bit on the corresponding index entry.

With the exception of 'reset --hard' (aggressively and unconditionally
cleaning the worktree & index is an important aspect of the command, IMO),
I'd personally like to see commands in this category align with the behavior
of 'switch' where they don't already. Regardless of what we decide, though,
I think it's probably worth documenting the "modified outside of sparsity
patterns" case.

Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
of the entries it reads into the index. Having all of your files suddenly
appear "deleted" probably isn't desired behavior, so it might be a good
candidate for the "Known bugs" section. 

> +
> +  * commands that write conflicted files to the working tree, but otherwise will
> +    omit writing files that do not match the sparsity patterns:
> +
> +      * merge
> +      * rebase
> +      * cherry-pick
> +      * revert
> +
> +    Note that this somewhat depends upon the merge strategy being used:
> +      * `ort` behaves as described above
> +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
> +	vivify files without conflicts.
> +      * `octopus` and `resolve` will always vivify any file changed in the merge
> +	relative to the first parent, which is rather suboptimal.
> +
> +  * commands that always ignore sparsity since commits must be full-tree
> +
> +      * archive
> +      * bundle
> +      * commit
> +      * format-patch
> +      * fast-export
> +      * fast-import
> +      * commit-tree
> +
> +  * commands that write any modified file to the working tree (conflicted or not,
> +    and whether those paths match sparsity patterns or not):
> +
> +      * stash
> +
> +      * am/apply probably should be in the above category, but need to be fixed to
> +	auto-vivify instead of failing
> +
> +* Commands that differ for behavior A vs. behavior B:
> +
> +  * commands that make modifications:

nit: "make modifications" -> "make modifications to the index"? 

> +      * add
> +      * rm
> +      * mv
> +
> +  * commands that query history
> +      * diff (with --cached or REVISION arguments)
> +      * grep (with --cached or REVISION arguments)
> +      * show (when given commit arguments)
> +      * bisect
> +      * blame
> +	* and annotate
> +      * log
> +	* and variants: shortlog, gitk, show-branch, whatchanged
> +
> +* Comands I don't know how to classify
> +
> +  * ls-files
> +
> +    Shows all tracked files by default, and with an option can show
> +    sparse directory entries instead of expanding them.  Should there be
> +    a way to restrict to just the non SKIP_WORKTREE files?

Yes, I think "restricting to just non SKIP_WORKTREE files" would be what a
'--restrict' option would do. The existing '--sparse' flag really is
independent of the sparse patterns altogether - it just toggles whether
sparse directories are shown as-is or expanded. Given your analysis so far,
'--sparse' should probably be renamed to something that reflects its unique
behavior ('--no-expand-sparse-directories'? I'm sure someone more creative
than me could come up with a better name ;) ).

So, disregarding the special sparse index behavior, I think 'ls-files' fits
neatly in the "commands that query history" section.

> +
> +    Note that `git ls-files -t` is often used to see what is sparse and
> +    what is not, which only works with a non-restricted assumption.
> +
> +  * checkout-index
> +
> +    should it be like `checkout` and pay attention to sparsity paths, or
> +    be considered special and write to working tree anyway?  The
> +    interaction with --prefix, and the use of specifically named files
> +    (rather than globs) makes me wonder.

IMO, it should still pay attention to sparsity paths, even with '--prefix'.
My interpretation would be that '--restrict' tells it how to *read* the
index when determining what to write to disk - even with '--prefix', then,
it'd only write files matching the sparsity patterns. In that case, it seems
to fit alongside 'switch', 'restore', etc. in "commands that restore files
to the working tree that match sparsity patterns." 

> +
> +  * update-index
> +
> +    The --[no-]ignore-skip-worktree-entries default is totally bogus,
> +    but otherwise this command seems okay?  Not sure what category it
> +    would go under, though.

I'd probably call this a "makes modifications" command (like 'git add', 'git
rm', etc.), since it adds/removes/modifies items in the index (either their
content or their flags).

> +
> +  * range-diff
> +
> +    Is this like `log` or `format-patch`?
> +
> +  * cherry
> +
> +    See range-diff
> +
> +  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
> +
> +    should these be tweaked or always operate full-tree?

For these (and the other plumbing/plumbing-ish commands you have listed:
'checkout-index', 'update-index', 'read-tree'), I'd lean towards making them
respect the sparsity patterns consistently with the porcelain layer. Part of
that is because the line between "plumbing" and "porcelain" is sometimes
fuzzy (like with 'read-tree'?), so having _very_ different behavior around
that boundary would probably be confusing. The other part is that I think
plumbing-based scripts would still fit one of your "A" or "B" user
archetypes, so full-tree behavior might not be desired anyway.

> +=== Subcommand-dependent defaults ===
> +
> +Note that we have different defaults (for the desired behavior, not just
> +the current implementation) depending on the command:
> +
> +  * Commands defaulting to --restrict:
> +    * status
> +    * diff (without --cached or REVISION arguments)
> +    * grep (without --cached or REVISION arguments)
> +    * switch
> +    * checkout (the switch-like half)
> +    * read-tree
> +    * reset (--hard)
> +    * restore/checkout
> +    * checkout-index
> +
> +    This behavior makes sense; these interact with the working tree.
> +
> +  * Commands defaulting to --restrict-unless-conflicts
> +    * merge
> +    * rebase
> +    * cherry-pick
> +    * revert
> +
> +    These also interact with the working tree, but require slightly different
> +    behavior so that conflicts can be resolved.
> +
> +  * Commands defaulting to --no-restrict
> +    * archive
> +    * bundle
> +    * commit
> +    * format-patch
> +    * fast-export
> +    * fast-import
> +    * commit-tree
> +
> +    * ls-files

In line with what I wrote earlier, I think 'ls-files' would belong wherever
other "commands that query history" go (looks like "Commands whose default
for --restrict vs. --no-restrict should vary").

> +    * stash
> +    * am
> +    * apply
> +
> +    These have completely different defaults and perhaps deserve the most detailed
> +    explanation:
> +
> +    In the case of commands in the first group (format-patch,
> +    fast-export, bundle, archive, etc.), these are commands for
> +    communicating history, which will be broken if they restrict to a
> +    subset of the repository.  As such, they operate on full paths and
> +    have no `--restrict` option for overriding.  Some of these commands may
> +    take paths for manually restricting what is exported, but it needs to
> +    be very explicit.
> +
> +    In the case of stash, it needs to vivify files to avoid losing the
> +    user's changes.
> +
> +    In the case of am and apply, those commands only operate on the
> +    working tree, so they are kind of in the same boat as stash.
> +    Perhaps `git am` could run `git sparse-checkout reapply`
> +    automatically afterward and move into a category more similar to
> +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> +    vivify files besides just conflicted ones when there are conflicts.
> +
> +    In the case of ls-files, `git ls-files -t` is often used to see what
> +    is sparse and not, in which case restricting would not make sense.
> +    Also, ls-files has traditionally been used to get a list of "all
> +    tracked files", which would suggest not restricting.  But it's
> +    slightly funny, because sparse-checkouts essentially split tracked
> +    files into two categories -- those in the sparse specification and
> +    those outside -- and how does the user specify which of those two
> +    types of tracked files they want?
> +
> +  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
> +    may affect how verbose the warnings are):
> +    * add
> +    * rm
> +    * mv

I was going to say that, if you consider 'update-index' part of the same
category as 'git add', it would belong here. However, the "but warn" part
seems a little weird with a mostly-plumbing command like 'update-index'. 

> +
> +    The defaults here perhaps make sense since they are nearly --restrict, but
> +    actually using --restrict could cause user confusion if users specify a
> +    specific filename, so they warn by default.  That logic may sound like
> +    --no-restrict should be the default, but that's prone to even bigger confusion:
> +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> +	the file randomly disappearing later when some subsequent command is run
> +	(since various commands automatically clean up unmodified files outside
> +	the sparsity specification).
> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> +	outside the range of the user's interest.  Much better to operate on the
> +	sparsity specification and give the user warnings if other files could have
> +	matched.
> +      * `git mv` has similar surprises when moving into or out of the cone, so
> +	best to restrict and throw warnings if restriction might affect the result.
> +
> +    There may be a difference in here between behavior A and behavior B.
> +    For behavior A, we probably only want to warn if there were no
> +    suitable matches for files in the sparsity specification, whereas
> +    for behavior B, we may want to warn even if there are valid files to
> +    operate on if the result would have been different under
> +    `--no-restrict`.

I'm a bit confused why '--restrict-but-warn' needs to be separate from
'--restrict'. Couldn't the '--restrict' behavior for 'add'/'rm'/'mv' just be
what you described above, since behavior is set on a per-command (or
per-category) basis?

Also, I might be mistaken, but isn't the current behavior more like
'--restrict', in that it returns an error code & advisory message if it
tries to add files outside the sparse patterns? If this is already okay to
users, what's the benefit of relaxing the error to a warning?

Otherwise, I'm on board with the difference between behaviors A & B (i.e.,
"some files must be in the sparse-checkout to avoid a warning/error" vs.
"all files must be in the sparse-checkout to avoid a warning/error").

> +
> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> +    on Behavior A or Behavior B
> +    * diff (with --cached or REVISION arguments)
> +    * grep (with --cached or REVISION arguments)
> +    * show (when given commit arguments)
> +    * bisect
> +    * blame
> +      * and annotate
> +    * log
> +      * and variants: shortlog, gitk, show-branch, whatchanged
> +
> +    For now, we default to behavior B for these, which want a default of
> +    --no-restrict.
> +
> +    Note that two of these commands -- diff and grep -- also appeared in
> +    a different list with a default of --restrict, but only when limited
> +    to searching the working tree.  The working tree vs. history
> +    distinction is fundamental in how behavior B operates, so this is
> +    expected.
> +
> +    --restrict may make more sense as the long term default for
> +    these[12], but that's a fair amount of work to implement, and it'd
> +    be very problematic for behavior B users.  Making it the default
> +    now, and then slowly implementing that default in various
> +    subcommands over multiple releases would mean that behavior B users
> +    would need to learn to slowly add additional flags to their
> +    commands, depending on git version, to get the behavior they want.
> +    That gradual switchover would be painful, so we should avoid it at
> +    least until it's fully implemented.

I think transitioning to '--restrict' by default is a good plan - as far as
I can tell, user A types seem more common than user B types, and
'--restrict' creates a more consistent experience. 

Maybe '--restrict' could be made the default earlier in 'scalar' (which
already sets up a cone-mode sparse-checkout by default)? We'd still
gradually move towards making the option a global default, but 'scalar'
might get it some early exposure with users that'd benefit the most from it.

> +
> +
> +=== Implementation Questions ===
> +
> +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> +    * Names in use, or appearing in patches, or previously suggested:
> +      * --sparse/--dense
> +      * --ignore-skip-worktree-bits
> +      * --ignore-skip-worktree-entries
> +      * --ignore-sparsity
> +      * --[no-]restrict-to-sparse-paths
> +      * --full-tree/--sparse-tree
> +      * --[no-]restrict
> +    * Rationale making me lean slightly towards --[no-]restrict:
> +      * We want a name that works for many commands, so we need a name that
> +	does not conflict
> +      * --[no-]restrict isn't overly long and seems relatively explanatory
> +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> +	grep/log/etc.  Changing the meaning of `--sparse` for these
> +	commands would fix the backwardness, but possibly break existing
> +	scripts.  Using a new name pairing would allow us to treat
> +	`--sparse` in these commands as a deprecated alias.
> +      * There is a different `--sparse`/`--dense` pair for commands using
> +	revision machinery, so using that naming might cause confusion
> +      * There is also a `--sparse` in both pack-objects and show-branch, which
> +	don't conflict but do suggest that `--sparse` is overloaded
> +      * The name --ignore-skip-worktree-bits is a double negative, is
> +	quite a mouthful, refers to an implementation detail that many
> +	users may not be familiar with, and we'd need a negation for it
> +	which would probably be even more ridiculously long.  (But we
> +	can make --ignore-skip-worktree-bits a deprecated alias for
> +	--no-restrict.)

I think '--[no-]restrict' is a good choice - it doesn't have the ambiguity
of '--sparse' or the so-verbose-it's-confusing nature of
'--ignore-skip-worktree-(bits|entries)'. My only concern would be with the
fact that '--[no-]restrict' doesn't clearly indicate its relationship to
sparse-checkout, but a longer name (like
'--[no-]restrict-to-sparse-checkout') would be cumbersome, not worth it for
the little bit of extra info a user would get.

> +
> +  * Should --[no-]restrict be a git global option, or added as options to each
> +    relevant command?  (Does that make sense given the multitude of different
> +    default behaviors we have for different options?)

That's an interesting idea! I'd be fine either way, there are pros and cons
to each. E.g., it feels a little weird putting the option before the command
('git --no-restrict add' vs. 'git add --no-restrict'), but the option does
apply to nearly every command (and it's easier to describe/document from a
Git-wide perspective than a per-command perspective).

> +
> +  * If a config option is added (core.restrictToSparsity?) what should
> +    the values and description be?  There's a risk of confusion, because
> +    we only want this config option to affect the history-querying
> +    commands (log/diff/grep) and maybe the path-modifying worktree
> +    commands (add/rm/mv), but certainly not most the others.  Previous config
> +    suggestion here: [13]

For values, maybe 'strict' (for behavior A/'--restrict' across the board),
'loose' (for behavior B), 'off'/'none' (for '--no-restrict' across the
board)? For the description, it could outline each of the use cases and
highlight notable command behavior differences? Kind of like what you
already have in [13].

> +
> +  * Should --sparse in ls-files be made an alias for --restrict?
> +    `--restrict` is certainly a near synonym in cone-mode, but even then
> +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> +    option has no effect, and in cone-mode it still shows the sparse
> +    directory entries which are technically outside the sparsity
> +    specification.

I don't think so (for the reasons I mentioned earlier - tl;dr --sparse and
--restrict are conceptually quite different, and functionally independent).
I do think '--sparse' should be renamed as part of the "Implementation
Goals/Plans", though.

> +
> +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
> +    restore be made deprecated aliases for --no-restrict?  (They have the
> +    same meaning.)
> +
> +  * Should --ignore-skip-worktree-entries in update-index be made a
> +    deprecated alias for --no-restrict?  (Or, better yet, should the
> +    option just be nuked from orbit after flipping the default, since
> +    the reverse option is never wanted and the sole purpose of this
> +    option was to turn off a bug?)

That's an interesting bit of history! I tend to think of 'update-index' as
"plumbing add/rm", so I think there's still a benefit to having a
'--restrict' mode.

In any case, if I'm reading this correctly, these two options are subtly
different than what's proposed for '--restrict', since IIRC they don't take
into account the sparse patterns at all (only operating based on
'SKIP_WORKTREE'). If '--restrict' will involve also using the sparse
patterns, the behavior would change. I'm happy with doing that (I think the
change would be beneficial), but it should probably be explicitly noted
either here or whenever those commands are updated.

> +
> +  * sparse-checkout: once behavior A is fully implemented, should we
> +    take an interim measure to easy people into switching the default?

nit: s/easy/ease/

> +    Namely, if folks are not already in a sparse checkout, then require
> +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> +    would set core.restrictToSparse according to the setting given), and
> +    throw an error if the flag is not provided?  That error would be a
> +    great place to warn folks that the default may change in the future,
> +    and get them used to specifying what they want so that the eventual
> +    default switch is seamless for them.

Sounds like a good approach to me! It avoids needing to constantly
re-specify '--[no-]restrict' on every 'sparse-checkout set' (because it sets
the config), and also provides visibility to users. 

> +
> +  * clone: should we provide some mechanism for tying partial clones and
> +    sparse checkouts together better.  Maybe an option
> +	--sparse=dir1,dir2,...,dirN
> +    which:
> +       * Does initial fetch with `--filter=blob:none`
> +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> +	 fault in the missing blobs within the sparse
> +	 specification...except that rev-list needs some kind of options
> +	 to also get files from leading directories too.
> +       * Sets --restrict mode to allow focusing on the cone of interest
> +	 (and to permit disconnected development)

Similar to the '--restrict' default, this could also be a good fit for
'scalar clone'.

> +
> +
> +=== Implementation Goals/Plans ===

The rest of this (+the "Known bugs" section) all look good to me.

Thanks again for writing this document, I really appreciate the time &
effort you put into it! It'll serve as a clear reference for work on
sparse-checkout going forward, and ultimately make sparse-checkout usage a
much better experience for users.

> base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
Junio C Hamano Sept. 26, 2022, 10:36 p.m. UTC | #4
Victoria Dye <vdye@github.com> writes:

>> +* Commands behaving the same regardless of high-level use-case
>> +
>> +  * commands that only look at files within the sparsity specification
>> +
>> +      * status
>> +      * diff (without --cached or REVISION arguments)
>> +      * grep (without --cached or REVISION arguments)
>
> 'status' and 'diff' currently show information about untracked files outside
> the working tree (since, not being in the index, they don't have a
> 'SKIP_WORKTREE' to use). Should that change with the proposed '--restrict'
> option?

Most likely not.  When sparsity specification is in effect, as you
said elsewhere in your response, no files, whether tracked or
untrcked, should exist that are outside your area of interest.
Their presence should be reported as anomalies by "git status".

Unless the command is being run with the "-uno" option, that is.

> - 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
>   & advise on how to clean up the modified files to re-align with the
>   sparsity patterns.
> - 'reset --hard' silently drops the modified file and resets the
>   'SKIP_WORKTREE' bit on the corresponding index entry.
>
> With the exception of 'reset --hard' (aggressively and unconditionally
> cleaning the worktree & index is an important aspect of the command, IMO),
> I'd personally like to see commands in this category align with the behavior
> of 'switch' where they don't already. Regardless of what we decide, though,
> I think it's probably worth documenting the "modified outside of sparsity
> patterns" case.

True.  I agree on both counts.

> Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
> of the entries it reads into the index. Having all of your files suddenly
> appear "deleted" probably isn't desired behavior, so it might be a good
> candidate for the "Known bugs" section. 

I would imagine that it actually is OK to say that it is the
responsibility of whoever invokes read-tree the plumbing command
to reapply the skip-worktree bits and/or collapse the index entries
outside the area of interest into trees afterwards.

>> +* Commands that differ for behavior A vs. behavior B:
>> +
>> +  * commands that make modifications:
>
> nit: "make modifications" -> "make modifications to the index"? 

That clarification actually raises an interesting question.  Do we
want three level distinction, i.e. different behaviour between
commands that touch and do not touch the working tree, between those
that touch and do not touch the index, and between those that touch
and do not touch the commit?

As the index is merely a way to express what the user did to
eventually create the next tree to be recorded in the commit, my gut
feeling is that it may be easier to understand if we treated the
working tree and the index at the same level, actually.  I.e. if
grepping in the working tree of a sparse checkout does not find a
match outside the cones of interest, it may make sense to do the
same at least by default in "grep --cached" mode.

If I understand Stolee's write-up on the use case of those in the
camp B, they are more aware of the larger whole and expect to see
hits outside the area they have checkout when running "grep HEAD".
But in their use case, they do not touch (only look) the area
outside their cone of interest, so if we limit the operation to
their cone of interest by default for working tree, the same default
probably should apply equally for an operation that inspect the
index.
Elijah Newren Sept. 27, 2022, 3:05 a.m. UTC | #5
On Mon, Sep 26, 2022 at 10:38 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
>
> "apply" does not touch the HEAD but it can touch the index; when it
> operates with the "--cached" or the "--index" option, it should not
> be considered as a working-tree-only command.

Ah, right, good flag.  This helps resolve part of my question, but
gives me a new question as well.

Without --cached or --index, I think we'd need to make `apply` behave
like `stash` and just auto-vivify any files being tweaked.  If we
don't, we'll lose changes from the patch.

"apply --cached" could possibly just update the index.  However, it
appears to have another bug I need to add to the known bugs section.
`apply --cached` updates the index, but the new index entry fails to
carry over the "SKIP_WORKTREE" bit, making it appear there is an
unstaged deletion of the file.  (Users can run `git sparse-checkout
reapply` afterwards as a workaround.).  This is slightly weird for
files with conflicts (created when running `git apply -3 --cached`)
since those files with content conflicts will not be present in the
working tree, but that's in line with the fact that `git apply -3
--cached` refuses to touch the working tree in general.

In line with `--cached`, we could have "apply --index" do updates to
both the index and the working copy, while ensuring any
"SKIP_WORKTREE" bits are preserved for non-conflicted files.  However,
would preserving "SKIP_WORKTREE" bits be weird for users?  On one
hand, `git apply` without `--index` auto-vivifies files and `--index`
says to "also apply changes to the index" -- but preserving
SKIP_WORKTREE bits would make the `--index` flag also affect how the
working tree is treated, which might seem odd.  On the other hand,
merge/cherry-pick/rebase will update files in the index while leaving
the file missing from the working tree when not conflicted, so there
is some precedent for such behavior.  The question might just be
whether `git apply --index` should be more like mergy behavior, or
more like `git apply`/`git stash` behavior.

> "am" is about recording what is in the patch as a commit.

Does that mean it should behave like "apply --index"?  Or more like
cherry-pick?  (This question might be moot depending on what we choose
for "apply --index", in particular, it won't matter if we preserve
SKIP_WORKTREE bits on non-conflicted files.)

> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
>
> I do not particularly think it is so bad.

For some reason I was thinking of running `git sparse-checkout
reapply` only if the `am` operation succeeded, which would give us a
special one-off command treatment.  If we instead view it as always
running `git sparse-checkout reapply` whether or not we hit conflicts,
or equivalently, if we view `git am` preserving SKIP_WORKTREE bits on
non-conflicted files, then I agree it's not weird anymore and can be
classified in the same group as merge/rebase/cherry-pick.

But something else you said confuses me...

> How would we handle the case where the user modifies paths outside
> the sparse specification and makes a commit out of the result,
> without using "am"?  We should be consistent with that use case, i.e.
>
>     $ edit path/outside/sparse/specification
>     $ git add path/outside/sparse/specification
>     $ git commit
>
> Do we require some "Yes, I am aware that I need to widen my sparse
> specification to do this, because I am now stepping out of it, and I
> understand that my sparse specification becomes wider after doing
> this operation" confirmation with "add" or "commit"?  If not, then I
> think "am" should silently widen just like these commands.  If they
> do, then "am" should also require such an option.  Perhaps call it
> "--widen-sparse" or whatever.

The command
    $ edit path/outside/sparse/specification
doesn't make sense to me; the file (and perhaps also its leading
directories) are missing.  Most editors will probably tell you that
you are editing a new file, but then it's more of a "rewrite from
scratch" than an "edit".

Typically, we'd expect users who want to edit such files to do so by
first running the `add` or `set` subcommands of sparse-checkout to
change their sparse specification so that the file becomes present.
But then it's no longer outside the sparse specification.  So, I'm not
sure how this angle could help guide our direction.

> By the way, I like the term "sparse specification" very much, as
> we should worry about non-cone mode as well.  Please use it
> consistently in this document after getting a concensus that it
> is a good phrase to use from others---I saw some other words
> used after "sparse" elsewhere in this patch.

:-)

> > +    In the case of ls-files, `git ls-files -t` is often used to see what
> > +    is sparse and not, in which case restricting would not make sense.
>
> I suspect that leaving it tree-wide would allow scripters come up
> with Porcelains that restricts to the sparse specification more
> easily.
Junio C Hamano Sept. 27, 2022, 4:30 a.m. UTC | #6
>> "am" is about recording what is in the patch as a commit.
>
> Does that mean it should behave like "apply --index"?  Or more like
> cherry-pick?

It should behave like a manual edit (after widening the area of
interest by adjusting sparsity specification, if needed) followed by
"git add" followed by "git commit".

> The command
>     $ edit path/outside/sparse/specification
> doesn't make sense to me; the file (and perhaps also its leading
> directories) are missing.  Most editors will probably tell you that
> you are editing a new file, but then it's more of a "rewrite from
> scratch" than an "edit".

If it is a new file, read it with "mkdir -p $(dirname $that_file)"
prefixed.  If it is an existing file, then "checkout $that_file"
instead.  And then adjust your sparsity specification so that the
path is now within your area of interest.

> Typically, we'd expect users who want to edit such files to do so by
> first running the `add` or `set` subcommands of sparse-checkout to
> change their sparse specification so that the file becomes present.
> But then it's no longer outside the sparse specification.  So, I'm not
> sure how this angle could help guide our direction.

The fact that you accept and attempt to apply and make it into a
commit already indicates your intention that the paths touched by
the patch are now in your area of interest, just like whichever
paths you decide to manually edit and record the changes you made,
so it would be the most user friendly to automatically adjust the
sparsity specification to allow them do exactly that, I would think.

That is how I look at the "am" command, anyway.
Elijah Newren Sept. 27, 2022, 6:09 a.m. UTC | #7
On Mon, Sep 26, 2022 at 1:09 PM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > Once upon a time, Matheus wrote some patches to make
> >    git grep [--cached | <REVISION>] ...
> > restrict its output to the sparsity specification when working in a
> > sparse checkout[1].  That effort got derailed by two things:
> >
> >   (1) The --sparse-index work just beginning which we wanted to avoid
> >       creating conflicts for
> >   (2) Never deciding on flag and config names and planned high level
> >       behavior for all commands.
> >
> > More recently, Shaoxuan implemented a more limited form of Matheus'
> > patches that only affected --cached, using a different flag name,
> > but also changing the default behavior in line with what Matheus did.
> > This again highlighted the fact that we never decided on command line
> > flag names, config option names, and the big picture path forward.
> >
> > The --sparse-index work has been mostly complete (or at least released
> > into production even if some small edges remain) for quite some time
> > now.  We have also had several discussions on flag and config names,
> > though we never came to solid conclusions.  Stolee once upon a time
> > suggested putting all these into some document in
> > Documentation/technical[3], which Victoria recently also requested[4].
> > I'm behind the times, but here's a patch attempting to finally do that.
>
> Thank you so much for writing this!
>
> > diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> > new file mode 100644
> > index 00000000000..b213b2b3f35
> > --- /dev/null
> > +++ b/Documentation/technical/sparse-checkout.txt
> > @@ -0,0 +1,670 @@
> > +Table of contents:
> > +
> > +  * Purpose of sparse-checkouts
> > +  * Desired behavior
> > +  * Subcommand-dependent defaults
> > +  * Implementation Questions
> > +  * Implementation Goals/Plans
> > +  * Known bugs
> > +  * Reference Emails
> > +
> > +
> > +=== Purpose of sparse-checkouts ===
> > +
> > +sparse-checkouts exist to allow users to work with a subset of their
> > +files.
> > +
> > +The idea is simple enough, but there are two different high-level
> > +usecases which affect how some Git subcommands should behave.  Further,
> > +even if we only considered one of those usecases, sparse-checkouts
> > +modify different subcommands in over a half dozen different ways.  Let's
> > +start by considering the high level usecases in this section:
> > +
> > +  A) Users are _only_ interested in the sparse portion of the repo
> > +
> > +  B) Users want a sparse working tree, but are working in a larger whole
>
> Both of these use cases make sense to me! Two thoughts/comments:
>
> 1. This could be a "me" problem, but I regularly struggle with "sparse"
>    having different meanings in similar contexts. For example, a "sparse
>    directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
>    of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
>    A quick note/section outlining some standard terminology would be
>    immensely helpful.

Yeah, that's a good point.  I think we maybe misnamed the sparse
directory entries, and that led to other naming problems.

I like your idea of adding a terminology section; I'll add one.

> 2. One detail I'd like this document to clarify is the similarity/difference
>    between "in the sparse portion of the repo" and "does not have
>    'SKIP_WORKTREE' applied." In a well-behaved sparse-checkout, these are
>    one in the same. However, if a user removes 'SKIP_WORKTREE' from a file
>    (either with 'update-index' or by checking it out on disk), commands
>    *sometimes* treat it as inside the sparse checkout (e.g., 'git status'),
>    and some treat it as outside (e.g., 'git add'). Technically, I think it
>    comes down to whether a command uses sparse patterns + 'SKIP_WORKTREE' to
>    determine sparsity vs. just 'SKIP_WORKTREE', but the varying behavior
>    feels inconsistent as an end user.

Yeah, that's a good point, I should address this.  There are
additional ways to get more files too -- resolving conflicts, or
commands like `stash` that auto-vivify intentionally, or commands that
accidentally auto-vivify (various merge backends), etc.  Anyway,
here's my current mental model, in case it helps:

* In a well-behaved situation, the sparse specification is given
directly by the $GIT_DIR/info/sparse-checkout file.
* The working tree can transiently have an expanded sparse
specification, due to a variety of reasons like resolving conflicts or
running various commands that might add or restore files to the
working tree.
   * Such transient differences can and will be automatically removed
as a side-effect of commands which call unpack_trees() (checkout,
merge, reset, etc.).
   * Users can also request such transient differences be corrected
via running `git sparse-checkout reapply`
   * Additional commands are also welcome to implicitly fix these differences.
   * Because of the above three items, users should make no assumption
that files in a transiently expanded (or restricted) sparse
specification will persist unless they manually explicitly request an
expansion or restriction (via e.g. the `set` or `add` subcommands of
sparse-checkout.)
   * (Yes, we avoid removing files when there are unstaged changes or
conflicts, since we don't want to lose user data.  I don't think that
undermines the general point of the last few bullets).
* The behavior wanted when doing something like "git grep expression
REVISION" is roughly what the users would expect from "git checkout
REVISION && git grep expression" (I know, we add "REVISION:" prefixes,
so it's not exactly the same, but it captures the high level idea).
This has a couple ramifications:
   * REVISION may have paths not in the current index, so there is no
path we can consult for a SKIP_WORKTREE setting for those paths.
   * Since a checkout tries to remove transient differences in the
sparse specification, it makes sense to use the corrected sparse
specification (i.e. $GIT_DIR/info/sparse-checkout) rather than
attempting to consult SKIP_WORKTREE anyway.
   * Therefore, a transiently expanded (or restricted) sparse
specification *only* applies to the working tree and perhaps index.
It does not apply for history queries.

We kind of discussed this previously for why SKIP_WORKTREE not
matching the normal sparse specification should only apply to the
worktree and not to history, in the context of grep[*]:

"""
For the worktree and cached cases, we iterate over paths without
the SKIP_WORKTREE bit set, and limit our searches to these paths.  For
the $REVISION case, we limit the paths we search to those that match
the sparsity patterns.  (We do not check the SKIP_WORKTREE bit for the
$REVISION case, because $REVISION may contain paths that do not exist
in HEAD and thus for which we have no SKIP_WORKTREE bit to consult.
The sparsity patterns tell us how the SKIP_WORKTREE bit would be set
if we were to check out $REVISION, so we consult those.  Also, we
don't use the sparsity paths with the worktree or cached cases, both
because we have a bit we can check directly and more efficiently, and
because unmerged entries from a merge or a rebase could cause more
files to temporarily be present than the sparsity patterns would
normally select.)
"""

(That email also discussed the weird case of being given a TREE
instead of a REVISION, which mucks things up a bit.)

[*] https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/

> > +
> > +=== Desired behavior ===
> > +
> > +As noted in the previous section, despite the simple idea of just
> > +working with a subset of files, there are a range of different
> > +behavioral changes that need to be made to different subcommands to work
> > +well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
> > +examples.  In particular, at [2], we saw that mere composition of other
> > +commands that individually worked correctly in a sparse-checkout context
> > +did not imply that the higher level command would work correctly; it
> > +sometimes requires further tweaks.  So, understanding these differences
> > +can be beneficial.
> > +
> > +* Commands behaving the same regardless of high-level use-case
> > +
> > +  * commands that only look at files within the sparsity specification
> > +
> > +      * status
> > +      * diff (without --cached or REVISION arguments)
> > +      * grep (without --cached or REVISION arguments)
>
> 'status' and 'diff' currently show information about untracked files outside
> the working tree (since, not being in the index, they don't have a
> 'SKIP_WORKTREE' to use).

'status' does, yes, but...I thought 'diff' only applied to tracked
files.  How do you get 'diff' to show information about untracked
files?

(Are you by chance referring to either (1) --no-index which requires
paths to be explicitly specified and thus --[no-]restrict is
irrelevant, or (2) --ignore-submodules, in which case I think
--[no-]restrict is also irrelevant since --[no-]restrict would apply
to the supermodule and the untracked files would just be ones found
within the submodule?)

> Should that change with the proposed '--restrict' option?

Here's how I look at it:

One way to view the purpose of sparse-checkouts is that it subdivides
"tracked" files into two categories -- a sparse subset, and all the
rest.  We mark "all the rest" with SKIP_WORKTREE.  The SKIP_WORKTREE
files are still tracked, just not present in the working copy.
`--restrict` is a modifier that only works to differentiate between
those two groups of tracked files.  In particular, `--restrict` exists
to allow us to specify that operations that normally operate on
tracked files should instead operate on that subset (and likewise,
`--no-restrict` exists to allow us to specify that operations that
default to working on a subset of tracked files should instead operate
on all tracked files).

untracked files are not tracked.  As such `--[no-]restrict` should not
affect how untracked files are treated...except when dealing with the
tracked/untracked boundary and moving files across that boundary (e.g.
with add/rm/mv).  In fact, I think that's why those three commands
have their own special category.

> > +
> > +  * commands that restore files to the working tree that match sparsity patterns, and
> > +    remove unmodified files that don't match those patterns:
> > +
> > +      * switch
> > +      * checkout (the switch-like half)
> > +      * read-tree
> > +      * reset --hard
> > +
> > +      * `restore` & the restore-like half of `checkout` SHOULD be in this above
> > +     category, but are buggy (see the "Known bugs" section below)
>
> These commands do behave differently if there are *modified* files outside
> the sparsity patterns:

I don't understand this claim; using checkout/switch:

$ git sparse-checkout disable
$ git status --porcelain
 M tracked-but-maybe-skipped
$ git checkout main~1
error: Your local changes to the following files would be overwritten
by checkout:
tracked-but-maybe-skipped
Please commit your changes or stash them before you switch branches.
Aborting
$ git sparse-checkout set --no-cone /tracked 2>/dev/null
$ git ls-files -t  # Note: tracked-but-maybe-skipped is outside
sparsity patterns, but modified
H tracked
H tracked-but-maybe-skipped
$ git checkout main~1
error: Your local changes to the following files would be overwritten
by checkout:
tracked-but-maybe-skipped
Please commit your changes or stash them before you switch branches.
Aborting

Exact same error in both sparse and non-sparse checkouts, even when
the sparse-checkout has a modified file outside the sparsity patterns.

> - 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
>   & advise on how to clean up the modified files to re-align with the
>   sparsity patterns.

Perhaps you have a different case in mind than I do?  I'm not aware of
anywhere that switch/checkout does this.  (If I modified the above
testcase to have the changes be staged, I still get the same error
both with or without a sparse-checkout, and that error doesn't mention
sparsity patterns in any way.)  I tried grepping around the source
code, but maybe I'm missing something?

> - 'reset --hard' silently drops the modified file and resets the
>   'SKIP_WORKTREE' bit on the corresponding index entry.
>
> With the exception of 'reset --hard' (aggressively and unconditionally
> cleaning the worktree & index is an important aspect of the command, IMO),
> I'd personally like to see commands in this category align with the behavior
> of 'switch' where they don't already.

Oh, are you thinking that `reset --hard` has a different kind of
modification made to it in sparse-checkouts than the other commands in
this category?

I still don't see it, even if that's what you're referring to.  Each
of these commands, in a sparse-checkout, performs its operation within
the sparsity specification, and then attempts to aggressively cull
differences between the sparsity specification and the sparsity
patterns (by marking unmodified files outside the sparsity patterns as
SKIP_WORKTREE and removing them, and marking files matching the
sparsity patterns which were previously SKIP_WORKTREE as
!SKIP_WORKTREE and restoring them to the working tree).  Perhaps some
examples would help:

Having switch/checkout restore paths matching sparsity patterns:
  $ rm tracked
  $ git status --porcelain
   D tracked
  $ git update-index --skip-worktree tracked
  $ git status --porcelain
  $ git ls-files -t
  S tracked
  $

  $ git checkout HEAD~1
  $ git status --porcelain
  $ git ls-files -t
  H tracked

Having switch/checkout remove paths that do not match sparsity patterns:
  $ git ls-files -t
  S tracked-but-maybe-skipped
  $ git show HEAD:tracked-but-maybe-skipped >tracked-but-maybe-skipped
  $ git ls-files -t
  H tracked-but-maybe-skipped

  $ git checkout HEAD~1
  $ git ls-files -t
  S tracked-but-maybe-skipped

So, switch & checkout are doing the same culling that `reset --hard`
is doing.  It's just that all the commands avoid culling when there
are modifications to the file after its normal operation, and by
design, you'll see `reset --hard` have more opportunities to cull
files since it squashes those modifications.

> Regardless of what we decide, though,
> I think it's probably worth documenting the "modified outside of sparsity
> patterns" case.

I'm happy to document if I understand it better; right now I'm just
not following.

> Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
> of the entries it reads into the index. Having all of your files suddenly
> appear "deleted" probably isn't desired behavior, so it might be a good
> candidate for the "Known bugs" section.

Ooh, good catch.  Yeah, I'll add it.

> > +
> > +  * commands that write conflicted files to the working tree, but otherwise will
> > +    omit writing files that do not match the sparsity patterns:
> > +
> > +      * merge
> > +      * rebase
> > +      * cherry-pick
> > +      * revert
> > +
> > +    Note that this somewhat depends upon the merge strategy being used:
> > +      * `ort` behaves as described above
> > +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
> > +     vivify files without conflicts.
> > +      * `octopus` and `resolve` will always vivify any file changed in the merge
> > +     relative to the first parent, which is rather suboptimal.
> > +
> > +  * commands that always ignore sparsity since commits must be full-tree
> > +
> > +      * archive
> > +      * bundle
> > +      * commit
> > +      * format-patch
> > +      * fast-export
> > +      * fast-import
> > +      * commit-tree
> > +
> > +  * commands that write any modified file to the working tree (conflicted or not,
> > +    and whether those paths match sparsity patterns or not):
> > +
> > +      * stash
> > +
> > +      * am/apply probably should be in the above category, but need to be fixed to
> > +     auto-vivify instead of failing
> > +
> > +* Commands that differ for behavior A vs. behavior B:
> > +
> > +  * commands that make modifications:
>
> nit: "make modifications" -> "make modifications to the index"?

More specifically, "make modifications to which files are tracked".
In a sense, these commands determine whether "--[no-]restrict" apply
to _untracked_ files (because those untracked files are about to
become tracked), which is something no other command has to worry
about, and they deserve special treatment because of that.

> > +      * add
> > +      * rm
> > +      * mv
> > +
> > +  * commands that query history
> > +      * diff (with --cached or REVISION arguments)
> > +      * grep (with --cached or REVISION arguments)
> > +      * show (when given commit arguments)
> > +      * bisect
> > +      * blame
> > +     * and annotate
> > +      * log
> > +     * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +* Comands I don't know how to classify
> > +
> > +  * ls-files
> > +
> > +    Shows all tracked files by default, and with an option can show
> > +    sparse directory entries instead of expanding them.  Should there be
> > +    a way to restrict to just the non SKIP_WORKTREE files?
>
> Yes, I think "restricting to just non SKIP_WORKTREE files" would be what a
> '--restrict' option would do.

Hmm...yeah, that makes sense...especially if as you say:

> The existing '--sparse' flag really is
> independent of the sparse patterns altogether - it just toggles whether
> sparse directories are shown as-is or expanded. Given your analysis so far,
> '--sparse' should probably be renamed to something that reflects its unique
> behavior ('--no-expand-sparse-directories'? I'm sure someone more creative
> than me could come up with a better name ;) ).

Maybe just `--no-expand`?  I'm also open to further alternatives.

> So, disregarding the special sparse index behavior, I think 'ls-files' fits
> neatly in the "commands that query history" section.

If it fits neatly in the "commands that query history" section, that
implies that `--restrict` should be the default for the behavior A
camp of people.  That may be fine, but...

Junio suggested that leaving ls-files as full-tree by default "would
allow scripters [to] come up with Porcelains that restricts to the
sparse specification more easily."  I know we've certainly used
`ls-files -t` a lot internally.  I guess it's a question of whether we
train such folks to always use `--no-restrict` together with `git
ls-files -t`, whether we actually treat ls-files as a special category
that defaults to full-tree even for the behavior A camp, or whether we
find some kind of middle ground by defaulting to `--restrict` but
making the `-t` option imply `--no-restrict`.  Thoughts?

> > +
> > +    Note that `git ls-files -t` is often used to see what is sparse and
> > +    what is not, which only works with a non-restricted assumption.
> > +
> > +  * checkout-index
> > +
> > +    should it be like `checkout` and pay attention to sparsity paths, or
> > +    be considered special and write to working tree anyway?  The
> > +    interaction with --prefix, and the use of specifically named files
> > +    (rather than globs) makes me wonder.
>
> IMO, it should still pay attention to sparsity paths, even with '--prefix'.
> My interpretation would be that '--restrict' tells it how to *read* the
> index when determining what to write to disk - even with '--prefix', then,
> it'd only write files matching the sparsity patterns. In that case, it seems
> to fit alongside 'switch', 'restore', etc. in "commands that restore files
> to the working tree that match sparsity patterns."

Sounds fair; I like that.

> > +
> > +  * update-index
> > +
> > +    The --[no-]ignore-skip-worktree-entries default is totally bogus,
> > +    but otherwise this command seems okay?  Not sure what category it
> > +    would go under, though.
>
> I'd probably call this a "makes modifications" command (like 'git add', 'git
> rm', etc.), since it adds/removes/modifies items in the index (either their
> content or their flags).

That group has a restrict-or-error behavior.  Do we want update-index
to require a --no-restrict to operate on files outside the sparse
specification?  Maybe we do, for the same reasons we do with
add/rm/mv.  And that certainly would have helped us avoid the
--[no-]ignore-skip-worktree-entries bug.

If we go this route, should some flags imply --no-restrict (such as
--[no-]skip-worktree)?

> > +
> > +  * range-diff
> > +
> > +    Is this like `log` or `format-patch`?
> > +
> > +  * cherry
> > +
> > +    See range-diff

I'm presuming you didn't mean the answers below to apply to the above two.

> > +  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
> > +
> > +    should these be tweaked or always operate full-tree?
>
> For these (and the other plumbing/plumbing-ish commands you have listed:
> 'checkout-index', 'update-index', 'read-tree'), I'd lean towards making them
> respect the sparsity patterns consistently with the porcelain layer. Part of
> that is because the line between "plumbing" and "porcelain" is sometimes
> fuzzy (like with 'read-tree'?), so having _very_ different behavior around
> that boundary would probably be confusing. The other part is that I think
> plumbing-based scripts would still fit one of your "A" or "B" user
> archetypes, so full-tree behavior might not be desired anyway.

That sounds compelling to me, generally.

However, if we are given a tree rather than a revision, we have no way
of knowing where in the directory hierarchy that the tree is found, so
we may not be able to provide `--restrict` behavior (unless we want to
just blindly assume the tree given is a toplevel tree; not sure Junio
would like that based on looking at the commit message of d4789c60aa
("ls-tree: add --full-tree option", 2008-12-25) where such an
assumption was made before).  Thus, things like `git grep $TREE`, `git
diff-tree $TREE1 $TREE2`, or `git ls-tree $TREE` may have to default
to `--no-restrict` when those arguments truly are trees rather than
commits.


> > +=== Subcommand-dependent defaults ===
> > +
> > +Note that we have different defaults (for the desired behavior, not just
> > +the current implementation) depending on the command:
> > +
> > +  * Commands defaulting to --restrict:
> > +    * status
> > +    * diff (without --cached or REVISION arguments)
> > +    * grep (without --cached or REVISION arguments)
> > +    * switch
> > +    * checkout (the switch-like half)
> > +    * read-tree
> > +    * reset (--hard)
> > +    * restore/checkout
> > +    * checkout-index
> > +
> > +    This behavior makes sense; these interact with the working tree.
> > +
> > +  * Commands defaulting to --restrict-unless-conflicts
> > +    * merge
> > +    * rebase
> > +    * cherry-pick
> > +    * revert
> > +
> > +    These also interact with the working tree, but require slightly different
> > +    behavior so that conflicts can be resolved.
> > +
> > +  * Commands defaulting to --no-restrict
> > +    * archive
> > +    * bundle
> > +    * commit
> > +    * format-patch
> > +    * fast-export
> > +    * fast-import
> > +    * commit-tree
> > +
> > +    * ls-files
>
> In line with what I wrote earlier, I think 'ls-files' would belong wherever
> other "commands that query history" go (looks like "Commands whose default
> for --restrict vs. --no-restrict should vary").
>
> > +    * stash
> > +    * am
> > +    * apply
> > +
> > +    These have completely different defaults and perhaps deserve the most detailed
> > +    explanation:
> > +
> > +    In the case of commands in the first group (format-patch,
> > +    fast-export, bundle, archive, etc.), these are commands for
> > +    communicating history, which will be broken if they restrict to a
> > +    subset of the repository.  As such, they operate on full paths and
> > +    have no `--restrict` option for overriding.  Some of these commands may
> > +    take paths for manually restricting what is exported, but it needs to
> > +    be very explicit.
> > +
> > +    In the case of stash, it needs to vivify files to avoid losing the
> > +    user's changes.
> > +
> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
> > +
> > +    In the case of ls-files, `git ls-files -t` is often used to see what
> > +    is sparse and not, in which case restricting would not make sense.
> > +    Also, ls-files has traditionally been used to get a list of "all
> > +    tracked files", which would suggest not restricting.  But it's
> > +    slightly funny, because sparse-checkouts essentially split tracked
> > +    files into two categories -- those in the sparse specification and
> > +    those outside -- and how does the user specify which of those two
> > +    types of tracked files they want?
> > +
> > +  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
> > +    may affect how verbose the warnings are):
> > +    * add
> > +    * rm
> > +    * mv
>
> I was going to say that, if you consider 'update-index' part of the same
> category as 'git add', it would belong here. However, the "but warn" part
> seems a little weird with a mostly-plumbing command like 'update-index'.

Is it more or less weird with "but error" rather than "but warn"?

> > +
> > +    The defaults here perhaps make sense since they are nearly --restrict, but
> > +    actually using --restrict could cause user confusion if users specify a
> > +    specific filename, so they warn by default.  That logic may sound like
> > +    --no-restrict should be the default, but that's prone to even bigger confusion:
> > +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> > +     the file randomly disappearing later when some subsequent command is run
> > +     (since various commands automatically clean up unmodified files outside
> > +     the sparsity specification).
> > +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> > +     outside the range of the user's interest.  Much better to operate on the
> > +     sparsity specification and give the user warnings if other files could have
> > +     matched.
> > +      * `git mv` has similar surprises when moving into or out of the cone, so
> > +     best to restrict and throw warnings if restriction might affect the result.
> > +
> > +    There may be a difference in here between behavior A and behavior B.
> > +    For behavior A, we probably only want to warn if there were no
> > +    suitable matches for files in the sparsity specification, whereas
> > +    for behavior B, we may want to warn even if there are valid files to
> > +    operate on if the result would have been different under
> > +    `--no-restrict`.
>
> I'm a bit confused why '--restrict-but-warn' needs to be separate from
> '--restrict'. Couldn't the '--restrict' behavior for 'add'/'rm'/'mv' just be
> what you described above, since behavior is set on a per-command (or
> per-category) basis?
>
> Also, I might be mistaken, but isn't the current behavior more like
> '--restrict', in that it returns an error code & advisory message if it
> tries to add files outside the sparse patterns? If this is already okay to
> users, what's the benefit of relaxing the error to a warning?
>
> Otherwise, I'm on board with the difference between behaviors A & B (i.e.,
> "some files must be in the sparse-checkout to avoid a warning/error" vs.
> "all files must be in the sparse-checkout to avoid a warning/error").

Sorry, I should have written "error" rather than "warning".  I wanted
these in a separate category, because initially these had
`--no-restrict` behavior and we had really big usability problems.  We
tried to fix this by implementing "--restrict" behavior and just
silently ignoring any paths users gave us outside the sparse
specification.  That reduced complaints and made problems much
smaller, but we still got complaints.  Providing an error message in
some cases due to the restriction (hence --restrict-but-error) is kind
of important to getting the user experience right on these commands.

> > +
> > +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> > +    on Behavior A or Behavior B
> > +    * diff (with --cached or REVISION arguments)
> > +    * grep (with --cached or REVISION arguments)
> > +    * show (when given commit arguments)
> > +    * bisect
> > +    * blame
> > +      * and annotate
> > +    * log
> > +      * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +    For now, we default to behavior B for these, which want a default of
> > +    --no-restrict.
> > +
> > +    Note that two of these commands -- diff and grep -- also appeared in
> > +    a different list with a default of --restrict, but only when limited
> > +    to searching the working tree.  The working tree vs. history
> > +    distinction is fundamental in how behavior B operates, so this is
> > +    expected.
> > +
> > +    --restrict may make more sense as the long term default for
> > +    these[12], but that's a fair amount of work to implement, and it'd
> > +    be very problematic for behavior B users.  Making it the default
> > +    now, and then slowly implementing that default in various
> > +    subcommands over multiple releases would mean that behavior B users
> > +    would need to learn to slowly add additional flags to their
> > +    commands, depending on git version, to get the behavior they want.
> > +    That gradual switchover would be painful, so we should avoid it at
> > +    least until it's fully implemented.
>
> I think transitioning to '--restrict' by default is a good plan - as far as
> I can tell, user A types seem more common than user B types, and
> '--restrict' creates a more consistent experience.
>
> Maybe '--restrict' could be made the default earlier in 'scalar' (which
> already sets up a cone-mode sparse-checkout by default)? We'd still
> gradually move towards making the option a global default, but 'scalar'
> might get it some early exposure with users that'd benefit the most from it.

I'm glad others support this idea.  A couple years ago, I thought it
was going to be hard to get buy-in to even support it as a config
option.

> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
> > +    * Rationale making me lean slightly towards --[no-]restrict:
> > +      * We want a name that works for many commands, so we need a name that
> > +     does not conflict
> > +      * --[no-]restrict isn't overly long and seems relatively explanatory
> > +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> > +     grep/log/etc.  Changing the meaning of `--sparse` for these
> > +     commands would fix the backwardness, but possibly break existing
> > +     scripts.  Using a new name pairing would allow us to treat
> > +     `--sparse` in these commands as a deprecated alias.
> > +      * There is a different `--sparse`/`--dense` pair for commands using
> > +     revision machinery, so using that naming might cause confusion
> > +      * There is also a `--sparse` in both pack-objects and show-branch, which
> > +     don't conflict but do suggest that `--sparse` is overloaded
> > +      * The name --ignore-skip-worktree-bits is a double negative, is
> > +     quite a mouthful, refers to an implementation detail that many
> > +     users may not be familiar with, and we'd need a negation for it
> > +     which would probably be even more ridiculously long.  (But we
> > +     can make --ignore-skip-worktree-bits a deprecated alias for
> > +     --no-restrict.)
>
> I think '--[no-]restrict' is a good choice - it doesn't have the ambiguity
> of '--sparse' or the so-verbose-it's-confusing nature of
> '--ignore-skip-worktree-(bits|entries)'. My only concern would be with the
> fact that '--[no-]restrict' doesn't clearly indicate its relationship to
> sparse-checkout, but a longer name (like
> '--[no-]restrict-to-sparse-checkout') would be cumbersome, not worth it for
> the little bit of extra info a user would get.

Yeah, that lack of relationship is annoying, but perhaps we can create
one by adding a --[no-]restrict flag to `sparse checkout (init|set)`?

> > +
> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> That's an interesting idea! I'd be fine either way, there are pros and cons
> to each. E.g., it feels a little weird putting the option before the command
> ('git --no-restrict add' vs. 'git add --no-restrict'), but the option does
> apply to nearly every command (and it's easier to describe/document from a
> Git-wide perspective than a per-command perspective).

One difficulty with global is that both --restrict and --no-restrict
will be added.  So:
  * What if --restrict is passed with a command that only uses
no-restrict behavior?  For example: stash? apply? commit?  etc.
  * What if --restrict is passed with a command that defaults to
something not-quite-restrict?  Such as add?  Or merge?  Should it
attempt harder to ignore paths outside the sparse specfication?
  * What if --restrict is passed to a command that doesn't understand
or use paths at all?  Such as update-ref?  Or branch?  Or repack?

Do we just ignore in the first and third case, and map it to the
almost-restrict in the second case?

> > +
> > +  * If a config option is added (core.restrictToSparsity?) what should
> > +    the values and description be?  There's a risk of confusion, because
> > +    we only want this config option to affect the history-querying
> > +    commands (log/diff/grep) and maybe the path-modifying worktree
> > +    commands (add/rm/mv), but certainly not most the others.  Previous config
> > +    suggestion here: [13]
>
> For values, maybe 'strict' (for behavior A/'--restrict' across the board),
> 'loose' (for behavior B), 'off'/'none' (for '--no-restrict' across the
> board)? For the description, it could outline each of the use cases and
> highlight notable command behavior differences? Kind of like what you
> already have in [13].

I'm a little lost on your third case there.  How would a
"`--no-restrict` across the board" setting be useful?  Doesn't having
checkout/switch default to --no-restrict defeat the point of
sparse-checkouts?  I suspect you meant something else by "across the
board", but I don't know what other usecase exists that defines the
edge of the board for your scenario.

> > +
> > +  * Should --sparse in ls-files be made an alias for --restrict?
> > +    `--restrict` is certainly a near synonym in cone-mode, but even then
> > +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> > +    option has no effect, and in cone-mode it still shows the sparse
> > +    directory entries which are technically outside the sparsity
> > +    specification.
>
> I don't think so (for the reasons I mentioned earlier - tl;dr --sparse and
> --restrict are conceptually quite different, and functionally independent).
> I do think '--sparse' should be renamed as part of the "Implementation
> Goals/Plans", though.

Yeah, sounds good.

> > +
> > +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
> > +    restore be made deprecated aliases for --no-restrict?  (They have the
> > +    same meaning.)
> > +
> > +  * Should --ignore-skip-worktree-entries in update-index be made a
> > +    deprecated alias for --no-restrict?  (Or, better yet, should the
> > +    option just be nuked from orbit after flipping the default, since
> > +    the reverse option is never wanted and the sole purpose of this
> > +    option was to turn off a bug?)
>
> That's an interesting bit of history! I tend to think of 'update-index' as
> "plumbing add/rm", so I think there's still a benefit to having a
> '--restrict' mode.
>
> In any case, if I'm reading this correctly, these two options are subtly
> different than what's proposed for '--restrict', since IIRC they don't take
> into account the sparse patterns at all (only operating based on
> 'SKIP_WORKTREE'). If '--restrict' will involve also using the sparse
> patterns, the behavior would change. I'm happy with doing that (I think the
> change would be beneficial), but it should probably be explicitly noted
> either here or whenever those commands are updated.

I think of `--restrict` as "apply operation to the sparse
specification", and as noted above, I view the sparse specification as
able to transiently diverge from the canonical sparsity patterns in
$GIT_DIR/info/sparse-checkout.

However, that's not really relevant here, because the difference
between sparse specification and sparsity patterns only matters for
--restrict.  In contrast, --no-restrict means apply operation to all
paths in both cases, making that subtle difference a moot point.

Since in this case these flags map to --no-restrict, we don't need to
worry about that distinction.

> > +
> > +  * sparse-checkout: once behavior A is fully implemented, should we
> > +    take an interim measure to easy people into switching the default?
>
> nit: s/easy/ease/

Indeed, thanks for catching.

> > +    Namely, if folks are not already in a sparse checkout, then require
> > +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> > +    would set core.restrictToSparse according to the setting given), and
> > +    throw an error if the flag is not provided?  That error would be a
> > +    great place to warn folks that the default may change in the future,
> > +    and get them used to specifying what they want so that the eventual
> > +    default switch is seamless for them.
>
> Sounds like a good approach to me! It avoids needing to constantly
> re-specify '--[no-]restrict' on every 'sparse-checkout set' (because it sets
> the config), and also provides visibility to users.

:-)

> > +
> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> Similar to the '--restrict' default, this could also be a good fit for
> 'scalar clone'.

It's awesome that you're already thinking about how to get early testing.

> > +
> > +
> > +=== Implementation Goals/Plans ===
>
> The rest of this (+the "Known bugs" section) all look good to me.
>
> Thanks again for writing this document, I really appreciate the time &
> effort you put into it! It'll serve as a clear reference for work on
> sparse-checkout going forward, and ultimately make sparse-checkout usage a
> much better experience for users.

Thanks for taking the time to read through it and provide detailed feedback!

>
> > base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
>
Elijah Newren Sept. 27, 2022, 7:30 a.m. UTC | #8
On Mon, Sep 26, 2022 at 3:36 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Victoria Dye <vdye@github.com> writes:
>
> >> +* Commands behaving the same regardless of high-level use-case
> >> +
> >> +  * commands that only look at files within the sparsity specification
> >> +
> >> +      * status
> >> +      * diff (without --cached or REVISION arguments)
> >> +      * grep (without --cached or REVISION arguments)
> >
> > 'status' and 'diff' currently show information about untracked files outside
> > the working tree (since, not being in the index, they don't have a
> > 'SKIP_WORKTREE' to use). Should that change with the proposed '--restrict'
> > option?
>
> Most likely not.  When sparsity specification is in effect, as you
> said elsewhere in your response, no files, whether tracked or
> untrcked, should exist that are outside your area of interest.
> Their presence should be reported as anomalies by "git status".
>
> Unless the command is being run with the "-uno" option, that is.

Oh, wow, that's something completely outside what I had considered.  I
had viewed sparse-checkouts as splitting "tracked files" into two
subsets.  As such, `--[no-]restrict` could only affect selecting
whether the smaller or larger set of tracked files was of interest.
From that viewpoint, untracked files seemed orthogonal, and thus there
couldn't be such a thing as an "anamalous untracked file".

But this idea is very interesting.  Hmm...

>
> > - 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
> >   & advise on how to clean up the modified files to re-align with the
> >   sparsity patterns.
> > - 'reset --hard' silently drops the modified file and resets the
> >   'SKIP_WORKTREE' bit on the corresponding index entry.
> >
> > With the exception of 'reset --hard' (aggressively and unconditionally
> > cleaning the worktree & index is an important aspect of the command, IMO),
> > I'd personally like to see commands in this category align with the behavior
> > of 'switch' where they don't already. Regardless of what we decide, though,
> > I think it's probably worth documenting the "modified outside of sparsity
> > patterns" case.
>
> True.  I agree on both counts.
>
> > Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
> > of the entries it reads into the index. Having all of your files suddenly
> > appear "deleted" probably isn't desired behavior, so it might be a good
> > candidate for the "Known bugs" section.
>
> I would imagine that it actually is OK to say that it is the
> responsibility of whoever invokes read-tree the plumbing command
> to reapply the skip-worktree bits and/or collapse the index entries
> outside the area of interest into trees afterwards.

I'll keep that in mind, but that sounds very error prone to me.

> >> +* Commands that differ for behavior A vs. behavior B:
> >> +
> >> +  * commands that make modifications:
> >
> > nit: "make modifications" -> "make modifications to the index"?
>
> That clarification actually raises an interesting question.  Do we
> want three level distinction, i.e. different behaviour between
> commands that touch and do not touch the working tree, between those
> that touch and do not touch the index, and between those that touch
> and do not touch the commit?
>
> As the index is merely a way to express what the user did to
> eventually create the next tree to be recorded in the commit, my gut
> feeling is that it may be easier to understand if we treated the
> working tree and the index at the same level, actually.  I.e. if
> grepping in the working tree of a sparse checkout does not find a
> match outside the cones of interest, it may make sense to do the
> same at least by default in "grep --cached" mode.
>
> If I understand Stolee's write-up on the use case of those in the
> camp B, they are more aware of the larger whole and expect to see
> hits outside the area they have checkout when running "grep HEAD".
> But in their use case, they do not touch (only look) the area
> outside their cone of interest, so if we limit the operation to
> their cone of interest by default for working tree, the same default
> probably should apply equally for an operation that inspect the
> index.

That is an interesting angle to view things; I wondered if an idea
along these lines was going to come up when I was first responding to
Shaoxuan.  I also wondered if people would come to different
conclusions on whether "git grep --cached" should search outside the
sparsity-paths depending upon whether the sparse index was in use.

One thing that makes me a little leery about this path is whether we
can consistently apply the scoped-to-sparse-specification rule for
index operations.  For example:

  * You previously agreed that `git format-patch` should ignore sparse
specification and operate full tree.
  * `git apply --cached $PATCH` only updates the index, and you
suggested in an alternate email that apply should operate full-tree
(at least with --index or without --cached, but I assume by extension
it probably also applies with --cached).
  * What if someone ran the last two commands, and then goes to commit
the result?  Do we want to scope `git commit` to only accept staged
changes within the sparse specification by default?  I thought we
wouldn't and marked commit as a full-tree operation, by default.
  * What if someone runs `git diff --cached` just before that commit?
Do we scope the diff to only those paths within the sparse
specification?
  * What if someone runs `git status` just before that commit?  Do we
only show staged changes within the sparse specification?

It feels like "git grep --cached" is perhaps the next thing along this
sequence, and I don't see a clear line where to draw that we should
limit things to the sparse specification for the index while treating
the other operations as full tree; it seems like something feels
broken or inconsistent in this sequence of commands if we attempt to
do so.


Also, I have some users in camp B.  They specifically have been using
"git grep --cached ..." for a few years now to find other code of
interest outside of their current sparse-checkout (often in stubbed
out dependencies or other projects that depend on the area you are
modifying).  This allows them to make internal API changes and find
the other sites that need to be modified, including outside the normal
sparse cone.  Perhaps I could re-teach them to use "git grep ... HEAD"
instead, but it may feel like a bit of a break to them.  I've found
"git grep --cached" being documented by others who wrote various "how
to work in sparse checkouts" documents, all commenting on this being
the trick to do a whole-tree search.  I did warn them that we might
change that command on them (and sparse-checkouts in general have a
warning about potentially changing behavior), but I'm a little
hesitant to do so.  So that's a second reason I lean towards treating
index searches the same as REVISION ones -- full-tree for camp B.
Junio C Hamano Sept. 27, 2022, 3:43 p.m. UTC | #9
"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +  * Does the name --[no-]restrict sound good to others?  Are there better options?

Everybody in this thread are interested in sparse checkout, which
unfortunately blinds them from the fact that "restrict to", "limit
to", "focus on", etc. need not to be limited to the sparse checkout
feature.  We must have something that hints that the option is about
the sparse checkout feature.

As to the verbs, I do not mind "restrict to".  Other good ones I do
not mind choosing are "limit to" and "focus on".  They would equally
convey the same thing in this context.  And the object for these
verb phrases are the area of interest, those paths without the
skip-worktree bit, the paths outside the sparse cone(s).

Or we could go the other way.  We are excluding those paths with the
skip-worktree bit, so "exclude" and "ignore" are natural candidates.

These two classes are good if the "restrict" behaviour will never be
the default.  When it is the default, the option often used will
become "--no-restrict", which is awkward.

	Personally I am slightly in favor of "focus on" (i.e.
	"--focus" vs "--unfocus") as that meshes well with the
	concept of "the areas of the working tree paths that I am
	interested in right now", which may already hint that the
	option is about the sparse checkout feature (i.e. "I am
	focusing on these areas right now") and can stay short.  But
	this is just one person's opinion.

> +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> +	grep/log/etc.  Changing the meaning of `--sparse` for these
> +	commands would fix the backwardness, but possibly break existing
> +	scripts.  Using a new name pairing would allow us to treat
> +	`--sparse` in these commands as a deprecated alias.

I actually am in favor of this, even though the appearance of
breaking backward compatibility may be big, but ...

> +      * There is a different `--sparse`/`--dense` pair for commands using
> +	revision machinery, so using that naming might cause confusion

... that is a good reason to avoid these two words.
Junio C Hamano Sept. 27, 2022, 4:07 p.m. UTC | #10
Elijah Newren <newren@gmail.com> writes:

> Oh, wow, that's something completely outside what I had considered.  I
> had viewed sparse-checkouts as splitting "tracked files" into two
> subsets.  As such, `--[no-]restrict` could only affect selecting
> whether the smaller or larger set of tracked files was of interest.
> From that viewpoint, untracked files seemed orthogonal, and thus there
> couldn't be such a thing as an "anamalous untracked file".
>
> But this idea is very interesting.  Hmm...

We need to design the behaviour of "git add" sensibly.  Even we say
"untracked files are just one class and there are two classes of
tracked ones, those path of current interest and those that are
uninteresting", we would need to say "'git add F' behaves this way
if F would become 'tracked path of current interest' when added, but
the command behaves this other way if F becomes 'tracked path that
is not interesting right now'".  It may be cleaner to separate the
untracked ones along the same line as the tracked ones.

Which in turn would mean that the skip-worktree bit cannot be the
source of truth.  Sparsity specification (either pattern matching or
being in listed directories) authoritatively decides if a path is of
the current interest or not.  This is simply because untracked ones
cannot have that bit ;-)  We can treat the skip-worktree bit as mere
implementation detail, a measure for optimization.

> One thing that makes me a little leery about this path is whether we
> can consistently apply the scoped-to-sparse-specification rule for
> index operations.  For example:
>
>   * You previously agreed that `git format-patch` should ignore sparse
> specification and operate full tree.

It is not "are we focusing on subset when we talk about index" to
begin with---format-patch is about a commit (or a series of commit),
and you should view it as a member of the "log" family.  Or the
first half of "rebase/cherry-pick" (the other half being "am"),
which should be full-tree, I would think.

>   * `git apply --cached $PATCH` only updates the index, and you
> suggested in an alternate email that apply should operate full-tree
> (at least with --index or without --cached, but I assume by extension
> it probably also applies with --cached).

I have not thought about "apply --cached".  Just like merge-tree can
merge without a working tree, "apply --cached" should be able to
serve as a foundation to apply a series out of lore archive and
create a topic branch without a working tree.

>   * What if someone runs `git diff --cached` just before that commit?
> Do we scope the diff to only those paths within the sparse
> specification?
>   * What if someone runs `git status` just before that commit?  Do we
> only show staged changes within the sparse specification?
>
> It feels like "git grep --cached" is perhaps the next thing along this
> sequence, and I don't see a clear line where to draw that we should
> limit things to the sparse specification for the index while treating
> the other operations as full tree; it seems like something feels
> broken or inconsistent in this sequence of commands if we attempt to
> do so.

OK, it seems that "--cached" has many cases that it wants to operate
on full tree.  I am in general more in favor of making things work
on full tree, simply because I feel it would have less chance of
going wrong, so defaulting to --no-restrict would be fine ;-)
Derrick Stolee Sept. 27, 2022, 4:36 p.m. UTC | #11
On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>

> +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
> +
> +These folks might know there are other things in the repository, but
> +don't care.  They are uninterested in other parts of the repository, and
> +only want to know about changes within their area of interest.  Showing
> +them other results from history (e.g. from diff/log/grep/etc.) is a
> +usability annoyance, potentially a huge one since other changes in
> +history may dwarf the changes they are interested in.

This idea of restricting the commit history to the sparse-checkout
definition (by default, with an escape hatch) seems like the most
radical of the things we've considered. I think it's interesting to
consider, but it might be better to think about things like diffstats,
grepping, and otherwise preventing out-of-cone adjustments by default.

That said, the idea of restricting history is also the simplest to
describe as a user-visible change.

> +Some of these users also arrive at this usecase from wanting to use
> +partial clones together with sparse checkouts and do disconnected
> +development.  Not only do these users generally not care about other
> +parts of the repository, but consider it a blocker for Git commands to
> +try to operate on those.  If commands attempt to access paths in history
> +outside the sparsity specification, then the partial clone will attempt
> +to download additional blobs on demand, fail, and then fail the user's
> +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> +has non-trivial changes to reconcile outside the sparsity path, but we
> +should limit how often users are forced to connect to the network.)

This idea pairs well with a feature I've been meaning to build:
'git sparse-checkout backfill' would download all historical blobs
within the sparse-checkout definition. This is possible with rev-list,
but I want to investigate grouping blobs by path and making requests in
batches, hopefully allowing better deltification and ability to recover
from network disconnections. That makes this idea of "staying within
your sparse-checkout means no missing object downloads" even more likely.

> +  (Behavior B) Users want a sparse working tree, but are working in a larger whole
> +
> +Stolee described this usecase this way[11]:
> +
> +"I'm also focused on users that know that they are a part of a larger
> +whole. They know they are operating on a large repository but focus on
> +what they need to contribute their part. I expect multiple "roles" to
> +use very different, almost disjoint parts of the codebase. Some other
> +"architect" users operate across the entire tree or hop between different
> +sections of the codebase as necessary. In this situation, I'm wary of
> +scoping too many features to the sparse-checkout definition, especially
> +"git log," as it can be too confusing to have their view of the codebase
> +depend on your "point of view."

Thanks for including this.

> +People might also end up wanting behavior B due to complex inter-project
> +dependencies.  The initial attempts to use sparse-checkouts usually
> +involve the directories you are directly interested in plus what those
> +directories depend upon within your repository.  But there's a monkey
> +wrench here: if you have integration tests, they invert the hierarchy:
> +to run integration tests, you need not only what you are interested in
> +and its dependencies, you also need everything that depends upon what
> +you are interested in or that depends upon one of your
> +dependencies...AND you need all the dependencies of that expanded group.
> +That can easily change your sparse-checkout into a nearly dense one.

In my experience, the downstream dependencies are checked via builds in
the cloud, though that doesn't help if they are source dependencies and
you make a breaking change to an API interface. This kind of problem is
absolutely one of system architecture and I don't know what Git can do
other than to acknowledge it and recommend good patterns.

In a properly-organized project, 95% of engineers in the project can have
a small sparse-checkout, then 5% work on the common core that has these
downstream dependencies and require a large sparse-checkout definition.
There's nothing Git can do to help those engineers that do cross-tree
work.

(nit: this is a good place to break up this paragraph.)

> +Naturally, that tends to kill the benefits of sparse-checkouts.  There
> +are a couple solutions to this conundrum: either avoid grabbing
> +dependencies (maybe have built versions of your dependencies pulled from
> +a CI cache somewhere), or say that users shouldn't run integration tests
> +directly and instead do it on the CI server when they submit a code
> +review.  Or do both.  Regardless of whether you stub out your
> +dependencies or stub out the things that depend upon you, there is
> +certainly a reason to want to query and be aware of those other
> +stubbed-out parts of the repository, particularly when the dependencies
> +are complex or change relatively frequently.  Thus, for such uses,
> +sparse-checkouts can be used to limit what you directly build and
> +modify, but these users do not necessarily want their sparse checkout
> +paths to limit their queries of history.

...

> +* Commands behaving the same regardless of high-level use-case

Thank you for this audit of command usage.

> +* Commands that differ for behavior A vs. behavior B:
> +
> +  * commands that make modifications:
> +      * add
> +      * rm
> +      * mv

I think these, along with diff and grep, are great candidates to have
the default behavior fit category A with a flag to act with behavior B.

> +  * commands that query history
> +      * bisect

Interesting that 'bisect' could be considered differently, but I
suppose that if we are presenting the commit history graph in a
simplified form that we'd want to bisect on that simplified graph
instead of the full one.

> +      * blame
> +	* and annotate

blame and annotate operate on a single path, so they already
restrict within the sparse-checkout definition (unless the user
specifies a path outside of the sparse-checkout). The only difference
between A and B would be reporting an error if the path is outside the
definition, right? We don't need to do anything special to simplify
the history.

> +      * show (when given commit arguments)
> +      * log
> +	* and variants: shortlog, gitk, show-branch, whatchanged

And here is where we'd need to do that big changes for simplifying
the history graph. Does 'rev-list' not fit here? I tend to think of
'log' as a formatting layer on top of 'rev-list', but maybe that is
misguided.

> +* Comands I don't know how to classify

nit: s/Comands/Commands/

> +
> +  * ls-files> +  * checkout-index
> +  * update-index
> +  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list

Plumbing commands might be a good candidate for "by default you
can do anything, but we can add ability to put guard rails on the
sparse-checkout set".

> +  * range-diff
> +
> +    Is this like `log` or `format-patch`?

I think this is more like format-patch. However, we need to be careful
if users use "git log" output to determine the range they provide to
the range-diff command, since that range could indicate a larger set of
commits.

> +=== Subcommand-dependent defaults ===
> +
> +Note that we have different defaults (for the desired behavior, not just
> +the current implementation) depending on the command:
> +
> +  * Commands defaulting to --restrict:

This appears to be the first mention of --restrict. Perhaps it would be
worth declaring what --restrict, --restrict-unless-conflicts, and
--no-restrict mean before creating this categorization?

> +    * status
> +    * diff (without --cached or REVISION arguments)
> +    * grep (without --cached or REVISION arguments)
> +    * switch
> +    * checkout (the switch-like half)
> +    * read-tree
> +    * reset (--hard)
> +    * restore/checkout
> +    * checkout-index
> +
> +    This behavior makes sense; these interact with the working tree.
> +
> +  * Commands defaulting to --restrict-unless-conflicts
> +    * merge
> +    * rebase
> +    * cherry-pick
> +    * revert

In my mind, --restrict-unless-conflicts doesn't provide any value unless
you want the --restrict mode to create an _error_ when trying to do
something outside of the sparse-checkout cone.

The only thing I can think about is that the diffstat might want to show
the stats for the conflicted files, in which case that's an important
perspective on the distinction from --restrict.

> +    In the case of am and apply, those commands only operate on the
> +    working tree, so they are kind of in the same boat as stash.
> +    Perhaps `git am` could run `git sparse-checkout reapply`
> +    automatically afterward and move into a category more similar to
> +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> +    vivify files besides just conflicted ones when there are conflicts.

'git am' should be able to construct the resulting commit from the patch
without adding files outside of the sparse-checkout definition. If there
is a conflict, it fails in the application, anyway. I suppose you are
writing this here because 'git am' does not play nice with sparse-checkout
right now.

> +    In the case of ls-files, `git ls-files -t` is often used to see what
> +    is sparse and not, in which case restricting would not make sense.
> +    Also, ls-files has traditionally been used to get a list of "all
> +    tracked files", which would suggest not restricting.  But it's
> +    slightly funny, because sparse-checkouts essentially split tracked
> +    files into two categories -- those in the sparse specification and
> +    those outside -- and how does the user specify which of those two
> +    types of tracked files they want?

> +  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B> +    may affect how verbose the warnings are):

More modes! OK.

> +    * add
> +    * rm
> +    * mv
> +
> +    The defaults here perhaps make sense since they are nearly --restrict, but
> +    actually using --restrict could cause user confusion if users specify a
> +    specific filename, so they warn by default.  That logic may sound like
> +    --no-restrict should be the default, but that's prone to even bigger confusion:
> +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> +	the file randomly disappearing later when some subsequent command is run
> +	(since various commands automatically clean up unmodified files outside
> +	the sparsity specification).
> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> +	outside the range of the user's interest.  Much better to operate on the
> +	sparsity specification and give the user warnings if other files could have
> +	matched.

The cost of checking for other files that might match is sometimes too large
(needing to expand the sparse index or walk trees to find those path names) that
I would not recommend warning that we _didn't_ do something. Perhaps an advice
that says "we did not look outside the sparse-checkout definition for matching
paths" when the pathspec is not an exact path or a prefix match.

> +      * `git mv` has similar surprises when moving into or out of the cone, so
> +	best to restrict and throw warnings if restriction might affect the result.
> +
> +    There may be a difference in here between behavior A and behavior B.
> +    For behavior A, we probably only want to warn if there were no
> +    suitable matches for files in the sparsity specification, whereas
> +    for behavior B, we may want to warn even if there are valid files to
> +    operate on if the result would have been different under
> +    `--no-restrict`.

I think in behavior B, users who actually want to modify things tree-wide will
actually increase their sparse-checkout definition to include those files so
they can validate what they are doing.

> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> +    on Behavior A or Behavior B
> +    * diff (with --cached or REVISION arguments)
> +    * grep (with --cached or REVISION arguments)
> +    * show (when given commit arguments)
> +    * bisect
> +    * blame
> +      * and annotate
> +    * log
> +      * and variants: shortlog, gitk, show-branch, whatchanged
> +
> +    For now, we default to behavior B for these, which want a default of
> +    --no-restrict.

I do feel pretty strongly that we'll want a --no-restrict default here
because otherwise we will present confusion. I'm not even sure if we would
want to make this available via a config setting, but likely a config
setting makes sense in the long term.

> +=== Implementation Questions ===
> +
> +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> +    * Names in use, or appearing in patches, or previously suggested:
> +      * --sparse/--dense
> +      * --ignore-skip-worktree-bits
> +      * --ignore-skip-worktree-entries
> +      * --ignore-sparsity
> +      * --[no-]restrict-to-sparse-paths
> +      * --full-tree/--sparse-tree
> +      * --[no-]restrict

I like the simplicity of --[no-]restrict, and my only worry is that it
doesn't immediately link to what it is restricting.

Perhaps something like "scope" would describe the set of things we care
about, but use a text mode:

	--scope=sparse	(--restrict)
	--scope=all	(--no-restrict)

But I'm notoriously bad at naming things.

> +  * Should --[no-]restrict be a git global option, or added as options to each
> +    relevant command?  (Does that make sense given the multitude of different
> +    default behaviors we have for different options?)

If we can make it a global option, that would be great, then update
the commands to behave under that mode as we go.

If that doesn't work, then adding the consistent option across commands
would be helpful. It might be good to make a OPT_RESTRICT macro (much
like OPT__VERBOSE, OPT__QUIET, and similar macros.

> +  * Should --sparse in ls-files be made an alias for --restrict?
> +    `--restrict` is certainly a near synonym in cone-mode, but even then
> +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> +    option has no effect, and in cone-mode it still shows the sparse
> +    directory entries which are technically outside the sparsity
> +    specification.

We should definitely replace the --sparse option(s) with whatever we
choose here. For ls-files, we have the issue that we are reporting
what is in the index, and in non-cone-mode the index cannot be sparse.

Now, maybe we change what the ls-files mode does under --restrict and
only have it report the paths within the sparse-checkout and not even
show the results for sparse directory entries. The --no-restrict would
then expand a sparse-index to show only paths again.

> +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
> +    restore be made deprecated aliases for --no-restrict?  (They have the
> +    same meaning.)

Yes.

> +  * Should --ignore-skip-worktree-entries in update-index be made a
> +    deprecated alias for --no-restrict?  (Or, better yet, should the
> +    option just be nuked from orbit after flipping the default, since
> +    the reverse option is never wanted and the sole purpose of this
> +    option was to turn off a bug?)

Yes and yes.

> +  * sparse-checkout: once behavior A is fully implemented, should we
> +    take an interim measure to easy people into switching the default?

nit: s/easy/ease/

> +    Namely, if folks are not already in a sparse checkout, then require
> +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> +    would set core.restrictToSparse according to the setting given), and
> +    throw an error if the flag is not provided?  That error would be a
> +    great place to warn folks that the default may change in the future,
> +    and get them used to specifying what they want so that the eventual
> +    default switch is seamless for them.

I don't like using the same option name (--[no-]restrict) for something
that sets a config option to keep that behavior permanently. Different
names that make it clearer could be:

	--enable-restrict-mode
	--set-scope=(sparse|all)

> +  * clone: should we provide some mechanism for tying partial clones and
> +    sparse checkouts together better.  Maybe an option
> +	--sparse=dir1,dir2,...,dirN
> +    which:
> +       * Does initial fetch with `--filter=blob:none`
> +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> +	 fault in the missing blobs within the sparse
> +	 specification...except that rev-list needs some kind of options
> +	 to also get files from leading directories too.
> +       * Sets --restrict mode to allow focusing on the cone of interest
> +	 (and to permit disconnected development)

As mentioned, I think we should have the option to backfill the blobs in
the sparse-checkout definition, but 'git clone' should not do this by
default. It's something that can be launched in the background, maybe, but
not a blocking operation on being able to use the repository.

'scalar clone' is an excellent testing bed for these kinds of things,
like setting the --restrict mode by default.

Hopefully my responses aren't too far off-base. I'll go read the rest of
the discussion now that I've contributed my thoughts on the doc.

Thanks,
-Stolee
Derrick Stolee Sept. 27, 2022, 4:42 p.m. UTC | #12
On 9/26/2022 4:08 PM, Victoria Dye wrote:
> Elijah Newren via GitGitGadget wrote:
>> +=== Purpose of sparse-checkouts ===
>> +
>> +sparse-checkouts exist to allow users to work with a subset of their
>> +files.
>> +
>> +The idea is simple enough, but there are two different high-level
>> +usecases which affect how some Git subcommands should behave.  Further,
>> +even if we only considered one of those usecases, sparse-checkouts
>> +modify different subcommands in over a half dozen different ways.  Let's
>> +start by considering the high level usecases in this section:
>> +
>> +  A) Users are _only_ interested in the sparse portion of the repo
>> +
>> +  B) Users want a sparse working tree, but are working in a larger whole
> 
> Both of these use cases make sense to me! Two thoughts/comments:
> 
> 1. This could be a "me" problem, but I regularly struggle with "sparse"
>    having different meanings in similar contexts. For example, a "sparse
>    directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
>    of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
>    A quick note/section outlining some standard terminology would be
>    immensely helpful.

This difference is absolutely my fault, and maybe we should consider
fixing this problem by renaming sparse directories something else.
Perhaps "skipped directory" would be a better name?

Thanks,
-Stolee
Elijah Newren Sept. 28, 2022, 5:38 a.m. UTC | #13
On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
>
> > +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
> > +
> > +These folks might know there are other things in the repository, but
> > +don't care.  They are uninterested in other parts of the repository, and
> > +only want to know about changes within their area of interest.  Showing
> > +them other results from history (e.g. from diff/log/grep/etc.) is a
> > +usability annoyance, potentially a huge one since other changes in
> > +history may dwarf the changes they are interested in.
>
> This idea of restricting the commit history to the sparse-checkout
> definition (by default, with an escape hatch) seems like the most
> radical of the things we've considered. I think it's interesting to
> consider, but it might be better to think about things like diffstats,
> grepping, and otherwise preventing out-of-cone adjustments by default.
>
> That said, the idea of restricting history is also the simplest to
> describe as a user-visible change.

By "restricting commit history", are you thinking in terms of "git log
-- PATHS" or more like some kind of special --filter to git-clone?

I get the feeling you might be thinking about the latter, whereas I
was assuming users had all commits (and all trees), but log/diff would
restrict output based on relevant paths.

> > +Some of these users also arrive at this usecase from wanting to use
> > +partial clones together with sparse checkouts and do disconnected
> > +development.  Not only do these users generally not care about other
> > +parts of the repository, but consider it a blocker for Git commands to
> > +try to operate on those.  If commands attempt to access paths in history
> > +outside the sparsity specification, then the partial clone will attempt
> > +to download additional blobs on demand, fail, and then fail the user's
> > +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> > +has non-trivial changes to reconcile outside the sparsity path, but we
> > +should limit how often users are forced to connect to the network.)
>
> This idea pairs well with a feature I've been meaning to build:
> 'git sparse-checkout backfill' would download all historical blobs
> within the sparse-checkout definition. This is possible with rev-list,
> but I want to investigate grouping blobs by path and making requests in
> batches, hopefully allowing better deltification and ability to recover
> from network disconnections. That makes this idea of "staying within
> your sparse-checkout means no missing object downloads" even more likely.

This sounds awesome.

> > +  (Behavior B) Users want a sparse working tree, but are working in a larger whole
> > +
> > +Stolee described this usecase this way[11]:
> > +
> > +"I'm also focused on users that know that they are a part of a larger
> > +whole. They know they are operating on a large repository but focus on
> > +what they need to contribute their part. I expect multiple "roles" to
> > +use very different, almost disjoint parts of the codebase. Some other
> > +"architect" users operate across the entire tree or hop between different
> > +sections of the codebase as necessary. In this situation, I'm wary of
> > +scoping too many features to the sparse-checkout definition, especially
> > +"git log," as it can be too confusing to have their view of the codebase
> > +depend on your "point of view."
>
> Thanks for including this.

I was actually worried this usecase was decreasing in priority for
you.  More on that later...

> > +People might also end up wanting behavior B due to complex inter-project
> > +dependencies.  The initial attempts to use sparse-checkouts usually
> > +involve the directories you are directly interested in plus what those
> > +directories depend upon within your repository.  But there's a monkey
> > +wrench here: if you have integration tests, they invert the hierarchy:
> > +to run integration tests, you need not only what you are interested in
> > +and its dependencies, you also need everything that depends upon what
> > +you are interested in or that depends upon one of your
> > +dependencies...AND you need all the dependencies of that expanded group.
> > +That can easily change your sparse-checkout into a nearly dense one.
>
> In my experience, the downstream dependencies are checked via builds in
> the cloud, though that doesn't help if they are source dependencies and
> you make a breaking change to an API interface. This kind of problem is
> absolutely one of system architecture and I don't know what Git can do
> other than to acknowledge it and recommend good patterns.

I was talking about (source) dependencies between
modules/projects/whatever-you-want-to-call-the-subcomponents of your
repository.  We have hundreds of modules, with various cross-module
dependencies that evolve over time.

I get the feeling from your description that your intra-repository
dependencies between modules/projects/whatever are much more static
for you than what we deal with.  (Which is a good thing; it'd be nice
if ours were more static.)

> In a properly-organized project, 95% of engineers in the project can have
> a small sparse-checkout, then 5% work on the common core that has these
> downstream dependencies and require a large sparse-checkout definition.

"In a properly-organized project"?  I'm unsure if this is an
indictment of some of the repositories I deal with in reality (and to
be fair, it might be a totally fair indictment), or if your statement
is starting to cross into "No true scotsman" territory.  ;-)

I would probably lean towards the former (we know it's more messy than
it should be), but I'm a bit puzzled that you'd just brush aside my
mention of integration tests.  We have people who want to run
integration tests locally, even when only modifying a small area of
the codebase.  These users are not doing cross-tree work, rather they
are doing cross-tree testing in conjunction with their work.  Running
such tests requires a build of the modules across the repository,
which naively would push folks into a dense checkout...and really long
local builds.  We want fast local builds, and sparse-checkouts help us
achieve that...but it does mean we have to be clever about how we
build in order to let these users run integration tests.  (And we have
to make it easy for users to discover the relevant integration tests,
and sometimes associated code components that depend on what they are
changing, which is where behavior B comes in).

> There's nothing Git can do to help those engineers that do cross-tree
> work.

I'm going to partially disagree with this, in part because of our
experience with many inter-module dependencies that evolve over time.
Folks can start on a certain module and begin refactoring.  Being
aware that their changes will affect other areas of the code, the can
do a search (e.g. "git grep --cached ..." to find cases outside their
current sparse checkout), and then selectively unsparsify to get the
relevant few dozen (or maybe even few hundred) modules added.  They
aren't switching to a dense checkout, just a less sparse one.  When
they are done, they may narrow their sparse specification again.  We
have a number of users doing cross-tree work who are using
sparse-checkouts, and who find it productive and say it still speeds
up their local build/test cycles.

So, I'd say that ensuring Git supports behavior B well in
sparse-checkouts, is something Git can do to help out both some of the
engineers doing cross-tree work, and some of the engineers that are
doing cross-tree testing.

(For full disclosure, we also have users doing cross-tree work using
regular dense checkouts and I agree there's not a lot we can do to
help them.)

> (nit: this is a good place to break up this paragraph.)

Yeah, it was kind of nice to have one paragraph per explanation of why
people might like behavior B.  But this is indeed a long paragraph.

[...]
> > +      * blame
> > +     * and annotate
>
> blame and annotate operate on a single path, so they already
> restrict within the sparse-checkout definition (unless the user
> specifies a path outside of the sparse-checkout). The only difference
> between A and B would be reporting an error if the path is outside the
> definition, right? We don't need to do anything special to simplify
> the history.

You're forgetting the possibility of one or more -C flags.  I'll note
it specifically on the line.

> > +      * show (when given commit arguments)
> > +      * log
> > +     * and variants: shortlog, gitk, show-branch, whatchanged
>
> And here is where we'd need to do that big changes for simplifying
> the history graph. Does 'rev-list' not fit here? I tend to think of
> 'log' as a formatting layer on top of 'rev-list', but maybe that is
> misguided.

Right, rev-list should probably be included here too.

> > +* Comands I don't know how to classify
>
> nit: s/Comands/Commands/

Thanks.

[...]
> > +=== Subcommand-dependent defaults ===
> > +
> > +Note that we have different defaults (for the desired behavior, not just
> > +the current implementation) depending on the command:
> > +
> > +  * Commands defaulting to --restrict:
>
> This appears to be the first mention of --restrict. Perhaps it would be
> worth declaring what --restrict, --restrict-unless-conflicts, and
> --no-restrict mean before creating this categorization?

Probably, yes.  Doing that might have even avoided some of the
confusion below...

[...]
> > +  * Commands defaulting to --restrict-unless-conflicts
> > +    * merge
> > +    * rebase
> > +    * cherry-pick
> > +    * revert
>
> In my mind, --restrict-unless-conflicts doesn't provide any value unless
> you want the --restrict mode to create an _error_ when trying to do
> something outside of the sparse-checkout cone.

Are you assuming here I was suggesting command line flags?  If so, I
apologize for my poor wording/descriptions.  At some point, I was just
noting that I was referring to behavior by the names of `--restrict`
and `--no-restrict`.  While pointing out that a strict interpretation
of the behaviors suggested by each name didn't match all commands, I
came up with names for alternate behaviors.  These names weren't meant
to become flags we'd use on the command line, despite the name that
perhaps suggests such.  Probably a really poor way to name these
behaviors; sorry about that.

Anyway, we do not want the behavior of `--restrict` for these
commands.  That would imply not providing conflicts to users for them
to resolve unless they are contained within the sparse specification,
which would clearly be broken.  We instead chose to write out files
with conflicts regardless of whether they are outside the sparse
specification.  This modified behavior I gave the name of
`--restrict-unless-conflict`, but we don't need or want an actual
command line flag for that.  I think the behavior should just remain
hardcoded into these commands.

(Note: these commands are among those that make me think
--[no-]restrict or --[un]focus or whatever might not make sense as a
git global option: `--restrict-unless-conflict` behavior is the
default for these and in fact that only sensible option, I think.  If
there's only one sensible option, no actual flag names are needed.)

> The only thing I can think about is that the diffstat might want to show
> the stats for the conflicted files, in which case that's an important
> perspective on the distinction from --restrict.

We only show the diffstat on a successful merge, so there's no
diffstat to show if there are any conflicted files.

> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
>
> 'git am' should be able to construct the resulting commit from the patch
> without adding files outside of the sparse-checkout definition. If there

That's yet another interesting take on `git am` -- different than what
I originally had in mind, and different from what Junio suggested.  I
think both of your takes are better than what I was initially
thinking, I just wish your two approaches weren't pulling in opposite
directions.  :-)

> is a conflict, it fails in the application, anyway. I suppose you are
> writing this here because 'git am' does not play nice with sparse-checkout
> right now.

Well, as a result of this thread, we now have at least 2-3 potential
solutions we could pursue...

[...]
> > +    * add
> > +    * rm
> > +    * mv
> > +
> > +    The defaults here perhaps make sense since they are nearly --restrict, but
> > +    actually using --restrict could cause user confusion if users specify a
> > +    specific filename, so they warn by default.  That logic may sound like
> > +    --no-restrict should be the default, but that's prone to even bigger confusion:
> > +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> > +     the file randomly disappearing later when some subsequent command is run
> > +     (since various commands automatically clean up unmodified files outside
> > +     the sparsity specification).
> > +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> > +     outside the range of the user's interest.  Much better to operate on the
> > +     sparsity specification and give the user warnings if other files could have
> > +     matched.
>
> The cost of checking for other files that might match is sometimes too large
> (needing to expand the sparse index or walk trees to find those path names) that
> I would not recommend warning that we _didn't_ do something. Perhaps an advice
> that says "we did not look outside the sparse-checkout definition for matching
> paths" when the pathspec is not an exact path or a prefix match.

Ah, good point, and a good idea to keep in mind.

However, I think advise_on_updating_sparse_paths() currently does what
you're warning against.  Do you think there's a good chance this is
the cause of the performance bug reported over at
https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com
?

> > +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> > +    on Behavior A or Behavior B
> > +    * diff (with --cached or REVISION arguments)
> > +    * grep (with --cached or REVISION arguments)
> > +    * show (when given commit arguments)
> > +    * bisect
> > +    * blame
> > +      * and annotate
> > +    * log
> > +      * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +    For now, we default to behavior B for these, which want a default of
> > +    --no-restrict.
>
> I do feel pretty strongly that we'll want a --no-restrict default here
> because otherwise we will present confusion. I'm not even sure if we would
> want to make this available via a config setting, but likely a config
> setting makes sense in the long term.

You've got me slightly confused.  You did say the same thing a long time ago:

    "But I also want to avoid doing this as a default or even behind a
config setting."[A]

BUT, when Shaoxuan proposed making --restrict/--focus the default for
one of these commands, you seemed to be on board[B].

Personally, I thought that if anyone would object to some of these
commands changing, that grep would be considered as among the riskier.
For diff and log, printing a "Warning: restricting output to the
sparse-checkout specification" would be pretty innocuous, but for grep
that wouldn't be.

I was a little unsure about making `--restrict/--focus` the default
for these commands, both based on your previous concerns and because
of thinking about some of my behavior B users.  But then, it seemed
like everyone else was pushing for not only having this behavior but
making it the default[C,D,E,F].  I was beginning to wonder if even you
had decided behavior B didn't matter anymore between your support of
Shaoxuan's change at [B] and your diffstat comments at [G].  But now
it sounds like you're not only against behavior A by default but even
implementing it at all...even though I don't see how that squares with
your previous comments on grep and diffstat.

Is it just a matter of presentation?  Is it specific subcommands you
don't want changed?  Or am I either missing or misunderstanding
something?


Anyway...I will note that without a configurable option to give these
commands a behavior of `--restrict`, I think you make working in
disconnected partial clones practically impossible.  I want to be able
to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
disconnected partial clones, and I've wanted that kind of capability
for well over a decade[H].  So, don't be surprised if I keep bringing
up a config option of some sort for these commands.  :-)

[A] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
[B] https://lore.kernel.org/git/e719d1e1-1849-07bc-ea08-2729985e5048@github.com/,
and the others in the thread
[C] https://lore.kernel.org/git/2fc889c9c264fc10d878f31bd89cc44e79982516.1599758167.git.matheus.bernardino@usp.br/
[D] paragraphs with "transitioning" in them from
https://lore.kernel.org/git/a89413b5-464b-2d54-5b8c-4502392afde8@github.com/
[E] https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
[F] https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
[G] https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
[H] https://lore.kernel.org/git/1283645647-1891-1-git-send-email-newren@gmail.com/


> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
>
> I like the simplicity of --[no-]restrict, and my only worry is that it
> doesn't immediately link to what it is restricting.

Yeah, Junio and Victoria brought up other flavors of this same
concern, and it's also the one thing I find suboptimal about this
name.

The problem is just that we need to add the flag in more places,
"sparse" is already taken in some of them with a different meaning,
and I'm not sure there is any other flag that does automatically link
to sparse-checkouts and/or self-describe without being excessively
wordy.

> Perhaps something like "scope" would describe the set of things we care
> about, but use a text mode:
>
>         --scope=sparse  (--restrict)
>         --scope=all     (--no-restrict)
>
> But I'm notoriously bad at naming things.

Yeah, me too.  Naming things is one of the two hard problems in
computer science, right?  (The others being cache invalidation, and
off-by-one errors.)

However, in this case, your suggestion sounds pretty decent to me.
I'll add it to the list for us to consider.

> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> If we can make it a global option, that would be great, then update
> the commands to behave under that mode as we go.
>
> If that doesn't work, then adding the consistent option across commands
> would be helpful. It might be good to make a OPT_RESTRICT macro (much
> like OPT__VERBOSE, OPT__QUIET, and similar macros.

Ooh, I didn't know about OPT__VERBOSE and OPT__QUIET.  Thanks for the flag.

[...]
> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> As mentioned, I think we should have the option to backfill the blobs in
> the sparse-checkout definition, but 'git clone' should not do this by
> default. It's something that can be launched in the background, maybe, but
> not a blocking operation on being able to use the repository.
>
> 'scalar clone' is an excellent testing bed for these kinds of things,
> like setting the --restrict mode by default.

Earlier in this same email you were against even making an option to
request --restrict mode, but now you're suggesting to not only
implement it but make it the default in scalar?

> Hopefully my responses aren't too far off-base. I'll go read the rest of
> the discussion now that I've contributed my thoughts on the doc.

Thanks for the detailed response!

I figured we'd have one or two places where all of us had some
disagreements on the big picture, but more and more I'm finding we
aren't even always thinking about the problems the same (e.g. the 3+
different solutions to the `am` issues).  All the more reason that a
document like this is important for us to discuss these details and
work out a plan.
Elijah Newren Sept. 28, 2022, 5:42 a.m. UTC | #14
On Tue, Sep 27, 2022 at 9:42 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/26/2022 4:08 PM, Victoria Dye wrote:
[...]
> > 1. This could be a "me" problem, but I regularly struggle with "sparse"
> >    having different meanings in similar contexts. For example, a "sparse
> >    directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
> >    of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
> >    A quick note/section outlining some standard terminology would be
> >    immensely helpful.
>
> This difference is absolutely my fault, and maybe we should consider
> fixing this problem by renaming sparse directories something else.

Hey now, don't reviewers also get some of the "credit"?  ;-)

> Perhaps "skipped directory" would be a better name?

Sounds reasonable to me.
Elijah Newren Sept. 28, 2022, 6:13 a.m. UTC | #15
On Tue, Sep 27, 2022 at 9:07 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Oh, wow, that's something completely outside what I had considered.  I
> > had viewed sparse-checkouts as splitting "tracked files" into two
> > subsets.  As such, `--[no-]restrict` could only affect selecting
> > whether the smaller or larger set of tracked files was of interest.
> > From that viewpoint, untracked files seemed orthogonal, and thus there
> > couldn't be such a thing as an "anamalous untracked file".
> >
> > But this idea is very interesting.  Hmm...
>
> We need to design the behaviour of "git add" sensibly.  Even we say
> "untracked files are just one class and there are two classes of
> tracked ones, those path of current interest and those that are
> uninteresting", we would need to say "'git add F' behaves this way
> if F would become 'tracked path of current interest' when added, but
> the command behaves this other way if F becomes 'tracked path that
> is not interesting right now'".  It may be cleaner to separate the
> untracked ones along the same line as the tracked ones.
>
> Which in turn would mean that the skip-worktree bit cannot be the
> source of truth.  Sparsity specification (either pattern matching or
> being in listed directories) authoritatively decides if a path is of
> the current interest or not.  This is simply because untracked ones
> cannot have that bit ;-)  We can treat the skip-worktree bit as mere
> implementation detail, a measure for optimization.

I like this idea.  Seems I should then move 'status' into the category
with add/rm/mv -- commands that need to be modified to treat untracked
files carefully.

Of course, this also may drag "git clean" into that category...though
I'm not sure how or if it'd differ.


[...]
> > It feels like "git grep --cached" is perhaps the next thing along this
> > sequence, and I don't see a clear line where to draw that we should
> > limit things to the sparse specification for the index while treating
> > the other operations as full tree; it seems like something feels
> > broken or inconsistent in this sequence of commands if we attempt to
> > do so.
>
> OK, it seems that "--cached" has many cases that it wants to operate
> on full tree.  I am in general more in favor of making things work
> on full tree, simply because I feel it would have less chance of
> going wrong, so defaulting to --no-restrict would be fine ;-)

Yeah, I think for the camp B folks, "--no-restrict" may make more
sense for operations searching or comparing to the index.

However, there's also another possibility I'm still mulling over.  To
understand it, first note that relative to the working tree, the
"sparse specification" can temporarily differ from the "paths matching
the sparsity patterns", because additional files might be transiently
present.  This most often happens due to conflicts, and we want
worktree related operations that behave under "restrict" mode (such as
"diff" or "grep" or "switch") to operate on all present tracked
files[1].  With that understanding, we could similarly consider that
relative to the index, the "sparse specification" could temporarily
differ from the "paths matching the sparsity patterns", because
additional paths outside the sparsity patterns could have been
modified in the index (e.g. during a merge or rebase or whatever).

Using a temporarily expanded sparsity specification may allow a
"restrict-like" behavior to make sense for index-related operations.
I currently think that'd be more useful for the camp A folks than the
camp B folks, though.

Either way, I don't think the index should use the sparsity defined by
or for the working tree.  The idea of using the working tree sparsity
for index-related operations may sound nice at first, but I think it
only behaves well when all paths modified in the index or working tree
are limited to those paths matching the sparsity patterns.  And
there's too many normal cases where that just doesn't hold.

[1] See also 82386b4496 ("Merge branch 'en/present-despite-skipped'",
2022-03-09)
Elijah Newren Sept. 28, 2022, 7:49 a.m. UTC | #16
On Tue, Sep 27, 2022 at 8:44 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
>
> Everybody in this thread are interested in sparse checkout, which
> unfortunately blinds them from the fact that "restrict to", "limit
> to", "focus on", etc. need not to be limited to the sparse checkout
> feature.  We must have something that hints that the option is about
> the sparse checkout feature.
>
> As to the verbs, I do not mind "restrict to".  Other good ones I do
> not mind choosing are "limit to" and "focus on".  They would equally
> convey the same thing in this context.  And the object for these
> verb phrases are the area of interest, those paths without the
> skip-worktree bit, the paths outside the sparse cone(s).
>
> Or we could go the other way.  We are excluding those paths with the
> skip-worktree bit, so "exclude" and "ignore" are natural candidates.

If you're thinking about plain "exclude", that's already a flag in
'apply', 'am', 'clean', and 'ls-files'.

Also, if you want these words alone, then they also seem to lack hints
that the option is about the sparse checkout feature.  Expand them a
bit, perhaps?  "--ignore-sparsity"?
"--exclude-sparse-checkout-restrictions"?

Assuming we are worried about needing "--no-" variants, wouldn't the
risk of a "--no-ignore-sparsity" be worse than a "--no-restrict" in
terms of awkwardness, given the double negative?

> These two classes are good if the "restrict" behaviour will never be
> the default.  When it is the default, the option often used will
> become "--no-restrict", which is awkward.
>
>         Personally I am slightly in favor of "focus on" (i.e.
>         "--focus" vs "--unfocus") as that meshes well with the
>         concept of "the areas of the working tree paths that I am
>         interested in right now", which may already hint that the
>         option is about the sparse checkout feature (i.e. "I am
>         focusing on these areas right now") and can stay short.  But
>         this is just one person's opinion.

I'll add --focus/--unfocus to the list.  --unfocus seems a bit more
awkward to me than --no-restrict, but that might just be me.  If
others really liked it, I'd be fine with it.

Right now, I'm leaning a bit more towards Stolee's
--scope={sparse,all} (or maybe --scope={sparse,dense}?)
Derrick Stolee Sept. 28, 2022, 1:22 p.m. UTC | #17
On 9/28/22 1:38 AM, Elijah Newren wrote:
> On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>
>> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
>>
>>> +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
>>> +
>>> +These folks might know there are other things in the repository, but
>>> +don't care.  They are uninterested in other parts of the repository, and
>>> +only want to know about changes within their area of interest.  Showing
>>> +them other results from history (e.g. from diff/log/grep/etc.) is a
>>> +usability annoyance, potentially a huge one since other changes in
>>> +history may dwarf the changes they are interested in.
>>
>> This idea of restricting the commit history to the sparse-checkout
>> definition (by default, with an escape hatch) seems like the most
>> radical of the things we've considered. I think it's interesting to
>> consider, but it might be better to think about things like diffstats,
>> grepping, and otherwise preventing out-of-cone adjustments by default.
>>
>> That said, the idea of restricting history is also the simplest to
>> describe as a user-visible change.
> 
> By "restricting commit history", are you thinking in terms of "git log
> -- PATHS" or more like some kind of special --filter to git-clone?
> 
> I get the feeling you might be thinking about the latter, whereas I
> was assuming users had all commits (and all trees), but log/diff would
> restrict output based on relevant paths.

I'm most skeptical of the "git log -- <sparse-checkout-paths>"
restriction showing a simplified history graph. I get enough
complaints about "missing commits" from simplified file history
as it is. Adding this simplified history scoped to the sparse-
checkout is more likely to add confusion than help users, in my
opinion.

>>> +People might also end up wanting behavior B due to complex inter-project
>>> +dependencies.  The initial attempts to use sparse-checkouts usually
>>> +involve the directories you are directly interested in plus what those
>>> +directories depend upon within your repository.  But there's a monkey
>>> +wrench here: if you have integration tests, they invert the hierarchy:
>>> +to run integration tests, you need not only what you are interested in
>>> +and its dependencies, you also need everything that depends upon what
>>> +you are interested in or that depends upon one of your
>>> +dependencies...AND you need all the dependencies of that expanded group.
>>> +That can easily change your sparse-checkout into a nearly dense one.
>>
>> In my experience, the downstream dependencies are checked via builds in
>> the cloud, though that doesn't help if they are source dependencies and
>> you make a breaking change to an API interface. This kind of problem is
>> absolutely one of system architecture and I don't know what Git can do
>> other than to acknowledge it and recommend good patterns.
> 
> I was talking about (source) dependencies between
> modules/projects/whatever-you-want-to-call-the-subcomponents of your
> repository.  We have hundreds of modules, with various cross-module
> dependencies that evolve over time.
> 
> I get the feeling from your description that your intra-repository
> dependencies between modules/projects/whatever are much more static
> for you than what we deal with.  (Which is a good thing; it'd be nice
> if ours were more static.)

The internal monorepo I know the most about has a very strict project
system that has less granularity than other build systems, so the
projects themselves don't change dependencies very frequently (but
they have lots of internal build adjustments that they can make
without updating the sparse-checkout). This is probably atypical,
especially from what I've heard from companies working with a build
system like Bazel.

>> In a properly-organized project, 95% of engineers in the project can have
>> a small sparse-checkout, then 5% work on the common core that has these
>> downstream dependencies and require a large sparse-checkout definition.
> 
> "In a properly-organized project"?  I'm unsure if this is an
> indictment of some of the repositories I deal with in reality (and to
> be fair, it might be a totally fair indictment), or if your statement
> is starting to cross into "No true scotsman" territory.  ;-)

I should probably say things like "If system architects want to
optimize for Git performance for the majority of their engineers, then
this kind of dependency organization is desirable." Building projects
in a vacuum, ignoring Git entirely, there is still a benefit to
minimizing local build costs for individual engineers. I think that
most of the time those improvements to the build system will also
result in more efficient sparse-checkout definitions for engineers
working on a small set of components.

> I would probably lean towards the former (we know it's more messy than
> it should be), but I'm a bit puzzled that you'd just brush aside my
> mention of integration tests.  We have people who want to run
> integration tests locally, even when only modifying a small area of
> the codebase.  These users are not doing cross-tree work, rather they
> are doing cross-tree testing in conjunction with their work.

I include "this component is used tree-wide" as tree-wide work, even
if it doesn't mean they are modifying code across the entire tree.
I will still assert that the vast majority of engineers in a large
repository should not be doing work that has tree-wide implications
such as this.

I would still argue that the most efficient way for these engineers
to work would be to modify their component directly locally, relying
on project-specific tests that check their API boundary for expectations,
then rely on a distributed build system to verify their changes across
the tree. They can then pull in the component(s) that have failing tests
in order to re-run tests locally and verify the correct fix.
 
>> There's nothing Git can do to help those engineers that do cross-tree
>> work.
> 
> I'm going to partially disagree with this, in part because of our
> experience with many inter-module dependencies that evolve over time.
> Folks can start on a certain module and begin refactoring.  Being
> aware that their changes will affect other areas of the code, the can
> do a search (e.g. "git grep --cached ..." to find cases outside their
> current sparse checkout), and then selectively unsparsify to get the
> relevant few dozen (or maybe even few hundred) modules added.  They
> aren't switching to a dense checkout, just a less sparse one.  When
> they are done, they may narrow their sparse specification again.  We
> have a number of users doing cross-tree work who are using
> sparse-checkouts, and who find it productive and say it still speeds
> up their local build/test cycles.

This matches my expectation of how to engage selectively with
dependent components, where we expand the sparse-checkout selectively.
My only difference is that unless there is a breaking change to the
API boundary that this expansion happens reactively, not proactively.
(Expand to another project if it has failing tests due to changes to
the local components.)
 
> So, I'd say that ensuring Git supports behavior B well in
> sparse-checkouts, is something Git can do to help out both some of the
> engineers doing cross-tree work, and some of the engineers that are
> doing cross-tree testing.
> 
> (For full disclosure, we also have users doing cross-tree work using
> regular dense checkouts and I agree there's not a lot we can do to
> help them.)

Perhaps there are two different categories going on here:

 1. The engineer is building a component consumed by many others
    across the tree, but all edits are within that component.

 2. The engineer is editing code across many components across the
    tree.

>>> +  * Commands defaulting to --restrict-unless-conflicts
>>> +    * merge
>>> +    * rebase
>>> +    * cherry-pick
>>> +    * revert
>>
>> In my mind, --restrict-unless-conflicts doesn't provide any value unless
>> you want the --restrict mode to create an _error_ when trying to do
>> something outside of the sparse-checkout cone.
> 
> Are you assuming here I was suggesting command line flags?  If so, I
> apologize for my poor wording/descriptions.

Yes, I think that was my misunderstanding.

>> The only thing I can think about is that the diffstat might want to show
>> the stats for the conflicted files, in which case that's an important
>> perspective on the distinction from --restrict.
> 
> We only show the diffstat on a successful merge, so there's no
> diffstat to show if there are any conflicted files.

Thanks! TIL.

>>> +    * add
>>> +    * rm
>>> +    * mv
>>> +
>>> +    The defaults here perhaps make sense since they are nearly --restrict, but
>>> +    actually using --restrict could cause user confusion if users specify a
>>> +    specific filename, so they warn by default.  That logic may sound like
>>> +    --no-restrict should be the default, but that's prone to even bigger confusion:
>>> +      * `git add <somefile>` if honored and outside the sparse cone, can result in
>>> +     the file randomly disappearing later when some subsequent command is run
>>> +     (since various commands automatically clean up unmodified files outside
>>> +     the sparsity specification).
>>> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
>>> +     outside the range of the user's interest.  Much better to operate on the
>>> +     sparsity specification and give the user warnings if other files could have
>>> +     matched.
>>
>> The cost of checking for other files that might match is sometimes too large
>> (needing to expand the sparse index or walk trees to find those path names) that
>> I would not recommend warning that we _didn't_ do something. Perhaps an advice
>> that says "we did not look outside the sparse-checkout definition for matching
>> paths" when the pathspec is not an exact path or a prefix match.
> 
> Ah, good point, and a good idea to keep in mind.
> 
> However, I think advise_on_updating_sparse_paths() currently does what
> you're warning against.  Do you think there's a good chance this is
> the cause of the performance bug reported over at
> https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com
> ?

Perhaps. You're right that it is warning about all of the paths that
match. That method was created before the sparse index was established,
so 'git add' was already checking all of the paths in the index, so
adding the warning made sense as something not too difficult to do after
checking each of those paths.

In the sparse index world, things are much more expensive to do that
check, hence the work to add modes that focus the action only to the
paths in the sparse-checkout. In that world, we _may_ want to recognize
that the user ran 'git rm *.png' and we want to provide advice that
we didn't look for '*.png' files outside of the sparse-checkout definition.

This makes less sense for 'git add *.png' because it already would not do
anything for files outside of the sparse-checkout definition. 

>>> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
>>> +    on Behavior A or Behavior B
>>> +    * diff (with --cached or REVISION arguments)
>>> +    * grep (with --cached or REVISION arguments)
>>> +    * show (when given commit arguments)
>>> +    * bisect
>>> +    * blame
>>> +      * and annotate
>>> +    * log
>>> +      * and variants: shortlog, gitk, show-branch, whatchanged
>>> +
>>> +    For now, we default to behavior B for these, which want a default of
>>> +    --no-restrict.
>>
>> I do feel pretty strongly that we'll want a --no-restrict default here
>> because otherwise we will present confusion. I'm not even sure if we would
>> want to make this available via a config setting, but likely a config
>> setting makes sense in the long term.
> 
> You've got me slightly confused.  You did say the same thing a long time ago:
> 
>     "But I also want to avoid doing this as a default or even behind a
> config setting."[A]
> 
> BUT, when Shaoxuan proposed making --restrict/--focus the default for
> one of these commands, you seemed to be on board[B].

I'm specifically talking about 'git log'. I think that having that be
in a restricted mode is extremely dangerous and will only confuse users.
This includes 'git show' (with commit arguments) and 'git bisect', I
think.

The rest, (diff, grep, blame) are worktree-focused, so having a restrict
mode by default makes sense to me.

> Personally, I thought that if anyone would object to some of these
> commands changing, that grep would be considered as among the riskier.
> For diff and log, printing a "Warning: restricting output to the
> sparse-checkout specification" would be pretty innocuous, but for grep
> that wouldn't be.

My main concern with 'git grep --cached' is its interaction with
partial clone. Perhaps a restrict mode for grep should be toggled with
partial clone and not sparse-checkout alone. But, that becomes more
confusing when the restrictions are applied or not.

> I was a little unsure about making `--restrict/--focus` the default
> for these commands, both based on your previous concerns and because
> of thinking about some of my behavior B users.  But then, it seemed
> like everyone else was pushing for not only having this behavior but
> making it the default[C,D,E,F].  I was beginning to wonder if even you
> had decided behavior B didn't matter anymore between your support of
> Shaoxuan's change at [B] and your diffstat comments at [G].  But now
> it sounds like you're not only against behavior A by default but even
> implementing it at all...even though I don't see how that squares with
> your previous comments on grep and diffstat.
> 
> Is it just a matter of presentation?  Is it specific subcommands you
> don't want changed?  Or am I either missing or misunderstanding
> something?

I think the biggest point is that the implications of behavior A
saying "I don't care about any changes outside of my sparse-checkout"
leading to changed history are unappealing to me. After removing that
kind of feature from consideration, I don't see any difference
between the behaviors.

> Anyway...I will note that without a configurable option to give these
> commands a behavior of `--restrict`, I think you make working in
> disconnected partial clones practically impossible.  I want to be able
> to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
> disconnected partial clones, and I've wanted that kind of capability
> for well over a decade[H].  So, don't be surprised if I keep bringing
> up a config option of some sort for these commands.  :-)

Now, if we're talking about "don't download extra objects" as a goal,
then we're thinking about things not just related to sparse-checkout
but even history within the sparse-checkout. Even if we make the
'backfill' command something that users could run, there isn't a
guarantee that users will want to have even that much data downloaded.
We would need a way to say "yes, I ran 'git blame' on this path in my
sparse-checkout, but please don't just fail if you can't get new objects,
instead inform me that the results are incomplete."

I think the sparse-checkout boundary is a good way to minimize the
number of objects downloaded by these commands, but to actually
remove the need for downloads at all we need a way to gracefully
return partial results.

>>> +  * clone: should we provide some mechanism for tying partial clones and
>>> +    sparse checkouts together better.  Maybe an option
>>> +     --sparse=dir1,dir2,...,dirN
>>> +    which:
>>> +       * Does initial fetch with `--filter=blob:none`
>>> +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
>>> +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
>>> +      fault in the missing blobs within the sparse
>>> +      specification...except that rev-list needs some kind of options
>>> +      to also get files from leading directories too.
>>> +       * Sets --restrict mode to allow focusing on the cone of interest
>>> +      (and to permit disconnected development)
>>
>> As mentioned, I think we should have the option to backfill the blobs in
>> the sparse-checkout definition, but 'git clone' should not do this by
>> default. It's something that can be launched in the background, maybe, but
>> not a blocking operation on being able to use the repository.
>>
>> 'scalar clone' is an excellent testing bed for these kinds of things,
>> like setting the --restrict mode by default.
> 
> Earlier in this same email you were against even making an option to
> request --restrict mode, but now you're suggesting to not only
> implement it but make it the default in scalar?

As I hope I've clarified earlier, there are some commands where I think
a --restrict mode is inadvisable, and turning it on by default is
dangerous. If we can configure the worktree commands to be restricted
by default and _not_ the history simplifyng ones, then that's what I
would want enabled in Scalar.
> I figured we'd have one or two places where all of us had some
> disagreements on the big picture, but more and more I'm finding we
> aren't even always thinking about the problems the same (e.g. the 3+
> different solutions to the `am` issues).  All the more reason that a
> document like this is important for us to discuss these details and
> work out a plan.

With such a massive doc and an ambitious plan, we are bound to have
misunderstandings and seem to self-contradict here and there. This
discussion is helping to drive clarity, and I appreciate all of your
work to drive towards mutual understanding.

Thanks,
-Stolee
ZheNing Hu Sept. 30, 2022, 9:09 a.m. UTC | #18
Derrick Stolee <derrickstolee@github.com> 于2022年9月28日周三 00:36写道:
>
> > +Some of these users also arrive at this usecase from wanting to use
> > +partial clones together with sparse checkouts and do disconnected
> > +development.  Not only do these users generally not care about other
> > +parts of the repository, but consider it a blocker for Git commands to
> > +try to operate on those.  If commands attempt to access paths in history
> > +outside the sparsity specification, then the partial clone will attempt
> > +to download additional blobs on demand, fail, and then fail the user's
> > +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> > +has non-trivial changes to reconcile outside the sparsity path, but we
> > +should limit how often users are forced to connect to the network.)
>
> This idea pairs well with a feature I've been meaning to build:
> 'git sparse-checkout backfill' would download all historical blobs
> within the sparse-checkout definition. This is possible with rev-list,
> but I want to investigate grouping blobs by path and making requests in
> batches, hopefully allowing better deltification and ability to recover
> from network disconnections. That makes this idea of "staying within
> your sparse-checkout means no missing object downloads" even more likely.
>

I think this is very useful: if I use sparse-checkout + partial-clone,
plugins like
git blame in vscode (or other IDE) will be invalidated, or require a
lot of network
overhead to download the missing blobs, so this git sparse-checkout backfill
looks like a promising solution to that problem.

> > +People might also end up wanting behavior B due to complex inter-project
> > +dependencies.  The initial attempts to use sparse-checkouts usually
> > +involve the directories you are directly interested in plus what those
> > +directories depend upon within your repository.  But there's a monkey
> > +wrench here: if you have integration tests, they invert the hierarchy:
> > +to run integration tests, you need not only what you are interested in
> > +and its dependencies, you also need everything that depends upon what
> > +you are interested in or that depends upon one of your
> > +dependencies...AND you need all the dependencies of that expanded group.
> > +That can easily change your sparse-checkout into a nearly dense one.
>
> In my experience, the downstream dependencies are checked via builds in
> the cloud, though that doesn't help if they are source dependencies and
> you make a breaking change to an API interface. This kind of problem is
> absolutely one of system architecture and I don't know what Git can do
> other than to acknowledge it and recommend good patterns.
>
> In a properly-organized project, 95% of engineers in the project can have
> a small sparse-checkout, then 5% work on the common core that has these
> downstream dependencies and require a large sparse-checkout definition.
> There's nothing Git can do to help those engineers that do cross-tree
> work.
>

This feels like it's because your project code is stable enough, but at other
companies I think many of the project dependencies are subject to frequent
changes.

> > +      * `git mv` has similar surprises when moving into or out of the cone, so
> > +     best to restrict and throw warnings if restriction might affect the result.
> > +
> > +    There may be a difference in here between behavior A and behavior B.
> > +    For behavior A, we probably only want to warn if there were no
> > +    suitable matches for files in the sparsity specification, whereas
> > +    for behavior B, we may want to warn even if there are valid files to
> > +    operate on if the result would have been different under
> > +    `--no-restrict`.
>
> I think in behavior B, users who actually want to modify things tree-wide will
> actually increase their sparse-checkout definition to include those files so
> they can validate what they are doing.
>

Agree.

> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
>
> I like the simplicity of --[no-]restrict, and my only worry is that it
> doesn't immediately link to what it is restricting.
>
> Perhaps something like "scope" would describe the set of things we care
> about, but use a text mode:
>
>         --scope=sparse  (--restrict)
>         --scope=all     (--no-restrict)
>
> But I'm notoriously bad at naming things.
>
> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> If we can make it a global option, that would be great, then update
> the commands to behave under that mode as we go.
>
> If that doesn't work, then adding the consistent option across commands
> would be helpful. It might be good to make a OPT_RESTRICT macro (much
> like OPT__VERBOSE, OPT__QUIET, and similar macros.
>
> > +  * Should --sparse in ls-files be made an alias for --restrict?
> > +    `--restrict` is certainly a near synonym in cone-mode, but even then
> > +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> > +    option has no effect, and in cone-mode it still shows the sparse
> > +    directory entries which are technically outside the sparsity
> > +    specification.
>
> We should definitely replace the --sparse option(s) with whatever we
> choose here. For ls-files, we have the issue that we are reporting
> what is in the index, and in non-cone-mode the index cannot be sparse.
>
> Now, maybe we change what the ls-files mode does under --restrict and
> only have it report the paths within the sparse-checkout and not even
> show the results for sparse directory entries. The --no-restrict would
> then expand a sparse-index to show only paths again.
>

> > +    Namely, if folks are not already in a sparse checkout, then require
> > +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> > +    would set core.restrictToSparse according to the setting given), and
> > +    throw an error if the flag is not provided?  That error would be a
> > +    great place to warn folks that the default may change in the future,
> > +    and get them used to specifying what they want so that the eventual
> > +    default switch is seamless for them.
>
> I don't like using the same option name (--[no-]restrict) for something
> that sets a config option to keep that behavior permanently. Different
> names that make it clearer could be:
>
>         --enable-restrict-mode
>         --set-scope=(sparse|all)
>

The name sounds clear enough. I had a idea that add some configuration like:

scope.<cmd>.mode=sparse|all

and then let scalar help users set some default configs...

> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> As mentioned, I think we should have the option to backfill the blobs in
> the sparse-checkout definition, but 'git clone' should not do this by
> default. It's something that can be launched in the background, maybe, but
> not a blocking operation on being able to use the repository.
>
> 'scalar clone' is an excellent testing bed for these kinds of things,
> like setting the --restrict mode by default.
>

This sounds interesting and would like to see scalar support them!

> Hopefully my responses aren't too far off-base. I'll go read the rest of
> the discussion now that I've contributed my thoughts on the doc.
>
> Thanks,
> -Stolee

Thanks,
--
ZheNing Hu
ZheNing Hu Sept. 30, 2022, 9:54 a.m. UTC | #19
I am not sure if these ideas are feasible.

Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
>
> > > +People might also end up wanting behavior B due to complex inter-project
> > > +dependencies.  The initial attempts to use sparse-checkouts usually
> > > +involve the directories you are directly interested in plus what those
> > > +directories depend upon within your repository.  But there's a monkey
> > > +wrench here: if you have integration tests, they invert the hierarchy:
> > > +to run integration tests, you need not only what you are interested in
> > > +and its dependencies, you also need everything that depends upon what
> > > +you are interested in or that depends upon one of your
> > > +dependencies...AND you need all the dependencies of that expanded group.
> > > +That can easily change your sparse-checkout into a nearly dense one.
> >
> > In my experience, the downstream dependencies are checked via builds in
> > the cloud, though that doesn't help if they are source dependencies and
> > you make a breaking change to an API interface. This kind of problem is
> > absolutely one of system architecture and I don't know what Git can do
> > other than to acknowledge it and recommend good patterns.
>
> I was talking about (source) dependencies between
> modules/projects/whatever-you-want-to-call-the-subcomponents of your
> repository.  We have hundreds of modules, with various cross-module
> dependencies that evolve over time.
>
> I get the feeling from your description that your intra-repository
> dependencies between modules/projects/whatever are much more static
> for you than what we deal with.  (Which is a good thing; it'd be nice
> if ours were more static.)
>
> > In a properly-organized project, 95% of engineers in the project can have
> > a small sparse-checkout, then 5% work on the common core that has these
> > downstream dependencies and require a large sparse-checkout definition.
>
> "In a properly-organized project"?  I'm unsure if this is an
> indictment of some of the repositories I deal with in reality (and to
> be fair, it might be a totally fair indictment), or if your statement
> is starting to cross into "No true scotsman" territory.  ;-)
>
> I would probably lean towards the former (we know it's more messy than
> it should be), but I'm a bit puzzled that you'd just brush aside my
> mention of integration tests.  We have people who want to run
> integration tests locally, even when only modifying a small area of
> the codebase.  These users are not doing cross-tree work, rather they
> are doing cross-tree testing in conjunction with their work.  Running
> such tests requires a build of the modules across the repository,
> which naively would push folks into a dense checkout...and really long
> local builds.  We want fast local builds, and sparse-checkouts help us
> achieve that...but it does mean we have to be clever about how we
> build in order to let these users run integration tests.  (And we have
> to make it easy for users to discover the relevant integration tests,
> and sometimes associated code components that depend on what they are
> changing, which is where behavior B comes in).
>
> > There's nothing Git can do to help those engineers that do cross-tree
> > work.
>
> I'm going to partially disagree with this, in part because of our
> experience with many inter-module dependencies that evolve over time.
> Folks can start on a certain module and begin refactoring.  Being
> aware that their changes will affect other areas of the code, the can
> do a search (e.g. "git grep --cached ..." to find cases outside their
> current sparse checkout), and then selectively unsparsify to get the
> relevant few dozen (or maybe even few hundred) modules added.  They
> aren't switching to a dense checkout, just a less sparse one.  When
> they are done, they may narrow their sparse specification again.  We
> have a number of users doing cross-tree work who are using
> sparse-checkouts, and who find it productive and say it still speeds
> up their local build/test cycles.
>
> So, I'd say that ensuring Git supports behavior B well in
> sparse-checkouts, is something Git can do to help out both some of the
> engineers doing cross-tree work, and some of the engineers that are
> doing cross-tree testing.
>
> (For full disclosure, we also have users doing cross-tree work using
> regular dense checkouts and I agree there's not a lot we can do to
> help them.)
>

Let me guess where the cross tree users using sparse-checkout are
getting their revenue from:

1. they don't have to download the entire repository of blobs at once
2. their working tree can be easily resized.
3. they could have something like sparse-index to optimize the performance
of git commands.

But it's still worth worrying about the size of the git repository blobs,
even if it's just only blobs in mono-repo's HEAD, that may also be too big
for the user's local area to handle.

Perhaps it would make more sense to place this integration testing work on
a remote server.

I am not sure if these ideas are feasible:

1. mount the large git repo on the server to local.
2. just ssh to a remote server to run integration tests.
3. use an external tool to run integration tests on the remote server.

>
> Anyway, we do not want the behavior of `--restrict` for these
> commands.  That would imply not providing conflicts to users for them
> to resolve unless they are contained within the sparse specification,
> which would clearly be broken.  We instead chose to write out files
> with conflicts regardless of whether they are outside the sparse
> specification.  This modified behavior I gave the name of
> `--restrict-unless-conflict`, but we don't need or want an actual
> command line flag for that.  I think the behavior should just remain
> hardcoded into these commands.
>
> (Note: these commands are among those that make me think
> --[no-]restrict or --[un]focus or whatever might not make sense as a
> git global option: `--restrict-unless-conflict` behavior is the
> default for these and in fact that only sensible option, I think.  If
> there's only one sensible option, no actual flag names are needed.)
>
> > The only thing I can think about is that the diffstat might want to show
> > the stats for the conflicted files, in which case that's an important
> > perspective on the distinction from --restrict.
>
> We only show the diffstat on a successful merge, so there's no
> diffstat to show if there are any conflicted files.
>

Sorry, I have some questions here: how does git merge know there are
no conflicts without downloading the blobs?

> > Perhaps something like "scope" would describe the set of things we care
> > about, but use a text mode:
> >
> >         --scope=sparse  (--restrict)
> >         --scope=all     (--no-restrict)
> >
> > But I'm notoriously bad at naming things.
>
> Yeah, me too.  Naming things is one of the two hard problems in
> computer science, right?  (The others being cache invalidation, and
> off-by-one errors.)
>
> However, in this case, your suggestion sounds pretty decent to me.
> I'll add it to the list for us to consider.
>

Agree.

Thanks,
--
ZheNing Hu
Elijah Newren Oct. 6, 2022, 7:10 a.m. UTC | #20
On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/28/22 1:38 AM, Elijah Newren wrote:
> > On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
> >>
> >> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> >>> From: Elijah Newren <newren@gmail.com>
> >>
[...]
> >>> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> >>> +    on Behavior A or Behavior B
> >>> +    * diff (with --cached or REVISION arguments)
> >>> +    * grep (with --cached or REVISION arguments)
> >>> +    * show (when given commit arguments)
> >>> +    * bisect
> >>> +    * blame
> >>> +      * and annotate
> >>> +    * log
> >>> +      * and variants: shortlog, gitk, show-branch, whatchanged
> >>> +
> >>> +    For now, we default to behavior B for these, which want a default of
> >>> +    --no-restrict.
> >>
> >> I do feel pretty strongly that we'll want a --no-restrict default here
> >> because otherwise we will present confusion. I'm not even sure if we would
> >> want to make this available via a config setting, but likely a config
> >> setting makes sense in the long term.
> >
> > You've got me slightly confused.  You did say the same thing a long time ago:
> >
> >     "But I also want to avoid doing this as a default or even behind a
> > config setting."[A]
> >
> > BUT, when Shaoxuan proposed making --restrict/--focus the default for
> > one of these commands, you seemed to be on board[B].
>
> I'm specifically talking about 'git log'. I think that having that be
> in a restricted mode is extremely dangerous and will only confuse users.
> This includes 'git show' (with commit arguments) and 'git bisect', I
> think.

Thanks, that helps me understand your position better.

I'm curious if, due to the length of the document and this thread,
you're just skimming past the idea I mentioned of showing a warning at
the beginning of `diff`, `log`, or `show` output when restricting
based on config or defaults.  Without such a warning, I agree that
restricting might be confusing at times, but I think such a warning
may be sufficient to address the concerns around partial/incomplete
results.  The one command that this warning idea doesn't help with is
`grep` since it cannot safely be applied there, which potentially
leaves `grep` giving confusing results when users pass either
`--cached` or revisions, but you seem to not be concerned about that.

I'm also curious if the problem partially stems from the fact that
with `git log` there is no way to control revision limiting and diff
generation paths independently.  If there was a way to make `git log
-p` continue showing the regular list of commits but restrict which
paths were shown in the diffs, and we made the --scope-sparse handling
do this so that only diffs were limited but not the revisions
traversed/printed, would that help address your concerns?

> The rest, (diff, grep, blame) are worktree-focused, so having a restrict
> mode by default makes sense to me.

I was specifically calling out diff & grep when passed revision
arguments, which are definitely *not* worktree-focused operations.

Also, blame incorporates a component of changes from the worktree, but
it's mostly about history (and one or more -C's make it check other
paths as well).

[...]
> I think the biggest point is that the implications of behavior A
> saying "I don't care about any changes outside of my sparse-checkout"
> leading to changed history are unappealing to me. After removing that
> kind of feature from consideration, I don't see any difference
> between the behaviors.

Indeed, the differences between the behaviors is (mostly?) about
history queries, be it `git grep --cached`, `git grep REV`, `git diff
REV1 REV2`, `git log -p`, etc.

And I understand it's unappealing to you, but I haven't seen an
alternative solution to disconnected development in partial clones.
Nor have I seen an alternate plan for users who want to really focus
on their small subset of the repository.

So, maybe you don't want to use a configuration knob and always want a
certain default, but I very much want a knob.

> > Anyway...I will note that without a configurable option to give these
> > commands a behavior of `--restrict`, I think you make working in
> > disconnected partial clones practically impossible.  I want to be able
> > to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
> > disconnected partial clones, and I've wanted that kind of capability
> > for well over a decade[H].  So, don't be surprised if I keep bringing
> > up a config option of some sort for these commands.  :-)
>
> Now, if we're talking about "don't download extra objects" as a goal,
> then we're thinking about things not just related to sparse-checkout
> but even history within the sparse-checkout. Even if we make the
> 'backfill' command something that users could run, there isn't a
> guarantee that users will want to have even that much data downloaded.
> We would need a way to say "yes, I ran 'git blame' on this path in my
> sparse-checkout, but please don't just fail if you can't get new objects,
> instead inform me that the results are incomplete."
>
> I think the sparse-checkout boundary is a good way to minimize the
> number of objects downloaded by these commands, but to actually
> remove the need for downloads at all we need a way to gracefully
> return partial results.

There may be some merits to a partial clone with shallow blob history,
but I've never really been all that interested in it.  I know that
partial clones only really implement that kind of feature, but I've
always wanted a full-depth sparse clone instead.  I tried to create
that alternate reality[H], but didn't get the time to push it very
far, and in the meantime others came along and implemented both
shallow clones and partial clones.  I still want my thing, but at this
point rather than introduce a new kind of clone, it makes more sense
for me to reuse the existing partial clone framework and extend it --
especially since it more gracefully handles cases where additional
data outside user-specified sparsity is needed (such as for merges).

[H] https://lore.kernel.org/git/1283645647-1891-1-git-send-email-newren@gmail.com/

But you've got me curious.  You seem to be suggesting that partial
results are okay if the user is informed.  I have suggested making
diff-with-revisions, log -p, etc. show a warning that results may be
incomplete when restricting them to the sparse checkout based on
config.  So, aren't you suggesting that my proposal is safe after all?

Anyway, if someone wants to implement something like you suggest here,
while I might not use it, it sounds reasonable to me.  It'd probably
fit in as yet another config setting.  Then, for history queries, our
config would select the default between --scope=all (for behavior B
folks), --scope=sparse (for the behavior A folks) and
--scope=sparse-and-already-downloaded (the behavior you suggest above,
though it probably needs a better name).  Also, it sounds to me like
implementing --scope=sparse would be a step along the path to
implementing what you are suggesting here, if I'm understanding you
correctly.  (Also, this idea makes me like your --scope= naming even
more, because it's awkward to add a third option to
--restrict/--no-restrict.)

> > I figured we'd have one or two places where all of us had some
> > disagreements on the big picture, but more and more I'm finding we
> > aren't even always thinking about the problems the same (e.g. the 3+
> > different solutions to the `am` issues).  All the more reason that a
> > document like this is important for us to discuss these details and
> > work out a plan.
>
> With such a massive doc and an ambitious plan, we are bound to have
> misunderstandings and seem to self-contradict here and there. This
> discussion is helping to drive clarity, and I appreciate all of your
> work to drive towards mutual understanding.

Thanks for taking the time to read through it and respond in detail!
Elijah Newren Oct. 6, 2022, 7:53 a.m. UTC | #21
On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> I am not sure if these ideas are feasible.
>
> Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> >
[...]
> > > There's nothing Git can do to help those engineers that do cross-tree
> > > work.
> >
> > I'm going to partially disagree with this, in part because of our
> > experience with many inter-module dependencies that evolve over time.
> > Folks can start on a certain module and begin refactoring.  Being
> > aware that their changes will affect other areas of the code, the can
> > do a search (e.g. "git grep --cached ..." to find cases outside their
> > current sparse checkout), and then selectively unsparsify to get the
> > relevant few dozen (or maybe even few hundred) modules added.  They
> > aren't switching to a dense checkout, just a less sparse one.  When
> > they are done, they may narrow their sparse specification again.  We
> > have a number of users doing cross-tree work who are using
> > sparse-checkouts, and who find it productive and say it still speeds
> > up their local build/test cycles.
> >
> > So, I'd say that ensuring Git supports behavior B well in
> > sparse-checkouts, is something Git can do to help out both some of the
> > engineers doing cross-tree work, and some of the engineers that are
> > doing cross-tree testing.
> >
> > (For full disclosure, we also have users doing cross-tree work using
> > regular dense checkouts and I agree there's not a lot we can do to
> > help them.)
> >
>
> Let me guess where the cross tree users using sparse-checkout are
> getting their revenue from:

Is "revenue" perhaps a case of auto-correct choosing the wrong word?

> 1. they don't have to download the entire repository of blobs at once
> 2. their working tree can be easily resized.
> 3. they could have something like sparse-index to optimize the performance
> of git commands.

These correspond to partial clone, sparse-checkout, and sparse-index.
I think these 3 features and the various work done to support them,
plus submodule (which is a different kind of solution) are the
features Git provides to work with repository subsets.  Some
repositories (especially the big monorepos like the Microsoft ones)
will benefit from using all three of these features.  Others might
only want to use one or two of them.

As an example, the repository where we first applied sparse-checkouts
to (and which had the complicated dependencies) does not use partial
clones or a sparse-index.   While partial clone and sparse-index might
help a little, the .git directory for a full clone is merely 2G, and
there are less than 100K entries in the index.  However,
sparse-checkout helps out a lot.

> But it's still worth worrying about the size of the git repository blobs,
> even if it's just only blobs in mono-repo's HEAD, that may also be too big
> for the user's local area to handle.
>
> Perhaps it would make more sense to place this integration testing work on
> a remote server.
>
> I am not sure if these ideas are feasible:
>
> 1. mount the large git repo on the server to local.
> 2. just ssh to a remote server to run integration tests.
> 3. use an external tool to run integration tests on the remote server.

Are you suggesting #1 as a way for just handling the git history, or
also for handling the worktree with some kind of virtual file system
where not all files are actually written locally?  If you're only
talking about the history, then you're kind of going on a tangent
unrelated to this document.  If you're talking about worktrees and
virtual file systems, then Git proper doesn't have anything of the
sort currently.  There are at least two solutions in this space --
Microsoft's Git-VFS (which I think they are phasing out) and Google's
similar virtual file system -- but I'm not currently particularly
interested in either one.

#3 is precisely what we did first (except "*a* remote server" rather
than "*the* remote server").  I think I called it out in the email
you're responding to; it's often good enough for many people.
However, sometimes those tests fail and people want to run locally so
it's easier to inspect.  Or they just want to be able to run locally
anyway.  So, while #3 helped, it wasn't good enough.

#2 is also something we did.  Using tools like Coder or GitHub
codespaces or other offerings in that area, you can provide developers
a nice beefy box with good network connectivity to the main Git
repository, on which they can do development and running of tests.
Then developers can connect to such machines from a variety of
different external locations.  Works great for some people...but build
times and ability of IDEs to handle the code base are still an issue,
so doing smarter things with sparse-checkouts is still important.
And, even if #2 works for some people, others still want to develop
and run integration tests on their (beefy) laptops.

All three of these, as far as I can tell, are just things that
individual teams setup and aren't anything that would affect Git's
development one way or another.


However, I'll note that while we internally definitely did two of the
three things you suggested here, it wasn't a complete enough solution
for us and sparse-checkout adoption was still pretty minimal at that
point.  So, we went back to our sparse-checkouts and asked how we
could modify the build system to allow us to not check out the in-tree
dependencies of the things we are tweaking, but still get a correct
build and allow us to run tests.  Once we got that working, we finally
really unlocked the value of sparse checkouts for us (both improving
things for developers on laptops, and for developers on the
development box in the cloud).  It went from very few folks using
sparse checkouts with that repository, to being the default and
recommended usage at that point.

While the build changes were internal things we did, I think that the
underlying usage scenario matters to Git development because it helps
inform how sparse-checkout can be used.  In particular, it suggests
why some sparse-checkout users may be interested in finding results
for files that do not match their sparse-checkout patterns -- in-tree
dependencies may not necessarily be checked out, but those are related
enough to the code that developers are working on, that developers are
still potentially interested in using e.g. "git grep" or "git log -p"
to find out information about code or changes in those other areas.
(And, of course, developers are also potentially interested in finding
out what other code depends on what they are changing, but I suspect
folks were already aware of that usecase.)  It's certainly not the
only usecase, but it's an additional one that I didn't think was quite
reflected in Stolee's description of why users would want searches to
turn up results for files not found in their working tree.

> > > The only thing I can think about is that the diffstat might want to show
> > > the stats for the conflicted files, in which case that's an important
> > > perspective on the distinction from --restrict.
> >
> > We only show the diffstat on a successful merge, so there's no
> > diffstat to show if there are any conflicted files.
> >
>
> Sorry, I have some questions here: how does git merge know there are
> no conflicts without downloading the blobs?

Not sure how that's related to the above, but to answer your question:

Sometimes merge has to download blobs to know if there are conflicts
or not.  But only sometimes.  Since tree objects have the hashes of
the blobs, having the tree objects is sufficient to determine which
side(s) of history modified each path.

If both sides of history modified the same file, then you *might* have
conflicts, and you indeed need the blobs to verify.  But if only one
side of history modified a file and the other left it alone, then
there is no conflict.
Derrick Stolee Oct. 6, 2022, 6:27 p.m. UTC | #22
On 10/6/22 3:10 AM, Elijah Newren wrote:
> On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>
>> On 9/28/22 1:38 AM, Elijah Newren wrote:
>>> On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>>>
>>>> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
>>>>> From: Elijah Newren <newren@gmail.com>
>>>>
> [...]
>>>>> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
>>>>> +    on Behavior A or Behavior B
>>>>> +    * diff (with --cached or REVISION arguments)
>>>>> +    * grep (with --cached or REVISION arguments)
>>>>> +    * show (when given commit arguments)
>>>>> +    * bisect
>>>>> +    * blame
>>>>> +      * and annotate
>>>>> +    * log
>>>>> +      * and variants: shortlog, gitk, show-branch, whatchanged
>>>>> +
>>>>> +    For now, we default to behavior B for these, which want a default of
>>>>> +    --no-restrict.
>>>>
>>>> I do feel pretty strongly that we'll want a --no-restrict default here
>>>> because otherwise we will present confusion. I'm not even sure if we would
>>>> want to make this available via a config setting, but likely a config
>>>> setting makes sense in the long term.
>>>
>>> You've got me slightly confused.  You did say the same thing a long time ago:
>>>
>>>     "But I also want to avoid doing this as a default or even behind a
>>> config setting."[A]
>>>
>>> BUT, when Shaoxuan proposed making --restrict/--focus the default for
>>> one of these commands, you seemed to be on board[B].
>>
>> I'm specifically talking about 'git log'. I think that having that be
>> in a restricted mode is extremely dangerous and will only confuse users.
>> This includes 'git show' (with commit arguments) and 'git bisect', I
>> think.
> 
> Thanks, that helps me understand your position better.
> 
> I'm curious if, due to the length of the document and this thread,
> you're just skimming past the idea I mentioned of showing a warning at
> the beginning of `diff`, `log`, or `show` output when restricting
> based on config or defaults.  Without such a warning, I agree that
> restricting might be confusing at times, but I think such a warning
> may be sufficient to address the concerns around partial/incomplete
> results.  The one command that this warning idea doesn't help with is
> `grep` since it cannot safely be applied there, which potentially
> leaves `grep` giving confusing results when users pass either
> `--cached` or revisions, but you seem to not be concerned about that.

I'm not convinced that warnings are enough for some cases, especially
for output that is fed to a pager. Do the warnings stick around in
the pager? I'm not sure.

> I'm also curious if the problem partially stems from the fact that
> with `git log` there is no way to control revision limiting and diff
> generation paths independently.  If there was a way to make `git log
> -p` continue showing the regular list of commits but restrict which
> paths were shown in the diffs, and we made the --scope-sparse handling
> do this so that only diffs were limited but not the revisions
> traversed/printed, would that help address your concerns?

My biggest issue is with the idea of simplifying the commit history
based on the sparse-checkout path definitions. The '-p' option having
a diff scoped to the sparse-checkout paths would be fine.

>> The rest, (diff, grep, blame) are worktree-focused, so having a restrict
>> mode by default makes sense to me.
> 
> I was specifically calling out diff & grep when passed revision
> arguments, which are definitely *not* worktree-focused operations.

You're right. I'm not using the right terminology. They _are_
operations on a single tree, where path scopes make sense.

> Also, blame incorporates a component of changes from the worktree, but
> it's mostly about history (and one or more -C's make it check other
> paths as well).

Since each input is a specific file path, I'm not sure we need
anything here except perhaps a warning that they are requesting
a file outside the sparse-checkout definition (if even that).

>>> Anyway...I will note that without a configurable option to give these
>>> commands a behavior of `--restrict`, I think you make working in
>>> disconnected partial clones practically impossible.  I want to be able
>>> to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
>>> disconnected partial clones, and I've wanted that kind of capability
>>> for well over a decade[H].  So, don't be surprised if I keep bringing
>>> up a config option of some sort for these commands.  :-)
>>
>> Now, if we're talking about "don't download extra objects" as a goal,
>> then we're thinking about things not just related to sparse-checkout
>> but even history within the sparse-checkout. Even if we make the
>> 'backfill' command something that users could run, there isn't a
>> guarantee that users will want to have even that much data downloaded.
>> We would need a way to say "yes, I ran 'git blame' on this path in my
>> sparse-checkout, but please don't just fail if you can't get new objects,
>> instead inform me that the results are incomplete."
>>
>> I think the sparse-checkout boundary is a good way to minimize the
>> number of objects downloaded by these commands, but to actually
>> remove the need for downloads at all we need a way to gracefully
>> return partial results.
> 
> There may be some merits to a partial clone with shallow blob history,
> but I've never really been all that interested in it. ......
> But you've got me curious.  You seem to be suggesting that partial
> results are okay if the user is informed.  I have suggested making
> diff-with-revisions, log -p, etc. show a warning that results may be
> incomplete when restricting them to the sparse checkout based on
> config.  So, aren't you suggesting that my proposal is safe after all?

I think the following things are true:

1. It's really important to keep the current partial clone default of
   only downloading blobs on-demand. Even with a limited sparse-checkout,
   it's rare that users will need every version of every file in that
   sparse-checkout, and they may not want that tax on their local storage.

2. Adding an opt-in backfill for a sparse-checkout definition will
   prevent most on-demand downloads (although it might want to be
   integrated into 'git fetch' behind an option to be really sure that
   state continues in the future).

3. Updating Git features to scope down to sparse-checkout will prevent
   many of the remaining on-demand downloads.

4. To be _absolutely sure_ that on-demand downloads don't happen, we
   need an extra mode for Git and new ways of reporting partial results.
   Without this mode, Git commands fail when triggering an on-demand
   download and the network is unavailable.

So, I'm saying that (4) is a direction that we could go. It also seems
extremely difficult to do, so we should do (2) & (3) first, which will
get us 99% of the way there.

Thanks,
-Stolee
Elijah Newren Oct. 7, 2022, 2:56 a.m. UTC | #23
On Thu, Oct 6, 2022 at 11:27 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 10/6/22 3:10 AM, Elijah Newren wrote:
> > On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@github.com> wrote:
> >>
> >> On 9/28/22 1:38 AM, Elijah Newren wrote:
[...]
> >> I'm specifically talking about 'git log'. I think that having that be
> >> in a restricted mode is extremely dangerous and will only confuse users.
> >> This includes 'git show' (with commit arguments) and 'git bisect', I
> >> think.
> >
> > Thanks, that helps me understand your position better.
> >
> > I'm curious if, due to the length of the document and this thread,
> > you're just skimming past the idea I mentioned of showing a warning at
> > the beginning of `diff`, `log`, or `show` output when restricting
> > based on config or defaults.  Without such a warning, I agree that
> > restricting might be confusing at times, but I think such a warning
> > may be sufficient to address the concerns around partial/incomplete
> > results.  The one command that this warning idea doesn't help with is
> > `grep` since it cannot safely be applied there, which potentially
> > leaves `grep` giving confusing results when users pass either
> > `--cached` or revisions, but you seem to not be concerned about that.
>
> I'm not convinced that warnings are enough for some cases

I'm not sure I'm following.  You suggested earlier in this thread that
we may want to provide a mode where commands "don't just fail if you
can't get new objects, instead inform me that the results are
incomplete".  You re-emphasized that in your most recent email by
saying "To be _absolutely sure_ that on-demand downloads don't happen,
we need an extra mode for Git and new ways of reporting partial
results."  So it sounds like you're suggesting a mode where partial
results are a forced option, because how else can you be "_absolutely
sure_ that on-demand downloads don't happen"?  And if we always want
to allow partial results, don't you need to inform users about those
results being potentially incomplete?  How exactly does one inform the
user that results are incomplete if not by a warning?  Something seems
inconsistent here, but perhaps I'm just misunderstanding something?

I think, based on what you said below, that you're uncomfortable with
certain types of incompleteness, such as partial revision results, but
are fine with others such as those dealing with partial blob results
(whether in breadth or in depth).  But if so, I'm still not sure what
your statement about warnings means.  If we scope operations down to
the sparsity paths (e.g. potentially giving a partial-breadth diff for
"git diff REV1 REV2"), what's your expectation with regards to
warnings?

>, especially
> for output that is fed to a pager. Do the warnings stick around in
> the pager? I'm not sure.

If the warning is printed on stdout, then yes the warning will stick
around in a pager.  If the warning is printed on stderr, then the
warning is likely of dubious utility since it can easily get lost.
Since log & diff output are not adversely affected by additional
preliminary output, I think stdout is where such a warning should go
(unless folks feel like we don't even need a warning?).  However, grep
would be strongly negatively affected by additional output, and that's
why I've stated several times that warnings cannot reasonably be
included with grep.

But, so far, no one has expressed concern with providing partial
results for grep even if no warning can be given, so perhaps it
doesn't matter.

> > I'm also curious if the problem partially stems from the fact that
> > with `git log` there is no way to control revision limiting and diff
> > generation paths independently.  If there was a way to make `git log
> > -p` continue showing the regular list of commits but restrict which
> > paths were shown in the diffs, and we made the --scope-sparse handling
> > do this so that only diffs were limited but not the revisions
> > traversed/printed, would that help address your concerns?
>
> My biggest issue is with the idea of simplifying the commit history
> based on the sparse-checkout path definitions. The '-p' option having
> a diff scoped to the sparse-checkout paths would be fine.

Wahoo!  Sounds like we have a path forward then.  I'll update the
document in my patch to reflect this distinction.

Note that it's not just the -p option to log, though, but anything
related to patches: diff formatting, diff filtering, rename & copy
detection, and pickaxe-related options.  The one place where the
scoping to sparse-checkouts is slightly funny for `git log` is with
--remerge-diff (because the merge machinery ignores sparsity patterns
when generating the new toplevel tree; however after the new toplevel
tree is generated, we would generate a diff that is limited to the
sparsity patterns).

[...]
> > Also, blame incorporates a component of changes from the worktree, but
> > it's mostly about history (and one or more -C's make it check other
> > paths as well).
>
> Since each input is a specific file path, I'm not sure we need
> anything here except perhaps a warning that they are requesting
> a file outside the sparse-checkout definition (if even that).

Your statement seems to suggest you are assuming that git blame will
only operate on the path listed on the command line.  Am I reading
your assumption correctly, or am I totally misunderstanding why you
would claim nothing is needed beyond a warning about the path the user
typed?  If I'm understanding your assumption correctly, your
assumption does not hold when one or more -C options are passed.
Since my earlier mentions of those options and their ramification
didn't connect, perhaps it would help if I was a bit more explicit
about what I mean.  Let's take a simple example, in git.git, which you
can run right now:

   git blame -C -C cache.h

This command will show lines of text that now appear in cache.h but
which came *from* all of these files:

    * builtin/clean.c
    * cache.h
    * merge-recursive.h
    * notes.c
    * object-file.c
    * object.h
    * read-cache.c
    * setup.c
    * sha1-file.c
    * sha1_file.c
    * sha1_name.c
    * show-diff.c
    * symlinks.c
    * tree-walk.h

In order to find out and report that the current lines of cache.h came
from these other files, blame has to search a wide range of other
files in the repository.  That potential wide range of other files in
the repository is something we could consider tailoring when in a
sparse-checkout, at least for Behavior A folks.

[...]
> I think the following things are true:
>
> 1. It's really important to keep the current partial clone default of
>    only downloading blobs on-demand. Even with a limited sparse-checkout,
>    it's rare that users will need every version of every file in that
>    sparse-checkout, and they may not want that tax on their local storage.

I do agree we need to keep these in mind for some usecases, but I do
not agree these are universally true among sparse-checkout users.
However, our differences on this probably don't matter in practice
since you then immediately suggested...

> 2. Adding an opt-in backfill for a sparse-checkout definition will
>    prevent most on-demand downloads (although it might want to be
>    integrated into 'git fetch' behind an option to be really sure that
>    state continues in the future).

Yes, this would be great.  One question, though: integrated with
`fetch` or with `sparse-checkout set|add`?  If users adjust their
sparse-checkout definition, that might be a good time to allow them to
automatically trigger fixing the missing backfill at the same time.

> 3. Updating Git features to scope down to sparse-checkout will prevent
>    many of the remaining on-demand downloads.

Yes, though I'd clarify "scope down to sparse-checkout where it can
make sense".  Things like merge & bundle have to pay attention to
changes outside the sparse-checkout, but we can get commands like
diff/log -p/grep to scope down in breadth.

> 4. To be _absolutely sure_ that on-demand downloads don't happen, we
>    need an extra mode for Git and new ways of reporting partial results.
>    Without this mode, Git commands fail when triggering an on-demand
>    download and the network is unavailable.

While many commands might be able to produce partial results
realistically, I think things like merge & bundle should not support
such a mode and just fail if they are missing any data they normally
need.  Basically, we'd still have commands that would fail without a
network connection beyond push/pull/fetch, but this mode would limit
the list as much as possible through allowing commands to limit both
breadth and depth of the blobs we act upon.

> So, I'm saying that (4) is a direction that we could go. It also seems
> extremely difficult to do, so we should do (2) & (3) first, which will
> get us 99% of the way there.

Agreed on all three counts.
ZheNing Hu Oct. 15, 2022, 2:17 a.m. UTC | #24
Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
>
> On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > I am not sure if these ideas are feasible.
> >
> > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > >
> [...]
> > > > There's nothing Git can do to help those engineers that do cross-tree
> > > > work.
> > >
> > > I'm going to partially disagree with this, in part because of our
> > > experience with many inter-module dependencies that evolve over time.
> > > Folks can start on a certain module and begin refactoring.  Being
> > > aware that their changes will affect other areas of the code, the can
> > > do a search (e.g. "git grep --cached ..." to find cases outside their
> > > current sparse checkout), and then selectively unsparsify to get the
> > > relevant few dozen (or maybe even few hundred) modules added.  They
> > > aren't switching to a dense checkout, just a less sparse one.  When
> > > they are done, they may narrow their sparse specification again.  We
> > > have a number of users doing cross-tree work who are using
> > > sparse-checkouts, and who find it productive and say it still speeds
> > > up their local build/test cycles.
> > >
> > > So, I'd say that ensuring Git supports behavior B well in
> > > sparse-checkouts, is something Git can do to help out both some of the
> > > engineers doing cross-tree work, and some of the engineers that are
> > > doing cross-tree testing.
> > >
> > > (For full disclosure, we also have users doing cross-tree work using
> > > regular dense checkouts and I agree there's not a lot we can do to
> > > help them.)
> > >
> >
> > Let me guess where the cross tree users using sparse-checkout are
> > getting their revenue from:
>
> Is "revenue" perhaps a case of auto-correct choosing the wrong word?
>

s/revenue/benefits

> > 1. they don't have to download the entire repository of blobs at once
> > 2. their working tree can be easily resized.
> > 3. they could have something like sparse-index to optimize the performance
> > of git commands.
>
> These correspond to partial clone, sparse-checkout, and sparse-index.
> I think these 3 features and the various work done to support them,
> plus submodule (which is a different kind of solution) are the
> features Git provides to work with repository subsets.  Some
> repositories (especially the big monorepos like the Microsoft ones)
> will benefit from using all three of these features.  Others might
> only want to use one or two of them.
>

Here I am just amazed that cross-tree users can shorten the
test/build cycle when only using sparse-checkout. So this benefits
don't come from above there conjectures. Not partial clone, not
sparse-index, not resize repo frequently.

> As an example, the repository where we first applied sparse-checkouts
> to (and which had the complicated dependencies) does not use partial
> clones or a sparse-index.   While partial clone and sparse-index might
> help a little, the .git directory for a full clone is merely 2G, and
> there are less than 100K entries in the index.  However,
> sparse-checkout helps out a lot.
>

Yes, you make a good explanation here that we don't necessarily need
to apply all these kinds of features. But I still feel a little confuse: Where
does the time savings come from? Is it saved by the time reduction of
git checkout? Or is it the reduction of some unnecessary working tree scans
during test/build time?

> > But it's still worth worrying about the size of the git repository blobs,
> > even if it's just only blobs in mono-repo's HEAD, that may also be too big
> > for the user's local area to handle.
> >
> > Perhaps it would make more sense to place this integration testing work on
> > a remote server.
> >
> > I am not sure if these ideas are feasible:
> >
> > 1. mount the large git repo on the server to local.
> > 2. just ssh to a remote server to run integration tests.
> > 3. use an external tool to run integration tests on the remote server.
>
> Are you suggesting #1 as a way for just handling the git history, or
> also for handling the worktree with some kind of virtual file system
> where not all files are actually written locally?  If you're only
> talking about the history, then you're kind of going on a tangent
> unrelated to this document.  If you're talking about worktrees and
> virtual file systems, then Git proper doesn't have anything of the
> sort currently.  There are at least two solutions in this space --
> Microsoft's Git-VFS (which I think they are phasing out) and Google's
> similar virtual file system -- but I'm not currently particularly
> interested in either one.
>

Here I mean git nfs, or some kind of git virtual file system, or some
git workspace, I don't really understand why they are now
phasing out?

> #3 is precisely what we did first (except "*a* remote server" rather
> than "*the* remote server").  I think I called it out in the email
> you're responding to; it's often good enough for many people.
> However, sometimes those tests fail and people want to run locally so
> it's easier to inspect.  Or they just want to be able to run locally
> anyway.  So, while #3 helped, it wasn't good enough.
>

Agree, testing locally sometimes is necessary.

> #2 is also something we did.  Using tools like Coder or GitHub
> codespaces or other offerings in that area, you can provide developers
> a nice beefy box with good network connectivity to the main Git
> repository, on which they can do development and running of tests.
> Then developers can connect to such machines from a variety of
> different external locations.  Works great for some people...but build
> times and ability of IDEs to handle the code base are still an issue,
> so doing smarter things with sparse-checkouts is still important.
> And, even if #2 works for some people, others still want to develop
> and run integration tests on their (beefy) laptops.
>

Agree too.

> All three of these, as far as I can tell, are just things that
> individual teams setup and aren't anything that would affect Git's
> development one way or another.
>
>
> However, I'll note that while we internally definitely did two of the
> three things you suggested here, it wasn't a complete enough solution
> for us and sparse-checkout adoption was still pretty minimal at that
> point.  So, we went back to our sparse-checkouts and asked how we
> could modify the build system to allow us to not check out the in-tree
> dependencies of the things we are tweaking, but still get a correct
> build and allow us to run tests.  Once we got that working, we finally
> really unlocked the value of sparse checkouts for us (both improving
> things for developers on laptops, and for developers on the
> development box in the cloud).  It went from very few folks using
> sparse checkouts with that repository, to being the default and
> recommended usage at that point.
>

Yeah, I'm a big believer in sparse-checkout or partial-clone which are
good features but not many people realize that they can use them.

> While the build changes were internal things we did, I think that the
> underlying usage scenario matters to Git development because it helps
> inform how sparse-checkout can be used.  In particular, it suggests
> why some sparse-checkout users may be interested in finding results
> for files that do not match their sparse-checkout patterns -- in-tree
> dependencies may not necessarily be checked out, but those are related
> enough to the code that developers are working on, that developers are
> still potentially interested in using e.g. "git grep" or "git log -p"
> to find out information about code or changes in those other areas.
> (And, of course, developers are also potentially interested in finding
> out what other code depends on what they are changing, but I suspect
> folks were already aware of that usecase.)  It's certainly not the
> only usecase, but it's an additional one that I didn't think was quite
> reflected in Stolee's description of why users would want searches to
> turn up results for files not found in their working tree.
>

Some users may really want to focus only on their subprojects, so I think
"git log -p" shouldn't show files that don't satisfy the
sparse-checkout patterns,
and "git grep" too. But some users may need to search something globally,
and I think those people are in the minority, so maybe there should be a
"git log -p --scrope=all" or "git grep --scrope=all" for them.

> > > > The only thing I can think about is that the diffstat might want to show
> > > > the stats for the conflicted files, in which case that's an important
> > > > perspective on the distinction from --restrict.
> > >
> > > We only show the diffstat on a successful merge, so there's no
> > > diffstat to show if there are any conflicted files.
> > >
> >
> > Sorry, I have some questions here: how does git merge know there are
> > no conflicts without downloading the blobs?
>
> Not sure how that's related to the above, but to answer your question:
>

Ah, this question relates to my previous question in [1]. At first I always
thought it was git merge that caused the extra blob downloading.
In the end, it turned out to be caused by the last diffstat of merge...

> Sometimes merge has to download blobs to know if there are conflicts
> or not.  But only sometimes.  Since tree objects have the hashes of
> the blobs, having the tree objects is sufficient to determine which
> side(s) of history modified each path.
>
> If both sides of history modified the same file, then you *might* have
> conflicts, and you indeed need the blobs to verify.  But if only one
> side of history modified a file and the other left it alone, then
> there is no conflict.

I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
"a4e1fc out/file1" which is same SHA1 to blob in merge base, because
it's out of sparse-checkout specification, and it fetch a commit of user2,
and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
need to check the contents of the file here, because only one side
changes it.

Thanks for your answers!

[1]: https://lore.kernel.org/git/CABPp-BEBB1oqdVcXrWwMAdtb0TwHZvr-6KDa210j5ncw54Di_g@mail.gmail.com/
Elijah Newren Oct. 15, 2022, 4:37 a.m. UTC | #25
On Fri, Oct 14, 2022 at 7:17 PM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
> >
> > On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > >
> > > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > > >
[...]
> > As an example, the repository where we first applied sparse-checkouts
> > to (and which had the complicated dependencies) does not use partial
> > clones or a sparse-index.   While partial clone and sparse-index might
> > help a little, the .git directory for a full clone is merely 2G, and
> > there are less than 100K entries in the index.  However,
> > sparse-checkout helps out a lot.
>
> Yes, you make a good explanation here that we don't necessarily need
> to apply all these kinds of features. But I still feel a little confuse: Where
> does the time savings come from? Is it saved by the time reduction of
> git checkout? Or is it the reduction of some unnecessary working tree scans
> during test/build time?

It is neither git checkout time, nor tree scans; it's the ability to
avoid building larging parts of the project coupled with the
significantly better responsiveness of IDEs when project scope is
limited.  When directories are entirely missing, we don't need to
build any of the code in those directories and can instead just use
already built artifacts from the most recent point in history that has
been built on our continuous integration infrastructure.  (Note: our
sparsification tool will keep any modules/directories where there have
been modifications since the most recent upstream commit that has been
built, so we don't risk getting a wrong build via this strategy.)

[...]
> > > 1. mount the large git repo on the server to local.
> > > 2. just ssh to a remote server to run integration tests.
> > > 3. use an external tool to run integration tests on the remote server.
> >
> > Are you suggesting #1 as a way for just handling the git history, or
> > also for handling the worktree with some kind of virtual file system
> > where not all files are actually written locally?  If you're only
> > talking about the history, then you're kind of going on a tangent
> > unrelated to this document.  If you're talking about worktrees and
> > virtual file systems, then Git proper doesn't have anything of the
> > sort currently.  There are at least two solutions in this space --
> > Microsoft's Git-VFS (which I think they are phasing out) and Google's
> > similar virtual file system -- but I'm not currently particularly
> > interested in either one.
> >
>
> Here I mean git nfs, or some kind of git virtual file system, or some
> git workspace, I don't really understand why they are now
> phasing out?

You'd have to ask them, or read their comments on it.  I think they
believe sparse-checkout with a normal file system is or will be better
than the behavior they are getting from their virtual file system (and
they've put a lot of really good work behind making sure that is the
case).

[...]
> Some users may really want to focus only on their subprojects, so I think
> "git log -p" shouldn't show files that don't satisfy the
> sparse-checkout patterns,
> and "git grep" too. But some users may need to search something globally,
> and I think those people are in the minority, so maybe there should be a
> "git log -p --scrope=all" or "git grep --scrope=all" for them.

Good to know you're in the "Behavior A" camp and we've got another
vote for implementing things in that direction.  A couple of small
points, though:
  * It's --scope rather than --scrope.  ;-)
  * I have to disagree here slightly about people using a --scope=all
flag -- I don't think users should have to specify it with every grep
or log invocation.  Users in the "Behavior B" camp would want
`--scope=all` behavior for nearly every grep and log -p invocation
they make; it's annoying and unfair to force them to spell it out
every time.  So, I think we need a configuration option.

[...]
> > Sometimes merge has to download blobs to know if there are conflicts
> > or not.  But only sometimes.  Since tree objects have the hashes of
> > the blobs, having the tree objects is sufficient to determine which
> > side(s) of history modified each path.
> >
> > If both sides of history modified the same file, then you *might* have
> > conflicts, and you indeed need the blobs to verify.  But if only one
> > side of history modified a file and the other left it alone, then
> > there is no conflict.
>
> I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
> "a4e1fc out/file1" which is same SHA1 to blob in merge base, because
> it's out of sparse-checkout specification, and it fetch a commit of user2,
> and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
> need to check the contents of the file here, because only one side
> changes it.

Precisely.  :-)
ZheNing Hu Oct. 15, 2022, 2:49 p.m. UTC | #26
Elijah Newren <newren@gmail.com> 于2022年10月15日周六 12:38写道:
>
> On Fri, Oct 14, 2022 at 7:17 PM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
> > >
> > > On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > > >
> > > > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > > > >
> [...]
> > > As an example, the repository where we first applied sparse-checkouts
> > > to (and which had the complicated dependencies) does not use partial
> > > clones or a sparse-index.   While partial clone and sparse-index might
> > > help a little, the .git directory for a full clone is merely 2G, and
> > > there are less than 100K entries in the index.  However,
> > > sparse-checkout helps out a lot.
> >
> > Yes, you make a good explanation here that we don't necessarily need
> > to apply all these kinds of features. But I still feel a little confuse: Where
> > does the time savings come from? Is it saved by the time reduction of
> > git checkout? Or is it the reduction of some unnecessary working tree scans
> > during test/build time?
>
> It is neither git checkout time, nor tree scans; it's the ability to
> avoid building larging parts of the project coupled with the
> significantly better responsiveness of IDEs when project scope is
> limited.  When directories are entirely missing, we don't need to
> build any of the code in those directories and can instead just use
> already built artifacts from the most recent point in history that has
> been built on our continuous integration infrastructure.  (Note: our
> sparsification tool will keep any modules/directories where there have
> been modifications since the most recent upstream commit that has been
> built, so we don't risk getting a wrong build via this strategy.)
>

So these users are just building/testing on a few projects and save time
from building/testing on some other projects. This is reasonable.

> [...]
> > > > 1. mount the large git repo on the server to local.
> > > > 2. just ssh to a remote server to run integration tests.
> > > > 3. use an external tool to run integration tests on the remote server.
> > >
> > > Are you suggesting #1 as a way for just handling the git history, or
> > > also for handling the worktree with some kind of virtual file system
> > > where not all files are actually written locally?  If you're only
> > > talking about the history, then you're kind of going on a tangent
> > > unrelated to this document.  If you're talking about worktrees and
> > > virtual file systems, then Git proper doesn't have anything of the
> > > sort currently.  There are at least two solutions in this space --
> > > Microsoft's Git-VFS (which I think they are phasing out) and Google's
> > > similar virtual file system -- but I'm not currently particularly
> > > interested in either one.
> > >
> >
> > Here I mean git nfs, or some kind of git virtual file system, or some
> > git workspace, I don't really understand why they are now
> > phasing out?
>
> You'd have to ask them, or read their comments on it.  I think they
> believe sparse-checkout with a normal file system is or will be better
> than the behavior they are getting from their virtual file system (and
> they've put a lot of really good work behind making sure that is the
> case).
>

Okay.

> [...]
> > Some users may really want to focus only on their subprojects, so I think
> > "git log -p" shouldn't show files that don't satisfy the
> > sparse-checkout patterns,
> > and "git grep" too. But some users may need to search something globally,
> > and I think those people are in the minority, so maybe there should be a
> > "git log -p --scrope=all" or "git grep --scrope=all" for them.
>
> Good to know you're in the "Behavior A" camp and we've got another
> vote for implementing things in that direction.  A couple of small
> points, though:
>   * It's --scope rather than --scrope.  ;-)
>   * I have to disagree here slightly about people using a --scope=all
> flag -- I don't think users should have to specify it with every grep
> or log invocation.  Users in the "Behavior B" camp would want
> `--scope=all` behavior for nearly every grep and log -p invocation
> they make; it's annoying and unfair to force them to spell it out
> every time.  So, I think we need a configuration option.
>

Fine, this configuration looks like it can balance the needs of both camps.

Thanks,
ZheNing Hu
diff mbox series

Patch

diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 00000000000..b213b2b3f35
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,670 @@ 
+Table of contents:
+
+  * Purpose of sparse-checkouts
+  * Desired behavior
+  * Subcommand-dependent defaults
+  * Implementation Questions
+  * Implementation Goals/Plans
+  * Known bugs
+  * Reference Emails
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+The idea is simple enough, but there are two different high-level
+usecases which affect how some Git subcommands should behave.  Further,
+even if we only considered one of those usecases, sparse-checkouts
+modify different subcommands in over a half dozen different ways.  Let's
+start by considering the high level usecases in this section:
+
+  A) Users are _only_ interested in the sparse portion of the repo
+
+  B) Users want a sparse working tree, but are working in a larger whole
+
+It may be worth explaining both of these in a bit more detail:
+
+  (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care.  They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest.  Showing
+them other results from history (e.g. from diff/log/grep/etc.) is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use
+partial clones together with sparse checkouts and do disconnected
+development.  Not only do these users generally not care about other
+parts of the repository, but consider it a blocker for Git commands to
+try to operate on those.  If commands attempt to access paths in history
+outside the sparsity specification, then the partial clone will attempt
+to download additional blobs on demand, fail, and then fail the user's
+command.  (This may be unavoidable in some cases, e.g. when `git merge`
+has non-trivial changes to reconcile outside the sparsity path, but we
+should limit how often users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+  (Behavior B) Users want a sparse working tree, but are working in a larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies.  The initial attempts to use sparse-checkouts usually
+involve the directories you are directly interested in plus what those
+directories depend upon within your repository.  But there's a monkey
+wrench here: if you have integration tests, they invert the hierarchy:
+to run integration tests, you need not only what you are interested in
+and its dependencies, you also need everything that depends upon what
+you are interested in or that depends upon one of your
+dependencies...AND you need all the dependencies of that expanded group.
+That can easily change your sparse-checkout into a nearly dense one.
+Naturally, that tends to kill the benefits of sparse-checkouts.  There
+are a couple solutions to this conundrum: either avoid grabbing
+dependencies (maybe have built versions of your dependencies pulled from
+a CI cache somewhere), or say that users shouldn't run integration tests
+directly and instead do it on the CI server when they submit a code
+review.  Or do both.  Regardless of whether you stub out your
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other
+stubbed-out parts of the repository, particularly when the dependencies
+are complex or change relatively frequently.  Thus, for such uses,
+sparse-checkouts can be used to limit what you directly build and
+modify, but these users do not necessarily want their sparse checkout
+paths to limit their queries of history.
+
+Some people may also be interested in behavior B simply as a performance
+workaround: if they are using non-cone mode, then they have to deal with
+its inherent quadratic performance problems.  In that mode, every
+operation that checks whether paths match the sparsity specification can
+be expensive.  As such, these users may only be willing to pay for those
+expensive checks when interacting with the working copy, and may prefer
+getting "unrelated" results from their history queries over having slow
+commands.
+
+
+=== Desired behavior ===
+
+As noted in the previous section, despite the simple idea of just
+working with a subset of files, there are a range of different
+behavioral changes that need to be made to different subcommands to work
+well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
+examples.  In particular, at [2], we saw that mere composition of other
+commands that individually worked correctly in a sparse-checkout context
+did not imply that the higher level command would work correctly; it
+sometimes requires further tweaks.  So, understanding these differences
+can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+  * commands that only look at files within the sparsity specification
+
+      * status
+      * diff (without --cached or REVISION arguments)
+      * grep (without --cached or REVISION arguments)
+
+  * commands that restore files to the working tree that match sparsity patterns, and
+    remove unmodified files that don't match those patterns:
+
+      * switch
+      * checkout (the switch-like half)
+      * read-tree
+      * reset --hard
+
+      * `restore` & the restore-like half of `checkout` SHOULD be in this above
+	category, but are buggy (see the "Known bugs" section below)
+
+  * commands that write conflicted files to the working tree, but otherwise will
+    omit writing files that do not match the sparsity patterns:
+
+      * merge
+      * rebase
+      * cherry-pick
+      * revert
+
+    Note that this somewhat depends upon the merge strategy being used:
+      * `ort` behaves as described above
+      * `recursive` tries to not vivify files unnecessarily, but does sometimes
+	vivify files without conflicts.
+      * `octopus` and `resolve` will always vivify any file changed in the merge
+	relative to the first parent, which is rather suboptimal.
+
+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree
+
+  * commands that write any modified file to the working tree (conflicted or not,
+    and whether those paths match sparsity patterns or not):
+
+      * stash
+
+      * am/apply probably should be in the above category, but need to be fixed to
+	auto-vivify instead of failing
+
+* Commands that differ for behavior A vs. behavior B:
+
+  * commands that make modifications:
+      * add
+      * rm
+      * mv
+
+  * commands that query history
+      * diff (with --cached or REVISION arguments)
+      * grep (with --cached or REVISION arguments)
+      * show (when given commit arguments)
+      * bisect
+      * blame
+	* and annotate
+      * log
+	* and variants: shortlog, gitk, show-branch, whatchanged
+
+* Comands I don't know how to classify
+
+  * ls-files
+
+    Shows all tracked files by default, and with an option can show
+    sparse directory entries instead of expanding them.  Should there be
+    a way to restrict to just the non SKIP_WORKTREE files?
+
+    Note that `git ls-files -t` is often used to see what is sparse and
+    what is not, which only works with a non-restricted assumption.
+
+  * checkout-index
+
+    should it be like `checkout` and pay attention to sparsity paths, or
+    be considered special and write to working tree anyway?  The
+    interaction with --prefix, and the use of specifically named files
+    (rather than globs) makes me wonder.
+
+  * update-index
+
+    The --[no-]ignore-skip-worktree-entries default is totally bogus,
+    but otherwise this command seems okay?  Not sure what category it
+    would go under, though.
+
+  * range-diff
+
+    Is this like `log` or `format-patch`?
+
+  * cherry
+
+    See range-diff
+
+  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
+
+    should these be tweaked or always operate full-tree?
+
+* Commands unaffected by sparse-checkouts
+
+  * branch
+  * clean (works on untracked files, whereas SKIP_WORKTREE files are still tracked)
+  * describe
+  * fetch
+  * gc
+  * init
+  * maintenance
+  * notes
+  * pull (merge & rebase have the necessary changes)
+  * push
+  * submodule
+  * tag
+
+  * config
+  * filter-branch (works in separate checkout without sparse-checkout setup)
+  * pack-refs
+  * prune
+  * remote
+  * repack
+  * replace
+
+  * bugreport
+  * count-objects
+  * fsck
+  * gitweb
+  * help
+  * instaweb
+  * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+  * rerere
+  * verify-commit
+  * verify-tag
+
+  * commit-graph
+  * hash-object
+  * index-pack
+  * mktag
+  * mktree
+  * multi-pack-index
+  * pack-objects
+  * prune-packed
+  * symbolic-ref
+  * unpack-objects
+  * update-ref
+  * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+  * for-each-ref
+  * get-tar-commit-id
+  * ls-remote
+  * merge-base (merges are computed full tree, so merge base should be too)
+  * name-rev
+  * pack-redundant
+  * rev-parse
+  * show-index
+  * show-ref
+  * unpack-file
+  * var
+  * verify-pack
+
+  * <Everything under 'Interacting with Others' in 'git help --all'>
+  * <Everything under 'Low-level...Syncing' in 'git help --all'>
+  * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+  * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+  * merge-file
+  * merge-index
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults (for the desired behavior, not just
+the current implementation) depending on the command:
+
+  * Commands defaulting to --restrict:
+    * status
+    * diff (without --cached or REVISION arguments)
+    * grep (without --cached or REVISION arguments)
+    * switch
+    * checkout (the switch-like half)
+    * read-tree
+    * reset (--hard)
+    * restore/checkout
+    * checkout-index
+
+    This behavior makes sense; these interact with the working tree.
+
+  * Commands defaulting to --restrict-unless-conflicts
+    * merge
+    * rebase
+    * cherry-pick
+    * revert
+
+    These also interact with the working tree, but require slightly different
+    behavior so that conflicts can be resolved.
+
+  * Commands defaulting to --no-restrict
+    * archive
+    * bundle
+    * commit
+    * format-patch
+    * fast-export
+    * fast-import
+    * commit-tree
+
+    * ls-files
+    * stash
+    * am
+    * apply
+
+    These have completely different defaults and perhaps deserve the most detailed
+    explanation:
+
+    In the case of commands in the first group (format-patch,
+    fast-export, bundle, archive, etc.), these are commands for
+    communicating history, which will be broken if they restrict to a
+    subset of the repository.  As such, they operate on full paths and
+    have no `--restrict` option for overriding.  Some of these commands may
+    take paths for manually restricting what is exported, but it needs to
+    be very explicit.
+
+    In the case of stash, it needs to vivify files to avoid losing the
+    user's changes.
+
+    In the case of am and apply, those commands only operate on the
+    working tree, so they are kind of in the same boat as stash.
+    Perhaps `git am` could run `git sparse-checkout reapply`
+    automatically afterward and move into a category more similar to
+    merge/rebase/cherry-pick, but it'd still be weird because it'd
+    vivify files besides just conflicted ones when there are conflicts.
+
+    In the case of ls-files, `git ls-files -t` is often used to see what
+    is sparse and not, in which case restricting would not make sense.
+    Also, ls-files has traditionally been used to get a list of "all
+    tracked files", which would suggest not restricting.  But it's
+    slightly funny, because sparse-checkouts essentially split tracked
+    files into two categories -- those in the sparse specification and
+    those outside -- and how does the user specify which of those two
+    types of tracked files they want?
+
+  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
+    may affect how verbose the warnings are):
+    * add
+    * rm
+    * mv
+
+    The defaults here perhaps make sense since they are nearly --restrict, but
+    actually using --restrict could cause user confusion if users specify a
+    specific filename, so they warn by default.  That logic may sound like
+    --no-restrict should be the default, but that's prone to even bigger confusion:
+      * `git add <somefile>` if honored and outside the sparse cone, can result in
+	the file randomly disappearing later when some subsequent command is run
+	(since various commands automatically clean up unmodified files outside
+	the sparsity specification).
+      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
+	outside the range of the user's interest.  Much better to operate on the
+	sparsity specification and give the user warnings if other files could have
+	matched.
+      * `git mv` has similar surprises when moving into or out of the cone, so
+	best to restrict and throw warnings if restriction might affect the result.
+
+    There may be a difference in here between behavior A and behavior B.
+    For behavior A, we probably only want to warn if there were no
+    suitable matches for files in the sparsity specification, whereas
+    for behavior B, we may want to warn even if there are valid files to
+    operate on if the result would have been different under
+    `--no-restrict`.
+
+  * Commands whose default for --restrict vs. --no-restrict should vary depending
+    on Behavior A or Behavior B
+    * diff (with --cached or REVISION arguments)
+    * grep (with --cached or REVISION arguments)
+    * show (when given commit arguments)
+    * bisect
+    * blame
+      * and annotate
+    * log
+      * and variants: shortlog, gitk, show-branch, whatchanged
+
+    For now, we default to behavior B for these, which want a default of
+    --no-restrict.
+
+    Note that two of these commands -- diff and grep -- also appeared in
+    a different list with a default of --restrict, but only when limited
+    to searching the working tree.  The working tree vs. history
+    distinction is fundamental in how behavior B operates, so this is
+    expected.
+
+    --restrict may make more sense as the long term default for
+    these[12], but that's a fair amount of work to implement, and it'd
+    be very problematic for behavior B users.  Making it the default
+    now, and then slowly implementing that default in various
+    subcommands over multiple releases would mean that behavior B users
+    would need to learn to slowly add additional flags to their
+    commands, depending on git version, to get the behavior they want.
+    That gradual switchover would be painful, so we should avoid it at
+    least until it's fully implemented.
+
+
+=== Implementation Questions ===
+
+  * Does the name --[no-]restrict sound good to others?  Are there better options?
+    * Names in use, or appearing in patches, or previously suggested:
+      * --sparse/--dense
+      * --ignore-skip-worktree-bits
+      * --ignore-skip-worktree-entries
+      * --ignore-sparsity
+      * --[no-]restrict-to-sparse-paths
+      * --full-tree/--sparse-tree
+      * --[no-]restrict
+    * Rationale making me lean slightly towards --[no-]restrict:
+      * We want a name that works for many commands, so we need a name that
+	does not conflict
+      * --[no-]restrict isn't overly long and seems relatively explanatory
+      * `--sparse`, as used in add/rm/mv, is totally backwards for
+	grep/log/etc.  Changing the meaning of `--sparse` for these
+	commands would fix the backwardness, but possibly break existing
+	scripts.  Using a new name pairing would allow us to treat
+	`--sparse` in these commands as a deprecated alias.
+      * There is a different `--sparse`/`--dense` pair for commands using
+	revision machinery, so using that naming might cause confusion
+      * There is also a `--sparse` in both pack-objects and show-branch, which
+	don't conflict but do suggest that `--sparse` is overloaded
+      * The name --ignore-skip-worktree-bits is a double negative, is
+	quite a mouthful, refers to an implementation detail that many
+	users may not be familiar with, and we'd need a negation for it
+	which would probably be even more ridiculously long.  (But we
+	can make --ignore-skip-worktree-bits a deprecated alias for
+	--no-restrict.)
+
+  * Should --[no-]restrict be a git global option, or added as options to each
+    relevant command?  (Does that make sense given the multitude of different
+    default behaviors we have for different options?)
+
+  * If a config option is added (core.restrictToSparsity?) what should
+    the values and description be?  There's a risk of confusion, because
+    we only want this config option to affect the history-querying
+    commands (log/diff/grep) and maybe the path-modifying worktree
+    commands (add/rm/mv), but certainly not most the others.  Previous config
+    suggestion here: [13]
+
+  * Should --sparse in ls-files be made an alias for --restrict?
+    `--restrict` is certainly a near synonym in cone-mode, but even then
+    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
+    option has no effect, and in cone-mode it still shows the sparse
+    directory entries which are technically outside the sparsity
+    specification.
+
+  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
+    restore be made deprecated aliases for --no-restrict?  (They have the
+    same meaning.)
+
+  * Should --ignore-skip-worktree-entries in update-index be made a
+    deprecated alias for --no-restrict?  (Or, better yet, should the
+    option just be nuked from orbit after flipping the default, since
+    the reverse option is never wanted and the sole purpose of this
+    option was to turn off a bug?)
+
+  * sparse-checkout: once behavior A is fully implemented, should we
+    take an interim measure to easy people into switching the default?
+    Namely, if folks are not already in a sparse checkout, then require
+    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
+    would set core.restrictToSparse according to the setting given), and
+    throw an error if the flag is not provided?  That error would be a
+    great place to warn folks that the default may change in the future,
+    and get them used to specifying what they want so that the eventual
+    default switch is seamless for them.
+
+  * clone: should we provide some mechanism for tying partial clones and
+    sparse checkouts together better.  Maybe an option
+	--sparse=dir1,dir2,...,dirN
+    which:
+       * Does initial fetch with `--filter=blob:none`
+       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
+       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
+	 fault in the missing blobs within the sparse
+	 specification...except that rev-list needs some kind of options
+	 to also get files from leading directories too.
+       * Sets --restrict mode to allow focusing on the cone of interest
+	 (and to permit disconnected development)
+
+
+=== Implementation Goals/Plans ===
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries, possibly
+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users request
+   that they not trigger this bug." flag
+
+  * Flags & Config
+    * Make `--sparse` in add/rm/mv a deprecated alias for `--no-restrict`
+    * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+      a deprecated aliases for `--no-restrict`
+    * Create config option (core.restrictToSparsity?), note how it only
+      affects two classes of commands
+
+ * Behavioral plans:
+     add, rm, mv:
+	Behavior B: throw error if would have affected paths outside of sparsity.
+	Behavior A: throw error if would have only affected paths outside of sparsity.
+     grep (on history), diff (on history), log, etc:
+	Behavior B: act on all paths (already implemented)
+	Behavior A: act on limited paths, maybe show stderr warning ("results limited")
+		    if selected via config rather than explicitly
+     other diff machinery:
+	make sure diff machinery changes don't mess with format-patch, fast-export, etc.
+
+  * Fix performance issues, such as
+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git.  (Behavior B didn't used to be either,
+   but was the easier of the two to implement.)
+
+1. am and apply:
+
+   am and apply rely on files being present in the working copy, and
+   also write to them unconditionally.  They should probably first check
+   for the files' presence, and if found to be SKIP_WORKTREE, then clear
+   the bit and vivify the paths, then do its work.
+
+2. reset --hard:
+
+   reset --hard provides confusing error message (works correctly, but
+   misleads the user into believing it didn't):
+
+    $ touch addme
+    $ git add addme
+    $ git ls-files -t
+    H addme
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git reset --hard                           # usually works great
+    error: Path 'addme' not uptodate; will not remove from working tree.
+    HEAD is now at bdbbb6f third
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ ls -1
+    tracked
+
+    `git reset --hard` DID remove addme from the index and the working tree, contrary
+    to the error message, but in line with how reset --hard should behave.
+
+3. Checkout, restore:
+
+   These command do not handle path & revision arguments appropriately:
+
+    $ ls
+    tracked
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-files -- '*skipped'
+    tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-tree HEAD | grep skipped
+    100644 blob 276f5a64354b791b13840f02047738c77ad0584f	tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout HEAD~1 -- '*skipped'
+    $ git ls-files -t
+    H tracked
+    H tracked-but-maybe-skipped
+    $ git status --porcelain
+    M  tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    $ git status --porcelain
+    $
+
+    Note that checkout without a revision (or restore --staged) fails to
+    find a file to restore from the index, even though ls-files shows
+    such a file certainly exists.
+
+    Similar issues occur with HEAD (--source=HEAD in restore's case),
+    but suddenly works when HEAD~1 is specified.  And then after that it
+    will work with HEAD specified, even though it didn't before.
+
+    Directories are also an issue:
+
+    $ git sparse-checkout set nomatches
+    $ git status
+    On branch main
+    You are in a sparse checkout with 0% of tracked files present.
+
+    nothing to commit, working tree clean
+    $ git checkout .
+    error: pathspec '.' did not match any file(s) known to git
+    $ git checkout HEAD~1 .
+    Updated 1 path from 58916d9
+    $ git ls-files -t
+    S tracked
+    H tracked-but-maybe-skipped
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+    https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+    https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+    https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+    https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+    https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+    https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+    https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparsity specification)
+     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+  * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+  * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+  * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparsity specification mechanism:
+  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/