Message ID | 9f2135f90ffea7f4ccb226f506bf554deab324cc.1605205427.git.matheus.bernardino@usp.br (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | rm: honor sparse checkout patterns | expand |
Hi, On Thu, Nov 12, 2020 at 1:02 PM Matheus Tavares <matheus.bernardino@usp.br> wrote: > > Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its > operation to the paths that match both the command line pathspecs and > the repository's sparsity patterns. This better matches the expectations > of users with sparse-checkout definitions, while still allowing them > to optionally enable the old behavior with 'sparse.restrictCmds=false' > or the global '--no-restrict-to-sparse-paths' option. (For Stolee:) Did this arise when a user specified a directory to delete, and a (possibly small) part of that directory was in the sparse checkout while other portions of it were outside? I can easily see users thinking they are dealing with just the files relevant to them, and expecting the directory deletion to only affect that relevant subset, so this seems like a great idea. We'd just want to make sure we have a good error message if they explicitly list a single path outside the sparse checkout. > Suggested-by: Derrick Stolee <stolee@gmail.com> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> > --- > > This is based on mt/grep-sparse-checkout. > Original feature request: https://github.com/gitgitgadget/git/issues/786 > > Documentation/config/sparse.txt | 3 ++- > Documentation/git-rm.txt | 9 +++++++++ > builtin/rm.c | 7 ++++++- > t/t3600-rm.sh | 22 ++++++++++++++++++++++ > t/t7011-skip-worktree-reading.sh | 5 ----- > 5 files changed, 39 insertions(+), 7 deletions(-) > > diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt > index 494761526e..79d7d173e9 100644 > --- a/Documentation/config/sparse.txt > +++ b/Documentation/config/sparse.txt > @@ -12,7 +12,8 @@ When this option is true (default), some git commands may limit their behavior > to the paths specified by the sparsity patterns, or to the intersection of > those paths and any (like `*.c`) that the user might also specify on the > command line. When false, the affected commands will work on full trees, > -ignoring the sparsity patterns. For now, only git-grep honors this setting. > +ignoring the sparsity patterns. For now, only git-grep and git-rm honor this > +setting. > + > Note: commands which export, integrity check, or create history will always > operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.), > diff --git a/Documentation/git-rm.txt b/Documentation/git-rm.txt > index ab750367fd..25dda8ff35 100644 > --- a/Documentation/git-rm.txt > +++ b/Documentation/git-rm.txt > @@ -25,6 +25,15 @@ When `--cached` is given, the staged content has to > match either the tip of the branch or the file on disk, > allowing the file to be removed from just the index. > > +CONFIGURATION > +------------- > + > +sparse.restrictCmds:: > + By default, git-rm only matches and removes paths within the > + sparse-checkout patterns. This behavior can be changed with the > + `sparse.restrictCmds` setting or the global > + `--no-restrict-to-sparse-paths` option. For more details, see the > + full `sparse.restrictCmds` definition in linkgit:git-config[1]. Hmm, I wonder what people will think who are reading through the manual and have never used sparse-checkout. This seems prone to confusion for them. Maybe instead we could word this as: When sparse-checkouts are in use, by default git-rm will only match and remove paths within the sparse-checkout patterns... > > OPTIONS > ------- > diff --git a/builtin/rm.c b/builtin/rm.c > index 4858631e0f..e1fe71c321 100644 > --- a/builtin/rm.c > +++ b/builtin/rm.c > @@ -14,6 +14,7 @@ > #include "string-list.h" > #include "submodule.h" > #include "pathspec.h" > +#include "sparse-checkout.h" > > static const char * const builtin_rm_usage[] = { > N_("git rm [<options>] [--] <file>..."), > @@ -254,7 +255,7 @@ static struct option builtin_rm_options[] = { > int cmd_rm(int argc, const char **argv, const char *prefix) > { > struct lock_file lock_file = LOCK_INIT; > - int i; > + int i, sparse_paths_only; > struct pathspec pathspec; > char *seen; > > @@ -293,8 +294,12 @@ int cmd_rm(int argc, const char **argv, const char *prefix) > > seen = xcalloc(pathspec.nr, 1); > > + sparse_paths_only = restrict_to_sparse_paths(the_repository); > + > for (i = 0; i < active_nr; i++) { > const struct cache_entry *ce = active_cache[i]; > + if (sparse_paths_only && ce_skip_worktree(ce)) > + continue; > if (!ce_path_match(&the_index, ce, &pathspec, seen)) > continue; > ALLOC_GROW(list.entry, list.nr + 1, list.alloc); > diff --git a/t/t3600-rm.sh b/t/t3600-rm.sh > index efec8d13b6..7bf55b42eb 100755 > --- a/t/t3600-rm.sh > +++ b/t/t3600-rm.sh > @@ -892,4 +892,26 @@ test_expect_success 'rm empty string should fail' ' > test_must_fail git rm -rf "" > ' > > +test_expect_success 'rm should respect --[no]-restrict-to-sparse-paths' ' > + git init sparse-repo && > + ( > + cd sparse-repo && > + touch a b c && > + git add -A && > + git commit -m files && > + git sparse-checkout set "/a" && > + > + # By default, it should not rm paths outside the sparse-checkout > + test_must_fail git rm b 2>stderr && > + test_i18ngrep "fatal: pathspec .b. did not match any files" stderr && Ah, this answers my question about whether the user gets an error message when they explicitly call out a single path outside the sparse checkout. I'm curious if we want to be slightly more verbose on the error message when sparse-checkouts are in effect. In particular, if no paths match the sparsity patterns, but some paths would have matched the pathspec ignoring the sparsity patterns, then perhaps the error message should include a reference to the --no-restrict-to-sparse-paths flag. > + > + # But it should rm them with --no-restrict-to-sparse-paths > + git --no-restrict-to-sparse-paths rm b && > + > + # And also with sparse.restrictCmds=false > + git reset && > + git -c sparse.restrictCmds=false rm b > + ) > +' > + > test_done Do we also want to include a testcase where the user specifies a directory and part of that directory is within the sparsity paths and part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r sub' ? > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh > index 26852586ac..1761a2b1b9 100755 > --- a/t/t7011-skip-worktree-reading.sh > +++ b/t/t7011-skip-worktree-reading.sh > @@ -132,11 +132,6 @@ test_expect_success 'diff-files does not examine skip-worktree dirty entries' ' > test -z "$(git diff-files -- one)" > ' > > -test_expect_success 'git-rm succeeds on skip-worktree absent entries' ' > - setup_absent && > - git rm 1 > -' > - > test_expect_success 'commit on skip-worktree absent entries' ' > git reset && > setup_absent && > -- > 2.28.0 Sweet, nice and simple. Thanks for sending this in; I think it'll be very nice.
On 11/12/2020 6:54 PM, Elijah Newren wrote: > Hi, > > On Thu, Nov 12, 2020 at 1:02 PM Matheus Tavares > <matheus.bernardino@usp.br> wrote: >> >> Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its >> operation to the paths that match both the command line pathspecs and >> the repository's sparsity patterns. This better matches the expectations >> of users with sparse-checkout definitions, while still allowing them >> to optionally enable the old behavior with 'sparse.restrictCmds=false' >> or the global '--no-restrict-to-sparse-paths' option. > > (For Stolee:) Did this arise when a user specified a directory to > delete, and a (possibly small) part of that directory was in the > sparse checkout while other portions of it were outside? The user who suggested this used a command like 'git rm */*.csprojx' to remove all paths with that file extension, but then realized that they were deleting all of those files from the entire repo, not just the current sparse-checkout. > I can easily see users thinking they are dealing with just the files > relevant to them, and expecting the directory deletion to only affect > that relevant subset, so this seems like a great idea. We'd just want > to make sure we have a good error message if they explicitly list a > single path outside the sparse checkout. We should definitely consider how to make this more usable for users who operate within a sparse-checkout but try to modify files outside the sparse-checkout. Is there a warning message such as "the supplied pathspec doesn't match any known file" that we could extend to recommend possibly disabling the sparse.restrictCmds config? (I see that you identify one below.) >> +CONFIGURATION >> +------------- >> + >> +sparse.restrictCmds:: >> + By default, git-rm only matches and removes paths within the >> + sparse-checkout patterns. This behavior can be changed with the >> + `sparse.restrictCmds` setting or the global >> + `--no-restrict-to-sparse-paths` option. For more details, see the >> + full `sparse.restrictCmds` definition in linkgit:git-config[1]. > > Hmm, I wonder what people will think who are reading through the > manual and have never used sparse-checkout. This seems prone to > confusion for them. Maybe instead we could word this as: > > When sparse-checkouts are in use, by default git-rm will only match > and remove paths within the sparse-checkout patterns... A preface such as "When using sparse-checkouts..." can help users ignore these config settings if they are unfamiliar with the concept. >> @@ -293,8 +294,12 @@ int cmd_rm(int argc, const char **argv, const char *prefix) >> >> seen = xcalloc(pathspec.nr, 1); >> >> + sparse_paths_only = restrict_to_sparse_paths(the_repository); >> + >> for (i = 0; i < active_nr; i++) { >> const struct cache_entry *ce = active_cache[i]; >> + if (sparse_paths_only && ce_skip_worktree(ce)) >> + continue; >> if (!ce_path_match(&the_index, ce, &pathspec, seen)) >> continue; >> ALLOC_GROW(list.entry, list.nr + 1, list.alloc); This seems like an incredibly simple implementation! Excellent. >> +test_expect_success 'rm should respect --[no]-restrict-to-sparse-paths' ' >> + git init sparse-repo && >> + ( >> + cd sparse-repo && >> + touch a b c && >> + git add -A && >> + git commit -m files && >> + git sparse-checkout set "/a" && >> + >> + # By default, it should not rm paths outside the sparse-checkout >> + test_must_fail git rm b 2>stderr && >> + test_i18ngrep "fatal: pathspec .b. did not match any files" stderr && > > Ah, this answers my question about whether the user gets an error > message when they explicitly call out a single path outside the sparse > checkout. I'm curious if we want to be slightly more verbose on the > error message when sparse-checkouts are in effect. In particular, if > no paths match the sparsity patterns, but some paths would have > matched the pathspec ignoring the sparsity patterns, then perhaps the > error message should include a reference to the > --no-restrict-to-sparse-paths flag. The error message could be modified similar to below: if (!seen[i]) { if (!ignore_unmatch) { die(_("pathspec '%s' did not match any files%s"), original, sparse_paths_only ? _("; disable sparse.restrictCmds if you intend to edit outside the current sparse-checkout definition") : ""); } } >> + >> + # But it should rm them with --no-restrict-to-sparse-paths >> + git --no-restrict-to-sparse-paths rm b && >> + >> + # And also with sparse.restrictCmds=false >> + git reset && >> + git -c sparse.restrictCmds=false rm b >> + ) >> +' >> + >> test_done > > Do we also want to include a testcase where the user specifies a > directory and part of that directory is within the sparsity paths and > part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r > sub' ? That is definitely an interesting case. I'm not sure the current implementation will do the "right" thing here. Definitely worth testing, and it might require a more complicated implementation. >> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh >> index 26852586ac..1761a2b1b9 100755 >> --- a/t/t7011-skip-worktree-reading.sh >> +++ b/t/t7011-skip-worktree-reading.sh >> @@ -132,11 +132,6 @@ test_expect_success 'diff-files does not examine skip-worktree dirty entries' ' >> test -z "$(git diff-files -- one)" >> ' >> >> -test_expect_success 'git-rm succeeds on skip-worktree absent entries' ' >> - setup_absent && >> - git rm 1 >> -' >> - Instead of deleting this case, perhaps we should just use "-c sparse.restrictCmds=false" in the 'git rm' command, so we are still testing this case? Thanks again! I appreciate that you jumped on this suggestion. -Stolee
Hi, Stolee and Elijah Thank you both for the comments. I'll try to send v2 soon. On Fri, Nov 13, 2020 at 10:47 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 11/12/2020 6:54 PM, Elijah Newren wrote: > > > > Do we also want to include a testcase where the user specifies a > > directory and part of that directory is within the sparsity paths and > > part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r > > sub' ? > > That is definitely an interesting case. I've added the test [1], but it's failing on Windows and I'm not quite sure why. The trash dir artifact shows that `git sparse-checkout set /sub/dir` produced the following path on the sparse-checkout file: "D:/a/git/git/git-sdk-64-minimal/sub/dir". If I change the setup cmd to `git sparse-checkout set sub/dir` (i.e. without the root slash), it works as expected. Could this be a bug, or am I missing something? [1]: https://github.com/matheustavares/git/commit/656bffa1793ce86b638d7ad1da2452103ce8b14b#diff-69312bb98fb0cf46e6906e3384c11529f3f04713d331a85d67fc77a3e43944f9R919
Am 15.11.20 um 21:12 schrieb Matheus Tavares Bernardino: > Thank you both for the comments. I'll try to send v2 soon. > > On Fri, Nov 13, 2020 at 10:47 AM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 11/12/2020 6:54 PM, Elijah Newren wrote: >>> >>> Do we also want to include a testcase where the user specifies a >>> directory and part of that directory is within the sparsity paths and >>> part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r >>> sub' ? >> >> That is definitely an interesting case. > > I've added the test [1], but it's failing on Windows and I'm not quite > sure why. The trash dir artifact shows that `git sparse-checkout set > /sub/dir` produced the following path on the sparse-checkout file: > "D:/a/git/git/git-sdk-64-minimal/sub/dir". If 'git sparse-checkout' is run from a bash command line, I would not be surprised if the absolute path is munched in the way that you observe, provided that D:/a/git/git/git-sdk-64-minimal is where your MinGW subsystem is located. I that the case? > If I change the setup cmd to `git sparse-checkout set sub/dir` (i.e. > without the root slash), it works as expected. Could this be a bug, or > am I missing something? Not a bug, unless tell us that D:/a/git/git/git-sdk-64-minimal is a completely random path in your system or does not even exist. -- Hannes
On Sun, Nov 15, 2020 at 6:42 PM Johannes Sixt <j6t@kdbg.org> wrote: > > Am 15.11.20 um 21:12 schrieb Matheus Tavares Bernardino: > > Thank you both for the comments. I'll try to send v2 soon. > > > > On Fri, Nov 13, 2020 at 10:47 AM Derrick Stolee <stolee@gmail.com> wrote: > >> > >> On 11/12/2020 6:54 PM, Elijah Newren wrote: > >>> > >>> Do we also want to include a testcase where the user specifies a > >>> directory and part of that directory is within the sparsity paths and > >>> part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r > >>> sub' ? > >> > >> That is definitely an interesting case. > > > > I've added the test [1], but it's failing on Windows and I'm not quite > > sure why. The trash dir artifact shows that `git sparse-checkout set > > /sub/dir` produced the following path on the sparse-checkout file: > > "D:/a/git/git/git-sdk-64-minimal/sub/dir". > > If 'git sparse-checkout' is run from a bash command line, I would not be > surprised if the absolute path is munched in the way that you observe, > provided that D:/a/git/git/git-sdk-64-minimal is where your MinGW > subsystem is located. I that the case? Yeah, that must be it, thanks. I didn't run the command myself as I'm not on Windows, but D:/a/git/git/git-sdk-64-minimal must be the path where MinGW was installed by our GitHub Actions script, then. I'll use "sub/dir" without the root slash in t3600 to avoid the conversion. Thanks again!
On 11/13/20 8:47 AM, Derrick Stolee wrote: > On 11/12/2020 6:54 PM, Elijah Newren wrote: >> Hi, >> >> On Thu, Nov 12, 2020 at 1:02 PM Matheus Tavares >> <matheus.bernardino@usp.br> wrote: >>> >>> Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its >>> operation to the paths that match both the command line pathspecs and >>> the repository's sparsity patterns. This better matches the expectations >>> of users with sparse-checkout definitions, while still allowing them >>> to optionally enable the old behavior with 'sparse.restrictCmds=false' >>> or the global '--no-restrict-to-sparse-paths' option. >> >> (For Stolee:) Did this arise when a user specified a directory to >> delete, and a (possibly small) part of that directory was in the >> sparse checkout while other portions of it were outside? > > The user who suggested this used a command like 'git rm */*.csprojx' to > remove all paths with that file extension, but then realized that they > were deleting all of those files from the entire repo, not just the > current sparse-checkout. Aren't the wildcards expanded by the shell before the command line is given to Git? So the Git command should only receive command line args that actually match existing files, right?? Jeff
Matheus Tavares <matheus.bernardino@usp.br> writes: > Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its > operation to the paths that match both the command line pathspecs and > the repository's sparsity patterns. > This better matches the expectations > of users with sparse-checkout definitions, while still allowing them > to optionally enable the old behavior with 'sparse.restrictCmds=false' > or the global '--no-restrict-to-sparse-paths' option. Hmph. Is "rm" the only oddball that ignores the sparse setting? > to the paths specified by the sparsity patterns, or to the intersection of > those paths and any (like `*.c`) that the user might also specify on the > command line. When false, the affected commands will work on full trees, > -ignoring the sparsity patterns. For now, only git-grep honors this setting. > +ignoring the sparsity patterns. For now, only git-grep and git-rm honor this > +setting. I am not sure if this is a good direction to go---can we make an inventory of all commands that affect working tree files and see which ones need the same treatment before going forward with just "grep" and "rm"? Documenting the decision on the ones that will not get the same treatment may also be a good idea. What I am aiming for is to prevent users from having to know in which versions of Git they can rely on the sparsity patterns with what commands, and doing things piecemeal like these two topics would be a road to confusion. Thanks.
On Mon, Nov 16, 2020 at 6:30 AM Jeff Hostetler <git@jeffhostetler.com> wrote: > > On 11/13/20 8:47 AM, Derrick Stolee wrote: > > On 11/12/2020 6:54 PM, Elijah Newren wrote: > >> Hi, > >> > >> On Thu, Nov 12, 2020 at 1:02 PM Matheus Tavares > >> <matheus.bernardino@usp.br> wrote: > >>> > >>> Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its > >>> operation to the paths that match both the command line pathspecs and > >>> the repository's sparsity patterns. This better matches the expectations > >>> of users with sparse-checkout definitions, while still allowing them > >>> to optionally enable the old behavior with 'sparse.restrictCmds=false' > >>> or the global '--no-restrict-to-sparse-paths' option. > >> > >> (For Stolee:) Did this arise when a user specified a directory to > >> delete, and a (possibly small) part of that directory was in the > >> sparse checkout while other portions of it were outside? > > > > The user who suggested this used a command like 'git rm */*.csprojx' to > > remove all paths with that file extension, but then realized that they > > were deleting all of those files from the entire repo, not just the > > current sparse-checkout. > > Aren't the wildcards expanded by the shell before the command > line is given to Git? So the Git command should only receive > command line args that actually match existing files, right?? Good point. I suspect, though, that the issue may still be a problem if the user were to quote the wildcards; that may have been what happened and the reporting of the case just lost them somewhere along the way.
Hi, On Mon, Nov 16, 2020 at 12:14 PM Junio C Hamano <gitster@pobox.com> wrote: > > Matheus Tavares <matheus.bernardino@usp.br> writes: > > > Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its > > operation to the paths that match both the command line pathspecs and > > the repository's sparsity patterns. > > > This better matches the expectations > > of users with sparse-checkout definitions, while still allowing them > > to optionally enable the old behavior with 'sparse.restrictCmds=false' > > or the global '--no-restrict-to-sparse-paths' option. > > Hmph. Is "rm" the only oddball that ignores the sparse setting? This might make you much less happy, but in general none of the commands pay attention to the setting; I think a line or two in merge-recursive.c is the only part of the codebase outside of unpack_trees() that pays any attention to it at all. This was noted as a problem in the initial review of the sparse-checkout series at [1], and was the biggest factor behind me requesting the following being added to the manpage for sparse-checkout[2]: THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE. However, multiple groups were using sparse checkouts anyway, via manually editing .git/info/sparse-checkout and running `git read-tree -mu HEAD`, and adding various wrappers around it, and Derrick and I thought there was value in getting _something_ out there to smooth it out a little bit. I'd still say it's pretty rough around the edges...but useful nonetheless. [1] https://lore.kernel.org/git/CABPp-BHJeuEHBDkf93m9sfSZ4rZB7+eFejiAXOsjLEUu5eT5FA@mail.gmail.com/ [2] https://lore.kernel.org/git/CABPp-BEryfaeYhuUsiDTaYdRKpK6GRi7hgZ5XSTVkoHVkx2qQA@mail.gmail.com/ > > to the paths specified by the sparsity patterns, or to the intersection of > > those paths and any (like `*.c`) that the user might also specify on the > > command line. When false, the affected commands will work on full trees, > > -ignoring the sparsity patterns. For now, only git-grep honors this setting. > > +ignoring the sparsity patterns. For now, only git-grep and git-rm honor this > > +setting. > > I am not sure if this is a good direction to go---can we make an > inventory of all commands that affect working tree files and see > which ones need the same treatment before going forward with just > "grep" and "rm"? Documenting the decision on the ones that will not > get the same treatment may also be a good idea. What I am aiming > for is to prevent users from having to know in which versions of Git > they can rely on the sparsity patterns with what commands, and doing > things piecemeal like these two topics would be a road to confusion. It's not just commands which affect the working tree that need to be inventoried and adjusted. We've made lists of commands in the past: [3] https://lore.kernel.org/git/CABPp-BEbNCYk0pCuEDQ_ViB2=varJPBsVODxNvJs0EVRyBqjBg@mail.gmail.com/ [4] https://lore.kernel.org/git/xmqqy2y3ejwe.fsf@gitster-ct.c.googlers.com/ But the working-directory related ones are perhaps more problematic. One additoinal example: I just got a report today that "git stash apply" dies with a fatal error and the working directory in some intermediate state when trying to apply a stash when the working directory has a different set of sparsity paths than when the stash was created. (Granted, an error makes sense, but this was throwing untranslated error messages, meaning they weren't in codepaths that were meant to be triggered.) This case may not be an apples to apples comparison, but the testcase did involve adding new files before stashing, so the stash apply would have been trying to remove files. Anyway, I'll send more details on that issue in a separate thread after I've had some time to dig into it. Anyway, I'm not sure this helps, because I'm basically saying things are kind of messy, and we're fixing as we go rather than having a full implementation and all the fixes.
Hi, On Mon, Nov 16, 2020 at 9:20 PM Elijah Newren <newren@gmail.com> wrote: > > On Mon, Nov 16, 2020 at 12:14 PM Junio C Hamano <gitster@pobox.com> wrote: > > > > Matheus Tavares <matheus.bernardino@usp.br> writes: > > > > > Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its > > > operation to the paths that match both the command line pathspecs and > > > the repository's sparsity patterns. > > > > > This better matches the expectations > > > of users with sparse-checkout definitions, while still allowing them > > > to optionally enable the old behavior with 'sparse.restrictCmds=false' > > > or the global '--no-restrict-to-sparse-paths' option. > > > > Hmph. Is "rm" the only oddball that ignores the sparse setting? > > This might make you much less happy, but in general none of the > commands pay attention to the setting; I think a line or two in This isn't quite right; as noted at the just submitted [1], there are three different classes of ways that existing commands at least partially pay attention to the setting. [1] https://lore.kernel.org/git/5143cba7047d25137b3d7f8c7811a875c1931aee.1605891222.git.gitgitgadget@gmail.com/ > merge-recursive.c is the only part of the codebase outside of > unpack_trees() that pays any attention to it at all. This was noted > as a problem in the initial review of the sparse-checkout series at > [1], and was the biggest factor behind me requesting the following > being added to the manpage for sparse-checkout[2]: > > THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER > COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN > THE FUTURE. The fact that commands have only somewhat paid attention to this setting is still a problem, though. In fact, it was apparently a known problem as far back as 2009 just from looking at the short list of TODOs at the end of that file. > > > to the paths specified by the sparsity patterns, or to the intersection of > > > those paths and any (like `*.c`) that the user might also specify on the > > > command line. When false, the affected commands will work on full trees, > > > -ignoring the sparsity patterns. For now, only git-grep honors this setting. > > > +ignoring the sparsity patterns. For now, only git-grep and git-rm honor this > > > +setting. > > > > I am not sure if this is a good direction to go---can we make an > > inventory of all commands that affect working tree files and see > > which ones need the same treatment before going forward with just > > "grep" and "rm"? Documenting the decision on the ones that will not > > get the same treatment may also be a good idea. What I am aiming > > for is to prevent users from having to know in which versions of Git > > they can rely on the sparsity patterns with what commands, and doing > > things piecemeal like these two topics would be a road to confusion. > > It's not just commands which affect the working tree that need to be > inventoried and adjusted. We've made lists of commands in the past: > > [3] https://lore.kernel.org/git/CABPp-BEbNCYk0pCuEDQ_ViB2=varJPBsVODxNvJs0EVRyBqjBg@mail.gmail.com/ > [4] https://lore.kernel.org/git/xmqqy2y3ejwe.fsf@gitster-ct.c.googlers.com/ So, I think there are a few other commands that need to be modified the same way rm is here by Matheus, a longer list of commands than what I previously linked to for other modifications, some warnings and error messages that need to be cleaned up, and a fair amount of additional testing needed. I also think we need to revisit the flag names for --restrict-to-sparse-paths and --no-restrict-to-sparse-paths; some feedback I'm getting suggest they might be more frequently used than I originally suspected and thus we might want shorter names. (--sparse and --dense?) So we probably want to wait off on both mt/grep-sparse-checkout and mt/rm-sparse-checkout (sorry Matheus) and maybe my recently submitted stash changes (though those don't have an exposed --[no]-restrict-to-sparse-paths flag and are modelled on existing merge behavior) until we have a bigger plan in place. But I only dug into it a bit while working on the stash apply bug; I'm going to dig more (probably just after Thanksgiving) and perhaps make a Documentation/technical/ file of some sort to propose more plans here.
Hi Matheus, On Mon, 16 Nov 2020, Matheus Tavares Bernardino wrote: > On Sun, Nov 15, 2020 at 6:42 PM Johannes Sixt <j6t@kdbg.org> wrote: > > > > Am 15.11.20 um 21:12 schrieb Matheus Tavares Bernardino: > > > Thank you both for the comments. I'll try to send v2 soon. > > > > > > On Fri, Nov 13, 2020 at 10:47 AM Derrick Stolee <stolee@gmail.com> wrote: > > >> > > >> On 11/12/2020 6:54 PM, Elijah Newren wrote: > > >>> > > >>> Do we also want to include a testcase where the user specifies a > > >>> directory and part of that directory is within the sparsity paths and > > >>> part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r > > >>> sub' ? > > >> > > >> That is definitely an interesting case. > > > > > > I've added the test [1], but it's failing on Windows and I'm not quite > > > sure why. The trash dir artifact shows that `git sparse-checkout set > > > /sub/dir` produced the following path on the sparse-checkout file: > > > "D:/a/git/git/git-sdk-64-minimal/sub/dir". > > > > If 'git sparse-checkout' is run from a bash command line, I would not be > > surprised if the absolute path is munched in the way that you observe, > > provided that D:/a/git/git/git-sdk-64-minimal is where your MinGW > > subsystem is located. I that the case? > > Yeah, that must be it, thanks. I didn't run the command myself as I'm > not on Windows, but D:/a/git/git/git-sdk-64-minimal must be the path > where MinGW was installed by our GitHub Actions script, then. I'll use > "sub/dir" without the root slash in t3600 to avoid the conversion. > Thanks again! In the `windows-test` job, the construct `$(pwd)` will give you the Windows form (`D:/a/git/git/git-sdk-64-minimal`) whereas the `$PWD` form will give you the Unix-y form (`/`). What form to use depends on the context (if the absolute path comes from a shell script, the Unix-y form, if the absolute path comes from `git.exe` itself, the Windows form). Ciao, Dscho
On Mon, Nov 23, 2020 at 10:23 AM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Matheus, > > On Mon, 16 Nov 2020, Matheus Tavares Bernardino wrote: > > > On Sun, Nov 15, 2020 at 6:42 PM Johannes Sixt <j6t@kdbg.org> wrote: > > > > > > Am 15.11.20 um 21:12 schrieb Matheus Tavares Bernardino: > > > > Thank you both for the comments. I'll try to send v2 soon. > > > > > > > > On Fri, Nov 13, 2020 at 10:47 AM Derrick Stolee <stolee@gmail.com> wrote: > > > >> > > > >> On 11/12/2020 6:54 PM, Elijah Newren wrote: > > > >>> > > > >>> Do we also want to include a testcase where the user specifies a > > > >>> directory and part of that directory is within the sparsity paths and > > > >>> part is out? E.g. 'git sparse-checkout set /sub/dir && git rm -r > > > >>> sub' ? > > > >> > > > >> That is definitely an interesting case. > > > > > > > > I've added the test [1], but it's failing on Windows and I'm not quite > > > > sure why. The trash dir artifact shows that `git sparse-checkout set > > > > /sub/dir` produced the following path on the sparse-checkout file: > > > > "D:/a/git/git/git-sdk-64-minimal/sub/dir". > > > > > > If 'git sparse-checkout' is run from a bash command line, I would not be > > > surprised if the absolute path is munched in the way that you observe, > > > provided that D:/a/git/git/git-sdk-64-minimal is where your MinGW > > > subsystem is located. I that the case? > > > > Yeah, that must be it, thanks. I didn't run the command myself as I'm > > not on Windows, but D:/a/git/git/git-sdk-64-minimal must be the path > > where MinGW was installed by our GitHub Actions script, then. I'll use > > "sub/dir" without the root slash in t3600 to avoid the conversion. > > Thanks again! > > In the `windows-test` job, the construct `$(pwd)` will give you the > Windows form (`D:/a/git/git/git-sdk-64-minimal`) whereas the `$PWD` form > will give you the Unix-y form (`/`). What form to use depends on the > context (if the absolute path comes from a shell script, the Unix-y form, > if the absolute path comes from `git.exe` itself, the Windows form). Got it, thanks for the explanation!
Hi, Sorry for the long delay... On Fri, Nov 20, 2020 at 9:06 AM Elijah Newren <newren@gmail.com> wrote: > On Mon, Nov 16, 2020 at 9:20 PM Elijah Newren <newren@gmail.com> wrote: > > On Mon, Nov 16, 2020 at 12:14 PM Junio C Hamano <gitster@pobox.com> wrote: > > > Matheus Tavares <matheus.bernardino@usp.br> writes: > > > > > > > Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its > > > > operation to the paths that match both the command line pathspecs and > > > > the repository's sparsity patterns. > > > > > > > This better matches the expectations > > > > of users with sparse-checkout definitions, while still allowing them > > > > to optionally enable the old behavior with 'sparse.restrictCmds=false' > > > > or the global '--no-restrict-to-sparse-paths' option. > > > > > > Hmph. Is "rm" the only oddball that ignores the sparse setting? > > > > to the paths specified by the sparsity patterns, or to the intersection of > > > > those paths and any (like `*.c`) that the user might also specify on the > > > > command line. When false, the affected commands will work on full trees, > > > > -ignoring the sparsity patterns. For now, only git-grep honors this setting. > > > > +ignoring the sparsity patterns. For now, only git-grep and git-rm honor this > > > > +setting. > > > > > > I am not sure if this is a good direction to go---can we make an > > > inventory of all commands that affect working tree files and see > > > which ones need the same treatment before going forward with just > > > "grep" and "rm"? Documenting the decision on the ones that will not > > > get the same treatment may also be a good idea. What I am aiming > > > for is to prevent users from having to know in which versions of Git > > > they can rely on the sparsity patterns with what commands, and doing > > > things piecemeal like these two topics would be a road to confusion. > > > > It's not just commands which affect the working tree that need to be > > inventoried and adjusted. We've made lists of commands in the past: > > > > [3] https://lore.kernel.org/git/CABPp-BEbNCYk0pCuEDQ_ViB2=varJPBsVODxNvJs0EVRyBqjBg@mail.gmail.com/ > > [4] https://lore.kernel.org/git/xmqqy2y3ejwe.fsf@gitster-ct.c.googlers.com/ > > So, I think there are a few other commands that need to be modified > the same way rm is here by Matheus, a longer list of commands than > what I previously linked to for other modifications, some warnings and > error messages that need to be cleaned up, and a fair amount of > additional testing needed. I also think we need to revisit the flag > names for --restrict-to-sparse-paths and > --no-restrict-to-sparse-paths; some feedback I'm getting suggest they > might be more frequently used than I originally suspected and thus we > might want shorter names. (--sparse and --dense?) So we probably > want to wait off on both mt/grep-sparse-checkout and > mt/rm-sparse-checkout (sorry Matheus) and maybe my recently submitted > stash changes (though those don't have an exposed > --[no]-restrict-to-sparse-paths flag and are modelled on existing > merge behavior) until we have a bigger plan in place. > > But I only dug into it a bit while working on the stash apply bug; I'm > going to dig more (probably just after Thanksgiving) and perhaps make > a Documentation/technical/ file of some sort to propose more plans > here. I apologize, this email is *very* lengthy but I have a summary up-front. This email includes: * questions * short term proposals for unsticking sparse-related topics (en/stash-apply-sparse-checkout, mt/rm-sparse-checkout, and mt/grep-sparse-checkout) * longer term proposals for continued sparse-checkout work. However, the core thing to get across is my main question, contained in the next four lines: sparse-checkout's purpose is not fully defined. Does it exist to: A) allow working on a subset of the repository? B) allow working with a subset of the repository checked out? C) something else? You can think of (A) as marrying partial clones and sparse checkouts, and trying to make the result feel like a smaller repository. That means that grep, diff, log, etc. cull output unrelated to your sparsity paths. (B) is treating the repo as dense history (so grep, diff, log do not cull output), but the working directory sparse. In my view, git still doesn't (yet) provide either of these. === Why it matters == There are unfortunately *many* gray areas when you try to define how git subcommands should behave in sparse-checkouts. (The implementation-level definition from a decade ago of "files are assumed to be unchanged from HEAD when SKIP_WORKTREE is set, and we remove files with that bit set from the working directory" definition from the past provides no clear vision about how to resolve gray areas, and also leads to various inconsistencies and surprises for users.) I believe a definition based around a usecase (or usecases) for the purpose of sparse-checkouts would remove most of the gray areas. Are there choices other than A & B that I proposed above that make sense? Traditionally, I thought of B as just a partial implementation of A, and that A was the desired end-goal. However, others have argued for B as a preferred choice (some users at $DAYJOB even want both A and B, meaning they'd like a simple short flag to switch between the two). There may be others I'm unaware of. git implements neither A nor B. It might be nice to think of git's current behavior as a partial implementation of B (enough to provide some value, but still feel buggy/incomplete), and that after finishing B we could add more work to allow A. I'm not sure if the current implementation is just a subset of B, though. Let's dig in... === sparse-checkout demonstration -- diff, status, add, clean, rm === $ git init testing && cd testing $ echo tracked >tracked $ echo tracked-but-maybe-skipped >tracked-but-maybe-skipped $ git add . $ git commit -m initial $ echo tracked-but-maybe-skipped-v2 >tracked-but-maybe-skipped $ git commit -am second $ echo tracked-but-maybe-skipped-v3 >tracked-but-maybe-skipped $ git commit -am third ### In a non-sparse checkout... $ ls -1 tracked tracked-but-maybe-skipped $ echo changed >tracked-but-maybe-skipped # modify the file $ git diff --stat # diff shows the change tracked-but-maybe-skipped | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) $ git status --porcelain # status shows the change M tracked-but-maybe-skipped $ git add tracked-but-maybe-skipped # add stages the change $ git status --porcelain # status shows the staged change M tracked-but-maybe-skipped $ git clean -f tracked-but-maybe-skipped # clean ignores it, it's tracked $ git rm -f tracked-but-maybe-skipped # rm removes the file rm 'tracked-but-maybe-skipped' $ git status --porcelain # status shows the removal D tracked-but-maybe-skipped $ git reset --hard # undo all changes ### Compared to a sparse-checkout... $ git sparse-checkout set tracked # sparse-checkout... $ ls -1 # ...removes non-matches tracked $ echo changed >tracked-but-maybe-skipped # put the file back, modified $ ls -1 tracked tracked-but-maybe-skipped $ git diff --stat # diff ignores changed file $ git status --porcelain # status ignores changed file $ git add tracked-but-maybe-skipped # add... $ git status --porcelain # ...also ignores changed file $ git clean -f tracked-but-maybe-skipped # maybe it's untracked? but... $ ls -1 # ...clean also ignores it tracked tracked-but-maybe-skipped $ git rm -f tracked-but-maybe-skipped # rm doesn't?! rm 'tracked-but-maybe-skipped' $ git status --porcelain D tracked-but-maybe-skipped $ git reset --hard # undo changes, re-sparsify With a direct filename some might question the behavior of add. However, note here that add & rm could have used a glob such as '*.c', or a directory like 'builtin/'. In such a case, the add behavior seems reasonable (though perhaps a warning would be nice if no paths end up matching the pathspec, much like it does if you specify `git add mistyped-filename`) and the rm behavior is quite dangerous. === more sparse-checkout discussion -- behavior A vs. B with history === Note that other forms of checkout/restore will also ignore paths that do not match sparsity patterns: $ git checkout HEAD tracked-but-maybe-skipped error: pathspec 'tracked-but-maybe-skipped' did not match any file(s) known to git $ git restore --source=HEAD tracked-but-maybe-skipped error: pathspec 'tracked-but-maybe-skipped' did not match any file(s) known to git The error message is suboptimal, but seems like this otherwise gives desirable behavior as we want the checkout to be sparse. If a user had specified a glob or a directory, we'd only want to match the portion of that glob or directory associated with the sparsity patterns. Unfortunately, this behavior changes once you specify a different version -- it actively not only ignores the sparse-checkout specification but clears the SKIP_WORKTREE bit: $ git checkout HEAD~1 tracked-but-maybe-skipped Updated 1 path from 58916d9 $ ls -1 tracked tracked-but-maybe-skipped $ git ls-files -t H tracked H tracked-but-maybe-skipped $ git reset --hard # Undo changes, re-sparsify And it gets even worse when passing globs like '*.c' or directories like 'src/' to checkout because then sparsity patterns are ignored if and only if the file in the index differs from the specified file: $ git checkout -- '*skipped' error: pathspec '*skipped' did not match any file(s) known to git $ ls -1 tracked $ git checkout HEAD~2 -- '*skipped' $ ls -1 tracked tracked-but-maybe-skipped We get similar weirdness with directories: $ git sparse-checkout set nomatches $ ls -1 $ git checkout . error: pathspec '.' did not match any file(s) known to git $ git checkout HEAD~2 . Updated 1 path from fb99ded $ git ls-files -t S tracked H tracked-but-maybe-skipped ### Undo the above changes $ git reset --hard $ git sparse-checkout set tracked Note that the above only updated 1 path, despite both files existing in HEAD~2. Only one of the files differed between HEAD~2 and the current index, so it only updated that one. When it updated that one, it cleared the SKIP_WORKTREE bit for it, but left the other file, that did exist in the older commit, with the SKIP_WORKTREE bit. Since checkout ignores non-matching paths, users might expect other commands like diff to do the same: $ git ls-files -t H tracked S tracked-but-maybe-skipped $ echo changed> tracked-but-maybe-skipped $ git diff --stat tracked-but-maybe-skipped # Yes, ignored $ git diff --stat HEAD tracked-but-maybe-skipped # Yes, ignored $ git diff --stat HEAD~1 tracked-but-maybe-skipped # Not ignored?!? tracked-but-maybe-skipped | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) $ rm tracked-but-maybe-skipped Technically this is because SKIP_WORKTREE is treated as "file is assumed to match HEAD", which might be implementationally well defined, but yields some weird results for users to attempt to form a mental model. Diff seems to match behavior B, whereas checkout with revisions and/or pathspecs doesn't match either A or B. There is a similar split in whether users would expect a search to return results for folks who prefer (A) or (B): $ git grep maybe HEAD~1 HEAD~1:tracked-but-maybe-skipped:tracked-but-maybe-skipped-v2 But, regardless of how history is treated (i.e. regardless of preferences for behavior A or B), consistent with checkout and diff above we'd expect a plain grep to not return anything: $ git grep v3 # Should be empty tracked-but-maybe-skipped:tracked-but-maybe-skipped-v3 Huh?!? Very confusing. Let's make it worse, though -- how about we manually create that file again (despite it being SKIP_WORKTREE) and give it different contents and see what it does: $ echo changed >tracked-but-maybe-skipped $ git grep v3 tracked-but-maybe-skipped:tracked-but-maybe-skipped-v3 WAT?!? What part of "changed" does "v3" match?? Oh, right, none of it. The pretend-the-working-directory-matches-HEAD implementation strikes again. It's a nice approximation to user-desired behavior, but I don't see how it actually makes sense in general. === sparse-checkout other behaviors -- merges and apply === The merge machinery has traditionally done something different than checkout/diff/status/commit in sparse-checkouts. Like commit, it has to include all paths in any created commit object. Like checkout, it wants to avoid writing files to the working directory that don't match sparsity paths -- BUT files might have conflicts. So, the merge machinery clears the skip_worktree bit for conflicted files. Also, merge-recursive.c also clears the skip_worktree bit for other files unnecessarily, due to difficulty of implementation of preserving it (merge-ort should correct this).[1] [1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/ Since the merge-machinery is used in multiple commands -- merge, rebase, cherry-pick, revert, checkout -m, etc., this behavior is present in all of those. stash is somewhat modelled on a merge, so it should behave the same. It doesn't quite do so currently in certain cases. The am and apply subcommands should also behave like merge; both will need fixes. === sparse-checkout papercuts === Some simple examples showing that commands which otherwise work quite nicely with sparse-checkout can have a few rough edges: $ touch addme $ git add addme $ git ls-files -t H addme H tracked S tracked-but-maybe-skipped $ git reset --hard # usually works great error: Path 'addme' not uptodate; will not remove from working tree. HEAD is now at bdbbb6f third $ git ls-files -t H tracked S tracked-but-maybe-skipped $ ls -1 tracked So, reset --hard worked correctly, but it made the user think that it didn't with its error message. $ git add mistyped-filename fatal: pathspec 'mistyped-filename' did not match any files $ echo $? 128 $ echo changed >tracked-but-maybe-skipped $ git add tracked-but-maybe-skipped $ echo $? 0 $ git status --procelain $ So, in the case of a mistyped-filename or a glob that doesn't match any files, `git add` prints an error and returns a non-zero exit code. In the case of SKIP_WORKTREE files, `git add` (correctly) refuses to add them, but confusingly does so silently and with a clean exit status. === behavioral proposals === Short term version: * en/stash-apply-sparse-checkout: apply as-is. * mt/rm-sparse-checkout: modify it to ignore sparse.restrictCmds -- `git rm` should be like `git add` and _always_ ignore SKIP_WORKTREE paths, but it should print a warning (and return with non-zero exit code) if only SKIP_WORKTREE'd paths match the pathspec. If folks want to remove (or add) files outside current sparsity paths, they can either update their sparsity paths or use `git update-index`. * mt/grep-sparse-checkout: figure out shorter flag names. Default to --no-restrict-to-sparse, for now. Then merge it for git-2.31. Longer term version: I'll split these into categories... --> Default behavior * Default to behavior B (--no-restrict-to-sparse from mt/grep-sparse-checkout) for now. I think that's the wrong default for when we marry sparse-checkouts with partial clones, but we only have patches for behavior A for git grep; it may take a while to support behavior A in each command. Slowly changing behavior of commands with each release is problematic. We can discuss again after behavior A is fully supported what to make the defaults be. --> Commands already working with sparse-checkouts; no known bugs: * status * switch, the "switch" parts of checkout * read-tree * update-index * ls-files --> Enhancements * General * shorter flag names than --[no-]restrict-to-sparse. --dense and --sparse? --[no-]restrict? * sparse-checkout (After behavior A is implemented...) * Provide warning if sparse.restrictCmds not set (similar to git pull's warning with no pull.rebase, or git checkout's warning when detaching HEAD) * clone * Consider having clone set sparse.restrictCmds based on whether --partial is provided in addition to --sparse. --> Commands with minor bugs/annoyances: * add * print a warning if pathspec only matches SKIP_WORKTREE files (much as it already does if the pathspec matches no files) * reset --hard * spurious and incorrect warning when removing a newly added file * merge, rebase, cherry-pick, revert * unnecessary unsparsification (merge-ort should fix this) * stash * similar to merge, but there are extra bugs from the pipeline design. en/stash-apply-sparse-checkout fixes the known issues. --> Buggy commands * am * should behave like merge commands -- (1) it needs to be okay for the file to not exist in the working directory; vivify it if necessary, (2) any conflicted paths must remain vivified, (3) paths which merge cleanly can be unvivified. * apply * See am * rm * should behave like add, skipping SKIP_WORKTREE entries. See comments on mt/rm-sparse-checkout elsewhere * restore * with revisions and/or globs, sparsity patterns should be heeded * checkout * see restore --> Commands that need no changes because commits are full-tree: * archive * bundle * commit * format-patch * fast-export * fast-import * commit-tree --> Commands that would change for behavior A * bisect * Only consider commits touching paths matching sparsity patterns * diff * When given revisions, only show subset of files matching sparsity patterns. If pathspecs are given, intersect them with sparsity patterns. * log * Only consider commits touching at least one path matching sparsity patterns. If pathspecs are given, paths must match both the pathspecs and the sparsity patterns in order to be considered relevant and be shown. * gitk * See log * shortlog * See log * grep * See mt/grep-sparse-checkout; it's been discussed in detail..and is implemented. (Other than that we don't want behavior A to be the default when so many commands do not support it yet.) * show-branch * See log * whatchanged * See log * show (at least for commits) * See diff * blame * With -C or -C -C, only detect lines moved/copied from files that match the sparsity paths. * annotate * See blame. --> Commands whose behavior I'm still uncertain of: * worktree add * for behavior A (marrying sparse-checkout with partial clone), we should almost certainly copy sparsity paths from the previous worktree (we either have to do that or have some kind of specify-at-clone-time default set of sparsity paths) * for behavior B, we may also want to copy sparsity paths from the previous worktree (much like a new command line shell will copy $PWD from the previous one), but it's less clear. Should it? * range-diff * is this considered to be log-like for format-patch-like in behavior? * cherry * see range-diff * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list * should these be tweaked or always operate full-tree? * checkout-index * should it be like checkout and pay attention to sparsity paths, or be considered special like update-index, ls-files, & read-tree and write to working tree anyway? * mv * I don't think mv can take a glob, and I think it currently happens to work. Should we add a comment to the code that if anyone wants to support mv using pathspecs they might need to be careful about SKIP_WORKTREE? --> Might need changes, but who cares? * merge-file * merge-index --> Commands with no interaction with sparse-checkout: * branch * clean * describe * fetch * gc * init * maintenance * notes * pull (merge & rebase have the necessary changes) * push * submodule * tag * config * filter-branch (works in separate checkout without sparsity paths) * pack-refs * prune * remote * repack * replace * bugreport * count-objects * fsck * gitweb * help * instaweb * merge-tree * rerere * verify-commit * verify-tag * commit-graph * hash-object * index-pack * mktag * mktree * multi-pack-index * pack-objects * prune-packed * symbolic-ref * unpack-objects * update-ref * write-tree * for-each-ref * get-tar-commit-id * ls-remote * merge-base * name-rev * pack-redundant * rev-parse * show-index * show-ref * unpack-file * var * verify-pack * <Everything under 'Interacting with Others' in 'git help --all'> * <Everything under 'Low-level...Syncing' in 'git help --all'> * <Everything under 'Low-level...Internal Helpers' in 'git help --all'> * <Everything under 'External commands' in 'git help --all'>
On 12/31/2020 3:03 PM, Elijah Newren wrote: > Sorry for the long delay... Thanks for bringing us all back to the topic. > sparse-checkout's purpose is not fully defined. Does it exist to: > A) allow working on a subset of the repository? > B) allow working with a subset of the repository checked out? > C) something else? I think it's all of the above! My main focus for sparse-checkout is a way for users who care about a small fraction of a repository's files to only do work on those files. This saves time because they are asking Git to do less, but also they can use tools like IDEs that with "Open Folder" options without falling over. Writing fewer files also affects things like their OS indexing files for search or antivirus scanning written files. Others use sparse-checkout to remove a few large files unless they need them. I'm less interested in this case, myself. Both perspectives get better with partial clone because the download size shrinks significantly. While partial clone has a sparse-checkout style filter, it is hard to compute on the server side. Further, it is not very forgiving of someone wanting to change their sparse definition after cloning. Tree misses are really expensive, and I find that the extra network transfer of the full tree set is a price that is worth paying. I'm also focused on users that know that they are a part of a larger whole. They know they are operating on a large repository but focus on what they need to contribute their part. I expect multiple "roles" to use very different, almost disjoint parts of the codebase. Some other "architect" users operate across the entire tree or hop between different sections of the codebase as necessary. In this situation, I'm wary of scoping too many features to the sparse-checkout definition, especially "git log," as it can be too confusing to have their view of the codebase depend on your "point of view." (In case we _do_ start changing behavior in this way, I'm going to use the term "sparse parallax" to describe users being confused about their repositories because they have different sparse-checkout definitions, changing what they see from "git log" or "git diff".) > === Why it matters == > > There are unfortunately *many* gray areas when you try to define how git > subcommands should behave in sparse-checkouts. (The > implementation-level definition from a decade ago of "files are assumed > to be unchanged from HEAD when SKIP_WORKTREE is set, and we remove files > with that bit set from the working directory" definition from the past > provides no clear vision about how to resolve gray areas, and also leads > to various inconsistencies and surprises for users.) I believe a > definition based around a usecase (or usecases) for the purpose of > sparse-checkouts would remove most of the gray areas. > > Are there choices other than A & B that I proposed above that make > sense? Traditionally, I thought of B as just a partial implementation > of A, and that A was the desired end-goal. However, others have argued > for B as a preferred choice (some users at $DAYJOB even want both A and > B, meaning they'd like a simple short flag to switch between the two). > There may be others I'm unaware of. > > git implements neither A nor B. It might be nice to think of git's > current behavior as a partial implementation of B (enough to provide > some value, but still feel buggy/incomplete), and that after finishing B > we could add more work to allow A. I'm not sure if the current > implementation is just a subset of B, though. > > Let's dig in... I read your detailed message and I think you make some great points. I think there are three possible situations: 1. sparse-checkout should not affect the behavior at all. An example for this is "git commit". We want the root tree to contain all of the subtrees and blobs that are out of the sparse-checkout definition. The underlying object model should never change. 2. sparse-checkout should change the default, but users can opt-out. The examples I think of here are 'git grep' and 'git rm', as we have discussed recently. Having a default of "you already chose to be in a sparse-checkout, so we think this behavior is better for you" should continue to be pursued. 3. Users can opt-in to a sparse-checkout version of a behavior. The example in this case is "git diff". Perhaps we would want to see a diff scoped only to our sparse definition, but that should not be the default. It is too risky to change the output here without an explicit choice by the user. Let's get into your concrete details now: > === behavioral proposals === > > Short term version: > > * en/stash-apply-sparse-checkout: apply as-is. > > * mt/rm-sparse-checkout: modify it to ignore sparse.restrictCmds -- > `git rm` should be like `git add` and _always_ ignore > SKIP_WORKTREE paths, but it should print a warning (and return > with non-zero exit code) if only SKIP_WORKTREE'd paths match the > pathspec. If folks want to remove (or add) files outside current > sparsity paths, they can either update their sparsity paths or use > `git update-index`. > > * mt/grep-sparse-checkout: figure out shorter flag names. Default to > --no-restrict-to-sparse, for now. Then merge it for git-2.31. I don't want to derail your high-level conversation too much, but by the end of January I hope to send an RFC to create a "sparse index" which allows the index to store entries corresponding to a directory with the skip- worktree bit on. The biggest benefit is that commands like 'git status' and 'git add' will actually change their performance based on the size of the sparse-checkout definition and not the total number of paths at HEAD. The other thing that happens once we have that idea is that these behaviors in 'git grep' or 'git rm' actually become _easier_ to implement because we don't even have an immediate reference to the blobs outside of the sparse cone (assuming cone mode). The tricky part (that I'm continuing to work on, hence no RFC today) is enabling the part where a user can opt-in to the old behavior. This requires parsing trees to expand the index as necessary. A simple approach is to create an in-memory index that is the full expansion at HEAD, when necessary. It will be better to do expansions in a targeted way. (Your merge-ort algorithm is critical to the success here, since that doesn't use the index as a data structure. I expect to make merge-ort the default for users with a sparse index. Your algorithm will be done first.) My point in bringing this up is that perhaps we should pause concrete work on updating other builtins until we have a clearer idea of what a sparse index could look like and how the implementation would change based on having one or not. I hope that my RFC will be illuminating in this regard. Ok, enough of that sidebar. I thought it important to bring up, but > Longer term version: > > I'll split these into categories... > > --> Default behavior > * Default to behavior B (--no-restrict-to-sparse from > mt/grep-sparse-checkout) for now. I think that's the wrong default > for when we marry sparse-checkouts with partial clones, but we only > have patches for behavior A for git grep; it may take a while to > support behavior A in each command. Slowly changing behavior of > commands with each release is problematic. We can discuss again > after behavior A is fully supported what to make the defaults be. > > --> Commands already working with sparse-checkouts; no known bugs: > * status > * switch, the "switch" parts of checkout > > * read-tree > * update-index > * ls-files > > --> Enhancements > * General > * shorter flag names than --[no-]restrict-to-sparse. --dense and > --sparse? --[no-]restrict? --full-workdir? > * sparse-checkout (After behavior A is implemented...) > * Provide warning if sparse.restrictCmds not set (similar to git > pull's warning with no pull.rebase, or git checkout's warning when > detaching HEAD) > * clone > * Consider having clone set sparse.restrictCmds based on whether > --partial is provided in addition to --sparse. In general, we could use some strategies to help users opt-in to these new behaviors more easily. We are very close to having the only real feature of Scalar be that it sets these options automatically, and will continue to push to the newest improvements as possible. > --> Commands with minor bugs/annoyances: > * add > * print a warning if pathspec only matches SKIP_WORKTREE files (much > as it already does if the pathspec matches no files) > > * reset --hard > * spurious and incorrect warning when removing a newly added file > * merge, rebase, cherry-pick, revert > * unnecessary unsparsification (merge-ort should fix this) > * stash > * similar to merge, but there are extra bugs from the pipeline > design. en/stash-apply-sparse-checkout fixes the known issues. > > --> Buggy commands > * am > * should behave like merge commands -- (1) it needs to be okay for > the file to not exist in the working directory; vivify it if > necessary, (2) any conflicted paths must remain vivified, (3) > paths which merge cleanly can be unvivified. > * apply > * See am > * rm > * should behave like add, skipping SKIP_WORKTREE entries. See comments > on mt/rm-sparse-checkout elsewhere > * restore > * with revisions and/or globs, sparsity patterns should be heeded > * checkout > * see restore > > --> Commands that need no changes because commits are full-tree: > * archive > * bundle > * commit > * format-patch > * fast-export > * fast-import > * commit-tree > > --> Commands that would change for behavior A > * bisect > * Only consider commits touching paths matching sparsity patterns > * diff > * When given revisions, only show subset of files matching sparsity > patterns. If pathspecs are given, intersect them with sparsity > patterns. > * log > * Only consider commits touching at least one path matching sparsity > patterns. If pathspecs are given, paths must match both the > pathspecs and the sparsity patterns in order to be considered > relevant and be shown. > * gitk > * See log > * shortlog > * See log > * grep > * See mt/grep-sparse-checkout; it's been discussed in detail..and is > implemented. (Other than that we don't want behavior A to be the > default when so many commands do not support it yet.) > > * show-branch > * See log > * whatchanged > * See log > * show (at least for commits) > * See diff > > * blame > * With -C or -C -C, only detect lines moved/copied from files that match > the sparsity paths. > * annotate > * See blame. this "behavior A" idea is the one I'm most skeptical about. Creating a way to opt-in to a sparse definition might be nice. It might be nice to run "git log --simplify-sparse" to see the simplified history when only caring about commits that changed according to the current sparse-checkout definitions. Expand that more when asking for diffs as part of that log, and the way we specify the option becomes tricky. But I also want to avoid doing this as a default or even behind a config setting. We already get enough complains about "missing commits" when someone does a bad merge so "git log -- file" simplifies away a commit that exists in the full history. Imagine someone saying "on my machine, 'git log' shows the commit, but my colleague can't see it!" I would really like to avoid adding to that confusion if possible. > --> Commands whose behavior I'm still uncertain of: > * worktree add > * for behavior A (marrying sparse-checkout with partial clone), we > should almost certainly copy sparsity paths from the previous > worktree (we either have to do that or have some kind of > specify-at-clone-time default set of sparsity paths) > * for behavior B, we may also want to copy sparsity paths from the > previous worktree (much like a new command line shell will copy > $PWD from the previous one), but it's less clear. Should it? I think 'git worktree add' should at minimum continue using a sparse- checkout if the current working directory has one. Worktrees are a great way to scale the creation of multiple working directories for the same repository without re-cloning all of the history. In a partial clone case, it's really important that we don't explode the workdir in the new worktree (or even download all those blobs). Now, should we copy the sparse-checkout definitions, or start with the "only files at root" default? That's a more subtle question. > * range-diff > * is this considered to be log-like for format-patch-like in > behavior? If we stick with log acting on the full tree unless specified in the command-line options, then range-diff can be the same. Seems like a really low priority, though, because of the proximity to format-patch. > * cherry > * see range-diff > * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list > * should these be tweaked or always operate full-tree? > * checkout-index > * should it be like checkout and pay attention to sparsity paths, or > be considered special like update-index, ls-files, & read-tree and > write to working tree anyway? > * mv > * I don't think mv can take a glob, and I think it currently happens to > work. Should we add a comment to the code that if anyone wants to > support mv using pathspecs they might need to be careful about > SKIP_WORKTREE? > > --> Might need changes, but who cares? > * merge-file > * merge-index > > --> Commands with no interaction with sparse-checkout: (I agree with the list you included here.) Thanks for starting the discussion. Perhaps more will pick it up as they return from the holiday break. Thanks, -Stolee
On Sun, Jan 3, 2021 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote: > > On 12/31/2020 3:03 PM, Elijah Newren wrote: > > Sorry for the long delay... > > Thanks for bringing us all back to the topic. > > > sparse-checkout's purpose is not fully defined. Does it exist to: > > A) allow working on a subset of the repository? > > B) allow working with a subset of the repository checked out? > > C) something else? > > I think it's all of the above! I think I understand your sentiment, but since my choice C was so vague, your answer doesn't really help us figure out correct behavior for commands that currently behave weirdly. ;-) > My main focus for sparse-checkout is a way for users who care about a > small fraction of a repository's files to only do work on those files. > This saves time because they are asking Git to do less, but also they > can use tools like IDEs that with "Open Folder" options without falling > over. Writing fewer files also affects things like their OS indexing > files for search or antivirus scanning written files. Interesting, IDE failure/overload was one of the big driving factors in our adoption as well. > Others use sparse-checkout to remove a few large files unless they > need them. I'm less interested in this case, myself. > > Both perspectives get better with partial clone because the download > size shrinks significantly. While partial clone has a sparse-checkout > style filter, it is hard to compute on the server side. Further, it > is not very forgiving of someone wanting to change their sparse > definition after cloning. Tree misses are really expensive, and I find > that the extra network transfer of the full tree set is a price that is > worth paying. Out of curiosity, is that because the promisor handling doesn't do nice batching of trees to download, as is done for blobs, or is there a more fundamental reason they are really expensive? (I'm just wondering if we are risking changing design in some areas based on suboptimal implementation of other things. I don't actually have experience with partial clones yet, though, so I'm basically just querying about random but interesting things without any experience to back it up.) > I'm also focused on users that know that they are a part of a larger > whole. They know they are operating on a large repository but focus on > what they need to contribute their part. I expect multiple "roles" to > use very different, almost disjoint parts of the codebase. Some other > "architect" users operate across the entire tree or hop between different > sections of the codebase as necessary. In this situation, I'm wary of > scoping too many features to the sparse-checkout definition, especially > "git log," as it can be too confusing to have their view of the codebase > depend on your "point of view." > > (In case we _do_ start changing behavior in this way, I'm going to use > the term "sparse parallax" to describe users being confused about their > repositories because they have different sparse-checkout definitions, > changing what they see from "git log" or "git diff".) Thanks for this extra perspective. > > > === Why it matters == > > > > There are unfortunately *many* gray areas when you try to define how git > > subcommands should behave in sparse-checkouts. (The > > implementation-level definition from a decade ago of "files are assumed > > to be unchanged from HEAD when SKIP_WORKTREE is set, and we remove files > > with that bit set from the working directory" definition from the past > > provides no clear vision about how to resolve gray areas, and also leads > > to various inconsistencies and surprises for users.) I believe a > > definition based around a usecase (or usecases) for the purpose of > > sparse-checkouts would remove most of the gray areas. > > > > Are there choices other than A & B that I proposed above that make > > sense? Traditionally, I thought of B as just a partial implementation > > of A, and that A was the desired end-goal. However, others have argued > > for B as a preferred choice (some users at $DAYJOB even want both A and > > B, meaning they'd like a simple short flag to switch between the two). > > There may be others I'm unaware of. > > > > git implements neither A nor B. It might be nice to think of git's > > current behavior as a partial implementation of B (enough to provide > > some value, but still feel buggy/incomplete), and that after finishing B > > we could add more work to allow A. I'm not sure if the current > > implementation is just a subset of B, though. > > > > Let's dig in... > > I read your detailed message and I think you make some great points. > > I think there are three possible situations: > > 1. sparse-checkout should not affect the behavior at all. > > An example for this is "git commit". We want the root tree to contain > all of the subtrees and blobs that are out of the sparse-checkout > definition. The underlying object model should never change. > > 2. sparse-checkout should change the default, but users can opt-out. > > The examples I think of here are 'git grep' and 'git rm', as we have > discussed recently. Having a default of "you already chose to be in > a sparse-checkout, so we think this behavior is better for you" > should continue to be pursued. > > 3. Users can opt-in to a sparse-checkout version of a behavior. > > The example in this case is "git diff". Perhaps we would want to see > a diff scoped only to our sparse definition, but that should not be > the default. It is too risky to change the output here without an > explicit choice by the user. I'm curious why you put grep and diff in different categories. A plain "git diff" without revisions will give the same output whether or not it restricts to the sparsity paths (because the other paths are unchanged), so restricting is purely an optimization question. Making "git diff REVISION" restrict to the sparsity paths would be a behavioral change as you note, but "git grep [REVISION]" would also require a behavioral change to limit to the sparsity paths. If it's too risky to change the output for git diff with revisions, why is it not also too risky to do that with git grep with revisions? Also, I think you are missing a really important category: 4. sparse-checkout changes the behavior of commands and there is no opt-out or configurability provided. The most obvious examples are switch and checkout -- their modified behavior is really the /point/ of sparse-checkouts and if you want to "opt out" then just don't use sparse-checkouts. `reset --hard` can go in the same bucket; it's modified in the same way. However, some commands are modified in a different way, but also have no opt-out -- for example, merge, rebase, cherry-pick, revert, and stash, all "try" to avoid writing files to the working tree that match the sparsify specifications, but will vivify files which have conflicts (and maybe a few additional files based on implementation shortcomings). Another command that behaves differently than any of these, and is also non-configurable in this change, is git-add. It'll ignore any tracked files with the SKIP_WORKTREE bit set, even if the file is present. That's really helpful thing for "git add -A [GLOB_OR_DIRECTORY]" to do, as we don't want sparsity to accidentally be treated as a directive to remove files from the repository. I think more commands should fall under this fourth category as well, including rm. However, I actually think 4 deserves to be broken up into separate categories based on the type of behavior change. Thus, I'd need a 4a, 4b, and 4c for my example commands above. > Let's get into your concrete details now: > > > === behavioral proposals === > > > > Short term version: > > > > * en/stash-apply-sparse-checkout: apply as-is. > > > > * mt/rm-sparse-checkout: modify it to ignore sparse.restrictCmds -- > > `git rm` should be like `git add` and _always_ ignore > > SKIP_WORKTREE paths, but it should print a warning (and return > > with non-zero exit code) if only SKIP_WORKTREE'd paths match the > > pathspec. If folks want to remove (or add) files outside current > > sparsity paths, they can either update their sparsity paths or use > > `git update-index`. > > > > * mt/grep-sparse-checkout: figure out shorter flag names. Default to > > --no-restrict-to-sparse, for now. Then merge it for git-2.31. > > I don't want to derail your high-level conversation too much, but by the > end of January I hope to send an RFC to create a "sparse index" which allows > the index to store entries corresponding to a directory with the skip- > worktree bit on. The biggest benefit is that commands like 'git status' and > 'git add' will actually change their performance based on the size of the > sparse-checkout definition and not the total number of paths at HEAD. This is _awesome_; I think it'll be huge. It'll cause even more commands behavior to change, of course, but in a good way. And I don't consider this derailing at all but extending the discussion complete with extra investigation work. :-) > The other thing that happens once we have that idea is that these behaviors > in 'git grep' or 'git rm' actually become _easier_ to implement because we > don't even have an immediate reference to the blobs outside of the sparse > cone (assuming cone mode). > > The tricky part (that I'm continuing to work on, hence no RFC today) is > enabling the part where a user can opt-in to the old behavior. This requires > parsing trees to expand the index as necessary. A simple approach is to > create an in-memory index that is the full expansion at HEAD, when necessary. > It will be better to do expansions in a targeted way. I'm not sure if you're just thinking of the old mt/rm-sparse-checkout and commenting on it, or if you're actively disagreeing with my proposal for rm. > (Your merge-ort algorithm is critical to the success here, since that doesn't > use the index as a data structure. I expect to make merge-ort the default for > users with a sparse index. Your algorithm will be done first.) Well, at 50 added/changed lines per patch, I've only got ~50 more patches to go for ort after the ones I submitted Monday (mostly optimization related). If I submit 10 patches per week (starting next week since I already sent a big patchset this week), then maybe mid-to-late February. That's a more aggressive pace than we've managed so far, but maybe it gets easier towards the end? Anyway, hopefully that helps you with timing predictions. On my end, this does make the ort work look like there's finally some light at the end of the tunnel; I just hope it's not an oncoming train. :-) > My point in bringing this up is that perhaps we should pause concrete work on > updating other builtins until we have a clearer idea of what a sparse index > could look like and how the implementation would change based on having one > or not. I hope that my RFC will be illuminating in this regard. Are you suggesting to pause any work on those pieces of the proposal that might be affected by your sparse index, or pause any work at all on sparse-checkouts? For example, I think en/stash-apply-sparse-checkout that's been sitting in seen is good to merge down to master now. I suspect mt/rm-sparse-checkout WITH my suggested changes (no configurability -- similar to git-add) and a better warning/error message for git-add are some examples of cleanups that could be done before your sparse index, but if you're worried about conflicting I certainly don't want to derail your project. (I agree that anything with configurability and touching on "behavior A" or "sparse parallelax", like mt/grep-sparse-checkout would be better if we waited on. I do feel pretty bad for how much we've made Matheus wait on that series, but waiting does still seem best.) > Ok, enough of that sidebar. I thought it important to bring up, but > > > Longer term version: > > > > I'll split these into categories... > > > > --> Default behavior > > * Default to behavior B (--no-restrict-to-sparse from > > mt/grep-sparse-checkout) for now. I think that's the wrong default > > for when we marry sparse-checkouts with partial clones, but we only > > have patches for behavior A for git grep; it may take a while to > > support behavior A in each command. Slowly changing behavior of > > commands with each release is problematic. We can discuss again > > after behavior A is fully supported what to make the defaults be. > > > > --> Commands already working with sparse-checkouts; no known bugs: > > * status > > * switch, the "switch" parts of checkout > > > > * read-tree > > * update-index > > * ls-files > > > > --> Enhancements > > * General > > * shorter flag names than --[no-]restrict-to-sparse. --dense and > > --sparse? --[no-]restrict? > > --full-workdir? Hmm. "workdir" sounds like an abbreviation of "working directory", which is the place where the files are checked out. And the working directory is sparse in a sparse-checkout. So isn't this misleading? Or did you intend for this option to be the name for requesting a sparser set? (If so, isn't "full" in its name a bit weird?) Also, what would the inverse name of --full-workdir be? I was looking to add options for both restricting the command to the sparser set and for expanding to the full set of files. Though I guess as you note below, you perhaps might be in favor of only one of these without configuration options to adjust defaults. > > * sparse-checkout (After behavior A is implemented...) > > * Provide warning if sparse.restrictCmds not set (similar to git > > pull's warning with no pull.rebase, or git checkout's warning when > > detaching HEAD) > > * clone > > * Consider having clone set sparse.restrictCmds based on whether > > --partial is provided in addition to --sparse. > > In general, we could use some strategies to help users opt-in to these > new behaviors more easily. We are very close to having the only real > feature of Scalar be that it sets these options automatically, and will > continue to push to the newest improvements as possible. Nice! > > --> Commands with minor bugs/annoyances: > > * add > > * print a warning if pathspec only matches SKIP_WORKTREE files (much > > as it already does if the pathspec matches no files) > > > > * reset --hard > > * spurious and incorrect warning when removing a newly added file > > * merge, rebase, cherry-pick, revert > > * unnecessary unsparsification (merge-ort should fix this) > > * stash > > * similar to merge, but there are extra bugs from the pipeline > > design. en/stash-apply-sparse-checkout fixes the known issues. > > > > --> Buggy commands > > * am > > * should behave like merge commands -- (1) it needs to be okay for > > the file to not exist in the working directory; vivify it if > > necessary, (2) any conflicted paths must remain vivified, (3) > > paths which merge cleanly can be unvivified. > > * apply > > * See am > > * rm > > * should behave like add, skipping SKIP_WORKTREE entries. See comments > > on mt/rm-sparse-checkout elsewhere > > * restore > > * with revisions and/or globs, sparsity patterns should be heeded > > * checkout > > * see restore > > > > --> Commands that need no changes because commits are full-tree: > > * archive > > * bundle > > * commit > > * format-patch > > * fast-export > > * fast-import > > * commit-tree > > > > --> Commands that would change for behavior A > > * bisect > > * Only consider commits touching paths matching sparsity patterns > > * diff > > * When given revisions, only show subset of files matching sparsity > > patterns. If pathspecs are given, intersect them with sparsity > > patterns. > > * log > > * Only consider commits touching at least one path matching sparsity > > patterns. If pathspecs are given, paths must match both the > > pathspecs and the sparsity patterns in order to be considered > > relevant and be shown. > > * gitk > > * See log > > * shortlog > > * See log > > * grep > > * See mt/grep-sparse-checkout; it's been discussed in detail..and is > > implemented. (Other than that we don't want behavior A to be the > > default when so many commands do not support it yet.) > > > > * show-branch > > * See log > > * whatchanged > > * See log > > * show (at least for commits) > > * See diff > > > > * blame > > * With -C or -C -C, only detect lines moved/copied from files that match > > the sparsity paths. > > * annotate > > * See blame. > > this "behavior A" idea is the one I'm most skeptical about. Creating a > way to opt-in to a sparse definition might be nice. It might be nice to > run "git log --simplify-sparse" to see the simplified history when only > caring about commits that changed according to the current sparse-checkout > definitions. Expand that more when asking for diffs as part of that log, > and the way we specify the option becomes tricky. --simplify-sparse is a really long name to need to specify at every invocation. Also, if we have --[no]-restrict or --sparse/--dense options at the git level (rather than the subcommand level), then I think we don't want extra ones like this at the subcommand level. Also, if the option appears at the global git level, doesn't that remove the trickiness of revision traversal vs. diff outputting in commands like log? It just automatically applies to both. (The only trickiness would be if you wanted to somehow apply sparsity patterns to just revision traversal or just diff outputting but not to both, but that's already tricky in log with explicit pathspecs and we've traditionally had files restrict both.) > But I also want to avoid doing this as a default or even behind a config > setting. We already get enough complains about "missing commits" when > someone does a bad merge so "git log -- file" simplifies away a commit > that exists in the full history. Imagine someone saying "on my machine, > 'git log' shows the commit, but my colleague can't see it!" I would really > like to avoid adding to that confusion if possible. That's a good point. A really good point. Maybe we do only want to allow explicit requests for this behavior -- and thus need very short option name for it. Here's a not-even-half-baked idea for thought: What if we allowed a configuration option to control this, BUT whenever a command like diff/grep/log restricts output based on the sparsity paths due solely to the configuration option, it prints a small reminder on stderr at the beginning of the output (e.g. "Note: output limited to sparsity paths, as per sparse.restrictCmds setting")? > > --> Commands whose behavior I'm still uncertain of: > > * worktree add > > * for behavior A (marrying sparse-checkout with partial clone), we > > should almost certainly copy sparsity paths from the previous > > worktree (we either have to do that or have some kind of > > specify-at-clone-time default set of sparsity paths) > > * for behavior B, we may also want to copy sparsity paths from the > > previous worktree (much like a new command line shell will copy > > $PWD from the previous one), but it's less clear. Should it? > > I think 'git worktree add' should at minimum continue using a sparse- > checkout if the current working directory has one. Worktrees are a > great way to scale the creation of multiple working directories for > the same repository without re-cloning all of the history. In a partial > clone case, it's really important that we don't explode the workdir in > the new worktree (or even download all those blobs). Okay, sounds like you agree with me for the partial clone case -- it's necessary. But what about the non-partial clone case? I think it should adopt the sparsity in that case too, but Junio has objected in the past. I'm pretty sure Junio wasn't thinking about the partial clone case, where I think it seems obvious and compelling. But I'm not sure how best to convince him in the non-partial clone case (or maybe I already did; he didn't respond further after his initial objection). > Now, should we copy the sparse-checkout definitions, or start with the > "only files at root" default? That's a more subtle question. Ooh, good point. Even if we adopt sparsity in the new worktree, there's apparently a number of ways to do it. > > * range-diff > > * is this considered to be log-like for format-patch-like in > > behavior? > > If we stick with log acting on the full tree unless specified in the > command-line options, then range-diff can be the same. Seems like a > really low priority, though, because of the proximity to format-patch. > > > * cherry > > * see range-diff > > * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list > > * should these be tweaked or always operate full-tree? > > * checkout-index > > * should it be like checkout and pay attention to sparsity paths, or > > be considered special like update-index, ls-files, & read-tree and > > write to working tree anyway? > > * mv > > * I don't think mv can take a glob, and I think it currently happens to > > work. Should we add a comment to the code that if anyone wants to > > support mv using pathspecs they might need to be careful about > > SKIP_WORKTREE? > > > > --> Might need changes, but who cares? > > * merge-file > > * merge-index > > > > --> Commands with no interaction with sparse-checkout: > > (I agree with the list you included here.) > > Thanks for starting the discussion. Perhaps more will pick it up as > they return from the holiday break. Thanks for jumping in and pushing it much further with sparse indices (or is it sparse indexes?) I'm excited.
On 1/6/2021 2:15 PM, Elijah Newren wrote: > On Sun, Jan 3, 2021 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote: >> >> On 12/31/2020 3:03 PM, Elijah Newren wrote: >> Others use sparse-checkout to remove a few large files unless they >> need them. I'm less interested in this case, myself. >> >> Both perspectives get better with partial clone because the download >> size shrinks significantly. While partial clone has a sparse-checkout >> style filter, it is hard to compute on the server side. Further, it >> is not very forgiving of someone wanting to change their sparse >> definition after cloning. Tree misses are really expensive, and I find >> that the extra network transfer of the full tree set is a price that is >> worth paying. > > Out of curiosity, is that because the promisor handling doesn't do > nice batching of trees to download, as is done for blobs, or is there > a more fundamental reason they are really expensive? (I'm just > wondering if we are risking changing design in some areas based on > suboptimal implementation of other things. I don't actually have > experience with partial clones yet, though, so I'm basically just > querying about random but interesting things without any experience to > back it up.) GitHub doesn't support pathspec filters for partial clone because it is too expensive to calculate that initial packfile (cannot use reachability bitmaps). Even outside of that initial cost, we have problems. The biggest problem is that we ask for the tree as a one-off request. There are two ways to approach this: 1. Ask for all trees that are reachable from that tree so we can complete the tree walk (current behavior). This downloads trees we already have, most of the time. 2. Ask for only that tree and no extra objects. This causes the request count to increase significantly, especially during a 'git pull' or 'git checkout' that spans a large distance. In either case, commands like "git log -- README.md" are really bad in a treeless clone (--filter=tree:0). For the sparse-checkout case, we still need the trees outside of our sparse cone in order to construct an index, even if we never actually check out those files. (Maybe not forever, though...) And maybe the solution would be to ask the server for your missing trees in the entire history when you change sparse-checkout definition, but what does that request look like? client> I have these commits with trees according to this pathspec. client> I want these commits with trees according to a new pathspec. server> *flips table* >> I think there are three possible situations: >> >> 1. sparse-checkout should not affect the behavior at all. >> >> An example for this is "git commit". We want the root tree to contain >> all of the subtrees and blobs that are out of the sparse-checkout >> definition. The underlying object model should never change. >> >> 2. sparse-checkout should change the default, but users can opt-out. >> >> The examples I think of here are 'git grep' and 'git rm', as we have >> discussed recently. Having a default of "you already chose to be in >> a sparse-checkout, so we think this behavior is better for you" >> should continue to be pursued. >> >> 3. Users can opt-in to a sparse-checkout version of a behavior. >> >> The example in this case is "git diff". Perhaps we would want to see >> a diff scoped only to our sparse definition, but that should not be >> the default. It is too risky to change the output here without an >> explicit choice by the user. > > I'm curious why you put grep and diff in different categories. A > plain "git diff" without revisions will give the same output whether > or not it restricts to the sparsity paths (because the other paths are > unchanged), so restricting is purely an optimization question. Making > "git diff REVISION" restrict to the sparsity paths would be a > behavioral change as you note, but "git grep [REVISION]" would also > require a behavioral change to limit to the sparsity paths. If it's > too risky to change the output for git diff with revisions, why is it > not also too risky to do that with git grep with revisions? I generally think of 'grep' as being "search for something I care about" which is easier to justify scoping to sparse-checkouts. 'diff' is something that I usually think of as "compare two git objects" and it is operating on immutable data. The practical difference comes into play with a blobless partial clone: 'diff' will download blobs that need a content comparison, so the cost is relative to the number of changed paths in that region and relative to the requested output. 'grep' will download every blob reachable from the root tree. We've seen too many cases of users trying 'git grep' to search the Windows codebase and complaining that it takes too long (because they are downloading 3 million blobs one at a time). > Also, I think you are missing a really important category: > > 4. sparse-checkout changes the behavior of commands and there is no > opt-out or configurability provided. > > The most obvious examples are switch and checkout -- their modified > behavior is really the /point/ of sparse-checkouts and if you want to > "opt out" then just don't use sparse-checkouts. `reset --hard` can go > in the same bucket; it's modified in the same way. However, some > commands are modified in a different way, but also have no opt-out -- > for example, merge, rebase, cherry-pick, revert, and stash, all "try" > to avoid writing files to the working tree that match the sparsify > specifications, but will vivify files which have conflicts (and maybe > a few additional files based on implementation shortcomings). Another > command that behaves differently than any of these, and is also > non-configurable in this change, is git-add. It'll ignore any tracked > files with the SKIP_WORKTREE bit set, even if the file is present. > That's really helpful thing for "git add -A [GLOB_OR_DIRECTORY]" to > do, as we don't want sparsity to accidentally be treated as a > directive to remove files from the repository. True. Except for these, the opt-in/out is "git sparse-checkout init" and "git sparse-checkout disable". If I want "git checkout" to behave differently, then I modify my sparse-checkout definition or disable it altogether. Perhaps instead we should think of this category as the "core functionality of sparse-checkout." > I think more commands should fall under this fourth category as well, > including rm. The biggest issue with 'rm' is that users may want to use it to delete paths outside of their sparse-checkout according to a pathspec. This is especially true since it is the current behavior, so if we change it by default we might discover more complaints than the current requests for a way to limit to the sparse-checkout definition. >>> * mt/grep-sparse-checkout: figure out shorter flag names. Default to >>> --no-restrict-to-sparse, for now. Then merge it for git-2.31. >> >> I don't want to derail your high-level conversation too much, but by the >> end of January I hope to send an RFC to create a "sparse index" which allows >> the index to store entries corresponding to a directory with the skip- >> worktree bit on. The biggest benefit is that commands like 'git status' and >> 'git add' will actually change their performance based on the size of the >> sparse-checkout definition and not the total number of paths at HEAD. > > This is _awesome_; I think it'll be huge. It'll cause even more > commands behavior to change, of course, but in a good way. And I > don't consider this derailing at all but extending the discussion > complete with extra investigation work. :-) > >> The other thing that happens once we have that idea is that these behaviors >> in 'git grep' or 'git rm' actually become _easier_ to implement because we >> don't even have an immediate reference to the blobs outside of the sparse >> cone (assuming cone mode). >> >> The tricky part (that I'm continuing to work on, hence no RFC today) is >> enabling the part where a user can opt-in to the old behavior. This requires >> parsing trees to expand the index as necessary. A simple approach is to >> create an in-memory index that is the full expansion at HEAD, when necessary. >> It will be better to do expansions in a targeted way. > > I'm not sure if you're just thinking of the old mt/rm-sparse-checkout > and commenting on it, or if you're actively disagreeing with my > proposal for rm. I remember the discussion around how making 'rm' sparse-aware was more complicated than "only look at entries without CE_SKIP_WORKTREE" but it might be easier with a sparse-index. So my intention here was to see if we should _delay_ our investigation here until I can at least get a prototype ready for inspection. I'm also saying that perhaps we could redirect this discussion around how to opt-in/out of these changes. Much like your "category 4" above being "behavior expected when in a sparse-checkout," what if this behavior of restricting to the sparse set was expected when using a sparse-index instead of based on config options or run-time arguments? What if we had something like "git update-index --[no-]sparse" to toggle between the two states? That's my intention with bringing up my half-baked idea before I have code to show for it. >> (Your merge-ort algorithm is critical to the success here, since that doesn't >> use the index as a data structure. I expect to make merge-ort the default for >> users with a sparse index. Your algorithm will be done first.) > > Well, at 50 added/changed lines per patch, I've only got ~50 more > patches to go for ort after the ones I submitted Monday (mostly > optimization related). If I submit 10 patches per week (starting next > week since I already sent a big patchset this week), then maybe > mid-to-late February. That's a more aggressive pace than we've > managed so far, but maybe it gets easier towards the end? Anyway, > hopefully that helps you with timing predictions. > > On my end, this does make the ort work look like there's finally some > light at the end of the tunnel; I just hope it's not an oncoming > train. :-) While I expect to have an RFC ready at the end of the month, I expect I will be working on sparse-index for the entire 2021 calendar year before it will be fully ready to use by end-users. I expect my RFC to have fast "git status" and "git add" times, but other commands will have a guard that expands a sparse-index into a "full" index before proceeding. This protection will keep behavior consistent but will cause performance problems. Iteratively removing these guards and implementing "sparse-aware" versions of each index operation will take time and care. >> My point in bringing this up is that perhaps we should pause concrete work on >> updating other builtins until we have a clearer idea of what a sparse index >> could look like and how the implementation would change based on having one >> or not. I hope that my RFC will be illuminating in this regard. > > Are you suggesting to pause any work on those pieces of the proposal > that might be affected by your sparse index, or pause any work at all > on sparse-checkouts? For example, I think > en/stash-apply-sparse-checkout that's been sitting in seen is good to > merge down to master now. I suspect mt/rm-sparse-checkout WITH my > suggested changes (no configurability -- similar to git-add) and a > better warning/error message for git-add are some examples of cleanups > that could be done before your sparse index, but if you're worried > about conflicting I certainly don't want to derail your project. (I > agree that anything with configurability and touching on "behavior A" > or "sparse parallelax", like mt/grep-sparse-checkout would be better > if we waited on. I do feel pretty bad for how much we've made Matheus > wait on that series, but waiting does still seem best.) I don't want to hold up valuable work. It's just tricky to navigate parallel efforts in the same space. I'm asking for a little more time to get my stuff together to see if it would influence your work. But it is unreasonable for me to "squat" on the feature and keep others from making valuable improvements. >>> * shorter flag names than --[no-]restrict-to-sparse. --dense and >>> --sparse? --[no-]restrict? >> >> --full-workdir? > > Hmm. "workdir" sounds like an abbreviation of "working directory", > which is the place where the files are checked out. And the working > directory is sparse in a sparse-checkout. So isn't this misleading? > Or did you intend for this option to be the name for requesting a > sparser set? (If so, isn't "full" in its name a bit weird?) > > Also, what would the inverse name of --full-workdir be? I was looking > to add options for both restricting the command to the sparser set and > for expanding to the full set of files. Though I guess as you note > below, you perhaps might be in favor of only one of these without > configuration options to adjust defaults. Right. Perhaps --full-tree or --sparse-tree would be better? I was trying to link the adjectives "full" and "sparse" to a noun that they modify. --dense already exists in rev-list to describe a form of history simplification. >>> --> Commands that would change for behavior A >>> * bisect >>> * Only consider commits touching paths matching sparsity patterns >>> * diff >>> * When given revisions, only show subset of files matching sparsity >>> patterns. If pathspecs are given, intersect them with sparsity >>> patterns. >>> * log >>> * Only consider commits touching at least one path matching sparsity >>> patterns. If pathspecs are given, paths must match both the >>> pathspecs and the sparsity patterns in order to be considered >>> relevant and be shown. >>> * gitk >>> * See log >>> * shortlog >>> * See log >>> * grep >>> * See mt/grep-sparse-checkout; it's been discussed in detail..and is >>> implemented. (Other than that we don't want behavior A to be the >>> default when so many commands do not support it yet.) >>> >>> * show-branch >>> * See log >>> * whatchanged >>> * See log >>> * show (at least for commits) >>> * See diff >>> >>> * blame >>> * With -C or -C -C, only detect lines moved/copied from files that match >>> the sparsity paths. >>> * annotate >>> * See blame. >> >> this "behavior A" idea is the one I'm most skeptical about. Creating a >> way to opt-in to a sparse definition might be nice. It might be nice to >> run "git log --simplify-sparse" to see the simplified history when only >> caring about commits that changed according to the current sparse-checkout >> definitions. Expand that more when asking for diffs as part of that log, >> and the way we specify the option becomes tricky. > > --simplify-sparse is a really long name to need to specify at every > invocation. Also, if we have --[no]-restrict or --sparse/--dense > options at the git level (rather than the subcommand level), then I > think we don't want extra ones like this at the subcommand level. > > Also, if the option appears at the global git level, doesn't that > remove the trickiness of revision traversal vs. diff outputting in > commands like log? It just automatically applies to both. (The only > trickiness would be if you wanted to somehow apply sparsity patterns > to just revision traversal or just diff outputting but not to both, > but that's already tricky in log with explicit pathspecs and we've > traditionally had files restrict both.) > >> But I also want to avoid doing this as a default or even behind a config >> setting. We already get enough complains about "missing commits" when >> someone does a bad merge so "git log -- file" simplifies away a commit >> that exists in the full history. Imagine someone saying "on my machine, >> 'git log' shows the commit, but my colleague can't see it!" I would really >> like to avoid adding to that confusion if possible. > > That's a good point. A really good point. Maybe we do only want to > allow explicit requests for this behavior -- and thus need very short > option name for it. And even though I mentioned earlier that "having a sparse-index might be a good way to opt-in," I would still say that simplifying commit history in 'git log' or reducing diff output would still require a short command-line option. > Here's a not-even-half-baked idea for thought: What if we allowed a > configuration option to control this, BUT whenever a command like > diff/grep/log restricts output based on the sparsity paths due solely > to the configuration option, it prints a small reminder on stderr at > the beginning of the output (e.g. "Note: output limited to sparsity > paths, as per sparse.restrictCmds setting")? I'm not thrilled with this idea, but perhaps the warning can be toggled by an advice.* config option. >>> --> Commands whose behavior I'm still uncertain of: >>> * worktree add >>> * for behavior A (marrying sparse-checkout with partial clone), we >>> should almost certainly copy sparsity paths from the previous >>> worktree (we either have to do that or have some kind of >>> specify-at-clone-time default set of sparsity paths) >>> * for behavior B, we may also want to copy sparsity paths from the >>> previous worktree (much like a new command line shell will copy >>> $PWD from the previous one), but it's less clear. Should it? >> >> I think 'git worktree add' should at minimum continue using a sparse- >> checkout if the current working directory has one. Worktrees are a >> great way to scale the creation of multiple working directories for >> the same repository without re-cloning all of the history. In a partial >> clone case, it's really important that we don't explode the workdir in >> the new worktree (or even download all those blobs). > > Okay, sounds like you agree with me for the partial clone case -- it's > necessary. > > But what about the non-partial clone case? I think it should adopt > the sparsity in that case too, but Junio has objected in the past. > I'm pretty sure Junio wasn't thinking about the partial clone case, > where I think it seems obvious and compelling. But I'm not sure how > best to convince him in the non-partial clone case (or maybe I already > did; he didn't respond further after his initial objection). We might want to consider certain behavior to be on by default when enough other optional features are enabled. A philosophy such as "We see you are using partial clone and sparse-checkout, so we restricted the search in 'git grep' for your own good" might be useful here. >> Thanks for starting the discussion. Perhaps more will pick it up as >> they return from the holiday break. > > Thanks for jumping in and pushing it much further with sparse indices > (or is it sparse indexes?) I'm excited. Another way to push this discussion further would be to create a forward-looking documentation file in Documentation/technical. We could use such a documentation as a place to organize thoughts and plans, especially things like: * How sparse-checkout works and why users need it. * How it works (and doesn't work) with partial clone. * Plans for modifying behavior in sparse scenarios: - Current behavior that is wrong or suspicious. - Commands that could have different default behavior. - Commands that could have different opt-in behavior. (This section would include a description of the planned flag that modifies behavior to limit to the sparse set.) I would add a section about sparse-index into such a document, if it existed. As things get implemented, these items could be moved out of the technical documentation and into Documentation/git-sparse-checkout.txt so we have a central place for users to discover how sparse-checkout can change their behavior. Thanks, -Stolee
On Thu, Jan 7, 2021 at 4:53 AM Derrick Stolee <stolee@gmail.com> wrote: > > On 1/6/2021 2:15 PM, Elijah Newren wrote: > > On Sun, Jan 3, 2021 at 7:02 PM Derrick Stolee <stolee@gmail.com> wrote: > >> > >> On 12/31/2020 3:03 PM, Elijah Newren wrote: > >> Others use sparse-checkout to remove a few large files unless they > >> need them. I'm less interested in this case, myself. > >> > >> Both perspectives get better with partial clone because the download > >> size shrinks significantly. While partial clone has a sparse-checkout > >> style filter, it is hard to compute on the server side. Further, it > >> is not very forgiving of someone wanting to change their sparse > >> definition after cloning. Tree misses are really expensive, and I find > >> that the extra network transfer of the full tree set is a price that is > >> worth paying. > > > > Out of curiosity, is that because the promisor handling doesn't do > > nice batching of trees to download, as is done for blobs, or is there > > a more fundamental reason they are really expensive? (I'm just > > wondering if we are risking changing design in some areas based on > > suboptimal implementation of other things. I don't actually have > > experience with partial clones yet, though, so I'm basically just > > querying about random but interesting things without any experience to > > back it up.) > > GitHub doesn't support pathspec filters for partial clone because it > is too expensive to calculate that initial packfile (cannot use > reachability bitmaps). Even outside of that initial cost, we have > problems. > > The biggest problem is that we ask for the tree as a one-off request. > There are two ways to approach this: > > 1. Ask for all trees that are reachable from that tree so we can > complete the tree walk (current behavior). This downloads trees we > already have, most of the time. > > 2. Ask for only that tree and no extra objects. This causes the request > count to increase significantly, especially during a 'git pull' or > 'git checkout' that spans a large distance. > > In either case, commands like "git log -- README.md" are really bad in > a treeless clone (--filter=tree:0). > > For the sparse-checkout case, we still need the trees outside of our > sparse cone in order to construct an index, even if we never actually > check out those files. (Maybe not forever, though...) > > And maybe the solution would be to ask the server for your missing > trees in the entire history when you change sparse-checkout definition, > but what does that request look like? > > client> I have these commits with trees according to this pathspec. > client> I want these commits with trees according to a new pathspec. > server> *flips table* Ah, the good old *flips table* codepath. That'd be a fun area of the code to work on. ;-) To be serious, though, thanks for the extra info. > >> I think there are three possible situations: > >> > >> 1. sparse-checkout should not affect the behavior at all. > >> > >> An example for this is "git commit". We want the root tree to contain > >> all of the subtrees and blobs that are out of the sparse-checkout > >> definition. The underlying object model should never change. > >> > >> 2. sparse-checkout should change the default, but users can opt-out. > >> > >> The examples I think of here are 'git grep' and 'git rm', as we have > >> discussed recently. Having a default of "you already chose to be in > >> a sparse-checkout, so we think this behavior is better for you" > >> should continue to be pursued. > >> > >> 3. Users can opt-in to a sparse-checkout version of a behavior. > >> > >> The example in this case is "git diff". Perhaps we would want to see > >> a diff scoped only to our sparse definition, but that should not be > >> the default. It is too risky to change the output here without an > >> explicit choice by the user. > > > > I'm curious why you put grep and diff in different categories. A > > plain "git diff" without revisions will give the same output whether > > or not it restricts to the sparsity paths (because the other paths are > > unchanged), so restricting is purely an optimization question. Making > > "git diff REVISION" restrict to the sparsity paths would be a > > behavioral change as you note, but "git grep [REVISION]" would also > > require a behavioral change to limit to the sparsity paths. If it's > > too risky to change the output for git diff with revisions, why is it > > not also too risky to do that with git grep with revisions? > > I generally think of 'grep' as being "search for something I care about" > which is easier to justify scoping to sparse-checkouts. > > 'diff' is something that I usually think of as "compare two git objects" > and it is operating on immutable data. > > The practical difference comes into play with a blobless partial clone: > 'diff' will download blobs that need a content comparison, so the cost > is relative to the number of changed paths in that region and relative > to the requested output. 'grep' will download every blob reachable from > the root tree. We've seen too many cases of users trying 'git grep' to > search the Windows codebase and complaining that it takes too long > (because they are downloading 3 million blobs one at a time). Oh, is the primary difference here that you're complaining about a bug in git grep, where without --cached it mixes worktree search results with index search results? That's just a flat out bug that should be fixed. See https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/. grep is supposed to search the working tree by default, and *just* the working tree -- as documented from the beginning. If you specify --cached, it'll search the index. If you specify revisions, then it searches revisions. There is not supposed to currently be mixing and matching of searches from multiple areas. Someone could add such an ability to search multiple locations, but then the user should have to specify that behavior. The fact that sparse-checkouts search multiple areas is a *bug*, regardless of sparse indexes or not, regardless of "behavior A" or "behavior B", etc. The interesting bit, is whether `git grep ... REVISION` should restrict to sparsity paths. Unlike the working tree, REVISION probably has many paths that don't match the sparsity paths. Same for --cached. That part makes sense to make configurable. So, I guess my question is, after the bug in git grep is fixed that I've been railing about for nearly a year (and sadly got tied up in other changes Matheus wanted to make so that the rest could be made configurable), then do you consider the rest of git grep different from git diff? > > Also, I think you are missing a really important category: > > > > 4. sparse-checkout changes the behavior of commands and there is no > > opt-out or configurability provided. > > > > The most obvious examples are switch and checkout -- their modified > > behavior is really the /point/ of sparse-checkouts and if you want to > > "opt out" then just don't use sparse-checkouts. `reset --hard` can go > > in the same bucket; it's modified in the same way. However, some > > commands are modified in a different way, but also have no opt-out -- > > for example, merge, rebase, cherry-pick, revert, and stash, all "try" > > to avoid writing files to the working tree that match the sparsify > > specifications, but will vivify files which have conflicts (and maybe > > a few additional files based on implementation shortcomings). Another > > command that behaves differently than any of these, and is also > > non-configurable in this change, is git-add. It'll ignore any tracked > > files with the SKIP_WORKTREE bit set, even if the file is present. > > That's really helpful thing for "git add -A [GLOB_OR_DIRECTORY]" to > > do, as we don't want sparsity to accidentally be treated as a > > directive to remove files from the repository. > > True. Except for these, the opt-in/out is "git sparse-checkout init" > and "git sparse-checkout disable". If I want "git checkout" to behave > differently, then I modify my sparse-checkout definition or disable > it altogether. > > Perhaps instead we should think of this category as the "core > functionality of sparse-checkout." The core functionality of sparse-checkout has always been only partially implemented. See the last few lines of t7012. See the bugs over the years with the merge machinery and SKIP_WORKTREE entries. See the bug reports with git-stash. See my long explanation to Matheus on how git-grep without --cached or revisions is flat broken in sparse-checkouts[1]. See Junio's comments about how "the sparse checkout 'feature' itself is a hack'"[2] and that folks working in other areas didn't need to provide full support for it. So, that raises the question -- what else is "core functionality of sparse-checkout" besides what I listed above? I reject the idea that whatever is currently implemented is the bright line. The rest is certainly up for discussion, but I don't like using the idea of current behavior as the litmus test for whether something is core functionality. In fact this is the very reason why I so strongly requested the huge warning in the sparse-checkout documentation[3]. [1] https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/ [2] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/ [3] https://lore.kernel.org/git/CABPp-BEryfaeYhuUsiDTaYdRKpK6GRi7hgZ5XSTVkoHVkx2qQA@mail.gmail.com/ And I'm arguing a change in rm behavior should be core functionality of sparse-checkout; more on that below. > > I think more commands should fall under this fourth category as well, > > including rm. > > The biggest issue with 'rm' is that users may want to use it to > delete paths outside of their sparse-checkout according to a > pathspec. This is especially true since it is the current > behavior, so if we change it by default we might discover more > complaints than the current requests for a way to limit to the > sparse-checkout definition. Users cannot update files outside their sparsity paths using add, though. Even if they create the file from scratch with the necessary changes and run 'git add' on it. And that's core functionality. Users must first change their sparsity paths before trying to add such files. If we were to allow them, I think things get weird. I think that's a good model. We should require them to do the same for removing. If they come to us requesting a way to delete paths outside their sparse-checkout, we give them the same answer that we do for updating paths outside their sparse-checkout: simply change your sparsity definition first. It's a simple solution. No configurability is required here, IMO, and it makes commands more consistent with each other. Of course, that's the high-level explanation. Digging in to the details, there we should also change or add warning/error messages for both 'add' and 'rm' in some cases; see my original email in this thread for that discussion. > >>> * mt/grep-sparse-checkout: figure out shorter flag names. Default to > >>> --no-restrict-to-sparse, for now. Then merge it for git-2.31. > >> > >> I don't want to derail your high-level conversation too much, but by the > >> end of January I hope to send an RFC to create a "sparse index" which allows > >> the index to store entries corresponding to a directory with the skip- > >> worktree bit on. The biggest benefit is that commands like 'git status' and > >> 'git add' will actually change their performance based on the size of the > >> sparse-checkout definition and not the total number of paths at HEAD. > > > > This is _awesome_; I think it'll be huge. It'll cause even more > > commands behavior to change, of course, but in a good way. And I > > don't consider this derailing at all but extending the discussion > > complete with extra investigation work. :-) > > > >> The other thing that happens once we have that idea is that these behaviors > >> in 'git grep' or 'git rm' actually become _easier_ to implement because we > >> don't even have an immediate reference to the blobs outside of the sparse > >> cone (assuming cone mode). > >> > >> The tricky part (that I'm continuing to work on, hence no RFC today) is > >> enabling the part where a user can opt-in to the old behavior. This requires > >> parsing trees to expand the index as necessary. A simple approach is to > >> create an in-memory index that is the full expansion at HEAD, when necessary. > >> It will be better to do expansions in a targeted way. > > > > I'm not sure if you're just thinking of the old mt/rm-sparse-checkout > > and commenting on it, or if you're actively disagreeing with my > > proposal for rm. > > I remember the discussion around how making 'rm' sparse-aware was more > complicated than "only look at entries without CE_SKIP_WORKTREE" but > it might be easier with a sparse-index. So my intention here was to > see if we should _delay_ our investigation here until I can at least > get a prototype ready for inspection. > > I'm also saying that perhaps we could redirect this discussion around > how to opt-in/out of these changes. Much like your "category 4" above > being "behavior expected when in a sparse-checkout," what if this > behavior of restricting to the sparse set was expected when using a > sparse-index instead of based on config options or run-time arguments? All the delay was based on the configurability (which I initially thought was needed too). But after further recent investigation, I think configurability for rm's behavior is just wrong. rm should restrict to sparse paths, just like add does. I agree on delaying any new features or changes that require configurability (and I agree with you that the sparse-index could play a role in how to configure things). I just think that en/stash-apply-sparse-checkout and a modified version of mt/rm-sparse-checkout are two cases that don't require configurability and could go in first. I also think pulling the bugfix out of mt/grep-sparse-checkout (don't mix worktree and index searches) and merging it could happen now, while waiting off on all the other bits that require configuration. > What if we had something like "git update-index --[no-]sparse" to > toggle between the two states? Ooh, interesting idea. We might have to discuss the name a bit; I'm worried folks might struggle to differentiate between that command and `git sparse-checkout {init,disable}`. > That's my intention with bringing up my half-baked idea before I have > code to show for it. > > >> (Your merge-ort algorithm is critical to the success here, since that doesn't > >> use the index as a data structure. I expect to make merge-ort the default for > >> users with a sparse index. Your algorithm will be done first.) > > > > Well, at 50 added/changed lines per patch, I've only got ~50 more > > patches to go for ort after the ones I submitted Monday (mostly > > optimization related). If I submit 10 patches per week (starting next > > week since I already sent a big patchset this week), then maybe > > mid-to-late February. That's a more aggressive pace than we've > > managed so far, but maybe it gets easier towards the end? Anyway, > > hopefully that helps you with timing predictions. > > > > On my end, this does make the ort work look like there's finally some > > light at the end of the tunnel; I just hope it's not an oncoming > > train. :-) > > While I expect to have an RFC ready at the end of the month, I expect > I will be working on sparse-index for the entire 2021 calendar year > before it will be fully ready to use by end-users. I expect my RFC to > have fast "git status" and "git add" times, but other commands will > have a guard that expands a sparse-index into a "full" index before > proceeding. This protection will keep behavior consistent but will > cause performance problems. Iteratively removing these guards and > implementing "sparse-aware" versions of each index operation will take > time and care. > > >> My point in bringing this up is that perhaps we should pause concrete work on > >> updating other builtins until we have a clearer idea of what a sparse index > >> could look like and how the implementation would change based on having one > >> or not. I hope that my RFC will be illuminating in this regard. > > > > Are you suggesting to pause any work on those pieces of the proposal > > that might be affected by your sparse index, or pause any work at all > > on sparse-checkouts? For example, I think > > en/stash-apply-sparse-checkout that's been sitting in seen is good to > > merge down to master now. I suspect mt/rm-sparse-checkout WITH my > > suggested changes (no configurability -- similar to git-add) and a > > better warning/error message for git-add are some examples of cleanups > > that could be done before your sparse index, but if you're worried > > about conflicting I certainly don't want to derail your project. (I > > agree that anything with configurability and touching on "behavior A" > > or "sparse parallelax", like mt/grep-sparse-checkout would be better > > if we waited on. I do feel pretty bad for how much we've made Matheus > > wait on that series, but waiting does still seem best.) > > I don't want to hold up valuable work. It's just tricky to navigate > parallel efforts in the same space. I'm asking for a little more time > to get my stuff together to see if it would influence your work. > > But it is unreasonable for me to "squat" on the feature and keep others > from making valuable improvements. I'm totally willing to hold off on bigger changes, new features, and anything that requires configurability. I'm also willing to hold off on any bug fixes that I have reason to think might conflict with your work. However, I think that several bug fixes would be independent of your current work. For example, the already-submitted en/stash-apply-sparse-checkout from a month ago (which I think should be merged to master), and fixing up reset --hard. You know more about me than your changes, though; do you have reason to think these might conflict? > >>> * shorter flag names than --[no-]restrict-to-sparse. --dense and > >>> --sparse? --[no-]restrict? > >> > >> --full-workdir? > > > > Hmm. "workdir" sounds like an abbreviation of "working directory", > > which is the place where the files are checked out. And the working > > directory is sparse in a sparse-checkout. So isn't this misleading? > > Or did you intend for this option to be the name for requesting a > > sparser set? (If so, isn't "full" in its name a bit weird?) > > > > Also, what would the inverse name of --full-workdir be? I was looking > > to add options for both restricting the command to the sparser set and > > for expanding to the full set of files. Though I guess as you note > > below, you perhaps might be in favor of only one of these without > > configuration options to adjust defaults. > > Right. Perhaps --full-tree or --sparse-tree would be better? I was > trying to link the adjectives "full" and "sparse" to a noun that they > modify. > > --dense already exists in rev-list to describe a form of history > simplification. I still think the flag should be a git global flag (like --no-pager, --work-tree, or --git-dir), not a subcommand flag, so rev-list's --dense flag isn't a collision even if we pick --dense/--sparse. --full-tree and --sparse-tree are good considerations though to add to the list we've come up with so far: --{full,sparse}-tree --dense/--sparse --[no-]restrict --[no-]restrict-to-sparse-paths > >>> --> Commands that would change for behavior A > >>> * bisect > >>> * Only consider commits touching paths matching sparsity patterns > >>> * diff > >>> * When given revisions, only show subset of files matching sparsity > >>> patterns. If pathspecs are given, intersect them with sparsity > >>> patterns. > >>> * log > >>> * Only consider commits touching at least one path matching sparsity > >>> patterns. If pathspecs are given, paths must match both the > >>> pathspecs and the sparsity patterns in order to be considered > >>> relevant and be shown. > >>> * gitk > >>> * See log > >>> * shortlog > >>> * See log > >>> * grep > >>> * See mt/grep-sparse-checkout; it's been discussed in detail..and is > >>> implemented. (Other than that we don't want behavior A to be the > >>> default when so many commands do not support it yet.) > >>> > >>> * show-branch > >>> * See log > >>> * whatchanged > >>> * See log > >>> * show (at least for commits) > >>> * See diff > >>> > >>> * blame > >>> * With -C or -C -C, only detect lines moved/copied from files that match > >>> the sparsity paths. > >>> * annotate > >>> * See blame. > >> > >> this "behavior A" idea is the one I'm most skeptical about. Creating a > >> way to opt-in to a sparse definition might be nice. It might be nice to > >> run "git log --simplify-sparse" to see the simplified history when only > >> caring about commits that changed according to the current sparse-checkout > >> definitions. Expand that more when asking for diffs as part of that log, > >> and the way we specify the option becomes tricky. > > > > --simplify-sparse is a really long name to need to specify at every > > invocation. Also, if we have --[no]-restrict or --sparse/--dense > > options at the git level (rather than the subcommand level), then I > > think we don't want extra ones like this at the subcommand level. > > > > Also, if the option appears at the global git level, doesn't that > > remove the trickiness of revision traversal vs. diff outputting in > > commands like log? It just automatically applies to both. (The only > > trickiness would be if you wanted to somehow apply sparsity patterns > > to just revision traversal or just diff outputting but not to both, > > but that's already tricky in log with explicit pathspecs and we've > > traditionally had files restrict both.) > > > >> But I also want to avoid doing this as a default or even behind a config > >> setting. We already get enough complains about "missing commits" when > >> someone does a bad merge so "git log -- file" simplifies away a commit > >> that exists in the full history. Imagine someone saying "on my machine, > >> 'git log' shows the commit, but my colleague can't see it!" I would really > >> like to avoid adding to that confusion if possible. > > > > That's a good point. A really good point. Maybe we do only want to > > allow explicit requests for this behavior -- and thus need very short > > option name for it. > > And even though I mentioned earlier that "having a sparse-index might > be a good way to opt-in," I would still say that simplifying commit > history in 'git log' or reducing diff output would still require a > short command-line option. > > > Here's a not-even-half-baked idea for thought: What if we allowed a > > configuration option to control this, BUT whenever a command like > > diff/grep/log restricts output based on the sparsity paths due solely > > to the configuration option, it prints a small reminder on stderr at > > the beginning of the output (e.g. "Note: output limited to sparsity > > paths, as per sparse.restrictCmds setting")? > > I'm not thrilled with this idea, but perhaps the warning can be > toggled by an advice.* config option. You've noted multiple times that you're leery of even providing such a configuration option -- and you provided good rationale to be worried about it. So, if we did use this escape hatch to make it sane to allow such a configuration option, I'd say this is probably one case where we do not allow this advice to be turned off. (In fact, we'd not only avoid implementing it, we'd be careful to document that it's intentionally not implemented so that someone doesn't come along later and assume we just overlooked it.) But that's still an 'if' and is just an idea I threw out there for us to consider over the next year or two; we may well require the short command-line option every time as you say. > >>> --> Commands whose behavior I'm still uncertain of: > >>> * worktree add > >>> * for behavior A (marrying sparse-checkout with partial clone), we > >>> should almost certainly copy sparsity paths from the previous > >>> worktree (we either have to do that or have some kind of > >>> specify-at-clone-time default set of sparsity paths) > >>> * for behavior B, we may also want to copy sparsity paths from the > >>> previous worktree (much like a new command line shell will copy > >>> $PWD from the previous one), but it's less clear. Should it? > >> > >> I think 'git worktree add' should at minimum continue using a sparse- > >> checkout if the current working directory has one. Worktrees are a > >> great way to scale the creation of multiple working directories for > >> the same repository without re-cloning all of the history. In a partial > >> clone case, it's really important that we don't explode the workdir in > >> the new worktree (or even download all those blobs). > > > > Okay, sounds like you agree with me for the partial clone case -- it's > > necessary. > > > > But what about the non-partial clone case? I think it should adopt > > the sparsity in that case too, but Junio has objected in the past. > > I'm pretty sure Junio wasn't thinking about the partial clone case, > > where I think it seems obvious and compelling. But I'm not sure how > > best to convince him in the non-partial clone case (or maybe I already > > did; he didn't respond further after his initial objection). > > We might want to consider certain behavior to be on by default when > enough other optional features are enabled. A philosophy such as "We > see you are using partial clone and sparse-checkout, so we restricted > the search in 'git grep' for your own good" might be useful here. You're dodging the question. ;-) We already agree on the behavior for sparse-checkouts with worktrees when partial clones are in effect; it seems obvious any newly added worktree needs to also be sparse in such a case. The whole question was what about the non-partial-clone case. I'm of the opinion that it makes sense there too (if for no other reason than making it easy to explain and understand what worktree does without having to provide a list of cases to users), but was curious if others had thoughts on the matter. Maybe you don't care since you always use partial clones? > >> Thanks for starting the discussion. Perhaps more will pick it up as > >> they return from the holiday break. > > > > Thanks for jumping in and pushing it much further with sparse indices > > (or is it sparse indexes?) I'm excited. > > Another way to push this discussion further would be to create a > forward-looking documentation file in Documentation/technical. > We could use such a documentation as a place to organize thoughts > and plans, especially things like: > > * How sparse-checkout works and why users need it. > * How it works (and doesn't work) with partial clone. > * Plans for modifying behavior in sparse scenarios: > - Current behavior that is wrong or suspicious. > - Commands that could have different default behavior. > - Commands that could have different opt-in behavior. > (This section would include a description of the planned flag > that modifies behavior to limit to the sparse set.) > > I would add a section about sparse-index into such a document, if it > existed. > > As things get implemented, these items could be moved out of the > technical documentation and into Documentation/git-sparse-checkout.txt > so we have a central place for users to discover how sparse-checkout > can change their behavior. Yeah, good idea. This thread has a lot of that information, so I'd be happy to start up such a document after I both hear back from Matheus and hear back from you on my rm & grep bug fix ideas.
diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt index 494761526e..79d7d173e9 100644 --- a/Documentation/config/sparse.txt +++ b/Documentation/config/sparse.txt @@ -12,7 +12,8 @@ When this option is true (default), some git commands may limit their behavior to the paths specified by the sparsity patterns, or to the intersection of those paths and any (like `*.c`) that the user might also specify on the command line. When false, the affected commands will work on full trees, -ignoring the sparsity patterns. For now, only git-grep honors this setting. +ignoring the sparsity patterns. For now, only git-grep and git-rm honor this +setting. + Note: commands which export, integrity check, or create history will always operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.), diff --git a/Documentation/git-rm.txt b/Documentation/git-rm.txt index ab750367fd..25dda8ff35 100644 --- a/Documentation/git-rm.txt +++ b/Documentation/git-rm.txt @@ -25,6 +25,15 @@ When `--cached` is given, the staged content has to match either the tip of the branch or the file on disk, allowing the file to be removed from just the index. +CONFIGURATION +------------- + +sparse.restrictCmds:: + By default, git-rm only matches and removes paths within the + sparse-checkout patterns. This behavior can be changed with the + `sparse.restrictCmds` setting or the global + `--no-restrict-to-sparse-paths` option. For more details, see the + full `sparse.restrictCmds` definition in linkgit:git-config[1]. OPTIONS ------- diff --git a/builtin/rm.c b/builtin/rm.c index 4858631e0f..e1fe71c321 100644 --- a/builtin/rm.c +++ b/builtin/rm.c @@ -14,6 +14,7 @@ #include "string-list.h" #include "submodule.h" #include "pathspec.h" +#include "sparse-checkout.h" static const char * const builtin_rm_usage[] = { N_("git rm [<options>] [--] <file>..."), @@ -254,7 +255,7 @@ static struct option builtin_rm_options[] = { int cmd_rm(int argc, const char **argv, const char *prefix) { struct lock_file lock_file = LOCK_INIT; - int i; + int i, sparse_paths_only; struct pathspec pathspec; char *seen; @@ -293,8 +294,12 @@ int cmd_rm(int argc, const char **argv, const char *prefix) seen = xcalloc(pathspec.nr, 1); + sparse_paths_only = restrict_to_sparse_paths(the_repository); + for (i = 0; i < active_nr; i++) { const struct cache_entry *ce = active_cache[i]; + if (sparse_paths_only && ce_skip_worktree(ce)) + continue; if (!ce_path_match(&the_index, ce, &pathspec, seen)) continue; ALLOC_GROW(list.entry, list.nr + 1, list.alloc); diff --git a/t/t3600-rm.sh b/t/t3600-rm.sh index efec8d13b6..7bf55b42eb 100755 --- a/t/t3600-rm.sh +++ b/t/t3600-rm.sh @@ -892,4 +892,26 @@ test_expect_success 'rm empty string should fail' ' test_must_fail git rm -rf "" ' +test_expect_success 'rm should respect --[no]-restrict-to-sparse-paths' ' + git init sparse-repo && + ( + cd sparse-repo && + touch a b c && + git add -A && + git commit -m files && + git sparse-checkout set "/a" && + + # By default, it should not rm paths outside the sparse-checkout + test_must_fail git rm b 2>stderr && + test_i18ngrep "fatal: pathspec .b. did not match any files" stderr && + + # But it should rm them with --no-restrict-to-sparse-paths + git --no-restrict-to-sparse-paths rm b && + + # And also with sparse.restrictCmds=false + git reset && + git -c sparse.restrictCmds=false rm b + ) +' + test_done diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh index 26852586ac..1761a2b1b9 100755 --- a/t/t7011-skip-worktree-reading.sh +++ b/t/t7011-skip-worktree-reading.sh @@ -132,11 +132,6 @@ test_expect_success 'diff-files does not examine skip-worktree dirty entries' ' test -z "$(git diff-files -- one)" ' -test_expect_success 'git-rm succeeds on skip-worktree absent entries' ' - setup_absent && - git rm 1 -' - test_expect_success 'commit on skip-worktree absent entries' ' git reset && setup_absent &&
Make git-rm honor the 'sparse.restrictCmds' setting, by restricting its operation to the paths that match both the command line pathspecs and the repository's sparsity patterns. This better matches the expectations of users with sparse-checkout definitions, while still allowing them to optionally enable the old behavior with 'sparse.restrictCmds=false' or the global '--no-restrict-to-sparse-paths' option. Suggested-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br> --- This is based on mt/grep-sparse-checkout. Original feature request: https://github.com/gitgitgadget/git/issues/786 Documentation/config/sparse.txt | 3 ++- Documentation/git-rm.txt | 9 +++++++++ builtin/rm.c | 7 ++++++- t/t3600-rm.sh | 22 ++++++++++++++++++++++ t/t7011-skip-worktree-reading.sh | 5 ----- 5 files changed, 39 insertions(+), 7 deletions(-)