diff mbox series

unpack-trees: do not set SKIP_WORKTREE on submodules

Message ID pull.809.git.git.1592356884310.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series unpack-trees: do not set SKIP_WORKTREE on submodules | expand

Commit Message

Elijah Newren via GitGitGadget June 17, 2020, 1:21 a.m. UTC
From: Elijah Newren <newren@gmail.com>

As noted in commit e7d7c73249 ("git-sparse-checkout: clarify
interactions with submodules", 2020-06-10), sparse-checkout cannot
remove submodules even if they don't match the sparsity patterns,
because doing so would risk data loss -- unpushed, uncommitted, or
untracked changes could all be lost.  That commit also updated the
documentation to point out that submodule initialization state was a
parallel, orthogonal reason that entries in the index might not be
present in the working tree.

However, sparsity and submodule initialization weren't actually fully
orthogonal yet.  The SKIP_WORKTREE handling in unpack_trees would
attempt to set the SKIP_WORKTREE bit on submodules when the submodule
did not match the sparsity patterns.  This resulted in innocuous but
potentially alarming warning messages:

    warning: unable to rmdir 'sha1collisiondetection': Directory not empty

It could also make things confusing since the entry would be marked as
SKIP_WORKTREE in the index but actually still be present in the working
tree:

    $ git ls-files -t | grep sha1collisiondetection
    S sha1collisiondetection
    $ ls -al sha1collisiondetection/ | wc -l
    13

Submodules have always been their own form of "partial checkout"
behavior, with their own "present or not" state determined by running
"git submodule [init|deinit|update]".  Enforce that separation by having
the SKIP_WORKTREE logic not touch submodules and allow submodules to
continue using their own initialization state for determining if the
submodule is present.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    unpack-trees: do not set SKIP_WORKTREE on submodules
    
    Interactions between submodules and sparsity patterns have come up a few
    times, with a certain edge case coming up multiple times recently:
    should a submodule have the SKIP_WORKTREE bit set if the submodule path
    does not match the sparsity patterns?[1][2][3].
    
    Here I try to resolve the question, with the answer of 'no'.
    
    This patch depends on en/sparse-with-submodule-doc lightly -- the commit
    message in this patch references the commit from that other series. It
    could possibly be considered an addition to that other topic, but
    "sparse-with-submodule-doc" implies the other topic is only a
    documentation change, whereas this patch involves a code change.
    
    [1] 
    https://lore.kernel.org/git/pull.805.git.git.1591831009762.gitgitgadget@gmail.com/
    
    > Since submodules may have unpushed changes or untracked files,
    > removing them could result in data loss. Thus, changing sparse
    > inclusion/exclusion rules will not cause an already checked out
    > submodule to be removed from the working copy. Said another way, just
    > as checkout will not cause submodules to be automatically removed or
    > initialized even when switching between branches that remove or add
    > submodules, using sparse-checkout to reduce or expand the scope of
    > "interesting" files will not cause submodules to be automatically
    > deinitialized or initialized either.
    
    [2] 
    https://lore.kernel.org/git/CABPp-BE+BL3Nq=Co=-kNB_wr=6gqX8zcGwa0ega_pGBpk6xYsg@mail.gmail.com/
    
    > If sparsity patterns would exclude a submodule that is initialized,
    > sparse-checkout clearly can't remove the submodule. However, should it
    > set the SKIP_WORKTREE bit for that submodule if it's not going to
    > remove it?
    
    [3] 
    https://lore.kernel.org/git/CABPp-BFJG7uFAZNidDPK2o7eHv-eYBsmcdnVxkOnKcZo7WzmFQ@mail.gmail.com/
    
    >> Or if you don't deinitialize a submodule that is excluded by the
    >> sparsity patterns (thus remaining in the working copy, anyway).
    >
    > This case requires more thought. If a submodule doesn't match the
    > sparsity patterns, we already said elsewhere that sparse-checkout
    > should not remove the submodule (since doing so would risk data loss).
    > But do we set the SKIP_WORKTREE bit for it? Generally, sparse-checkout
    > avoids removing files with modifications, and if it doesn't remove
    > them it also doesn't set the SKIP_WORKTREE bit. For consistency,
    > should sparse-checkout not set SKIP_WORKTREE for initialized
    > submodules?

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-809%2Fnewren%2Fsparse-submodule-no-skip-worktree-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-809/newren/sparse-submodule-no-skip-worktree-v1
Pull-Request: https://github.com/git/git/pull/809

 unpack-trees.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


base-commit: eebb51ba8cab97c0b3f3f18eaab7796803b8494b

Comments

Matheus Tavares June 17, 2020, 5:58 p.m. UTC | #1
On Tue, Jun 16, 2020 at 10:21 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Elijah Newren <newren@gmail.com>
>
> As noted in commit e7d7c73249 ("git-sparse-checkout: clarify
> interactions with submodules", 2020-06-10), sparse-checkout cannot
> remove submodules even if they don't match the sparsity patterns,
> because doing so would risk data loss -- unpushed, uncommitted, or
> untracked changes could all be lost.  That commit also updated the
> documentation to point out that submodule initialization state was a
> parallel, orthogonal reason that entries in the index might not be
> present in the working tree.
>
> However, sparsity and submodule initialization weren't actually fully
> orthogonal yet.  The SKIP_WORKTREE handling in unpack_trees would
> attempt to set the SKIP_WORKTREE bit on submodules when the submodule
> did not match the sparsity patterns.  This resulted in innocuous but
> potentially alarming warning messages:
>
>     warning: unable to rmdir 'sha1collisiondetection': Directory not empty
>
> It could also make things confusing since the entry would be marked as
> SKIP_WORKTREE in the index but actually still be present in the working
> tree:
>
>     $ git ls-files -t | grep sha1collisiondetection
>     S sha1collisiondetection
>     $ ls -al sha1collisiondetection/ | wc -l
>     13
>
> Submodules have always been their own form of "partial checkout"
> behavior, with their own "present or not" state determined by running
> "git submodule [init|deinit|update]".  Enforce that separation by having
> the SKIP_WORKTREE logic not touch submodules and allow submodules to
> continue using their own initialization state for determining if the
> submodule is present.

Makes sense to me.

I'm just thinking about the possible implications in grep (with
mt/grep-sparse-checkout). As you mentioned in [1], users might think
of "git grep $rev $pat" as an optimized version of "git checkout $rev
&& git grep $pat". And, in this sense, they probably expect the output
of these two operations to be equal. But if we don't set the
SKIP_WORKTREE bit for submodules they might diverge.

As an example, if we have a repository like:
.
|-- A
|   `-- sub
`-- B

And the [cone-mode] sparsity rules:
/*
!/*/
/B/

Then, "git grep --recurse-submodules $rev $pat" would search only in B
(as A doesn't match the sparsity patterns and thus, is not recursed
into). But "git checkout $rev && git grep --recurse-submodules $pat"
would search in both B and A/sub (as the latter would not have the
SKIP_WORKTREE bit set).

This might be a problem for git-grep, not git-sparse-checkout. But I'm
not sure how we could solve it efficiently, as the submodule might be
deep down in a path whose first dir was already ignored for not
matching the sparsity patterns. Is this a problem we should consider,
or is it OK if the outputs of these two operations diverge?

[1]: https://lore.kernel.org/git/CABPp-BFKHivKffBPO0M_s-JtpiLyEMLZr+sX9_yZk9ZX7ULtbw@mail.gmail.com/
Elijah Newren June 18, 2020, 12:24 a.m. UTC | #2
Hi Matheus,

On Wed, Jun 17, 2020 at 10:58 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Jun 16, 2020 at 10:21 PM Elijah Newren via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > From: Elijah Newren <newren@gmail.com>
> >
> > As noted in commit e7d7c73249 ("git-sparse-checkout: clarify
> > interactions with submodules", 2020-06-10), sparse-checkout cannot
> > remove submodules even if they don't match the sparsity patterns,
> > because doing so would risk data loss -- unpushed, uncommitted, or
> > untracked changes could all be lost.  That commit also updated the
> > documentation to point out that submodule initialization state was a
> > parallel, orthogonal reason that entries in the index might not be
> > present in the working tree.
> >
> > However, sparsity and submodule initialization weren't actually fully
> > orthogonal yet.  The SKIP_WORKTREE handling in unpack_trees would
> > attempt to set the SKIP_WORKTREE bit on submodules when the submodule
> > did not match the sparsity patterns.  This resulted in innocuous but
> > potentially alarming warning messages:
> >
> >     warning: unable to rmdir 'sha1collisiondetection': Directory not empty
> >
> > It could also make things confusing since the entry would be marked as
> > SKIP_WORKTREE in the index but actually still be present in the working
> > tree:
> >
> >     $ git ls-files -t | grep sha1collisiondetection
> >     S sha1collisiondetection
> >     $ ls -al sha1collisiondetection/ | wc -l
> >     13
> >
> > Submodules have always been their own form of "partial checkout"
> > behavior, with their own "present or not" state determined by running
> > "git submodule [init|deinit|update]".  Enforce that separation by having
> > the SKIP_WORKTREE logic not touch submodules and allow submodules to
> > continue using their own initialization state for determining if the
> > submodule is present.
>
> Makes sense to me.
>
> I'm just thinking about the possible implications in grep (with
> mt/grep-sparse-checkout). As you mentioned in [1], users might think
> of "git grep $rev $pat" as an optimized version of "git checkout $rev
> && git grep $pat". And, in this sense, they probably expect the output
> of these two operations to be equal. But if we don't set the
> SKIP_WORKTREE bit for submodules they might diverge.
>
> As an example, if we have a repository like:
> .
> |-- A
> |   `-- sub
> `-- B
>
> And the [cone-mode] sparsity rules:
> /*
> !/*/
> /B/
>
> Then, "git grep --recurse-submodules $rev $pat" would search only in B
> (as A doesn't match the sparsity patterns and thus, is not recursed
> into). But "git checkout $rev && git grep --recurse-submodules $pat"
> would search in both B and A/sub (as the latter would not have the
> SKIP_WORKTREE bit set).
>
> This might be a problem for git-grep, not git-sparse-checkout. But I'm
> not sure how we could solve it efficiently, as the submodule might be
> deep down in a path whose first dir was already ignored for not
> matching the sparsity patterns. Is this a problem we should consider,
> or is it OK if the outputs of these two operations diverge?

You always have an eagle eye for catching these things.  Good point.

So, ignoring submodules for a second... In the case of unmerged files
that do not match the sparsity patterns, we previously felt it was
fine that
    git grep $revision $pattern
won't search through $revision:${currently_unmerged_file} despite the fact that
    git grep $pattern
will search through ${currently_unmerged_file} from the working directory.

If we follow that analogy, then we're fine having those two "almost
command equivalents" diverge.

Let's dig a little further anyway and compare these "almost command
equivalents" a bit more, which as a reminder are (1) "git grep
--recurse-submodules $rev $pat" vs. (2) "git checkout $rev && git grep
--recurse-submodules $pat".  (I just wanted to give them simple
numerical labels.)  In all cases, we'll be considering a $FILENAME
which does not match sparsity patterns but is present in the working
directory for reasons listed below:

First case: $FILENAME has uncommitted changes:
(Command 1) This will search only through files in $REV that match
sparsity patterns; $REV:$FILENAME will be excluded
(Command 2) If $FILENAME is different between HEAD and $REV this fails
and shows nothing, even if other paths in the sparsity patterns had
matches.  Otherwise, it searches through $FILENAME.

If the command is modified to add a -m to the `git checkout` command
to make the checkout not fail, then the second command will always
search through $REV:$FILENAME, as modified by the dirty changes being
merged in.  If the user cleans up the file with uncommitted changes
first, then the checkout would make the file go away and thus it
wouldn't be searched after the switch.

Second case: $FILENAME has unmerged changes:
(Command 1) This will search through files in $REV that match sparsity
patterns; $REV:$FILENAME will be excluded.
(Command 2) This will always fail and show nothing, regardless of any
other matches.

Here the checkout command always fails due to the unmerged entries and
won't even allow the user to switch with -m.

So, my comparison of commands #1 and #2 isn't quite right, even for
non-submodule cases.  But let's dig a little further.  If someone were
to try do change their sparsity patterns or even just run a "git
sparse-checkout reapply" when they had the above issues, they'd see
something like:

    $ git sparse-checkout reapply
    warning: The following paths are unmerged and were left despite
sparse patterns:
            filename_with_conflicts

    After fixing the above paths, you may want to run `git
sparse-checkout reapply`.

This basically suggests that we consider uncommitted and unmerged
files to be "unclean" in some way (sparse-checkout wants to set the
SKIP_WORKTREE bit on all files that do not match the sparsity
specification, so "clean" means sparse-checkout is able to do so).  So
I could amend my earlier comparison and say that IF the user has a
clean directory, then "git grep --recurse-submodules $REVISION
$PATTERN" should be equivalent to "git checkout $REVISION && git grep
--recurse-submodules $PATTERN".  I could also say that given the big
warnings we give users when we can't set the SKIP_WORKTREE bit, that
we expect it to be a transient state and thus that we expect them to
more likely than not clear it out by the time they do switch branches.
That would lead us to the follow-up rule that if the user does not
have a clean directory then "git grep --recurse-submodules $REVISION
$PATTERN" should be equivalent to what you would get if the unclean
entries were ignored (expecting them to be cleaned before the any `git
checkout` could be run) and you then otherwise ran "git checkout
$REVISION && git grep --recurse-submodules $PATTERN".

That suggests that grep's implementation we agreed on earlier is still
correct (when given a $REVISION ignore submodulees that do not match
the sparsity patterns), but that unpack-trees/sparse-checkout still
need an update:

When we notice an initialized submodule that does not match the
sparsity patterns, we should print a warning just like we do for
unmerged and dirty entries.

So my patch needs a bit more work.
Matheus Tavares June 18, 2020, 2:34 p.m. UTC | #3
On Wed, Jun 17, 2020 at 9:24 PM Elijah Newren <newren@gmail.com> wrote:
>
> If someone were
> to try do change their sparsity patterns or even just run a "git
> sparse-checkout reapply" when they had the above issues, they'd see
> something like:
>
>     $ git sparse-checkout reapply
>     warning: The following paths are unmerged and were left despite
> sparse patterns:
>             filename_with_conflicts
>
>     After fixing the above paths, you may want to run `git
> sparse-checkout reapply`.
>
> This basically suggests that we consider uncommitted and unmerged
> files to be "unclean" in some way (sparse-checkout wants to set the
> SKIP_WORKTREE bit on all files that do not match the sparsity
> specification, so "clean" means sparse-checkout is able to do so).  So
> I could amend my earlier comparison and say that IF the user has a
> clean directory, then "git grep --recurse-submodules $REVISION
> $PATTERN" should be equivalent to "git checkout $REVISION && git grep
> --recurse-submodules $PATTERN".  I could also say that given the big
> warnings we give users when we can't set the SKIP_WORKTREE bit, that
> we expect it to be a transient state and thus that we expect them to
> more likely than not clear it out by the time they do switch branches.
> That would lead us to the follow-up rule that if the user does not
> have a clean directory then "git grep --recurse-submodules $REVISION
> $PATTERN" should be equivalent to what you would get if the unclean
> entries were ignored (expecting them to be cleaned before the any `git
> checkout` could be run) and you then otherwise ran "git checkout
> $REVISION && git grep --recurse-submodules $PATTERN".

Makes sense, thanks! We haven't mentioned "git grep --cached" yet, but
it would behave in the same way of the worktree grep, in this case.
(I.e. searching the submodules, as their SKIP_WORTREE bit was not
set.) So I guess it should be fine, as well.

> That suggests that grep's implementation we agreed on earlier is still
> correct (when given a $REVISION ignore submodulees that do not match
> the sparsity patterns), but that unpack-trees/sparse-checkout still
> need an update:
>
> When we notice an initialized submodule that does not match the
> sparsity patterns, we should print a warning just like we do for
> unmerged and dirty entries.

Yeah, seems like a good approach. Thanks for the explanations. Some of
the test cases in mt/grep-sparse-checkout will have to be adjusted
with this change. Should I reroll the series based on the patch you
will send or do you prefer to adjust them in your patch (and make it
dependent on mt/grep-sparse-checkout)?

Thanks,
Matheus
Elijah Newren June 18, 2020, 7:19 p.m. UTC | #4
On Thu, Jun 18, 2020 at 7:34 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Wed, Jun 17, 2020 at 9:24 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > If someone were
> > to try do change their sparsity patterns or even just run a "git
> > sparse-checkout reapply" when they had the above issues, they'd see
> > something like:
> >
> >     $ git sparse-checkout reapply
> >     warning: The following paths are unmerged and were left despite
> > sparse patterns:
> >             filename_with_conflicts
> >
> >     After fixing the above paths, you may want to run `git
> > sparse-checkout reapply`.
> >
> > This basically suggests that we consider uncommitted and unmerged
> > files to be "unclean" in some way (sparse-checkout wants to set the
> > SKIP_WORKTREE bit on all files that do not match the sparsity
> > specification, so "clean" means sparse-checkout is able to do so).  So
> > I could amend my earlier comparison and say that IF the user has a
> > clean directory, then "git grep --recurse-submodules $REVISION
> > $PATTERN" should be equivalent to "git checkout $REVISION && git grep
> > --recurse-submodules $PATTERN".  I could also say that given the big
> > warnings we give users when we can't set the SKIP_WORKTREE bit, that
> > we expect it to be a transient state and thus that we expect them to
> > more likely than not clear it out by the time they do switch branches.
> > That would lead us to the follow-up rule that if the user does not
> > have a clean directory then "git grep --recurse-submodules $REVISION
> > $PATTERN" should be equivalent to what you would get if the unclean
> > entries were ignored (expecting them to be cleaned before the any `git
> > checkout` could be run) and you then otherwise ran "git checkout
> > $REVISION && git grep --recurse-submodules $PATTERN".
>
> Makes sense, thanks! We haven't mentioned "git grep --cached" yet, but
> it would behave in the same way of the worktree grep, in this case.
> (I.e. searching the submodules, as their SKIP_WORTREE bit was not
> set.) So I guess it should be fine, as well.

:-)

> > That suggests that grep's implementation we agreed on earlier is still
> > correct (when given a $REVISION ignore submodulees that do not match
> > the sparsity patterns), but that unpack-trees/sparse-checkout still
> > need an update:
> >
> > When we notice an initialized submodule that does not match the
> > sparsity patterns, we should print a warning just like we do for
> > unmerged and dirty entries.
>
> Yeah, seems like a good approach. Thanks for the explanations. Some of
> the test cases in mt/grep-sparse-checkout will have to be adjusted
> with this change. Should I reroll the series based on the patch you
> will send or do you prefer to adjust them in your patch (and make it
> dependent on mt/grep-sparse-checkout)?

Ah, good catch.  Your series came first, is longer, and is reviewed
other than the submodule/config stuff that needs someone more familiar
with that area than me.  Since my patch needs more work anyway, how
about I rebase my patch on top of your work, and make sure to ping you
as a reviewer to make sure I don't mess anything up?

Thanks,
Elijah
Matheus Tavares June 18, 2020, 8:29 p.m. UTC | #5
On Thu, Jun 18, 2020 at 4:20 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Jun 18, 2020 at 7:34 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Wed, Jun 17, 2020 at 9:24 PM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > When we notice an initialized submodule that does not match the
> > > sparsity patterns, we should print a warning just like we do for
> > > unmerged and dirty entries.
> >
> > Yeah, seems like a good approach. Thanks for the explanations. Some of
> > the test cases in mt/grep-sparse-checkout will have to be adjusted
> > with this change. Should I reroll the series based on the patch you
> > will send or do you prefer to adjust them in your patch (and make it
> > dependent on mt/grep-sparse-checkout)?
>
> Ah, good catch.  Your series came first, is longer, and is reviewed
> other than the submodule/config stuff that needs someone more familiar
> with that area than me.  Since my patch needs more work anyway, how
> about I rebase my patch on top of your work, and make sure to ping you
> as a reviewer to make sure I don't mess anything up?

Sounds good :) Thanks
diff mbox series

Patch

diff --git a/unpack-trees.c b/unpack-trees.c
index 4be5fc30754..9922522b29d 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1524,7 +1524,7 @@  static void mark_new_skip_worktree(struct pattern_list *pl,
 	int i;
 
 	/*
-	 * 1. Pretend the narrowest worktree: only unmerged entries
+	 * 1. Pretend the narrowest worktree: only unmerged files and symlinks
 	 * are checked out
 	 */
 	for (i = 0; i < istate->cache_nr; i++) {
@@ -1533,7 +1533,8 @@  static void mark_new_skip_worktree(struct pattern_list *pl,
 		if (select_flag && !(ce->ce_flags & select_flag))
 			continue;
 
-		if (!ce_stage(ce) && !(ce->ce_flags & CE_CONFLICTED))
+		if (!ce_stage(ce) && !(ce->ce_flags & CE_CONFLICTED) &&
+		    !S_ISGITLINK(ce->ce_mode))
 			ce->ce_flags |= skip_wt_flag;
 		else
 			ce->ce_flags &= ~skip_wt_flag;