mbox series

[0/2] Sparse index: delete ignored files outside sparse cone

Message ID pull.1009.git.1627579637.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Sparse index: delete ignored files outside sparse cone | expand

Message

Linus Arver via GitGitGadget July 29, 2021, 5:27 p.m. UTC
My commits are technically based on ds/add-with-sparse-index, but could
apply earlier, I believe.

We launched an experimental release [1] of the sparse-index feature to our
internal users. We immediately discovered a problem due to the isolated way
in which we tested the sparse index: we were never building the project and
changing our sparse-checkout definition.

[1] https://github.com/microsoft/git/releases/tag/v2.32.0.vfs.0.102.exp

Users who ran a build in one project and then moved to another still had
build artifacts in their worktree that lived inside the old directories.
Since the files are marked by the .gitignore patterns, these files were not
removed by the 'git sparse-checkout set' command. However, they make the
sparse-index unusable because every 'git status' command needs to expand the
sparse-directory entries in order to see if the files are tracked or not.
This made the first experimental release actually slower for all users
because of this cost.

The solution we shipped to these customers was to change the way our fork
handles these ignored files. Specifically, instead of Git completely
ignoring the files, we changed Git to understand that with cone-mode
sparse-checkout patterns, the users is asking for entire directories to be
removed from the worktree. The link [1] included earlier has this change.

I believe that this is a reasonable expectation, though I recognize that it
might look like breaking the expectations of how .gitignore files work.

To limit the scope of this change, the focus is limited to behave this way
only when the sparse-index is enabled, because otherwise the sparse index
does not behave as intended. With that restriction in mind, the
implementation can use the sparse directory entries to assist.

I imagine that when the sparse index is mature enough, we will enable it by
default when cone mode sparse-checkout is enabled, in which case this
behavior would expand to all users with cone mode sparse-checkout.

I'm interested in the community's thoughts about this change, as it seems
like one that we should make carefully and intentionally.

While the rewrite of the t7519 test seems unrelated, it is required to avoid
a test failure with this change that deletes files outside of the cone. By
moving the test into repositories not at $TRASH_DIRECTORY, we gain more
control over the repository structure.

Thanks, -Stolee

Derrick Stolee (2):
  t7519: rewrite sparse index test
  sparse-checkout: clear tracked sparse dirs

 Documentation/git-sparse-checkout.txt |  6 +++
 builtin/sparse-checkout.c             | 73 +++++++++++++++++++++++++++
 t/t1091-sparse-checkout-builtin.sh    | 42 +++++++++++++++
 t/t7519-status-fsmonitor.sh           | 38 +++++++-------
 4 files changed, 142 insertions(+), 17 deletions(-)


base-commit: adf5b15ac3d44d92e0438451ef36631ed3ee2a63
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1009%2Fderrickstolee%2Fsparse-index%2Fignored-files-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1009/derrickstolee/sparse-index/ignored-files-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1009

Comments

Elijah Newren July 30, 2021, 1:11 p.m. UTC | #1
On Thu, Jul 29, 2021 at 11:27 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> My commits are technically based on ds/add-with-sparse-index, but could
> apply earlier, I believe.
>
> We launched an experimental release [1] of the sparse-index feature to our
> internal users. We immediately discovered a problem due to the isolated way
> in which we tested the sparse index: we were never building the project and
> changing our sparse-checkout definition.
>
> [1] https://github.com/microsoft/git/releases/tag/v2.32.0.vfs.0.102.exp
>
> Users who ran a build in one project and then moved to another still had
> build artifacts in their worktree that lived inside the old directories.
> Since the files are marked by the .gitignore patterns, these files were not
> removed by the 'git sparse-checkout set' command. However, they make the
> sparse-index unusable because every 'git status' command needs to expand the
> sparse-directory entries in order to see if the files are tracked or not.
> This made the first experimental release actually slower for all users
> because of this cost.

As highlighted at
https://lore.kernel.org/git/CABPp-BHwRHhYE3BPxCG+fRFWf2dCZ60rnO=ykztquKoOp_x_QQ@mail.gmail.com/
?

"""So, although 'git status'
shouldn't have to check the tracked files anymore, it is still going
to have to look at each of the *untracked* files and compare to
.gitignore files in order to correctly classify each file as ignored
or just plain untracked."""

As I noted there,

"""about a month ago my
colleagues made the tool just auto-invoke [--remove-old-ignores] from other
sparsify invocations.  To my knowledge, there have been no complaints
about that being automatically turned on; but there were
complaints/confusion before about the directories being left around.
(Of course, non-ignored files are still left around by that option.)"""

I think your changes here shouldn't be limited to a sparse-index but
should apply to cone-mode sparse checkouts generally.

> The solution we shipped to these customers was to change the way our fork
> handles these ignored files. Specifically, instead of Git completely
> ignoring the files, we changed Git to understand that with cone-mode
> sparse-checkout patterns, the users is asking for entire directories to be
> removed from the worktree. The link [1] included earlier has this change.

Could you clarify what you are removing?
  1) Tracked but now SKIP_WOKRTREE files
  2) .gitignore'd files
  3) other untracked files

(1) has always been removed by sparse-checkout.  I think you're adding
(2) but it's not clear if you are also removing (3).

> I believe that this is a reasonable expectation, though I recognize that it
> might look like breaking the expectations of how .gitignore files work.
>
> To limit the scope of this change, the focus is limited to behave this way
> only when the sparse-index is enabled, because otherwise the sparse index
> does not behave as intended. With that restriction in mind, the
> implementation can use the sparse directory entries to assist.
>
> I imagine that when the sparse index is mature enough, we will enable it by
> default when cone mode sparse-checkout is enabled, in which case this
> behavior would expand to all users with cone mode sparse-checkout.

I'm all in favor of just doing it for all cone mode users, at least
for .gitignore'd files.

> I'm interested in the community's thoughts about this change, as it seems
> like one that we should make carefully and intentionally.

Yeah, I approached it pretty cautiously, just adding a new option in
our tool (`sparsify --remove-old-ignores`) because I wasn't sure if it
should be done by default but wanted to provide an easy way for users
to access it.  We got numerous complaints and after some months, added
a config option to run it automatically.  We still got complaints and
eventually my colleagues just made it automatically run and provided
no way to turn it off.  No user complaints since.