mbox series

[v4,00/10] Sparse index: delete ignored files outside sparse cone

Message ID pull.1009.v4.git.1629841904.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Sparse index: delete ignored files outside sparse cone | expand

Message

Johannes Schindelin via GitGitGadget Aug. 24, 2021, 9:51 p.m. UTC
UPDATE: I had to base this version on a merge of ds/add-with-sparse-index
and v2.33.0-rc1 because of some changes to dir.h that became important!

We launched an experimental release [1] of the sparse-index feature to our
internal users. We immediately discovered a problem due to the isolated way
in which we tested the sparse index: we were never building the project and
changing our sparse-checkout definition.

[1] https://github.com/microsoft/git/releases/tag/v2.32.0.vfs.0.102.exp

Users who ran a build in one project and then moved to another still had
build artifacts in their worktree that lived inside the old directories.
Since the files are marked by the .gitignore patterns, these files were not
removed by the 'git sparse-checkout set' command. However, they make the
sparse-index unusable because every 'git status' command needs to expand the
sparse-directory entries in order to see if the files are tracked or not.
This made the first experimental release actually slower for all users
because of this cost.

The solution we shipped to these customers was to change the way our fork
handles these ignored files. Specifically, instead of Git completely
ignoring the files, we changed Git to understand that with cone-mode
sparse-checkout patterns, the users is asking for entire directories to be
removed from the worktree. The link [1] included earlier has this change.

I believe that this is a reasonable expectation, though I recognize that it
might look like breaking the expectations of how .gitignore files work.

Since feedback demonstrated that this is a desired behavior, v2 includes
this behavior for all "cone mode" repositories.

I'm interested in the community's thoughts about this change, as it seems
like one that we should make carefully and intentionally.

While the rewrite of the t7519 test seems unrelated, it is required to avoid
a test failure with this change that deletes files outside of the cone. By
moving the test into repositories not at $TRASH_DIRECTORY, we gain more
control over the repository structure.


Updates in V4
=============

 * Fixed an issue with the split index.

 * The helper methods are used more consistently.

 * The helper method path_in_cone_mode_sparse_checkout() is introduced.

 * Commit messages are edited for clarity.

 * A new config option is added to disable the behavior being added in this
   series.

 * I split the commit that involves cache_tree_update(). I have not yet
   succeeded in creating tests to demonstrate why this is required outside
   of needing it in the Scalar functional tests, which includes a version of
   partial clone. I will continue to investigate recreating this scenario in
   the Git test suite, but I wanted to send this version out to get feedback
   on the things that have changed.


Update in V3
============

 * As promised [2], the helper methods are fixed to work with non-cone-mode
   patterns. A later series will use them to their fullest potential
   (changing git add, git rm, and git mv when interacting with sparse
   entries).

[2]
https://lore.kernel.org/git/bac76c72-955d-1ade-4ecf-778ffc45f297@gmail.com/


Updates in V2
=============

 * This version correctly leaves untracked files alone. If untracked files
   are found, then the directory is left as-is, in case those ignored files
   are important to the user's work resolving those untracked files.

 * This behavior is now enabled by core.sparseCheckoutCone=true.

 * To use a sparse index as an in-memory data structure even when
   index.sparse is disabled, a new patch is included to modify the prototype
   of convert_to_sparse() to include a flags parameter.

 * A few cleanup patches that I was collecting based on feedback from the
   experimental release and intending for my next series were necessary for
   this implementation.

 * Cleaned up the tests (no NEEDSWORK) and the remainders of a previous
   implementation that used run_subcommand().

Thanks, -Stolee

Derrick Stolee (10):
  t7519: rewrite sparse index test
  sparse-index: silently return when not using cone-mode patterns
  sparse-index: silently return when cache tree fails
  sparse-index: use WRITE_TREE_MISSING_OK
  unpack-trees: fix nested sparse-dir search
  sparse-checkout: create helper methods
  attr: be careful about sparse directories
  sparse-index: add SPARSE_INDEX_MEMORY_ONLY flag
  sparse-checkout: clear tracked sparse dirs
  sparse-checkout: add config to disable deleting dirs

 Documentation/config/index.txt        |   6 ++
 Documentation/git-sparse-checkout.txt |  10 +++
 attr.c                                |  14 ++++
 builtin/add.c                         |   7 +-
 builtin/sparse-checkout.c             | 101 ++++++++++++++++++++++++++
 dir.c                                 |  52 +++++++++++++
 dir.h                                 |   8 ++
 read-cache.c                          |   4 +-
 sparse-index.c                        |  76 ++++++++++---------
 sparse-index.h                        |   3 +-
 t/t1091-sparse-checkout-builtin.sh    |  63 ++++++++++++++++
 t/t7519-status-fsmonitor.sh           |  38 +++++-----
 unpack-trees.c                        |   7 +-
 13 files changed, 328 insertions(+), 61 deletions(-)


base-commit: 80b8d6c56b8a5f5db1d5c2a0159fd808e8a7fc4f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1009%2Fderrickstolee%2Fsparse-index%2Fignored-files-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1009/derrickstolee/sparse-index/ignored-files-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1009

Range-diff vs v3:

  1:  e66106f7a99 !  1:  c407b2cb346 t7519: rewrite sparse index test
     @@ t/t7519-status-fsmonitor.sh: test_expect_success 'status succeeds after staging/
       test_expect_success 'status succeeds with sparse index' '
      -	git reset --hard &&
      +	git clone . full &&
     -+	git clone . sparse &&
     ++	git clone --sparse . sparse &&
      +	git -C sparse sparse-checkout init --cone --sparse-index &&
      +	git -C sparse sparse-checkout set dir1 dir2 &&
       
  2:  fb3ff9108bf !  2:  8660877ba7a sparse-index: silently return when not using cone-mode patterns
     @@ sparse-index.c: static int index_has_unmerged_entries(struct index_state *istate
       {
       	int test_env;
      -	if (istate->split_index || istate->sparse_index ||
     -+
      +	if (istate->split_index || istate->sparse_index || !istate->cache_nr ||
       	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
       		return 0;
  3:  37198535268 !  3:  a669740af9a sparse-index: silently return when cache tree fails
     @@ Commit message
          convert_to_sparse(), then we could still recover from this scenario and
          have a sparse index.
      
     -    When constructing the cache-tree extension in convert_to_sparse(), it is
     -    possible that we construct a tree object that is new to the object
     -    database. Without the WRITE_TREE_MISSING_OK flag, this results in an
     -    error that halts our conversion to a sparse index. Add this flag to
     -    remove this limitation.
     -
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## sparse-index.c ##
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
      +	/*
      +	 * Silently return if there is a problem with the cache tree update,
      +	 * which might just be due to a conflict state in some entry.
     -+	 *
     -+	 * This might create new tree objects, so be sure to use
     -+	 * WRITE_TREE_MISSING_OK.
      +	 */
     -+	if (cache_tree_update(istate, WRITE_TREE_MISSING_OK))
     ++	if (cache_tree_update(istate, 0))
      +		return 0;
       
       	remove_fsmonitor(istate);
  -:  ----------- >  4:  b379b8fc61a sparse-index: use WRITE_TREE_MISSING_OK
  4:  10bcadb284e !  5:  acdded0f762 unpack-trees: fix nested sparse-dir search
     @@ Commit message
          diff that modified files within the sparse directory entry, but we could
          not correctly find the entry.
      
     +    Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
     +    Helped-by: René Scharfe <l.s.r@web.de>
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## unpack-trees.c ##
     @@ unpack-trees.c: static int sparse_dir_matches_path(const struct cache_entry *ce,
       					    const struct name_entry *p)
       {
      -	struct cache_entry *ce;
     ++	const char *path;
      +	struct cache_entry *ce = NULL;
       	int pos = find_cache_pos(info, p->path, p->pathlen);
       	struct unpack_trees_options *o = info->data;
     -+	struct strbuf full_path = STRBUF_INIT;
       
     - 	if (0 <= pos)
     - 		return o->src_index->cache[pos];
     -@@ unpack-trees.c: static struct cache_entry *find_cache_entry(struct traverse_info *info,
     - 	if (pos < 0 || pos >= o->src_index->cache_nr)
     - 		return NULL;
     - 
     -+	strbuf_addstr(&full_path, info->traverse_path);
     -+	strbuf_add(&full_path, p->path, p->pathlen);
     -+	strbuf_addch(&full_path, '/');
     -+
     - 	/*
     - 	 * Due to lexicographic sorting and sparse directory
     - 	 * entries ending with a trailing slash, our path as a
      @@ unpack-trees.c: static struct cache_entry *find_cache_entry(struct traverse_info *info,
       	while (pos >= 0) {
       		ce = o->src_index->cache[pos];
       
      -		if (strncmp(ce->name, p->path, p->pathlen))
     --			return NULL;
     -+		if (strncmp(ce->name, full_path.buf, full_path.len)) {
     -+			ce = NULL;
     -+			break;
     -+		}
     ++		if (!skip_prefix(ce->name, info->traverse_path, &path) ||
     ++		    strncmp(path, p->path, p->pathlen) ||
     ++		    path[p->pathlen] != '/')
     + 			return NULL;
       
       		if (S_ISSPARSEDIR(ce->ce_mode) &&
     - 		    sparse_dir_matches_path(ce, info, p))
     --			return ce;
     -+			break;
     - 
     - 		pos--;
     - 	}
     - 
     --	return NULL;
     -+	strbuf_release(&full_path);
     -+	return ce;
     - }
     - 
     - static void debug_path(struct traverse_info *info)
  5:  5d28570c82a !  6:  1958751aa0e sparse-checkout: create helper methods
     @@ Commit message
          adopt these helpers. There are just two in builtin/add.c and
          sparse-index.c that can use path_in_sparse_checkout().
      
     +    We add a path_in_cone_mode_sparse_checkout() as well that will only
     +    return false if the path is outside of the sparse-checkout definition
     +    _and_ the sparse-checkout patterns are in cone mode.
     +
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## builtin/add.c ##
     @@ builtin/add.c: static int refresh(int verbose, const struct pathspec *pathspec)
      -			    (sparse_checkout_enabled &&
      -			     !path_matches_pattern_list(path, strlen(path), NULL,
      -							&dtype, &pl, &the_index))) {
     -+			    (core_apply_sparse_checkout &&
     -+			     path_in_sparse_checkout(path, &the_index) == NOT_MATCHED)) {
     ++			    !path_in_sparse_checkout(path, &the_index)) {
       				string_list_append(&only_match_skip_worktree,
       						   pathspec->items[i].original);
       			} else {
     @@ dir.c: done:
       
      +int init_sparse_checkout_patterns(struct index_state *istate)
      +{
     -+	if (!core_apply_sparse_checkout ||
     -+	    istate->sparse_checkout_patterns)
     ++	if (!core_apply_sparse_checkout)
     ++		return 1;
     ++	if (istate->sparse_checkout_patterns)
      +		return 0;
      +
      +	CALLOC_ARRAY(istate->sparse_checkout_patterns, 1);
     @@ dir.c: done:
      +	return 0;
      +}
      +
     -+int path_in_sparse_checkout(const char *path,
     -+			    struct index_state *istate)
     ++static int path_in_sparse_checkout_1(const char *path,
     ++				     struct index_state *istate,
     ++				     int require_cone_mode)
      +{
      +	const char *base;
      +	int dtype = DT_REG;
     -+	init_sparse_checkout_patterns(istate);
      +
     -+	if (!istate->sparse_checkout_patterns)
     -+		return MATCHED;
     ++	/*
     ++	 * We default to accepting a path if there are no patterns or
     ++	 * they are of the wrong type.
     ++	 */
     ++	if (init_sparse_checkout_patterns(istate) ||
     ++	    (require_cone_mode &&
     ++	     !istate->sparse_checkout_patterns->use_cone_patterns))
     ++		return 1;
      +
      +	base = strrchr(path, '/');
      +	return path_matches_pattern_list(path, strlen(path), base ? base + 1 : path,
     @@ dir.c: done:
      +					 istate->sparse_checkout_patterns,
      +					 istate) > 0;
      +}
     ++
     ++int path_in_sparse_checkout(const char *path,
     ++			    struct index_state *istate)
     ++{
     ++	return path_in_sparse_checkout_1(path, istate, 0);
     ++}
     ++
     ++int path_in_cone_mode_sparse_checkout(const char *path,
     ++				     struct index_state *istate)
     ++{
     ++	return path_in_sparse_checkout_1(path, istate, 1);
     ++}
      +
       static struct path_pattern *last_matching_pattern_from_lists(
       		struct dir_struct *dir, struct index_state *istate,
     @@ dir.h: enum pattern_match_result path_matches_pattern_list(const char *pathname,
      +
      +int path_in_sparse_checkout(const char *path,
      +			    struct index_state *istate);
     ++int path_in_cone_mode_sparse_checkout(const char *path,
     ++				      struct index_state *istate);
      +
       struct dir_entry *dir_add_ignored(struct dir_struct *dir,
       				  struct index_state *istate,
     @@ dir.h: enum pattern_match_result path_matches_pattern_list(const char *pathname,
      
       ## sparse-index.c ##
      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
     + {
       	int i, can_convert = 1;
       	int start_converted = num_converted;
     - 	enum pattern_match_result match;
     +-	enum pattern_match_result match;
      -	int dtype = DT_UNKNOWN;
       	struct strbuf child_path = STRBUF_INIT;
      -	struct pattern_list *pl = istate->sparse_checkout_patterns;
     @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
       	 */
      -	match = path_matches_pattern_list(ct_path, ct_pathlen,
      -					  NULL, &dtype, pl, istate);
     -+	match = path_in_sparse_checkout(ct_path, istate);
     - 	if (match != NOT_MATCHED)
     +-	if (match != NOT_MATCHED)
     ++	if (path_in_sparse_checkout(ct_path, istate))
       		can_convert = 0;
       
     + 	for (i = start; can_convert && i < end; i++) {
      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
       	if (!istate->repo->settings.sparse_index)
       		return 0;
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
      -		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
      -			return 0;
      -	}
     -+	if (init_sparse_checkout_patterns(istate) < 0)
     ++	if (init_sparse_checkout_patterns(istate))
      +		return 0;
       
       	/*
  6:  c9e100e68f8 !  7:  e496f3cee66 attr: be careful about sparse directories
     @@ attr.c: static struct attr_stack *read_attr_from_index(struct index_state *istat
       		return NULL;
       
      +	/*
     -+	 * In the case of cone-mode sparse-checkout, getting the
     -+	 * .gitattributes file from a directory is meaningless: all
     -+	 * contained paths will be sparse if the .gitattributes is also
     -+	 * sparse. In the case of a sparse index, it is critical that we
     -+	 * don't go looking for one as it will expand the index.
     ++	 * The .gitattributes file only applies to files within its
     ++	 * parent directory. In the case of cone-mode sparse-checkout,
     ++	 * the .gitattributes file is sparse if and only if all paths
     ++	 * within that directory are also sparse. Thus, don't load the
     ++	 * .gitattributes file since it will not matter.
     ++	 *
     ++	 * In the case of a sparse index, it is critical that we don't go
     ++	 * looking for a .gitattributes file, as the index will expand.
      +	 */
     -+	init_sparse_checkout_patterns(istate);
     -+	if (istate->sparse_checkout_patterns &&
     -+	    istate->sparse_checkout_patterns->use_cone_patterns &&
     -+	    path_in_sparse_checkout(path, istate) == NOT_MATCHED)
     ++	if (!path_in_cone_mode_sparse_checkout(path, istate))
      +		return NULL;
      +
       	buf = read_blob_data_from_index(istate, path, NULL);
  7:  b0ece4b7dcc !  8:  cab9360b1e9 sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag
     +    sparse-index: add SPARSE_INDEX_MEMORY_ONLY flag
      
          The convert_to_sparse() method checks for the GIT_TEST_SPARSE_INDEX
          environment variable or the "index.sparse" config setting before
     @@ Commit message
          index as an in-memory data structure, regardless of whether the on-disk
          format should be sparse.
      
     -    To that end, create the SPARSE_INDEX_IGNORE_CONFIG flag that will skip
     +    To that end, create the SPARSE_INDEX_MEMORY_ONLY flag that will skip
          these config checks when enabled. All current consumers are modified to
          pass '0' in the new 'flags' parameter.
      
     @@ sparse-index.c: static int index_has_unmerged_entries(struct index_state *istate
      +int convert_to_sparse(struct index_state *istate, int flags)
       {
       	int test_env;
     +-	if (istate->split_index || istate->sparse_index || !istate->cache_nr ||
     ++	if (istate->sparse_index || !istate->cache_nr ||
     + 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
     + 		return 0;
       
     -@@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
       	if (!istate->repo)
       		istate->repo = the_repository;
       
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
      -	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
      -	if (test_env >= 0)
      -		set_sparse_index_config(istate->repo, test_env);
     -+	if (!(flags & SPARSE_INDEX_IGNORE_CONFIG)) {
     ++	if (!(flags & SPARSE_INDEX_MEMORY_ONLY)) {
     ++		/*
     ++		 * The sparse index is not (yet) integrated with a split index.
     ++		 */
     ++		if (istate->split_index)
     ++			return 0;
      +		/*
      +		 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
      +		 * index.sparse config variable to be on.
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
      +			return 0;
      +	}
       
     - 	if (init_sparse_checkout_patterns(istate) < 0)
     + 	if (init_sparse_checkout_patterns(istate))
       		return 0;
      
       ## sparse-index.h ##
     @@ sparse-index.h
       
       struct index_state;
      -int convert_to_sparse(struct index_state *istate);
     -+#define SPARSE_INDEX_IGNORE_CONFIG (1 << 0)
     ++#define SPARSE_INDEX_MEMORY_ONLY (1 << 0)
      +int convert_to_sparse(struct index_state *istate, int flags);
       
       /*
  8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
     @@ Commit message
          the sparse directory. This depends on the indexed version of the file,
          so the sparse directory must be expanded.
      
     +    We must take special care to look for untracked, non-ignored files in
     +    these directories before deleting them. We do not want to delete any
     +    meaningful work that the users were doing in those directories and
     +    perhaps forgot to add and commit before switching sparse-checkout
     +    definitions. Since those untracked files might be code files that
     +    generated ignored build output, also do not delete any ignored files
     +    from these directories in that case. The users can recover their state
     +    by resetting their sparse-checkout definition to include that directory
     +    and continue. Alternatively, they can see the warning that is presented
     +    and delete the directory themselves to regain the performance they
     +    expect.
     +
          By deleting the sparse directories when changing scope (or running 'git
          sparse-checkout reapply') we regain these performance benefits as if the
          repository was in a clean state.
     @@ Documentation/git-sparse-checkout.txt: case-insensitive check. This corrects for
       'git sparse-checkout set' command to reflect the expected cone in the working
       directory.
       
     -+The cone mode sparse-checkout patterns will also remove ignored files that
     -+are not within the sparse-checkout definition. This is important behavior
     -+to preserve the performance of the sparse index, but also matches that
     -+cone mode patterns care about directories, not files. If there exist files
     -+that are untracked and not ignored, then Git will not delete files within
     -+that directory other than the tracked files that are now out of scope.
     -+These files should be removed manually to ensure Git can behave optimally.
     ++When changing the sparse-checkout patterns in cone mode, Git will inspect each
     ++tracked directory that is not within the sparse-checkout cone to see if it
     ++contains any untracked files. If all of those files are ignored due to the
     ++`.gitignore` patterns, then the directory will be deleted. If any of the
     ++untracked files within that directory is not ignored, then no deletions will
     ++occur within that directory and a warning message will appear. If these files
     ++are important, then reset your sparse-checkout definition so they are included,
     ++use `git add` and `git commit` to store them, then remove any remaining files
     ++manually to ensure Git can behave optimally.
      +
       
       SUBMODULES
     @@ builtin/sparse-checkout.c: static int sparse_checkout_list(int argc, const char
      +	 */
      +	if (!r || !r->index || !r->worktree)
      +		return;
     -+	init_sparse_checkout_patterns(r->index);
     -+	if (!r->index->sparse_checkout_patterns ||
     ++	if (init_sparse_checkout_patterns(r->index) ||
      +	    !r->index->sparse_checkout_patterns->use_cone_patterns)
      +		return;
      +
     @@ builtin/sparse-checkout.c: static int sparse_checkout_list(int argc, const char
      +		 * prevents us from converting to a sparse index, then do
      +		 * not try deleting files.
      +		 */
     -+		if (convert_to_sparse(r->index, SPARSE_INDEX_IGNORE_CONFIG))
     ++		if (convert_to_sparse(r->index, SPARSE_INDEX_MEMORY_ONLY))
      +			return;
      +		was_full = 1;
      +	}
  -:  ----------- > 10:  8d55a6ba2fd sparse-checkout: add config to disable deleting dirs

Comments

Elijah Newren Aug. 27, 2021, 9:56 p.m. UTC | #1
On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> UPDATE: I had to base this version on a merge of ds/add-with-sparse-index
> and v2.33.0-rc1 because of some changes to dir.h that became important!

Isn't this stale (i.e. just copied from v2/v3)?

...
> Updates in V4
> =============
>
>  * Fixed an issue with the split index.
>
>  * The helper methods are used more consistently.
>
>  * The helper method path_in_cone_mode_sparse_checkout() is introduced.
>
>  * Commit messages are edited for clarity.
>
>  * A new config option is added to disable the behavior being added in this
>    series.
>
>  * I split the commit that involves cache_tree_update(). I have not yet
>    succeeded in creating tests to demonstrate why this is required outside
>    of needing it in the Scalar functional tests, which includes a version of
>    partial clone. I will continue to investigate recreating this scenario in
>    the Git test suite, but I wanted to send this version out to get feedback
>    on the things that have changed.

Are you talking about patch 3 or patch 4?

For patch 4, as I commented on that patch, my guess would be that you
start with a full tree, pick some directory that you will sparsify
away, and before sparsifying it away, change some file under that
directory and stage the change.  Then when you sparsify the directory
away with cone mode, I think the cache_tree_update() will need to
write the tree.  At least, some quick greps of cache_tree_update
suggest there aren't that many codepaths that attempt to update it, so
I don't think anything else would have written it yet at that point.

For patch 3, since it comes before patch 5, it would be triggerable
with the same case.  Is it possible you originally found both patches
based off similar initial conditions that just resulted in slightly
different behavioral reports, and thus that patch 4 is good enough
without patch 3?  Trying to ignore that possibility and looking at the
cache_tree_update() code, the only other reasons for failure of
cache_tree_update() that I can see are (a) someone having invalid
hashes recorded in the index (e.g. someone using `git update-index
--index-info` with some invalid hashes, or someone manually deleting
some loose blobs from under their .git/ directory), or (b) disk being
full or read-only so that writing new tree objects to the object store
fail.

> base-commit: 80b8d6c56b8a5f5db1d5c2a0159fd808e8a7fc4f
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1009%2Fderrickstolee%2Fsparse-index%2Fignored-files-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1009/derrickstolee/sparse-index/ignored-files-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1009

A super micro nit:

It appears this is based on your other branches that have since been
incorporated into master.  Not a big deal, but rebasing on master
might make it easier for others to take a look at the series (I was
confused at first why this series depended on other unmerged series,
before realizing those other series had since merged down).  I know we
normally don't like rebasing series, but I suspect it wouldn't change
the range-diff (as I don't think others are touching these areas of
the code), and thus be transparent to Junio.



I commented on some individual patches, but have some small comments
that are easier to post in response to the range-diff:

> Range-diff vs v3:
>
>   1:  e66106f7a99 !  1:  c407b2cb346 t7519: rewrite sparse index test
>      @@ t/t7519-status-fsmonitor.sh: test_expect_success 'status succeeds after staging/
>        test_expect_success 'status succeeds with sparse index' '
>       - git reset --hard &&
>       + git clone . full &&
>      -+ git clone . sparse &&
>      ++ git clone --sparse . sparse &&
>       + git -C sparse sparse-checkout init --cone --sparse-index &&
>       + git -C sparse sparse-checkout set dir1 dir2 &&
>
>   2:  fb3ff9108bf !  2:  8660877ba7a sparse-index: silently return when not using cone-mode patterns
>      @@ sparse-index.c: static int index_has_unmerged_entries(struct index_state *istate
>        {
>         int test_env;
>       - if (istate->split_index || istate->sparse_index ||
>      -+
>       + if (istate->split_index || istate->sparse_index || !istate->cache_nr ||
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
>   3:  37198535268 !  3:  a669740af9a sparse-index: silently return when cache tree fails
>      @@ Commit message
>           convert_to_sparse(), then we could still recover from this scenario and
>           have a sparse index.
>
>      -    When constructing the cache-tree extension in convert_to_sparse(), it is
>      -    possible that we construct a tree object that is new to the object
>      -    database. Without the WRITE_TREE_MISSING_OK flag, this results in an
>      -    error that halts our conversion to a sparse index. Add this flag to
>      -    remove this limitation.
>      -
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>        ## sparse-index.c ##
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       + /*
>       +  * Silently return if there is a problem with the cache tree update,
>       +  * which might just be due to a conflict state in some entry.
>      -+  *
>      -+  * This might create new tree objects, so be sure to use
>      -+  * WRITE_TREE_MISSING_OK.
>       +  */
>      -+ if (cache_tree_update(istate, WRITE_TREE_MISSING_OK))
>      ++ if (cache_tree_update(istate, 0))
>       +         return 0;
>
>         remove_fsmonitor(istate);
>   -:  ----------- >  4:  b379b8fc61a sparse-index: use WRITE_TREE_MISSING_OK

Thanks for splitting patch 3 & 4.  I left some comments on patch 5.
For patch 4, my comments above about testcases might be of interest.

>   4:  10bcadb284e !  5:  acdded0f762 unpack-trees: fix nested sparse-dir search
>      @@ Commit message
>           diff that modified files within the sparse directory entry, but we could
>           not correctly find the entry.
>
>      +    Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
>      +    Helped-by: René Scharfe <l.s.r@web.de>
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>        ## unpack-trees.c ##
>      @@ unpack-trees.c: static int sparse_dir_matches_path(const struct cache_entry *ce,
>                                             const struct name_entry *p)
>        {
>       - struct cache_entry *ce;
>      ++ const char *path;
>       + struct cache_entry *ce = NULL;
>         int pos = find_cache_pos(info, p->path, p->pathlen);
>         struct unpack_trees_options *o = info->data;
>      -+ struct strbuf full_path = STRBUF_INIT;
>
>      -  if (0 <= pos)
>      -          return o->src_index->cache[pos];
>      -@@ unpack-trees.c: static struct cache_entry *find_cache_entry(struct traverse_info *info,
>      -  if (pos < 0 || pos >= o->src_index->cache_nr)
>      -          return NULL;
>      -
>      -+ strbuf_addstr(&full_path, info->traverse_path);
>      -+ strbuf_add(&full_path, p->path, p->pathlen);
>      -+ strbuf_addch(&full_path, '/');
>      -+
>      -  /*
>      -   * Due to lexicographic sorting and sparse directory
>      -   * entries ending with a trailing slash, our path as a
>       @@ unpack-trees.c: static struct cache_entry *find_cache_entry(struct traverse_info *info,
>         while (pos >= 0) {
>                 ce = o->src_index->cache[pos];
>
>       -         if (strncmp(ce->name, p->path, p->pathlen))
>      --                 return NULL;
>      -+         if (strncmp(ce->name, full_path.buf, full_path.len)) {
>      -+                 ce = NULL;
>      -+                 break;
>      -+         }
>      ++         if (!skip_prefix(ce->name, info->traverse_path, &path) ||
>      ++             strncmp(path, p->path, p->pathlen) ||
>      ++             path[p->pathlen] != '/')
>      +                  return NULL;
>
>                 if (S_ISSPARSEDIR(ce->ce_mode) &&
>      -              sparse_dir_matches_path(ce, info, p))
>      --                 return ce;
>      -+                 break;
>      -
>      -          pos--;
>      -  }
>      -
>      -- return NULL;
>      -+ strbuf_release(&full_path);
>      -+ return ce;
>      - }
>      -
>      - static void debug_path(struct traverse_info *info)

René's suggestion definitely made this patch simpler (which sticks out
more when you look at the patch rather than the range-diff).

>   5:  5d28570c82a !  6:  1958751aa0e sparse-checkout: create helper methods
>      @@ Commit message
>           adopt these helpers. There are just two in builtin/add.c and
>           sparse-index.c that can use path_in_sparse_checkout().
>
>      +    We add a path_in_cone_mode_sparse_checkout() as well that will only
>      +    return false if the path is outside of the sparse-checkout definition
>      +    _and_ the sparse-checkout patterns are in cone mode.
>      +
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>        ## builtin/add.c ##
>      @@ builtin/add.c: static int refresh(int verbose, const struct pathspec *pathspec)
>       -                     (sparse_checkout_enabled &&
>       -                      !path_matches_pattern_list(path, strlen(path), NULL,
>       -                                                 &dtype, &pl, &the_index))) {
>      -+                     (core_apply_sparse_checkout &&
>      -+                      path_in_sparse_checkout(path, &the_index) == NOT_MATCHED)) {
>      ++                     !path_in_sparse_checkout(path, &the_index)) {
>                                 string_list_append(&only_match_skip_worktree,
>                                                    pathspec->items[i].original);
>                         } else {
>      @@ dir.c: done:
>
>       +int init_sparse_checkout_patterns(struct index_state *istate)
>       +{
>      -+ if (!core_apply_sparse_checkout ||
>      -+     istate->sparse_checkout_patterns)
>      ++ if (!core_apply_sparse_checkout)
>      ++         return 1;
>      ++ if (istate->sparse_checkout_patterns)
>       +         return 0;
>       +
>       + CALLOC_ARRAY(istate->sparse_checkout_patterns, 1);
>      @@ dir.c: done:
>       + return 0;
>       +}
>       +
>      -+int path_in_sparse_checkout(const char *path,
>      -+                     struct index_state *istate)
>      ++static int path_in_sparse_checkout_1(const char *path,
>      ++                              struct index_state *istate,
>      ++                              int require_cone_mode)
>       +{
>       + const char *base;
>       + int dtype = DT_REG;
>      -+ init_sparse_checkout_patterns(istate);
>       +
>      -+ if (!istate->sparse_checkout_patterns)
>      -+         return MATCHED;
>      ++ /*
>      ++  * We default to accepting a path if there are no patterns or
>      ++  * they are of the wrong type.
>      ++  */
>      ++ if (init_sparse_checkout_patterns(istate) ||
>      ++     (require_cone_mode &&
>      ++      !istate->sparse_checkout_patterns->use_cone_patterns))
>      ++         return 1;
>       +
>       + base = strrchr(path, '/');
>       + return path_matches_pattern_list(path, strlen(path), base ? base + 1 : path,
>      @@ dir.c: done:
>       +                                  istate->sparse_checkout_patterns,
>       +                                  istate) > 0;
>       +}
>      ++
>      ++int path_in_sparse_checkout(const char *path,
>      ++                     struct index_state *istate)
>      ++{
>      ++ return path_in_sparse_checkout_1(path, istate, 0);
>      ++}
>      ++
>      ++int path_in_cone_mode_sparse_checkout(const char *path,
>      ++                              struct index_state *istate)
>      ++{
>      ++ return path_in_sparse_checkout_1(path, istate, 1);
>      ++}
>       +
>        static struct path_pattern *last_matching_pattern_from_lists(
>                 struct dir_struct *dir, struct index_state *istate,
>      @@ dir.h: enum pattern_match_result path_matches_pattern_list(const char *pathname,
>       +
>       +int path_in_sparse_checkout(const char *path,
>       +                     struct index_state *istate);
>      ++int path_in_cone_mode_sparse_checkout(const char *path,
>      ++                               struct index_state *istate);
>       +
>        struct dir_entry *dir_add_ignored(struct dir_struct *dir,
>                                   struct index_state *istate,
>      @@ dir.h: enum pattern_match_result path_matches_pattern_list(const char *pathname,
>
>        ## sparse-index.c ##
>       @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>      + {
>         int i, can_convert = 1;
>         int start_converted = num_converted;
>      -  enum pattern_match_result match;
>      +- enum pattern_match_result match;
>       - int dtype = DT_UNKNOWN;
>         struct strbuf child_path = STRBUF_INIT;
>       - struct pattern_list *pl = istate->sparse_checkout_patterns;
>      @@ sparse-index.c: static int convert_to_sparse_rec(struct index_state *istate,
>          */
>       - match = path_matches_pattern_list(ct_path, ct_pathlen,
>       -                                   NULL, &dtype, pl, istate);
>      -+ match = path_in_sparse_checkout(ct_path, istate);
>      -  if (match != NOT_MATCHED)
>      +- if (match != NOT_MATCHED)
>      ++ if (path_in_sparse_checkout(ct_path, istate))
>                 can_convert = 0;
>
>      +  for (i = start; can_convert && i < end; i++) {
>       @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>         if (!istate->repo->settings.sparse_index)
>                 return 0;
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       -         if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
>       -                 return 0;
>       - }
>      -+ if (init_sparse_checkout_patterns(istate) < 0)
>      ++ if (init_sparse_checkout_patterns(istate))
>       +         return 0;
>
>         /*
>   6:  c9e100e68f8 !  7:  e496f3cee66 attr: be careful about sparse directories
>      @@ attr.c: static struct attr_stack *read_attr_from_index(struct index_state *istat
>                 return NULL;
>
>       + /*
>      -+  * In the case of cone-mode sparse-checkout, getting the
>      -+  * .gitattributes file from a directory is meaningless: all
>      -+  * contained paths will be sparse if the .gitattributes is also
>      -+  * sparse. In the case of a sparse index, it is critical that we
>      -+  * don't go looking for one as it will expand the index.
>      ++  * The .gitattributes file only applies to files within its
>      ++  * parent directory. In the case of cone-mode sparse-checkout,
>      ++  * the .gitattributes file is sparse if and only if all paths
>      ++  * within that directory are also sparse. Thus, don't load the
>      ++  * .gitattributes file since it will not matter.
>      ++  *
>      ++  * In the case of a sparse index, it is critical that we don't go
>      ++  * looking for a .gitattributes file, as the index will expand.

This comment is better, and I appreciate the thought behind adding the
second paragraph, but the wording seems fuzzy and potentially
confusing to future readers.  Perhaps changing the wording to:

...looking for a .gitattributes file, as doing so would cause the
index to expand.

>       +  */
>      -+ init_sparse_checkout_patterns(istate);
>      -+ if (istate->sparse_checkout_patterns &&
>      -+     istate->sparse_checkout_patterns->use_cone_patterns &&
>      -+     path_in_sparse_checkout(path, istate) == NOT_MATCHED)
>      ++ if (!path_in_cone_mode_sparse_checkout(path, istate))
>       +         return NULL;
>       +
>         buf = read_blob_data_from_index(istate, path, NULL);
>   7:  b0ece4b7dcc !  8:  cab9360b1e9 sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag
>      @@ Metadata
>       Author: Derrick Stolee <dstolee@microsoft.com>
>
>        ## Commit message ##
>      -    sparse-index: add SPARSE_INDEX_IGNORE_CONFIG flag
>      +    sparse-index: add SPARSE_INDEX_MEMORY_ONLY flag
>
>           The convert_to_sparse() method checks for the GIT_TEST_SPARSE_INDEX
>           environment variable or the "index.sparse" config setting before
>      @@ Commit message
>           index as an in-memory data structure, regardless of whether the on-disk
>           format should be sparse.
>
>      -    To that end, create the SPARSE_INDEX_IGNORE_CONFIG flag that will skip
>      +    To that end, create the SPARSE_INDEX_MEMORY_ONLY flag that will skip
>           these config checks when enabled. All current consumers are modified to
>           pass '0' in the new 'flags' parameter.
>
>      @@ sparse-index.c: static int index_has_unmerged_entries(struct index_state *istate
>       +int convert_to_sparse(struct index_state *istate, int flags)
>        {
>         int test_env;
>      +- if (istate->split_index || istate->sparse_index || !istate->cache_nr ||
>      ++ if (istate->sparse_index || !istate->cache_nr ||
>      +      !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>      +          return 0;
>
>      -@@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>         if (!istate->repo)
>                 istate->repo = the_repository;
>
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       - test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
>       - if (test_env >= 0)
>       -         set_sparse_index_config(istate->repo, test_env);
>      -+ if (!(flags & SPARSE_INDEX_IGNORE_CONFIG)) {
>      ++ if (!(flags & SPARSE_INDEX_MEMORY_ONLY)) {
>      ++         /*
>      ++          * The sparse index is not (yet) integrated with a split index.
>      ++          */
>      ++         if (istate->split_index)
>      ++                 return 0;
>       +         /*
>       +          * The GIT_TEST_SPARSE_INDEX environment variable triggers the
>       +          * index.sparse config variable to be on.
>      @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
>       +                 return 0;
>       + }
>
>      -  if (init_sparse_checkout_patterns(istate) < 0)
>      +  if (init_sparse_checkout_patterns(istate))
>                 return 0;
>
>        ## sparse-index.h ##
>      @@ sparse-index.h
>
>        struct index_state;
>       -int convert_to_sparse(struct index_state *istate);
>      -+#define SPARSE_INDEX_IGNORE_CONFIG (1 << 0)
>      ++#define SPARSE_INDEX_MEMORY_ONLY (1 << 0)
>       +int convert_to_sparse(struct index_state *istate, int flags);
>
>        /*
>   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
>      @@ Commit message
>           the sparse directory. This depends on the indexed version of the file,
>           so the sparse directory must be expanded.
>
>      +    We must take special care to look for untracked, non-ignored files in
>      +    these directories before deleting them. We do not want to delete any
>      +    meaningful work that the users were doing in those directories and
>      +    perhaps forgot to add and commit before switching sparse-checkout
>      +    definitions. Since those untracked files might be code files that
>      +    generated ignored build output, also do not delete any ignored files
>      +    from these directories in that case. The users can recover their state
>      +    by resetting their sparse-checkout definition to include that directory
>      +    and continue. Alternatively, they can see the warning that is presented
>      +    and delete the directory themselves to regain the performance they
>      +    expect.
>      +
>           By deleting the sparse directories when changing scope (or running 'git
>           sparse-checkout reapply') we regain these performance benefits as if the
>           repository was in a clean state.
>      @@ Documentation/git-sparse-checkout.txt: case-insensitive check. This corrects for
>        'git sparse-checkout set' command to reflect the expected cone in the working
>        directory.
>
>      -+The cone mode sparse-checkout patterns will also remove ignored files that
>      -+are not within the sparse-checkout definition. This is important behavior
>      -+to preserve the performance of the sparse index, but also matches that
>      -+cone mode patterns care about directories, not files. If there exist files
>      -+that are untracked and not ignored, then Git will not delete files within
>      -+that directory other than the tracked files that are now out of scope.
>      -+These files should be removed manually to ensure Git can behave optimally.
>      ++When changing the sparse-checkout patterns in cone mode, Git will inspect each
>      ++tracked directory that is not within the sparse-checkout cone to see if it
>      ++contains any untracked files. If all of those files are ignored due to the
>      ++`.gitignore` patterns, then the directory will be deleted. If any of the
>      ++untracked files within that directory is not ignored, then no deletions will
>      ++occur within that directory and a warning message will appear. If these files
>      ++are important, then reset your sparse-checkout definition so they are included,
>      ++use `git add` and `git commit` to store them, then remove any remaining files
>      ++manually to ensure Git can behave optimally.
>       +
>
>        SUBMODULES
>      @@ builtin/sparse-checkout.c: static int sparse_checkout_list(int argc, const char
>       +  */
>       + if (!r || !r->index || !r->worktree)
>       +         return;
>      -+ init_sparse_checkout_patterns(r->index);
>      -+ if (!r->index->sparse_checkout_patterns ||
>      ++ if (init_sparse_checkout_patterns(r->index) ||
>       +     !r->index->sparse_checkout_patterns->use_cone_patterns)
>       +         return;
>       +
>      @@ builtin/sparse-checkout.c: static int sparse_checkout_list(int argc, const char
>       +          * prevents us from converting to a sparse index, then do
>       +          * not try deleting files.
>       +          */
>      -+         if (convert_to_sparse(r->index, SPARSE_INDEX_IGNORE_CONFIG))
>      ++         if (convert_to_sparse(r->index, SPARSE_INDEX_MEMORY_ONLY))
>       +                 return;
>       +         was_full = 1;
>       + }

Thanks for the extra explanations; those help.

You haven't yet addressed how this choice will interact with
.gitignore files, though (as I commented on v3).  You alluded in patch
7 to how it was critical to avoid looking for .gitattributes files
because doing so would require the index to expand, possibly
recursively all the way down.  Don't .gitignore files have the same
problem?  Nuking all the ignored files so that you can delete extra
directories and thus not need to walk into them during operations
helps, but does the fact that we sometimes avoid removing directories
mean we have to walk into them anyway?  Do we only need to walk into
such directories when they are (still? again?) present in the working
copy?  Does there need to be some code added somewhere that checks for
the presence of a directory in the working copy and unsparsifies
sparse directory entries in the index when such directories are found?

Also, does non-cone mode sparsity paths have weird bugs around letting
users specify patterns that would happen to also exclude .gitignore
files from the working copy, but keep other sibling files within the
same directory?  In particular, if the .gitignore files aren't
present, does git add -A add a bunch of supposed-to-be-ignored files?
(If so, that's an issue that's probably tangential to this series
since it's not about cone mode or the sparse index, but it's certainly
an interesting thought.)

Maybe the whole .gitignore situation with cone mode has a simple
answer, but since this whole series is about gaining performance by
avoiding recursing into directories corresponding to sparse directory
entries, it feels the answers to how .gitignore will be handled should
at least be addressed.

>   -:  ----------- > 10:  8d55a6ba2fd sparse-checkout: add config to disable deleting dirs

I left some comments.


Thanks, as always, for all your hard work on this!
Elijah Newren Aug. 27, 2021, 10:01 p.m. UTC | #2
Sorry, one more thing...

On Fri, Aug 27, 2021 at 2:56 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:

> >   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
> >      @@ Commit message
> >           the sparse directory. This depends on the indexed version of the file,
> >           so the sparse directory must be expanded.
> >
> >      +    We must take special care to look for untracked, non-ignored files in
> >      +    these directories before deleting them. We do not want to delete any
> >      +    meaningful work that the users were doing in those directories and
> >      +    perhaps forgot to add and commit before switching sparse-checkout
> >      +    definitions. Since those untracked files might be code files that
> >      +    generated ignored build output, also do not delete any ignored files
> >      +    from these directories in that case. The users can recover their state
> >      +    by resetting their sparse-checkout definition to include that directory
> >      +    and continue. Alternatively, they can see the warning that is presented
> >      +    and delete the directory themselves to regain the performance they
> >      +    expect.

Will deleting the directory regain the performance they expect, or is
another step needed?  In other worse, will the sparsification of the
paths under the no-longer-needed directory into a sparse directory
entry just happen automatically as part of some other command like
commit/add, or do they need to manually run `git sparse-checkout
reapply`?

> >      +
> >           By deleting the sparse directories when changing scope (or running 'git
> >           sparse-checkout reapply') we regain these performance benefits as if the
> >           repository was in a clean state.
> >      @@ Documentation/git-sparse-checkout.txt: case-insensitive check. This corrects for
> >        'git sparse-checkout set' command to reflect the expected cone in the working
> >        directory.
> >
> >      -+The cone mode sparse-checkout patterns will also remove ignored files that
> >      -+are not within the sparse-checkout definition. This is important behavior
> >      -+to preserve the performance of the sparse index, but also matches that
> >      -+cone mode patterns care about directories, not files. If there exist files
> >      -+that are untracked and not ignored, then Git will not delete files within
> >      -+that directory other than the tracked files that are now out of scope.
> >      -+These files should be removed manually to ensure Git can behave optimally.
> >      ++When changing the sparse-checkout patterns in cone mode, Git will inspect each
> >      ++tracked directory that is not within the sparse-checkout cone to see if it
> >      ++contains any untracked files. If all of those files are ignored due to the
> >      ++`.gitignore` patterns, then the directory will be deleted. If any of the
> >      ++untracked files within that directory is not ignored, then no deletions will
> >      ++occur within that directory and a warning message will appear. If these files
> >      ++are important, then reset your sparse-checkout definition so they are included,
> >      ++use `git add` and `git commit` to store them, then remove any remaining files
> >      ++manually to ensure Git can behave optimally.
Derrick Stolee Aug. 30, 2021, 1:34 p.m. UTC | #3
On 8/27/2021 6:01 PM, Elijah Newren wrote:
> Sorry, one more thing...
> 
> On Fri, Aug 27, 2021 at 2:56 PM Elijah Newren <newren@gmail.com> wrote:
>>
>> On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
> 
>>>   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
>>>      @@ Commit message
>>>           the sparse directory. This depends on the indexed version of the file,
>>>           so the sparse directory must be expanded.
>>>
>>>      +    We must take special care to look for untracked, non-ignored files in
>>>      +    these directories before deleting them. We do not want to delete any
>>>      +    meaningful work that the users were doing in those directories and
>>>      +    perhaps forgot to add and commit before switching sparse-checkout
>>>      +    definitions. Since those untracked files might be code files that
>>>      +    generated ignored build output, also do not delete any ignored files
>>>      +    from these directories in that case. The users can recover their state
>>>      +    by resetting their sparse-checkout definition to include that directory
>>>      +    and continue. Alternatively, they can see the warning that is presented
>>>      +    and delete the directory themselves to regain the performance they
>>>      +    expect.
> 
> Will deleting the directory regain the performance they expect, or is
> another step needed?  In other worse, will the sparsification of the
> paths under the no-longer-needed directory into a sparse directory
> entry just happen automatically as part of some other command like
> commit/add, or do they need to manually run `git sparse-checkout
> reapply`?

If the directory is gone, then their performance returns to as they
expect, since the next Git command will not need to expand the index
to detect the difference between tracked, untracked, and ignored files
within the sparse directory.

(Recall that the index is stored as sparse even if it expands to a
full one in-memory during execution.)

Thanks,
-Stolee
Derrick Stolee Aug. 30, 2021, 1:54 p.m. UTC | #4
On 8/27/2021 5:56 PM, Elijah Newren wrote:
> On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> UPDATE: I had to base this version on a merge of ds/add-with-sparse-index
>> and v2.33.0-rc1 because of some changes to dir.h that became important!
> 
> Isn't this stale (i.e. just copied from v2/v3)?

Sorry, yes. It was an update to the original cover letter, but was not
updated in later versions.

>>  * I split the commit that involves cache_tree_update(). I have not yet
>>    succeeded in creating tests to demonstrate why this is required outside
>>    of needing it in the Scalar functional tests, which includes a version of
>>    partial clone. I will continue to investigate recreating this scenario in
>>    the Git test suite, but I wanted to send this version out to get feedback
>>    on the things that have changed.
> 
> Are you talking about patch 3 or patch 4?

I'm talking about patches 3 and 4, which (at the time that I created
them a few months ago, before later merging them) they each had solved
some problems in the Scalar functional tests. The issue I'm seeing now
is that since so many commands have integrated with the sparse index,
those functional tests are not testing convert_to_sparse() as heavily,
so simply reverting each does not show the problem any more.

> For patch 4, as I commented on that patch, my guess would be that you
> start with a full tree, pick some directory that you will sparsify
> away, and before sparsifying it away, change some file under that
> directory and stage the change.  Then when you sparsify the directory
> away with cone mode, I think the cache_tree_update() will need to
> write the tree.  At least, some quick greps of cache_tree_update
> suggest there aren't that many codepaths that attempt to update it, so
> I don't think anything else would have written it yet at that point.
> 
> For patch 3, since it comes before patch 5, it would be triggerable
> with the same case.  Is it possible you originally found both patches
> based off similar initial conditions that just resulted in slightly
> different behavioral reports, and thus that patch 4 is good enough
> without patch 3?  Trying to ignore that possibility and looking at the
> cache_tree_update() code, the only other reasons for failure of
> cache_tree_update() that I can see are (a) someone having invalid
> hashes recorded in the index (e.g. someone using `git update-index
> --index-info` with some invalid hashes, or someone manually deleting
> some loose blobs from under their .git/ directory), or (b) disk being
> full or read-only so that writing new tree objects to the object store
> fail.

Just for full context:

* Patch 3 removes a warning and "return -1" when cache_tree_update()
  fails. This makes convert_to_sparse() resilient to any failure that
  might happen in cache_tree_update(), whatever it might be.

* Patch 4 adds the WRITE_TREE_MISSING_OK, which removes one possible
  reason that cache_tree_update() could fail (and removes an attempted
  promisor fetch in the case of partial clone).

So I could see that either of the two patches would be sufficient to
make tests pass in all normally-testable scenarios. However, patch 3
is valuable for possibly more complicated failure scenarios such as
corrupted data, while patch 4 is valuable for performance reasons in
a very specific environment.

>> base-commit: 80b8d6c56b8a5f5db1d5c2a0159fd808e8a7fc4f
>> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1009%2Fderrickstolee%2Fsparse-index%2Fignored-files-v4
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1009/derrickstolee/sparse-index/ignored-files-v4
>> Pull-Request: https://github.com/gitgitgadget/git/pull/1009
> 
> A super micro nit:
> 
> It appears this is based on your other branches that have since been
> incorporated into master.  Not a big deal, but rebasing on master
> might make it easier for others to take a look at the series (I was
> confused at first why this series depended on other unmerged series,
> before realizing those other series had since merged down).  I know we
> normally don't like rebasing series, but I suspect it wouldn't change
> the range-diff (as I don't think others are touching these areas of
> the code), and thus be transparent to Junio.

The 'git add' integration only recently hit 'master', so I have not
had a chance to base on a version of that branch with the necessary
dependencies.

>>   6:  c9e100e68f8 !  7:  e496f3cee66 attr: be careful about sparse directories
>>      @@ attr.c: static struct attr_stack *read_attr_from_index(struct index_state *istat
>>                 return NULL;
>>
>>       + /*
>>      -+  * In the case of cone-mode sparse-checkout, getting the
>>      -+  * .gitattributes file from a directory is meaningless: all
>>      -+  * contained paths will be sparse if the .gitattributes is also
>>      -+  * sparse. In the case of a sparse index, it is critical that we
>>      -+  * don't go looking for one as it will expand the index.
>>      ++  * The .gitattributes file only applies to files within its
>>      ++  * parent directory. In the case of cone-mode sparse-checkout,
>>      ++  * the .gitattributes file is sparse if and only if all paths
>>      ++  * within that directory are also sparse. Thus, don't load the
>>      ++  * .gitattributes file since it will not matter.
>>      ++  *
>>      ++  * In the case of a sparse index, it is critical that we don't go
>>      ++  * looking for a .gitattributes file, as the index will expand.
> 
> This comment is better, and I appreciate the thought behind adding the
> second paragraph, but the wording seems fuzzy and potentially
> confusing to future readers.  Perhaps changing the wording to:
> 
> ...looking for a .gitattributes file, as doing so would cause the
> index to expand.

Thanks.

>>   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
>>      @@ Commit message
>>           the sparse directory. This depends on the indexed version of the file,
>>           so the sparse directory must be expanded.
>>
>>      +    We must take special care to look for untracked, non-ignored files in
>>      +    these directories before deleting them. We do not want to delete any
>>      +    meaningful work that the users were doing in those directories and
>>      +    perhaps forgot to add and commit before switching sparse-checkout
>>      +    definitions. Since those untracked files might be code files that
>>      +    generated ignored build output, also do not delete any ignored files
>>      +    from these directories in that case. The users can recover their state
>>      +    by resetting their sparse-checkout definition to include that directory
>>      +    and continue. Alternatively, they can see the warning that is presented
>>      +    and delete the directory themselves to regain the performance they
>>      +    expect.
>>      +
...
>>      ++When changing the sparse-checkout patterns in cone mode, Git will inspect each
>>      ++tracked directory that is not within the sparse-checkout cone to see if it
>>      ++contains any untracked files. If all of those files are ignored due to the
>>      ++`.gitignore` patterns, then the directory will be deleted. If any of the
>>      ++untracked files within that directory is not ignored, then no deletions will
>>      ++occur within that directory and a warning message will appear. If these files
>>      ++are important, then reset your sparse-checkout definition so they are included,
>>      ++use `git add` and `git commit` to store them, then remove any remaining files
>>      ++manually to ensure Git can behave optimally.
...
> Thanks for the extra explanations; those help.
> 
> You haven't yet addressed how this choice will interact with
> .gitignore files, though (as I commented on v3).  You alluded in patch
> 7 to how it was critical to avoid looking for .gitattributes files
> because doing so would require the index to expand, possibly
> recursively all the way down.  Don't .gitignore files have the same
> problem?  Nuking all the ignored files so that you can delete extra
> directories and thus not need to walk into them during operations
> helps, but does the fact that we sometimes avoid removing directories
> mean we have to walk into them anyway?  Do we only need to walk into
> such directories when they are (still? again?) present in the working
> copy?  Does there need to be some code added somewhere that checks for
> the presence of a directory in the working copy and unsparsifies
> sparse directory entries in the index when such directories are found?
> 
> Also, does non-cone mode sparsity paths have weird bugs around letting
> users specify patterns that would happen to also exclude .gitignore
> files from the working copy, but keep other sibling files within the
> same directory?  In particular, if the .gitignore files aren't
> present, does git add -A add a bunch of supposed-to-be-ignored files?
> (If so, that's an issue that's probably tangential to this series
> since it's not about cone mode or the sparse index, but it's certainly
> an interesting thought.)
> 
> Maybe the whole .gitignore situation with cone mode has a simple
> answer, but since this whole series is about gaining performance by
> avoiding recursing into directories corresponding to sparse directory
> entries, it feels the answers to how .gitignore will be handled should
> at least be addressed.

The goal of this series is to allow Git to operate as normal (respecting
.gitattributes and .gitignore files as part of that) without needing to
expand a sparse index. The way it achieves this is by deleting the
directories that leave the sparse-checkout cone. If those directories
don't exist, then we don't care about the .gitattributes or .gitignore
files in those locations. When the directory _does_ exist, then we need
to look for the .gitignore file which expands the index to full. This
is done naturally by Git using index_name_pos() for "<dir>/.gitignore"
which finds that "<dir>" (or one of its parent directories) is a
sparse directory, then expands the index and searches for
"<dir>/.gitignore" again, returning the position if it exists in the
fully-expanded index.

The difference between the .gitattributes and .gitignore files is that
we don't need to do anything with the .gitignore files to prevent Git
from accessing them with their contained directories don't exist. The
.gitattributes change in attr.c is reachable from some code path even
when the directory doesn't exist, so it needs special logic to avoid
that expansion.

When we don't succeed in deleting sparse directories, performance
suffers. That's the reason behind having a very-visible warnings to
users that something is blocking the removal. We will be monitoring
performance of users where it is _very clear_ when this happens. From
that experience, we might discover new ways to assist them to get
into a better state (perhaps a warning during "git status" would help?).

Thanks,
-Stolee
Elijah Newren Aug. 30, 2021, 8:14 p.m. UTC | #5
On Mon, Aug 30, 2021 at 6:34 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 8/27/2021 6:01 PM, Elijah Newren wrote:
> > Sorry, one more thing...
> >
> > On Fri, Aug 27, 2021 at 2:56 PM Elijah Newren <newren@gmail.com> wrote:
> >>
> >> On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
> >> <gitgitgadget@gmail.com> wrote:
> >
> >>>   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
> >>>      @@ Commit message
> >>>           the sparse directory. This depends on the indexed version of the file,
> >>>           so the sparse directory must be expanded.
> >>>
> >>>      +    We must take special care to look for untracked, non-ignored files in
> >>>      +    these directories before deleting them. We do not want to delete any
> >>>      +    meaningful work that the users were doing in those directories and
> >>>      +    perhaps forgot to add and commit before switching sparse-checkout
> >>>      +    definitions. Since those untracked files might be code files that
> >>>      +    generated ignored build output, also do not delete any ignored files
> >>>      +    from these directories in that case. The users can recover their state
> >>>      +    by resetting their sparse-checkout definition to include that directory
> >>>      +    and continue. Alternatively, they can see the warning that is presented
> >>>      +    and delete the directory themselves to regain the performance they
> >>>      +    expect.
> >
> > Will deleting the directory regain the performance they expect, or is
> > another step needed?  In other worse, will the sparsification of the
> > paths under the no-longer-needed directory into a sparse directory
> > entry just happen automatically as part of some other command like
> > commit/add, or do they need to manually run `git sparse-checkout
> > reapply`?
>
> If the directory is gone, then their performance returns to as they
> expect, since the next Git command will not need to expand the index
> to detect the difference between tracked, untracked, and ignored files
> within the sparse directory.
>
> (Recall that the index is stored as sparse even if it expands to a
> full one in-memory during execution.)

Right, as with the other email, this was the piece that hadn't quite
stuck or registered for me.  Thanks for clarifying.
Elijah Newren Aug. 30, 2021, 8:23 p.m. UTC | #6
On Mon, Aug 30, 2021 at 6:54 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 8/27/2021 5:56 PM, Elijah Newren wrote:
> > On Tue, Aug 24, 2021 at 2:51 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
...
> >>   8:  febef675f05 !  9:  c19d93ec5d7 sparse-checkout: clear tracked sparse dirs
> >>      @@ Commit message
> >>           the sparse directory. This depends on the indexed version of the file,
> >>           so the sparse directory must be expanded.
> >>
> >>      +    We must take special care to look for untracked, non-ignored files in
> >>      +    these directories before deleting them. We do not want to delete any
> >>      +    meaningful work that the users were doing in those directories and
> >>      +    perhaps forgot to add and commit before switching sparse-checkout
> >>      +    definitions. Since those untracked files might be code files that
> >>      +    generated ignored build output, also do not delete any ignored files
> >>      +    from these directories in that case. The users can recover their state
> >>      +    by resetting their sparse-checkout definition to include that directory
> >>      +    and continue. Alternatively, they can see the warning that is presented
> >>      +    and delete the directory themselves to regain the performance they
> >>      +    expect.
> >>      +
> ...
> >>      ++When changing the sparse-checkout patterns in cone mode, Git will inspect each
> >>      ++tracked directory that is not within the sparse-checkout cone to see if it
> >>      ++contains any untracked files. If all of those files are ignored due to the
> >>      ++`.gitignore` patterns, then the directory will be deleted. If any of the
> >>      ++untracked files within that directory is not ignored, then no deletions will
> >>      ++occur within that directory and a warning message will appear. If these files
> >>      ++are important, then reset your sparse-checkout definition so they are included,
> >>      ++use `git add` and `git commit` to store them, then remove any remaining files
> >>      ++manually to ensure Git can behave optimally.
> ...
> > Thanks for the extra explanations; those help.
> >
> > You haven't yet addressed how this choice will interact with
> > .gitignore files, though (as I commented on v3).
...
> The difference between the .gitattributes and .gitignore files is that
> we don't need to do anything with the .gitignore files to prevent Git
> from accessing them with their contained directories don't exist. The
> .gitattributes change in attr.c is reachable from some code path even
> when the directory doesn't exist, so it needs special logic to avoid
> that expansion.

Ah, this is what I was looking for and couldn't quite work out --
whether .gitignore files might need to be loaded (and whether our
other UI decisions were affecting that).  Sounds like you have it
covered.

Thanks for patiently explaining these various items I was not
understanding or missing.  Sorry I missed a couple key things despite
reading the earlier rounds.  Anyway, looks like we're in pretty good
shape and there's only a few minor things left, mostly stuff you
brought up yourself that you wanted to tweak for v5.