diff mbox series

[v3,6/8] read-tree: narrow scope of index expansion for '--prefix'

Message ID 1a9365c3bc5b810a60593612bfba97a8b0c1d657.1646166271.git.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 749703924121b9eb2750b4313b2c769113c8b310
Headers show
Series Sparse index: integrate with 'read-tree' | expand

Commit Message

Victoria Dye March 1, 2022, 8:24 p.m. UTC
From: Victoria Dye <vdye@github.com>

When 'git read-tree' is provided with a prefix, expand the index only if the
prefix is equivalent to a sparse directory or contained within one. If the
index is not expanded in these cases, 'ce_in_traverse_path' will indicate
that the relevant sparse directory is not in the prefix/traverse path,
skipping past it and not unpacking the appropriate tree(s).

If the prefix is in-cone, its sparse subdirectories (if any) will be
traversed correctly without index expansion.

The behavior of 'git read-tree' with prefixes 1) inside of cone, 2) equal to
a sparse directory, and 3) inside a sparse directory are all tested as part
of the 't/t1092-sparse-checkout-compatibility.sh' test 'read-tree --prefix',
ensuring that the sparse index case works the way it did prior to this
change as well as matching non-sparse index sparse-checkout.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Victoria Dye <vdye@github.com>
---
 builtin/read-tree.c                      |  3 +-
 t/t1092-sparse-checkout-compatibility.sh |  8 ++++-
 unpack-trees.c                           | 38 ++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 3 deletions(-)

Comments

Glen Choo March 3, 2022, 5:54 p.m. UTC | #1
"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Victoria Dye <vdye@github.com>
> +static void update_sparsity_for_prefix(const char *prefix,
> +				       struct index_state *istate)
> +{
> +	int prefix_len = strlen(prefix);
> +	struct strbuf ce_prefix = STRBUF_INIT;
> +
> +	if (!istate->sparse_index)
> +		return;
> +
> +	while (prefix_len > 0 && prefix[prefix_len - 1] == '/')
> +		prefix_len--;
> +
> +	if (prefix_len <= 0)
> +		BUG("Invalid prefix passed to update_sparsity_for_prefix");
> +
> +	strbuf_grow(&ce_prefix, prefix_len + 1);
> +	strbuf_add(&ce_prefix, prefix, prefix_len);
> +	strbuf_addch(&ce_prefix, '/');
> +
> +	/*
> +	 * If the prefix points to a sparse directory or a path inside a sparse
> +	 * directory, the index should be expanded. This is accomplished in one
> +	 * of two ways:
> +	 * - if the prefix is inside a sparse directory, it will be expanded by
> +	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
> +	 * - if the prefix matches an existing sparse directory entry,
> +	 *   'index_name_pos(...)' will return its index position, triggering
> +	 *   the 'ensure_full_index(...)' below.
> +	 */
> +	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
> +	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
> +		ensure_full_index(istate);
> +
> +	strbuf_release(&ce_prefix);
> +}

Hm, I don't think I follow the rationale for having two different ways
of expanding the index:

- If the prefix is inside a sparse directory, we should expand the
  index.
- If the prefix matches a sparse directory entry, we should expand the
  index.

So it seems like distinguishing between the two cases with
index_name_pos(...) isn't necessary. I've attached a diff that does
exactly this, and it passes t1092-sparse-checkout-compatibility.sh as
far as I can tell. I've also amended the comment in a way that makes
more sense to me, but I'm not 100% sure if it's accurate.

I'm also a little averse to using a side effect of index_name_pos() to
achieve what we really want, so I'd prefer to get rid of the call if we
can :)

----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----

diff --git a/unpack-trees.c b/unpack-trees.c
index b876caca0d..5b07055605 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1749,17 +1749,11 @@ static void update_sparsity_for_prefix(const char *prefix,
 	strbuf_addch(&ce_prefix, '/');
 
 	/*
-	 * If the prefix points to a sparse directory or a path inside a sparse
-	 * directory, the index should be expanded. This is accomplished in one
-	 * of two ways:
-	 * - if the prefix is inside a sparse directory, it will be expanded by
-	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
-	 * - if the prefix matches an existing sparse directory entry,
-	 *   'index_name_pos(...)' will return its index position, triggering
-	 *   the 'ensure_full_index(...)' below.
+	 * If the prefix points to a sparse directory or a path inside a
+	 * sparse directory (not within the sparse patterns), the index
+	 * should be expanded.
 	 */
-	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
-	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
+	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate))
 		ensure_full_index(istate);
 
 	strbuf_release(&ce_prefix);
Victoria Dye March 3, 2022, 9:19 p.m. UTC | #2
Glen Choo wrote:
> "Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Victoria Dye <vdye@github.com>
>> +static void update_sparsity_for_prefix(const char *prefix,
>> +				       struct index_state *istate)
>> +{
>> +	int prefix_len = strlen(prefix);
>> +	struct strbuf ce_prefix = STRBUF_INIT;
>> +
>> +	if (!istate->sparse_index)
>> +		return;
>> +
>> +	while (prefix_len > 0 && prefix[prefix_len - 1] == '/')
>> +		prefix_len--;
>> +
>> +	if (prefix_len <= 0)
>> +		BUG("Invalid prefix passed to update_sparsity_for_prefix");
>> +
>> +	strbuf_grow(&ce_prefix, prefix_len + 1);
>> +	strbuf_add(&ce_prefix, prefix, prefix_len);
>> +	strbuf_addch(&ce_prefix, '/');
>> +
>> +	/*
>> +	 * If the prefix points to a sparse directory or a path inside a sparse
>> +	 * directory, the index should be expanded. This is accomplished in one
>> +	 * of two ways:
>> +	 * - if the prefix is inside a sparse directory, it will be expanded by
>> +	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
>> +	 * - if the prefix matches an existing sparse directory entry,
>> +	 *   'index_name_pos(...)' will return its index position, triggering
>> +	 *   the 'ensure_full_index(...)' below.
>> +	 */
>> +	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
>> +	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
>> +		ensure_full_index(istate);
>> +
>> +	strbuf_release(&ce_prefix);
>> +}
> 
> Hm, I don't think I follow the rationale for having two different ways
> of expanding the index:
> 
> - If the prefix is inside a sparse directory, we should expand the
>   index.
> - If the prefix matches a sparse directory entry, we should expand the
>   index.
> 

There's an (admittedly subtle if you aren't familiar with sparse index)
distinction between a "sparse directory" index entry and "a directory
outside the sparse-checkout cone with SKIP_WORKTREE enabled on its entries".
Ultimately, that's what necessitates the two checks but, as in [1], I want
to use this as an opportunity to shed some light on what 'unpack_trees(...)'
does.

Taking a step back, why would index expansion matter when you pass a prefix
to 'read-tree'? The answer lies in the tree traversal at the core of
'unpack_trees(...)'; when a prefix is provided, 'unpack_trees' does the
following:

1. Set the "traversal prefix" to the user-provided prefix.
2. Call 'traverse_trees(...)', iterating through the child files/directories
   of the prefix directory (NOTE: this does *not* use the index - it finds
   the child entries by extracting them from the tree).
3. For each child, call 'unpack_callback' which, among other things, looks
   for the child entry as it exists in the index to merge the input trees
   into it.

The problem with sparse directories arises in step 3. Suppose you have the
following repo:

.
├── bar
│   └── f1
├── baz
│   ├── deep
│   │   └── a
│   └── f2
├── foo
└── foo1


where directory 'baz/' is sparse, leading to an index that looks like this
(per 'git ls-files -t --sparse'):

H bar/f1
S baz/
H foo
H foo1

If you provide 'baz/' as a prefix, 'unpack_callback(...)' tries to find
'baz/deep/a' and 'baz/f2' in the index, but they won't be found because
they're "wrapped" in sparse directory 'baz/'. Ultimately, this leads to a
corrupted index with duplicate 'baz/' contents merged in:

H bar/f1
S baz/
S baz/deep/a
S baz/f2
H foo
H foo1

This explains why you need to expand the index at all, but not why you need
to check that the prefix is not in the sparse cone *and* that it (or its
parent directory) doesn't exist in the index. For that, an important bit of
context is that you can have a sparse index with non-sparse directories
outside the cone. This can happen in a number of ways (for example, running
'git update-index --no-skip-worktree' on a file in a sparse directory), but
the important thing is that it is a completely valid state and not entirely
uncommon. 

Using our example above, suppose 'baz/' is partially expanded in the index,
with the following index contents:

H bar/f1
S baz/deep/
S baz/f2
H foo
H foo1

If we use the prefix 'baz/' here, we actually traverse the trees properly:
'baz/deep/' and 'baz/f2' will be found and merged - no index expansion
needed! But if we only checked '!path_in_cone_mode_sparse_checkout(...)', we
would have expanded the index because 'baz/' is outside the sparse cone. 

This presents a problem because index expansion is *extremely* expensive -
we should avoid it whenever possible. That's where checking
'index_name_pos(...)' comes in: if the directory is in the index as a sparse
directory, the position is '>= 0' and 'ensure_full_index(...)' is called; if
the directory is inside an existing sparse directory, the position will be
'< 0' but the index will be expanded implicitly. In every other case, we
avoid expanding the index and proceed with the merge as normal.

Hope this helps!

[1] https://lore.kernel.org/git/dc47f12b-8724-22ef-ed2c-096badfafd76@github.com/

> So it seems like distinguishing between the two cases with
> index_name_pos(...) isn't necessary. I've attached a diff that does
> exactly this, and it passes t1092-sparse-checkout-compatibility.sh as
> far as I can tell. I've also amended the comment in a way that makes
> more sense to me, but I'm not 100% sure if it's accurate.
> 
> I'm also a little averse to using a side effect of index_name_pos() to
> achieve what we really want, so I'd prefer to get rid of the call if we
> can :)
> 
> ----- >8 --------- >8 --------- >8 --------- >8 --------- >8 ----
> 
> diff --git a/unpack-trees.c b/unpack-trees.c
> index b876caca0d..5b07055605 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1749,17 +1749,11 @@ static void update_sparsity_for_prefix(const char *prefix,
>  	strbuf_addch(&ce_prefix, '/');
>  
>  	/*
> -	 * If the prefix points to a sparse directory or a path inside a sparse
> -	 * directory, the index should be expanded. This is accomplished in one
> -	 * of two ways:
> -	 * - if the prefix is inside a sparse directory, it will be expanded by
> -	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
> -	 * - if the prefix matches an existing sparse directory entry,
> -	 *   'index_name_pos(...)' will return its index position, triggering
> -	 *   the 'ensure_full_index(...)' below.
> +	 * If the prefix points to a sparse directory or a path inside a
> +	 * sparse directory (not within the sparse patterns), the index
> +	 * should be expanded.
>  	 */
> -	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
> -	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
> +	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate))
>  		ensure_full_index(istate);
>  
>  	strbuf_release(&ce_prefix);
Glen Choo March 4, 2022, 6:47 p.m. UTC | #3
Victoria Dye <vdye@github.com> writes:
> There's an (admittedly subtle if you aren't familiar with sparse index)
> distinction between a "sparse directory" index entry and "a directory
> outside the sparse-checkout cone with SKIP_WORKTREE enabled on its entries".
> Ultimately, that's what necessitates the two checks but, as in [1], I want
> to use this as an opportunity to shed some light on what 'unpack_trees(...)'
> does.

Thanks! This explanation was really helpful.

> Using our example above, suppose 'baz/' is partially expanded in the index,
> with the following index contents:
>
> H bar/f1
> S baz/deep/
> S baz/f2
> H foo
> H foo1
>
> If we use the prefix 'baz/' here, we actually traverse the trees properly:
> 'baz/deep/' and 'baz/f2' will be found and merged - no index expansion
> needed! But if we only checked '!path_in_cone_mode_sparse_checkout(...)', we
> would have expanded the index because 'baz/' is outside the sparse cone. 

In particular, I didn't consider that a directory outside of the
sparse-checkout cone could be partially expanded. This seems to be the
crux of it, which is that even if the path is outside of the
sparse-checkout clone, we can still get correct behavior (without
expanding the index) if its entries are expanded...

> This presents a problem because index expansion is *extremely* expensive -
> we should avoid it whenever possible. That's where checking
> 'index_name_pos(...)' comes in: if the directory is in the index as a sparse
> directory, the position is '>= 0' and 'ensure_full_index(...)' is called; if
> the directory is inside an existing sparse directory, the position will be
> '< 0' but the index will be expanded implicitly. In every other case, we
> avoid expanding the index and proceed with the merge as normal.

and because of this, we don't always need to expand the index when the
path is outside of the cone, so my suggested patch expands the index in
too many cases.

What I also didn't consider is that index_name_pos() doesn't _always_
expand the index, it only expands the index when the directory is
inside a sparse directory entry.

So the side-effect of index_name_pos() is actually _exactly_ what we
want. Granted, it would be clearer if we had a function that did _only_
'expand if path is inside a sparse directory entry', but I suppose it's
overkill.

(Purely optional suggestion) I wonder if we could add a test that can
distinguish between 'always expand if --prefix is outside of the cone'
vs 'expand only if path is outside of cone AND inside a sparse directory
entry'. The scenario you described sounds perfect as a test case, though
I don't know how feasible it is to set up.
diff mbox series

Patch

diff --git a/builtin/read-tree.c b/builtin/read-tree.c
index 0a52cab7752..ec6d038242a 100644
--- a/builtin/read-tree.c
+++ b/builtin/read-tree.c
@@ -217,8 +217,7 @@  int cmd_read_tree(int argc, const char **argv, const char *cmd_prefix)
 	if (opts.merge && !opts.index_only)
 		setup_work_tree();
 
-	/* TODO: audit sparse index behavior in unpack_trees */
-	if (opts.skip_sparse_checkout || opts.prefix)
+	if (opts.skip_sparse_checkout)
 		ensure_full_index(&the_index);
 
 	if (opts.merge) {
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 86241b01a59..d98558f3238 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -1417,7 +1417,13 @@  test_expect_success 'sparse index is not expanded: read-tree' '
 	do
 		ensure_not_expanded read-tree -mu $MERGE_TREES &&
 		ensure_not_expanded reset --hard || return 1
-	done
+	done &&
+
+	rm -rf sparse-index/deep/deeper2 &&
+	ensure_not_expanded add . &&
+	ensure_not_expanded commit -m "test" &&
+
+	ensure_not_expanded read-tree --prefix=deep/deeper2 -u deepest
 '
 
 test_expect_success 'ls-files' '
diff --git a/unpack-trees.c b/unpack-trees.c
index 360844bda3a..f3667d85ec5 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1693,6 +1693,41 @@  static void populate_from_existing_patterns(struct unpack_trees_options *o,
 		o->pl = pl;
 }
 
+static void update_sparsity_for_prefix(const char *prefix,
+				       struct index_state *istate)
+{
+	int prefix_len = strlen(prefix);
+	struct strbuf ce_prefix = STRBUF_INIT;
+
+	if (!istate->sparse_index)
+		return;
+
+	while (prefix_len > 0 && prefix[prefix_len - 1] == '/')
+		prefix_len--;
+
+	if (prefix_len <= 0)
+		BUG("Invalid prefix passed to update_sparsity_for_prefix");
+
+	strbuf_grow(&ce_prefix, prefix_len + 1);
+	strbuf_add(&ce_prefix, prefix, prefix_len);
+	strbuf_addch(&ce_prefix, '/');
+
+	/*
+	 * If the prefix points to a sparse directory or a path inside a sparse
+	 * directory, the index should be expanded. This is accomplished in one
+	 * of two ways:
+	 * - if the prefix is inside a sparse directory, it will be expanded by
+	 *   the 'ensure_full_index(...)' call in 'index_name_pos(...)'.
+	 * - if the prefix matches an existing sparse directory entry,
+	 *   'index_name_pos(...)' will return its index position, triggering
+	 *   the 'ensure_full_index(...)' below.
+	 */
+	if (!path_in_cone_mode_sparse_checkout(ce_prefix.buf, istate) &&
+	    index_name_pos(istate, ce_prefix.buf, ce_prefix.len) >= 0)
+		ensure_full_index(istate);
+
+	strbuf_release(&ce_prefix);
+}
 
 static int verify_absent(const struct cache_entry *,
 			 enum unpack_trees_error_types,
@@ -1739,6 +1774,9 @@  int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 		setup_standard_excludes(o->dir);
 	}
 
+	if (o->prefix)
+		update_sparsity_for_prefix(o->prefix, o->src_index);
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {