mbox series

[v6,0/8] Sparse Index: integrate with reset

Message ID pull.1048.v6.git.1638201164.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Sparse Index: integrate with reset | expand

Message

Linus Arver via GitGitGadget Nov. 29, 2021, 3:52 p.m. UTC
This series integrates the sparse index with git reset and provides
miscellaneous fixes and improvements to the command in sparse checkouts.
This includes:

 1. tests added to t1092 and p2000 to establish the baseline functionality
    of the command
 2. repository settings to enable the sparse index with ensure_full_index
    guarding any code paths that break tests without other compatibility
    updates.
 3. modifications to remove or reduce the scope in which ensure_full_index
    must be called.

The sparse index updates are predicated on a fix originating from the
microsoft/git fork [1], correcting how git reset --mixed handles resetting
entries outside the sparse checkout definition. Additionally, a performance
"bug" in next_cache_entry with sparse index is corrected, preventing
repeatedly looping over already-searched entries.

The p2000 tests demonstrate a ~70% execution time reduction in git reset
using a sparse index, and no change (within expected variability [2]) using
a full index. Results summarized below [3, 4]:

Test                           base              [5/8]                 
-----------------------------------------------------------------------
git reset --hard (full-v3)     1.00(0.50+0.39)   0.97(0.50+0.37) -3.0% 
git reset --hard (full-v4)     1.00(0.51+0.38)   0.96(0.50+0.36) -4.0% 
git reset --hard (sparse-v3)   1.68(1.17+0.39)   1.37(0.91+0.35) -18.5%
git reset --hard (sparse-v4)   1.70(1.18+0.40)   1.41(0.94+0.35) -17.1%

Test                           base              [6/8]   
-----------------------------------------------------------------------
git reset --hard (full-v3)     1.00(0.50+0.39)   0.94(0.48+0.34) -6.0% 
git reset --hard (full-v4)     1.00(0.51+0.38)   0.95(0.51+0.34) -5.0% 
git reset --hard (sparse-v3)   1.68(1.17+0.39)   0.46(0.05+0.29) -72.6%
git reset --hard (sparse-v4)   1.70(1.18+0.40)   0.46(0.06+0.29) -72.9%

Test                               base              [7/8]
---------------------------------------------------------------------------
git reset (full-v3)                0.77(0.27+0.37)   0.72(0.26+0.32) -6.5%
git reset (full-v4)                0.75(0.27+0.34)   0.73(0.26+0.32) -2.7%
git reset (sparse-v3)              1.44(0.96+0.36)   0.43(0.04+0.96) -70.1%
git reset (sparse-v4)              1.46(0.97+0.36)   0.43(0.05+0.79) -70.5%
git reset -- missing (full-v3)     0.72(0.26+0.32)   0.69(0.26+0.30) -4.2%
git reset -- missing (full-v4)     0.74(0.28+0.33)   0.71(0.27+0.32) -4.1% 
git reset -- missing (sparse-v3)   1.45(0.97+0.35)   0.81(0.42+0.90) -44.1%
git reset -- missing (sparse-v4)   1.41(0.94+0.34)   0.79(0.42+0.76) -44.0%

Test                               base              [8/8]            
---------------------------------------------------------------------------
git reset -- missing (full-v3)     0.72(0.26+0.32)   0.73(0.26+0.33) +1.4% 
git reset -- missing (full-v4)     0.74(0.28+0.33)   0.74(0.27+0.32) +0.0% 
git reset -- missing (sparse-v3)   1.45(0.97+0.35)   0.43(0.05+0.80) -70.3%
git reset -- missing (sparse-v4)   1.41(0.94+0.34)   0.44(0.05+0.76) -68.8%



Changes since V1
================

 * Add --force-full-index option to update-index. The option is used
   circumvent changing command_requires_full_index from its default value -
   right now this is effectively a no-op, but will change once update-index
   is integrated with sparse index. By using this option in the t1092
   expand/collapse test, the command used to test will not need to be
   updated with subsequent sparse index integrations.
 * Update implementation of mixed reset for entries outside sparse checkout
   definition. The condition in which a file should be checked out before
   index reset is simplified to "if it has skip-worktree enabled and a reset
   would change the file, check it out".
   * After checking the behavior of update_index_from_diff with renames,
     found that the diff used by reset does not produce diff queue entries
     with different pathnames for one and two. Because of this, and that
     nothing in the implementation seems to rely on identical path names, no
     BUG check is added.
 * Correct a bug in the sparse index is not expanded tests in t1092 where
   failure of a git reset --mixed test was not being reported. Test now
   verifies an appropriate scenario with corrected failure-checking.


Changes since V2
================

 * Replace patch adding checkouts for git reset --mixed with sparse checkout
   with preserving the skip-worktree flag (including a new test for git
   reset --mixed and update to t1092 - checkout and reset (mixed))
 * Move rename of is_missing into its own patch
 * Further extend t1092 tests and remove unnecessary commands/tests where
   possible
 * Refine logic determining which pathspecs require ensure_full_index in git
   reset --mixed, add related ensure_not_expanded tests
 * Add index_search_mode enum to index_name_stage_pos
 * Clean up variable usage & remove unnecessary subtree_path in
   prime_cache_tree_rec
 * Update cover letter performance data
 * More thoroughly explain changes in each commit message


Changes since V3
================

 * Replace git update-index --force-full-index with git reset update-folder1
   -- folder1/a, remove introduction of new --force-full-index option
   entirely, and add comment clarifying the intent of sparse-index is
   expanded and converted back test
 * Fix authorship on reset: preserve skip-worktree bit in mixed reset
   (current patch fully replaces original patch, but metadata of the
   original wasn't properly replaced)


Changes since V4
================

 * Update t1092 test 'checkout and reset (mixed)' to explicitly verify
   differences between sparse and full checkouts


Changes since V5
================

 * Update t1092 test 'reset with wildcard pathspec' with more cases and
   better checks
 * Add "special case" wildcard pathspec check when determining whether to
   expand the index (avoids double-loop over pathspecs & index entries)

Thanks! -Victoria

[1] microsoft@6b8a074 [2]
https://lore.kernel.org/git/8b9fe3f8-f0e3-4567-b20b-17c92bd1a5c5@github.com/
[3] If a test and/or commit is not mentioned, there is no significant change
to performance [4] Pathspec "does-not-exist" is changed to "missing" to save
space in performance report

Victoria Dye (8):
  reset: rename is_missing to !is_in_reset_tree
  reset: preserve skip-worktree bit in mixed reset
  sparse-index: update command for expand/collapse test
  reset: expand test coverage for sparse checkouts
  reset: integrate with sparse index
  reset: make sparse-aware (except --mixed)
  reset: make --mixed sparse-aware
  unpack-trees: improve performance of next_cache_entry

 builtin/reset.c                          | 113 ++++++++++++++++-
 cache-tree.c                             |  46 ++++++-
 cache.h                                  |  10 ++
 read-cache.c                             |  27 ++--
 t/perf/p2000-sparse-operations.sh        |   3 +
 t/t1092-sparse-checkout-compatibility.sh | 154 ++++++++++++++++++++---
 t/t7102-reset.sh                         |  17 +++
 unpack-trees.c                           |  23 +++-
 8 files changed, 356 insertions(+), 37 deletions(-)


base-commit: cefe983a320c03d7843ac78e73bd513a27806845
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1048%2Fvdye%2Fvdye%2Fsparse-index-part1-v6
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1048/vdye/vdye/sparse-index-part1-v6
Pull-Request: https://github.com/gitgitgadget/git/pull/1048

Range-diff vs v5:

 1:  ad7013a31aa = 1:  ad7013a31aa reset: rename is_missing to !is_in_reset_tree
 2:  b221b00b7e0 = 2:  b221b00b7e0 reset: preserve skip-worktree bit in mixed reset
 3:  1bb2ca92c60 = 3:  1bb2ca92c60 sparse-index: update command for expand/collapse test
 4:  cc76c694647 ! 4:  741a2c9ffaa reset: expand test coverage for sparse checkouts
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'checkout and rese
      +test_expect_success 'reset with wildcard pathspec' '
      +	init_repos &&
      +
     -+	test_all_match git checkout -b reset-test update-deep &&
     -+	test_all_match git reset base -- \*/a &&
     -+	test_all_match git status --porcelain=v2 &&
     -+	test_all_match git rev-parse HEAD:folder1/a &&
     ++	test_all_match git reset update-deep -- deep\* &&
     ++	test_all_match git ls-files -s -- deep &&
      +
     -+	test_all_match git reset base -- folder\* &&
     -+	test_all_match git status --porcelain=v2 &&
     -+	test_all_match git rev-parse HEAD:folder2
     ++	test_all_match git reset deepest -- deep\*\*\* &&
     ++	test_all_match git ls-files -s -- deep &&
     ++
     ++	# The following `git reset`s result in updating the index on files with
     ++	# `skip-worktree` enabled. To avoid failing due to discrepencies in reported
     ++	# "modified" files, `test_sparse_match` reset is performed separately from
     ++	# "full-checkout" reset, then the index contents of all repos are verified.
     ++
     ++	test_sparse_match git reset update-folder1 -- \*/a &&
     ++	git -C full-checkout reset update-folder1 -- \*/a &&
     ++	test_all_match git ls-files -s -- deep/a folder1/a &&
     ++
     ++	test_sparse_match git reset update-folder2 -- folder\* &&
     ++	git -C full-checkout reset update-folder2 -- folder\* &&
     ++	test_all_match git ls-files -s -- folder10 folder1 folder2 &&
     ++
     ++	test_sparse_match git reset base -- folder1/\* &&
     ++	git -C full-checkout reset base -- folder1/\* &&
     ++	test_all_match git ls-files -s -- folder1
      +'
      +
       test_expect_success 'merge, cherry-pick, and rebase' '
 5:  217ae445418 = 5:  65b0eafd27c reset: integrate with sparse index
 6:  a3e2fd59867 = 6:  908c84005b9 reset: make sparse-aware (except --mixed)
 7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
     @@ Commit message
      
          Remove the `ensure_full_index` guard on `read_from_tree` and update `git
          reset --mixed` to ensure it can use sparse directory index entries wherever
     -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
     +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
          requires `change` and `add_remove` functions to process the internal
          contents of the sparse directory. The `recursive` diff option handles cases
          in which `reset --mixed` must diff/merge files that are nested multiple
     @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
      +		 * (since we can reset whole sparse directories without expanding them).
      +		 */
      +		if (item.nowildcard_len < item.len) {
     ++			/*
     ++			 * Special case: if the pattern is a path inside the cone
     ++			 * followed by only wildcards, the pattern cannot match
     ++			 * partial sparse directories, so we don't expand the index.
     ++			 */
     ++			if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
     ++			    strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
     ++				continue;
     ++
      +			for (pos = 0; pos < active_nr; pos++) {
      +				struct cache_entry *ce = active_cache[pos];
      +
 8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry

Comments

Elijah Newren Nov. 29, 2021, 6:37 p.m. UTC | #1
Hi,

On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Changes since V5
> ================
>
>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
>    better checks
>  * Add "special case" wildcard pathspec check when determining whether to
>    expand the index (avoids double-loop over pathspecs & index entries)

Looks pretty good.  However, I'm worried this special case you added
at my prodding might be problematic, and that I may have been wrong to
prod you into it...

> Thanks! -Victoria
>
> Range-diff vs v5:
>
>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
>      @@ Commit message
>
>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
>           reset --mixed` to ensure it can use sparse directory index entries wherever
>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
>           requires `change` and `add_remove` functions to process the internal
>           contents of the sparse directory. The `recursive` diff option handles cases
>           in which `reset --mixed` must diff/merge files that are nested multiple
>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
>       +          * (since we can reset whole sparse directories without expanding them).
>       +          */
>       +         if (item.nowildcard_len < item.len) {
>      ++                 /*
>      ++                  * Special case: if the pattern is a path inside the cone
>      ++                  * followed by only wildcards, the pattern cannot match
>      ++                  * partial sparse directories, so we don't expand the index.
>      ++                  */
>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)

I usually expect in an &&-chain to see the cheaper function call first
(because that ordering often avoids the need to call the second
function), and I would presume that strspn() would be the cheaper of
the two.  Did you switch the order because you expect the strspn call
to nearly always return true, though?

Could the strspn() call be replaced by a `item.len ==
item.nowildcard_len + 1`?  I mean, sure, folks could list multiple
asterisks in a row in their pathspec, but that seems super unlikely
and even if it does happen the code will just fall back to the slower
codepath and still give them the right answer.  And the simpler check
feels a lot easier to parse for human readers.

But I'm worried there's a deeper issue here:


Is the wildcard character (or characters) in path treated as a literal
by path_in_cone_mode_sparse_checkout()?  I think it is...and I'm
worried that may be incorrect.  For example, if the path is

   foo/*

and the user has done a

  git sparse-checkout set foo/bar/

Then 'foo/baz/file' is not in the sparse checkout.  However, 'foo/*'
should match 'foo/baz/file' and yet 'foo/*' when treated as a literal
path would be considered in the sparse checkout by
path_in_cone_mode_sparse_checkout.  Does this result in the code
returning an incorrect answer?  (Or did I misunderstand something so
far?)

I'm wondering if I misled you earlier in my musings about whether we
could avoid the slow codepath for pathspecs with wildcard characters.
Maybe there's no safe optimization here and wildcard characters should
always go through the slower codepath.

>      ++                         continue;
>      ++
>       +                 for (pos = 0; pos < active_nr; pos++) {
>       +                         struct cache_entry *ce = active_cache[pos];
>       +
>  8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry
>
> --
> gitgitgadget
Victoria Dye Nov. 29, 2021, 7:44 p.m. UTC | #2
Elijah Newren wrote:
> Hi,
> 
> On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> Changes since V5
>> ================
>>
>>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
>>    better checks
>>  * Add "special case" wildcard pathspec check when determining whether to
>>    expand the index (avoids double-loop over pathspecs & index entries)
> 
> Looks pretty good.  However, I'm worried this special case you added
> at my prodding might be problematic, and that I may have been wrong to
> prod you into it...
> 
>> Thanks! -Victoria
>>
>> Range-diff vs v5:
>>
>>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
>>      @@ Commit message
>>
>>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
>>           reset --mixed` to ensure it can use sparse directory index entries wherever
>>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
>>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
>>           requires `change` and `add_remove` functions to process the internal
>>           contents of the sparse directory. The `recursive` diff option handles cases
>>           in which `reset --mixed` must diff/merge files that are nested multiple
>>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
>>       +          * (since we can reset whole sparse directories without expanding them).
>>       +          */
>>       +         if (item.nowildcard_len < item.len) {
>>      ++                 /*
>>      ++                  * Special case: if the pattern is a path inside the cone
>>      ++                  * followed by only wildcards, the pattern cannot match
>>      ++                  * partial sparse directories, so we don't expand the index.
>>      ++                  */
>>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
>>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
> 
> I usually expect in an &&-chain to see the cheaper function call first
> (because that ordering often avoids the need to call the second
> function), and I would presume that strspn() would be the cheaper of
> the two.  Did you switch the order because you expect the strspn call
> to nearly always return true, though?
> 

This is a miss on my part, the `strspn()` check is probably less expensive
and should be first.

> Could the strspn() call be replaced by a `item.len ==
> item.nowildcard_len + 1`?  I mean, sure, folks could list multiple
> asterisks in a row in their pathspec, but that seems super unlikely
> and even if it does happen the code will just fall back to the slower
> codepath and still give them the right answer.  And the simpler check
> feels a lot easier to parse for human readers.
> 

Agreed on wanting better readability - if the multiple-wildcard case is
unlikely, the `PATHSPEC_ONESTAR` flag would indicate whether the pathspec
ends in a single wildcard character. If that flag is still too obscure,
though, I can stick with the length comparison.

> But I'm worried there's a deeper issue here:
> 
> 
> Is the wildcard character (or characters) in path treated as a literal
> by path_in_cone_mode_sparse_checkout()?  I think it is...and I'm
> worried that may be incorrect.  For example, if the path is
> 
>    foo/*
> 
> and the user has done a
> 
>   git sparse-checkout set foo/bar/
> 
> Then 'foo/baz/file' is not in the sparse checkout.  However, 'foo/*'
> should match 'foo/baz/file' and yet 'foo/*' when treated as a literal
> path would be considered in the sparse checkout by
> path_in_cone_mode_sparse_checkout.  Does this result in the code
> returning an incorrect answer?  (Or did I misunderstand something so
> far?)
> 

Correct: `path_in_cone_mode_sparse_checkout` interprets the wildcard
literally, and the checks here take that into account. The goal of
`pathspec_needs_expanded_index` is to determine if the pathspec *may* match
only partial contents of a sparse directory (like '*.c', or 'f*le'). For a
`git reset --mixed`, only this scenario requires expansion; if an entire
sparse directory is matched by a pathspec, the entire sparse directory is
reset. 

Using your example, 'foo/*' does match 'foo/baz/file', but it also matches
'foo/' itself; as a result, the `foo/` sparse directory index entry is reset
(rather than some individual files contained within it). The same goes for a
patchspec like 'fo*' ("in-cone" and ending in a wildcard). Conversely, a
pathspec like 'foo/ba*' would _not_ work (it wouldn't match something like
'foo/test-file'), and neither would 'f*o' (it would match all of 'foo', but
would only match files ending in "o" in a directory 'f/').

Hope that helps!

> I'm wondering if I misled you earlier in my musings about whether we
> could avoid the slow codepath for pathspecs with wildcard characters.
> Maybe there's no safe optimization here and wildcard characters should
> always go through the slower codepath.
> 
>>      ++                         continue;
>>      ++
>>       +                 for (pos = 0; pos < active_nr; pos++) {
>>       +                         struct cache_entry *ce = active_cache[pos];
>>       +
>>  8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry
>>
>> --
>> gitgitgadget
Elijah Newren Nov. 30, 2021, 3:59 a.m. UTC | #3
On Mon, Nov 29, 2021 at 11:44 AM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
> > Hi,
> >
> > On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> Changes since V5
> >> ================
> >>
> >>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
> >>    better checks
> >>  * Add "special case" wildcard pathspec check when determining whether to
> >>    expand the index (avoids double-loop over pathspecs & index entries)
> >
> > Looks pretty good.  However, I'm worried this special case you added
> > at my prodding might be problematic, and that I may have been wrong to
> > prod you into it...
> >
> >> Thanks! -Victoria
> >>
> >> Range-diff vs v5:
> >>
> >>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
> >>      @@ Commit message
> >>
> >>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
> >>           reset --mixed` to ensure it can use sparse directory index entries wherever
> >>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
> >>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
> >>           requires `change` and `add_remove` functions to process the internal
> >>           contents of the sparse directory. The `recursive` diff option handles cases
> >>           in which `reset --mixed` must diff/merge files that are nested multiple
> >>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
> >>       +          * (since we can reset whole sparse directories without expanding them).
> >>       +          */
> >>       +         if (item.nowildcard_len < item.len) {
> >>      ++                 /*
> >>      ++                  * Special case: if the pattern is a path inside the cone
> >>      ++                  * followed by only wildcards, the pattern cannot match
> >>      ++                  * partial sparse directories, so we don't expand the index.
> >>      ++                  */
> >>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
> >>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
> >
> > I usually expect in an &&-chain to see the cheaper function call first
> > (because that ordering often avoids the need to call the second
> > function), and I would presume that strspn() would be the cheaper of
> > the two.  Did you switch the order because you expect the strspn call
> > to nearly always return true, though?
> >
>
> This is a miss on my part, the `strspn()` check is probably less expensive
> and should be first.
>
> > Could the strspn() call be replaced by a `item.len ==
> > item.nowildcard_len + 1`?  I mean, sure, folks could list multiple
> > asterisks in a row in their pathspec, but that seems super unlikely
> > and even if it does happen the code will just fall back to the slower
> > codepath and still give them the right answer.  And the simpler check
> > feels a lot easier to parse for human readers.
> >
>
> Agreed on wanting better readability - if the multiple-wildcard case is
> unlikely, the `PATHSPEC_ONESTAR` flag would indicate whether the pathspec
> ends in a single wildcard character. If that flag is still too obscure,
> though, I can stick with the length comparison.
>
> > But I'm worried there's a deeper issue here:
> >
> >
> > Is the wildcard character (or characters) in path treated as a literal
> > by path_in_cone_mode_sparse_checkout()?  I think it is...and I'm
> > worried that may be incorrect.  For example, if the path is
> >
> >    foo/*
> >
> > and the user has done a
> >
> >   git sparse-checkout set foo/bar/
> >
> > Then 'foo/baz/file' is not in the sparse checkout.  However, 'foo/*'
> > should match 'foo/baz/file' and yet 'foo/*' when treated as a literal
> > path would be considered in the sparse checkout by
> > path_in_cone_mode_sparse_checkout.  Does this result in the code
> > returning an incorrect answer?  (Or did I misunderstand something so
> > far?)
> >
>
> Correct: `path_in_cone_mode_sparse_checkout` interprets the wildcard
> literally, and the checks here take that into account. The goal of
> `pathspec_needs_expanded_index` is to determine if the pathspec *may* match
> only partial contents of a sparse directory (like '*.c', or 'f*le'). For a
> `git reset --mixed`, only this scenario requires expansion; if an entire
> sparse directory is matched by a pathspec, the entire sparse directory is
> reset.
>
> Using your example, 'foo/*' does match 'foo/baz/file', but it also matches
> 'foo/' itself; as a result, the `foo/` sparse directory index entry is reset
> (rather than some individual files contained within it). The same goes for a
> patchspec like 'fo*' ("in-cone" and ending in a wildcard). Conversely, a
> pathspec like 'foo/ba*' would _not_ work (it wouldn't match something like
> 'foo/test-file'), and neither would 'f*o' (it would match all of 'foo', but
> would only match files ending in "o" in a directory 'f/').
>
> Hope that helps!

Ah, yes, thanks for the explanation.  :-)

> > I'm wondering if I misled you earlier in my musings about whether we
> > could avoid the slow codepath for pathspecs with wildcard characters.
> > Maybe there's no safe optimization here and wildcard characters should
> > always go through the slower codepath.
> >
> >>      ++                         continue;
> >>      ++
> >>       +                 for (pos = 0; pos < active_nr; pos++) {
> >>       +                         struct cache_entry *ce = active_cache[pos];
> >>       +
> >>  8:  f91d1dcf024 = 8:  ddd97fb2837 unpack-trees: improve performance of next_cache_entry
> >>
> >> --
> >> gitgitgadget
>
Ævar Arnfjörð Bjarmason Dec. 1, 2021, 8:26 p.m. UTC | #4
On Mon, Nov 29 2021, Victoria Dye wrote:

> Elijah Newren wrote:
>> Hi,
>> 
>> On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
>>>
>>> Changes since V5
>>> ================
>>>
>>>  * Update t1092 test 'reset with wildcard pathspec' with more cases and
>>>    better checks
>>>  * Add "special case" wildcard pathspec check when determining whether to
>>>    expand the index (avoids double-loop over pathspecs & index entries)
>> 
>> Looks pretty good.  However, I'm worried this special case you added
>> at my prodding might be problematic, and that I may have been wrong to
>> prod you into it...
>> 
>>> Thanks! -Victoria
>>>
>>> Range-diff vs v5:
>>>
>>>  7:  a9135a5ed64 ! 7:  822d7344587 reset: make --mixed sparse-aware
>>>      @@ Commit message
>>>
>>>           Remove the `ensure_full_index` guard on `read_from_tree` and update `git
>>>           reset --mixed` to ensure it can use sparse directory index entries wherever
>>>      -    possible. Sparse directory entries are reset use `diff_tree_oid`, which
>>>      +    possible. Sparse directory entries are reset using `diff_tree_oid`, which
>>>           requires `change` and `add_remove` functions to process the internal
>>>           contents of the sparse directory. The `recursive` diff option handles cases
>>>           in which `reset --mixed` must diff/merge files that are nested multiple
>>>      @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q,
>>>       +          * (since we can reset whole sparse directories without expanding them).
>>>       +          */
>>>       +         if (item.nowildcard_len < item.len) {
>>>      ++                 /*
>>>      ++                  * Special case: if the pattern is a path inside the cone
>>>      ++                  * followed by only wildcards, the pattern cannot match
>>>      ++                  * partial sparse directories, so we don't expand the index.
>>>      ++                  */
>>>      ++                 if (path_in_cone_mode_sparse_checkout(item.original, &the_index) &&
>>>      ++                     strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len)
>> 
>> I usually expect in an &&-chain to see the cheaper function call first
>> (because that ordering often avoids the need to call the second
>> function), and I would presume that strspn() would be the cheaper of
>> the two.  Did you switch the order because you expect the strspn call
>> to nearly always return true, though?
>> 
>
> This is a miss on my part, the `strspn()` check is probably less expensive
> and should be first.

I doubt it matters either way, and I didn't look into this to any degree
of carefulness.

But having followed the breadcrumb trail from the "What's Cooking"
discussion & looked at the code one thing that stuck out for me was that
path_in_cone_mode_sparse_checkout() appears returns 1 inconditionally in
some cases based on global state:

	/*
	 * We default to accepting a path if there are no patterns or
	 * they are of the wrong type.
	 */
	if (init_sparse_checkout_patterns(istate) ||
	    (require_cone_mode &&
	     !istate->sparse_checkout_patterns->use_cone_patterns))
		return 1;

So moreso than the nano-optimization of strspn()
v.s. path_in_cone_mode_sparse_checkout() I found it a bit odd that we're
calling something in a loop where presumably we can punt out a lot
earlier, and at least make that "continue" a "break" or "return" in that
case.

I.e. something in this direction (this patch obviously doesn't even
compile, but should clarify what I'm blathering about :); but again, I
really haven't looked at this properly, so just food for thought:

diff --git a/builtin/reset.c b/builtin/reset.c
index b1ff699b43a..cefdabb09c2 100644
--- a/builtin/reset.c
+++ b/builtin/reset.c
@@ -187,6 +187,9 @@ static int pathspec_needs_expanded_index(const struct pathspec *pathspec)
 	if (pathspec->magic)
 		return 1;
 
+	if (cant_possibly_have_path_in_cone_mode_blah_blah(&the_index))
+		return 1;
+
 	for (i = 0; i < pathspec->nr; i++) {
 		struct pathspec_item item = pathspec->items[i];
 
diff --git a/dir.c b/dir.c
index 5aa6fbad0b7..19f2d989dd3 100644
--- a/dir.c
+++ b/dir.c
@@ -1456,14 +1456,8 @@ int init_sparse_checkout_patterns(struct index_state *istate)
 	return 0;
 }
 
-static int path_in_sparse_checkout_1(const char *path,
-				     struct index_state *istate,
-				     int require_cone_mode)
+int cant_possibly_have_path_in_cone_mode_blah_blah(...)
 {
-	int dtype = DT_REG;
-	enum pattern_match_result match = UNDECIDED;
-	const char *end, *slash;
-
 	/*
 	 * We default to accepting a path if there are no patterns or
 	 * they are of the wrong type.
@@ -1472,6 +1466,16 @@ static int path_in_sparse_checkout_1(const char *path,
 	    (require_cone_mode &&
 	     !istate->sparse_checkout_patterns->use_cone_patterns))
 		return 1;
+}
+
+
+static int path_in_sparse_checkout_1(const char *path,
+				     struct index_state *istate,
+				     int require_cone_mode)
+{
+	int dtype = DT_REG;
+	enum pattern_match_result match = UNDECIDED;
+	const char *end, *slash;
 
 	/*
 	 * If UNDECIDED, use the match from the parent dir (recursively), or