Message ID | pull.1048.v6.git.1638201164.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Sparse Index: integrate with reset | expand |
Hi, On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget <gitgitgadget@gmail.com> wrote: > > Changes since V5 > ================ > > * Update t1092 test 'reset with wildcard pathspec' with more cases and > better checks > * Add "special case" wildcard pathspec check when determining whether to > expand the index (avoids double-loop over pathspecs & index entries) Looks pretty good. However, I'm worried this special case you added at my prodding might be problematic, and that I may have been wrong to prod you into it... > Thanks! -Victoria > > Range-diff vs v5: > > 7: a9135a5ed64 ! 7: 822d7344587 reset: make --mixed sparse-aware > @@ Commit message > > Remove the `ensure_full_index` guard on `read_from_tree` and update `git > reset --mixed` to ensure it can use sparse directory index entries wherever > - possible. Sparse directory entries are reset use `diff_tree_oid`, which > + possible. Sparse directory entries are reset using `diff_tree_oid`, which > requires `change` and `add_remove` functions to process the internal > contents of the sparse directory. The `recursive` diff option handles cases > in which `reset --mixed` must diff/merge files that are nested multiple > @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q, > + * (since we can reset whole sparse directories without expanding them). > + */ > + if (item.nowildcard_len < item.len) { > ++ /* > ++ * Special case: if the pattern is a path inside the cone > ++ * followed by only wildcards, the pattern cannot match > ++ * partial sparse directories, so we don't expand the index. > ++ */ > ++ if (path_in_cone_mode_sparse_checkout(item.original, &the_index) && > ++ strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len) I usually expect in an &&-chain to see the cheaper function call first (because that ordering often avoids the need to call the second function), and I would presume that strspn() would be the cheaper of the two. Did you switch the order because you expect the strspn call to nearly always return true, though? Could the strspn() call be replaced by a `item.len == item.nowildcard_len + 1`? I mean, sure, folks could list multiple asterisks in a row in their pathspec, but that seems super unlikely and even if it does happen the code will just fall back to the slower codepath and still give them the right answer. And the simpler check feels a lot easier to parse for human readers. But I'm worried there's a deeper issue here: Is the wildcard character (or characters) in path treated as a literal by path_in_cone_mode_sparse_checkout()? I think it is...and I'm worried that may be incorrect. For example, if the path is foo/* and the user has done a git sparse-checkout set foo/bar/ Then 'foo/baz/file' is not in the sparse checkout. However, 'foo/*' should match 'foo/baz/file' and yet 'foo/*' when treated as a literal path would be considered in the sparse checkout by path_in_cone_mode_sparse_checkout. Does this result in the code returning an incorrect answer? (Or did I misunderstand something so far?) I'm wondering if I misled you earlier in my musings about whether we could avoid the slow codepath for pathspecs with wildcard characters. Maybe there's no safe optimization here and wildcard characters should always go through the slower codepath. > ++ continue; > ++ > + for (pos = 0; pos < active_nr; pos++) { > + struct cache_entry *ce = active_cache[pos]; > + > 8: f91d1dcf024 = 8: ddd97fb2837 unpack-trees: improve performance of next_cache_entry > > -- > gitgitgadget
Elijah Newren wrote: > Hi, > > On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> >> Changes since V5 >> ================ >> >> * Update t1092 test 'reset with wildcard pathspec' with more cases and >> better checks >> * Add "special case" wildcard pathspec check when determining whether to >> expand the index (avoids double-loop over pathspecs & index entries) > > Looks pretty good. However, I'm worried this special case you added > at my prodding might be problematic, and that I may have been wrong to > prod you into it... > >> Thanks! -Victoria >> >> Range-diff vs v5: >> >> 7: a9135a5ed64 ! 7: 822d7344587 reset: make --mixed sparse-aware >> @@ Commit message >> >> Remove the `ensure_full_index` guard on `read_from_tree` and update `git >> reset --mixed` to ensure it can use sparse directory index entries wherever >> - possible. Sparse directory entries are reset use `diff_tree_oid`, which >> + possible. Sparse directory entries are reset using `diff_tree_oid`, which >> requires `change` and `add_remove` functions to process the internal >> contents of the sparse directory. The `recursive` diff option handles cases >> in which `reset --mixed` must diff/merge files that are nested multiple >> @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q, >> + * (since we can reset whole sparse directories without expanding them). >> + */ >> + if (item.nowildcard_len < item.len) { >> ++ /* >> ++ * Special case: if the pattern is a path inside the cone >> ++ * followed by only wildcards, the pattern cannot match >> ++ * partial sparse directories, so we don't expand the index. >> ++ */ >> ++ if (path_in_cone_mode_sparse_checkout(item.original, &the_index) && >> ++ strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len) > > I usually expect in an &&-chain to see the cheaper function call first > (because that ordering often avoids the need to call the second > function), and I would presume that strspn() would be the cheaper of > the two. Did you switch the order because you expect the strspn call > to nearly always return true, though? > This is a miss on my part, the `strspn()` check is probably less expensive and should be first. > Could the strspn() call be replaced by a `item.len == > item.nowildcard_len + 1`? I mean, sure, folks could list multiple > asterisks in a row in their pathspec, but that seems super unlikely > and even if it does happen the code will just fall back to the slower > codepath and still give them the right answer. And the simpler check > feels a lot easier to parse for human readers. > Agreed on wanting better readability - if the multiple-wildcard case is unlikely, the `PATHSPEC_ONESTAR` flag would indicate whether the pathspec ends in a single wildcard character. If that flag is still too obscure, though, I can stick with the length comparison. > But I'm worried there's a deeper issue here: > > > Is the wildcard character (or characters) in path treated as a literal > by path_in_cone_mode_sparse_checkout()? I think it is...and I'm > worried that may be incorrect. For example, if the path is > > foo/* > > and the user has done a > > git sparse-checkout set foo/bar/ > > Then 'foo/baz/file' is not in the sparse checkout. However, 'foo/*' > should match 'foo/baz/file' and yet 'foo/*' when treated as a literal > path would be considered in the sparse checkout by > path_in_cone_mode_sparse_checkout. Does this result in the code > returning an incorrect answer? (Or did I misunderstand something so > far?) > Correct: `path_in_cone_mode_sparse_checkout` interprets the wildcard literally, and the checks here take that into account. The goal of `pathspec_needs_expanded_index` is to determine if the pathspec *may* match only partial contents of a sparse directory (like '*.c', or 'f*le'). For a `git reset --mixed`, only this scenario requires expansion; if an entire sparse directory is matched by a pathspec, the entire sparse directory is reset. Using your example, 'foo/*' does match 'foo/baz/file', but it also matches 'foo/' itself; as a result, the `foo/` sparse directory index entry is reset (rather than some individual files contained within it). The same goes for a patchspec like 'fo*' ("in-cone" and ending in a wildcard). Conversely, a pathspec like 'foo/ba*' would _not_ work (it wouldn't match something like 'foo/test-file'), and neither would 'f*o' (it would match all of 'foo', but would only match files ending in "o" in a directory 'f/'). Hope that helps! > I'm wondering if I misled you earlier in my musings about whether we > could avoid the slow codepath for pathspecs with wildcard characters. > Maybe there's no safe optimization here and wildcard characters should > always go through the slower codepath. > >> ++ continue; >> ++ >> + for (pos = 0; pos < active_nr; pos++) { >> + struct cache_entry *ce = active_cache[pos]; >> + >> 8: f91d1dcf024 = 8: ddd97fb2837 unpack-trees: improve performance of next_cache_entry >> >> -- >> gitgitgadget
On Mon, Nov 29, 2021 at 11:44 AM Victoria Dye <vdye@github.com> wrote: > > Elijah Newren wrote: > > Hi, > > > > On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget > > <gitgitgadget@gmail.com> wrote: > >> > >> Changes since V5 > >> ================ > >> > >> * Update t1092 test 'reset with wildcard pathspec' with more cases and > >> better checks > >> * Add "special case" wildcard pathspec check when determining whether to > >> expand the index (avoids double-loop over pathspecs & index entries) > > > > Looks pretty good. However, I'm worried this special case you added > > at my prodding might be problematic, and that I may have been wrong to > > prod you into it... > > > >> Thanks! -Victoria > >> > >> Range-diff vs v5: > >> > >> 7: a9135a5ed64 ! 7: 822d7344587 reset: make --mixed sparse-aware > >> @@ Commit message > >> > >> Remove the `ensure_full_index` guard on `read_from_tree` and update `git > >> reset --mixed` to ensure it can use sparse directory index entries wherever > >> - possible. Sparse directory entries are reset use `diff_tree_oid`, which > >> + possible. Sparse directory entries are reset using `diff_tree_oid`, which > >> requires `change` and `add_remove` functions to process the internal > >> contents of the sparse directory. The `recursive` diff option handles cases > >> in which `reset --mixed` must diff/merge files that are nested multiple > >> @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q, > >> + * (since we can reset whole sparse directories without expanding them). > >> + */ > >> + if (item.nowildcard_len < item.len) { > >> ++ /* > >> ++ * Special case: if the pattern is a path inside the cone > >> ++ * followed by only wildcards, the pattern cannot match > >> ++ * partial sparse directories, so we don't expand the index. > >> ++ */ > >> ++ if (path_in_cone_mode_sparse_checkout(item.original, &the_index) && > >> ++ strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len) > > > > I usually expect in an &&-chain to see the cheaper function call first > > (because that ordering often avoids the need to call the second > > function), and I would presume that strspn() would be the cheaper of > > the two. Did you switch the order because you expect the strspn call > > to nearly always return true, though? > > > > This is a miss on my part, the `strspn()` check is probably less expensive > and should be first. > > > Could the strspn() call be replaced by a `item.len == > > item.nowildcard_len + 1`? I mean, sure, folks could list multiple > > asterisks in a row in their pathspec, but that seems super unlikely > > and even if it does happen the code will just fall back to the slower > > codepath and still give them the right answer. And the simpler check > > feels a lot easier to parse for human readers. > > > > Agreed on wanting better readability - if the multiple-wildcard case is > unlikely, the `PATHSPEC_ONESTAR` flag would indicate whether the pathspec > ends in a single wildcard character. If that flag is still too obscure, > though, I can stick with the length comparison. > > > But I'm worried there's a deeper issue here: > > > > > > Is the wildcard character (or characters) in path treated as a literal > > by path_in_cone_mode_sparse_checkout()? I think it is...and I'm > > worried that may be incorrect. For example, if the path is > > > > foo/* > > > > and the user has done a > > > > git sparse-checkout set foo/bar/ > > > > Then 'foo/baz/file' is not in the sparse checkout. However, 'foo/*' > > should match 'foo/baz/file' and yet 'foo/*' when treated as a literal > > path would be considered in the sparse checkout by > > path_in_cone_mode_sparse_checkout. Does this result in the code > > returning an incorrect answer? (Or did I misunderstand something so > > far?) > > > > Correct: `path_in_cone_mode_sparse_checkout` interprets the wildcard > literally, and the checks here take that into account. The goal of > `pathspec_needs_expanded_index` is to determine if the pathspec *may* match > only partial contents of a sparse directory (like '*.c', or 'f*le'). For a > `git reset --mixed`, only this scenario requires expansion; if an entire > sparse directory is matched by a pathspec, the entire sparse directory is > reset. > > Using your example, 'foo/*' does match 'foo/baz/file', but it also matches > 'foo/' itself; as a result, the `foo/` sparse directory index entry is reset > (rather than some individual files contained within it). The same goes for a > patchspec like 'fo*' ("in-cone" and ending in a wildcard). Conversely, a > pathspec like 'foo/ba*' would _not_ work (it wouldn't match something like > 'foo/test-file'), and neither would 'f*o' (it would match all of 'foo', but > would only match files ending in "o" in a directory 'f/'). > > Hope that helps! Ah, yes, thanks for the explanation. :-) > > I'm wondering if I misled you earlier in my musings about whether we > > could avoid the slow codepath for pathspecs with wildcard characters. > > Maybe there's no safe optimization here and wildcard characters should > > always go through the slower codepath. > > > >> ++ continue; > >> ++ > >> + for (pos = 0; pos < active_nr; pos++) { > >> + struct cache_entry *ce = active_cache[pos]; > >> + > >> 8: f91d1dcf024 = 8: ddd97fb2837 unpack-trees: improve performance of next_cache_entry > >> > >> -- > >> gitgitgadget >
On Mon, Nov 29 2021, Victoria Dye wrote: > Elijah Newren wrote: >> Hi, >> >> On Mon, Nov 29, 2021 at 7:52 AM Victoria Dye via GitGitGadget >> <gitgitgadget@gmail.com> wrote: >>> >>> Changes since V5 >>> ================ >>> >>> * Update t1092 test 'reset with wildcard pathspec' with more cases and >>> better checks >>> * Add "special case" wildcard pathspec check when determining whether to >>> expand the index (avoids double-loop over pathspecs & index entries) >> >> Looks pretty good. However, I'm worried this special case you added >> at my prodding might be problematic, and that I may have been wrong to >> prod you into it... >> >>> Thanks! -Victoria >>> >>> Range-diff vs v5: >>> >>> 7: a9135a5ed64 ! 7: 822d7344587 reset: make --mixed sparse-aware >>> @@ Commit message >>> >>> Remove the `ensure_full_index` guard on `read_from_tree` and update `git >>> reset --mixed` to ensure it can use sparse directory index entries wherever >>> - possible. Sparse directory entries are reset use `diff_tree_oid`, which >>> + possible. Sparse directory entries are reset using `diff_tree_oid`, which >>> requires `change` and `add_remove` functions to process the internal >>> contents of the sparse directory. The `recursive` diff option handles cases >>> in which `reset --mixed` must diff/merge files that are nested multiple >>> @@ builtin/reset.c: static void update_index_from_diff(struct diff_queue_struct *q, >>> + * (since we can reset whole sparse directories without expanding them). >>> + */ >>> + if (item.nowildcard_len < item.len) { >>> ++ /* >>> ++ * Special case: if the pattern is a path inside the cone >>> ++ * followed by only wildcards, the pattern cannot match >>> ++ * partial sparse directories, so we don't expand the index. >>> ++ */ >>> ++ if (path_in_cone_mode_sparse_checkout(item.original, &the_index) && >>> ++ strspn(item.original + item.nowildcard_len, "*") == item.len - item.nowildcard_len) >> >> I usually expect in an &&-chain to see the cheaper function call first >> (because that ordering often avoids the need to call the second >> function), and I would presume that strspn() would be the cheaper of >> the two. Did you switch the order because you expect the strspn call >> to nearly always return true, though? >> > > This is a miss on my part, the `strspn()` check is probably less expensive > and should be first. I doubt it matters either way, and I didn't look into this to any degree of carefulness. But having followed the breadcrumb trail from the "What's Cooking" discussion & looked at the code one thing that stuck out for me was that path_in_cone_mode_sparse_checkout() appears returns 1 inconditionally in some cases based on global state: /* * We default to accepting a path if there are no patterns or * they are of the wrong type. */ if (init_sparse_checkout_patterns(istate) || (require_cone_mode && !istate->sparse_checkout_patterns->use_cone_patterns)) return 1; So moreso than the nano-optimization of strspn() v.s. path_in_cone_mode_sparse_checkout() I found it a bit odd that we're calling something in a loop where presumably we can punt out a lot earlier, and at least make that "continue" a "break" or "return" in that case. I.e. something in this direction (this patch obviously doesn't even compile, but should clarify what I'm blathering about :); but again, I really haven't looked at this properly, so just food for thought: diff --git a/builtin/reset.c b/builtin/reset.c index b1ff699b43a..cefdabb09c2 100644 --- a/builtin/reset.c +++ b/builtin/reset.c @@ -187,6 +187,9 @@ static int pathspec_needs_expanded_index(const struct pathspec *pathspec) if (pathspec->magic) return 1; + if (cant_possibly_have_path_in_cone_mode_blah_blah(&the_index)) + return 1; + for (i = 0; i < pathspec->nr; i++) { struct pathspec_item item = pathspec->items[i]; diff --git a/dir.c b/dir.c index 5aa6fbad0b7..19f2d989dd3 100644 --- a/dir.c +++ b/dir.c @@ -1456,14 +1456,8 @@ int init_sparse_checkout_patterns(struct index_state *istate) return 0; } -static int path_in_sparse_checkout_1(const char *path, - struct index_state *istate, - int require_cone_mode) +int cant_possibly_have_path_in_cone_mode_blah_blah(...) { - int dtype = DT_REG; - enum pattern_match_result match = UNDECIDED; - const char *end, *slash; - /* * We default to accepting a path if there are no patterns or * they are of the wrong type. @@ -1472,6 +1466,16 @@ static int path_in_sparse_checkout_1(const char *path, (require_cone_mode && !istate->sparse_checkout_patterns->use_cone_patterns)) return 1; +} + + +static int path_in_sparse_checkout_1(const char *path, + struct index_state *istate, + int require_cone_mode) +{ + int dtype = DT_REG; + enum pattern_match_result match = UNDECIDED; + const char *end, *slash; /* * If UNDECIDED, use the match from the parent dir (recursively), or