Message ID | c3b5ca597330275391704a0653398ee28f911fc1.1741275245.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 27be76b230b07360b64aec06d6b0b9bc9e993603 |
Headers | show |
Series | refs: a couple of --exclude fixes | expand |
On Thu, Mar 6, 2025 at 7:34 AM Taylor Blau <me@ttaylorr.com> wrote: > > In 59c35fac54 (refs/packed-backend.c: implement jump lists to avoid > excluded pattern(s), 2023-07-10), the packed-refs backend learned how to > construct "jump lists" to avoid enumerating sections of the packed-refs > file that we know the caller is going to throw out anyway. > > This process works by finding the start- and end-points (that is, where > in the packed-refs file corresponds to the range we're going to ignore) > for each exclude pattern, then constructing a jump list based on that. > At enumeration time we'll consult the jump list to skip past everything > in the range(s) found in the previous step, saving time when excluding a > large portion of references. > > But when there is a --exclude pattern which is just the empty string, > the behavior is a little funky. When we try and exclude the empty > string, the matched range covers the entire packed-refs file, meaning > that we won't output any packed references. But the empty pattern > doesn't actually match any references to begin with! For example, on my > copy of git.git I can do: > > $ git for-each-ref '' | wc -l > 0 > > So "git for-each-ref --exclude=''" shouldn't actually remove anything > from the output, and ought to be equivalent to "git for-each-ref". But > it's not, and in fact: > > $ git for-each-ref | wc -l > 2229 > $ git for-each-ref --exclude='' | wc -l > 480 > > But why does the '--exclude' version output only some of the references > in the repository? Here's a hint: > > $ find .git/refs -type f | wc -l > 480 > > Indeed, because the files backend doesn't implement[^1] the same jump > list concept as the packed backend we get the correct result for the > loose references, but none of the packed references. > > Since the empty string exclude pattern doesn't match anything, we can > discard them before the packed-refs backend has a chance to even see it > (and likewise for reftable, which also implements a similar concept > since 1869525066 (refs/reftable: wire up support for exclude patterns, > 2024-09-16)). > > This approach (copying only some of the patterns into a strvec at the > refs.c layer) may seem heavy-handed, but it's setting us up to fix > another bug in the following commit where the fix will involve modifying > the incoming patterns. > > [^1]: As noted in 59c35fac54. We technically could avoid opening and > enumerating the contents of, for e.g., "$GIT_DIR/refs/heads/foo/" if > we knew that we were excluding anything under the 'refs/heads/foo' > hierarchy. But the --exclude stuff is all best-effort anyway, since > the caller is expected to cull out any results that they don't want. > > Noticed-by: Jeff King <peff@peff.net> > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > refs.c | 16 ++++++++++++++++ > t/t1419-exclude-refs.sh | 10 ++++++++++ > 2 files changed, 26 insertions(+) > > diff --git a/refs.c b/refs.c > index 91da5325d7..17d3840aff 100644 > --- a/refs.c > +++ b/refs.c > @@ -1699,6 +1699,20 @@ struct ref_iterator *refs_ref_iterator_begin( > enum do_for_each_ref_flags flags) > { > struct ref_iterator *iter; > + struct strvec normalized_exclude_patterns = STRVEC_INIT; > + > + if (exclude_patterns) { > + for (size_t i = 0; exclude_patterns[i]; i++) { > + const char *pattern = exclude_patterns[i]; > + size_t len = strlen(pattern); > + if (!len) > + continue; > + > + strvec_push(&normalized_exclude_patterns, pattern); > + } > + > + exclude_patterns = normalized_exclude_patterns.v; > + } > > if (!(flags & DO_FOR_EACH_INCLUDE_BROKEN)) { > static int ref_paranoia = -1; > @@ -1719,6 +1733,8 @@ struct ref_iterator *refs_ref_iterator_begin( > if (trim) > iter = prefix_ref_iterator_begin(iter, "", trim); > > + strvec_clear(&normalized_exclude_patterns); > + > return iter; > } > > diff --git a/t/t1419-exclude-refs.sh b/t/t1419-exclude-refs.sh > index c04eeb7211..fd58260a24 100755 > --- a/t/t1419-exclude-refs.sh > +++ b/t/t1419-exclude-refs.sh > @@ -155,4 +155,14 @@ test_expect_success 'meta-characters are discarded' ' > assert_no_jumps perf > ' > > +test_expect_success 'empty string exclude pattern is ignored' ' > + git update-ref refs/heads/loose $(git rev-parse refs/heads/foo/1) && > + > + for_each_ref__exclude refs/heads "" >actual 2>perf && > + for_each_ref >expect && > + > + test_cmp expect actual && > + assert_no_jumps perf > +' > + > test_done > -- > 2.49.0.rc1.2.g67c8c5f7978 Makes sense...but doesn't the second patch also fix this issue without the first patch being needed?
On Fri, Mar 07, 2025 at 01:32:49PM -0800, Elijah Newren wrote: > Makes sense...but doesn't the second patch also fix this issue without > the first patch being needed? It does, but the mechanism is pretty round-about. (From a quick glance we'll turn the empty pattern "" into "/" which won't match anything, and thus won't contribute to the jump list). But there are a couple of reasons to keep this patch. Most importantly, it hardens us against potential future regressions here with the empty pattern. And it makes dealing with that case much more explicit by throwing those patterns out before they make their way to the backends instead of the quirk above. Thanks, Taylor
On Fri, Mar 7, 2025 at 3:37 PM Taylor Blau <me@ttaylorr.com> wrote: > > On Fri, Mar 07, 2025 at 01:32:49PM -0800, Elijah Newren wrote: > > Makes sense...but doesn't the second patch also fix this issue without > > the first patch being needed? > > It does, but the mechanism is pretty round-about. (From a quick glance > we'll turn the empty pattern "" into "/" which won't match anything, and > thus won't contribute to the jump list). How is that round-about? The whole point of patch 2 is to stop matching on excludes as a prefix unless that prefix is a directory name, right? To me, patch 1 merely looks like a special case of patch 2. > But there are a couple of reasons to keep this patch. Most importantly, > it hardens us against potential future regressions here with the empty > pattern. I'm fine with leaving the patch in place, since it doesn't hurt anything. And if the empty pattern is especially problematic, I can see this logic. > And it makes dealing with that case much more explicit by > throwing those patterns out before they make their way to the backends > instead of the quirk above. I don't understand this reason, though. It feels to me like the design behind patch 2 rather than a "quirk"...but maybe the fact that patch 2 doesn't exclude "refs/heads/bar" (at the low-level) even when that exact string is given as an exclude was unintentional or something? If it was an intentional part of patch 2 (as I assumed while reading it), then I don't see how patch 2 excluding the empty string exclude is a "quirk". Am I missing something? (Not that it matters, since I'm fine with keeping the patch for your first reason, I'm just curious...)
diff --git a/refs.c b/refs.c index 91da5325d7..17d3840aff 100644 --- a/refs.c +++ b/refs.c @@ -1699,6 +1699,20 @@ struct ref_iterator *refs_ref_iterator_begin( enum do_for_each_ref_flags flags) { struct ref_iterator *iter; + struct strvec normalized_exclude_patterns = STRVEC_INIT; + + if (exclude_patterns) { + for (size_t i = 0; exclude_patterns[i]; i++) { + const char *pattern = exclude_patterns[i]; + size_t len = strlen(pattern); + if (!len) + continue; + + strvec_push(&normalized_exclude_patterns, pattern); + } + + exclude_patterns = normalized_exclude_patterns.v; + } if (!(flags & DO_FOR_EACH_INCLUDE_BROKEN)) { static int ref_paranoia = -1; @@ -1719,6 +1733,8 @@ struct ref_iterator *refs_ref_iterator_begin( if (trim) iter = prefix_ref_iterator_begin(iter, "", trim); + strvec_clear(&normalized_exclude_patterns); + return iter; } diff --git a/t/t1419-exclude-refs.sh b/t/t1419-exclude-refs.sh index c04eeb7211..fd58260a24 100755 --- a/t/t1419-exclude-refs.sh +++ b/t/t1419-exclude-refs.sh @@ -155,4 +155,14 @@ test_expect_success 'meta-characters are discarded' ' assert_no_jumps perf ' +test_expect_success 'empty string exclude pattern is ignored' ' + git update-ref refs/heads/loose $(git rev-parse refs/heads/foo/1) && + + for_each_ref__exclude refs/heads "" >actual 2>perf && + for_each_ref >expect && + + test_cmp expect actual && + assert_no_jumps perf +' + test_done
In 59c35fac54 (refs/packed-backend.c: implement jump lists to avoid excluded pattern(s), 2023-07-10), the packed-refs backend learned how to construct "jump lists" to avoid enumerating sections of the packed-refs file that we know the caller is going to throw out anyway. This process works by finding the start- and end-points (that is, where in the packed-refs file corresponds to the range we're going to ignore) for each exclude pattern, then constructing a jump list based on that. At enumeration time we'll consult the jump list to skip past everything in the range(s) found in the previous step, saving time when excluding a large portion of references. But when there is a --exclude pattern which is just the empty string, the behavior is a little funky. When we try and exclude the empty string, the matched range covers the entire packed-refs file, meaning that we won't output any packed references. But the empty pattern doesn't actually match any references to begin with! For example, on my copy of git.git I can do: $ git for-each-ref '' | wc -l 0 So "git for-each-ref --exclude=''" shouldn't actually remove anything from the output, and ought to be equivalent to "git for-each-ref". But it's not, and in fact: $ git for-each-ref | wc -l 2229 $ git for-each-ref --exclude='' | wc -l 480 But why does the '--exclude' version output only some of the references in the repository? Here's a hint: $ find .git/refs -type f | wc -l 480 Indeed, because the files backend doesn't implement[^1] the same jump list concept as the packed backend we get the correct result for the loose references, but none of the packed references. Since the empty string exclude pattern doesn't match anything, we can discard them before the packed-refs backend has a chance to even see it (and likewise for reftable, which also implements a similar concept since 1869525066 (refs/reftable: wire up support for exclude patterns, 2024-09-16)). This approach (copying only some of the patterns into a strvec at the refs.c layer) may seem heavy-handed, but it's setting us up to fix another bug in the following commit where the fix will involve modifying the incoming patterns. [^1]: As noted in 59c35fac54. We technically could avoid opening and enumerating the contents of, for e.g., "$GIT_DIR/refs/heads/foo/" if we knew that we were excluding anything under the 'refs/heads/foo' hierarchy. But the --exclude stuff is all best-effort anyway, since the caller is expected to cull out any results that they don't want. Noticed-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- refs.c | 16 ++++++++++++++++ t/t1419-exclude-refs.sh | 10 ++++++++++ 2 files changed, 26 insertions(+)