Message ID | a39d1107c1f3e9fbd859a23aed72e63bbd394fa2.1683581621.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | refs: implement skip lists for packed backend | expand |
I haven't read through any of this closely (don't have time for it now) but want to mention one thing here: On Mon, May 8, 2023 at 3:06 PM Taylor Blau <me@ttaylorr.com> wrote: > - Construct a skip list of regions by combining adjacent and > overlapping regions from the previous step. You might want to add a note to the code that there is no relationship here to the skip list data structure (see https://en.wikipedia.org/wiki/Skip_list). Chris
On Mon, May 08, 2023 at 06:00:08PM -0400, Taylor Blau wrote: > When iterating through the `packed-refs` file in order to answer a query > like: > > $ git for-each-ref --exclude=refs/__hidden__ > > it would be useful to avoid walking over all of the entries in > `refs/__hidden__/*` when possible, since we know that the ref-filter > code is going to throw them away anyways. > > In certain circumstances, doing so is possible. The algorithm for doing > so is as follows: > > - For each excluded pattern, find the first record that matches it, > and the first pattern that *doesn't* match it (i.e. the location > you'd next want to consider when excluding that pattern). > > - Sort the patterns by their starting location within the > `packed-refs` file. > > - Construct a skip list of regions by combining adjacent and > overlapping regions from the previous step. > > - When iterating through the `packed-refs` file, if `iter->pos` is > ever contained in one of the regions from the previous steps, > advance `iter->pos` past the end of that region, and continue > enumeration. > > Note that this optimization is only possible when none of the excluded > pattern(s) have special meta-characters in them. To see why this is the > case, consider the exclusion pattern "refs/foo[a]". In general, in order > to find the location of the first record that matches this pattern, we > could only consider up to the first meta-character, "refs/foo". But this > doesn't work, since the excluded region we'd come up with would include > "refs/foobar", even though it is not excluded. Is this generally true though? A naive implementation would iterate through all references and find the first reference that matches the exclusion regular exepression. From thereon we continue to iterate until we find the first entry that doesn't match. This may cause us to end up with a suboptimal skip list, but the skip list would still be valid. As I said, this implementation would be naive as we're now forced to iterate through all references starting at the beginning. I assume that your implementation will instead use a binary search to locate the first entry that matches the exclusion pattern and the last pattern. But the way this paragraph is formulated makes it sound like this is a general fact, even though it is a fact that derives from the implementation. I of course don't propose to change the algorithm here, but instead to clarify where this restriction actually comes from and why the tradeoff makes sense. [snip] > @@ -929,6 +972,108 @@ static struct ref_iterator_vtable packed_ref_iterator_vtable = { > .abort = packed_ref_iterator_abort > }; > > +static int skip_list_entry_cmp(const void *va, const void *vb) > +{ > + const struct skip_list_entry *a = va; > + const struct skip_list_entry *b = vb; > + > + if (a->start < b->start) > + return -1; > + if (a->start > b->start) > + return 1; > + return 0; > +} > + > +static int has_glob_special(const char *str) > +{ > + const char *p; > + for (p = str; *p; p++) { > + if (is_glob_special(*p)) > + return 1; > + } > + return 0; > +} > + > +static const char *ptr_max(const char *x, const char *y) > +{ > + if (x > y) > + return x; > + return y; > +} > + > +static void populate_excluded_skip_list(struct packed_ref_iterator *iter, > + struct snapshot *snapshot, > + const char **excluded_patterns) > +{ > + size_t i, j; > + const char **pattern; > + > + if (!excluded_patterns) > + return; > + > + for (pattern = excluded_patterns; *pattern; pattern++) { > + struct skip_list_entry *e; > + > + /* > + * We can't feed any excludes with globs in them to the > + * refs machinery. It only understands prefix matching. > + * We likewise can't even feed the string leading up to > + * the first meta-character, as something like "foo[a]" > + * should not exclude "foobar" (but the prefix "foo" > + * would match that and mark it for exclusion). > + */ > + if (has_glob_special(*pattern)) > + continue; > + > + ALLOC_GROW(iter->skip, iter->skip_nr + 1, iter->skip_alloc); > + > + e = &iter->skip[iter->skip_nr++]; > + e->start = find_reference_location(snapshot, *pattern, 0); > + e->end = find_reference_location_end(snapshot, *pattern, 0); One thing that makes this hard to reason about is that we don't explicitly handle the case where the pattern doesn't match at all. So you require a bunch of knowledge about what exactly the functions `find_reference_location()` and `find_reference_location_end()` do in that case in order to determine whether we will end up doing the right thing. Explicitly handling this would give us some benefits: - It makes the code more obvious. - We can avoid an extra skip list entry for every non-matching pattern. - We wouldn't have to perform binary search over the snapshot twice. Might be I'm missing something though. > + } > + > + if (!iter->skip_nr) { > + /* > + * Every entry in exclude_patterns has a meta-character, > + * nothing to do here. > + */ > + return; > + } > + > + QSORT(iter->skip, iter->skip_nr, skip_list_entry_cmp); > + > + /* > + * As an optimization, merge adjacent entries in the skip list > + * to jump forwards as far as possible when entering a skipped > + * region. > + * > + * For example, if we have two skipped regions: > + * > + * [[A, B], [B, C]] > + * > + * we want to combine that into a single entry jumping from A to > + * C. > + */ > + for (i = 1, j = 1; i < iter->skip_nr; i++) { > + struct skip_list_entry *ours = &iter->skip[i]; > + struct skip_list_entry *prev = &iter->skip[i - 1]; > + > + if (ours->start == ours->end) { > + /* ignore empty regions (no matching entries) */ > + continue; > + } else if (prev->end >= ours->start) { > + /* overlapping regions extend the previous one */ > + prev->end = ptr_max(prev->end, ours->end); > + } else { > + /* otherwise, insert a new region */ > + iter->skip[j++] = *ours; > + } > + } Mh. Does this do the right thing in case we have multiple consecutive overlapping skip list entries? We always end up adjusting the immediate predecessor as we use `i - 1` to find `prev`. Shouldn't we instead start with `j = 0` and assign `prev = &iter->skip[j]`? [snip] > diff --git a/t/t1419-exclude-refs.sh b/t/t1419-exclude-refs.sh > new file mode 100755 > index 0000000000..da5265a5a8 > --- /dev/null > +++ b/t/t1419-exclude-refs.sh > @@ -0,0 +1,101 @@ > +#!/bin/sh > + > +test_description='test exclude_patterns functionality in main ref store' > + > +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main > +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME > + > +TEST_PASSES_SANITIZE_LEAK=true > +. ./test-lib.sh > + > +for_each_ref__exclude () { > + test-tool ref-store main for-each-ref--exclude "$@" >actual.raw > + cut -d ' ' -f 2 actual.raw > +} > + > +for_each_ref () { > + git for-each-ref --format='%(refname)' "$@" > +} > + > +test_expect_success 'setup' ' > + test_commit --no-tag base && > + base="$(git rev-parse HEAD)" && > + > + for name in foo bar baz quux > + do > + for i in 1 2 3 > + do > + echo "create refs/heads/$name/$i $base" || return 1 > + done || return 1 > + done >in && > + echo "delete refs/heads/main" >>in && > + > + git update-ref --stdin <in && > + git pack-refs --all > +' > + > +test_expect_success 'for_each_ref__exclude(refs/heads/foo/)' ' > + # region in middle > + for_each_ref__exclude refs/heads refs/heads/foo >actual && > + for_each_ref refs/heads/bar refs/heads/baz refs/heads/quux >expect && > + > + test_cmp expect actual > +' Nit: it might be a bit more readable if we put the comment into the test description instead of having an opaque description that mentions ref names that don't have much of a meaning without reading the test itself. Patrick
On Mon, May 08, 2023 at 05:10:52PM -0700, Chris Torek wrote: > On Mon, May 8, 2023 at 3:06 PM Taylor Blau <me@ttaylorr.com> wrote: > > - Construct a skip list of regions by combining adjacent and > > overlapping regions from the previous step. > > You might want to add a note to the code that there is no > relationship here to the skip list data structure (see > https://en.wikipedia.org/wiki/Skip_list). Good suggestion, thanks. I picked the name skip list for this concept since we're skipping over excluded regions of the packed-refs file, but you're right it has no relation to the skip list data structure. Will note in the patch message. Thanks, Taylor
On Tue, May 09, 2023 at 05:15:43PM +0200, Patrick Steinhardt wrote: > On Mon, May 08, 2023 at 06:00:08PM -0400, Taylor Blau wrote: > > > Note that this optimization is only possible when none of the excluded > > pattern(s) have special meta-characters in them. To see why this is the > > case, consider the exclusion pattern "refs/foo[a]". In general, in order > > to find the location of the first record that matches this pattern, we > > could only consider up to the first meta-character, "refs/foo". But this > > doesn't work, since the excluded region we'd come up with would include > > "refs/foobar", even though it is not excluded. > > Is this generally true though? A naive implementation would iterate > through all references and find the first reference that matches the > exclusion regular exepression. From thereon we continue to iterate until > we find the first entry that doesn't match. This may cause us to end up > with a suboptimal skip list, but the skip list would still be valid. > > As I said, this implementation would be naive as we're now forced to > iterate through all references starting at the beginning. I assume that > your implementation will instead use a binary search to locate the first > entry that matches the exclusion pattern and the last pattern. But the > way this paragraph is formulated makes it sound like this is a general > fact, even though it is a fact that derives from the implementation. > > I of course don't propose to change the algorithm here, but instead to > clarify where this restriction actually comes from and why the tradeoff > makes sense. In the example you include, it's possible. But consider something like: $ git for-each-ref --exclude='refs/foo[ac]' The region that matches that expression ("refs/fooa", "refs/fooc" and everything underneath them) does not have to appear as a continuous single region in the packed-refs file. If you have, say, "refs/foobar", that will appear between the two regions you want to exclude. So I think you *might* be able to do it in general, but at the very least it would involve splitting each character class and finding the start and end of any region(s) that it matches. Even so, you'd have to try and match each entry as you determine the width of the excluded region, at which point you're at par with enumerating them anyway and having the caller discard any entries it doesn't want. > > + for (pattern = excluded_patterns; *pattern; pattern++) { > > + struct skip_list_entry *e; > > + > > + /* > > + * We can't feed any excludes with globs in them to the > > + * refs machinery. It only understands prefix matching. > > + * We likewise can't even feed the string leading up to > > + * the first meta-character, as something like "foo[a]" > > + * should not exclude "foobar" (but the prefix "foo" > > + * would match that and mark it for exclusion). > > + */ > > + if (has_glob_special(*pattern)) > > + continue; > > + > > + ALLOC_GROW(iter->skip, iter->skip_nr + 1, iter->skip_alloc); > > + > > + e = &iter->skip[iter->skip_nr++]; > > + e->start = find_reference_location(snapshot, *pattern, 0); > > + e->end = find_reference_location_end(snapshot, *pattern, 0); > > One thing that makes this hard to reason about is that we don't > explicitly handle the case where the pattern doesn't match at all. So > you require a bunch of knowledge about what exactly the functions > `find_reference_location()` and `find_reference_location_end()` do in > that case in order to determine whether we will end up doing the right > thing. > > Explicitly handling this would give us some benefits: > > - It makes the code more obvious. > > - We can avoid an extra skip list entry for every non-matching > pattern. > > - We wouldn't have to perform binary search over the snapshot twice. > > Might be I'm missing something though. We handle it implicitly via the mustexist parameter that both find_reference_location() and find_reference_location_end() take, which returns the position that you would find a matching entry if one does not already exist. You're right that it would save you a second binary search, but that is likely to be a vanishingly small cost compared to the actual traversal, disk I/O, etc. I could imagine modifying the signature of both of those functions to be something like: int find_reference_location(struct snapshot *, const char *pattern, int mustexist, char *out) Which would propagate the result through `out`, and the return value would be whether or not it found a matching entry. But that only really makes sense when `mustexist` is set to 0, and it adds some verbosity throughout. I think that it's totally possible to avoid the second search, but I'm not sure that the cost of additional complexity and verbosity is worth the benefit of avoiding one binary search (which will likely be resident, anyway). Note though that if e->start == e->end here we will discard the empty skip region below. > > + for (i = 1, j = 1; i < iter->skip_nr; i++) { > > + struct skip_list_entry *ours = &iter->skip[i]; > > + struct skip_list_entry *prev = &iter->skip[i - 1]; > > + > > + if (ours->start == ours->end) { > > + /* ignore empty regions (no matching entries) */ > > + continue; > > + } else if (prev->end >= ours->start) { > > + /* overlapping regions extend the previous one */ > > + prev->end = ptr_max(prev->end, ours->end); > > + } else { > > + /* otherwise, insert a new region */ > > + iter->skip[j++] = *ours; > > + } > > + } > > Mh. Does this do the right thing in case we have multiple consecutive > overlapping skip list entries? We always end up adjusting the immediate > predecessor as we use `i - 1` to find `prev`. Shouldn't we instead start > with `j = 0` and assign `prev = &iter->skip[j]`? Good catch. I think applying this on top should do the trick: --- 8< --- diff --git a/refs/packed-backend.c b/refs/packed-backend.c index 137a4233f6..3b1337267a 100644 --- a/refs/packed-backend.c +++ b/refs/packed-backend.c @@ -1054,9 +1054,9 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter, * we want to combine that into a single entry jumping from A to * C. */ - for (i = 1, j = 1; i < iter->skip_nr; i++) { + for (i = 0, j = 0; i < iter->skip_nr; i++) { struct skip_list_entry *ours = &iter->skip[i]; - struct skip_list_entry *prev = &iter->skip[i - 1]; + struct skip_list_entry *prev = &iter->skip[j]; if (ours->start == ours->end) { /* ignore empty regions (no matching entries) */ @@ -1066,7 +1066,7 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter, prev->end = ptr_max(prev->end, ours->end); } else { /* otherwise, insert a new region */ - iter->skip[j++] = *ours; + iter->skip[++j] = *ours; } } --- >8 --- > > +test_expect_success 'for_each_ref__exclude(refs/heads/foo/)' ' > > + # region in middle > > + for_each_ref__exclude refs/heads refs/heads/foo >actual && > > + for_each_ref refs/heads/bar refs/heads/baz refs/heads/quux >expect && > > + > > + test_cmp expect actual > > +' > > Nit: it might be a bit more readable if we put the comment into the test > description instead of having an opaque description that mentions ref > names that don't have much of a meaning without reading the test itself. Fair enough ;-). Thanks, Taylor
On Tue, May 09, 2023 at 04:55:55PM -0400, Taylor Blau wrote: > Good catch. I think applying this on top should do the trick: > > --- 8< --- > diff --git a/refs/packed-backend.c b/refs/packed-backend.c > index 137a4233f6..3b1337267a 100644 > --- a/refs/packed-backend.c > +++ b/refs/packed-backend.c > @@ -1054,9 +1054,9 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter, > * we want to combine that into a single entry jumping from A to > * C. > */ > - for (i = 1, j = 1; i < iter->skip_nr; i++) { > + for (i = 0, j = 0; i < iter->skip_nr; i++) { > struct skip_list_entry *ours = &iter->skip[i]; > - struct skip_list_entry *prev = &iter->skip[i - 1]; > + struct skip_list_entry *prev = &iter->skip[j]; > > if (ours->start == ours->end) { > /* ignore empty regions (no matching entries) */ > @@ -1066,7 +1066,7 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter, > prev->end = ptr_max(prev->end, ours->end); > } else { > /* otherwise, insert a new region */ > - iter->skip[j++] = *ours; > + iter->skip[++j] = *ours; > } > } > --- >8 --- Oops, this is wrong. It should be something like this instead: --- 8< --- diff --git a/refs/packed-backend.c b/refs/packed-backend.c index 137a4233f6..574f32d67f 100644 --- a/refs/packed-backend.c +++ b/refs/packed-backend.c @@ -1007,6 +1007,7 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter, { size_t i, j; const char **pattern; + struct skip_list_entry *last_disjoint; if (!excluded_patterns) return; @@ -1054,19 +1055,22 @@ static void populate_excluded_skip_list(struct packed_ref_iterator *iter, * we want to combine that into a single entry jumping from A to * C. */ + last_disjoint = iter->skip; + for (i = 1, j = 1; i < iter->skip_nr; i++) { struct skip_list_entry *ours = &iter->skip[i]; - struct skip_list_entry *prev = &iter->skip[i - 1]; if (ours->start == ours->end) { /* ignore empty regions (no matching entries) */ continue; - } else if (prev->end >= ours->start) { + } else if (ours->start <= last_disjoint->end) { /* overlapping regions extend the previous one */ - prev->end = ptr_max(prev->end, ours->end); + last_disjoint->end = ptr_max(last_disjoint->end, ours->end); } else { /* otherwise, insert a new region */ iter->skip[j++] = *ours; + last_disjoint = ours; + } } --- >8 --- Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes: > When iterating through the `packed-refs` file in order to answer a query > like: > > $ git for-each-ref --exclude=refs/__hidden__ > > it would be useful to avoid walking over all of the entries in > `refs/__hidden__/*` when possible, since we know that the ref-filter > code is going to throw them away anyways. > > In certain circumstances, doing so is possible. The algorithm for doing > so is as follows: > > - For each excluded pattern, find the first record that matches it, > and the first pattern that *doesn't* match it (i.e. the location > you'd next want to consider when excluding that pattern). Do we find "record" and then "pattern", or is the latter a misspelt "record"? I will assume it is the latter while reading the rest. Is the latter "the record that does not match the pattern, whose record number is the smallest but yet larger than the first record that matches the pattern?" That is we assume that the refs in the packed refs file are sorted and can be partitioned into three by each pattern: (1) refs before the first matching record---they do not match the pattern. (2) refs at or after the first matching record that match. (3) refs after (2) that do not match. I am not sure if that would work. What if the pattern were refs/tags/[ad-z]* and there are packed tags refs/tags/{a,b,c,d,e}? The pattern would partition the refs into [a](matches), [b,c](does not match), [d,e](matches). Perhaps I am grossly misunderstanding what the above explanation says. > - Sort the patterns by their starting location within the > `packed-refs` file. So the idea is that the patterns are sorted by the first record they match, and after sorting these patterns, refs that are between the beginning of the list of refs and the first record associated with the first pattern will not match _any_ pattern in that list? > - Construct a skip list of regions by combining adjacent and > overlapping regions from the previous step. "skip list of regions" -> "list of regions to skip", I guess. > - When iterating through the `packed-refs` file, if `iter->pos` is > ever contained in one of the regions from the previous steps, > advance `iter->pos` past the end of that region, and continue > enumeration. > > Note that this optimization is only possible when none of the excluded > pattern(s) have special meta-characters in them. Ah, so this is called "patterns" but it deals with literal patterns without globs? Then I'll stop worrying about the refs/tags/[ad-z] counter-example. > To see why this is the > case, consider the exclusion pattern "refs/foo[a]". In general, in order > to find the location of the first record that matches this pattern, we > could only consider up to the first meta-character, "refs/foo". But this > doesn't work, since the excluded region we'd come up with would include > "refs/foobar", even though it is not excluded. OK. > Using the skip list is fairly straightforward (see the changes to > `refs/packed-backend.c::next_record()`), but constructing the list is > not. To ensure that the construction is correct, add a new suite of > tests in t1419 covering various corner cases (overlapping regions, > partially overlapping regions, adjacent regions, etc.). Sounds good. Does this actually use the skip list data structure, or do they happen to share the same two words in their names, but otherwise have nothing common with each other? If the latter, we may want to revise the explanation, data type names, and variable names, to avoid confusion, as Chris pointed out earlier. > +static void populate_excluded_skip_list(struct packed_ref_iterator *iter, > + struct snapshot *snapshot, > + const char **excluded_patterns) > +{ > + size_t i, j; > + const char **pattern; > + > + if (!excluded_patterns) > + return; > + > + for (pattern = excluded_patterns; *pattern; pattern++) { > + struct skip_list_entry *e; > + > + /* > + * We can't feed any excludes with globs in them to the > + * refs machinery. It only understands prefix matching. > + * We likewise can't even feed the string leading up to > + * the first meta-character, as something like "foo[a]" > + * should not exclude "foobar" (but the prefix "foo" > + * would match that and mark it for exclusion). > + */ > + if (has_glob_special(*pattern)) > + continue; Hmph. I would have expected that a set of patterns with any one pattern with glob would invalidate the whole skip optimization, but it is nice if we can salvage such a set and still optimize, if only for literal patterns. Interesting. Ah, that is because we are dealing with ranges that cannot possibly match. With the mention of "first record that matches" etc. in the earlier descriptoin, I somehow misled myself that we are dealing with ranges that have interesting records. So, a pattern with glob does not contribute any range to be skipped, but that is OK. > + ALLOC_GROW(iter->skip, iter->skip_nr + 1, iter->skip_alloc); > + > + e = &iter->skip[iter->skip_nr++]; > + e->start = find_reference_location(snapshot, *pattern, 0); > + e->end = find_reference_location_end(snapshot, *pattern, 0); So, iter->skip[] array has one range per pattern at most, but some patterns may not contribute any range to the list. > + } > + > + if (!iter->skip_nr) { > + /* > + * Every entry in exclude_patterns has a meta-character, > + * nothing to do here. > + */ > + return; > + } > + > + QSORT(iter->skip, iter->skip_nr, skip_list_entry_cmp); > + > + /* > + * As an optimization, merge adjacent entries in the skip list > + * to jump forwards as far as possible when entering a skipped > + * region. > + * > + * For example, if we have two skipped regions: > + * > + * [[A, B], [B, C]] I am confused. The first pattern may never match records in [A..B] range, and the second pattern may never match records in [B..C] range, but what does it mean to combine these two ranges? > + * we want to combine that into a single entry jumping from A to > + * C. > + */ > + for (i = 1, j = 1; i < iter->skip_nr; i++) { > + struct skip_list_entry *ours = &iter->skip[i]; > + struct skip_list_entry *prev = &iter->skip[i - 1]; > + if (ours->start == ours->end) { > + /* ignore empty regions (no matching entries) */ > + continue; > + } else if (prev->end >= ours->start) { > + /* overlapping regions extend the previous one */ > + prev->end = ptr_max(prev->end, ours->end); > + } else { > + /* otherwise, insert a new region */ > + iter->skip[j++] = *ours; > + } None of the {braces} seem needed, but OK. > + } > + > + iter->skip_nr = j; > + iter->skip_pos = 0; > +}
On Tue, May 09, 2023 at 04:40:50PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > When iterating through the `packed-refs` file in order to answer a query > > like: > > > > $ git for-each-ref --exclude=refs/__hidden__ > > > > it would be useful to avoid walking over all of the entries in > > `refs/__hidden__/*` when possible, since we know that the ref-filter > > code is going to throw them away anyways. > > > > In certain circumstances, doing so is possible. The algorithm for doing > > so is as follows: > > > > - For each excluded pattern, find the first record that matches it, > > and the first pattern that *doesn't* match it (i.e. the location > > you'd next want to consider when excluding that pattern). > > Do we find "record" and then "pattern", or is the latter a misspelt > "record"? I will assume it is the latter while reading the rest. > Is the latter "the record that does not match the pattern, whose > record number is the smallest but yet larger than the first record > that matches the pattern?" That is we assume that the refs in the > packed refs file are sorted and can be partitioned into three by > each pattern: (1) refs before the first matching record---they do > not match the pattern. (2) refs at or after the first matching > record that match. (3) refs after (2) that do not match. > > I am not sure if that would work. What if the pattern were refs/tags/[ad-z]* > and there are packed tags refs/tags/{a,b,c,d,e}? The pattern would partition > the refs into [a](matches), [b,c](does not match), [d,e](matches). > > Perhaps I am grossly misunderstanding what the above explanation > says. Sorry, you're right to assume the latter. This should read "..., find the record that matches it, and the first record that *doesn't* match." In the example you give, we would ignore that whole pattern and enumerate everything (including entries that match 'refs/tags/[ad-z]*'), and the caller is expected to discard them. > > - Sort the patterns by their starting location within the > > `packed-refs` file. > > So the idea is that the patterns are sorted by the first record they > match, and after sorting these patterns, refs that are between the > beginning of the list of refs and the first record associated with > the first pattern will not match _any_ pattern in that list? My patch could use some clarification here, since it is much easier to treat the patterns as unsorted, and then sort the beginning of the range that they match. > > - Construct a skip list of regions by combining adjacent and > > overlapping regions from the previous step. > > "skip list of regions" -> "list of regions to skip", I guess. Thanks, will update. > > Using the skip list is fairly straightforward (see the changes to > > `refs/packed-backend.c::next_record()`), but constructing the list is > > not. To ensure that the construction is correct, add a new suite of > > tests in t1419 covering various corner cases (overlapping regions, > > partially overlapping regions, adjacent regions, etc.). > > Sounds good. Does this actually use the skip list data structure, > or do they happen to share the same two words in their names, but > otherwise have nothing common with each other? If the latter, we > may want to revise the explanation, data type names, and variable > names, to avoid confusion, as Chris pointed out earlier. They have nothing to do with each other ;-). I made a note in this patch in the revised version I'll send in the next day or two to note the distinction. But I'm fine with renaming the whole concept to "jump list" or something like that if you prefer. > > + for (pattern = excluded_patterns; *pattern; pattern++) { > > + struct skip_list_entry *e; > > + > > + /* > > + * We can't feed any excludes with globs in them to the > > + * refs machinery. It only understands prefix matching. > > + * We likewise can't even feed the string leading up to > > + * the first meta-character, as something like "foo[a]" > > + * should not exclude "foobar" (but the prefix "foo" > > + * would match that and mark it for exclusion). > > + */ > > + if (has_glob_special(*pattern)) > > + continue; > > Hmph. I would have expected that a set of patterns with any one > pattern with glob would invalidate the whole skip optimization, but > it is nice if we can salvage such a set and still optimize, if only > for literal patterns. Interesting. > > Ah, that is because we are dealing with ranges that cannot possibly > match. With the mention of "first record that matches" etc. in the > earlier descriptoin, I somehow misled myself that we are dealing > with ranges that have interesting records. So, a pattern with glob > does not contribute any range to be skipped, but that is OK. Exactly right. This is all "best effort" anyway, since there are some patterns that we cannot construct a skip list entry for in the general case. So it's fine if we enumerate some references that match one or more of the excluded patterns, because the caller is expected to drop those results themselves. > > + /* > > + * As an optimization, merge adjacent entries in the skip list > > + * to jump forwards as far as possible when entering a skipped > > + * region. > > + * > > + * For example, if we have two skipped regions: > > + * > > + * [[A, B], [B, C]] > > I am confused. The first pattern may never match records in [A..B] > range, and the second pattern may never match records in [B..C] > range, but what does it mean to combine these two ranges? The patterns would match all records in their respective regions, but since they are excluded we want to jump over those references instead of iterating (and then discarding them later). So if we have a jump from A->B, and another from B->C, it would be fine to perform two jumps to get from A to C. But we can detect this case by combining adjacent/overlapping regions. > > + if (ours->start == ours->end) { > > + /* ignore empty regions (no matching entries) */ > > + continue; > > + } else if (prev->end >= ours->start) { > > + /* overlapping regions extend the previous one */ > > + prev->end = ptr_max(prev->end, ours->end); > > + } else { > > + /* otherwise, insert a new region */ > > + iter->skip[j++] = *ours; > > + } > > None of the {braces} seem needed, but OK. Yeah. I added the braces here intentionally to match the CodingGuidelines, which state that this case is an exception. Thanks, Taylor
On Tue, May 09, 2023 at 04:55:55PM -0400, Taylor Blau wrote: > On Tue, May 09, 2023 at 05:15:43PM +0200, Patrick Steinhardt wrote: > > On Mon, May 08, 2023 at 06:00:08PM -0400, Taylor Blau wrote: > > > > > Note that this optimization is only possible when none of the excluded > > > pattern(s) have special meta-characters in them. To see why this is the > > > case, consider the exclusion pattern "refs/foo[a]". In general, in order > > > to find the location of the first record that matches this pattern, we > > > could only consider up to the first meta-character, "refs/foo". But this > > > doesn't work, since the excluded region we'd come up with would include > > > "refs/foobar", even though it is not excluded. > > > > Is this generally true though? A naive implementation would iterate > > through all references and find the first reference that matches the > > exclusion regular exepression. From thereon we continue to iterate until > > we find the first entry that doesn't match. This may cause us to end up > > with a suboptimal skip list, but the skip list would still be valid. > > > > As I said, this implementation would be naive as we're now forced to > > iterate through all references starting at the beginning. I assume that > > your implementation will instead use a binary search to locate the first > > entry that matches the exclusion pattern and the last pattern. But the > > way this paragraph is formulated makes it sound like this is a general > > fact, even though it is a fact that derives from the implementation. > > > > I of course don't propose to change the algorithm here, but instead to > > clarify where this restriction actually comes from and why the tradeoff > > makes sense. > > In the example you include, it's possible. But consider something like: > > $ git for-each-ref --exclude='refs/foo[ac]' > > The region that matches that expression ("refs/fooa", "refs/fooc" and > everything underneath them) does not have to appear as a continuous > single region in the packed-refs file. If you have, say, "refs/foobar", > that will appear between the two regions you want to exclude. > > So I think you *might* be able to do it in general, but at the very > least it would involve splitting each character class and finding the > start and end of any region(s) that it matches. > > Even so, you'd have to try and match each entry as you determine the > width of the excluded region, at which point you're at par with > enumerating them anyway and having the caller discard any entries it > doesn't want. Alternatively you could also do this on a best-effort basis and only find the first matching region. But anyway, as said: I'm fine with the limitations but think that we should document better where they come from. The current commit message sounds like the limitation is of general nature even though it is in fact a conciously-chosen tradeoff that allows us to make the implementation more efficient for most cases. Patrick
diff --git a/ref-filter.c b/ref-filter.c index c8ced1104b..56ebd332fa 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -2208,12 +2208,13 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, if (!filter->name_patterns[0]) { /* no patterns; we have to look at everything */ return refs_for_each_fullref_in(get_main_ref_store(the_repository), - "", NULL, cb, cb_data); + "", filter->exclude.v, cb, cb_data); } return refs_for_each_fullref_in_prefixes(get_main_ref_store(the_repository), NULL, filter->name_patterns, - NULL, cb, cb_data); + filter->exclude.v, + cb, cb_data); } /* diff --git a/refs/packed-backend.c b/refs/packed-backend.c index 98f96bf3ee..137a4233f6 100644 --- a/refs/packed-backend.c +++ b/refs/packed-backend.c @@ -595,6 +595,21 @@ static const char *find_reference_location(struct snapshot *snapshot, return find_reference_location_1(snapshot, refname, mustexist, 1); } +/* + * Find the place in `snapshot->buf` after the end of the record for + * `refname`. In other words, find the location of first thing *after* + * `refname`. + * + * Other semantics are identical to the ones in + * `find_reference_location()`. + */ +static const char *find_reference_location_end(struct snapshot *snapshot, + const char *refname, + int mustexist) +{ + return find_reference_location_1(snapshot, refname, mustexist, 0); +} + /* * Create a newly-allocated `snapshot` of the `packed-refs` file in * its current state and return it. The return value will already have @@ -786,6 +801,13 @@ struct packed_ref_iterator { /* The end of the part of the buffer that will be iterated over: */ const char *eof; + struct skip_list_entry { + const char *start; + const char *end; + } *skip; + size_t skip_nr, skip_alloc; + size_t skip_pos; + /* Scratch space for current values: */ struct object_id oid, peeled; struct strbuf refname_buf; @@ -803,14 +825,34 @@ struct packed_ref_iterator { */ static int next_record(struct packed_ref_iterator *iter) { - const char *p = iter->pos, *eol; + const char *p, *eol; strbuf_reset(&iter->refname_buf); + /* + * If iter->pos is contained within a skipped region, jump past + * it. + * + * Note that each skipped region is considered at most once, + * since they are ordered based on their starting position. + */ + while (iter->skip_pos < iter->skip_nr) { + struct skip_list_entry *curr = &iter->skip[iter->skip_pos]; + if (iter->pos < curr->start) + break; /* not to the next jump yet */ + + iter->skip_pos++; + if (iter->pos < curr->end) { + iter->pos = curr->end; + break; + } + } + if (iter->pos == iter->eof) return ITER_DONE; iter->base.flags = REF_ISPACKED; + p = iter->pos; if (iter->eof - p < the_hash_algo->hexsz + 2 || parse_oid_hex(p, &iter->oid, &p) || @@ -918,6 +960,7 @@ static int packed_ref_iterator_abort(struct ref_iterator *ref_iterator) int ok = ITER_DONE; strbuf_release(&iter->refname_buf); + free(iter->skip); release_snapshot(iter->snapshot); base_ref_iterator_free(ref_iterator); return ok; @@ -929,6 +972,108 @@ static struct ref_iterator_vtable packed_ref_iterator_vtable = { .abort = packed_ref_iterator_abort }; +static int skip_list_entry_cmp(const void *va, const void *vb) +{ + const struct skip_list_entry *a = va; + const struct skip_list_entry *b = vb; + + if (a->start < b->start) + return -1; + if (a->start > b->start) + return 1; + return 0; +} + +static int has_glob_special(const char *str) +{ + const char *p; + for (p = str; *p; p++) { + if (is_glob_special(*p)) + return 1; + } + return 0; +} + +static const char *ptr_max(const char *x, const char *y) +{ + if (x > y) + return x; + return y; +} + +static void populate_excluded_skip_list(struct packed_ref_iterator *iter, + struct snapshot *snapshot, + const char **excluded_patterns) +{ + size_t i, j; + const char **pattern; + + if (!excluded_patterns) + return; + + for (pattern = excluded_patterns; *pattern; pattern++) { + struct skip_list_entry *e; + + /* + * We can't feed any excludes with globs in them to the + * refs machinery. It only understands prefix matching. + * We likewise can't even feed the string leading up to + * the first meta-character, as something like "foo[a]" + * should not exclude "foobar" (but the prefix "foo" + * would match that and mark it for exclusion). + */ + if (has_glob_special(*pattern)) + continue; + + ALLOC_GROW(iter->skip, iter->skip_nr + 1, iter->skip_alloc); + + e = &iter->skip[iter->skip_nr++]; + e->start = find_reference_location(snapshot, *pattern, 0); + e->end = find_reference_location_end(snapshot, *pattern, 0); + } + + if (!iter->skip_nr) { + /* + * Every entry in exclude_patterns has a meta-character, + * nothing to do here. + */ + return; + } + + QSORT(iter->skip, iter->skip_nr, skip_list_entry_cmp); + + /* + * As an optimization, merge adjacent entries in the skip list + * to jump forwards as far as possible when entering a skipped + * region. + * + * For example, if we have two skipped regions: + * + * [[A, B], [B, C]] + * + * we want to combine that into a single entry jumping from A to + * C. + */ + for (i = 1, j = 1; i < iter->skip_nr; i++) { + struct skip_list_entry *ours = &iter->skip[i]; + struct skip_list_entry *prev = &iter->skip[i - 1]; + + if (ours->start == ours->end) { + /* ignore empty regions (no matching entries) */ + continue; + } else if (prev->end >= ours->start) { + /* overlapping regions extend the previous one */ + prev->end = ptr_max(prev->end, ours->end); + } else { + /* otherwise, insert a new region */ + iter->skip[j++] = *ours; + } + } + + iter->skip_nr = j; + iter->skip_pos = 0; +} + static struct ref_iterator *packed_ref_iterator_begin( struct ref_store *ref_store, const char *prefix, const char **exclude_patterns, @@ -964,6 +1109,9 @@ static struct ref_iterator *packed_ref_iterator_begin( ref_iterator = &iter->base; base_ref_iterator_init(ref_iterator, &packed_ref_iterator_vtable, 1); + if (exclude_patterns) + populate_excluded_skip_list(iter, snapshot, exclude_patterns); + iter->snapshot = snapshot; acquire_snapshot(snapshot); diff --git a/t/helper/test-ref-store.c b/t/helper/test-ref-store.c index 6d8f844e9c..2bff003f7c 100644 --- a/t/helper/test-ref-store.c +++ b/t/helper/test-ref-store.c @@ -175,6 +175,15 @@ static int cmd_for_each_ref(struct ref_store *refs, const char **argv) return refs_for_each_ref_in(refs, prefix, each_ref, NULL); } +static int cmd_for_each_ref__exclude(struct ref_store *refs, const char **argv) +{ + const char *prefix = notnull(*argv++, "prefix"); + const char **exclude_patterns = argv; + + return refs_for_each_fullref_in(refs, prefix, exclude_patterns, each_ref, + NULL); +} + static int cmd_resolve_ref(struct ref_store *refs, const char **argv) { struct object_id oid = *null_oid(); @@ -307,6 +316,7 @@ static struct command commands[] = { { "delete-refs", cmd_delete_refs }, { "rename-ref", cmd_rename_ref }, { "for-each-ref", cmd_for_each_ref }, + { "for-each-ref--exclude", cmd_for_each_ref__exclude }, { "resolve-ref", cmd_resolve_ref }, { "verify-ref", cmd_verify_ref }, { "for-each-reflog", cmd_for_each_reflog }, diff --git a/t/t1419-exclude-refs.sh b/t/t1419-exclude-refs.sh new file mode 100755 index 0000000000..da5265a5a8 --- /dev/null +++ b/t/t1419-exclude-refs.sh @@ -0,0 +1,101 @@ +#!/bin/sh + +test_description='test exclude_patterns functionality in main ref store' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +TEST_PASSES_SANITIZE_LEAK=true +. ./test-lib.sh + +for_each_ref__exclude () { + test-tool ref-store main for-each-ref--exclude "$@" >actual.raw + cut -d ' ' -f 2 actual.raw +} + +for_each_ref () { + git for-each-ref --format='%(refname)' "$@" +} + +test_expect_success 'setup' ' + test_commit --no-tag base && + base="$(git rev-parse HEAD)" && + + for name in foo bar baz quux + do + for i in 1 2 3 + do + echo "create refs/heads/$name/$i $base" || return 1 + done || return 1 + done >in && + echo "delete refs/heads/main" >>in && + + git update-ref --stdin <in && + git pack-refs --all +' + +test_expect_success 'for_each_ref__exclude(refs/heads/foo/)' ' + # region in middle + for_each_ref__exclude refs/heads refs/heads/foo >actual && + for_each_ref refs/heads/bar refs/heads/baz refs/heads/quux >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/bar/)' ' + # region at beginning + for_each_ref__exclude refs/heads refs/heads/bar >actual && + for_each_ref refs/heads/baz refs/heads/foo refs/heads/quux >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/quux/)' ' + # region at end + for_each_ref__exclude refs/heads refs/heads/quux >actual && + for_each_ref refs/heads/foo refs/heads/bar refs/heads/baz >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/bar/, refs/heads/quux/)' ' + # disjoint regions + for_each_ref__exclude refs/heads refs/heads/bar refs/heads/quux >actual && + for_each_ref refs/heads/baz refs/heads/foo >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/bar/, refs/heads/baz/)' ' + # adjacent, non-overlapping regions + for_each_ref__exclude refs/heads refs/heads/bar refs/heads/baz >actual && + for_each_ref refs/heads/foo refs/heads/quux >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/ba refs/heads/baz/)' ' + # overlapping region + for_each_ref__exclude refs/heads refs/heads/ba refs/heads/baz >actual && + for_each_ref refs/heads/foo refs/heads/quux >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/does/not/exist)' ' + # empty region + for_each_ref__exclude refs/heads refs/heads/does/not/exist >actual && + for_each_ref >expect && + + test_cmp expect actual +' + +test_expect_success 'for_each_ref__exclude(refs/heads/ba*)' ' + # discards meta-characters + for_each_ref__exclude refs/heads "refs/heads/ba*" >actual && + for_each_ref >expect && + + test_cmp expect actual +' + +test_done