Message ID | 20210119144251.27924-2-jacob@gitlab.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | ls-refs.c: minimize number of refs visited | expand |
Hi Jacob, On Tue, Jan 19, 2021 at 03:42:51PM +0100, Jacob Vosmaer wrote: > The previous implementation of ls-refs would perform exactly one ref > walk, matching each ref against the prefixes (if any) provided by the > user. This can be expensive if there are a lot of refs and the user > only cares about a small subset of them. > > In this patch we analyze the prefixes provided by the user and build a > minimal set of disjoint prefixes that contains all of them. We then do > a ref walk for each of these minimal prefixes. This reminds me of b31e2680c4 (ref-filter.c: find disjoint pattern prefixes, 2019-06-26), where we solved a very similar problem for 'git for-each-ref'. The difference here is that we are operating on a set of prefixes, not a set of refs. But, I think that we could get pretty far by treating the prefixes as refs so that we can call ref-filter.c:find_longest_prefixes(). For its purposes, it doesn't really care about whether or not the arguments actually are references. It simply returns the longest common prefix among all of its arguments (delimited by '/' characters). > It is tempting to have just one strvec for the prefixes and use it > both for matching and for iterating. But every time I tried that, it > made things more complicated. I settled on leaving the existing ref > matching (using &data.prefixes) alone, and I added a new layer around > it for the ref walk optimization (using &iter_prefixes). I think the implementation in b31e2680c4 deals with this nicely: it takes a pointer to a strvec and dumps prefixes in there. > This commit also fixes a bug in ls-refs.c that was not triggered > before: we were using a strvec set to zero, which is not how you are > supposed to initialize a strvec. We now call strvec_init after zeroing. Good. > Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> > --- > ls-refs.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 62 insertions(+), 1 deletion(-) > > diff --git a/ls-refs.c b/ls-refs.c > index a1e0b473e4..6d5f0c769a 100644 > --- a/ls-refs.c > +++ b/ls-refs.c > @@ -84,12 +84,44 @@ static int ls_refs_config(const char *var, const char *value, void *data) > return parse_hide_refs_config(var, value, "uploadpack"); > } > > +static int cmp_prefix(const void *a_, const void *b_){ > + const char *a = *(const char **)a_; > + const char *b = *(const char **)b_; > + return strcmp(a, b); > +} > + > +static void deduplicate_prefixes(struct strvec *prefixes) { > + int i; > + > + QSORT(prefixes->v, prefixes->nr, cmp_prefix); > + > + for (i = 1; i < prefixes->nr;) { > + const char *p = prefixes->v[i]; > + > + /* > + * If p is "refs/foobar" and its predecessor is "refs/foo" then we should > + * drop p, both to avoid sending duplicate refs to the user, and to avoid > + * doing unnecessary work. > + */ > + if (starts_with(p, prefixes->v[i - 1])) { > + MOVE_ARRAY(&prefixes->v[i], &prefixes->v[i + 1], prefixes->nr - (i + 1)); > + prefixes->v[prefixes->nr - 1] = p; > + strvec_pop(prefixes); > + } else { > + i++; > + } > + } > +} > + Indeed, this and the below code are very reminiscent of b31e2680c4. So, I wonder if it's possible to use the existing implementation rather than implement what is roughly the same thing twice. Below is a completely untested patch to try and reuse the code from b31e2680c4. (It compiles, but that's the extent of my guarantees about it ;-).) It's all smashed into one huge patch, so if you're happy with the direction I'll take care of cleaning it up. The new function in ref-filter.h really belongs in refs.h, but I left the implementation in ref-filter.c to avoid creating more noise in the diff. Let me know what you think. Thanks, Taylor --- >8 --- Subject: [PATCH] ls-refs: iterate longest common refs prefix Signed-off-by: Taylor Blau <me@ttaylorr.com> --- ls-refs.c | 4 +++- ref-filter.c | 46 ++++++++++++++++++++++++++++++++-------------- ref-filter.h | 10 ++++++++++ 3 files changed, 45 insertions(+), 15 deletions(-) diff --git a/ls-refs.c b/ls-refs.c index a1e0b473e4..6a3e11d45c 100644 --- a/ls-refs.c +++ b/ls-refs.c @@ -6,6 +6,7 @@ #include "ls-refs.h" #include "pkt-line.h" #include "config.h" +#include "ref-filter.h" /* * Check if one of the prefixes is a prefix of the ref. @@ -109,7 +110,8 @@ int ls_refs(struct repository *r, struct strvec *keys, die(_("expected flush after ls-refs arguments")); head_ref_namespaced(send_ref, &data); - for_each_namespaced_ref(send_ref, &data); + for_each_fullref_in_prefixes(get_git_namespace(), data.prefixes.v, + send_ref, &data, 0); packet_flush(1); strvec_clear(&data.prefixes); return 0; diff --git a/ref-filter.c b/ref-filter.c index aa260bfd09..c34bf34d06 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -1987,6 +1987,36 @@ static void find_longest_prefixes(struct string_list *out, strbuf_release(&prefix); } +int for_each_fullref_in_prefixes(const char *namespace, + const char **patterns, + each_ref_fn cb, + void *cb_data, + int broken) +{ + struct string_list prefixes = STRING_LIST_INIT_DUP; + struct string_list_item *prefix; + struct strbuf buf = STRBUF_INIT; + int ret = 0, namespace_len; + + find_longest_prefixes(&prefixes, patterns); + + if (namespace) + strbuf_addstr(&buf, namespace); + namespace_len = buf.len; + + for_each_string_list_item(prefix, &prefixes) { + strbuf_addf(&buf, prefix->string); + ret = for_each_fullref_in(buf.buf, cb, cb_data, broken); + if (ret) + break; + strbuf_setlen(&buf, namespace_len); + } + + string_list_clear(&prefixes, 0); + strbuf_release(&buf); + return ret; +} + /* * This is the same as for_each_fullref_in(), but it tries to iterate * only over the patterns we'll care about. Note that it _doesn't_ do a full @@ -1997,10 +2027,6 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, void *cb_data, int broken) { - struct string_list prefixes = STRING_LIST_INIT_DUP; - struct string_list_item *prefix; - int ret; - if (!filter->match_as_path) { /* * in this case, the patterns are applied after @@ -2024,16 +2050,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, return for_each_fullref_in("", cb, cb_data, broken); } - find_longest_prefixes(&prefixes, filter->name_patterns); - - for_each_string_list_item(prefix, &prefixes) { - ret = for_each_fullref_in(prefix->string, cb, cb_data, broken); - if (ret) - break; - } - - string_list_clear(&prefixes, 0); - return ret; + return for_each_fullref_in_prefixes(NULL, filter->name_patterns, + cb, cb_data, broken); } /* diff --git a/ref-filter.h b/ref-filter.h index feaef4a8fd..f666a0fb49 100644 --- a/ref-filter.h +++ b/ref-filter.h @@ -146,4 +146,14 @@ struct ref_array_item *ref_array_push(struct ref_array *array, const char *refname, const struct object_id *oid); +/** + * iterate all refs which descend from the longest common prefix among + * "patterns". + */ +int for_each_fullref_in_prefixes(const char *namespace, + const char **patterns, + each_ref_fn cb, + void *cb_data, + int broken); + #endif /* REF_FILTER_H */ -- 2.30.0.138.g6d7191ea01
Hi Taylor, Thanks for your reply. That sounds like a great idea! On Tue, Jan 19, 2021 at 5:12 PM Taylor Blau <me@ttaylorr.com> wrote: > But, I think that we could get pretty far by treating the prefixes as > refs so that we can call ref-filter.c:find_longest_prefixes(). For its > purposes, it doesn't really care about whether or not the arguments > actually are references. It simply returns the longest common prefix > among all of its arguments (delimited by '/' characters). What does "delimited by /" mean? Without really understanding the longest common prefix code in ref-filter.c, my intuitive concern is that the specifics of glob matching and special treatment of '/' may bite us. I suppose we'll be fine because ls-refs has its own matching logic. So long as for_each_fullref_in_prefixes yield enough prefixes, the end result would remain the same. The question is then, does for_each_fullref_in_prefixes yield everything we need? I think my approach would be to expose the new for_each_fullref_in_prefixes iterator you propose through test-tool, and unit test it so we can be sure it handles both contexts (for-each-refs with globs and special '/', and ls-refs without any special character behavior) correctly. I may be overly cautious here, take this with a grain of salt because I am not an experienced Git contributor. On that topic, apologies if I'm botching my inline replies in this email. Regarding your patch: it works correctly and as fast as expected for my development "many refs" test case. Yay! It also segfaults and fails some tests but see my comments below. All in all: thanks, great idea, yes we should reuse, I only lack confidence on correctness because I don't fully grasp your longest-common-prefix algorithm yet. Cheers, Jacob > --- >8 --- > > Subject: [PATCH] ls-refs: iterate longest common refs prefix > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > ls-refs.c | 4 +++- > ref-filter.c | 46 ++++++++++++++++++++++++++++++++-------------- > ref-filter.h | 10 ++++++++++ > 3 files changed, 45 insertions(+), 15 deletions(-) > > diff --git a/ls-refs.c b/ls-refs.c > index a1e0b473e4..6a3e11d45c 100644 > --- a/ls-refs.c > +++ b/ls-refs.c > @@ -6,6 +6,7 @@ > #include "ls-refs.h" > #include "pkt-line.h" > #include "config.h" > +#include "ref-filter.h" > > /* > * Check if one of the prefixes is a prefix of the ref. > @@ -109,7 +110,8 @@ int ls_refs(struct repository *r, struct strvec *keys, > die(_("expected flush after ls-refs arguments")); > > head_ref_namespaced(send_ref, &data); > - for_each_namespaced_ref(send_ref, &data); > + for_each_fullref_in_prefixes(get_git_namespace(), data.prefixes.v, > + send_ref, &data, 0); Remember to add that strvec_init(&data.prefixes) I talked about in my commit message, or you get a segfault. You also need: if (!data.prefixes.nr) strvec_push(&data.prefixes, "refs/"); Without it, the "ls-refs without ref-prefix" scenario fails. > packet_flush(1); > strvec_clear(&data.prefixes); > return 0; > diff --git a/ref-filter.c b/ref-filter.c > index aa260bfd09..c34bf34d06 100644 > --- a/ref-filter.c > +++ b/ref-filter.c > @@ -1987,6 +1987,36 @@ static void find_longest_prefixes(struct string_list *out, > strbuf_release(&prefix); > } > > +int for_each_fullref_in_prefixes(const char *namespace, > + const char **patterns, > + each_ref_fn cb, > + void *cb_data, > + int broken) > +{ > + struct string_list prefixes = STRING_LIST_INIT_DUP; > + struct string_list_item *prefix; > + struct strbuf buf = STRBUF_INIT; > + int ret = 0, namespace_len; > + > + find_longest_prefixes(&prefixes, patterns); > + > + if (namespace) > + strbuf_addstr(&buf, namespace); > + namespace_len = buf.len; > + > + for_each_string_list_item(prefix, &prefixes) { > + strbuf_addf(&buf, prefix->string); Missing format string "%s". > + ret = for_each_fullref_in(buf.buf, cb, cb_data, broken); > + if (ret) > + break; > + strbuf_setlen(&buf, namespace_len); > + } > + > + string_list_clear(&prefixes, 0); > + strbuf_release(&buf); > + return ret; > +} > + > /* > * This is the same as for_each_fullref_in(), but it tries to iterate > * only over the patterns we'll care about. Note that it _doesn't_ do a full > @@ -1997,10 +2027,6 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, > void *cb_data, > int broken) > { > - struct string_list prefixes = STRING_LIST_INIT_DUP; > - struct string_list_item *prefix; > - int ret; > - > if (!filter->match_as_path) { > /* > * in this case, the patterns are applied after > @@ -2024,16 +2050,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter, > return for_each_fullref_in("", cb, cb_data, broken); > } > > - find_longest_prefixes(&prefixes, filter->name_patterns); > - > - for_each_string_list_item(prefix, &prefixes) { > - ret = for_each_fullref_in(prefix->string, cb, cb_data, broken); > - if (ret) > - break; > - } > - > - string_list_clear(&prefixes, 0); > - return ret; > + return for_each_fullref_in_prefixes(NULL, filter->name_patterns, > + cb, cb_data, broken); > } > > /* > diff --git a/ref-filter.h b/ref-filter.h > index feaef4a8fd..f666a0fb49 100644 > --- a/ref-filter.h > +++ b/ref-filter.h > @@ -146,4 +146,14 @@ struct ref_array_item *ref_array_push(struct ref_array *array, > const char *refname, > const struct object_id *oid); > > +/** > + * iterate all refs which descend from the longest common prefix among > + * "patterns". > + */ > +int for_each_fullref_in_prefixes(const char *namespace, > + const char **patterns, > + each_ref_fn cb, > + void *cb_data, > + int broken); > + > #endif /* REF_FILTER_H */ > -- > 2.30.0.138.g6d7191ea01 >
On Tue, Jan 19, 2021 at 06:42:56PM +0100, Jacob Vosmaer wrote: > Hi Taylor, > > Thanks for your reply. That sounds like a great idea! > > On Tue, Jan 19, 2021 at 5:12 PM Taylor Blau <me@ttaylorr.com> wrote: > > But, I think that we could get pretty far by treating the prefixes as > > refs so that we can call ref-filter.c:find_longest_prefixes(). For its > > purposes, it doesn't really care about whether or not the arguments > > actually are references. It simply returns the longest common prefix > > among all of its arguments (delimited by '/' characters). > > What does "delimited by /" mean? Ah, I just meant that it looks for the longest common prefix where it will only split at '/' characters. But, that's not right at all: find_longest_prefixes_1() will happily split anywhere there is a difference. > Without really understanding the longest common prefix code in > ref-filter.c, my intuitive concern is that the specifics of glob > matching and special treatment of '/' may bite us. I suppose we'll be > fine because ls-refs has its own matching logic. So long as > for_each_fullref_in_prefixes yield enough prefixes, the end result > would remain the same. Right. We can ignore the concern about '/' (seeing my comment above), and note that find_longest_prefixes_1() breaks on glob metacharacters, so we'll only match or overmatch the desired set (and we'll never undermatch). I made sure to write in the second patch downthread that ls-refs.c:send_ref() correctly handles receiving too many refs (and it discards ones that it doesn't want). > The question is then, does for_each_fullref_in_prefixes yield > everything we need? For the reasons above, yes: it will. > I think my approach would be to expose the new > for_each_fullref_in_prefixes iterator you propose through test-tool, > and unit test it so we can be sure it handles both contexts > (for-each-refs with globs and special '/', and ls-refs without any > special character behavior) correctly. > > I may be overly cautious here, take this with a grain of salt because > I am not an experienced Git contributor. On that topic, apologies if > I'm botching my inline replies in this email. I do appreciate your caution, but I'm not sure exposing a test-tool is necessary, since we already test this behavior extensively in t6300 (and now t5701, t5702 and t5704, too). > Regarding your patch: it works correctly and as fast as expected for > my development "many refs" test case. Yay! It also segfaults and fails > some tests but see my comments below. > > All in all: thanks, great idea, yes we should reuse, I only lack > confidence on correctness because I don't fully grasp your > longest-common-prefix algorithm yet. :-). Thanks for the pointers on the spots that I had missed (as I mentioned, I only compiled it before sending, so having an additional set of more careful eyes was quite helpful). Thanks, Taylor
On Tue, Jan 19, 2021 at 02:09:04PM -0500, Taylor Blau wrote: > > What does "delimited by /" mean? > > Ah, I just meant that it looks for the longest common prefix where it > will only split at '/' characters. But, that's not right at all: > find_longest_prefixes_1() will happily split anywhere there is a > difference. Right. We thought in early revisions of the ref-filter work that we might have to split on path components, but it turns out that the underlying ref code is happy to take arbitrary prefixes. And it's that code which defines our strategy; Even if the ls-refs code wanted to allow only full path components, it should be using the limiting from for_each_ref_in() only as an optimization, and applying its own filtering to the output. > > Without really understanding the longest common prefix code in > > ref-filter.c, my intuitive concern is that the specifics of glob > > matching and special treatment of '/' may bite us. I suppose we'll be > > fine because ls-refs has its own matching logic. So long as > > for_each_fullref_in_prefixes yield enough prefixes, the end result > > would remain the same. > > Right. We can ignore the concern about '/' (seeing my comment above), > and note that find_longest_prefixes_1() breaks on glob metacharacters, > so we'll only match or overmatch the desired set (and we'll never > undermatch). > > I made sure to write in the second patch downthread that > ls-refs.c:send_ref() correctly handles receiving too many refs (and it > discards ones that it doesn't want). Yeah, I think the glob-handling is OK for that reason. We'd have a smaller prefix, which means seeing more refs. So it's never a correctness issue, but only a potential optimization one (we consider more refs in ls-refs.c than we might otherwise). But I think even that s impossible, because the glob characters are not valid in refnames. So we would never see one at all in a string which is meant to be a pure prefix. > > I think my approach would be to expose the new > > for_each_fullref_in_prefixes iterator you propose through test-tool, > > and unit test it so we can be sure it handles both contexts > > (for-each-refs with globs and special '/', and ls-refs without any > > special character behavior) correctly. > > > > I may be overly cautious here, take this with a grain of salt because > > I am not an experienced Git contributor. On that topic, apologies if > > I'm botching my inline replies in this email. > > I do appreciate your caution, but I'm not sure exposing a test-tool is > necessary, since we already test this behavior extensively in t6300 (and > now t5701, t5702 and t5704, too). Agreed. test-tool is fine when we have no way to easily feed data to a unit. But in this case, the prefix code is easy to test via the for-each-ref plumbing. We'd want to make sure the new setting in ls-refs is covered, too, but I agree that t5701 covers that well (regular fetches make sure we don't send too few refs, but those ones check that we limited the refs in the expected way). -Peff
On Tue, Jan 19, 2021 at 04:59:18PM -0500, Jeff King wrote: > > > What does "delimited by /" mean? > > > > Ah, I just meant that it looks for the longest common prefix where it > > will only split at '/' characters. But, that's not right at all: > > find_longest_prefixes_1() will happily split anywhere there is a > > difference. > > Right. We thought in early revisions of the ref-filter work that we > might have to split on path components, but it turns out that the > underlying ref code is happy to take arbitrary prefixes. And it's that > code which defines our strategy; Even if the ls-refs code wanted to > allow only full path components, it should be using the limiting from > for_each_ref_in() only as an optimization, and applying its own > filtering to the output. Having now looked carefully at the ls-refs code, it's a pure prefix-match, too. So I think we _could_ rely on for_each_fullref_in() returning us the correct full results, and not checking it further in send_ref(). But I kind of like keeping it there as an extra check (and one which could in theory grow more logic later). -Peff
On Tue, Jan 19, 2021 at 05:15:30PM -0500, Jeff King wrote: > On Tue, Jan 19, 2021 at 04:59:18PM -0500, Jeff King wrote: > > > > > What does "delimited by /" mean? > > > > > > Ah, I just meant that it looks for the longest common prefix where it > > > will only split at '/' characters. But, that's not right at all: > > > find_longest_prefixes_1() will happily split anywhere there is a > > > difference. > > > > Right. We thought in early revisions of the ref-filter work that we > > might have to split on path components, but it turns out that the > > underlying ref code is happy to take arbitrary prefixes. And it's that > > code which defines our strategy; Even if the ls-refs code wanted to > > allow only full path components, it should be using the limiting from > > for_each_ref_in() only as an optimization, and applying its own > > filtering to the output. > > Having now looked carefully at the ls-refs code, it's a pure > prefix-match, too. So I think we _could_ rely on for_each_fullref_in() > returning us the correct full results, and not checking it further in > send_ref(). But I kind of like keeping it there as an extra check (and > one which could in theory grow more logic later). Hmm. What if the caller asks for: ref-prefix refs/tags/a ref-prefix refs/tags/b ? The LCP between those two is refs/tags, so send_ref() will presumably get lots of reuslts that it doesn't care about (assuming there are tags besides a and b). Thanks, Taylor
On Tue, Jan 19, 2021 at 05:23:36PM -0500, Taylor Blau wrote: > > Having now looked carefully at the ls-refs code, it's a pure > > prefix-match, too. So I think we _could_ rely on for_each_fullref_in() > > returning us the correct full results, and not checking it further in > > send_ref(). But I kind of like keeping it there as an extra check (and > > one which could in theory grow more logic later). > > Hmm. What if the caller asks for: > > ref-prefix refs/tags/a > ref-prefix refs/tags/b > > ? > > The LCP between those two is refs/tags, so send_ref() will presumably > get lots of reuslts that it doesn't care about (assuming there are tags > besides a and b). Oh, you're right, of course. Ignore me. :) -Peff
On Tue, Jan 19, 2021 at 03:42:51PM +0100, Jacob Vosmaer wrote: > The previous implementation of ls-refs would perform exactly one ref > walk, matching each ref against the prefixes (if any) provided by the > user. This can be expensive if there are a lot of refs and the user > only cares about a small subset of them. > > In this patch we analyze the prefixes provided by the user and build a > minimal set of disjoint prefixes that contains all of them. We then do > a ref walk for each of these minimal prefixes. Thanks for posting this. I have a vague recollection that we considered this either when we did the for-each-ref prefixes, or when we added ls-refs prefixes, but I can't seem to find either. At any rate, at GitHub we haven't generally found it to be a problem because our horrifically-large repos tend to be aggregated alternates repos, not the ones people serve upload-pack out of (though I did just time it, and some of our largest repos should save a few hundred milliseconds per advertisement, which is certainly not nothing). I do think we should reuse the code from ref-filter, as Taylor showed. > This commit also fixes a bug in ls-refs.c that was not triggered > before: we were using a strvec set to zero, which is not how you are > supposed to initialize a strvec. We now call strvec_init after zeroing. Good catch. It didn't matter until now because nobody relied on having a NULL entry when no prefix had been added (instead, they always iterated over prefixes->nr). IMHO that is worth fixing as a separate commit. -Peff
On Tue, Jan 19, 2021 at 05:52:04PM -0500, Jeff King wrote: > On Tue, Jan 19, 2021 at 05:23:36PM -0500, Taylor Blau wrote: > > > > Having now looked carefully at the ls-refs code, it's a pure > > > prefix-match, too. So I think we _could_ rely on for_each_fullref_in() > > > returning us the correct full results, and not checking it further in > > > send_ref(). But I kind of like keeping it there as an extra check (and > > > one which could in theory grow more logic later). > > > > Hmm. What if the caller asks for: > > > > ref-prefix refs/tags/a > > ref-prefix refs/tags/b > > > > ? > > > > The LCP between those two is refs/tags, so send_ref() will presumably > > get lots of reuslts that it doesn't care about (assuming there are tags > > besides a and b). > > Oh, you're right, of course. Ignore me. :) Actually, I am not sure that we would look for "refs/tags/" in that case (I did a quick test and we do not seem to). Which makes sense, as it is cheaper to find the "a" and "b" hierarchies separately if there is a very big "refs/tags/c" hierarchy. But I agree that this is a good reason that callers should consider it as an optimization which could return more results than expected. -Peff
On Tue, Jan 19, 2021 at 05:53:56PM -0500, Jeff King wrote: > Thanks for posting this. I have a vague recollection that we considered > this either when we did the for-each-ref prefixes, or when we added > ls-refs prefixes, but I can't seem to find either. At any rate, at > GitHub we haven't generally found it to be a problem because our > horrifically-large repos tend to be aggregated alternates repos, not the > ones people serve upload-pack out of (though I did just time it, and > some of our largest repos should save a few hundred milliseconds per > advertisement, which is certainly not nothing). Great on all counts! > > This commit also fixes a bug in ls-refs.c that was not triggered > > before: we were using a strvec set to zero, which is not how you are > > supposed to initialize a strvec. We now call strvec_init after zeroing. > > Good catch. It didn't matter until now because nobody relied on having a > NULL entry when no prefix had been added (instead, they always iterated > over prefixes->nr). IMHO that is worth fixing as a separate commit. Yeah. Even after calling it out as such myself, I promptly forgot it when preparing the first patch I sent back to Jacob! I didn't pull it out into its own patch, and rather folded it in to my 2/2. It has a small call-out of its own, but if you'd prefer it by itself, I'm happy to resubmit it with that change included. > -Peff Thanks, Taylor
On Tue, Jan 19, 2021 at 05:59:34PM -0500, Jeff King wrote: > Actually, I am not sure that we would look for "refs/tags/" in that case > (I did a quick test and we do not seem to). Which makes sense, as it is > cheaper to find the "a" and "b" hierarchies separately if there is a > very big "refs/tags/c" hierarchy. Ah, makes sense. Thanks for double checking. > But I agree that this is a good reason that callers should consider it > as an optimization which could return more results than expected. Yep. Even though I couldn't quite remember when the algorithm would split without looking more closely, I made sure to document that it iterates *all* references that are descendent of the LCP of its arguments. > -Peff Thanks, Taylor
On Tue, Jan 19, 2021 at 06:00:19PM -0500, Taylor Blau wrote: > > Good catch. It didn't matter until now because nobody relied on having a > > NULL entry when no prefix had been added (instead, they always iterated > > over prefixes->nr). IMHO that is worth fixing as a separate commit. > > Yeah. Even after calling it out as such myself, I promptly forgot it > when preparing the first patch I sent back to Jacob! > > I didn't pull it out into its own patch, and rather folded it in to my > 2/2. It has a small call-out of its own, but if you'd prefer it by > itself, I'm happy to resubmit it with that change included. It's not a _huge_ deal to me, but I think it is slightly nicer as a separate patch. Plus it can easily be credited to Jacob, so at least he gets some authorship credit out of this. :) -Peff
On Wed, Jan 20, 2021 at 12:11 AM Jeff King <peff@peff.net> wrote: > It's not a _huge_ deal to me, but I think it is slightly nicer as a > separate patch. Plus it can easily be credited to Jacob, so at least he > gets some authorship credit out of this. :) Thanks, I'm fine either way. Just remember that the other changes in ls-refs.c depend on that strvec_init, so if we split it out, we need to remember / maintain that (order) dependency. > Having now looked carefully at the ls-refs code, it's a pure > prefix-match, too. So I think we _could_ rely on for_each_fullref_in() > returning us the correct full results, and not checking it further in > send_ref(). I also think we could. But as I alluded to in my original commit message, I don't like how complicated that gets. I find it easier to convince myself in the current form that the longest prefix code selects _enough_ prefixes, which is a weaker property than "selects exactly the right prefixes". Jacob
On Wed, Jan 20, 2021 at 11:40 AM Jacob Vosmaer <jacob@gitlab.com> wrote: > I also think we could. But as I alluded to in my original commit > message, I don't like how complicated that gets. I find it easier to > convince myself in the current form that the longest prefix code > selects _enough_ prefixes, which is a weaker property than "selects > exactly the right prefixes". By which I mean to say I think we should keep the current form, with the final authoritative ref matching pass. Jacob
diff --git a/ls-refs.c b/ls-refs.c index a1e0b473e4..6d5f0c769a 100644 --- a/ls-refs.c +++ b/ls-refs.c @@ -84,12 +84,44 @@ static int ls_refs_config(const char *var, const char *value, void *data) return parse_hide_refs_config(var, value, "uploadpack"); } +static int cmp_prefix(const void *a_, const void *b_){ + const char *a = *(const char **)a_; + const char *b = *(const char **)b_; + return strcmp(a, b); +} + +static void deduplicate_prefixes(struct strvec *prefixes) { + int i; + + QSORT(prefixes->v, prefixes->nr, cmp_prefix); + + for (i = 1; i < prefixes->nr;) { + const char *p = prefixes->v[i]; + + /* + * If p is "refs/foobar" and its predecessor is "refs/foo" then we should + * drop p, both to avoid sending duplicate refs to the user, and to avoid + * doing unnecessary work. + */ + if (starts_with(p, prefixes->v[i - 1])) { + MOVE_ARRAY(&prefixes->v[i], &prefixes->v[i + 1], prefixes->nr - (i + 1)); + prefixes->v[prefixes->nr - 1] = p; + strvec_pop(prefixes); + } else { + i++; + } + } +} + int ls_refs(struct repository *r, struct strvec *keys, struct packet_reader *request) { struct ls_refs_data data; + struct strvec iter_prefixes = STRVEC_INIT; + const char **p; memset(&data, 0, sizeof(data)); + strvec_init(&data.prefixes); git_config(ls_refs_config, NULL); @@ -109,8 +141,37 @@ int ls_refs(struct repository *r, struct strvec *keys, die(_("expected flush after ls-refs arguments")); head_ref_namespaced(send_ref, &data); - for_each_namespaced_ref(send_ref, &data); + + /* + * Try to create a minimal disjoint set of prefixes that + * contains everything the user wants to see, but that lets us + * avoid a full ref walk. If the repo has a lot of refs and + * the user only wants to see refs/heads and refs/tags, + * it is faster to do two small walks of refs/heads and + * refs/tags instead of one big walk of all of refs/. + */ + for (p = data.prefixes.v; *p; p++) { + if (starts_with(*p, "refs/")) { + strvec_push(&iter_prefixes, *p); + } else { /* full ref walk in pathological cases like "" */ + strvec_push(&iter_prefixes, "refs/"); + } + } + + if (!iter_prefixes.nr) /* full ref walk when there are no prefixes */ + strvec_push(&iter_prefixes, "refs/"); + + deduplicate_prefixes(&iter_prefixes); + + for (p = iter_prefixes.v; *p; p++) { + struct strbuf buf = STRBUF_INIT; + strbuf_addf(&buf, "%s%s", get_git_namespace(), *p); + for_each_fullref_in(buf.buf, send_ref, &data, 0); + strbuf_release(&buf); + } + packet_flush(1); strvec_clear(&data.prefixes); + strvec_clear(&iter_prefixes); return 0; }
The previous implementation of ls-refs would perform exactly one ref walk, matching each ref against the prefixes (if any) provided by the user. This can be expensive if there are a lot of refs and the user only cares about a small subset of them. In this patch we analyze the prefixes provided by the user and build a minimal set of disjoint prefixes that contains all of them. We then do a ref walk for each of these minimal prefixes. It is tempting to have just one strvec for the prefixes and use it both for matching and for iterating. But every time I tried that, it made things more complicated. I settled on leaving the existing ref matching (using &data.prefixes) alone, and I added a new layer around it for the ref walk optimization (using &iter_prefixes). This commit also fixes a bug in ls-refs.c that was not triggered before: we were using a strvec set to zero, which is not how you are supposed to initialize a strvec. We now call strvec_init after zeroing. Signed-off-by: Jacob Vosmaer <jacob@gitlab.com> --- ls-refs.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-)