diff mbox series

[1/1] ls-refs.c: minimize number of refs visited

Message ID 20210119144251.27924-2-jacob@gitlab.com (mailing list archive)
State New, archived
Headers show
Series ls-refs.c: minimize number of refs visited | expand

Commit Message

Jacob Vosmaer Jan. 19, 2021, 2:42 p.m. UTC
The previous implementation of ls-refs would perform exactly one ref
walk, matching each ref against the prefixes (if any) provided by the
user. This can be expensive if there are a lot of refs and the user
only cares about a small subset of them.

In this patch we analyze the prefixes provided by the user and build a
minimal set of disjoint prefixes that contains all of them. We then do
a ref walk for each of these minimal prefixes.

It is tempting to have just one strvec for the prefixes and use it
both for matching and for iterating. But every time I tried that, it
made things more complicated. I settled on leaving the existing ref
matching (using &data.prefixes) alone, and I added a new layer around
it for the ref walk optimization (using &iter_prefixes).

This commit also fixes a bug in ls-refs.c that was not triggered
before: we were using a strvec set to zero, which is not how you are
supposed to initialize a strvec. We now call strvec_init after zeroing.

Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
---
 ls-refs.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

Comments

Taylor Blau Jan. 19, 2021, 4:12 p.m. UTC | #1
Hi Jacob,

On Tue, Jan 19, 2021 at 03:42:51PM +0100, Jacob Vosmaer wrote:
> The previous implementation of ls-refs would perform exactly one ref
> walk, matching each ref against the prefixes (if any) provided by the
> user. This can be expensive if there are a lot of refs and the user
> only cares about a small subset of them.
>
> In this patch we analyze the prefixes provided by the user and build a
> minimal set of disjoint prefixes that contains all of them. We then do
> a ref walk for each of these minimal prefixes.

This reminds me of b31e2680c4 (ref-filter.c: find disjoint pattern
prefixes, 2019-06-26), where we solved a very similar problem for 'git
for-each-ref'. The difference here is that we are operating on a set of
prefixes, not a set of refs.

But, I think that we could get pretty far by treating the prefixes as
refs so that we can call ref-filter.c:find_longest_prefixes(). For its
purposes, it doesn't really care about whether or not the arguments
actually are references. It simply returns the longest common prefix
among all of its arguments (delimited by '/' characters).

> It is tempting to have just one strvec for the prefixes and use it
> both for matching and for iterating. But every time I tried that, it
> made things more complicated. I settled on leaving the existing ref
> matching (using &data.prefixes) alone, and I added a new layer around
> it for the ref walk optimization (using &iter_prefixes).

I think the implementation in b31e2680c4 deals with this nicely: it
takes a pointer to a strvec and dumps prefixes in there.

> This commit also fixes a bug in ls-refs.c that was not triggered
> before: we were using a strvec set to zero, which is not how you are
> supposed to initialize a strvec. We now call strvec_init after zeroing.

Good.

> Signed-off-by: Jacob Vosmaer <jacob@gitlab.com>
> ---
>  ls-refs.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/ls-refs.c b/ls-refs.c
> index a1e0b473e4..6d5f0c769a 100644
> --- a/ls-refs.c
> +++ b/ls-refs.c
> @@ -84,12 +84,44 @@ static int ls_refs_config(const char *var, const char *value, void *data)
>  	return parse_hide_refs_config(var, value, "uploadpack");
>  }
>
> +static int cmp_prefix(const void *a_, const void *b_){
> +	const char *a = *(const char **)a_;
> +	const char *b = *(const char **)b_;
> +	return strcmp(a, b);
> +}
> +
> +static void deduplicate_prefixes(struct strvec *prefixes) {
> +	int i;
> +
> +	QSORT(prefixes->v, prefixes->nr, cmp_prefix);
> +
> +	for (i = 1; i < prefixes->nr;) {
> +		const char *p = prefixes->v[i];
> +
> +		/*
> +		 * If p is "refs/foobar" and its predecessor is "refs/foo" then we should
> +		 * drop p, both to avoid sending duplicate refs to the user, and to avoid
> +		 * doing unnecessary work.
> +		 */
> +		if (starts_with(p, prefixes->v[i - 1])) {
> +			MOVE_ARRAY(&prefixes->v[i], &prefixes->v[i + 1], prefixes->nr - (i + 1));
> +			prefixes->v[prefixes->nr - 1] = p;
> +			strvec_pop(prefixes);
> +		} else {
> +			i++;
> +		}
> +	}
> +}
> +

Indeed, this and the below code are very reminiscent of b31e2680c4. So,
I wonder if it's possible to use the existing implementation rather than
implement what is roughly the same thing twice.

Below is a completely untested patch to try and reuse the code from
b31e2680c4. (It compiles, but that's the extent of my guarantees about
it ;-).) It's all smashed into one huge patch, so if you're happy with
the direction I'll take care of cleaning it up. The new function in
ref-filter.h really belongs in refs.h, but I left the implementation in
ref-filter.c to avoid creating more noise in the diff.

Let me know what you think.

Thanks,
Taylor

--- >8 ---

Subject: [PATCH] ls-refs: iterate longest common refs prefix

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 ls-refs.c    |  4 +++-
 ref-filter.c | 46 ++++++++++++++++++++++++++++++++--------------
 ref-filter.h | 10 ++++++++++
 3 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/ls-refs.c b/ls-refs.c
index a1e0b473e4..6a3e11d45c 100644
--- a/ls-refs.c
+++ b/ls-refs.c
@@ -6,6 +6,7 @@
 #include "ls-refs.h"
 #include "pkt-line.h"
 #include "config.h"
+#include "ref-filter.h"

 /*
  * Check if one of the prefixes is a prefix of the ref.
@@ -109,7 +110,8 @@ int ls_refs(struct repository *r, struct strvec *keys,
 		die(_("expected flush after ls-refs arguments"));

 	head_ref_namespaced(send_ref, &data);
-	for_each_namespaced_ref(send_ref, &data);
+	for_each_fullref_in_prefixes(get_git_namespace(), data.prefixes.v,
+				     send_ref, &data, 0);
 	packet_flush(1);
 	strvec_clear(&data.prefixes);
 	return 0;
diff --git a/ref-filter.c b/ref-filter.c
index aa260bfd09..c34bf34d06 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1987,6 +1987,36 @@ static void find_longest_prefixes(struct string_list *out,
 	strbuf_release(&prefix);
 }

+int for_each_fullref_in_prefixes(const char *namespace,
+				 const char **patterns,
+				 each_ref_fn cb,
+				 void *cb_data,
+				 int broken)
+{
+	struct string_list prefixes = STRING_LIST_INIT_DUP;
+	struct string_list_item *prefix;
+	struct strbuf buf = STRBUF_INIT;
+	int ret = 0, namespace_len;
+
+	find_longest_prefixes(&prefixes, patterns);
+
+	if (namespace)
+		strbuf_addstr(&buf, namespace);
+	namespace_len = buf.len;
+
+	for_each_string_list_item(prefix, &prefixes) {
+		strbuf_addf(&buf, prefix->string);
+		ret = for_each_fullref_in(buf.buf, cb, cb_data, broken);
+		if (ret)
+			break;
+		strbuf_setlen(&buf, namespace_len);
+	}
+
+	string_list_clear(&prefixes, 0);
+	strbuf_release(&buf);
+	return ret;
+}
+
 /*
  * This is the same as for_each_fullref_in(), but it tries to iterate
  * only over the patterns we'll care about. Note that it _doesn't_ do a full
@@ -1997,10 +2027,6 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
 				       void *cb_data,
 				       int broken)
 {
-	struct string_list prefixes = STRING_LIST_INIT_DUP;
-	struct string_list_item *prefix;
-	int ret;
-
 	if (!filter->match_as_path) {
 		/*
 		 * in this case, the patterns are applied after
@@ -2024,16 +2050,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
 		return for_each_fullref_in("", cb, cb_data, broken);
 	}

-	find_longest_prefixes(&prefixes, filter->name_patterns);
-
-	for_each_string_list_item(prefix, &prefixes) {
-		ret = for_each_fullref_in(prefix->string, cb, cb_data, broken);
-		if (ret)
-			break;
-	}
-
-	string_list_clear(&prefixes, 0);
-	return ret;
+	return for_each_fullref_in_prefixes(NULL, filter->name_patterns,
+					    cb, cb_data, broken);
 }

 /*
diff --git a/ref-filter.h b/ref-filter.h
index feaef4a8fd..f666a0fb49 100644
--- a/ref-filter.h
+++ b/ref-filter.h
@@ -146,4 +146,14 @@ struct ref_array_item *ref_array_push(struct ref_array *array,
 				      const char *refname,
 				      const struct object_id *oid);

+/**
+ * iterate all refs which descend from the longest common prefix among
+ * "patterns".
+ */
+int for_each_fullref_in_prefixes(const char *namespace,
+				 const char **patterns,
+				 each_ref_fn cb,
+				 void *cb_data,
+				 int broken);
+
 #endif /*  REF_FILTER_H  */
--
2.30.0.138.g6d7191ea01
Jacob Vosmaer Jan. 19, 2021, 5:42 p.m. UTC | #2
Hi Taylor,

Thanks for your reply. That sounds like a great idea!

On Tue, Jan 19, 2021 at 5:12 PM Taylor Blau <me@ttaylorr.com> wrote:
> But, I think that we could get pretty far by treating the prefixes as
> refs so that we can call ref-filter.c:find_longest_prefixes(). For its
> purposes, it doesn't really care about whether or not the arguments
> actually are references. It simply returns the longest common prefix
> among all of its arguments (delimited by '/' characters).

What does "delimited by /" mean?

Without really understanding the longest common prefix code in
ref-filter.c, my intuitive concern is that the specifics of glob
matching and special treatment of '/' may bite us. I suppose we'll be
fine because ls-refs has its own matching logic. So long as
for_each_fullref_in_prefixes yield enough prefixes, the end result
would remain the same.

The question is then, does for_each_fullref_in_prefixes yield
everything we need?

I think my approach would be to expose the new
for_each_fullref_in_prefixes iterator you propose through test-tool,
and unit test it so we can be sure it handles both contexts
(for-each-refs with globs and special '/', and ls-refs without any
special character behavior) correctly.

I may be overly cautious here, take this with a grain of salt because
I am not an experienced Git contributor. On that topic, apologies if
I'm botching my inline replies in this email.

Regarding your patch: it works correctly and as fast as expected for
my development "many refs" test case. Yay! It also segfaults and fails
some tests but see my comments below.

All in all: thanks, great idea, yes we should reuse, I only lack
confidence on correctness because I don't fully grasp your
longest-common-prefix algorithm yet.

Cheers, Jacob


> --- >8 ---
>
> Subject: [PATCH] ls-refs: iterate longest common refs prefix
>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  ls-refs.c    |  4 +++-
>  ref-filter.c | 46 ++++++++++++++++++++++++++++++++--------------
>  ref-filter.h | 10 ++++++++++
>  3 files changed, 45 insertions(+), 15 deletions(-)
>
> diff --git a/ls-refs.c b/ls-refs.c
> index a1e0b473e4..6a3e11d45c 100644
> --- a/ls-refs.c
> +++ b/ls-refs.c
> @@ -6,6 +6,7 @@
>  #include "ls-refs.h"
>  #include "pkt-line.h"
>  #include "config.h"
> +#include "ref-filter.h"
>
>  /*
>   * Check if one of the prefixes is a prefix of the ref.
> @@ -109,7 +110,8 @@ int ls_refs(struct repository *r, struct strvec *keys,
>                 die(_("expected flush after ls-refs arguments"));
>
>         head_ref_namespaced(send_ref, &data);
> -       for_each_namespaced_ref(send_ref, &data);
> +       for_each_fullref_in_prefixes(get_git_namespace(), data.prefixes.v,
> +                                    send_ref, &data, 0);

Remember to add that strvec_init(&data.prefixes) I talked about in my
commit message, or you get a segfault.

You also need:  if (!data.prefixes.nr) strvec_push(&data.prefixes, "refs/");

Without it, the "ls-refs without ref-prefix" scenario fails.

>         packet_flush(1);
>         strvec_clear(&data.prefixes);
>         return 0;
> diff --git a/ref-filter.c b/ref-filter.c
> index aa260bfd09..c34bf34d06 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -1987,6 +1987,36 @@ static void find_longest_prefixes(struct string_list *out,
>         strbuf_release(&prefix);
>  }
>
> +int for_each_fullref_in_prefixes(const char *namespace,
> +                                const char **patterns,
> +                                each_ref_fn cb,
> +                                void *cb_data,
> +                                int broken)
> +{
> +       struct string_list prefixes = STRING_LIST_INIT_DUP;
> +       struct string_list_item *prefix;
> +       struct strbuf buf = STRBUF_INIT;
> +       int ret = 0, namespace_len;
> +
> +       find_longest_prefixes(&prefixes, patterns);
> +
> +       if (namespace)
> +               strbuf_addstr(&buf, namespace);
> +       namespace_len = buf.len;
> +
> +       for_each_string_list_item(prefix, &prefixes) {
> +               strbuf_addf(&buf, prefix->string);

Missing format string "%s".


> +               ret = for_each_fullref_in(buf.buf, cb, cb_data, broken);
> +               if (ret)
> +                       break;
> +               strbuf_setlen(&buf, namespace_len);
> +       }
> +
> +       string_list_clear(&prefixes, 0);
> +       strbuf_release(&buf);
> +       return ret;
> +}
> +
>  /*
>   * This is the same as for_each_fullref_in(), but it tries to iterate
>   * only over the patterns we'll care about. Note that it _doesn't_ do a full
> @@ -1997,10 +2027,6 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
>                                        void *cb_data,
>                                        int broken)
>  {
> -       struct string_list prefixes = STRING_LIST_INIT_DUP;
> -       struct string_list_item *prefix;
> -       int ret;
> -
>         if (!filter->match_as_path) {
>                 /*
>                  * in this case, the patterns are applied after
> @@ -2024,16 +2050,8 @@ static int for_each_fullref_in_pattern(struct ref_filter *filter,
>                 return for_each_fullref_in("", cb, cb_data, broken);
>         }
>
> -       find_longest_prefixes(&prefixes, filter->name_patterns);
> -
> -       for_each_string_list_item(prefix, &prefixes) {
> -               ret = for_each_fullref_in(prefix->string, cb, cb_data, broken);
> -               if (ret)
> -                       break;
> -       }
> -
> -       string_list_clear(&prefixes, 0);
> -       return ret;
> +       return for_each_fullref_in_prefixes(NULL, filter->name_patterns,
> +                                           cb, cb_data, broken);
>  }
>
>  /*
> diff --git a/ref-filter.h b/ref-filter.h
> index feaef4a8fd..f666a0fb49 100644
> --- a/ref-filter.h
> +++ b/ref-filter.h
> @@ -146,4 +146,14 @@ struct ref_array_item *ref_array_push(struct ref_array *array,
>                                       const char *refname,
>                                       const struct object_id *oid);
>
> +/**
> + * iterate all refs which descend from the longest common prefix among
> + * "patterns".
> + */
> +int for_each_fullref_in_prefixes(const char *namespace,
> +                                const char **patterns,
> +                                each_ref_fn cb,
> +                                void *cb_data,
> +                                int broken);
> +
>  #endif /*  REF_FILTER_H  */
> --
> 2.30.0.138.g6d7191ea01
>
Taylor Blau Jan. 19, 2021, 7:09 p.m. UTC | #3
On Tue, Jan 19, 2021 at 06:42:56PM +0100, Jacob Vosmaer wrote:
> Hi Taylor,
>
> Thanks for your reply. That sounds like a great idea!
>
> On Tue, Jan 19, 2021 at 5:12 PM Taylor Blau <me@ttaylorr.com> wrote:
> > But, I think that we could get pretty far by treating the prefixes as
> > refs so that we can call ref-filter.c:find_longest_prefixes(). For its
> > purposes, it doesn't really care about whether or not the arguments
> > actually are references. It simply returns the longest common prefix
> > among all of its arguments (delimited by '/' characters).
>
> What does "delimited by /" mean?

Ah, I just meant that it looks for the longest common prefix where it
will only split at '/' characters. But, that's not right at all:
find_longest_prefixes_1() will happily split anywhere there is a
difference.

> Without really understanding the longest common prefix code in
> ref-filter.c, my intuitive concern is that the specifics of glob
> matching and special treatment of '/' may bite us. I suppose we'll be
> fine because ls-refs has its own matching logic. So long as
> for_each_fullref_in_prefixes yield enough prefixes, the end result
> would remain the same.

Right. We can ignore the concern about '/' (seeing my comment above),
and note that find_longest_prefixes_1() breaks on glob metacharacters,
so we'll only match or overmatch the desired set (and we'll never
undermatch).

I made sure to write in the second patch downthread that
ls-refs.c:send_ref() correctly handles receiving too many refs (and it
discards ones that it doesn't want).

> The question is then, does for_each_fullref_in_prefixes yield
> everything we need?

For the reasons above, yes: it will.

> I think my approach would be to expose the new
> for_each_fullref_in_prefixes iterator you propose through test-tool,
> and unit test it so we can be sure it handles both contexts
> (for-each-refs with globs and special '/', and ls-refs without any
> special character behavior) correctly.
>
> I may be overly cautious here, take this with a grain of salt because
> I am not an experienced Git contributor. On that topic, apologies if
> I'm botching my inline replies in this email.

I do appreciate your caution, but I'm not sure exposing a test-tool is
necessary, since we already test this behavior extensively in t6300 (and
now t5701, t5702 and t5704, too).

> Regarding your patch: it works correctly and as fast as expected for
> my development "many refs" test case. Yay! It also segfaults and fails
> some tests but see my comments below.
>
> All in all: thanks, great idea, yes we should reuse, I only lack
> confidence on correctness because I don't fully grasp your
> longest-common-prefix algorithm yet.

:-). Thanks for the pointers on the spots that I had missed (as I
mentioned, I only compiled it before sending, so having an additional
set of more careful eyes was quite helpful).


Thanks,
Taylor
Jeff King Jan. 19, 2021, 9:59 p.m. UTC | #4
On Tue, Jan 19, 2021 at 02:09:04PM -0500, Taylor Blau wrote:

> > What does "delimited by /" mean?
> 
> Ah, I just meant that it looks for the longest common prefix where it
> will only split at '/' characters. But, that's not right at all:
> find_longest_prefixes_1() will happily split anywhere there is a
> difference.

Right. We thought in early revisions of the ref-filter work that we
might have to split on path components, but it turns out that the
underlying ref code is happy to take arbitrary prefixes. And it's that
code which defines our strategy; Even if the ls-refs code wanted to
allow only full path components, it should be using the limiting from
for_each_ref_in() only as an optimization, and applying its own
filtering to the output.

> > Without really understanding the longest common prefix code in
> > ref-filter.c, my intuitive concern is that the specifics of glob
> > matching and special treatment of '/' may bite us. I suppose we'll be
> > fine because ls-refs has its own matching logic. So long as
> > for_each_fullref_in_prefixes yield enough prefixes, the end result
> > would remain the same.
> 
> Right. We can ignore the concern about '/' (seeing my comment above),
> and note that find_longest_prefixes_1() breaks on glob metacharacters,
> so we'll only match or overmatch the desired set (and we'll never
> undermatch).
> 
> I made sure to write in the second patch downthread that
> ls-refs.c:send_ref() correctly handles receiving too many refs (and it
> discards ones that it doesn't want).

Yeah, I think the glob-handling is OK for that reason. We'd have a
smaller prefix, which means seeing more refs. So it's never a
correctness issue, but only a potential optimization one (we consider
more refs in ls-refs.c than we might otherwise). But I think even that s
impossible, because the glob characters are not valid in refnames. So we
would never see one at all in a string which is meant to be a pure
prefix.

> > I think my approach would be to expose the new
> > for_each_fullref_in_prefixes iterator you propose through test-tool,
> > and unit test it so we can be sure it handles both contexts
> > (for-each-refs with globs and special '/', and ls-refs without any
> > special character behavior) correctly.
> >
> > I may be overly cautious here, take this with a grain of salt because
> > I am not an experienced Git contributor. On that topic, apologies if
> > I'm botching my inline replies in this email.
> 
> I do appreciate your caution, but I'm not sure exposing a test-tool is
> necessary, since we already test this behavior extensively in t6300 (and
> now t5701, t5702 and t5704, too).

Agreed. test-tool is fine when we have no way to easily feed data to a
unit. But in this case, the prefix code is easy to test via the
for-each-ref plumbing. We'd want to make sure the new setting in ls-refs
is covered, too, but I agree that t5701 covers that well (regular
fetches make sure we don't send too few refs, but those ones check that
we limited the refs in the expected way).

-Peff
Jeff King Jan. 19, 2021, 10:15 p.m. UTC | #5
On Tue, Jan 19, 2021 at 04:59:18PM -0500, Jeff King wrote:

> > > What does "delimited by /" mean?
> > 
> > Ah, I just meant that it looks for the longest common prefix where it
> > will only split at '/' characters. But, that's not right at all:
> > find_longest_prefixes_1() will happily split anywhere there is a
> > difference.
> 
> Right. We thought in early revisions of the ref-filter work that we
> might have to split on path components, but it turns out that the
> underlying ref code is happy to take arbitrary prefixes. And it's that
> code which defines our strategy; Even if the ls-refs code wanted to
> allow only full path components, it should be using the limiting from
> for_each_ref_in() only as an optimization, and applying its own
> filtering to the output.

Having now looked carefully at the ls-refs code, it's a pure
prefix-match, too. So I think we _could_ rely on for_each_fullref_in()
returning us the correct full results, and not checking it further in
send_ref(). But I kind of like keeping it there as an extra check (and
one which could in theory grow more logic later).

-Peff
Taylor Blau Jan. 19, 2021, 10:23 p.m. UTC | #6
On Tue, Jan 19, 2021 at 05:15:30PM -0500, Jeff King wrote:
> On Tue, Jan 19, 2021 at 04:59:18PM -0500, Jeff King wrote:
>
> > > > What does "delimited by /" mean?
> > >
> > > Ah, I just meant that it looks for the longest common prefix where it
> > > will only split at '/' characters. But, that's not right at all:
> > > find_longest_prefixes_1() will happily split anywhere there is a
> > > difference.
> >
> > Right. We thought in early revisions of the ref-filter work that we
> > might have to split on path components, but it turns out that the
> > underlying ref code is happy to take arbitrary prefixes. And it's that
> > code which defines our strategy; Even if the ls-refs code wanted to
> > allow only full path components, it should be using the limiting from
> > for_each_ref_in() only as an optimization, and applying its own
> > filtering to the output.
>
> Having now looked carefully at the ls-refs code, it's a pure
> prefix-match, too. So I think we _could_ rely on for_each_fullref_in()
> returning us the correct full results, and not checking it further in
> send_ref(). But I kind of like keeping it there as an extra check (and
> one which could in theory grow more logic later).

Hmm. What if the caller asks for:

  ref-prefix refs/tags/a
  ref-prefix refs/tags/b

?

The LCP between those two is refs/tags, so send_ref() will presumably
get lots of reuslts that it doesn't care about (assuming there are tags
besides a and b).

Thanks,
Taylor
Jeff King Jan. 19, 2021, 10:52 p.m. UTC | #7
On Tue, Jan 19, 2021 at 05:23:36PM -0500, Taylor Blau wrote:

> > Having now looked carefully at the ls-refs code, it's a pure
> > prefix-match, too. So I think we _could_ rely on for_each_fullref_in()
> > returning us the correct full results, and not checking it further in
> > send_ref(). But I kind of like keeping it there as an extra check (and
> > one which could in theory grow more logic later).
> 
> Hmm. What if the caller asks for:
> 
>   ref-prefix refs/tags/a
>   ref-prefix refs/tags/b
> 
> ?
> 
> The LCP between those two is refs/tags, so send_ref() will presumably
> get lots of reuslts that it doesn't care about (assuming there are tags
> besides a and b).

Oh, you're right, of course. Ignore me. :)

-Peff
Jeff King Jan. 19, 2021, 10:53 p.m. UTC | #8
On Tue, Jan 19, 2021 at 03:42:51PM +0100, Jacob Vosmaer wrote:

> The previous implementation of ls-refs would perform exactly one ref
> walk, matching each ref against the prefixes (if any) provided by the
> user. This can be expensive if there are a lot of refs and the user
> only cares about a small subset of them.
> 
> In this patch we analyze the prefixes provided by the user and build a
> minimal set of disjoint prefixes that contains all of them. We then do
> a ref walk for each of these minimal prefixes.

Thanks for posting this. I have a vague recollection that we considered
this either when we did the for-each-ref prefixes, or when we added
ls-refs prefixes, but I can't seem to find either. At any rate, at
GitHub we haven't generally found it to be a problem because our
horrifically-large repos tend to be aggregated alternates repos, not the
ones people serve upload-pack out of (though I did just time it, and
some of our largest repos should save a few hundred milliseconds per
advertisement, which is certainly not nothing).

I do think we should reuse the code from ref-filter, as Taylor showed.

> This commit also fixes a bug in ls-refs.c that was not triggered
> before: we were using a strvec set to zero, which is not how you are
> supposed to initialize a strvec. We now call strvec_init after zeroing.

Good catch. It didn't matter until now because nobody relied on having a
NULL entry when no prefix had been added (instead, they always iterated
over prefixes->nr). IMHO that is worth fixing as a separate commit.

-Peff
Jeff King Jan. 19, 2021, 10:59 p.m. UTC | #9
On Tue, Jan 19, 2021 at 05:52:04PM -0500, Jeff King wrote:

> On Tue, Jan 19, 2021 at 05:23:36PM -0500, Taylor Blau wrote:
> 
> > > Having now looked carefully at the ls-refs code, it's a pure
> > > prefix-match, too. So I think we _could_ rely on for_each_fullref_in()
> > > returning us the correct full results, and not checking it further in
> > > send_ref(). But I kind of like keeping it there as an extra check (and
> > > one which could in theory grow more logic later).
> > 
> > Hmm. What if the caller asks for:
> > 
> >   ref-prefix refs/tags/a
> >   ref-prefix refs/tags/b
> > 
> > ?
> > 
> > The LCP between those two is refs/tags, so send_ref() will presumably
> > get lots of reuslts that it doesn't care about (assuming there are tags
> > besides a and b).
> 
> Oh, you're right, of course. Ignore me. :)

Actually, I am not sure that we would look for "refs/tags/" in that case
(I did a quick test and we do not seem to). Which makes sense, as it is
cheaper to find the "a" and "b" hierarchies separately if there is a
very big "refs/tags/c" hierarchy.

But I agree that this is a good reason that callers should consider it
as an optimization which could return more results than expected.

-Peff
Taylor Blau Jan. 19, 2021, 11 p.m. UTC | #10
On Tue, Jan 19, 2021 at 05:53:56PM -0500, Jeff King wrote:
> Thanks for posting this. I have a vague recollection that we considered
> this either when we did the for-each-ref prefixes, or when we added
> ls-refs prefixes, but I can't seem to find either. At any rate, at
> GitHub we haven't generally found it to be a problem because our
> horrifically-large repos tend to be aggregated alternates repos, not the
> ones people serve upload-pack out of (though I did just time it, and
> some of our largest repos should save a few hundred milliseconds per
> advertisement, which is certainly not nothing).

Great on all counts!

> > This commit also fixes a bug in ls-refs.c that was not triggered
> > before: we were using a strvec set to zero, which is not how you are
> > supposed to initialize a strvec. We now call strvec_init after zeroing.
>
> Good catch. It didn't matter until now because nobody relied on having a
> NULL entry when no prefix had been added (instead, they always iterated
> over prefixes->nr). IMHO that is worth fixing as a separate commit.

Yeah. Even after calling it out as such myself, I promptly forgot it
when preparing the first patch I sent back to Jacob!

I didn't pull it out into its own patch, and rather folded it in to my
2/2. It has a small call-out of its own, but if you'd prefer it by
itself, I'm happy to resubmit it with that change included.

> -Peff

Thanks,
Taylor
Taylor Blau Jan. 19, 2021, 11:02 p.m. UTC | #11
On Tue, Jan 19, 2021 at 05:59:34PM -0500, Jeff King wrote:
> Actually, I am not sure that we would look for "refs/tags/" in that case
> (I did a quick test and we do not seem to). Which makes sense, as it is
> cheaper to find the "a" and "b" hierarchies separately if there is a
> very big "refs/tags/c" hierarchy.

Ah, makes sense. Thanks for double checking.

> But I agree that this is a good reason that callers should consider it
> as an optimization which could return more results than expected.

Yep. Even though I couldn't quite remember when the algorithm would
split without looking more closely, I made sure to document that it
iterates *all* references that are descendent of the LCP of its
arguments.

> -Peff

Thanks,
Taylor
Jeff King Jan. 19, 2021, 11:11 p.m. UTC | #12
On Tue, Jan 19, 2021 at 06:00:19PM -0500, Taylor Blau wrote:

> > Good catch. It didn't matter until now because nobody relied on having a
> > NULL entry when no prefix had been added (instead, they always iterated
> > over prefixes->nr). IMHO that is worth fixing as a separate commit.
> 
> Yeah. Even after calling it out as such myself, I promptly forgot it
> when preparing the first patch I sent back to Jacob!
> 
> I didn't pull it out into its own patch, and rather folded it in to my
> 2/2. It has a small call-out of its own, but if you'd prefer it by
> itself, I'm happy to resubmit it with that change included.

It's not a _huge_ deal to me, but I think it is slightly nicer as a
separate patch. Plus it can easily be credited to Jacob, so at least he
gets some authorship credit out of this. :)

-Peff
Jacob Vosmaer Jan. 20, 2021, 10:40 a.m. UTC | #13
On Wed, Jan 20, 2021 at 12:11 AM Jeff King <peff@peff.net> wrote:
> It's not a _huge_ deal to me, but I think it is slightly nicer as a
> separate patch. Plus it can easily be credited to Jacob, so at least he
> gets some authorship credit out of this. :)

Thanks, I'm fine either way. Just remember that the other changes in
ls-refs.c depend on that strvec_init, so if we split it out, we need
to remember / maintain that (order) dependency.

> Having now looked carefully at the ls-refs code, it's a pure
> prefix-match, too. So I think we _could_ rely on for_each_fullref_in()
> returning us the correct full results, and not checking it further in
> send_ref().

I also think we could. But as I alluded to in my original commit
message, I don't like how complicated that gets. I find it easier to
convince myself in the current form that the longest prefix code
selects _enough_ prefixes, which is a weaker property than "selects
exactly the right prefixes".

Jacob
Jacob Vosmaer Jan. 20, 2021, 10:44 a.m. UTC | #14
On Wed, Jan 20, 2021 at 11:40 AM Jacob Vosmaer <jacob@gitlab.com> wrote:
> I also think we could. But as I alluded to in my original commit
> message, I don't like how complicated that gets. I find it easier to
> convince myself in the current form that the longest prefix code
> selects _enough_ prefixes, which is a weaker property than "selects
> exactly the right prefixes".

By which I mean to say I think we should keep the current form, with
the final authoritative ref matching pass.

Jacob
diff mbox series

Patch

diff --git a/ls-refs.c b/ls-refs.c
index a1e0b473e4..6d5f0c769a 100644
--- a/ls-refs.c
+++ b/ls-refs.c
@@ -84,12 +84,44 @@  static int ls_refs_config(const char *var, const char *value, void *data)
 	return parse_hide_refs_config(var, value, "uploadpack");
 }
 
+static int cmp_prefix(const void *a_, const void *b_){
+	const char *a = *(const char **)a_;
+	const char *b = *(const char **)b_;
+	return strcmp(a, b);
+}
+
+static void deduplicate_prefixes(struct strvec *prefixes) {
+	int i;
+
+	QSORT(prefixes->v, prefixes->nr, cmp_prefix);
+
+	for (i = 1; i < prefixes->nr;) {
+		const char *p = prefixes->v[i];
+
+		/*
+		 * If p is "refs/foobar" and its predecessor is "refs/foo" then we should
+		 * drop p, both to avoid sending duplicate refs to the user, and to avoid
+		 * doing unnecessary work.
+		 */
+		if (starts_with(p, prefixes->v[i - 1])) {
+			MOVE_ARRAY(&prefixes->v[i], &prefixes->v[i + 1], prefixes->nr - (i + 1));
+			prefixes->v[prefixes->nr - 1] = p;
+			strvec_pop(prefixes);
+		} else {
+			i++;
+		}
+	}
+}
+
 int ls_refs(struct repository *r, struct strvec *keys,
 	    struct packet_reader *request)
 {
 	struct ls_refs_data data;
+	struct strvec iter_prefixes = STRVEC_INIT;
+	const char **p;
 
 	memset(&data, 0, sizeof(data));
+	strvec_init(&data.prefixes);
 
 	git_config(ls_refs_config, NULL);
 
@@ -109,8 +141,37 @@  int ls_refs(struct repository *r, struct strvec *keys,
 		die(_("expected flush after ls-refs arguments"));
 
 	head_ref_namespaced(send_ref, &data);
-	for_each_namespaced_ref(send_ref, &data);
+
+	/*
+	 * Try to create a minimal disjoint set of prefixes that
+	 * contains everything the user wants to see, but that lets us
+	 * avoid a full ref walk. If the repo has a lot of refs and
+	 * the user only wants to see refs/heads and refs/tags,
+	 * it is faster to do two small walks of refs/heads and
+	 * refs/tags instead of one big walk of all of refs/.
+	 */
+	for (p = data.prefixes.v; *p; p++) {
+		if (starts_with(*p, "refs/")) {
+			strvec_push(&iter_prefixes, *p);
+		} else { /* full ref walk in pathological cases like "" */
+			strvec_push(&iter_prefixes, "refs/");
+		}
+	}
+
+	if (!iter_prefixes.nr) /* full ref walk when there are no prefixes */
+		strvec_push(&iter_prefixes, "refs/");
+
+	deduplicate_prefixes(&iter_prefixes);
+
+	for (p = iter_prefixes.v; *p; p++) {
+		struct strbuf buf = STRBUF_INIT;
+		strbuf_addf(&buf, "%s%s", get_git_namespace(), *p);
+		for_each_fullref_in(buf.buf, send_ref, &data, 0);
+		strbuf_release(&buf);
+	}
+
 	packet_flush(1);
 	strvec_clear(&data.prefixes);
+	strvec_clear(&iter_prefixes);
 	return 0;
 }