[2/2] ls-refs.c: traverse longest common ref prefix

Message ID	fb8681d12864d724108c524834f9498d91e270e6.1611080326.git.me@ttaylorr.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Tue, 19 Jan 2021 13:19:17 -0500 From: Taylor Blau <me@ttaylorr.com> To: git@vger.kernel.org Cc: jacob@gitlab.com, peff@peff.net Subject: [PATCH 2/2] ls-refs.c: traverse longest common ref prefix Message-ID: <fb8681d12864d724108c524834f9498d91e270e6.1611080326.git.me@ttaylorr.com> References: <CADMWQoPREhirr+RJPkJJV2U+8VG=DFotvTBCDSXFhn-3pn2X-A@mail.gmail.com> <cover.1611080326.git.me@ttaylorr.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <cover.1611080326.git.me@ttaylorr.com> Precedence: bulk
Series	ls-refs: only traverse through longest common ref prefix \| expand [0/2] ls-refs: only traverse through longest common ref prefix [1/2] refs: expose 'for_each_fullref_in_prefixes' [2/2] ls-refs.c: traverse longest common ref prefix

Message ID

fb8681d12864d724108c524834f9498d91e270e6.1611080326.git.me@ttaylorr.com (mailing list archive)

State

New, archived

Headers

Date: Tue, 19 Jan 2021 13:19:17 -0500
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: jacob@gitlab.com, peff@peff.net
Subject: [PATCH 2/2] ls-refs.c: traverse longest common ref prefix
Message-ID: 
 <fb8681d12864d724108c524834f9498d91e270e6.1611080326.git.me@ttaylorr.com>
References: 
 <CADMWQoPREhirr+RJPkJJV2U+8VG=DFotvTBCDSXFhn-3pn2X-A@mail.gmail.com>
 <cover.1611080326.git.me@ttaylorr.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <cover.1611080326.git.me@ttaylorr.com>
Precedence: bulk

Series

ls-refs: only traverse through longest common ref prefix | expand

Commit Message

Taylor Blau Jan. 19, 2021, 6:19 p.m. UTC

ls-refs performs a single revision walk over the whole ref namespace,
and sends ones that match with one of the given ref prefixes down to the
user.

This can be expensive if there are many refs overall, but the portion of
them covered by the given prefixes is small by comparison.

To attempt to reduce the difference between the number of refs
traversed, and the number of refs sent, only traverse references which
are in the longest common prefix of the given prefixes. This is very
reminiscent of the approach taken in b31e2680c4 (ref-filter.c: find
disjoint pattern prefixes, 2019-06-26) which does an analogous thing for
multi-patterned 'git for-each-ref' invocations.

The only difference here is that we are operating on ref prefixes, which
do not necessarily point to a single reference. That is just fine, since
all we care about is finding the longest common prefix among prefixes
which can be thought of as refspecs for our purposes here.

Similarly, for_each_fullref_in_prefixes may return more results than the
caller asked for (since the longest common prefix might match something
that a longer prefix in the same set wouldn't match) but
ls-refs.c:send_ref() discards such results.

The code introduced in b31e2680c4 is resilient to stop early (and
return a shorter prefix) when it encounters a metacharacter (as
mentioned in that patch, there is some opportunity to improve this, but
nobody has done it).

There are two remaining small items in this patch:

  - If no prefixes were provided, then implicitly add the empty string
    (which will match all references).

  - Since we are manually munging the prefixes, make sure that we
    initialize it ourselves (previously this wasn't necessary since the
    first strvec_push would do so).

Original-patch-by: Jacob Vosmaer <jacob@gitlab.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 ls-refs.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Comments

Jeff King Jan. 19, 2021, 11:09 p.m. UTC | #1

On Tue, Jan 19, 2021 at 01:19:17PM -0500, Taylor Blau wrote:

> To attempt to reduce the difference between the number of refs
> traversed, and the number of refs sent, only traverse references which
> are in the longest common prefix of the given prefixes. This is very
> reminiscent of the approach taken in b31e2680c4 (ref-filter.c: find
> disjoint pattern prefixes, 2019-06-26) which does an analogous thing for
> multi-patterned 'git for-each-ref' invocations.
> 
> The only difference here is that we are operating on ref prefixes, which
> do not necessarily point to a single reference. That is just fine, since
> all we care about is finding the longest common prefix among prefixes
> which can be thought of as refspecs for our purposes here.

This second paragraph confused me. Aren't the inputs to for-each-ref
also prefixes?

I guess they require an explicit '*', but fundamentally it's the same
concept (and certainly they are not just single references).

> Similarly, for_each_fullref_in_prefixes may return more results than the
> caller asked for (since the longest common prefix might match something
> that a longer prefix in the same set wouldn't match) but
> ls-refs.c:send_ref() discards such results.

Based on my other poking, I'm not entirely sure that we can return too
many results. But I do think it's worth keeping the caller more careful.

> The code introduced in b31e2680c4 is resilient to stop early (and
> return a shorter prefix) when it encounters a metacharacter (as
> mentioned in that patch, there is some opportunity to improve this, but
> nobody has done it).

I agree that is how b31e2680c4 works, but we don't care about that
behavior here, since we have strict prefixes. Isn't the argument we need
to make the other way? I.e., that stopping early on a metacharacter will
not hurt us. Because at worst we return too many results (hey, there's a
case!) and because we would not expect metacharacters in the prefixes
anyway, as they are illegal in refnames.

> There are two remaining small items in this patch:
> 
>   - If no prefixes were provided, then implicitly add the empty string
>     (which will match all references).

I wonder if for_each_fullref_in_prefixes() should do that, since that is
exactly what the other caller does, too. OTOH, maybe it is better to
make the callers be more explicit. In which case should it maybe BUG()
on an empty set of prefixes, rather than letting the caller assume some
behavior?

>   - Since we are manually munging the prefixes, make sure that we
>     initialize it ourselves (previously this wasn't necessary since the
>     first strvec_push would do so).

I think this was an existing bug (that we were just lucky it was
impossible to trigger, because nobody looked for the NULL sentinel), and
would make more sense as a separate fix.

-Peff

Taylor Blau Jan. 19, 2021, 11:52 p.m. UTC | #2

On Tue, Jan 19, 2021 at 06:09:42PM -0500, Jeff King wrote:
> On Tue, Jan 19, 2021 at 01:19:17PM -0500, Taylor Blau wrote:
>
> > To attempt to reduce the difference between the number of refs
> > traversed, and the number of refs sent, only traverse references which
> > are in the longest common prefix of the given prefixes. This is very
> > reminiscent of the approach taken in b31e2680c4 (ref-filter.c: find
> > disjoint pattern prefixes, 2019-06-26) which does an analogous thing for
> > multi-patterned 'git for-each-ref' invocations.
> >
> > The only difference here is that we are operating on ref prefixes, which
> > do not necessarily point to a single reference. That is just fine, since
> > all we care about is finding the longest common prefix among prefixes
> > which can be thought of as refspecs for our purposes here.
>
> This second paragraph confused me. Aren't the inputs to for-each-ref
> also prefixes?
>
> I guess they require an explicit '*', but fundamentally it's the same
> concept (and certainly they are not just single references).

Yeah, that is the point that I was trying to make. But re-reading this
patch after knowing that it confused you, I think the clearest way to
make that point is to drop that second paragraph entirely.

> > Similarly, for_each_fullref_in_prefixes may return more results than the
> > caller asked for (since the longest common prefix might match something
> > that a longer prefix in the same set wouldn't match) but
> > ls-refs.c:send_ref() discards such results.
>
> Based on my other poking, I'm not entirely sure that we can return too
> many results. But I do think it's worth keeping the caller more careful.

It can return more results, but I don't think that my writing in
b31e2680c4 is particularly clear. Here's an example, though. Say I ask
for `git for-each-refs 'refs/tags/a/*' 'refs/tags/a/b/c'`. The LCP of
that is definitely "refs/tags/a", which might traverse other stuff like
"refs/tags/a/b/d", which wouldn't get matched by either.

> > The code introduced in b31e2680c4 is resilient to stop early (and
> > return a shorter prefix) when it encounters a metacharacter (as
> > mentioned in that patch, there is some opportunity to improve this, but
> > nobody has done it).
>
> I agree that is how b31e2680c4 works, but we don't care about that
> behavior here, since we have strict prefixes. Isn't the argument we need
> to make the other way? I.e., that stopping early on a metacharacter will
> not hurt us. Because at worst we return too many results (hey, there's a
> case!) and because we would not expect metacharacters in the prefixes
> anyway, as they are illegal in refnames.

Yeah, thinking on it more I agree that's the argument we should be
making here. I updated the patch to reflect it.

> > There are two remaining small items in this patch:
> >
> >   - If no prefixes were provided, then implicitly add the empty string
> >     (which will match all references).
>
> I wonder if for_each_fullref_in_prefixes() should do that, since that is
> exactly what the other caller does, too. OTOH, maybe it is better to
> make the callers be more explicit. In which case should it maybe BUG()
> on an empty set of prefixes, rather than letting the caller assume some
> behavior?

Hmm. I don't feel strongly either way, but I think that the BUG is
probably the most sensible option.

> >   - Since we are manually munging the prefixes, make sure that we
> >     initialize it ourselves (previously this wasn't necessary since the
> >     first strvec_push would do so).
>
> I think this was an existing bug (that we were just lucky it was
> impossible to trigger, because nobody looked for the NULL sentinel), and
> would make more sense as a separate fix.

Right. I'll make sure to pull this one out into a separate patch and
credit Jacob with authorship.

> -Peff

Thanks,
Taylor

Jeff King Jan. 20, 2021, 12:08 a.m. UTC | #3

On Tue, Jan 19, 2021 at 06:52:31PM -0500, Taylor Blau wrote:

> > I guess they require an explicit '*', but fundamentally it's the same
> > concept (and certainly they are not just single references).
> 
> Yeah, that is the point that I was trying to make. But re-reading this
> patch after knowing that it confused you, I think the clearest way to
> make that point is to drop that second paragraph entirely.

Sounds good.

> > Based on my other poking, I'm not entirely sure that we can return too
> > many results. But I do think it's worth keeping the caller more careful.
> 
> It can return more results, but I don't think that my writing in
> b31e2680c4 is particularly clear. Here's an example, though. Say I ask
> for `git for-each-refs 'refs/tags/a/*' 'refs/tags/a/b/c'`. The LCP of
> that is definitely "refs/tags/a", which might traverse other stuff like
> "refs/tags/a/b/d", which wouldn't get matched by either.

I thought that would be matched by refs/tags/a/*, but it looks like
for-each-ref treats "*" as matching only a single path component. So
really just:

  git for-each-ref refs/tags/*

requires extra filtering already. But AFAICT none of that is true for
ls-refs, which is strictly prefix matching already.

-Peff

Jacob Vosmaer Jan. 20, 2021, 11 a.m. UTC | #4

On Tue, Jan 19, 2021 at 7:19 PM Taylor Blau <me@ttaylorr.com> wrote:
>
> ls-refs performs a single revision walk over the whole ref namespace,
> and sends ones that match with one of the given ref prefixes down to the
> user.
>
> This can be expensive if there are many refs overall, but the portion of
> them covered by the given prefixes is small by comparison.
>
> To attempt to reduce the difference between the number of refs
> traversed, and the number of refs sent, only traverse references which
> are in the longest common prefix of the given prefixes. This is very
> reminiscent of the approach taken in b31e2680c4 (ref-filter.c: find
> disjoint pattern prefixes, 2019-06-26) which does an analogous thing for
> multi-patterned 'git for-each-ref' invocations.
>
> The only difference here is that we are operating on ref prefixes, which
> do not necessarily point to a single reference. That is just fine, since
> all we care about is finding the longest common prefix among prefixes
> which can be thought of as refspecs for our purposes here.
>
> Similarly, for_each_fullref_in_prefixes may return more results than the
> caller asked for (since the longest common prefix might match something
> that a longer prefix in the same set wouldn't match) but
> ls-refs.c:send_ref() discards such results.
>
> The code introduced in b31e2680c4 is resilient to stop early (and
> return a shorter prefix) when it encounters a metacharacter (as
> mentioned in that patch, there is some opportunity to improve this, but
> nobody has done it).
>
> There are two remaining small items in this patch:
>
>   - If no prefixes were provided, then implicitly add the empty string
>     (which will match all references).
>
>   - Since we are manually munging the prefixes, make sure that we
>     initialize it ourselves (previously this wasn't necessary since the
>     first strvec_push would do so).
>
> Original-patch-by: Jacob Vosmaer <jacob@gitlab.com>
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>  ls-refs.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/ls-refs.c b/ls-refs.c
> index a1e0b473e4..eaaa36d0df 100644
> --- a/ls-refs.c
> +++ b/ls-refs.c
> @@ -90,6 +90,7 @@ int ls_refs(struct repository *r, struct strvec *keys,
>         struct ls_refs_data data;
>
>         memset(&data, 0, sizeof(data));
> +       strvec_init(&data.prefixes);
>
>         git_config(ls_refs_config, NULL);
>
> @@ -109,7 +110,10 @@ int ls_refs(struct repository *r, struct strvec *keys,
>                 die(_("expected flush after ls-refs arguments"));
>
>         head_ref_namespaced(send_ref, &data);
> -       for_each_namespaced_ref(send_ref, &data);
> +       if (!data.prefixes.nr)
> +               strvec_push(&data.prefixes, "");

The old code, with for_each_namespaced_ref, would walk
"${NAMESPACE}refs/". The new code would walk "${NAMESPACE}" because
we're pushing "" onto data.prefixes. So if there is anything in the
namespace that does not start with "refs/" it will get yielded.

Does that matter?

> +       for_each_fullref_in_prefixes(get_git_namespace(), data.prefixes.v,
> +                                    send_ref, &data, 0);
>         packet_flush(1);
>         strvec_clear(&data.prefixes);
>         return 0;
> --
> 2.30.0.138.g6d7191ea01

diff --git a/ls-refs.c b/ls-refs.c
index a1e0b473e4..eaaa36d0df 100644
--- a/ls-refs.c
+++ b/ls-refs.c
@@ -90,6 +90,7 @@  int ls_refs(struct repository *r, struct strvec *keys,
 	struct ls_refs_data data;
 
 	memset(&data, 0, sizeof(data));
+	strvec_init(&data.prefixes);
 
 	git_config(ls_refs_config, NULL);
 
@@ -109,7 +110,10 @@  int ls_refs(struct repository *r, struct strvec *keys,
 		die(_("expected flush after ls-refs arguments"));
 
 	head_ref_namespaced(send_ref, &data);
-	for_each_namespaced_ref(send_ref, &data);
+	if (!data.prefixes.nr)
+		strvec_push(&data.prefixes, "");
+	for_each_fullref_in_prefixes(get_git_namespace(), data.prefixes.v,
+				     send_ref, &data, 0);
 	packet_flush(1);
 	strvec_clear(&data.prefixes);
 	return 0;

[2/2] ls-refs.c: traverse longest common ref prefix

Commit Message

Comments

Patch