[1/2] ref-filter: hacky "streaming" mode

Message ID	YTNpeH+jO0zQgAVT@coredump.intra.peff.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Sat, 4 Sep 2021 08:41:28 -0400 From: Jeff King <peff@peff.net> To: git@vger.kernel.org Cc: ZheNing Hu <adlternative@gmail.com> Subject: [PATCH 1/2] ref-filter: hacky "streaming" mode Message-ID: <YTNpeH+jO0zQgAVT@coredump.intra.peff.net> References: <YTNpQ7Od1U/5i0R7@coredump.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <YTNpQ7Od1U/5i0R7@coredump.intra.peff.net> Precedence: bulk
Series	speeding up trivial for-each-ref invocations \| expand [hacky,0/2] speeding up trivial for-each-ref invocations [1/2] ref-filter: hacky "streaming" mode [2/2] ref-filter: implement "quick" formats

Jeff King Sept. 4, 2021, 12:41 p.m. UTC

The ref-filter code is very keen to collect all of the refs in an array
before writing any output. This makes things slower than necessary when
using the default sort order (which is already sorted by refname when we
call for_each_ref()), and when no filtering options require it.

This commit adds a mildly-ugly interface to detect this case and stream
directly from filter_refs(). The caller just needs to call the
"maybe_stream" function. Either way, they can call the usual sort/print
functions; they'll just be noops if we did stream instead of collecting.

Here are some timings on a fully-packed 500k-ref repo:

  Benchmark #1: ./git.orig for-each-ref --format='%(objectname) %(refname)'
    Time (mean ± σ):     340.2 ms ±   5.3 ms    [User: 300.5 ms, System: 39.6 ms]
    Range (min … max):   332.9 ms … 347.0 ms    10 runs

  Benchmark #2: ./git.stream for-each-ref --format='%(objectname) %(refname)'
    Time (mean ± σ):     224.0 ms ±   3.4 ms    [User: 222.7 ms, System: 1.3 ms]
    Range (min … max):   218.1 ms … 228.5 ms    13 runs

  Summary
    './git.stream for-each-ref --format='%(objectname) %(refname)'' ran
      1.52 ± 0.03 times faster than './git.orig for-each-ref --format='%(objectname) %(refname)''

So we definitely shave off some time, though we're still _much_ slower
than a simple `wc -l <packed-refs` (which is around 10ms, though of
course it isn't actually doing as much work).

The point here is to reduce overall effort, though of course the time to
first output is much improved in the streaming version, too (about 250ms
versus 4ms).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/for-each-ref.c |  7 ++++++
 ref-filter.c           | 50 ++++++++++++++++++++++++++++++++++--------
 ref-filter.h           |  8 +++++++
 3 files changed, 56 insertions(+), 9 deletions(-)

ZheNing Hu Sept. 5, 2021, 8:20 a.m. UTC | #1

Jeff King <peff@peff.net> 于2021年9月4日周六 下午8:41写道：
>
> The ref-filter code is very keen to collect all of the refs in an array
> before writing any output. This makes things slower than necessary when
> using the default sort order (which is already sorted by refname when we
> call for_each_ref()), and when no filtering options require it.
>
> This commit adds a mildly-ugly interface to detect this case and stream
> directly from filter_refs(). The caller just needs to call the
> "maybe_stream" function. Either way, they can call the usual sort/print
> functions; they'll just be noops if we did stream instead of collecting.
>
> Here are some timings on a fully-packed 500k-ref repo:
>
>   Benchmark #1: ./git.orig for-each-ref --format='%(objectname) %(refname)'
>     Time (mean ± σ):     340.2 ms ±   5.3 ms    [User: 300.5 ms, System: 39.6 ms]
>     Range (min … max):   332.9 ms … 347.0 ms    10 runs
>
>   Benchmark #2: ./git.stream for-each-ref --format='%(objectname) %(refname)'
>     Time (mean ± σ):     224.0 ms ±   3.4 ms    [User: 222.7 ms, System: 1.3 ms]
>     Range (min … max):   218.1 ms … 228.5 ms    13 runs
>
>   Summary
>     './git.stream for-each-ref --format='%(objectname) %(refname)'' ran
>       1.52 ± 0.03 times faster than './git.orig for-each-ref --format='%(objectname) %(refname)''
>
> So we definitely shave off some time, though we're still _much_ slower
> than a simple `wc -l <packed-refs` (which is around 10ms, though of
> course it isn't actually doing as much work).
>
> The point here is to reduce overall effort, though of course the time to
> first output is much improved in the streaming version, too (about 250ms
> versus 4ms).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
>  builtin/for-each-ref.c |  7 ++++++
>  ref-filter.c           | 50 ++++++++++++++++++++++++++++++++++--------
>  ref-filter.h           |  8 +++++++
>  3 files changed, 56 insertions(+), 9 deletions(-)
>
> diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
> index 89cb6307d4..fe0b92443f 100644
> --- a/builtin/for-each-ref.c
> +++ b/builtin/for-each-ref.c
> @@ -70,6 +70,13 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
>         if (verify_ref_format(&format))
>                 usage_with_options(for_each_ref_usage, opts);
>
> +       /*
> +        * try streaming, but only without maxcount; in theory the ref-filter
> +        * code could learn about maxcount, but for now it is our own thing
> +        */
> +       if (!maxcount)
> +               ref_filter_maybe_stream(&filter, sorting, icase, &format);
> +

Yes, I think this maxcount is easy to support.

>         if (!sorting)
>                 sorting = ref_default_sorting();
>         ref_sorting_set_sort_flags_all(sorting, REF_SORTING_ICASE, icase);
> diff --git a/ref-filter.c b/ref-filter.c
> index 93ce2a6ef2..17b78b1d30 100644
> --- a/ref-filter.c
> +++ b/ref-filter.c
> @@ -2283,15 +2283,19 @@ static int ref_filter_handler(const char *refname, const struct object_id *oid,
>                         return 0;
>         }
>
> -       /*
> -        * We do not open the object yet; sort may only need refname
> -        * to do its job and the resulting list may yet to be pruned
> -        * by maxcount logic.
> -        */
> -       ref = ref_array_push(ref_cbdata->array, refname, oid);
> -       ref->commit = commit;
> -       ref->flag = flag;
> -       ref->kind = kind;
> +       if (ref_cbdata->filter->streaming_format) {
> +               pretty_print_ref(refname, oid, ref_cbdata->filter->streaming_format);

So we directly use pretty_print_ref() in streaming mode, OK.

> +       } else {
> +               /*
> +                * We do not open the object yet; sort may only need refname
> +                * to do its job and the resulting list may yet to be pruned
> +                * by maxcount logic.
> +                */
> +               ref = ref_array_push(ref_cbdata->array, refname, oid);
> +               ref->commit = commit;
> +               ref->flag = flag;
> +               ref->kind = kind;
> +       }
>
>         return 0;
>  }

Therefore, in streaming mode, there is no need to push ref to
ref_array, which can
reduce the overhead of malloc(), free(), which makes sense.

But here is a terrible fact: we did not use ref_array_sort() for sorting here.
So in fact, for_each_fullref_in() does the sorting work () for us by
default sort (%(refname)),
This may be due to the file system or some implementation of ref_iterator.
But this limit the application of this optimization when we use other
atoms to sort.

> @@ -2563,6 +2567,34 @@ void ref_array_sort(struct ref_sorting *sorting, struct ref_array *array)
>         QSORT_S(array->items, array->nr, compare_refs, sorting);
>  }
>
> +void ref_filter_maybe_stream(struct ref_filter *filter,
> +                            const struct ref_sorting *sort, int icase,
> +                            struct ref_format *format)
> +{
> +       /* streaming only works with default for_each_ref sort order */
> +       if (sort || icase)
> +               return;
> +

Yes, this really can only be optimized on the default sort.

> +       /* these filters want to see all candidates up front */
> +       if (filter->reachable_from || filter->unreachable_from)
> +               return;
> +

Make Sence.

> +       /*
> +        * the %(symref) placeholder is broken with pretty_print_ref(),
> +        * which our streaming code uses. I suspect this is a sign of breakage
> +        * in other callers like verify_tag(), which should be fixed. But for
> +        * now just disable streaming.
> +        *
> +        * Note that this implies we've parsed the format already with
> +        * verify_ref_format().
> +        */
> +       if (need_symref)
> +               return;
> +

I haven’t taken a closer look at why pretty_print_ref() does not
support %(symref),
we can skip it first.

> +       /* OK to stream */
> +       filter->streaming_format = format;
> +}
> +
>  static void append_literal(const char *cp, const char *ep, struct ref_formatting_state *state)
>  {
>         struct strbuf *s = &state->stack->output;
> diff --git a/ref-filter.h b/ref-filter.h
> index c15dee8d6b..ecea1837a2 100644
> --- a/ref-filter.h
> +++ b/ref-filter.h
> @@ -69,6 +69,9 @@ struct ref_filter {
>                 lines;
>         int abbrev,
>                 verbose;
> +
> +       /* if non-NULL, streaming output during filter_refs() is enabled */
> +       struct ref_format *streaming_format;
>  };
>
>  struct ref_format {
> @@ -135,6 +138,11 @@ char *get_head_description(void);
>  /*  Set up translated strings in the output. */
>  void setup_ref_filter_porcelain_msg(void);
>
> +/* enable streaming during filter_refs() if options allow it */
> +void ref_filter_maybe_stream(struct ref_filter *filter,
> +                            const struct ref_sorting *sort, int icase,
> +                            struct ref_format *format);
> +
>  /*
>   * Print a single ref, outside of any ref-filter. Note that the
>   * name must be a fully qualified refname.
> --
> 2.33.0.618.g5b11852304
>

Unfortunately, this optimization may not be helpful for git cat-file --batch,
see [1], batch_object_write() directly constructs a ref_array_item and call
format_ref_array_item() to grab data, It does not use ref_array. So it also
does not have this malloc() | free() overhead.

[1]: https://lore.kernel.org/git/9c5fddf6885875ccd3ce3f047bb938c77d9bbca2.1628842990.git.gitgitgadget@gmail.com/

--
ZheNing Hu

Jeff King Sept. 5, 2021, 1:04 p.m. UTC | #2

On Sun, Sep 05, 2021 at 04:20:02PM +0800, ZheNing Hu wrote:

> But here is a terrible fact: we did not use ref_array_sort() for sorting here.
> So in fact, for_each_fullref_in() does the sorting work () for us by
> default sort (%(refname)),
> This may be due to the file system or some implementation of ref_iterator.
> But this limit the application of this optimization when we use other
> atoms to sort.

Right. This technique does not help at all if you use --sort. I do think
it's reasonable to rely on the sorted output of the for_each_ref()
functions; other parts of Git likely do as well, so I think we'd try
pretty hard to maintain that (and no, it's not a filesystem thing; we do
make sure to sort it ourselves; see the calls to sort_ref_dir()).

> > +       /*
> > +        * the %(symref) placeholder is broken with pretty_print_ref(),
> > +        * which our streaming code uses. I suspect this is a sign of breakage
> > +        * in other callers like verify_tag(), which should be fixed. But for
> > +        * now just disable streaming.
> > +        *
> > +        * Note that this implies we've parsed the format already with
> > +        * verify_ref_format().
> > +        */
> > +       if (need_symref)
> > +               return;
> > +
> 
> I haven’t taken a closer look at why pretty_print_ref() does not
> support %(symref),
> we can skip it first.

I think it's just because pretty_print_ref() does not take a "flag"
parameter for the caller. So it never sees that REF_ISSYMREF is set.

There aren't many callers of that function, so I think nobody ever
really noticed. But you can see the bug like this:

  git init repo
  cd repo

  git commit --allow-empty -m foo
  git tag -s -m bar annotated &&
  git symbolic-ref refs/tags/symref refs/tags/annotated

  # this is ok; ref-filter handles the flags
  git tag --list --format='list: %(refname) %(symref)'

  # not ok; we do not tell the formatter about the flags, so it does
  # not notice that "symref" is a symref
  git tag --verify --format='verify: %(refname) %(symref)' annotated symref

I notice that the --verify output also shows the short refname, which
makes me wonder if %(symref) would have other bugs there (because it
has to re-resolve the ref to come up with the symref destination).

> Unfortunately, this optimization may not be helpful for git cat-file --batch,
> see [1], batch_object_write() directly constructs a ref_array_item and call
> format_ref_array_item() to grab data, It does not use ref_array. So it also
> does not have this malloc() | free() overhead.

Right. It would only be helped by the "quick" formats in the second
patch (and by skipping the malloc/free of the individual
ref_array_items).

-Peff

Jeff King Sept. 5, 2021, 1:15 p.m. UTC | #3

On Sun, Sep 05, 2021 at 04:20:02PM +0800, ZheNing Hu wrote:

> > +       if (ref_cbdata->filter->streaming_format) {
> > +               pretty_print_ref(refname, oid, ref_cbdata->filter->streaming_format);
> 
> So we directly use pretty_print_ref() in streaming mode, OK.
> 
> > +       } else {
> > +               /*
> > +                * We do not open the object yet; sort may only need refname
> > +                * to do its job and the resulting list may yet to be pruned
> > +                * by maxcount logic.
> > +                */
> > +               ref = ref_array_push(ref_cbdata->array, refname, oid);
> > +               ref->commit = commit;
> > +               ref->flag = flag;
> > +               ref->kind = kind;
> > +       }
> >
> >         return 0;
> >  }
> 
> Therefore, in streaming mode, there is no need to push ref to
> ref_array, which can
> reduce the overhead of malloc(), free(), which makes sense.

By the way, one thing I wondered here: how much of the benefit is from
avoiding the ref_array, and how much is from skipping the sort entirely.

It turns out that most of it is from the latter. If I do this:

diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
index 89cb6307d4..037d5db814 100644
--- a/builtin/for-each-ref.c
+++ b/builtin/for-each-ref.c
@@ -78,7 +78,11 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 	filter.name_patterns = argv;
 	filter.match_as_path = 1;
 	filter_refs(&array, &filter, FILTER_REFS_ALL | FILTER_REFS_INCLUDE_BROKEN);
-	ref_array_sort(sorting, &array);
+	/*
+	 * we should skip this only when we are using the default refname
+	 * sorting, but as an experimental hack, we'll just comment it out.
+	 */
+	// ref_array_sort(sorting, &array);
 
 	if (!maxcount || array.nr < maxcount)
 		maxcount = array.nr;

then the timings I get are:

  Benchmark #1: ./git.old for-each-ref --format='%(objectname) %(refname)'
    Time (mean ± σ):     341.4 ms ±   7.4 ms    [User: 299.8 ms, System: 41.6 ms]
    Range (min … max):   333.5 ms … 355.1 ms    10 runs
   
  Benchmark #2: ./git.new for-each-ref --format='%(objectname) %(refname)'
    Time (mean ± σ):     249.1 ms ±   5.7 ms    [User: 211.8 ms, System: 37.2 ms]
    Range (min … max):   245.9 ms … 267.0 ms    12 runs
   
  Summary
    './git.new for-each-ref --format='%(objectname) %(refname)'' ran
      1.37 ± 0.04 times faster than './git.old for-each-ref --format='%(objectname) %(refname)''

So of the 1.5x improvement that the original patch showed, 1.37x is from
skipping the sort of the already-sorted data. I suspect that has less to
do with sorting at all, and more to do with the fact that even just
formatting "%(refname)" for each entry takes a non-trivial amount of
time.

-Peff

ZheNing Hu Sept. 7, 2021, 5:28 a.m. UTC | #4

Jeff King <peff@peff.net> 于2021年9月5日周日 下午9:04写道：
>
> On Sun, Sep 05, 2021 at 04:20:02PM +0800, ZheNing Hu wrote:
>
> > But here is a terrible fact: we did not use ref_array_sort() for sorting here.
> > So in fact, for_each_fullref_in() does the sorting work () for us by
> > default sort (%(refname)),
> > This may be due to the file system or some implementation of ref_iterator.
> > But this limit the application of this optimization when we use other
> > atoms to sort.
>
> Right. This technique does not help at all if you use --sort. I do think
> it's reasonable to rely on the sorted output of the for_each_ref()
> functions; other parts of Git likely do as well, so I think we'd try
> pretty hard to maintain that (and no, it's not a filesystem thing; we do
> make sure to sort it ourselves; see the calls to sort_ref_dir()).
>
> > > +       /*
> > > +        * the %(symref) placeholder is broken with pretty_print_ref(),
> > > +        * which our streaming code uses. I suspect this is a sign of breakage
> > > +        * in other callers like verify_tag(), which should be fixed. But for
> > > +        * now just disable streaming.
> > > +        *
> > > +        * Note that this implies we've parsed the format already with
> > > +        * verify_ref_format().
> > > +        */
> > > +       if (need_symref)
> > > +               return;
> > > +
> >
> > I haven’t taken a closer look at why pretty_print_ref() does not
> > support %(symref),
> > we can skip it first.
>
> I think it's just because pretty_print_ref() does not take a "flag"
> parameter for the caller. So it never sees that REF_ISSYMREF is set.
>

yeah, pretty_print_ref() does not set the flag, this is a defect of
pretty_print_ref(), maybe we need to find a way to set this flag.
I also deliberately avoided %(symref) when refactoring git cat-file --batch,
perhaps the improvements here can also be applied to git cat-file --batch

> There aren't many callers of that function, so I think nobody ever
> really noticed. But you can see the bug like this:
>
>   git init repo
>   cd repo
>
>   git commit --allow-empty -m foo
>   git tag -s -m bar annotated &&
>   git symbolic-ref refs/tags/symref refs/tags/annotated
>
>   # this is ok; ref-filter handles the flags
>   git tag --list --format='list: %(refname) %(symref)'
>
>   # not ok; we do not tell the formatter about the flags, so it does
>   # not notice that "symref" is a symref
>   git tag --verify --format='verify: %(refname) %(symref)' annotated symref
>
> I notice that the --verify output also shows the short refname, which
> makes me wonder if %(symref) would have other bugs there (because it
> has to re-resolve the ref to come up with the symref destination).
>

This may be easy to fix:

diff --git a/builtin/tag.c b/builtin/tag.c
index 452558ec95..4be5d36366 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -152,11 +152,11 @@ static int verify_tag(const char *name, const char *ref,
        if (format->format)
                flags = GPG_VERIFY_OMIT_STATUS;

-       if (gpg_verify_tag(oid, name, flags))
+       if (gpg_verify_tag(oid, ref, flags))
                return -1;

        if (format->format)
-               pretty_print_ref(name, oid, format);
+               pretty_print_ref(ref, oid, format);

        return 0;
 }

> > Unfortunately, this optimization may not be helpful for git cat-file --batch,
> > see [1], batch_object_write() directly constructs a ref_array_item and call
> > format_ref_array_item() to grab data, It does not use ref_array. So it also
> > does not have this malloc() | free() overhead.
>
> Right. It would only be helped by the "quick" formats in the second
> patch (and by skipping the malloc/free of the individual
> ref_array_items).
>

Yes, your test is indeed worth thinking about: how to avoid
intermediate data to reduce overhead.

> -Peff

Thanks.
--
ZheNing Hu

ZheNing Hu Sept. 7, 2021, 5:42 a.m. UTC | #5

Jeff King <peff@peff.net> 于2021年9月5日周日 下午9:15写道：
>
> On Sun, Sep 05, 2021 at 04:20:02PM +0800, ZheNing Hu wrote:
>
> > > +       if (ref_cbdata->filter->streaming_format) {
> > > +               pretty_print_ref(refname, oid, ref_cbdata->filter->streaming_format);
> >
> > So we directly use pretty_print_ref() in streaming mode, OK.
> >
> > > +       } else {
> > > +               /*
> > > +                * We do not open the object yet; sort may only need refname
> > > +                * to do its job and the resulting list may yet to be pruned
> > > +                * by maxcount logic.
> > > +                */
> > > +               ref = ref_array_push(ref_cbdata->array, refname, oid);
> > > +               ref->commit = commit;
> > > +               ref->flag = flag;
> > > +               ref->kind = kind;
> > > +       }
> > >
> > >         return 0;
> > >  }
> >
> > Therefore, in streaming mode, there is no need to push ref to
> > ref_array, which can
> > reduce the overhead of malloc(), free(), which makes sense.
>
> By the way, one thing I wondered here: how much of the benefit is from
> avoiding the ref_array, and how much is from skipping the sort entirely.
>
> It turns out that most of it is from the latter. If I do this:
>
> diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
> index 89cb6307d4..037d5db814 100644
> --- a/builtin/for-each-ref.c
> +++ b/builtin/for-each-ref.c
> @@ -78,7 +78,11 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
>         filter.name_patterns = argv;
>         filter.match_as_path = 1;
>         filter_refs(&array, &filter, FILTER_REFS_ALL | FILTER_REFS_INCLUDE_BROKEN);
> -       ref_array_sort(sorting, &array);
> +       /*
> +        * we should skip this only when we are using the default refname
> +        * sorting, but as an experimental hack, we'll just comment it out.
> +        */
> +       // ref_array_sort(sorting, &array);
>
>         if (!maxcount || array.nr < maxcount)
>                 maxcount = array.nr;
>
> then the timings I get are:
>
>   Benchmark #1: ./git.old for-each-ref --format='%(objectname) %(refname)'
>     Time (mean ± σ):     341.4 ms ±   7.4 ms    [User: 299.8 ms, System: 41.6 ms]
>     Range (min … max):   333.5 ms … 355.1 ms    10 runs
>
>   Benchmark #2: ./git.new for-each-ref --format='%(objectname) %(refname)'
>     Time (mean ± σ):     249.1 ms ±   5.7 ms    [User: 211.8 ms, System: 37.2 ms]
>     Range (min … max):   245.9 ms … 267.0 ms    12 runs
>
>   Summary
>     './git.new for-each-ref --format='%(objectname) %(refname)'' ran
>       1.37 ± 0.04 times faster than './git.old for-each-ref --format='%(objectname) %(refname)''
>
> So of the 1.5x improvement that the original patch showed, 1.37x is from
> skipping the sort of the already-sorted data. I suspect that has less to
> do with sorting at all, and more to do with the fact that even just
> formatting "%(refname)" for each entry takes a non-trivial amount of
> time.
>

Yes, I think this overhead may come from get_ref_atom_value() instead
of QSORT_S().

> -Peff

Thanks.
--
ZheNing

Jeff King Sept. 7, 2021, 6:01 p.m. UTC | #6

On Tue, Sep 07, 2021 at 01:28:33PM +0800, ZheNing Hu wrote:

> > I think it's just because pretty_print_ref() does not take a "flag"
> > parameter for the caller. So it never sees that REF_ISSYMREF is set.
> >
> 
> yeah, pretty_print_ref() does not set the flag, this is a defect of
> pretty_print_ref(), maybe we need to find a way to set this flag.

In general, I think it could take a flag parameter from its caller. But
its caller comes indirectly from for_each_tag_name(), which isn't a
regular ref-iterator. It would probably need to switch to using
read_ref_full() to get the flags.

> > I notice that the --verify output also shows the short refname, which
> > makes me wonder if %(symref) would have other bugs there (because it
> > has to re-resolve the ref to come up with the symref destination).
> >
> 
> This may be easy to fix:
> 
> diff --git a/builtin/tag.c b/builtin/tag.c
> index 452558ec95..4be5d36366 100644
> --- a/builtin/tag.c
> +++ b/builtin/tag.c
> @@ -152,11 +152,11 @@ static int verify_tag(const char *name, const char *ref,
>         if (format->format)
>                 flags = GPG_VERIFY_OMIT_STATUS;
> 
> -       if (gpg_verify_tag(oid, name, flags))
> +       if (gpg_verify_tag(oid, ref, flags))
>                 return -1;
> 
>         if (format->format)
> -               pretty_print_ref(name, oid, format);
> +               pretty_print_ref(ref, oid, format);
> 
>         return 0;
>  }

Yeah, I think that would work, although:

  - there's another caller in cmd_verify_tag() which seems to just pass
    whatever was on the command line. It doesn't even resolve the ref
    itself!

  - I suspect people may be relying on the current behavior. The
    original was added to be able to compare the internal tagname to the
    refname. I.e., that:

      git tag -v --format='%(refname) %(tag)' foo

    would show "foo foo". Now that _should_ be "%(refname:strip=2)", I
    think, but we'd probably be breaking scripts. OTOH, it really feels
    like _not_ handing over a real, fully-qualified refname to the
    ref-filter code will mean other things are broken (e.g.,
    ATOM_UPSTREAM is assuming we have a fully-qualified ref).

    I think a backwards-compatible way of fixing it would be to have
    this call hand over the full refname to the ref-filter code, but
    tell it that %(refname) should default to strip=2. And then anybody
    who wants the full name can use %(refname:strip=0).

    It makes the behavior confusing and quirky, but we can't avoid that
    without breaking compatibility.

-Peff

ZheNing Hu Sept. 9, 2021, 2:45 p.m. UTC | #7

Jeff King <peff@peff.net> 于2021年9月8日周三 上午2:01写道：
>
> On Tue, Sep 07, 2021 at 01:28:33PM +0800, ZheNing Hu wrote:
>
> > > I think it's just because pretty_print_ref() does not take a "flag"
> > > parameter for the caller. So it never sees that REF_ISSYMREF is set.
> > >
> >
> > yeah, pretty_print_ref() does not set the flag, this is a defect of
> > pretty_print_ref(), maybe we need to find a way to set this flag.
>
> In general, I think it could take a flag parameter from its caller. But
> its caller comes indirectly from for_each_tag_name(), which isn't a
> regular ref-iterator. It would probably need to switch to using
> read_ref_full() to get the flags.
>

Yeah, I think using read_ref_full() with verify_tag() changes can
solve this BUG:

$ git.fix tag --verify --format='verify: %(refname) %(symref)' annotated symref
verify: refs/tags/annotated
verify: refs/tags/symref refs/tags/annotated

diff --git a/ref-filter.c b/ref-filter.c
index 1efa3aadc8..71b1d7da4f 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -2613,18 +2613,6 @@ void ref_filter_maybe_stream(struct ref_filter *filter,
        if (filter->reachable_from || filter->unreachable_from)
                return;

-       /*
-        * the %(symref) placeholder is broken with pretty_print_ref(),
-        * which our streaming code uses. I suspect this is a sign of breakage
-        * in other callers like verify_tag(), which should be fixed. But for
-        * now just disable streaming.
-        *
-        * Note that this implies we've parsed the format already with
-        * verify_ref_format().
-        */
-       if (need_symref)
-               return;
-
        /* OK to stream */
        filter->streaming_format = format;
 }
@@ -2735,6 +2723,7 @@ void pretty_print_ref(const char *name, const
struct object_id *oid,

        ref_item = new_ref_array_item(name, oid);
        ref_item->kind = ref_kind_from_refname(name);
+       read_ref_full(name, 0, NULL, &ref_item->flag);
        if (format_ref_array_item(ref_item, format, &output, &err))
                die("%s", err.buf);
        fwrite(output.buf, 1, output.len, stdout);

> > > I notice that the --verify output also shows the short refname, which
> > > makes me wonder if %(symref) would have other bugs there (because it
> > > has to re-resolve the ref to come up with the symref destination).
> > >
> >
> > This may be easy to fix:
> >
> > diff --git a/builtin/tag.c b/builtin/tag.c
> > index 452558ec95..4be5d36366 100644
> > --- a/builtin/tag.c
> > +++ b/builtin/tag.c
> > @@ -152,11 +152,11 @@ static int verify_tag(const char *name, const char *ref,
> >         if (format->format)
> >                 flags = GPG_VERIFY_OMIT_STATUS;
> >
> > -       if (gpg_verify_tag(oid, name, flags))
> > +       if (gpg_verify_tag(oid, ref, flags))
> >                 return -1;
> >
> >         if (format->format)
> > -               pretty_print_ref(name, oid, format);
> > +               pretty_print_ref(ref, oid, format);
> >
> >         return 0;
> >  }
>
> Yeah, I think that would work, although:
>
>   - there's another caller in cmd_verify_tag() which seems to just pass
>     whatever was on the command line. It doesn't even resolve the ref
>     itself!
>

We can modify get_oid() --> read_ref_full() in cmd_verify_tag()...
Yes, the inconsistency between cmd_verify_tag() and verify_tag() makes
me feel very strange.

>   - I suspect people may be relying on the current behavior. The
>     original was added to be able to compare the internal tagname to the
>     refname. I.e., that:
>
>       git tag -v --format='%(refname) %(tag)' foo
>
>     would show "foo foo". Now that _should_ be "%(refname:strip=2)", I
>     think, but we'd probably be breaking scripts. OTOH, it really feels
>     like _not_ handing over a real, fully-qualified refname to the
>     ref-filter code will mean other things are broken (e.g.,
>     ATOM_UPSTREAM is assuming we have a fully-qualified ref).
>

This is indeed a sad thing: A bug becomes a feature.

>     I think a backwards-compatible way of fixing it would be to have
>     this call hand over the full refname to the ref-filter code, but
>     tell it that %(refname) should default to strip=2. And then anybody
>     who wants the full name can use %(refname:strip=0).
>

Doesn't this make things more complicated? Those callers of git for-each-ref,
wouldn't our changes like this destroy them?


>     It makes the behavior confusing and quirky, but we can't avoid that
>     without breaking compatibility.
>

Eh, I think we may need other solutions.

> -Peff

Thanks.
--
ZheNing Hu

Jeff King Sept. 10, 2021, 2:26 p.m. UTC | #8

On Thu, Sep 09, 2021 at 10:45:15PM +0800, ZheNing Hu wrote:

> @@ -2735,6 +2723,7 @@ void pretty_print_ref(const char *name, const
> struct object_id *oid,
> 
>         ref_item = new_ref_array_item(name, oid);
>         ref_item->kind = ref_kind_from_refname(name);
> +       read_ref_full(name, 0, NULL, &ref_item->flag);
>         if (format_ref_array_item(ref_item, format, &output, &err))
>                 die("%s", err.buf);
>         fwrite(output.buf, 1, output.len, stdout);

IMHO this is the wrong place to do it, since the caller may already have
the flags (and looking up the ref again is a non-trivial cost).

The caller in builtin/tag.c should switch to using read_ref_full() and
pass in the flags.

The caller in builtin/verify-tag.c _probably_ should resolve the ref in
the same way and pass in that full refname and flags. I do worry that
this may be a compatibility problem, but the current behavior seems so
broken to me.

> >   - I suspect people may be relying on the current behavior. The
> >     original was added to be able to compare the internal tagname to the
> >     refname. I.e., that:
> >
> >       git tag -v --format='%(refname) %(tag)' foo
> >
> >     would show "foo foo". Now that _should_ be "%(refname:strip=2)", I
> >     think, but we'd probably be breaking scripts. OTOH, it really feels
> >     like _not_ handing over a real, fully-qualified refname to the
> >     ref-filter code will mean other things are broken (e.g.,
> >     ATOM_UPSTREAM is assuming we have a fully-qualified ref).
> >
> 
> This is indeed a sad thing: A bug becomes a feature.
> 
> >     I think a backwards-compatible way of fixing it would be to have
> >     this call hand over the full refname to the ref-filter code, but
> >     tell it that %(refname) should default to strip=2. And then anybody
> >     who wants the full name can use %(refname:strip=0).
> >
> 
> Doesn't this make things more complicated? Those callers of git for-each-ref,
> wouldn't our changes like this destroy them?

My proposal was that we'd have a specific flag in ref-filter to say
"default refname:strip to this value". And then _only_ "tag --verify"
would set that (to "2"), leaving for-each-ref, etc unaffected.

So yes, it's complicated. And it must be explained to the user that
"%(refname)" behaves slightly differently with "git tag --verify", but
that is unavoidable if we do not want to break scripts (it _already_
behaves slightly differently, and we just never told anyone).

The other option is to declare the current behavior a bug and fix it. I
am quite tempted by that route, given the inconsistency with other
formatters, including even "git tag --list --format=%(refname)"!

-Peff

ZheNing Hu Sept. 15, 2021, 12:27 p.m. UTC | #9

Jeff King <peff@peff.net> 于2021年9月10日周五 下午10:26写道：
>
> On Thu, Sep 09, 2021 at 10:45:15PM +0800, ZheNing Hu wrote:
>
> > @@ -2735,6 +2723,7 @@ void pretty_print_ref(const char *name, const
> > struct object_id *oid,
> >
> >         ref_item = new_ref_array_item(name, oid);
> >         ref_item->kind = ref_kind_from_refname(name);
> > +       read_ref_full(name, 0, NULL, &ref_item->flag);
> >         if (format_ref_array_item(ref_item, format, &output, &err))
> >                 die("%s", err.buf);
> >         fwrite(output.buf, 1, output.len, stdout);
>
> IMHO this is the wrong place to do it, since the caller may already have
> the flags (and looking up the ref again is a non-trivial cost).
>

Well, but not doing this means that we have to pass the flag from the
pretty_print_ref() call stack.

> The caller in builtin/tag.c should switch to using read_ref_full() and
> pass in the flags.
>

Agree.

> The caller in builtin/verify-tag.c _probably_ should resolve the ref in
> the same way and pass in that full refname and flags. I do worry that
> this may be a compatibility problem, but the current behavior seems so
> broken to me.
>

Yeah.

> > >   - I suspect people may be relying on the current behavior. The
> > >     original was added to be able to compare the internal tagname to the
> > >     refname. I.e., that:
> > >
> > >       git tag -v --format='%(refname) %(tag)' foo
> > >
> > >     would show "foo foo". Now that _should_ be "%(refname:strip=2)", I
> > >     think, but we'd probably be breaking scripts. OTOH, it really feels
> > >     like _not_ handing over a real, fully-qualified refname to the
> > >     ref-filter code will mean other things are broken (e.g.,
> > >     ATOM_UPSTREAM is assuming we have a fully-qualified ref).
> > >
> >
> > This is indeed a sad thing: A bug becomes a feature.
> >
> > >     I think a backwards-compatible way of fixing it would be to have
> > >     this call hand over the full refname to the ref-filter code, but
> > >     tell it that %(refname) should default to strip=2. And then anybody
> > >     who wants the full name can use %(refname:strip=0).
> > >
> >
> > Doesn't this make things more complicated? Those callers of git for-each-ref,
> > wouldn't our changes like this destroy them?
>
> My proposal was that we'd have a specific flag in ref-filter to say
> "default refname:strip to this value". And then _only_ "tag --verify"
> would set that (to "2"), leaving for-each-ref, etc unaffected.
>

Indeed this may be a feasible solution. I will try to do this first.

> So yes, it's complicated. And it must be explained to the user that
> "%(refname)" behaves slightly differently with "git tag --verify", but
> that is unavoidable if we do not want to break scripts (it _already_
> behaves slightly differently, and we just never told anyone).
>
> The other option is to declare the current behavior a bug and fix it. I
> am quite tempted by that route, given the inconsistency with other
> formatters, including even "git tag --list --format=%(refname)"!

I don't know, I think both fix methods are okay.

>
> -Peff

Thanks.
--
ZheNing Hu

ZheNing Hu Sept. 15, 2021, 2:23 p.m. UTC | #10

ZheNing Hu <adlternative@gmail.com> 于2021年9月15日周三 下午8:27写道：
>
> > So yes, it's complicated. And it must be explained to the user that
> > "%(refname)" behaves slightly differently with "git tag --verify", but
> > that is unavoidable if we do not want to break scripts (it _already_
> > behaves slightly differently, and we just never told anyone).
> >

$ git tag --verify --format='verify: %(refname) %(symref)' annotated symref
verify: annotated
verify: symref
$ git tag --verify --format='verify: %(refname) %(symref)'
refs/tags/annotated refs/tags/symref
error: tag 'refs/tags/annotated' not found.
error: tag 'refs/tags/symref' not found.

$ git verify-tag --format='verify: %(refname) %(symref)' annotated
symref
verify: annotated
verify: symref
$ git verify-tag --format='verify: %(refname) %(symref)'
refs/tags/annotated refs/tags/symref
verify: refs/tags/annotated
verify: refs/tags/symref

As we can see, there is a slight difference between git tag --verify and
git verify-tag: git tag --verify can not handle refs' fullname refs/tags/*
(because read_ref_full() | read_ref() can't handle them). So, as a standard,
which characteristics should we keep?

> > The other option is to declare the current behavior a bug and fix it. I
> > am quite tempted by that route, given the inconsistency with other
> > formatters, including even "git tag --list --format=%(refname)"!
>
> I don't know, I think both fix methods are okay.
>
> >
> > -Peff
>
> Thanks.
> --
> ZheNing Hu

Jeff King Sept. 16, 2021, 9:31 p.m. UTC | #11

On Wed, Sep 15, 2021 at 08:27:15PM +0800, ZheNing Hu wrote:

> > > @@ -2735,6 +2723,7 @@ void pretty_print_ref(const char *name, const
> > > struct object_id *oid,
> > >
> > >         ref_item = new_ref_array_item(name, oid);
> > >         ref_item->kind = ref_kind_from_refname(name);
> > > +       read_ref_full(name, 0, NULL, &ref_item->flag);
> > >         if (format_ref_array_item(ref_item, format, &output, &err))
> > >                 die("%s", err.buf);
> > >         fwrite(output.buf, 1, output.len, stdout);
> >
> > IMHO this is the wrong place to do it, since the caller may already have
> > the flags (and looking up the ref again is a non-trivial cost).
> >
> 
> Well, but not doing this means that we have to pass the flag from the
> pretty_print_ref() call stack.

Right. But there are only two callers of that function, so I don't think
it's a big imposition to require it.

> > My proposal was that we'd have a specific flag in ref-filter to say
> > "default refname:strip to this value". And then _only_ "tag --verify"
> > would set that (to "2"), leaving for-each-ref, etc unaffected.
> >
> 
> Indeed this may be a feasible solution. I will try to do this first.

Great, thanks for taking a look at it.

-Peff

Jeff King Sept. 16, 2021, 9:45 p.m. UTC | #12

On Wed, Sep 15, 2021 at 10:23:43PM +0800, ZheNing Hu wrote:

> ZheNing Hu <adlternative@gmail.com> 于2021年9月15日周三 下午8:27写道：
> >
> > > So yes, it's complicated. And it must be explained to the user that
> > > "%(refname)" behaves slightly differently with "git tag --verify", but
> > > that is unavoidable if we do not want to break scripts (it _already_
> > > behaves slightly differently, and we just never told anyone).
> > >
> 
> $ git tag --verify --format='verify: %(refname) %(symref)' annotated symref
> verify: annotated
> verify: symref
> $ git tag --verify --format='verify: %(refname) %(symref)'
> refs/tags/annotated refs/tags/symref
> error: tag 'refs/tags/annotated' not found.
> error: tag 'refs/tags/symref' not found.

This is expected. When you provide a tag name on the command line of
"git tag" it is assumed to be a non-qualified name in refs/tags/ (and
ditto for git-branch and refs/heads/). It is tempting to try to be
friendly and accept fully-qualified refs there, but it would create
ambiguities (e.g., you could really have refs/tags/refs/tags/foo as a
ref).

I think we can ignore that for our purposes here, though. It's a
question of input from the command-line, and we focus on just the output
that we produce.

> $ git verify-tag --format='verify: %(refname) %(symref)' annotated
> symref
> verify: annotated
> verify: symref
> $ git verify-tag --format='verify: %(refname) %(symref)'
> refs/tags/annotated refs/tags/symref
> verify: refs/tags/annotated
> verify: refs/tags/symref
> 
> As we can see, there is a slight difference between git tag --verify and
> git verify-tag: git tag --verify can not handle refs' fullname refs/tags/*
> (because read_ref_full() | read_ref() can't handle them). So, as a standard,
> which characteristics should we keep?

Whereas are you notice here, verify-tag takes any name (which could be
fully qualified), and uses it as-is. In fact, it might not even be a ref
at all! You can say "git verify-tag c06b72d02" if you want to. And as a
plumbing tool, we should make sure this continues to work. For example,
careful scripts may resolve a ref into an object, and want to continue
talking about that object without worrying about the ref being changed
simultaneously.

But it also creates a weirdness for "git verify-tag --format". We do not
necessarily even have a ref to show. So IMHO the feature is somewhat
mis-designed in the first place. But we should probably continue to
support it as best we can.

The best I can come up with is:

  - when we resolve the name, if it was a ref, we should record that.
    I think this is hard to do now. It would probably require
    get_oid_with_context() learning to report on the results it got from
    dwim_ref().

  - if we have a refname, then feed it to pretty_print_ref() as a
    fully-qualified name. And pass whatever "default lstrip=2" magic we
    come up with for "git tag --verify". That would mean that "git
    verify-tag --format=%(refname) v2.33.0" would behave the same before
    and after.

  - if we didn't get a refname, then...I guess continue to pass the name
    the user gave us into pretty_print_ref()? That would keep "git
    verify-tag --format=%(refname) c06b72d02" working as it does today.

The alternative is to do none of those things, and just document that
"verify-tag" is weird:

  - its %(refname) reports whatever you gave it, whether it is a ref or
    not

  - some advanced format placeholders like %(symref) may not work if you
    don't pass a fully-qualified ref

-Peff

ZheNing Hu Sept. 20, 2021, 7:42 a.m. UTC | #13

Jeff King <peff@peff.net> 于2021年9月17日周五 上午5:45写道：
>
> On Wed, Sep 15, 2021 at 10:23:43PM +0800, ZheNing Hu wrote:
>
> > ZheNing Hu <adlternative@gmail.com> 于2021年9月15日周三 下午8:27写道：
> > >
> > > > So yes, it's complicated. And it must be explained to the user that
> > > > "%(refname)" behaves slightly differently with "git tag --verify", but
> > > > that is unavoidable if we do not want to break scripts (it _already_
> > > > behaves slightly differently, and we just never told anyone).
> > > >
> >
> > $ git tag --verify --format='verify: %(refname) %(symref)' annotated symref
> > verify: annotated
> > verify: symref
> > $ git tag --verify --format='verify: %(refname) %(symref)'
> > refs/tags/annotated refs/tags/symref
> > error: tag 'refs/tags/annotated' not found.
> > error: tag 'refs/tags/symref' not found.
>
> This is expected. When you provide a tag name on the command line of
> "git tag" it is assumed to be a non-qualified name in refs/tags/ (and
> ditto for git-branch and refs/heads/). It is tempting to try to be
> friendly and accept fully-qualified refs there, but it would create
> ambiguities (e.g., you could really have refs/tags/refs/tags/foo as a
> ref).
>

Yeah, maybe you are right, for git tag --verify, there may have ambiguities.
But for git verify-tag, if we have tags like "refs/tags/refs/tags/foo" and
"refs/tags/foo":

$ git verify-tag --format='verify: %(refname) %(symref)' refs/tags/foo
warning: refname 'refs/tags/foo' is ambiguous.

We see the ambiguities too here.

> I think we can ignore that for our purposes here, though. It's a
> question of input from the command-line, and we focus on just the output
> that we produce.
>

Yeah, but using different functions (read_ref_full(), get_oid()) will
affect what
kind of input we can provide.

> > $ git verify-tag --format='verify: %(refname) %(symref)' annotated
> > symref
> > verify: annotated
> > verify: symref
> > $ git verify-tag --format='verify: %(refname) %(symref)'
> > refs/tags/annotated refs/tags/symref
> > verify: refs/tags/annotated
> > verify: refs/tags/symref
> >
> > As we can see, there is a slight difference between git tag --verify and
> > git verify-tag: git tag --verify can not handle refs' fullname refs/tags/*
> > (because read_ref_full() | read_ref() can't handle them). So, as a standard,
> > which characteristics should we keep?
>
> Whereas are you notice here, verify-tag takes any name (which could be
> fully qualified), and uses it as-is. In fact, it might not even be a ref
> at all! You can say "git verify-tag c06b72d02" if you want to. And as a
> plumbing tool, we should make sure this continues to work. For example,
> careful scripts may resolve a ref into an object, and want to continue
> talking about that object without worrying about the ref being changed
> simultaneously.
>

Yes, this feature is very bad. %(refname) seems to do the %(objectname)
work.


> But it also creates a weirdness for "git verify-tag --format". We do not
> necessarily even have a ref to show. So IMHO the feature is somewhat
> mis-designed in the first place. But we should probably continue to
> support it as best we can.
>
> The best I can come up with is:
>
>   - when we resolve the name, if it was a ref, we should record that.
>     I think this is hard to do now. It would probably require
>     get_oid_with_context() learning to report on the results it got from
>     dwim_ref().
>
>   - if we have a refname, then feed it to pretty_print_ref() as a
>     fully-qualified name. And pass whatever "default lstrip=2" magic we
>     come up with for "git tag --verify". That would mean that "git
>     verify-tag --format=%(refname) v2.33.0" would behave the same before
>     and after.
>
>   - if we didn't get a refname, then...I guess continue to pass the name
>     the user gave us into pretty_print_ref()? That would keep "git
>     verify-tag --format=%(refname) c06b72d02" working as it does today.
>
> The alternative is to do none of those things, and just document that
> "verify-tag" is weird:
>
>   - its %(refname) reports whatever you gave it, whether it is a ref or
>     not
>
>   - some advanced format placeholders like %(symref) may not work if you
>     don't pass a fully-qualified ref
>
> -Peff

This is my solution according to your above suggection:
https://lore.kernel.org/git/pull.1042.git.1632123476.gitgitgadget@gmail.com/

Thanks.
--
ZheNing Hu

[1/2] ref-filter: hacky "streaming" mode

Commit Message

Comments

Patch