[v4,0/2] rev-list: print additional missing object information

Message ID	20250205004147.887106-1-jltobler@gmail.com (mailing list archive)
Headers	show Received: from mail-oa1-f51.google.com (mail-oa1-f51.google.com [209.85.160.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D415224FA for <git@vger.kernel.org>; Wed, 5 Feb 2025 00:45:08 +0000 (UTC) From: Justin Tobler <jltobler@gmail.com> To: git@vger.kernel.org Cc: christian.couder@gmail.com, phillip.wood123@gmail.com, Justin Tobler <jltobler@gmail.com> Subject: [PATCH v4 0/2] rev-list: print additional missing object information Date: Tue, 4 Feb 2025 18:41:45 -0600 Message-ID: <20250205004147.887106-1-jltobler@gmail.com> In-Reply-To: <20250201201658.11562-1-jltobler@gmail.com> References: <20250201201658.11562-1-jltobler@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	rev-list: print additional missing object information \| expand [v4,0/2] rev-list: print additional missing object information [v4,1/2] rev-list: add print-info action to print missing object path [v4,2/2] rev-list: extend print-info to print missing object type

Justin Tobler Feb. 5, 2025, 12:41 a.m. UTC

Greetings,

It is possible to configure git-rev-list(1) to print the OID of missing
objects by setting the `--missing=print` option. While it is useful
knowing about these objects, it would be nice to have even more context
about the objects that are missing. Luckily, from an object containing
the missing object, it is possible to infer additional information the
missing object. For example, if the tree containing a missing blob still
exists, the tree entry for the missing object should contain path and
type information.

This series aims to provide git-rev-list(1) with a new `print-info`
missing action for the `--missing` option that, when set, behaves like
the existing `print` action but also prints other potentially
interesting information about the missing object.

Missing object info is printed in the form `?<oid> [<token>=<value>]...`
where multiple `<token>=<value>` pairs may be specified each separated
from each other with a SP. Values that contain SP or LF characters are
expected to be encoded in a manner such that these problematic bytes are
handled. For missing object path information this is handled by quoting
the path in the C style if it contains SP or special characters.

One concern I currently have with this quoting approach is that it is a
bit more challenging to machine parse compared to something like using a
null byte to delimit between missing info. One option is, in a followup
series, introduce a git-for-each-ref(1) style format syntax. Maybe
something like: `--missing=print-info:%(path)%00%(type)`. I'm curious if
anyone may have thoughts around this. My goal is to ensure that there is
an easy to use machine parsable interface to get this information. I
could see something like `?<oid> path="foo \"bar" type=blob`, being a
bit complex.

The series is set up as follows:

- Patch 1 introduces the `print-info` missing action and supports
  printing missing object path information.

- Patch 2 extends the `print-info` missing action to also print object
  type information about the missing object.

Changes in V4:

- The core.quotePath behavior is no longer force enabled for the missing
  info values. Consequently the first two patches from the previous
  version are dropped.

Thanks,
-Justin

Justin Tobler (2):
  rev-list: add print-info action to print missing object path
  rev-list: extend print-info to print missing object type

 Documentation/rev-list-options.txt |  19 ++++++
 builtin/rev-list.c                 | 106 ++++++++++++++++++++++++-----
 t/t6022-rev-list-missing.sh        |  53 +++++++++++++++
 3 files changed, 161 insertions(+), 17 deletions(-)

Range-diff against v3:
1:  f628728300 < -:  ---------- quote: add c quote flag to ignore core.quotePath
2:  53a3811d8f < -:  ---------- quote: add quote_path() flag to ignore config
3:  fe7a3da8de ! 1:  e3d5295b4d rev-list: add print-info action to print missing object path
    @@ builtin/rev-list.c: static off_t get_object_disk_usage(struct object *obj)
     +		struct strbuf path = STRBUF_INIT;
     +
     +		strbuf_addstr(&sb, " path=");
    -+		quote_path(entry->path, NULL, &path,
    -+			   QUOTE_PATH_QUOTE_SP | QUOTE_PATH_IGNORE_CONFIG);
    ++		quote_path(entry->path, NULL, &path, QUOTE_PATH_QUOTE_SP);
     +		strbuf_addbuf(&sb, &path);
     +
     +		strbuf_release(&path);
4:  788b497d00 = 2:  6aa71444d3 rev-list: extend print-info to print missing object type

base-commit: b74ff38af58464688b211140b90ec90598d340c6

Christian Couder Feb. 5, 2025, 10:35 a.m. UTC | #1

On Wed, Feb 5, 2025 at 1:45 AM Justin Tobler <jltobler@gmail.com> wrote:

> Changes in V4:
>
> - The core.quotePath behavior is no longer force enabled for the missing
>   info values. Consequently the first two patches from the previous
>   version are dropped.

This v4 looks good to me. Ack!

Junio C Hamano Feb. 5, 2025, 1:18 p.m. UTC | #2

Justin Tobler <jltobler@gmail.com> writes:

> One concern I currently have with this quoting approach is that it is a
> bit more challenging to machine parse compared to something like using a
> null byte to delimit between missing info. One option is, in a followup
> series, introduce a git-for-each-ref(1) style format syntax. Maybe
> something like: `--missing=print-info:%(path)%00%(type)`. I'm curious if
> anyone may have thoughts around this.

Would it be so bad if we said that in -z mode with --info option,
each record is terminated with two NUL bytes, and elements on a list
of var=value pairs have a single NUL in between, or something silly
like that?  The point is to get away with just a fixed format,
without any customization.

Justin Tobler Feb. 5, 2025, 5:17 p.m. UTC | #3

On 25/02/05 05:18AM, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> 
> > One concern I currently have with this quoting approach is that it is a
> > bit more challenging to machine parse compared to something like using a
> > null byte to delimit between missing info. One option is, in a followup
> > series, introduce a git-for-each-ref(1) style format syntax. Maybe
> > something like: `--missing=print-info:%(path)%00%(type)`. I'm curious if
> > anyone may have thoughts around this.
> 
> Would it be so bad if we said that in -z mode with --info option,
> each record is terminated with two NUL bytes, and elements on a list
> of var=value pairs have a single NUL in between, or something silly
> like that?  The point is to get away with just a fixed format,
> without any customization.

I agree that some sort of fixed format would be preferable as it's less
complex while also being simpler to implement. I originally considered
using NUL but realized a single NUL byte to delimit between entries
wouldn't be sufficient to determine where each record would end. Using
two NUL bytes next to each other to mark the end of a record would work
though.

Since even a normal rev-list record may have an object name entry in
addition to its OID when the `--objects` option is set, maybe we could
introduce a `-z` option that always terminates a record with two NUL
bytes?

The output for `git rev-list -z --objects --missing=print-info` could
look something like the following (no LF at EOL):

  6aa71444d3d41315509c3f2cfe2d45d86cea20d7<NUL><NUL>
  f009994f5d7fc97c1e87b4dc7ad69057a07e85c4<NUL>foo/bar<NUL><NUL>
  ?f10f78c60046b2be841c9e2403960663439296c3<NUL>path=foo/bar/baz<NUL>type=blob<NUL><NUL>
  ?ead43a34efd775b58d6b3e86db6bc71bbedd2c1c<NUL>path=foo/bar/baz 2<NUL>type=blob<NUL><NUL>

Having two NUL bytes to delimit between records might be a bit odd in
the common case for git-rev-list(1) without the `--objects` and
`--missing` options since we would only expect a list of OIDs. Having
consistent `-z` option output irrespective of other options might be
preferable though.

If this approach seems reasonable, I can do so in a followup series.

Thanks
-Justin

Justin Tobler Feb. 5, 2025, 5:18 p.m. UTC | #4

On 25/02/05 11:35AM, Christian Couder wrote:
> On Wed, Feb 5, 2025 at 1:45 AM Justin Tobler <jltobler@gmail.com> wrote:
> 
> > Changes in V4:
> >
> > - The core.quotePath behavior is no longer force enabled for the missing
> >   info values. Consequently the first two patches from the previous
> >   version are dropped.
> 
> This v4 looks good to me. Ack!

Thanks for the review!

-Justin

Junio C Hamano Feb. 5, 2025, 6:29 p.m. UTC | #5

Justin Tobler <jltobler@gmail.com> writes:

> wouldn't be sufficient to determine where each record would end. Using
> two NUL bytes next to each other to mark the end of a record would work
> though.

I think we already use that convention elsewhere, and that is why I
brought it up as a potential approach to take.

[v4,0/2] rev-list: print additional missing object information

Message

Comments