mbox series

[0/4] rev-list: introduce NUL-delimited output mode

Message ID 20250310192829.661692-1-jltobler@gmail.com (mailing list archive)
Headers show
Series rev-list: introduce NUL-delimited output mode | expand

Message

Justin Tobler March 10, 2025, 7:28 p.m. UTC
When walking objects, git-rev-list(1) prints each object entry on a
separate line in the form:

        <oid> LF

Some options, such as `--objects`, may print additional information
about the object on the same line:

        <oid> SP [<path>] LF

In this mode, if the object path contains a newline it is truncated at
the newline.

When the `--missing={print,print-info}` option is provided, information
about any missing objects encountered during the object walk are also
printed in the form:

        ?<oid> [SP <token>=<value>]... LF

where values containing LF or SP are printed in a token specific fashion
so that the resulting encoded value does not contain either of these two
problematic bytes. For example, missing object paths are quoted in the C
style so they contain LF or SP.

To make machine parsing easier, this series introduces a NUL-delimited
output mode for git-rev-list(1) via a `-z` option following a suggestion
from Junio in a previous thread[1]. In this mode, instead of LF, each
object is delimited with two NUL bytes and any object metadata is
separated with a single NUL byte. Examples:

        <oid> NUL NUL
        <oid> [NUL <path>] NUL NUL
        ?<oid> [NUL <token>=<value>]... NUL NUL

In this mode, path and value info are printed as-is without any special
encoding or truncation.

For now this series only adds support for use with the `--objects` and
`--missing` options. Usage of `-z` with other options is rejected, so it
can potentially be added in the future.

One idea I had, but did not implement in this version, was to also use
the `<token>=<value>` format for regular non-missing object info while
in the NUL-delimited mode. I could see this being a bit more flexible
instead of relying strictly on order. Interested if anyone has thoughts
on this. :)

This series is structured as follows:

        - Patches 1 and 2 do some minor preparatory refactors.

        - Patch 3 adds the `-z` option to git-rev-list(1) to print
          objects in a NUL-delimited fashion. Printed object paths with
          the `--objects` option are also handled.

        - Patch 4 teaches the `--missing` option how to print info in a
          NUL-delimited fashion.

Thanks for taking a look,
-Justin

[1]: <xmqq5xlor0la.fsf@gitster.g>

Justin Tobler (4):
  rev-list: inline `show_object_with_name()` in `show_object()`
  rev-list: refactor early option parsing
  rev-list: support delimiting objects with NUL bytes
  rev-list: support NUL-delimited --missing option

 Documentation/rev-list-options.adoc | 26 +++++++++
 builtin/rev-list.c                  | 86 ++++++++++++++++++++++-------
 revision.c                          |  8 ---
 revision.h                          |  2 -
 t/t6000-rev-list-misc.sh            | 34 ++++++++++++
 t/t6022-rev-list-missing.sh         | 30 ++++++++++
 6 files changed, 155 insertions(+), 31 deletions(-)


base-commit: 87a0bdbf0f72b7561f3cd50636eee33dcb7dbcc3

Comments

Junio C Hamano March 10, 2025, 8:37 p.m. UTC | #1
Justin Tobler <jltobler@gmail.com> writes:

>         ?<oid> [SP <token>=<value>]... LF
>
> where values containing LF or SP are printed in a token specific fashion
> so that the resulting encoded value does not contain either of these two
> problematic bytes. For example, missing object paths are quoted in the C
> style so they contain LF or SP.

"so" -> "when"???

> To make machine parsing easier, this series introduces a NUL-delimited
> output mode for git-rev-list(1) via a `-z` option following a suggestion
> from Junio in a previous thread[1]. In this mode, instead of LF, each
> object is delimited with two NUL bytes and any object metadata is
> separated with a single NUL byte. Examples:
>
>         <oid> NUL NUL
>         <oid> [NUL <path>] NUL NUL

Why do we need double-NUL in the above two cases?

>         ?<oid> [NUL <token>=<value>]... NUL NUL

This one I understand; we could do without double-NUL and take the
lack of "=" in the token after NUL termination as the sign that the
previous record ended, though, to avoid double-NUL while keeping the
format extensible.

As this topic is designing essentially a new and machine parseable
format, we could even unify all three formats into one.  For example,
the format could be like this:

	<oid> NUL [<attr>=<value> NUL]...

where

 (1) A record ends when a new record begins.

 (2) The beginning of a new record is signaled by <oid> that is all
     hexadecimal and does not have any '=' in it.

 (3) The traditional "rev-list --objects" output that gives path in
     addition to the object name uses "path" as the <attr> name,
     i.e. such a record looks like "<oid> NUL path=<path> NUL".

 (4) The traditional "rev-list --missing" output loses the leading
     "?"; it is replaced by "missing" as the <attr> name, i.e. such
     a record may look like "<oid> NUL missing=yes NUL..." together
     with other "<token>=<value> NUL" pairs appended as needed at
     the end.

Hmm?
Junio C Hamano March 10, 2025, 9:08 p.m. UTC | #2
Junio C Hamano <gitster@pobox.com> writes:

> As this topic is designing essentially a new and machine parseable
> format, we could even unify all three formats into one.  For example,
> the format could be like this:
>
> 	<oid> NUL [<attr>=<value> NUL]...
>
> where

(0) "rev-list" that gives only a sequence of "<oid>" for commit-ish,
    as well as "rev-list --boundary", would fall out as a natural
    consequence.  Bog-standard "list of commits" would see a
    sequence of "<oid> NUL", while a boundary object would see
    "<oid> NUL boundary=yes NUL".

>  (1) A record ends when a new record begins.
>
>  (2) The beginning of a new record is signaled by <oid> that is all
>      hexadecimal and does not have any '=' in it.
>
>  (3) The traditional "rev-list --objects" output that gives path in
>      addition to the object name uses "path" as the <attr> name,
>      i.e. such a record looks like "<oid> NUL path=<path> NUL".
>
>  (4) The traditional "rev-list --missing" output loses the leading
>      "?"; it is replaced by "missing" as the <attr> name, i.e. such
>      a record may look like "<oid> NUL missing=yes NUL..." together
>      with other "<token>=<value> NUL" pairs appended as needed at
>      the end.
D. Ben Knoble March 10, 2025, 10:38 p.m. UTC | #3
On Mon, Mar 10, 2025 at 3:32 PM Justin Tobler <jltobler@gmail.com> wrote:
>
> When walking objects, git-rev-list(1) prints each object entry on a
> separate line in the form:
>
>         <oid> LF
>
> Some options, such as `--objects`, may print additional information
> about the object on the same line:
>
>         <oid> SP [<path>] LF
>
> In this mode, if the object path contains a newline it is truncated at
> the newline.
>
> When the `--missing={print,print-info}` option is provided, information
> about any missing objects encountered during the object walk are also
> printed in the form:
>
>         ?<oid> [SP <token>=<value>]... LF
>
> where values containing LF or SP are printed in a token specific fashion
> so that the resulting encoded value does not contain either of these two
> problematic bytes. For example, missing object paths are quoted in the C
> style so they contain LF or SP.
>
> To make machine parsing easier, this series introduces a NUL-delimited
> output mode for git-rev-list(1) via a `-z` option following a suggestion
> from Junio in a previous thread[1]. In this mode, instead of LF, each
> object is delimited with two NUL bytes and any object metadata is
> separated with a single NUL byte. Examples:
>
>         <oid> NUL NUL
>         <oid> [NUL <path>] NUL NUL
>         ?<oid> [NUL <token>=<value>]... NUL NUL
>
> In this mode, path and value info are printed as-is without any special
> encoding or truncation.
>
> For now this series only adds support for use with the `--objects` and
> `--missing` options. Usage of `-z` with other options is rejected, so it
> can potentially be added in the future.
>
> One idea I had, but did not implement in this version, was to also use
> the `<token>=<value>` format for regular non-missing object info while
> in the NUL-delimited mode. I could see this being a bit more flexible
> instead of relying strictly on order. Interested if anyone has thoughts
> on this. :)

Without taking a deeper look, I think token=value has the benefit of
being self-describing at the cost of more output bytes (which might
matter over the wire, for example). Generally I like the idea;
sometimes I find it troublesome having to parse prose manuals for the
specifics of output formats like field order, especially when I end up
coding a parser for the format. If the field order doesn’t matter to
the consumer, then perhaps using ordered fields AWK-style is
inappropriately terse?

OTOH, the -z format is for machines, and they don’t need human labels
;) [I think token labels would be a great parser-writing and debugging
aid]

Best,
Ben

>
> This series is structured as follows:
>
>         - Patches 1 and 2 do some minor preparatory refactors.
>
>         - Patch 3 adds the `-z` option to git-rev-list(1) to print
>           objects in a NUL-delimited fashion. Printed object paths with
>           the `--objects` option are also handled.
>
>         - Patch 4 teaches the `--missing` option how to print info in a
>           NUL-delimited fashion.
>
> Thanks for taking a look,
> -Justin
>
> [1]: <xmqq5xlor0la.fsf@gitster.g>
>
> Justin Tobler (4):
>   rev-list: inline `show_object_with_name()` in `show_object()`
>   rev-list: refactor early option parsing
>   rev-list: support delimiting objects with NUL bytes
>   rev-list: support NUL-delimited --missing option
>
>  Documentation/rev-list-options.adoc | 26 +++++++++
>  builtin/rev-list.c                  | 86 ++++++++++++++++++++++-------
>  revision.c                          |  8 ---
>  revision.h                          |  2 -
>  t/t6000-rev-list-misc.sh            | 34 ++++++++++++
>  t/t6022-rev-list-missing.sh         | 30 ++++++++++
>  6 files changed, 155 insertions(+), 31 deletions(-)
>
>
> base-commit: 87a0bdbf0f72b7561f3cd50636eee33dcb7dbcc3
> --
> 2.49.0.rc2
>
>
Justin Tobler March 11, 2025, 10:59 p.m. UTC | #4
On 25/03/10 06:38PM, D. Ben Knoble wrote:
> On Mon, Mar 10, 2025 at 3:32 PM Justin Tobler <jltobler@gmail.com> wrote:
> >
> > When walking objects, git-rev-list(1) prints each object entry on a
> > separate line in the form:
> >
> >         <oid> LF
> >
> > Some options, such as `--objects`, may print additional information
> > about the object on the same line:
> >
> >         <oid> SP [<path>] LF
> >
> > In this mode, if the object path contains a newline it is truncated at
> > the newline.
> >
> > When the `--missing={print,print-info}` option is provided, information
> > about any missing objects encountered during the object walk are also
> > printed in the form:
> >
> >         ?<oid> [SP <token>=<value>]... LF
> >
> > where values containing LF or SP are printed in a token specific fashion
> > so that the resulting encoded value does not contain either of these two
> > problematic bytes. For example, missing object paths are quoted in the C
> > style so they contain LF or SP.
> >
> > To make machine parsing easier, this series introduces a NUL-delimited
> > output mode for git-rev-list(1) via a `-z` option following a suggestion
> > from Junio in a previous thread[1]. In this mode, instead of LF, each
> > object is delimited with two NUL bytes and any object metadata is
> > separated with a single NUL byte. Examples:
> >
> >         <oid> NUL NUL
> >         <oid> [NUL <path>] NUL NUL
> >         ?<oid> [NUL <token>=<value>]... NUL NUL
> >
> > In this mode, path and value info are printed as-is without any special
> > encoding or truncation.
> >
> > For now this series only adds support for use with the `--objects` and
> > `--missing` options. Usage of `-z` with other options is rejected, so it
> > can potentially be added in the future.
> >
> > One idea I had, but did not implement in this version, was to also use
> > the `<token>=<value>` format for regular non-missing object info while
> > in the NUL-delimited mode. I could see this being a bit more flexible
> > instead of relying strictly on order. Interested if anyone has thoughts
> > on this. :)
> 
> Without taking a deeper look, I think token=value has the benefit of
> being self-describing at the cost of more output bytes (which might
> matter over the wire, for example). Generally I like the idea;
> sometimes I find it troublesome having to parse prose manuals for the
> specifics of output formats like field order, especially when I end up
> coding a parser for the format. If the field order doesn’t matter to
> the consumer, then perhaps using ordered fields AWK-style is
> inappropriately terse?
> 
> OTOH, the -z format is for machines, and they don’t need human labels
> ;) [I think token labels would be a great parser-writing and debugging
> aid]

One of the challenges with parsing git-rev-list(1) is all the various
forms it can take based on the options provided. For example:

    $ git rev-list --timestamp --objects --parents <rev>

    timestamp SP <oid> [SP <parent oid>] LF   (commit)
    <oid> SP [<path>] LF                      (tree/blob)

Relying strictly on order can be a bit tricky to parse due to how the
output format can change even line to line. So even for machine parsing,
labels may help simplify things if all object records follow something
along the lines of:

    <oid> NUL [<token>=<value> NUL]...

As you mentioned, this could potentially also be useful for users since
the attributes would be self-describing. This series is currently
focussed on the machine parsing side, but I think support for this mode
in a human-readable format could be added via a separate option in the
future.

-Justin
Justin Tobler March 11, 2025, 11:19 p.m. UTC | #5
On 25/03/10 01:37PM, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> > To make machine parsing easier, this series introduces a NUL-delimited
> > output mode for git-rev-list(1) via a `-z` option following a suggestion
> > from Junio in a previous thread[1]. In this mode, instead of LF, each
> > object is delimited with two NUL bytes and any object metadata is
> > separated with a single NUL byte. Examples:
> >
> >         <oid> NUL NUL
> >         <oid> [NUL <path>] NUL NUL
> 
> Why do we need double-NUL in the above two cases?

In the `<oid> [NUL <path>] NUL NUL` case, it would technically be
possible for an object path to match an OID. The use of two NUL bytes
signals when the object record ends.

Without someother mechanism to know when a record starts/stops, even the
`<oid> NUL NUL` case would need the two trailing NUL bytes to avoid
being considered a potential path.

If the output format would not result in any additional object metadata
being appended, we could use a single NUL byte to delimit between
objects in this case, but always using two NUL bytes allowed for a more
consistent format.

> 
> >         ?<oid> [NUL <token>=<value>]... NUL NUL
> 
> This one I understand; we could do without double-NUL and take the
> lack of "=" in the token after NUL termination as the sign that the
> previous record ended, though, to avoid double-NUL while keeping the
> format extensible.
> 
> As this topic is designing essentially a new and machine parseable
> format, we could even unify all three formats into one.  For example,
> the format could be like this:
> 
> 	<oid> NUL [<attr>=<value> NUL]...

I was also considering something similar. This format could allow other
object metadata like `--timestamp` to be supported in the future with a
more flexible format. In the next version I'll implement a unified
format here.

> 
> where
> 
>  (1) A record ends when a new record begins.
> 
>  (2) The beginning of a new record is signaled by <oid> that is all
>      hexadecimal and does not have any '=' in it.

I think this is a good idea. By always appending printed object metadata
in the form `<token>=<value>`, we know that any entry without '=' must
be the start of a new record. This removes the need for the two NUL
bytes to indicate the end of a record.

I'll use only a single NUL byte to delimit in the next version.

> 
>  (3) The traditional "rev-list --objects" output that gives path in
>      addition to the object name uses "path" as the <attr> name,
>      i.e. such a record looks like "<oid> NUL path=<path> NUL".
> 
>  (4) The traditional "rev-list --missing" output loses the leading
>      "?"; it is replaced by "missing" as the <attr> name, i.e. such
>      a record may look like "<oid> NUL missing=yes NUL..." together
>      with other "<token>=<value> NUL" pairs appended as needed at
>      the end.

I think this is good. Instead of prefixing missing OIDs with '?', we can
just append another token/value pair `missing=yes`.

Thanks,
-Justin
Justin Tobler March 11, 2025, 11:24 p.m. UTC | #6
On 25/03/10 02:08PM, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > As this topic is designing essentially a new and machine parseable
> > format, we could even unify all three formats into one.  For example,
> > the format could be like this:
> >
> > 	<oid> NUL [<attr>=<value> NUL]...
> >
> > where
> 
> (0) "rev-list" that gives only a sequence of "<oid>" for commit-ish,
>     as well as "rev-list --boundary", would fall out as a natural
>     consequence.  Bog-standard "list of commits" would see a
>     sequence of "<oid> NUL", while a boundary object would see
>     "<oid> NUL boundary=yes NUL".

I had not considered handling the `--boundary` option. It looks like
boundary objects are printed as part of `show_commit()`, so I can adapt
the handling and do something similar to missing objects:

    $ git rev-list -z --boundary <rev>
    <oid> NUL boundary=yes NUL

This would remain consistent with the unified format.

-Justin
Junio C Hamano March 11, 2025, 11:44 p.m. UTC | #7
Justin Tobler <jltobler@gmail.com> writes:

>>  (4) The traditional "rev-list --missing" output loses the leading
>>      "?"; it is replaced by "missing" as the <attr> name, i.e. such
>>      a record may look like "<oid> NUL missing=yes NUL..." together
>>      with other "<token>=<value> NUL" pairs appended as needed at
>>      the end.
>
> I think this is good. Instead of prefixing missing OIDs with '?', we can
> just append another token/value pair `missing=yes`.

And we may want to avoid excessive bloat of the output that is not
primarily meant for human consumption, in which case perhaps a
common things like "missing" and "path" can be given a shorter
token, perhaps like "m=yes" and "p=Makefile"?
Jeff King March 11, 2025, 11:57 p.m. UTC | #8
On Mon, Mar 10, 2025 at 02:28:25PM -0500, Justin Tobler wrote:

> To make machine parsing easier, this series introduces a NUL-delimited
> output mode for git-rev-list(1) via a `-z` option following a suggestion
> from Junio in a previous thread[1]. In this mode, instead of LF, each
> object is delimited with two NUL bytes and any object metadata is
> separated with a single NUL byte. Examples:
> 
>         <oid> NUL NUL
>         <oid> [NUL <path>] NUL NUL
>         ?<oid> [NUL <token>=<value>]... NUL NUL
> 
> In this mode, path and value info are printed as-is without any special
> encoding or truncation.

I think this is a good direction, but I have two compatibility
questions:

  1. What should "git rev-list -z --stdin" do? In most other programs
     with a "-z" option it affects both input and output. I don't
     particularly care about this case myself, but it will be hard to
     change later. So we probably want to decide now.

  2. I was a little surprised that rev-list already takes a "-z" option,
     but it doesn't seem to do anything. I guess it's probably picked up
     via diff_opt_parse(), though (I think) you can't actually trigger a
     diff via rev-list itself. So even though this is a change in
     behavior, probably nobody would have been using it until now?

     If it is possible to see some effect from "-z" now (I didn't dig
     very far), then it may be better to continue to let the diff
     options parser handle it, and simply pick the result out of
     revs.diffopt.line_termination. As your patch 3 is written, I think
     the diff code probably doesn't see it anymore at all.

-Peff
Patrick Steinhardt March 12, 2025, 7:37 a.m. UTC | #9
On Tue, Mar 11, 2025 at 04:44:10PM -0700, Junio C Hamano wrote:
> Justin Tobler <jltobler@gmail.com> writes:
> 
> >>  (4) The traditional "rev-list --missing" output loses the leading
> >>      "?"; it is replaced by "missing" as the <attr> name, i.e. such
> >>      a record may look like "<oid> NUL missing=yes NUL..." together
> >>      with other "<token>=<value> NUL" pairs appended as needed at
> >>      the end.
> >
> > I think this is good. Instead of prefixing missing OIDs with '?', we can
> > just append another token/value pair `missing=yes`.
> 
> And we may want to avoid excessive bloat of the output that is not
> primarily meant for human consumption, in which case perhaps a
> common things like "missing" and "path" can be given a shorter
> token, perhaps like "m=yes" and "p=Makefile"?

I would prefer to keep longer paths here:

  - While the output typically shouldn't be seen by a human, the code
    handling it both on the printing and on the receiving side would.
    And there it helps the reader quite a bit to get additional context
    rather than cryptic single-letter abbreviations that one would have
    to always look up.

  - By keeping with long names we also avoid "letter squatting" when two
    attributes of an object start with the same letter.

  - I doubt that the couple of extra characters is really going to
    matter much given that most of the time will most likely be spent
    reading and decompressing the objects from disk anyway.

Patrick
Patrick Steinhardt March 12, 2025, 7:42 a.m. UTC | #10
On Tue, Mar 11, 2025 at 07:57:20PM -0400, Jeff King wrote:
> On Mon, Mar 10, 2025 at 02:28:25PM -0500, Justin Tobler wrote:
> 
> > To make machine parsing easier, this series introduces a NUL-delimited
> > output mode for git-rev-list(1) via a `-z` option following a suggestion
> > from Junio in a previous thread[1]. In this mode, instead of LF, each
> > object is delimited with two NUL bytes and any object metadata is
> > separated with a single NUL byte. Examples:
> > 
> >         <oid> NUL NUL
> >         <oid> [NUL <path>] NUL NUL
> >         ?<oid> [NUL <token>=<value>]... NUL NUL
> > 
> > In this mode, path and value info are printed as-is without any special
> > encoding or truncation.
> 
> I think this is a good direction, but I have two compatibility
> questions:
> 
>   1. What should "git rev-list -z --stdin" do? In most other programs
>      with a "-z" option it affects both input and output. I don't
>      particularly care about this case myself, but it will be hard to
>      change later. So we probably want to decide now.

I would lean into the direction of making "-z" change the format both
for stdin and stdout. That's what we do in most cases, and in those
cases where we didn't we came to regret it (git-cat-file(1)).

Patrick