Message ID | 20250310192829.661692-1-jltobler@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | rev-list: introduce NUL-delimited output mode | expand |
Justin Tobler <jltobler@gmail.com> writes: > ?<oid> [SP <token>=<value>]... LF > > where values containing LF or SP are printed in a token specific fashion > so that the resulting encoded value does not contain either of these two > problematic bytes. For example, missing object paths are quoted in the C > style so they contain LF or SP. "so" -> "when"??? > To make machine parsing easier, this series introduces a NUL-delimited > output mode for git-rev-list(1) via a `-z` option following a suggestion > from Junio in a previous thread[1]. In this mode, instead of LF, each > object is delimited with two NUL bytes and any object metadata is > separated with a single NUL byte. Examples: > > <oid> NUL NUL > <oid> [NUL <path>] NUL NUL Why do we need double-NUL in the above two cases? > ?<oid> [NUL <token>=<value>]... NUL NUL This one I understand; we could do without double-NUL and take the lack of "=" in the token after NUL termination as the sign that the previous record ended, though, to avoid double-NUL while keeping the format extensible. As this topic is designing essentially a new and machine parseable format, we could even unify all three formats into one. For example, the format could be like this: <oid> NUL [<attr>=<value> NUL]... where (1) A record ends when a new record begins. (2) The beginning of a new record is signaled by <oid> that is all hexadecimal and does not have any '=' in it. (3) The traditional "rev-list --objects" output that gives path in addition to the object name uses "path" as the <attr> name, i.e. such a record looks like "<oid> NUL path=<path> NUL". (4) The traditional "rev-list --missing" output loses the leading "?"; it is replaced by "missing" as the <attr> name, i.e. such a record may look like "<oid> NUL missing=yes NUL..." together with other "<token>=<value> NUL" pairs appended as needed at the end. Hmm?
Junio C Hamano <gitster@pobox.com> writes: > As this topic is designing essentially a new and machine parseable > format, we could even unify all three formats into one. For example, > the format could be like this: > > <oid> NUL [<attr>=<value> NUL]... > > where (0) "rev-list" that gives only a sequence of "<oid>" for commit-ish, as well as "rev-list --boundary", would fall out as a natural consequence. Bog-standard "list of commits" would see a sequence of "<oid> NUL", while a boundary object would see "<oid> NUL boundary=yes NUL". > (1) A record ends when a new record begins. > > (2) The beginning of a new record is signaled by <oid> that is all > hexadecimal and does not have any '=' in it. > > (3) The traditional "rev-list --objects" output that gives path in > addition to the object name uses "path" as the <attr> name, > i.e. such a record looks like "<oid> NUL path=<path> NUL". > > (4) The traditional "rev-list --missing" output loses the leading > "?"; it is replaced by "missing" as the <attr> name, i.e. such > a record may look like "<oid> NUL missing=yes NUL..." together > with other "<token>=<value> NUL" pairs appended as needed at > the end.
On Mon, Mar 10, 2025 at 3:32 PM Justin Tobler <jltobler@gmail.com> wrote: > > When walking objects, git-rev-list(1) prints each object entry on a > separate line in the form: > > <oid> LF > > Some options, such as `--objects`, may print additional information > about the object on the same line: > > <oid> SP [<path>] LF > > In this mode, if the object path contains a newline it is truncated at > the newline. > > When the `--missing={print,print-info}` option is provided, information > about any missing objects encountered during the object walk are also > printed in the form: > > ?<oid> [SP <token>=<value>]... LF > > where values containing LF or SP are printed in a token specific fashion > so that the resulting encoded value does not contain either of these two > problematic bytes. For example, missing object paths are quoted in the C > style so they contain LF or SP. > > To make machine parsing easier, this series introduces a NUL-delimited > output mode for git-rev-list(1) via a `-z` option following a suggestion > from Junio in a previous thread[1]. In this mode, instead of LF, each > object is delimited with two NUL bytes and any object metadata is > separated with a single NUL byte. Examples: > > <oid> NUL NUL > <oid> [NUL <path>] NUL NUL > ?<oid> [NUL <token>=<value>]... NUL NUL > > In this mode, path and value info are printed as-is without any special > encoding or truncation. > > For now this series only adds support for use with the `--objects` and > `--missing` options. Usage of `-z` with other options is rejected, so it > can potentially be added in the future. > > One idea I had, but did not implement in this version, was to also use > the `<token>=<value>` format for regular non-missing object info while > in the NUL-delimited mode. I could see this being a bit more flexible > instead of relying strictly on order. Interested if anyone has thoughts > on this. :) Without taking a deeper look, I think token=value has the benefit of being self-describing at the cost of more output bytes (which might matter over the wire, for example). Generally I like the idea; sometimes I find it troublesome having to parse prose manuals for the specifics of output formats like field order, especially when I end up coding a parser for the format. If the field order doesn’t matter to the consumer, then perhaps using ordered fields AWK-style is inappropriately terse? OTOH, the -z format is for machines, and they don’t need human labels ;) [I think token labels would be a great parser-writing and debugging aid] Best, Ben > > This series is structured as follows: > > - Patches 1 and 2 do some minor preparatory refactors. > > - Patch 3 adds the `-z` option to git-rev-list(1) to print > objects in a NUL-delimited fashion. Printed object paths with > the `--objects` option are also handled. > > - Patch 4 teaches the `--missing` option how to print info in a > NUL-delimited fashion. > > Thanks for taking a look, > -Justin > > [1]: <xmqq5xlor0la.fsf@gitster.g> > > Justin Tobler (4): > rev-list: inline `show_object_with_name()` in `show_object()` > rev-list: refactor early option parsing > rev-list: support delimiting objects with NUL bytes > rev-list: support NUL-delimited --missing option > > Documentation/rev-list-options.adoc | 26 +++++++++ > builtin/rev-list.c | 86 ++++++++++++++++++++++------- > revision.c | 8 --- > revision.h | 2 - > t/t6000-rev-list-misc.sh | 34 ++++++++++++ > t/t6022-rev-list-missing.sh | 30 ++++++++++ > 6 files changed, 155 insertions(+), 31 deletions(-) > > > base-commit: 87a0bdbf0f72b7561f3cd50636eee33dcb7dbcc3 > -- > 2.49.0.rc2 > >
On 25/03/10 06:38PM, D. Ben Knoble wrote: > On Mon, Mar 10, 2025 at 3:32 PM Justin Tobler <jltobler@gmail.com> wrote: > > > > When walking objects, git-rev-list(1) prints each object entry on a > > separate line in the form: > > > > <oid> LF > > > > Some options, such as `--objects`, may print additional information > > about the object on the same line: > > > > <oid> SP [<path>] LF > > > > In this mode, if the object path contains a newline it is truncated at > > the newline. > > > > When the `--missing={print,print-info}` option is provided, information > > about any missing objects encountered during the object walk are also > > printed in the form: > > > > ?<oid> [SP <token>=<value>]... LF > > > > where values containing LF or SP are printed in a token specific fashion > > so that the resulting encoded value does not contain either of these two > > problematic bytes. For example, missing object paths are quoted in the C > > style so they contain LF or SP. > > > > To make machine parsing easier, this series introduces a NUL-delimited > > output mode for git-rev-list(1) via a `-z` option following a suggestion > > from Junio in a previous thread[1]. In this mode, instead of LF, each > > object is delimited with two NUL bytes and any object metadata is > > separated with a single NUL byte. Examples: > > > > <oid> NUL NUL > > <oid> [NUL <path>] NUL NUL > > ?<oid> [NUL <token>=<value>]... NUL NUL > > > > In this mode, path and value info are printed as-is without any special > > encoding or truncation. > > > > For now this series only adds support for use with the `--objects` and > > `--missing` options. Usage of `-z` with other options is rejected, so it > > can potentially be added in the future. > > > > One idea I had, but did not implement in this version, was to also use > > the `<token>=<value>` format for regular non-missing object info while > > in the NUL-delimited mode. I could see this being a bit more flexible > > instead of relying strictly on order. Interested if anyone has thoughts > > on this. :) > > Without taking a deeper look, I think token=value has the benefit of > being self-describing at the cost of more output bytes (which might > matter over the wire, for example). Generally I like the idea; > sometimes I find it troublesome having to parse prose manuals for the > specifics of output formats like field order, especially when I end up > coding a parser for the format. If the field order doesn’t matter to > the consumer, then perhaps using ordered fields AWK-style is > inappropriately terse? > > OTOH, the -z format is for machines, and they don’t need human labels > ;) [I think token labels would be a great parser-writing and debugging > aid] One of the challenges with parsing git-rev-list(1) is all the various forms it can take based on the options provided. For example: $ git rev-list --timestamp --objects --parents <rev> timestamp SP <oid> [SP <parent oid>] LF (commit) <oid> SP [<path>] LF (tree/blob) Relying strictly on order can be a bit tricky to parse due to how the output format can change even line to line. So even for machine parsing, labels may help simplify things if all object records follow something along the lines of: <oid> NUL [<token>=<value> NUL]... As you mentioned, this could potentially also be useful for users since the attributes would be self-describing. This series is currently focussed on the machine parsing side, but I think support for this mode in a human-readable format could be added via a separate option in the future. -Justin
On 25/03/10 01:37PM, Junio C Hamano wrote: > Justin Tobler <jltobler@gmail.com> writes: > > To make machine parsing easier, this series introduces a NUL-delimited > > output mode for git-rev-list(1) via a `-z` option following a suggestion > > from Junio in a previous thread[1]. In this mode, instead of LF, each > > object is delimited with two NUL bytes and any object metadata is > > separated with a single NUL byte. Examples: > > > > <oid> NUL NUL > > <oid> [NUL <path>] NUL NUL > > Why do we need double-NUL in the above two cases? In the `<oid> [NUL <path>] NUL NUL` case, it would technically be possible for an object path to match an OID. The use of two NUL bytes signals when the object record ends. Without someother mechanism to know when a record starts/stops, even the `<oid> NUL NUL` case would need the two trailing NUL bytes to avoid being considered a potential path. If the output format would not result in any additional object metadata being appended, we could use a single NUL byte to delimit between objects in this case, but always using two NUL bytes allowed for a more consistent format. > > > ?<oid> [NUL <token>=<value>]... NUL NUL > > This one I understand; we could do without double-NUL and take the > lack of "=" in the token after NUL termination as the sign that the > previous record ended, though, to avoid double-NUL while keeping the > format extensible. > > As this topic is designing essentially a new and machine parseable > format, we could even unify all three formats into one. For example, > the format could be like this: > > <oid> NUL [<attr>=<value> NUL]... I was also considering something similar. This format could allow other object metadata like `--timestamp` to be supported in the future with a more flexible format. In the next version I'll implement a unified format here. > > where > > (1) A record ends when a new record begins. > > (2) The beginning of a new record is signaled by <oid> that is all > hexadecimal and does not have any '=' in it. I think this is a good idea. By always appending printed object metadata in the form `<token>=<value>`, we know that any entry without '=' must be the start of a new record. This removes the need for the two NUL bytes to indicate the end of a record. I'll use only a single NUL byte to delimit in the next version. > > (3) The traditional "rev-list --objects" output that gives path in > addition to the object name uses "path" as the <attr> name, > i.e. such a record looks like "<oid> NUL path=<path> NUL". > > (4) The traditional "rev-list --missing" output loses the leading > "?"; it is replaced by "missing" as the <attr> name, i.e. such > a record may look like "<oid> NUL missing=yes NUL..." together > with other "<token>=<value> NUL" pairs appended as needed at > the end. I think this is good. Instead of prefixing missing OIDs with '?', we can just append another token/value pair `missing=yes`. Thanks, -Justin
On 25/03/10 02:08PM, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > As this topic is designing essentially a new and machine parseable > > format, we could even unify all three formats into one. For example, > > the format could be like this: > > > > <oid> NUL [<attr>=<value> NUL]... > > > > where > > (0) "rev-list" that gives only a sequence of "<oid>" for commit-ish, > as well as "rev-list --boundary", would fall out as a natural > consequence. Bog-standard "list of commits" would see a > sequence of "<oid> NUL", while a boundary object would see > "<oid> NUL boundary=yes NUL". I had not considered handling the `--boundary` option. It looks like boundary objects are printed as part of `show_commit()`, so I can adapt the handling and do something similar to missing objects: $ git rev-list -z --boundary <rev> <oid> NUL boundary=yes NUL This would remain consistent with the unified format. -Justin
Justin Tobler <jltobler@gmail.com> writes: >> (4) The traditional "rev-list --missing" output loses the leading >> "?"; it is replaced by "missing" as the <attr> name, i.e. such >> a record may look like "<oid> NUL missing=yes NUL..." together >> with other "<token>=<value> NUL" pairs appended as needed at >> the end. > > I think this is good. Instead of prefixing missing OIDs with '?', we can > just append another token/value pair `missing=yes`. And we may want to avoid excessive bloat of the output that is not primarily meant for human consumption, in which case perhaps a common things like "missing" and "path" can be given a shorter token, perhaps like "m=yes" and "p=Makefile"?
On Mon, Mar 10, 2025 at 02:28:25PM -0500, Justin Tobler wrote: > To make machine parsing easier, this series introduces a NUL-delimited > output mode for git-rev-list(1) via a `-z` option following a suggestion > from Junio in a previous thread[1]. In this mode, instead of LF, each > object is delimited with two NUL bytes and any object metadata is > separated with a single NUL byte. Examples: > > <oid> NUL NUL > <oid> [NUL <path>] NUL NUL > ?<oid> [NUL <token>=<value>]... NUL NUL > > In this mode, path and value info are printed as-is without any special > encoding or truncation. I think this is a good direction, but I have two compatibility questions: 1. What should "git rev-list -z --stdin" do? In most other programs with a "-z" option it affects both input and output. I don't particularly care about this case myself, but it will be hard to change later. So we probably want to decide now. 2. I was a little surprised that rev-list already takes a "-z" option, but it doesn't seem to do anything. I guess it's probably picked up via diff_opt_parse(), though (I think) you can't actually trigger a diff via rev-list itself. So even though this is a change in behavior, probably nobody would have been using it until now? If it is possible to see some effect from "-z" now (I didn't dig very far), then it may be better to continue to let the diff options parser handle it, and simply pick the result out of revs.diffopt.line_termination. As your patch 3 is written, I think the diff code probably doesn't see it anymore at all. -Peff
On Tue, Mar 11, 2025 at 04:44:10PM -0700, Junio C Hamano wrote: > Justin Tobler <jltobler@gmail.com> writes: > > >> (4) The traditional "rev-list --missing" output loses the leading > >> "?"; it is replaced by "missing" as the <attr> name, i.e. such > >> a record may look like "<oid> NUL missing=yes NUL..." together > >> with other "<token>=<value> NUL" pairs appended as needed at > >> the end. > > > > I think this is good. Instead of prefixing missing OIDs with '?', we can > > just append another token/value pair `missing=yes`. > > And we may want to avoid excessive bloat of the output that is not > primarily meant for human consumption, in which case perhaps a > common things like "missing" and "path" can be given a shorter > token, perhaps like "m=yes" and "p=Makefile"? I would prefer to keep longer paths here: - While the output typically shouldn't be seen by a human, the code handling it both on the printing and on the receiving side would. And there it helps the reader quite a bit to get additional context rather than cryptic single-letter abbreviations that one would have to always look up. - By keeping with long names we also avoid "letter squatting" when two attributes of an object start with the same letter. - I doubt that the couple of extra characters is really going to matter much given that most of the time will most likely be spent reading and decompressing the objects from disk anyway. Patrick
On Tue, Mar 11, 2025 at 07:57:20PM -0400, Jeff King wrote: > On Mon, Mar 10, 2025 at 02:28:25PM -0500, Justin Tobler wrote: > > > To make machine parsing easier, this series introduces a NUL-delimited > > output mode for git-rev-list(1) via a `-z` option following a suggestion > > from Junio in a previous thread[1]. In this mode, instead of LF, each > > object is delimited with two NUL bytes and any object metadata is > > separated with a single NUL byte. Examples: > > > > <oid> NUL NUL > > <oid> [NUL <path>] NUL NUL > > ?<oid> [NUL <token>=<value>]... NUL NUL > > > > In this mode, path and value info are printed as-is without any special > > encoding or truncation. > > I think this is a good direction, but I have two compatibility > questions: > > 1. What should "git rev-list -z --stdin" do? In most other programs > with a "-z" option it affects both input and output. I don't > particularly care about this case myself, but it will be hard to > change later. So we probably want to decide now. I would lean into the direction of making "-z" change the format both for stdin and stdout. That's what we do in most cases, and in those cases where we didn't we came to regret it (git-cat-file(1)). Patrick