diff mbox series

[2/2] builtin/cat-file.c: support NUL-delimited input with `-z`

Message ID ed1583223f63cfde99829069f14af62e4f0f2a82.1658532524.git.me@ttaylorr.com (mailing list archive)
State Accepted
Commit db9d67f2e9c9ea389f5558d6a168460d51631769
Headers show
Series cat-file: support NUL-delimited input with `-z` | expand

Commit Message

Taylor Blau July 22, 2022, 11:29 p.m. UTC
When callers are using `cat-file` via one of the stdin-driven `--batch`
modes, all input is newline-delimited. This presents a problem when
callers wish to ask about, e.g. tree-entries that have a newline
character present in their filename.

To support this niche scenario, introduce a new `-z` mode to the
`--batch`, `--batch-check`, and `--batch-command` suite of options that
instructs `cat-file` to treat its input as NUL-delimited, allowing the
individual commands themselves to have newlines present.

The refactoring here is slightly unfortunate, since we turn loops like:

    while (strbuf_getline(&buf, stdin) != EOF)

into:

    while (1) {
        int ret;
        if (opt->nul_terminated)
            ret = strbuf_getline_nul(&input, stdin);
        else
            ret = strbuf_getline(&input, stdin);

        if (ret == EOF)
            break;
    }

It's tempting to think that we could use `strbuf_getwholeline()` and
specify either `\n` or `\0` as the terminating character. But for input
on platforms that include a CR character preceeding the LF, this
wouldn't quite be the same, since `strbuf_getline(...)` will trim any
trailing CR, while `strbuf_getwholeline(&buf, stdin, '\n')` will not.

In the future, we could clean this up further by introducing a variant
of `strbuf_getwholeline()` that addresses the aforementioned gap, but
that approach felt too heavy-handed for this pair of uses.

Some tests are added in t1006 to ensure that `cat-file` produces the
same output in `--batch`, `--batch-check`, and `--batch-command` modes
with and without the new `-z` option.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 Documentation/git-cat-file.txt |  7 +++++-
 builtin/cat-file.c             | 28 ++++++++++++++++++++---
 t/t1006-cat-file.sh            | 42 +++++++++++++++++++++++++++++++++-
 3 files changed, 72 insertions(+), 5 deletions(-)

Comments

Chris Torek July 22, 2022, 11:41 p.m. UTC | #1
On Fri, Jul 22, 2022 at 4:34 PM Taylor Blau <me@ttaylorr.com> wrote:
> The refactoring here is slightly unfortunate, since we turn loops like:
>
>     while (strbuf_getline(&buf, stdin) != EOF)
>
> into:
>
>     while (1) {
>         int ret;
>         if (opt->nul_terminated)
>             ret = strbuf_getline_nul(&input, stdin);
>         else
>             ret = strbuf_getline(&input, stdin);
>
>         if (ret == EOF)
>             break;
>     }
>
> It's tempting to think that we could use `strbuf_getwholeline()` and
> specify either `\n` or `\0` as the terminating character. ...

How about:

    int (*get)(struct strbuf *, FILE *);
    get = opt->nul_terminated ? strbuf_getline_nul : strbuf_getline;
    while (get(&buf, stdin) != EOF)

or similar?

Chris
Ævar Arnfjörð Bjarmason July 23, 2022, 5:17 a.m. UTC | #2
On Fri, Jul 22 2022, Taylor Blau wrote:

[I think John Cai would appreciate being CC'd on this]

> It's tempting to think that we could use `strbuf_getwholeline()` and
> specify either `\n` or `\0` as the terminating character. But for input
> on platforms that include a CR character preceeding the LF, this
> wouldn't quite be the same, since `strbuf_getline(...)` will trim any
> trailing CR, while `strbuf_getwholeline(&buf, stdin, '\n')` will not.

I commend the effort to maintain bug-for-bug compatibility, but do we
care about this distinction, or is it just an accident that we support
\r\n here in the first place?

This doesn't apply to the rest of cat-file directly, but the origin of
the recent --batch-command mode cdoe was to lift the same-ish code from
builtin/update-ref.c, whose \n or \0 mode does exactly that:

	while (!strbuf_getwholeline(&input, stdin, line_termination)) {

I.e. it doesn't support \r\n, just \n or \0.

Isn't that fine? I may be missing something, but why isn't it OK even on
platforms that use \r\n for their normal *.txt endings to only use \n or
\0 for this bit of protocol?

For the command mode at least this passes our tests:
	
	diff --git a/builtin/cat-file.c b/builtin/cat-file.c
	index f42782e955f..8646059472d 100644
	--- a/builtin/cat-file.c
	+++ b/builtin/cat-file.c
	@@ -614,12 +614,16 @@ static void batch_objects_command(struct batch_options *opt,
	 	struct queued_cmd *queued_cmd = NULL;
	 	size_t alloc = 0, nr = 0;
	 
	-	while (!strbuf_getline(&input, stdin)) {
	+	while (!strbuf_getwholeline(&input, stdin, '\n')) {
	 		int i;
	 		const struct parse_cmd *cmd = NULL;
	 		const char *p = NULL, *cmd_end;
	 		struct queued_cmd call = {0};
	 
	+		if (input.len > 0 && input.buf[input.len - 1] == '\n')
	+			--input.len;
	+		input.buf[input.len] = '\0';
	+
	 		if (!input.len)
	 			die(_("empty command in input"));
	 		if (isspace(*input.buf))

So maybe we should just do something like that instead? I.e. declare
that a mistake.

As for the rest of cat-file 05d5667fec9 (git-cat-file: Add --batch-check
option, 2008-04-23) documents that it's LF, not CR LF, ditto
git-cat-file.txt.

So isn't this just an accident in of us having used the strbuf_getline()
function to mean "\n", but actually it also does "\r\n".

Which is a really unfortunately named function b.t.w., since it sneaks
this bit of Windows portability into places that may not want it in the
first place.

>  strlen () {
>      echo_without_newline "$1" | wc -c | sed -e 's/^ *//'
>  }
> @@ -398,6 +403,12 @@ test_expect_success '--batch with multiple sha1s gives correct format' '
>  	test "$(maybe_remove_timestamp "$batch_output" 1)" = "$(maybe_remove_timestamp "$(echo_without_newline "$batch_input" | git cat-file --batch)" 1)"
>  '
>  
> +test_expect_success '--batch, -z with multiple sha1s gives correct format' '
> +	echo_without_newline_nul "$batch_input" >in &&
> +	test "$(maybe_remove_timestamp "$batch_output" 1)" = \
> +	"$(maybe_remove_timestamp "$(git cat-file --batch -z <in)" 1)"

This...

> +'
> +
>  batch_check_input="$hello_sha1
>  $tree_sha1
>  $commit_sha1
> @@ -418,6 +429,24 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>      "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>  '
>  
> +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" '
> +    echo_without_newline_nul "$batch_check_input" >in &&
> +    test "$batch_check_output" = "$(git cat-file --batch-check -z <in)"

This....

> +'
> +
> +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
> +	touch -- "newline${LF}embedded" &&
> +	git add -- "newline${LF}embedded" &&
> +	git commit -m "file with newline embedded" &&
> +	test_tick &&
> +
> +	printf "HEAD:newline${LF}embedded" >in &&
> +	git cat-file --batch-check -z <in >actual &&
> +
> +	echo "$(git rev-parse "HEAD:newline${LF}embedded") blob 0" >expect &&

..and this hides git's exit code, better to pipe to a file, use test_cmp
etc. etc.
Junio C Hamano July 23, 2022, 5:35 a.m. UTC | #3
Taylor Blau <me@ttaylorr.com> writes:

> @@ -14,7 +14,7 @@ SYNOPSIS
>  'git cat-file' (-t | -s) [--allow-unknown-type] <object>
>  'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects]
>  	     [--buffer] [--follow-symlinks] [--unordered]
> -	     [--textconv | --filters]
> +	     [--textconv | --filters] [-z]

Is "-z" useful with any other option, or is it useful only in
combination with one of the three --batch-*?  The above suggests the
former.

> +test_expect_success '--batch, -z with multiple sha1s gives correct format' '
> +	echo_without_newline_nul "$batch_input" >in &&

I I recall [1/2] correctly, the input lacked the LF at the end.  In
the original "LF terminated" use converted to use these variables,
because $batch_*_input is "echo"ed to create the file "in", the lack
of LF at the end is a GOOD thing.

But here, echo_without_newline_nul is just a glorified "printf %s"
piped into tr to turn LF into NUL.  What is fed by printf into the
pipe lacks LF at the end, so the output from tr will not have NUL at
the end, either.

That might happen to work (because the EOF may be enough to signal
the end of the entire input, thus the last input item), but it does
not make the test case for "-z" exactly parallel to the line oriented
input.

> +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" '
> +    echo_without_newline_nul "$batch_check_input" >in &&
> +    test "$batch_check_output" = "$(git cat-file --batch-check -z <in)"
> +'
> +
> +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
> +	touch -- "newline${LF}embedded" &&
> +	git add -- "newline${LF}embedded" &&
> +	git commit -m "file with newline embedded" &&
> +	test_tick &&
> +
> +	printf "HEAD:newline${LF}embedded" >in &&
> +	git cat-file --batch-check -z <in >actual &&

As I already said, I suspect that new users who know how our path
quoting works would expect c-quoted path would work just fine
without using "-z".  It is not a reason to refuse "-z" to exist,
though.

> @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form
>  	echo "$batch_command_multiple_info" >in &&
>  	git cat-file --batch-command --buffer <in >actual &&
>  
> +	test_cmp expect actual &&
> +
> +	echo "$batch_command_multiple_info" | tr "\n" "\0" >in &&

This is what I would expect.  The _info variable lacks final LF,
which is supplied by "echo", so output from tr ends with NUL, which
mirrors the line-oriented input we used above.

> +	git cat-file --batch-command --buffer -z <in >actual &&
> +
>  	test_cmp expect actual
>  '
>  
> @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f
>  	echo "$batch_command_multiple_contents" >in &&
>  	git cat-file --batch-command --buffer <in >actual_raw &&
>  
> +	remove_timestamp <actual_raw >actual &&
> +	test_cmp expect actual &&
> +
> +	echo "$batch_command_multiple_contents" | tr "\n" "\0" >in &&
> +	git cat-file --batch-command --buffer -z <in >actual_raw &&
> +

Likewise.
Junio C Hamano July 23, 2022, 5:45 p.m. UTC | #4
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> This doesn't apply to the rest of cat-file directly, but the origin of
> the recent --batch-command mode cdoe was to lift the same-ish code from
> builtin/update-ref.c, whose \n or \0 mode does exactly that:
>
> 	while (!strbuf_getwholeline(&input, stdin, line_termination)) {
>
> I.e. it doesn't support \r\n, just \n or \0.
>
> Isn't that fine? I may be missing something, but why isn't it OK even on
> platforms that use \r\n for their normal *.txt endings to only use \n or
> \0 for this bit of protocol?
>
> So maybe we should just do something like that instead? I.e. declare
> that a mistake.

Good point.

> So isn't this just an accident in of us having used the strbuf_getline()
> function to mean "\n", but actually it also does "\r\n".
>
> Which is a really unfortunately named function b.t.w., since it sneaks
> this bit of Windows portability into places that may not want it in the
> first place.

getline() is to read a single "logical" line, so it is fine for it
to strip CRLF on platforms with CRLF, and to leave CR at the end of
line on a LF platform.  If the "protocol" is defined to use LF on
any platform (and allow a payload that ends with CR to be passed),
you can argue that it is a wrong helper to call.

But does that result in a sensible behaviour?  I am not sure.  Some
editors that are popular on LF platforms can produce CRLF files when
the user asks (either on purpose or by mistake), and when not using
the "-z" mode of input, I suspect that most users would expect CR at
the end of the "line" (terminated with LF on their platform of
choice) would be thrown away even on their LF platform, simply
because it is unlikely that the kind of input they are preparing can
usefully and legitimately end with CR, as their colleagues on CRLF
platforms may also have to produce similar input.  IOW, the presence
of CRLF platforms makes text lines that end with CR much less useful
everywhere.

And from that point of view, "getline() or getwholeline(NUL)" may be
a pattern that is practically more useful than the old "use always
getwholeline(NUL or LF)" convention that I invented more than 10
years ago.
Taylor Blau July 25, 2022, 11:39 p.m. UTC | #5
On Fri, Jul 22, 2022 at 04:41:54PM -0700, Chris Torek wrote:
> > It's tempting to think that we could use `strbuf_getwholeline()` and
> > specify either `\n` or `\0` as the terminating character. ...
>
> How about:
>
>     int (*get)(struct strbuf *, FILE *);
>     get = opt->nul_terminated ? strbuf_getline_nul : strbuf_getline;
>     while (get(&buf, stdin) != EOF)
>
> or similar?

I thought about it, but it seemed cleaner to lift the ternary out to
capture the result, too, leading to the if/else-statement I sent in the
patch above.

If we wanted to change it, I'd probably suggest a more general-purpose
strbuf function that implemented this behavior through a single call.
But it sounds like supporting the carriage return character was a
mistake from the beginning, which simplifies this direction quite a bit,
anyways.

Thanks,
Taylor
Taylor Blau July 25, 2022, 11:44 p.m. UTC | #6
Hi Junio,

On Sat, Jul 23, 2022 at 10:45:05AM -0700, Junio C Hamano wrote:
> > So isn't this just an accident in of us having used the strbuf_getline()
> > function to mean "\n", but actually it also does "\r\n".
> >
> > Which is a really unfortunately named function b.t.w., since it sneaks
> > this bit of Windows portability into places that may not want it in the
> > first place.
>
> getline() is to read a single "logical" line, so it is fine for it
> to strip CRLF on platforms with CRLF, and to leave CR at the end of
> line on a LF platform.  If the "protocol" is defined to use LF on
> any platform (and allow a payload that ends with CR to be passed),
> you can argue that it is a wrong helper to call.

Reading your email up to this point makes me think that we should ignore
any CR bytes we see next to a LF. So by this point I think that we
should take Ævar's suggestion and call:

    strbuf_getwholeline(&buf, stdin, opt->nul_terminated ? '\0' : '\n');

But...

> But does that result in a sensible behaviour?  I am not sure.  Some
> editors that are popular on LF platforms can produce CRLF files when
> the user asks (either on purpose or by mistake), and when not using
> the "-z" mode of input, I suspect that most users would expect CR at
> the end of the "line" (terminated with LF on their platform of
> choice) would be thrown away even on their LF platform, simply
> because it is unlikely that the kind of input they are preparing can
> usefully and legitimately end with CR, as their colleagues on CRLF
> platforms may also have to produce similar input.  IOW, the presence
> of CRLF platforms makes text lines that end with CR much less useful
> everywhere.
>
> And from that point of view, "getline() or getwholeline(NUL)" may be
> a pattern that is practically more useful than the old "use always
> getwholeline(NUL or LF)" convention that I invented more than 10
> years ago.

This makes me think that we should retain support for dropping the CR
preceeeding a LF and treat it as a historical wart that we are stuck
supporting.

Do you have a preference? I am fine with either approach, FWIW.

Thanks,
Taylor
Taylor Blau July 25, 2022, 11:50 p.m. UTC | #7
On Fri, Jul 22, 2022 at 10:35:43PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@ttaylorr.com> writes:
>
> > @@ -14,7 +14,7 @@ SYNOPSIS
> >  'git cat-file' (-t | -s) [--allow-unknown-type] <object>
> >  'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects]
> >  	     [--buffer] [--follow-symlinks] [--unordered]
> > -	     [--textconv | --filters]
> > +	     [--textconv | --filters] [-z]
>
> Is "-z" useful with any other option, or is it useful only in
> combination with one of the three --batch-*?  The above suggests the
> former.

It only makes sense with `--batch`-related options. But doesn't the
above suggest the latter, not the former? That synopsis line begins
with:

    'git cat-file' (--batch | --batch-check | --batch-command) ...

which made me think that this was the invocation for batch-related
options, and only listed options that made sense with a `--batch` mode
of one kind or another.

> > +test_expect_success '--batch, -z with multiple sha1s gives correct format' '
> > +	echo_without_newline_nul "$batch_input" >in &&
>
> I I recall [1/2] correctly, the input lacked the LF at the end.  In
> the original "LF terminated" use converted to use these variables,
> because $batch_*_input is "echo"ed to create the file "in", the lack
> of LF at the end is a GOOD thing.
>
> But here, echo_without_newline_nul is just a glorified "printf %s"
> piped into tr to turn LF into NUL.  What is fed by printf into the
> pipe lacks LF at the end, so the output from tr will not have NUL at
> the end, either.
>
> That might happen to work (because the EOF may be enough to signal
> the end of the entire input, thus the last input item), but it does
> not make the test case for "-z" exactly parallel to the line oriented
> input.

I see what you're saying. And, yeah, I think it happens to work since we
treat EOF as marking the end of the last input element, regardless of
whether or not we saw a NUL byte or a LF (depending on whether or not we
passed `-z`).

I think the helper should probably be something more like:

    echo_with_nul () {
        echo "$@" | tr '\n' '\0'
    }

or similar. But as you note below, this is probably not even worth
extracting to a helper function.

'
> > +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" '
> > +    echo_without_newline_nul "$batch_check_input" >in &&
> > +    test "$batch_check_output" = "$(git cat-file --batch-check -z <in)"
> > +'
> > +
> > +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
> > +	touch -- "newline${LF}embedded" &&
> > +	git add -- "newline${LF}embedded" &&
> > +	git commit -m "file with newline embedded" &&
> > +	test_tick &&
> > +
> > +	printf "HEAD:newline${LF}embedded" >in &&
> > +	git cat-file --batch-check -z <in >actual &&
>
> As I already said, I suspect that new users who know how our path
> quoting works would expect c-quoted path would work just fine
> without using "-z".  It is not a reason to refuse "-z" to exist,
> though.

Yeah. I think we can do both, if there is a need. I suspect that just
`-z` support would be sufficient for now, but I agree that one doesn't
need to tie up the other.

> > @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form
> >  	echo "$batch_command_multiple_info" >in &&
> >  	git cat-file --batch-command --buffer <in >actual &&
> >
> > +	test_cmp expect actual &&
> > +
> > +	echo "$batch_command_multiple_info" | tr "\n" "\0" >in &&
>
> This is what I would expect.  The _info variable lacks final LF,
> which is supplied by "echo", so output from tr ends with NUL, which
> mirrors the line-oriented input we used above.

Yep.

> > +	git cat-file --batch-command --buffer -z <in >actual &&
> > +
> >  	test_cmp expect actual
> >  '
> >
> > @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f
> >  	echo "$batch_command_multiple_contents" >in &&
> >  	git cat-file --batch-command --buffer <in >actual_raw &&
> >
> > +	remove_timestamp <actual_raw >actual &&
> > +	test_cmp expect actual &&
> > +
> > +	echo "$batch_command_multiple_contents" | tr "\n" "\0" >in &&
> > +	git cat-file --batch-command --buffer -z <in >actual_raw &&
> > +
>
> Likewise.

Ditto, thanks.

Thanks,
Taylor
Junio C Hamano July 27, 2022, 2:10 p.m. UTC | #8
Taylor Blau <me@ttaylorr.com> writes:

> Do you have a preference? I am fine with either approach, FWIW.

I do not have a strong preference, but in "text" mode, I suspect
that the users may find it more convenient if we discarded CR at the
end of the line even on LF platforms.

Thanks.
Junio C Hamano July 27, 2022, 2:20 p.m. UTC | #9
Taylor Blau <me@ttaylorr.com> writes:

> On Fri, Jul 22, 2022 at 10:35:43PM -0700, Junio C Hamano wrote:
>> Taylor Blau <me@ttaylorr.com> writes:
>>
>> > @@ -14,7 +14,7 @@ SYNOPSIS
>> >  'git cat-file' (-t | -s) [--allow-unknown-type] <object>
>> >  'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects]
>> >  	     [--buffer] [--follow-symlinks] [--unordered]
>> > -	     [--textconv | --filters]
>> > +	     [--textconv | --filters] [-z]
>>
>> Is "-z" useful with any other option, or is it useful only in
>> combination with one of the three --batch-*?  The above suggests the
>> former.
>
> It only makes sense with `--batch`-related options. But doesn't the
> above suggest the latter, not the former? That synopsis line begins
> with:
>
>     'git cat-file' (--batch | --batch-check | --batch-command) ...
>
> which made me think that this was the invocation for batch-related
> options, and only listed options that made sense with a `--batch` mode
> of one kind or another.

Ah, yes you are absolutely right.  I misread the synopsis.

> I think the helper should probably be something more like:
>
>     echo_with_nul () {
>         echo "$@" | tr '\n' '\0'
>     }

I think you meant

	printf "%s\n" "$@" | tr ..

but then I wonder if

	printf "%s\000" "$@"

would work better?

>> > +	printf "HEAD:newline${LF}embedded" >in &&
>> > +	git cat-file --batch-check -z <in >actual &&
>>
>> As I already said, I suspect that new users who know how our path
>> quoting works would expect c-quoted path would work just fine
>> without using "-z".  It is not a reason to refuse "-z" to exist,
>> though.
>
> Yeah. I think we can do both, if there is a need.

Even though we know it is needed already, it is unfortunately way
too late now X-<, because existing scripts know that paths are not
taken as c-quoted, even though people would naturally expect that
paths are, and satisfying the latter expectation now means breaking
existing scripts.

In any case, as long as "-z" is designed right from the day one, it
would be fine ;-)

Thanks.
Phillip Wood July 31, 2022, 3:50 p.m. UTC | #10
Hi Taylor

Thanks for working on this, I've been annoyed by the lack of a "-z" 
option but never got round to doing anything about it.

On 23/07/2022 00:29, Taylor Blau wrote:
> When callers are using `cat-file` via one of the stdin-driven `--batch`
> modes, all input is newline-delimited. This presents a problem when
> callers wish to ask about, e.g. tree-entries that have a newline
> character present in their filename.
> 
> To support this niche scenario, introduce a new `-z` mode to the
> `--batch`, `--batch-check`, and `--batch-command` suite of options that
> instructs `cat-file` to treat its input as NUL-delimited, allowing the
> individual commands themselves to have newlines present.

I wonder if this should also change the output delimiter from NL to NUL.

printf 'HEAD:does\nnot\nexist\000' | bin-wrappers/git cat-file 
--batch-check -z

prints

HEAD:does
not
exist missing

If it was terminated by a NUL rather than a newline it would be much 
easier to parse.

Best Wishes

Phillip

> The refactoring here is slightly unfortunate, since we turn loops like:
> 
>      while (strbuf_getline(&buf, stdin) != EOF)
> 
> into:
> 
>      while (1) {
>          int ret;
>          if (opt->nul_terminated)
>              ret = strbuf_getline_nul(&input, stdin);
>          else
>              ret = strbuf_getline(&input, stdin);
> 
>          if (ret == EOF)
>              break;
>      }
> 
> It's tempting to think that we could use `strbuf_getwholeline()` and
> specify either `\n` or `\0` as the terminating character. But for input
> on platforms that include a CR character preceeding the LF, this
> wouldn't quite be the same, since `strbuf_getline(...)` will trim any
> trailing CR, while `strbuf_getwholeline(&buf, stdin, '\n')` will not.
> 
> In the future, we could clean this up further by introducing a variant
> of `strbuf_getwholeline()` that addresses the aforementioned gap, but
> that approach felt too heavy-handed for this pair of uses.
> 
> Some tests are added in t1006 to ensure that `cat-file` produces the
> same output in `--batch`, `--batch-check`, and `--batch-command` modes
> with and without the new `-z` option.
> 
> Signed-off-by: Taylor Blau <me@ttaylorr.com>
> ---
>   Documentation/git-cat-file.txt |  7 +++++-
>   builtin/cat-file.c             | 28 ++++++++++++++++++++---
>   t/t1006-cat-file.sh            | 42 +++++++++++++++++++++++++++++++++-
>   3 files changed, 72 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index 24a811f0ef..3515350ed6 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -14,7 +14,7 @@ SYNOPSIS
>   'git cat-file' (-t | -s) [--allow-unknown-type] <object>
>   'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects]
>   	     [--buffer] [--follow-symlinks] [--unordered]
> -	     [--textconv | --filters]
> +	     [--textconv | --filters] [-z]
>   'git cat-file' (--textconv | --filters)
>   	     [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>]
>   
> @@ -207,6 +207,11 @@ respectively print:
>   	/etc/passwd
>   --
>   
> +-z::
> +	Only meaningful with `--batch`, `--batch-check`, or
> +	`--batch-command`; input is NUL-delimited instead of
> +	newline-delimited.
> +
>   
>   OUTPUT
>   ------
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index f42782e955..c3602d15df 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -31,6 +31,7 @@ struct batch_options {
>   	int all_objects;
>   	int unordered;
>   	int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */
> +	int nul_terminated;
>   	const char *format;
>   };
>   
> @@ -614,12 +615,20 @@ static void batch_objects_command(struct batch_options *opt,
>   	struct queued_cmd *queued_cmd = NULL;
>   	size_t alloc = 0, nr = 0;
>   
> -	while (!strbuf_getline(&input, stdin)) {
> -		int i;
> +	while (1) {
> +		int i, ret;
>   		const struct parse_cmd *cmd = NULL;
>   		const char *p = NULL, *cmd_end;
>   		struct queued_cmd call = {0};
>   
> +		if (opt->nul_terminated)
> +			ret = strbuf_getline_nul(&input, stdin);
> +		else
> +			ret = strbuf_getline(&input, stdin);
> +
> +		if (ret)
> +			break;
> +
>   		if (!input.len)
>   			die(_("empty command in input"));
>   		if (isspace(*input.buf))
> @@ -763,7 +772,16 @@ static int batch_objects(struct batch_options *opt)
>   		goto cleanup;
>   	}
>   
> -	while (strbuf_getline(&input, stdin) != EOF) {
> +	while (1) {
> +		int ret;
> +		if (opt->nul_terminated)
> +			ret = strbuf_getline_nul(&input, stdin);
> +		else
> +			ret = strbuf_getline(&input, stdin);
> +
> +		if (ret == EOF)
> +			break;
> +
>   		if (data.split_on_whitespace) {
>   			/*
>   			 * Split at first whitespace, tying off the beginning
> @@ -866,6 +884,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>   			N_("like --batch, but don't emit <contents>"),
>   			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
>   			batch_option_callback),
> +		OPT_BOOL('z', NULL, &batch.nul_terminated, N_("stdin is NUL-terminated")),
>   		OPT_CALLBACK_F(0, "batch-command", &batch, N_("format"),
>   			N_("read commands from stdin"),
>   			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
> @@ -921,6 +940,9 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
>   	else if (batch.all_objects)
>   		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
>   			       "--batch-all-objects");
> +	else if (batch.nul_terminated)
> +		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
> +			       "-z");
>   
>   	/* Batch defaults */
>   	if (batch.buffer_output < 0)
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 01c0535765..23b8942edb 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -88,7 +88,8 @@ done
>   
>   for opt in --buffer \
>   	--follow-symlinks \
> -	--batch-all-objects
> +	--batch-all-objects \
> +	-z
>   do
>   	test_expect_success "usage: bad option combination: $opt without batch mode" '
>   		test_incompatible_usage git cat-file $opt &&
> @@ -100,6 +101,10 @@ echo_without_newline () {
>       printf '%s' "$*"
>   }
>   
> +echo_without_newline_nul () {
> +	echo_without_newline "$@" | tr '\n' '\0'
> +}
> +
>   strlen () {
>       echo_without_newline "$1" | wc -c | sed -e 's/^ *//'
>   }
> @@ -398,6 +403,12 @@ test_expect_success '--batch with multiple sha1s gives correct format' '
>   	test "$(maybe_remove_timestamp "$batch_output" 1)" = "$(maybe_remove_timestamp "$(echo_without_newline "$batch_input" | git cat-file --batch)" 1)"
>   '
>   
> +test_expect_success '--batch, -z with multiple sha1s gives correct format' '
> +	echo_without_newline_nul "$batch_input" >in &&
> +	test "$(maybe_remove_timestamp "$batch_output" 1)" = \
> +	"$(maybe_remove_timestamp "$(git cat-file --batch -z <in)" 1)"
> +'
> +
>   batch_check_input="$hello_sha1
>   $tree_sha1
>   $commit_sha1
> @@ -418,6 +429,24 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" '
>       "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
>   '
>   
> +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" '
> +    echo_without_newline_nul "$batch_check_input" >in &&
> +    test "$batch_check_output" = "$(git cat-file --batch-check -z <in)"
> +'
> +
> +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
> +	touch -- "newline${LF}embedded" &&
> +	git add -- "newline${LF}embedded" &&
> +	git commit -m "file with newline embedded" &&
> +	test_tick &&
> +
> +	printf "HEAD:newline${LF}embedded" >in &&
> +	git cat-file --batch-check -z <in >actual &&
> +
> +	echo "$(git rev-parse "HEAD:newline${LF}embedded") blob 0" >expect &&
> +	test_cmp expect actual
> +'
> +
>   batch_command_multiple_info="info $hello_sha1
>   info $tree_sha1
>   info $commit_sha1
> @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form
>   	echo "$batch_command_multiple_info" >in &&
>   	git cat-file --batch-command --buffer <in >actual &&
>   
> +	test_cmp expect actual &&
> +
> +	echo "$batch_command_multiple_info" | tr "\n" "\0" >in &&
> +	git cat-file --batch-command --buffer -z <in >actual &&
> +
>   	test_cmp expect actual
>   '
>   
> @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f
>   	echo "$batch_command_multiple_contents" >in &&
>   	git cat-file --batch-command --buffer <in >actual_raw &&
>   
> +	remove_timestamp <actual_raw >actual &&
> +	test_cmp expect actual &&
> +
> +	echo "$batch_command_multiple_contents" | tr "\n" "\0" >in &&
> +	git cat-file --batch-command --buffer -z <in >actual_raw &&
> +
>   	remove_timestamp <actual_raw >actual &&
>   	test_cmp expect actual
>   '
Ævar Arnfjörð Bjarmason Aug. 11, 2022, 11:52 a.m. UTC | #11
On Fri, Jul 22 2022, Taylor Blau wrote:

This change:

> diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
> index 24a811f0ef..3515350ed6 100644
> --- a/Documentation/git-cat-file.txt
> +++ b/Documentation/git-cat-file.txt
> @@ -14,7 +14,7 @@ SYNOPSIS
>  'git cat-file' (-t | -s) [--allow-unknown-type] <object>
>  'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects]
>  	     [--buffer] [--follow-symlinks] [--unordered]
> -	     [--textconv | --filters]
> +	     [--textconv | --filters] [-z]
>  'git cat-file' (--textconv | --filters)
>  	     [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>]
>  
> [...]
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index f42782e955..c3602d15df 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -31,6 +31,7 @@ struct batch_options {
>  	int all_objects;
>  	int unordered;
>  	int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */
> +	int nul_terminated;
>  	const char *format;
>  };

Is missing a corresponding change to the -h output here. Before this
they were the same, but now:
	
	+ diff -u txt help
	--- txt 2022-08-11 11:53:00.235221628 +0000
	+++ help        2022-08-11 11:53:00.239221594 +0000
	@@ -3,6 +3,6 @@
	 git cat-file (-t | -s) [--allow-unknown-type] <object>
	 git cat-file (--batch | --batch-check | --batch-command) [--batch-all-objects]
	              [--buffer] [--follow-symlinks] [--unordered]
	-             [--textconv | --filters] [-z]
	+             [--textconv | --filters]
	 git cat-file (--textconv | --filters)
	              [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>]

(Spotted with some local patches of mine which automatically check this)
diff mbox series

Patch

diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index 24a811f0ef..3515350ed6 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -14,7 +14,7 @@  SYNOPSIS
 'git cat-file' (-t | -s) [--allow-unknown-type] <object>
 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects]
 	     [--buffer] [--follow-symlinks] [--unordered]
-	     [--textconv | --filters]
+	     [--textconv | --filters] [-z]
 'git cat-file' (--textconv | --filters)
 	     [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>]
 
@@ -207,6 +207,11 @@  respectively print:
 	/etc/passwd
 --
 
+-z::
+	Only meaningful with `--batch`, `--batch-check`, or
+	`--batch-command`; input is NUL-delimited instead of
+	newline-delimited.
+
 
 OUTPUT
 ------
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index f42782e955..c3602d15df 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -31,6 +31,7 @@  struct batch_options {
 	int all_objects;
 	int unordered;
 	int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */
+	int nul_terminated;
 	const char *format;
 };
 
@@ -614,12 +615,20 @@  static void batch_objects_command(struct batch_options *opt,
 	struct queued_cmd *queued_cmd = NULL;
 	size_t alloc = 0, nr = 0;
 
-	while (!strbuf_getline(&input, stdin)) {
-		int i;
+	while (1) {
+		int i, ret;
 		const struct parse_cmd *cmd = NULL;
 		const char *p = NULL, *cmd_end;
 		struct queued_cmd call = {0};
 
+		if (opt->nul_terminated)
+			ret = strbuf_getline_nul(&input, stdin);
+		else
+			ret = strbuf_getline(&input, stdin);
+
+		if (ret)
+			break;
+
 		if (!input.len)
 			die(_("empty command in input"));
 		if (isspace(*input.buf))
@@ -763,7 +772,16 @@  static int batch_objects(struct batch_options *opt)
 		goto cleanup;
 	}
 
-	while (strbuf_getline(&input, stdin) != EOF) {
+	while (1) {
+		int ret;
+		if (opt->nul_terminated)
+			ret = strbuf_getline_nul(&input, stdin);
+		else
+			ret = strbuf_getline(&input, stdin);
+
+		if (ret == EOF)
+			break;
+
 		if (data.split_on_whitespace) {
 			/*
 			 * Split at first whitespace, tying off the beginning
@@ -866,6 +884,7 @@  int cmd_cat_file(int argc, const char **argv, const char *prefix)
 			N_("like --batch, but don't emit <contents>"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
 			batch_option_callback),
+		OPT_BOOL('z', NULL, &batch.nul_terminated, N_("stdin is NUL-terminated")),
 		OPT_CALLBACK_F(0, "batch-command", &batch, N_("format"),
 			N_("read commands from stdin"),
 			PARSE_OPT_OPTARG | PARSE_OPT_NONEG,
@@ -921,6 +940,9 @@  int cmd_cat_file(int argc, const char **argv, const char *prefix)
 	else if (batch.all_objects)
 		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
 			       "--batch-all-objects");
+	else if (batch.nul_terminated)
+		usage_msg_optf(_("'%s' requires a batch mode"), usage, options,
+			       "-z");
 
 	/* Batch defaults */
 	if (batch.buffer_output < 0)
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 01c0535765..23b8942edb 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -88,7 +88,8 @@  done
 
 for opt in --buffer \
 	--follow-symlinks \
-	--batch-all-objects
+	--batch-all-objects \
+	-z
 do
 	test_expect_success "usage: bad option combination: $opt without batch mode" '
 		test_incompatible_usage git cat-file $opt &&
@@ -100,6 +101,10 @@  echo_without_newline () {
     printf '%s' "$*"
 }
 
+echo_without_newline_nul () {
+	echo_without_newline "$@" | tr '\n' '\0'
+}
+
 strlen () {
     echo_without_newline "$1" | wc -c | sed -e 's/^ *//'
 }
@@ -398,6 +403,12 @@  test_expect_success '--batch with multiple sha1s gives correct format' '
 	test "$(maybe_remove_timestamp "$batch_output" 1)" = "$(maybe_remove_timestamp "$(echo_without_newline "$batch_input" | git cat-file --batch)" 1)"
 '
 
+test_expect_success '--batch, -z with multiple sha1s gives correct format' '
+	echo_without_newline_nul "$batch_input" >in &&
+	test "$(maybe_remove_timestamp "$batch_output" 1)" = \
+	"$(maybe_remove_timestamp "$(git cat-file --batch -z <in)" 1)"
+'
+
 batch_check_input="$hello_sha1
 $tree_sha1
 $commit_sha1
@@ -418,6 +429,24 @@  test_expect_success "--batch-check with multiple sha1s gives correct format" '
     "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)"
 '
 
+test_expect_success "--batch-check, -z with multiple sha1s gives correct format" '
+    echo_without_newline_nul "$batch_check_input" >in &&
+    test "$batch_check_output" = "$(git cat-file --batch-check -z <in)"
+'
+
+test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
+	touch -- "newline${LF}embedded" &&
+	git add -- "newline${LF}embedded" &&
+	git commit -m "file with newline embedded" &&
+	test_tick &&
+
+	printf "HEAD:newline${LF}embedded" >in &&
+	git cat-file --batch-check -z <in >actual &&
+
+	echo "$(git rev-parse "HEAD:newline${LF}embedded") blob 0" >expect &&
+	test_cmp expect actual
+'
+
 batch_command_multiple_info="info $hello_sha1
 info $tree_sha1
 info $commit_sha1
@@ -436,6 +465,11 @@  test_expect_success '--batch-command with multiple info calls gives correct form
 	echo "$batch_command_multiple_info" >in &&
 	git cat-file --batch-command --buffer <in >actual &&
 
+	test_cmp expect actual &&
+
+	echo "$batch_command_multiple_info" | tr "\n" "\0" >in &&
+	git cat-file --batch-command --buffer -z <in >actual &&
+
 	test_cmp expect actual
 '
 
@@ -459,6 +493,12 @@  test_expect_success '--batch-command with multiple command calls gives correct f
 	echo "$batch_command_multiple_contents" >in &&
 	git cat-file --batch-command --buffer <in >actual_raw &&
 
+	remove_timestamp <actual_raw >actual &&
+	test_cmp expect actual &&
+
+	echo "$batch_command_multiple_contents" | tr "\n" "\0" >in &&
+	git cat-file --batch-command --buffer -z <in >actual_raw &&
+
 	remove_timestamp <actual_raw >actual &&
 	test_cmp expect actual
 '