Message ID | ed1583223f63cfde99829069f14af62e4f0f2a82.1658532524.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Accepted |
Commit | db9d67f2e9c9ea389f5558d6a168460d51631769 |
Headers | show |
Series | cat-file: support NUL-delimited input with `-z` | expand |
On Fri, Jul 22, 2022 at 4:34 PM Taylor Blau <me@ttaylorr.com> wrote: > The refactoring here is slightly unfortunate, since we turn loops like: > > while (strbuf_getline(&buf, stdin) != EOF) > > into: > > while (1) { > int ret; > if (opt->nul_terminated) > ret = strbuf_getline_nul(&input, stdin); > else > ret = strbuf_getline(&input, stdin); > > if (ret == EOF) > break; > } > > It's tempting to think that we could use `strbuf_getwholeline()` and > specify either `\n` or `\0` as the terminating character. ... How about: int (*get)(struct strbuf *, FILE *); get = opt->nul_terminated ? strbuf_getline_nul : strbuf_getline; while (get(&buf, stdin) != EOF) or similar? Chris
On Fri, Jul 22 2022, Taylor Blau wrote: [I think John Cai would appreciate being CC'd on this] > It's tempting to think that we could use `strbuf_getwholeline()` and > specify either `\n` or `\0` as the terminating character. But for input > on platforms that include a CR character preceeding the LF, this > wouldn't quite be the same, since `strbuf_getline(...)` will trim any > trailing CR, while `strbuf_getwholeline(&buf, stdin, '\n')` will not. I commend the effort to maintain bug-for-bug compatibility, but do we care about this distinction, or is it just an accident that we support \r\n here in the first place? This doesn't apply to the rest of cat-file directly, but the origin of the recent --batch-command mode cdoe was to lift the same-ish code from builtin/update-ref.c, whose \n or \0 mode does exactly that: while (!strbuf_getwholeline(&input, stdin, line_termination)) { I.e. it doesn't support \r\n, just \n or \0. Isn't that fine? I may be missing something, but why isn't it OK even on platforms that use \r\n for their normal *.txt endings to only use \n or \0 for this bit of protocol? For the command mode at least this passes our tests: diff --git a/builtin/cat-file.c b/builtin/cat-file.c index f42782e955f..8646059472d 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -614,12 +614,16 @@ static void batch_objects_command(struct batch_options *opt, struct queued_cmd *queued_cmd = NULL; size_t alloc = 0, nr = 0; - while (!strbuf_getline(&input, stdin)) { + while (!strbuf_getwholeline(&input, stdin, '\n')) { int i; const struct parse_cmd *cmd = NULL; const char *p = NULL, *cmd_end; struct queued_cmd call = {0}; + if (input.len > 0 && input.buf[input.len - 1] == '\n') + --input.len; + input.buf[input.len] = '\0'; + if (!input.len) die(_("empty command in input")); if (isspace(*input.buf)) So maybe we should just do something like that instead? I.e. declare that a mistake. As for the rest of cat-file 05d5667fec9 (git-cat-file: Add --batch-check option, 2008-04-23) documents that it's LF, not CR LF, ditto git-cat-file.txt. So isn't this just an accident in of us having used the strbuf_getline() function to mean "\n", but actually it also does "\r\n". Which is a really unfortunately named function b.t.w., since it sneaks this bit of Windows portability into places that may not want it in the first place. > strlen () { > echo_without_newline "$1" | wc -c | sed -e 's/^ *//' > } > @@ -398,6 +403,12 @@ test_expect_success '--batch with multiple sha1s gives correct format' ' > test "$(maybe_remove_timestamp "$batch_output" 1)" = "$(maybe_remove_timestamp "$(echo_without_newline "$batch_input" | git cat-file --batch)" 1)" > ' > > +test_expect_success '--batch, -z with multiple sha1s gives correct format' ' > + echo_without_newline_nul "$batch_input" >in && > + test "$(maybe_remove_timestamp "$batch_output" 1)" = \ > + "$(maybe_remove_timestamp "$(git cat-file --batch -z <in)" 1)" This... > +' > + > batch_check_input="$hello_sha1 > $tree_sha1 > $commit_sha1 > @@ -418,6 +429,24 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" ' > "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)" > ' > > +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" ' > + echo_without_newline_nul "$batch_check_input" >in && > + test "$batch_check_output" = "$(git cat-file --batch-check -z <in)" This.... > +' > + > +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' > + touch -- "newline${LF}embedded" && > + git add -- "newline${LF}embedded" && > + git commit -m "file with newline embedded" && > + test_tick && > + > + printf "HEAD:newline${LF}embedded" >in && > + git cat-file --batch-check -z <in >actual && > + > + echo "$(git rev-parse "HEAD:newline${LF}embedded") blob 0" >expect && ..and this hides git's exit code, better to pipe to a file, use test_cmp etc. etc.
Taylor Blau <me@ttaylorr.com> writes: > @@ -14,7 +14,7 @@ SYNOPSIS > 'git cat-file' (-t | -s) [--allow-unknown-type] <object> > 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects] > [--buffer] [--follow-symlinks] [--unordered] > - [--textconv | --filters] > + [--textconv | --filters] [-z] Is "-z" useful with any other option, or is it useful only in combination with one of the three --batch-*? The above suggests the former. > +test_expect_success '--batch, -z with multiple sha1s gives correct format' ' > + echo_without_newline_nul "$batch_input" >in && I I recall [1/2] correctly, the input lacked the LF at the end. In the original "LF terminated" use converted to use these variables, because $batch_*_input is "echo"ed to create the file "in", the lack of LF at the end is a GOOD thing. But here, echo_without_newline_nul is just a glorified "printf %s" piped into tr to turn LF into NUL. What is fed by printf into the pipe lacks LF at the end, so the output from tr will not have NUL at the end, either. That might happen to work (because the EOF may be enough to signal the end of the entire input, thus the last input item), but it does not make the test case for "-z" exactly parallel to the line oriented input. > +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" ' > + echo_without_newline_nul "$batch_check_input" >in && > + test "$batch_check_output" = "$(git cat-file --batch-check -z <in)" > +' > + > +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' > + touch -- "newline${LF}embedded" && > + git add -- "newline${LF}embedded" && > + git commit -m "file with newline embedded" && > + test_tick && > + > + printf "HEAD:newline${LF}embedded" >in && > + git cat-file --batch-check -z <in >actual && As I already said, I suspect that new users who know how our path quoting works would expect c-quoted path would work just fine without using "-z". It is not a reason to refuse "-z" to exist, though. > @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form > echo "$batch_command_multiple_info" >in && > git cat-file --batch-command --buffer <in >actual && > > + test_cmp expect actual && > + > + echo "$batch_command_multiple_info" | tr "\n" "\0" >in && This is what I would expect. The _info variable lacks final LF, which is supplied by "echo", so output from tr ends with NUL, which mirrors the line-oriented input we used above. > + git cat-file --batch-command --buffer -z <in >actual && > + > test_cmp expect actual > ' > > @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f > echo "$batch_command_multiple_contents" >in && > git cat-file --batch-command --buffer <in >actual_raw && > > + remove_timestamp <actual_raw >actual && > + test_cmp expect actual && > + > + echo "$batch_command_multiple_contents" | tr "\n" "\0" >in && > + git cat-file --batch-command --buffer -z <in >actual_raw && > + Likewise.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > This doesn't apply to the rest of cat-file directly, but the origin of > the recent --batch-command mode cdoe was to lift the same-ish code from > builtin/update-ref.c, whose \n or \0 mode does exactly that: > > while (!strbuf_getwholeline(&input, stdin, line_termination)) { > > I.e. it doesn't support \r\n, just \n or \0. > > Isn't that fine? I may be missing something, but why isn't it OK even on > platforms that use \r\n for their normal *.txt endings to only use \n or > \0 for this bit of protocol? > > So maybe we should just do something like that instead? I.e. declare > that a mistake. Good point. > So isn't this just an accident in of us having used the strbuf_getline() > function to mean "\n", but actually it also does "\r\n". > > Which is a really unfortunately named function b.t.w., since it sneaks > this bit of Windows portability into places that may not want it in the > first place. getline() is to read a single "logical" line, so it is fine for it to strip CRLF on platforms with CRLF, and to leave CR at the end of line on a LF platform. If the "protocol" is defined to use LF on any platform (and allow a payload that ends with CR to be passed), you can argue that it is a wrong helper to call. But does that result in a sensible behaviour? I am not sure. Some editors that are popular on LF platforms can produce CRLF files when the user asks (either on purpose or by mistake), and when not using the "-z" mode of input, I suspect that most users would expect CR at the end of the "line" (terminated with LF on their platform of choice) would be thrown away even on their LF platform, simply because it is unlikely that the kind of input they are preparing can usefully and legitimately end with CR, as their colleagues on CRLF platforms may also have to produce similar input. IOW, the presence of CRLF platforms makes text lines that end with CR much less useful everywhere. And from that point of view, "getline() or getwholeline(NUL)" may be a pattern that is practically more useful than the old "use always getwholeline(NUL or LF)" convention that I invented more than 10 years ago.
On Fri, Jul 22, 2022 at 04:41:54PM -0700, Chris Torek wrote: > > It's tempting to think that we could use `strbuf_getwholeline()` and > > specify either `\n` or `\0` as the terminating character. ... > > How about: > > int (*get)(struct strbuf *, FILE *); > get = opt->nul_terminated ? strbuf_getline_nul : strbuf_getline; > while (get(&buf, stdin) != EOF) > > or similar? I thought about it, but it seemed cleaner to lift the ternary out to capture the result, too, leading to the if/else-statement I sent in the patch above. If we wanted to change it, I'd probably suggest a more general-purpose strbuf function that implemented this behavior through a single call. But it sounds like supporting the carriage return character was a mistake from the beginning, which simplifies this direction quite a bit, anyways. Thanks, Taylor
Hi Junio, On Sat, Jul 23, 2022 at 10:45:05AM -0700, Junio C Hamano wrote: > > So isn't this just an accident in of us having used the strbuf_getline() > > function to mean "\n", but actually it also does "\r\n". > > > > Which is a really unfortunately named function b.t.w., since it sneaks > > this bit of Windows portability into places that may not want it in the > > first place. > > getline() is to read a single "logical" line, so it is fine for it > to strip CRLF on platforms with CRLF, and to leave CR at the end of > line on a LF platform. If the "protocol" is defined to use LF on > any platform (and allow a payload that ends with CR to be passed), > you can argue that it is a wrong helper to call. Reading your email up to this point makes me think that we should ignore any CR bytes we see next to a LF. So by this point I think that we should take Ævar's suggestion and call: strbuf_getwholeline(&buf, stdin, opt->nul_terminated ? '\0' : '\n'); But... > But does that result in a sensible behaviour? I am not sure. Some > editors that are popular on LF platforms can produce CRLF files when > the user asks (either on purpose or by mistake), and when not using > the "-z" mode of input, I suspect that most users would expect CR at > the end of the "line" (terminated with LF on their platform of > choice) would be thrown away even on their LF platform, simply > because it is unlikely that the kind of input they are preparing can > usefully and legitimately end with CR, as their colleagues on CRLF > platforms may also have to produce similar input. IOW, the presence > of CRLF platforms makes text lines that end with CR much less useful > everywhere. > > And from that point of view, "getline() or getwholeline(NUL)" may be > a pattern that is practically more useful than the old "use always > getwholeline(NUL or LF)" convention that I invented more than 10 > years ago. This makes me think that we should retain support for dropping the CR preceeeding a LF and treat it as a historical wart that we are stuck supporting. Do you have a preference? I am fine with either approach, FWIW. Thanks, Taylor
On Fri, Jul 22, 2022 at 10:35:43PM -0700, Junio C Hamano wrote: > Taylor Blau <me@ttaylorr.com> writes: > > > @@ -14,7 +14,7 @@ SYNOPSIS > > 'git cat-file' (-t | -s) [--allow-unknown-type] <object> > > 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects] > > [--buffer] [--follow-symlinks] [--unordered] > > - [--textconv | --filters] > > + [--textconv | --filters] [-z] > > Is "-z" useful with any other option, or is it useful only in > combination with one of the three --batch-*? The above suggests the > former. It only makes sense with `--batch`-related options. But doesn't the above suggest the latter, not the former? That synopsis line begins with: 'git cat-file' (--batch | --batch-check | --batch-command) ... which made me think that this was the invocation for batch-related options, and only listed options that made sense with a `--batch` mode of one kind or another. > > +test_expect_success '--batch, -z with multiple sha1s gives correct format' ' > > + echo_without_newline_nul "$batch_input" >in && > > I I recall [1/2] correctly, the input lacked the LF at the end. In > the original "LF terminated" use converted to use these variables, > because $batch_*_input is "echo"ed to create the file "in", the lack > of LF at the end is a GOOD thing. > > But here, echo_without_newline_nul is just a glorified "printf %s" > piped into tr to turn LF into NUL. What is fed by printf into the > pipe lacks LF at the end, so the output from tr will not have NUL at > the end, either. > > That might happen to work (because the EOF may be enough to signal > the end of the entire input, thus the last input item), but it does > not make the test case for "-z" exactly parallel to the line oriented > input. I see what you're saying. And, yeah, I think it happens to work since we treat EOF as marking the end of the last input element, regardless of whether or not we saw a NUL byte or a LF (depending on whether or not we passed `-z`). I think the helper should probably be something more like: echo_with_nul () { echo "$@" | tr '\n' '\0' } or similar. But as you note below, this is probably not even worth extracting to a helper function. ' > > +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" ' > > + echo_without_newline_nul "$batch_check_input" >in && > > + test "$batch_check_output" = "$(git cat-file --batch-check -z <in)" > > +' > > + > > +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' > > + touch -- "newline${LF}embedded" && > > + git add -- "newline${LF}embedded" && > > + git commit -m "file with newline embedded" && > > + test_tick && > > + > > + printf "HEAD:newline${LF}embedded" >in && > > + git cat-file --batch-check -z <in >actual && > > As I already said, I suspect that new users who know how our path > quoting works would expect c-quoted path would work just fine > without using "-z". It is not a reason to refuse "-z" to exist, > though. Yeah. I think we can do both, if there is a need. I suspect that just `-z` support would be sufficient for now, but I agree that one doesn't need to tie up the other. > > @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form > > echo "$batch_command_multiple_info" >in && > > git cat-file --batch-command --buffer <in >actual && > > > > + test_cmp expect actual && > > + > > + echo "$batch_command_multiple_info" | tr "\n" "\0" >in && > > This is what I would expect. The _info variable lacks final LF, > which is supplied by "echo", so output from tr ends with NUL, which > mirrors the line-oriented input we used above. Yep. > > + git cat-file --batch-command --buffer -z <in >actual && > > + > > test_cmp expect actual > > ' > > > > @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f > > echo "$batch_command_multiple_contents" >in && > > git cat-file --batch-command --buffer <in >actual_raw && > > > > + remove_timestamp <actual_raw >actual && > > + test_cmp expect actual && > > + > > + echo "$batch_command_multiple_contents" | tr "\n" "\0" >in && > > + git cat-file --batch-command --buffer -z <in >actual_raw && > > + > > Likewise. Ditto, thanks. Thanks, Taylor
Taylor Blau <me@ttaylorr.com> writes:
> Do you have a preference? I am fine with either approach, FWIW.
I do not have a strong preference, but in "text" mode, I suspect
that the users may find it more convenient if we discarded CR at the
end of the line even on LF platforms.
Thanks.
Taylor Blau <me@ttaylorr.com> writes: > On Fri, Jul 22, 2022 at 10:35:43PM -0700, Junio C Hamano wrote: >> Taylor Blau <me@ttaylorr.com> writes: >> >> > @@ -14,7 +14,7 @@ SYNOPSIS >> > 'git cat-file' (-t | -s) [--allow-unknown-type] <object> >> > 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects] >> > [--buffer] [--follow-symlinks] [--unordered] >> > - [--textconv | --filters] >> > + [--textconv | --filters] [-z] >> >> Is "-z" useful with any other option, or is it useful only in >> combination with one of the three --batch-*? The above suggests the >> former. > > It only makes sense with `--batch`-related options. But doesn't the > above suggest the latter, not the former? That synopsis line begins > with: > > 'git cat-file' (--batch | --batch-check | --batch-command) ... > > which made me think that this was the invocation for batch-related > options, and only listed options that made sense with a `--batch` mode > of one kind or another. Ah, yes you are absolutely right. I misread the synopsis. > I think the helper should probably be something more like: > > echo_with_nul () { > echo "$@" | tr '\n' '\0' > } I think you meant printf "%s\n" "$@" | tr .. but then I wonder if printf "%s\000" "$@" would work better? >> > + printf "HEAD:newline${LF}embedded" >in && >> > + git cat-file --batch-check -z <in >actual && >> >> As I already said, I suspect that new users who know how our path >> quoting works would expect c-quoted path would work just fine >> without using "-z". It is not a reason to refuse "-z" to exist, >> though. > > Yeah. I think we can do both, if there is a need. Even though we know it is needed already, it is unfortunately way too late now X-<, because existing scripts know that paths are not taken as c-quoted, even though people would naturally expect that paths are, and satisfying the latter expectation now means breaking existing scripts. In any case, as long as "-z" is designed right from the day one, it would be fine ;-) Thanks.
Hi Taylor Thanks for working on this, I've been annoyed by the lack of a "-z" option but never got round to doing anything about it. On 23/07/2022 00:29, Taylor Blau wrote: > When callers are using `cat-file` via one of the stdin-driven `--batch` > modes, all input is newline-delimited. This presents a problem when > callers wish to ask about, e.g. tree-entries that have a newline > character present in their filename. > > To support this niche scenario, introduce a new `-z` mode to the > `--batch`, `--batch-check`, and `--batch-command` suite of options that > instructs `cat-file` to treat its input as NUL-delimited, allowing the > individual commands themselves to have newlines present. I wonder if this should also change the output delimiter from NL to NUL. printf 'HEAD:does\nnot\nexist\000' | bin-wrappers/git cat-file --batch-check -z prints HEAD:does not exist missing If it was terminated by a NUL rather than a newline it would be much easier to parse. Best Wishes Phillip > The refactoring here is slightly unfortunate, since we turn loops like: > > while (strbuf_getline(&buf, stdin) != EOF) > > into: > > while (1) { > int ret; > if (opt->nul_terminated) > ret = strbuf_getline_nul(&input, stdin); > else > ret = strbuf_getline(&input, stdin); > > if (ret == EOF) > break; > } > > It's tempting to think that we could use `strbuf_getwholeline()` and > specify either `\n` or `\0` as the terminating character. But for input > on platforms that include a CR character preceeding the LF, this > wouldn't quite be the same, since `strbuf_getline(...)` will trim any > trailing CR, while `strbuf_getwholeline(&buf, stdin, '\n')` will not. > > In the future, we could clean this up further by introducing a variant > of `strbuf_getwholeline()` that addresses the aforementioned gap, but > that approach felt too heavy-handed for this pair of uses. > > Some tests are added in t1006 to ensure that `cat-file` produces the > same output in `--batch`, `--batch-check`, and `--batch-command` modes > with and without the new `-z` option. > > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > Documentation/git-cat-file.txt | 7 +++++- > builtin/cat-file.c | 28 ++++++++++++++++++++--- > t/t1006-cat-file.sh | 42 +++++++++++++++++++++++++++++++++- > 3 files changed, 72 insertions(+), 5 deletions(-) > > diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt > index 24a811f0ef..3515350ed6 100644 > --- a/Documentation/git-cat-file.txt > +++ b/Documentation/git-cat-file.txt > @@ -14,7 +14,7 @@ SYNOPSIS > 'git cat-file' (-t | -s) [--allow-unknown-type] <object> > 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects] > [--buffer] [--follow-symlinks] [--unordered] > - [--textconv | --filters] > + [--textconv | --filters] [-z] > 'git cat-file' (--textconv | --filters) > [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>] > > @@ -207,6 +207,11 @@ respectively print: > /etc/passwd > -- > > +-z:: > + Only meaningful with `--batch`, `--batch-check`, or > + `--batch-command`; input is NUL-delimited instead of > + newline-delimited. > + > > OUTPUT > ------ > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index f42782e955..c3602d15df 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -31,6 +31,7 @@ struct batch_options { > int all_objects; > int unordered; > int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */ > + int nul_terminated; > const char *format; > }; > > @@ -614,12 +615,20 @@ static void batch_objects_command(struct batch_options *opt, > struct queued_cmd *queued_cmd = NULL; > size_t alloc = 0, nr = 0; > > - while (!strbuf_getline(&input, stdin)) { > - int i; > + while (1) { > + int i, ret; > const struct parse_cmd *cmd = NULL; > const char *p = NULL, *cmd_end; > struct queued_cmd call = {0}; > > + if (opt->nul_terminated) > + ret = strbuf_getline_nul(&input, stdin); > + else > + ret = strbuf_getline(&input, stdin); > + > + if (ret) > + break; > + > if (!input.len) > die(_("empty command in input")); > if (isspace(*input.buf)) > @@ -763,7 +772,16 @@ static int batch_objects(struct batch_options *opt) > goto cleanup; > } > > - while (strbuf_getline(&input, stdin) != EOF) { > + while (1) { > + int ret; > + if (opt->nul_terminated) > + ret = strbuf_getline_nul(&input, stdin); > + else > + ret = strbuf_getline(&input, stdin); > + > + if (ret == EOF) > + break; > + > if (data.split_on_whitespace) { > /* > * Split at first whitespace, tying off the beginning > @@ -866,6 +884,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix) > N_("like --batch, but don't emit <contents>"), > PARSE_OPT_OPTARG | PARSE_OPT_NONEG, > batch_option_callback), > + OPT_BOOL('z', NULL, &batch.nul_terminated, N_("stdin is NUL-terminated")), > OPT_CALLBACK_F(0, "batch-command", &batch, N_("format"), > N_("read commands from stdin"), > PARSE_OPT_OPTARG | PARSE_OPT_NONEG, > @@ -921,6 +940,9 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix) > else if (batch.all_objects) > usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > "--batch-all-objects"); > + else if (batch.nul_terminated) > + usage_msg_optf(_("'%s' requires a batch mode"), usage, options, > + "-z"); > > /* Batch defaults */ > if (batch.buffer_output < 0) > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index 01c0535765..23b8942edb 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -88,7 +88,8 @@ done > > for opt in --buffer \ > --follow-symlinks \ > - --batch-all-objects > + --batch-all-objects \ > + -z > do > test_expect_success "usage: bad option combination: $opt without batch mode" ' > test_incompatible_usage git cat-file $opt && > @@ -100,6 +101,10 @@ echo_without_newline () { > printf '%s' "$*" > } > > +echo_without_newline_nul () { > + echo_without_newline "$@" | tr '\n' '\0' > +} > + > strlen () { > echo_without_newline "$1" | wc -c | sed -e 's/^ *//' > } > @@ -398,6 +403,12 @@ test_expect_success '--batch with multiple sha1s gives correct format' ' > test "$(maybe_remove_timestamp "$batch_output" 1)" = "$(maybe_remove_timestamp "$(echo_without_newline "$batch_input" | git cat-file --batch)" 1)" > ' > > +test_expect_success '--batch, -z with multiple sha1s gives correct format' ' > + echo_without_newline_nul "$batch_input" >in && > + test "$(maybe_remove_timestamp "$batch_output" 1)" = \ > + "$(maybe_remove_timestamp "$(git cat-file --batch -z <in)" 1)" > +' > + > batch_check_input="$hello_sha1 > $tree_sha1 > $commit_sha1 > @@ -418,6 +429,24 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" ' > "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)" > ' > > +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" ' > + echo_without_newline_nul "$batch_check_input" >in && > + test "$batch_check_output" = "$(git cat-file --batch-check -z <in)" > +' > + > +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' > + touch -- "newline${LF}embedded" && > + git add -- "newline${LF}embedded" && > + git commit -m "file with newline embedded" && > + test_tick && > + > + printf "HEAD:newline${LF}embedded" >in && > + git cat-file --batch-check -z <in >actual && > + > + echo "$(git rev-parse "HEAD:newline${LF}embedded") blob 0" >expect && > + test_cmp expect actual > +' > + > batch_command_multiple_info="info $hello_sha1 > info $tree_sha1 > info $commit_sha1 > @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form > echo "$batch_command_multiple_info" >in && > git cat-file --batch-command --buffer <in >actual && > > + test_cmp expect actual && > + > + echo "$batch_command_multiple_info" | tr "\n" "\0" >in && > + git cat-file --batch-command --buffer -z <in >actual && > + > test_cmp expect actual > ' > > @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f > echo "$batch_command_multiple_contents" >in && > git cat-file --batch-command --buffer <in >actual_raw && > > + remove_timestamp <actual_raw >actual && > + test_cmp expect actual && > + > + echo "$batch_command_multiple_contents" | tr "\n" "\0" >in && > + git cat-file --batch-command --buffer -z <in >actual_raw && > + > remove_timestamp <actual_raw >actual && > test_cmp expect actual > '
On Fri, Jul 22 2022, Taylor Blau wrote: This change: > diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt > index 24a811f0ef..3515350ed6 100644 > --- a/Documentation/git-cat-file.txt > +++ b/Documentation/git-cat-file.txt > @@ -14,7 +14,7 @@ SYNOPSIS > 'git cat-file' (-t | -s) [--allow-unknown-type] <object> > 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects] > [--buffer] [--follow-symlinks] [--unordered] > - [--textconv | --filters] > + [--textconv | --filters] [-z] > 'git cat-file' (--textconv | --filters) > [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>] > > [...] > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index f42782e955..c3602d15df 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -31,6 +31,7 @@ struct batch_options { > int all_objects; > int unordered; > int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */ > + int nul_terminated; > const char *format; > }; Is missing a corresponding change to the -h output here. Before this they were the same, but now: + diff -u txt help --- txt 2022-08-11 11:53:00.235221628 +0000 +++ help 2022-08-11 11:53:00.239221594 +0000 @@ -3,6 +3,6 @@ git cat-file (-t | -s) [--allow-unknown-type] <object> git cat-file (--batch | --batch-check | --batch-command) [--batch-all-objects] [--buffer] [--follow-symlinks] [--unordered] - [--textconv | --filters] [-z] + [--textconv | --filters] git cat-file (--textconv | --filters) [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>] (Spotted with some local patches of mine which automatically check this)
diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt index 24a811f0ef..3515350ed6 100644 --- a/Documentation/git-cat-file.txt +++ b/Documentation/git-cat-file.txt @@ -14,7 +14,7 @@ SYNOPSIS 'git cat-file' (-t | -s) [--allow-unknown-type] <object> 'git cat-file' (--batch | --batch-check | --batch-command) [--batch-all-objects] [--buffer] [--follow-symlinks] [--unordered] - [--textconv | --filters] + [--textconv | --filters] [-z] 'git cat-file' (--textconv | --filters) [<rev>:<path|tree-ish> | --path=<path|tree-ish> <rev>] @@ -207,6 +207,11 @@ respectively print: /etc/passwd -- +-z:: + Only meaningful with `--batch`, `--batch-check`, or + `--batch-command`; input is NUL-delimited instead of + newline-delimited. + OUTPUT ------ diff --git a/builtin/cat-file.c b/builtin/cat-file.c index f42782e955..c3602d15df 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -31,6 +31,7 @@ struct batch_options { int all_objects; int unordered; int transform_mode; /* may be 'w' or 'c' for --filters or --textconv */ + int nul_terminated; const char *format; }; @@ -614,12 +615,20 @@ static void batch_objects_command(struct batch_options *opt, struct queued_cmd *queued_cmd = NULL; size_t alloc = 0, nr = 0; - while (!strbuf_getline(&input, stdin)) { - int i; + while (1) { + int i, ret; const struct parse_cmd *cmd = NULL; const char *p = NULL, *cmd_end; struct queued_cmd call = {0}; + if (opt->nul_terminated) + ret = strbuf_getline_nul(&input, stdin); + else + ret = strbuf_getline(&input, stdin); + + if (ret) + break; + if (!input.len) die(_("empty command in input")); if (isspace(*input.buf)) @@ -763,7 +772,16 @@ static int batch_objects(struct batch_options *opt) goto cleanup; } - while (strbuf_getline(&input, stdin) != EOF) { + while (1) { + int ret; + if (opt->nul_terminated) + ret = strbuf_getline_nul(&input, stdin); + else + ret = strbuf_getline(&input, stdin); + + if (ret == EOF) + break; + if (data.split_on_whitespace) { /* * Split at first whitespace, tying off the beginning @@ -866,6 +884,7 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix) N_("like --batch, but don't emit <contents>"), PARSE_OPT_OPTARG | PARSE_OPT_NONEG, batch_option_callback), + OPT_BOOL('z', NULL, &batch.nul_terminated, N_("stdin is NUL-terminated")), OPT_CALLBACK_F(0, "batch-command", &batch, N_("format"), N_("read commands from stdin"), PARSE_OPT_OPTARG | PARSE_OPT_NONEG, @@ -921,6 +940,9 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix) else if (batch.all_objects) usage_msg_optf(_("'%s' requires a batch mode"), usage, options, "--batch-all-objects"); + else if (batch.nul_terminated) + usage_msg_optf(_("'%s' requires a batch mode"), usage, options, + "-z"); /* Batch defaults */ if (batch.buffer_output < 0) diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 01c0535765..23b8942edb 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -88,7 +88,8 @@ done for opt in --buffer \ --follow-symlinks \ - --batch-all-objects + --batch-all-objects \ + -z do test_expect_success "usage: bad option combination: $opt without batch mode" ' test_incompatible_usage git cat-file $opt && @@ -100,6 +101,10 @@ echo_without_newline () { printf '%s' "$*" } +echo_without_newline_nul () { + echo_without_newline "$@" | tr '\n' '\0' +} + strlen () { echo_without_newline "$1" | wc -c | sed -e 's/^ *//' } @@ -398,6 +403,12 @@ test_expect_success '--batch with multiple sha1s gives correct format' ' test "$(maybe_remove_timestamp "$batch_output" 1)" = "$(maybe_remove_timestamp "$(echo_without_newline "$batch_input" | git cat-file --batch)" 1)" ' +test_expect_success '--batch, -z with multiple sha1s gives correct format' ' + echo_without_newline_nul "$batch_input" >in && + test "$(maybe_remove_timestamp "$batch_output" 1)" = \ + "$(maybe_remove_timestamp "$(git cat-file --batch -z <in)" 1)" +' + batch_check_input="$hello_sha1 $tree_sha1 $commit_sha1 @@ -418,6 +429,24 @@ test_expect_success "--batch-check with multiple sha1s gives correct format" ' "$(echo_without_newline "$batch_check_input" | git cat-file --batch-check)" ' +test_expect_success "--batch-check, -z with multiple sha1s gives correct format" ' + echo_without_newline_nul "$batch_check_input" >in && + test "$batch_check_output" = "$(git cat-file --batch-check -z <in)" +' + +test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' + touch -- "newline${LF}embedded" && + git add -- "newline${LF}embedded" && + git commit -m "file with newline embedded" && + test_tick && + + printf "HEAD:newline${LF}embedded" >in && + git cat-file --batch-check -z <in >actual && + + echo "$(git rev-parse "HEAD:newline${LF}embedded") blob 0" >expect && + test_cmp expect actual +' + batch_command_multiple_info="info $hello_sha1 info $tree_sha1 info $commit_sha1 @@ -436,6 +465,11 @@ test_expect_success '--batch-command with multiple info calls gives correct form echo "$batch_command_multiple_info" >in && git cat-file --batch-command --buffer <in >actual && + test_cmp expect actual && + + echo "$batch_command_multiple_info" | tr "\n" "\0" >in && + git cat-file --batch-command --buffer -z <in >actual && + test_cmp expect actual ' @@ -459,6 +493,12 @@ test_expect_success '--batch-command with multiple command calls gives correct f echo "$batch_command_multiple_contents" >in && git cat-file --batch-command --buffer <in >actual_raw && + remove_timestamp <actual_raw >actual && + test_cmp expect actual && + + echo "$batch_command_multiple_contents" | tr "\n" "\0" >in && + git cat-file --batch-command --buffer -z <in >actual_raw && + remove_timestamp <actual_raw >actual && test_cmp expect actual '
When callers are using `cat-file` via one of the stdin-driven `--batch` modes, all input is newline-delimited. This presents a problem when callers wish to ask about, e.g. tree-entries that have a newline character present in their filename. To support this niche scenario, introduce a new `-z` mode to the `--batch`, `--batch-check`, and `--batch-command` suite of options that instructs `cat-file` to treat its input as NUL-delimited, allowing the individual commands themselves to have newlines present. The refactoring here is slightly unfortunate, since we turn loops like: while (strbuf_getline(&buf, stdin) != EOF) into: while (1) { int ret; if (opt->nul_terminated) ret = strbuf_getline_nul(&input, stdin); else ret = strbuf_getline(&input, stdin); if (ret == EOF) break; } It's tempting to think that we could use `strbuf_getwholeline()` and specify either `\n` or `\0` as the terminating character. But for input on platforms that include a CR character preceeding the LF, this wouldn't quite be the same, since `strbuf_getline(...)` will trim any trailing CR, while `strbuf_getwholeline(&buf, stdin, '\n')` will not. In the future, we could clean this up further by introducing a variant of `strbuf_getwholeline()` that addresses the aforementioned gap, but that approach felt too heavy-handed for this pair of uses. Some tests are added in t1006 to ensure that `cat-file` produces the same output in `--batch`, `--batch-check`, and `--batch-command` modes with and without the new `-z` option. Signed-off-by: Taylor Blau <me@ttaylorr.com> --- Documentation/git-cat-file.txt | 7 +++++- builtin/cat-file.c | 28 ++++++++++++++++++++--- t/t1006-cat-file.sh | 42 +++++++++++++++++++++++++++++++++- 3 files changed, 72 insertions(+), 5 deletions(-)