[v5,1/1] cat-file: quote-format name in error when using -z

Message ID	20230510190116.795641-2-toon@iotcl.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@vger.kernel.org> From: Toon Claes <toon@iotcl.com> To: git@vger.kernel.org Cc: Phillip Wood <phillip.wood123@gmail.com>, Toon Claes <toon@iotcl.com> Subject: [PATCH v5 1/1] cat-file: quote-format name in error when using -z Date: Wed, 10 May 2023 21:01:16 +0200 Message-Id: <20230510190116.795641-2-toon@iotcl.com> In-Reply-To: <20230510190116.795641-1-toon@iotcl.com> References: <xmqqilfhctrr.fsf@gitster.g> <20230510190116.795641-1-toon@iotcl.com> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	cat-file: quote-format name in error when using -z \| expand [v5,0/1] cat-file: quote-format name in error when using -z [v5,1/1] cat-file: quote-format name in error when using -z

Toon Claes May 10, 2023, 7:01 p.m. UTC

Since it's supported to have NUL-delimited input, introduced in
db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
2022-07-22), it's possible to pass paths that contain newlines. This
works great when the object is found, but when it's not, the input path
is returned in the error message. Because this can contain newlines, the
error message might get spread over multiple lines, making it harder to
machine-parse this error message.

With this change, the input is quote-formatted in the error message, if
needed. This ensures the error message is always on a single line and
makes parsing the error more straightforward.

Signed-off-by: Toon Claes <toon@iotcl.com>
---
 builtin/cat-file.c  | 19 +++++++++++++++++++
 t/t1006-cat-file.sh | 27 +++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

Junio C Hamano May 10, 2023, 8:13 p.m. UTC | #1

Toon Claes <toon@iotcl.com> writes:

> Since it's supported to have NUL-delimited input, introduced in
> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
> 2022-07-22), it's possible to pass paths that contain newlines.

It has been a while since I saw this patch the last time, and it did
not immediately click to me how "pass paths" relates to passing
object names, which is what "--batch-check" takes.  Perhaps

    "cat-file --batch-check" may be fed object names of blobs and
    trees as "<commit>:<path>", and with its "-z" option, it is
    possible to feed <commit> or <path> with newlines in it.

or something?

> This
> works great when the object is found, but when it's not, the input path
> is returned in the error message. Because this can contain newlines, the
> error message might get spread over multiple lines, making it harder to
> machine-parse this error message.

Good description.  I may suggest

    "the input path is returned" -> "the input is shown verbatim"

because <path> is not the only thing that can contain LF.  E.g.

    $ git show -s 'HEAD^{/introduced in
    > db9d67}'

can find the commit resulting from this patch, so

    $ printf "%s\0" 'HEAD^{/introduced in
    > db9d67}:builtin/cat-file.cc' |
    > git cat-file --batch-check -z 

would be an input record with newline in it, that has no newline in
the path.

> With this change, the input is quote-formatted in the error message, if
> needed. This ensures the error message is always on a single line and
> makes parsing the error more straightforward.

Drop "With this change, ..." and give a command to the codebase to
c-quote the object name in the output, e.g.

    C-quote the object name from the input in the error message as
    needed, to ensure that the error message is on a single line and
    ...

The other side of the coin, however, is that an existing project
that is sane enough not to have a path with LF in it, but is not
sane enough to avoid a path with double-quote in it, would now stop
being able to parse the error message for a missing path.

It is a regression, and we may argue that it is not a large enough
regression to block progress given by this patch, but if it is not
common enough to have funny characters in the paths then we wouldn't
be seeing this patch in the first place.  So I would prefer to see
that we at least admit that we are deliberately making this change
with a known regression.  Perhaps add something like

    Note that if a project already parses the error message
    correctly because it does not have paths with newlines, this is
    a breaking change if it has paths with characters that need
    c-quoting (like double quotes and backslashes) that are not
    newlines.  We consider that this is a small enough price to pay
    to allow newlines in paths because ...

and fill the "because ..." part is sensible.  I am not filling the
"because" part, simply because I do not offhand see any good excuse
to rob Peter to pay Paul in this case.

> @@ -470,8 +471,17 @@ static void batch_object_write(const char *obj_name,
>  						       &data->oid, &data->info,
>  						       OBJECT_INFO_LOOKUP_REPLACE);
>  		if (ret < 0) {
> +			struct strbuf quoted = STRBUF_INIT;
> +
> +			if (opt->nul_terminated &&
> +			    obj_name) {
> +				quote_c_style(obj_name, &quoted, NULL, 0);
> +				obj_name = quoted.buf;
> +			}
> +
>  			printf("%s missing\n",
>  			       obj_name ? obj_name : oid_to_hex(&data->oid));
> +			strbuf_release(&quoted);
>  			fflush(stdout);
>  			return;
>  		}

Interesting.

When running "--batch-all-objects", we do not have obj_name, and we
do not have anything to "quote", but the fallback label is the full
hexadecimal object name, and we do not have to worry about line
breaks.

OK.

> @@ -518,6 +528,13 @@ static void batch_one_object(const char *obj_name,
>  	result = get_oid_with_context(the_repository, obj_name,
>  				      flags, &data->oid, &ctx);
>  	if (result != FOUND) {
> +		struct strbuf quoted = STRBUF_INIT;
> +
> +		if (opt->nul_terminated) {
> +			quote_c_style(obj_name, &quoted, NULL, 0);
> +			obj_name = quoted.buf;
> +		}
> +
>  		switch (result) {
>  		case MISSING_OBJECT:
>  			printf("%s missing\n", obj_name);
> @@ -542,6 +559,8 @@ static void batch_one_object(const char *obj_name,
>  			       result);
>  			break;
>  		}
> +
> +		strbuf_release(&quoted);
>  		fflush(stdout);
>  		return;
>  	}

This is the side that runs without "--batch-all-objects" and always
has non-null obj_name, so it looks a bit different from the other
hunk and is more straight-forward.  Looking good.

Thanks.

Toon Claes May 12, 2023, 8:54 a.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> Toon Claes <toon@iotcl.com> writes:
>
>> Since it's supported to have NUL-delimited input, introduced in
>> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
>> 2022-07-22), it's possible to pass paths that contain newlines.
>
> It has been a while since I saw this patch the last time, and it did
> not immediately click to me how "pass paths" relates to passing
> object names, which is what "--batch-check" takes.  Perhaps
>
>     "cat-file --batch-check" may be fed object names of blobs and
>     trees as "<commit>:<path>", and with its "-z" option, it is
>     possible to feed <commit> or <path> with newlines in it.
>
> or something?

Good suggestion.

>> This
>> works great when the object is found, but when it's not, the input path
>> is returned in the error message. Because this can contain newlines, the
>> error message might get spread over multiple lines, making it harder to
>> machine-parse this error message.
>
> Good description.  I may suggest
>
>     "the input path is returned" -> "the input is shown verbatim"
>
> because <path> is not the only thing that can contain LF.  E.g.
>
>     $ git show -s 'HEAD^{/introduced in
>     > db9d67}'
>
> can find the commit resulting from this patch, so
>
>     $ printf "%s\0" 'HEAD^{/introduced in
>     > db9d67}:builtin/cat-file.cc' |
>     > git cat-file --batch-check -z

Thanks for that thorough explanation, makes a lot of sense.

>
> would be an input record with newline in it, that has no newline in
> the path.
>
>> With this change, the input is quote-formatted in the error message, if
>> needed. This ensures the error message is always on a single line and
>> makes parsing the error more straightforward.
>
> Drop "With this change, ..." and give a command to the codebase to
> c-quote the object name in the output, e.g.
>
>     C-quote the object name from the input in the error message as
>     needed, to ensure that the error message is on a single line and
>     ...

Sure, I'll update the commit message accordingly.

> The other side of the coin, however, is that an existing project
> that is sane enough not to have a path with LF in it, but is not
> sane enough to avoid a path with double-quote in it, would now stop
> being able to parse the error message for a missing path.

Ah interesting, this is not a case I did consider before.

> It is a regression, and we may argue that it is not a large enough
> regression to block progress given by this patch, but if it is not
> common enough to have funny characters in the paths then we wouldn't
> be seeing this patch in the first place.  So I would prefer to see
> that we at least admit that we are deliberately making this change
> with a known regression.  Perhaps add something like
>
>     Note that if a project already parses the error message
>     correctly because it does not have paths with newlines, this is
>     a breaking change if it has paths with characters that need
>     c-quoting (like double quotes and backslashes) that are not
>     newlines.  We consider that this is a small enough price to pay
>     to allow newlines in paths because ...
>
> and fill the "because ..." part is sensible.  I am not filling the
> "because" part, simply because I do not offhand see any good excuse
> to rob Peter to pay Paul in this case.

I intended to finalize this patch sooner, but other priorities popped
up. The -z flag was added in v2.38 and it would have been nice to have
the patch in v2.40, this would reduce the number of Peters affected. Now
we're a couple months and yet another release further between
introduction of the flag and making the regression, so I feel your
sentiment.

I previous conversations[1] we've been talking about making the output
NUL-terminated as well. We agreed on the cquote fix because the -z flag
was still fresh, but maybe at this time we need to revisit this.

Ideally the output should be NUL-terminated if -z is used. This was also
suggested[2] when the flag was introduced. Obviously we cannot change
this now, because it would break behavior for *everyone* using -z, not
only when funny names are used. So if we want to go this route, we
should only do so with another flag (e.g. `--null-output`) or a config
option.

But I was looking at the git-config(1) documentation:

> core.quotePath::
> 	Commands that output paths (e.g. 'ls-files', 'diff'), will
> 	quote "unusual" characters in the pathname by enclosing the
> 	pathname in double-quotes and escaping those characters with
> 	backslashes in the same way C escapes control characters (e.g.
> 	`\t` for TAB, `\n` for LF, `\\` for backslash) or bytes with
> 	values larger than 0x80 (e.g. octal `\302\265` for "micro" in
> 	UTF-8).  If this variable is set to false, bytes higher than
> 	0x80 are not considered "unusual" any more. Double-quotes,
> 	backslash and control characters are always escaped regardless
> 	of the setting of this variable.  A simple space character is
> 	not considered "unusual".  Many commands can output pathnames
> 	completely verbatim using the `-z` option. The default value
> 	is true.

If you read this, the changes of this patch fully contradict this. Also
documentation on other commands (e.g. git-check-ignore(1)) using `-z`
will mention the verbatim output. So at the moment I'm leaning toward
another solution. I'm looping in Taylor as he added the flag initially,
so he might have some insights how they are using it and how they expect
it to work.

-- Toon

[1]: https://lore.kernel.org/git/xmqq5yekyvrh.fsf@gitster.g/
[2]: https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/

Junio C Hamano May 12, 2023, 4:57 p.m. UTC | #3

Toon Claes <toon@iotcl.com> writes:

> Ideally the output should be NUL-terminated if -z is used. This was also
> suggested[2] when the flag was introduced. Obviously we cannot change
> this now, because it would break behavior for *everyone* using -z, not
> only when funny names are used. So if we want to go this route, we
> should only do so with another flag (e.g. `--null-output`) or a config
> option.

Yes, `--null-output` came also to my mind.  As this new mode of
output is for consumption by programs, letting them read
NUL-terminated records is a viable, if cumbersome, possibility.

> But I was looking at the git-config(1) documentation:
>
>> core.quotePath::
>> 	Commands that output paths (e.g. 'ls-files', 'diff'), will
>> 	quote "unusual" characters in the pathname by enclosing the
>> 	pathname in double-quotes and escaping those characters with
>> 	backslashes in the same way C escapes control characters (e.g.
>> 	`\t` for TAB, `\n` for LF, `\\` for backslash) or bytes with
>> 	values larger than 0x80 (e.g. octal `\302\265` for "micro" in
>> 	UTF-8).  If this variable is set to false, bytes higher than
>> 	0x80 are not considered "unusual" any more. Double-quotes,
>> 	backslash and control characters are always escaped regardless
>> 	of the setting of this variable.  A simple space character is
>> 	not considered "unusual".  Many commands can output pathnames
>> 	completely verbatim using the `-z` option. The default value
>> 	is true.
>
> If you read this, the changes of this patch fully contradict this.

Hmph, I do not quite see where the contradiction is.  If you mean
"Many commands can output" part, I do not think it applies here.
First, your "cat-file" does not have to be a part of "many".  More
importantly, the mention of `-z` there is about the option accepted
by the diff family of commants, e.g. "git diff --name-only -z
HEAD^", that is an output record separator.  Your "-z" is about the
input record separator, and if you are not changing "-z" to suddenly
mean both input and output  separator to break existing scripts that
expect "-z" only applies to input, the above "completely verbatim"
does not apply to you.

> Also
> documentation on other commands (e.g. git-check-ignore(1)) using `-z`
> will mention the verbatim output.

Again, it is about the output.

Stepping back a bit, how big a problem is this in real life?  It
certainly is possible to create a pathname with funny byte values in
it, and in some environments,letters like single-quote that are
considered cumbersome to handle by those who are used to CLI
programs may be commonplace.  But a path with newline?  Or any
control character for that matter?  And this is not even the primary
output from the program but is an error message for consumption by
humans, no?

I am wondering if it is simpler to just declare that the paths
output in error messages have certain bytes, probably all control
characters other than HT, replaced with a dot, and tell the users
not to rely on the pathnames being intact if they contain funny
bytes in them.  That way, with the definition of "work" being "you
can read the path out of error messages that talk about it", paths
with bytes that c-quote mechanism butchers, like double quotes and
backslashes, that have worked before will not be broken, and paths
with LF or CRLF in them that have never worked would not work, but
at least does not break the input stream of whoever is reading the
error messages line by line.

I dunno.

Phillip Wood May 15, 2023, 8:47 a.m. UTC | #4

On 12/05/2023 17:57, Junio C Hamano wrote:
> Toon Claes <toon@iotcl.com> writes:
> Stepping back a bit, how big a problem is this in real life?  It
> certainly is possible to create a pathname with funny byte values in
> it, and in some environments,letters like single-quote that are
> considered cumbersome to handle by those who are used to CLI
> programs may be commonplace.  But a path with newline?  Or any
> control character for that matter?  And this is not even the primary
> output from the program but is an error message for consumption by
> humans, no?
> 
> I am wondering if it is simpler to just declare that the paths
> output in error messages have certain bytes, probably all control
> characters other than HT, replaced with a dot, and tell the users
> not to rely on the pathnames being intact if they contain funny
> bytes in them.

We could only c-quote the name when it contains a control character 
other that HT. That way names containing double quotes and backslashes 
are unchanged but it will still be possible to parse the path from the 
error message. If we're going to munge the name we might as well use our 
standard quoting rather than some ad-hoc scheme.

Best Wishes

Phillip

   That way, with the definition of "work" being "you
> can read the path out of error messages that talk about it", paths
> with bytes that c-quote mechanism butchers, like double quotes and
> backslashes, that have worked before will not be broken, and paths
> with LF or CRLF in them that have never worked would not work, but
> at least does not break the input stream of whoever is reading the
> error messages line by line.
> 
> I dunno.
> 
>

Junio C Hamano May 15, 2023, 5:20 p.m. UTC | #5

Phillip Wood <phillip.wood123@gmail.com> writes:

> On 12/05/2023 17:57, Junio C Hamano wrote:
>> Toon Claes <toon@iotcl.com> writes:
>> Stepping back a bit, how big a problem is this in real life?  It
>> certainly is possible to create a pathname with funny byte values in
>> it, and in some environments,letters like single-quote that are
>> considered cumbersome to handle by those who are used to CLI
>> programs may be commonplace.  But a path with newline?  Or any
>> control character for that matter?  And this is not even the primary
>> output from the program but is an error message for consumption by
>> humans, no?
>> I am wondering if it is simpler to just declare that the paths
>> output in error messages have certain bytes, probably all control
>> characters other than HT, replaced with a dot, and tell the users
>> not to rely on the pathnames being intact if they contain funny
>> bytes in them.
>
> We could only c-quote the name when it contains a control character
> other that HT. That way names containing double quotes and backslashes
> are unchanged but it will still be possible to parse the path from the
> error message. If we're going to munge the name we might as well use
> our standard quoting rather than some ad-hoc scheme.

In the above suggestion, I gave up and no longer aim to do
"quoting".  A more appropriate word for the approach is "redacting".
The message essentially is: If you use truly problematic bytes in
your path, they are redacted (so do not use them if it hurts).

This is because I am not sure how "names containing dq and bs are
unchanged" can be done without ambiguity.  If I see a message that
comes out of this:

	printf("%s missing\n", obj_name);

and it looks like

	"a\nb" missing

how do I tell if it is complaining about the object the user named
with a three-byte string (i.e. lowercase-A, newline, lowercase-B),
or a six-byte string (i.e. dq, lowercase-A, bs, lowercase-N,
lowercase-B, dq)?

If we were forbidding '"' to appear in a refname, then we could take
advantage of the fact that the name of an object inside a tree at a
funny path would not start with '"', to disambiguate.  For the
three- and six-byte string cases above, the formatting function will
give these messages (referred to as "sample output" below):

	"master:a\nb" missing
	master:"a\nb" missing

because of your "we do not exactly do our standard c-quote; we
exempt dq and bs from the bytes to be quoted" rule.

But it still feels a bit misleading.  This codepath may have the
whole objectname as a single string so that c-quoting the entire
"<commit> <colon> <path>" inside a single c-quoted string that
begins with a dq is easy, but not all codepaths are lucky and some
may have to show <commit> and <path> separately, concatenated with
<colon> at the outermost output layer, which means that the second
one from the sample output may still mean the path with three-byte
name in the tree of 'master' commit.

And worse yet, because

	git branch '"master'

is possible (even though nobody sane would do that), so "treat the
string as c-quoted only if the object name as a whole begins with a
dq", this disambiguation idea would not work.  The first one from
the sample output could be the blob at the path with a five-byte
string name (i.e. lowercase-A, bs, lowercase-N, lowercase-B, dq)
in the tree of the commit at the tip of branch with seven-byte
string name (i.e. dq followed by 'master').

So, I dunno.

Phillip Wood June 2, 2023, 1:29 p.m. UTC | #6

Hi Junio

Sorry for the slow reply, I had intended to reply before but got 
distracted and forgot about it.

On 15/05/2023 18:20, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> On 12/05/2023 17:57, Junio C Hamano wrote:
>>> Toon Claes <toon@iotcl.com> writes:
>>> Stepping back a bit, how big a problem is this in real life?  It
>>> certainly is possible to create a pathname with funny byte values in
>>> it, and in some environments,letters like single-quote that are
>>> considered cumbersome to handle by those who are used to CLI
>>> programs may be commonplace.  But a path with newline?  Or any
>>> control character for that matter?  And this is not even the primary
>>> output from the program but is an error message for consumption by
>>> humans, no?
>>> I am wondering if it is simpler to just declare that the paths
>>> output in error messages have certain bytes, probably all control
>>> characters other than HT, replaced with a dot, and tell the users
>>> not to rely on the pathnames being intact if they contain funny
>>> bytes in them.
>>
>> We could only c-quote the name when it contains a control character
>> other that HT. That way names containing double quotes and backslashes
>> are unchanged but it will still be possible to parse the path from the
>> error message. If we're going to munge the name we might as well use
>> our standard quoting rather than some ad-hoc scheme.
> 
> In the above suggestion, I gave up and no longer aim to do
> "quoting".  A more appropriate word for the approach is "redacting".
> The message essentially is: If you use truly problematic bytes in
> your path, they are redacted (so do not use them if it hurts).
> 
> This is because I am not sure how "names containing dq and bs are
> unchanged" can be done without ambiguity.

D'oh, I should have thought of that. You're right it ends up being 
ambiguous. Anyway Patrick has just posted a patch to add NUL terminated 
output which looks like a cleaner approach.

Best Wishes

Phillip

>  If I see a message that
> comes out of this:
> 
> 	printf("%s missing\n", obj_name);
> 
> and it looks like
> 
> 	"a\nb" missing
> 
> how do I tell if it is complaining about the object the user named
> with a three-byte string (i.e. lowercase-A, newline, lowercase-B),
> or a six-byte string (i.e. dq, lowercase-A, bs, lowercase-N,
> lowercase-B, dq)?
> 
> If we were forbidding '"' to appear in a refname, then we could take
> advantage of the fact that the name of an object inside a tree at a
> funny path would not start with '"', to disambiguate.  For the
> three- and six-byte string cases above, the formatting function will
> give these messages (referred to as "sample output" below):
> 
> 	"master:a\nb" missing
> 	master:"a\nb" missing
> 
> because of your "we do not exactly do our standard c-quote; we
> exempt dq and bs from the bytes to be quoted" rule.
> 
> But it still feels a bit misleading.  This codepath may have the
> whole objectname as a single string so that c-quoting the entire
> "<commit> <colon> <path>" inside a single c-quoted string that
> begins with a dq is easy, but not all codepaths are lucky and some
> may have to show <commit> and <path> separately, concatenated with
> <colon> at the outermost output layer, which means that the second
> one from the sample output may still mean the path with three-byte
> name in the tree of 'master' commit.
> 
> And worse yet, because
> 
> 	git branch '"master'
> 
> is possible (even though nobody sane would do that), so "treat the
> string as c-quoted only if the object name as a whole begins with a
> dq", this disambiguation idea would not work.  The first one from
> the sample output could be the blob at the path with a five-byte
> string name (i.e. lowercase-A, bs, lowercase-N, lowercase-B, dq)
> in the tree of the commit at the tip of branch with seven-byte
> string name (i.e. dq followed by 'master').
> 
> So, I dunno.

[v5,1/1] cat-file: quote-format name in error when using -z

Commit Message

Comments

Patch