diff mbox series

[1/1] cat-file: quote-format name in error when using -z

Message ID 20221209150048.2400648-2-toon@iotcl.com (mailing list archive)
State Superseded
Headers show
Series cat-file: quote-format name in error when using -z | expand

Commit Message

Toon claes Dec. 9, 2022, 3 p.m. UTC
Since it's supported to have NUL-delimited input, introduced in
db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
2022-07-22), it's possible to pass paths that contain newlines. This
works great when the object is found, but when it's not, the input path
is returned in the error message. Because this can contain newlines, the
error message might get spread over multiple lines, making it harder to
machine-parse this error message.

With this change, the input is quote-formatted in the error message, if
needed. This ensures the error message is always on a single line and
makes parsing the error more straightforward.

Signed-off-by: Toon Claes <toon@iotcl.com>
---
 builtin/cat-file.c  | 10 ++++++++++
 t/t1006-cat-file.sh |  8 ++++++++
 2 files changed, 18 insertions(+)

--
2.39.0.rc0.57

Comments

Phillip Wood Dec. 9, 2022, 7:33 p.m. UTC | #1
Hi Toon

On 09/12/2022 15:00, Toon Claes wrote:
> Since it's supported to have NUL-delimited input, introduced in
> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
> 2022-07-22), it's possible to pass paths that contain newlines. This
> works great when the object is found, but when it's not, the input path
> is returned in the error message. Because this can contain newlines, the
> error message might get spread over multiple lines, making it harder to
> machine-parse this error message.
> 
> With this change, the input is quote-formatted in the error message, if
> needed. This ensures the error message is always on a single line and
> makes parsing the error more straightforward.

Thanks for working on this. I'd previously suggested NUL terminating the 
output of "git cat-file -z" to avoid this problem [1] but quoting the 
object name is a better solution. The implementation looks good, thanks 
for adding a test - are you sure we need the FUNNYNAMES guard on the 
test even though it isn't creating files with funny names?

Best Wishes

Phillip

[1] 
https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/

> Signed-off-by: Toon Claes <toon@iotcl.com>
> ---
>   builtin/cat-file.c  | 10 ++++++++++
>   t/t1006-cat-file.sh |  8 ++++++++
>   2 files changed, 18 insertions(+)
> 
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index b3be58b1fb..67dd81238d 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -14,6 +14,7 @@
>   #include "tree-walk.h"
>   #include "oid-array.h"
>   #include "packfile.h"
> +#include "quote.h"
>   #include "object-store.h"
>   #include "promisor-remote.h"
>   #include "mailmap.h"
> @@ -475,6 +476,13 @@ static void batch_one_object(const char *obj_name,
>   	result = get_oid_with_context(the_repository, obj_name,
>   				      flags, &data->oid, &ctx);
>   	if (result != FOUND) {
> +		struct strbuf quoted = STRBUF_INIT;
> +
> +		if (opt->nul_terminated) {
> +			quote_c_style(obj_name, &quoted, NULL, 0);
> +			obj_name = quoted.buf;
> +		}
> +
>   		switch (result) {
>   		case MISSING_OBJECT:
>   			printf("%s missing\n", obj_name);
> @@ -499,6 +507,8 @@ static void batch_one_object(const char *obj_name,
>   			       result);
>   			break;
>   		}
> +
> +		strbuf_release(&quoted);
>   		fflush(stdout);
>   		return;
>   	}
> diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
> index 23b8942edb..232bfd1723 100755
> --- a/t/t1006-cat-file.sh
> +++ b/t/t1006-cat-file.sh
> @@ -447,6 +447,14 @@ test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
>   	test_cmp expect actual
>   '
> 
> +test_expect_success FUNNYNAMES '--batch-check, -z with newline in non-existent named object' '
> +	printf "HEAD:newline${LF}embedded" >in &&
> +	git cat-file --batch-check -z <in >actual &&
> +
> +	printf "\"HEAD:newline\\\\nembedded\" missing\n" >expect &&
> +	test_cmp expect actual
> +'
> +
>   batch_command_multiple_info="info $hello_sha1
>   info $tree_sha1
>   info $commit_sha1
> --
> 2.39.0.rc0.57
Junio C Hamano Dec. 9, 2022, 11:58 p.m. UTC | #2
Phillip Wood <phillip.wood123@gmail.com> writes:

> Hi Toon
>
> On 09/12/2022 15:00, Toon Claes wrote:
>> Since it's supported to have NUL-delimited input, introduced in
>> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
>> 2022-07-22), it's possible to pass paths that contain newlines. This
>> works great when the object is found, but when it's not, the input path
>> is returned in the error message. Because this can contain newlines, the
>> error message might get spread over multiple lines, making it harder to
>> machine-parse this error message.
>> With this change, the input is quote-formatted in the error message,
>> if
>> needed. This ensures the error message is always on a single line and
>> makes parsing the error more straightforward.
>
> Thanks for working on this. I'd previously suggested NUL terminating
> the output of "git cat-file -z" to avoid this problem [1] but quoting
> the object name is a better solution.

Hmph.  My knee-jerk reaction was that it is utterly disgusting if we
quote when we do NUL-terminated.  Is your "quoting is OK over NUL
terminating" because "-z" applies only to the input?  If so, then I
would agree that is OK, but shouldn't the quoting apply regardless
of how the input is formulated?  Why do we call the cquote helper
only under "-z"?
Phillip Wood Dec. 11, 2022, 4:30 p.m. UTC | #3
On 09/12/2022 23:58, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> Hi Toon
>>
>> On 09/12/2022 15:00, Toon Claes wrote:
>>> Since it's supported to have NUL-delimited input, introduced in
>>> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`,
>>> 2022-07-22), it's possible to pass paths that contain newlines. This
>>> works great when the object is found, but when it's not, the input path
>>> is returned in the error message. Because this can contain newlines, the
>>> error message might get spread over multiple lines, making it harder to
>>> machine-parse this error message.
>>> With this change, the input is quote-formatted in the error message,
>>> if
>>> needed. This ensures the error message is always on a single line and
>>> makes parsing the error more straightforward.
>>
>> Thanks for working on this. I'd previously suggested NUL terminating
>> the output of "git cat-file -z" to avoid this problem [1] but quoting
>> the object name is a better solution.
> 
> Hmph.  My knee-jerk reaction was that it is utterly disgusting if we
> quote when we do NUL-terminated.  Is your "quoting is OK over NUL
> terminating" because "-z" applies only to the input?

Yes, if the object exists then delimiting the output with newlines is 
fine, it is only if the object is missing and its name contains a 
newline that there is a problem. It also makes adopting "-z" in existing 
scripts easier as there is no change required when parsing the output.

As "-z" was added in 2.38 there is also a pragmatic reason to prefer 
quoting over NUL terminated output as it allows us to fix this issue 
without changing the output delimiter of an existing option.

> If so, then I
> would agree that is OK, but shouldn't the quoting apply regardless
> of how the input is formulated?  Why do we call the cquote helper
> only under "-z"?

Without "-z" you cannot pass object names that contain newlines so not 
quoting the output does not cause a problem. We could start quoting the 
object name without "-z" but we'd be changing the output without a huge 
benefit.

Best Wishes

Phillip
Junio C Hamano Dec. 12, 2022, 12:11 a.m. UTC | #4
Phillip Wood <phillip.wood123@gmail.com> writes:

> Without "-z" you cannot pass object names that contain newlines so not
> quoting the output does not cause a problem. We could start quoting
> the object name without "-z" but we'd be changing the output without a
> huge benefit.

That's fair.  The next question is from a devil's advocate:
is switching to the full cquote the best thing to do?

If we were using the full cquote from the very beginning, of course
it is, simply because that is what is used in all other places in
Git.  Using the full cquote does mean a LF byte will be protected
(i.e. instead of shown literally in the middle of other letters
around LF, "other\nletters around LF" would be shown), but pathnames
with backslashes and double quotes in them that have been shown
without problems would be shown differently and will break existing
parsers, which are written lazily with the assumption that they are
perfectly happy to be "simple" thans to not having to deal with LF
(because in their environment a path with LF in it do not matter).

A bit safer thing to do is to replace LF (and not any other bytes
that would be quoted with full cquote) in the path given in these
messages with something else (like NUL to truncate the output
there).  As these answers are given in order, the object names are
not absolutely needed to identify and match up the input and the
output, and properly written parsers would be prepared to see a
response with an object name that it did not request and handle it
sanely, such a change may not break such a parser for a path with
any byte that are modified with full cquote.

The above is with a devil's adovocate hat on, and I do not care too
much, as I do not think butchering backslash with full cquote would
not hurt even existing Windows users (if "HEAD:t\README.txt" named
the same blob as "HEAD:t/README.txt" on a platform, doubling the
backslashes in the output would have made quite a lot of damage, but
I do not think we allow backslashes to name tree paths).

By the way, there is another use of obj_name in batch_object_write()
that can show whatever byte in it literally to the output.

Thanks.
Toon Claes Dec. 12, 2022, 11:34 a.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> Phillip Wood <phillip.wood123@gmail.com> writes:
>
>> Without "-z" you cannot pass object names that contain newlines so not
>> quoting the output does not cause a problem. We could start quoting
>> the object name without "-z" but we'd be changing the output without a
>> huge benefit.
>
> That's fair.  The next question is from a devil's advocate:
> is switching to the full cquote the best thing to do?
>
> If we were using the full cquote from the very beginning, of course
> it is, simply because that is what is used in all other places in
> Git.  Using the full cquote does mean a LF byte will be protected
> (i.e. instead of shown literally in the middle of other letters
> around LF, "other\nletters around LF" would be shown), but pathnames
> with backslashes and double quotes in them that have been shown
> without problems would be shown differently and will break existing
> parsers, which are written lazily with the assumption that they are
> perfectly happy to be "simple" thans to not having to deal with LF
> (because in their environment a path with LF in it do not matter).
>
> A bit safer thing to do is to replace LF (and not any other bytes
> that would be quoted with full cquote) in the path given in these
> messages with something else (like NUL to truncate the output
> there).

So object name "HEAD:other\nletters around LF" would give the error
message "other missing"? That error message would also occur when the
user does not provide -z. I think that might be confusing.

> As these answers are given in order, the object names are
> not absolutely needed to identify and match up the input and the
> output, and properly written parsers would be prepared to see a
> response with an object name that it did not request and handle it
> sanely, such a change may not break such a parser for a path with
> any byte that are modified with full cquote.

Yes, the answers are returned in order, so personally I don't care too
much about the returned object name format. I even would be fine with a
generic error message that omits the input name, for example "object
missing". As long as it's clear that the requested object is not found.

For your information, there is an extreme edge case a user could fake an
object, and that's what we want to avoid as well. For example the
command (line break included):

printf "aabbccddeeff00112233445566778899aabbccdd blob 26
this object is not" | git cat-file --batch -z

Would print:

aabbccddeeff00112233445566778899aabbccdd blob 26
this object is not missing

This is perfectly valid git-cat-file output. Luckily I don't see any way
how this can be abused. Generally I think it's a good idea to not return
the input as-is in any situation. We could only replace newlines, but
cquoting already sanitizes the input, so why not use that?

> The above is with a devil's adovocate hat on, and I do not care too
> much, as I do not think butchering backslash with full cquote would
> not hurt even existing Windows users (if "HEAD:t\README.txt" named
> the same blob as "HEAD:t/README.txt" on a platform, doubling the
> backslashes in the output would have made quite a lot of damage, but
> I do not think we allow backslashes to name tree paths).

> By the way, there is another use of obj_name in batch_object_write()
> that can show whatever byte in it literally to the output.

Ah thanks! I will include in the next version, when we reach a consensus
on when or what to cquote.

--
Toon
Junio C Hamano Dec. 12, 2022, 10:09 p.m. UTC | #6
Toon Claes <toon@to1.studio> writes:

> Ah thanks! I will include in the next version, when we reach a consensus
> on when or what to cquote.

I think we already have consensus on when---"-z" may benefit,
otherwise we should not quote.  I am not sure what you mean by "what
to cquote"---are there things other than input object names that may
be ambiguous and require quoting?
Phillip Wood Dec. 13, 2022, 3:06 p.m. UTC | #7
On 12/12/2022 00:11, Junio C Hamano wrote:
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> Without "-z" you cannot pass object names that contain newlines so not
>> quoting the output does not cause a problem. We could start quoting
>> the object name without "-z" but we'd be changing the output without a
>> huge benefit.
> 
> That's fair.  The next question is from a devil's advocate:
> is switching to the full cquote the best thing to do?
> 
> If we were using the full cquote from the very beginning, of course
> it is, simply because that is what is used in all other places in
> Git.  Using the full cquote does mean a LF byte will be protected
> (i.e. instead of shown literally in the middle of other letters
> around LF, "other\nletters around LF" would be shown), but pathnames
> with backslashes and double quotes in them that have been shown
> without problems would be shown differently and will break existing
> parsers, which are written lazily with the assumption that they are
> perfectly happy to be "simple" thans to not having to deal with LF
> (because in their environment a path with LF in it do not matter).
> 
> A bit safer thing to do is to replace LF (and not any other bytes
> that would be quoted with full cquote) in the path given in these
> messages with something else (like NUL to truncate the output
> there).  As these answers are given in order, the object names are
> not absolutely needed to identify and match up the input and the
> output, and properly written parsers would be prepared to see a
> response with an object name that it did not request and handle it
> sanely, such a change may not break such a parser for a path with
> any byte that are modified with full cquote.
> 
> The above is with a devil's adovocate hat on, and I do not care too
> much, as I do not think butchering backslash with full cquote would
> not hurt even existing Windows users (if "HEAD:t\README.txt" named
> the same blob as "HEAD:t/README.txt" on a platform, doubling the
> backslashes in the output would have made quite a lot of damage, but
> I do not think we allow backslashes to name tree paths).

By default quoting also affects names that contain non-ascii characters. 
If we're worried about that we could only quote names that contain LF 
but as you point out the answers are given in order so I don't think 
there should be a problem with calling cquote unconditionally. (As this 
option has only existed for one release I suspect there aren't that many 
users yet as well)

Best Wishes

Phillip

> By the way, there is another use of obj_name in batch_object_write()
> that can show whatever byte in it literally to the output.
> 
> Thanks.
Junio C Hamano Dec. 14, 2022, 8:29 a.m. UTC | #8
Phillip Wood <phillip.wood123@gmail.com> writes:

> By default quoting also affects names that contain non-ascii
> characters. If we're worried about that we could only quote names that
> contain LF but as you point out the answers are given in order so I
> don't think there should be a problem with calling cquote
> unconditionally. (As this option has only existed for one release I
> suspect there aren't that many users yet as well)

Good point.
Toon Claes Dec. 20, 2022, 5:31 a.m. UTC | #9
Phillip Wood <phillip.wood123@gmail.com> writes:

> Thanks for working on this. I'd previously suggested NUL terminating the output
> of "git cat-file -z" to avoid this problem [1]
>
> [1]
> https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/

What happened to this proposal? I don't see any replies to that. That's
a bit sad, because it would have been nice to have it this behavior from
the start.

> but quoting the object name is a better solution.

I would not say it's a better solution, but it's a less invasive
solution that /minimizes/ breaking changes. Ideally I'd like to have NUL
terminated output for "git cat-file -z". In a success situation I
assume this would return:

    <oid> SP <type> SP <size> NUL
    <contents> NUL

In a failure situation something like:

    <object> SP missing NUL

So when you pass -z you can keep reading until the first NUL and then
you'll know if you should read for contents as well.

Would you consider change behavior to this now?

--
Toon
Phillip Wood Dec. 20, 2022, 10:18 a.m. UTC | #10
Hi Toon

On 20/12/2022 05:31, Toon Claes wrote:
> 
> Phillip Wood <phillip.wood123@gmail.com> writes:
> 
>> Thanks for working on this. I'd previously suggested NUL terminating the output
>> of "git cat-file -z" to avoid this problem [1]
>>
>> [1]
>> https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/
> 
> What happened to this proposal? I don't see any replies to that. That's
> a bit sad, because it would have been nice to have it this behavior from
> the start.

Yes it would have been nice, unfortunately it feel through the cracks 
due to a combination of unfortunate timing and lack of time. I think the 
patch was probably in next by the time I realized the problem and wrote 
that email. Taylor was on holiday at the time and then I went away 
around the time he got back. I know it was on his todo list but I think 
he was busy catching up from being away getting ready for git merge. I 
had other things I was working on and unfortunately didn't get round to 
following it up.

>> but quoting the object name is a better solution.
> 
> I would not say it's a better solution, but it's a less invasive
> solution that /minimizes/ breaking changes. Ideally I'd like to have NUL
> terminated output for "git cat-file -z". In a success situation I
> assume this would return:
> 
>      <oid> SP <type> SP <size> NUL
>      <contents> NUL
> 
> In a failure situation something like:
> 
>      <object> SP missing NUL
> 
> So when you pass -z you can keep reading until the first NUL and then
> you'll know if you should read for contents as well.
> 
> Would you consider change behavior to this now?

Hmm I'm not sure (and luckily it isn't up to me to take the final 
decision!). I really don't know how many people are using "-z" I suspect 
it is not many and so the fallout wouldn't be too bad but we'd still be 
inconveniencing some users. I theory we could so "sorry we made a 
mistake when implementing this feature and have decided to change it" 
but we have a solution in your patch which hopefully does not break any 
users (I say hopefully because think if one is careful and keep track of 
which responses you've read it is possible to detect missing objects now 
even if their name contains a new line but it will be messy and probably 
slightly inefficient but anyone doing that will notice the change in 
output).

Overall I'd say it is tempting but risky and inconvenient to any 
existing users of "-z" (assuming someone else is actually using it). 
Quoting the object name is definitively the safer option at this point.

Best Wishes

Phillip
Toon Claes Dec. 21, 2022, 12:42 p.m. UTC | #11
Phillip Wood <phillip.wood123@gmail.com> writes:

> Yes it would have been nice, unfortunately it feel through the cracks due to a
> combination of unfortunate timing and lack of time. I think the patch was
> probably in next by the time I realized the problem and wrote that email. Taylor
> was on holiday at the time and then I went away around the time he got back. I
> know it was on his todo list but I think he was busy catching up from being away
> getting ready for git merge. I had other things I was working on and
> unfortunately didn't get round to following it up.

Oh sorry, I didn't want to put blame on anyone. Things happen. I just
wasn't sure I missed anything.

> Overall I'd say it is tempting but risky and inconvenient to any existing users
> of "-z" (assuming someone else is actually using it). Quoting the object name is
> definitively the safer option at this point.

I assume Taylor would be one of the consumers of output of "-z", so
thanks for putting him in Cc to get his take on this.

--
Toon
diff mbox series

Patch

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index b3be58b1fb..67dd81238d 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -14,6 +14,7 @@ 
 #include "tree-walk.h"
 #include "oid-array.h"
 #include "packfile.h"
+#include "quote.h"
 #include "object-store.h"
 #include "promisor-remote.h"
 #include "mailmap.h"
@@ -475,6 +476,13 @@  static void batch_one_object(const char *obj_name,
 	result = get_oid_with_context(the_repository, obj_name,
 				      flags, &data->oid, &ctx);
 	if (result != FOUND) {
+		struct strbuf quoted = STRBUF_INIT;
+
+		if (opt->nul_terminated) {
+			quote_c_style(obj_name, &quoted, NULL, 0);
+			obj_name = quoted.buf;
+		}
+
 		switch (result) {
 		case MISSING_OBJECT:
 			printf("%s missing\n", obj_name);
@@ -499,6 +507,8 @@  static void batch_one_object(const char *obj_name,
 			       result);
 			break;
 		}
+
+		strbuf_release(&quoted);
 		fflush(stdout);
 		return;
 	}
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 23b8942edb..232bfd1723 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -447,6 +447,14 @@  test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' '
 	test_cmp expect actual
 '

+test_expect_success FUNNYNAMES '--batch-check, -z with newline in non-existent named object' '
+	printf "HEAD:newline${LF}embedded" >in &&
+	git cat-file --batch-check -z <in >actual &&
+
+	printf "\"HEAD:newline\\\\nembedded\" missing\n" >expect &&
+	test_cmp expect actual
+'
+
 batch_command_multiple_info="info $hello_sha1
 info $tree_sha1
 info $commit_sha1