Message ID | 20230510190116.795641-2-toon@iotcl.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | cat-file: quote-format name in error when using -z | expand |
Toon Claes <toon@iotcl.com> writes: > Since it's supported to have NUL-delimited input, introduced in > db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, > 2022-07-22), it's possible to pass paths that contain newlines. It has been a while since I saw this patch the last time, and it did not immediately click to me how "pass paths" relates to passing object names, which is what "--batch-check" takes. Perhaps "cat-file --batch-check" may be fed object names of blobs and trees as "<commit>:<path>", and with its "-z" option, it is possible to feed <commit> or <path> with newlines in it. or something? > This > works great when the object is found, but when it's not, the input path > is returned in the error message. Because this can contain newlines, the > error message might get spread over multiple lines, making it harder to > machine-parse this error message. Good description. I may suggest "the input path is returned" -> "the input is shown verbatim" because <path> is not the only thing that can contain LF. E.g. $ git show -s 'HEAD^{/introduced in > db9d67}' can find the commit resulting from this patch, so $ printf "%s\0" 'HEAD^{/introduced in > db9d67}:builtin/cat-file.cc' | > git cat-file --batch-check -z would be an input record with newline in it, that has no newline in the path. > With this change, the input is quote-formatted in the error message, if > needed. This ensures the error message is always on a single line and > makes parsing the error more straightforward. Drop "With this change, ..." and give a command to the codebase to c-quote the object name in the output, e.g. C-quote the object name from the input in the error message as needed, to ensure that the error message is on a single line and ... The other side of the coin, however, is that an existing project that is sane enough not to have a path with LF in it, but is not sane enough to avoid a path with double-quote in it, would now stop being able to parse the error message for a missing path. It is a regression, and we may argue that it is not a large enough regression to block progress given by this patch, but if it is not common enough to have funny characters in the paths then we wouldn't be seeing this patch in the first place. So I would prefer to see that we at least admit that we are deliberately making this change with a known regression. Perhaps add something like Note that if a project already parses the error message correctly because it does not have paths with newlines, this is a breaking change if it has paths with characters that need c-quoting (like double quotes and backslashes) that are not newlines. We consider that this is a small enough price to pay to allow newlines in paths because ... and fill the "because ..." part is sensible. I am not filling the "because" part, simply because I do not offhand see any good excuse to rob Peter to pay Paul in this case. > @@ -470,8 +471,17 @@ static void batch_object_write(const char *obj_name, > &data->oid, &data->info, > OBJECT_INFO_LOOKUP_REPLACE); > if (ret < 0) { > + struct strbuf quoted = STRBUF_INIT; > + > + if (opt->nul_terminated && > + obj_name) { > + quote_c_style(obj_name, "ed, NULL, 0); > + obj_name = quoted.buf; > + } > + > printf("%s missing\n", > obj_name ? obj_name : oid_to_hex(&data->oid)); > + strbuf_release("ed); > fflush(stdout); > return; > } Interesting. When running "--batch-all-objects", we do not have obj_name, and we do not have anything to "quote", but the fallback label is the full hexadecimal object name, and we do not have to worry about line breaks. OK. > @@ -518,6 +528,13 @@ static void batch_one_object(const char *obj_name, > result = get_oid_with_context(the_repository, obj_name, > flags, &data->oid, &ctx); > if (result != FOUND) { > + struct strbuf quoted = STRBUF_INIT; > + > + if (opt->nul_terminated) { > + quote_c_style(obj_name, "ed, NULL, 0); > + obj_name = quoted.buf; > + } > + > switch (result) { > case MISSING_OBJECT: > printf("%s missing\n", obj_name); > @@ -542,6 +559,8 @@ static void batch_one_object(const char *obj_name, > result); > break; > } > + > + strbuf_release("ed); > fflush(stdout); > return; > } This is the side that runs without "--batch-all-objects" and always has non-null obj_name, so it looks a bit different from the other hunk and is more straight-forward. Looking good. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Toon Claes <toon@iotcl.com> writes: > >> Since it's supported to have NUL-delimited input, introduced in >> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, >> 2022-07-22), it's possible to pass paths that contain newlines. > > It has been a while since I saw this patch the last time, and it did > not immediately click to me how "pass paths" relates to passing > object names, which is what "--batch-check" takes. Perhaps > > "cat-file --batch-check" may be fed object names of blobs and > trees as "<commit>:<path>", and with its "-z" option, it is > possible to feed <commit> or <path> with newlines in it. > > or something? Good suggestion. >> This >> works great when the object is found, but when it's not, the input path >> is returned in the error message. Because this can contain newlines, the >> error message might get spread over multiple lines, making it harder to >> machine-parse this error message. > > Good description. I may suggest > > "the input path is returned" -> "the input is shown verbatim" > > because <path> is not the only thing that can contain LF. E.g. > > $ git show -s 'HEAD^{/introduced in > > db9d67}' > > can find the commit resulting from this patch, so > > $ printf "%s\0" 'HEAD^{/introduced in > > db9d67}:builtin/cat-file.cc' | > > git cat-file --batch-check -z Thanks for that thorough explanation, makes a lot of sense. > > would be an input record with newline in it, that has no newline in > the path. > >> With this change, the input is quote-formatted in the error message, if >> needed. This ensures the error message is always on a single line and >> makes parsing the error more straightforward. > > Drop "With this change, ..." and give a command to the codebase to > c-quote the object name in the output, e.g. > > C-quote the object name from the input in the error message as > needed, to ensure that the error message is on a single line and > ... Sure, I'll update the commit message accordingly. > The other side of the coin, however, is that an existing project > that is sane enough not to have a path with LF in it, but is not > sane enough to avoid a path with double-quote in it, would now stop > being able to parse the error message for a missing path. Ah interesting, this is not a case I did consider before. > It is a regression, and we may argue that it is not a large enough > regression to block progress given by this patch, but if it is not > common enough to have funny characters in the paths then we wouldn't > be seeing this patch in the first place. So I would prefer to see > that we at least admit that we are deliberately making this change > with a known regression. Perhaps add something like > > Note that if a project already parses the error message > correctly because it does not have paths with newlines, this is > a breaking change if it has paths with characters that need > c-quoting (like double quotes and backslashes) that are not > newlines. We consider that this is a small enough price to pay > to allow newlines in paths because ... > > and fill the "because ..." part is sensible. I am not filling the > "because" part, simply because I do not offhand see any good excuse > to rob Peter to pay Paul in this case. I intended to finalize this patch sooner, but other priorities popped up. The -z flag was added in v2.38 and it would have been nice to have the patch in v2.40, this would reduce the number of Peters affected. Now we're a couple months and yet another release further between introduction of the flag and making the regression, so I feel your sentiment. I previous conversations[1] we've been talking about making the output NUL-terminated as well. We agreed on the cquote fix because the -z flag was still fresh, but maybe at this time we need to revisit this. Ideally the output should be NUL-terminated if -z is used. This was also suggested[2] when the flag was introduced. Obviously we cannot change this now, because it would break behavior for *everyone* using -z, not only when funny names are used. So if we want to go this route, we should only do so with another flag (e.g. `--null-output`) or a config option. But I was looking at the git-config(1) documentation: > core.quotePath:: > Commands that output paths (e.g. 'ls-files', 'diff'), will > quote "unusual" characters in the pathname by enclosing the > pathname in double-quotes and escaping those characters with > backslashes in the same way C escapes control characters (e.g. > `\t` for TAB, `\n` for LF, `\\` for backslash) or bytes with > values larger than 0x80 (e.g. octal `\302\265` for "micro" in > UTF-8). If this variable is set to false, bytes higher than > 0x80 are not considered "unusual" any more. Double-quotes, > backslash and control characters are always escaped regardless > of the setting of this variable. A simple space character is > not considered "unusual". Many commands can output pathnames > completely verbatim using the `-z` option. The default value > is true. If you read this, the changes of this patch fully contradict this. Also documentation on other commands (e.g. git-check-ignore(1)) using `-z` will mention the verbatim output. So at the moment I'm leaning toward another solution. I'm looping in Taylor as he added the flag initially, so he might have some insights how they are using it and how they expect it to work. -- Toon [1]: https://lore.kernel.org/git/xmqq5yekyvrh.fsf@gitster.g/ [2]: https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/
Toon Claes <toon@iotcl.com> writes: > Ideally the output should be NUL-terminated if -z is used. This was also > suggested[2] when the flag was introduced. Obviously we cannot change > this now, because it would break behavior for *everyone* using -z, not > only when funny names are used. So if we want to go this route, we > should only do so with another flag (e.g. `--null-output`) or a config > option. Yes, `--null-output` came also to my mind. As this new mode of output is for consumption by programs, letting them read NUL-terminated records is a viable, if cumbersome, possibility. > But I was looking at the git-config(1) documentation: > >> core.quotePath:: >> Commands that output paths (e.g. 'ls-files', 'diff'), will >> quote "unusual" characters in the pathname by enclosing the >> pathname in double-quotes and escaping those characters with >> backslashes in the same way C escapes control characters (e.g. >> `\t` for TAB, `\n` for LF, `\\` for backslash) or bytes with >> values larger than 0x80 (e.g. octal `\302\265` for "micro" in >> UTF-8). If this variable is set to false, bytes higher than >> 0x80 are not considered "unusual" any more. Double-quotes, >> backslash and control characters are always escaped regardless >> of the setting of this variable. A simple space character is >> not considered "unusual". Many commands can output pathnames >> completely verbatim using the `-z` option. The default value >> is true. > > If you read this, the changes of this patch fully contradict this. Hmph, I do not quite see where the contradiction is. If you mean "Many commands can output" part, I do not think it applies here. First, your "cat-file" does not have to be a part of "many". More importantly, the mention of `-z` there is about the option accepted by the diff family of commants, e.g. "git diff --name-only -z HEAD^", that is an output record separator. Your "-z" is about the input record separator, and if you are not changing "-z" to suddenly mean both input and output separator to break existing scripts that expect "-z" only applies to input, the above "completely verbatim" does not apply to you. > Also > documentation on other commands (e.g. git-check-ignore(1)) using `-z` > will mention the verbatim output. Again, it is about the output. Stepping back a bit, how big a problem is this in real life? It certainly is possible to create a pathname with funny byte values in it, and in some environments,letters like single-quote that are considered cumbersome to handle by those who are used to CLI programs may be commonplace. But a path with newline? Or any control character for that matter? And this is not even the primary output from the program but is an error message for consumption by humans, no? I am wondering if it is simpler to just declare that the paths output in error messages have certain bytes, probably all control characters other than HT, replaced with a dot, and tell the users not to rely on the pathnames being intact if they contain funny bytes in them. That way, with the definition of "work" being "you can read the path out of error messages that talk about it", paths with bytes that c-quote mechanism butchers, like double quotes and backslashes, that have worked before will not be broken, and paths with LF or CRLF in them that have never worked would not work, but at least does not break the input stream of whoever is reading the error messages line by line. I dunno.
On 12/05/2023 17:57, Junio C Hamano wrote: > Toon Claes <toon@iotcl.com> writes: > Stepping back a bit, how big a problem is this in real life? It > certainly is possible to create a pathname with funny byte values in > it, and in some environments,letters like single-quote that are > considered cumbersome to handle by those who are used to CLI > programs may be commonplace. But a path with newline? Or any > control character for that matter? And this is not even the primary > output from the program but is an error message for consumption by > humans, no? > > I am wondering if it is simpler to just declare that the paths > output in error messages have certain bytes, probably all control > characters other than HT, replaced with a dot, and tell the users > not to rely on the pathnames being intact if they contain funny > bytes in them. We could only c-quote the name when it contains a control character other that HT. That way names containing double quotes and backslashes are unchanged but it will still be possible to parse the path from the error message. If we're going to munge the name we might as well use our standard quoting rather than some ad-hoc scheme. Best Wishes Phillip That way, with the definition of "work" being "you > can read the path out of error messages that talk about it", paths > with bytes that c-quote mechanism butchers, like double quotes and > backslashes, that have worked before will not be broken, and paths > with LF or CRLF in them that have never worked would not work, but > at least does not break the input stream of whoever is reading the > error messages line by line. > > I dunno. > >
Phillip Wood <phillip.wood123@gmail.com> writes: > On 12/05/2023 17:57, Junio C Hamano wrote: >> Toon Claes <toon@iotcl.com> writes: >> Stepping back a bit, how big a problem is this in real life? It >> certainly is possible to create a pathname with funny byte values in >> it, and in some environments,letters like single-quote that are >> considered cumbersome to handle by those who are used to CLI >> programs may be commonplace. But a path with newline? Or any >> control character for that matter? And this is not even the primary >> output from the program but is an error message for consumption by >> humans, no? >> I am wondering if it is simpler to just declare that the paths >> output in error messages have certain bytes, probably all control >> characters other than HT, replaced with a dot, and tell the users >> not to rely on the pathnames being intact if they contain funny >> bytes in them. > > We could only c-quote the name when it contains a control character > other that HT. That way names containing double quotes and backslashes > are unchanged but it will still be possible to parse the path from the > error message. If we're going to munge the name we might as well use > our standard quoting rather than some ad-hoc scheme. In the above suggestion, I gave up and no longer aim to do "quoting". A more appropriate word for the approach is "redacting". The message essentially is: If you use truly problematic bytes in your path, they are redacted (so do not use them if it hurts). This is because I am not sure how "names containing dq and bs are unchanged" can be done without ambiguity. If I see a message that comes out of this: printf("%s missing\n", obj_name); and it looks like "a\nb" missing how do I tell if it is complaining about the object the user named with a three-byte string (i.e. lowercase-A, newline, lowercase-B), or a six-byte string (i.e. dq, lowercase-A, bs, lowercase-N, lowercase-B, dq)? If we were forbidding '"' to appear in a refname, then we could take advantage of the fact that the name of an object inside a tree at a funny path would not start with '"', to disambiguate. For the three- and six-byte string cases above, the formatting function will give these messages (referred to as "sample output" below): "master:a\nb" missing master:"a\nb" missing because of your "we do not exactly do our standard c-quote; we exempt dq and bs from the bytes to be quoted" rule. But it still feels a bit misleading. This codepath may have the whole objectname as a single string so that c-quoting the entire "<commit> <colon> <path>" inside a single c-quoted string that begins with a dq is easy, but not all codepaths are lucky and some may have to show <commit> and <path> separately, concatenated with <colon> at the outermost output layer, which means that the second one from the sample output may still mean the path with three-byte name in the tree of 'master' commit. And worse yet, because git branch '"master' is possible (even though nobody sane would do that), so "treat the string as c-quoted only if the object name as a whole begins with a dq", this disambiguation idea would not work. The first one from the sample output could be the blob at the path with a five-byte string name (i.e. lowercase-A, bs, lowercase-N, lowercase-B, dq) in the tree of the commit at the tip of branch with seven-byte string name (i.e. dq followed by 'master'). So, I dunno.
Hi Junio Sorry for the slow reply, I had intended to reply before but got distracted and forgot about it. On 15/05/2023 18:20, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> On 12/05/2023 17:57, Junio C Hamano wrote: >>> Toon Claes <toon@iotcl.com> writes: >>> Stepping back a bit, how big a problem is this in real life? It >>> certainly is possible to create a pathname with funny byte values in >>> it, and in some environments,letters like single-quote that are >>> considered cumbersome to handle by those who are used to CLI >>> programs may be commonplace. But a path with newline? Or any >>> control character for that matter? And this is not even the primary >>> output from the program but is an error message for consumption by >>> humans, no? >>> I am wondering if it is simpler to just declare that the paths >>> output in error messages have certain bytes, probably all control >>> characters other than HT, replaced with a dot, and tell the users >>> not to rely on the pathnames being intact if they contain funny >>> bytes in them. >> >> We could only c-quote the name when it contains a control character >> other that HT. That way names containing double quotes and backslashes >> are unchanged but it will still be possible to parse the path from the >> error message. If we're going to munge the name we might as well use >> our standard quoting rather than some ad-hoc scheme. > > In the above suggestion, I gave up and no longer aim to do > "quoting". A more appropriate word for the approach is "redacting". > The message essentially is: If you use truly problematic bytes in > your path, they are redacted (so do not use them if it hurts). > > This is because I am not sure how "names containing dq and bs are > unchanged" can be done without ambiguity. D'oh, I should have thought of that. You're right it ends up being ambiguous. Anyway Patrick has just posted a patch to add NUL terminated output which looks like a cleaner approach. Best Wishes Phillip > If I see a message that > comes out of this: > > printf("%s missing\n", obj_name); > > and it looks like > > "a\nb" missing > > how do I tell if it is complaining about the object the user named > with a three-byte string (i.e. lowercase-A, newline, lowercase-B), > or a six-byte string (i.e. dq, lowercase-A, bs, lowercase-N, > lowercase-B, dq)? > > If we were forbidding '"' to appear in a refname, then we could take > advantage of the fact that the name of an object inside a tree at a > funny path would not start with '"', to disambiguate. For the > three- and six-byte string cases above, the formatting function will > give these messages (referred to as "sample output" below): > > "master:a\nb" missing > master:"a\nb" missing > > because of your "we do not exactly do our standard c-quote; we > exempt dq and bs from the bytes to be quoted" rule. > > But it still feels a bit misleading. This codepath may have the > whole objectname as a single string so that c-quoting the entire > "<commit> <colon> <path>" inside a single c-quoted string that > begins with a dq is easy, but not all codepaths are lucky and some > may have to show <commit> and <path> separately, concatenated with > <colon> at the outermost output layer, which means that the second > one from the sample output may still mean the path with three-byte > name in the tree of 'master' commit. > > And worse yet, because > > git branch '"master' > > is possible (even though nobody sane would do that), so "treat the > string as c-quoted only if the object name as a whole begins with a > dq", this disambiguation idea would not work. The first one from > the sample output could be the blob at the path with a five-byte > string name (i.e. lowercase-A, bs, lowercase-N, lowercase-B, dq) > in the tree of the commit at the tip of branch with seven-byte > string name (i.e. dq followed by 'master'). > > So, I dunno.
diff --git a/builtin/cat-file.c b/builtin/cat-file.c index 0bafc14e6c..0ff31c4593 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -25,6 +25,7 @@ #include "object-store.h" #include "replace-object.h" #include "promisor-remote.h" +#include "quote.h" #include "mailmap.h" #include "write-or-die.h" @@ -470,8 +471,17 @@ static void batch_object_write(const char *obj_name, &data->oid, &data->info, OBJECT_INFO_LOOKUP_REPLACE); if (ret < 0) { + struct strbuf quoted = STRBUF_INIT; + + if (opt->nul_terminated && + obj_name) { + quote_c_style(obj_name, "ed, NULL, 0); + obj_name = quoted.buf; + } + printf("%s missing\n", obj_name ? obj_name : oid_to_hex(&data->oid)); + strbuf_release("ed); fflush(stdout); return; } @@ -518,6 +528,13 @@ static void batch_one_object(const char *obj_name, result = get_oid_with_context(the_repository, obj_name, flags, &data->oid, &ctx); if (result != FOUND) { + struct strbuf quoted = STRBUF_INIT; + + if (opt->nul_terminated) { + quote_c_style(obj_name, "ed, NULL, 0); + obj_name = quoted.buf; + } + switch (result) { case MISSING_OBJECT: printf("%s missing\n", obj_name); @@ -542,6 +559,8 @@ static void batch_one_object(const char *obj_name, result); break; } + + strbuf_release("ed); fflush(stdout); return; } diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 8eac74b59c..03c1bfb86b 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -447,6 +447,33 @@ test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' test_cmp expect actual ' +test_expect_success '--batch-check, -z with newline in non-existent named object' ' + printf "HEAD:newline${LF}missing" >in && + git cat-file --batch-check -z <in >actual && + + printf "\"HEAD:newline\\\\nmissing\" missing\n" >expect && + test_cmp expect actual +' + +test_expect_success FUNNYNAMES '--batch-check, -z with missing object having newline in name' ' + git init missing-object-newline && + ( + cd missing-object-newline && + file="newline${LF}embedded" && + echo_without_newline "hello" > $file && + git add "$file" && + git commit -m "file with newline embedded" && + test_tick && + + sha1=$(git rev-parse HEAD:"$file") && + rm .git/objects/$(test_oid_to_path $sha1) && + printf "HEAD:$file" >in && + git cat-file --batch-check -z <in >actual && + printf "\"HEAD:newline\\\\nembedded\" missing\n" >expect && + test_cmp expect actual + ) +' + batch_command_multiple_info="info $hello_sha1 info $tree_sha1 info $commit_sha1
Since it's supported to have NUL-delimited input, introduced in db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, 2022-07-22), it's possible to pass paths that contain newlines. This works great when the object is found, but when it's not, the input path is returned in the error message. Because this can contain newlines, the error message might get spread over multiple lines, making it harder to machine-parse this error message. With this change, the input is quote-formatted in the error message, if needed. This ensures the error message is always on a single line and makes parsing the error more straightforward. Signed-off-by: Toon Claes <toon@iotcl.com> --- builtin/cat-file.c | 19 +++++++++++++++++++ t/t1006-cat-file.sh | 27 +++++++++++++++++++++++++++ 2 files changed, 46 insertions(+)