Message ID | pull.959.git.1621763612.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | ref-filter: add contents:raw atom | expand |
"ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > In (a2f3241: [GSOC] ref-filter: add contents:raw atom) I did not notice the > ... Before going into any of these details, remember that the cover letter is where you first sell the series to the readers. Why is it worth their time to read it? What problem does it solve in the bigger picture? Mention that we want to let "cat-file --batch" use the ref-filter --format logic, if that is the primary reason why we have these three patches, for example. I actually do not know if a modified form of %(contents) is a good match for this feature. It was invented as a way to extract the unstructured "free text" part of structured textual objects (namely, commits and tags) so it is natural that it would not apply to trees (they are structured and there is no unstractured "free text" part) nor blobs (they are totally unstructured and does not even have to be text). Is there another word to refer to the entire payload unadulterated? > git for-each-ref --format="%(contents)" --python refs/mytrees/first > > will output a string processed by python_quote_buf_with_size(), which > contains'\0'. But the binary files seem to be useless after quoting. Should > we allow these binary files to be output in the default way with > strbuf_add()? If so, we can remove the first patch. The --language option is designed to be used to write a small script in the language and used like this: git for-each-ref --format=' name=%(refname) var=%(placeholder) mkdir -p "$(dirname "$name")" printf "%%s" "$var" >"$name" ' --shell | /bin/sh Note that %(refname) and %(placeholder) in the --format string is not quoted at all; the "--shell" option knows how values are quoted in the host language (shell) and writes single-quotes around %(refname). If %(placeholder) produces something with a single-quote in it, that will (eh, at least "should") be quoted appropriately. So It does not make any sense not to quote a value that comes from %(placeholder), whether it is binary or not, to match the syntax of the host language you are making the "for-each-ref --format=" to write such a script in. So, "binary files seem to be useless after quoting" is a misunderstanding. They are useless if you do not quote them. Thanks. P.S. I am mostly offline today, and response will be slow.
Junio C Hamano wrote:
> Is there another word to refer to the entire payload unadulterated?
Another word better than "raw"? I don't think so: basic, crude, green,
rough.
On 24/05/21 09.41, Felipe Contreras wrote: > Junio C Hamano wrote: > >> Is there another word to refer to the entire payload unadulterated? > > Another word better than "raw"? I don't think so: basic, crude, green, > rough. > I think we should go with "raw", because the payloads we discussed here are unmodified. It is akin to "raw data" that is processed further to produce porcelain output, such as templating engine that process raw HTML into HTML pages that would be served to end user.
Junio C Hamano <gitster@pobox.com> 于2021年5月24日周一 上午9:09写道: > > "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > In (a2f3241: [GSOC] ref-filter: add contents:raw atom) I did not notice the > > ... > > Before going into any of these details, remember that the cover > letter is where you first sell the series to the readers. Why is it > worth their time to read it? What problem does it solve in the > bigger picture? Mention that we want to let "cat-file --batch" use > the ref-filter --format logic, if that is the primary reason why we > have these three patches, for example. > Thanks for reminding, Indeed, as you said, this cover-letter needs to explain the patches' main purpose: let "cat-file --batch" use the ref-filter --format logic. > I actually do not know if a modified form of %(contents) is a good > match for this feature. It was invented as a way to extract the > unstructured "free text" part of structured textual objects (namely, > commits and tags) so it is natural that it would not apply to trees > (they are structured and there is no unstractured "free text" part) > nor blobs (they are totally unstructured and does not even have to > be text). > This is exactly what I worry about: Does %(contents) have good semantics to explain things like blob, tree objects which have no unstractured "free text" part or totally unstructured? > Is there another word to refer to the entire payload unadulterated? > If you specifically mean "raw" in %(contents:raw), I agree with Felipe Contreras or Bagas Sanjaya, "raw" seems to be enough. But if you mean we need another atom name, then "%(binary)", "%(text)" may be possible choices. There is a good word in `ref-filter.c`: C_BARE, but this enum vale corresponds to %(contents). Therefore, I cannot use the name "%(contents:bare)". > > git for-each-ref --format="%(contents)" --python refs/mytrees/first > > > > will output a string processed by python_quote_buf_with_size(), which > > contains'\0'. But the binary files seem to be useless after quoting. Should > > we allow these binary files to be output in the default way with > > strbuf_add()? If so, we can remove the first patch. > > The --language option is designed to be used to write a small script > in the language and used like this: > > git for-each-ref --format=' > name=%(refname) > var=%(placeholder) > mkdir -p "$(dirname "$name")" > printf "%%s" "$var" >"$name" > ' --shell | /bin/sh > > Note that %(refname) and %(placeholder) in the --format string is > not quoted at all; the "--shell" option knows how values are quoted > in the host language (shell) and writes single-quotes around > %(refname). If %(placeholder) produces something with a single-quote > in it, that will (eh, at least "should") be quoted appropriately. > Well, now I have also noticed that --language can make these atoms be surrounded by single quotes, special characters in the output will be escaped e.g. "a'b" will turn to "'a\'b'", In this way, shell, python... can directly quote the content. > So It does not make any sense not to quote a value that comes from > %(placeholder), whether it is binary or not, to match the syntax of > the host language you are making the "for-each-ref --format=" to > write such a script in. > > So, "binary files seem to be useless after quoting" is a > misunderstanding. They are useless if you do not quote them. > Now I agree that It seems that the content of the blob or tree is meaningful after being quoted. > Thanks. > > P.S. I am mostly offline today, and response will be slow. > Thanks! -- ZheNing Hu
Bagas Sanjaya <bagasdotme@gmail.com> writes: > I think we should go with "raw", because the payloads we discussed here > are unmodified. It is akin to "raw data" that is processed further to > produce porcelain output, such as templating engine that process raw > HTML into HTML pages that would be served to end user. Oh, I can certainly live with --format='%(raw)'; I just view that %(contents:raw) is problematic, especially if we meant to apply it to trees and blobs, because they are not what %(contents) is about. Thanks.
Junio C Hamano <gitster@pobox.com> writes: >> git for-each-ref --format="%(contents)" --python refs/mytrees/first >> >> will output a string processed by python_quote_buf_with_size(), which >> contains'\0'. But the binary files seem to be useless after quoting. Should >> we allow these binary files to be output in the default way with >> strbuf_add()? If so, we can remove the first patch. > > The --language option is designed to be used to write a small script > in the language and used like this: > > git for-each-ref --format=' > name=%(refname) > var=%(placeholder) > mkdir -p "$(dirname "$name")" > printf "%%s" "$var" >"$name" > ' --shell | /bin/sh > > Note that %(refname) and %(placeholder) in the --format string is > not quoted at all; the "--shell" option knows how values are quoted > in the host language (shell) and writes single-quotes around > %(refname). If %(placeholder) produces something with a single-quote > in it, that will (eh, at least "should") be quoted appropriately. > > So It does not make any sense not to quote a value that comes from > %(placeholder), whether it is binary or not, to match the syntax of > the host language you are making the "for-each-ref --format=" to > write such a script in. > > So, "binary files seem to be useless after quoting" is a > misunderstanding. They are useless if you do not quote them. Another thing to keep in mind is that not all host languages may be capable of expressing a string with NUL in it. Most notably, shell. The --shell quoting rule used by for-each-ref would produce an equivalent of the "script" produced like this: $ tr Q '\000' >script <<\EOF #!/bin/sh varname='varQname' echo "$varname" EOF but I do not think it would say 'var' followed by a NUL followed by 'name'. The NUL is likely lost when assigned to the variable. So for some host languages, binaries may be useless with or without quoting. But for ones that can use strings to hold arbitrary byte sequence, it should be OK to let for-each-ref to quote the byte sequence as a string literal for the language (so that the exact byte sequence will end up being in the variable after assignment). That reminds me of another thing. The --python thing was written back when Python3 was still a distant dream and strings were the appropriate type for a random sequence of bytes (as opposed to unicode, which cannot have a random sequence of bytes). Somebody needs to check if it needs any update to work with Python3.
> > Another thing to keep in mind is that not all host languages may be > capable of expressing a string with NUL in it. Most notably, shell. > The --shell quoting rule used by for-each-ref would produce an > equivalent of the "script" produced like this: > > $ tr Q '\000' >script <<\EOF > #!/bin/sh > varname='varQname' > echo "$varname" > EOF > > but I do not think it would say 'var' followed by a NUL followed by > 'name'. The NUL is likely lost when assigned to the variable. > Yes, in the following example you mentioned earlier, I have also noticed the loss of '\0'. > > git for-each-ref --format=' > > name=%(refname) > > var=%(placeholder) > > mkdir -p "$(dirname "$name")" > > printf "%%s" "$var" >"$name" > > ' --shell | /bin/sh > > > So for some host languages, binaries may be useless with or without > quoting. But for ones that can use strings to hold arbitrary byte > sequence, it should be OK to let for-each-ref to quote the byte > sequence as a string literal for the language (so that the exact > byte sequence will end up being in the variable after assignment). > I agree, and maybe some'\0' can be escaped appropriately to let host languages recognize it.... > That reminds me of another thing. The --python thing was written > back when Python3 was still a distant dream and strings were the > appropriate type for a random sequence of bytes (as opposed to > unicode, which cannot have a random sequence of bytes). Somebody > needs to check if it needs any update to work with Python3. $ printf '%b' "name='a\\\0b\\\0c'\nprint(name)" | python2.7 | od -c 0000000 a \0 b \0 c \n 0000006 $ printf '%b' "name='a\\\0b\\\0c'\necho -e \"\$name\"" | sh | od -c 0000000 a \0 b \0 c \n 0000006 In shell or python2/3, we can replace'\0' with "\\0". In Tcl and perl, they are ok with '\0'. $ printf '%b' "set name \"a\0b\0c\"\nputs \$name" | tclsh | od -c 0000000 ' a \0 b \0 c ' \n 0000010 $ printf '%b' "\$name = 'a\0b\0c';\n print \"\$name\"" | perl | od -c 0000000 a \0 b \0 c 0000005 So I currently think that a possible strategy is to modify `python_quote_buf_with_size()` and `sq_quote_buf_with_size()` from '\0' to "\\0". Thanks! -- ZheNing Hu
ZheNing Hu <adlternative@gmail.com> writes: > $ printf '%b' "name='a\\\0b\\\0c'\necho -e \"\$name\"" | sh | od -c > 0000000 a \0 b \0 c \n > 0000006 This is wrong. In the above, the variable name does not have a NUL in it. It insead has 2-byte sequence "\" and "0" in it, and you are letting "echo -e" to convert it into binary, which is not portable at all. I'd suggest instead to declare that some host languages, like shell, are not binary-clean and either refuse to process atoms whose values have NUL in them. Silently truncating strings at NUL or striping NULs in strings are also acceptable options if clearly documented. Claiming that we stuff binaries into variables of the host language, while not doing so and instead assigning a quoted form, is not good. I have not thought about Python3 very much. For the purpose of most %(placeholders), it is vastly more preferrable to use str (i.e. text sequence type) as opposed to bytes, as you do not have to .decode() to use the resulting "string", but even for things like %(refname), it is not technically kosher to assume that the contents are UTF-8 encoded text, as filenames used to represent refnames are merely a sequence of bytes except NUL, but for consistency with binary gunk, we might have to emit everything as bytes. I dunno. > In shell or python2/3, we can replace'\0' with "\\0". Not for shell. We should declare that it is not supported to feed binary to shell.
Junio C Hamano <gitster@pobox.com> 于2021年5月26日周三 下午3:06写道: > > ZheNing Hu <adlternative@gmail.com> writes: > > > $ printf '%b' "name='a\\\0b\\\0c'\necho -e \"\$name\"" | sh | od -c > > 0000000 a \0 b \0 c \n > > 0000006 > > This is wrong. In the above, the variable name does not have a NUL > in it. It insead has 2-byte sequence "\" and "0" in it, and you are > letting "echo -e" to convert it into binary, which is not portable > at all. > Indeed I was wrong, the var name does not contain '\0'. > I'd suggest instead to declare that some host languages, like shell, > are not binary-clean and either refuse to process atoms whose values > have NUL in them. Silently truncating strings at NUL or striping > NULs in strings are also acceptable options if clearly documented. > Claiming that we stuff binaries into variables of the host language, > while not doing so and instead assigning a quoted form, is not good. > Makes sense. Either choose to truncate, or choose to reject. > I have not thought about Python3 very much. For the purpose of most > %(placeholders), it is vastly more preferrable to use str (i.e. text > sequence type) as opposed to bytes, as you do not have to .decode() > to use the resulting "string", but even for things like %(refname), > it is not technically kosher to assume that the contents are UTF-8 > encoded text, as filenames used to represent refnames are merely a > sequence of bytes except NUL, but for consistency with binary gunk, > we might have to emit everything as bytes. I dunno. > > > In shell or python2/3, we can replace'\0' with "\\0". > > Not for shell. We should declare that it is not supported to feed > binary to shell. Eh, it seems that we adopt a "reject" strategy. $ git hash-object a.out -w | xargs git update-ref refs/myblobs/aoutblob $ git for-each-ref --format="name=%(raw)" refs/myblobs/aoutblob --python | python2 File "<stdin>", line 1 SyntaxError: Non-ASCII character '\x8b' in file <stdin> on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details $ git for-each-ref --format="name=%(raw)" refs/myblobs/aoutblob --python |python3 SyntaxError: Non-UTF-8 code starting with '\x8b' in file <stdin> on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details It seems that --python also needs to "reject", no matter python2 or python3. What about tcl and perl? $ cat a.out | od >1.od $ git for-each-ref --format="set name %(raw) puts -nonewline \$name" refs/myblobs/aoutblob --tcl | tclsh | od > 2.od $ diff 1.od 2.od | head 7,12c7,12 < 0000140 114303 000002 000000 000000 141400 001230 000000 000000 < 0000160 000000 000010 000000 000000 000000 000003 000000 000004 < 0000200 000000 001430 000000 000000 000000 001430 000000 000000 < 0000220 000000 001430 000000 000000 000000 000034 000000 000000 < 0000240 000000 000034 000000 000000 000000 000001 000000 000000 < 0000260 000000 000001 000000 000004 000000 000000 000000 000000 --- > 0000140 001330 000000 000000 000000 001330 000000 000000 000000 > 0000160 000010 000000 000000 000000 000003 000000 000004 000000 > 0000200 001430 000000 000000 000000 001430 000000 000000 000000 > 0000220 001430 000000 000000 000000 000034 000000 000000 000000 > 0000240 000034 000000 000000 000000 000001 000000 000000 000000 > 0000260 000001 000000 000004 000000 000000 000000 000000 000000 It seems that a.out contents passed into tcl and then the output is very different... But, $ cat a.out | od >1.od $ git for-each-ref --format="\$name= %(raw); print \"\$name\"" refs/myblobs/aoutblob --perl | perl | od >6.od $ diff 1.od 2.od There was no error this time, so for perl, it's ok... The "binary security" we care about is currently only complied with by the Perl language. So I think we better reject them all languages together for normative. The clear definition of this rejection strategy is that %(raw) and --language cannot be used at the same time. If our binary data is passed to a variable in the host language, there may be escape errors for the host language. Thanks. -- ZheNing Hu