Message ID | 2247912.lYO0ccLKhl@localhost.localdomain (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] pretty-options.txt: describe supported encoding | expand |
Krzysztof Żelechowski <giecrilj@stegny.2a.pl> writes: > git log recognises only system encodings supported by iconv(1), but not > POSIX character maps used by iconv(1p). Document it. > > Signed-off-by: <ne01026@shark.2a.pl> The "Human Readable Name <email@add.re.ss>" on this line must match the one on the "From: " line that records the author of the patch. If you are forwarding somebody else's patch (with or without improvement), we also need your sign off. > diff --git a/Documentation/pretty-options.txt b/Documentation/pretty- > options.txt > index 27ddaf84a19..4f8376d681b 100644 > --- a/Documentation/pretty-options.txt > +++ b/Documentation/pretty-options.txt > @@ -36,9 +36,13 @@ people using 80-column terminals. > The commit objects record the encoding used for the log message > in their encoding header; this option can be used to tell the > command to re-code the commit log message in the encoding > - preferred by the user. For non plumbing commands this > - defaults to UTF-8. Note that if an object claims to be encoded > - in `X` and we are outputting in `X`, we will output the object > + preferred by the user. > + The encoding must be a system encoding supported by iconv(1), > + otherwise this option will be ignored. > + POSIX character maps used by iconv(1p) are not supported. This paragraph is a bit hard to grok. I think it is saying that the "-f frommap -t tomap" form in [*1*] that can use arbitrary character set description file is not supported, but "-f fromcode -t tocode" form, which also is what iconv_open() takes [*2*], is supported. Am I reading it correctly? Is there an easier-to-read way to explain the distinction to our average reader? What I am getting at is this. Imagine average users who need to see their commits recoded to iso-8859-2. They see "git log" has "--encoding=<encoding>" option, read the above paragraph and wonder if they are on the supported side or unsupported side of the above paragraph. I want to make it easy for them to stop wondering. For that purpose, "iconv(1) vs iconv(1p)" would not help them very much, especially considering that not all Git users are UNIX users (they probably do not even know what (1) and (1p) means). > + For non-plumbing commands this defaults to UTF-8. I think I can guess why the patch wants to change "non plumbing" to "non-plumbing" (I do not strongly care either way, so I'd take the patch without complaint about that particular change). It would have been nicer to mention this change in the proposed commit log message, though, but that is minor. > + Note that if an object claims to be encoded in `X` > + and we are outputting in `X`, we shall output the object > verbatim; this means that invalid sequences in the original > commit may be copied to the output. I probably wouldn't have noticed this if a new manual page used "shall" consistently, but since the original deliberately used "will" and the patch changes it to "shall", I have to ask: why? I think our end-user facing manual pages tend to avoid the latter. We do use "shall" in the RFC2119/BCP14 sense on the technical side of our documentation where we give requirements to the third-party implementations so that they can interoperate with us, but this is not such a description. Thanks. [References] *1* https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html *2* https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_open.html
On Fri, Aug 27, 2021 at 10:03:56AM -0700, Junio C Hamano wrote: > > + The encoding must be a system encoding supported by iconv(1), > > + otherwise this option will be ignored. > > + POSIX character maps used by iconv(1p) are not supported. > > This paragraph is a bit hard to grok. > > I think it is saying that the "-f frommap -t tomap" form in [*1*] > that can use arbitrary character set description file is not > supported, but "-f fromcode -t tocode" form, which also is what > iconv_open() takes [*2*], is supported. Am I reading it correctly? > > Is there an easier-to-read way to explain the distinction to our > average reader? > > What I am getting at is this. Imagine average users who need to see > their commits recoded to iso-8859-2. They see "git log" has > "--encoding=<encoding>" option, read the above paragraph and wonder > if they are on the supported side or unsupported side of the above > paragraph. I want to make it easy for them to stop wondering. > > For that purpose, "iconv(1) vs iconv(1p)" would not help them very > much, especially considering that not all Git users are UNIX users > (they probably do not even know what (1) and (1p) means). I likewise found the mention of character maps confusing. If we were to refer to anything, it would be iconv(3) or iconv_open(3). But really, all of the discussion that led to this patch seemed to be about the distinction between "character set conversion" (or "character encoding", or "codeset conversion", all terms used by the POSIX pages) and the syntactic encoding of HTML. Is there any version of iconv that would convert "<" to "<"? I guess that _conceptually_ one could think of that as a multi-byte character conversion, but it seems to me that it is generally considered a layer above (after all, the original "<" and characters in the HTML entity have to be in some character encoding; generally ASCII, but I think you could have UTF-16 HTML, too). What I'm getting it is that maybe we just need to use a less generic word than "encoding". Perhaps just s/encoding/character &/ or something? And maybe add something like: Conversions are done using the system iconv(3) function. The set of available encodings will depend on your system. You _can_ use "iconv -l" to get such a list on many systems, but it is not even necessarily the same list. I also wonder if other mentions of encoding would want to use the same term (e.g., gitattributes working-tree-encoding), and of course i18n.commitEncoding (though peeking at the latter, it seems to already say "Character encoding", so maybe that is sufficient). -Peff
Dnia piątek, 27 sierpnia 2021 19:03:56 CEST Junio C Hamano pisze: > > + The encoding must be a system encoding supported by iconv(1), > > + otherwise this option will be ignored. > > + POSIX character maps used by iconv(1p) are not supported. > > This paragraph is a bit hard to grok. > > I think it is saying that the "-f frommap -t tomap" form in [*1*] > that can use arbitrary character set description file is not > supported, but "-f fromcode -t tocode" form, which also is what > iconv_open() takes [*2*], is supported. Am I reading it correctly? Yes > > Is there an easier-to-read way to explain the distinction to our > average reader? It is not our job to explain what POSIX character maps are. The takeaway is they are unsupported; if you do not know what they are, why should you bother? > > What I am getting at is this. Imagine average users who need to see > their commits recoded to iso-8859-2. They see "git log" has > "--encoding=<encoding>" option, read the above paragraph and wonder > if they are on the supported side or unsupported side of the above > paragraph. I want to make it easy for them to stop wondering. > > For that purpose, "iconv(1) vs iconv(1p)" would not help them very > much, especially considering that not all Git users are UNIX users > (they probably do not even know what (1) and (1p) means). I am sorry, as a UNIX user I have no idea what iconv, being part of the GNU C library, means and how it works on a non-UNIX system that does not contain one. If you know, could you enlighten us please? > I think our end-user facing manual pages tend to avoid the latter. > We do use "shall" in the RFC2119/BCP14 sense on the technical side > of our documentation where we give requirements to the third-party > implementations so that they can interoperate with us, but this is > not such a description. > > Thanks. I shall revert it after we have come to an agreement about the POSIX stuff. BR, Chris
diff --git a/Documentation/pretty-options.txt b/Documentation/pretty- options.txt index 27ddaf84a19..4f8376d681b 100644 --- a/Documentation/pretty-options.txt +++ b/Documentation/pretty-options.txt @@ -36,9 +36,13 @@ people using 80-column terminals. The commit objects record the encoding used for the log message in their encoding header; this option can be used to tell the command to re-code the commit log message in the encoding - preferred by the user. For non plumbing commands this - defaults to UTF-8. Note that if an object claims to be encoded - in `X` and we are outputting in `X`, we will output the object + preferred by the user. + The encoding must be a system encoding supported by iconv(1), + otherwise this option will be ignored. + POSIX character maps used by iconv(1p) are not supported. + For non-plumbing commands this defaults to UTF-8. + Note that if an object claims to be encoded in `X` + and we are outputting in `X`, we shall output the object verbatim; this means that invalid sequences in the original commit may be copied to the output.
git log recognises only system encodings supported by iconv(1), but not POSIX character maps used by iconv(1p). Document it. Signed-off-by: <ne01026@shark.2a.pl>