diff mbox series

[v2] pretty-options.txt: describe supported encoding

Message ID 2247912.lYO0ccLKhl@localhost.localdomain (mailing list archive)
State New, archived
Headers show
Series [v2] pretty-options.txt: describe supported encoding | expand

Commit Message

Krzysztof Żelechowski Aug. 27, 2021, 11:51 a.m. UTC
git log recognises only system encodings supported by iconv(1), but not 
POSIX character maps used by iconv(1p). Document it.

Signed-off-by:  <ne01026@shark.2a.pl>

Comments

Junio C Hamano Aug. 27, 2021, 5:03 p.m. UTC | #1
Krzysztof Żelechowski <giecrilj@stegny.2a.pl> writes:

> git log recognises only system encodings supported by iconv(1), but not 
> POSIX character maps used by iconv(1p). Document it.
>
> Signed-off-by:  <ne01026@shark.2a.pl>

The "Human Readable Name <email@add.re.ss>" on this line must match
the one on the "From: " line that records the author of the patch.

If you are forwarding somebody else's patch (with or without
improvement), we also need your sign off.

> diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-
> options.txt
> index 27ddaf84a19..4f8376d681b 100644
> --- a/Documentation/pretty-options.txt
> +++ b/Documentation/pretty-options.txt
> @@ -36,9 +36,13 @@ people using 80-column terminals.
>         The commit objects record the encoding used for the log message
>         in their encoding header; this option can be used to tell the
>         command to re-code the commit log message in the encoding
> -       preferred by the user.  For non plumbing commands this
> -       defaults to UTF-8. Note that if an object claims to be encoded
> -       in `X` and we are outputting in `X`, we will output the object
> +       preferred by the user.


> +       The encoding must be a system encoding supported by iconv(1),
> +       otherwise this option will be ignored.
> +       POSIX character maps used by iconv(1p) are not supported.

This paragraph is a bit hard to grok.

I think it is saying that the "-f frommap -t tomap" form in [*1*]
that can use arbitrary character set description file is not
supported, but "-f fromcode -t tocode" form, which also is what
iconv_open() takes [*2*], is supported.  Am I reading it correctly?

Is there an easier-to-read way to explain the distinction to our
average reader?

What I am getting at is this.  Imagine average users who need to see
their commits recoded to iso-8859-2.  They see "git log" has
"--encoding=<encoding>" option, read the above paragraph and wonder
if they are on the supported side or unsupported side of the above
paragraph.  I want to make it easy for them to stop wondering.

For that purpose, "iconv(1) vs iconv(1p)" would not help them very
much, especially considering that not all Git users are UNIX users
(they probably do not even know what (1) and (1p) means).

> +       For non-plumbing commands this defaults to UTF-8.

I think I can guess why the patch wants to change "non plumbing" to
"non-plumbing" (I do not strongly care either way, so I'd take the
patch without complaint about that particular change).  It would
have been nicer to mention this change in the proposed commit log
message, though, but that is minor.

> +       Note that if an object claims to be encoded in `X`
> +       and we are outputting in `X`, we shall output the object
>         verbatim; this means that invalid sequences in the original
>         commit may be copied to the output.

I probably wouldn't have noticed this if a new manual page used
"shall" consistently, but since the original deliberately used
"will" and the patch changes it to "shall", I have to ask: why?

I think our end-user facing manual pages tend to avoid the latter.
We do use "shall" in the RFC2119/BCP14 sense on the technical side
of our documentation where we give requirements to the third-party
implementations so that they can interoperate with us, but this is
not such a description.

Thanks.


[References]

*1* https://pubs.opengroup.org/onlinepubs/9699919799/utilities/iconv.html 
*2* https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_open.html
Jeff King Aug. 27, 2021, 6:03 p.m. UTC | #2
On Fri, Aug 27, 2021 at 10:03:56AM -0700, Junio C Hamano wrote:

> > +       The encoding must be a system encoding supported by iconv(1),
> > +       otherwise this option will be ignored.
> > +       POSIX character maps used by iconv(1p) are not supported.
> 
> This paragraph is a bit hard to grok.
> 
> I think it is saying that the "-f frommap -t tomap" form in [*1*]
> that can use arbitrary character set description file is not
> supported, but "-f fromcode -t tocode" form, which also is what
> iconv_open() takes [*2*], is supported.  Am I reading it correctly?
> 
> Is there an easier-to-read way to explain the distinction to our
> average reader?
> 
> What I am getting at is this.  Imagine average users who need to see
> their commits recoded to iso-8859-2.  They see "git log" has
> "--encoding=<encoding>" option, read the above paragraph and wonder
> if they are on the supported side or unsupported side of the above
> paragraph.  I want to make it easy for them to stop wondering.
> 
> For that purpose, "iconv(1) vs iconv(1p)" would not help them very
> much, especially considering that not all Git users are UNIX users
> (they probably do not even know what (1) and (1p) means).

I likewise found the mention of character maps confusing. If we were to
refer to anything, it would be iconv(3) or iconv_open(3). But really,
all of the discussion that led to this patch seemed to be about the
distinction between "character set conversion" (or "character encoding",
or "codeset conversion", all terms used by the POSIX pages) and the
syntactic encoding of HTML.

Is there any version of iconv that would convert "<" to "&lt;"?

I guess that _conceptually_ one could think of that as a multi-byte
character conversion, but it seems to me that it is generally considered
a layer above (after all, the original "<" and characters in the HTML
entity have to be in some character encoding; generally ASCII, but I
think you could have UTF-16 HTML, too).

What I'm getting it is that maybe we just need to use a less generic
word than "encoding". Perhaps just s/encoding/character &/ or something?
And maybe add something like:

  Conversions are done using the system iconv(3) function. The set of
  available encodings will depend on your system.

You _can_ use "iconv -l" to get such a list on many systems, but it is
not even necessarily the same list.

I also wonder if other mentions of encoding would want to use the same
term (e.g., gitattributes working-tree-encoding), and of course
i18n.commitEncoding (though peeking at the latter, it seems to already
say "Character encoding", so maybe that is sufficient).

-Peff
Krzysztof Żelechowski Aug. 27, 2021, 11:20 p.m. UTC | #3
Dnia piątek, 27 sierpnia 2021 19:03:56 CEST Junio C Hamano pisze:
> > +       The encoding must be a system encoding supported by iconv(1),
> > +       otherwise this option will be ignored.
> > +       POSIX character maps used by iconv(1p) are not supported.
> 
> This paragraph is a bit hard to grok.
> 
> I think it is saying that the "-f frommap -t tomap" form in [*1*]
> that can use arbitrary character set description file is not
> supported, but "-f fromcode -t tocode" form, which also is what
> iconv_open() takes [*2*], is supported.  Am I reading it correctly?

Yes

> 
> Is there an easier-to-read way to explain the distinction to our
> average reader?

It is not our job to explain what POSIX character maps are.  The takeaway is 
they are unsupported; if you do not know what they are, why should you bother?

> 
> What I am getting at is this.  Imagine average users who need to see
> their commits recoded to iso-8859-2.  They see "git log" has
> "--encoding=<encoding>" option, read the above paragraph and wonder
> if they are on the supported side or unsupported side of the above
> paragraph.  I want to make it easy for them to stop wondering.
> 
> For that purpose, "iconv(1) vs iconv(1p)" would not help them very
> much, especially considering that not all Git users are UNIX users
> (they probably do not even know what (1) and (1p) means).

I am sorry, as a UNIX user I have no idea what iconv, being part of the GNU C 
library, means and how it works on a non-UNIX system that does not contain 
one.  If you know, could you enlighten us please?

> I think our end-user facing manual pages tend to avoid the latter.
> We do use "shall" in the RFC2119/BCP14 sense on the technical side
> of our documentation where we give requirements to the third-party
> implementations so that they can interoperate with us, but this is
> not such a description.
> 
> Thanks.

I shall revert it after we have come to an agreement about the POSIX stuff.

BR,
Chris
diff mbox series

Patch

diff --git a/Documentation/pretty-options.txt b/Documentation/pretty-
options.txt
index 27ddaf84a19..4f8376d681b 100644
--- a/Documentation/pretty-options.txt
+++ b/Documentation/pretty-options.txt
@@ -36,9 +36,13 @@  people using 80-column terminals.
        The commit objects record the encoding used for the log message
        in their encoding header; this option can be used to tell the
        command to re-code the commit log message in the encoding
-       preferred by the user.  For non plumbing commands this
-       defaults to UTF-8. Note that if an object claims to be encoded
-       in `X` and we are outputting in `X`, we will output the object
+       preferred by the user.
+       The encoding must be a system encoding supported by iconv(1),
+       otherwise this option will be ignored.
+       POSIX character maps used by iconv(1p) are not supported.
+       For non-plumbing commands this defaults to UTF-8.
+       Note that if an object claims to be encoded in `X`
+       and we are outputting in `X`, we shall output the object
        verbatim; this means that invalid sequences in the original
        commit may be copied to the output.