diff mbox

[2/2] DocBook: Use a fixed encoding for output

Message ID 1441147759.9215.44.camel@decadent.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Hutchings Sept. 1, 2015, 10:49 p.m. UTC
Currently the encoding of documents generated by DocBook depends on
the current locale.  Make the output reproducible independently of
the locale, by setting the encoding to UTF-8 (LC_CTYPE=C.UTF-8) by
preference, or ASCII (LC_CTYPE=C) as a fallback.

LC_CTYPE can normally be overridden by LC_ALL, but the top-level
Makefile unsets that.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---
 Documentation/DocBook/Makefile | 6 ++++++
 Makefile                       | 2 +-
 scripts/Makefile               | 7 +++++--
 scripts/check-lc_ctype.c       | 6 ++++++
 4 files changed, 18 insertions(+), 3 deletions(-)
 create mode 100644 scripts/check-lc_ctype.c

Comments

Jonathan Corbet Sept. 11, 2015, 7:30 p.m. UTC | #1
On Tue, 01 Sep 2015 23:49:19 +0100
Ben Hutchings <ben@decadent.org.uk> wrote:

> Currently the encoding of documents generated by DocBook depends on
> the current locale.  Make the output reproducible independently of
> the locale, by setting the encoding to UTF-8 (LC_CTYPE=C.UTF-8) by
> preference, or ASCII (LC_CTYPE=C) as a fallback.

I guess I have to ask, though: doesn't it seem that having the docs
produced according to the current locale is the Right Thing to do?  Users
have their locale set as it is for a reason, it seems like the production
of textual documents should respect their choice.

Am I missing something here?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Kahn Gillmor Sept. 11, 2015, 9:40 p.m. UTC | #2
On Fri 2015-09-11 15:30:59 -0400, Jonathan Corbet wrote:
> On Tue, 01 Sep 2015 23:49:19 +0100
> Ben Hutchings <ben@decadent.org.uk> wrote:
>
>> Currently the encoding of documents generated by DocBook depends on
>> the current locale.  Make the output reproducible independently of
>> the locale, by setting the encoding to UTF-8 (LC_CTYPE=C.UTF-8) by
>> preference, or ASCII (LC_CTYPE=C) as a fallback.
>
> I guess I have to ask, though: doesn't it seem that having the docs
> produced according to the current locale is the Right Thing to do?  Users
> have their locale set as it is for a reason, it seems like the production
> of textual documents should respect their choice.
>
> Am I missing something here?

I sympathize with Jonathan's general concern here -- if this patchset
makes it impossible for people to build documentation with (for example)
their preferred collation order, it would be suboptimal.

On the other hand, this seems to focus on character encodings
specifically; do we really want to encourage any sort of encodings other
than UTF-8?  The only plausible arguments i've heard for documents that
are exclusively CJK characters, which could achieve a modest size
reduction using more targeted encodings.  afaik, there are no such
documents in the kernel, and i doubt there ever will be.

          --dkg
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Corbet Sept. 12, 2015, 8:06 p.m. UTC | #3
On Fri, 11 Sep 2015 17:40:33 -0400
Daniel Kahn Gillmor <dkg@fifthhorseman.net> wrote:

> I sympathize with Jonathan's general concern here -- if this patchset
> makes it impossible for people to build documentation with (for example)
> their preferred collation order, it would be suboptimal.
> 
> On the other hand, this seems to focus on character encodings
> specifically; do we really want to encourage any sort of encodings other
> than UTF-8?  The only plausible arguments i've heard for documents that
> are exclusively CJK characters, which could achieve a modest size
> reduction using more targeted encodings.  afaik, there are no such
> documents in the kernel, and i doubt there ever will be.

Well, there are CJK documents in the kernel, actually, though none are in
the DocBook directory currently.

Regardless of this, it's not a matter of which encodings we are
encouraging.  If we want to encourage utf-8 use, we might not want to
start in the kernel's documentation directory.  I think we need to
respect the user's choice in this regard and not try to override it.  If
I take this patch, I suspect somebody will yell at me for it...

With regard to reproducible builds: success in this area certainly
requires reproducing the build environment as well.  Honestly, I think
that needs to include the locale settings.

Let me know if you think I've totally misunderstood things.

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben Hutchings Sept. 14, 2015, 12:32 a.m. UTC | #4
On Fri, 2015-09-11 at 13:30 -0600, Jonathan Corbet wrote:
> On Tue, 01 Sep 2015 23:49:19 +0100
> Ben Hutchings <ben@decadent.org.uk> wrote:
> 
> > Currently the encoding of documents generated by DocBook depends on
> > the current locale.  Make the output reproducible independently of
> > the locale, by setting the encoding to UTF-8 (LC_CTYPE=C.UTF-8) by
> > preference, or ASCII (LC_CTYPE=C) as a fallback.
> 
> I guess I have to ask, though: doesn't it seem that having the docs
> produced according to the current locale is the Right Thing to do?  Users
> have their locale set as it is for a reason, it seems like the production
> of textual documents should respect their choice.
> 
> Am I missing something here?

Yes - the locale's character encoding applies to plain text, but rich
text formats can have a locale-independent encoding which the viewer
will automatically to the current locale's encoding.

For HTML, the document encoding can be explicit in the document header
(and is, in this case).

Manual pages were already consistently encoded in UTF-8, as this is the
default behaviour of DocBook-XSL (and is what man-db prefers as input).

PDF and Postscript documents have arbitrary and explicit mappings from
character numbers (or names) to glyphs, and PDF documents normally have
a mapping from glyphs back to Unicode code points to support searching
and copying text.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Corbet Sept. 18, 2015, 4:30 p.m. UTC | #5
On Mon, 14 Sep 2015 01:32:50 +0100
Ben Hutchings <ben@decadent.org.uk> wrote:

> > I guess I have to ask, though: doesn't it seem that having the docs
> > produced according to the current locale is the Right Thing to do?  Users
> > have their locale set as it is for a reason, it seems like the production
> > of textual documents should respect their choice.
> > 
> > Am I missing something here?  
> 
> Yes - the locale's character encoding applies to plain text, but rich
> text formats can have a locale-independent encoding which the viewer
> will automatically to the current locale's encoding.
> 
> For HTML, the document encoding can be explicit in the document header
> (and is, in this case).
> 
> Manual pages were already consistently encoded in UTF-8, as this is the
> default behaviour of DocBook-XSL (and is what man-db prefers as input).
> 
> PDF and Postscript documents have arbitrary and explicit mappings from
> character numbers (or names) to glyphs, and PDF documents normally have
> a mapping from glyphs back to Unicode code points to support searching
> and copying text.

OK, I guess you've talked me into it.  Can I ask you for one last favor,
though: please resubmit this patch with a couple of tweaks:

 - Based off current mainline, please (or docs-next, but that shouldn't
   be necessary).  The patch as sent doesn't apply.

 - Could you add a comment to the check-lc_ctype proglet so that somebody
   stumbling across it in the scripts directory knows why it's there?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile
index 198e9b5..9af25da 100644
--- a/Documentation/DocBook/Makefile
+++ b/Documentation/DocBook/Makefile
@@ -68,6 +68,12 @@  installmandocs: mandocs
 #External programs used
 KERNELDOC = $(srctree)/scripts/kernel-doc
 DOCPROC   = $(objtree)/scripts/docproc
+CHECK_LC_CTYPE = $(objtree)/scripts/check-lc_ctype
+
+# Use a fixed encoding - UTF-8 if the C library has support built-in
+# or ASCII if not
+LC_CTYPE := $(call try-run, LC_CTYPE=C.UTF-8 $(CHECK_LC_CTYPE),C.UTF-8,C)
+export LC_CTYPE
 
 XMLTOFLAGS = -m $(srctree)/$(src)/stylesheet.xsl
 XMLTOFLAGS += --skip-validation
diff --git a/Makefile b/Makefile
index 13270c0..5846c06 100644
--- a/Makefile
+++ b/Makefile
@@ -1338,7 +1338,7 @@  $(help-board-dirs): help-%:
 # Documentation targets
 # ---------------------------------------------------------------------------
 %docs: scripts_basic FORCE
-	$(Q)$(MAKE) $(build)=scripts build_docproc
+	$(Q)$(MAKE) $(build)=scripts build_docproc build_check-lc_ctype
 	$(Q)$(MAKE) $(build)=Documentation/DocBook $@
 
 else # KBUILD_EXTMOD
diff --git a/scripts/Makefile b/scripts/Makefile
index 2016a64..6f0349f 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -7,6 +7,7 @@ 
 # conmakehash:   Create chartable
 # conmakehash:	 Create arrays for initializing the kernel console tables
 # docproc:       Used in Documentation/DocBook
+# check-lc_ctype: Used in Documentation/DocBook
 
 HOST_EXTRACFLAGS += -I$(srctree)/tools/include
 
@@ -23,14 +24,16 @@  HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
 always		:= $(hostprogs-y) $(hostprogs-m)
 
 # The following hostprogs-y programs are only build on demand
-hostprogs-y += unifdef docproc
+hostprogs-y += unifdef docproc check-lc_ctype
 
 # These targets are used internally to avoid "is up to date" messages
-PHONY += build_unifdef build_docproc
+PHONY += build_unifdef build_docproc build_check-lc_ctype
 build_unifdef: $(obj)/unifdef
 	@:
 build_docproc: $(obj)/docproc
 	@:
+build_check-lc_ctype: $(obj)/check-lc_ctype
+	@:
 
 subdir-$(CONFIG_MODVERSIONS) += genksyms
 subdir-y                     += mod
diff --git a/scripts/check-lc_ctype.c b/scripts/check-lc_ctype.c
new file mode 100644
index 0000000..51fe229
--- /dev/null
+++ b/scripts/check-lc_ctype.c
@@ -0,0 +1,6 @@ 
+#include <locale.h>
+
+int main(void)
+{
+	return !setlocale(LC_CTYPE, "");
+}