Message ID | 20180615044042.7928-2-tamiko@43-1.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jun 14, 2018 at 11:40:41PM -0500, Matthias Maier wrote: > This commit removes the PYTHON_UTF8 workaround. The problem with setting > > LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 > > is that the en_US.UTF-8 locale might not be available. In this case What platform are you using where UTF8 locale is not available ? Indeed I would ideally like to make the entire of QEMU build with an explicit en_US.UTF-8 or C.UTF-8 locale, to ensure that we get reliably reproducible builds, as locale differences have been known to impact output of many tools not just python. Regards, Daniel
On Fri, Jun 15, 2018, at 04:42 CDT, Daniel P. Berrangé <berrange@redhat.com> wrote: > On Thu, Jun 14, 2018 at 11:40:41PM -0500, Matthias Maier wrote: >> This commit removes the PYTHON_UTF8 workaround. The problem with setting >> >> LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 >> >> is that the en_US.UTF-8 locale might not be available. In this case > > What platform are you using where UTF8 locale is not available ? For example, neither Debian (and for that matter Ubuntu) nor Gentoo guarantee that the en_US.UTF-8 locale is available. We in particular encounter build problems on Gentoo when users have only set very specific, non en_US locales, for example de_DE.UTF-8 (or similar). > Indeed I would ideally like to make the entire of QEMU build with an > explicit en_US.UTF-8 or C.UTF-8 locale, to ensure that we get reliably > reproducible builds, as locale differences have been known to impact > output of many tools not just python. We face the same problem in Gentoo and usually advice users to set LC_ALL=C when submitting bug reports. (It is frustrating that glibc upstream doesn't get their act together fixing and merging the current C.UTF-8 proposal.) So what about making the build system more robust (by merging the patches, or a variant) and either setting C.UTF-8, or C globally (depending on availability)? Best, Matthias
"Partially revert"? Which part isn't reverted? Matthias Maier <tamiko@43-1.org> writes: > On Fri, Jun 15, 2018, at 04:42 CDT, Daniel P. Berrangé <berrange@redhat.com> wrote: > >> On Thu, Jun 14, 2018 at 11:40:41PM -0500, Matthias Maier wrote: >>> This commit removes the PYTHON_UTF8 workaround. The problem with setting >>> >>> LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 >>> >>> is that the en_US.UTF-8 locale might not be available. In this case The workaround is from commit d4e5ec877ca698a87dabe68814c6f93668f50c60 Author: Daniel P. Berrange <berrange@redhat.com> Date: Tue Jan 16 13:42:11 2018 +0000 qapi: force a UTF-8 locale for running Python Python2 did not validate locale correctness when reading input data, so would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so if you try to read UTF-8 data in the C locale, it will raise an error for any UTF-8 bytes that aren't representable in 7-bit ascii encoding. e.g. More background on this can be seen in https://www.python.org/dev/peps/pep-0538/ Many distros support a new C.UTF-8 locale that is like the C locale, but with UTF-8 instead of 7-bit ASCII. That is not entirely portable though. This patch thus sets the LANG to "C", but overrides LC_CTYPE to be en_US.UTF-8 locale. This gets us pretty close to C.UTF-8, but in a way that should be portable to everywhere QEMU builds. This patch only forces UTF-8 for QAPI scripts, since that is the one showing the immediate error under Python3 with C locale, but potentially we ought to force this for all python scripts used in the build process. It's still used only for running QAPI generators. As far as I can tell, the only non-ASCII input characters are: * qapi/trace.json # Copyright (C) 2011-2016 Lluís Vilanova <vilanova@ac.upc.edu> * tests/qapi-schema/escape-too-big.json # { 'command': 'é' } * tests/qapi-schema/unicode-str.json { 'command': 'é' } I believe these characters made Dan put in the workaround. We could get rid of them if the fix is too onerous (I haven't really looked, yet). >> What platform are you using where UTF8 locale is not available ? > > For example, neither Debian (and for that matter Ubuntu) nor Gentoo > guarantee that the en_US.UTF-8 locale is available. > > We in particular encounter build problems on Gentoo when users have only > set very specific, non en_US locales, for example de_DE.UTF-8 (or > similar). > >> Indeed I would ideally like to make the entire of QEMU build with an >> explicit en_US.UTF-8 or C.UTF-8 locale, to ensure that we get reliably >> reproducible builds, as locale differences have been known to impact >> output of many tools not just python. > > We face the same problem in Gentoo and usually advice users to set > LC_ALL=C when submitting bug reports. (It is frustrating that glibc > upstream doesn't get their act together fixing and merging the current > C.UTF-8 proposal.) > > So what about making the build system more robust (by merging the > patches, or a variant) and either setting C.UTF-8, or C globally > (depending on availability)? "git-grep -Fi .utf-8" coughs up ui/gtk.c: setlocale(LC_CTYPE, "C.UTF-8"); This runs whenever you pick -display gtk. Errors are ignored, though. commit 27b224a61f97faabbd20bdf72c0c1a3dbe400cd1 Author: Kevin Wolf <kwolf@redhat.com> Date: Tue Jan 31 11:09:45 2017 +0100 gtk: Hardcode LC_CTYPE as C.utf-8 Commit 2cb5d2a4 removed setlocale() for everything except LC_MESSAGES in order to avoid unwanted side effects such as using the wrong decimal separator in generated JSON objects. However, the problem that unsetting LC_CTYPE caused is that non-ASCII characters are considered non-printable now and therefore the GTK menus display question marks for accented letters, Chinese characters etc. A first attempt to fix this [1] was rejected because even just setting LC_CTYPE to the user's locale (and thereby modifying the semantics of the ctype.h functions) could have unwanted effects that we're not aware of yet. Recently, however, glibc introduced a new locale "C.utf-8" that just uses UTF-8 as its charset, but otherwise leaves the semantics alone. Just setting the right character set is enough for our use case, so we can just hardcode this one without having to be afraid of nasty side effects. Older systems that don't have the new locale will continue displaying question marks, but this should fix the problem for most users. [1] https://lists.gnu.org/archive/html/qemu-devel/2015-12/msg03591.html ('Re: gtk: use setlocale() for LC_MESSAGES only') Signed-off-by: Kevin Wolf <kwolf@redhat.com> Message-id: 20170131100945.8189-1-kwolf@redhat.com [ kraxel: change C.utf-8 to C.UTF-8 ] Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Looks like your frustration about upstream glibc is quite stale :)
On Fri, Jun 15, 2018 at 08:20:36AM -0500, Matthias Maier wrote: > > On Fri, Jun 15, 2018, at 04:42 CDT, Daniel P. Berrangé <berrange@redhat.com> wrote: > > > On Thu, Jun 14, 2018 at 11:40:41PM -0500, Matthias Maier wrote: > >> This commit removes the PYTHON_UTF8 workaround. The problem with setting > >> > >> LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 > >> > >> is that the en_US.UTF-8 locale might not be available. In this case > > > > What platform are you using where UTF8 locale is not available ? > > For example, neither Debian (and for that matter Ubuntu) nor Gentoo > guarantee that the en_US.UTF-8 locale is available. > > We in particular encounter build problems on Gentoo when users have only > set very specific, non en_US locales, for example de_DE.UTF-8 (or > similar). > > > Indeed I would ideally like to make the entire of QEMU build with an > > explicit en_US.UTF-8 or C.UTF-8 locale, to ensure that we get reliably > > reproducible builds, as locale differences have been known to impact > > output of many tools not just python. > > We face the same problem in Gentoo and usually advice users to set > LC_ALL=C when submitting bug reports. (It is frustrating that glibc > upstream doesn't get their act together fixing and merging the current > C.UTF-8 proposal.) > > So what about making the build system more robust (by merging the > patches, or a variant) and either setting C.UTF-8, or C globally > (depending on availability)? Yes, if we could figure out a way to check for existance of locales, we could make configure check for C.UTF-*, en_US.UTF-8, C in that order, using the first it finds to work. Fun fact, on macOS 'C' is always UTF-8, so its valid to use just 'C'. Regards, Daniel
On Fri, Jun 15, 2018, at 10:17 CDT, Markus Armbruster <armbru@redhat.com> wrote: > "Partially revert"? Which part isn't reverted? Yes, it ended up being a full revert of the commit in question. I am sorry for the sloppy wording. > [...] >> Recently, however, glibc introduced a new locale "C.utf-8" that just >> uses UTF-8 as its charset, but otherwise leaves the semantics alone. >> Just setting the right character set is enough for our use case, so we >> can just hardcode this one without having to be afraid of nasty side >> effects. > Looks like your frustration about upstream glibc is quite stale :) Unfortunately, this statement is not correct. The corresponding glibc bug report summarizes the current situation [1]. Fact is that a lot of distributions ship a custom C.UTF-8 locale, for example Debian [2] (for the currenct glibc-2.27 release). Unfortunately, not everyone applies such custom patches. :-/ Best, Matthias [1] https://sourceware.org/bugzilla/show_bug.cgi?id=17318 [2] https://sources.debian.org/patches/glibc/2.27-3/localedata/locale-C.diff/
diff --git a/Makefile b/Makefile index e46f2b625a..7ed9cc4a21 100644 --- a/Makefile +++ b/Makefile @@ -20,8 +20,6 @@ ifneq ($(wildcard config-host.mak),) all: include config-host.mak -PYTHON_UTF8 = LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 $(PYTHON) - git-submodule-update: .PHONY: git-submodule-update @@ -576,7 +574,7 @@ qga/qapi-generated/qga-qapi-commands.h qga/qapi-generated/qga-qapi-commands.c \ qga/qapi-generated/qga-qapi-doc.texi: \ qga/qapi-generated/qapi-gen-timestamp ; qga/qapi-generated/qapi-gen-timestamp: $(SRC_PATH)/qga/qapi-schema.json $(qapi-py) - $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-gen.py \ + $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-gen.py \ -o qga/qapi-generated -p "qga-" $<, \ "GEN","$(@:%-timestamp=%)") @>$@ @@ -676,7 +674,7 @@ qapi/qapi-introspect.h qapi/qapi-introspect.c \ qapi/qapi-doc.texi: \ qapi-gen-timestamp ; qapi-gen-timestamp: $(qapi-modules) $(qapi-py) - $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-gen.py \ + $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-gen.py \ -o "qapi" -b $<, \ "GEN","$(@:%-timestamp=%)") @>$@ diff --git a/tests/Makefile.include b/tests/Makefile.include index 607afe5bed..b5121f2851 100644 --- a/tests/Makefile.include +++ b/tests/Makefile.include @@ -674,13 +674,13 @@ tests/test-qapi-events.c tests/test-qapi-events.h \ tests/test-qapi-introspect.c tests/test-qapi-introspect.h: \ tests/test-qapi-gen-timestamp ; tests/test-qapi-gen-timestamp: $(SRC_PATH)/tests/qapi-schema/qapi-schema-test.json $(qapi-py) - $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-gen.py \ + $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-gen.py \ -o tests -p "test-" $<, \ "GEN","$(@:%-timestamp=%)") @>$@ tests/qapi-schema/doc-good.test.texi: $(SRC_PATH)/tests/qapi-schema/doc-good.json $(qapi-py) - $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-gen.py \ + $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-gen.py \ -o tests/qapi-schema -p "doc-good-" $<, \ "GEN","$@") @mv tests/qapi-schema/doc-good-qapi-doc.texi $@ @@ -938,7 +938,7 @@ check-tests/qemu-iotests-quick.sh: tests/qemu-iotests-quick.sh qemu-img$(EXESUF) .PHONY: $(patsubst %, check-%, $(check-qapi-schema-y)) $(patsubst %, check-%, $(check-qapi-schema-y)): check-%.json: $(SRC_PATH)/%.json $(call quiet-command, PYTHONPATH=$(SRC_PATH)/scripts \ - $(PYTHON_UTF8) $(SRC_PATH)/tests/qapi-schema/test-qapi.py \ + $(PYTHON) $(SRC_PATH)/tests/qapi-schema/test-qapi.py \ $^ >$*.test.out 2>$*.test.err; \ echo $$? >$*.test.exit, \ "TEST","$*.out")