misc: Avoid UTF-8 in error messages

Message ID	20181120203628.2367003-1-eblake@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Eric Blake <eblake@redhat.com> To: qemu-devel@nongnu.org Date: Tue, 20 Nov 2018 14:36:28 -0600 Message-Id: <20181120203628.2367003-1-eblake@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] [PATCH] misc: Avoid UTF-8 in error messages Precedence: list Cc: qemu-trivial@nongnu.org Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	misc: Avoid UTF-8 in error messages \| expand misc: Avoid UTF-8 in error messages

Eric Blake Nov. 20, 2018, 8:36 p.m. UTC

While most developers are now using UTF-8 environments, it's
harder to guarantee that error messages will be output to
a multibyte locale. Rather than risking error messages that
get corrupted into mojibake when the user runs qemu in a
non-multibyte locale, let's stick to straight ASCII error
messages, rather than assuming that our use of UTF-8 in source
code string constants will work unchanged in other locales.

Reported-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 hw/misc/tmp105.c | 2 +-
 hw/misc/tmp421.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

John Snow Nov. 20, 2018, 9:28 p.m. UTC | #1

On 11/20/18 3:36 PM, Eric Blake wrote:
> While most developers are now using UTF-8 environments, it's
> harder to guarantee that error messages will be output to
> a multibyte locale. Rather than risking error messages that
> get corrupted into mojibake when the user runs qemu in a
> non-multibyte locale, let's stick to straight ASCII error
> messages, rather than assuming that our use of UTF-8 in source
> code string constants will work unchanged in other locales.
> 
> Reported-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  hw/misc/tmp105.c | 2 +-
>  hw/misc/tmp421.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c
> index 0918f3a6ea2..f6d7163273a 100644
> --- a/hw/misc/tmp105.c
> +++ b/hw/misc/tmp105.c
> @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name,
>          return;
>      }
>      if (temp >= 128000 || temp < -128000) {
> -        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> +        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
>                     temp / 1000, temp % 1000);
>          return;
>      }
> diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c
> index c234044305d..eeb11000f0f 100644
> --- a/hw/misc/tmp421.c
> +++ b/hw/misc/tmp421.c
> @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name,
>      }
> 
>      if (temp >= maxs[ext_range] || temp < mins[ext_range]) {
> -        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> +        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
>                     temp / 1000, temp % 1000);
>          return;
>      }
> 

Do we have any policy in place to prohibit this in the future?
(Presumably a policy that is automatic and won't interfere with QEMU
localization efforts which may rightly attempt to use UTF-8 for those
locales.)

Do you have a script or trick to find utf-8 containing strings in our
source?

Only curious, don't hold this patch up on my account. I'm not raising a
challenge.

--js

Eric Blake Nov. 20, 2018, 10:01 p.m. UTC | #2

[adding Markus in CC, since git didn't do it automatically from the 
'Reported-by']

On 11/20/18 3:28 PM, John Snow wrote:
> 
> 
> On 11/20/18 3:36 PM, Eric Blake wrote:
>> While most developers are now using UTF-8 environments, it's
>> harder to guarantee that error messages will be output to
>> a multibyte locale. Rather than risking error messages that
>> get corrupted into mojibake when the user runs qemu in a
>> non-multibyte locale, let's stick to straight ASCII error
>> messages, rather than assuming that our use of UTF-8 in source
>> code string constants will work unchanged in other locales.
>>
>> Reported-by: Markus Armbruster <armbru@redhat.com>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>   hw/misc/tmp105.c | 2 +-
>>   hw/misc/tmp421.c | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)

> 
> Do we have any policy in place to prohibit this in the future?
> (Presumably a policy that is automatic and won't interfere with QEMU
> localization efforts which may rightly attempt to use UTF-8 for those
> locales.)

Not that I know of.

> 
> Do you have a script or trick to find utf-8 containing strings in our
> source?

Markus found these two, probably by reading over a list resulting from 
his claim of finding 217 out of 6455 files (53 of them binary, which 
don't count):
https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html

My quick and dirty attempt, which does not quite reproduce his numbers:

$ LC_ALL=C git grep -l $'[\x80-\xff]' | wc
     279     279    7490

Thus, by forcing a unibyte locale (where encoding errors are impossible) 
with sane range expressions (POSIX says only the C locale is required to 
interpret regex ranges according to byte value - all bets are off in 
other locales) and using $'' to type non-UTF-8 bytes into my search, I 
found 279 files with at least one byte outside of ASCII.  But the use of 
-l has no easy way to filter which of those files are binary; while 
dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to 
scroll through, especially considering there ARE binary files in the mix.

Narrowing the search to a more specific pattern:

$ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc
     129     685    8808

is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc 
(false positive hits, due to interesting? comments), in po/ (which 
doesn't count), or in scripts/ for python.  And the proof for THIS patch:

$ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat
origin:hw/misc/tmp105.c
origin:hw/misc/tmp421.c

> 
> Only curious, don't hold this patch up on my account. I'm not raising a
> challenge.

Maybe checkpatch.pl could be taught to do a similar check?

John Snow Nov. 20, 2018, 10:43 p.m. UTC | #3

On 11/20/18 5:01 PM, Eric Blake wrote:
> [adding Markus in CC, since git didn't do it automatically from the
> 'Reported-by']
> 
> On 11/20/18 3:28 PM, John Snow wrote:
>>
>>
>> On 11/20/18 3:36 PM, Eric Blake wrote:
>>> While most developers are now using UTF-8 environments, it's
>>> harder to guarantee that error messages will be output to
>>> a multibyte locale. Rather than risking error messages that
>>> get corrupted into mojibake when the user runs qemu in a
>>> non-multibyte locale, let's stick to straight ASCII error
>>> messages, rather than assuming that our use of UTF-8 in source
>>> code string constants will work unchanged in other locales.
>>>
>>> Reported-by: Markus Armbruster <armbru@redhat.com>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>> ---
>>>   hw/misc/tmp105.c | 2 +-
>>>   hw/misc/tmp421.c | 2 +-
>>>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
>>
>> Do we have any policy in place to prohibit this in the future?
>> (Presumably a policy that is automatic and won't interfere with QEMU
>> localization efforts which may rightly attempt to use UTF-8 for those
>> locales.)
> 
> Not that I know of.
> 
>>
>> Do you have a script or trick to find utf-8 containing strings in our
>> source?
> 
> Markus found these two, probably by reading over a list resulting from
> his claim of finding 217 out of 6455 files (53 of them binary, which
> don't count):
> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html
> 
> My quick and dirty attempt, which does not quite reproduce his numbers:
> 
> $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc
>     279     279    7490
> 
> Thus, by forcing a unibyte locale (where encoding errors are impossible)
> with sane range expressions (POSIX says only the C locale is required to
> interpret regex ranges according to byte value - all bets are off in
> other locales) and using $'' to type non-UTF-8 bytes into my search, I
> found 279 files with at least one byte outside of ASCII.  But the use of
> -l has no easy way to filter which of those files are binary; while
> dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to
> scroll through, especially considering there ARE binary files in the mix.
> 
> Narrowing the search to a more specific pattern:
> 
> $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc
>     129     685    8808
> 
> is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc
> (false positive hits, due to interesting? comments), in po/ (which
> doesn't count), or in scripts/ for python.  And the proof for THIS patch:
> 
> $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat
> origin:hw/misc/tmp105.c
> origin:hw/misc/tmp421.c
> 

Aha!

Thanks. I was surprised we've managed to introduce only two cases of
this so far. Your patch seems complete in this case.

Since I took the time to read along, I may as well:

Reviewed-by: John Snow <jsnow@redhat.com>

>>
>> Only curious, don't hold this patch up on my account. I'm not raising a
>> challenge.
> 
> Maybe checkpatch.pl could be taught to do a similar check?
>

Thomas Huth Nov. 21, 2018, 6:03 a.m. UTC | #4

On 2018-11-20 21:36, Eric Blake wrote:
> While most developers are now using UTF-8 environments, it's
> harder to guarantee that error messages will be output to
> a multibyte locale. Rather than risking error messages that
> get corrupted into mojibake when the user runs qemu in a
> non-multibyte locale, let's stick to straight ASCII error
> messages, rather than assuming that our use of UTF-8 in source
> code string constants will work unchanged in other locales.
> 
> Reported-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  hw/misc/tmp105.c | 2 +-
>  hw/misc/tmp421.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c
> index 0918f3a6ea2..f6d7163273a 100644
> --- a/hw/misc/tmp105.c
> +++ b/hw/misc/tmp105.c
> @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name,
>          return;
>      }
>      if (temp >= 128000 || temp < -128000) {
> -        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> +        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
>                     temp / 1000, temp % 1000);
>          return;
>      }
> diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c
> index c234044305d..eeb11000f0f 100644
> --- a/hw/misc/tmp421.c
> +++ b/hw/misc/tmp421.c
> @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name,
>      }
> 
>      if (temp >= maxs[ext_range] || temp < mins[ext_range]) {
> -        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> +        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
>                     temp / 1000, temp % 1000);
>          return;
>      }
> 

Reviewed-by: Thomas Huth <thuth@redhat.com>

Philippe Mathieu-Daudé Nov. 21, 2018, 11:39 a.m. UTC | #5

On 20/11/18 23:01, Eric Blake wrote:
> [adding Markus in CC, since git didn't do it automatically from the 
> 'Reported-by']
> 
> On 11/20/18 3:28 PM, John Snow wrote:
>>
>>
>> On 11/20/18 3:36 PM, Eric Blake wrote:
>>> While most developers are now using UTF-8 environments, it's
>>> harder to guarantee that error messages will be output to
>>> a multibyte locale. Rather than risking error messages that
>>> get corrupted into mojibake when the user runs qemu in a
>>> non-multibyte locale, let's stick to straight ASCII error
>>> messages, rather than assuming that our use of UTF-8 in source
>>> code string constants will work unchanged in other locales.
>>>
>>> Reported-by: Markus Armbruster <armbru@redhat.com>
>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>> ---
>>>   hw/misc/tmp105.c | 2 +-
>>>   hw/misc/tmp421.c | 2 +-
>>>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
>>
>> Do we have any policy in place to prohibit this in the future?
>> (Presumably a policy that is automatic and won't interfere with QEMU
>> localization efforts which may rightly attempt to use UTF-8 for those
>> locales.)
> 
> Not that I know of.
> 
>>
>> Do you have a script or trick to find utf-8 containing strings in our
>> source?
> 
> Markus found these two, probably by reading over a list resulting from 
> his claim of finding 217 out of 6455 files (53 of them binary, which 
> don't count):
> https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html
> 
> My quick and dirty attempt, which does not quite reproduce his numbers:
> 
> $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc
>      279     279    7490
> 
> Thus, by forcing a unibyte locale (where encoding errors are impossible) 
> with sane range expressions (POSIX says only the C locale is required to 
> interpret regex ranges according to byte value - all bets are off in 
> other locales) and using $'' to type non-UTF-8 bytes into my search, I 
> found 279 files with at least one byte outside of ASCII.  But the use of 
> -l has no easy way to filter which of those files are binary; while 
> dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to 
> scroll through, especially considering there ARE binary files in the mix.
> 
> Narrowing the search to a more specific pattern:
> 
> $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc
>      129     685    8808
> 
> is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc 
> (false positive hits, due to interesting? comments), in po/ (which 
> doesn't count), or in scripts/ for python.  And the proof for THIS patch:
> 
> $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat
> origin:hw/misc/tmp105.c
> origin:hw/misc/tmp421.c

Can we add the last 3 lines in the commit message?

> 
>>
>> Only curious, don't hold this patch up on my account. I'm not raising a
>> challenge.
> 
> Maybe checkpatch.pl could be taught to do a similar check?

It looks easier in shell than perl...

We could add a checkpatch.sh which finally call 'exec -l checkpatch.pl 
$@' or similar?

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

Markus Armbruster Nov. 21, 2018, 4:20 p.m. UTC | #6

Philippe Mathieu-Daudé <philmd@redhat.com> writes:

> On 20/11/18 23:01, Eric Blake wrote:
>> [adding Markus in CC, since git didn't do it automatically from the
>> 'Reported-by']
>>
>> On 11/20/18 3:28 PM, John Snow wrote:
>>>
>>>
>>> On 11/20/18 3:36 PM, Eric Blake wrote:
>>>> While most developers are now using UTF-8 environments, it's
>>>> harder to guarantee that error messages will be output to
>>>> a multibyte locale. Rather than risking error messages that
>>>> get corrupted into mojibake when the user runs qemu in a
>>>> non-multibyte locale, let's stick to straight ASCII error
>>>> messages, rather than assuming that our use of UTF-8 in source
>>>> code string constants will work unchanged in other locales.
>>>>
>>>> Reported-by: Markus Armbruster <armbru@redhat.com>
>>>> Signed-off-by: Eric Blake <eblake@redhat.com>
>>>> ---
>>>>   hw/misc/tmp105.c | 2 +-
>>>>   hw/misc/tmp421.c | 2 +-
>>>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>>>
>>> Do we have any policy in place to prohibit this in the future?
>>> (Presumably a policy that is automatic and won't interfere with QEMU
>>> localization efforts which may rightly attempt to use UTF-8 for those
>>> locales.)
>>
>> Not that I know of.

We already outlaw newline and trailing punctuation, we could amend that
to outlaw non-ASCII.

$ git-grep 'no newline'
include/qapi/error.h: * The resulting message should be a single phrase, with no newline or
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * a single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * single phrase, with no newline or trailing punctuation.
util/qemu-error.c: * single phrase, with no newline or trailing punctuation.

[...]
>> Maybe checkpatch.pl could be taught to do a similar check?
>
> It looks easier in shell than perl...
>
> We could add a checkpatch.sh which finally call 'exec -l checkpatch.pl
> $@' or similar?

Congratulations, you found yet another way to make our checkpatch
program less readable.

Back to serious.  checkpatch.pl already flags error messages containing
newlines (search for "should not contain newlines").  Extending that to
flag non-ASCII characters shouldn't be hard.

> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>

Reviewed-by: Markus Armbruster <armbru@redhat.com>

Laurent Vivier Nov. 22, 2018, 12:27 p.m. UTC | #7

On 20/11/2018 21:36, Eric Blake wrote:
> While most developers are now using UTF-8 environments, it's
> harder to guarantee that error messages will be output to
> a multibyte locale. Rather than risking error messages that
> get corrupted into mojibake when the user runs qemu in a
> non-multibyte locale, let's stick to straight ASCII error
> messages, rather than assuming that our use of UTF-8 in source
> code string constants will work unchanged in other locales.
> 
> Reported-by: Markus Armbruster <armbru@redhat.com>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  hw/misc/tmp105.c | 2 +-
>  hw/misc/tmp421.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c
> index 0918f3a6ea2..f6d7163273a 100644
> --- a/hw/misc/tmp105.c
> +++ b/hw/misc/tmp105.c
> @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name,
>          return;
>      }
>      if (temp >= 128000 || temp < -128000) {
> -        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> +        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
>                     temp / 1000, temp % 1000);
>          return;
>      }
> diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c
> index c234044305d..eeb11000f0f 100644
> --- a/hw/misc/tmp421.c
> +++ b/hw/misc/tmp421.c
> @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name,
>      }
> 
>      if (temp >= maxs[ext_range] || temp < mins[ext_range]) {
> -        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range",
> +        error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range",
>                     temp / 1000, temp % 1000);
>          return;
>      }
> 

Applied to qemu-trivial with updated message log (command to find the
non-ASCII characters).

Thanks,
Laurent

misc: Avoid UTF-8 in error messages

Commit Message

Comments

Patch