diff mbox series

[1/2] util: Refactor json-writer's string sanitizer to be public

Message ID 20240802194156.2131519-5-eblake@redhat.com (mailing list archive)
State New, archived
Headers show
Series NBD: don't print raw server error text to terminal | expand

Commit Message

Eric Blake Aug. 2, 2024, 7:26 p.m. UTC
My next patch needs to convert text from an untrusted input into an
output representation that is suitable for display on a terminal is
useful to more than just the json-writer; the text should normally be
UTF-8, but blindly allowing all Unicode code points (including ASCII
ESC) through to a terminal risks remote-code-execution attacks on some
terminals.  Extract the existing body of json-writer's quoted_strinto
a new helper routine mod_utf8_sanitize, and generalize it to also work
on data that is length-limited rather than NUL-terminated.  [I was
actually surprised that glib does not have such a sanitizer already -
Google turns up lots of examples of rolling your own string
sanitizer.]

If desired in the future, we may want to tweak whether the output is
guaranteed to be ASCII (using lots of \u escape sequences, including
surrogate pairs for code points outside the BMP) or if we are okay
passing printable Unicode through (we still need to escape control
characters).  But for now, I went for minimal code churn, including
the fact that the resulting function allows a non-UTF-8 2-byte synonym
for U+0000.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 include/qemu/unicode.h |  3 ++
 qobject/json-writer.c  | 47 +----------------------
 util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 88 insertions(+), 46 deletions(-)

Comments

Richard W.M. Jones Aug. 2, 2024, 9 p.m. UTC | #1
On Fri, Aug 02, 2024 at 02:26:05PM -0500, Eric Blake wrote:
> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals.  Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated.  [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
> 
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters).  But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+0000.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

I have to admit I'd never heard of "modified UTF-8" before, but it's a
thing:

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

As the patch is almost a simple code motion:

Reviewed-by: Richard W.M. Jones <rjones@redhat.com>

Rich.

> ---
>  include/qemu/unicode.h |  3 ++
>  qobject/json-writer.c  | 47 +----------------------
>  util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+), 46 deletions(-)
> 
> diff --git a/include/qemu/unicode.h b/include/qemu/unicode.h
> index 7fa10b8e604..e1013b29f72 100644
> --- a/include/qemu/unicode.h
> +++ b/include/qemu/unicode.h
> @@ -4,4 +4,7 @@
>  int mod_utf8_codepoint(const char *s, size_t n, char **end);
>  ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint);
> 
> +void mod_utf8_sanitize(GString *buf, const char *str);
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len);
> +
>  #endif
> diff --git a/qobject/json-writer.c b/qobject/json-writer.c
> index 309a31d57a9..5c14574efee 100644
> --- a/qobject/json-writer.c
> +++ b/qobject/json-writer.c
> @@ -104,53 +104,8 @@ static void pretty_newline_or_space(JSONWriter *writer)
> 
>  static void quoted_str(JSONWriter *writer, const char *str)
>  {
> -    const char *ptr;
> -    char *end;
> -    int cp;
> -
>      g_string_append_c(writer->contents, '"');
> -
> -    for (ptr = str; *ptr; ptr = end) {
> -        cp = mod_utf8_codepoint(ptr, 6, &end);
> -        switch (cp) {
> -        case '\"':
> -            g_string_append(writer->contents, "\\\"");
> -            break;
> -        case '\\':
> -            g_string_append(writer->contents, "\\\\");
> -            break;
> -        case '\b':
> -            g_string_append(writer->contents, "\\b");
> -            break;
> -        case '\f':
> -            g_string_append(writer->contents, "\\f");
> -            break;
> -        case '\n':
> -            g_string_append(writer->contents, "\\n");
> -            break;
> -        case '\r':
> -            g_string_append(writer->contents, "\\r");
> -            break;
> -        case '\t':
> -            g_string_append(writer->contents, "\\t");
> -            break;
> -        default:
> -            if (cp < 0) {
> -                cp = 0xFFFD; /* replacement character */
> -            }
> -            if (cp > 0xFFFF) {
> -                /* beyond BMP; need a surrogate pair */
> -                g_string_append_printf(writer->contents, "\\u%04X\\u%04X",
> -                                       0xD800 + ((cp - 0x10000) >> 10),
> -                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
> -            } else if (cp < 0x20 || cp >= 0x7F) {
> -                g_string_append_printf(writer->contents, "\\u%04X", cp);
> -            } else {
> -                g_string_append_c(writer->contents, cp);
> -            }
> -        }
> -    };
> -
> +    mod_utf8_sanitize(writer->contents, str);
>      g_string_append_c(writer->contents, '"');
>  }
> 
> diff --git a/util/unicode.c b/util/unicode.c
> index 8580bc598b3..a419ed4de35 100644
> --- a/util/unicode.c
> +++ b/util/unicode.c
> @@ -154,3 +154,87 @@ ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint)
>      buf[4] = 0;
>      return 4;
>  }
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + *
> + * Append the contents of the NUL-terminated Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters.  The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize(GString *buf, const char *str)
> +{
> +    mod_utf8_sanitize_len(buf, str, -1);
> +}
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + * @len: Length of @str, or negative to stop at NUL terminator
> + *
> + * Append the contents of @len bytes of the Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters.  The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len)
> +{
> +    const char *ptr;
> +    char *end;
> +    int cp;
> +
> +    if (len < 0) {
> +        len = strlen(str);
> +    }
> +
> +    for (ptr = str; *ptr; ptr = end) {
> +        cp = mod_utf8_codepoint(ptr, MIN(6, str + len - ptr), &end);
> +        switch (cp) {
> +        case '\"':
> +            g_string_append(buf, "\\\"");
> +            break;
> +        case '\\':
> +            g_string_append(buf, "\\\\");
> +            break;
> +        case '\b':
> +            g_string_append(buf, "\\b");
> +            break;
> +        case '\f':
> +            g_string_append(buf, "\\f");
> +            break;
> +        case '\n':
> +            g_string_append(buf, "\\n");
> +            break;
> +        case '\r':
> +            g_string_append(buf, "\\r");
> +            break;
> +        case '\t':
> +            g_string_append(buf, "\\t");
> +            break;
> +        default:
> +            if (cp < 0) {
> +                cp = 0xFFFD; /* replacement character */
> +            }
> +            if (cp > 0xFFFF) {
> +                /* beyond BMP; need a surrogate pair */
> +                g_string_append_printf(buf, "\\u%04X\\u%04X",
> +                                       0xD800 + ((cp - 0x10000) >> 10),
> +                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
> +            } else if (cp < 0x20 || cp >= 0x7F) {
> +                g_string_append_printf(buf, "\\u%04X", cp);
> +            } else {
> +                g_string_append_c(buf, cp);
> +            }
> +        }
> +    }
> +}
> -- 
> 2.45.2
Philippe Mathieu-Daudé Aug. 2, 2024, 9:38 p.m. UTC | #2
On 2/8/24 21:26, Eric Blake wrote:
> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals.  Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated.  [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
> 
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters).  But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+0000.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>   include/qemu/unicode.h |  3 ++
>   qobject/json-writer.c  | 47 +----------------------
>   util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 88 insertions(+), 46 deletions(-)

Preferably moving the docstring help to the header,

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Daniel P. Berrangé Aug. 7, 2024, 6:49 p.m. UTC | #3
On Fri, Aug 02, 2024 at 02:26:05PM -0500, Eric Blake wrote:
> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals.  Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated.  [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
> 
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters).  But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+0000.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  include/qemu/unicode.h |  3 ++
>  qobject/json-writer.c  | 47 +----------------------
>  util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+), 46 deletions(-)

I was going to ask for a unit test, but "escaped_string" in
test-qjson.c  looks like it will be covering this sufficiently
well already, that we don't need to test it in isolation.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


With regards,
Daniel
Markus Armbruster Aug. 8, 2024, 7:54 a.m. UTC | #4
Eric Blake <eblake@redhat.com> writes:

> My next patch needs to convert text from an untrusted input into an
> output representation that is suitable for display on a terminal is
> useful to more than just the json-writer; the text should normally be
> UTF-8, but blindly allowing all Unicode code points (including ASCII
> ESC) through to a terminal risks remote-code-execution attacks on some
> terminals.  Extract the existing body of json-writer's quoted_strinto
> a new helper routine mod_utf8_sanitize, and generalize it to also work
> on data that is length-limited rather than NUL-terminated.

This is two changes in one patch: factor out, and generalize.

You don't actually use the generalized variant.  Please leave that for
later, and keep this patch simple.

>                                                             [I was
> actually surprised that glib does not have such a sanitizer already -
> Google turns up lots of examples of rolling your own string
> sanitizer.]
>
> If desired in the future, we may want to tweak whether the output is
> guaranteed to be ASCII (using lots of \u escape sequences, including
> surrogate pairs for code points outside the BMP) or if we are okay
> passing printable Unicode through (we still need to escape control
> characters).  But for now, I went for minimal code churn, including
> the fact that the resulting function allows a non-UTF-8 2-byte synonym
> for U+0000.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  include/qemu/unicode.h |  3 ++
>  qobject/json-writer.c  | 47 +----------------------
>  util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 88 insertions(+), 46 deletions(-)
>
> diff --git a/include/qemu/unicode.h b/include/qemu/unicode.h
> index 7fa10b8e604..e1013b29f72 100644
> --- a/include/qemu/unicode.h
> +++ b/include/qemu/unicode.h
> @@ -4,4 +4,7 @@
>  int mod_utf8_codepoint(const char *s, size_t n, char **end);
>  ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint);
>
> +void mod_utf8_sanitize(GString *buf, const char *str);
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len);
> +
>  #endif
> diff --git a/qobject/json-writer.c b/qobject/json-writer.c
> index 309a31d57a9..5c14574efee 100644
> --- a/qobject/json-writer.c
> +++ b/qobject/json-writer.c
> @@ -104,53 +104,8 @@ static void pretty_newline_or_space(JSONWriter *writer)
>
>  static void quoted_str(JSONWriter *writer, const char *str)
>  {
> -    const char *ptr;
> -    char *end;
> -    int cp;
> -
>      g_string_append_c(writer->contents, '"');
> -
> -    for (ptr = str; *ptr; ptr = end) {
> -        cp = mod_utf8_codepoint(ptr, 6, &end);
> -        switch (cp) {
> -        case '\"':
> -            g_string_append(writer->contents, "\\\"");
> -            break;
> -        case '\\':
> -            g_string_append(writer->contents, "\\\\");
> -            break;
> -        case '\b':
> -            g_string_append(writer->contents, "\\b");
> -            break;
> -        case '\f':
> -            g_string_append(writer->contents, "\\f");
> -            break;
> -        case '\n':
> -            g_string_append(writer->contents, "\\n");
> -            break;
> -        case '\r':
> -            g_string_append(writer->contents, "\\r");
> -            break;
> -        case '\t':
> -            g_string_append(writer->contents, "\\t");
> -            break;
> -        default:
> -            if (cp < 0) {
> -                cp = 0xFFFD; /* replacement character */
> -            }
> -            if (cp > 0xFFFF) {
> -                /* beyond BMP; need a surrogate pair */
> -                g_string_append_printf(writer->contents, "\\u%04X\\u%04X",
> -                                       0xD800 + ((cp - 0x10000) >> 10),
> -                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
> -            } else if (cp < 0x20 || cp >= 0x7F) {
> -                g_string_append_printf(writer->contents, "\\u%04X", cp);
> -            } else {
> -                g_string_append_c(writer->contents, cp);
> -            }
> -        }
> -    };
> -
> +    mod_utf8_sanitize(writer->contents, str);
>      g_string_append_c(writer->contents, '"');
>  }
>
> diff --git a/util/unicode.c b/util/unicode.c
> index 8580bc598b3..a419ed4de35 100644
> --- a/util/unicode.c
> +++ b/util/unicode.c
> @@ -154,3 +154,87 @@ ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint)
>      buf[4] = 0;
>      return 4;
>  }
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + *
> + * Append the contents of the NUL-terminated Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII

"Append into" or "append to"?

> + * and for quote characters.  The result is therefore safe for output
> + * to a terminal.

Actually, we escape double quote, backslash, ASCII control characters,
and non-ASCII characters, i.e. we leave just printable ASCII characters
other than '"' and '\\' unescaped.  See the code quoted below.

Escaping '\\' is of course necessary.

Escaping '"' is necessary only for callers that want to enclose the
result in double quotes without ambiguity.  Which is what the existing
caller wants.  Future callers may prefer not to escape '"'.  But we can
worry about this when we run into such callers.

> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */

Missing: behavior for invalid sequences.  Each such sequence is replaced
by a replacement character U+FFFD.  For the definition of "invalid
sequence", see mod_utf8_codepoint().

On the function name.  "Sanitize" could be many things.  This function
actually does two things: (1) replace invalid sequences, and (2) escape
to printable ASCII.  What about append_mod_utf8_as_printable_ascii()?
Admittedly a mouthful.

> +void mod_utf8_sanitize(GString *buf, const char *str)
> +{
> +    mod_utf8_sanitize_len(buf, str, -1);
> +}
> +
> +/**
> + * mod_utf8_sanitize:
> + * @buf: Destination buffer
> + * @str: Modified UTF-8 string to sanitize
> + * @len: Length of @str, or negative to stop at NUL terminator
> + *
> + * Append the contents of @len bytes of the Modified UTF-8 string
> + * @str into @buf, with escape sequences used for non-printable ASCII
> + * and for quote characters.  The result is therefore safe for output
> + * to a terminal.
> + *
> + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + */
> +void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len)
> +{
> +    const char *ptr;
> +    char *end;
> +    int cp;
> +
> +    if (len < 0) {
> +        len = strlen(str);
> +    }
> +
> +    for (ptr = str; *ptr; ptr = end) {
> +        cp = mod_utf8_codepoint(ptr, MIN(6, str + len - ptr), &end);
> +        switch (cp) {
> +        case '\"':
> +            g_string_append(buf, "\\\"");
> +            break;
> +        case '\\':
> +            g_string_append(buf, "\\\\");
> +            break;
> +        case '\b':
> +            g_string_append(buf, "\\b");
> +            break;
> +        case '\f':
> +            g_string_append(buf, "\\f");
> +            break;
> +        case '\n':
> +            g_string_append(buf, "\\n");
> +            break;
> +        case '\r':
> +            g_string_append(buf, "\\r");
> +            break;
> +        case '\t':
> +            g_string_append(buf, "\\t");
> +            break;
> +        default:
> +            if (cp < 0) {
> +                cp = 0xFFFD; /* replacement character */
> +            }
> +            if (cp > 0xFFFF) {
> +                /* beyond BMP; need a surrogate pair */
> +                g_string_append_printf(buf, "\\u%04X\\u%04X",
> +                                       0xD800 + ((cp - 0x10000) >> 10),
> +                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
> +            } else if (cp < 0x20 || cp >= 0x7F) {
> +                g_string_append_printf(buf, "\\u%04X", cp);
> +            } else {
> +                g_string_append_c(buf, cp);
> +            }
> +        }
> +    }
> +}
Markus Armbruster Aug. 8, 2024, 7:57 a.m. UTC | #5
Daniel P. Berrangé <berrange@redhat.com> writes:

> On Fri, Aug 02, 2024 at 02:26:05PM -0500, Eric Blake wrote:
>> My next patch needs to convert text from an untrusted input into an
>> output representation that is suitable for display on a terminal is
>> useful to more than just the json-writer; the text should normally be
>> UTF-8, but blindly allowing all Unicode code points (including ASCII
>> ESC) through to a terminal risks remote-code-execution attacks on some
>> terminals.  Extract the existing body of json-writer's quoted_strinto
>> a new helper routine mod_utf8_sanitize, and generalize it to also work
>> on data that is length-limited rather than NUL-terminated.  [I was
>> actually surprised that glib does not have such a sanitizer already -
>> Google turns up lots of examples of rolling your own string
>> sanitizer.]
>> 
>> If desired in the future, we may want to tweak whether the output is
>> guaranteed to be ASCII (using lots of \u escape sequences, including
>> surrogate pairs for code points outside the BMP) or if we are okay
>> passing printable Unicode through (we still need to escape control
>> characters).  But for now, I went for minimal code churn, including
>> the fact that the resulting function allows a non-UTF-8 2-byte synonym
>> for U+0000.
>> 
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  include/qemu/unicode.h |  3 ++
>>  qobject/json-writer.c  | 47 +----------------------
>>  util/unicode.c         | 84 ++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 88 insertions(+), 46 deletions(-)
>
> I was going to ask for a unit test, but "escaped_string" in
> test-qjson.c  looks like it will be covering this sufficiently

check-qjson.c, and other test cases torture it some more.

> well already, that we don't need to test it in isolation.
>
> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
>
>
> With regards,
> Daniel
Eric Blake Aug. 8, 2024, 2:02 p.m. UTC | #6
On Thu, Aug 08, 2024 at 09:54:26AM GMT, Markus Armbruster wrote:
> Eric Blake <eblake@redhat.com> writes:
> 
> > My next patch needs to convert text from an untrusted input into an
> > output representation that is suitable for display on a terminal is
> > useful to more than just the json-writer; the text should normally be
> > UTF-8, but blindly allowing all Unicode code points (including ASCII
> > ESC) through to a terminal risks remote-code-execution attacks on some
> > terminals.  Extract the existing body of json-writer's quoted_strinto
> > a new helper routine mod_utf8_sanitize, and generalize it to also work
> > on data that is length-limited rather than NUL-terminated.
> 
> This is two changes in one patch: factor out, and generalize.
> 
> You don't actually use the generalized variant.  Please leave that for
> later, and keep this patch simple.

Makes sense. Will simplify in v2.

> 
> >                                                             [I was
> > actually surprised that glib does not have such a sanitizer already -
> > Google turns up lots of examples of rolling your own string
> > sanitizer.]

See https://gitlab.gnome.org/GNOME/glib/-/issues/3426 where I asked if
glib should provide a more generic sanitizer.  In turn, I found glib
does have:

char*
g_uri_escape_string (
  const char* unescaped,
  const char* reserved_chars_allowed,
  gboolean allow_utf8
)

which is a bit more powerful (you have some fine-tuning on what gets
escaped), but a different escaping mechanism (%XX instead of \uXXXX
escapes, where % rather than \ is special).  I'm not sure whether it
is nicer to have a malloc'd result or to append into a larger g_string
as the base operation (and where you could write an easy wrapper to
provide the other operation).

> > +/**
> > + * mod_utf8_sanitize:
> > + * @buf: Destination buffer
> > + * @str: Modified UTF-8 string to sanitize
> > + *
> > + * Append the contents of the NUL-terminated Modified UTF-8 string
> > + * @str into @buf, with escape sequences used for non-printable ASCII
> 
> "Append into" or "append to"?

"append to" works, will simplify.

> 
> > + * and for quote characters.  The result is therefore safe for output
> > + * to a terminal.
> 
> Actually, we escape double quote, backslash, ASCII control characters,
> and non-ASCII characters, i.e. we leave just printable ASCII characters
> other than '"' and '\\' unescaped.  See the code quoted below.
> 
> Escaping '\\' is of course necessary.
> 
> Escaping '"' is necessary only for callers that want to enclose the
> result in double quotes without ambiguity.  Which is what the existing
> caller wants.  Future callers may prefer not to escape '"'.  But we can
> worry about this when we run into such callers.

If we encounter more users, I could see this morphing into a backend
that takes a flag argument on knobs like whether to escape " or ',
whether to preserve printing Unicode, ...; coupled with frontends with
fewer arguments that pass the right flags intot the backend.

> 
> > + *
> > + * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> > + * "\xC0\x80".
> > + */
> 
> Missing: behavior for invalid sequences.  Each such sequence is replaced
> by a replacement character U+FFFD.  For the definition of "invalid
> sequence", see mod_utf8_codepoint().

Indeed, more documentation here wouldn't hurt (both on \ and " being
escaped, and on U+FFFD codepoints being placed into the output
stream).

> 
> On the function name.  "Sanitize" could be many things.  This function
> actually does two things: (1) replace invalid sequences, and (2) escape
> to printable ASCII.  What about append_mod_utf8_as_printable_ascii()?
> Admittedly a mouthful.

Naming is always tough.  Your suggestion is longer, but indeed
accurate.  Maybe a shorter compromise of append_escaped_mod_utf8()?

> 
> > +void mod_utf8_sanitize(GString *buf, const char *str)
> > +{
> > +    mod_utf8_sanitize_len(buf, str, -1);
> > +}
diff mbox series

Patch

diff --git a/include/qemu/unicode.h b/include/qemu/unicode.h
index 7fa10b8e604..e1013b29f72 100644
--- a/include/qemu/unicode.h
+++ b/include/qemu/unicode.h
@@ -4,4 +4,7 @@ 
 int mod_utf8_codepoint(const char *s, size_t n, char **end);
 ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint);

+void mod_utf8_sanitize(GString *buf, const char *str);
+void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len);
+
 #endif
diff --git a/qobject/json-writer.c b/qobject/json-writer.c
index 309a31d57a9..5c14574efee 100644
--- a/qobject/json-writer.c
+++ b/qobject/json-writer.c
@@ -104,53 +104,8 @@  static void pretty_newline_or_space(JSONWriter *writer)

 static void quoted_str(JSONWriter *writer, const char *str)
 {
-    const char *ptr;
-    char *end;
-    int cp;
-
     g_string_append_c(writer->contents, '"');
-
-    for (ptr = str; *ptr; ptr = end) {
-        cp = mod_utf8_codepoint(ptr, 6, &end);
-        switch (cp) {
-        case '\"':
-            g_string_append(writer->contents, "\\\"");
-            break;
-        case '\\':
-            g_string_append(writer->contents, "\\\\");
-            break;
-        case '\b':
-            g_string_append(writer->contents, "\\b");
-            break;
-        case '\f':
-            g_string_append(writer->contents, "\\f");
-            break;
-        case '\n':
-            g_string_append(writer->contents, "\\n");
-            break;
-        case '\r':
-            g_string_append(writer->contents, "\\r");
-            break;
-        case '\t':
-            g_string_append(writer->contents, "\\t");
-            break;
-        default:
-            if (cp < 0) {
-                cp = 0xFFFD; /* replacement character */
-            }
-            if (cp > 0xFFFF) {
-                /* beyond BMP; need a surrogate pair */
-                g_string_append_printf(writer->contents, "\\u%04X\\u%04X",
-                                       0xD800 + ((cp - 0x10000) >> 10),
-                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
-            } else if (cp < 0x20 || cp >= 0x7F) {
-                g_string_append_printf(writer->contents, "\\u%04X", cp);
-            } else {
-                g_string_append_c(writer->contents, cp);
-            }
-        }
-    };
-
+    mod_utf8_sanitize(writer->contents, str);
     g_string_append_c(writer->contents, '"');
 }

diff --git a/util/unicode.c b/util/unicode.c
index 8580bc598b3..a419ed4de35 100644
--- a/util/unicode.c
+++ b/util/unicode.c
@@ -154,3 +154,87 @@  ssize_t mod_utf8_encode(char buf[], size_t bufsz, int codepoint)
     buf[4] = 0;
     return 4;
 }
+
+/**
+ * mod_utf8_sanitize:
+ * @buf: Destination buffer
+ * @str: Modified UTF-8 string to sanitize
+ *
+ * Append the contents of the NUL-terminated Modified UTF-8 string
+ * @str into @buf, with escape sequences used for non-printable ASCII
+ * and for quote characters.  The result is therefore safe for output
+ * to a terminal.
+ *
+ * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
+ * "\xC0\x80".
+ */
+void mod_utf8_sanitize(GString *buf, const char *str)
+{
+    mod_utf8_sanitize_len(buf, str, -1);
+}
+
+/**
+ * mod_utf8_sanitize:
+ * @buf: Destination buffer
+ * @str: Modified UTF-8 string to sanitize
+ * @len: Length of @str, or negative to stop at NUL terminator
+ *
+ * Append the contents of @len bytes of the Modified UTF-8 string
+ * @str into @buf, with escape sequences used for non-printable ASCII
+ * and for quote characters.  The result is therefore safe for output
+ * to a terminal.
+ *
+ * Modified UTF-8 is exactly like UTF-8, except U+0000 is encoded as
+ * "\xC0\x80".
+ */
+void mod_utf8_sanitize_len(GString *buf, const char *str, ssize_t len)
+{
+    const char *ptr;
+    char *end;
+    int cp;
+
+    if (len < 0) {
+        len = strlen(str);
+    }
+
+    for (ptr = str; *ptr; ptr = end) {
+        cp = mod_utf8_codepoint(ptr, MIN(6, str + len - ptr), &end);
+        switch (cp) {
+        case '\"':
+            g_string_append(buf, "\\\"");
+            break;
+        case '\\':
+            g_string_append(buf, "\\\\");
+            break;
+        case '\b':
+            g_string_append(buf, "\\b");
+            break;
+        case '\f':
+            g_string_append(buf, "\\f");
+            break;
+        case '\n':
+            g_string_append(buf, "\\n");
+            break;
+        case '\r':
+            g_string_append(buf, "\\r");
+            break;
+        case '\t':
+            g_string_append(buf, "\\t");
+            break;
+        default:
+            if (cp < 0) {
+                cp = 0xFFFD; /* replacement character */
+            }
+            if (cp > 0xFFFF) {
+                /* beyond BMP; need a surrogate pair */
+                g_string_append_printf(buf, "\\u%04X\\u%04X",
+                                       0xD800 + ((cp - 0x10000) >> 10),
+                                       0xDC00 + ((cp - 0x10000) & 0x3FF));
+            } else if (cp < 0x20 || cp >= 0x7F) {
+                g_string_append_printf(buf, "\\u%04X", cp);
+            } else {
+                g_string_append_c(buf, cp);
+            }
+        }
+    }
+}