[0/8] Add multi-byte support

Message ID	cover.1714215826.git.herbert@gondor.apana.org.au (mailing list archive)
Headers	show Received: from abb.hmeau.com (abb.hmeau.com [144.6.53.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A9AC5482E1 for <dash@vger.kernel.org>; Sat, 27 Apr 2024 11:06:59 +0000 (UTC) Message-Id: <cover.1714215826.git.herbert@gondor.apana.org.au> From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sat, 27 Apr 2024 19:03:46 +0800 Subject: [PATCH 0/8] Add multi-byte support To: DASH Mailing List <dash@vger.kernel.org> Precedence: bulk
Series	Add multi-byte support \| expand [0/8] Add multi-byte support [1/8] shell: Call setlocale [2/8] shell: Use strcoll instead of strcmp where applicable [3/8] expand: Count multi-byte characters for VSLENGTH [4/8] expand: Process multi-byte characters in subevalvar [5/8] expand: Process multi-byte characters in expmeta [6/8] expand: Support multi-byte characters during field splitting [7/8] input: Allow MB_LEN_MAX calls to pungetc [8/8] parser: Add support for multi-byte characters

Herbert Xu April 27, 2024, 11:03 a.m. UTC

This patch series adds multi-byte support to dash.  For now only
fnmatch is supported as the native pmatch function has not been
modified to support multi-byte characters.

Herbert Xu (8):
  shell: Call setlocale
  shell: Use strcoll instead of strcmp where applicable
  expand: Count multi-byte characters for VSLENGTH
  expand: Process multi-byte characters in subevalvar
  expand: Process multi-byte characters in expmeta
  expand: Support multi-byte characters during field splitting
  input: Allow MB_LEN_MAX calls to pungetc
  parser: Add support for multi-byte characters

 src/bltin/test.c |   8 +-
 src/expand.c     | 489 +++++++++++++++++++++++++++++++++++------------
 src/expand.h     |   1 +
 src/input.c      |  58 ++++--
 src/input.h      |  11 +-
 src/main.c       |   4 +
 src/mystring.c   |   2 +-
 src/parser.c     | 127 +++++++++---
 src/parser.h     |   1 +
 9 files changed, 519 insertions(+), 182 deletions(-)

Christoph Anton Mitterer April 27, 2024, 9:31 p.m. UTC | #1

Hey.

On Sat, 2024-04-27 at 19:03 +0800, Herbert Xu wrote:
> This patch series adds multi-byte support to dash.  For now only
> fnmatch is supported as the native pmatch function has not been
> modified to support multi-byte characters.

Nothing against the functionality per se, but I think for all scripts
that assumed dash's (and thus on may systems /bin/sh's) current
behaviour of being C locale only even without explicitly setting
LC_ALL=C, this may have quite some subtle issues.

AFAIU, in the C locale, all bytes is a character, and thus in
particular pattern matching notation is defined for every defined
outcome of command substitution respectively every content of variables
(that is: in every(!) locale every byte other than NUL).

For example:
************
A while ago I've asked on the Austin Group mailing list for a portable
way to get command substitution without stripping of trailing newlines.

Long story short:
The recommended way was to add a sentinel character '.' at the end of
the output within the command substitution and strip that off later
with parameter expansion.
But despite of the very special properties[0] of '.', it's apparently
still required to set LC_ALL=C when stripping the sentinel, because the
pattern matching notation in ${foo%.} is defined only on strings of
characters, not on strings of bytes.

Back then, Harald van Dijk had some ideas how that might be resolved
for good, but IIRC none of the shell implementors seemed to really have
interest.

My goal was to make a portable function like
   command_subst_with_newlines "eval-ed-command-string" "target-variable-name"
which, with the requirement of setting LC_ALL proved more or less
impossible when the function should have no side effects (like keeping
the LC_ALL overridden, over possibly overriding some existing var like
OLD_LC_ALL).

Anyway... I could image, that if dash becomes multi-byte aware, there
might be more or less subtle surprises.

Cheers,
Chris.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html
"The encoded values associated with <period>, <slash>, <newline>, and
<carriage-return> shall be invariant across all locales supported by
the implementation."

Herbert Xu April 28, 2024, 12:49 a.m. UTC | #2

On Sat, Apr 27, 2024 at 11:31:43PM +0200, Christoph Anton Mitterer wrote:
>
> Long story short:
> The recommended way was to add a sentinel character '.' at the end of
> the output within the command substitution and strip that off later
> with parameter expansion.
> But despite of the very special properties[0] of '.', it's apparently
> still required to set LC_ALL=C when stripping the sentinel, because the
> pattern matching notation in ${foo%.} is defined only on strings of
> characters, not on strings of bytes.

Are you talking about a theoretical undefined condition, or an
actual one?  Which shell doesn't deal with ${foo%.} correctly?

Cheers,

Christoph Anton Mitterer April 28, 2024, 1:19 a.m. UTC | #3

On Sun, 2024-04-28 at 08:49 +0800, Herbert Xu wrote:
> 
> Are you talking about a theoretical undefined condition, or an
> actual one?  Which shell doesn't deal with ${foo%.} correctly?

Well my main point for this mail was how dash does it (and not just
about '.').
I guess it simply resorts to fnmatch(3) so it will probably do whatever
the system's libc does?

But it's really not just about '.' (which is more or less a rather safe
case).
If someone assumes dash would always be LC_ALL=C, that any such
operation where e.g. a byte is used that is part of a multi-byte
character might, AFAIU, lead to unexpected results.
E.g. an fnmatch implementation may just decide to stop and give an
error at the first byte that's not a valid characters, right?

Also there were some locales like Big5, which had the weird property of
having multibyte chars that contain byte sequences that form other
valid chars (see [0]).
Not sure if I remember that correctly, but it might have been undefined
when stripping of the "shorter" character from that.
Which again, couldn't have happened so far in dash, as it simply was C
local only.

Harald van Dijk made some extensive tests back then, how different
shells behave.
I think the austrin-group-l mailing list archive is not publicly
available, but if you have an account it was in that mail:
https://collaboration.opengroup.org/operational/mailarch.php?soph=N&action=show&archive=austin-group-l&num=34339&limit=100&offset=0
which showed that shells do indeed behave differently (the tests
weren't for '.').

Cheers,
Chris.

[0] https://unix.stackexchange.com/questions/383217/shell-keep-trailing-newlines-n-in-command-substitution/383411#383411

Lawrence Velázquez April 28, 2024, 1:35 a.m. UTC | #4

On Sat, Apr 27, 2024, at 9:19 PM, Christoph Anton Mitterer wrote:
> If someone assumes dash would always be LC_ALL=C, that any such
> operation where e.g. a byte is used that is part of a multi-byte
> character might, AFAIU, lead to unexpected results.

What's your point?  Do you think dash should not acquire multibyte
support, to avoid breaking scripts that make such assumptions?

Christoph Anton Mitterer April 28, 2024, 1:50 a.m. UTC | #5

On Sat, 2024-04-27 at 21:35 -0400, Lawrence Velázquez wrote:
> What's your point?  Do you think dash should not acquire multibyte
> support, to avoid breaking scripts that make such assumptions?

No, as I've said I have nothing against the functionality per se, but
just wanted to give a heads up that this might have some not
immediately obvious consequences.

Maybe it makes sense to add some big fat warning in a NEWS or changelog
file, telling people that their scripts may now run under another
locale and perhaps also giving some explanation on possible subtle
consequences?

Cheers,
Chris.

Christoph Anton Mitterer April 28, 2024, 2:03 a.m. UTC | #6

Just to give a simple example:

x.sh:
   char=ä
   printf '%s' "$char" | hd
   
   byte="$( printf '\244' )"
   printf '%s' "$byte" | hd
   
   printf '%s' "${char%$byte}" | hd


$ LC_ALL=en_US.UTF-8 bash x.sh 
00000000  c3 a4                                             |..|
00000002
00000000  a4                                                |.|
00000001
00000000  c3                                                |.|
00000001
$ LC_ALL=C bash x.sh 
00000000  c3 a4                                             |..|
00000002
00000000  a4                                                |.|
00000001
00000000  c3                                                |.|
00000001

$ dash x.sh 
00000000  c3 a4                                             |..|
00000002
00000000  a4                                                |.|
00000001
00000000  c3                                                |.|
00000001

but:

$ LC_ALL=en_US.UTF-8 zsh x.sh 
00000000  c3 a4                                             |..|
00000002
00000000  a4                                                |.|
00000001
00000000  c3 a4                                             |..|
00000002
$ LC_ALL=C zsh x.sh 
00000000  c3 a4                                             |..|
00000002
00000000  a4                                                |.|
00000001
00000000  c3                                                |.|
00000001


Again, not saying that this means dash shouldn't support other
locales... but people may (of course wrongly) rely on the behaviour
that dash is always in the C locale.

Even if that doesn't change right now with the fnmatch()
implementation, it could in principle do so with any change there.


Ceers,
Chris.

Harald van Dijk April 28, 2024, 2:50 p.m. UTC | #7

On 28/04/2024 01:49, Herbert Xu wrote:
> On Sat, Apr 27, 2024 at 11:31:43PM +0200, Christoph Anton Mitterer wrote:
>>
>> Long story short:
>> The recommended way was to add a sentinel character '.' at the end of
>> the output within the command substitution and strip that off later
>> with parameter expansion.
>> But despite of the very special properties[0] of '.', it's apparently
>> still required to set LC_ALL=C when stripping the sentinel, because the
>> pattern matching notation in ${foo%.} is defined only on strings of
>> characters, not on strings of bytes.
> 
> Are you talking about a theoretical undefined condition, or an
> actual one?  Which shell doesn't deal with ${foo%.} correctly?

The way you are implementing it, once you get to pmatch(), arguably you 
will not handle ${foo%.} correctly.

Consider an UTF-8 locale, where '\303' is not a valid multibyte 
character. In this locale, consider

   foo=$(printf '\303.')
   foo=${foo%.}

This is something I expect to set foo to '\303', and it does in all 
shells I know of, despite POSIX not saying this needs to work. The way 
you are implementing multibyte character support, if I am reading it 
right, as long as a full multibyte character has not been read, the next 
byte will be taken as part of that multibyte character, meaning you will 
take '\303.' as a single invalid multibyte character.

At the same time, '\303\251' is a valid multibyte character, and '\251' 
is not. So also consider

   foo=$(printf '\303\251')
   foo=${foo%$(printf '\251')}

Here, it is not clear what the correct result is, and indeed, shells 
disagree. bosh, ksh, zsh, and my shell do not break up characters, which 
I believe to be the most sensible behaviour. bash and mksh do.

The corner cases need to be carefully considered in order to figure out 
how to write the multibyte character support core functionality.

Cheers,
Harald van Dijk

Herbert Xu April 29, 2024, 1:12 p.m. UTC | #8

On Sun, Apr 28, 2024 at 03:50:57PM +0100, Harald van Dijk wrote:
>
> The way you are implementing it, once you get to pmatch(), arguably you will
> not handle ${foo%.} correctly.

I took the easy way out so for now only the fnmatch path works
properly.  But eventually I will get to the pmatch path too,
especially because fnmatch(3) seems to slow down quite a bit
once locale is set.

> This is something I expect to set foo to '\303', and it does in all shells I
> know of, despite POSIX not saying this needs to work. The way you are
> implementing multibyte character support, if I am reading it right, as long
> as a full multibyte character has not been read, the next byte will be taken
> as part of that multibyte character, meaning you will take '\303.' as a
> single invalid multibyte character.

A multi-byte character has to be valid according to mbrtowc
before I mark it with MBCHAR.  Otherwise it'll be treated as
just a single-byte character.

Cheers,

[0/8] Add multi-byte support

Message

Comments