[v4,00/23] Ext4 Encoding and Case-insensitive support

Message ID	20181206230903.30011-1-krisman@collabora.com (mailing list archive)
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> sender: krisman) with ESMTPSA id 030E327DA75 From: Gabriel Krisman Bertazi <krisman@collabora.com> To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi <krisman@collabora.com> Subject: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support Date: Thu, 6 Dec 2018 18:08:40 -0500 Message-Id: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	Ext4 Encoding and Case-insensitive support \| expand [v4,00/23] Ext4 Encoding and Case-insensitive support [v4,01/23] nls: Wrap uni2char/char2uni callers [v4,02/23] nls: Wrap charset field access [v4,03/23] nls: Wrap charset hooks in ops structure [v4,04/23] nls: Split default charset from NLS core [v4,05/23] nls: Split struct nls_charset from struct nls_table [v4,06/23] nls: Add support for multiple versions of an encoding [v4,07/23] nls: Implement NLS_STRICT_MODE flag [v4,08/23] nls: Let charsets define the behavior of tolower/toupper [v4,09/23] nls: Add new interface for string comparisons [v4,10/23] nls: Add optional normalization and casefold hooks [v4,11/23] nls: ascii: Support validation and normalization operations [v4,12/23] nls: utf8: Add unicode character database files [v4,13/23] scripts: add trie generator for UTF-8 [v4,14/23] nls: utf8: Move nls-utf8{,-core}.c [v4,15/23] nls: utf8: Introduce code for UTF-8 normalization [v4,16/23] nls: utf8n: reduce the size of utf8data[] [v4,17/23] nls: utf8: Integrate utf8 normalization code with utf8 charset [v4,18/23] nls: utf8: Introduce test module for normalized utf8 implementation [v4,19/23] ext4: Reserve superblock fields for encoding information [v4,20/23] ext4: Include encoding information in the superblock [v4,21/23] ext4: Support encoding-aware file name lookups [v4,22/23] ext4: Implement EXT4_CASEFOLD_FL flag [v4,23/23] docs: ext4.rst: Document encoding and case-insensitive

Gabriel Krisman Bertazi Dec. 6, 2018, 11:08 p.m. UTC

Hi,

[Resending to include fsdevel, as requested by Dave Chinner]

Following the e2fsprogs changes, these are the corresponding kernel-side
modifications to support the fname_encoding feature.

The patches are split in two parts. The fist 14 patches are refactoring
and improvements to the NLS code, including the utf8 normalization
support.  The final patches implement the fname_encoding feature in ext4.

To test this feature, you need to use the tip of e2fsprogs branch, which
already include support for enabling this feature.

As usual, the ucd files are not included in this email because they are
too large, and would actually cause the email message to bounce.

There are two test files for this in a private xfstests branch, that I
plan to submit upstream once we get this series merged:

  https://gitlab.collabora.com/krisman/xfstests.git -b encoding_v4

I also tested this with the xfstests smoke tests using two scenarios:
(1) a non-encoding TEST_DEV; (2) a utf8 enabled TEST_DEV.  On both
cases, no unrelated regressions where observed.  With my branch of
xfstests above, that fixes some related tests, I didn't observe any
regressions.

Gabriel Krisman Bertazi (19):
  nls: Wrap uni2char/char2uni callers
  nls: Wrap charset field access
  nls: Wrap charset hooks in ops structure
  nls: Split default charset from NLS core
  nls: Split struct nls_charset from struct nls_table
  nls: Add support for multiple versions of an encoding
  nls: Implement NLS_STRICT_MODE flag
  nls: Let charsets define the behavior of tolower/toupper
  nls: Add new interface for string comparisons
  nls: Add optional normalization and casefold hooks
  nls: ascii: Support validation and normalization operations
  nls: utf8: Move nls-utf8{,-core}.c
  nls: utf8: Integrate utf8 normalization code with utf8 charset
  nls: utf8: Introduce test module for normalized utf8 implementation
  ext4: Reserve superblock fields for encoding information
  ext4: Include encoding information in the superblock
  ext4: Support encoding-aware file name lookups
  ext4: Implement EXT4_CASEFOLD_FL flag
  docs: ext4.rst: Document encoding and case-insensitive

Olaf Weber (4):
  nls: utf8: Add unicode character database files
  scripts: add trie generator for UTF-8
  nls: utf8: Introduce code for UTF-8 normalization
  nls: utf8n: reduce the size of utf8data[]

 Documentation/admin-guide/ext4.rst   |   29 +
 fs/befs/linuxvfs.c                   |    8 +-
 fs/cifs/cifs_unicode.c               |   15 +-
 fs/cifs/cifsfs.c                     |    2 +-
 fs/cifs/connect.c                    |    2 +-
 fs/cifs/dir.c                        |    7 +-
 fs/ext4/dir.c                        |   59 +
 fs/ext4/ext4.h                       |   33 +-
 fs/ext4/hash.c                       |   38 +-
 fs/ext4/ialloc.c                     |    2 +-
 fs/ext4/inline.c                     |    2 +-
 fs/ext4/inode.c                      |    4 +-
 fs/ext4/ioctl.c                      |   18 +
 fs/ext4/namei.c                      |   85 +-
 fs/ext4/super.c                      |   83 +
 fs/fat/dir.c                         |   13 +-
 fs/fat/inode.c                       |    6 +-
 fs/fat/namei_vfat.c                  |    6 +-
 fs/hfs/super.c                       |    6 +-
 fs/hfs/trans.c                       |    9 +-
 fs/hfsplus/options.c                 |    2 +-
 fs/hfsplus/unicode.c                 |    6 +-
 fs/isofs/inode.c                     |    5 +-
 fs/isofs/joliet.c                    |    3 +-
 fs/jfs/jfs_unicode.c                 |    9 +-
 fs/jfs/super.c                       |    3 +-
 fs/nls/Kconfig                       |   15 +
 fs/nls/Makefile                      |   20 +
 fs/nls/mac-celtic.c                  |   34 +-
 fs/nls/mac-centeuro.c                |   34 +-
 fs/nls/mac-croatian.c                |   34 +-
 fs/nls/mac-cyrillic.c                |   34 +-
 fs/nls/mac-gaelic.c                  |   34 +-
 fs/nls/mac-greek.c                   |   34 +-
 fs/nls/mac-iceland.c                 |   34 +-
 fs/nls/mac-inuit.c                   |   34 +-
 fs/nls/mac-roman.c                   |   34 +-
 fs/nls/mac-romanian.c                |   34 +-
 fs/nls/mac-turkish.c                 |   34 +-
 fs/nls/nls_ascii.c                   |   84 +-
 fs/nls/nls_core.c                    |  163 ++
 fs/nls/nls_cp1250.c                  |   34 +-
 fs/nls/nls_cp1251.c                  |   34 +-
 fs/nls/nls_cp1255.c                  |   36 +-
 fs/nls/nls_cp437.c                   |   34 +-
 fs/nls/nls_cp737.c                   |   34 +-
 fs/nls/nls_cp775.c                   |   34 +-
 fs/nls/nls_cp850.c                   |   34 +-
 fs/nls/nls_cp852.c                   |   34 +-
 fs/nls/nls_cp855.c                   |   34 +-
 fs/nls/nls_cp857.c                   |   34 +-
 fs/nls/nls_cp860.c                   |   34 +-
 fs/nls/nls_cp861.c                   |   34 +-
 fs/nls/nls_cp862.c                   |   34 +-
 fs/nls/nls_cp863.c                   |   34 +-
 fs/nls/nls_cp864.c                   |   34 +-
 fs/nls/nls_cp865.c                   |   34 +-
 fs/nls/nls_cp866.c                   |   34 +-
 fs/nls/nls_cp869.c                   |   34 +-
 fs/nls/nls_cp874.c                   |   36 +-
 fs/nls/nls_cp932.c                   |   36 +-
 fs/nls/nls_cp936.c                   |   36 +-
 fs/nls/nls_cp949.c                   |   36 +-
 fs/nls/nls_cp950.c                   |   36 +-
 fs/nls/{nls_base.c => nls_default.c} |  124 +-
 fs/nls/nls_euc-jp.c                  |   29 +-
 fs/nls/nls_iso8859-1.c               |   34 +-
 fs/nls/nls_iso8859-13.c              |   34 +-
 fs/nls/nls_iso8859-14.c              |   34 +-
 fs/nls/nls_iso8859-15.c              |   34 +-
 fs/nls/nls_iso8859-2.c               |   34 +-
 fs/nls/nls_iso8859-3.c               |   34 +-
 fs/nls/nls_iso8859-4.c               |   34 +-
 fs/nls/nls_iso8859-5.c               |   34 +-
 fs/nls/nls_iso8859-6.c               |   34 +-
 fs/nls/nls_iso8859-7.c               |   34 +-
 fs/nls/nls_iso8859-9.c               |   34 +-
 fs/nls/nls_koi8-r.c                  |   34 +-
 fs/nls/nls_koi8-ru.c                 |   30 +-
 fs/nls/nls_koi8-u.c                  |   34 +-
 fs/nls/nls_utf8-core.c               |  328 +++
 fs/nls/nls_utf8-norm.c               |  797 ++++++
 fs/nls/nls_utf8-selftest.c           |  316 +++
 fs/nls/nls_utf8.c                    |   67 -
 fs/nls/ucd/README                    |   34 +
 fs/nls/utf8n.h                       |  117 +
 fs/ntfs/inode.c                      |    2 +-
 fs/ntfs/super.c                      |    6 +-
 fs/ntfs/unistr.c                     |   13 +-
 fs/udf/super.c                       |    3 +-
 fs/udf/unicode.c                     |    4 +-
 include/linux/fs.h                   |    2 +
 include/linux/nls.h                  |  293 ++-
 scripts/Makefile                     |    1 +
 scripts/mkutf8data.c                 | 3392 ++++++++++++++++++++++++++
 95 files changed, 7287 insertions(+), 618 deletions(-)
 create mode 100644 fs/nls/nls_core.c
 rename fs/nls/{nls_base.c => nls_default.c} (89%)
 create mode 100644 fs/nls/nls_utf8-core.c
 create mode 100644 fs/nls/nls_utf8-norm.c
 create mode 100644 fs/nls/nls_utf8-selftest.c
 delete mode 100644 fs/nls/nls_utf8.c
 create mode 100644 fs/nls/ucd/README
 create mode 100644 fs/nls/utf8n.h
 create mode 100644 scripts/mkutf8data.c

Randy Dunlap Dec. 7, 2018, 6:41 p.m. UTC | #1

On 12/6/18 3:08 PM, Gabriel Krisman Bertazi wrote:
> Hi,
> 
> [Resending to include fsdevel, as requested by Dave Chinner]
> 
> Following the e2fsprogs changes, these are the corresponding kernel-side
> modifications to support the fname_encoding feature.
> 
> The patches are split in two parts. The fist 14 patches are refactoring
> and improvements to the NLS code, including the utf8 normalization
> support.  The final patches implement the fname_encoding feature in ext4.

Hi,

Please include some justification and use case(s) in the patch description.

Thanks.

> To test this feature, you need to use the tip of e2fsprogs branch, which
> already include support for enabling this feature.
> 
> As usual, the ucd files are not included in this email because they are
> too large, and would actually cause the email message to bounce.
> 
> There are two test files for this in a private xfstests branch, that I
> plan to submit upstream once we get this series merged:
> 
>   https://gitlab.collabora.com/krisman/xfstests.git -b encoding_v4
> 
> I also tested this with the xfstests smoke tests using two scenarios:
> (1) a non-encoding TEST_DEV; (2) a utf8 enabled TEST_DEV.  On both
> cases, no unrelated regressions where observed.  With my branch of
> xfstests above, that fixes some related tests, I didn't observe any
> regressions.
> 
> Gabriel Krisman Bertazi (19):
>   nls: Wrap uni2char/char2uni callers
>   nls: Wrap charset field access
>   nls: Wrap charset hooks in ops structure
>   nls: Split default charset from NLS core
>   nls: Split struct nls_charset from struct nls_table
>   nls: Add support for multiple versions of an encoding
>   nls: Implement NLS_STRICT_MODE flag
>   nls: Let charsets define the behavior of tolower/toupper
>   nls: Add new interface for string comparisons
>   nls: Add optional normalization and casefold hooks
>   nls: ascii: Support validation and normalization operations
>   nls: utf8: Move nls-utf8{,-core}.c
>   nls: utf8: Integrate utf8 normalization code with utf8 charset
>   nls: utf8: Introduce test module for normalized utf8 implementation
>   ext4: Reserve superblock fields for encoding information
>   ext4: Include encoding information in the superblock
>   ext4: Support encoding-aware file name lookups
>   ext4: Implement EXT4_CASEFOLD_FL flag
>   docs: ext4.rst: Document encoding and case-insensitive
> 
> Olaf Weber (4):
>   nls: utf8: Add unicode character database files
>   scripts: add trie generator for UTF-8
>   nls: utf8: Introduce code for UTF-8 normalization
>   nls: utf8n: reduce the size of utf8data[]
> 
>  Documentation/admin-guide/ext4.rst   |   29 +
>  fs/befs/linuxvfs.c                   |    8 +-
>  fs/cifs/cifs_unicode.c               |   15 +-
>  fs/cifs/cifsfs.c                     |    2 +-
>  fs/cifs/connect.c                    |    2 +-
>  fs/cifs/dir.c                        |    7 +-
>  fs/ext4/dir.c                        |   59 +
>  fs/ext4/ext4.h                       |   33 +-
>  fs/ext4/hash.c                       |   38 +-
>  fs/ext4/ialloc.c                     |    2 +-
>  fs/ext4/inline.c                     |    2 +-
>  fs/ext4/inode.c                      |    4 +-
>  fs/ext4/ioctl.c                      |   18 +
>  fs/ext4/namei.c                      |   85 +-
>  fs/ext4/super.c                      |   83 +
>  fs/fat/dir.c                         |   13 +-
>  fs/fat/inode.c                       |    6 +-
>  fs/fat/namei_vfat.c                  |    6 +-
>  fs/hfs/super.c                       |    6 +-
>  fs/hfs/trans.c                       |    9 +-
>  fs/hfsplus/options.c                 |    2 +-
>  fs/hfsplus/unicode.c                 |    6 +-
>  fs/isofs/inode.c                     |    5 +-
>  fs/isofs/joliet.c                    |    3 +-
>  fs/jfs/jfs_unicode.c                 |    9 +-
>  fs/jfs/super.c                       |    3 +-
>  fs/nls/Kconfig                       |   15 +
>  fs/nls/Makefile                      |   20 +
>  fs/nls/mac-celtic.c                  |   34 +-
>  fs/nls/mac-centeuro.c                |   34 +-
>  fs/nls/mac-croatian.c                |   34 +-
>  fs/nls/mac-cyrillic.c                |   34 +-
>  fs/nls/mac-gaelic.c                  |   34 +-
>  fs/nls/mac-greek.c                   |   34 +-
>  fs/nls/mac-iceland.c                 |   34 +-
>  fs/nls/mac-inuit.c                   |   34 +-
>  fs/nls/mac-roman.c                   |   34 +-
>  fs/nls/mac-romanian.c                |   34 +-
>  fs/nls/mac-turkish.c                 |   34 +-
>  fs/nls/nls_ascii.c                   |   84 +-
>  fs/nls/nls_core.c                    |  163 ++
>  fs/nls/nls_cp1250.c                  |   34 +-
>  fs/nls/nls_cp1251.c                  |   34 +-
>  fs/nls/nls_cp1255.c                  |   36 +-
>  fs/nls/nls_cp437.c                   |   34 +-
>  fs/nls/nls_cp737.c                   |   34 +-
>  fs/nls/nls_cp775.c                   |   34 +-
>  fs/nls/nls_cp850.c                   |   34 +-
>  fs/nls/nls_cp852.c                   |   34 +-
>  fs/nls/nls_cp855.c                   |   34 +-
>  fs/nls/nls_cp857.c                   |   34 +-
>  fs/nls/nls_cp860.c                   |   34 +-
>  fs/nls/nls_cp861.c                   |   34 +-
>  fs/nls/nls_cp862.c                   |   34 +-
>  fs/nls/nls_cp863.c                   |   34 +-
>  fs/nls/nls_cp864.c                   |   34 +-
>  fs/nls/nls_cp865.c                   |   34 +-
>  fs/nls/nls_cp866.c                   |   34 +-
>  fs/nls/nls_cp869.c                   |   34 +-
>  fs/nls/nls_cp874.c                   |   36 +-
>  fs/nls/nls_cp932.c                   |   36 +-
>  fs/nls/nls_cp936.c                   |   36 +-
>  fs/nls/nls_cp949.c                   |   36 +-
>  fs/nls/nls_cp950.c                   |   36 +-
>  fs/nls/{nls_base.c => nls_default.c} |  124 +-
>  fs/nls/nls_euc-jp.c                  |   29 +-
>  fs/nls/nls_iso8859-1.c               |   34 +-
>  fs/nls/nls_iso8859-13.c              |   34 +-
>  fs/nls/nls_iso8859-14.c              |   34 +-
>  fs/nls/nls_iso8859-15.c              |   34 +-
>  fs/nls/nls_iso8859-2.c               |   34 +-
>  fs/nls/nls_iso8859-3.c               |   34 +-
>  fs/nls/nls_iso8859-4.c               |   34 +-
>  fs/nls/nls_iso8859-5.c               |   34 +-
>  fs/nls/nls_iso8859-6.c               |   34 +-
>  fs/nls/nls_iso8859-7.c               |   34 +-
>  fs/nls/nls_iso8859-9.c               |   34 +-
>  fs/nls/nls_koi8-r.c                  |   34 +-
>  fs/nls/nls_koi8-ru.c                 |   30 +-
>  fs/nls/nls_koi8-u.c                  |   34 +-
>  fs/nls/nls_utf8-core.c               |  328 +++
>  fs/nls/nls_utf8-norm.c               |  797 ++++++
>  fs/nls/nls_utf8-selftest.c           |  316 +++
>  fs/nls/nls_utf8.c                    |   67 -
>  fs/nls/ucd/README                    |   34 +
>  fs/nls/utf8n.h                       |  117 +
>  fs/ntfs/inode.c                      |    2 +-
>  fs/ntfs/super.c                      |    6 +-
>  fs/ntfs/unistr.c                     |   13 +-
>  fs/udf/super.c                       |    3 +-
>  fs/udf/unicode.c                     |    4 +-
>  include/linux/fs.h                   |    2 +
>  include/linux/nls.h                  |  293 ++-
>  scripts/Makefile                     |    1 +
>  scripts/mkutf8data.c                 | 3392 ++++++++++++++++++++++++++
>  95 files changed, 7287 insertions(+), 618 deletions(-)
>  create mode 100644 fs/nls/nls_core.c
>  rename fs/nls/{nls_base.c => nls_default.c} (89%)
>  create mode 100644 fs/nls/nls_utf8-core.c
>  create mode 100644 fs/nls/nls_utf8-norm.c
>  create mode 100644 fs/nls/nls_utf8-selftest.c
>  delete mode 100644 fs/nls/nls_utf8.c
>  create mode 100644 fs/nls/ucd/README
>  create mode 100644 fs/nls/utf8n.h
>  create mode 100644 scripts/mkutf8data.c
>

Linus Torvalds Dec. 8, 2018, 9:48 p.m. UTC | #2

On Sat, Dec 8, 2018 at 12:22 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> There's a patch series that's been baking for a while that will likely
> go upstream either in the next upcoming merge window, or the one after
> that.  Since it adds support for Unicode case-folding, it involves a
> non-trivial number of changes to fs/nls.  As near as I can tell, no
> one is really maintaining fs/nls.

Christ.

Why do people want to do this? We know it's a crazy and stupid thing
to do. And we know that, exactly because people have done it, and it
has always been a mistake.

It causes actual and very subtle security issues.

It breaks things subtly even when they supposedly "know" about case
folding because different things will do it differently (ie user space
vs kernel space not having the *exact* same rules due to using
different tables, for example).

It doesn't work with locales, because people often want different
locales at the same time.

And it slows things down enormously because you can't do hashing well,
and comparisons get hugely more expensive.

And to add insult to injury, people always implement it so *horribly*
badly that it's not even funny.

For example, the usual way that people do it is to case-fold two
strings, and then compare the end results. And that's *incredibly*
stupid and slow and generates extra temporary allocations etc.

Or people to it character-by-character instead, and don't understand
utf-8 (which is literally designed to be easy to see character
boundaries *without* having to do a full decode!), and do *that*
incredibly badly instead.

And when you create a file with an ambiguous name, what does readdir
report? Does it report the name you used, some normalized thing, or
what?

Finally, people then invariably do it in ways that preclude any
concurrent sane uses.

For example, they make it a single mount-time flag for the whole
filesystem, so now if you are (for example) wanting to do emulation of
bad system decisions, you now force the *host* to buy into the whole
mistake too.

And they make it a whole-filesystem flag, instead of (for example)
allowing just the emulated environment to do case-insensitive
filesystem operations on an operation-by-operation basis, and possibly
only within a particular subdirectory structure (or bind mount).

So the first thing I want to know is who really needs it, *why* they
need it, and what the design is for.

Because I can almost guarantee that the design is horrible, and the
reasons are really really bad.

And what *are* the case insensitivity rules, and how do you co-exist
when there are two *different* folding rules at the same time? For
example, OS X has some truly horrendously bad rules, that take the
badness that Windows did to a whole different level. What if you're a
file server (or emulation environment) and you want to expose the same
filesystem to both of those environments?

Because it would quite possibly be a whole lot better to allow
per-operation flags, so that you can do

    fd = openat(dir, path, O_RDONLY | O_ICASE);

so that you can allow *one* process to treat a filesystem as if it was
case insensitive (think "Wine in with a ~/.wine/C directory"), without
forcing the whole filesystem to be icase.

Yes, allowing concurrent use then generates whole new "interesting"
questions, like "what happens if a case _sensitive_ user creates two
files with names that are identical to a in-sensitive user", but they
aren't necessarily any worse than the issues you face *not* allowing
that.

> Given your recent comments about not wanting to see pull requests for
> things outside of fs/xfs as part of the xfs pull, do you have any
> opinions about how to do manage this feature going upstream?  My
> original plan was to send them through the ext4 tree, since I very
> much doubt Al cares much about nls issues, and they will only impact
> ext4.

I really want to know what is driving this insanity, and what the
actual use-case is.

You have a diffstat, but not a git tree to look at what the heck is going on.

Seriously, case insensitivity is *such* a horrendously bad idea that
people need to think about it deeply, and nobody seems to ever do
that.

And yes, we have d_hash() and some rudimentary support for it in the
VFS layer, but that VFS layer bit was always meant purely for
interoperability filesystems that nobody really cared about as a real
filesystem for Linux. Notably FAT and its ilk.

If we have a major native filesystem doing it, I think we need to
actively think about the big picture and do it *right*. None of the
crazy "ok, you can't even look things up in the dcache directly at
all" stuff that we have as a hack to just allow _bad_ filesystems to
do their thing.

So I think this is a bigger deal than that diffstat of yours implies.
I don't think people understand just how *bad* case insensitivity is.

The old DOS/Mac people thought case insensitivity was a "helpful"
idea, and that was understandable - but wrong - even back in the 80's.
They are still living with the end result of that horrendously bad
decision decades later. They've _tried_ to fix their bad decisions,
and have never been able to (except, apparently, in iOS where somebody
finally had a glimmer of a clue).

                 Linus

Linus Torvalds Dec. 8, 2018, 9:58 p.m. UTC | #3

On Sat, Dec 8, 2018 at 1:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes, allowing concurrent use then generates whole new "interesting"
> questions, like "what happens if a case _sensitive_ user creates two
> files with names that are identical to a in-sensitive user", but they
> aren't necessarily any worse than the issues you face *not* allowing
> that.

I'm hoping you are at least doing it per-directory. That makes at
least the "oh, the whole filesystem needs to do this wrong" issue a
bit less bad.

Just looking at the shortlog you posted, my guess is that the ext4
patches didn't even get *that* right, though. That shortlog "encoding
information in superblock" implies this is the same kind of just
horribly bad mess that we've seen before.

I really despise every single case-sensitive filesystem I have ever
seen, exactly because nobody apparently spends even a minimal amount
of effort on getting any of the basics remotely right. Every single
case I've seen has been a huge nasty hack, with seriously bad
system-wide consequences.

                  Linus

Linus Torvalds Dec. 8, 2018, 10:59 p.m. UTC | #4

On Sat, Dec 8, 2018 at 1:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm hoping you are at least doing it per-directory. That makes at
> least the "oh, the whole filesystem needs to do this wrong" issue a
> bit less bad.

So for example, if you do it per-directory, the rules could be something like:

 - new directories (ie "mkdir()") inherit the icase/folding semantics
of the parent directory

 - empty directories can have their case/folding rules changed with
some well-defined interface

and even from just those simple rules, now some icase behavior could
be useful to testing.

Not just filesystem testing (although that would be a thing - thing
fsstress), but for doing app development in a test directory.

Apps like git (and GNU fileutils) could use it for having test suites
for FAT etc filesystems.

And cross-platform apps could use it as a "I want to check that I do
the right thing" if you do development on Linux, but might have a
portable app for other platforms.

If the whole filesystem is that way, nobody is going to do it. Sure,
they could do it on a FAT filesystem using a USB disk, but nobody
really does that. But if you can troivially just run your tests in a
test subdirectory, it's another thing entirely.

So this is the kind of thing I mean when I think icase behavior for a
major Linux filesystem should have a real _design_. It's really quite
fundamentally different from the "oh, I need FAT to be icase" hack
that we have now.

(We might also be able to make the dcache better at handling
well-defined icase/folding rules, as opposed to the current "just give
up, let the filesystem hash it" behavior).

              Linus

Andreas Dilger Dec. 9, 2018, 12:46 a.m. UTC | #5

On Dec 8, 2018, at 3:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Sat, Dec 8, 2018 at 1:58 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> 
>> I'm hoping you are at least doing it per-directory. That makes at
>> least the "oh, the whole filesystem needs to do this wrong" issue a
>> bit less bad.
> 
> So for example, if you do it per-directory, the rules could be something like:
> 
> - new directories (ie "mkdir()") inherit the icase/folding semantics
> of the parent directory
> 
> - empty directories can have their case/folding rules changed with
> some well-defined interface
> 
> and even from just those simple rules, now some icase behavior could
> be useful to testing.
> 
> Not just filesystem testing (although that would be a thing - thing
> fsstress), but for doing app development in a test directory.
> 
> Apps like git (and GNU fileutils) could use it for having test suites
> for FAT etc filesystems.
> 
> And cross-platform apps could use it as a "I want to check that I do
> the right thing" if you do development on Linux, but might have a
> portable app for other platforms.
> 
> If the whole filesystem is that way, nobody is going to do it. Sure,
> they could do it on a FAT filesystem using a USB disk, but nobody
> really does that. But if you can troivially just run your tests in a
> test subdirectory, it's another thing entirely.
> 
> So this is the kind of thing I mean when I think icase behavior for a
> major Linux filesystem should have a real _design_. It's really quite
> fundamentally different from the "oh, I need FAT to be icase" hack
> that we have now.
> 
> (We might also be able to make the dcache better at handling
> well-defined icase/folding rules, as opposed to the current "just give
> up, let the filesystem hash it" behavior).

In theory, we could store the encoding on a per-entry basis if we
wanted, using the dir_data feature (this would consume 2-3 bytes per
entry, depending on how rich an encoding type we wanted).  The tricky
part is how does the kernel know what the filename encoding is?  How
do we communicate the encoding type back to userspace?

Cheers, Andreas

Linus Torvalds Dec. 9, 2018, 5:41 p.m. UTC | #6

On Sat, Dec 8, 2018 at 9:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> Whether or not case-folding is being done is per-directory (it's a
> flag on the directory set by chattr) .  What encoding is supported
> (and we only will support two, ASCII and UTF-8) is per-file system.  I
> personally believe it's insane to try to encode a large number of
> encodings, like big5, or iso-8859-1, etc. on a per-directory basis.
> Either don't do encodings at all, or use utf-8.  Period.  I believe
> you made a similar request for git metadata, no?   :-)

Absolutely.

But if you only support ascii or utf-8, then why are you messing with
the nls part? That makes no sense.

You can't have it both ways.

Either you have a horrible fundamental design mistake that has
different per-filesystem locales, or you don't.

If you don't, you shouldn't be touching any of the nls code.

Whatever unicode tables you use for case folding shouldn't be in the nls code.

                 Linus

Theodore Ts'o Dec. 9, 2018, 8:10 p.m. UTC | #7

On Sun, Dec 09, 2018 at 09:41:13AM -0800, Linus Torvalds wrote:
> But if you only support ascii or utf-8, then why are you messing with
> the nls part? That makes no sense.
> 
> You can't have it both ways.
> 
> Either you have a horrible fundamental design mistake that has
> different per-filesystem locales, or you don't.
> 
> If you don't, you shouldn't be touching any of the nls code.
> 
> Whatever unicode tables you use for case folding shouldn't be in the nls code.

Gabriel added the Unicode tables for case folding to the fs/nls
directory.  If you'd prefer that we put them somewhere else, we
can; do you have a preference?

					- Ted

Gabriel Krisman Bertazi Dec. 9, 2018, 8:53 p.m. UTC | #8

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Sat, Dec 8, 2018 at 9:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:

> Either you have a horrible fundamental design mistake that has
> different per-filesystem locales, or you don't.
>
> If you don't, you shouldn't be touching any of the nls code.
>
> Whatever unicode tables you use for case folding shouldn't be in the nls code.

Hi Linus,

As Ted mentioned the SMB case, in my understanding, we might have more
users for in-kernel ut8 normalization/casefold comparison functions than
just ext4 in the future.  Steve French (in cc.), for instance, mentioned
his interest in using this higher level NLS API when I first submitted
these patches.

My first RFC actually included this code as a separated module inside
lib/ instead of touching NLS, but I found myself rewriting much of the
same APIs that already existed in NLS.  That is why I merged my work
with that subsystem.  I am open to rethinking it, if there is a better
alternative.

Thanks,

Linus Torvalds Dec. 9, 2018, 8:54 p.m. UTC | #9

On Sun, Dec 9, 2018 at 12:10 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> Gabriel added the Unicode tables for case folding to the fs/nls
> directory.  If you'd prefer that we put them somewhere else, we
> can; do you have a preference?

I have a really hard time judging, since I haven't seen the code, just
a random diffstat and shortlog.

First off, there is no such thing as "one" unicode table for case
folding. There are lots and lots of tables, and I'm not clear what
table it is all about.

For example, both OS X and Windows do some form of case folding on
unicode. They don't do the *same* folding, though.

There are also various locale variations to case folding. This is
where I thought your nls choice came from, but then you tried to imply
that there are no locale issues and that directories can just have a
single flag to enable/disable the folding.

In some locales, "SS" and "ß" (perhaps "SZ" too) will compare the same
in case-insensitivity. Crazy in general, and afaik modern unicode even
has a real upper-case "ß" so it's arguably legacy, but...

And that's all entirely independent of the issues with all the
combining characters, modifier letters, white-space, overlong utf8
questions, etc etc.

It's also easy to generate overlong utf-8 that decodes to '/', for
example. Some broken systems might consider that identical to a real
'/' and it matters for path lookup.

So what's the actual code? What rules did you happen to pick? Did you
take the windows rules as-is (I _think_ they may be documented) since
the primary target apparently is just samba performance?

And even if the answer is "we follow NTFS rules", which *version* of
NTFS folding rules are you using if you're trying to speed up samba,
for example? Because afaik they have changed over time.

Is the *only* target samba? You are never interested for local loads
like "oh, people want to run Wine and might need it" or the
application testing parts?

All of these matter.

For example, if it's some "ext4 special case just for samba", then
perhaps the logical place to put all this is just in fs/ext4/ and not
bother anybody else about it.

But if it might be useful as some generic "NTFS hashing" library, then
make it that.

                   Linus

Linus Torvalds Dec. 9, 2018, 9:05 p.m. UTC | #10

On Sun, Dec 9, 2018 at 12:53 PM Gabriel Krisman Bertazi
<krisman@collabora.com> wrote:
>
> As Ted mentioned the SMB case, in my understanding, we might have more
> users for in-kernel ut8 normalization/casefold comparison functions than
> just ext4 in the future.

Crossed emails.

See my note about how there really is not a single case-folding
library. It's simply not physically possible, because there are so
many different ideas about what case-folding actually means.

That's still true even if "everything is utf-8", sadly.

So how do you handle locale issues and things like "we have ten
different tables for utf-8 comparisons, and that's _ignoring_ the
issue of whether we combine or decompose characters"?

And there's no way you can use the existing nls interfaces for
upper/lower case, for example, since they are all limited to 256-byte
tables and direct accesses to said tables, afaik.

And if that is where the extensions were, and that is why you changed
other filesystems, this all matters.

My *guess* is that what you really want is not really about unicode at
all, but specifically about just the NTFS rules. Which, yes, might
find generic sharing interest between cifs/ext4/etc, but my gut feel
is that they'd be specifically about some NTFS interoperability
library.

Because even then I think you might have issues like "NTFS-5.1" vs
"NTFS-4.0" etc.

Maybe you don't care, and you're picking just *one* version. And I
haven't seen the code.

Basically, I would not be surprised if the sanest model is simply to
make a "ntfs" library. Because I'm really fairly sure that OS X rules
are very different indeed, even if it too is "unicode".

                 Linus

Theodore Ts'o Dec. 10, 2018, 12:08 a.m. UTC | #11

On Sun, Dec 09, 2018 at 12:54:38PM -0800, Linus Torvalds wrote:
> First off, there is no such thing as "one" unicode table for case
> folding. There are lots and lots of tables, and I'm not clear what
> table it is all about.
> 
> For example, both OS X and Windows do some form of case folding on
> unicode. They don't do the *same* folding, though.

So things are much better in recent years.  In the past it was kind of
a disaster, but the world is converging enough that the latest
versions of Mac OS'x APFS and Windows NTFS behave pretty much the same
way.  They are both case-insensitive, case-preserving and
normalization-preserving, normalization-insensitive with respect to
filenames.

In the bad old-days, MacOS X's HFS+ was not normalization-preserving.
So it would force filenames to NFD form --- so if the user tried to
create a file named Å, and passed in the Unicode string U+212B to
creat(2), HFS+ would store it as U+0041,U+030A and that is what
readdir(2) would return.  Apple has effectively admitted this was a
mistake, and their new APFS doesn't do this any more.

Now, both file systems basically say, "we don't care whether you pass
in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we
will treat it as the same filename; but readdir(2) will return what
you gave us."

It's been a *long* time since Unicode has changed case folding rules
for pre-existing characters.  The tables have only changed with
respect to the new character sets have been added.  If you have a set
of filenames which were all legal under Unicode 5.0, how they case
fold didn't change with respect to Unicode 6.0, 7.0, 8.0 9.0, 10.0 or
11.0.

Unicode 11.0 added some character sets like Ancient Sanskrit, a bunch
of new emoji's, and the copyleft symbol, and to the extent that
Ancient Sanskrit had case, the tables might have been *extended*.  But
that doesn't break backwards compatibility.

And, of course, MacOS and Windows have been aggressively tracking
Unicode updates because everybody wants the latest emoji's.  :-)

And it's not just SAMBA/CIFS.  The NFSv4 protocol also provides for
case/normalization preserving filenames, and you can specify a NFSv4
mount option whether or not file name lookups should be
case/normalization insensitive.  And the NFSv4 protocol specs also
specify the use of the Unicode thables, of which the latest versions
can be downloaded here:

	http://www.unicode.org/Public/11.0.0/ucd/

So how about this?  We'll put the unicode handling functions in a new
directory, fs/unicode, just to make it really clear that this will now
be changing any of the legacy fs/nls functions which other file
systems will use.  By putting it in a separate directory, it will be
easier for other file systems to use it, whether it's for better Samba
or NFSv4 support.

						- Ted

Linus Torvalds Dec. 10, 2018, 7:35 p.m. UTC | #12

On Sun, Dec 9, 2018 at 4:08 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> So things are much better in recent years.  In the past it was kind of
> a disaster, but the world is converging enough that the latest
> versions of Mac OS'x APFS and Windows NTFS behave pretty much the same
> way.  They are both case-insensitive, case-preserving and
> normalization-preserving, normalization-insensitive with respect to
> filenames.

Oh, so APFS at least fixed *that* horrific problem with their
filesystem. Oh how I despised the exposure of NFD (which should at
most be used as an internal representation, not externally visible).
Turning basic letters (coming from Finland, åäö) into character
combinations was an absolute abomination.

> In the bad old-days, MacOS X's HFS+ was not normalization-preserving.

Oh, I'm very aware.

It's not even that it wasn't normalization-preserving, it picked the
*wrong* normalization to use.

> Now, both file systems basically say, "we don't care whether you pass
> in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we
> will treat it as the same filename; but readdir(2) will return what
> you gave us."

Actually, the "on the screen it will look identical" is a horribly
incorrect thing to do too.

There are lots of things that look identical on the screen without
being at all the same thing. Sometimes it depends on font, sometimes
it's just how it is. A nonbreaking space is *not* the same as a
regular space, even if they may look identical on the screen.

I suspect (and sincerely _hope_) neither filesystem actually does
anything as stuipid as taking "glyph equivalence" into account.

I'm hoping it's just "convert to NFx, then lower-case, then compare
for equality". Where the 'x' doesn't much matter as long as it is
never _exposed_ in any way outside of the comparison (ie NFD is a fine
and probably simpler model for the lower-casing, the HFS+ mistake was
to then expose the corrupted form of the filename).

> It's been a *long* time since Unicode has changed case folding rules
> for pre-existing characters.  The tables have only changed with
> respect to the new character sets have been added.

But new characters _have_ been added, and some of them do have
lower-case form, so the folding tables have changed.

Happily, maybe that is over. As long as the Unicode people continue to
mainly play with their Emoji list, I guess we can consider it done.

> So how about this?  We'll put the unicode handling functions in a new
> directory, fs/unicode, just to make it really clear that this will now
> be changing any of the legacy fs/nls functions which other file
> systems will use.  By putting it in a separate directory, it will be
> easier for other file systems to use it, whether it's for better Samba
> or NFSv4 support.

Ok, that sounds fine.

Some of the unicode translation functions from the NLS code could well
move into that, and NLS itself could be relegated to the sad
historical thing.

And please try to make the *interfaces* sane.

For example, the interface for "let's compare with folded case" should
*not* be about "convert to NFDK and lower case into a temp buffer,
then compare the results".

You can do a lot of "let's handle the simple cases" faster even if the
"oh, I hit a complex character" case might then become one of those
"convert to a temp buffer" cases.

And it shouldn't be about C strings, since we very much have cases
where it's not a C string but a {ptr,len} tuple. Maybe even use the
"struct qstr", which is a not-horrible way to pass those around.

Even if you have a C string, you can always just do

        struct qstr str = QSTR_INIT(name, strlen(name));

and then pass that qstr pointer around.

Finally, don't do the NLS thing with "descriptors". that you register
and look up. The indirection kills you. Particularly the crazy "one
character at a time" model.

Just let people explicitly say "utf8_icasecmp(qstr, qstr)" or
something like that.  With the interface at least allowing for the
common simple cases (ie everything is in the ASCII subset) to be
handled basically as a specialized thing.

                    Linus

[v4,00/23] Ext4 Encoding and Case-insensitive support

Message

Comments