Message ID | 20181206230903.30011-1-krisman@collabora.com (mailing list archive) |
---|---|
Headers | show |
Series | Ext4 Encoding and Case-insensitive support | expand |
On 12/6/18 3:08 PM, Gabriel Krisman Bertazi wrote: > Hi, > > [Resending to include fsdevel, as requested by Dave Chinner] > > Following the e2fsprogs changes, these are the corresponding kernel-side > modifications to support the fname_encoding feature. > > The patches are split in two parts. The fist 14 patches are refactoring > and improvements to the NLS code, including the utf8 normalization > support. The final patches implement the fname_encoding feature in ext4. Hi, Please include some justification and use case(s) in the patch description. Thanks. > To test this feature, you need to use the tip of e2fsprogs branch, which > already include support for enabling this feature. > > As usual, the ucd files are not included in this email because they are > too large, and would actually cause the email message to bounce. > > There are two test files for this in a private xfstests branch, that I > plan to submit upstream once we get this series merged: > > https://gitlab.collabora.com/krisman/xfstests.git -b encoding_v4 > > I also tested this with the xfstests smoke tests using two scenarios: > (1) a non-encoding TEST_DEV; (2) a utf8 enabled TEST_DEV. On both > cases, no unrelated regressions where observed. With my branch of > xfstests above, that fixes some related tests, I didn't observe any > regressions. > > Gabriel Krisman Bertazi (19): > nls: Wrap uni2char/char2uni callers > nls: Wrap charset field access > nls: Wrap charset hooks in ops structure > nls: Split default charset from NLS core > nls: Split struct nls_charset from struct nls_table > nls: Add support for multiple versions of an encoding > nls: Implement NLS_STRICT_MODE flag > nls: Let charsets define the behavior of tolower/toupper > nls: Add new interface for string comparisons > nls: Add optional normalization and casefold hooks > nls: ascii: Support validation and normalization operations > nls: utf8: Move nls-utf8{,-core}.c > nls: utf8: Integrate utf8 normalization code with utf8 charset > nls: utf8: Introduce test module for normalized utf8 implementation > ext4: Reserve superblock fields for encoding information > ext4: Include encoding information in the superblock > ext4: Support encoding-aware file name lookups > ext4: Implement EXT4_CASEFOLD_FL flag > docs: ext4.rst: Document encoding and case-insensitive > > Olaf Weber (4): > nls: utf8: Add unicode character database files > scripts: add trie generator for UTF-8 > nls: utf8: Introduce code for UTF-8 normalization > nls: utf8n: reduce the size of utf8data[] > > Documentation/admin-guide/ext4.rst | 29 + > fs/befs/linuxvfs.c | 8 +- > fs/cifs/cifs_unicode.c | 15 +- > fs/cifs/cifsfs.c | 2 +- > fs/cifs/connect.c | 2 +- > fs/cifs/dir.c | 7 +- > fs/ext4/dir.c | 59 + > fs/ext4/ext4.h | 33 +- > fs/ext4/hash.c | 38 +- > fs/ext4/ialloc.c | 2 +- > fs/ext4/inline.c | 2 +- > fs/ext4/inode.c | 4 +- > fs/ext4/ioctl.c | 18 + > fs/ext4/namei.c | 85 +- > fs/ext4/super.c | 83 + > fs/fat/dir.c | 13 +- > fs/fat/inode.c | 6 +- > fs/fat/namei_vfat.c | 6 +- > fs/hfs/super.c | 6 +- > fs/hfs/trans.c | 9 +- > fs/hfsplus/options.c | 2 +- > fs/hfsplus/unicode.c | 6 +- > fs/isofs/inode.c | 5 +- > fs/isofs/joliet.c | 3 +- > fs/jfs/jfs_unicode.c | 9 +- > fs/jfs/super.c | 3 +- > fs/nls/Kconfig | 15 + > fs/nls/Makefile | 20 + > fs/nls/mac-celtic.c | 34 +- > fs/nls/mac-centeuro.c | 34 +- > fs/nls/mac-croatian.c | 34 +- > fs/nls/mac-cyrillic.c | 34 +- > fs/nls/mac-gaelic.c | 34 +- > fs/nls/mac-greek.c | 34 +- > fs/nls/mac-iceland.c | 34 +- > fs/nls/mac-inuit.c | 34 +- > fs/nls/mac-roman.c | 34 +- > fs/nls/mac-romanian.c | 34 +- > fs/nls/mac-turkish.c | 34 +- > fs/nls/nls_ascii.c | 84 +- > fs/nls/nls_core.c | 163 ++ > fs/nls/nls_cp1250.c | 34 +- > fs/nls/nls_cp1251.c | 34 +- > fs/nls/nls_cp1255.c | 36 +- > fs/nls/nls_cp437.c | 34 +- > fs/nls/nls_cp737.c | 34 +- > fs/nls/nls_cp775.c | 34 +- > fs/nls/nls_cp850.c | 34 +- > fs/nls/nls_cp852.c | 34 +- > fs/nls/nls_cp855.c | 34 +- > fs/nls/nls_cp857.c | 34 +- > fs/nls/nls_cp860.c | 34 +- > fs/nls/nls_cp861.c | 34 +- > fs/nls/nls_cp862.c | 34 +- > fs/nls/nls_cp863.c | 34 +- > fs/nls/nls_cp864.c | 34 +- > fs/nls/nls_cp865.c | 34 +- > fs/nls/nls_cp866.c | 34 +- > fs/nls/nls_cp869.c | 34 +- > fs/nls/nls_cp874.c | 36 +- > fs/nls/nls_cp932.c | 36 +- > fs/nls/nls_cp936.c | 36 +- > fs/nls/nls_cp949.c | 36 +- > fs/nls/nls_cp950.c | 36 +- > fs/nls/{nls_base.c => nls_default.c} | 124 +- > fs/nls/nls_euc-jp.c | 29 +- > fs/nls/nls_iso8859-1.c | 34 +- > fs/nls/nls_iso8859-13.c | 34 +- > fs/nls/nls_iso8859-14.c | 34 +- > fs/nls/nls_iso8859-15.c | 34 +- > fs/nls/nls_iso8859-2.c | 34 +- > fs/nls/nls_iso8859-3.c | 34 +- > fs/nls/nls_iso8859-4.c | 34 +- > fs/nls/nls_iso8859-5.c | 34 +- > fs/nls/nls_iso8859-6.c | 34 +- > fs/nls/nls_iso8859-7.c | 34 +- > fs/nls/nls_iso8859-9.c | 34 +- > fs/nls/nls_koi8-r.c | 34 +- > fs/nls/nls_koi8-ru.c | 30 +- > fs/nls/nls_koi8-u.c | 34 +- > fs/nls/nls_utf8-core.c | 328 +++ > fs/nls/nls_utf8-norm.c | 797 ++++++ > fs/nls/nls_utf8-selftest.c | 316 +++ > fs/nls/nls_utf8.c | 67 - > fs/nls/ucd/README | 34 + > fs/nls/utf8n.h | 117 + > fs/ntfs/inode.c | 2 +- > fs/ntfs/super.c | 6 +- > fs/ntfs/unistr.c | 13 +- > fs/udf/super.c | 3 +- > fs/udf/unicode.c | 4 +- > include/linux/fs.h | 2 + > include/linux/nls.h | 293 ++- > scripts/Makefile | 1 + > scripts/mkutf8data.c | 3392 ++++++++++++++++++++++++++ > 95 files changed, 7287 insertions(+), 618 deletions(-) > create mode 100644 fs/nls/nls_core.c > rename fs/nls/{nls_base.c => nls_default.c} (89%) > create mode 100644 fs/nls/nls_utf8-core.c > create mode 100644 fs/nls/nls_utf8-norm.c > create mode 100644 fs/nls/nls_utf8-selftest.c > delete mode 100644 fs/nls/nls_utf8.c > create mode 100644 fs/nls/ucd/README > create mode 100644 fs/nls/utf8n.h > create mode 100644 scripts/mkutf8data.c >
On Sat, Dec 8, 2018 at 12:22 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > There's a patch series that's been baking for a while that will likely > go upstream either in the next upcoming merge window, or the one after > that. Since it adds support for Unicode case-folding, it involves a > non-trivial number of changes to fs/nls. As near as I can tell, no > one is really maintaining fs/nls. Christ. Why do people want to do this? We know it's a crazy and stupid thing to do. And we know that, exactly because people have done it, and it has always been a mistake. It causes actual and very subtle security issues. It breaks things subtly even when they supposedly "know" about case folding because different things will do it differently (ie user space vs kernel space not having the *exact* same rules due to using different tables, for example). It doesn't work with locales, because people often want different locales at the same time. And it slows things down enormously because you can't do hashing well, and comparisons get hugely more expensive. And to add insult to injury, people always implement it so *horribly* badly that it's not even funny. For example, the usual way that people do it is to case-fold two strings, and then compare the end results. And that's *incredibly* stupid and slow and generates extra temporary allocations etc. Or people to it character-by-character instead, and don't understand utf-8 (which is literally designed to be easy to see character boundaries *without* having to do a full decode!), and do *that* incredibly badly instead. And when you create a file with an ambiguous name, what does readdir report? Does it report the name you used, some normalized thing, or what? Finally, people then invariably do it in ways that preclude any concurrent sane uses. For example, they make it a single mount-time flag for the whole filesystem, so now if you are (for example) wanting to do emulation of bad system decisions, you now force the *host* to buy into the whole mistake too. And they make it a whole-filesystem flag, instead of (for example) allowing just the emulated environment to do case-insensitive filesystem operations on an operation-by-operation basis, and possibly only within a particular subdirectory structure (or bind mount). So the first thing I want to know is who really needs it, *why* they need it, and what the design is for. Because I can almost guarantee that the design is horrible, and the reasons are really really bad. And what *are* the case insensitivity rules, and how do you co-exist when there are two *different* folding rules at the same time? For example, OS X has some truly horrendously bad rules, that take the badness that Windows did to a whole different level. What if you're a file server (or emulation environment) and you want to expose the same filesystem to both of those environments? Because it would quite possibly be a whole lot better to allow per-operation flags, so that you can do fd = openat(dir, path, O_RDONLY | O_ICASE); so that you can allow *one* process to treat a filesystem as if it was case insensitive (think "Wine in with a ~/.wine/C directory"), without forcing the whole filesystem to be icase. Yes, allowing concurrent use then generates whole new "interesting" questions, like "what happens if a case _sensitive_ user creates two files with names that are identical to a in-sensitive user", but they aren't necessarily any worse than the issues you face *not* allowing that. > Given your recent comments about not wanting to see pull requests for > things outside of fs/xfs as part of the xfs pull, do you have any > opinions about how to do manage this feature going upstream? My > original plan was to send them through the ext4 tree, since I very > much doubt Al cares much about nls issues, and they will only impact > ext4. I really want to know what is driving this insanity, and what the actual use-case is. You have a diffstat, but not a git tree to look at what the heck is going on. Seriously, case insensitivity is *such* a horrendously bad idea that people need to think about it deeply, and nobody seems to ever do that. And yes, we have d_hash() and some rudimentary support for it in the VFS layer, but that VFS layer bit was always meant purely for interoperability filesystems that nobody really cared about as a real filesystem for Linux. Notably FAT and its ilk. If we have a major native filesystem doing it, I think we need to actively think about the big picture and do it *right*. None of the crazy "ok, you can't even look things up in the dcache directly at all" stuff that we have as a hack to just allow _bad_ filesystems to do their thing. So I think this is a bigger deal than that diffstat of yours implies. I don't think people understand just how *bad* case insensitivity is. The old DOS/Mac people thought case insensitivity was a "helpful" idea, and that was understandable - but wrong - even back in the 80's. They are still living with the end result of that horrendously bad decision decades later. They've _tried_ to fix their bad decisions, and have never been able to (except, apparently, in iOS where somebody finally had a glimmer of a clue). Linus
On Sat, Dec 8, 2018 at 1:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Yes, allowing concurrent use then generates whole new "interesting" > questions, like "what happens if a case _sensitive_ user creates two > files with names that are identical to a in-sensitive user", but they > aren't necessarily any worse than the issues you face *not* allowing > that. I'm hoping you are at least doing it per-directory. That makes at least the "oh, the whole filesystem needs to do this wrong" issue a bit less bad. Just looking at the shortlog you posted, my guess is that the ext4 patches didn't even get *that* right, though. That shortlog "encoding information in superblock" implies this is the same kind of just horribly bad mess that we've seen before. I really despise every single case-sensitive filesystem I have ever seen, exactly because nobody apparently spends even a minimal amount of effort on getting any of the basics remotely right. Every single case I've seen has been a huge nasty hack, with seriously bad system-wide consequences. Linus
On Sat, Dec 8, 2018 at 1:58 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I'm hoping you are at least doing it per-directory. That makes at > least the "oh, the whole filesystem needs to do this wrong" issue a > bit less bad. So for example, if you do it per-directory, the rules could be something like: - new directories (ie "mkdir()") inherit the icase/folding semantics of the parent directory - empty directories can have their case/folding rules changed with some well-defined interface and even from just those simple rules, now some icase behavior could be useful to testing. Not just filesystem testing (although that would be a thing - thing fsstress), but for doing app development in a test directory. Apps like git (and GNU fileutils) could use it for having test suites for FAT etc filesystems. And cross-platform apps could use it as a "I want to check that I do the right thing" if you do development on Linux, but might have a portable app for other platforms. If the whole filesystem is that way, nobody is going to do it. Sure, they could do it on a FAT filesystem using a USB disk, but nobody really does that. But if you can troivially just run your tests in a test subdirectory, it's another thing entirely. So this is the kind of thing I mean when I think icase behavior for a major Linux filesystem should have a real _design_. It's really quite fundamentally different from the "oh, I need FAT to be icase" hack that we have now. (We might also be able to make the dcache better at handling well-defined icase/folding rules, as opposed to the current "just give up, let the filesystem hash it" behavior). Linus
On Dec 8, 2018, at 3:59 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Sat, Dec 8, 2018 at 1:58 PM Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> I'm hoping you are at least doing it per-directory. That makes at >> least the "oh, the whole filesystem needs to do this wrong" issue a >> bit less bad. > > So for example, if you do it per-directory, the rules could be something like: > > - new directories (ie "mkdir()") inherit the icase/folding semantics > of the parent directory > > - empty directories can have their case/folding rules changed with > some well-defined interface > > and even from just those simple rules, now some icase behavior could > be useful to testing. > > Not just filesystem testing (although that would be a thing - thing > fsstress), but for doing app development in a test directory. > > Apps like git (and GNU fileutils) could use it for having test suites > for FAT etc filesystems. > > And cross-platform apps could use it as a "I want to check that I do > the right thing" if you do development on Linux, but might have a > portable app for other platforms. > > If the whole filesystem is that way, nobody is going to do it. Sure, > they could do it on a FAT filesystem using a USB disk, but nobody > really does that. But if you can troivially just run your tests in a > test subdirectory, it's another thing entirely. > > So this is the kind of thing I mean when I think icase behavior for a > major Linux filesystem should have a real _design_. It's really quite > fundamentally different from the "oh, I need FAT to be icase" hack > that we have now. > > (We might also be able to make the dcache better at handling > well-defined icase/folding rules, as opposed to the current "just give > up, let the filesystem hash it" behavior). In theory, we could store the encoding on a per-entry basis if we wanted, using the dir_data feature (this would consume 2-3 bytes per entry, depending on how rich an encoding type we wanted). The tricky part is how does the kernel know what the filename encoding is? How do we communicate the encoding type back to userspace? Cheers, Andreas
On Sat, Dec 8, 2018 at 9:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > Whether or not case-folding is being done is per-directory (it's a > flag on the directory set by chattr) . What encoding is supported > (and we only will support two, ASCII and UTF-8) is per-file system. I > personally believe it's insane to try to encode a large number of > encodings, like big5, or iso-8859-1, etc. on a per-directory basis. > Either don't do encodings at all, or use utf-8. Period. I believe > you made a similar request for git metadata, no? :-) Absolutely. But if you only support ascii or utf-8, then why are you messing with the nls part? That makes no sense. You can't have it both ways. Either you have a horrible fundamental design mistake that has different per-filesystem locales, or you don't. If you don't, you shouldn't be touching any of the nls code. Whatever unicode tables you use for case folding shouldn't be in the nls code. Linus
On Sun, Dec 09, 2018 at 09:41:13AM -0800, Linus Torvalds wrote: > But if you only support ascii or utf-8, then why are you messing with > the nls part? That makes no sense. > > You can't have it both ways. > > Either you have a horrible fundamental design mistake that has > different per-filesystem locales, or you don't. > > If you don't, you shouldn't be touching any of the nls code. > > Whatever unicode tables you use for case folding shouldn't be in the nls code. Gabriel added the Unicode tables for case folding to the fs/nls directory. If you'd prefer that we put them somewhere else, we can; do you have a preference? - Ted
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, Dec 8, 2018 at 9:03 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > Either you have a horrible fundamental design mistake that has > different per-filesystem locales, or you don't. > > If you don't, you shouldn't be touching any of the nls code. > > Whatever unicode tables you use for case folding shouldn't be in the nls code. Hi Linus, As Ted mentioned the SMB case, in my understanding, we might have more users for in-kernel ut8 normalization/casefold comparison functions than just ext4 in the future. Steve French (in cc.), for instance, mentioned his interest in using this higher level NLS API when I first submitted these patches. My first RFC actually included this code as a separated module inside lib/ instead of touching NLS, but I found myself rewriting much of the same APIs that already existed in NLS. That is why I merged my work with that subsystem. I am open to rethinking it, if there is a better alternative. Thanks,
On Sun, Dec 9, 2018 at 12:10 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > Gabriel added the Unicode tables for case folding to the fs/nls > directory. If you'd prefer that we put them somewhere else, we > can; do you have a preference? I have a really hard time judging, since I haven't seen the code, just a random diffstat and shortlog. First off, there is no such thing as "one" unicode table for case folding. There are lots and lots of tables, and I'm not clear what table it is all about. For example, both OS X and Windows do some form of case folding on unicode. They don't do the *same* folding, though. There are also various locale variations to case folding. This is where I thought your nls choice came from, but then you tried to imply that there are no locale issues and that directories can just have a single flag to enable/disable the folding. In some locales, "SS" and "ß" (perhaps "SZ" too) will compare the same in case-insensitivity. Crazy in general, and afaik modern unicode even has a real upper-case "ß" so it's arguably legacy, but... And that's all entirely independent of the issues with all the combining characters, modifier letters, white-space, overlong utf8 questions, etc etc. It's also easy to generate overlong utf-8 that decodes to '/', for example. Some broken systems might consider that identical to a real '/' and it matters for path lookup. So what's the actual code? What rules did you happen to pick? Did you take the windows rules as-is (I _think_ they may be documented) since the primary target apparently is just samba performance? And even if the answer is "we follow NTFS rules", which *version* of NTFS folding rules are you using if you're trying to speed up samba, for example? Because afaik they have changed over time. Is the *only* target samba? You are never interested for local loads like "oh, people want to run Wine and might need it" or the application testing parts? All of these matter. For example, if it's some "ext4 special case just for samba", then perhaps the logical place to put all this is just in fs/ext4/ and not bother anybody else about it. But if it might be useful as some generic "NTFS hashing" library, then make it that. Linus
On Sun, Dec 9, 2018 at 12:53 PM Gabriel Krisman Bertazi <krisman@collabora.com> wrote: > > As Ted mentioned the SMB case, in my understanding, we might have more > users for in-kernel ut8 normalization/casefold comparison functions than > just ext4 in the future. Crossed emails. See my note about how there really is not a single case-folding library. It's simply not physically possible, because there are so many different ideas about what case-folding actually means. That's still true even if "everything is utf-8", sadly. So how do you handle locale issues and things like "we have ten different tables for utf-8 comparisons, and that's _ignoring_ the issue of whether we combine or decompose characters"? And there's no way you can use the existing nls interfaces for upper/lower case, for example, since they are all limited to 256-byte tables and direct accesses to said tables, afaik. And if that is where the extensions were, and that is why you changed other filesystems, this all matters. My *guess* is that what you really want is not really about unicode at all, but specifically about just the NTFS rules. Which, yes, might find generic sharing interest between cifs/ext4/etc, but my gut feel is that they'd be specifically about some NTFS interoperability library. Because even then I think you might have issues like "NTFS-5.1" vs "NTFS-4.0" etc. Maybe you don't care, and you're picking just *one* version. And I haven't seen the code. Basically, I would not be surprised if the sanest model is simply to make a "ntfs" library. Because I'm really fairly sure that OS X rules are very different indeed, even if it too is "unicode". Linus
On Sun, Dec 09, 2018 at 12:54:38PM -0800, Linus Torvalds wrote: > First off, there is no such thing as "one" unicode table for case > folding. There are lots and lots of tables, and I'm not clear what > table it is all about. > > For example, both OS X and Windows do some form of case folding on > unicode. They don't do the *same* folding, though. So things are much better in recent years. In the past it was kind of a disaster, but the world is converging enough that the latest versions of Mac OS'x APFS and Windows NTFS behave pretty much the same way. They are both case-insensitive, case-preserving and normalization-preserving, normalization-insensitive with respect to filenames. In the bad old-days, MacOS X's HFS+ was not normalization-preserving. So it would force filenames to NFD form --- so if the user tried to create a file named Å, and passed in the Unicode string U+212B to creat(2), HFS+ would store it as U+0041,U+030A and that is what readdir(2) would return. Apple has effectively admitted this was a mistake, and their new APFS doesn't do this any more. Now, both file systems basically say, "we don't care whether you pass in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we will treat it as the same filename; but readdir(2) will return what you gave us." It's been a *long* time since Unicode has changed case folding rules for pre-existing characters. The tables have only changed with respect to the new character sets have been added. If you have a set of filenames which were all legal under Unicode 5.0, how they case fold didn't change with respect to Unicode 6.0, 7.0, 8.0 9.0, 10.0 or 11.0. Unicode 11.0 added some character sets like Ancient Sanskrit, a bunch of new emoji's, and the copyleft symbol, and to the extent that Ancient Sanskrit had case, the tables might have been *extended*. But that doesn't break backwards compatibility. And, of course, MacOS and Windows have been aggressively tracking Unicode updates because everybody wants the latest emoji's. :-) And it's not just SAMBA/CIFS. The NFSv4 protocol also provides for case/normalization preserving filenames, and you can specify a NFSv4 mount option whether or not file name lookups should be case/normalization insensitive. And the NFSv4 protocol specs also specify the use of the Unicode thables, of which the latest versions can be downloaded here: http://www.unicode.org/Public/11.0.0/ucd/ So how about this? We'll put the unicode handling functions in a new directory, fs/unicode, just to make it really clear that this will now be changing any of the legacy fs/nls functions which other file systems will use. By putting it in a separate directory, it will be easier for other file systems to use it, whether it's for better Samba or NFSv4 support. - Ted
On Sun, Dec 9, 2018 at 4:08 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > So things are much better in recent years. In the past it was kind of > a disaster, but the world is converging enough that the latest > versions of Mac OS'x APFS and Windows NTFS behave pretty much the same > way. They are both case-insensitive, case-preserving and > normalization-preserving, normalization-insensitive with respect to > filenames. Oh, so APFS at least fixed *that* horrific problem with their filesystem. Oh how I despised the exposure of NFD (which should at most be used as an internal representation, not externally visible). Turning basic letters (coming from Finland, åäö) into character combinations was an absolute abomination. > In the bad old-days, MacOS X's HFS+ was not normalization-preserving. Oh, I'm very aware. It's not even that it wasn't normalization-preserving, it picked the *wrong* normalization to use. > Now, both file systems basically say, "we don't care whether you pass > in U+212B or U+0041,U+030A; on the screen it looks identical, Å, so we > will treat it as the same filename; but readdir(2) will return what > you gave us." Actually, the "on the screen it will look identical" is a horribly incorrect thing to do too. There are lots of things that look identical on the screen without being at all the same thing. Sometimes it depends on font, sometimes it's just how it is. A nonbreaking space is *not* the same as a regular space, even if they may look identical on the screen. I suspect (and sincerely _hope_) neither filesystem actually does anything as stuipid as taking "glyph equivalence" into account. I'm hoping it's just "convert to NFx, then lower-case, then compare for equality". Where the 'x' doesn't much matter as long as it is never _exposed_ in any way outside of the comparison (ie NFD is a fine and probably simpler model for the lower-casing, the HFS+ mistake was to then expose the corrupted form of the filename). > It's been a *long* time since Unicode has changed case folding rules > for pre-existing characters. The tables have only changed with > respect to the new character sets have been added. But new characters _have_ been added, and some of them do have lower-case form, so the folding tables have changed. Happily, maybe that is over. As long as the Unicode people continue to mainly play with their Emoji list, I guess we can consider it done. > So how about this? We'll put the unicode handling functions in a new > directory, fs/unicode, just to make it really clear that this will now > be changing any of the legacy fs/nls functions which other file > systems will use. By putting it in a separate directory, it will be > easier for other file systems to use it, whether it's for better Samba > or NFSv4 support. Ok, that sounds fine. Some of the unicode translation functions from the NLS code could well move into that, and NLS itself could be relegated to the sad historical thing. And please try to make the *interfaces* sane. For example, the interface for "let's compare with folded case" should *not* be about "convert to NFDK and lower case into a temp buffer, then compare the results". You can do a lot of "let's handle the simple cases" faster even if the "oh, I hit a complex character" case might then become one of those "convert to a temp buffer" cases. And it shouldn't be about C strings, since we very much have cases where it's not a C string but a {ptr,len} tuple. Maybe even use the "struct qstr", which is a not-horrible way to pass those around. Even if you have a C string, you can always just do struct qstr str = QSTR_INIT(name, strlen(name)); and then pass that qstr pointer around. Finally, don't do the NLS thing with "descriptors". that you register and look up. The indirection kills you. Particularly the crazy "one character at a time" model. Just let people explicitly say "utf8_icasecmp(qstr, qstr)" or something like that. With the interface at least allowing for the common simple cases (ie everything is in the ASCII subset) to be handled basically as a specialized thing. Linus