Message ID | 20240117222836.11086-1-krisman@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | libfs: Attempt exact-match comparison first during casefold lookup | expand |
On Wed, Jan 17, 2024 at 07:28:36PM -0300, Gabriel Krisman Bertazi wrote: > Note that, for strict mode, generic_ci_d_compare used to reject an > invalid UTF-8 string, which would now be considered valid if it > exact-matches the disk-name. But, if that is the case, the filesystem > is corrupt. More than that, it really doesn't matter in practice, > because the name-under-lookup will have already been rejected by > generic_ci_d_hash and we won't even get here. > - if (ret >= 0) > - return ret; > - > - if (sb_has_strict_encoding(sb)) > + qstr.len = len; > + qstr.name = str; > + ret = utf8_strncasecmp(dentry->d_sb->s_encoding, name, &qstr); > + if (ret < 0) > return -EINVAL; Umm... So why bother with -EINVAL? The rest looks sane, but considering the fact that your string *has* passed ->d_hash(), do we need anything beyond return utf8_strncasecmp(dentry->d_sb->s_encoding, name, &qstr); here?
Al Viro <viro@zeniv.linux.org.uk> writes: > On Wed, Jan 17, 2024 at 07:28:36PM -0300, Gabriel Krisman Bertazi wrote: > >> Note that, for strict mode, generic_ci_d_compare used to reject an >> invalid UTF-8 string, which would now be considered valid if it >> exact-matches the disk-name. But, if that is the case, the filesystem >> is corrupt. More than that, it really doesn't matter in practice, >> because the name-under-lookup will have already been rejected by >> generic_ci_d_hash and we won't even get here. > >> - if (ret >= 0) >> - return ret; >> - >> - if (sb_has_strict_encoding(sb)) >> + qstr.len = len; >> + qstr.name = str; >> + ret = utf8_strncasecmp(dentry->d_sb->s_encoding, name, &qstr); >> + if (ret < 0) >> return -EINVAL; > > Umm... So why bother with -EINVAL? The rest looks sane, but > considering the fact that your string *has* passed ->d_hash(), > do we need anything beyond > return utf8_strncasecmp(dentry->d_sb->s_encoding, name, &qstr); > here? No, you are right. In fact, it seems d_compare can't propagate errors anyway as it is only compared against 0, anyway. I spin the v2 dropping this bit as suggested. thanks,
On Wed, 17 Jan 2024 at 16:02, Gabriel Krisman Bertazi <krisman@suse.de> wrote: > > No, you are right. In fact, it seems d_compare can't propagate errors > anyway as it is only compared against 0, anyway. Note that the whole "malformed utf-8 is an error" is actually wrong anyway. Yes, if you *output* utf-8, and your output is malformed, then that's an error that needs fixing. But honestly, "malformed utf-8" on input is almost always just "oh, it wasn't utf-8 to begin with, and somebody is still using Latin-1 or Shift-JIS or whatever". And then treating that as some kind of hard error is actually really really wrong and annoying, and may end up meaning that the user cannot *fix* it, because they can't access the data at all. And yes, I realize that if somebody is an unicode person, they go "oh, but it *IS* an error, and you need to detect it". At that point you should just block that paper-pushing pinhead from your life. Bad locales happen. It's just a fact of life. The patch that makes an exact lookup work even when some locale rule is being violated should be seen as an absolute win, not as a failure to detect an error. This is another case of the old adage: be conservative in what you do, be liberal in what you accept from others. I find libraries that just error out on "malformed utf-8" to be actively harmful. Linus
On Wed, Jan 17, 2024 at 04:40:17PM -0800, Linus Torvalds wrote: > Note that the whole "malformed utf-8 is an error" is actually wrong anyway. > > Yes, if you *output* utf-8, and your output is malformed, then that's > an error that needs fixing. > > But honestly, "malformed utf-8" on input is almost always just "oh, it > wasn't utf-8 to begin with, and somebody is still using Latin-1 or > Shift-JIS or whatever". > > And then treating that as some kind of hard error is actually really > really wrong and annoying, and may end up meaning that the user cannot > *fix* it, because they can't access the data at all. A file system which supports casefolding can support "strict" mode (not the default) where attempts to create files that have invalid UTF-8 characters are rejected before a file or hard link is created (or renamed) with an error. This is what MacOS does, by the way. If you try to rsync a file from a Linux box where the file was created by unpacking a Windows Zip file created by downloading a directory hierarchy from a Microsoft Sharepoint, and then you try to scp or rsync it over to MacOS, MacOS will will refuse to allow the file to be created if it contains invalid UTF-8 characters, and rsync or scp will report an error. I just ran into this earlier today... So we don't need to worry about the user not being able to fix it, because they won't have been able to create the file in the first place. This is not the default, since we know there are a bunch of users who might be creating files using the unofficial "Klingon" characters (for example) that are not officially part of Unicode since Unicode will only allow characters used by human languages, and Klingon doesn't qualify. I believe though that Android has elected to enable casefolding in strict mode, which is fine as far as I'm concerned. > I find libraries that just error out on "malformed utf-8" to be > actively harmful. I admit that when I discovered that MacOS errored out on illegal utf-8 characters it was mildly annoying, but it wasn't that hard to fix it on the Linux side and then I retried the rsync. It also turned out that if I unpacked the zip file on MacOS, the filename was created without the illegal utf-8 characters, so there may have been something funky going on with the zip userspace program on Linux. I haven't cared enough to try to debug it... - Ted
On Wed, 17 Jan 2024 at 18:06, Theodore Ts'o <tytso@mit.edu> wrote: > > A file system which supports casefolding can support "strict" mode .. and we can support "shit" mode that craps all over your filesystem too. The fact that you can support something doesn't make it good. > This is what MacOS does, by the way. The MacOS filesystem is the CVS of filesystems: if you see that it does something, the right thing is almost certainly to do the exact opposite. Case-folding is a horrible thing to do in the first place, but MacOS compounds on its bad ways by then using NFD normalization, AND EXPOSING THAT TO THE USER. IOW, MacOS actively changes and corrupts the data the user gave it for filenames. There are tons of Apple apologists out there that will make any number of excuses for it, but it's actively shit. It's broken garbage. So the fact that MacOS then has "strict" mode, really is an argument *against* it. The MacOS filesystem designers were fed the wrong kind of drugs, and I suspect they may not have been the brightest bunch to begin with. > So we don't need to worry about the user not being able to fix it, > because they won't have been able to create the file in the first > place. Yeah, that's a fine argument, until you have a bug or subtle bit flip data corruption, and now instead of having something you can recover, the system actively says "Nope". > I admit that when I discovered that MacOS errored out on illegal utf-8 > characters it was mildly annoying, We may have to be able to interoperate with shit, but let's call it what it is. Nobody pretends FAT is a great filesystem that made great design decisions. That doesn't mean that we can't interoperate with it just fine. But we don't need to take those idiotic and bad design decisions to heart, and we don't need to hide the fact that they are horrendous design mistakes. So "strict" mode should mean that you can't *create* a misformed UTF-8 filename. It's that same "be conservative in what you do". But *dammit*, if "strict" mode means that you can't even read other peoples mistakes because your "->lookup()" function refuses to even look at it, then "strict" mode is GARBAGE. That's the "be liberal in what you accept" part. Do it, or be damned. Really. No excuses for shit modes. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Wed, 17 Jan 2024 at 18:06, Theodore Ts'o <tytso@mit.edu> wrote: >> So we don't need to worry about the user not being able to fix it, >> because they won't have been able to create the file in the first >> place. > > Yeah, that's a fine argument, until you have a bug or subtle bit flip > data corruption, and now instead of having something you can recover, > the system actively says "Nope". I know this is not your point, but I should add that, in case of a bug or bit flip, we support "fixing" the "bad utf8" string through fsck. >> I admit that when I discovered that MacOS errored out on illegal utf-8 >> characters it was mildly annoying, > > We may have to be able to interoperate with shit, but let's call it what it is. > > Nobody pretends FAT is a great filesystem that made great design > decisions. That doesn't mean that we can't interoperate with it just > fine. > > But we don't need to take those idiotic and bad design decisions to > heart, and we don't need to hide the fact that they are horrendous > design mistakes. There is a correctness issue with accepting the creation of invalid utf-8 names that justifies the existence of strict mode. Currently undefined code-points can become a casefold match to some other file in a later unicode version. When you decide to update your unicode version or even copy the file to a volume with a different version, the lookup might yield a different file, making one of them inaccessible or overwriting the wrong file. Obviously, not all corruptions would yield a "valid" undefined code-point. But those are possible. We currently don't care much, since mkfs will create the volume with a fixed, never-changed unicode version. That is, unless the user goes out of their way to shoot themselves in the foot. Strict mode is an easy way to prevent this class of issues (aside from corruptions). > So "strict" mode should mean that you can't *create* a misformed UTF-8 > filename. > > It's that same "be conservative in what you do". > > But *dammit*, if "strict" mode means that you can't even read other > peoples mistakes because your "->lookup()" function refuses to even > look at it, then "strict" mode is GARBAGE. > > That's the "be liberal in what you accept" part. Do it, or be damned. Yes, we could be more liberal in the lookup while restricting the creation of invalid utf8 sequences. But, the only case where it would matter is for corrupted volumes, where a file-name suddenly changed to something invalid. Considering ext4 and f2fs, since the disk direntry hash (which is hash(casefolded(filename))) didn't get corrupted exactly right, looking up the exact-match of the invalid name might fail. This would create an even more inconsistent semantics, where small, non-hashed directories can find these files, but larger, hashed directories might not. And that is even more confusing to users, since it exposes internal filesystem details. I get the point about how annoying the current semantics is. But I still think this is the sanest approach to a fundamentally insane feature.
diff --git a/fs/libfs.c b/fs/libfs.c index eec6031b0155..d17a8adb4a35 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -1704,17 +1704,28 @@ bool is_empty_dir_inode(struct inode *inode) static int generic_ci_d_compare(const struct dentry *dentry, unsigned int len, const char *str, const struct qstr *name) { - const struct dentry *parent = READ_ONCE(dentry->d_parent); - const struct inode *dir = READ_ONCE(parent->d_inode); - const struct super_block *sb = dentry->d_sb; - const struct unicode_map *um = sb->s_encoding; - struct qstr qstr = QSTR_INIT(str, len); + const struct dentry *parent; + const struct inode *dir; char strbuf[DNAME_INLINE_LEN]; + struct qstr qstr; int ret; + /* + * Attempt a case-sensitive match first. It is cheaper and + * should cover most lookups, including all the sane + * applications that expect a case-sensitive filesystem. + */ + if (len == name->len && !memcmp(str, name->name, len)) + return 0; + + parent = READ_ONCE(dentry->d_parent); + dir = READ_ONCE(parent->d_inode); if (!dir || !IS_CASEFOLDED(dir)) - goto fallback; + return 1; + /* + * Finally, try the case-insensitive match. + * * If the dentry name is stored in-line, then it may be concurrently * modified by a rename. If this happens, the VFS will eventually retry * the lookup, so it doesn't matter what ->d_compare() returns. @@ -1724,20 +1735,17 @@ static int generic_ci_d_compare(const struct dentry *dentry, unsigned int len, if (len <= DNAME_INLINE_LEN - 1) { memcpy(strbuf, str, len); strbuf[len] = 0; - qstr.name = strbuf; + str = strbuf; /* prevent compiler from optimizing out the temporary buffer */ barrier(); } - ret = utf8_strncasecmp(um, name, &qstr); - if (ret >= 0) - return ret; - - if (sb_has_strict_encoding(sb)) + qstr.len = len; + qstr.name = str; + ret = utf8_strncasecmp(dentry->d_sb->s_encoding, name, &qstr); + if (ret < 0) return -EINVAL; -fallback: - if (len != name->len) - return 1; - return !!memcmp(str, name->name, len); + + return ret; } /**
Casefolded comparisons are (obviously) way more costly than a simple memcmp. Try the case-sensitive comparison first, falling-back to the case-insensitive lookup only when needed. This allows any exact-match lookup to complete without having to walk the utf8 trie. Note that, for strict mode, generic_ci_d_compare used to reject an invalid UTF-8 string, which would now be considered valid if it exact-matches the disk-name. But, if that is the case, the filesystem is corrupt. More than that, it really doesn't matter in practice, because the name-under-lookup will have already been rejected by generic_ci_d_hash and we won't even get here. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> --- fs/libfs.c | 40 ++++++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 16 deletions(-)