From patchwork Thu Dec 6 23:08:41 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717179 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EF0FE15A6 for ; Thu, 6 Dec 2018 23:09:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DAC9C2F205 for ; Thu, 6 Dec 2018 23:09:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CBB502F21D; Thu, 6 Dec 2018 23:09:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E33C92F205 for ; Thu, 6 Dec 2018 23:09:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726250AbeLFXJN (ORCPT ); Thu, 6 Dec 2018 18:09:13 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56064 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726119AbeLFXJN (ORCPT ); Thu, 6 Dec 2018 18:09:13 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 5F0E327E83A From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 01/23] nls: Wrap uni2char/char2uni callers Date: Thu, 6 Dec 2018 18:08:41 -0500 Message-Id: <20181206230903.30011-2-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi Just a cosmetic change at this point, this patch will simplify the following Coccinelle patches which will move the hooks into a dedicated structure, with the goal of splitting the nls_table structure to support versioning. This was generated with the following coccinele script: @@ expression A, B, C, D; @@ ( - A->uni2char(B, C, D) + nls_uni2char(A, B, C, D) | - A->char2uni(B, C, D) + nls_char2uni(A, B, C, D) ) Signed-off-by: Gabriel Krisman Bertazi --- fs/befs/linuxvfs.c | 4 ++-- fs/cifs/cifs_unicode.c | 9 +++++---- fs/cifs/dir.c | 7 ++++--- fs/fat/dir.c | 8 ++++---- fs/fat/namei_vfat.c | 6 +++--- fs/hfs/trans.c | 9 +++++---- fs/hfsplus/unicode.c | 6 +++--- fs/isofs/joliet.c | 3 ++- fs/jfs/jfs_unicode.c | 7 +++---- fs/nls/nls_euc-jp.c | 4 ++-- fs/nls/nls_koi8-ru.c | 6 +++--- fs/ntfs/unistr.c | 8 ++++---- include/linux/nls.h | 13 +++++++++++++ 13 files changed, 53 insertions(+), 37 deletions(-) diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 4700b4534439..0ba368fbfad4 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -542,7 +542,7 @@ befs_utf2nls(struct super_block *sb, const char *in, /* convert from Unicode to nls */ if (uni > MAX_WCHAR_T) goto conv_err; - unilen = nls->uni2char(uni, &result[o], in_len - o); + unilen = nls_uni2char(nls, uni, &result[o], in_len - o); if (unilen < 0) goto conv_err; } @@ -616,7 +616,7 @@ befs_nls2utf(struct super_block *sb, const char *in, for (i = o = 0; i < in_len; i += unilen, o += utflen) { /* convert from nls to unicode */ - unilen = nls->char2uni(&in[i], in_len - i, &uni); + unilen = nls_char2uni(nls, &in[i], in_len - i, &uni); if (unilen < 0) goto conv_err; diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c index a2b2355e7f01..ffad8b4f90d1 100644 --- a/fs/cifs/cifs_unicode.c +++ b/fs/cifs/cifs_unicode.c @@ -145,7 +145,7 @@ cifs_mapchar(char *target, const __u16 *from, const struct nls_table *cp, return len; /* if character not one of seven in special remap set */ - len = cp->uni2char(src_char, target, NLS_MAX_CHARSET_SIZE); + len = nls_uni2char(cp, src_char, target, NLS_MAX_CHARSET_SIZE); if (len <= 0) goto surrogate_pair; @@ -289,7 +289,7 @@ cifs_strtoUTF16(__le16 *to, const char *from, int len, } for (i = 0; len && *from; i++, from += charlen, len -= charlen) { - charlen = codepage->char2uni(from, len, &wchar_to); + charlen = nls_char2uni(codepage, from, len, &wchar_to); if (charlen < 1) { cifs_dbg(VFS, "strtoUTF16: char2uni of 0x%x returned %d\n", *from, charlen); @@ -515,7 +515,8 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen, * as they use backslash as separator. */ if (dst_char == 0) { - charlen = cp->char2uni(source + i, srclen - i, &tmp); + charlen = nls_char2uni(cp, source + i, srclen - i, + &tmp); dst_char = cpu_to_le16(tmp); /* @@ -605,7 +606,7 @@ cifs_local_to_utf16_bytes(const char *from, int len, wchar_t wchar_to; for (i = 0; len && *from; i++, from += charlen, len -= charlen) { - charlen = codepage->char2uni(from, len, &wchar_to); + charlen = nls_char2uni(codepage, from, len, &wchar_to); /* Failed conversion defaults to a question mark */ if (charlen < 1) charlen = 1; diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c index 3713d22b95a7..f8bb9285f630 100644 --- a/fs/cifs/dir.c +++ b/fs/cifs/dir.c @@ -910,7 +910,7 @@ static int cifs_ci_hash(const struct dentry *dentry, struct qstr *q) hash = init_name_hash(dentry); for (i = 0; i < q->len; i += charlen) { - charlen = codepage->char2uni(&q->name[i], q->len - i, &c); + charlen = nls_char2uni(codepage, &q->name[i], q->len - i, &c); /* error out if we can't convert the character */ if (unlikely(charlen < 0)) return charlen; @@ -939,8 +939,9 @@ static int cifs_ci_compare(const struct dentry *dentry, for (i = 0; i < len; i += l1) { /* Convert characters in both strings to UTF-16. */ - l1 = codepage->char2uni(&str[i], len - i, &c1); - l2 = codepage->char2uni(&name->name[i], name->len - i, &c2); + l1 = nls_char2uni(codepage, &str[i], len - i, &c1); + l2 = nls_char2uni(codepage, &name->name[i], name->len - i, + &c2); /* * If we can't convert either character, just declare it to diff --git a/fs/fat/dir.c b/fs/fat/dir.c index c8366cb8eccd..d5f856651a08 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -153,7 +153,7 @@ static int uni16_to_x8(struct super_block *sb, unsigned char *ascii, while (*ip && ((len - NLS_MAX_CHARSET_SIZE) > 0)) { ec = *ip++; - charlen = nls->uni2char(ec, op, NLS_MAX_CHARSET_SIZE); + charlen = nls_uni2char(nls, ec, op, NLS_MAX_CHARSET_SIZE); if (charlen > 0) { op += charlen; len -= charlen; @@ -195,7 +195,7 @@ fat_short2uni(struct nls_table *t, unsigned char *c, int clen, wchar_t *uni) { int charlen; - charlen = t->char2uni(c, clen, uni); + charlen = nls_char2uni(t, c, clen, uni); if (charlen < 0) { *uni = 0x003f; /* a question mark */ charlen = 1; @@ -210,7 +210,7 @@ fat_short2lower_uni(struct nls_table *t, unsigned char *c, int charlen; wchar_t wc; - charlen = t->char2uni(c, clen, &wc); + charlen = nls_char2uni(t, c, clen, &wc); if (charlen < 0) { *uni = 0x003f; /* a question mark */ charlen = 1; @@ -220,7 +220,7 @@ fat_short2lower_uni(struct nls_table *t, unsigned char *c, if (!nc) nc = *c; - charlen = t->char2uni(&nc, 1, uni); + charlen = nls_char2uni(t, &nc, 1, uni); if (charlen < 0) { *uni = 0x003f; /* a question mark */ charlen = 1; diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 996c8c25e9c6..ab6b450521ff 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -289,7 +289,7 @@ static inline int to_shortname_char(struct nls_table *nls, return 1; } - len = nls->uni2char(*src, buf, buf_size); + len = nls_uni2char(nls, *src, buf, buf_size); if (len <= 0) { info->valid = 0; buf[0] = '_'; @@ -544,8 +544,8 @@ xlate_to_uni(const unsigned char *name, int len, unsigned char *outname, ip += 5; i += 5; } else { - charlen = nls->char2uni(ip, len - i, - (wchar_t *)op); + charlen = nls_char2uni(nls, ip, len - i, + (wchar_t *)op); if (charlen < 0) return -EINVAL; ip += charlen; diff --git a/fs/hfs/trans.c b/fs/hfs/trans.c index 39f5e343bf4d..2fae312edb31 100644 --- a/fs/hfs/trans.c +++ b/fs/hfs/trans.c @@ -49,7 +49,8 @@ int hfs_mac2asc(struct super_block *sb, char *out, const struct hfs_name *in) while (srclen > 0) { if (nls_disk) { - size = nls_disk->char2uni(src, srclen, &ch); + size = nls_char2uni(nls_disk, src, srclen, + &ch); if (size <= 0) { ch = '?'; size = 1; @@ -62,7 +63,7 @@ int hfs_mac2asc(struct super_block *sb, char *out, const struct hfs_name *in) } if (ch == '/') ch = ':'; - size = nls_io->uni2char(ch, dst, dstlen); + size = nls_uni2char(nls_io, ch, dst, dstlen); if (size < 0) { if (size == -ENAMETOOLONG) goto out; @@ -110,7 +111,7 @@ void hfs_asc2mac(struct super_block *sb, struct hfs_name *out, const struct qstr wchar_t ch; while (srclen > 0) { - size = nls_io->char2uni(src, srclen, &ch); + size = nls_char2uni(nls_io, src, srclen, &ch); if (size < 0) { ch = '?'; size = 1; @@ -120,7 +121,7 @@ void hfs_asc2mac(struct super_block *sb, struct hfs_name *out, const struct qstr if (ch == ':') ch = '/'; if (nls_disk) { - size = nls_disk->uni2char(ch, dst, dstlen); + size = nls_uni2char(nls_disk, ch, dst, dstlen); if (size < 0) { if (size == -ENAMETOOLONG) goto out; diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c index c8d1b2be7854..057dc7e57cb1 100644 --- a/fs/hfsplus/unicode.c +++ b/fs/hfsplus/unicode.c @@ -190,7 +190,7 @@ int hfsplus_uni2asc(struct super_block *sb, c0 = ':'; break; } - res = nls->uni2char(c0, op, len); + res = nls_uni2char(nls, c0, op, len); if (res < 0) { if (res == -ENAMETOOLONG) goto out; @@ -233,7 +233,7 @@ int hfsplus_uni2asc(struct super_block *sb, cc = c0; } done: - res = nls->uni2char(cc, op, len); + res = nls_uni2char(nls, cc, op, len); if (res < 0) { if (res == -ENAMETOOLONG) goto out; @@ -256,7 +256,7 @@ int hfsplus_uni2asc(struct super_block *sb, static inline int asc2unichar(struct super_block *sb, const char *astr, int len, wchar_t *uc) { - int size = HFSPLUS_SB(sb)->nls->char2uni(astr, len, uc); + int size = nls_char2uni(HFSPLUS_SB(sb)->nls, astr, len, uc); if (size <= 0) { *uc = '?'; size = 1; diff --git a/fs/isofs/joliet.c b/fs/isofs/joliet.c index be8b6a9d0b92..56fac73b27a5 100644 --- a/fs/isofs/joliet.c +++ b/fs/isofs/joliet.c @@ -25,7 +25,8 @@ uni16_to_x8(unsigned char *ascii, __be16 *uni, int len, struct nls_table *nls) while ((ch = get_unaligned(ip)) && len) { int llen; - llen = nls->uni2char(be16_to_cpu(ch), op, NLS_MAX_CHARSET_SIZE); + llen = nls_uni2char(nls, be16_to_cpu(ch), op, + NLS_MAX_CHARSET_SIZE); if (llen > 0) op += llen; else diff --git a/fs/jfs/jfs_unicode.c b/fs/jfs/jfs_unicode.c index 0148e2e4d97a..4ca88ef661e9 100644 --- a/fs/jfs/jfs_unicode.c +++ b/fs/jfs/jfs_unicode.c @@ -41,9 +41,8 @@ int jfs_strfromUCS_le(char *to, const __le16 * from, for (i = 0; (i < len) && from[i]; i++) { int charlen; charlen = - codepage->uni2char(le16_to_cpu(from[i]), - &to[outlen], - NLS_MAX_CHARSET_SIZE); + nls_uni2char(codepage, le16_to_cpu(from[i]), + &to[outlen], NLS_MAX_CHARSET_SIZE); if (charlen > 0) outlen += charlen; else @@ -88,7 +87,7 @@ static int jfs_strtoUCS(wchar_t * to, const unsigned char *from, int len, if (codepage) { for (i = 0; len && *from; i++, from += charlen, len -= charlen) { - charlen = codepage->char2uni(from, len, &to[i]); + charlen = nls_char2uni(codepage, from, len, &to[i]); if (charlen < 1) { jfs_err("jfs_strtoUCS: char2uni returned %d.", charlen); diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c index 162b3f160353..eec257545f04 100644 --- a/fs/nls/nls_euc-jp.c +++ b/fs/nls/nls_euc-jp.c @@ -413,7 +413,7 @@ static int uni2char(const wchar_t uni, if (!p_nls) return -EINVAL; - if ((n = p_nls->uni2char(uni, out, boundlen)) < 0) + if ((n = nls_uni2char(p_nls, uni, out, boundlen)) < 0) return n; /* translate SJIS into EUC-JP */ @@ -543,7 +543,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen, sjis_temp[1] = 0x00; } - if ( (n = p_nls->char2uni(sjis_temp, sizeof(sjis_temp), uni)) < 0) + if ( (n = nls_char2uni(p_nls, sjis_temp, sizeof(sjis_temp), uni)) < 0) return n; return euc_offset; diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c index a80a741a8676..32781252110d 100644 --- a/fs/nls/nls_koi8-ru.c +++ b/fs/nls/nls_koi8-ru.c @@ -28,12 +28,12 @@ static int uni2char(const wchar_t uni, else if (uni == 0x255d || uni == 0x256c) return 0; else - return p_nls->uni2char(uni, out, boundlen); + return nls_uni2char(p_nls, uni, out, boundlen); return 1; } else /* fast path */ - return p_nls->uni2char(uni, out, boundlen); + return nls_uni2char(p_nls, uni, out, boundlen); } static int char2uni(const unsigned char *rawstring, int boundlen, @@ -47,7 +47,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return 1; } - n = p_nls->char2uni(rawstring, boundlen, uni); + n = nls_char2uni(p_nls, rawstring, boundlen, uni); return n; } diff --git a/fs/ntfs/unistr.c b/fs/ntfs/unistr.c index 005ca4b0f132..e0a5f33441df 100644 --- a/fs/ntfs/unistr.c +++ b/fs/ntfs/unistr.c @@ -269,8 +269,8 @@ int ntfs_nlstoucs(const ntfs_volume *vol, const char *ins, ucs = kmem_cache_alloc(ntfs_name_cache, GFP_NOFS); if (likely(ucs)) { for (i = o = 0; i < ins_len; i += wc_len) { - wc_len = nls->char2uni(ins + i, ins_len - i, - &wc); + wc_len = nls_char2uni(nls, ins + i, + ins_len - i, &wc); if (likely(wc_len >= 0 && o < NTFS_MAX_NAME_LEN)) { if (likely(wc)) { @@ -355,8 +355,8 @@ int ntfs_ucstonls(const ntfs_volume *vol, const ntfschar *ins, goto mem_err_out; } for (i = o = 0; i < ins_len; i++) { -retry: wc = nls->uni2char(le16_to_cpu(ins[i]), ns + o, - ns_len - o); +retry: wc = nls_uni2char(nls, le16_to_cpu(ins[i]), + ns + o, ns_len - o); if (wc > 0) { o += wc; continue; diff --git a/include/linux/nls.h b/include/linux/nls.h index 499e486b3722..5073ecd57279 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -59,6 +59,19 @@ extern int utf8s_to_utf16s(const u8 *s, int len, extern int utf16s_to_utf8s(const wchar_t *pwcs, int len, enum utf16_endian endian, u8 *s, int maxlen); +static inline int nls_uni2char(const struct nls_table *table, wchar_t uni, + unsigned char *out, int boundlen) +{ + return table->uni2char(uni, out, boundlen); +} + +static inline int nls_char2uni(const struct nls_table *table, + const unsigned char *rawstring, + int boundlen, wchar_t *uni) +{ + return table->char2uni(rawstring, boundlen, uni); +} + static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c) { unsigned char nc = t->charset2lower[c]; From patchwork Thu Dec 6 23:08:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717181 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7CA9A14E2 for ; Thu, 6 Dec 2018 23:09:20 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6C0282F205 for ; Thu, 6 Dec 2018 23:09:20 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 603CB2F21D; Thu, 6 Dec 2018 23:09:20 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 8C3772F205 for ; Thu, 6 Dec 2018 23:09:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726255AbeLFXJS (ORCPT ); Thu, 6 Dec 2018 18:09:18 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56066 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726119AbeLFXJS (ORCPT ); Thu, 6 Dec 2018 18:09:18 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 1400127E83A From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 02/23] nls: Wrap charset field access Date: Thu, 6 Dec 2018 18:08:42 -0500 Message-Id: <20181206230903.30011-3-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi The goal is to simplify the following patches that split nls_table. No behavior changes intended. @@ struct nls_table *c; @@ - c->charset + nls_charset_name(c) Signed-off-by: Gabriel Krisman Bertazi --- fs/befs/linuxvfs.c | 4 ++-- fs/cifs/cifs_unicode.c | 6 +++--- fs/cifs/cifsfs.c | 2 +- fs/cifs/connect.c | 2 +- fs/fat/inode.c | 6 ++++-- fs/hfs/super.c | 6 ++++-- fs/hfsplus/options.c | 2 +- fs/isofs/inode.c | 5 +++-- fs/jfs/jfs_unicode.c | 2 +- fs/jfs/super.c | 3 ++- fs/nls/nls_base.c | 2 +- fs/ntfs/inode.c | 2 +- fs/ntfs/super.c | 6 +++--- fs/ntfs/unistr.c | 5 +++-- fs/udf/super.c | 3 ++- include/linux/nls.h | 5 +++++ 16 files changed, 37 insertions(+), 24 deletions(-) diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 0ba368fbfad4..8b7af0a9011a 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -555,7 +555,7 @@ befs_utf2nls(struct super_block *sb, const char *in, conv_err: befs_error(sb, "Name using character set %s contains a character that " - "cannot be converted to unicode.", nls->charset); + "cannot be converted to unicode.", nls_charset_name(nls)); befs_debug(sb, "<--- %s", __func__); kfree(result); return -EILSEQ; @@ -635,7 +635,7 @@ befs_nls2utf(struct super_block *sb, const char *in, conv_err: befs_error(sb, "Name using character set %s contains a character that " - "cannot be converted to unicode.", nls->charset); + "cannot be converted to unicode.", nls_charset_name(nls)); befs_debug(sb, "<--- %s", __func__); kfree(result); return -EILSEQ; diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c index ffad8b4f90d1..2a9396d24f60 100644 --- a/fs/cifs/cifs_unicode.c +++ b/fs/cifs/cifs_unicode.c @@ -153,7 +153,7 @@ cifs_mapchar(char *target, const __u16 *from, const struct nls_table *cp, surrogate_pair: /* convert SURROGATE_PAIR and IVS */ - if (strcmp(cp->charset, "utf8")) + if (strcmp(nls_charset_name(cp), "utf8")) goto unknown; len = utf16s_to_utf8s(from, 3, UTF16_LITTLE_ENDIAN, target, 6); if (len <= 0) @@ -268,7 +268,7 @@ cifs_strtoUTF16(__le16 *to, const char *from, int len, wchar_t wchar_to; /* needed to quiet sparse */ /* special case for utf8 to handle no plane0 chars */ - if (!strcmp(codepage->charset, "utf8")) { + if (!strcmp(nls_charset_name(codepage), "utf8")) { /* * convert utf8 -> utf16, we assume we have enough space * as caller should have assumed conversion does not overflow @@ -527,7 +527,7 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen, goto ctoUTF16; /* convert SURROGATE_PAIR */ - if (strcmp(cp->charset, "utf8") || !wchar_to) + if (strcmp(nls_charset_name(cp), "utf8") || !wchar_to) goto unknown; if (*(source + i) & 0x80) { charlen = utf8_to_utf32(source + i, 6, &u); diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 865706edb307..b0531986cfd7 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -414,7 +414,7 @@ cifs_show_nls(struct seq_file *s, struct nls_table *cur) /* Display iocharset= option if it's not default charset */ def = load_nls_default(); if (def != cur) - seq_printf(s, ",iocharset=%s", cur->charset); + seq_printf(s, ",iocharset=%s", nls_charset_name(cur)); unload_nls(def); } diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index 6f24f129a751..3011276a06f0 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -3191,7 +3191,7 @@ compare_mount_options(struct super_block *sb, struct cifs_mnt_data *mnt_data) old->mnt_dir_mode != new->mnt_dir_mode) return 0; - if (strcmp(old->local_nls->charset, new->local_nls->charset)) + if (strcmp(nls_charset_name(old->local_nls), nls_charset_name(new->local_nls))) return 0; if (old->actimeo != new->actimeo) diff --git a/fs/fat/inode.c b/fs/fat/inode.c index c0b5b5c3373b..2563dc306e7f 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -948,10 +948,12 @@ static int fat_show_options(struct seq_file *m, struct dentry *root) seq_printf(m, ",allow_utime=%04o", opts->allow_utime); if (sbi->nls_disk) /* strip "cp" prefix from displayed option */ - seq_printf(m, ",codepage=%s", &sbi->nls_disk->charset[2]); + seq_printf(m, ",codepage=%s", + &nls_charset_name(sbi->nls_disk)[2]); if (isvfat) { if (sbi->nls_io) - seq_printf(m, ",iocharset=%s", sbi->nls_io->charset); + seq_printf(m, ",iocharset=%s", + nls_charset_name(sbi->nls_io)); switch (opts->shortname) { case VFAT_SFN_DISPLAY_WIN95 | VFAT_SFN_CREATE_WIN95: diff --git a/fs/hfs/super.c b/fs/hfs/super.c index 173876782f73..b16ca01180a5 100644 --- a/fs/hfs/super.c +++ b/fs/hfs/super.c @@ -151,9 +151,11 @@ static int hfs_show_options(struct seq_file *seq, struct dentry *root) if (sbi->session >= 0) seq_printf(seq, ",session=%u", sbi->session); if (sbi->nls_disk) - seq_printf(seq, ",codepage=%s", sbi->nls_disk->charset); + seq_printf(seq, ",codepage=%s", + nls_charset_name(sbi->nls_disk)); if (sbi->nls_io) - seq_printf(seq, ",iocharset=%s", sbi->nls_io->charset); + seq_printf(seq, ",iocharset=%s", + nls_charset_name(sbi->nls_io)); if (sbi->s_quiet) seq_printf(seq, ",quiet"); return 0; diff --git a/fs/hfsplus/options.c b/fs/hfsplus/options.c index 047e05c57560..2d6644465566 100644 --- a/fs/hfsplus/options.c +++ b/fs/hfsplus/options.c @@ -230,7 +230,7 @@ int hfsplus_show_options(struct seq_file *seq, struct dentry *root) if (sbi->session >= 0) seq_printf(seq, ",session=%u", sbi->session); if (sbi->nls) - seq_printf(seq, ",nls=%s", sbi->nls->charset); + seq_printf(seq, ",nls=%s", nls_charset_name(sbi->nls)); if (test_bit(HFSPLUS_SB_NODECOMPOSE, &sbi->flags)) seq_puts(seq, ",nodecompose"); if (test_bit(HFSPLUS_SB_NOBARRIER, &sbi->flags)) diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c index 488a9e7f8f66..b23a3955b8c6 100644 --- a/fs/isofs/inode.c +++ b/fs/isofs/inode.c @@ -520,8 +520,9 @@ static int isofs_show_options(struct seq_file *m, struct dentry *root) #ifdef CONFIG_JOLIET if (sbi->s_nls_iocharset && - strcmp(sbi->s_nls_iocharset->charset, CONFIG_NLS_DEFAULT) != 0) - seq_printf(m, ",iocharset=%s", sbi->s_nls_iocharset->charset); + strcmp(nls_charset_name(sbi->s_nls_iocharset), CONFIG_NLS_DEFAULT) != 0) + seq_printf(m, ",iocharset=%s", + nls_charset_name(sbi->s_nls_iocharset)); #endif return 0; } diff --git a/fs/jfs/jfs_unicode.c b/fs/jfs/jfs_unicode.c index 4ca88ef661e9..1e89b3b8caa7 100644 --- a/fs/jfs/jfs_unicode.c +++ b/fs/jfs/jfs_unicode.c @@ -92,7 +92,7 @@ static int jfs_strtoUCS(wchar_t * to, const unsigned char *from, int len, jfs_err("jfs_strtoUCS: char2uni returned %d.", charlen); jfs_err("charset = %s, char = 0x%x", - codepage->charset, *from); + nls_charset_name(codepage), *from); return charlen; } } diff --git a/fs/jfs/super.c b/fs/jfs/super.c index 65d8fc87ab11..a04ff4bc5afd 100644 --- a/fs/jfs/super.c +++ b/fs/jfs/super.c @@ -736,7 +736,8 @@ static int jfs_show_options(struct seq_file *seq, struct dentry *root) if (sbi->flag & JFS_DISCARD) seq_printf(seq, ",discard=%u", sbi->minblks_trim); if (sbi->nls_tab) - seq_printf(seq, ",iocharset=%s", sbi->nls_tab->charset); + seq_printf(seq, ",iocharset=%s", + nls_charset_name(sbi->nls_tab)); if (sbi->flag & JFS_ERR_CONTINUE) seq_printf(seq, ",errors=continue"); if (sbi->flag & JFS_ERR_PANIC) diff --git a/fs/nls/nls_base.c b/fs/nls/nls_base.c index 52ccd34b1e79..e5d083b6e2b2 100644 --- a/fs/nls/nls_base.c +++ b/fs/nls/nls_base.c @@ -277,7 +277,7 @@ static struct nls_table *find_nls(char *charset) struct nls_table *nls; spin_lock(&nls_lock); for (nls = tables; nls; nls = nls->next) { - if (!strcmp(nls->charset, charset)) + if (!strcmp(nls_charset_name(nls), charset)) break; if (nls->alias && !strcmp(nls->alias, charset)) break; diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c index bd3221cbdd95..872ef265b117 100644 --- a/fs/ntfs/inode.c +++ b/fs/ntfs/inode.c @@ -2313,7 +2313,7 @@ int ntfs_show_options(struct seq_file *sf, struct dentry *root) seq_printf(sf, ",fmask=0%o", vol->fmask); seq_printf(sf, ",dmask=0%o", vol->dmask); } - seq_printf(sf, ",nls=%s", vol->nls_map->charset); + seq_printf(sf, ",nls=%s", nls_charset_name(vol->nls_map)); if (NVolCaseSensitive(vol)) seq_printf(sf, ",case_sensitive"); if (NVolShowSystemFiles(vol)) diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c index bb7159f697f2..1c68c33e9816 100644 --- a/fs/ntfs/super.c +++ b/fs/ntfs/super.c @@ -224,7 +224,7 @@ static bool parse_options(ntfs_volume *vol, char *opt) } ntfs_error(vol->sb, "NLS character set %s not " "found. Using previous one %s.", - v, old_nls->charset); + v, nls_charset_name(old_nls)); nls_map = old_nls; } else /* nls_map */ { unload_nls(old_nls); @@ -274,7 +274,7 @@ static bool parse_options(ntfs_volume *vol, char *opt) "on remount."); return false; } /* else (!vol->nls_map) */ - ntfs_debug("Using NLS character set %s.", nls_map->charset); + ntfs_debug("Using NLS character set %s.", nls_charset_name(nls_map)); vol->nls_map = nls_map; } else /* (!nls_map) */ { if (!vol->nls_map) { @@ -285,7 +285,7 @@ static bool parse_options(ntfs_volume *vol, char *opt) return false; } ntfs_debug("Using default NLS character set (%s).", - vol->nls_map->charset); + nls_charset_name(vol->nls_map)); } } if (mft_zone_multiplier != -1) { diff --git a/fs/ntfs/unistr.c b/fs/ntfs/unistr.c index e0a5f33441df..a30911979a55 100644 --- a/fs/ntfs/unistr.c +++ b/fs/ntfs/unistr.c @@ -297,7 +297,7 @@ int ntfs_nlstoucs(const ntfs_volume *vol, const char *ins, if (wc_len < 0) { ntfs_error(vol->sb, "Name using character set %s contains " "characters that cannot be converted to " - "Unicode.", nls->charset); + "Unicode.", nls_charset_name(nls)); i = -EILSEQ; } else /* if (o >= NTFS_MAX_NAME_LEN) */ { ntfs_error(vol->sb, "Name is too long (maximum length for a " @@ -386,7 +386,8 @@ retry: wc = nls_uni2char(nls, le16_to_cpu(ins[i]), conversion_err: ntfs_error(vol->sb, "Unicode name contains characters that cannot be " "converted to character set %s. You might want to " - "try to use the mount option nls=utf8.", nls->charset); + "try to use the mount option nls=utf8.", + nls_charset_name(nls)); if (ns != *outs) kfree(ns); if (wc != -ENAMETOOLONG) diff --git a/fs/udf/super.c b/fs/udf/super.c index 8f2f56d9a1bb..284087eb64d0 100644 --- a/fs/udf/super.c +++ b/fs/udf/super.c @@ -361,7 +361,8 @@ static int udf_show_options(struct seq_file *seq, struct dentry *root) if (UDF_QUERY_FLAG(sb, UDF_FLAG_UTF8)) seq_puts(seq, ",utf8"); if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP) && sbi->s_nls_map) - seq_printf(seq, ",iocharset=%s", sbi->s_nls_map->charset); + seq_printf(seq, ",iocharset=%s", + nls_charset_name(sbi->s_nls_map)); return 0; } diff --git a/include/linux/nls.h b/include/linux/nls.h index 5073ecd57279..cacbcd7d63e6 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -72,6 +72,11 @@ static inline int nls_char2uni(const struct nls_table *table, return table->char2uni(rawstring, boundlen, uni); } +static inline const char *nls_charset_name(const struct nls_table *table) +{ + return table->charset; +} + static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c) { unsigned char nc = t->charset2lower[c]; From patchwork Thu Dec 6 23:08:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717185 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6B52C15A6 for ; Thu, 6 Dec 2018 23:09:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 584622F205 for ; Thu, 6 Dec 2018 23:09:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4B8162F21E; Thu, 6 Dec 2018 23:09:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5F25B2F205 for ; Thu, 6 Dec 2018 23:09:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726274AbeLFXJX (ORCPT ); Thu, 6 Dec 2018 18:09:23 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56070 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726119AbeLFXJW (ORCPT ); Thu, 6 Dec 2018 18:09:22 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 0373727E03E From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 03/23] nls: Wrap charset hooks in ops structure Date: Thu, 6 Dec 2018 18:08:43 -0500 Message-Id: <20181206230903.30011-4-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi This is done in preparation to splitting the nls_table structure, in order to support multiple versions of the same encoding. By placing all the operations together, we can allow multiple operations for the same encoding, depending on the version. For now, there is no behavior change intended, but this simplify the following patches. With the exception of the declaration of the structure, this patch was generated by the following Coccinelle script: @nlstable@ identifier p; expression uni2char_fn; expression char2uni_fn; @@ static struct nls_table p = { - .char2uni = char2uni_fn, - .uni2char = uni2char_fn, + .ops = &charset_ops, }; @createops@ identifier nlstable.p; expression nlstable.uni2char_fn; expression nlstable.char2uni_fn; @@ +static const struct nls_ops charset_ops = { + .uni2char = uni2char_fn, + .char2uni = char2uni_fn, +}; + static struct nls_table p = {}; @@ struct nls_table *c; @@ ( - c->uni2char + c->ops->uni2char | - c->char2uni + c->ops->char2uni ) Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/mac-celtic.c | 8 ++++++-- fs/nls/mac-centeuro.c | 8 ++++++-- fs/nls/mac-croatian.c | 8 ++++++-- fs/nls/mac-cyrillic.c | 8 ++++++-- fs/nls/mac-gaelic.c | 8 ++++++-- fs/nls/mac-greek.c | 8 ++++++-- fs/nls/mac-iceland.c | 8 ++++++-- fs/nls/mac-inuit.c | 8 ++++++-- fs/nls/mac-roman.c | 8 ++++++-- fs/nls/mac-romanian.c | 8 ++++++-- fs/nls/mac-turkish.c | 8 ++++++-- fs/nls/nls_ascii.c | 8 ++++++-- fs/nls/nls_base.c | 8 ++++++-- fs/nls/nls_cp1250.c | 8 ++++++-- fs/nls/nls_cp1251.c | 8 ++++++-- fs/nls/nls_cp1255.c | 8 ++++++-- fs/nls/nls_cp437.c | 8 ++++++-- fs/nls/nls_cp737.c | 8 ++++++-- fs/nls/nls_cp775.c | 8 ++++++-- fs/nls/nls_cp850.c | 8 ++++++-- fs/nls/nls_cp852.c | 8 ++++++-- fs/nls/nls_cp855.c | 8 ++++++-- fs/nls/nls_cp857.c | 8 ++++++-- fs/nls/nls_cp860.c | 8 ++++++-- fs/nls/nls_cp861.c | 8 ++++++-- fs/nls/nls_cp862.c | 8 ++++++-- fs/nls/nls_cp863.c | 8 ++++++-- fs/nls/nls_cp864.c | 8 ++++++-- fs/nls/nls_cp865.c | 8 ++++++-- fs/nls/nls_cp866.c | 8 ++++++-- fs/nls/nls_cp869.c | 8 ++++++-- fs/nls/nls_cp874.c | 8 ++++++-- fs/nls/nls_cp932.c | 8 ++++++-- fs/nls/nls_cp936.c | 8 ++++++-- fs/nls/nls_cp949.c | 8 ++++++-- fs/nls/nls_cp950.c | 8 ++++++-- fs/nls/nls_euc-jp.c | 8 ++++++-- fs/nls/nls_iso8859-1.c | 8 ++++++-- fs/nls/nls_iso8859-13.c | 8 ++++++-- fs/nls/nls_iso8859-14.c | 8 ++++++-- fs/nls/nls_iso8859-15.c | 8 ++++++-- fs/nls/nls_iso8859-2.c | 8 ++++++-- fs/nls/nls_iso8859-3.c | 8 ++++++-- fs/nls/nls_iso8859-4.c | 8 ++++++-- fs/nls/nls_iso8859-5.c | 8 ++++++-- fs/nls/nls_iso8859-6.c | 8 ++++++-- fs/nls/nls_iso8859-7.c | 8 ++++++-- fs/nls/nls_iso8859-9.c | 8 ++++++-- fs/nls/nls_koi8-r.c | 8 ++++++-- fs/nls/nls_koi8-ru.c | 8 ++++++-- fs/nls/nls_koi8-u.c | 8 ++++++-- fs/nls/nls_utf8.c | 8 ++++++-- fs/udf/unicode.c | 4 ++-- include/linux/nls.h | 16 ++++++++++------ 54 files changed, 324 insertions(+), 112 deletions(-) diff --git a/fs/nls/mac-celtic.c b/fs/nls/mac-celtic.c index 266c2d7d50bd..1b59b04f26f2 100644 --- a/fs/nls/mac-celtic.c +++ b/fs/nls/mac-celtic.c @@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macceltic", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-centeuro.c b/fs/nls/mac-centeuro.c index 9789c6057551..d5b8f38f97b6 100644 --- a/fs/nls/mac-centeuro.c +++ b/fs/nls/mac-centeuro.c @@ -507,10 +507,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "maccenteuro", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-croatian.c b/fs/nls/mac-croatian.c index bb19e7a07d43..32de6accd526 100644 --- a/fs/nls/mac-croatian.c +++ b/fs/nls/mac-croatian.c @@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "maccroatian", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-cyrillic.c b/fs/nls/mac-cyrillic.c index 2a7dea36acba..34d5c1c05ff1 100644 --- a/fs/nls/mac-cyrillic.c +++ b/fs/nls/mac-cyrillic.c @@ -472,10 +472,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "maccyrillic", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-gaelic.c b/fs/nls/mac-gaelic.c index 77b001653588..2aabf5213176 100644 --- a/fs/nls/mac-gaelic.c +++ b/fs/nls/mac-gaelic.c @@ -542,10 +542,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macgaelic", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-greek.c b/fs/nls/mac-greek.c index 1eccf499e2eb..df62909ef57e 100644 --- a/fs/nls/mac-greek.c +++ b/fs/nls/mac-greek.c @@ -472,10 +472,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macgreek", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-iceland.c b/fs/nls/mac-iceland.c index cbd0875c6d69..8daa68b995bc 100644 --- a/fs/nls/mac-iceland.c +++ b/fs/nls/mac-iceland.c @@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "maciceland", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-inuit.c b/fs/nls/mac-inuit.c index fba8357aaf03..b0799693502a 100644 --- a/fs/nls/mac-inuit.c +++ b/fs/nls/mac-inuit.c @@ -507,10 +507,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macinuit", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-roman.c b/fs/nls/mac-roman.c index b6a98a5208cd..ba358b864b05 100644 --- a/fs/nls/mac-roman.c +++ b/fs/nls/mac-roman.c @@ -612,10 +612,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macroman", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-romanian.c b/fs/nls/mac-romanian.c index 25547f023638..7a8a7f9a0bbc 100644 --- a/fs/nls/mac-romanian.c +++ b/fs/nls/mac-romanian.c @@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macromanian", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/mac-turkish.c b/fs/nls/mac-turkish.c index b5454bc7b7fa..eb3c5e53ec88 100644 --- a/fs/nls/mac-turkish.c +++ b/fs/nls/mac-turkish.c @@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "macturkish", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c index a2620650d5e4..6bad3e779284 100644 --- a/fs/nls/nls_ascii.c +++ b/fs/nls/nls_ascii.c @@ -142,10 +142,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "ascii", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_base.c b/fs/nls/nls_base.c index e5d083b6e2b2..0bb0acf6893f 100644 --- a/fs/nls/nls_base.c +++ b/fs/nls/nls_base.c @@ -520,10 +520,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table default_table = { .charset = "default", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp1250.c b/fs/nls/nls_cp1250.c index ace3e19d3407..08902e86fc8e 100644 --- a/fs/nls/nls_cp1250.c +++ b/fs/nls/nls_cp1250.c @@ -323,10 +323,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp1250", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp1251.c b/fs/nls/nls_cp1251.c index 9273ddfd08a1..2bb88c8cc5bf 100644 --- a/fs/nls/nls_cp1251.c +++ b/fs/nls/nls_cp1251.c @@ -277,10 +277,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp1251", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp1255.c b/fs/nls/nls_cp1255.c index 1caf5dfed85b..c6bf8d575c5b 100644 --- a/fs/nls/nls_cp1255.c +++ b/fs/nls/nls_cp1255.c @@ -358,11 +358,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp1255", .alias = "iso8859-8", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp437.c b/fs/nls/nls_cp437.c index 7ddb830da3fd..0f3f8bdbb62b 100644 --- a/fs/nls/nls_cp437.c +++ b/fs/nls/nls_cp437.c @@ -363,10 +363,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp437", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp737.c b/fs/nls/nls_cp737.c index c593f683a0cd..9383359ca25f 100644 --- a/fs/nls/nls_cp737.c +++ b/fs/nls/nls_cp737.c @@ -326,10 +326,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp737", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp775.c b/fs/nls/nls_cp775.c index 554c863745f2..6c787b9079ed 100644 --- a/fs/nls/nls_cp775.c +++ b/fs/nls/nls_cp775.c @@ -295,10 +295,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp775", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp850.c b/fs/nls/nls_cp850.c index 56cccd14b40b..50a57138a571 100644 --- a/fs/nls/nls_cp850.c +++ b/fs/nls/nls_cp850.c @@ -291,10 +291,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp850", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp852.c b/fs/nls/nls_cp852.c index 7cdc05ac1d40..0cbb199f1cd5 100644 --- a/fs/nls/nls_cp852.c +++ b/fs/nls/nls_cp852.c @@ -313,10 +313,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp852", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp855.c b/fs/nls/nls_cp855.c index 7426eea05663..530b77c86363 100644 --- a/fs/nls/nls_cp855.c +++ b/fs/nls/nls_cp855.c @@ -275,10 +275,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp855", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp857.c b/fs/nls/nls_cp857.c index 098309733ebd..0db642ec6f45 100644 --- a/fs/nls/nls_cp857.c +++ b/fs/nls/nls_cp857.c @@ -277,10 +277,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp857", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp860.c b/fs/nls/nls_cp860.c index 84224478e731..44a40dac26bd 100644 --- a/fs/nls/nls_cp860.c +++ b/fs/nls/nls_cp860.c @@ -340,10 +340,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp860", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp861.c b/fs/nls/nls_cp861.c index dc873e4be092..50e08174fc48 100644 --- a/fs/nls/nls_cp861.c +++ b/fs/nls/nls_cp861.c @@ -363,10 +363,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp861", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp862.c b/fs/nls/nls_cp862.c index d5263e3c5566..3505f3437972 100644 --- a/fs/nls/nls_cp862.c +++ b/fs/nls/nls_cp862.c @@ -397,10 +397,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp862", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp863.c b/fs/nls/nls_cp863.c index 051c9832e36a..e3489cdc0c04 100644 --- a/fs/nls/nls_cp863.c +++ b/fs/nls/nls_cp863.c @@ -357,10 +357,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp863", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp864.c b/fs/nls/nls_cp864.c index 97eb1273b2f7..d4185bc7f1bf 100644 --- a/fs/nls/nls_cp864.c +++ b/fs/nls/nls_cp864.c @@ -383,10 +383,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp864", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp865.c b/fs/nls/nls_cp865.c index 111214228525..9f468944e577 100644 --- a/fs/nls/nls_cp865.c +++ b/fs/nls/nls_cp865.c @@ -363,10 +363,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp865", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp866.c b/fs/nls/nls_cp866.c index ffdcbc3fc38d..ee46fd5a76b1 100644 --- a/fs/nls/nls_cp866.c +++ b/fs/nls/nls_cp866.c @@ -281,10 +281,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp866", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp869.c b/fs/nls/nls_cp869.c index 3b5a34589354..da29a4a53e1d 100644 --- a/fs/nls/nls_cp869.c +++ b/fs/nls/nls_cp869.c @@ -291,10 +291,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp869", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp874.c b/fs/nls/nls_cp874.c index 8dfaa10710fa..642659b9ed89 100644 --- a/fs/nls/nls_cp874.c +++ b/fs/nls/nls_cp874.c @@ -249,11 +249,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp874", .alias = "tis-620", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp932.c b/fs/nls/nls_cp932.c index 67b7398e8483..3e7bdefdca90 100644 --- a/fs/nls/nls_cp932.c +++ b/fs/nls/nls_cp932.c @@ -7907,11 +7907,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return -EINVAL; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp932", .alias = "sjis", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp936.c b/fs/nls/nls_cp936.c index c96546cfec9f..b1fa2918992b 100644 --- a/fs/nls/nls_cp936.c +++ b/fs/nls/nls_cp936.c @@ -11085,11 +11085,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp936", .alias = "gb2312", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp949.c b/fs/nls/nls_cp949.c index 199171e97aa4..1d334095d86c 100644 --- a/fs/nls/nls_cp949.c +++ b/fs/nls/nls_cp949.c @@ -13920,11 +13920,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp949", .alias = "euc-kr", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_cp950.c b/fs/nls/nls_cp950.c index 8e1418708209..d936160a48f9 100644 --- a/fs/nls/nls_cp950.c +++ b/fs/nls/nls_cp950.c @@ -9456,11 +9456,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "cp950", .alias = "big5", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c index eec257545f04..0af73982738b 100644 --- a/fs/nls/nls_euc-jp.c +++ b/fs/nls/nls_euc-jp.c @@ -549,10 +549,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return euc_offset; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "euc-jp", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, }; static int __init init_nls_euc_jp(void) diff --git a/fs/nls/nls_iso8859-1.c b/fs/nls/nls_iso8859-1.c index 69ac020d43b1..6212b2925fa0 100644 --- a/fs/nls/nls_iso8859-1.c +++ b/fs/nls/nls_iso8859-1.c @@ -233,10 +233,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-1", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-13.c b/fs/nls/nls_iso8859-13.c index afb3f8f275f0..8f0a23109207 100644 --- a/fs/nls/nls_iso8859-13.c +++ b/fs/nls/nls_iso8859-13.c @@ -261,10 +261,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-13", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-14.c b/fs/nls/nls_iso8859-14.c index 046370f0b6f0..80ab77f37480 100644 --- a/fs/nls/nls_iso8859-14.c +++ b/fs/nls/nls_iso8859-14.c @@ -317,10 +317,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-14", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-15.c b/fs/nls/nls_iso8859-15.c index 7e34a841a056..5c02f93e7b20 100644 --- a/fs/nls/nls_iso8859-15.c +++ b/fs/nls/nls_iso8859-15.c @@ -283,10 +283,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-15", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-2.c b/fs/nls/nls_iso8859-2.c index 7dd571181741..97afc1233da1 100644 --- a/fs/nls/nls_iso8859-2.c +++ b/fs/nls/nls_iso8859-2.c @@ -284,10 +284,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-2", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-3.c b/fs/nls/nls_iso8859-3.c index 740b75ec4493..f835fcec3aae 100644 --- a/fs/nls/nls_iso8859-3.c +++ b/fs/nls/nls_iso8859-3.c @@ -284,10 +284,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-3", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-4.c b/fs/nls/nls_iso8859-4.c index 8826021e32f5..14acb68fb013 100644 --- a/fs/nls/nls_iso8859-4.c +++ b/fs/nls/nls_iso8859-4.c @@ -284,10 +284,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-4", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-5.c b/fs/nls/nls_iso8859-5.c index 7c04057a1ad8..f559bbb25045 100644 --- a/fs/nls/nls_iso8859-5.c +++ b/fs/nls/nls_iso8859-5.c @@ -248,10 +248,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-5", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-6.c b/fs/nls/nls_iso8859-6.c index d4a881400d74..e3d7e28363b8 100644 --- a/fs/nls/nls_iso8859-6.c +++ b/fs/nls/nls_iso8859-6.c @@ -239,10 +239,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-6", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-7.c b/fs/nls/nls_iso8859-7.c index 37b75d825a75..49fd2b24e492 100644 --- a/fs/nls/nls_iso8859-7.c +++ b/fs/nls/nls_iso8859-7.c @@ -293,10 +293,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-7", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_iso8859-9.c b/fs/nls/nls_iso8859-9.c index 557b98250d37..876696f89626 100644 --- a/fs/nls/nls_iso8859-9.c +++ b/fs/nls/nls_iso8859-9.c @@ -248,10 +248,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "iso8859-9", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_koi8-r.c b/fs/nls/nls_koi8-r.c index 811f232fccfb..6a85211402a8 100644 --- a/fs/nls/nls_koi8-r.c +++ b/fs/nls/nls_koi8-r.c @@ -299,10 +299,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "koi8-r", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c index 32781252110d..c4e382fd0f13 100644 --- a/fs/nls/nls_koi8-ru.c +++ b/fs/nls/nls_koi8-ru.c @@ -51,10 +51,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "koi8-ru", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, }; static int __init init_nls_koi8_ru(void) diff --git a/fs/nls/nls_koi8-u.c b/fs/nls/nls_koi8-u.c index 7e029e4c188a..5f91e9cdb165 100644 --- a/fs/nls/nls_koi8-u.c +++ b/fs/nls/nls_koi8-u.c @@ -306,10 +306,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "koi8-u", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8.c index afcfbc4a14db..6988fffd5cf6 100644 --- a/fs/nls/nls_utf8.c +++ b/fs/nls/nls_utf8.c @@ -40,10 +40,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return n; } +static const struct nls_ops charset_ops = { + .uni2char = uni2char, + .char2uni = char2uni, +}; + static struct nls_table table = { .charset = "utf8", - .uni2char = uni2char, - .char2uni = char2uni, + .ops = &charset_ops, .charset2lower = identity, /* no conversion */ .charset2upper = identity, }; diff --git a/fs/udf/unicode.c b/fs/udf/unicode.c index 45234791fec2..f1a9625ade43 100644 --- a/fs/udf/unicode.c +++ b/fs/udf/unicode.c @@ -178,7 +178,7 @@ static int udf_name_from_CS0(struct super_block *sb, } if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP)) - conv_f = UDF_SB(sb)->s_nls_map->uni2char; + conv_f = UDF_SB(sb)->s_nls_map->ops->uni2char; else conv_f = NULL; @@ -286,7 +286,7 @@ static int udf_name_to_CS0(struct super_block *sb, return 0; if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP)) - conv_f = UDF_SB(sb)->s_nls_map->char2uni; + conv_f = UDF_SB(sb)->s_nls_map->ops->char2uni; else conv_f = NULL; diff --git a/include/linux/nls.h b/include/linux/nls.h index cacbcd7d63e6..5d63fe6aa55e 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -22,12 +22,16 @@ typedef u16 wchar_t; /* Arbitrary Unicode character */ typedef u32 unicode_t; -struct nls_table { - const char *charset; - const char *alias; +struct nls_ops { int (*uni2char) (wchar_t uni, unsigned char *out, int boundlen); int (*char2uni) (const unsigned char *rawstring, int boundlen, wchar_t *uni); +}; + +struct nls_table { + const char *charset; + const char *alias; + const struct nls_ops *ops; const unsigned char *charset2lower; const unsigned char *charset2upper; struct module *owner; @@ -62,14 +66,14 @@ extern int utf16s_to_utf8s(const wchar_t *pwcs, int len, static inline int nls_uni2char(const struct nls_table *table, wchar_t uni, unsigned char *out, int boundlen) { - return table->uni2char(uni, out, boundlen); + return table->ops->uni2char(uni, out, boundlen); } static inline int nls_char2uni(const struct nls_table *table, const unsigned char *rawstring, int boundlen, wchar_t *uni) { - return table->char2uni(rawstring, boundlen, uni); + return table->ops->char2uni(rawstring, boundlen, uni); } static inline const char *nls_charset_name(const struct nls_table *table) @@ -116,7 +120,7 @@ nls_nullsize(const struct nls_table *codepage) int charlen; char tmp[NLS_MAX_CHARSET_SIZE]; - charlen = codepage->uni2char(0, tmp, NLS_MAX_CHARSET_SIZE); + charlen = codepage->ops->uni2char(0, tmp, NLS_MAX_CHARSET_SIZE); return charlen > 0 ? charlen : 1; } From patchwork Thu Dec 6 23:08:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717183 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 32E3514E2 for ; Thu, 6 Dec 2018 23:09:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2277B2F205 for ; Thu, 6 Dec 2018 23:09:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 16B922F21D; Thu, 6 Dec 2018 23:09:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 69FED2F205 for ; Thu, 6 Dec 2018 23:09:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726288AbeLFXJ3 (ORCPT ); Thu, 6 Dec 2018 18:09:29 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56074 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726270AbeLFXJY (ORCPT ); Thu, 6 Dec 2018 18:09:24 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 435D127DA75 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 04/23] nls: Split default charset from NLS core Date: Thu, 6 Dec 2018 18:08:44 -0500 Message-Id: <20181206230903.30011-5-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/Makefile | 1 + fs/nls/nls_core.c | 94 ++++++++++++++++++++++++++++ fs/nls/{nls_base.c => nls_default.c} | 93 +++------------------------ 3 files changed, 102 insertions(+), 86 deletions(-) create mode 100644 fs/nls/nls_core.c rename fs/nls/{nls_base.c => nls_default.c} (90%) diff --git a/fs/nls/Makefile b/fs/nls/Makefile index ac54db297128..5f42ceff9d15 100644 --- a/fs/nls/Makefile +++ b/fs/nls/Makefile @@ -3,6 +3,7 @@ # Makefile for native language support # +nls_base-y := nls_core.o nls_default.o obj-$(CONFIG_NLS) += nls_base.o obj-$(CONFIG_NLS_CODEPAGE_437) += nls_cp437.o diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c new file mode 100644 index 000000000000..3f7de8f4c5b2 --- /dev/null +++ b/fs/nls/nls_core.c @@ -0,0 +1,94 @@ +/* + * linux/fs/nls/nls_core.c + * + * Native language support--charsets and unicode translations. + * By Gordon Chaffee 1996, 1997 + * + * Unicode based case conversion 1999 by Wolfram Pienkoss + * + */ + +#include +#include +#include +#include +#include +#include +#include + +static struct nls_table default_table; +static struct nls_table *tables = &default_table; +static DEFINE_SPINLOCK(nls_lock); + +int __register_nls(struct nls_table *nls, struct module *owner) +{ + struct nls_table ** tmp = &tables; + + if (nls->next) + return -EBUSY; + + nls->owner = owner; + spin_lock(&nls_lock); + while (*tmp) { + if (nls == *tmp) { + spin_unlock(&nls_lock); + return -EBUSY; + } + tmp = &(*tmp)->next; + } + nls->next = tables; + tables = nls; + spin_unlock(&nls_lock); + return 0; +} +EXPORT_SYMBOL(__register_nls); + +int unregister_nls(struct nls_table * nls) +{ + struct nls_table ** tmp = &tables; + + spin_lock(&nls_lock); + while (*tmp) { + if (nls == *tmp) { + *tmp = nls->next; + spin_unlock(&nls_lock); + return 0; + } + tmp = &(*tmp)->next; + } + spin_unlock(&nls_lock); + return -EINVAL; +} + +static struct nls_table *find_nls(char *charset) +{ + struct nls_table *nls; + spin_lock(&nls_lock); + for (nls = tables; nls; nls = nls->next) { + if (!strcmp(nls_charset_name(nls), charset)) + break; + if (nls->alias && !strcmp(nls->alias, charset)) + break; + } + if (nls && !try_module_get(nls->owner)) + nls = NULL; + spin_unlock(&nls_lock); + return nls; +} + +struct nls_table *load_nls(char *charset) +{ + return try_then_request_module(find_nls(charset), "nls_%s", charset); +} + +void unload_nls(struct nls_table *nls) +{ + if (nls) + module_put(nls->owner); +} + +EXPORT_SYMBOL(unregister_nls); +EXPORT_SYMBOL(unload_nls); +EXPORT_SYMBOL(load_nls); + +MODULE_LICENSE("Dual BSD/GPL"); diff --git a/fs/nls/nls_base.c b/fs/nls/nls_default.c similarity index 90% rename from fs/nls/nls_base.c rename to fs/nls/nls_default.c index 0bb0acf6893f..c5d7e8391b22 100644 --- a/fs/nls/nls_base.c +++ b/fs/nls/nls_default.c @@ -1,5 +1,5 @@ /* - * linux/fs/nls/nls_base.c + * linux/fs/nls/nls_default.c * * Native language support--charsets and unicode translations. * By Gordon Chaffee 1996, 1997 @@ -8,23 +8,17 @@ * */ +/* + * Sample implementation from Unicode home page. + * http://www.stonehand.com/unicode/standard/fss-utf.html + */ + #include -#include -#include -#include -#include -#include -#include #include +#include static struct nls_table default_table; -static struct nls_table *tables = &default_table; -static DEFINE_SPINLOCK(nls_lock); -/* - * Sample implementation from Unicode home page. - * http://www.stonehand.com/unicode/standard/fss-utf.html - */ struct utf8_table { int cmask; int cval; @@ -232,73 +226,6 @@ int utf16s_to_utf8s(const wchar_t *pwcs, int inlen, enum utf16_endian endian, } EXPORT_SYMBOL(utf16s_to_utf8s); -int __register_nls(struct nls_table *nls, struct module *owner) -{ - struct nls_table ** tmp = &tables; - - if (nls->next) - return -EBUSY; - - nls->owner = owner; - spin_lock(&nls_lock); - while (*tmp) { - if (nls == *tmp) { - spin_unlock(&nls_lock); - return -EBUSY; - } - tmp = &(*tmp)->next; - } - nls->next = tables; - tables = nls; - spin_unlock(&nls_lock); - return 0; -} -EXPORT_SYMBOL(__register_nls); - -int unregister_nls(struct nls_table * nls) -{ - struct nls_table ** tmp = &tables; - - spin_lock(&nls_lock); - while (*tmp) { - if (nls == *tmp) { - *tmp = nls->next; - spin_unlock(&nls_lock); - return 0; - } - tmp = &(*tmp)->next; - } - spin_unlock(&nls_lock); - return -EINVAL; -} - -static struct nls_table *find_nls(char *charset) -{ - struct nls_table *nls; - spin_lock(&nls_lock); - for (nls = tables; nls; nls = nls->next) { - if (!strcmp(nls_charset_name(nls), charset)) - break; - if (nls->alias && !strcmp(nls->alias, charset)) - break; - } - if (nls && !try_module_get(nls->owner)) - nls = NULL; - spin_unlock(&nls_lock); - return nls; -} - -struct nls_table *load_nls(char *charset) -{ - return try_then_request_module(find_nls(charset), "nls_%s", charset); -} - -void unload_nls(struct nls_table *nls) -{ - if (nls) - module_put(nls->owner); -} - static const wchar_t charset2uni[256] = { /* 0x00*/ 0x0000, 0x0001, 0x0002, 0x0003, @@ -543,10 +470,4 @@ struct nls_table *load_nls_default(void) else return &default_table; } - -EXPORT_SYMBOL(unregister_nls); -EXPORT_SYMBOL(unload_nls); -EXPORT_SYMBOL(load_nls); EXPORT_SYMBOL(load_nls_default); - -MODULE_LICENSE("Dual BSD/GPL"); From patchwork Thu Dec 6 23:08:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717189 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 36EE913AF for ; Thu, 6 Dec 2018 23:09:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1FC612F205 for ; Thu, 6 Dec 2018 23:09:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1415B2F21D; Thu, 6 Dec 2018 23:09:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7BDB82F212 for ; Thu, 6 Dec 2018 23:09:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726295AbeLFXJb (ORCPT ); Thu, 6 Dec 2018 18:09:31 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56078 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726166AbeLFXJa (ORCPT ); Thu, 6 Dec 2018 18:09:30 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id B117C27E83A From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 05/23] nls: Split struct nls_charset from struct nls_table Date: Thu, 6 Dec 2018 18:08:45 -0500 Message-Id: <20181206230903.30011-6-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi In order to support multiple versions of the same encoding, we want to split the charset registering data, which enrolls it with the NLS subsystem and is valid for all versions of the encoding, from the version specific data, that is actually going to be returned when the caller loads an NLS charset. With the exception of the following files (files declaration and default charset), which were edited by hand, the other files were generated with the following Coccinelle patch: Files edited by hand: - fs/nls/nls_base.c - include/linux/nls.h - fs/nls/nls_default.c @nlstable@ identifier p; expression charset_str; @@ static struct nls_table p = { - .charset = charset_str, }; @createops@ identifier nlstable.p; expression nlstable.charset_str; @@ +static struct nls_charset nls_charset; static struct nls_table p = { + .charset = &nls_charset, }; + +static struct nls_charset nls_charset = { + .charset = charset_str, + .tables = &p, +}; @@ expression A; @@ ? return - register_nls(A); + register_nls(&nls_charset); @@ expression A; @@ - unregister_nls(A); + unregister_nls(&nls_charset); @mvalias@ identifier p; expression alias_str; @@ static struct nls_table p = { - .alias = alias_str, }; @@ expression mvalias.alias_str; @@ static struct nls_charset nls_charset = { + .alias = alias_str, }; Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/mac-celtic.c | 12 ++++++++--- fs/nls/mac-centeuro.c | 12 ++++++++--- fs/nls/mac-croatian.c | 12 ++++++++--- fs/nls/mac-cyrillic.c | 12 ++++++++--- fs/nls/mac-gaelic.c | 12 ++++++++--- fs/nls/mac-greek.c | 12 ++++++++--- fs/nls/mac-iceland.c | 12 ++++++++--- fs/nls/mac-inuit.c | 12 ++++++++--- fs/nls/mac-roman.c | 12 ++++++++--- fs/nls/mac-romanian.c | 12 ++++++++--- fs/nls/mac-turkish.c | 12 ++++++++--- fs/nls/nls_ascii.c | 12 ++++++++--- fs/nls/nls_core.c | 48 +++++++++++++++++++++++++++-------------- fs/nls/nls_cp1250.c | 12 ++++++++--- fs/nls/nls_cp1251.c | 12 ++++++++--- fs/nls/nls_cp1255.c | 14 ++++++++---- fs/nls/nls_cp437.c | 12 ++++++++--- fs/nls/nls_cp737.c | 12 ++++++++--- fs/nls/nls_cp775.c | 12 ++++++++--- fs/nls/nls_cp850.c | 12 ++++++++--- fs/nls/nls_cp852.c | 12 ++++++++--- fs/nls/nls_cp855.c | 12 ++++++++--- fs/nls/nls_cp857.c | 12 ++++++++--- fs/nls/nls_cp860.c | 12 ++++++++--- fs/nls/nls_cp861.c | 12 ++++++++--- fs/nls/nls_cp862.c | 12 ++++++++--- fs/nls/nls_cp863.c | 12 ++++++++--- fs/nls/nls_cp864.c | 12 ++++++++--- fs/nls/nls_cp865.c | 12 ++++++++--- fs/nls/nls_cp866.c | 12 ++++++++--- fs/nls/nls_cp869.c | 12 ++++++++--- fs/nls/nls_cp874.c | 14 ++++++++---- fs/nls/nls_cp932.c | 14 ++++++++---- fs/nls/nls_cp936.c | 14 ++++++++---- fs/nls/nls_cp949.c | 14 ++++++++---- fs/nls/nls_cp950.c | 14 ++++++++---- fs/nls/nls_default.c | 9 ++++++-- fs/nls/nls_euc-jp.c | 12 ++++++++--- fs/nls/nls_iso8859-1.c | 12 ++++++++--- fs/nls/nls_iso8859-13.c | 12 ++++++++--- fs/nls/nls_iso8859-14.c | 12 ++++++++--- fs/nls/nls_iso8859-15.c | 12 ++++++++--- fs/nls/nls_iso8859-2.c | 12 ++++++++--- fs/nls/nls_iso8859-3.c | 12 ++++++++--- fs/nls/nls_iso8859-4.c | 12 ++++++++--- fs/nls/nls_iso8859-5.c | 12 ++++++++--- fs/nls/nls_iso8859-6.c | 12 ++++++++--- fs/nls/nls_iso8859-7.c | 12 ++++++++--- fs/nls/nls_iso8859-9.c | 12 ++++++++--- fs/nls/nls_koi8-r.c | 12 ++++++++--- fs/nls/nls_koi8-ru.c | 12 ++++++++--- fs/nls/nls_koi8-u.c | 12 ++++++++--- fs/nls/nls_utf8.c | 12 ++++++++--- include/linux/nls.h | 18 ++++++++++------ 54 files changed, 516 insertions(+), 183 deletions(-) diff --git a/fs/nls/mac-celtic.c b/fs/nls/mac-celtic.c index 1b59b04f26f2..4fe7347c55d6 100644 --- a/fs/nls/mac-celtic.c +++ b/fs/nls/mac-celtic.c @@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macceltic", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macceltic", + .tables = &table, +}; + static int __init init_nls_macceltic(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macceltic(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macceltic) diff --git a/fs/nls/mac-centeuro.c b/fs/nls/mac-centeuro.c index d5b8f38f97b6..2d115aae4240 100644 --- a/fs/nls/mac-centeuro.c +++ b/fs/nls/mac-centeuro.c @@ -512,21 +512,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "maccenteuro", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "maccenteuro", + .tables = &table, +}; + static int __init init_nls_maccenteuro(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_maccenteuro(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_maccenteuro) diff --git a/fs/nls/mac-croatian.c b/fs/nls/mac-croatian.c index 32de6accd526..b496b85fcde1 100644 --- a/fs/nls/mac-croatian.c +++ b/fs/nls/mac-croatian.c @@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "maccroatian", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "maccroatian", + .tables = &table, +}; + static int __init init_nls_maccroatian(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_maccroatian(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_maccroatian) diff --git a/fs/nls/mac-cyrillic.c b/fs/nls/mac-cyrillic.c index 34d5c1c05ff1..18c9e0eb8e58 100644 --- a/fs/nls/mac-cyrillic.c +++ b/fs/nls/mac-cyrillic.c @@ -477,21 +477,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "maccyrillic", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "maccyrillic", + .tables = &table, +}; + static int __init init_nls_maccyrillic(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_maccyrillic(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_maccyrillic) diff --git a/fs/nls/mac-gaelic.c b/fs/nls/mac-gaelic.c index 2aabf5213176..8f8d6ae20f02 100644 --- a/fs/nls/mac-gaelic.c +++ b/fs/nls/mac-gaelic.c @@ -547,21 +547,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macgaelic", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macgaelic", + .tables = &table, +}; + static int __init init_nls_macgaelic(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macgaelic(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macgaelic) diff --git a/fs/nls/mac-greek.c b/fs/nls/mac-greek.c index df62909ef57e..0e2c12fe3447 100644 --- a/fs/nls/mac-greek.c +++ b/fs/nls/mac-greek.c @@ -477,21 +477,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macgreek", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macgreek", + .tables = &table, +}; + static int __init init_nls_macgreek(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macgreek(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macgreek) diff --git a/fs/nls/mac-iceland.c b/fs/nls/mac-iceland.c index 8daa68b995bc..414767fa47a4 100644 --- a/fs/nls/mac-iceland.c +++ b/fs/nls/mac-iceland.c @@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "maciceland", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "maciceland", + .tables = &table, +}; + static int __init init_nls_maciceland(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_maciceland(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_maciceland) diff --git a/fs/nls/mac-inuit.c b/fs/nls/mac-inuit.c index b0799693502a..0e06fd3a0c8f 100644 --- a/fs/nls/mac-inuit.c +++ b/fs/nls/mac-inuit.c @@ -512,21 +512,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macinuit", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macinuit", + .tables = &table, +}; + static int __init init_nls_macinuit(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macinuit(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macinuit) diff --git a/fs/nls/mac-roman.c b/fs/nls/mac-roman.c index ba358b864b05..fcfd387cfaa8 100644 --- a/fs/nls/mac-roman.c +++ b/fs/nls/mac-roman.c @@ -617,21 +617,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macroman", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macroman", + .tables = &table, +}; + static int __init init_nls_macroman(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macroman(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macroman) diff --git a/fs/nls/mac-romanian.c b/fs/nls/mac-romanian.c index 7a8a7f9a0bbc..74027022a135 100644 --- a/fs/nls/mac-romanian.c +++ b/fs/nls/mac-romanian.c @@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macromanian", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macromanian", + .tables = &table, +}; + static int __init init_nls_macromanian(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macromanian(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macromanian) diff --git a/fs/nls/mac-turkish.c b/fs/nls/mac-turkish.c index eb3c5e53ec88..0edc0f8b1f4d 100644 --- a/fs/nls/mac-turkish.c +++ b/fs/nls/mac-turkish.c @@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "macturkish", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "macturkish", + .tables = &table, +}; + static int __init init_nls_macturkish(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_macturkish(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_macturkish) diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c index 6bad3e779284..3c3ee908d1ed 100644 --- a/fs/nls/nls_ascii.c +++ b/fs/nls/nls_ascii.c @@ -147,21 +147,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "ascii", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "ascii", + .tables = &table, +}; + static int __init init_nls_ascii(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_ascii(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_ascii) diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c index 3f7de8f4c5b2..200a7f8165e6 100644 --- a/fs/nls/nls_core.c +++ b/fs/nls/nls_core.c @@ -16,13 +16,18 @@ #include #include -static struct nls_table default_table; -static struct nls_table *tables = &default_table; +extern struct nls_charset default_charset; +static struct nls_charset *charsets = &default_charset; static DEFINE_SPINLOCK(nls_lock); +static struct nls_table *nls_load_table(struct nls_charset *charset) +{ + /* For now, return the default table, which is the first one found. */ + return charset->tables; +} -int __register_nls(struct nls_table *nls, struct module *owner) +int __register_nls(struct nls_charset *nls, struct module *owner) { - struct nls_table ** tmp = &tables; + struct nls_charset **tmp = &charsets; if (nls->next) return -EBUSY; @@ -36,16 +41,16 @@ int __register_nls(struct nls_table *nls, struct module *owner) } tmp = &(*tmp)->next; } - nls->next = tables; - tables = nls; + nls->next = charsets; + charsets = nls; spin_unlock(&nls_lock); return 0; } EXPORT_SYMBOL(__register_nls); -int unregister_nls(struct nls_table * nls) +int unregister_nls(struct nls_charset * nls) { - struct nls_table ** tmp = &tables; + struct nls_charset **tmp = &charsets; spin_lock(&nls_lock); while (*tmp) { @@ -60,31 +65,42 @@ int unregister_nls(struct nls_table * nls) return -EINVAL; } -static struct nls_table *find_nls(char *charset) +static struct nls_charset *find_nls(const char *charset) { - struct nls_table *nls; + struct nls_charset *nls; spin_lock(&nls_lock); - for (nls = tables; nls; nls = nls->next) { - if (!strcmp(nls_charset_name(nls), charset)) + for (nls = charsets; nls; nls = nls->next) { + if (!strcmp(nls->charset, charset)) break; if (nls->alias && !strcmp(nls->alias, charset)) break; } - if (nls && !try_module_get(nls->owner)) - nls = NULL; + + if (!nls) + nls = ERR_PTR(-EINVAL); + else if (!try_module_get(nls->owner)) + nls = ERR_PTR(-EBUSY); + spin_unlock(&nls_lock); return nls; } struct nls_table *load_nls(char *charset) { - return try_then_request_module(find_nls(charset), "nls_%s", charset); + struct nls_charset *nls_charset; + + nls_charset = try_then_request_module(find_nls(charset), + "nls_%s", charset); + if (!IS_ERR(nls_charset)) + return NULL; + + return nls_load_table(nls_charset); } void unload_nls(struct nls_table *nls) { if (nls) - module_put(nls->owner); + module_put(nls->charset->owner); } EXPORT_SYMBOL(unregister_nls); diff --git a/fs/nls/nls_cp1250.c b/fs/nls/nls_cp1250.c index 08902e86fc8e..080717694405 100644 --- a/fs/nls/nls_cp1250.c +++ b/fs/nls/nls_cp1250.c @@ -328,20 +328,26 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp1250", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp1250", + .tables = &table, +}; + static int __init init_nls_cp1250(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp1250(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp1250) diff --git a/fs/nls/nls_cp1251.c b/fs/nls/nls_cp1251.c index 2bb88c8cc5bf..2fba498ab289 100644 --- a/fs/nls/nls_cp1251.c +++ b/fs/nls/nls_cp1251.c @@ -282,21 +282,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp1251", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp1251", + .tables = &table, +}; + static int __init init_nls_cp1251(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp1251(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp1251) diff --git a/fs/nls/nls_cp1255.c b/fs/nls/nls_cp1255.c index c6bf8d575c5b..c268e8d8c038 100644 --- a/fs/nls/nls_cp1255.c +++ b/fs/nls/nls_cp1255.c @@ -363,22 +363,28 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp1255", - .alias = "iso8859-8", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .alias = "iso8859-8", + .charset = "cp1255", + .tables = &table, +}; + static int __init init_nls_cp1255(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp1255(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp1255) diff --git a/fs/nls/nls_cp437.c b/fs/nls/nls_cp437.c index 0f3f8bdbb62b..f24f8691e720 100644 --- a/fs/nls/nls_cp437.c +++ b/fs/nls/nls_cp437.c @@ -368,21 +368,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp437", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp437", + .tables = &table, +}; + static int __init init_nls_cp437(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp437(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp437) diff --git a/fs/nls/nls_cp737.c b/fs/nls/nls_cp737.c index 9383359ca25f..f5a8b9e88165 100644 --- a/fs/nls/nls_cp737.c +++ b/fs/nls/nls_cp737.c @@ -331,21 +331,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp737", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp737", + .tables = &table, +}; + static int __init init_nls_cp737(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp737(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp737) diff --git a/fs/nls/nls_cp775.c b/fs/nls/nls_cp775.c index 6c787b9079ed..d268bfb873e4 100644 --- a/fs/nls/nls_cp775.c +++ b/fs/nls/nls_cp775.c @@ -300,21 +300,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp775", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp775", + .tables = &table, +}; + static int __init init_nls_cp775(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp775(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp775) diff --git a/fs/nls/nls_cp850.c b/fs/nls/nls_cp850.c index 50a57138a571..b698b0df65e3 100644 --- a/fs/nls/nls_cp850.c +++ b/fs/nls/nls_cp850.c @@ -296,21 +296,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp850", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp850", + .tables = &table, +}; + static int __init init_nls_cp850(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp850(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp850) diff --git a/fs/nls/nls_cp852.c b/fs/nls/nls_cp852.c index 0cbb199f1cd5..738e95346b34 100644 --- a/fs/nls/nls_cp852.c +++ b/fs/nls/nls_cp852.c @@ -318,21 +318,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp852", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp852", + .tables = &table, +}; + static int __init init_nls_cp852(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp852(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp852) diff --git a/fs/nls/nls_cp855.c b/fs/nls/nls_cp855.c index 530b77c86363..9a1c4e307cb1 100644 --- a/fs/nls/nls_cp855.c +++ b/fs/nls/nls_cp855.c @@ -280,21 +280,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp855", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp855", + .tables = &table, +}; + static int __init init_nls_cp855(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp855(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp855) diff --git a/fs/nls/nls_cp857.c b/fs/nls/nls_cp857.c index 0db642ec6f45..782e31cb9f5a 100644 --- a/fs/nls/nls_cp857.c +++ b/fs/nls/nls_cp857.c @@ -282,21 +282,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp857", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp857", + .tables = &table, +}; + static int __init init_nls_cp857(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp857(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp857) diff --git a/fs/nls/nls_cp860.c b/fs/nls/nls_cp860.c index 44a40dac26bd..2ad1954b84e6 100644 --- a/fs/nls/nls_cp860.c +++ b/fs/nls/nls_cp860.c @@ -345,21 +345,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp860", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp860", + .tables = &table, +}; + static int __init init_nls_cp860(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp860(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp860) diff --git a/fs/nls/nls_cp861.c b/fs/nls/nls_cp861.c index 50e08174fc48..5930b0e6e8f1 100644 --- a/fs/nls/nls_cp861.c +++ b/fs/nls/nls_cp861.c @@ -368,21 +368,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp861", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp861", + .tables = &table, +}; + static int __init init_nls_cp861(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp861(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp861) diff --git a/fs/nls/nls_cp862.c b/fs/nls/nls_cp862.c index 3505f3437972..63c27b24a011 100644 --- a/fs/nls/nls_cp862.c +++ b/fs/nls/nls_cp862.c @@ -402,21 +402,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp862", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp862", + .tables = &table, +}; + static int __init init_nls_cp862(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp862(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp862) diff --git a/fs/nls/nls_cp863.c b/fs/nls/nls_cp863.c index e3489cdc0c04..aa815cdc7481 100644 --- a/fs/nls/nls_cp863.c +++ b/fs/nls/nls_cp863.c @@ -362,21 +362,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp863", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp863", + .tables = &table, +}; + static int __init init_nls_cp863(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp863(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp863) diff --git a/fs/nls/nls_cp864.c b/fs/nls/nls_cp864.c index d4185bc7f1bf..a20725f661e9 100644 --- a/fs/nls/nls_cp864.c +++ b/fs/nls/nls_cp864.c @@ -388,21 +388,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp864", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp864", + .tables = &table, +}; + static int __init init_nls_cp864(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp864(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp864) diff --git a/fs/nls/nls_cp865.c b/fs/nls/nls_cp865.c index 9f468944e577..3d22ec2bd7af 100644 --- a/fs/nls/nls_cp865.c +++ b/fs/nls/nls_cp865.c @@ -368,21 +368,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp865", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp865", + .tables = &table, +}; + static int __init init_nls_cp865(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp865(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp865) diff --git a/fs/nls/nls_cp866.c b/fs/nls/nls_cp866.c index ee46fd5a76b1..35dc7b2f023a 100644 --- a/fs/nls/nls_cp866.c +++ b/fs/nls/nls_cp866.c @@ -286,21 +286,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp866", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp866", + .tables = &table, +}; + static int __init init_nls_cp866(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp866(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp866) diff --git a/fs/nls/nls_cp869.c b/fs/nls/nls_cp869.c index da29a4a53e1d..56504ab0f405 100644 --- a/fs/nls/nls_cp869.c +++ b/fs/nls/nls_cp869.c @@ -296,21 +296,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp869", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "cp869", + .tables = &table, +}; + static int __init init_nls_cp869(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp869(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp869) diff --git a/fs/nls/nls_cp874.c b/fs/nls/nls_cp874.c index 642659b9ed89..41394620d000 100644 --- a/fs/nls/nls_cp874.c +++ b/fs/nls/nls_cp874.c @@ -254,22 +254,28 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp874", - .alias = "tis-620", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .alias = "tis-620", + .charset = "cp874", + .tables = &table, +}; + static int __init init_nls_cp874(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp874(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp874) diff --git a/fs/nls/nls_cp932.c b/fs/nls/nls_cp932.c index 3e7bdefdca90..25fe26fb2603 100644 --- a/fs/nls/nls_cp932.c +++ b/fs/nls/nls_cp932.c @@ -7912,22 +7912,28 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp932", - .alias = "sjis", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .alias = "sjis", + .charset = "cp932", + .tables = &table, +}; + static int __init init_nls_cp932(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp932(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp932) diff --git a/fs/nls/nls_cp936.c b/fs/nls/nls_cp936.c index b1fa2918992b..766f86b53a7b 100644 --- a/fs/nls/nls_cp936.c +++ b/fs/nls/nls_cp936.c @@ -11090,22 +11090,28 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp936", - .alias = "gb2312", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .alias = "gb2312", + .charset = "cp936", + .tables = &table, +}; + static int __init init_nls_cp936(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp936(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp936) diff --git a/fs/nls/nls_cp949.c b/fs/nls/nls_cp949.c index 1d334095d86c..138eec74bb3f 100644 --- a/fs/nls/nls_cp949.c +++ b/fs/nls/nls_cp949.c @@ -13925,22 +13925,28 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp949", - .alias = "euc-kr", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .alias = "euc-kr", + .charset = "cp949", + .tables = &table, +}; + static int __init init_nls_cp949(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp949(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp949) diff --git a/fs/nls/nls_cp950.c b/fs/nls/nls_cp950.c index d936160a48f9..899da09fe0d7 100644 --- a/fs/nls/nls_cp950.c +++ b/fs/nls/nls_cp950.c @@ -9461,22 +9461,28 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "cp950", - .alias = "big5", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .alias = "big5", + .charset = "cp950", + .tables = &table, +}; + static int __init init_nls_cp950(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_cp950(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_cp950) diff --git a/fs/nls/nls_default.c b/fs/nls/nls_default.c index c5d7e8391b22..ef8c0efb8a3c 100644 --- a/fs/nls/nls_default.c +++ b/fs/nls/nls_default.c @@ -17,7 +17,7 @@ #include #include -static struct nls_table default_table; +struct nls_charset default_charset; struct utf8_table { int cmask; @@ -453,12 +453,17 @@ static const struct nls_ops charset_ops = { }; static struct nls_table default_table = { - .charset = "default", + .charset = &default_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +struct nls_charset default_charset = { + .charset = "default", + .tables = &default_table, +}; + /* Returns a simple default translation table */ struct nls_table *load_nls_default(void) { diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c index 0af73982738b..8bc5d9991452 100644 --- a/fs/nls/nls_euc-jp.c +++ b/fs/nls/nls_euc-jp.c @@ -554,11 +554,17 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "euc-jp", + .charset = &nls_charset, .ops = &charset_ops, }; +static struct nls_charset nls_charset = { + .charset = "euc-jp", + .tables = &table, +}; + static int __init init_nls_euc_jp(void) { p_nls = load_nls("cp932"); @@ -566,7 +572,7 @@ static int __init init_nls_euc_jp(void) if (p_nls) { table.charset2upper = p_nls->charset2upper; table.charset2lower = p_nls->charset2lower; - return register_nls(&table); + return register_nls(&nls_charset); } return -EINVAL; @@ -574,7 +580,7 @@ static int __init init_nls_euc_jp(void) static void __exit exit_nls_euc_jp(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); unload_nls(p_nls); } diff --git a/fs/nls/nls_iso8859-1.c b/fs/nls/nls_iso8859-1.c index 6212b2925fa0..78e9c0169f69 100644 --- a/fs/nls/nls_iso8859-1.c +++ b/fs/nls/nls_iso8859-1.c @@ -238,21 +238,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-1", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-1", + .tables = &table, +}; + static int __init init_nls_iso8859_1(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_1(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_1) diff --git a/fs/nls/nls_iso8859-13.c b/fs/nls/nls_iso8859-13.c index 8f0a23109207..eb8665629e0f 100644 --- a/fs/nls/nls_iso8859-13.c +++ b/fs/nls/nls_iso8859-13.c @@ -266,21 +266,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-13", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-13", + .tables = &table, +}; + static int __init init_nls_iso8859_13(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_13(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_13) diff --git a/fs/nls/nls_iso8859-14.c b/fs/nls/nls_iso8859-14.c index 80ab77f37480..c8d5a48f869c 100644 --- a/fs/nls/nls_iso8859-14.c +++ b/fs/nls/nls_iso8859-14.c @@ -322,21 +322,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-14", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-14", + .tables = &table, +}; + static int __init init_nls_iso8859_14(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_14(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_14) diff --git a/fs/nls/nls_iso8859-15.c b/fs/nls/nls_iso8859-15.c index 5c02f93e7b20..0611c6cb56b4 100644 --- a/fs/nls/nls_iso8859-15.c +++ b/fs/nls/nls_iso8859-15.c @@ -288,21 +288,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-15", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-15", + .tables = &table, +}; + static int __init init_nls_iso8859_15(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_15(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_15) diff --git a/fs/nls/nls_iso8859-2.c b/fs/nls/nls_iso8859-2.c index 97afc1233da1..5255d92a25eb 100644 --- a/fs/nls/nls_iso8859-2.c +++ b/fs/nls/nls_iso8859-2.c @@ -289,21 +289,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-2", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-2", + .tables = &table, +}; + static int __init init_nls_iso8859_2(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_2(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_2) diff --git a/fs/nls/nls_iso8859-3.c b/fs/nls/nls_iso8859-3.c index f835fcec3aae..ad1b84f3e102 100644 --- a/fs/nls/nls_iso8859-3.c +++ b/fs/nls/nls_iso8859-3.c @@ -289,21 +289,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-3", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-3", + .tables = &table, +}; + static int __init init_nls_iso8859_3(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_3(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_3) diff --git a/fs/nls/nls_iso8859-4.c b/fs/nls/nls_iso8859-4.c index 14acb68fb013..82469deee0ba 100644 --- a/fs/nls/nls_iso8859-4.c +++ b/fs/nls/nls_iso8859-4.c @@ -289,21 +289,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-4", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-4", + .tables = &table, +}; + static int __init init_nls_iso8859_4(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_4(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_4) diff --git a/fs/nls/nls_iso8859-5.c b/fs/nls/nls_iso8859-5.c index f559bbb25045..3f3cd0c28797 100644 --- a/fs/nls/nls_iso8859-5.c +++ b/fs/nls/nls_iso8859-5.c @@ -253,21 +253,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-5", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-5", + .tables = &table, +}; + static int __init init_nls_iso8859_5(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_5(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_5) diff --git a/fs/nls/nls_iso8859-6.c b/fs/nls/nls_iso8859-6.c index e3d7e28363b8..43e6675998bc 100644 --- a/fs/nls/nls_iso8859-6.c +++ b/fs/nls/nls_iso8859-6.c @@ -244,21 +244,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-6", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-6", + .tables = &table, +}; + static int __init init_nls_iso8859_6(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_6(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_6) diff --git a/fs/nls/nls_iso8859-7.c b/fs/nls/nls_iso8859-7.c index 49fd2b24e492..83893e487f82 100644 --- a/fs/nls/nls_iso8859-7.c +++ b/fs/nls/nls_iso8859-7.c @@ -298,21 +298,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-7", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-7", + .tables = &table, +}; + static int __init init_nls_iso8859_7(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_7(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_7) diff --git a/fs/nls/nls_iso8859-9.c b/fs/nls/nls_iso8859-9.c index 876696f89626..df03f97cd9d1 100644 --- a/fs/nls/nls_iso8859-9.c +++ b/fs/nls/nls_iso8859-9.c @@ -253,21 +253,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "iso8859-9", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "iso8859-9", + .tables = &table, +}; + static int __init init_nls_iso8859_9(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_iso8859_9(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_iso8859_9) diff --git a/fs/nls/nls_koi8-r.c b/fs/nls/nls_koi8-r.c index 6a85211402a8..22918e154dbe 100644 --- a/fs/nls/nls_koi8-r.c +++ b/fs/nls/nls_koi8-r.c @@ -304,21 +304,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "koi8-r", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "koi8-r", + .tables = &table, +}; + static int __init init_nls_koi8_r(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_koi8_r(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_koi8_r) diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c index c4e382fd0f13..f4edbc313706 100644 --- a/fs/nls/nls_koi8-ru.c +++ b/fs/nls/nls_koi8-ru.c @@ -56,11 +56,17 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "koi8-ru", + .charset = &nls_charset, .ops = &charset_ops, }; +static struct nls_charset nls_charset = { + .charset = "koi8-ru", + .tables = &table, +}; + static int __init init_nls_koi8_ru(void) { p_nls = load_nls("koi8-u"); @@ -68,7 +74,7 @@ static int __init init_nls_koi8_ru(void) if (p_nls) { table.charset2upper = p_nls->charset2upper; table.charset2lower = p_nls->charset2lower; - return register_nls(&table); + return register_nls(&nls_charset); } return -EINVAL; @@ -76,7 +82,7 @@ static int __init init_nls_koi8_ru(void) static void __exit exit_nls_koi8_ru(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); unload_nls(p_nls); } diff --git a/fs/nls/nls_koi8-u.c b/fs/nls/nls_koi8-u.c index 5f91e9cdb165..b2421625e98b 100644 --- a/fs/nls/nls_koi8-u.c +++ b/fs/nls/nls_koi8-u.c @@ -311,21 +311,27 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "koi8-u", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = charset2lower, .charset2upper = charset2upper, }; +static struct nls_charset nls_charset = { + .charset = "koi8-u", + .tables = &table, +}; + static int __init init_nls_koi8_u(void) { - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_koi8_u(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_koi8_u) diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8.c index 6988fffd5cf6..aecf460827ac 100644 --- a/fs/nls/nls_utf8.c +++ b/fs/nls/nls_utf8.c @@ -45,25 +45,31 @@ static const struct nls_ops charset_ops = { .char2uni = char2uni, }; +static struct nls_charset nls_charset; static struct nls_table table = { - .charset = "utf8", + .charset = &nls_charset, .ops = &charset_ops, .charset2lower = identity, /* no conversion */ .charset2upper = identity, }; +static struct nls_charset nls_charset = { + .charset = "utf8", + .tables = &table, +}; + static int __init init_nls_utf8(void) { int i; for (i=0; i<256; i++) identity[i] = i; - return register_nls(&table); + return register_nls(&nls_charset); } static void __exit exit_nls_utf8(void) { - unregister_nls(&table); + unregister_nls(&nls_charset); } module_init(init_nls_utf8) diff --git a/include/linux/nls.h b/include/linux/nls.h index 5d63fe6aa55e..cdc95cd9e5d4 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -29,15 +29,21 @@ struct nls_ops { }; struct nls_table { - const char *charset; - const char *alias; + const struct nls_charset *charset; const struct nls_ops *ops; const unsigned char *charset2lower; const unsigned char *charset2upper; - struct module *owner; struct nls_table *next; }; +struct nls_charset { + const char *charset; + const char *alias; + struct module *owner; + struct nls_table *tables; + struct nls_charset *next; +}; + /* this value hold the maximum octet of charset */ #define NLS_MAX_CHARSET_SIZE 6 /* for UTF-8 */ @@ -49,8 +55,8 @@ enum utf16_endian { }; /* nls_base.c */ -extern int __register_nls(struct nls_table *, struct module *); -extern int unregister_nls(struct nls_table *); +extern int __register_nls(struct nls_charset *, struct module *); +extern int unregister_nls(struct nls_charset *); extern struct nls_table *load_nls(char *); extern void unload_nls(struct nls_table *); extern struct nls_table *load_nls_default(void); @@ -78,7 +84,7 @@ static inline int nls_char2uni(const struct nls_table *table, static inline const char *nls_charset_name(const struct nls_table *table) { - return table->charset; + return table->charset->charset; } static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c) From patchwork Thu Dec 6 23:08:46 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717187 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E11CB15A6 for ; Thu, 6 Dec 2018 23:09:35 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D2B032F205 for ; Thu, 6 Dec 2018 23:09:35 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C772C2F21E; Thu, 6 Dec 2018 23:09:35 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 573B12F205 for ; Thu, 6 Dec 2018 23:09:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726298AbeLFXJd (ORCPT ); Thu, 6 Dec 2018 18:09:33 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56082 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726270AbeLFXJb (ORCPT ); Thu, 6 Dec 2018 18:09:31 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id CFE0427E03E From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 06/23] nls: Add support for multiple versions of an encoding Date: Thu, 6 Dec 2018 18:08:46 -0500 Message-Id: <20181206230903.30011-7-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi This allows a user to request a specific version of an encoding, like the version 10.0.0 of Unicode encoded in utf8. Supporting specific versions of encodings is important to ensure stability of names in filesystems, specially when doing transformations like casefold and normalization. Even for unicode, where defined code-points are stable, there is instability for code points that weren't defined on a previous version, so the user might want to use an older version of the encoding to ensure the encoding is exact. Not every NLS charset supports this feature. It doesn't make sense for many of them, like ASCII. Others just don't implement it yet, and never will. In those cases, the interface allows the caller to get the un-versioned charset, which is the same original behavior as if this patch weren't applied. A user that is not interested in a specific version can also ask for a versioned charset without specifying the version, and in this case, NLS will return the latest version available of that charset. Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/nls_core.c | 45 ++++++++++++++++++++++++++++++++++++++------- include/linux/nls.h | 8 ++++++++ 2 files changed, 46 insertions(+), 7 deletions(-) diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c index 200a7f8165e6..20e00a8b968c 100644 --- a/fs/nls/nls_core.c +++ b/fs/nls/nls_core.c @@ -19,10 +19,26 @@ extern struct nls_charset default_charset; static struct nls_charset *charsets = &default_charset; static DEFINE_SPINLOCK(nls_lock); -static struct nls_table *nls_load_table(struct nls_charset *charset) + +static struct nls_table *nls_load_table(struct nls_charset *charset, + const char *version, + unsigned int flags) { - /* For now, return the default table, which is the first one found. */ - return charset->tables; + struct nls_table *tbl; + + /* If there is no load_table hook, only 1 table is supported and + * it must have been loaded statically. + */ + if (charset->load_table) + tbl = charset->load_table(version, flags); + else + tbl = charset->tables; + + if (IS_ERR(tbl)) + return tbl; + + tbl->flags = flags; + return tbl; } int __register_nls(struct nls_charset *nls, struct module *owner) @@ -85,21 +101,36 @@ static struct nls_charset *find_nls(const char *charset) return nls; } -struct nls_table *load_nls(char *charset) +struct nls_table *load_nls_version(const char *charset, const char *version, + unsigned int flags) { struct nls_charset *nls_charset; nls_charset = try_then_request_module(find_nls(charset), "nls_%s", charset); - if (!IS_ERR(nls_charset)) + if (IS_ERR(nls_charset)) + return ERR_PTR(-EINVAL); + + return nls_load_table(nls_charset, version, flags); +} +EXPORT_SYMBOL(load_nls_version); + +struct nls_table *load_nls(char *charset) +{ + struct nls_table *table = load_nls_version(charset, NULL, 0); + + /* Pre-versioned load_nls() didn't return error pointers. Let's + * keep the abi for now to prevent breakage. + */ + if (IS_ERR(table)) return NULL; - return nls_load_table(nls_charset); + return table; } void unload_nls(struct nls_table *nls) { - if (nls) + if (!IS_ERR_OR_NULL(nls)) module_put(nls->charset->owner); } diff --git a/include/linux/nls.h b/include/linux/nls.h index cdc95cd9e5d4..91524bb4477b 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -30,6 +30,9 @@ struct nls_ops { struct nls_table { const struct nls_charset *charset; + unsigned int version; + unsigned int flags; + const struct nls_ops *ops; const unsigned char *charset2lower; const unsigned char *charset2upper; @@ -42,6 +45,8 @@ struct nls_charset { struct module *owner; struct nls_table *tables; struct nls_charset *next; + struct nls_table *(*load_table)(const char *version, + unsigned int flags); }; /* this value hold the maximum octet of charset */ @@ -58,6 +63,9 @@ enum utf16_endian { extern int __register_nls(struct nls_charset *, struct module *); extern int unregister_nls(struct nls_charset *); extern struct nls_table *load_nls(char *); +extern struct nls_table *load_nls_version(const char *charset, + const char *version, + unsigned int flags); extern void unload_nls(struct nls_table *); extern struct nls_table *load_nls_default(void); #define register_nls(nls) __register_nls((nls), THIS_MODULE) From patchwork Thu Dec 6 23:08:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717191 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DCAD813AF for ; Thu, 6 Dec 2018 23:09:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CDD7D2F205 for ; Thu, 6 Dec 2018 23:09:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C29732F21D; Thu, 6 Dec 2018 23:09:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5D8F02F205 for ; Thu, 6 Dec 2018 23:09:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726299AbeLFXJf (ORCPT ); Thu, 6 Dec 2018 18:09:35 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56086 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726270AbeLFXJf (ORCPT ); Thu, 6 Dec 2018 18:09:35 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id C5DE427ED45 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 07/23] nls: Implement NLS_STRICT_MODE flag Date: Thu, 6 Dec 2018 18:08:47 -0500 Message-Id: <20181206230903.30011-8-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi The flag NLS_STRICT_MODE indicates whether NLS should reject invalid characters or ignore them. Support for this relies on the .validate() hook, which is implemented by each charset and states whether a given string is valid within that charset. Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/nls_core.c | 11 +++++++++++ include/linux/nls.h | 25 +++++++++++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c index 20e00a8b968c..49a15bb2174f 100644 --- a/fs/nls/nls_core.c +++ b/fs/nls/nls_core.c @@ -20,6 +20,14 @@ extern struct nls_charset default_charset; static struct nls_charset *charsets = &default_charset; static DEFINE_SPINLOCK(nls_lock); +static int nls_validate_flags(struct nls_table *table, unsigned int flags) +{ + if (flags & NLS_STRICT_MODE && !table->ops->validate) + return -1; + + return 0; +} + static struct nls_table *nls_load_table(struct nls_charset *charset, const char *version, unsigned int flags) @@ -37,6 +45,9 @@ static struct nls_table *nls_load_table(struct nls_charset *charset, if (IS_ERR(tbl)) return tbl; + if (nls_validate_flags(tbl, flags) < 0) + return ERR_PTR(-EINVAL); + tbl->flags = flags; return tbl; } diff --git a/include/linux/nls.h b/include/linux/nls.h index 91524bb4477b..9f61015a54bf 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -22,10 +22,22 @@ typedef u16 wchar_t; /* Arbitrary Unicode character */ typedef u32 unicode_t; +struct nls_table; + struct nls_ops { int (*uni2char) (wchar_t uni, unsigned char *out, int boundlen); int (*char2uni) (const unsigned char *rawstring, int boundlen, wchar_t *uni); + /** + * @validate: + * + * Returns 0 if the argument is a valid string in this charset. + * Otherwise, return non-zero. + * + * This is required iff the charset supports strict mode. + **/ + int (*validate)(const struct nls_table *charset, + const unsigned char *str, size_t len); }; struct nls_table { @@ -59,6 +71,13 @@ enum utf16_endian { UTF16_BIG_ENDIAN }; +#define NLS_STRICT_MODE 0x00000001 + +static inline int IS_STRICT_MODE(const struct nls_table *charset) +{ + return (charset->flags & NLS_STRICT_MODE); +} + /* nls_base.c */ extern int __register_nls(struct nls_charset *, struct module *); extern int unregister_nls(struct nls_charset *); @@ -90,6 +109,12 @@ static inline int nls_char2uni(const struct nls_table *table, return table->ops->char2uni(rawstring, boundlen, uni); } +static inline int nls_validate(const struct nls_table *t, const unsigned char *str, + const size_t len) +{ + return t->ops->validate(t, str, len); +} + static inline const char *nls_charset_name(const struct nls_table *table) { return table->charset->charset; From patchwork Thu Dec 6 23:08:48 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717195 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E46D213AF for ; Thu, 6 Dec 2018 23:09:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CBFD02F212 for ; Thu, 6 Dec 2018 23:09:49 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C07372F21E; Thu, 6 Dec 2018 23:09:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 10A3C2F212 for ; Thu, 6 Dec 2018 23:09:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726324AbeLFXJp (ORCPT ); Thu, 6 Dec 2018 18:09:45 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56090 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726270AbeLFXJo (ORCPT ); Thu, 6 Dec 2018 18:09:44 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 0B82027ED58 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 08/23] nls: Let charsets define the behavior of tolower/toupper Date: Thu, 6 Dec 2018 18:08:48 -0500 Message-Id: <20181206230903.30011-9-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi Instead of always reading from a table, give the charset a chance to implement tolower() and toupper() algorithmically. This allow us to drop a lot of tables which hardcode the identity functions (like ASCII), and replace them with a few lines of code in the hooks. This patch was created using the semantic patch below, with the exception of the header files (hook definitions) and a fix to files that didn't have the tables statically allocated (koi8-u and cp932). @tbl@ identifier p; expression lower_tbl; expression upper_tbl; @@ static struct nls_table p = { - .charset2lower = lower_tbl, - .charset2upper = upper_tbl, }; @@ identifier charset_ops; expression tbl.lower_tbl; expression tbl.upper_tbl; @@ + static unsigned char charset_tolower(const struct nls_table *table, unsigned int c) + { + return lower_tbl[c]; + } + + static unsigned char charset_toupper(const struct nls_table *table, unsigned int c) + { + return upper_tbl[c]; + } static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, }; @@ struct nls_table *t; expression A; expression nc; @@ ( - nc = t->charset2lower[A] + nc = nls_tolower(t, A) | - nc = t->charset2upper[A] + nc = nls_toupper(t, A) ) <... - if(!nc) - nc = A; ...> Signed-off-by: Gabriel Krisman Bertazi --- fs/fat/dir.c | 5 +---- fs/nls/mac-celtic.c | 14 ++++++++++++-- fs/nls/mac-centeuro.c | 14 ++++++++++++-- fs/nls/mac-croatian.c | 14 ++++++++++++-- fs/nls/mac-cyrillic.c | 14 ++++++++++++-- fs/nls/mac-gaelic.c | 14 ++++++++++++-- fs/nls/mac-greek.c | 14 ++++++++++++-- fs/nls/mac-iceland.c | 14 ++++++++++++-- fs/nls/mac-inuit.c | 14 ++++++++++++-- fs/nls/mac-roman.c | 14 ++++++++++++-- fs/nls/mac-romanian.c | 14 ++++++++++++-- fs/nls/mac-turkish.c | 14 ++++++++++++-- fs/nls/nls_ascii.c | 14 ++++++++++++-- fs/nls/nls_cp1250.c | 14 ++++++++++++-- fs/nls/nls_cp1251.c | 14 ++++++++++++-- fs/nls/nls_cp1255.c | 14 ++++++++++++-- fs/nls/nls_cp437.c | 14 ++++++++++++-- fs/nls/nls_cp737.c | 14 ++++++++++++-- fs/nls/nls_cp775.c | 14 ++++++++++++-- fs/nls/nls_cp850.c | 14 ++++++++++++-- fs/nls/nls_cp852.c | 14 ++++++++++++-- fs/nls/nls_cp855.c | 14 ++++++++++++-- fs/nls/nls_cp857.c | 14 ++++++++++++-- fs/nls/nls_cp860.c | 14 ++++++++++++-- fs/nls/nls_cp861.c | 14 ++++++++++++-- fs/nls/nls_cp862.c | 14 ++++++++++++-- fs/nls/nls_cp863.c | 14 ++++++++++++-- fs/nls/nls_cp864.c | 14 ++++++++++++-- fs/nls/nls_cp865.c | 14 ++++++++++++-- fs/nls/nls_cp866.c | 14 ++++++++++++-- fs/nls/nls_cp869.c | 14 ++++++++++++-- fs/nls/nls_cp874.c | 14 ++++++++++++-- fs/nls/nls_cp932.c | 14 ++++++++++++-- fs/nls/nls_cp936.c | 14 ++++++++++++-- fs/nls/nls_cp949.c | 14 ++++++++++++-- fs/nls/nls_cp950.c | 14 ++++++++++++-- fs/nls/nls_default.c | 14 ++++++++++++-- fs/nls/nls_euc-jp.c | 7 ++++--- fs/nls/nls_iso8859-1.c | 14 ++++++++++++-- fs/nls/nls_iso8859-13.c | 14 ++++++++++++-- fs/nls/nls_iso8859-14.c | 14 ++++++++++++-- fs/nls/nls_iso8859-15.c | 14 ++++++++++++-- fs/nls/nls_iso8859-2.c | 14 ++++++++++++-- fs/nls/nls_iso8859-3.c | 14 ++++++++++++-- fs/nls/nls_iso8859-4.c | 14 ++++++++++++-- fs/nls/nls_iso8859-5.c | 14 ++++++++++++-- fs/nls/nls_iso8859-6.c | 14 ++++++++++++-- fs/nls/nls_iso8859-7.c | 14 ++++++++++++-- fs/nls/nls_iso8859-9.c | 14 ++++++++++++-- fs/nls/nls_koi8-r.c | 14 ++++++++++++-- fs/nls/nls_koi8-ru.c | 6 +++--- fs/nls/nls_koi8-u.c | 14 ++++++++++++-- fs/nls/nls_utf8.c | 14 ++++++++++++-- include/linux/nls.h | 17 +++++++++++------ 54 files changed, 619 insertions(+), 116 deletions(-) diff --git a/fs/fat/dir.c b/fs/fat/dir.c index d5f856651a08..6518886ee5cf 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -215,10 +215,7 @@ fat_short2lower_uni(struct nls_table *t, unsigned char *c, *uni = 0x003f; /* a question mark */ charlen = 1; } else if (charlen <= 1) { - unsigned char nc = t->charset2lower[*c]; - - if (!nc) - nc = *c; + unsigned char nc = nls_tolower(t, *c); charlen = nls_char2uni(t, &nc, 1, uni); if (charlen < 0) { diff --git a/fs/nls/mac-celtic.c b/fs/nls/mac-celtic.c index 4fe7347c55d6..7207f9a14342 100644 --- a/fs/nls/mac-celtic.c +++ b/fs/nls/mac-celtic.c @@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -586,8 +598,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-centeuro.c b/fs/nls/mac-centeuro.c index 2d115aae4240..0664408e4451 100644 --- a/fs/nls/mac-centeuro.c +++ b/fs/nls/mac-centeuro.c @@ -507,7 +507,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -516,8 +528,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-croatian.c b/fs/nls/mac-croatian.c index b496b85fcde1..a4b7992ef8ec 100644 --- a/fs/nls/mac-croatian.c +++ b/fs/nls/mac-croatian.c @@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -586,8 +598,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-cyrillic.c b/fs/nls/mac-cyrillic.c index 18c9e0eb8e58..cb60563911ea 100644 --- a/fs/nls/mac-cyrillic.c +++ b/fs/nls/mac-cyrillic.c @@ -472,7 +472,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -481,8 +493,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-gaelic.c b/fs/nls/mac-gaelic.c index 8f8d6ae20f02..e683881f4a13 100644 --- a/fs/nls/mac-gaelic.c +++ b/fs/nls/mac-gaelic.c @@ -542,7 +542,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -551,8 +563,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-greek.c b/fs/nls/mac-greek.c index 0e2c12fe3447..bd2245238512 100644 --- a/fs/nls/mac-greek.c +++ b/fs/nls/mac-greek.c @@ -472,7 +472,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -481,8 +493,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-iceland.c b/fs/nls/mac-iceland.c index 414767fa47a4..3ce3e27b3660 100644 --- a/fs/nls/mac-iceland.c +++ b/fs/nls/mac-iceland.c @@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -586,8 +598,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-inuit.c b/fs/nls/mac-inuit.c index 0e06fd3a0c8f..6f12cccccb37 100644 --- a/fs/nls/mac-inuit.c +++ b/fs/nls/mac-inuit.c @@ -507,7 +507,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -516,8 +528,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-roman.c b/fs/nls/mac-roman.c index fcfd387cfaa8..d8e411c82c69 100644 --- a/fs/nls/mac-roman.c +++ b/fs/nls/mac-roman.c @@ -612,7 +612,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -621,8 +633,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-romanian.c b/fs/nls/mac-romanian.c index 74027022a135..cd638dfe9d7c 100644 --- a/fs/nls/mac-romanian.c +++ b/fs/nls/mac-romanian.c @@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -586,8 +598,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/mac-turkish.c b/fs/nls/mac-turkish.c index 0edc0f8b1f4d..82ba6f6b4c24 100644 --- a/fs/nls/mac-turkish.c +++ b/fs/nls/mac-turkish.c @@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -586,8 +598,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c index 3c3ee908d1ed..2f4826478d3d 100644 --- a/fs/nls/nls_ascii.c +++ b/fs/nls/nls_ascii.c @@ -142,7 +142,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -151,8 +163,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp1250.c b/fs/nls/nls_cp1250.c index 080717694405..1cfe65851185 100644 --- a/fs/nls/nls_cp1250.c +++ b/fs/nls/nls_cp1250.c @@ -323,7 +323,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -332,8 +344,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp1251.c b/fs/nls/nls_cp1251.c index 2fba498ab289..061eb23892f1 100644 --- a/fs/nls/nls_cp1251.c +++ b/fs/nls/nls_cp1251.c @@ -277,7 +277,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -286,8 +298,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp1255.c b/fs/nls/nls_cp1255.c index c268e8d8c038..2a71dc175c9b 100644 --- a/fs/nls/nls_cp1255.c +++ b/fs/nls/nls_cp1255.c @@ -358,7 +358,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -367,8 +379,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp437.c b/fs/nls/nls_cp437.c index f24f8691e720..4f763761b699 100644 --- a/fs/nls/nls_cp437.c +++ b/fs/nls/nls_cp437.c @@ -363,7 +363,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -372,8 +384,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp737.c b/fs/nls/nls_cp737.c index f5a8b9e88165..2f2ab91340e7 100644 --- a/fs/nls/nls_cp737.c +++ b/fs/nls/nls_cp737.c @@ -326,7 +326,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -335,8 +347,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp775.c b/fs/nls/nls_cp775.c index d268bfb873e4..92f311e620f3 100644 --- a/fs/nls/nls_cp775.c +++ b/fs/nls/nls_cp775.c @@ -295,7 +295,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -304,8 +316,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp850.c b/fs/nls/nls_cp850.c index b698b0df65e3..77cdce20ced6 100644 --- a/fs/nls/nls_cp850.c +++ b/fs/nls/nls_cp850.c @@ -291,7 +291,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -300,8 +312,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp852.c b/fs/nls/nls_cp852.c index 738e95346b34..47722904e9f1 100644 --- a/fs/nls/nls_cp852.c +++ b/fs/nls/nls_cp852.c @@ -313,7 +313,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -322,8 +334,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp855.c b/fs/nls/nls_cp855.c index 9a1c4e307cb1..b52709886900 100644 --- a/fs/nls/nls_cp855.c +++ b/fs/nls/nls_cp855.c @@ -275,7 +275,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -284,8 +296,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp857.c b/fs/nls/nls_cp857.c index 782e31cb9f5a..fcdf30a540f8 100644 --- a/fs/nls/nls_cp857.c +++ b/fs/nls/nls_cp857.c @@ -277,7 +277,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -286,8 +298,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp860.c b/fs/nls/nls_cp860.c index 2ad1954b84e6..a1504424e923 100644 --- a/fs/nls/nls_cp860.c +++ b/fs/nls/nls_cp860.c @@ -340,7 +340,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -349,8 +361,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp861.c b/fs/nls/nls_cp861.c index 5930b0e6e8f1..9fa1f54cee0d 100644 --- a/fs/nls/nls_cp861.c +++ b/fs/nls/nls_cp861.c @@ -363,7 +363,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -372,8 +384,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp862.c b/fs/nls/nls_cp862.c index 63c27b24a011..00474e2b2102 100644 --- a/fs/nls/nls_cp862.c +++ b/fs/nls/nls_cp862.c @@ -397,7 +397,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -406,8 +418,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp863.c b/fs/nls/nls_cp863.c index aa815cdc7481..908e573c1c42 100644 --- a/fs/nls/nls_cp863.c +++ b/fs/nls/nls_cp863.c @@ -357,7 +357,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -366,8 +378,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp864.c b/fs/nls/nls_cp864.c index a20725f661e9..6cae9e9c73aa 100644 --- a/fs/nls/nls_cp864.c +++ b/fs/nls/nls_cp864.c @@ -383,7 +383,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -392,8 +404,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp865.c b/fs/nls/nls_cp865.c index 3d22ec2bd7af..5aa6415ec357 100644 --- a/fs/nls/nls_cp865.c +++ b/fs/nls/nls_cp865.c @@ -363,7 +363,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -372,8 +384,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp866.c b/fs/nls/nls_cp866.c index 35dc7b2f023a..f24b73839680 100644 --- a/fs/nls/nls_cp866.c +++ b/fs/nls/nls_cp866.c @@ -281,7 +281,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -290,8 +302,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp869.c b/fs/nls/nls_cp869.c index 56504ab0f405..c2ba80140906 100644 --- a/fs/nls/nls_cp869.c +++ b/fs/nls/nls_cp869.c @@ -291,7 +291,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -300,8 +312,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp874.c b/fs/nls/nls_cp874.c index 41394620d000..844bb205deee 100644 --- a/fs/nls/nls_cp874.c +++ b/fs/nls/nls_cp874.c @@ -249,7 +249,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -258,8 +270,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp932.c b/fs/nls/nls_cp932.c index 25fe26fb2603..0a5db2a0a6b3 100644 --- a/fs/nls/nls_cp932.c +++ b/fs/nls/nls_cp932.c @@ -7907,7 +7907,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return -EINVAL; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -7916,8 +7928,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp936.c b/fs/nls/nls_cp936.c index 766f86b53a7b..6b0d725cdfab 100644 --- a/fs/nls/nls_cp936.c +++ b/fs/nls/nls_cp936.c @@ -11085,7 +11085,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -11094,8 +11106,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp949.c b/fs/nls/nls_cp949.c index 138eec74bb3f..292c2d02d2c2 100644 --- a/fs/nls/nls_cp949.c +++ b/fs/nls/nls_cp949.c @@ -13920,7 +13920,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -13929,8 +13941,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_cp950.c b/fs/nls/nls_cp950.c index 899da09fe0d7..d4e35bfd8dbd 100644 --- a/fs/nls/nls_cp950.c +++ b/fs/nls/nls_cp950.c @@ -9456,7 +9456,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -9465,8 +9477,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_default.c b/fs/nls/nls_default.c index ef8c0efb8a3c..602eeec24b3d 100644 --- a/fs/nls/nls_default.c +++ b/fs/nls/nls_default.c @@ -447,7 +447,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -455,8 +467,6 @@ static const struct nls_ops charset_ops = { static struct nls_table default_table = { .charset = &default_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; struct nls_charset default_charset = { diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c index 8bc5d9991452..b3a81350cbea 100644 --- a/fs/nls/nls_euc-jp.c +++ b/fs/nls/nls_euc-jp.c @@ -549,7 +549,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return euc_offset; } -static const struct nls_ops charset_ops = { +static struct nls_ops charset_ops = { .uni2char = uni2char, .char2uni = char2uni, }; @@ -570,8 +570,9 @@ static int __init init_nls_euc_jp(void) p_nls = load_nls("cp932"); if (p_nls) { - table.charset2upper = p_nls->charset2upper; - table.charset2lower = p_nls->charset2lower; + + charset_ops.uppercase = p_nls->ops->uppercase; + charset_ops.lowercase = p_nls->ops->lowercase; return register_nls(&nls_charset); } diff --git a/fs/nls/nls_iso8859-1.c b/fs/nls/nls_iso8859-1.c index 78e9c0169f69..a98298bd5de5 100644 --- a/fs/nls/nls_iso8859-1.c +++ b/fs/nls/nls_iso8859-1.c @@ -233,7 +233,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -242,8 +254,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-13.c b/fs/nls/nls_iso8859-13.c index eb8665629e0f..811f4cf1d1a3 100644 --- a/fs/nls/nls_iso8859-13.c +++ b/fs/nls/nls_iso8859-13.c @@ -261,7 +261,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -270,8 +282,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-14.c b/fs/nls/nls_iso8859-14.c index c8d5a48f869c..d8dafca31d26 100644 --- a/fs/nls/nls_iso8859-14.c +++ b/fs/nls/nls_iso8859-14.c @@ -317,7 +317,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -326,8 +338,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-15.c b/fs/nls/nls_iso8859-15.c index 0611c6cb56b4..9de12c9e25a3 100644 --- a/fs/nls/nls_iso8859-15.c +++ b/fs/nls/nls_iso8859-15.c @@ -283,7 +283,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -292,8 +304,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-2.c b/fs/nls/nls_iso8859-2.c index 5255d92a25eb..c59e2424f2b5 100644 --- a/fs/nls/nls_iso8859-2.c +++ b/fs/nls/nls_iso8859-2.c @@ -284,7 +284,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -293,8 +305,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-3.c b/fs/nls/nls_iso8859-3.c index ad1b84f3e102..4bab1b607059 100644 --- a/fs/nls/nls_iso8859-3.c +++ b/fs/nls/nls_iso8859-3.c @@ -284,7 +284,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -293,8 +305,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-4.c b/fs/nls/nls_iso8859-4.c index 82469deee0ba..1a3cf5f507f6 100644 --- a/fs/nls/nls_iso8859-4.c +++ b/fs/nls/nls_iso8859-4.c @@ -284,7 +284,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -293,8 +305,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-5.c b/fs/nls/nls_iso8859-5.c index 3f3cd0c28797..0a26cea9d578 100644 --- a/fs/nls/nls_iso8859-5.c +++ b/fs/nls/nls_iso8859-5.c @@ -248,7 +248,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -257,8 +269,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-6.c b/fs/nls/nls_iso8859-6.c index 43e6675998bc..d5a230888eed 100644 --- a/fs/nls/nls_iso8859-6.c +++ b/fs/nls/nls_iso8859-6.c @@ -239,7 +239,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -248,8 +260,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-7.c b/fs/nls/nls_iso8859-7.c index 83893e487f82..a5a171849ae4 100644 --- a/fs/nls/nls_iso8859-7.c +++ b/fs/nls/nls_iso8859-7.c @@ -293,7 +293,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -302,8 +314,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_iso8859-9.c b/fs/nls/nls_iso8859-9.c index df03f97cd9d1..795093547cd6 100644 --- a/fs/nls/nls_iso8859-9.c +++ b/fs/nls/nls_iso8859-9.c @@ -248,7 +248,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -257,8 +269,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_koi8-r.c b/fs/nls/nls_koi8-r.c index 22918e154dbe..bbce9a608419 100644 --- a/fs/nls/nls_koi8-r.c +++ b/fs/nls/nls_koi8-r.c @@ -299,7 +299,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -308,8 +320,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c index f4edbc313706..d3e946652bf6 100644 --- a/fs/nls/nls_koi8-ru.c +++ b/fs/nls/nls_koi8-ru.c @@ -51,7 +51,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen, return n; } -static const struct nls_ops charset_ops = { +static struct nls_ops charset_ops = { .uni2char = uni2char, .char2uni = char2uni, }; @@ -72,8 +72,8 @@ static int __init init_nls_koi8_ru(void) p_nls = load_nls("koi8-u"); if (p_nls) { - table.charset2upper = p_nls->charset2upper; - table.charset2lower = p_nls->charset2lower; + charset_ops.uppercase = p_nls->ops->uppercase; + charset_ops.lowercase = p_nls->ops->lowercase; return register_nls(&nls_charset); } diff --git a/fs/nls/nls_koi8-u.c b/fs/nls/nls_koi8-u.c index b2421625e98b..5de52a74f0b3 100644 --- a/fs/nls/nls_koi8-u.c +++ b/fs/nls/nls_koi8-u.c @@ -306,7 +306,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return charset2lower[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return charset2upper[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -315,8 +327,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = charset2lower, - .charset2upper = charset2upper, }; static struct nls_charset nls_charset = { diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8.c index aecf460827ac..fe1ac5efaa37 100644 --- a/fs/nls/nls_utf8.c +++ b/fs/nls/nls_utf8.c @@ -40,7 +40,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return n; } +static unsigned char charset_tolower(const struct nls_table *table, + unsigned int c){ + return identity[c]; +} + +static unsigned char charset_toupper(const struct nls_table *table, + unsigned int c) { + return identity[c]; +} + static const struct nls_ops charset_ops = { + .lowercase = charset_toupper, + .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, }; @@ -49,8 +61,6 @@ static struct nls_charset nls_charset; static struct nls_table table = { .charset = &nls_charset, .ops = &charset_ops, - .charset2lower = identity, /* no conversion */ - .charset2upper = identity, }; static struct nls_charset nls_charset = { diff --git a/include/linux/nls.h b/include/linux/nls.h index 9f61015a54bf..c43746bd390e 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -38,6 +38,10 @@ struct nls_ops { **/ int (*validate)(const struct nls_table *charset, const unsigned char *str, size_t len); + unsigned char (*lowercase)(const struct nls_table *charset, + unsigned int c); + unsigned char (*uppercase)(const struct nls_table *charset, + unsigned int c); }; struct nls_table { @@ -46,9 +50,8 @@ struct nls_table { unsigned int flags; const struct nls_ops *ops; - const unsigned char *charset2lower; - const unsigned char *charset2upper; struct nls_table *next; + }; struct nls_charset { @@ -120,16 +123,18 @@ static inline const char *nls_charset_name(const struct nls_table *table) return table->charset->charset; } -static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c) +static inline unsigned char nls_tolower(const struct nls_table *t, + unsigned char c) { - unsigned char nc = t->charset2lower[c]; + unsigned char nc = t->ops->lowercase(t, c); return nc ? nc : c; } -static inline unsigned char nls_toupper(struct nls_table *t, unsigned char c) +static inline unsigned char nls_toupper(const struct nls_table *t, + unsigned char c) { - unsigned char nc = t->charset2upper[c]; + unsigned char nc = t->ops->uppercase(t, c); return nc ? nc : c; } From patchwork Thu Dec 6 23:08:49 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717193 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F105013AF for ; Thu, 6 Dec 2018 23:09:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E19462F212 for ; Thu, 6 Dec 2018 23:09:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D5B372F21E; Thu, 6 Dec 2018 23:09:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6BF722F212 for ; Thu, 6 Dec 2018 23:09:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726321AbeLFXJo (ORCPT ); Thu, 6 Dec 2018 18:09:44 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56100 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726300AbeLFXJn (ORCPT ); Thu, 6 Dec 2018 18:09:43 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id C8A1B27ED5C From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 09/23] nls: Add new interface for string comparisons Date: Thu, 6 Dec 2018 18:08:49 -0500 Message-Id: <20181206230903.30011-10-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi The existing stricmp() interface is limited by not accepting separated length parameters for each string being compared. This is a problem for charsets doing normalization or full casefold comparison, since different sized strings can still be matched. To resolve this problem, this patch implements a new interface, allowing charsets to do the comparison, if needed. The original stricmp is left in the code, until we convert all caller to the new interface. Nevertheless, it was reimplemented using the new interface. Signed-off-by: Gabriel Krisman Bertazi --- include/linux/nls.h | 69 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 66 insertions(+), 3 deletions(-) diff --git a/include/linux/nls.h b/include/linux/nls.h index c43746bd390e..980103d4c363 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -3,6 +3,7 @@ #define _LINUX_NLS_H #include +#include /* Unicode has changed over the years. Unicode code points no longer * fit into 16 bits; as of Unicode 5 valid code points range from 0 @@ -38,6 +39,32 @@ struct nls_ops { **/ int (*validate)(const struct nls_table *charset, const unsigned char *str, size_t len); + /** + * @strncmp: + * + * strncmp is the function for case-sensitive string comparison. + * It only needs to be implemented by charsets that want to do + * some fancy comparisons, like normalization-insensitive. + * + * Returns 0 if str1 and str2 are equal, otherwise return + * non-zero. + **/ + int (*strncmp)(const struct nls_table *charset, + const unsigned char *str1, size_t len1, + const unsigned char *str2, size_t len2); + + /** + * @strncasecmp: + * + * strncasecmp is the function for case-insensitive string + * comparison. + * + * Returns 0 if str1 and str2 are equal, otherwise return + * non-zero. + **/ + int (*strncasecmp)(const struct nls_table *charset, + const unsigned char *str1, size_t len1, + const unsigned char *str2, size_t len2); unsigned char (*lowercase)(const struct nls_table *charset, unsigned int c); unsigned char (*uppercase)(const struct nls_table *charset, @@ -139,10 +166,21 @@ static inline unsigned char nls_toupper(const struct nls_table *t, return nc ? nc : c; } -static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1, - const unsigned char *s2, int len) +static inline int nls_strncasecmp(struct nls_table *t, + const unsigned char *s1, size_t len1, + const unsigned char *s2, size_t len2) { - while (len--) { + if (t->ops->strncasecmp) + return t->ops->strncasecmp(t, s1, len1, s2, len2); + + if (IS_STRICT_MODE(t) && + (nls_validate(t, s1, len1) || nls_validate(t, s1, len1))) + return -EINVAL; + + if (len1 != len2) + return 1; + + while (len1--) { if (nls_tolower(t, *s1++) != nls_tolower(t, *s2++)) return 1; } @@ -150,6 +188,31 @@ static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1, return 0; } +static inline int nls_strncmp(struct nls_table *t, + const unsigned char *s1, size_t len1, + const unsigned char *s2, size_t len2) +{ + if (t->ops->strncmp) + return t->ops->strncmp(t, s1, len1, s2, len2); + + if (IS_STRICT_MODE(t) && + (nls_validate(t, s1, len1) || nls_validate(t, s1, len1))) + return -EINVAL; + + if (len1 != len2) + return 1; + + /* strnicmp did not return negative values. So let's keep the + * abi for now */ + return !!memcmp(s1, s2, len1); +} + +static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1, + const unsigned char *s2, int len) +{ + return nls_strncasecmp(t, s1, len, s2, len); +} + /* * nls_nullsize - return length of null character for codepage * @codepage - codepage for which to return length of NULL terminator From patchwork Thu Dec 6 23:08:50 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717197 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1370E15A6 for ; Thu, 6 Dec 2018 23:09:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 032B42F205 for ; Thu, 6 Dec 2018 23:09:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EBF052F212; Thu, 6 Dec 2018 23:09:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 532352F205 for ; Thu, 6 Dec 2018 23:09:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726104AbeLFXJr (ORCPT ); Thu, 6 Dec 2018 18:09:47 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56104 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726322AbeLFXJq (ORCPT ); Thu, 6 Dec 2018 18:09:46 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 583F127E03E From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 10/23] nls: Add optional normalization and casefold hooks Date: Thu, 6 Dec 2018 18:08:50 -0500 Message-Id: <20181206230903.30011-11-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi The Normalization operation applies a transformation to strings to obtain the normalization form, which allow the user to determine whether any two strings are equivalent to each other. The NLS subsystem doesn't impose any constraint on what means to be equivalent, for any charsets. Unicode-based charsets, for instance, are free to support one, a few or all kinds of Unicode equivalences. The Casefold operation is similar to Normalization, in a sense that it also allows the caller to identify equivalent strings, but it disregards case, making it ideal for case insensitive comparisons. Default implementation are provided by the nls core, such that existing charsets can operate on the new interface. The Normalization default operation is the format NLS_NORMALIZATION_TYPE_PLAIN, which returns the identity of the string, which means no normalization. The casefold default is NLS_CASEFOLD_TYPE_TOUPPER, which returns the string with all characters converted to uppercase. Changes since V1: - Add default operations for casefold and normalization Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/nls_core.c | 11 +++++ include/linux/nls.h | 116 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 127 insertions(+) diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c index 49a15bb2174f..c49088f36f4c 100644 --- a/fs/nls/nls_core.c +++ b/fs/nls/nls_core.c @@ -25,6 +25,17 @@ static int nls_validate_flags(struct nls_table *table, unsigned int flags) if (flags & NLS_STRICT_MODE && !table->ops->validate) return -1; + if ((flags & NLS_NORMALIZATION_TYPE_MASK) && !table->ops->normalize) + return -1; + + if ((flags & NLS_CASEFOLD_TYPE_MASK) && !table->ops->casefold) + return -1; + + /* Reject unused flags */ + if (flags & ~(NLS_CASEFOLD_TYPE_MASK | NLS_NORMALIZATION_TYPE_MASK | + NLS_STRICT_MODE)) + return -1; + return 0; } diff --git a/include/linux/nls.h b/include/linux/nls.h index 980103d4c363..44a06a9c69e7 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -4,6 +4,7 @@ #include #include +#include /* Unicode has changed over the years. Unicode code points no longer * fit into 16 bits; as of Unicode 5 valid code points range from 0 @@ -65,6 +66,51 @@ struct nls_ops { int (*strncasecmp)(const struct nls_table *charset, const unsigned char *str1, size_t len1, const unsigned char *str2, size_t len2); + /** + * @normalize: + * + * Obtain the normalized form of a string, which can be used to + * determine whether any two strings are equivalent. The NLS + * subsystem doesn't impose any constraint on the charsets + * regarding what it means to be equivalent. Unicode-based + * charsets, for instance, are free to support one, a few or all + * kinds of Unicode equivalences. Different kinds of + * normalizations can be specified using the nls_table flags. + * + * This hook is responsible for performing string validation if + * the strict mode flag is set. The only case where it is not + * called by nls_core is when strict mode and normalization are + * disabled, because in this case the normalization is + * guaranteed to be the string identity. + * + * Not every charset implements this hook. It is only required + * if the charset supports strict mode or some kind of + * normalization. + * + * If this operation cannot be executed for this charset, + * -ENOTSUPP is returned. If the sequence is invalid, -EINVAL + * is returned. Otherwise, this function returns the size of the + * new string. + **/ + int (*normalize)(const struct nls_table *charset, + const unsigned char *str, size_t len, + unsigned char *dest, size_t dlen); + /** + * @casefold: + * + * Casefold returns a version of the string that can be used to + * perform case-insensitive comparisons. The kind of casefold + * algorithm that will be used is charset dependent, and can be + * configured using the nls_table flags field. + * + * If this operation cannot be executed for this charset, + * -ENOTSUPP is returned. If the sequence fails, -EINVAL is + * returned. Otherwise, this function returns the size of the + * new string. + **/ + int (*casefold)(const struct nls_table *charset, + const unsigned char *str, size_t len, + unsigned char *dest, size_t dlen); unsigned char (*lowercase)(const struct nls_table *charset, unsigned int c); unsigned char (*uppercase)(const struct nls_table *charset, @@ -101,13 +147,37 @@ enum utf16_endian { UTF16_BIG_ENDIAN }; +#define NLS_NORMALIZATION_TYPE(i) ((i & 0x7) << 1) +#define NLS_CASEFOLD_TYPE(i) ((i & 0x7) << 4) + #define NLS_STRICT_MODE 0x00000001 +#define NLS_NORMALIZATION_TYPE_PLAIN NLS_NORMALIZATION_TYPE(0) +#define NLS_NORMALIZATION_TYPE_MASK 0x0000000E +#define NLS_CASEFOLD_TYPE_TOUPPER NLS_CASEFOLD_TYPE(0) +#define NLS_CASEFOLD_TYPE_MASK 0x00000070 static inline int IS_STRICT_MODE(const struct nls_table *charset) { return (charset->flags & NLS_STRICT_MODE); } +#define NLS_NORMALIZATION_FUNCS(charset, type, i) \ +static inline int \ +IS_NORMALIZATION_TYPE_##charset##_##type(const struct nls_table *c) \ +{ \ + return ((c->flags & NLS_NORMALIZATION_TYPE_MASK) == i); \ +} + +#define NLS_CASEFOLD_FUNCS(charset, type, i) \ +static inline int \ +IS_CASEFOLD_TYPE_##charset##_##type(const struct nls_table *c) \ +{ \ + return ((c->flags & NLS_CASEFOLD_TYPE_MASK) == i); \ +} + +NLS_NORMALIZATION_FUNCS(ALL, PLAIN, NLS_NORMALIZATION_TYPE_PLAIN) +NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER) + /* nls_base.c */ extern int __register_nls(struct nls_charset *, struct module *); extern int unregister_nls(struct nls_charset *); @@ -213,6 +283,52 @@ static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1, return nls_strncasecmp(t, s1, len, s2, len); } +static inline int nls_casefold(const struct nls_table *t, + const unsigned char *str, size_t len, + unsigned char *dest, size_t dlen) +{ + int i; + + if (t->ops->casefold) + return t->ops->casefold(t, str, len, dest, dlen); + + if (!IS_CASEFOLD_TYPE_ALL_TOUPPER(t)) + return -ENOTSUPP; + + if (IS_STRICT_MODE(t) && nls_validate(t, str, len)) + return -EINVAL; + + if (len > dlen) + return -EINVAL; + + for (i = 0 ; i < len; i++) + dest[i] = nls_toupper(t, str[i]); + + return len; +} + +static inline int nls_normalize(const struct nls_table *t, + const unsigned char *str, size_t len, + unsigned char *dest, size_t dlen) +{ + if (t->ops->normalize) + return t->ops->normalize(t, str, len, dest, dlen); + + if (!IS_NORMALIZATION_TYPE_ALL_PLAIN(t)) + return -ENOTSUPP; + + if (IS_STRICT_MODE(t) && nls_validate(t, str, len)) + return -EINVAL; + + if (len > dlen) + return -EINVAL; + + /* If normalization are disabled, normalization is the + * identity. */ + strncpy(dest, str, len); + return len; +} + /* * nls_nullsize - return length of null character for codepage * @codepage - codepage for which to return length of NULL terminator From patchwork Thu Dec 6 23:08:51 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717199 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8EE7D15A6 for ; Thu, 6 Dec 2018 23:09:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7FAFE2F205 for ; Thu, 6 Dec 2018 23:09:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 747812F21D; Thu, 6 Dec 2018 23:09:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0FEA02F205 for ; Thu, 6 Dec 2018 23:09:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726335AbeLFXJv (ORCPT ); Thu, 6 Dec 2018 18:09:51 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56108 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726322AbeLFXJt (ORCPT ); Thu, 6 Dec 2018 18:09:49 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id CFE2927ED58 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 11/23] nls: ascii: Support validation and normalization operations Date: Thu, 6 Dec 2018 18:08:51 -0500 Message-Id: <20181206230903.30011-12-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi validation is trivial. Any byte that has the MSB set is an invalid sequence. Casefold can be implemented with uppercase or lowercase, and we have no specification on that. Callers should be safe using either of them, as long as it doesn't change. Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/nls_ascii.c | 50 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/nls.h | 8 ++++++++ 2 files changed, 58 insertions(+) diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c index 2f4826478d3d..079a1574c19d 100644 --- a/fs/nls/nls_ascii.c +++ b/fs/nls/nls_ascii.c @@ -12,6 +12,7 @@ #include #include #include +#include static const wchar_t charset2uni[256] = { /* 0x00*/ @@ -117,6 +118,8 @@ static const unsigned char charset2upper[256] = { 0x58, 0x59, 0x5a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */ }; +#define VALID_ASCII(c) (c < 128) + static int uni2char(wchar_t uni, unsigned char *out, int boundlen) { const unsigned char *uni2charset; @@ -142,6 +145,16 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni) return 1; } +static int ascii_validate(const struct nls_table *table, + const unsigned char *str, size_t len) +{ + int i; + for (i = 0; i < len && str[i]; i++) + if (!VALID_ASCII(str[i])) + return -1; + return 0; +} + static unsigned char charset_tolower(const struct nls_table *table, unsigned int c){ return charset2lower[c]; @@ -152,11 +165,36 @@ static unsigned char charset_toupper(const struct nls_table *table, return charset2upper[c]; } +static int ascii_casefold(const struct nls_table *charset, + const unsigned char *str, size_t len, + unsigned char *dest, size_t dlen) +{ + unsigned int i; + + if (dlen < len) + return -EINVAL; + + for (i = 0; i < len; i++) { + if (IS_STRICT_MODE(charset) && !VALID_ASCII(str[i])) + return -EINVAL; + + if (IS_CASEFOLD_TYPE_ASCII_TOLOWER(charset)) + dest[i] = charset_tolower(charset, str[i]); + else + dest[i] = charset_toupper(charset, str[i]); + } + dest[len] = '\0'; + + return len; +} + static const struct nls_ops charset_ops = { + .validate = ascii_validate, .lowercase = charset_toupper, .uppercase = charset_tolower, .uni2char = uni2char, .char2uni = char2uni, + .casefold = ascii_casefold, }; static struct nls_charset nls_charset; @@ -165,9 +203,21 @@ static struct nls_table table = { .ops = &charset_ops, }; +struct nls_table *ascii_load_table(const char *version, unsigned int flags) +{ + if (flags & ~(NLS_STRICT_MODE) || + (flags & NLS_NORMALIZATION_TYPE_MASK) != NLS_NORMALIZATION_TYPE_PLAIN) + return ERR_PTR(-EINVAL); + + table.flags = flags; + return &table; +} + + static struct nls_charset nls_charset = { .charset = "ascii", .tables = &table, + .load_table = ascii_load_table, }; static int __init init_nls_ascii(void) diff --git a/include/linux/nls.h b/include/linux/nls.h index 44a06a9c69e7..aab60d4858ee 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -178,6 +178,14 @@ IS_CASEFOLD_TYPE_##charset##_##type(const struct nls_table *c) \ NLS_NORMALIZATION_FUNCS(ALL, PLAIN, NLS_NORMALIZATION_TYPE_PLAIN) NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER) +/* ASCII */ + +#define NLS_ASCII_CASEFOLD_TOUPPER NLS_CASEFOLD_TYPE_TOUPPER +#define NLS_ASCII_CASEFOLD_TOLOWER NLS_CASEFOLD_TYPE(1) + +NLS_CASEFOLD_FUNCS(ASCII, TOUPPER, NLS_ASCII_CASEFOLD_TOUPPER) +NLS_CASEFOLD_FUNCS(ASCII, TOLOWER, NLS_ASCII_CASEFOLD_TOLOWER) + /* nls_base.c */ extern int __register_nls(struct nls_charset *, struct module *); extern int unregister_nls(struct nls_charset *); From patchwork Thu Dec 6 23:08:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717201 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0C2E915A6 for ; Thu, 6 Dec 2018 23:09:55 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F17902F205 for ; Thu, 6 Dec 2018 23:09:54 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E60722F21D; Thu, 6 Dec 2018 23:09:54 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C1722F205 for ; Thu, 6 Dec 2018 23:09:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726338AbeLFXJx (ORCPT ); Thu, 6 Dec 2018 18:09:53 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56112 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726322AbeLFXJw (ORCPT ); Thu, 6 Dec 2018 18:09:52 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 4FF3D27ED74 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH v4 12/23] nls: utf8: Add unicode character database files Date: Thu, 6 Dec 2018 18:08:52 -0500 Message-Id: <20181206230903.30011-13-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Olaf Weber Add files from the Unicode Character Database, version 11.0, to the source. A helper program that generates a trie used for normalization from these files is part of a separate commit. - Notes on the update from 8.0.0 and 11.0: The structure of ucd files and special cases have not experienced any changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of Cherokee LC characters, which is an interesting case for case-folding. The update is accompanied by new tests on the test_ucd module to catch specific cases. No changes to mkutf8data script was required for the update. The actual files are not part of the commit submitted to the list because they are to big and would bounce. Still, they can be obtained by the following script: FILES="CaseFolding.txt DerivedAge.txt extracted/DerivedCombiningClass.txt DerivedCoreProperties.txt NormalizationCorrections.txt NormalizationTest.txt UnicodeData.txt" VERSION=11.0.0 BASE=http://www.unicode.org/Public/${VERSION}/ucd for i in ${FILES} ; do wget "${BASE}/$i" -O fs/nls/ucd/$(basename ${i} .txt)-${VERSION}.txt done Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Move ucd directory to fs/nls/] [Update to Unicode 11.0.0] --- fs/nls/ucd/README | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 fs/nls/ucd/README diff --git a/fs/nls/ucd/README b/fs/nls/ucd/README new file mode 100644 index 000000000000..553ce7e4c224 --- /dev/null +++ b/fs/nls/ucd/README @@ -0,0 +1,34 @@ +The files in this directory are part of the Unicode Character Database +for version 11.0.0 of the Unicode standard. + +The full set of files can be found here: + + http://www.unicode.org/Public/11.0.0/ucd/ + +The latest released version of the UCD can be found here: + + http://www.unicode.org/Public/UCD/latest/ + +The files in this directory are identical, except that they have been +renamed with a suffix indicating the unicode version. + +Individual source links: + + http://www.unicode.org/Public/11.0.0/ucd/CaseFolding.txt + http://www.unicode.org/Public/11.0.0/ucd/DerivedAge.txt + http://www.unicode.org/Public/11.0.0/ucd/extracted/DerivedCombiningClass.txt + http://www.unicode.org/Public/11.0.0/ucd/DerivedCoreProperties.txt + http://www.unicode.org/Public/11.0.0/ucd/NormalizationCorrections.txt + http://www.unicode.org/Public/11.0.0/ucd/NormalizationTest.txt + http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt + +md5sums + + 414436796cf097df55f798e1585448ee CaseFolding-11.0.0.txt + 6032a595fbb782694456491d86eecfac DerivedAge-11.0.0.txt + 3240997d671297ac754ab0d27577acf7 DerivedCombiningClass-11.0.0.txt + d41d8cd98f00b204e9800998ecf8427e DerivedCombiningClass.txt + 2a4fe257d9d8184518e036194d2248ec DerivedCoreProperties-11.0.0.txt + 4e7d383fa0dd3cd9d49d64e5b7b7c9e0 NormalizationCorrections-11.0.0.txt + c9500c5b8b88e584469f056023ecc3f2 NormalizationTest-11.0.0.txt + acc291106c3758d2025f8d7bd5518bee UnicodeData-11.0.0.txt From patchwork Thu Dec 6 23:08:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717205 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9D87B15A6 for ; Thu, 6 Dec 2018 23:10:06 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7EBA72F205 for ; Thu, 6 Dec 2018 23:10:06 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 701EC2F21D; Thu, 6 Dec 2018 23:10:06 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9A6AC2F212 for ; Thu, 6 Dec 2018 23:10:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726342AbeLFXKA (ORCPT ); Thu, 6 Dec 2018 18:10:00 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56118 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726322AbeLFXKA (ORCPT ); Thu, 6 Dec 2018 18:10:00 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 9265327E819 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH v4 13/23] scripts: add trie generator for UTF-8 Date: Thu, 6 Dec 2018 18:08:53 -0500 Message-Id: <20181206230903.30011-14-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Olaf Weber mkutf8data.c is the source for a program that generates utf8data.h, which contains the trie that utf8norm.c uses. The trie is generated from the Unicode 11.0.0 data files. The format of the utf8data[] table is described in utf8norm.c, which is added in the next patch. Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Rebase to mainline] [Fix out-of-tree-build] [Fix checkpatch warnings] [Merge back robustness fixes from original patch. Requested by Dave Chinner] [Update makefile to build 11.0.0 ucd files] [drop references to xfs] --- fs/nls/Kconfig | 10 + fs/nls/Makefile | 13 + scripts/Makefile | 1 + scripts/mkutf8data.c | 3168 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 3192 insertions(+) create mode 100644 scripts/mkutf8data.c diff --git a/fs/nls/Kconfig b/fs/nls/Kconfig index e2ce79ef48c4..7cb2848da608 100644 --- a/fs/nls/Kconfig +++ b/fs/nls/Kconfig @@ -616,4 +616,14 @@ config NLS_UTF8 input/output character sets. Say Y here for the UTF-8 encoding of the Unicode/ISO9646 universal character set. +# +# utf8 normalization +# +config NLS_UTF8_NORMALIZATION + bool "UTF-8 normalization and casefolding support" + depends on NLS_UTF8 + help + Say Y here to enable utf8 NFKD normalization and casefolding + support. + endif # NLS diff --git a/fs/nls/Makefile b/fs/nls/Makefile index 5f42ceff9d15..840e06aefd47 100644 --- a/fs/nls/Makefile +++ b/fs/nls/Makefile @@ -55,3 +55,16 @@ obj-$(CONFIG_NLS_MAC_INUIT) += mac-inuit.o obj-$(CONFIG_NLS_MAC_ROMANIAN) += mac-romanian.o obj-$(CONFIG_NLS_MAC_ROMAN) += mac-roman.o obj-$(CONFIG_NLS_MAC_TURKISH) += mac-turkish.o + +$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE + $(call cmd,mkutf8data) +quiet_cmd_mkutf8data = MKUTF8DATA $@ + cmd_mkutf8data = $(objtree)/scripts/mkutf8data \ + -a $(srctree)/$(src)/ucd/DerivedAge-11.0.0.txt \ + -c $(srctree)/$(src)/ucd/DerivedCombiningClass-11.0.0.txt \ + -p $(srctree)/$(src)/ucd/DerivedCoreProperties-11.0.0.txt \ + -d $(srctree)/$(src)/ucd/UnicodeData-11.0.0.txt \ + -f $(srctree)/$(src)/ucd/CaseFolding-11.0.0.txt \ + -n $(srctree)/$(src)/ucd/NormalizationCorrections-11.0.0.txt \ + -t $(srctree)/$(src)/ucd/NormalizationTest-11.0.0.txt \ + -o $@ diff --git a/scripts/Makefile b/scripts/Makefile index ece52ff20171..b36208c62c17 100644 --- a/scripts/Makefile +++ b/scripts/Makefile @@ -20,6 +20,7 @@ hostprogs-$(CONFIG_ASN1) += asn1_compiler hostprogs-$(CONFIG_MODULE_SIG) += sign-file hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert +hostprogs-$(CONFIG_NLS_UTF8_NORMALIZATION) += mkutf8data HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c new file mode 100644 index 000000000000..26794053d0d4 --- /dev/null +++ b/scripts/mkutf8data.c @@ -0,0 +1,3168 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write the Free Software Foundation, + * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA + */ + +/* Generator for a compact trie for unicode normalization */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* Default names of the in- and output files. */ + +#define AGE_NAME "DerivedAge.txt" +#define CCC_NAME "DerivedCombiningClass.txt" +#define PROP_NAME "DerivedCoreProperties.txt" +#define DATA_NAME "UnicodeData.txt" +#define FOLD_NAME "CaseFolding.txt" +#define NORM_NAME "NormalizationCorrections.txt" +#define TEST_NAME "NormalizationTest.txt" +#define UTF8_NAME "utf8data.h" + +const char *age_name = AGE_NAME; +const char *ccc_name = CCC_NAME; +const char *prop_name = PROP_NAME; +const char *data_name = DATA_NAME; +const char *fold_name = FOLD_NAME; +const char *norm_name = NORM_NAME; +const char *test_name = TEST_NAME; +const char *utf8_name = UTF8_NAME; + +int verbose = 0; + +/* An arbitrary line size limit on input lines. */ + +#define LINESIZE 1024 +char line[LINESIZE]; +char buf0[LINESIZE]; +char buf1[LINESIZE]; +char buf2[LINESIZE]; +char buf3[LINESIZE]; + +const char *argv0; + +/* ------------------------------------------------------------------ */ + +/* + * Unicode version numbers consist of three parts: major, minor, and a + * revision. These numbers are packed into an unsigned int to obtain + * a single version number. + * + * To save space in the generated trie, the unicode version is not + * stored directly, instead we calculate a generation number from the + * unicode versions seen in the DerivedAge file, and use that as an + * index into a table of unicode versions. + */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_MAJ_MAX ((unsigned short)-1) +#define UNICODE_MIN_MAX ((unsigned char)-1) +#define UNICODE_REV_MAX ((unsigned char)-1) + +#define UNICODE_AGE(MAJ,MIN,REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +unsigned int *ages; +int ages_count; + +unsigned int unicode_maxage; + +static int age_valid(unsigned int major, unsigned int minor, + unsigned int revision) +{ + if (major > UNICODE_MAJ_MAX) + return 0; + if (minor > UNICODE_MIN_MAX) + return 0; + if (revision > UNICODE_REV_MAX) + return 0; + return 1; +} + +/* ------------------------------------------------------------------ */ + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype, unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + */ +typedef unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2)) + +#define MAXGEN (255) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +struct tree; +static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, const char *); + +unsigned char *utf8data; +size_t utf8data_size; + +utf8trie_t *nfkdi; +utf8trie_t *nfkdicf; + +/* ------------------------------------------------------------------ */ + +/* + * UTF8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7f: 0 0x7f + * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf + * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf + * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf + * + * Even within those ranges not all values are allowed: the surrogates + * 0xd800 - 0xdfff should never be seen. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +#define UTF8_2_BITS 0xC0 +#define UTF8_3_BITS 0xE0 +#define UTF8_4_BITS 0xF0 +#define UTF8_N_BITS 0x80 +#define UTF8_2_MASK 0xE0 +#define UTF8_3_MASK 0xF0 +#define UTF8_4_MASK 0xF8 +#define UTF8_N_MASK 0xC0 +#define UTF8_V_MASK 0x3F +#define UTF8_V_SHIFT 6 + +static int utf8encode(char *str, unsigned int val) +{ + int len; + + if (val < 0x80) { + str[0] = val; + len = 1; + } else if (val < 0x800) { + str[1] = val & UTF8_V_MASK; + str[1] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[0] = val; + str[0] |= UTF8_2_BITS; + len = 2; + } else if (val < 0x10000) { + str[2] = val & UTF8_V_MASK; + str[2] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[1] = val & UTF8_V_MASK; + str[1] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[0] = val; + str[0] |= UTF8_3_BITS; + len = 3; + } else if (val < 0x110000) { + str[3] = val & UTF8_V_MASK; + str[3] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[2] = val & UTF8_V_MASK; + str[2] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[1] = val & UTF8_V_MASK; + str[1] |= UTF8_N_BITS; + val >>= UTF8_V_SHIFT; + str[0] = val; + str[0] |= UTF8_4_BITS; + len = 4; + } else { + printf("%#x: illegal val\n", val); + len = 0; + } + return len; +} + +static unsigned int utf8decode(const char *str) +{ + const unsigned char *s = (const unsigned char*)str; + unsigned int unichar = 0; + + if (*s < 0x80) { + unichar = *s; + } else if (*s < UTF8_3_BITS) { + unichar = *s++ & 0x1F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else if (*s < UTF8_4_BITS) { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } else { + unichar = *s++ & 0x0F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s++ & 0x3F; + unichar <<= UTF8_V_SHIFT; + unichar |= *s & 0x3F; + } + return unichar; +} + +static int utf32valid(unsigned int unichar) +{ + return unichar < 0x110000; +} + +#define NODE 1 +#define LEAF 0 + +struct tree { + void *root; + int childnode; + const char *type; + unsigned int maxage; + struct tree *next; + int (*leaf_equal)(void *, void *); + void (*leaf_print)(void *, int); + int (*leaf_mark)(void *); + int (*leaf_size)(void *); + int *(*leaf_index)(struct tree *, void *); + unsigned char *(*leaf_emit)(void *, unsigned char *); + int leafindex[0x110000]; + int index; +}; + +struct node { + int index; + int offset; + int mark; + int size; + struct node *parent; + void *left; + void *right; + unsigned char bitnum; + unsigned char nextbyte; + unsigned char leftnode; + unsigned char rightnode; + unsigned int keybits; + unsigned int keymask; +}; + +/* + * Example lookup function for a tree. + */ +static void *lookup(struct tree *tree, const char *key) +{ + struct node *node; + void *leaf = NULL; + + node = tree->root; + while (!leaf && node) { + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) { + /* Right leg */ + if (node->rightnode == NODE) { + node = node->right; + } else if (node->rightnode == LEAF) { + leaf = node->right; + } else { + node = NULL; + } + } else { + /* Left leg */ + if (node->leftnode == NODE) { + node = node->left; + } else if (node->leftnode == LEAF) { + leaf = node->left; + } else { + node = NULL; + } + } + } + + return leaf; +} + +/* + * A simple non-recursive tree walker: keep track of visits to the + * left and right branches in the leftmask and rightmask. + */ +static void tree_walk(struct tree *tree) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int indent = 1; + int nodes, singletons, leaves; + + nodes = singletons = leaves = 0; + + printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_print(tree->root, indent); + leaves = 1; + } else { + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + printf("%*snode @ %p bitnum %d nextbyte %d" + " left %p right %p mask %x bits %x\n", + indent, "", node, + node->bitnum, node->nextbyte, + node->left, node->right, + node->keymask, node->keybits); + nodes += 1; + if (!(node->left && node->right)) + singletons += 1; + + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + tree->leaf_print(node->left, + indent+1); + leaves += 1; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + tree->leaf_print(node->right, + indent+1); + leaves += 1; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } + } + printf("nodes %d leaves %d singletons %d\n", + nodes, leaves, singletons); +} + +/* + * Allocate an initialize a new internal node. + */ +static struct node *alloc_node(struct node *parent) +{ + struct node *node; + int bitnum; + + node = malloc(sizeof(*node)); + node->left = node->right = NULL; + node->parent = parent; + node->leftnode = NODE; + node->rightnode = NODE; + node->keybits = 0; + node->keymask = 0; + node->mark = 0; + node->index = 0; + node->offset = -1; + node->size = 4; + + if (node->parent) { + bitnum = parent->bitnum; + if ((bitnum & 7) == 0) { + node->bitnum = bitnum + 7 + 8; + node->nextbyte = 1; + } else { + node->bitnum = bitnum - 1; + node->nextbyte = 0; + } + } else { + node->bitnum = 7; + node->nextbyte = 0; + } + + return node; +} + +/* + * Insert a new leaf into the tree, and collapse any subtrees that are + * fully populated and end in identical leaves. A nextbyte tagged + * internal node will not be removed to preserve the tree's integrity. + * Note that due to the structure of utf8, no nextbyte tagged node + * will be a candidate for removal. + */ +static int insert(struct tree *tree, char *key, int keylen, void *leaf) +{ + struct node *node; + struct node *parent; + void **cursor; + int keybits; + + assert(keylen >= 1 && keylen <= 4); + + node = NULL; + cursor = &tree->root; + keybits = 8 * keylen; + + /* Insert, creating path along the way. */ + while (keybits) { + if (!*cursor) + *cursor = alloc_node(node); + node = *cursor; + if (node->nextbyte) + key++; + if (*key & (1 << (node->bitnum & 7))) + cursor = &node->right; + else + cursor = &node->left; + keybits--; + } + *cursor = leaf; + + /* Merge subtrees if possible. */ + while (node) { + if (*key & (1 << (node->bitnum & 7))) + node->rightnode = LEAF; + else + node->leftnode = LEAF; + if (node->nextbyte) + break; + if (node->leftnode == NODE || node->rightnode == NODE) + break; + assert(node->left); + assert(node->right); + /* Compare */ + if (! tree->leaf_equal(node->left, node->right)) + break; + /* Keep left, drop right leaf. */ + leaf = node->left; + /* Check in parent */ + parent = node->parent; + if (!parent) { + /* root of tree! */ + tree->root = leaf; + tree->childnode = LEAF; + } else if (parent->left == node) { + parent->left = leaf; + parent->leftnode = LEAF; + if (parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + } + } else if (parent->right == node) { + parent->right = leaf; + parent->rightnode = LEAF; + if (parent->left) { + parent->keymask = 0; + parent->keybits = 0; + } else { + parent->keymask |= (1 << node->bitnum); + parent->keybits |= (1 << node->bitnum); + } + } else { + /* internal tree error */ + assert(0); + } + free(node); + node = parent; + } + + /* Propagate keymasks up along singleton chains. */ + while (node) { + parent = node->parent; + if (!parent) + break; + /* Nix the mask for parents with two children. */ + if (node->keymask == 0) { + parent->keymask = 0; + parent->keybits = 0; + } else if (parent->left && parent->right) { + parent->keymask = 0; + parent->keybits = 0; + } else { + assert((parent->keymask & node->keymask) == 0); + parent->keymask |= node->keymask; + parent->keymask |= (1 << parent->bitnum); + parent->keybits |= node->keybits; + if (parent->right) + parent->keybits |= (1 << parent->bitnum); + } + node = parent; + } + + return 0; +} + +/* + * Prune internal nodes. + * + * Fully populated subtrees that end at the same leaf have already + * been collapsed. There are still internal nodes that have for both + * their left and right branches a sequence of singletons that make + * identical choices and end in identical leaves. The keymask and + * keybits collected in the nodes describe the choices made in these + * singleton chains. When they are identical for the left and right + * branch of a node, and the two leaves comare identical, the node in + * question can be removed. + * + * Note that nodes with the nextbyte tag set will not be removed by + * this to ensure tree integrity. Note as well that the structure of + * utf8 ensures that these nodes would not have been candidates for + * removal in any case. + */ +static void prune(struct tree *tree) +{ + struct node *node; + struct node *left; + struct node *right; + struct node *parent; + void *leftleaf; + void *rightleaf; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + + if (verbose > 0) + printf("Pruning %s_%x\n", tree->type, tree->maxage); + + count = 0; + if (tree->childnode == LEAF) + return; + if (!tree->root) + return; + + leftmask = rightmask = 0; + node = tree->root; + while (node) { + if (node->nextbyte) + goto advance; + if (node->leftnode == LEAF) + goto advance; + if (node->rightnode == LEAF) + goto advance; + if (!node->left) + goto advance; + if (!node->right) + goto advance; + left = node->left; + right = node->right; + if (left->keymask == 0) + goto advance; + if (right->keymask == 0) + goto advance; + if (left->keymask != right->keymask) + goto advance; + if (left->keybits != right->keybits) + goto advance; + leftleaf = NULL; + while (!leftleaf) { + assert(left->left || left->right); + if (left->leftnode == LEAF) + leftleaf = left->left; + else if (left->rightnode == LEAF) + leftleaf = left->right; + else if (left->left) + left = left->left; + else if (left->right) + left = left->right; + else + assert(0); + } + rightleaf = NULL; + while (!rightleaf) { + assert(right->left || right->right); + if (right->leftnode == LEAF) + rightleaf = right->left; + else if (right->rightnode == LEAF) + rightleaf = right->right; + else if (right->left) + right = right->left; + else if (right->right) + right = right->right; + else + assert(0); + } + if (! tree->leaf_equal(leftleaf, rightleaf)) + goto advance; + /* + * This node has identical singleton-only subtrees. + * Remove it. + */ + parent = node->parent; + left = node->left; + right = node->right; + if (parent->left == node) + parent->left = left; + else if (parent->right == node) + parent->right = left; + else + assert(0); + left->parent = parent; + left->keymask |= (1 << node->bitnum); + node->left = NULL; + while (node) { + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + if (node->leftnode == NODE && node->left) { + left = node->left; + free(node); + count++; + node = left; + } else if (node->rightnode == NODE && node->right) { + right = node->right; + free(node); + count++; + node = right; + } else { + node = NULL; + } + } + /* Propagate keymasks up along singleton chains. */ + node = parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + for (;;) { + if (node->left && node->right) + break; + if (node->left) { + left = node->left; + node->keymask |= left->keymask; + node->keybits |= left->keybits; + } + if (node->right) { + right = node->right; + node->keymask |= right->keymask; + node->keybits |= right->keybits; + } + node->keymask |= (1 << node->bitnum); + node = node->parent; + /* Force re-check */ + bitmask = 1 << node->bitnum; + leftmask &= ~bitmask; + rightmask &= ~bitmask; + } + advance: + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0 && + node->leftnode == NODE && + node->left) { + leftmask |= bitmask; + node = node->left; + } else if ((rightmask & bitmask) == 0 && + node->rightnode == NODE && + node->right) { + rightmask |= bitmask; + node = node->right; + } else { + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + } + if (verbose > 0) + printf("Pruned %d nodes\n", count); +} + +/* + * Mark the nodes in the tree that lead to leaves that must be + * emitted. + */ +static void mark_nodes(struct tree *tree) +{ + struct node *node; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int marked; + + marked = 0; + if (verbose > 0) + printf("Marking %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } + + /* second pass: left siblings and singletons */ + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + bitmask = 1 << node->bitnum; + if ((leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + if (tree->leaf_mark(node->left)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->left) { + assert(node->leftnode == NODE); + node = node->left; + if (!node->mark && node->parent->mark) { + marked++; + node->mark = 1; + } + continue; + } + } + if ((rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + if (tree->leaf_mark(node->right)) { + n = node; + while (n && !n->mark) { + marked++; + n->mark = 1; + n = n->parent; + } + } + } else if (node->right) { + assert(node->rightnode==NODE); + node = node->right; + if (!node->mark && node->parent->mark && + !node->parent->left) { + marked++; + node->mark = 1; + } + continue; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + } +done: + if (verbose > 0) + printf("Marked %d nodes\n", marked); +} + +/* + * Compute the index of each node and leaf, which is the offset in the + * emitted trie. These values must be pre-computed because relative + * offsets between nodes are used to navigate the tree. + */ +static int index_nodes(struct tree *tree, int index) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int count; + int indent; + + /* Align to a cache line (or half a cache line?). */ + while (index % 64) + index++; + tree->index = index; + indent = 1; + count = 0; + + if (verbose > 0) + printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index); + if (tree->childnode == LEAF) { + index += tree->leaf_size(tree->root); + goto done; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + count++; + if (node->index != index) + node->index = index; + index += node->size; +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + *tree->leaf_index(tree, node->left) = + index; + index += tree->leaf_size(node->left); + count++; + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + *tree->leaf_index(tree, node->right) = index; + index += tree->leaf_size(node->right); + count++; + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + /* Round up to a multiple of 16 */ + while (index % 16) + index++; + if (verbose > 0) + printf("Final index %d\n", index); + return index; +} + +/* + * Compute the size of nodes and leaves. We start by assuming that + * each node needs to store a three-byte offset. The indexes of the + * nodes are calculated based on that, and then this function is + * called to see if the sizes of some nodes can be reduced. This is + * repeated until no more changes are seen. + */ +static int size_nodes(struct tree *tree) +{ + struct tree *next; + struct node *node; + struct node *right; + struct node *n; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + unsigned int pathbits; + unsigned int pathmask; + int changed; + int offset; + int size; + int indent; + + indent = 1; + changed = 0; + size = 0; + + if (verbose > 0) + printf("Sizing %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) + goto done; + + assert(tree->childnode == NODE); + pathbits = 0; + pathmask = 0; + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + offset = 0; + if (!node->left || !node->right) { + size = 1; + } else { + if (node->rightnode == NODE) { + right = node->right; + next = tree->next; + while (!right->mark) { + assert(next); + n = next->root; + while (n->bitnum != node->bitnum) { + if (pathbits & (1<bitnum)) + n = n->right; + else + n = n->left; + } + n = n->right; + assert(right->bitnum == n->bitnum); + right = n; + next = next->next; + } + offset = right->index - node->index; + } else { + offset = *tree->leaf_index(tree, node->right); + offset -= node->index; + } + assert(offset >= 0); + assert(offset <= 0xffffff); + if (offset <= 0xff) { + size = 2; + } else if (offset <= 0xffff) { + size = 3; + } else { /* offset <= 0xffffff */ + size = 4; + } + } + if (node->size != size || node->offset != offset) { + node->size = size; + node->offset = offset; + changed++; + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + pathmask |= bitmask; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + pathbits |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + pathmask &= ~bitmask; + pathbits &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +done: + if (verbose > 0) + printf("Found %d changes\n", changed); + return changed; +} + +/* + * Emit a trie for the given tree into the data array. + */ +static void emit(struct tree *tree, unsigned char *data) +{ + struct node *node; + unsigned int leftmask; + unsigned int rightmask; + unsigned int bitmask; + int offlen; + int offset; + int index; + int indent; + unsigned char byte; + + index = tree->index; + data += index; + indent = 1; + if (verbose > 0) + printf("Emitting %s_%x\n", tree->type, tree->maxage); + if (tree->childnode == LEAF) { + assert(tree->root); + tree->leaf_emit(tree->root, data); + return; + } + + assert(tree->childnode == NODE); + node = tree->root; + leftmask = rightmask = 0; + while (node) { + if (!node->mark) + goto skip; + assert(node->offset != -1); + assert(node->index == index); + + byte = 0; + if (node->nextbyte) + byte |= NEXTBYTE; + byte |= (node->bitnum & BITNUM); + if (node->left && node->right) { + if (node->leftnode == NODE) + byte |= LEFTNODE; + if (node->rightnode == NODE) + byte |= RIGHTNODE; + if (node->offset <= 0xff) + offlen = 1; + else if (node->offset <= 0xffff) + offlen = 2; + else + offlen = 3; + offset = node->offset; + byte |= offlen << OFFLEN_SHIFT; + *data++ = byte; + index++; + while (offlen--) { + *data++ = offset & 0xff; + index++; + offset >>= 8; + } + } else if (node->left) { + if (node->leftnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else if (node->right) { + byte |= RIGHTNODE; + if (node->rightnode == NODE) + byte |= TRIENODE; + *data++ = byte; + index++; + } else { + assert(0); + } +skip: + while (node) { + bitmask = 1 << node->bitnum; + if (node->mark && (leftmask & bitmask) == 0) { + leftmask |= bitmask; + if (node->leftnode == LEAF) { + assert(node->left); + data = tree->leaf_emit(node->left, + data); + index += tree->leaf_size(node->left); + } else if (node->left) { + assert(node->leftnode == NODE); + indent += 1; + node = node->left; + break; + } + } + if (node->mark && (rightmask & bitmask) == 0) { + rightmask |= bitmask; + if (node->rightnode == LEAF) { + assert(node->right); + data = tree->leaf_emit(node->right, + data); + index += tree->leaf_size(node->right); + } else if (node->right) { + assert(node->rightnode==NODE); + indent += 1; + node = node->right; + break; + } + } + leftmask &= ~bitmask; + rightmask &= ~bitmask; + node = node->parent; + indent -= 1; + } + } +} + +/* ------------------------------------------------------------------ */ + +/* + * Unicode data. + * + * We need to keep track of the Canonical Combining Class, the Age, + * and decompositions for a code point. + * + * For the Age, we store the index into the ages table. Effectively + * this is a generation number that the table maps to a unicode + * version. + * + * The correction field is used to indicate that this entry is in the + * corrections array, which contains decompositions that were + * corrected in later revisions. The value of the correction field is + * the Unicode version in which the mapping was corrected. + */ +struct unicode_data { + unsigned int code; + int ccc; + int gen; + int correction; + unsigned int *utf32nfkdi; + unsigned int *utf32nfkdicf; + char *utf8nfkdi; + char *utf8nfkdicf; +}; + +struct unicode_data unicode_data[0x110000]; +struct unicode_data *corrections; +int corrections_count; + +struct tree *nfkdi_tree; +struct tree *nfkdicf_tree; + +struct tree *trees; +int trees_count; + +/* + * Check the corrections array to see if this entry was corrected at + * some point. + */ +static struct unicode_data *corrections_lookup(struct unicode_data *u) +{ + int i; + + for (i = 0; i != corrections_count; i++) + if (u->code == corrections[i].code) + return &corrections[i]; + return u; +} + +static int nfkdi_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static int nfkdicf_equal(void *l, void *r) +{ + struct unicode_data *left = l; + struct unicode_data *right = r; + + if (left->gen != right->gen) + return 0; + if (left->ccc != right->ccc) + return 0; + if (left->utf8nfkdicf && right->utf8nfkdicf && + strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0) + return 1; + if (left->utf8nfkdicf && right->utf8nfkdicf) + return 0; + if (left->utf8nfkdicf || right->utf8nfkdicf) + return 0; + if (left->utf8nfkdi && right->utf8nfkdi && + strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0) + return 1; + if (left->utf8nfkdi || right->utf8nfkdi) + return 0; + return 1; +} + +static void nfkdi_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static void nfkdicf_print(void *l, int indent) +{ + struct unicode_data *leaf = l; + + printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, + leaf->code, leaf->ccc, leaf->gen); + if (leaf->utf8nfkdicf) + printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf); + else if (leaf->utf8nfkdi) + printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); + printf("\n"); +} + +static int nfkdi_mark(void *l) +{ + return 1; +} + +static int nfkdicf_mark(void *l) +{ + struct unicode_data *leaf = l; + + if (leaf->utf8nfkdicf) + return 1; + return 0; +} + +static int correction_mark(void *l) +{ + struct unicode_data *leaf = l; + + return leaf->correction; +} + +static int nfkdi_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int nfkdicf_size(void *l) +{ + struct unicode_data *leaf = l; + + int size = 2; + if (leaf->utf8nfkdicf) + size += strlen(leaf->utf8nfkdicf) + 1; + else if (leaf->utf8nfkdi) + size += strlen(leaf->utf8nfkdi) + 1; + return size; +} + +static int *nfkdi_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static int *nfkdicf_index(struct tree *tree, void *l) +{ + struct unicode_data *leaf = l; + + return &tree->leafindex[leaf->code]; +} + +static unsigned char *nfkdi_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static unsigned char *nfkdicf_emit(void *l, unsigned char *data) +{ + struct unicode_data *leaf = l; + unsigned char *s; + + *data++ = leaf->gen; + if (leaf->utf8nfkdicf) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdicf; + while ((*data++ = *s++) != 0) + ; + } else if (leaf->utf8nfkdi) { + *data++ = DECOMPOSE; + s = (unsigned char*)leaf->utf8nfkdi; + while ((*data++ = *s++) != 0) + ; + } else { + *data++ = leaf->ccc; + } + return data; +} + +static void utf8_create(struct unicode_data *data) +{ + char utf[18*4+1]; + char *u; + unsigned int *um; + int i; + + u = utf; + um = data->utf32nfkdi; + if (um) { + for (i = 0; um[i]; i++) + u += utf8encode(u, um[i]); + *u = '\0'; + data->utf8nfkdi = strdup(utf); + } + u = utf; + um = data->utf32nfkdicf; + if (um) { + for (i = 0; um[i]; i++) + u += utf8encode(u, um[i]); + *u = '\0'; + if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, utf)) + data->utf8nfkdicf = strdup(utf); + } +} + +static void utf8_init(void) +{ + unsigned int unichar; + int i; + + for (unichar = 0; unichar != 0x110000; unichar++) + utf8_create(&unicode_data[unichar]); + + for (i = 0; i != corrections_count; i++) + utf8_create(&corrections[i]); +} + +static void trees_init(void) +{ + struct unicode_data *data; + unsigned int maxage; + unsigned int nextage; + int count; + int i; + int j; + + /* Count the number of different ages. */ + count = 0; + nextage = (unsigned int)-1; + do { + maxage = nextage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + count++; + } while (nextage); + + /* Two trees per age: nfkdi and nfkdicf */ + trees_count = count * 2; + trees = calloc(trees_count, sizeof(struct tree)); + + /* Assign ages to the trees. */ + count = trees_count; + nextage = (unsigned int)-1; + do { + maxage = nextage; + trees[--count].maxage = maxage; + trees[--count].maxage = maxage; + nextage = 0; + for (i = 0; i <= corrections_count; i++) { + data = &corrections[i]; + if (nextage < data->correction && + data->correction < maxage) + nextage = data->correction; + } + } while (nextage); + + /* The ages assigned above are off by one. */ + for (i = 0; i != trees_count; i++) { + j = 0; + while (ages[j] < trees[i].maxage) + j++; + trees[i].maxage = ages[j-1]; + } + + /* Set up the forwarding between trees. */ + trees[trees_count-2].next = &trees[trees_count-1]; + trees[trees_count-1].leaf_mark = nfkdi_mark; + trees[trees_count-2].leaf_mark = nfkdicf_mark; + for (i = 0; i != trees_count-2; i += 2) { + trees[i].next = &trees[trees_count-2]; + trees[i].leaf_mark = correction_mark; + trees[i+1].next = &trees[trees_count-1]; + trees[i+1].leaf_mark = correction_mark; + } + + /* Assign the callouts. */ + for (i = 0; i != trees_count; i += 2) { + trees[i].type = "nfkdicf"; + trees[i].leaf_equal = nfkdicf_equal; + trees[i].leaf_print = nfkdicf_print; + trees[i].leaf_size = nfkdicf_size; + trees[i].leaf_index = nfkdicf_index; + trees[i].leaf_emit = nfkdicf_emit; + + trees[i+1].type = "nfkdi"; + trees[i+1].leaf_equal = nfkdi_equal; + trees[i+1].leaf_print = nfkdi_print; + trees[i+1].leaf_size = nfkdi_size; + trees[i+1].leaf_index = nfkdi_index; + trees[i+1].leaf_emit = nfkdi_emit; + } + + /* Finish init. */ + for (i = 0; i != trees_count; i++) + trees[i].childnode = NODE; +} + +static void trees_populate(void) +{ + struct unicode_data *data; + unsigned int unichar; + char keyval[4]; + int keylen; + int i; + + for (i = 0; i != trees_count; i++) { + if (verbose > 0) { + printf("Populating %s_%x\n", + trees[i].type, trees[i].maxage); + } + for (unichar = 0; unichar != 0x110000; unichar++) { + if (unicode_data[unichar].gen < 0) + continue; + keylen = utf8encode(keyval, unichar); + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= trees[i].maxage) + data = &unicode_data[unichar]; + insert(&trees[i], keyval, keylen, data); + } + } +} + +static void trees_reduce(void) +{ + int i; + int size; + int changed; + + for (i = 0; i != trees_count; i++) + prune(&trees[i]); + for (i = 0; i != trees_count; i++) + mark_nodes(&trees[i]); + do { + size = 0; + for (i = 0; i != trees_count; i++) + size = index_nodes(&trees[i], size); + changed = 0; + for (i = 0; i != trees_count; i++) + changed += size_nodes(&trees[i]); + } while (changed); + + utf8data = calloc(size, 1); + utf8data_size = size; + for (i = 0; i != trees_count; i++) + emit(&trees[i], utf8data); + + if (verbose > 0) { + for (i = 0; i != trees_count; i++) { + printf("%s_%x idx %d\n", + trees[i].type, trees[i].maxage, trees[i].index); + } + } + + nfkdi = utf8data + trees[trees_count-1].index; + nfkdicf = utf8data + trees[trees_count-2].index; + + nfkdi_tree = &trees[trees_count-1]; + nfkdicf_tree = &trees[trees_count-2]; +} + +static void verify(struct tree *tree) +{ + struct unicode_data *data; + utf8leaf_t *leaf; + unsigned int unichar; + char key[4]; + int report; + int nocf; + + if (verbose > 0) + printf("Verifying %s_%x\n", tree->type, tree->maxage); + nocf = strcmp(tree->type, "nfkdicf"); + + for (unichar = 0; unichar != 0x110000; unichar++) { + report = 0; + data = corrections_lookup(&unicode_data[unichar]); + if (data->correction <= tree->maxage) + data = &unicode_data[unichar]; + utf8encode(key,unichar); + leaf = utf8lookup(tree, key); + if (!leaf) { + if (data->gen != -1) + report++; + if (unichar < 0xd800 || unichar > 0xdfff) + report++; + } else { + if (unichar >= 0xd800 && unichar <= 0xdfff) + report++; + if (data->gen == -1) + report++; + if (data->gen != LEAF_GEN(leaf)) + report++; + if (LEAF_CCC(leaf) == DECOMPOSE) { + if (nocf) { + if (!data->utf8nfkdi) { + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } else { + if (!data->utf8nfkdicf && + !data->utf8nfkdi) { + report++; + } else if (data->utf8nfkdicf) { + if (strcmp(data->utf8nfkdicf, + LEAF_STR(leaf))) + report++; + } else if (strcmp(data->utf8nfkdi, + LEAF_STR(leaf))) { + report++; + } + } + } else if (data->ccc != LEAF_CCC(leaf)) { + report++; + } + } + if (report) { + printf("%X code %X gen %d ccc %d" + " nfkdi -> \"%s\"", + unichar, data->code, data->gen, + data->ccc, + data->utf8nfkdi); + if (leaf) { + printf(" gen %d ccc %d" + " nfkdi -> \"%s\"", + LEAF_GEN(leaf), + LEAF_CCC(leaf), + LEAF_CCC(leaf) == DECOMPOSE ? + LEAF_STR(leaf) : ""); + } + printf("\n"); + } + } +} + +static void trees_verify(void) +{ + int i; + + for (i = 0; i != trees_count; i++) + verify(&trees[i]); +} + +/* ------------------------------------------------------------------ */ + +static void help(void) +{ + printf("Usage: %s [options]\n", argv0); + printf("\n"); + printf("This program creates an a data trie used for parsing and\n"); + printf("normalization of UTF-8 strings. The trie is derived from\n"); + printf("a set of input files from the Unicode character database\n"); + printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n"); + printf("\n"); + printf("The generated tree supports two normalization forms:\n"); + printf("\n"); + printf("\tnfkdi:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\n"); + printf("\tnfkdicf:\n"); + printf("\t- Apply unicode normalization form NFKD.\n"); + printf("\t- Remove any Default_Ignorable_Code_Point.\n"); + printf("\t- Apply a full casefold (C + F).\n"); + printf("\n"); + printf("These forms were chosen as being most useful when dealing\n"); + printf("with file names: NFKD catches most cases where characters\n"); + printf("should be considered equivalent. The ignorables are mostly\n"); + printf("invisible, making names hard to type.\n"); + printf("\n"); + printf("The options to specify the files to be used are listed\n"); + printf("below with their default values, which are the names used\n"); + printf("by version 11.0.0 of the Unicode Character Database.\n"); + printf("\n"); + printf("The input files:\n"); + printf("\t-a %s\n", AGE_NAME); + printf("\t-c %s\n", CCC_NAME); + printf("\t-p %s\n", PROP_NAME); + printf("\t-d %s\n", DATA_NAME); + printf("\t-f %s\n", FOLD_NAME); + printf("\t-n %s\n", NORM_NAME); + printf("\n"); + printf("Additionally, the generated tables are tested using:\n"); + printf("\t-t %s\n", TEST_NAME); + printf("\n"); + printf("Finally, the output file:\n"); + printf("\t-o %s\n", UTF8_NAME); + printf("\n"); +} + +static void usage(void) +{ + help(); + exit(1); +} + +static void open_fail(const char *name, int error) +{ + printf("Error %d opening %s: %s\n", error, name, strerror(error)); + exit(1); +} + +static void file_fail(const char *filename) +{ + printf("Error parsing %s\n", filename); + exit(1); +} + +static void line_fail(const char *filename, const char *line) +{ + printf("Error parsing %s:%s\n", filename, line); + exit(1); +} + +/* ------------------------------------------------------------------ */ + +static void print_utf32(unsigned int *utf32str) +{ + int i; + + for (i = 0; utf32str[i]; i++) + printf(" %X", utf32str[i]); +} + +static void print_utf32nfkdi(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdi); + printf("\n"); +} + +static void print_utf32nfkdicf(unsigned int unichar) +{ + printf(" %X ->", unichar); + print_utf32(unicode_data[unichar].utf32nfkdicf); + printf("\n"); +} + +/* ------------------------------------------------------------------ */ + +static void age_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + int gen; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", age_name); + + file = fopen(age_name, "r"); + if (!file) + open_fail(age_name, errno); + count = 0; + + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d_%d\n", + major, minor, revision); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages_count++; + if (verbose > 1) + printf(" Age V%d_%d\n", major, minor); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + } + + /* We must have found something above. */ + if (verbose > 1) + printf("%d age entries\n", ages_count); + if (ages_count == 0 || ages_count > MAXGEN) + file_fail(age_name); + + /* There is a 0 entry. */ + ages_count++; + ages = calloc(ages_count + 1, sizeof(*ages)); + /* And a guard entry. */ + ages[ages_count] = (unsigned int)-1; + + rewind(file); + count = 0; + gen = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "# Age=V%d_%d_%d", + &major, &minor, &revision); + if (ret == 3) { + ages[++gen] = + UNICODE_AGE(major, minor, revision); + if (verbose > 1) + printf(" Age V%d_%d_%d = gen %d\n", + major, minor, revision, gen); + if (!age_valid(major, minor, revision)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "# Age=V%d_%d", &major, &minor); + if (ret == 2) { + ages[++gen] = UNICODE_AGE(major, minor, 0); + if (verbose > 1) + printf(" Age V%d_%d = %d\n", + major, minor, gen); + if (!age_valid(major, minor, 0)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X..%X ; %d.%d #", + &first, &last, &major, &minor); + if (ret == 4) { + for (unichar = first; unichar <= last; unichar++) + unicode_data[unichar].gen = gen; + count += 1 + last - first; + if (verbose > 1) + printf(" %X..%X gen %d\n", first, last, gen); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(age_name, line); + continue; + } + ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor); + if (ret == 3) { + unicode_data[unichar].gen = gen; + count++; + if (verbose > 1) + printf(" %X gen %d\n", unichar, gen); + if (!utf32valid(unichar)) + line_fail(age_name, line); + continue; + } + } + unicode_maxage = ages[gen]; + fclose(file); + + /* Nix surrogate block */ + if (verbose > 1) + printf(" Removing surrogate block D800..DFFF\n"); + for (unichar = 0xd800; unichar <= 0xdfff; unichar++) + unicode_data[unichar].gen = -1; + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(age_name); +} + +static void ccc_init(void) +{ + FILE *file; + unsigned int first; + unsigned int last; + unsigned int unichar; + unsigned int value; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", ccc_name); + + file = fopen(ccc_name, "r"); + if (!file) + open_fail(ccc_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value); + if (ret == 3) { + for (unichar = first; unichar <= last; unichar++) { + unicode_data[unichar].ccc = value; + count++; + } + if (verbose > 1) + printf(" %X..%X ccc %d\n", first, last, value); + if (!utf32valid(first) || !utf32valid(last)) + line_fail(ccc_name, line); + continue; + } + ret = sscanf(line, "%X ; %d #", &unichar, &value); + if (ret == 2) { + unicode_data[unichar].ccc = value; + count++; + if (verbose > 1) + printf(" %X ccc %d\n", unichar, value); + if (!utf32valid(unichar)) + line_fail(ccc_name, line); + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(ccc_name); +} + +static void nfkdi_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + unsigned int *um; + int count; + int i; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", data_name); + file = fopen(data_name, "r"); + if (!file) + open_fail(data_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];", + &unichar, buf0); + if (ret != 2) + continue; + if (!utf32valid(unichar)) + line_fail(data_name, line); + + s = buf0; + /* skip over */ + if (*s == '<') + while (*s++ != ' ') + ; + /* decode the decomposition into UTF-32 */ + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(data_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(data_name); +} + +static void nfkdicf_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char status; + char *s; + unsigned int *um; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", fold_name); + file = fopen(fold_name, "r"); + if (!file) + open_fail(fold_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0); + if (ret != 3) + continue; + if (!utf32valid(unichar)) + line_fail(fold_name, line); + /* Use the C+F casefold. */ + if (status != 'C' && status != 'F') + continue; + s = buf0; + if (*s == '<') + while (*s++ != ' ') + ; + i = 0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(fold_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + fclose(file); + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(fold_name); +} + +static void ignore_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int first; + unsigned int last; + unsigned int *um; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", prop_name); + file = fopen(prop_name, "r"); + if (!file) + open_fail(prop_name, errno); + assert(file); + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0); + if (ret == 3) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(first) || !utf32valid(last)) + line_fail(prop_name, line); + for (unichar = first; unichar <= last; unichar++) { + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + count++; + } + if (verbose > 1) + printf(" %X..%X Default_Ignorable_Code_Point\n", + first, last); + continue; + } + ret = sscanf(line, "%X ; %s # ", &unichar, buf0); + if (ret == 2) { + if (strcmp(buf0, "Default_Ignorable_Code_Point")) + continue; + if (!utf32valid(unichar)) + line_fail(prop_name, line); + free(unicode_data[unichar].utf32nfkdi); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdi = um; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(sizeof(unsigned int)); + *um = 0; + unicode_data[unichar].utf32nfkdicf = um; + if (verbose > 1) + printf(" %X Default_Ignorable_Code_Point\n", + unichar); + count++; + continue; + } + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(prop_name); +} + +static void corrections_init(void) +{ + FILE *file; + unsigned int unichar; + unsigned int major; + unsigned int minor; + unsigned int revision; + unsigned int age; + unsigned int *um; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + char *s; + int i; + int count; + int ret; + + if (verbose > 0) + printf("Parsing %s\n", norm_name); + file = fopen(norm_name, "r"); + if (!file) + open_fail(norm_name, errno); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + count++; + } + corrections = calloc(count, sizeof(struct unicode_data)); + corrections_count = count; + rewind(file); + + count = 0; + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #", + &unichar, buf0, buf1, + &major, &minor, &revision); + if (ret != 6) + continue; + if (!utf32valid(unichar) || !age_valid(major, minor, revision)) + line_fail(norm_name, line); + corrections[count] = unicode_data[unichar]; + assert(corrections[count].code == unichar); + age = UNICODE_AGE(major, minor, revision); + corrections[count].correction = age; + + i = 0; + s = buf0; + while (*s) { + mapping[i] = strtoul(s, &s, 16); + if (!utf32valid(mapping[i])) + line_fail(norm_name, line); + i++; + } + mapping[i++] = 0; + + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + corrections[count].utf32nfkdi = um; + + if (verbose > 1) + printf(" %X -> %s -> %s V%d_%d_%d\n", + unichar, buf0, buf1, major, minor, revision); + count++; + } + fclose(file); + + if (verbose > 0) + printf("Found %d entries\n", count); + if (count == 0) + file_fail(norm_name); +} + +/* ------------------------------------------------------------------ */ + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;;Lo;0;L;;;;;N;;;;; + * D7A3;;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount) + * LVPart = SBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount) + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = + * } else { + * TPart = TBase + TIndex + * d = + * } + * + */ + +static void +hangul_decompose(void) +{ + unsigned int sb = 0xAC00; + unsigned int lb = 0x1100; + unsigned int vb = 0x1161; + unsigned int tb = 0x11a7; + /* unsigned int lc = 19; */ + unsigned int vc = 21; + unsigned int tc = 28; + unsigned int nc = (vc * tc); + /* unsigned int sc = (lc * nc); */ + unsigned int unichar; + unsigned int mapping[4]; + unsigned int *um; + int count; + int i; + + if (verbose > 0) + printf("Decomposing hangul\n"); + /* Hangul */ + count = 0; + for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) { + unsigned int si = unichar - sb; + unsigned int li = si / nc; + unsigned int vi = (si % nc) / tc; + unsigned int ti = si % tc; + + i = 0; + mapping[i++] = lb + li; + mapping[i++] = vb + vi; + if (ti) + mapping[i++] = tb + ti; + mapping[i++] = 0; + + assert(!unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + + assert(!unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + + if (verbose > 1) + print_utf32nfkdi(unichar); + + count++; + } + if (verbose > 0) + printf("Created %d entries\n", count); +} + +static void nfkdi_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdi\n"); + + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdi) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdi; + while (*um) { + dc = unicode_data[*um].utf32nfkdi; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdi); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdi = um; + } + /* Add this decomposition to nfkdicf if there is no entry. */ + if (!unicode_data[unichar].utf32nfkdicf) { + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdi(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +static void nfkdicf_decompose(void) +{ + unsigned int unichar; + unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */ + unsigned int *um; + unsigned int *dc; + int count; + int i; + int j; + int ret; + + if (verbose > 0) + printf("Decomposing nfkdicf\n"); + count = 0; + for (unichar = 0; unichar != 0x110000; unichar++) { + if (!unicode_data[unichar].utf32nfkdicf) + continue; + for (;;) { + ret = 1; + i = 0; + um = unicode_data[unichar].utf32nfkdicf; + while (*um) { + dc = unicode_data[*um].utf32nfkdicf; + if (dc) { + for (j = 0; dc[j]; j++) + mapping[i++] = dc[j]; + ret = 0; + } else { + mapping[i++] = *um; + } + um++; + } + mapping[i++] = 0; + if (ret) + break; + free(unicode_data[unichar].utf32nfkdicf); + um = malloc(i * sizeof(unsigned int)); + memcpy(um, mapping, i * sizeof(unsigned int)); + unicode_data[unichar].utf32nfkdicf = um; + } + if (verbose > 1) + print_utf32nfkdicf(unichar); + count++; + } + if (verbose > 0) + printf("Processed %d entries\n", count); +} + +/* ------------------------------------------------------------------ */ + +int utf8agemax(struct tree *, const char *); +int utf8nagemax(struct tree *, const char *, size_t); +int utf8agemin(struct tree *, const char *); +int utf8nagemin(struct tree *, const char *, size_t); +ssize_t utf8len(struct tree *, const char *); +ssize_t utf8nlen(struct tree *, const char *, size_t); +struct utf8cursor; +int utf8cursor(struct utf8cursor *, struct tree *, const char *); +int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); +int utf8byte(struct utf8cursor *); + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) +{ + utf8trie_t *trie = utf8data + tree->index; + int offlen; + int offset; + int mask; + int node; + + if (!tree) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + return NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + return NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to trie_nlookup(). + */ +static utf8leaf_t *utf8lookup(struct tree *tree, const char *s) +{ + return utf8nlookup(tree, s, (size_t)-1); +} + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int utf8clen(const char *s) +{ + unsigned char c = *s; + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int utf8agemax(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int utf8agemin(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + int age; + int leaf_age; + + if (!tree) + return -1; + age = tree->maxage; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemax(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemin(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age; + + if (!tree) + return -1; + age = tree->maxage; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + leaf_age = ages[LEAF_GEN(leaf)]; + if (leaf_age <= tree->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t utf8len(struct tree *tree, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (*s) { + if (!(leaf = utf8lookup(tree, s))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t utf8nlen(struct tree *tree, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!tree) + return -1; + while (len && *s) { + if (!(leaf = utf8nlookup(tree, s, len))) + return -1; + if (ages[LEAF_GEN(leaf)] > tree->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + struct tree *tree; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; + unsigned int unichar; +}; + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : string. + * len : length of s. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int utf8ncursor(struct utf8cursor *u8c, struct tree *tree, const char *s, + size_t len) +{ + if (!tree) + return -1; + if (!s) + return -1; + u8c->tree = tree; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->unichar = 0; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * s : NUL-terminated string. + * u8c : pointer to cursor. + * trie : utf8trie_t to use for normalization. + * + * Returns -1 on error, 0 on success. + */ +int utf8cursor(struct utf8cursor *u8c, struct tree *tree, const char *s) +{ + return utf8ncursor(u8c, tree, s, (unsigned int)-1); +} + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->tree, u8c->s); + else + leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + /* Characters that are too new have CCC 0. */ + if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) { + ccc = STOPPER; + } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->tree, u8c->s); + ccc = LEAF_CCC(leaf); + } + u8c->unichar = utf8decode(u8c->s); + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ + ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + assert(u8c->ccc == STOPPER); + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} + +/* ------------------------------------------------------------------ */ + +static int normalize_line(struct tree *tree) +{ + char *s; + char *t; + int c; + struct utf8cursor u8c; + + /* First test: null-terminated string. */ + s = buf2; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + /* Second test: length-limited string. */ + s = buf2; + /* Replace NUL with a value that will cause an error if seen. */ + s[strlen(s) + 1] = -1; + t = buf3; + if (utf8cursor(&u8c, tree, s)) + return -1; + while ((c = utf8byte(&u8c)) > 0) + if (c != (unsigned char)*t++) + return -1; + if (c < 0) + return -1; + if (*t != 0) + return -1; + + return 0; +} + +static void normalization_test(void) +{ + FILE *file; + unsigned int unichar; + struct unicode_data *data; + char *s; + char *t; + int ret; + int ignorables; + int tests = 0; + int failures = 0; + + if (verbose > 0) + printf("Parsing %s\n", test_name); + /* Step one, read data from file. */ + file = fopen(test_name, "r"); + if (!file) + open_fail(test_name, errno); + + while (fgets(line, LINESIZE, file)) { + ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];", + buf0, buf1); + if (ret != 2 || *line == '#') + continue; + s = buf0; + t = buf2; + while (*s) { + unichar = strtoul(s, &s, 16); + t += utf8encode(t, unichar); + } + *t = '\0'; + + ignorables = 0; + s = buf1; + t = buf3; + while (*s) { + unichar = strtoul(s, &s, 16); + data = &unicode_data[unichar]; + if (data->utf8nfkdi && !*data->utf8nfkdi) + ignorables = 1; + else + t += utf8encode(t, unichar); + } + *t = '\0'; + + tests++; + if (normalize_line(nfkdi_tree) < 0) { + printf("Line %s -> %s", buf0, buf1); + if (ignorables) + printf(" (ignorables removed)"); + printf(" failure\n"); + failures++; + } + } + fclose(file); + if (verbose > 0) + printf("Ran %d tests with %d failures\n", tests, failures); + if (failures) + file_fail(test_name); +} + +/* ------------------------------------------------------------------ */ + +static void write_file(void) +{ + FILE *file; + int i; + int j; + int t; + int gen; + + if (verbose > 0) + printf("Writing %s\n", utf8_name); + file = fopen(utf8_name, "w"); + if (!file) + open_fail(utf8_name, errno); + + fprintf(file, "/* This file is generated code, do not edit. */\n"); + fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n"); + fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n"); + fprintf(file, "#endif\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8vers = %#x;\n", + unicode_maxage); + fprintf(file, "\n"); + fprintf(file, "static const unsigned int utf8agetab[] = {\n"); + for (i = 0; i != ages_count; i++) + fprintf(file, "\t%#x%s\n", ages[i], + ages[i] == unicode_maxage ? "" : ","); + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n"); + t = 0; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n"); + t = 1; + for (gen = 0; gen < ages_count; gen++) { + fprintf(file, "\t{ %#x, %d }%s\n", + ages[gen], trees[t].index, + ages[gen] == unicode_maxage ? "" : ","); + if (trees[t].maxage == ages[gen]) + t += 2; + } + fprintf(file, "};\n"); + fprintf(file, "\n"); + fprintf(file, "static const unsigned char utf8data[%zd] = {\n", + utf8data_size); + t = 0; + for (i = 0; i != utf8data_size; i += 16) { + if (i == trees[t].index) { + fprintf(file, "\t/* %s_%x */\n", + trees[t].type, trees[t].maxage); + if (t < trees_count-1) + t++; + } + fprintf(file, "\t"); + for (j = i; j != i + 16; j++) + fprintf(file, "0x%.2x%s", utf8data[j], + (j < utf8data_size -1 ? "," : "")); + fprintf(file, "\n"); + } + fprintf(file, "};\n"); + fclose(file); +} + +/* ------------------------------------------------------------------ */ + +int main(int argc, char *argv[]) +{ + unsigned int unichar; + int opt; + + argv0 = argv[0]; + + while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) { + switch (opt) { + case 'a': + age_name = optarg; + break; + case 'c': + ccc_name = optarg; + break; + case 'd': + data_name = optarg; + break; + case 'f': + fold_name = optarg; + break; + case 'n': + norm_name = optarg; + break; + case 'o': + utf8_name = optarg; + break; + case 'p': + prop_name = optarg; + break; + case 't': + test_name = optarg; + break; + case 'v': + verbose++; + break; + case 'h': + help(); + exit(0); + default: + usage(); + } + } + + if (verbose > 1) + help(); + for (unichar = 0; unichar != 0x110000; unichar++) + unicode_data[unichar].code = unichar; + age_init(); + ccc_init(); + nfkdi_init(); + nfkdicf_init(); + ignore_init(); + corrections_init(); + hangul_decompose(); + nfkdi_decompose(); + nfkdicf_decompose(); + utf8_init(); + trees_init(); + trees_populate(); + trees_reduce(); + trees_verify(); + /* Prevent "unused function" warning. */ + (void)lookup(nfkdi_tree, " "); + if (verbose > 2) + tree_walk(nfkdi_tree); + if (verbose > 2) + tree_walk(nfkdicf_tree); + normalization_test(); + write_file(); + + return 0; +} From patchwork Thu Dec 6 23:08:54 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717203 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2D5EA17DB for ; Thu, 6 Dec 2018 23:10:02 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1B2EA2F21D for ; Thu, 6 Dec 2018 23:10:02 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0B6352F22E; Thu, 6 Dec 2018 23:10:02 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A6B982F21D for ; Thu, 6 Dec 2018 23:10:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726345AbeLFXKA (ORCPT ); Thu, 6 Dec 2018 18:10:00 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56124 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726331AbeLFXKA (ORCPT ); Thu, 6 Dec 2018 18:10:00 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 4AFF827ED7B From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 14/23] nls: utf8: Move nls-utf8{,-core}.c Date: Thu, 6 Dec 2018 18:08:54 -0500 Message-Id: <20181206230903.30011-15-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi nls_utf8 will be generated from multiple files, so lets move the existing code to a -core suffix. Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/Makefile | 3 +++ fs/nls/{nls_utf8.c => nls_utf8-core.c} | 0 2 files changed, 3 insertions(+) rename fs/nls/{nls_utf8.c => nls_utf8-core.c} (100%) diff --git a/fs/nls/Makefile b/fs/nls/Makefile index 840e06aefd47..c94221b6108d 100644 --- a/fs/nls/Makefile +++ b/fs/nls/Makefile @@ -43,7 +43,10 @@ obj-$(CONFIG_NLS_ISO8859_14) += nls_iso8859-14.o obj-$(CONFIG_NLS_ISO8859_15) += nls_iso8859-15.o obj-$(CONFIG_NLS_KOI8_R) += nls_koi8-r.o obj-$(CONFIG_NLS_KOI8_U) += nls_koi8-u.o nls_koi8-ru.o + obj-$(CONFIG_NLS_UTF8) += nls_utf8.o +nls_utf8-y += nls_utf8-core.o + obj-$(CONFIG_NLS_MAC_CELTIC) += mac-celtic.o obj-$(CONFIG_NLS_MAC_CENTEURO) += mac-centeuro.o obj-$(CONFIG_NLS_MAC_CROATIAN) += mac-croatian.o diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8-core.c similarity index 100% rename from fs/nls/nls_utf8.c rename to fs/nls/nls_utf8-core.c From patchwork Thu Dec 6 23:08:55 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717207 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EBB161932 for ; Thu, 6 Dec 2018 23:10:08 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D85982F205 for ; Thu, 6 Dec 2018 23:10:08 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CC3882F21D; Thu, 6 Dec 2018 23:10:08 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4797C2F205 for ; Thu, 6 Dec 2018 23:10:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726349AbeLFXKG (ORCPT ); Thu, 6 Dec 2018 18:10:06 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56128 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726331AbeLFXKF (ORCPT ); Thu, 6 Dec 2018 18:10:05 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 8692F27ED5C From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH v4 15/23] nls: utf8: Introduce code for UTF-8 normalization Date: Thu, 6 Dec 2018 18:08:55 -0500 Message-Id: <20181206230903.30011-16-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Olaf Weber Supporting functions for UTF-8 normalization are in utf8norm.c with the header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf. nfkdi: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. nfkdicf: - Apply unicode normalization form NFKD. - Remove any Default_Ignorable_Code_Point. - Apply a full casefold (C + F). For the purposes of the code, a string is valid UTF-8 if: - The values encoded are 0x1..0x10FFFF. - The surrogate codepoints 0xD800..0xDFFFF are not encoded. - The shortest possible encoding is used for all values. The supporting functions work on null-terminated strings (utf8 prefix) and on length-limited strings (utf8n prefix). From the original SGI patch and for conformity with coding standards, the utf8data_t typedef was dropped, since it was just masking the struct keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I decided to keep it, since they are simple pointers to memory buffers, and using uchars here wouldn't provide any more meaningful information. Changes since RFC v2: - Merge to NLS system Changes since RFC v1: - utf8_version_is_supported receives maj, min and rev as separate arguments. (Olaf Weber) Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Rebase to Mainline] [Fix up checkpatch.pl warnings] [Drop typedefs] [Merge with NLS subsystem] --- fs/nls/Makefile | 2 + fs/nls/nls_utf8-norm.c | 640 +++++++++++++++++++++++++++++++++++++++++ fs/nls/utf8n.h | 112 ++++++++ 3 files changed, 754 insertions(+) create mode 100644 fs/nls/nls_utf8-norm.c create mode 100644 fs/nls/utf8n.h diff --git a/fs/nls/Makefile b/fs/nls/Makefile index c94221b6108d..bd13c1a90767 100644 --- a/fs/nls/Makefile +++ b/fs/nls/Makefile @@ -46,6 +46,7 @@ obj-$(CONFIG_NLS_KOI8_U) += nls_koi8-u.o nls_koi8-ru.o obj-$(CONFIG_NLS_UTF8) += nls_utf8.o nls_utf8-y += nls_utf8-core.o +nls_utf8-$(CONFIG_NLS_UTF8_NORMALIZATION) += nls_utf8-norm.o obj-$(CONFIG_NLS_MAC_CELTIC) += mac-celtic.o obj-$(CONFIG_NLS_MAC_CENTEURO) += mac-centeuro.o @@ -59,6 +60,7 @@ obj-$(CONFIG_NLS_MAC_ROMANIAN) += mac-romanian.o obj-$(CONFIG_NLS_MAC_ROMAN) += mac-roman.o obj-$(CONFIG_NLS_MAC_TURKISH) += mac-turkish.o +$(obj)/nls_utf8-norm.o: $(obj)/utf8data.h $(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE $(call cmd,mkutf8data) quiet_cmd_mkutf8data = MKUTF8DATA $@ diff --git a/fs/nls/nls_utf8-norm.c b/fs/nls/nls_utf8-norm.c new file mode 100644 index 000000000000..ca0bbf644b49 --- /dev/null +++ b/fs/nls/nls_utf8-norm.c @@ -0,0 +1,640 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include "utf8n.h" + +struct utf8data { + unsigned int maxage; + unsigned int offset; +}; + +#define __INCLUDED_FROM_UTF8NORM_C__ +#include "utf8data.h" +#undef __INCLUDED_FROM_UTF8NORM_C__ + +int utf8version_is_supported(u8 maj, u8 min, u8 rev) +{ + int i = ARRAY_SIZE(utf8agetab) - 1; + unsigned int sb_utf8version = UNICODE_AGE(maj, min, rev); + + while (i >= 0 && utf8agetab[i] != 0) { + if (sb_utf8version == utf8agetab[i]) + return 1; + i--; + } + return 0; +} +EXPORT_SYMBOL(utf8version_is_supported); + +/* + * UTF-8 valid ranges. + * + * The UTF-8 encoding spreads the bits of a 32bit word over several + * bytes. This table gives the ranges that can be held and how they'd + * be represented. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * There is an additional requirement on UTF-8, in that only the + * shortest representation of a 32bit value is to be used. A decoder + * must not decode sequences that do not satisfy this requirement. + * Thus the allowed ranges have a lower bound. + * + * 0x00000000 0x0000007F: 0xxxxxxx + * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx + * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx + * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + * + * Actual unicode characters are limited to the range 0x0 - 0x10FFFF, + * 17 planes of 65536 values. This limits the sequences actually seen + * even more, to just the following. + * + * 0 - 0x7F: 0 - 0x7F + * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF + * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF + * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF + * + * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed. + * + * Note that the longest sequence seen with valid usage is 4 bytes, + * the same a single UTF-32 character. This makes the UTF-8 + * representation of Unicode strictly smaller than UTF-32. + * + * The shortest sequence requirement was introduced by: + * Corrigendum #1: UTF-8 Shortest Form + * It can be found here: + * http://www.unicode.org/versions/corrigendum1.html + * + */ + +/* + * Return the number of bytes used by the current UTF-8 sequence. + * Assumes the input points to the first byte of a valid UTF-8 + * sequence. + */ +static inline int utf8clen(const char *s) +{ + unsigned char c = *s; + + return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); +} + +/* + * utf8trie_t + * + * A compact binary tree, used to decode UTF-8 characters. + * + * Internal nodes are one byte for the node itself, and up to three + * bytes for an offset into the tree. The first byte contains the + * following information: + * NEXTBYTE - flag - advance to next byte if set + * BITNUM - 3 bit field - the bit number to tested + * OFFLEN - 2 bit field - number of bytes in the offset + * if offlen == 0 (non-branching node) + * RIGHTPATH - 1 bit field - set if the following node is for the + * right-hand path (tested bit is set) + * TRIENODE - 1 bit field - set if the following node is an internal + * node, otherwise it is a leaf node + * if offlen != 0 (branching node) + * LEFTNODE - 1 bit field - set if the left-hand node is internal + * RIGHTNODE - 1 bit field - set if the right-hand node is internal + * + * Due to the way utf8 works, there cannot be branching nodes with + * NEXTBYTE set, and moreover those nodes always have a righthand + * descendant. + */ +typedef const unsigned char utf8trie_t; +#define BITNUM 0x07 +#define NEXTBYTE 0x08 +#define OFFLEN 0x30 +#define OFFLEN_SHIFT 4 +#define RIGHTPATH 0x40 +#define TRIENODE 0x80 +#define RIGHTNODE 0x40 +#define LEFTNODE 0x80 + +/* + * utf8leaf_t + * + * The leaves of the trie are embedded in the trie, and so the same + * underlying datatype: unsigned char. + * + * leaf[0]: The unicode version, stored as a generation number that is + * an index into utf8agetab[]. With this we can filter code + * points based on the unicode version in which they were + * defined. The CCC of a non-defined code point is 0. + * leaf[1]: Canonical Combining Class. During normalization, we need + * to do a stable sort into ascending order of all characters + * with a non-zero CCC that occur between two characters with + * a CCC of 0, or at the begin or end of a string. + * The unicode standard guarantees that all CCC values are + * between 0 and 254 inclusive, which leaves 255 available as + * a special value. + * Code points with CCC 0 are known as stoppers. + * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the + * start of a NUL-terminated string that is the decomposition + * of the character. + * The CCC of a decomposable character is the same as the CCC + * of the first character of its decomposition. + * Some characters decompose as the empty string: these are + * characters with the Default_Ignorable_Code_Point property. + * These do affect normalization, as they all have CCC 0. + * + * The decompositions in the trie have been fully expanded. + * + * Casefolding, if applicable, is also done using decompositions. + * + * The trie is constructed in such a way that leaves exist for all + * UTF-8 sequences that match the criteria from the "UTF-8 valid + * ranges" comment above, and only for those sequences. Therefore a + * lookup in the trie can be used to validate the UTF-8 input. + */ +typedef const unsigned char utf8leaf_t; + +#define LEAF_GEN(LEAF) ((LEAF)[0]) +#define LEAF_CCC(LEAF) ((LEAF)[1]) +#define LEAF_STR(LEAF) ((const char *)((LEAF) + 2)) + +#define MINCCC (0) +#define MAXCCC (254) +#define STOPPER (0) +#define DECOMPOSE (255) + +/* + * Use trie to scan s, touching at most len bytes. + * Returns the leaf if one exists, NULL otherwise. + * + * A non-NULL return guarantees that the UTF-8 sequence starting at s + * is well-formed and corresponds to a known unicode code point. The + * shorthand for this will be "is valid UTF-8 unicode". + */ +static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, + size_t len) +{ + utf8trie_t *trie = utf8data + data->offset; + int offlen; + int offset; + int mask; + int node; + + if (!data) + return NULL; + if (len == 0) + return NULL; + node = 1; + while (node) { + offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT; + if (*trie & NEXTBYTE) { + if (--len == 0) + return NULL; + s++; + } + mask = 1 << (*trie & BITNUM); + if (*s & mask) { + /* Right leg */ + if (offlen) { + /* Right node at offset of trie */ + node = (*trie & RIGHTNODE); + offset = trie[offlen]; + while (--offlen) { + offset <<= 8; + offset |= trie[offlen]; + } + trie += offset; + } else if (*trie & RIGHTPATH) { + /* Right node after this node */ + node = (*trie & TRIENODE); + trie++; + } else { + /* No right node. */ + node = 0; + trie = NULL; + } + } else { + /* Left leg */ + if (offlen) { + /* Left node after this node. */ + node = (*trie & LEFTNODE); + trie += offlen + 1; + } else if (*trie & RIGHTPATH) { + /* No left node. */ + node = 0; + trie = NULL; + } else { + /* Left node after this node */ + node = (*trie & TRIENODE); + trie++; + } + } + } + return trie; +} + +/* + * Use trie to scan s. + * Returns the leaf if one exists, NULL otherwise. + * + * Forwards to utf8nlookup(). + */ +static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s) +{ + return utf8nlookup(data, s, (size_t)-1); +} + +/* + * Maximum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if only non-assigned code points are used. + */ +int utf8agemax(const struct utf8data *data, const char *s) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (*s) { + leaf = utf8lookup(data, s); + if (!leaf) + return -1; + + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemax); + +/* + * Minimum age of any character in s. + * Return -1 if s is not valid UTF-8 unicode. + * Return 0 if non-assigned code points are used. + */ +int utf8agemin(const struct utf8data *data, const char *s) +{ + utf8leaf_t *leaf; + int age; + int leaf_age; + + if (!data) + return -1; + age = data->maxage; + while (*s) { + leaf = utf8lookup(data, s); + if (!leaf) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8agemin); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemax(const struct utf8data *data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int age = 0; + int leaf_age; + + if (!data) + return -1; + while (len && *s) { + leaf = utf8nlookup(data, s, len); + if (!leaf) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age > age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemax); + +/* + * Maximum age of any character in s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +int utf8nagemin(const struct utf8data *data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + int leaf_age; + int age; + + if (!data) + return -1; + age = data->maxage; + while (len && *s) { + leaf = utf8nlookup(data, s, len); + if (!leaf) + return -1; + leaf_age = utf8agetab[LEAF_GEN(leaf)]; + if (leaf_age <= data->maxage && leaf_age < age) + age = leaf_age; + len -= utf8clen(s); + s += utf8clen(s); + } + return age; +} +EXPORT_SYMBOL(utf8nagemin); + +/* + * Length of the normalization of s. + * Return -1 if s is not valid UTF-8 unicode. + * + * A string of Default_Ignorable_Code_Point has length 0. + */ +ssize_t utf8len(const struct utf8data *data, const char *s) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (*s) { + leaf = utf8lookup(data, s); + if (!leaf) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8len); + +/* + * Length of the normalization of s, touch at most len bytes. + * Return -1 if s is not valid UTF-8 unicode. + */ +ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len) +{ + utf8leaf_t *leaf; + size_t ret = 0; + + if (!data) + return -1; + while (len && *s) { + leaf = utf8nlookup(data, s, len); + if (!leaf) + return -1; + if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) + ret += utf8clen(s); + else if (LEAF_CCC(leaf) == DECOMPOSE) + ret += strlen(LEAF_STR(leaf)); + else + ret += utf8clen(s); + len -= utf8clen(s); + s += utf8clen(s); + } + return ret; +} +EXPORT_SYMBOL(utf8nlen); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : const struct utf8data to use for normalization. + * s : string. + * len : length of s. + * + * Returns -1 on error, 0 on success. + */ +int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s, size_t len) +{ + if (!data) + return -1; + if (!s) + return -1; + u8c->data = data; + u8c->s = s; + u8c->p = NULL; + u8c->ss = NULL; + u8c->sp = NULL; + u8c->len = len; + u8c->slen = 0; + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + /* Check we didn't clobber the maximum length. */ + if (u8c->len != len) + return -1; + /* The first byte of s may not be an utf8 continuation. */ + if (len > 0 && (*s & 0xC0) == 0x80) + return -1; + return 0; +} +EXPORT_SYMBOL(utf8ncursor); + +/* + * Set up an utf8cursor for use by utf8byte(). + * + * u8c : pointer to cursor. + * data : const struct utf8data to use for normalization. + * s : NUL-terminated string. + * + * Returns -1 on error, 0 on success. + */ +int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s) +{ + return utf8ncursor(u8c, data, s, (unsigned int)-1); +} +EXPORT_SYMBOL(utf8cursor); + +/* + * Get one byte from the normalized form of the string described by u8c. + * + * Returns the byte cast to an unsigned char on succes, and -1 on failure. + * + * The cursor keeps track of the location in the string in u8c->s. + * When a character is decomposed, the current location is stored in + * u8c->p, and u8c->s is set to the start of the decomposition. Note + * that bytes from a decomposition do not count against u8c->len. + * + * Characters are emitted if they match the current CCC in u8c->ccc. + * Hitting end-of-string while u8c->ccc == STOPPER means we're done, + * and the function returns 0 in that case. + * + * Sorting by CCC is done by repeatedly scanning the string. The + * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at + * the start of the scan. The first pass finds the lowest CCC to be + * emitted and stores it in u8c->nccc, the second pass emits the + * characters with this CCC and finds the next lowest CCC. This limits + * the number of passes to 1 + the number of different CCCs in the + * sequence being scanned. + * + * Therefore: + * u8c->p != NULL -> a decomposition is being scanned. + * u8c->ss != NULL -> this is a repeating scan. + * u8c->ccc == -1 -> this is the first scan of a repeating scan. + */ +int utf8byte(struct utf8cursor *u8c) +{ + utf8leaf_t *leaf; + int ccc; + + for (;;) { + /* Check for the end of a decomposed character. */ + if (u8c->p && *u8c->s == '\0') { + u8c->s = u8c->p; + u8c->p = NULL; + } + + /* Check for end-of-string. */ + if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) { + /* There is no next byte. */ + if (u8c->ccc == STOPPER) + return 0; + /* End-of-string during a scan counts as a stopper. */ + ccc = STOPPER; + goto ccc_mismatch; + } else if ((*u8c->s & 0xC0) == 0x80) { + /* This is a continuation of the current character. */ + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Look up the data for the current character. */ + if (u8c->p) + leaf = utf8lookup(u8c->data, u8c->s); + else + leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + + /* No leaf found implies that the input is a binary blob. */ + if (!leaf) + return -1; + + ccc = LEAF_CCC(leaf); + /* Characters that are too new have CCC 0. */ + if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) { + ccc = STOPPER; + } else if (ccc == DECOMPOSE) { + u8c->len -= utf8clen(u8c->s); + u8c->p = u8c->s + utf8clen(u8c->s); + u8c->s = LEAF_STR(leaf); + /* Empty decomposition implies CCC 0. */ + if (*u8c->s == '\0') { + if (u8c->ccc == STOPPER) + continue; + ccc = STOPPER; + goto ccc_mismatch; + } + leaf = utf8lookup(u8c->data, u8c->s); + } + + /* + * If this is not a stopper, then see if it updates + * the next canonical class to be emitted. + */ + if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc) + u8c->nccc = ccc; + + /* + * Return the current byte if this is the current + * combining class. + */ + if (ccc == u8c->ccc) { + if (!u8c->p) + u8c->len--; + return (unsigned char)*u8c->s++; + } + + /* Current combining class mismatch. */ +ccc_mismatch: + if (u8c->nccc == STOPPER) { + /* + * Scan forward for the first canonical class + * to be emitted. Save the position from + * which to restart. + */ + u8c->ccc = MINCCC - 1; + u8c->nccc = ccc; + u8c->sp = u8c->p; + u8c->ss = u8c->s; + u8c->slen = u8c->len; + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (ccc != STOPPER) { + /* Not a stopper, and not the ccc we're emitting. */ + if (!u8c->p) + u8c->len -= utf8clen(u8c->s); + u8c->s += utf8clen(u8c->s); + } else if (u8c->nccc != MAXCCC + 1) { + /* At a stopper, restart for next ccc. */ + u8c->ccc = u8c->nccc; + u8c->nccc = MAXCCC + 1; + u8c->s = u8c->ss; + u8c->p = u8c->sp; + u8c->len = u8c->slen; + } else { + /* All done, proceed from here. */ + u8c->ccc = STOPPER; + u8c->nccc = STOPPER; + u8c->sp = NULL; + u8c->ss = NULL; + u8c->slen = 0; + } + } +} +EXPORT_SYMBOL(utf8byte); + +const struct utf8data *utf8nfkdi(unsigned int maxage) +{ + int i = ARRAY_SIZE(utf8nfkdidata) - 1; + + while (maxage < utf8nfkdidata[i].maxage) + i--; + if (maxage > utf8nfkdidata[i].maxage) + return NULL; + return &utf8nfkdidata[i]; +} +EXPORT_SYMBOL(utf8nfkdi); + +const struct utf8data *utf8nfkdicf(unsigned int maxage) +{ + int i = ARRAY_SIZE(utf8nfkdicfdata) - 1; + + while (maxage < utf8nfkdicfdata[i].maxage) + i--; + if (maxage > utf8nfkdicfdata[i].maxage) + return NULL; + return &utf8nfkdicfdata[i]; +} +EXPORT_SYMBOL(utf8nfkdicf); diff --git a/fs/nls/utf8n.h b/fs/nls/utf8n.h new file mode 100644 index 000000000000..0f5fc14d4fd2 --- /dev/null +++ b/fs/nls/utf8n.h @@ -0,0 +1,112 @@ +/* + * Copyright (c) 2014 SGI. + * All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#ifndef UTF8NORM_H +#define UTF8NORM_H + +#include +#include +#include +#include + +/* Encoding a unicode version number as a single unsigned int. */ +#define UNICODE_MAJ_SHIFT (16) +#define UNICODE_MIN_SHIFT (8) + +#define UNICODE_AGE(MAJ, MIN, REV) \ + (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \ + ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \ + ((unsigned int)(REV))) + +/* Highest unicode version supported by the data tables. */ +extern int utf8version_is_supported(u8 maj, u8 min, u8 rev); + +/* + * Look for the correct const struct utf8data for a unicode version. + * Returns NULL if the version requested is too new. + * + * Two normalization forms are supported: nfkdi and nfkdicf. + * + * nfkdi: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * + * nfkdicf: + * - Apply unicode normalization form NFKD. + * - Remove any Default_Ignorable_Code_Point. + * - Apply a full casefold (C + F). + */ +extern const struct utf8data *utf8nfkdi(unsigned int maxage); +extern const struct utf8data *utf8nfkdicf(unsigned int maxage); + +/* + * Determine the maximum age of any unicode character in the string. + * Returns 0 if only unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemax(const struct utf8data *data, const char *s); +extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len); + +/* + * Determine the minimum age of any unicode character in the string. + * Returns 0 if any unassigned code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern int utf8agemin(const struct utf8data *data, const char *s); +extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len); + +/* + * Determine the length of the normalized from of the string, + * excluding any terminating NULL byte. + * Returns 0 if only ignorable code points are present. + * Returns -1 if the input is not valid UTF-8. + */ +extern ssize_t utf8len(const struct utf8data *data, const char *s); +extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len); + +/* + * Cursor structure used by the normalizer. + */ +struct utf8cursor { + const struct utf8data *data; + const char *s; + const char *p; + const char *ss; + const char *sp; + unsigned int len; + unsigned int slen; + short int ccc; + short int nccc; +}; + +/* + * Initialize a utf8cursor to normalize a string. + * Returns 0 on success. + * Returns -1 on failure. + */ +extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s); +extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data, + const char *s, size_t len); + +/* + * Get the next byte in the normalization. + * Returns a value > 0 && < 256 on success. + * Returns 0 when the end of the normalization is reached. + * Returns -1 if the string being normalized is not valid UTF-8. + */ +extern int utf8byte(struct utf8cursor *u8c); + +#endif /* UTF8NORM_H */ From patchwork Thu Dec 6 23:08:56 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717209 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 031DC15A6 for ; Thu, 6 Dec 2018 23:10:12 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E1DA22F205 for ; Thu, 6 Dec 2018 23:10:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D5FB12F21D; Thu, 6 Dec 2018 23:10:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4404C2F205 for ; Thu, 6 Dec 2018 23:10:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726352AbeLFXKJ (ORCPT ); Thu, 6 Dec 2018 18:10:09 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56134 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726331AbeLFXKI (ORCPT ); Thu, 6 Dec 2018 18:10:08 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id BAD8F27ED80 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Olaf Weber , Gabriel Krisman Bertazi Subject: [PATCH v4 16/23] nls: utf8n: reduce the size of utf8data[] Date: Thu, 6 Dec 2018 18:08:56 -0500 Message-Id: <20181206230903.30011-17-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Olaf Weber Remove the Hangul decompositions from the utf8data trie, and do algorithmic decomposition to calculate them on the fly. To store the decomposition the caller of utf8lookup()/utf8nlookup() must provide a 12-byte buffer, which is used to synthesize a leaf with the decomposition. Trie size is reduced from 245kB to 90kB. Signed-off-by: Olaf Weber Signed-off-by: Gabriel Krisman Bertazi [Rebase to mainline] [Fix checkpatch errors] [Extract robustness fixes and merge back to original mkutf8data.c patch] --- fs/nls/nls_utf8-norm.c | 191 +++++++++++++++++++++++--- fs/nls/utf8n.h | 4 + scripts/mkutf8data.c | 298 ++++++++++++++++++++++++++++++++++++----- 3 files changed, 436 insertions(+), 57 deletions(-) diff --git a/fs/nls/nls_utf8-norm.c b/fs/nls/nls_utf8-norm.c index ca0bbf644b49..64c3cc74a2ca 100644 --- a/fs/nls/nls_utf8-norm.c +++ b/fs/nls/nls_utf8-norm.c @@ -98,6 +98,38 @@ static inline int utf8clen(const char *s) return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0); } +/* + * Decode a 3-byte UTF-8 sequence. + */ +static unsigned int +utf8decode3(const char *str) +{ + unsigned int uc; + + uc = *str++ & 0x0F; + uc <<= 6; + uc |= *str++ & 0x3F; + uc <<= 6; + uc |= *str++ & 0x3F; + + return uc; +} + +/* + * Encode a 3-byte UTF-8 sequence. + */ +static int +utf8encode3(char *str, unsigned int val) +{ + str[2] = (val & 0x3F) | 0x80; + val >>= 6; + str[1] = (val & 0x3F) | 0x80; + val >>= 6; + str[0] = val | 0xE0; + + return 3; +} + /* * utf8trie_t * @@ -159,7 +191,8 @@ typedef const unsigned char utf8trie_t; * characters with the Default_Ignorable_Code_Point property. * These do affect normalization, as they all have CCC 0. * - * The decompositions in the trie have been fully expanded. + * The decompositions in the trie have been fully expanded, with the + * exception of Hangul syllables, which are decomposed algorithmically. * * Casefolding, if applicable, is also done using decompositions. * @@ -179,6 +212,105 @@ typedef const unsigned char utf8leaf_t; #define STOPPER (0) #define DECOMPOSE (255) +/* Marker for hangul syllable decomposition. */ +#define HANGUL ((char)(255)) +/* Size of the synthesized leaf used for Hangul syllable decomposition. */ +#define UTF8HANGULLEAF (12) + +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;;Lo;0;L;;;;;N;;;;; + * D7A3;;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount) + * LVPart = SBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount) + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = + * } else { + * TPart = TBase + TIndex + * d = + * } + */ + +/* Constants */ +#define SB (0xAC00) +#define LB (0x1100) +#define VB (0x1161) +#define TB (0x11A7) +#define LC (19) +#define VC (21) +#define TC (28) +#define NC (VC * TC) +#define SC (LC * NC) + +/* Algorithmic decomposition of hangul syllable. */ +static utf8leaf_t * +utf8hangul(const char *str, unsigned char *hangul) +{ + unsigned int si; + unsigned int li; + unsigned int vi; + unsigned int ti; + unsigned char *h; + + /* Calculate the SI, LI, VI, and TI values. */ + si = utf8decode3(str) - SB; + li = si / NC; + vi = (si % NC) / TC; + ti = si % TC; + + /* Fill in base of leaf. */ + h = hangul; + LEAF_GEN(h) = 2; + LEAF_CCC(h) = DECOMPOSE; + h += 2; + + /* Add LPart, a 3-byte UTF-8 sequence. */ + h += utf8encode3((char *)h, li + LB); + + /* Add VPart, a 3-byte UTF-8 sequence. */ + h += utf8encode3((char *)h, vi + VB); + + /* Add TPart if required, also a 3-byte UTF-8 sequence. */ + if (ti) + h += utf8encode3((char *)h, ti + TB); + + /* Terminate string. */ + h[0] = '\0'; + + return hangul; +} + /* * Use trie to scan s, touching at most len bytes. * Returns the leaf if one exists, NULL otherwise. @@ -187,8 +319,8 @@ typedef const unsigned char utf8leaf_t; * is well-formed and corresponds to a known unicode code point. The * shorthand for this will be "is valid UTF-8 unicode". */ -static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, - size_t len) +static utf8leaf_t *utf8nlookup(const struct utf8data *data, + unsigned char *hangul, const char *s, size_t len) { utf8trie_t *trie = utf8data + data->offset; int offlen; @@ -226,8 +358,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, trie++; } else { /* No right node. */ - node = 0; - trie = NULL; + return NULL; } } else { /* Left leg */ @@ -237,8 +368,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, trie += offlen + 1; } else if (*trie & RIGHTPATH) { /* No left node. */ - node = 0; - trie = NULL; + return NULL; } else { /* Left node after this node */ node = (*trie & TRIENODE); @@ -246,6 +376,14 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, } } } + /* + * Hangul decomposition is done algorithmically. These are the + * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is + * always 3 bytes long, so s has been advanced twice, and the + * start of the sequence is at s-2. + */ + if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL) + trie = utf8hangul(s - 2, hangul); return trie; } @@ -255,9 +393,10 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s, * * Forwards to utf8nlookup(). */ -static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s) +static utf8leaf_t *utf8lookup(const struct utf8data *data, + unsigned char *hangul, const char *s) { - return utf8nlookup(data, s, (size_t)-1); + return utf8nlookup(data, hangul, s, (size_t)-1); } /* @@ -270,11 +409,13 @@ int utf8agemax(const struct utf8data *data, const char *s) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; + while (*s) { - leaf = utf8lookup(data, s); + leaf = utf8lookup(data, hangul, s); if (!leaf) return -1; @@ -297,12 +438,13 @@ int utf8agemin(const struct utf8data *data, const char *s) utf8leaf_t *leaf; int age; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; age = data->maxage; while (*s) { - leaf = utf8lookup(data, s); + leaf = utf8lookup(data, hangul, s); if (!leaf) return -1; leaf_age = utf8agetab[LEAF_GEN(leaf)]; @@ -323,11 +465,13 @@ int utf8nagemax(const struct utf8data *data, const char *s, size_t len) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; + while (len && *s) { - leaf = utf8nlookup(data, s, len); + leaf = utf8nlookup(data, hangul, s, len); if (!leaf) return -1; leaf_age = utf8agetab[LEAF_GEN(leaf)]; @@ -349,12 +493,13 @@ int utf8nagemin(const struct utf8data *data, const char *s, size_t len) utf8leaf_t *leaf; int leaf_age; int age; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; age = data->maxage; while (len && *s) { - leaf = utf8nlookup(data, s, len); + leaf = utf8nlookup(data, hangul, s, len); if (!leaf) return -1; leaf_age = utf8agetab[LEAF_GEN(leaf)]; @@ -377,11 +522,12 @@ ssize_t utf8len(const struct utf8data *data, const char *s) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; while (*s) { - leaf = utf8lookup(data, s); + leaf = utf8lookup(data, hangul, s); if (!leaf) return -1; if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) @@ -404,11 +550,12 @@ ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!data) return -1; while (len && *s) { - leaf = utf8nlookup(data, s, len); + leaf = utf8nlookup(data, hangul, s, len); if (!leaf) return -1; if (utf8agetab[LEAF_GEN(leaf)] > data->maxage) @@ -531,10 +678,12 @@ int utf8byte(struct utf8cursor *u8c) } /* Look up the data for the current character. */ - if (u8c->p) - leaf = utf8lookup(u8c->data, u8c->s); - else - leaf = utf8nlookup(u8c->data, u8c->s, u8c->len); + if (u8c->p) { + leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s); + } else { + leaf = utf8nlookup(u8c->data, u8c->hangul, + u8c->s, u8c->len); + } /* No leaf found implies that the input is a binary blob. */ if (!leaf) @@ -555,7 +704,9 @@ int utf8byte(struct utf8cursor *u8c) ccc = STOPPER; goto ccc_mismatch; } - leaf = utf8lookup(u8c->data, u8c->s); + + leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s); + ccc = LEAF_CCC(leaf); } /* diff --git a/fs/nls/utf8n.h b/fs/nls/utf8n.h index 0f5fc14d4fd2..f60827663503 100644 --- a/fs/nls/utf8n.h +++ b/fs/nls/utf8n.h @@ -76,6 +76,9 @@ extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len); extern ssize_t utf8len(const struct utf8data *data, const char *s); extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len); +/* Needed in struct utf8cursor below. */ +#define UTF8HANGULLEAF (12) + /* * Cursor structure used by the normalizer. */ @@ -89,6 +92,7 @@ struct utf8cursor { unsigned int slen; short int ccc; short int nccc; + unsigned char hangul[UTF8HANGULLEAF]; }; /* diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c index 26794053d0d4..49bb0e16669b 100644 --- a/scripts/mkutf8data.c +++ b/scripts/mkutf8data.c @@ -180,10 +180,14 @@ typedef unsigned char utf8leaf_t; #define MAXCCC (254) #define STOPPER (0) #define DECOMPOSE (255) +#define HANGUL ((char)(255)) + +#define UTF8HANGULLEAF (12) struct tree; -static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t); -static utf8leaf_t *utf8lookup(struct tree *, const char *); +static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *, + const char *, size_t); +static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *); unsigned char *utf8data; size_t utf8data_size; @@ -331,6 +335,8 @@ static int utf32valid(unsigned int unichar) return unichar < 0x110000; } +#define HANGUL_SYLLABLE(U) ((U) >= 0xAC00 && (U) <= 0xD7A3) + #define NODE 1 #define LEAF 0 @@ -461,7 +467,7 @@ static void tree_walk(struct tree *tree) indent+1); leaves += 1; } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -855,7 +861,7 @@ static void mark_nodes(struct tree *tree) } } } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); node = node->right; continue; } @@ -907,7 +913,7 @@ static void mark_nodes(struct tree *tree) } } } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); node = node->right; if (!node->mark && node->parent->mark && !node->parent->left) { @@ -990,7 +996,7 @@ static int index_nodes(struct tree *tree, int index) index += tree->leaf_size(node->right); count++; } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -1011,6 +1017,25 @@ static int index_nodes(struct tree *tree, int index) return index; } +/* + * Mark the nodes in a subtree, helper for size_nodes(). + */ +static int mark_subtree(struct node *node) +{ + int changed; + + if (!node || node->mark) + return 0; + node->mark = 1; + node->index = node->parent->index; + changed = 1; + if (node->leftnode == NODE) + changed += mark_subtree(node->left); + if (node->rightnode == NODE) + changed += mark_subtree(node->right); + return changed; +} + /* * Compute the size of nodes and leaves. We start by assuming that * each node needs to store a three-byte offset. The indexes of the @@ -1029,6 +1054,7 @@ static int size_nodes(struct tree *tree) unsigned int bitmask; unsigned int pathbits; unsigned int pathmask; + unsigned int nbit; int changed; int offset; int size; @@ -1056,22 +1082,40 @@ static int size_nodes(struct tree *tree) size = 1; } else { if (node->rightnode == NODE) { + /* + * If the right node is not marked, + * look for a corresponding node in + * the next tree. Such a node need + * not exist. + */ right = node->right; next = tree->next; while (!right->mark) { assert(next); n = next->root; while (n->bitnum != node->bitnum) { - if (pathbits & (1<bitnum)) + nbit = 1 << n->bitnum; + if (!(pathmask & nbit)) + break; + if (pathbits & nbit) { + if (n->rightnode == LEAF) + break; n = n->right; - else + } else { + if (n->leftnode == LEAF) + break; n = n->left; + } } + if (n->bitnum != node->bitnum) + break; n = n->right; - assert(right->bitnum == n->bitnum); right = n; next = next->next; } + /* Make sure the right node is marked. */ + if (!right->mark) + changed += mark_subtree(right); offset = right->index - node->index; } else { offset = *tree->leaf_index(tree, node->right); @@ -1113,7 +1157,7 @@ static int size_nodes(struct tree *tree) if (node->rightnode == LEAF) { assert(node->right); } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -1146,8 +1190,15 @@ static void emit(struct tree *tree, unsigned char *data) int offset; int index; int indent; + int size; + int bytes; + int leaves; + int nodes[4]; unsigned char byte; + nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0; + leaves = 0; + bytes = 0; index = tree->index; data += index; indent = 1; @@ -1156,7 +1207,10 @@ static void emit(struct tree *tree, unsigned char *data) if (tree->childnode == LEAF) { assert(tree->root); tree->leaf_emit(tree->root, data); - return; + size = tree->leaf_size(tree->root); + index += size; + leaves++; + goto done; } assert(tree->childnode == NODE); @@ -1183,6 +1237,7 @@ static void emit(struct tree *tree, unsigned char *data) offlen = 2; else offlen = 3; + nodes[offlen]++; offset = node->offset; byte |= offlen << OFFLEN_SHIFT; *data++ = byte; @@ -1195,12 +1250,14 @@ static void emit(struct tree *tree, unsigned char *data) } else if (node->left) { if (node->leftnode == NODE) byte |= TRIENODE; + nodes[0]++; *data++ = byte; index++; } else if (node->right) { byte |= RIGHTNODE; if (node->rightnode == NODE) byte |= TRIENODE; + nodes[0]++; *data++ = byte; index++; } else { @@ -1215,7 +1272,10 @@ static void emit(struct tree *tree, unsigned char *data) assert(node->left); data = tree->leaf_emit(node->left, data); - index += tree->leaf_size(node->left); + size = tree->leaf_size(node->left); + index += size; + bytes += size; + leaves++; } else if (node->left) { assert(node->leftnode == NODE); indent += 1; @@ -1229,9 +1289,12 @@ static void emit(struct tree *tree, unsigned char *data) assert(node->right); data = tree->leaf_emit(node->right, data); - index += tree->leaf_size(node->right); + size = tree->leaf_size(node->right); + index += size; + bytes += size; + leaves++; } else if (node->right) { - assert(node->rightnode==NODE); + assert(node->rightnode == NODE); indent += 1; node = node->right; break; @@ -1243,6 +1306,15 @@ static void emit(struct tree *tree, unsigned char *data) indent -= 1; } } +done: + if (verbose > 0) { + printf("Emitted %d (%d) leaves", + leaves, bytes); + printf(" %d (%d+%d+%d+%d) nodes", + nodes[0] + nodes[1] + nodes[2] + nodes[3], + nodes[0], nodes[1], nodes[2], nodes[3]); + printf(" %d total\n", index - tree->index); + } } /* ------------------------------------------------------------------ */ @@ -1344,7 +1416,9 @@ static void nfkdi_print(void *l, int indent) printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf, leaf->code, leaf->ccc, leaf->gen); - if (leaf->utf8nfkdi) + if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL) + printf(" nfkdi \"%s\"", "HANGUL SYLLABLE"); + else if (leaf->utf8nfkdi) printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); printf("\n"); } @@ -1357,6 +1431,8 @@ static void nfkdicf_print(void *l, int indent) leaf->code, leaf->ccc, leaf->gen); if (leaf->utf8nfkdicf) printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf); + else if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL) + printf(" nfkdi \"%s\"", "HANGUL SYLLABLE"); else if (leaf->utf8nfkdi) printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi); printf("\n"); @@ -1388,7 +1464,9 @@ static int nfkdi_size(void *l) struct unicode_data *leaf = l; int size = 2; - if (leaf->utf8nfkdi) + if (HANGUL_SYLLABLE(leaf->code)) + size += 1; + else if (leaf->utf8nfkdi) size += strlen(leaf->utf8nfkdi) + 1; return size; } @@ -1398,7 +1476,9 @@ static int nfkdicf_size(void *l) struct unicode_data *leaf = l; int size = 2; - if (leaf->utf8nfkdicf) + if (HANGUL_SYLLABLE(leaf->code)) + size += 1; + else if (leaf->utf8nfkdicf) size += strlen(leaf->utf8nfkdicf) + 1; else if (leaf->utf8nfkdi) size += strlen(leaf->utf8nfkdi) + 1; @@ -1425,7 +1505,10 @@ static unsigned char *nfkdi_emit(void *l, unsigned char *data) unsigned char *s; *data++ = leaf->gen; - if (leaf->utf8nfkdi) { + if (HANGUL_SYLLABLE(leaf->code)) { + *data++ = DECOMPOSE; + *data++ = HANGUL; + } else if (leaf->utf8nfkdi) { *data++ = DECOMPOSE; s = (unsigned char*)leaf->utf8nfkdi; while ((*data++ = *s++) != 0) @@ -1442,7 +1525,10 @@ static unsigned char *nfkdicf_emit(void *l, unsigned char *data) unsigned char *s; *data++ = leaf->gen; - if (leaf->utf8nfkdicf) { + if (HANGUL_SYLLABLE(leaf->code)) { + *data++ = DECOMPOSE; + *data++ = HANGUL; + } else if (leaf->utf8nfkdicf) { *data++ = DECOMPOSE; s = (unsigned char*)leaf->utf8nfkdicf; while ((*data++ = *s++) != 0) @@ -1465,6 +1551,11 @@ static void utf8_create(struct unicode_data *data) unsigned int *um; int i; + if (data->utf8nfkdi) { + assert(data->utf8nfkdi[0] == HANGUL); + return; + } + u = utf; um = data->utf32nfkdi; if (um) { @@ -1650,6 +1741,7 @@ static void verify(struct tree *tree) utf8leaf_t *leaf; unsigned int unichar; char key[4]; + unsigned char hangul[UTF8HANGULLEAF]; int report; int nocf; @@ -1663,7 +1755,8 @@ static void verify(struct tree *tree) if (data->correction <= tree->maxage) data = &unicode_data[unichar]; utf8encode(key,unichar); - leaf = utf8lookup(tree, key); + leaf = utf8lookup(tree, hangul, key); + if (!leaf) { if (data->gen != -1) report++; @@ -1677,7 +1770,10 @@ static void verify(struct tree *tree) if (data->gen != LEAF_GEN(leaf)) report++; if (LEAF_CCC(leaf) == DECOMPOSE) { - if (nocf) { + if (HANGUL_SYLLABLE(data->code)) { + if (data->utf8nfkdi[0] != HANGUL) + report++; + } else if (nocf) { if (!data->utf8nfkdi) { report++; } else if (strcmp(data->utf8nfkdi, @@ -2302,8 +2398,7 @@ static void corrections_init(void) * */ -static void -hangul_decompose(void) +static void hangul_decompose(void) { unsigned int sb = 0xAC00; unsigned int lb = 0x1100; @@ -2347,6 +2442,15 @@ hangul_decompose(void) memcpy(um, mapping, i * sizeof(unsigned int)); unicode_data[unichar].utf32nfkdicf = um; + /* + * Add a cookie as a reminder that the hangul syllable + * decompositions must not be stored in the generated + * trie. + */ + unicode_data[unichar].utf8nfkdi = malloc(2); + unicode_data[unichar].utf8nfkdi[0] = HANGUL; + unicode_data[unichar].utf8nfkdi[1] = '\0'; + if (verbose > 1) print_utf32nfkdi(unichar); @@ -2472,6 +2576,99 @@ int utf8cursor(struct utf8cursor *, struct tree *, const char *); int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t); int utf8byte(struct utf8cursor *); +/* + * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0) + * + * AC00;;Lo;0;L;;;;;N;;;;; + * D7A3;;Lo;0;L;;;;;N;;;;; + * + * SBase = 0xAC00 + * LBase = 0x1100 + * VBase = 0x1161 + * TBase = 0x11A7 + * LCount = 19 + * VCount = 21 + * TCount = 28 + * NCount = 588 (VCount * TCount) + * SCount = 11172 (LCount * NCount) + * + * Decomposition: + * SIndex = s - SBase + * + * LV (Canonical/Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * + * LVT (Canonical) + * LVIndex = (SIndex / TCount) * TCount + * TIndex = (Sindex % TCount) + * LVPart = SBase + LVIndex + * TPart = TBase + TIndex + * + * LVT (Full) + * LIndex = SIndex / NCount + * VIndex = (Sindex % NCount) / TCount + * TIndex = (Sindex % TCount) + * LPart = LBase + LIndex + * VPart = VBase + VIndex + * if (TIndex == 0) { + * d = + * } else { + * TPart = TBase + TIndex + * d = + * } + */ + +/* Constants */ +#define SB (0xAC00) +#define LB (0x1100) +#define VB (0x1161) +#define TB (0x11A7) +#define LC (19) +#define VC (21) +#define TC (28) +#define NC (VC * TC) +#define SC (LC * NC) + +/* Algorithmic decomposition of hangul syllable. */ +static utf8leaf_t *utf8hangul(const char *str, unsigned char *hangul) +{ + unsigned int si; + unsigned int li; + unsigned int vi; + unsigned int ti; + unsigned char *h; + + /* Calculate the SI, LI, VI, and TI values. */ + si = utf8decode(str) - SB; + li = si / NC; + vi = (si % NC) / TC; + ti = si % TC; + + /* Fill in base of leaf. */ + h = hangul; + LEAF_GEN(h) = 2; + LEAF_CCC(h) = DECOMPOSE; + h += 2; + + /* Add LPart, a 3-byte UTF-8 sequence. */ + h += utf8encode((char *)h, li + LB); + + /* Add VPart, a 3-byte UTF-8 sequence. */ + h += utf8encode((char *)h, vi + VB); + + /* Add TPart if required, also a 3-byte UTF-8 sequence. */ + if (ti) + h += utf8encode((char *)h, ti + TB); + + /* Terminate string. */ + h[0] = '\0'; + + return hangul; +} + /* * Use trie to scan s, touching at most len bytes. * Returns the leaf if one exists, NULL otherwise. @@ -2480,7 +2677,8 @@ int utf8byte(struct utf8cursor *); * is well-formed and corresponds to a known unicode code point. The * shorthand for this will be "is valid UTF-8 unicode". */ -static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) +static utf8leaf_t *utf8nlookup(struct tree *tree, unsigned char *hangul, + const char *s, size_t len) { utf8trie_t *trie = utf8data + tree->index; int offlen; @@ -2536,6 +2734,14 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) } } } + /* + * Hangul decomposition is done algorithmically. These are the + * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is + * always 3 bytes long, so s has been advanced twice, and the + * start of the sequence is at s-2. + */ + if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL) + trie = utf8hangul(s - 2, hangul); return trie; } @@ -2545,9 +2751,10 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len) * * Forwards to trie_nlookup(). */ -static utf8leaf_t *utf8lookup(struct tree *tree, const char *s) +static utf8leaf_t *utf8lookup(struct tree *tree, unsigned char *hangul, + const char *s) { - return utf8nlookup(tree, s, (size_t)-1); + return utf8nlookup(tree, hangul, s, (size_t)-1); } /* @@ -2571,11 +2778,14 @@ int utf8agemax(struct tree *tree, const char *s) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; + while (*s) { - if (!(leaf = utf8lookup(tree, s))) + leaf = utf8lookup(tree, hangul, s); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age > age) @@ -2595,12 +2805,14 @@ int utf8agemin(struct tree *tree, const char *s) utf8leaf_t *leaf; int age; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; age = tree->maxage; while (*s) { - if (!(leaf = utf8lookup(tree, s))) + leaf = utf8lookup(tree, hangul, s); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age < age) @@ -2619,11 +2831,14 @@ int utf8nagemax(struct tree *tree, const char *s, size_t len) utf8leaf_t *leaf; int age = 0; int leaf_age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; + while (len && *s) { - if (!(leaf = utf8nlookup(tree, s, len))) + leaf = utf8nlookup(tree, hangul, s, len); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age > age) @@ -2643,12 +2858,14 @@ int utf8nagemin(struct tree *tree, const char *s, size_t len) utf8leaf_t *leaf; int leaf_age; int age; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; age = tree->maxage; while (len && *s) { - if (!(leaf = utf8nlookup(tree, s, len))) + leaf = utf8nlookup(tree, hangul, s, len); + if (!leaf) return -1; leaf_age = ages[LEAF_GEN(leaf)]; if (leaf_age <= tree->maxage && leaf_age < age) @@ -2669,11 +2886,13 @@ ssize_t utf8len(struct tree *tree, const char *s) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; while (*s) { - if (!(leaf = utf8lookup(tree, s))) + leaf = utf8lookup(tree, hangul, s); + if (!leaf) return -1; if (ages[LEAF_GEN(leaf)] > tree->maxage) ret += utf8clen(s); @@ -2694,11 +2913,13 @@ ssize_t utf8nlen(struct tree *tree, const char *s, size_t len) { utf8leaf_t *leaf; size_t ret = 0; + unsigned char hangul[UTF8HANGULLEAF]; if (!tree) return -1; while (len && *s) { - if (!(leaf = utf8nlookup(tree, s, len))) + leaf = utf8nlookup(tree, hangul, s, len); + if (!leaf) return -1; if (ages[LEAF_GEN(leaf)] > tree->maxage) ret += utf8clen(s); @@ -2726,6 +2947,7 @@ struct utf8cursor { short int ccc; short int nccc; unsigned int unichar; + unsigned char hangul[UTF8HANGULLEAF]; }; /* @@ -2833,10 +3055,12 @@ int utf8byte(struct utf8cursor *u8c) } /* Look up the data for the current character. */ - if (u8c->p) - leaf = utf8lookup(u8c->tree, u8c->s); - else - leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len); + if (u8c->p) { + leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s); + } else { + leaf = utf8nlookup(u8c->tree, u8c->hangul, + u8c->s, u8c->len); + } /* No leaf found implies that the input is a binary blob. */ if (!leaf) @@ -2856,7 +3080,7 @@ int utf8byte(struct utf8cursor *u8c) ccc = STOPPER; goto ccc_mismatch; } - leaf = utf8lookup(u8c->tree, u8c->s); + leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s); ccc = LEAF_CCC(leaf); } u8c->unichar = utf8decode(u8c->s); From patchwork Thu Dec 6 23:08:57 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717211 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AF3411932 for ; Thu, 6 Dec 2018 23:10:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9D3FA2F205 for ; Thu, 6 Dec 2018 23:10:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 91CFC2F21D; Thu, 6 Dec 2018 23:10:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CD5042F205 for ; Thu, 6 Dec 2018 23:10:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726358AbeLFXKN (ORCPT ); Thu, 6 Dec 2018 18:10:13 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56140 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726353AbeLFXKM (ORCPT ); Thu, 6 Dec 2018 18:10:12 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id E1EAB27ED68 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 17/23] nls: utf8: Integrate utf8 normalization code with utf8 charset Date: Thu, 6 Dec 2018 18:08:57 -0500 Message-Id: <20181206230903.30011-18-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi This patch integrates the utf8n patches with the NLS utf8 charset by implementing the nls_ops operations and nls_charset table. The Normalization is done with NFKD, and Casefold is implemented using the NFKD+CF algorithm, implemented by Olaf Weber and SGI. The high level, strcmp, strncmp functions are implemented on top of the same utf8 code. Utf-8 with normalization is exposed as optional on top of the existing utf8 charset, and disabled by default, to avoid changing the behavior of existing nls_utf8 users. To enable normalization, the specific normalization type must be set at load_table() time. Changes since RFC v2: - Integrate with NLS - Merge utf8n with nls_utf8. Changes since RFC v1: - Change error return code from EIO to EINVAL. (Olaf Weber) - Fix issues with strncmp/strcmp. (Olaf Weber) - Remove stack buffer in normalization/casefold. (Olaf Weber) - Include length parameter for second string on comparison functions. - Change length type to size_t. Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/nls_utf8-core.c | 269 ++++++++++++++++++++++++++++++++++++++--- fs/nls/nls_utf8-norm.c | 6 + fs/nls/utf8n.h | 1 + include/linux/nls.h | 8 ++ 4 files changed, 270 insertions(+), 14 deletions(-) diff --git a/fs/nls/nls_utf8-core.c b/fs/nls/nls_utf8-core.c index fe1ac5efaa37..1b7320bd9c34 100644 --- a/fs/nls/nls_utf8-core.c +++ b/fs/nls/nls_utf8-core.c @@ -6,10 +6,15 @@ #include #include #include +#include +#include #include #include +#include "utf8n.h" + static unsigned char identity[256]; +static struct nls_charset utf8_info; static int uni2char(wchar_t uni, unsigned char *out, int boundlen) { @@ -50,22 +55,257 @@ static unsigned char charset_toupper(const struct nls_table *table, return identity[c]; } -static const struct nls_ops charset_ops = { - .lowercase = charset_toupper, - .uppercase = charset_tolower, - .uni2char = uni2char, - .char2uni = char2uni, -}; +#ifdef CONFIG_NLS_UTF8_NORMALIZATION + +static int utf8_validate(const struct nls_table *charset, + const unsigned char *str, size_t len) +{ + const struct utf8data *data = utf8nfkdi(charset->version); + + if (utf8nlen(data, str, len) < 0) + return -1; + return 0; +} + +static int utf8_strncmp(const struct nls_table *charset, + const unsigned char *str1, size_t len1, + const unsigned char *str2, size_t len2) +{ + const struct utf8data *data = utf8nfkdi(charset->version); + struct utf8cursor cur1, cur2; + int c1, c2; + + if (utf8ncursor(&cur1, data, str1, len1) < 0) + goto invalid_seq; + + if (utf8ncursor(&cur2, data, str2, len2) < 0) + goto invalid_seq; + + do { + c1 = utf8byte(&cur1); + c2 = utf8byte(&cur2); + + if (c1 < 0 || c2 < 0) + goto invalid_seq; + if (c1 != c2) + return 1; + } while (c1); + + return 0; + +invalid_seq: + if(IS_STRICT_MODE(charset)) + return -EINVAL; + + /* Treat the sequence as a binary blob. */ + if (len1 != len2) + return 1; + + return !!memcmp(str1, str2, len1); +} + +static int utf8_strncasecmp(const struct nls_table *charset, + const unsigned char *str1, size_t len1, + const unsigned char *str2, size_t len2) +{ + const struct utf8data *data = utf8nfkdicf(charset->version); + struct utf8cursor cur1, cur2; + int c1, c2; + + if (utf8ncursor(&cur1, data, str1, len1) < 0) + goto invalid_seq; + + if (utf8ncursor(&cur2, data, str2, len2) < 0) + goto invalid_seq; + + do { + c1 = utf8byte(&cur1); + c2 = utf8byte(&cur2); + + if (c1 < 0 || c2 < 0) + goto invalid_seq; + if (c1 != c2) + return 1; + } while (c1); + + return 0; + +invalid_seq: + if(IS_STRICT_MODE(charset)) + return -EINVAL; + + /* Treat the sequence as a binary blob. */ + if (len1 != len2) + return 1; + + return !!memcmp(str1, str2, len1); +} + +static int utf8_casefold_nfkdcf(const struct nls_table *charset, + const unsigned char *str, size_t len, + unsigned char *dest, size_t dlen) +{ + const struct utf8data *data = utf8nfkdicf(charset->version); + struct utf8cursor cur; + size_t nlen = 0; + + if (utf8ncursor(&cur, data, str, len) < 0) + goto invalid_seq; + + for (nlen = 0; nlen < dlen; nlen++) { + dest[nlen] = utf8byte(&cur); + if (!dest[nlen]) + return nlen; + if (dest[nlen] == -1) + break; + } + +invalid_seq: + if (IS_STRICT_MODE(charset)) + return -EINVAL; + + /* Treat the sequence as a binary blob. */ + memcpy(dest, str, len); + return len; +} + +static int utf8_normalize_nfkd(const struct nls_table *charset, + const unsigned char *str, + size_t len, unsigned char *dest, size_t dlen) +{ + const struct utf8data *data = utf8nfkdi(charset->version); + struct utf8cursor cur; + ssize_t nlen = 0; + + if (utf8ncursor(&cur, data, str, len) < 0) + goto invalid_seq; -static struct nls_charset nls_charset; -static struct nls_table table = { - .charset = &nls_charset, - .ops = &charset_ops, + for (nlen = 0; nlen < dlen; nlen++) { + dest[nlen] = utf8byte(&cur); + if (!dest[nlen]) + return nlen; + if (dest[nlen] == -1) + break; + } + +invalid_seq: + if (IS_STRICT_MODE(charset)) + return -EINVAL; + + /* Treat the sequence as a binary blob. */ + memcpy(dest, str, len); + return len; +} + +static int utf8_parse_version(const char *version, unsigned int *maj, + unsigned int *min, unsigned int *rev) +{ + substring_t args[3]; + char version_string[12]; + const struct match_token token[] = { + {1, "%d.%d.%d"}, + {0, NULL} + }; + + strncpy(version_string, version, sizeof(version_string)); + + if (match_token(version_string, token, args) != 1) + return -EINVAL; + + if (match_int(&args[0], maj) || match_int(&args[1], min) || + match_int(&args[2], rev)) + return -EINVAL; + + return 0; +} +#endif + +struct utf8_table { + struct nls_table tbl; + struct nls_ops ops; }; -static struct nls_charset nls_charset = { +static void utf8_set_ops(struct utf8_table *utbl) +{ + utbl->ops.lowercase = charset_toupper; + utbl->ops.uppercase = charset_tolower; + utbl->ops.uni2char = uni2char; + utbl->ops.char2uni = char2uni; + +#ifdef CONFIG_NLS_UTF8_NORMALIZATION + utbl->ops.validate = utf8_validate; + + if (IS_NORMALIZATION_TYPE_UTF8_NFKD(&utbl->tbl)) { + utbl->ops.normalize = utf8_normalize_nfkd; + utbl->ops.strncmp = utf8_strncmp; + } + + if (IS_CASEFOLD_TYPE_UTF8_NFKDCF(&utbl->tbl)) { + utbl->ops.casefold = utf8_casefold_nfkdcf; + utbl->ops.strncasecmp = utf8_strncasecmp; + } +#endif + + utbl->tbl.ops = &utbl->ops; +} + +static struct nls_table *utf8_load_table(const char *version, unsigned int flags) +{ + struct utf8_table *utbl = NULL; + unsigned int nls_version; + +#ifdef CONFIG_NLS_UTF8_NORMALIZATION + if (version) { + unsigned int maj, min, rev; + + if (utf8_parse_version(version, &maj, &min, &rev) < 0) + return ERR_PTR(-EINVAL); + + if (!utf8version_is_supported(maj, min, rev)) + return ERR_PTR(-EINVAL); + + nls_version = UNICODE_AGE(maj, min, rev); + } else { + nls_version = utf8version_latest(); + printk(KERN_WARNING"UTF-8 version not specified. " + "Assuming latest supported version (%d.%d.%d).", + (nls_version >> 16) & 0xff, (nls_version >> 8) & 0xff, + (nls_version & 0xff)); + } +#else + nls_version = 0; +#endif + + utbl = kzalloc(sizeof(struct utf8_table), GFP_KERNEL); + if (!utbl) + return ERR_PTR(-ENOMEM); + + utbl->tbl.charset = &utf8_info; + utbl->tbl.version = nls_version; + utbl->tbl.flags = flags; + utf8_set_ops(utbl); + + utbl->tbl.next = utf8_info.tables; + utf8_info.tables = &utbl->tbl; + + return &utbl->tbl; +} + +static void utf8_cleanup_tables(void) +{ + struct nls_table *tmp, *tbl = utf8_info.tables; + + while (tbl) { + tmp = tbl; + tbl = tbl->next; + kfree(tmp); + } + utf8_info.tables = NULL; +} + +static struct nls_charset utf8_info = { .charset = "utf8", - .tables = &table, + .load_table = utf8_load_table, }; static int __init init_nls_utf8(void) @@ -74,12 +314,13 @@ static int __init init_nls_utf8(void) for (i=0; i<256; i++) identity[i] = i; - return register_nls(&nls_charset); + return register_nls(&utf8_info); } static void __exit exit_nls_utf8(void) { - unregister_nls(&nls_charset); + unregister_nls(&utf8_info); + utf8_cleanup_tables(); } module_init(init_nls_utf8) diff --git a/fs/nls/nls_utf8-norm.c b/fs/nls/nls_utf8-norm.c index 64c3cc74a2ca..abee8b376a87 100644 --- a/fs/nls/nls_utf8-norm.c +++ b/fs/nls/nls_utf8-norm.c @@ -38,6 +38,12 @@ int utf8version_is_supported(u8 maj, u8 min, u8 rev) } EXPORT_SYMBOL(utf8version_is_supported); +int utf8version_latest() +{ + return utf8vers; +} +EXPORT_SYMBOL(utf8version_latest); + /* * UTF-8 valid ranges. * diff --git a/fs/nls/utf8n.h b/fs/nls/utf8n.h index f60827663503..b4697f9bfbab 100644 --- a/fs/nls/utf8n.h +++ b/fs/nls/utf8n.h @@ -32,6 +32,7 @@ /* Highest unicode version supported by the data tables. */ extern int utf8version_is_supported(u8 maj, u8 min, u8 rev); +extern int utf8version_latest(void); /* * Look for the correct const struct utf8data for a unicode version. diff --git a/include/linux/nls.h b/include/linux/nls.h index aab60d4858ee..aee5cbfc07c6 100644 --- a/include/linux/nls.h +++ b/include/linux/nls.h @@ -186,6 +186,14 @@ NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER) NLS_CASEFOLD_FUNCS(ASCII, TOUPPER, NLS_ASCII_CASEFOLD_TOUPPER) NLS_CASEFOLD_FUNCS(ASCII, TOLOWER, NLS_ASCII_CASEFOLD_TOLOWER) +/* UTF-8 */ + +#define NLS_UTF8_NORMALIZATION_TYPE_NFKD NLS_NORMALIZATION_TYPE(1) +#define NLS_UTF8_CASEFOLD_TYPE_NFKDCF NLS_CASEFOLD_TYPE(1) + +NLS_NORMALIZATION_FUNCS(UTF8, NFKD, NLS_UTF8_NORMALIZATION_TYPE_NFKD) +NLS_CASEFOLD_FUNCS(UTF8, NFKDCF, NLS_UTF8_CASEFOLD_TYPE_NFKDCF) + /* nls_base.c */ extern int __register_nls(struct nls_charset *, struct module *); extern int unregister_nls(struct nls_charset *); From patchwork Thu Dec 6 23:08:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717213 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 01EBB15A6 for ; Thu, 6 Dec 2018 23:10:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E54642F205 for ; Thu, 6 Dec 2018 23:10:17 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D9CB72F21D; Thu, 6 Dec 2018 23:10:17 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 14DCD2F205 for ; Thu, 6 Dec 2018 23:10:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726361AbeLFXKQ (ORCPT ); Thu, 6 Dec 2018 18:10:16 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56146 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726353AbeLFXKP (ORCPT ); Thu, 6 Dec 2018 18:10:15 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 5A35027E03E From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 18/23] nls: utf8: Introduce test module for normalized utf8 implementation Date: Thu, 6 Dec 2018 18:08:58 -0500 Message-Id: <20181206230903.30011-19-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi This implements a in-kernel sanity test module for the utf8 normalization core. At probe time, it will run basic sequences through the utf8n core, to identify problems will equivalent sequences and normalization/casefold code. This is supposed to be useful for regression testing when adding support for a new version of utf8 to linux. Changes since RFC v2: - Merge with NLS Changes since RFC v1: - Include comparison tests for matching strings with different lengths. - Include tests for characters included in unicode 8.0.0, 9.0.0 and 10.0.0. Signed-off-by: Gabriel Krisman Bertazi --- fs/nls/Kconfig | 5 + fs/nls/Makefile | 1 + fs/nls/nls_utf8-selftest.c | 316 +++++++++++++++++++++++++++++++++++++ 3 files changed, 322 insertions(+) create mode 100644 fs/nls/nls_utf8-selftest.c diff --git a/fs/nls/Kconfig b/fs/nls/Kconfig index 7cb2848da608..b9c9b663d0ab 100644 --- a/fs/nls/Kconfig +++ b/fs/nls/Kconfig @@ -626,4 +626,9 @@ config NLS_UTF8_NORMALIZATION Say Y here to enable utf8 NFKD normalization and casefolding support. +config NLS_UTF8_NORMALIZATION_SELFTEST + tristate "Test UTF-8 normalization support" + depends on NLS_UTF8 + default n + endif # NLS diff --git a/fs/nls/Makefile b/fs/nls/Makefile index bd13c1a90767..e88006d2f3ac 100644 --- a/fs/nls/Makefile +++ b/fs/nls/Makefile @@ -47,6 +47,7 @@ obj-$(CONFIG_NLS_KOI8_U) += nls_koi8-u.o nls_koi8-ru.o obj-$(CONFIG_NLS_UTF8) += nls_utf8.o nls_utf8-y += nls_utf8-core.o nls_utf8-$(CONFIG_NLS_UTF8_NORMALIZATION) += nls_utf8-norm.o +obj-$(CONFIG_NLS_UTF8_NORMALIZATION_SELFTEST) += nls_utf8-selftest.o obj-$(CONFIG_NLS_MAC_CELTIC) += mac-celtic.o obj-$(CONFIG_NLS_MAC_CENTEURO) += mac-centeuro.o diff --git a/fs/nls/nls_utf8-selftest.c b/fs/nls/nls_utf8-selftest.c new file mode 100644 index 000000000000..47e4f24a3f44 --- /dev/null +++ b/fs/nls/nls_utf8-selftest.c @@ -0,0 +1,316 @@ +/* + * Kernel module for testing utf-8 support. + * + * Copyright 2017 Collabora Ltd. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and modified under those terms. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include + +#include "utf8n.h" + +unsigned int failed_tests; +unsigned int total_tests; + +/* Tests will be based on this version. */ +#define latest_maj 11 +#define latest_min 0 +#define latest_rev 0 + +#define _test(cond, func, line, fmt, ...) do { \ + total_tests++; \ + if (!cond) { \ + failed_tests++; \ + pr_err("test %s:%d Failed: %s%s", \ + func, line, #cond, (fmt?":":".")); \ + if (fmt) \ + pr_err(fmt, ##__VA_ARGS__); \ + } \ + } while (0) +#define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__) +#define test(cond) _test(cond, __func__, __LINE__, "") + +const static struct { + /* UTF-8 strings in this vector _must_ be NULL-terminated. */ + unsigned char str[10]; + unsigned char dec[10]; +} nfkdi_test_data[] = { + /* Trivial sequence */ + { + /* "ABba" decomposes to itself */ + .str = {0x41, 0x42, 0x62, 0x61, 0x00}, + .dec = {0x41, 0x42, 0x62, 0x61, 0x00} + }, + /* Simple equivalent sequences */ + { + /* 'VULGAR FRACTION ONE QUARTER' decomposes to + 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */ + .str = {0xc2, 0xbc, 0x00}, + .dec = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00}, + }, + { + /* 'LATIN SMALL LETTER A WITH DIAERESIS' decomposes to + 'LETTER A' + 'COMBINING DIAERESIS' */ + .str = {0xc3, 0xa4, 0x00}, + .dec = {0x61, 0xcc, 0x88, 0x00}, + }, + { + /* 'LATIN SMALL LETTER LJ' decomposes to + 'LETTER L' + 'LETTER J' */ + .str = {0xC7, 0x89, 0x00}, + .dec = {0x6c, 0x6a, 0x00}, + }, + { + /* GREEK ANO TELEIA decomposes to MIDDLE DOT */ + .str = {0xCE, 0x87, 0x00}, + .dec = {0xC2, 0xB7, 0x00} + }, + /* Canonical ordering */ + { + /* A + 'COMBINING ACUTE ACCENT' + 'COMBINING OGONEK' decomposes + to A + 'COMBINING OGONEK' + 'COMBINING ACUTE ACCENT' */ + .str = {0x41, 0xcc, 0x81, 0xcc, 0xa8, 0x0}, + .dec = {0x41, 0xcc, 0xa8, 0xcc, 0x81, 0x0}, + }, + { + /* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK' + decomposes to + 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */ + .str = {0xc3, 0xa4, 0xCC, 0xA8, 0x00}, + + .dec = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00}, + }, + +}; + +const static struct { + /* UTF-8 strings in this vector _must_ be NULL-terminated. */ + unsigned char str[30]; + unsigned char ncf[30]; +} nfkdicf_test_data[] = { + /* Trivial sequences */ + { + /* "ABba" folds to lowercase */ + .str = {0x41, 0x42, 0x62, 0x61, 0x00}, + .ncf = {0x61, 0x62, 0x62, 0x61, 0x00}, + }, + { + /* All ASCII folds to lower-case */ + .str = "ABCDEFGHIJKLMNOPRSTUVWXYZ0.1", + .ncf = "abcdefghijklmnoprstuvwxyz0.1", + }, + { + /* LATIN SMALL LETTER SHARP S folds to + LATIN SMALL LETTER S + LATIN SMALL LETTER S */ + .str = {0xc3, 0x9f, 0x00}, + .ncf = {0x73, 0x73, 0x00}, + }, + { + /* LATIN CAPITAL LETTER A WITH RING ABOVE folds to + LATIN SMALL LETTER A + COMBINING RING ABOVE */ + .str = {0xC3, 0x85, 0x00}, + .ncf = {0x61, 0xcc, 0x8a, 0x00}, + }, + /* Introduced by UTF-8.0.0. */ + /* Cherokee letters are interesting test-cases because they fold + to upper-case. Before 8.0.0, Cherokee lowercase were + undefined, thus, the folding from LC is not stable between + 7.0.0 -> 8.0.0, but it is from UC. */ + { + /* CHEROKEE SMALL LETTER A folds to CHEROKEE LETTER A */ + .str = {0xea, 0xad, 0xb0, 0x00}, + .ncf = {0xe1, 0x8e, 0xa0, 0x00}, + }, + { + /* CHEROKEE SMALL LETTER YE folds to CHEROKEE LETTER YE */ + .str = {0xe1, 0x8f, 0xb8, 0x00}, + .ncf = {0xe1, 0x8f, 0xb0, 0x00}, + }, + { + /* OLD HUNGARIAN CAPITAL LETTER AMB folds to + OLD HUNGARIAN SMALL LETTER AMB */ + .str = {0xf0, 0x90, 0xb2, 0x83, 0x00}, + .ncf = {0xf0, 0x90, 0xb3, 0x83, 0x00}, + }, + /* Introduced by UTF-9.0.0. */ + { + /* OSAGE CAPITAL LETTER CHA folds to + OSAGE SMALL LETTER CHA */ + .str = {0xf0, 0x90, 0x92, 0xb5, 0x00}, + .ncf = {0xf0, 0x90, 0x93, 0x9d, 0x00}, + }, + { + /* LATIN CAPITAL LETTER SMALL CAPITAL I folds to + LATIN LETTER SMALL CAPITAL I */ + .str = {0xea, 0x9e, 0xae, 0x00}, + .ncf = {0xc9, 0xaa, 0x00}, + }, + /* Introduced by UTF-11.0.0. */ + { + /* GEORGIAN SMALL LETTER AN folds to GEORGIAN MTAVRULI + CAPITAL LETTER AN */ + .str = {0xe1, 0xb2, 0x90, 0x00}, + .ncf = {0xe1, 0x83, 0x90, 0x00}, + } +}; + +static void check_utf8_nfkdi(void) +{ + int i; + struct utf8cursor u8c; + const struct utf8data *data; + + data = utf8nfkdi(UNICODE_AGE(latest_maj, latest_min, latest_rev)); + if (!data) { + pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n", + __func__, latest_maj, latest_min, latest_rev); + return; + } + + for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) { + int len = strlen(nfkdi_test_data[i].str); + int nlen = strlen(nfkdi_test_data[i].dec); + int j = 0; + unsigned char c; + + test((utf8len(data, nfkdi_test_data[i].str) == nlen)); + test((utf8nlen(data, nfkdi_test_data[i].str, len) == nlen)); + + if (utf8cursor(&u8c, data, nfkdi_test_data[i].str) < 0) + pr_err("can't create cursor\n"); + + while ((c = utf8byte(&u8c)) > 0) { + test_f((c == nfkdi_test_data[i].dec[j]), + "Unexpected byte 0x%x should be 0x%x\n", + c, nfkdi_test_data[i].dec[j]); + j++; + } + + test((j == nlen)); + } +} + +static void check_utf8_nfkdicf(void) +{ + int i; + struct utf8cursor u8c; + const struct utf8data *data; + + data = utf8nfkdicf(UNICODE_AGE(latest_maj, latest_min, latest_rev)); + if (!data) { + pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n", + __func__, latest_maj, latest_min, latest_rev); + return; + } + + for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) { + int len = strlen(nfkdicf_test_data[i].str); + int nlen = strlen(nfkdicf_test_data[i].ncf); + int j = 0; + unsigned char c; + + test((utf8len(data, nfkdicf_test_data[i].str) == nlen)); + test((utf8nlen(data, nfkdicf_test_data[i].str, len) == nlen)); + + if (utf8cursor(&u8c, data, nfkdicf_test_data[i].str) < 0) + pr_err("can't create cursor\n"); + + while ((c = utf8byte(&u8c)) > 0) { + test_f((c == nfkdicf_test_data[i].ncf[j]), + "Unexpected byte 0x%x should be 0x%x\n", + c, nfkdicf_test_data[i].ncf[j]); + j++; + } + + test((j == nlen)); + } +} + +static void check_utf8_comparisons(void) +{ + int i; + struct nls_table *table = load_nls_version("utf8", "11.0.0", + NLS_UTF8_NORMALIZATION_TYPE_NFKD | + NLS_UTF8_CASEFOLD_TYPE_NFKDCF); + + if (IS_ERR(table)) { + pr_err("%s: Unable to load utf8 %d.%d.%d. Skipping.\n", + __func__, latest_maj, latest_min, latest_rev); + return; + } + + for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) { + const char *s1 = nfkdi_test_data[i].str; + const char *s2 = nfkdi_test_data[i].dec; + + test_f(!nls_strncmp(table, s1, strlen(s1), s2, strlen(s2)), + "%s %s comparison mismatch\n", s1, s2); + } + for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) { + const char *s1 = nfkdicf_test_data[i].str; + const char *s2 = nfkdicf_test_data[i].ncf; + + test_f(!nls_strncasecmp(table, s1, strlen(s1), + s2, strlen(s2)), + "%s %s comparison mismatch\n", s1, s2); + } + + unload_nls(table); +} + +static void check_supported_versions(void) +{ + /* Unicode 7.0.0 should be supported. */ + test(utf8version_is_supported(7, 0, 0)); + + /* Unicode 9.0.0 should be supported. */ + test(utf8version_is_supported(9, 0, 0)); + + /* Unicode 1x.0.0 (the latest version) should be supported. */ + test(utf8version_is_supported(latest_maj, latest_min, latest_rev)); + + /* Next versions don't exist. */ + test(!utf8version_is_supported(12, 0, 0)); + test(!utf8version_is_supported(0, 0, 0)); + test(!utf8version_is_supported(-1, -1, -1)); +} + +static int __init init_test_ucd(void) +{ + failed_tests = 0; + total_tests = 0; + + check_supported_versions(); + check_utf8_nfkdi(); + check_utf8_nfkdicf(); + check_utf8_comparisons(); + + if (!failed_tests) + pr_info("All %u tests passed\n", total_tests); + else + pr_err("%u out of %u tests failed\n", failed_tests, + total_tests); + return 0; +} + +static void __exit exit_test_ucd(void) +{ +} + +module_init(init_test_ucd); +module_exit(exit_test_ucd); + +MODULE_AUTHOR("Gabriel Krisman Bertazi "); +MODULE_LICENSE("GPL"); From patchwork Thu Dec 6 23:08:59 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717215 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3321F1932 for ; Thu, 6 Dec 2018 23:10:21 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 249FF2F21D for ; Thu, 6 Dec 2018 23:10:21 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 18F832F22E; Thu, 6 Dec 2018 23:10:21 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B9FDD2F21D for ; Thu, 6 Dec 2018 23:10:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726366AbeLFXKT (ORCPT ); Thu, 6 Dec 2018 18:10:19 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56154 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726353AbeLFXKT (ORCPT ); Thu, 6 Dec 2018 18:10:19 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 6212027ED58 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 19/23] ext4: Reserve superblock fields for encoding information Date: Thu, 6 Dec 2018 18:08:59 -0500 Message-Id: <20181206230903.30011-20-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi The s_encoding field stores a magic number indicating the encoding format and version used globally by file and directory names in the filesystem. The s_encoding_flags defines policies for using the charset encoding, like how to handle invalid sequences and what kind of normalization to use. Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/ext4.h | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 3f89d0ab08fc..52c9e8b948a0 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1311,7 +1311,9 @@ struct ext4_super_block { __u8 s_first_error_time_hi; __u8 s_last_error_time_hi; __u8 s_pad[2]; - __le32 s_reserved[96]; /* Padding to the end of the block */ + __le16 s_encoding; /* Filename charset encoding */ + __le16 s_encoding_flags; /* Filename charset encoding flags */ + __le32 s_reserved[95]; /* Padding to the end of the block */ __le32 s_checksum; /* crc32c(superblock) */ }; @@ -1661,6 +1663,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei) #define EXT4_FEATURE_INCOMPAT_LARGEDIR 0x4000 /* >2GB or 3-lvl htree */ #define EXT4_FEATURE_INCOMPAT_INLINE_DATA 0x8000 /* data in inode */ #define EXT4_FEATURE_INCOMPAT_ENCRYPT 0x10000 +#define EXT4_FEATURE_INCOMPAT_FNAME_ENCODING 0x20000 #define EXT4_FEATURE_COMPAT_FUNCS(name, flagname) \ static inline bool ext4_has_feature_##name(struct super_block *sb) \ @@ -1749,6 +1752,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(csum_seed, CSUM_SEED) EXT4_FEATURE_INCOMPAT_FUNCS(largedir, LARGEDIR) EXT4_FEATURE_INCOMPAT_FUNCS(inline_data, INLINE_DATA) EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT) +EXT4_FEATURE_INCOMPAT_FUNCS(fname_encoding, FNAME_ENCODING) #define EXT2_FEATURE_COMPAT_SUPP EXT4_FEATURE_COMPAT_EXT_ATTR #define EXT2_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \ @@ -1776,6 +1780,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT) EXT4_FEATURE_INCOMPAT_MMP | \ EXT4_FEATURE_INCOMPAT_INLINE_DATA | \ EXT4_FEATURE_INCOMPAT_ENCRYPT | \ + EXT4_FEATURE_INCOMPAT_FNAME_ENCODING | \ EXT4_FEATURE_INCOMPAT_CSUM_SEED | \ EXT4_FEATURE_INCOMPAT_LARGEDIR) #define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \ From patchwork Thu Dec 6 23:09:00 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717217 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C1C3E15A6 for ; Thu, 6 Dec 2018 23:10:25 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B0C162F21D for ; Thu, 6 Dec 2018 23:10:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A4E5D2F22E; Thu, 6 Dec 2018 23:10:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 21D7A2F21D for ; Thu, 6 Dec 2018 23:10:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726370AbeLFXKX (ORCPT ); Thu, 6 Dec 2018 18:10:23 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56158 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726353AbeLFXKX (ORCPT ); Thu, 6 Dec 2018 18:10:23 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 66BEF27ED70 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 20/23] ext4: Include encoding information in the superblock Date: Thu, 6 Dec 2018 18:09:00 -0500 Message-Id: <20181206230903.30011-21-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi Support for encoding is considered an incompatible feature, since it has potential to create collisions of file names in existing filesystems. If the feature flag is not enabled, the entire filesystem will operate on opaque byte sequences, respecting the original behavior. The charset data is encoded in a new field in the superblock using a magic number specific to ext4. This is the easiest way I found to avoid writing the name of the charset in the superblock. The magic number is mapped to the exact NLS table, but the mapping is specific to ext4. Since we don't have any commitment to support old encodings, the only encodings I am supporting right now is utf8-11.0 and ascii, both using the NLS abstraction. The current implementation prevents the user from enabling encoding and per-directory encryption on the same filesystem at the same time. The incompatibility between these features lies in how we do efficient directory searches when we cannot be sure the encryption of the user provided fname will match the actual hash stored in the disk without decrypting every directory entry, because of normalization cases. My quickest solution is to simply block the concurrent use of these features for now, and enable it later, once we have a better solution. Changes since v2: - Split superblock bitfield reservation into another patch. - Rename s_ioencoding -> s_encoding - Remove encoding_info from in-memory superblock. Changes since v1: - Guard code with CONFIG_NLS. - Use 16 bits for s_ioencoding. - Split mount option from this patch Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/ext4.h | 7 +++++ fs/ext4/super.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 84 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 52c9e8b948a0..c21717a19106 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1338,6 +1338,9 @@ struct ext4_super_block { /* Number of quota types we support */ #define EXT4_MAXQUOTAS 3 +#define EXT4_ENC_ASCII 0 +#define EXT4_ENC_UTF8_11_0 1 + /* * fourth extended-fs super-block data in memory */ @@ -1387,6 +1390,10 @@ struct ext4_sb_info { struct kobject s_kobj; struct completion s_kobj_unregister; struct super_block *s_sb; +#ifdef CONFIG_NLS + struct nls_table *s_encoding; + __u16 s_encoding_flags; +#endif /* Journaling */ struct journal_s *s_journal; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 53ff6c2a26ed..e64a9ed2ca12 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -1022,6 +1023,9 @@ static void ext4_put_super(struct super_block *sb) crypto_free_shash(sbi->s_chksum_driver); kfree(sbi->s_blockgroup_lock); fs_put_dax(sbi->s_daxdev); +#ifdef CONFIG_NLS + unload_nls(sbi->s_encoding); +#endif kfree(sbi); } @@ -1716,6 +1720,37 @@ static const struct mount_opts { {Opt_err, 0, 0} }; +#ifdef CONFIG_NLS +static const struct ext4_sb_encodings { + __u16 magic; + char *name; + char *version; +} ext4_sb_encoding_map[] = { + {EXT4_ENC_ASCII, "ascii", NULL}, + {EXT4_ENC_UTF8_11_0, "utf8", "11.0.0"}, +}; + +static int ext4_sb_read_encoding(const struct ext4_super_block *es, + const struct ext4_sb_encodings **encoding, + __u16 *flags) +{ + __u16 magic = le16_to_cpu(es->s_encoding); + int i; + + for (i = 0; i < ARRAY_SIZE(ext4_sb_encoding_map); i++) + if (magic == ext4_sb_encoding_map[i].magic) + break; + + if (i >= ARRAY_SIZE(ext4_sb_encoding_map)) + return -EINVAL; + + *encoding = &ext4_sb_encoding_map[i]; + *flags = le16_to_cpu(es->s_encoding_flags); + + return 0; +} +#endif + static int handle_mount_opt(struct super_block *sb, char *opt, int token, substring_t *args, unsigned long *journal_devnum, unsigned int *journal_ioprio, int is_remount) @@ -3534,6 +3569,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) int err = 0; unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO; ext4_group_t first_not_zeroed; +#ifdef CONFIG_NLS + struct nls_table *encoding; + const struct ext4_sb_encodings *encoding_info; + __u16 nls_flags; +#endif if ((data && !orig_data) || !sbi) goto out_free_base; @@ -3706,6 +3746,38 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) &journal_ioprio, 0)) goto failed_mount; +#ifdef CONFIG_NLS + if (ext4_has_feature_fname_encoding(sb) && !sbi->s_encoding) { + if (ext4_has_feature_encrypt(sb)) { + ext4_msg(sb, KERN_ERR, + "Can't mount with both encoding and encryption"); + goto failed_mount; + } + + if (ext4_sb_read_encoding(es, &encoding_info, &nls_flags)) { + ext4_msg(sb, KERN_ERR, + "Encoding requested by superblock is unknown"); + goto failed_mount; + } + + encoding = load_nls_version(encoding_info->name, + encoding_info->version, nls_flags); + if (IS_ERR(encoding)) { + ext4_msg(sb, KERN_ERR, "can't mount with superblock charset: " + "%s-%s not supported by the kernel. flags: 0x%x", + encoding_info->name, encoding_info->version, + nls_flags); + goto failed_mount; + } + ext4_msg(sb, KERN_INFO,"Using encoding defined by superblock: " + "%s-%s with flags 0x%hx", encoding_info->name, + encoding_info->version?:"\b", nls_flags); + + sbi->s_encoding = encoding; + sbi->s_encoding_flags = nls_flags; + } +#endif + if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) { printk_once(KERN_WARNING "EXT4-fs: Warning: mounting " "with data=journal disables delayed " @@ -4547,6 +4619,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) failed_mount: if (sbi->s_chksum_driver) crypto_free_shash(sbi->s_chksum_driver); + +#ifdef CONFIG_NLS + unload_nls(sbi->s_encoding); +#endif + #ifdef CONFIG_QUOTA for (i = 0; i < EXT4_MAXQUOTAS; i++) kfree(sbi->s_qf_names[i]); From patchwork Thu Dec 6 23:09:01 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717219 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D518617DB for ; Thu, 6 Dec 2018 23:10:29 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C445C2F21D for ; Thu, 6 Dec 2018 23:10:29 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B87382F22E; Thu, 6 Dec 2018 23:10:29 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AB0342F21D for ; Thu, 6 Dec 2018 23:10:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726374AbeLFXK1 (ORCPT ); Thu, 6 Dec 2018 18:10:27 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56162 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726353AbeLFXK1 (ORCPT ); Thu, 6 Dec 2018 18:10:27 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id E6FAE27ED73 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 21/23] ext4: Support encoding-aware file name lookups Date: Thu, 6 Dec 2018 18:09:01 -0500 Message-Id: <20181206230903.30011-22-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi This patch implements the actual support for encoding-aware file name lookups in ext4, based on the feature bit and the encoding stored in the superblock. A filesystem that has the encoding feature set is able to find files even if the name used by userspace is not exactly the same, but if it is an equivalent string. This operation will be called and inexact-match name search. Ext4 only stores the first equivalent name dentry used in the dcache. This is done to prevent unintentional duplication of dentries in the dcache, while also allowing the VFS code to quickly find the right entry in the cache despite what equivalent string was used without resorting to ->lookup(). d_hash() is implemented as the hash of the normalized string, such that we always have a well-known bucket for all the equivalencies of the same string. d_compare uses the nls_strncmp() infrastructure, which should handle the comparison of equivalent names as well. If the filesystem's normalization type is PLAIN, though, we can just reuse the VFS hash. For now, negative lookups are not inserted in the dcache, since they would need to be invalidated anyway, because we can't trust missing file dentries. This is bad for performance but requires some leveraging of the vfs layer to fix. We can live without that for now, and so does everyone else. DX is supported by modifying the hashes to make them encoding-aware. The new disk hashes are also calculated as the hash of the normalized string, instead of the string directly. This allows us to efficiently search for file names in the htree without requiring the user to provide the exact name. Changes since v2: - Don't use d_add_ci. - Squash the dcache hooks into this patch. - Rename sbi->encoding -> sbi->s_encoding. Changes since v1: - Support normalized htree hashes. - Guard code with CONFIG_NLS. - Use qstr->len instead of strlen in dcache hookups. Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/dir.c | 45 ++++++++++++++++++++++++++++ fs/ext4/ext4.h | 12 ++++++-- fs/ext4/hash.c | 34 ++++++++++++++++++++- fs/ext4/ialloc.c | 2 +- fs/ext4/inline.c | 2 +- fs/ext4/namei.c | 78 +++++++++++++++++++++++++++++++++++++++++------- fs/ext4/super.c | 6 ++++ 7 files changed, 163 insertions(+), 16 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index f93f9881ec18..efb75c204551 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "ext4.h" #include "xattr.h" @@ -662,3 +663,47 @@ const struct file_operations ext4_dir_operations = { .open = ext4_dir_open, .release = ext4_release_dir, }; + +#ifdef CONFIG_NLS +static int ext4_d_compare(const struct dentry *dentry, unsigned int len, + const char *str, const struct qstr *name) +{ + struct nls_table *charset = EXT4_SB(dentry->d_sb)->s_encoding; + + return nls_strncmp(charset, str, len, name->name, name->len); +} + +static int ext4_d_hash(const struct dentry *dentry, struct qstr *q) +{ + const struct nls_table *charset = EXT4_SB(dentry->d_sb)->s_encoding; + unsigned char *norm; + int len, ret = 0; + + /* If normalization is TYPE_PLAIN, we can just reuse the vfs + * hash. */ + if (IS_NORMALIZATION_TYPE_ALL_PLAIN(charset)) + return 0; + + norm = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!norm) + return -ENOMEM; + + len = nls_normalize(charset, q->name, q->len, norm, PATH_MAX); + + if (len < 0) { + ret = -EINVAL; + goto out; + } + + q->hash = full_name_hash(dentry, norm, len); + +out: + kfree (norm); + return ret; +} + +const struct dentry_operations ext4_dentry_ops = { + .d_hash = ext4_d_hash, + .d_compare = ext4_d_compare, +}; +#endif diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c21717a19106..e84a6605a19a 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1341,6 +1341,11 @@ struct ext4_super_block { #define EXT4_ENC_ASCII 0 #define EXT4_ENC_UTF8_11_0 1 +/* + * Flags for ext4_sb_info.s_encoding_flags. Be careful when modifying + * these, as they must match their NLS counterpart. */ +#define EXT4_ENC_STRICT_MODE_FL (1 << 0) + /* * fourth extended-fs super-block data in memory */ @@ -2387,8 +2392,8 @@ extern int ext4_check_all_de(struct inode *dir, struct buffer_head *bh, extern int ext4_sync_file(struct file *, loff_t, loff_t, int); /* hash.c */ -extern int ext4fs_dirhash(const char *name, int len, struct - dx_hash_info *hinfo); +extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len, + struct dx_hash_info *hinfo); /* ialloc.c */ extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t, @@ -2971,6 +2976,9 @@ static inline void ext4_unlock_group(struct super_block *sb, /* dir.c */ extern const struct file_operations ext4_dir_operations; +#ifdef CONFIG_NLS +extern const struct dentry_operations ext4_dentry_ops; +#endif /* file.c */ extern const struct inode_operations ext4_file_inode_operations; diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c index e22dcfab308b..8ec9c7145987 100644 --- a/fs/ext4/hash.c +++ b/fs/ext4/hash.c @@ -6,6 +6,7 @@ */ #include +#include #include #include #include "ext4.h" @@ -196,7 +197,8 @@ static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num) * represented, and whether or not the returned hash is 32 bits or 64 * bits. 32 bit hashes will return 0 for the minor hash. */ -int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo) +static int __ext4fs_dirhash(const char *name, int len, + struct dx_hash_info *hinfo) { __u32 hash; __u32 minor_hash = 0; @@ -266,3 +268,33 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo) hinfo->minor_hash = minor_hash; return 0; } + +int ext4fs_dirhash(const struct inode *dir, const char *name, int len, + struct dx_hash_info *hinfo) +{ +#ifdef CONFIG_NLS + const struct nls_table *charset = EXT4_SB(dir->i_sb)->s_encoding; + int r, dlen; + unsigned char *buff; + + if (len && charset) { + buff = kzalloc(sizeof (char) * PATH_MAX, GFP_KERNEL); + if (!buff) + return -1; + + dlen = nls_normalize(charset, name, len, buff, PATH_MAX); + + if (dlen < 0) { + kfree(buff); + goto opaque_seq; + } + + r = __ext4fs_dirhash(buff, dlen, hinfo); + + kfree(buff); + return r; + } +opaque_seq: +#endif + return __ext4fs_dirhash(name, len, hinfo); +} diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index 014f6a698cb7..1ef355549abf 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -455,7 +455,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent, if (qstr) { hinfo.hash_version = DX_HASH_HALF_MD4; hinfo.seed = sbi->s_hash_seed; - ext4fs_dirhash(qstr->name, qstr->len, &hinfo); + ext4fs_dirhash(parent, qstr->name, qstr->len, &hinfo); grp = hinfo.hash; } else grp = prandom_u32(); diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 9c4bac18cc6c..10b9d3dcec4e 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -1404,7 +1404,7 @@ int htree_inlinedir_to_tree(struct file *dir_file, } } - ext4fs_dirhash(de->name, de->name_len, hinfo); + ext4fs_dirhash(dir, de->name, de->name_len, hinfo); if ((hinfo->hash < start_hash) || ((hinfo->hash == start_hash) && (hinfo->minor_hash < start_minor_hash))) diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 437f71fe83ae..23e0e911b3fe 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -35,6 +35,7 @@ #include #include #include +#include #include "ext4.h" #include "ext4_jbd2.h" @@ -629,7 +630,7 @@ static struct stats dx_show_leaf(struct inode *dir, } if (!fscrypt_has_encryption_key(dir)) { /* Directory is not encrypted */ - ext4fs_dirhash(de->name, + ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:(U)%x.%u ", len, name, h.hash, @@ -662,8 +663,8 @@ static struct stats dx_show_leaf(struct inode *dir, name = fname_crypto_str.name; len = fname_crypto_str.len; } - ext4fs_dirhash(de->name, de->name_len, - &h); + ext4fs_dirhash(dir, de->name, + de->name_len, &h); printk("%*.s:(E)%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); @@ -673,7 +674,7 @@ static struct stats dx_show_leaf(struct inode *dir, #else int len = de->name_len; char *name = de->name; - ext4fs_dirhash(de->name, de->name_len, &h); + ext4fs_dirhash(dir, de->name, de->name_len, &h); printk("%*.s:%x.%u ", len, name, h.hash, (unsigned) ((char *) de - base)); #endif @@ -762,7 +763,7 @@ dx_probe(struct ext4_filename *fname, struct inode *dir, hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed; if (fname && fname_name(fname)) - ext4fs_dirhash(fname_name(fname), fname_len(fname), hinfo); + ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), hinfo); hash = hinfo->hash; if (root->info.unused_flags & 1) { @@ -1008,7 +1009,7 @@ static int htree_dirblock_to_tree(struct file *dir_file, /* silently ignore the rest of the block */ break; } - ext4fs_dirhash(de->name, de->name_len, hinfo); + ext4fs_dirhash(dir, de->name, de->name_len, hinfo); if ((hinfo->hash < start_hash) || ((hinfo->hash == start_hash) && (hinfo->minor_hash < start_minor_hash))) @@ -1197,7 +1198,7 @@ static int dx_make_map(struct inode *dir, struct ext4_dir_entry_2 *de, while ((char *) de < base + blocksize) { if (de->name_len && de->inode) { - ext4fs_dirhash(de->name, de->name_len, &h); + ext4fs_dirhash(dir, de->name, de->name_len, &h); map_tail--; map_tail->hash = h.hash; map_tail->offs = ((char *) de - base)>>2; @@ -1257,10 +1258,14 @@ static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block) * * Return: %true if the directory entry matches, otherwise %false. */ -static inline bool ext4_match(const struct ext4_filename *fname, +static inline bool ext4_match(const struct inode *parent, + const struct ext4_filename *fname, const struct ext4_dir_entry_2 *de) { struct fscrypt_name f; +#ifdef CONFIG_NLS + const struct ext4_sb_info *sbi = EXT4_SB(parent->i_sb); +#endif if (!de->inode) return false; @@ -1270,6 +1275,15 @@ static inline bool ext4_match(const struct ext4_filename *fname, #ifdef CONFIG_EXT4_FS_ENCRYPTION f.crypto_buf = fname->crypto_buf; #endif + +#ifdef CONFIG_NLS + if (sbi->s_encoding) { + return !nls_strncmp(sbi->s_encoding, + de->name, de->name_len, + f.disk_name.name, f.disk_name.len); + } +#endif + return fscrypt_match_name(&f, de->name, de->name_len); } @@ -1290,7 +1304,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size, /* this code is executed quadratically often */ /* do minimal checking `by hand' */ if ((char *) de + de->name_len <= dlimit && - ext4_match(fname, de)) { + ext4_match(dir, fname, de)) { /* found a match - just to be sure, do * a full check */ if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data, @@ -1588,6 +1602,17 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi return ERR_PTR(-EPERM); } } + +#ifdef CONFIG_NLS + if (EXT4_SB(dir->i_sb)->s_encoding && !inode) { + /* Eventually we want to call d_add_ci(dentry, NULL) + * for negative dentries in the encoding case as + * well. For now, prevent the negative dentry + * from being cached. + */ + return NULL; + } +#endif return d_splice_alias(inode, dentry); } @@ -1798,7 +1823,7 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode, if (ext4_check_dir_entry(dir, NULL, de, bh, buf, buf_size, offset)) return -EFSCORRUPTED; - if (ext4_match(fname, de)) + if (ext4_match(dir, fname, de)) return -EEXIST; nlen = EXT4_DIR_REC_LEN(de->name_len); rlen = ext4_rec_len_from_disk(de->rec_len, buf_size); @@ -1983,7 +2008,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname, if (fname->hinfo.hash_version <= DX_HASH_TEA) fname->hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned; fname->hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed; - ext4fs_dirhash(fname_name(fname), fname_len(fname), &fname->hinfo); + ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), &fname->hinfo); memset(frames, 0, sizeof(frames)); frame = frames; @@ -2036,6 +2061,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry, struct ext4_dir_entry_2 *de; struct ext4_dir_entry_tail *t; struct super_block *sb; + struct ext4_sb_info *sbi; struct ext4_filename fname; int retval; int dx_fallback=0; @@ -2047,10 +2073,18 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry, csum_size = sizeof(struct ext4_dir_entry_tail); sb = dir->i_sb; + sbi = EXT4_SB(sb); blocksize = sb->s_blocksize; if (!dentry->d_name.len) return -EINVAL; +#ifdef CONFIG_NLS + if (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL && + nls_validate(sbi->s_encoding, dentry->d_name.name, + dentry->d_name.len)) + return -EINVAL; +#endif + retval = ext4_fname_setup_filename(dir, &dentry->d_name, 0, &fname); if (retval) return retval; @@ -2975,6 +3009,17 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry) ext4_update_dx_flag(dir); ext4_mark_inode_dirty(handle, dir); +#ifdef CONFIG_NLS + /* VFS negative dentries are incompatible with Encoding and + * Case-insensitiveness. Eventually we'll want avoid + * invalidating the dentries here, alongside with returning the + * negative dentries at ext4_lookup(), when it is better + * supported by the VFS for the CI case. + */ + if (EXT4_SB(dir->i_sb)->s_encoding) + d_invalidate(dentry); +#endif + end_rmdir: brelse(bh); if (handle) @@ -3044,6 +3089,17 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry) inode->i_ctime = current_time(inode); ext4_mark_inode_dirty(handle, inode); +#ifdef CONFIG_NLS + /* VFS negative dentries are incompatible with Encoding and + * Case-insensitiveness. Eventually we'll want avoid + * invalidating the dentries here, alongside with returning the + * negative dentries at ext4_lookup(), when it is better + * supported by the VFS for the CI case. + */ + if (EXT4_SB(dir->i_sb)->s_encoding) + d_invalidate(dentry); +#endif + end_unlink: brelse(bh); if (handle) diff --git a/fs/ext4/super.c b/fs/ext4/super.c index e64a9ed2ca12..5ac1ed77f36b 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4412,6 +4412,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) iput(root); goto failed_mount4; } + +#ifdef CONFIG_NLS + if (sbi->s_encoding) + sb->s_d_op = &ext4_dentry_ops; +#endif + sb->s_root = d_make_root(root); if (!sb->s_root) { ext4_msg(sb, KERN_ERR, "get root dentry failed"); From patchwork Thu Dec 6 23:09:02 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717221 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5DA1017DB for ; Thu, 6 Dec 2018 23:10:33 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4CEEA2F21D for ; Thu, 6 Dec 2018 23:10:33 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 414D52F22E; Thu, 6 Dec 2018 23:10:33 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A36FB2F21D for ; Thu, 6 Dec 2018 23:10:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726375AbeLFXKb (ORCPT ); Thu, 6 Dec 2018 18:10:31 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56166 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726256AbeLFXKb (ORCPT ); Thu, 6 Dec 2018 18:10:31 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 4AC4B27ED64 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 22/23] ext4: Implement EXT4_CASEFOLD_FL flag Date: Thu, 6 Dec 2018 18:09:02 -0500 Message-Id: <20181206230903.30011-23-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi Casefold is a flag applied to directories and inherited by its children which states that the directory requires case-insensitive searches. This flag can only be enabled on empty directories for filesystems that support the encoding feature, thus preventing collision of file names that only differ by case. Enconding-awareness is also required because we consider the casefold operation not be defined for opaque byte sequences. Changes since v2: - Rename sbi->encoding -> sbi->s_encoding. Changes since v1: - Moved the CASEFOLD_FL to prevent collision with reserved verity flag. Signed-off-by: Gabriel Krisman Bertazi --- fs/ext4/dir.c | 30 ++++++++++++++++++++++-------- fs/ext4/ext4.h | 7 ++++--- fs/ext4/hash.c | 6 +++++- fs/ext4/inode.c | 4 +++- fs/ext4/ioctl.c | 18 ++++++++++++++++++ fs/ext4/namei.c | 13 ++++++++++--- include/linux/fs.h | 2 ++ 7 files changed, 64 insertions(+), 16 deletions(-) diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c index efb75c204551..43b91747f7e7 100644 --- a/fs/ext4/dir.c +++ b/fs/ext4/dir.c @@ -670,6 +670,10 @@ static int ext4_d_compare(const struct dentry *dentry, unsigned int len, { struct nls_table *charset = EXT4_SB(dentry->d_sb)->s_encoding; + if (IS_CASEFOLDED(dentry->d_parent->d_inode)) + return nls_strncasecmp(charset, str, len, name->name, + name->len); + return nls_strncmp(charset, str, len, name->name, name->len); } @@ -679,16 +683,26 @@ static int ext4_d_hash(const struct dentry *dentry, struct qstr *q) unsigned char *norm; int len, ret = 0; - /* If normalization is TYPE_PLAIN, we can just reuse the vfs - * hash. */ - if (IS_NORMALIZATION_TYPE_ALL_PLAIN(charset)) - return 0; + if (!IS_CASEFOLDED(dentry->d_inode)) { - norm = kmalloc(PATH_MAX, GFP_ATOMIC); - if (!norm) - return -ENOMEM; + /* If normalization is TYPE_PLAIN, we can just reuse the + * VFS hash. + */ + if (IS_NORMALIZATION_TYPE_ALL_PLAIN(charset)) + return 0; - len = nls_normalize(charset, q->name, q->len, norm, PATH_MAX); + norm = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!norm) + return -ENOMEM; + + len = nls_normalize(charset, q->name, q->len, norm, PATH_MAX); + } else { + norm = kmalloc(PATH_MAX, GFP_ATOMIC); + if (!norm) + return -ENOMEM; + + len = nls_casefold(charset, q->name, q->len, norm, PATH_MAX); + } if (len < 0) { ret = -EINVAL; diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index e84a6605a19a..d21ed5e88302 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -400,10 +400,11 @@ struct flex_groups { #define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */ #define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */ #define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */ +#define EXT4_CASEFOLD_FL 0x40000000 /* Casefolded file */ #define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */ -#define EXT4_FL_USER_VISIBLE 0x304BDFFF /* User visible flags */ -#define EXT4_FL_USER_MODIFIABLE 0x204BC0FF /* User modifiable flags */ +#define EXT4_FL_USER_VISIBLE 0x704BDFFF /* User visible flags */ +#define EXT4_FL_USER_MODIFIABLE 0x604BC0FF /* User modifiable flags */ /* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */ #define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \ @@ -418,7 +419,7 @@ struct flex_groups { EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\ EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\ EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\ - EXT4_PROJINHERIT_FL) + EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL) /* Flags that are appropriate for regular files (all but dir-specific ones). */ #define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL)) diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c index 8ec9c7145987..78cb97664a33 100644 --- a/fs/ext4/hash.c +++ b/fs/ext4/hash.c @@ -282,7 +282,11 @@ int ext4fs_dirhash(const struct inode *dir, const char *name, int len, if (!buff) return -1; - dlen = nls_normalize(charset, name, len, buff, PATH_MAX); + if (!IS_CASEFOLDED(dir)) + dlen = nls_normalize(charset, name, len, buff, + PATH_MAX); + else + dlen = nls_casefold(charset, name, len, buff, PATH_MAX); if (dlen < 0) { kfree(buff); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 22a9d8159720..9908d7d98b6e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4745,9 +4745,11 @@ void ext4_set_inode_flags(struct inode *inode) new_fl |= S_DAX; if (flags & EXT4_ENCRYPT_FL) new_fl |= S_ENCRYPTED; + if (flags & EXT4_CASEFOLD_FL) + new_fl |= S_CASEFOLD; inode_set_flags(inode, new_fl, S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX| - S_ENCRYPTED); + S_ENCRYPTED|S_CASEFOLD); } static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode, diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index 0edee31913d1..ef4ffe681836 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -231,6 +231,7 @@ static int ext4_ioctl_setflags(struct inode *inode, struct ext4_iloc iloc; unsigned int oldflags, mask, i; unsigned int jflag; + struct super_block *sb = inode->i_sb; /* Is it quota file? Do not allow user to mess with it */ if (ext4_is_quota_file(inode)) @@ -275,6 +276,23 @@ static int ext4_ioctl_setflags(struct inode *inode, goto flags_out; } + if ((flags ^ oldflags) & EXT4_CASEFOLD_FL) { + if (!ext4_has_feature_fname_encoding(sb)) { + err = -EOPNOTSUPP; + goto flags_out; + } + + if (!S_ISDIR(inode->i_mode)) { + err = -ENOTDIR; + goto flags_out; + } + + if (!ext4_empty_dir(inode)) { + err = -ENOTEMPTY; + goto flags_out; + } + } + handle = ext4_journal_start(inode, EXT4_HT_INODE, 1); if (IS_ERR(handle)) { err = PTR_ERR(handle); diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 23e0e911b3fe..a21f0d7227db 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -1278,9 +1278,16 @@ static inline bool ext4_match(const struct inode *parent, #ifdef CONFIG_NLS if (sbi->s_encoding) { - return !nls_strncmp(sbi->s_encoding, - de->name, de->name_len, - f.disk_name.name, f.disk_name.len); + if (!IS_CASEFOLDED(parent)) + return !nls_strncmp(sbi->s_encoding, + de->name, de->name_len, + fname->disk_name.name, + fname->disk_name.len); + else + return !nls_strncasecmp(sbi->s_encoding, + de->name, de->name_len, + fname->disk_name.name, + fname->disk_name.len); } #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index c95c0807471f..69abaca207c0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1947,6 +1947,7 @@ struct super_operations { #define S_DAX 0 /* Make all the DAX code disappear */ #endif #define S_ENCRYPTED 16384 /* Encrypted file (using fs/crypto/) */ +#define S_CASEFOLD 32768 /* Casefolded file */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1987,6 +1988,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC) #define IS_DAX(inode) ((inode)->i_flags & S_DAX) #define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED) +#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV) From patchwork Thu Dec 6 23:09:03 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gabriel Krisman Bertazi X-Patchwork-Id: 10717223 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9F9D215A6 for ; Thu, 6 Dec 2018 23:10:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9045D2F21D for ; Thu, 6 Dec 2018 23:10:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 84E662F22E; Thu, 6 Dec 2018 23:10:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 295D52F21D for ; Thu, 6 Dec 2018 23:10:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726380AbeLFXKe (ORCPT ); Thu, 6 Dec 2018 18:10:34 -0500 Received: from bhuna.collabora.co.uk ([46.235.227.227]:56170 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726256AbeLFXKe (ORCPT ); Thu, 6 Dec 2018 18:10:34 -0500 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 19C1C27ED89 From: Gabriel Krisman Bertazi To: tytso@mit.edu Cc: linux-fsdevel@vger.kernel.org, kernel@collabora.com, linux-ext4@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v4 23/23] docs: ext4.rst: Document encoding and case-insensitive Date: Thu, 6 Dec 2018 18:09:03 -0500 Message-Id: <20181206230903.30011-24-krisman@collabora.com> X-Mailer: git-send-email 2.20.0.rc2 In-Reply-To: <20181206230903.30011-1-krisman@collabora.com> References: <20181206230903.30011-1-krisman@collabora.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Gabriel Krisman Bertazi Introduces the encoding-awareness and case-insensitive features on ext4 for system administrators. Explain the minimum of design decisions that are important for sysadmins enabling this feature. Signed-off-by: Gabriel Krisman Bertazi --- Documentation/admin-guide/ext4.rst | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index e506d3dae510..f42c682acecc 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -91,10 +91,39 @@ Currently Available * large block (up to pagesize) support * efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force the ordering) +* Encoding aware file names +* Case insensitive file name lookups [1] Filesystems with a block size of 1k may see a limit imposed by the directory hash tree having a maximum depth of two. +Encoding-aware file names and case-insensitive lookups +====================================================== + +Ext4 optionally supports filesystem-wide charset knowledge when handling +file names, which allows the user to perform file system lookups using +charset equivalent versions of the same file name, and optionally ensure +that no invalid names are held by the filesystem. charset encoding +awareness is also essential for performing case-insensitive lookups, +because it is what defines the casefold operation. + +The case-insensitive file name lookup feature is supported in a smaller +granularity, on a per-directory basis, allowing the user to mix +case-insensitive and case-sensitive directories in the same filesystem. +It is enabled by flipping a file attribute on an empty directory. For +the reason stated above, the filesystem must have encoding enabled to +use this feature. + +When we change from filenames as opaque byte sequences to seeing them as +encoded strings we need to address what happens when a program tries to +create a file with an invalid name. The Natural Language System within +the kernel leaves the decision of what to do in this case to the +filesystem, which select its preferred behavior by enabling/disabling +the strict mode in NLS. When Ext4 encounters one of those strings, it +falls back to considering the entire string as an opaque byte sequence, +which still allows the user to operate on that file but the +case-insensitive and equivalent sequence lookups won't work. + Options =======