From patchwork Wed Mar 21 03:40:29 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Darrick J. Wong" X-Patchwork-Id: 10298503 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 2319960349 for ; Wed, 21 Mar 2018 03:40:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 113FB295F1 for ; Wed, 21 Mar 2018 03:40:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 059CC29621; Wed, 21 Mar 2018 03:40:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI, T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 22375295F1 for ; Wed, 21 Mar 2018 03:40:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751938AbeCUDkf (ORCPT ); Tue, 20 Mar 2018 23:40:35 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:41150 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751828AbeCUDkf (ORCPT ); Tue, 20 Mar 2018 23:40:35 -0400 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w2L3WT6F027590; Wed, 21 Mar 2018 03:40:32 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : from : to : cc : date : message-id : in-reply-to : references : mime-version : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=dF1pDDc1BekXbOMEHVpmwMyyqeuAHsgAHHUiHuhK4+g=; b=S3Iss9W65bABJA4XWEiSHQPu5mvy5U8/h1nUB0c0owdBBe7XMefKudavx3Pt6f+NIus+ F8Pu241zrWzKCacnwMVxxbKAPb0YdXxUUkE3lB4NTHJyp6Jovi4PgvbpCSTGDrLKFzxS FT4z+ut3xLruY4aIA6eDQqHhlN8pD59ir9JY15Xcc7OEXRMVFUgb9NagaWcfKwOHfbQL J17o30v2OnKaG2uNrncx2vwXZol3HmNTc+DaoDTcG1utEOtUzoTpBwIAPKWYsl00M4Fv oS2C8+z26aSd35ZOa9qs112yadfy0EQYtKVFkol3pO3T9yGYbUf5RNCPhtmEkLnCPsYJ Cg== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2130.oracle.com with ESMTP id 2gufjqr0hn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 21 Mar 2018 03:40:32 +0000 Received: from aserv0122.oracle.com (aserv0122.oracle.com [141.146.126.236]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w2L3eVuf024656 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 21 Mar 2018 03:40:31 GMT Received: from abhmp0008.oracle.com (abhmp0008.oracle.com [141.146.116.14]) by aserv0122.oracle.com (8.14.4/8.14.4) with ESMTP id w2L3eUK2021550; Wed, 21 Mar 2018 03:40:30 GMT Received: from localhost (/10.159.242.221) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 20 Mar 2018 20:40:30 -0700 Subject: [PATCH 08/14] xfs_scrub: use Unicode skeleton function to find confusing names From: "Darrick J. Wong" To: sandeen@redhat.com, darrick.wong@oracle.com Cc: linux-xfs@vger.kernel.org Date: Tue, 20 Mar 2018 20:40:29 -0700 Message-ID: <152160362958.8288.5732447522762104227.stgit@magnolia> In-Reply-To: <152160358015.8288.2700156777231657519.stgit@magnolia> References: <152160358015.8288.2700156777231657519.stgit@magnolia> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8838 signatures=668695 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1803200127 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Darrick J. Wong Drop the weak normalization-based Unicode name collision detection in favor of the confusable name guidelines provided in Unicode TR36 & TR39. This means that we transform the original name into its Unicode skeleton in order to do hashing-based collision detection. The Unicode skeleton is defined as nfd(translation(nfd(string))), which is to say that it flattens sequences that render ambiguously into a unambiguous format. For example, 'l' and '1' can render identically in some typefaces, so they're both squashed to 'l'. From the skeletons we can figure out if two names will look the same, and thereby complain about them. The unicode spoofing is provided by libicu, hence the switch away from libunistring. Note that potentially confusable names are only worth an informational warning, since it's entirely possible that with the system typefaces in use, two names will render distinctly enough that users can tell the difference. Signed-off-by: Darrick J. Wong --- scrub/unicrash.c | 147 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 120 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/scrub/unicrash.c b/scrub/unicrash.c index 3b5b46e..f35b7bd 100644 --- a/scrub/unicrash.c +++ b/scrub/unicrash.c @@ -26,38 +26,45 @@ #include #include #include +#include #include "path.h" #include "xfs_scrub.h" #include "common.h" /* - * Detect collisions of Unicode-normalized names. + * Detect Unicode confusable names in directories and attributes. * - * Record all the name->ino mappings in a directory/xattr, with a twist! - * The twist is that we perform unicode normalization on every name we - * see, so that we can warn about a directory containing more than one - * directory entries that normalize to the same Unicode string. These - * entries are at best a sign of Unicode mishandling, or some sort of - * weird name substitution attack if the entries do not point to the - * same inode. Warn if we see multiple dirents that do not all point to - * the same inode. + * Record all the name->ino mappings in a directory/xattr, with a twist! The + * twist is to record the Unicode skeleton and normalized version of every + * name we see so that we can check for a name space (directory, extended + * attribute set) containing names containing malicious characters or that + * could be confused for one another. These entries are at best a sign of + * Unicode mishandling, or some sort of weird name substitution attack if the + * entries do not point to the same inode. Warn if we see multiple dirents + * that do not all point to the same inode. * * For extended attributes we perform the same collision checks on the * attribute, though any collision is enough to trigger a warning. * - * We flag these collisions as warnings and not errors because XFS - * treats names as a sequence of arbitrary nonzero bytes. While a - * Unicode collision is not technically a filesystem corruption, we - * ought to say something if there's a possibility for misleading a - * user. + * We avoid flagging these problems as errors because XFS treats names as a + * sequence of arbitrary nonzero bytes. While a Unicode collision is not + * technically a filesystem corruption, we ought to say something if there's a + * possibility for misleading a user. Unquestionably bad things (direction + * overrides, control characters, names that normalize to the same string) + * produce warnings, whereas potentially confusable names produce + * informational messages. * - * To normalize, we use Unicode NFKC. We use the composing - * normalization mode (e.g. "E WITH ACUTE" instead of "E" then "ACUTE") - * because that's what W3C (and in general Linux) uses. This enables us - * to detect multiple object names that normalize to the same name and - * could be confusing to users. Furthermore, we use the compatibility - * mode to detect names with compatible but different code points to - * strengthen those checks. + * The skeleton algorithm is detailed in section 4 ("Confusable Detection") of + * the Unicode technical standard #39. First we normalize the name, then we + * substitute code points according to the confusable code point table, then + * normalize again. + * + * We take the extra step of removing non-identifier code points such as + * formatting characters, control characters, zero width characters, etc. + * from the skeleton so that we can complain about names that are confusable + * due to invisible control characters. + * + * In other words, skel = remove_invisible(nfd(remap_confusables(nfd(name)))). */ struct name_entry { @@ -67,6 +74,10 @@ struct name_entry { UChar *normstr; size_t normstrlen; + /* Unicode skeletonized name */ + UChar *skelstr; + size_t skelstrlen; + xfs_ino_t ino; /* Raw UTF8 name */ @@ -78,6 +89,7 @@ struct name_entry { struct unicrash { struct scrub_ctx *ctx; + USpoofChecker *spoof; const UNormalizer2 *normalizer; bool compare_ino; size_t nr_buckets; @@ -106,6 +118,9 @@ struct unicrash { /* Invisible characters. Only a problem if we have collisions. */ #define UNICRASH_ZERO_WIDTH (1 << 4) +/* Multiple names resolve to the same skeleton string. */ +#define UNICRASH_CONFUSABLE (1 << 5) + /* * We only care about validating utf8 collisions if the underlying * system configuration says we're using utf8. If the language @@ -141,7 +156,7 @@ is_utf8_locale(void) } /* - * Generate normalized form of the name. + * Generate normalized form and skeleton of the name. * If this fails, just forget everything; this is an advisory checker. */ static bool @@ -151,8 +166,13 @@ name_entry_compute_checknames( { UChar *normstr; UChar *unistr; + UChar *skelstr; int32_t normstrlen; int32_t unistrlen; + int32_t skelstrlen; + UChar32 uchr; + int32_t i, j; + UErrorCode uerr = U_ZERO_ERROR; /* Convert bytestr to unistr for normalization */ @@ -182,11 +202,40 @@ name_entry_compute_checknames( if (U_FAILURE(uerr)) goto out_normstr; + /* Compute skeleton. */ + skelstrlen = uspoof_getSkeleton(uc->spoof, 0, unistr, unistrlen, NULL, + 0, &uerr); + if (uerr != U_BUFFER_OVERFLOW_ERROR) + goto out_normstr; + uerr = U_ZERO_ERROR; + skelstr = calloc(skelstrlen + 1, sizeof(UChar)); + if (!skelstr) + goto out_normstr; + uspoof_getSkeleton(uc->spoof, 0, unistr, unistrlen, skelstr, skelstrlen, + &uerr); + if (U_FAILURE(uerr)) + goto out_skelstr; + + /* Remove control/formatting characters from skeleton. */ + for (i = 0, j = 0; i < skelstrlen; j = i) { + U16_NEXT_UNSAFE(skelstr, i, uchr); + if (!u_isIDIgnorable(uchr)) + continue; + memmove(&skelstr[j], &skelstr[i], + (skelstrlen - i + 1) * sizeof(UChar)); + skelstrlen -= (i - j); + i = j; + } + + entry->skelstr = skelstr; + entry->skelstrlen = skelstrlen; entry->normstr = normstr; entry->normstrlen = normstrlen; free(unistr); return true; +out_skelstr: + free(skelstr); out_normstr: free(normstr); out_unistr: @@ -215,7 +264,7 @@ name_entry_create( new_entry->name[namelen] = 0; new_entry->namelen = namelen; - /* Normalize name to find collisions. */ + /* Normalize/skeletonize name to find collisions. */ if (!name_entry_compute_checknames(uc, new_entry)) goto out; @@ -233,6 +282,7 @@ name_entry_free( struct name_entry *entry) { free(entry->normstr); + free(entry->skelstr); free(entry); } @@ -253,8 +303,8 @@ name_entry_hash( size_t namelen; xfs_dahash_t hash; - name = (uint8_t *)entry->normstr; - namelen = entry->normstrlen * sizeof(UChar); + name = (uint8_t *)entry->skelstr; + namelen = entry->skelstrlen * sizeof(UChar); /* * Do four characters at a time as long as we can. @@ -369,9 +419,17 @@ unicrash_init( p->normalizer = unorm2_getNFKCInstance(&uerr); if (U_FAILURE(uerr)) goto out_free; + p->spoof = uspoof_open(&uerr); + if (U_FAILURE(uerr)) + goto out_free; + uspoof_setChecks(p->spoof, USPOOF_ALL_CHECKS, &uerr); + if (U_FAILURE(uerr)) + goto out_spoof; *ucp = p; return true; +out_spoof: + uspoof_close(uc->spoof); out_free: free(p); return false; @@ -414,6 +472,7 @@ unicrash_free( if (!uc) return; + uspoof_close(uc->spoof); for (i = 0; i < uc->nr_buckets; i++) { for (ne = uc->buckets[i]; ne != NULL; ne = x) { x = ne->next; @@ -466,6 +525,19 @@ _("Unicode name \"%s\" in %s renders identically to \"%s\"."), } /* + * If a name contains invisible/nonprinting characters and can be + * confused with another name as a result, we should complain. + * "moocow" and "moocow" are misleading. + */ + if ((badflags & UNICRASH_ZERO_WIDTH) && + (badflags & UNICRASH_CONFUSABLE)) { + str_warn(uc->ctx, descr, +_("Unicode name \"%s\" in %s could be confused with '%s' due to invisible characters."), + bad1, what, bad2); + goto out; + } + + /* * Unfiltered control characters can mess up your terminal and render * invisibly in filechooser UIs. */ @@ -489,6 +561,18 @@ _("Unicode name \"%s\" in %s mixes bidirectional characters."), goto out; } + /* + * We'll note if two names could be confusable with each other, but + * whether or not the user will actually confuse them is dependent + * on the rendering system and the typefaces in use. Maybe "foo.1" + * and "moo.l" look the same, maybe they do not. + */ + if (badflags & UNICRASH_CONFUSABLE) { + str_info(uc->ctx, descr, +_("Unicode name \"%s\" in %s could be confused with \"%s\"."), + bad1, what, bad2); + } + out: free(bad1); free(bad2); @@ -496,8 +580,8 @@ _("Unicode name \"%s\" in %s mixes bidirectional characters."), /* * Try to add a name -> ino entry to the collision detector. The name - * must be normalized according to Unicode NFKC rules to detect names that - * could be confused with each other. + * must be skeletonized according to Unicode TR39 to detect names that + * could be visually confused with each other. */ static bool unicrash_add( @@ -526,6 +610,15 @@ unicrash_add( *existing_entry = entry; return true; } + + /* Confusable? */ + if (new_entry->skelstrlen == entry->skelstrlen && + !u_strcmp(new_entry->skelstr, entry->skelstr) && + (uc->compare_ino ? entry->ino != new_entry->ino : true)) { + *badflags |= UNICRASH_CONFUSABLE; + *existing_entry = entry; + return true; + } entry = entry->next; }