diff mbox

[4/5] generic/45[34]: force UTF-8 codeset to enable utf-8 namer checks in xfs_scrub

Message ID 150836987529.27213.1335818370452284585.stgit@magnolia (mailing list archive)
State Accepted
Headers show

Commit Message

Darrick J. Wong Oct. 18, 2017, 11:37 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

The upcoming xfs_scrub tool will have the ability to warn about
suspicious UTF-8 normalization collisions.  We want generic/45[34] to be
able to test this functionality, but to do that we have to forcibly set
the codeset to UTF-8 via LC_ALL since the rest of xfstests only uses
LC_ALL=C.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 tests/generic/453 |    2 +-
 tests/generic/454 |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Christoph Hellwig Oct. 19, 2017, 7:18 a.m. UTC | #1
On Wed, Oct 18, 2017 at 04:37:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> The upcoming xfs_scrub tool will have the ability to warn about
> suspicious UTF-8 normalization collisions.  We want generic/45[34] to be
> able to test this functionality, but to do that we have to forcibly set
> the codeset to UTF-8 via LC_ALL since the rest of xfstests only uses
> LC_ALL=C.

Wait.  Where do you want to validate UTF-8 normalization?  There is
absolutely no guarantee that someone uses UTF-8, so any reliance on
the character set in the file system is bogus.
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick J. Wong Oct. 20, 2017, 5:56 p.m. UTC | #2
On Thu, Oct 19, 2017 at 12:18:42AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 18, 2017 at 04:37:55PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > The upcoming xfs_scrub tool will have the ability to warn about
> > suspicious UTF-8 normalization collisions.  We want generic/45[34] to be
> > able to test this functionality, but to do that we have to forcibly set
> > the codeset to UTF-8 via LC_ALL since the rest of xfstests only uses
> > LC_ALL=C.
> 
> Wait.  Where do you want to validate UTF-8 normalization?  There is
> absolutely no guarantee that someone uses UTF-8, so any reliance on
> the character set in the file system is bogus.

I'll start by summarizing a problem statement[1].  In XFS (and nearly
all the other filesystems), neither the on-disk format nor the kernel
driver care about the contents of file names or attribute names; they
treat these as an arbitrary byte sequence.  Userspace can set whatever
localization and encoding parameters it wants, and the kernel doesn't
care except for '\0' and '/'.  That doesn't change.

In modern Linux userspace, however, we /do/ care about being able to
encode Unicode codepoints into byte streams, so we encode them in UTF8.
Because there's two different normalization methods in Unicode, this
leads to the funny situation where two unique filename byte sequences
can render the same but point to totally different files:

$ echo NFC > "$(echo -e "french_caf\xc3\xa9.txt")"
$ echo NFD > "$(echo -e "french_caf\xcc\x81.txt")"
$ ls -lai
133 -rw-r--r-- 1 root root   4 Oct 20 10:40 french_café.txt
132 -rw-r--r-- 1 root root   4 Oct 20 10:40 french_café.txt
$ echo $LANG
en_US.UTF-8

At least on my computer, the two filenames render identically yet point
to different inodes.  This could be used to mislead people into opening
a malicious file whose name appears identical to a legitimate file.

xfs_scrub is the (proposed) userspace component of XFS online fsck.  The
first four phases simply call the in-kernel fsck code and pass status
back, but the fifth phase walks the directory tree looking for problems.

If xfs_scrub (the userspace component of online fsck) was built with
libunistring and the LC_MESSAGES string contains "UTF-8", phase 5 will
warn if it finds multiple filenames in a directory that normalize to the
same string but point to different inodes.  Similarly, it will warn
about colliding attribute names.  Warnings in xfs_scrub are for
situations that warrant administrative review but are not filesystem
corruptions.

IOWs, if userspace is configured for UTF-8, the userspace part of online
fsck will flag suspicious-looking uses of Unicode for admin review.  The
kernel remains uninvolved.

--D

[1] https://eclecticlight.co/2017/04/06/apfs-is-currently-unusable-with-most-non-english-languages/

> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/tests/generic/453 b/tests/generic/453
index 40fae91..16589d1 100755
--- a/tests/generic/453
+++ b/tests/generic/453
@@ -148,7 +148,7 @@  check_xfs_scrub() {
 }
 
 if check_xfs_scrub; then
-	output="$(${XFS_SCRUB_PROG} -n "${SCRATCH_MNT}" 2>&1 | filter_scrub)"
+	output="$(LC_ALL="C.UTF-8" ${XFS_SCRUB_PROG} -n "${SCRATCH_MNT}" 2>&1 | filter_scrub)"
 	echo "${output}" | grep -q "french_" || echo "No complaints about french e accent?"
 	echo "${output}" | grep -q "chinese_" || echo "No complaints about chinese width-different?"
 	echo "${output}" | grep -q "greek_" || echo "No complaints about greek letter mess?"
diff --git a/tests/generic/454 b/tests/generic/454
index 462185a..efac860 100755
--- a/tests/generic/454
+++ b/tests/generic/454
@@ -144,7 +144,7 @@  check_xfs_scrub() {
 }
 
 if check_xfs_scrub; then
-	output="$(${XFS_SCRUB_PROG} -n "${SCRATCH_MNT}" 2>&1 | filter_scrub)"
+	output="$(LC_ALL="C.UTF-8" ${XFS_SCRUB_PROG} -n "${SCRATCH_MNT}" 2>&1 | filter_scrub)"
 	echo "${output}" | grep -q "french_" || echo "No complaints about french e accent?"
 	echo "${output}" | grep -q "chinese_" || echo "No complaints about chinese width-different?"
 	echo "${output}" | grep -q "greek_" || echo "No complaints about greek letter mess?"