diff mbox series

[03/13] xfs_scrub: add a couple of omitted invisible code points

Message ID 172229847576.1348850.17804705325546372553.stgit@frogsfrogsfrogs (mailing list archive)
State Accepted, archived
Headers show
Series [01/13] xfs_scrub: use proper UChar string iterators | expand

Commit Message

Darrick J. Wong July 30, 2024, 1:06 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

I missed a few non-rendering code points in the "zero width"
classification code.  Add them now, and sort the list.  Finding them is
an annoyingly manual process because there are various code points that
are not supposed to affect the rendering of a string of text but are not
explicitly named as such.  There are other code points that, when
surrounded by code points from the same chart, actually /do/ affect the
rendering.

IOWs, the only way to figure this out is to grep the likely code points
and then go figure out how each of them render by reading the Unicode
spec or trying it.

$ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
$ grep -E '(separator|zero width|invisible|joiner|application)' -i UnicodeData.txt

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 scrub/unicrash.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)
diff mbox series

Patch

diff --git a/scrub/unicrash.c b/scrub/unicrash.c
index 96e20114c..edc32d55c 100644
--- a/scrub/unicrash.c
+++ b/scrub/unicrash.c
@@ -351,15 +351,19 @@  name_entry_examine(
 	while ((uchr = uiter_next32(&uiter)) != U_SENTINEL) {
 		/* zero width character sequences */
 		switch (uchr) {
+		case 0x034F:	/* combining grapheme joiner */
 		case 0x200B:	/* zero width space */
 		case 0x200C:	/* zero width non-joiner */
 		case 0x200D:	/* zero width joiner */
-		case 0xFEFF:	/* zero width non breaking space */
+		case 0x2028:	/* line separator */
+		case 0x2029:	/* paragraph separator */
 		case 0x2060:	/* word joiner */
 		case 0x2061:	/* function application */
 		case 0x2062:	/* invisible times (multiply) */
 		case 0x2063:	/* invisible separator (comma) */
 		case 0x2064:	/* invisible plus (addition) */
+		case 0x2D7F:	/* tifinagh consonant joiner */
+		case 0xFEFF:	/* zero width non breaking space */
 			*badflags |= UNICRASH_ZERO_WIDTH;
 			break;
 		}