diff mbox series

[03/13] xfs_scrub: add a couple of omitted invisible code points

Message ID 171988117657.2007123.5376979485947307326.stgit@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series [01/13] xfs_scrub: use proper UChar string iterators | expand

Commit Message

Darrick J. Wong July 2, 2024, 12:58 a.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

I missed a few non-rendering code points in the "zero width"
classification code.  Add them now, and sort the list.

$ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
$ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/unicrash.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig July 2, 2024, 5:22 a.m. UTC | #1
On Mon, Jul 01, 2024 at 05:58:10PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> I missed a few non-rendering code points in the "zero width"
> classification code.  Add them now, and sort the list.
> 
> $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt

Should this be automated?

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Darrick J. Wong July 3, 2024, 1:59 a.m. UTC | #2
On Tue, Jul 02, 2024 at 07:22:25AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 01, 2024 at 05:58:10PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > I missed a few non-rendering code points in the "zero width"
> > classification code.  Add them now, and sort the list.
> > 
> > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt
> 
> Should this be automated?

That will require a bit more thought -- many distro build systems these
days operate in a sealed box with no network access, so you can't really
automate this.  libicu (the last time I looked) didn't have a predicate
to tell you if a particular code point was one of the invisible ones.

Here's what I get from running the commands right now:

009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;;
034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;;
200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;
2060;WORD JOINER;Cf;0;BN;;;;;N;;;;;
2061;FUNCTION APPLICATION;Cf;0;BN;;;;;N;;;;;
2062;INVISIBLE TIMES;Cf;0;BN;;;;;N;;;;;
2063;INVISIBLE SEPARATOR;Cf;0;BN;;;;;N;;;;;
2064;INVISIBLE PLUS;Cf;0;BN;;;;;N;;;;;
2D7F;TIFINAGH CONSONANT JOINER;Mn;9;NSM;;;;;N;;;;;
FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;
1107F;BRAHMI NUMBER JOINER;Mn;9;NSM;;;;;N;;;;;
11A47;ZANABAZAR SQUARE SUBJOINER;Mn;9;NSM;;;;;N;;;;;
11A99;SOYOMBO SUBJOINER;Mn;9;NSM;;;;;N;;;;;
11F42;KAWI CONJOINER;Mn;9;NSM;;;;;N;;;;;
13430;EGYPTIAN HIEROGLYPH VERTICAL JOINER;Cf;0;L;;;;;N;;;;;
13431;EGYPTIAN HIEROGLYPH HORIZONTAL JOINER;Cf;0;L;;;;;N;;;;;

The five-digit ones are new to me; I'll go have a look at how the
noto(fu) fonts render those.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D
Christoph Hellwig July 3, 2024, 4:27 a.m. UTC | #3
On Tue, Jul 02, 2024 at 06:59:56PM -0700, Darrick J. Wong wrote:
> > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt
> > 
> > Should this be automated?
> 
> That will require a bit more thought -- many distro build systems these
> days operate in a sealed box with no network access, so you can't really
> automate this.  libicu (the last time I looked) didn't have a predicate
> to tell you if a particular code point was one of the invisible ones.

Oh, I absolutely do not suggest to run the wget from a normal build!

But if you look at the kernel unicode CI support, it allows you to
place the downloaded file into the kernel tree, and then case make file
rules to re-generate the tables from it (see fs/unicode/Makefile).
Darrick J. Wong July 3, 2024, 5:02 a.m. UTC | #4
On Wed, Jul 03, 2024 at 06:27:32AM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 06:59:56PM -0700, Darrick J. Wong wrote:
> > > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> > > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt
> > > 
> > > Should this be automated?
> > 
> > That will require a bit more thought -- many distro build systems these
> > days operate in a sealed box with no network access, so you can't really
> > automate this.  libicu (the last time I looked) didn't have a predicate
> > to tell you if a particular code point was one of the invisible ones.
> 
> Oh, I absolutely do not suggest to run the wget from a normal build!
> 
> But if you look at the kernel unicode CI support, it allows you to
> place the downloaded file into the kernel tree, and then case make file
> rules to re-generate the tables from it (see fs/unicode/Makefile).

Ah ok got it, will take a closer look tomorrow.

--D
Darrick J. Wong July 3, 2024, 6:35 p.m. UTC | #5
On Tue, Jul 02, 2024 at 10:02:19PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 03, 2024 at 06:27:32AM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 02, 2024 at 06:59:56PM -0700, Darrick J. Wong wrote:
> > > > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
> > > > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt
> > > > 
> > > > Should this be automated?
> > > 
> > > That will require a bit more thought -- many distro build systems these
> > > days operate in a sealed box with no network access, so you can't really
> > > automate this.  libicu (the last time I looked) didn't have a predicate
> > > to tell you if a particular code point was one of the invisible ones.
> > 
> > Oh, I absolutely do not suggest to run the wget from a normal build!
> > 
> > But if you look at the kernel unicode CI support, it allows you to
> > place the downloaded file into the kernel tree, and then case make file
> > rules to re-generate the tables from it (see fs/unicode/Makefile).
> 
> Ah ok got it, will take a closer look tomorrow.

I'm not sure if there's a good way to automate this.  But let's go
through the grep output?

009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;;

This one seems to render as Ÿ (Y with two dots above it) so I didn't put
this in the case statement.

034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;;

Doesn't seem to have any sort of rendering (i.e. if I type them into a
web browser, it doesn't render any glyph at all), so I put it in the
case statement.

200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;;
200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;;
200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;;

These ones have 'zero-width' in the name, so I put these in the
denylist.

2060;WORD JOINER;Cf;0;BN;;;;;N;;;;;

https://www.unicode.org/reports/tr14/ says that this doesn't have
any rendering, nor does my web browser try one.

Come to think of it, I should probably add these two:

2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;;

because it doesn't render either.

2061;FUNCTION APPLICATION;Cf;0;BN;;;;;N;;;;;

https://www.unicode.org/reports/tr25/tr25-6.html says this is a
metacharacter and doesn't have its own rendering.  It seems to be there
to indicate that this:

f(x + y)

is a function taking "x + y" as a parameter, and not shorthand for:

f*x + f+y

2062;INVISIBLE TIMES;Cf;0;BN;;;;;N;;;;;
2063;INVISIBLE SEPARATOR;Cf;0;BN;;;;;N;;;;;
2064;INVISIBLE PLUS;Cf;0;BN;;;;;N;;;;;

These have 'invisible' in the name, that seems pretty obvious to me.

2D7F;TIFINAGH CONSONANT JOINER;Mn;9;NSM;;;;;N;;;;;

This one doesn't seem to render either.  Unicode 15.1 says it isn't
visible rendered.

FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;;

Another 'zero-width' one, same as the others.

1107F;BRAHMI NUMBER JOINER;Mn;9;NSM;;;;;N;;;;;
11A47;ZANABAZAR SQUARE SUBJOINER;Mn;9;NSM;;;;;N;;;;;
11A99;SOYOMBO SUBJOINER;Mn;9;NSM;;;;;N;;;;;
13430;EGYPTIAN HIEROGLYPH VERTICAL JOINER;Cf;0;L;;;;;N;;;;;
13431;EGYPTIAN HIEROGLYPH HORIZONTAL JOINER;Cf;0;L;;;;;N;;;;;

When I typed these into a browser, I got *some* kind of rendering.  I
doubt I'm using them correctly, since they're likely supposed to be
surrounded by other codepoints from the same chart.

11F42;KAWI CONJOINER;Mn;9;NSM;;;;;N;;;;;

I don't even know about this one; it generated "tofu" (aka those horrid
rectangular boxes with the byte representation in them).  On the other
hand, https://www.unicode.org/notes/tn48/UTN48-Implementing-Kawi-1.pdf
says it's not visible by itself.  But paired with Kawi script, it's
supposed to affect the formatting.

I'm not sure how exactly to write a classifier here -- the 'invisible'
and 'zero width' ones are obvious, but the 'joiner' code points don't
seem to have any obvious trend to them.

For now I think I'll take the "conservative" approach and only flag
things that sound like they're supposed to be general metacharacters,
and leave out the modifier codepoints that are ok if they're surrounded
by certain codepoints.  But this is a rather manual process.

--D
Christoph Hellwig July 4, 2024, 7:22 a.m. UTC | #6
On Wed, Jul 03, 2024 at 11:35:10AM -0700, Darrick J. Wong wrote:
> I'm not sure how exactly to write a classifier here -- the 'invisible'
> and 'zero width' ones are obvious, but the 'joiner' code points don't
> seem to have any obvious trend to them.
> 
> For now I think I'll take the "conservative" approach and only flag
> things that sound like they're supposed to be general metacharacters,
> and leave out the modifier codepoints that are ok if they're surrounded
> by certain codepoints.  But this is a rather manual process.

Oh, right.  There is no clear identification and you are just doing
a manual search based on viѕual output.  Yes, there's unfortunately
no really good way to automate that.
diff mbox series

Patch

diff --git a/scrub/unicrash.c b/scrub/unicrash.c
index 96e20114c484..fc1adb2caab7 100644
--- a/scrub/unicrash.c
+++ b/scrub/unicrash.c
@@ -351,15 +351,17 @@  name_entry_examine(
 	while ((uchr = uiter_next32(&uiter)) != U_SENTINEL) {
 		/* zero width character sequences */
 		switch (uchr) {
+		case 0x034F:	/* combining grapheme joiner */
 		case 0x200B:	/* zero width space */
 		case 0x200C:	/* zero width non-joiner */
 		case 0x200D:	/* zero width joiner */
-		case 0xFEFF:	/* zero width non breaking space */
 		case 0x2060:	/* word joiner */
 		case 0x2061:	/* function application */
 		case 0x2062:	/* invisible times (multiply) */
 		case 0x2063:	/* invisible separator (comma) */
 		case 0x2064:	/* invisible plus (addition) */
+		case 0x2D7F:	/* tifinagh consonant joiner */
+		case 0xFEFF:	/* zero width non breaking space */
 			*badflags |= UNICRASH_ZERO_WIDTH;
 			break;
 		}