Message ID | 171988117657.2007123.5376979485947307326.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [01/13] xfs_scrub: use proper UChar string iterators | expand |
On Mon, Jul 01, 2024 at 05:58:10PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > I missed a few non-rendering code points in the "zero width" > classification code. Add them now, and sort the list. > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt Should this be automated? Otherwise looks good: Reviewed-by: Christoph Hellwig <hch@lst.de>
On Tue, Jul 02, 2024 at 07:22:25AM +0200, Christoph Hellwig wrote: > On Mon, Jul 01, 2024 at 05:58:10PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > I missed a few non-rendering code points in the "zero width" > > classification code. Add them now, and sort the list. > > > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt > > Should this be automated? That will require a bit more thought -- many distro build systems these days operate in a sealed box with no network access, so you can't really automate this. libicu (the last time I looked) didn't have a predicate to tell you if a particular code point was one of the invisible ones. Here's what I get from running the commands right now: 009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;; 034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;; 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;; 200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;; 2060;WORD JOINER;Cf;0;BN;;;;;N;;;;; 2061;FUNCTION APPLICATION;Cf;0;BN;;;;;N;;;;; 2062;INVISIBLE TIMES;Cf;0;BN;;;;;N;;;;; 2063;INVISIBLE SEPARATOR;Cf;0;BN;;;;;N;;;;; 2064;INVISIBLE PLUS;Cf;0;BN;;;;;N;;;;; 2D7F;TIFINAGH CONSONANT JOINER;Mn;9;NSM;;;;;N;;;;; FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;; 1107F;BRAHMI NUMBER JOINER;Mn;9;NSM;;;;;N;;;;; 11A47;ZANABAZAR SQUARE SUBJOINER;Mn;9;NSM;;;;;N;;;;; 11A99;SOYOMBO SUBJOINER;Mn;9;NSM;;;;;N;;;;; 11F42;KAWI CONJOINER;Mn;9;NSM;;;;;N;;;;; 13430;EGYPTIAN HIEROGLYPH VERTICAL JOINER;Cf;0;L;;;;;N;;;;; 13431;EGYPTIAN HIEROGLYPH HORIZONTAL JOINER;Cf;0;L;;;;;N;;;;; The five-digit ones are new to me; I'll go have a look at how the noto(fu) fonts render those. > Otherwise looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thanks! --D
On Tue, Jul 02, 2024 at 06:59:56PM -0700, Darrick J. Wong wrote: > > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt > > > > Should this be automated? > > That will require a bit more thought -- many distro build systems these > days operate in a sealed box with no network access, so you can't really > automate this. libicu (the last time I looked) didn't have a predicate > to tell you if a particular code point was one of the invisible ones. Oh, I absolutely do not suggest to run the wget from a normal build! But if you look at the kernel unicode CI support, it allows you to place the downloaded file into the kernel tree, and then case make file rules to re-generate the tables from it (see fs/unicode/Makefile).
On Wed, Jul 03, 2024 at 06:27:32AM +0200, Christoph Hellwig wrote: > On Tue, Jul 02, 2024 at 06:59:56PM -0700, Darrick J. Wong wrote: > > > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > > > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt > > > > > > Should this be automated? > > > > That will require a bit more thought -- many distro build systems these > > days operate in a sealed box with no network access, so you can't really > > automate this. libicu (the last time I looked) didn't have a predicate > > to tell you if a particular code point was one of the invisible ones. > > Oh, I absolutely do not suggest to run the wget from a normal build! > > But if you look at the kernel unicode CI support, it allows you to > place the downloaded file into the kernel tree, and then case make file > rules to re-generate the tables from it (see fs/unicode/Makefile). Ah ok got it, will take a closer look tomorrow. --D
On Tue, Jul 02, 2024 at 10:02:19PM -0700, Darrick J. Wong wrote: > On Wed, Jul 03, 2024 at 06:27:32AM +0200, Christoph Hellwig wrote: > > On Tue, Jul 02, 2024 at 06:59:56PM -0700, Darrick J. Wong wrote: > > > > > $ wget https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt > > > > > $ grep -E '(zero width|invisible|joiner|application)' -i UnicodeData.txt > > > > > > > > Should this be automated? > > > > > > That will require a bit more thought -- many distro build systems these > > > days operate in a sealed box with no network access, so you can't really > > > automate this. libicu (the last time I looked) didn't have a predicate > > > to tell you if a particular code point was one of the invisible ones. > > > > Oh, I absolutely do not suggest to run the wget from a normal build! > > > > But if you look at the kernel unicode CI support, it allows you to > > place the downloaded file into the kernel tree, and then case make file > > rules to re-generate the tables from it (see fs/unicode/Makefile). > > Ah ok got it, will take a closer look tomorrow. I'm not sure if there's a good way to automate this. But let's go through the grep output? 009F;<control>;Cc;0;BN;;;;;N;APPLICATION PROGRAM COMMAND;;;; This one seems to render as Ÿ (Y with two dots above it) so I didn't put this in the case statement. 034F;COMBINING GRAPHEME JOINER;Mn;0;NSM;;;;;N;;;;; Doesn't seem to have any sort of rendering (i.e. if I type them into a web browser, it doesn't render any glyph at all), so I put it in the case statement. 200B;ZERO WIDTH SPACE;Cf;0;BN;;;;;N;;;;; 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;; 200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;; These ones have 'zero-width' in the name, so I put these in the denylist. 2060;WORD JOINER;Cf;0;BN;;;;;N;;;;; https://www.unicode.org/reports/tr14/ says that this doesn't have any rendering, nor does my web browser try one. Come to think of it, I should probably add these two: 2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; 2029;PARAGRAPH SEPARATOR;Zp;0;B;;;;;N;;;;; because it doesn't render either. 2061;FUNCTION APPLICATION;Cf;0;BN;;;;;N;;;;; https://www.unicode.org/reports/tr25/tr25-6.html says this is a metacharacter and doesn't have its own rendering. It seems to be there to indicate that this: f(x + y) is a function taking "x + y" as a parameter, and not shorthand for: f*x + f+y 2062;INVISIBLE TIMES;Cf;0;BN;;;;;N;;;;; 2063;INVISIBLE SEPARATOR;Cf;0;BN;;;;;N;;;;; 2064;INVISIBLE PLUS;Cf;0;BN;;;;;N;;;;; These have 'invisible' in the name, that seems pretty obvious to me. 2D7F;TIFINAGH CONSONANT JOINER;Mn;9;NSM;;;;;N;;;;; This one doesn't seem to render either. Unicode 15.1 says it isn't visible rendered. FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;; Another 'zero-width' one, same as the others. 1107F;BRAHMI NUMBER JOINER;Mn;9;NSM;;;;;N;;;;; 11A47;ZANABAZAR SQUARE SUBJOINER;Mn;9;NSM;;;;;N;;;;; 11A99;SOYOMBO SUBJOINER;Mn;9;NSM;;;;;N;;;;; 13430;EGYPTIAN HIEROGLYPH VERTICAL JOINER;Cf;0;L;;;;;N;;;;; 13431;EGYPTIAN HIEROGLYPH HORIZONTAL JOINER;Cf;0;L;;;;;N;;;;; When I typed these into a browser, I got *some* kind of rendering. I doubt I'm using them correctly, since they're likely supposed to be surrounded by other codepoints from the same chart. 11F42;KAWI CONJOINER;Mn;9;NSM;;;;;N;;;;; I don't even know about this one; it generated "tofu" (aka those horrid rectangular boxes with the byte representation in them). On the other hand, https://www.unicode.org/notes/tn48/UTN48-Implementing-Kawi-1.pdf says it's not visible by itself. But paired with Kawi script, it's supposed to affect the formatting. I'm not sure how exactly to write a classifier here -- the 'invisible' and 'zero width' ones are obvious, but the 'joiner' code points don't seem to have any obvious trend to them. For now I think I'll take the "conservative" approach and only flag things that sound like they're supposed to be general metacharacters, and leave out the modifier codepoints that are ok if they're surrounded by certain codepoints. But this is a rather manual process. --D
On Wed, Jul 03, 2024 at 11:35:10AM -0700, Darrick J. Wong wrote: > I'm not sure how exactly to write a classifier here -- the 'invisible' > and 'zero width' ones are obvious, but the 'joiner' code points don't > seem to have any obvious trend to them. > > For now I think I'll take the "conservative" approach and only flag > things that sound like they're supposed to be general metacharacters, > and leave out the modifier codepoints that are ok if they're surrounded > by certain codepoints. But this is a rather manual process. Oh, right. There is no clear identification and you are just doing a manual search based on viѕual output. Yes, there's unfortunately no really good way to automate that.
diff --git a/scrub/unicrash.c b/scrub/unicrash.c index 96e20114c484..fc1adb2caab7 100644 --- a/scrub/unicrash.c +++ b/scrub/unicrash.c @@ -351,15 +351,17 @@ name_entry_examine( while ((uchr = uiter_next32(&uiter)) != U_SENTINEL) { /* zero width character sequences */ switch (uchr) { + case 0x034F: /* combining grapheme joiner */ case 0x200B: /* zero width space */ case 0x200C: /* zero width non-joiner */ case 0x200D: /* zero width joiner */ - case 0xFEFF: /* zero width non breaking space */ case 0x2060: /* word joiner */ case 0x2061: /* function application */ case 0x2062: /* invisible times (multiply) */ case 0x2063: /* invisible separator (comma) */ case 0x2064: /* invisible plus (addition) */ + case 0x2D7F: /* tifinagh consonant joiner */ + case 0xFEFF: /* zero width non breaking space */ *badflags |= UNICRASH_ZERO_WIDTH; break; }