diff mbox series

[v7,27/30] t/lib-unicode-nfc-nfd: helper prereqs for testing unicode nfc/nfd

Message ID 6a8308699543faaea760d4605babe50a0e478f41.1653336765.git.gitgitgadget@gmail.com (mailing list archive)
State New
Headers show
Series Builtin FSMonitor Part 3 | expand

Commit Message

Jeff Hostetler May 23, 2022, 8:12 p.m. UTC
From: Jeff Hostetler <jeffhost@microsoft.com>

Create a set of prereqs to help understand how file names
are handled by the filesystem when they contain NFC and NFD
Unicode characters.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/lib-unicode-nfc-nfd.sh | 158 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100755 t/lib-unicode-nfc-nfd.sh

Comments

Junio C Hamano May 23, 2022, 9:33 p.m. UTC | #1
"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	ls | test-tool hexdump | grep "63 5f c3 a9"

A few comments:

 * Not folding output lines at arbitrary place like "od", "hd",
   etc. does, is a good design decision made by "hexdump" here.
   Depending on where in the pathname the 4-byte sequence appears,
   tools from other people may split the sequence across output
   lines, making grep ineffective.  But our hexdump would work fine
   here.

 * For the narrow purpose of the tests in this script, output that
   is a single long line produced by hexdump might be sufficient,
   but I wonder if it makes the tool more useful if we at least
   placed the hexified output for each line on separate output
   lines.

 * Purist in us may find it a bit disturbing that exit status from
   test-tool is hidden by the pipe.  I do not care too deeply about
   it, as it is very unlikely that we care about segfault after
   hexdump successfully shows the substring the downstream grep is
   looking for, but it does make us feel dirty.

A devil's advocate suggestion is to go in the completely opposite
side of the spectrum.  Perhaps if we are willing to limit the tool's
utility to the tests done in this script file, it might be a good
idea to combine the latter two elements in the pipeline, i.e.

	ls | test-tool hexgrep 63 5f c3 a9

that exits with 0 when the output from "ls" has the 4-byte sequence,
exits with 1 when it does not, and exits with 139 when it segfauls ;-)
Johannes Schindelin May 24, 2022, 12:14 p.m. UTC | #2
Hi Junio,

On Mon, 23 May 2022, Junio C Hamano wrote:

> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> A devil's advocate suggestion is to go in the completely opposite
> side of the spectrum.  Perhaps if we are willing to limit the tool's
> utility to the tests done in this script file, it might be a good
> idea to combine the latter two elements in the pipeline, i.e.
>
> 	ls | test-tool hexgrep 63 5f c3 a9
>
> that exits with 0 when the output from "ls" has the 4-byte sequence,
> exits with 1 when it does not, and exits with 139 when it segfauls ;-)

I like the idea, but from what I recall of the Knuth-Pratt algorithm
[*1*], the implementation might get a bit more involved than the current
`test-hexdump.c`. With non-repetitive patterns like you wrote above, you
can simply re-set the needle's offset to 0 if a mismatch was seen. It's
partially-repetitive patterns such as `01 02 01 02 01 ff` that make things
tricker: After encountering a `01 02 01 02 01`, if the next character is a
`02`, we must not reset the needle's offset completely, as the next two
characters might be `01 ff`, i.e. a match.

Since the purpose of this already-long, already well-iterated patch series
is not necessarily to improve the test suite in such an involved manner,
it should be left as an excercise for another patch series whose purpose
_is_ to improve Git's test framework.

Don't get me wrong, I am very much in favor of that `hexgrep` idea. Just
in its own, dedicated patch series.

Ciao,
Dscho

Footnote *1*: I actually had to look at Wikipedia page at
https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
to realize that I did unjustice to James Morris by forgetting that they
had discovered the algorithm independently.
Jeff Hostetler May 24, 2022, 3:06 p.m. UTC | #3
On 5/23/22 5:33 PM, Junio C Hamano wrote:
> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +	ls | test-tool hexdump | grep "63 5f c3 a9"
> 
> A few comments:
> 
>   * Not folding output lines at arbitrary place like "od", "hd",
>     etc. does, is a good design decision made by "hexdump" here.
>     Depending on where in the pathname the 4-byte sequence appears,
>     tools from other people may split the sequence across output
>     lines, making grep ineffective.  But our hexdump would work fine
>     here.
> 
>   * For the narrow purpose of the tests in this script, output that
>     is a single long line produced by hexdump might be sufficient,
>     but I wonder if it makes the tool more useful if we at least
>     placed the hexified output for each line on separate output
>     lines.


Yeah, having tools arbitrarily wrap every 16 or whatever bytes
(and including offset line prefixes) makes it difficult to use
when looking for specific patterns that might span a boundary.

I could see having a command line option to emit a '\n' (in addition
to or in place of) each LF in the input.  I suppose it depends on the
type of data we are dumping. (That also gets into issues about CRLFs,
however.)

I'm using hexdump for unicode text here, soit could make sense.  But
if I were using it to dump .git/index it wouldn't.

So having the default be one very long line is a good start.
We can teach it more later.

> 
>   * Purist in us may find it a bit disturbing that exit status from
>     test-tool is hidden by the pipe.  I do not care too deeply about
>     it, as it is very unlikely that we care about segfault after
>     hexdump successfully shows the substring the downstream grep is
>     looking for, but it does make us feel dirty.

Given the simplicity of the current version of the helper, I'm not
really worried about such problems.  I suppose that we could do the
usual trick of writing the hex dump to a file and grepping it, but
I'm not sure it's worth the bother right now.

> 
> A devil's advocate suggestion is to go in the completely opposite
> side of the spectrum.  Perhaps if we are willing to limit the tool's
> utility to the tests done in this script file, it might be a good
> idea to combine the latter two elements in the pipeline, i.e.
> 
> 	ls | test-tool hexgrep 63 5f c3 a9
> 
> that exits with 0 when the output from "ls" has the 4-byte sequence,
> exits with 1 when it does not, and exits with 139 when it segfauls ;-)
> 

I was a little afraid to suggest a hex version of grep.  That would
be interesting project to work on, but has lots of hard problems in
it and is too much to tack on to this series.  Johannes raises some
interesting questions in a later response in this thread that suggest
that this could be a seriously non-trivial task.  So again, I'd like
to not attempt this.

Thanks
Jeff
diff mbox series

Patch

diff --git a/t/lib-unicode-nfc-nfd.sh b/t/lib-unicode-nfc-nfd.sh
new file mode 100755
index 00000000000..76c6fbc0ec2
--- /dev/null
+++ b/t/lib-unicode-nfc-nfd.sh
@@ -0,0 +1,158 @@ 
+# Help detect how Unicode NFC and NFD are handled on the filesystem.
+
+# A simple character that has a NFD form.
+#
+# NFC:       U+00e9 LATIN SMALL LETTER E WITH ACUTE
+# UTF8(NFC): \xc3 \xa9
+#
+# NFD:       U+0065 LATIN SMALL LETTER E
+#            U+0301 COMBINING ACUTE ACCENT
+# UTF8(NFD): \x65  +  \xcc \x81
+#
+utf8_nfc=$(printf "\xc3\xa9")
+utf8_nfd=$(printf "\x65\xcc\x81")
+
+# Is the OS or the filesystem "Unicode composition sensitive"?
+#
+# That is, does the OS or the filesystem allow files to exist with
+# both the NFC and NFD spellings?  Or, does the OS/FS lie to us and
+# tell us that the NFC and NFD forms are equivalent.
+#
+# This is or may be independent of what type of filesystem we have,
+# since it might be handled by the OS at a layer above the FS.
+# Testing shows on MacOS using APFS, HFS+, and FAT32 reports a
+# collision, for example.
+#
+# This does not tell us how the Unicode pathname will be spelled
+# on disk, but rather only that the two spelling "collide".  We
+# will examine the actual on disk spelling in a later prereq.
+#
+test_lazy_prereq UNICODE_COMPOSITION_SENSITIVE '
+	mkdir trial_${utf8_nfc} &&
+	mkdir trial_${utf8_nfd}
+'
+
+# Is the spelling of an NFC pathname preserved on disk?
+#
+# On MacOS with HFS+ and FAT32, NFC paths are converted into NFD
+# and on APFS, NFC paths are preserved.  As we have established
+# above, this is independent of "composition sensitivity".
+#
+test_lazy_prereq UNICODE_NFC_PRESERVED '
+	mkdir c_${utf8_nfc} &&
+	ls | test-tool hexdump | grep "63 5f c3 a9"
+'
+
+# Is the spelling of an NFD pathname preserved on disk?
+#
+test_lazy_prereq UNICODE_NFD_PRESERVED '
+	mkdir d_${utf8_nfd} &&
+	ls | test-tool hexdump | grep "64 5f 65 cc 81"
+'
+
+# The following _DOUBLE_ forms are more for my curiosity,
+# but there may be quirks lurking when there are multiple
+# combining characters in non-canonical order.
+
+# Unicode also allows multiple combining characters
+# that can be decomposed in pieces.
+#
+# NFC:        U+1f67 GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI
+# UTF8(NFC):  \xe1 \xbd \xa7
+#
+# NFD1:       U+1f61 GREEK SMALL LETTER OMEGA WITH DASIA
+#             U+0342 COMBINING GREEK PERISPOMENI
+# UTF8(NFD1): \xe1 \xbd \xa1  +  \xcd \x82
+#
+# But U+1f61 decomposes into
+# NFD2:       U+03c9 GREEK SMALL LETTER OMEGA
+#             U+0314 COMBINING REVERSED COMMA ABOVE
+# UTF8(NFD2): \xcf \x89  +  \xcc \x94
+#
+# Yielding:   \xcf \x89  +  \xcc \x94  +  \xcd \x82
+#
+# Note that I've used the canonical ordering of the
+# combinining characters.  It is also possible to
+# swap them.  My testing shows that that non-standard
+# ordering also causes a collision in mkdir.  However,
+# the resulting names don't draw correctly on the
+# terminal (implying that the on-disk format also has
+# them out of order).
+#
+greek_nfc=$(printf "\xe1\xbd\xa7")
+greek_nfd1=$(printf "\xe1\xbd\xa1\xcd\x82")
+greek_nfd2=$(printf "\xcf\x89\xcc\x94\xcd\x82")
+
+# See if a double decomposition also collides.
+#
+test_lazy_prereq UNICODE_DOUBLE_COMPOSITION_SENSITIVE '
+	mkdir trial_${greek_nfc} &&
+	mkdir trial_${greek_nfd2}
+'
+
+# See if the NFC spelling appears on the disk.
+#
+test_lazy_prereq UNICODE_DOUBLE_NFC_PRESERVED '
+	mkdir c_${greek_nfc} &&
+	ls | test-tool hexdump | grep "63 5f e1 bd a7"
+'
+
+# See if the NFD spelling appears on the disk.
+#
+test_lazy_prereq UNICODE_DOUBLE_NFD_PRESERVED '
+	mkdir d_${greek_nfd2} &&
+	ls | test-tool hexdump | grep "64 5f cf 89 cc 94 cd 82"
+'
+
+# The following is for debugging. I found it useful when
+# trying to understand the various (OS, FS) quirks WRT
+# Unicode and how composition/decomposition is handled.
+# For example, when trying to understand how (macOS, APFS)
+# and (macOS, HFS) and (macOS, FAT32) compare.
+#
+# It is rather noisy, so it is disabled by default.
+#
+if test "$unicode_debug" = "true"
+then
+	if test_have_prereq UNICODE_COMPOSITION_SENSITIVE
+	then
+		echo NFC and NFD are distinct on this OS/filesystem.
+	else
+		echo NFC and NFD are aliases on this OS/filesystem.
+	fi
+
+	if test_have_prereq UNICODE_NFC_PRESERVED
+	then
+		echo NFC maintains original spelling.
+	else
+		echo NFC is modified.
+	fi
+
+	if test_have_prereq UNICODE_NFD_PRESERVED
+	then
+		echo NFD maintains original spelling.
+	else
+		echo NFD is modified.
+	fi
+
+	if test_have_prereq UNICODE_DOUBLE_COMPOSITION_SENSITIVE
+	then
+		echo DOUBLE NFC and NFD are distinct on this OS/filesystem.
+	else
+		echo DOUBLE NFC and NFD are aliases on this OS/filesystem.
+	fi
+
+	if test_have_prereq UNICODE_DOUBLE_NFC_PRESERVED
+	then
+		echo Double NFC maintains original spelling.
+	else
+		echo Double NFC is modified.
+	fi
+
+	if test_have_prereq UNICODE_DOUBLE_NFD_PRESERVED
+	then
+		echo Double NFD maintains original spelling.
+	else
+		echo Double NFD is modified.
+	fi
+fi