diff mbox series

[v2,2/2] t0028: add more tests

Message ID 40e54cf5ce74d1404187e31c94644df29134b4ff.1569233057.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Update: fixed typos in commit message | expand

Commit Message

Linus Arver via GitGitGadget Sept. 23, 2019, 10:04 a.m. UTC
From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>

After I discovered that UTF-16-LE-BOM test was bugged and still
succeeded, I decided that better tests are required. Possibly the best
option here is to compare git results against hardcoded ground truth.

The new tests also cover more interesting chars where (ANSI != UTF-8).

Signed-off-by: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
---
 t/t0028-working-tree-encoding.sh | 39 ++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

Comments

Torsten Bögershausen Sept. 24, 2019, 4:06 a.m. UTC | #1
On Mon, Sep 23, 2019 at 03:04:19AM -0700, Alexandr Miloslavskiy via GitGitGadget wrote:
> From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>

Thanks for the tests, some nit-picks inline.

>
> After I discovered that UTF-16-LE-BOM test was bugged and still
> succeeded...

My interpretation is that the \000\000 must be handled correctly
on all platforms, and that seems to be the case.

Would this make more sense:

After I discovered that UTF-16-LE-BOM test was bugged,
I decided that better tests are required

> ... I decided that better tests are required. Possibly the best
> option here is to compare git results against hardcoded ground truth.


>
> The new tests also cover more interesting chars where (ANSI != UTF-8).
>
> Signed-off-by: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
> ---
>  t/t0028-working-tree-encoding.sh | 39 ++++++++++++++++++++++++++++++++
>  1 file changed, 39 insertions(+)
>
> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
> index 5493cf3ca9..d0dd5dd0ea 100755
> --- a/t/t0028-working-tree-encoding.sh
> +++ b/t/t0028-working-tree-encoding.sh
> @@ -280,4 +280,43 @@ test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
>  	git reset
>  '
>
> +# $1: checkout encoding
> +# $2: test string
> +# $3: binary test string in checkout encoding
> +test_commit_utf8_checkout_other () {
> +	encoding="$1"
> +	orig_string="$2"
> +	expect_bytes="$3"
> +
> +	test_expect_success "Commit utf-8, checkout ${encoding}" '

General remark:
Do we need the {} here?
${encoding} could be simpler written as $encoding

> +		test_when_finished "git checkout HEAD -- .gitattributes" &&
> +
> +		test_ext="commit_utf8_checkout_${encoding}" &&
> +		test_file="test.${test_ext}" &&
> +
> +		# Commit as utf-8

Another nit-pick:
Looking at the other test cases, should utf-8 be written as UTF-8
for consistency ?

> +		echo "*.${test_ext} text working-tree-encoding=utf-8" >.gitattributes &&
> +		printf "${orig_string}" >"${test_file}" &&
> +		git add "${test_file}" &&
> +		git commit -m "Test data" &&
> +
> +		# Checkout in tested encoding
> +		rm "${test_file}" &&
> +		echo "*.${test_ext} text working-tree-encoding=${encoding}" >.gitattributes &&
> +		git checkout HEAD -- "${test_file}" &&
> +
> +		# Test
> +		printf "${expect_bytes}" > "${test_file}.raw" &&
> +		test_cmp_bin "${test_file}.raw" "${test_file}"

More a style-nit: could we simply write like this:
 		printf $expect_bytes > $test_file.raw &&
		test_cmp_bin $test_file.raw $test_file

(Even on other places)

> +	'
> +}
> +
> +test_commit_utf8_checkout_other "UTF-8"        "Test Тест" "\124\145\163\164\040\320\242\320\265\321\201\321\202"
> +test_commit_utf8_checkout_other "UTF-16LE"     "Test Тест" "\124\000\145\000\163\000\164\000\040\000\042\004\065\004\101\004\102\004"
> +test_commit_utf8_checkout_other "UTF-16BE"     "Test Тест" "\000\124\000\145\000\163\000\164\000\040\004\042\004\065\004\101\004\102"
> +test_commit_utf8_checkout_other "UTF-16LE-BOM" "Test Тест" "\377\376\124\000\145\000\163\000\164\000\040\000\042\004\065\004\101\004\102\004"
> +test_commit_utf8_checkout_other "UTF-16BE-BOM" "Test Тест" "\376\377\000\124\000\145\000\163\000\164\000\040\004\042\004\065\004\101\004\102"
> +test_commit_utf8_checkout_other "UTF-32LE"     "Test Тест" "\124\000\000\000\145\000\000\000\163\000\000\000\164\000\000\000\040\000\000\000\042\004\000\000\065\004\000\000\101\004\000\000\102\004\000\000"
> +test_commit_utf8_checkout_other "UTF-32BE"     "Test Тест" "\000\000\000\124\000\000\000\145\000\000\000\163\000\000\000\164\000\000\000\040\000\000\004\042\000\000\004\065\000\000\004\101\000\000\004\102"
> +
>  test_done
> --
> gitgitgadget
>
Johannes Sixt Sept. 24, 2019, 6:21 a.m. UTC | #2
Am 23.09.19 um 12:04 schrieb Alexandr Miloslavskiy via GitGitGadget:
> From: Alexandr Miloslavskiy <alexandr.miloslavskiy@syntevo.com>
> 
> After I discovered that UTF-16-LE-BOM test was bugged and still
> succeeded, I decided that better tests are required. Possibly the best
> option here is to compare git results against hardcoded ground truth.
> 
> The new tests also cover more interesting chars where (ANSI != UTF-8).

What are we testing here? Is there some back-and-forth conversion going
on, and are we testing that the conversion happens at all, or that the
correct conversion/encoding is picked, or that the conversion that is
finally chosen is correct? Why does it help to test more interesting
chars (and would you not also regard codepoints outside the BMP the most
interesting because they require surrogate codepoints in UTF-16)?

-- Hannes
Alexandr Miloslavskiy Sept. 24, 2019, 10:03 a.m. UTC | #3
On 24.09.2019 6:06, Torsten Bögershausen wrote:
> Would this make more sense:
> After I discovered that UTF-16-LE-BOM test was bugged,
> I decided that better tests are required

OK

 > Looking at the other test cases, should utf-8 be written as UTF-8
 > for consistency ?

OK

 > General remark:
 > Do we need the {} here?
 > ${encoding} could be simpler written as $encoding

 > More a style-nit: could we simply write like this:
 > printf $expect_bytes > $test_file.raw &&
 > test_cmp_bin $test_file.raw $test_file

This is pretty much my first experience with test framework
(and maybe third with shell scripts...). I will change as suggested.
Alexandr Miloslavskiy Sept. 24, 2019, 10:31 a.m. UTC | #4
On 24.09.2019 8:21, Johannes Sixt wrote:
> What are we testing here? Is there some back-and-forth conversion going
> on, and are we testing that the conversion happens at all, or that the
> correct conversion/encoding is picked, or that the conversion that is
> finally chosen is correct? Why does it help to test more interesting
> chars (and would you not also regard codepoints outside the BMP the most
> interesting because they require surrogate codepoints in UTF-16)?

According to my understanding (I'm not the author of test package),
it is designed to test that various encodings are properly supported
by git in the working tree.

The new tests are designed to avoid any back-and-forth, which actually
happened for the previous UTF-16-LE-BOM test, which in turn hidden
that the test was bugged.

Otherwise, the test verifies that if you requested some encoding, you
get exactly that, and it covers various potential problems at once.

 > Why does it help to test more interesting chars (and would you not
 > also regard codepoints outside the BMP the most interesting because
 > they require surrogate codepoints in UTF-16)?
					
It helps to cover more potential problems. One could agree that
converting latin characters is mostly about padding/dropping zero
chars, but this approach could never work for the chars I used. As for
"outside the BMP", I'm simply not experienced with that. If you are,
you're welcome to further improve the tests I added.
diff mbox series

Patch

diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 5493cf3ca9..d0dd5dd0ea 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -280,4 +280,43 @@  test_expect_success ICONV_SHIFT_JIS 'check roundtrip encoding' '
 	git reset
 '
 
+# $1: checkout encoding
+# $2: test string
+# $3: binary test string in checkout encoding
+test_commit_utf8_checkout_other () {
+	encoding="$1"
+	orig_string="$2"
+	expect_bytes="$3"
+	
+	test_expect_success "Commit utf-8, checkout ${encoding}" '
+		test_when_finished "git checkout HEAD -- .gitattributes" &&
+		
+		test_ext="commit_utf8_checkout_${encoding}" &&
+		test_file="test.${test_ext}" &&
+		
+		# Commit as utf-8
+		echo "*.${test_ext} text working-tree-encoding=utf-8" >.gitattributes &&
+		printf "${orig_string}" >"${test_file}" &&
+		git add "${test_file}" &&
+		git commit -m "Test data" &&
+
+		# Checkout in tested encoding
+		rm "${test_file}" &&
+		echo "*.${test_ext} text working-tree-encoding=${encoding}" >.gitattributes &&
+		git checkout HEAD -- "${test_file}" &&
+		
+		# Test
+		printf "${expect_bytes}" > "${test_file}.raw" &&
+		test_cmp_bin "${test_file}.raw" "${test_file}"
+	'
+}
+
+test_commit_utf8_checkout_other "UTF-8"        "Test Тест" "\124\145\163\164\040\320\242\320\265\321\201\321\202"
+test_commit_utf8_checkout_other "UTF-16LE"     "Test Тест" "\124\000\145\000\163\000\164\000\040\000\042\004\065\004\101\004\102\004"
+test_commit_utf8_checkout_other "UTF-16BE"     "Test Тест" "\000\124\000\145\000\163\000\164\000\040\004\042\004\065\004\101\004\102"
+test_commit_utf8_checkout_other "UTF-16LE-BOM" "Test Тест" "\377\376\124\000\145\000\163\000\164\000\040\000\042\004\065\004\101\004\102\004"
+test_commit_utf8_checkout_other "UTF-16BE-BOM" "Test Тест" "\376\377\000\124\000\145\000\163\000\164\000\040\004\042\004\065\004\101\004\102"
+test_commit_utf8_checkout_other "UTF-32LE"     "Test Тест" "\124\000\000\000\145\000\000\000\163\000\000\000\164\000\000\000\040\000\000\000\042\004\000\000\065\004\000\000\101\004\000\000\102\004\000\000"
+test_commit_utf8_checkout_other "UTF-32BE"     "Test Тест" "\000\000\000\124\000\000\000\145\000\000\000\163\000\000\000\164\000\000\000\040\000\000\004\042\000\000\004\065\000\000\004\101\000\000\004\102"
+
 test_done