diff mbox series

[PATCH/RFC,v1,1/1] Support working-tree-encoding "UTF-16LE-BOM"

Message ID 20181229110924.26598-1-tboegi@web.de (mailing list archive)
State New, archived
Headers show
Series [PATCH/RFC,v1,1/1] Support working-tree-encoding "UTF-16LE-BOM" | expand

Commit Message

Torsten Bögershausen Dec. 29, 2018, 11:09 a.m. UTC
From: Torsten Bögershausen <tboegi@web.de>

Users who want UTF-16 files in the working tree set the .gitattributes
like this:
test.txt working-tree-encoding=UTF-16

After a checkout, the resulting file has a BOM and is encoded in "UTF-16".
The unicode standard allows both little- and big-endianess (LE/BE) for
those files, the BOM will tell which one is used inside the file.
iconv seems to prefer the BE version.
Not all users under Windows are happy with this when tools are not fully
unicode aware and don't digest the BE version at all.

Today there is no name for "UTF-16 with BOM, little endian please".
Introduce "UTF-16LE-BOM".

Rported-by: Adrián Gimeno Balaguer <adrigibal@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
---

This feels like an RFC at the moment - please comment.
Using UTF-16 in the way "UTF-16LE-BOM" is used in this patch
could be an alternative - simply produce UTF-16 in LE version
under Git - this could make people using Git happy as well.

Documentation/gitattributes.txt  |  4 +--
 compat/precompose_utf8.c         |  2 +-
 t/t0028-working-tree-encoding.sh | 12 ++++++++-
 utf8.c                           | 42 ++++++++++++++++++++++++--------
 utf8.h                           |  2 +-
 5 files changed, 47 insertions(+), 15 deletions(-)

Comments

Adrián Gimeno Balaguer Dec. 29, 2018, 3:48 p.m. UTC | #1
Hello again.

I appreciate the grown interest in this issue.

Torsten, may I know what is the benefit on your code? My PR solved it
by only tweaking the utf8.c's function 'has_prohibited_utf_bom', which
is likely the shortest way:

https://github.com/git/git/pull/550/files

In order to make sure everything is clear, here is a case list of
current Git behaviour and new one after my PR, regarding this issue.

Current behaviour:

- Placing 'test.txt working-tree-encoding=UTF-16' for a new test.txt
file with either UTF-16 BE or LE BOM, and comitting everything -> The
file gets re-encoded from UTF-8 (as stored internally), to UTF-16 and
the default system/libiconv endianness -> Problem (as long as user
required the opposite endianness for any reason on his project). As a
note, user can see however human-readable diffs on that file.

- Placing  'test.txt working-tree-encoding=UTF-16LE' or 'test.txt
working-tree-encoding=UTF-16BE' for a new test.txt file with either
UTF-16 BE or LE BOM, and comitting everything: we assume user is doing
this because he requires that exact endianness, thus he writes it in
order to attempt preserving it -> Git prohibites commiting it, also no
human-readable diff is shown in the diff viewer/tool being used, but
file is simply shown as binary.

New behaviour:

-  Just got too lazy to repeat it all over, read my PR description:
https://github.com/git/git/pull/550

- Git translations may need to be tweaked to in order to be consistent
with new behaviour.

Thanks for your attention.
Philip Oakley Dec. 29, 2018, 5:54 p.m. UTC | #2
(adding Brian as cc who was in the original thread)

On 29/12/2018 15:48, Adrián Gimeno Balaguer wrote:
> Hello again.
> 
> I appreciate the grown interest in this issue.
> 
> Torsten, may I know what is the benefit on your code? My PR solved it
> by only tweaking the utf8.c's function 'has_prohibited_utf_bom', which
> is likely the shortest way:
> 
> https://github.com/git/git/pull/550/files

My main complaint with the PR would be the lack of documentation updates.

As the discussion has highlighted, whatever our solution, we will need 
to tell the users in plain and simple terms which parts of which 
standards are being used, and why we need to be somehow 'different'.

That is because a revision control system must be able to recover the 
original, for use in the original software tool, not just interpret it 
is some alternate form. The standards generally abdicate responsibility 
for the last step ;-)

I did not fully understand the conversion process you proposed, as I 
assumed(?) that on receipt of the source file, the Git conversion to 
utf-8 would convert the 16-bit BOM to the three byte utf-8 BOM byte 
sequence `EF BB BF` which has lost any knowledge of the original BE/LE 
coding.

Or, are we saying that the the 16-bit BOM is being interpreted as, a) 
the BE/LE indicator and b) a genuine "ZERO WIDTH NON-BREAKING SPACE" 
which is stored as the two byte utf-8 character code, again loosing 
(once stored as a blob object) the BE/LE indication.

Or, we see the BOM, note the endianness and then loose the BOM character 
when converting to utf-8. My ignorance of this step is starting to show. 
Regular users are probably even more confused, hence my hope for some 
documentation.

Given the above confusions, and many more when exploring the internet, 
the provision of a new, extra, clear, name for the encoding, as 
suggested by Torsten does offer an advantage in that it explicitly 
(rather than implicitly) makes plain what we are trying to do, without 
squeezing it in 'under the radar'.

That said, assuming an appropriate internal utf-8 Git coding that does 
remember the BE/LE state [if so how?] then the PR is a neat trick.

Torsten's patch also suffers from the lack of user facing documentation.

> 
> In order to make sure everything is clear, here is a case list of
> current Git behaviour and new one after my PR, regarding this issue.
> 
> Current behaviour:
> 
> - Placing 'test.txt working-tree-encoding=UTF-16' for a new test.txt
> file with either UTF-16 BE or LE BOM, and comitting everything -> The
> file gets re-encoded from UTF-8 (as stored internally), to UTF-16 and
> the default system/libiconv endianness -> Problem (as long as user
> required the opposite endianness for any reason on his project). As a
> note, user can see however human-readable diffs on that file.
> 
> - Placing  'test.txt working-tree-encoding=UTF-16LE' or 'test.txt
> working-tree-encoding=UTF-16BE' for a new test.txt file with either
> UTF-16 BE or LE BOM, and comitting everything: we assume user is doing
> this because he requires that exact endianness, thus he writes it in
> order to attempt preserving it -> Git prohibites commiting it, also no
> human-readable diff is shown in the diff viewer/tool being used, but
> file is simply shown as binary.
> 
> New behaviour:
> 
> -  Just got too lazy to repeat it all over, read my PR description:
> https://github.com/git/git/pull/550

"In this PR: Git only prohibites the opposite BOM than the one in 
working-tree-encoding (e.g. if declared LE, then it denies BE BOM 
presence within the associated file, of the declared UTF-16/UTF-32). 
This way the user can now make Git operations which were previously 
impossible, with the only requisite being to match the endianness of 
working-tree-encoding attribute with the associated file/s."

> 
> - Git translations may need to be tweaked to in order to be consistent
> with new behaviour.
> 
> Thanks for your attention.
>
diff mbox series

Patch

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index b8392fc330..4a88ab8be7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -343,13 +343,13 @@  automatic line ending conversion based on your platform.
 ------------------------
 
 Use the following attributes if your '*.ps1' files are UTF-16 little
-endian encoded without BOM and you want Git to use Windows line endings
+endian encoded with BOM and you want Git to use Windows line endings
 in the working directory. Please note, it is highly recommended to
 explicitly define the line endings with `eol` if the `working-tree-encoding`
 attribute is used to avoid ambiguity.
 
 ------------------------
-*.ps1		text working-tree-encoding=UTF-16LE eol=CRLF
+*.ps1		text working-tree-encoding=UTF-16LE-BOM eol=CRLF
 ------------------------
 
 You can get a list of all available encodings on your platform with the
diff --git a/compat/precompose_utf8.c b/compat/precompose_utf8.c
index de61c15d34..136250fbf6 100644
--- a/compat/precompose_utf8.c
+++ b/compat/precompose_utf8.c
@@ -79,7 +79,7 @@  void precompose_argv(int argc, const char **argv)
 		size_t namelen;
 		oldarg = argv[i];
 		if (has_non_ascii(oldarg, (size_t)-1, &namelen)) {
-			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, NULL);
+			newarg = reencode_string_iconv(oldarg, namelen, ic_precompose, 0, NULL);
 			if (newarg)
 				argv[i] = newarg;
 		}
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 7e87b5a200..e58ecbfc44 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -11,9 +11,12 @@  test_expect_success 'setup test files' '
 
 	text="hallo there!\ncan you read me?" &&
 	echo "*.utf16 text working-tree-encoding=utf-16" >.gitattributes &&
+	echo "*.utf16lebom text working-tree-encoding=UTF-16LE-BOM" >>.gitattributes &&
 	printf "$text" >test.utf8.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-16 >test.utf16.raw &&
 	printf "$text" | iconv -f UTF-8 -t UTF-32 >test.utf32.raw &&
+	printf "\377\376"                         >test.utf16lebom.raw &&
+	printf "$text" | iconv -f UTF-8 -t UTF-32LE >>test.utf16lebom.raw &&
 
 	# Line ending tests
 	printf "one\ntwo\nthree\n" >lf.utf8.raw &&
@@ -32,7 +35,8 @@  test_expect_success 'setup test files' '
 	# Add only UTF-16 file, we will add the UTF-32 file later
 	cp test.utf16.raw test.utf16 &&
 	cp test.utf32.raw test.utf32 &&
-	git add .gitattributes test.utf16 &&
+	cp test.utf16lebom.raw test.utf16lebom &&
+	git add .gitattributes test.utf16 test.utf16lebom &&
 	git commit -m initial
 '
 
@@ -51,6 +55,12 @@  test_expect_success 're-encode to UTF-16 on checkout' '
 	test_cmp_bin test.utf16.raw test.utf16
 '
 
+test_expect_success 're-encode to UTF-16-LE-BOM on checkout' '
+	rm test.utf16lebom &&
+	git checkout test.utf16lebom &&
+	test_cmp_bin test.utf16lebom.raw test.utf16lebom
+'
+
 test_expect_success 'check $GIT_DIR/info/attributes support' '
 	test_when_finished "rm -f test.utf32.git" &&
 	test_when_finished "git reset --hard HEAD" &&
diff --git a/utf8.c b/utf8.c
index eb78587504..83824dc2f4 100644
--- a/utf8.c
+++ b/utf8.c
@@ -4,6 +4,11 @@ 
 
 /* This code is originally from http://www.cl.cam.ac.uk/~mgk25/ucs/ */
 
+static const char utf16_be_bom[] = {'\xFE', '\xFF'};
+static const char utf16_le_bom[] = {'\xFF', '\xFE'};
+static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
+static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
+
 struct interval {
 	ucs_char_t first;
 	ucs_char_t last;
@@ -470,16 +475,17 @@  int utf8_fprintf(FILE *stream, const char *format, ...)
 #else
 	typedef char * iconv_ibp;
 #endif
-char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv, size_t *outsz_p)
+char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv,
+			    size_t bom_len, size_t *outsz_p)
 {
 	size_t outsz, outalloc;
 	char *out, *outpos;
 	iconv_ibp cp;
 
 	outsz = insz;
-	outalloc = st_add(outsz, 1); /* for terminating NUL */
+	outalloc = st_add(outsz, 1 + bom_len); /* for terminating NUL */
 	out = xmalloc(outalloc);
-	outpos = out;
+	outpos = out + bom_len;
 	cp = (iconv_ibp)in;
 
 	while (1) {
@@ -540,10 +546,30 @@  char *reencode_string_len(const char *in, size_t insz,
 {
 	iconv_t conv;
 	char *out;
+	const char *bom_str = NULL;
+	size_t bom_len = 0;
 
 	if (!in_encoding)
 		return NULL;
 
+	/* UTF-16LE-BOM is the same as UTF-16 for reading */
+	if (same_utf_encoding("UTF-16LE-BOM", in_encoding))
+		in_encoding = "UTF-16";
+
+	/*
+	 * For writing, UTF-16 iconv typically creates "UTF-16BE-BOM"
+	 * Some users under Windows want the little endian version
+	 */
+	if (same_utf_encoding("UTF-16LE-BOM", out_encoding)) {
+		bom_str = utf16_le_bom;
+		bom_len = sizeof(utf16_le_bom);
+		out_encoding = "UTF-16LE";
+	} else if (same_utf_encoding("UTF-16BE-BOM", out_encoding)) {
+		bom_str = utf16_be_bom;
+		bom_len = sizeof(utf16_be_bom);
+		out_encoding = "UTF-16BE";
+	}
+
 	conv = iconv_open(out_encoding, in_encoding);
 	if (conv == (iconv_t) -1) {
 		in_encoding = fallback_encoding(in_encoding);
@@ -553,9 +579,10 @@  char *reencode_string_len(const char *in, size_t insz,
 		if (conv == (iconv_t) -1)
 			return NULL;
 	}
-
-	out = reencode_string_iconv(in, insz, conv, outsz);
+	out = reencode_string_iconv(in, insz, conv, bom_len, outsz);
 	iconv_close(conv);
+	if (out && bom_str && bom_len)
+		memcpy(out, bom_str, bom_len);
 	return out;
 }
 #endif
@@ -566,11 +593,6 @@  static int has_bom_prefix(const char *data, size_t len,
 	return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
 }
 
-static const char utf16_be_bom[] = {'\xFE', '\xFF'};
-static const char utf16_le_bom[] = {'\xFF', '\xFE'};
-static const char utf32_be_bom[] = {'\0', '\0', '\xFE', '\xFF'};
-static const char utf32_le_bom[] = {'\xFF', '\xFE', '\0', '\0'};
-
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
 {
 	return (
diff --git a/utf8.h b/utf8.h
index edea55e093..84efbfcb1f 100644
--- a/utf8.h
+++ b/utf8.h
@@ -27,7 +27,7 @@  void strbuf_utf8_replace(struct strbuf *sb, int pos, int width,
 
 #ifndef NO_ICONV
 char *reencode_string_iconv(const char *in, size_t insz,
-			    iconv_t conv, size_t *outsz);
+			    iconv_t conv, size_t bom_len, size_t *outsz);
 char *reencode_string_len(const char *in, size_t insz,
 			  const char *out_encoding,
 			  const char *in_encoding,