diff mbox series

[v6] Add git-p4.fallbackEncoding

Message ID 20210429073905.837-1-tzadik.vanderhoof@gmail.com (mailing list archive)
State New
Headers show
Series [v6] Add git-p4.fallbackEncoding | expand

Commit Message

Tzadik Vanderhoof April 29, 2021, 7:39 a.m. UTC
Add git-p4.fallbackEncoding config variable, to prevent git-p4 from
crashing on non UTF-8 changeset descriptions.

When git-p4 reads the output from a p4 command, it assumes it will
be 100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   |  9 ++
 git-p4.py                                  | 11 ++-
 t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++
 3 files changed, 117 insertions(+), 1 deletion(-)
 create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh

Comments

Luke Diamand April 29, 2021, 8:36 a.m. UTC | #1
On 29/04/2021 07:39, Tzadik Vanderhoof wrote:
> Add git-p4.fallbackEncoding config variable, to prevent git-p4 from
> crashing on non UTF-8 changeset descriptions.
> 
> When git-p4 reads the output from a p4 command, it assumes it will
> be 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
> 
>      File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>          value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
>          decode byte Ox93 in position 42: invalid start byte
> 
> This is especially a problem for the "git p4 clone ... @all" command,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
> 
> Add a new config setting, allowing git-p4 to try a fallback encoding
> (for example, "cp1252") and/or use the Unicode replacement character,
> to prevent the whole program from crashing on such a minor problem.

I think Andrew Oakley pointed this out earlier - but in the days of 
Python2 this was (I think) never a problem. Python2 just took in the 
binary data, in whatever encoding, and passed it untouched on to git, 
which in turn just stored it.

https://lore.kernel.org/git/20210412085251.51475-1-andrew@adoakley.name/

It was only whatever was trying to render the bytestream that needed to 
worry about the encoding.

Now we're making the decision in git-p4 when we ingest it - did we 
consider just passing it along untouched?

The problem at hand is that git-p4 is trying to store it internally as a 
`string' which now is unicode-aware, when perhaps it should not be.

It's going to get very confusing if anyone ingests something from 
Perforce having set the encoding to one thing, and it turns out to be a 
different encoding, or worse, multiple encodings for the same repo.

I also worry that if someone has connected to a Unicode-aware Perforce 
server, and then unwittingly set P4CHARSET, then are we going to end up 
silently scrambling everything?

I'm not sure, encodings make my head hurt.

Luke



> 
> Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> ---
>   Documentation/git-p4.txt                   |  9 ++
>   git-p4.py                                  | 11 ++-
>   t/t9836-git-p4-config-fallback-encoding.sh | 98 ++++++++++++++++++++++
>   3 files changed, 117 insertions(+), 1 deletion(-)
>   create mode 100755 t/t9836-git-p4-config-fallback-encoding.sh
> 
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424..86d3ffa644 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,15 @@ git-p4.pathEncoding::
>   	to transcode the paths to UTF-8. As an example, Perforce on Windows
>   	often uses "cp1252" to encode path names.
>   
> +git-p4.fallbackEncoding::
> +	Perforce changeset descriptions can be stored in any encoding.
> +	Git-p4 first tries to interpret each description as UTF-8. If that
> +	fails, this config allows another encoding to be tried. You can specify,
> +	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
> +	be used, with invalid UTF-8 characters replaced by the Unicode replacement
> +	character. The default is "none": there is no fallback, and any non UTF-8
> +	character will cause git-p4 to immediately fail.
> +
>   git-p4.largeFileSystem::
>   	Specify the system that is used for large (binary) files. Please note
>   	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac4..202fb01bdf 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                   for key, value in entry.items():
>                       key = key.decode()
>                       if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except UnicodeDecodeError:
> +                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
> +                            if fallbackEncoding == 'none':
> +                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                       decoded_entry[key] = value
>                   # Parse out data if it's an error response
>                   if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
> diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh
> new file mode 100755
> index 0000000000..901bb3759d
> --- /dev/null
> +++ b/t/t9836-git-p4-config-fallback-encoding.sh
> @@ -0,0 +1,98 @@
> +#!/bin/sh
> +
> +test_description='test git-p4.fallbackEncoding config'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./lib-git-p4.sh
> +
> +test_expect_success 'start p4d' '
> +	start_p4d
> +'
> +
> +test_expect_success 'add Unicode description' '
> +	cd "$cli" &&
> +	echo file1 >file1 &&
> +	p4 add file1 &&
> +	p4 submit -d documentación
> +'
> +
> +# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some
> +# environments. This test determines if that is the case in our environment. If so,
> +# we create a file called "clone_fails". In subsequent tests, we check whether that
> +# file exists to determine what behavior to expect.
> +
> +clone_fails="$TRASH_DIRECTORY/clone_fails"
> +
> +# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file,
> +# and make sure the error message is correct
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
> +	git config --global git-p4.fallbackEncoding none &&
> +	test_when_finished cleanup_git && {
> +		git p4 clone --dest="$git" //depot@all 2>error || (
> +			>"$clone_fails" &&
> +			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
> +		)
> +	}
> +'
> +
> +# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset,
> +# also with the correct error message.  Otherwise the clone should succeed.
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding unset' '
> +	git config --global --unset git-p4.fallbackEncoding &&
> +	test_when_finished cleanup_git && {
> +		(
> +			test -f "$clone_fails" &&
> +			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
> +			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
> +		) ||
> +		(
> +			! test -f "$clone_fails" &&
> +			git p4 clone --dest="$git" //depot@all 2>error
> +		)
> +	}
> +'
> +
> +# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding
> +# to "cp1252" should cause clone to succeed and get the right description
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
> +	git config --global git-p4.fallbackEncoding cp1252 &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | cut -d" " -f2) &&
> +		test "$desc" = "documentación"
> +	)
> +'
> +
> +# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed.
> +# If "clone_fails" exists, the description should contain the Unicode replacement
> +# character, otherwise the description should be correct (since we're on a system that
> +# doesn't have the Unicode issue)
> +
> +test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
> +	git config --global git-p4.fallbackEncoding replace &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | cut -d" " -f2) &&
> +		{
> +			(test -f "$clone_fails" &&
> +				test "$desc" = "documentaci�n"
> +			) ||
> +			(! test -f "$clone_fails" &&
> +				test "$desc" = "documentación"
> +			)
> +		}
> +	)
> +'
> +
> +test_done
>
Tzadik Vanderhoof April 29, 2021, 5:29 p.m. UTC | #2
On Thu, Apr 29, 2021 at 1:36 AM Luke Diamand <luke@diamand.org> wrote:
>
> I think Andrew Oakley pointed this out earlier - but in the days of
> Python2 this was (I think) never a problem. Python2 just took in the
> binary data, in whatever encoding, and passed it untouched on to git,
> which in turn just stored it.
>

Unfortunately, I just became aware yesterday that Andrew Oakley was
also working on this issue (his CC to me 2 weeks ago somehow ended up
in my Spam folder, and I only dug it out of there after finding out
that we both created a test with the same number).

When I first became aware of Andrew's work (yesterday), I thought it
would make mine unnecessary, but upon further investigation, I don't
think Andrew's work will solve this problem.  Please see my reply
yesterday to Andew's thread:
https://lore.kernel.org/git/CAKu1iLXRrsB4mRsDfhBH5aahWzDjpfqLuWP9t47RMB=RdpL1iA@mail.gmail.com
diff mbox series

Patch

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@  git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be stored in any encoding.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You can specify,
+	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+	be used, with invalid UTF-8 characters replaced by the Unicode replacement
+	character. The default is "none": there is no fallback, and any non UTF-8
+	character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9836-git-p4-config-fallback-encoding.sh b/t/t9836-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..901bb3759d
--- /dev/null
+++ b/t/t9836-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,98 @@ 
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add Unicode description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+# Unicode descriptions cause "git p4 clone" to crash with a UnicodeDecodeError in some
+# environments. This test determines if that is the case in our environment. If so,
+# we create a file called "clone_fails". In subsequent tests, we check whether that
+# file exists to determine what behavior to expect.
+
+clone_fails="$TRASH_DIRECTORY/clone_fails"
+
+# If clone fails with git-p4.fallbackEncoding set to "none", create the "clone_fails" file,
+# and make sure the error message is correct
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git && {
+		git p4 clone --dest="$git" //depot@all 2>error || (
+			>"$clone_fails" &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		)
+	}
+'
+
+# If clone fails with git-p4.fallbackEncoding set to "none", it should also fail when it's unset,
+# also with the correct error message.  Otherwise the clone should succeed.
+
+test_expect_success 'clone with git-p4.fallbackEncoding unset' '
+	git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git && {
+		(
+			test -f "$clone_fails" &&
+			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		) ||
+		(
+			! test -f "$clone_fails" &&
+			git p4 clone --dest="$git" //depot@all 2>error
+		)
+	}
+'
+
+# Whether or not "clone_fails" exists, setting git-p4.fallbackEncoding
+# to "cp1252" should cause clone to succeed and get the right description
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentación"
+	)
+'
+
+# Setting git-p4.fallbackEncoding to "replace" should always cause clone to succeed.
+# If "clone_fails" exists, the description should contain the Unicode replacement
+# character, otherwise the description should be correct (since we're on a system that
+# doesn't have the Unicode issue)
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		{
+			(test -f "$clone_fails" &&
+				test "$desc" = "documentaci�n"
+			) ||
+			(! test -f "$clone_fails" &&
+				test "$desc" = "documentación"
+			)
+		}
+	)
+'
+
+test_done