diff mbox series

[v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions

Message ID 20210427053916.1977-1-tzadik.vanderhoof@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions | expand

Commit Message

Tzadik Vanderhoof April 27, 2021, 5:39 a.m. UTC
When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   |  9 +++
 git-p4.py                                  | 11 ++-
 t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++
 3 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

Comments

Tzadik Vanderhoof April 27, 2021, 5:45 a.m. UTC | #1
I modified the test to work on both Linux and Windows.  See the
comments in the test.
Junio C Hamano April 28, 2021, 4:39 a.m. UTC | #2
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

>  t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++
>  3 files changed, 106 insertions(+), 1 deletion(-)
>  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

9835 is already taken (see 'seen').
Torsten Bögershausen April 28, 2021, 2:58 p.m. UTC | #3
On Wed, Apr 28, 2021 at 01:39:38PM +0900, Junio C Hamano wrote:
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>
> >  t/t9835-git-p4-config-fallback-encoding.sh | 87 ++++++++++++++++++++++
> >  3 files changed, 106 insertions(+), 1 deletion(-)
> >  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh
>
> 9835 is already taken (see 'seen').

In general, this looks good to me.
There are two minor nitpicks to make the patch more the git-way:

> Subject: [PATCH v5] add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptionsw

The head line is somewhat too long.
It should be much shorter, like 50-55 characters, if I recall it rigth.
The first line of the commit message is what we see under PATCH in the email,
followed by a blank line (that's what we have) and a detailed description
(Which we have)

How abut this ?

git-p4: Add git-p4.fallbackEncoding

Add git-p4.fallbackEncoding config variable,
to prevent git-p4 from crashing on non UTF-8 changeset descriptions.

When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.


[]

And then, somewhere in the test:

			cp /dev/null "$clone_fails" &&

This should create an empty file, right ?
Then we can use a simple output-redirection:

			>"$clone_fails" &&
Tzadik Vanderhoof April 29, 2021, 5:14 p.m. UTC | #4
On Thu, Apr 29, 2021 at 7:12 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> Hej Tzadik,
>
> This version went only to my email ?
>

v6 went to the list as well as your email.  I just forgot to include
you in the email to the list, so I sent you another copy with just
you.

> The test case number seems to be fixed, thanks.
>
> (Normally we don't have this collision, but right now
> it seem as if there is much going on in the git-p4 area,
> which is good)
>
> The "headline" is still overlong, it seams.
>

I did shorten the first line of my commit as you asked and used that
commit to create the v6 path. That first (short) line goes into the
Subject line of the patch. When you do "git am" it will use the
subject (which is short) as the first line of the commit. The overlong
summary will become the 3rd line of the commit (after a blank second
line)
diff mbox series

Patch

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@  git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be stored in any encoding.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You can specify,
+	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+	be used, with invalid UTF-8 characters replaced by the Unicode replacement
+	character. The default is "none": there is no fallback, and any non UTF-8
+	character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..383507803e
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,87 @@ 
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add Unicode description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+# Unicode descriptions cause clone to throw in some environments. This test
+# determines if that is the case in our environment. If so we create a file called "clone_fails".
+# We check that file to in subsequent tests to determine what behavior to expect.
+
+clone_fails="$TRASH_DIRECTORY/clone_fails"
+
+test_expect_success 'clone with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git && {
+		git p4 clone --dest="$git" //depot@all 2>error || (
+			cp /dev/null "$clone_fails" &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		)
+	}
+'
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git && {
+		(
+			test -f "$clone_fails" &&
+			test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+			grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+		) ||
+		(
+			! test -f "$clone_fails" &&
+			git p4 clone --dest="$git" //depot@all 2>error
+		)
+	}
+'
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentación"
+	)
+'
+
+test_expect_success 'clone with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		{
+			(test -f "$clone_fails" &&
+				test "$desc" = "documentaci�n"
+			) ||
+			(! test -f "$clone_fails" &&
+				test "$desc" = "documentación"
+			)
+		}
+	)
+'
+
+test_expect_success 'unset git-p4.fallbackEncoding' '
+	git config --global --unset git-p4.fallbackEncoding
+'
+
+test_done