diff mbox series

add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions

Message ID 20210423063632.1973-1-tzadik.vanderhoof@gmail.com (mailing list archive)
State Superseded
Headers show
Series add git-p4.fallbackEncoding config variable, to prevent git-p4 from crashing on non UTF-8 changeset descriptions | expand

Commit Message

Tzadik Vanderhoof April 23, 2021, 6:36 a.m. UTC
When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

    File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
        value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
        decode byte Ox93 in position 42: invalid start byte

This is especially a problem for the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   |  9 +++
 git-p4.py                                  | 11 +++-
 t/t9835-git-p4-config-fallback-encoding.sh | 76 ++++++++++++++++++++++
 3 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

Comments

Tzadik Vanderhoof April 23, 2021, 6:44 a.m. UTC | #1
(last patch should be labeled as v4... sorry)
Tzadik Vanderhoof April 23, 2021, 7:08 p.m. UTC | #2
To clarify....

The new config variable I am introducing addresses an issue that only
occurs on Windows.  This is because the behavior of the "p4" command
differs on Windows vs Linux around Unicode in changeset descriptions.

I don't have the source code for "p4", but I'm guessing it's written
in C, and that this difference in behavior is simply a result of the
fact that there is no defined standard of how "char *argv[]" in "main"
should deal with non-ASCII characters being passed in from the command
line.

As a result, "git p4 clone" on Linux is not affected by this "p4"
behavior.  Since my tests assume the Windows behavior, they fail when
run on Linux.  For this reason, I added code to my tests to skip them
on Linux.

On a related note, I don't think there are any CI environments on
github for git that are (a) on Windows, and (b) have Python and (c)
have Perforce, so I don't think my tests are actually running on
github CI.  I'm not sure how that can be addressed.
Torsten Bögershausen April 24, 2021, 8:14 a.m. UTC | #3
(Adding some of the p4 and Windows experts in cc)

On Fri, Apr 23, 2021 at 12:08:17PM -0700, Tzadik Vanderhoof wrote:
> To clarify....

Good. This is good information, and the important stuff could go
into the commit message. And because the commit as such should be
self-contained (as much as possible).
Giving an overview about the problem.

>
> The new config variable I am introducing addresses an issue that only
> occurs on Windows.  This is because the behavior of the "p4" command
> differs on Windows vs Linux around Unicode in changeset descriptions.

What does Windows mean in this context ?
Is p4 a "console application" ?
In this case it may be possible to use CHCP to change to a different code page ?

>
> I don't have the source code for "p4", but I'm guessing it's written
> in C, and that this difference in behavior is simply a result of the
> fact that there is no defined standard of how "char *argv[]" in "main"
> should deal with non-ASCII characters being passed in from the command
> line.
>
> As a result, "git p4 clone" on Linux is not affected by this "p4"
> behavior.

Is it ?
What happens if yoy have a p4 depot that was feed from Windows in CP-1252 and is now
accessed from a Linux  machine ?
Doesthe Linux box face the same problems ?

> Since my tests assume the Windows behavior, they fail when
> run on Linux.  For this reason, I added code to my tests to skip them
> on Linux.

That makes sense, but what is the "Windows behavior" more in detail ?
My understanding is that when you press e.g. the key 'Ä' on the keybaord,
it will give a different byte sequence once that 'Ä' is transferred
across the wire (to the p4 server).
This is dependent on what Linux calls a locale, and all major Linux installations
use UTF-8 these days by default.
But that was not always the case, since in old days they used ISO-8851-1 or something
else, usable for your contry/region.

So most Windows "console applications" are not run under UTF-8, but it
may be possible that "CHCP 65000" (or so) works - more testing needed.
>
> On a related note, I don't think there are any CI environments on
> github for git that are (a) on Windows, and (b) have Python and (c)
> have Perforce, so I don't think my tests are actually running on
> github CI.  I'm not sure how that can be addressed.

That are 3 different questions -
(a) Yes, git is compiled under Windows, both gcc and MSVC (correct me if that is wrong)
(b) Yes, we have python on the different CI. Github actions has python, I use it every day.
(c) There are tests run for p4, but it seems if they are only run under Linux.

It would be nice if your test can pass under Linux, why are they failing ?

If I dig here:
<https://github.com/git/git/runs/2420889332?check_suite_focus=true>

We can see that the t98 test are run, and are passing. Just to pick one:
[15:28:22] t9804-git-p4-label.sh .............................. ok     3969

Thanks for working on this.
It would be good to have a v5 version of the patch with some more informations,
like above, and may be :how is the p4 server configured ?
(Unicode or not ?)
diff mbox series

Patch

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b424..86d3ffa644 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,15 @@  git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be stored in any encoding.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You can specify,
+	for example, "cp1252". If git-p4.fallbackEncoding is "replace", UTF-8 will
+	be used, with invalid UTF-8 characters replaced by the Unicode replacement
+	character. The default is "none": there is no fallback, and any non UTF-8
+	character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93ac4..202fb01bdf 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000000..ce352c826b
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,76 @@ 
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+# The Windows build of p4 encodes its command-line arguments according to the
+# active code page (which defaults to "cp1252"). As a result, "p4 submit -d" causes
+# Unicode changeset descriptions to be stored in the Perforce database as cp1252,
+# and a subsequent "git p4 clone" attempting to decode these descriptions as UTF-8
+# will raise a UnicodeDecodeError, necessitating the use of the git-p4.fallbackEncoding config.
+#
+# The Linux build of p4 encodes its command-line arguments as UTF-8, so changeset descriptions
+# are stored as UTF-8, and UnicodeDecodeError is never raised by "git p4 clone".
+
+if test_have_prereq !MINGW,!CYGWIN; then
+	skip_all='This system is not subject to encoding failures in "git p4 clone"'
+	test_done
+fi
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add cp1252 description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+		grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+	)
+'
+test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>error &&
+		grep "UTF-8 decoding failed. Consider using git config git-p4.fallbackEncoding" error
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentación"
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | cut -d" " -f2) &&
+		test "$desc" = "documentaci�n"
+	)
+'
+
+test_done