diff mbox series

[v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions

Message ID 20210422050504.441-1-tzadik.vanderhoof@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions | expand

Commit Message

Tzadik Vanderhoof April 22, 2021, 5:05 a.m. UTC
When git-p4 reads the output from a p4 command, it assumes it will be
100% UTF-8. If even one character in the output of one p4 command is
not UTF-8, git-p4 crashes with:

File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
    value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
decode byte Ox93 in position 42: invalid start byte

This is especially a problem on the "git p4 clone ... @all" command,
where git-p4 needs to read thousands of changeset descriptions, one of
which may have a stray smart quote, causing the whole clone operation
to fail.

This pull request adds a new config setting, allowing git-p4 to try
another encoding (for example, "cp1252") and/or use the Unicode replacement
character, to prevent the whole program from crashing on such a minor problem.

See the documentation included in the patch for more details of how
the new config setting works.

Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
---
 Documentation/git-p4.txt                   | 10 ++++
 git-p4.py                                  | 11 +++-
 t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh

Comments

Torsten Bögershausen April 22, 2021, 3:50 p.m. UTC | #1
On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote:

Thanks for V3, please see some comments inline.

> When git-p4 reads the output from a p4 command, it assumes it will be
> 100% UTF-8. If even one character in the output of one p4 command is
> not UTF-8, git-p4 crashes with:
>
> File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList
>     value = value.decode() UnicodeDecodeError: 'utf-8' codec can't
> decode byte Ox93 in position 42: invalid start byte
>
> This is especially a problem on the "git p4 clone ... @all" command,
> where git-p4 needs to read thousands of changeset descriptions, one of
> which may have a stray smart quote, causing the whole clone operation
> to fail.
>

> This pull request adds a new config setting, allowing git-p4 to try
> another encoding (for example, "cp1252") and/or use the Unicode replacement
> character, to prevent the whole program from crashing on such a minor problem.

"This pull request" is somewhat superflous wording.
How about:

Add a new config setting, allowing git-p4 to try a fallback encoding
(for example, "cp1252") and/or use the Unicode replacement character,
to prevent the whole program from crashing on such a minor problem.

Documentation is good (and needed, and neccessary).
Probably this is then not needed:
> See the documentation included in the patch for more details of how
> the new config setting works.

>
> Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>
> ---
>  Documentation/git-p4.txt                   | 10 ++++
>  git-p4.py                                  | 11 +++-
>  t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++
>  3 files changed, 85 insertions(+), 1 deletion(-)
>  create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b..e0131a9 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,16 @@ git-p4.pathEncoding::
>  	to transcode the paths to UTF-8. As an example, Perforce on Windows
>  	often uses "cp1252" to encode path names.
>
> +git-p4.fallbackEncoding::
> +	Perforce changeset descriptions can be in a mixture of encodings.
> +	Git-p4 first tries to interpret each description as UTF-8. If that
> +	fails, this config allows another encoding to be tried. You
> +	can specify, for example, "cp1252".

That looks OK according to
https://docs.python.org/3/library/codecs.html#standard-encodings

> + If instead of an encoding,
> +	you specify "replace", UTF-8 will be used, with invalid UTF-8
> +	characters replaced by the Unicode replacement character. If you
> +	specify "none" (the default), there is no fallback, and any non
> +	UTF-8 character will cause git-p4 to immediately fail.
> +

May be, that is a matter of taste:

> + If git-p4.fallbackEncoding is "replace" ", UTF-8 will be used, with invalid UTF-8
> +	characters replaced by the Unicode replacement character.
> +	The default is "none": there is no fallback, and any non
> +	UTF-8 character will cause git-p4 to immediately fail.
> +



>  git-p4.largeFileSystem::
>  	Specify the system that is used for large (binary) files. Please note
>  	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93..3364287 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        try:
> +                            value = value.decode()
> +                        except UnicodeDecodeError as ex:
> +                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
> +                            if fallbackEncoding == 'none':
> +                                raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding")
> +                            elif fallbackEncoding == 'replace':
> +                                value = value.decode(errors='replace')
> +                            else:
> +                                value = value.decode(encoding=fallbackEncoding)
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
> diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
> new file mode 100755
> index 0000000..56a245e
> --- /dev/null
> +++ b/t/t9835-git-p4-config-fallback-encoding.sh
> @@ -0,0 +1,65 @@
> +#!/bin/sh
> +
> +test_description='test git-p4.fallbackEncoding config'
> +
> +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
> +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
> +
> +. ./lib-git-p4.sh
> +
> +if test_have_prereq !MINGW,!CYGWIN; then
> +	skip_all='This system is not subject to encoding failures in "git p4 clone"'
> +	test_done
> +fi

Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ?

> +
> +test_expect_success 'start p4d' '
> +	start_p4d
> +'
> +
> +test_expect_success 'add cp1252 description' '
> +	cd "$cli" &&
> +	echo file1 >file1 &&
> +	p4 add file1 &&
> +	p4 submit -d documentación
> +'
> +
> +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
> +	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
> +	test_when_finished cleanup_git &&
> +	(
> +		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
Should this be >actual ?
And please no ' ' between the '>' and the filename.

> +		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
> +	)
> +'
> +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
> +	git config --global git-p4.fallbackEncoding none &&
> +	test_when_finished cleanup_git &&
> +	(
> +		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
Same here
2 >actual

> +		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
> +	)
> +'
> +
> +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
> +	git config --global git-p4.fallbackEncoding cp1252 &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentación" ]

Style nit:
See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]".

desc=$(head -1 log | awk '\''{print $2}'\'') &&	test "$desc" = "documentación"

> +	)
> +'
> +
> +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
> +	git config --global git-p4.fallbackEncoding replace &&
> +	test_when_finished cleanup_git &&
> +	(
> +		git p4 clone --dest="$git" //depot@all &&
> +		cd "$git" &&
> +		git log --oneline >log &&
> +		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentaci�n" ]
> +	)
> +'
> +
> +test_done
> --
> 2.31.1.windows.1
>

Are there any more comments from the p4 experts ?
Eric Sunshine April 22, 2021, 4:17 p.m. UTC | #2
On Thu, Apr 22, 2021 at 11:51 AM Torsten Bögershausen <tboegi@web.de> wrote:
> On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote:
> > +if test_have_prereq !MINGW,!CYGWIN; then
> > +     skip_all='This system is not subject to encoding failures in "git p4 clone"'
> > +     test_done
> > +fi
>
> Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ?

The answer to this question is probably worthy of recording as an
in-code comment just above this conditional so that people coming upon
this test script in the future don't have to ask the same question
(which is especially important if the author is no longer reachable).
If an in-code comment is overkill, then it would probably be a good
idea for the commit message to explain the reason.

> > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
> > +     git config --global git-p4.fallbackEncoding cp1252 &&
> > +     test_when_finished cleanup_git &&
> > +     (
> > +             git p4 clone --dest="$git" //depot@all &&
> > +             cd "$git" &&
> > +             git log --oneline >log &&
> > +             desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ]
>
> Style nit:
> See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]".
>
> desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación"

Style also suggests splitting the line after the &&.

We normally want to avoid using bare single-quotes inside the body of
the test since the body itself is a single-quoted string. These
single-quotes make it harder for a reader to reason about what is
going on; especially with the $2 in there, one has to spend extra
cycles wondering if $2 is correctly expanded when the test runs or
when it is first defined. So, an easier-to-understand rewrite might
be:

    desc=$(head -1 log | awk ''{print \$2}'') &&
    test "$desc" = "documentación"

Many existing tests in this project use `cut` for word-plucking, so an
alternative would be:

    desc=$(head -1 log | cut -d" " -f2) &&
Eric Sunshine April 22, 2021, 10:33 p.m. UTC | #3
On Thu, Apr 22, 2021 at 12:17 PM Eric Sunshine <sunshine@sunshineco.com> wrote:
> We normally want to avoid using bare single-quotes inside the body of
> the test since the body itself is a single-quoted string. These
> single-quotes make it harder for a reader to reason about what is
> going on; especially with the $2 in there, one has to spend extra
> cycles wondering if $2 is correctly expanded when the test runs or
> when it is first defined. So, an easier-to-understand rewrite might
> be:
>
>     desc=$(head -1 log | awk ''{print \$2}'') &&

Of course, the quotes surrounding the {print...} should be
double-quotes, not pairs of single-quotes:

    desc=$(head -1 log | awk "{print \$2}") &&

(I didn't notice the problem when originally composing the email since
the compose window wasn't using a fixed-width font, and only noticed
it later when re-reading it in a mail reader which does use
fixed-width. Sorry for any potential confusion.)

> Many existing tests in this project use `cut` for word-plucking, so an
> alternative would be:
>
>     desc=$(head -1 log | cut -d" " -f2) &&

At any rate, using `cut` would be a good option since there's plenty
of precedent in existing test scripts.
diff mbox series

Patch

diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
index f89e68b..e0131a9 100644
--- a/Documentation/git-p4.txt
+++ b/Documentation/git-p4.txt
@@ -638,6 +638,16 @@  git-p4.pathEncoding::
 	to transcode the paths to UTF-8. As an example, Perforce on Windows
 	often uses "cp1252" to encode path names.
 
+git-p4.fallbackEncoding::
+	Perforce changeset descriptions can be in a mixture of encodings.
+	Git-p4 first tries to interpret each description as UTF-8. If that
+	fails, this config allows another encoding to be tried. You
+	can specify, for example, "cp1252". If instead of an encoding,
+	you specify "replace", UTF-8 will be used, with invalid UTF-8
+	characters replaced by the Unicode replacement character. If you
+	specify "none" (the default), there is no fallback, and any non
+	UTF-8 character will cause git-p4 to immediately fail.
+
 git-p4.largeFileSystem::
 	Specify the system that is used for large (binary) files. Please note
 	that large file systems do not support the 'git p4 submit' command.
diff --git a/git-p4.py b/git-p4.py
index 09c9e93..3364287 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -771,7 +771,16 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
                 for key, value in entry.items():
                     key = key.decode()
                     if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
-                        value = value.decode()
+                        try:
+                            value = value.decode()
+                        except UnicodeDecodeError as ex:
+                            fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none'
+                            if fallbackEncoding == 'none':
+                                raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding")
+                            elif fallbackEncoding == 'replace':
+                                value = value.decode(errors='replace')
+                            else:
+                                value = value.decode(encoding=fallbackEncoding)
                     decoded_entry[key] = value
                 # Parse out data if it's an error response
                 if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh
new file mode 100755
index 0000000..56a245e
--- /dev/null
+++ b/t/t9835-git-p4-config-fallback-encoding.sh
@@ -0,0 +1,65 @@ 
+#!/bin/sh
+
+test_description='test git-p4.fallbackEncoding config'
+
+GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main
+export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME
+
+. ./lib-git-p4.sh
+
+if test_have_prereq !MINGW,!CYGWIN; then
+	skip_all='This system is not subject to encoding failures in "git p4 clone"'
+	test_done
+fi
+
+test_expect_success 'start p4d' '
+	start_p4d
+'
+
+test_expect_success 'add cp1252 description' '
+	cd "$cli" &&
+	echo file1 >file1 &&
+	p4 add file1 &&
+	p4 submit -d documentación
+'
+
+test_expect_success 'clone fails with git-p4.fallbackEncoding unset' '
+	test_might_fail git config --global --unset git-p4.fallbackEncoding &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
+		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
+	)
+'
+test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' '
+	git config --global git-p4.fallbackEncoding none &&
+	test_when_finished cleanup_git &&
+	(
+		test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual &&
+		grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' '
+	git config --global git-p4.fallbackEncoding cp1252 &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentación" ]
+	)
+'
+
+test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' '
+	git config --global git-p4.fallbackEncoding replace &&
+	test_when_finished cleanup_git &&
+	(
+		git p4 clone --dest="$git" //depot@all &&
+		cd "$git" &&
+		git log --oneline >log &&
+		desc=$(head -1 log | awk '\''{print $2}'\'') &&	[ "$desc" = "documentaci�n" ]
+	)
+'
+
+test_done