Message ID | 20210422050504.441-1-tzadik.vanderhoof@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] add git-p4.fallbackEncoding config setting, to prevent git-p4 from crashing on non UTF-8 changeset descriptions | expand |
On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote: Thanks for V3, please see some comments inline. > When git-p4 reads the output from a p4 command, it assumes it will be > 100% UTF-8. If even one character in the output of one p4 command is > not UTF-8, git-p4 crashes with: > > File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList > value = value.decode() UnicodeDecodeError: 'utf-8' codec can't > decode byte Ox93 in position 42: invalid start byte > > This is especially a problem on the "git p4 clone ... @all" command, > where git-p4 needs to read thousands of changeset descriptions, one of > which may have a stray smart quote, causing the whole clone operation > to fail. > > This pull request adds a new config setting, allowing git-p4 to try > another encoding (for example, "cp1252") and/or use the Unicode replacement > character, to prevent the whole program from crashing on such a minor problem. "This pull request" is somewhat superflous wording. How about: Add a new config setting, allowing git-p4 to try a fallback encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. Documentation is good (and needed, and neccessary). Probably this is then not needed: > See the documentation included in the patch for more details of how > the new config setting works. > > Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> > --- > Documentation/git-p4.txt | 10 ++++ > git-p4.py | 11 +++- > t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++ > 3 files changed, 85 insertions(+), 1 deletion(-) > create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index f89e68b..e0131a9 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -638,6 +638,16 @@ git-p4.pathEncoding:: > to transcode the paths to UTF-8. As an example, Perforce on Windows > often uses "cp1252" to encode path names. > > +git-p4.fallbackEncoding:: > + Perforce changeset descriptions can be in a mixture of encodings. > + Git-p4 first tries to interpret each description as UTF-8. If that > + fails, this config allows another encoding to be tried. You > + can specify, for example, "cp1252". That looks OK according to https://docs.python.org/3/library/codecs.html#standard-encodings > + If instead of an encoding, > + you specify "replace", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. If you > + specify "none" (the default), there is no fallback, and any non > + UTF-8 character will cause git-p4 to immediately fail. > + May be, that is a matter of taste: > + If git-p4.fallbackEncoding is "replace" ", UTF-8 will be used, with invalid UTF-8 > + characters replaced by the Unicode replacement character. > + The default is "none": there is no fallback, and any non > + UTF-8 character will cause git-p4 to immediately fail. > + > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > that large file systems do not support the 'git p4 submit' command. > diff --git a/git-p4.py b/git-p4.py > index 09c9e93..3364287 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > for key, value in entry.items(): > key = key.decode() > if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): > - value = value.decode() > + try: > + value = value.decode() > + except UnicodeDecodeError as ex: > + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' > + if fallbackEncoding == 'none': > + raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") > + elif fallbackEncoding == 'replace': > + value = value.decode(errors='replace') > + else: > + value = value.decode(encoding=fallbackEncoding) > decoded_entry[key] = value > # Parse out data if it's an error response > if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: > diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh > new file mode 100755 > index 0000000..56a245e > --- /dev/null > +++ b/t/t9835-git-p4-config-fallback-encoding.sh > @@ -0,0 +1,65 @@ > +#!/bin/sh > + > +test_description='test git-p4.fallbackEncoding config' > + > +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main > +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME > + > +. ./lib-git-p4.sh > + > +if test_have_prereq !MINGW,!CYGWIN; then > + skip_all='This system is not subject to encoding failures in "git p4 clone"' > + test_done > +fi Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ? > + > +test_expect_success 'start p4d' ' > + start_p4d > +' > + > +test_expect_success 'add cp1252 description' ' > + cd "$cli" && > + echo file1 >file1 && > + p4 add file1 && > + p4 submit -d documentación > +' > + > +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' > + test_might_fail git config --global --unset git-p4.fallbackEncoding && > + test_when_finished cleanup_git && > + ( > + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && Should this be >actual ? And please no ' ' between the '>' and the filename. > + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual > + ) > +' > +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' > + git config --global git-p4.fallbackEncoding none && > + test_when_finished cleanup_git && > + ( > + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && Same here 2 >actual > + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual > + ) > +' > + > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' > + git config --global git-p4.fallbackEncoding cp1252 && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] Style nit: See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]". desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación" > + ) > +' > + > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' > + git config --global git-p4.fallbackEncoding replace && > + test_when_finished cleanup_git && > + ( > + git p4 clone --dest="$git" //depot@all && > + cd "$git" && > + git log --oneline >log && > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentaci�n" ] > + ) > +' > + > +test_done > -- > 2.31.1.windows.1 > Are there any more comments from the p4 experts ?
On Thu, Apr 22, 2021 at 11:51 AM Torsten Bögershausen <tboegi@web.de> wrote: > On Wed, Apr 21, 2021 at 10:05:04PM -0700, Tzadik Vanderhoof wrote: > > +if test_have_prereq !MINGW,!CYGWIN; then > > + skip_all='This system is not subject to encoding failures in "git p4 clone"' > > + test_done > > +fi > > Out of curiosity: Why are Windows versions (MINGW, CYGWIN) excluded ? The answer to this question is probably worthy of recording as an in-code comment just above this conditional so that people coming upon this test script in the future don't have to ask the same question (which is especially important if the author is no longer reachable). If an in-code comment is overkill, then it would probably be a good idea for the commit message to explain the reason. > > +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' > > + git config --global git-p4.fallbackEncoding cp1252 && > > + test_when_finished cleanup_git && > > + ( > > + git p4 clone --dest="$git" //depot@all && > > + cd "$git" && > > + git log --oneline >log && > > + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] > > Style nit: > See Documentation/CodingGuidelines: - We prefer "test" over "[ ... ]". > > desc=$(head -1 log | awk '\''{print $2}'\'') && test "$desc" = "documentación" Style also suggests splitting the line after the &&. We normally want to avoid using bare single-quotes inside the body of the test since the body itself is a single-quoted string. These single-quotes make it harder for a reader to reason about what is going on; especially with the $2 in there, one has to spend extra cycles wondering if $2 is correctly expanded when the test runs or when it is first defined. So, an easier-to-understand rewrite might be: desc=$(head -1 log | awk ''{print \$2}'') && test "$desc" = "documentación" Many existing tests in this project use `cut` for word-plucking, so an alternative would be: desc=$(head -1 log | cut -d" " -f2) &&
On Thu, Apr 22, 2021 at 12:17 PM Eric Sunshine <sunshine@sunshineco.com> wrote: > We normally want to avoid using bare single-quotes inside the body of > the test since the body itself is a single-quoted string. These > single-quotes make it harder for a reader to reason about what is > going on; especially with the $2 in there, one has to spend extra > cycles wondering if $2 is correctly expanded when the test runs or > when it is first defined. So, an easier-to-understand rewrite might > be: > > desc=$(head -1 log | awk ''{print \$2}'') && Of course, the quotes surrounding the {print...} should be double-quotes, not pairs of single-quotes: desc=$(head -1 log | awk "{print \$2}") && (I didn't notice the problem when originally composing the email since the compose window wasn't using a fixed-width font, and only noticed it later when re-reading it in a mail reader which does use fixed-width. Sorry for any potential confusion.) > Many existing tests in this project use `cut` for word-plucking, so an > alternative would be: > > desc=$(head -1 log | cut -d" " -f2) && At any rate, using `cut` would be a good option since there's plenty of precedent in existing test scripts.
diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt index f89e68b..e0131a9 100644 --- a/Documentation/git-p4.txt +++ b/Documentation/git-p4.txt @@ -638,6 +638,16 @@ git-p4.pathEncoding:: to transcode the paths to UTF-8. As an example, Perforce on Windows often uses "cp1252" to encode path names. +git-p4.fallbackEncoding:: + Perforce changeset descriptions can be in a mixture of encodings. + Git-p4 first tries to interpret each description as UTF-8. If that + fails, this config allows another encoding to be tried. You + can specify, for example, "cp1252". If instead of an encoding, + you specify "replace", UTF-8 will be used, with invalid UTF-8 + characters replaced by the Unicode replacement character. If you + specify "none" (the default), there is no fallback, and any non + UTF-8 character will cause git-p4 to immediately fail. + git-p4.largeFileSystem:: Specify the system that is used for large (binary) files. Please note that large file systems do not support the 'git p4 submit' command. diff --git a/git-p4.py b/git-p4.py index 09c9e93..3364287 100755 --- a/git-p4.py +++ b/git-p4.py @@ -771,7 +771,16 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, for key, value in entry.items(): key = key.decode() if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): - value = value.decode() + try: + value = value.decode() + except UnicodeDecodeError as ex: + fallbackEncoding = gitConfig("git-p4.fallbackEncoding").lower() or 'none' + if fallbackEncoding == 'none': + raise Exception("UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding") + elif fallbackEncoding == 'replace': + value = value.decode(errors='replace') + else: + value = value.decode(encoding=fallbackEncoding) decoded_entry[key] = value # Parse out data if it's an error response if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: diff --git a/t/t9835-git-p4-config-fallback-encoding.sh b/t/t9835-git-p4-config-fallback-encoding.sh new file mode 100755 index 0000000..56a245e --- /dev/null +++ b/t/t9835-git-p4-config-fallback-encoding.sh @@ -0,0 +1,65 @@ +#!/bin/sh + +test_description='test git-p4.fallbackEncoding config' + +GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME=main +export GIT_TEST_DEFAULT_INITIAL_BRANCH_NAME + +. ./lib-git-p4.sh + +if test_have_prereq !MINGW,!CYGWIN; then + skip_all='This system is not subject to encoding failures in "git p4 clone"' + test_done +fi + +test_expect_success 'start p4d' ' + start_p4d +' + +test_expect_success 'add cp1252 description' ' + cd "$cli" && + echo file1 >file1 && + p4 add file1 && + p4 submit -d documentación +' + +test_expect_success 'clone fails with git-p4.fallbackEncoding unset' ' + test_might_fail git config --global --unset git-p4.fallbackEncoding && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual + ) +' +test_expect_success 'clone fails with git-p4.fallbackEncoding set to "none"' ' + git config --global git-p4.fallbackEncoding none && + test_when_finished cleanup_git && + ( + test_must_fail git p4 clone --dest="$git" //depot@all 2>> actual && + grep "UTF8 decoding failed. Consider using git config git-p4.fallbackEncoding" actual + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "cp1252"' ' + git config --global git-p4.fallbackEncoding cp1252 && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentación" ] + ) +' + +test_expect_success 'clone succeeds with git-p4.fallbackEncoding set to "replace"' ' + git config --global git-p4.fallbackEncoding replace && + test_when_finished cleanup_git && + ( + git p4 clone --dest="$git" //depot@all && + cd "$git" && + git log --oneline >log && + desc=$(head -1 log | awk '\''{print $2}'\'') && [ "$desc" = "documentaci�n" ] + ) +' + +test_done
When git-p4 reads the output from a p4 command, it assumes it will be 100% UTF-8. If even one character in the output of one p4 command is not UTF-8, git-p4 crashes with: File "C:/Program Files/Git/bin/git-p4.py", line 774, in p4CmdList value = value.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte Ox93 in position 42: invalid start byte This is especially a problem on the "git p4 clone ... @all" command, where git-p4 needs to read thousands of changeset descriptions, one of which may have a stray smart quote, causing the whole clone operation to fail. This pull request adds a new config setting, allowing git-p4 to try another encoding (for example, "cp1252") and/or use the Unicode replacement character, to prevent the whole program from crashing on such a minor problem. See the documentation included in the patch for more details of how the new config setting works. Signed-off-by: Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> --- Documentation/git-p4.txt | 10 ++++ git-p4.py | 11 +++- t/t9835-git-p4-config-fallback-encoding.sh | 65 ++++++++++++++++++++++ 3 files changed, 85 insertions(+), 1 deletion(-) create mode 100755 t/t9835-git-p4-config-fallback-encoding.sh