diff mbox series

[RFC,4/4] git-p4: use utf-8 encoding for file paths throughout

Message ID 20191128012807.3103-5-yang.zhao@skyboxlabs.com (mailing list archive)
State New, archived
Headers show
Series git-p4: python 3 compatability | expand

Commit Message

Yang Zhao Nov. 28, 2019, 1:28 a.m. UTC
Try to decode file paths in responses from p4 as soon as possible so
that we are working with unicode string throughout the rest of the flow.
This makes python 3 a lot happier.

Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
---

This is probably the most risky patch out of the set. It's very likely
that I've neglected to consider certain corner cases with decoding of
path data.

 git-p4.py | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

Comments

Elijah Newren Nov. 28, 2019, 2:57 a.m. UTC | #1
On Wed, Nov 27, 2019 at 5:32 PM Yang Zhao <yang.zhao@skyboxlabs.com> wrote:
>
> Try to decode file paths in responses from p4 as soon as possible so
> that we are working with unicode string throughout the rest of the flow.
> This makes python 3 a lot happier.
>
> Signed-off-by: Yang Zhao <yang.zhao@skyboxlabs.com>
> ---
>
> This is probably the most risky patch out of the set. It's very likely
> that I've neglected to consider certain corner cases with decoding of
> path data.

Yes, this does seem somewhat risky to me.  It may go well on platforms
that require all filenames to be unicode.  And it may work for users
who happen to restrict their filenames to valid utf-8.  But this
abstraction doesn't fit the general problem, so some users may be left
out in the cold.

I tried multiple times while switching git-filter-repo from python2 to
python3, at different levels of pervasiveness, to use unicode more
generally.  But I mostly gave up; everyone knows files won't
necessarily be unicode, but you just can't assume filenames or commit
messages or branch or tag names (and perhaps a few other things I'm
forgetting) are either.  I ended up using bytestrings everywhere
except messages displayed to the user, and I only decode at that
point.


Of course, if perforce happens to only work with unicode filenames
then you'll be fine.  And perhaps you don't want or need to be as
paranoid as I was about what people could do.  So I don't know if my
experience applies in your case (I've never used perforce myself), but
I just thought I'd mention it in case it's useful.
diff mbox series

Patch

diff --git a/git-p4.py b/git-p4.py
index 6821d6aafd..bd693e1404 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -650,11 +650,27 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             if use_encoded_streams:
                 # Decode unmarshalled dict to use str keys and values, except for:
                 #   - `data` which may contain arbitrary binary data
-                #   - `depotFile` which may contain non-UTF8 encoded text
+                #   - `depotFile` which may contain non-UTF8 encoded text, and is decoded
+                #     according to git-p4.pathEncoding config
                 decoded_entry = {}
                 for key, value in entry.items():
                     key = key.decode()
-                    decoded_entry[key] = value.decode() if not (key in ['data', 'depotFile'] or isinstance(value, str)) else value
+                    if key == 'data':
+                        pass
+                    elif key == 'depotFile':
+                        try:
+                            value = value.decode('ascii')
+                        except:
+                            encoding = 'utf-8'
+                            if gitConfig('git-p4.pathEncoding'):
+                                encoding = gitConfig('git-p4.pathEncoding')
+                            path = path.decode(encoding, 'replace')
+                            if verbose:
+                                print('Path with non-ASCII characters detected. Used %s to decode: %s ' % (encoding, path))
+                    elif not isinstance(value, str):
+                        value = value.decode()
+
+                    decoded_entry[key] = value
                 entry = decoded_entry
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
@@ -2758,24 +2774,11 @@  def writeToGitStream(self, gitMode, relPath, contents):
             self.gitStream.write(d)
         self.gitStream.write('\n')
 
-    def encodeWithUTF8(self, path):
-        try:
-            path.decode('ascii')
-        except:
-            encoding = 'utf8'
-            if gitConfig('git-p4.pathEncoding'):
-                encoding = gitConfig('git-p4.pathEncoding')
-            path = path.decode(encoding, 'replace').encode('utf8', 'replace')
-            if self.verbose:
-                print('Path with non-ASCII characters detected. Used %s to encode: %s ' % (encoding, path))
-        return path
-
     # output one file from the P4 stream
     # - helper for streamP4Files
 
     def streamOneP4File(self, file, contents):
         relPath = self.stripRepoPath(file['depotFile'], self.branchPrefixes)
-        relPath = self.encodeWithUTF8(relPath)
         if verbose:
             if 'fileSize' in self.stream_file:
                 size = int(self.stream_file['fileSize'])
@@ -2858,7 +2861,6 @@  def streamOneP4File(self, file, contents):
 
     def streamOneP4Deletion(self, file):
         relPath = self.stripRepoPath(file['path'], self.branchPrefixes)
-        relPath = self.encodeWithUTF8(relPath)
         if verbose:
             sys.stdout.write("delete %s\n" % relPath)
             sys.stdout.flush()