[v2,1/3] Cast byte strings to unicode strings in python3

Message ID	0bca930ff82623bbef172b4cb6c36ef8e5c46098.1573828978.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=V4lN=ZH=vger.kernel.org=git-owner@kernel.org> Message-Id: <0bca930ff82623bbef172b4cb6c36ef8e5c46098.1573828978.git.gitgitgadget@gmail.com> In-Reply-To: <pull.465.v2.git.1573828978.gitgitgadget@gmail.com> References: <pull.465.git.1573679665.gitgitgadget@gmail.com> <pull.465.v2.git.1573828978.gitgitgadget@gmail.com> From: "Ben Keene via GitGitGadget" <gitgitgadget@gmail.com> Date: Fri, 15 Nov 2019 14:42:56 +0000 Subject: [PATCH v2 1/3] Cast byte strings to unicode strings in python3 Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: Ben Keene <seraphire@gmail.com>, Junio C Hamano <gitster@pobox.com>, Ben Keene <bkeene@partswatch.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Feature: New Variable git-p4.p4program \| expand [v2,0/3] Feature: New Variable git-p4.p4program [v2,1/3] Cast byte strings to unicode strings in python3 [v2,2/3] Added general variable git-p4.binary and added a default for windows of 'P4.EXE' [v2,3/3] Changed the name of the parameter from git-p4.binary to git-p4.p4program

Message ID

0bca930ff82623bbef172b4cb6c36ef8e5c46098.1573828978.git.gitgitgadget@gmail.com (mailing list archive)

State

New, archived

Headers

Message-Id: 
 <0bca930ff82623bbef172b4cb6c36ef8e5c46098.1573828978.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.465.v2.git.1573828978.gitgitgadget@gmail.com>
References: <pull.465.git.1573679665.gitgitgadget@gmail.com>
        <pull.465.v2.git.1573828978.gitgitgadget@gmail.com>
From: "Ben Keene via GitGitGadget" <gitgitgadget@gmail.com>
Date: Fri, 15 Nov 2019 14:42:56 +0000
Subject: [PATCH v2 1/3] Cast byte strings to unicode strings in python3
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: Ben Keene <seraphire@gmail.com>,
        Junio C Hamano <gitster@pobox.com>,
        Ben Keene <bkeene@partswatch.com>
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

Feature: New Variable git-p4.p4program | expand

Comments

Junio C Hamano Nov. 16, 2019, 2:40 a.m. UTC | #1

"Ben Keene via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ben Keene <bkeene@partswatch.com>
> Subject: Re: [PATCH v2 1/3] Cast byte strings to unicode strings in python3

That is "how" this patch tries to achieve something that is not
explained in this proposed log message.  Worse, the rest of the
proposed log message is entirely empty and does not help readers
understand what the author wanted to do, or decide if applying these
patches is a good idea.

You wrote something about Python 3 having issues while comparing the
strings read out of process.popen() against string literals in the
code elsewhere in the cover letter.  That information is probably
very relevant to explain why this commit exists, i.e. it should be
explained here.

Here is what I _think_ is close to why you wrote this patch, but as
I already said, I am far from being familiar with Py2to3
differences, so there may be factual misunderstanding you would want
to fix, even if you decide to pick some pieces out of it when
sending an updated patch.

    Subject: [PATCH] git-p4: compare "p4" output with the right string type

    Communicating with a subprocess yields a byte string in both
    Python 2 and Python 3.  Comparing the output or the error string
    with string literals makes the code fail, as string literals in
    Python 3 are unicode strings and not byte strings.

    Introduce a ustring() helper, which returns the corresponding
    unicode string for a given byte string (or return the argument
    as-is, when a unicode string is given) in Python 3, but returns
    the given bytestring as-is in Python 2.  Use it to turn strings
    obtained from subprocess to the type suitable to compare with
    the string literals in both versions of Python.

What does the name "ustring()" stand for?

What purpose does the function serve at the higher conceptual level?
Name the function after the answer to that question and the code
would become easier to explain.

It certainly is not "make sure we have unicode string", as the name
is shared between Python 2 and Python 3, and in the older Python 2
world, you want the helper to mean "keep the string a bytestring",
not turning it into a unicode string.  So I doubt ustring() is a
great name for this helper.

>              if skip_info:
>                  if 'code' in entry and entry['code'] == 'info':
>                      continue
> +                if b'code' in entry and entry[b'code'] == b'info':
> +                    continue

This one explicitly calls for 'bytes', which I think would work
correctly with Python 2.  But why would we have to prepare for both
variants existing in the entry[] hash (especially when working with
Python3)?  Shouldn't the code that populates entry[] be responsible
for turning a byte string into a unicode string?

Also, is subprocess.communicate the only way bytestrings can come to
"git-p4" program, or are there other avenues?  What I am wondering
is if it is sensible to declare that when running under Python 3,
the program will internally use unicode strings for everything and
convert any bytestring to unicode string as soon as it enters [*1*]

That would mean ...

>              if cb is not None:
>                  cb(entry)
>              else:
> -                result.append(entry)
> +                if isunicode:
> +                    out = {}
> +                    for key, value in entry.items():
> +                        out[ustring(key)] = ustring(value)
> +                    result.append(out)
> +                else:
> +                    result.append(entry)

... a hunk like this would become entirely unnecessary, as keys and
values of entry[] that are read from outside world would have
already been made into unicode strings under Python 3.

By the way, would values in entry[] always some string (iow, can
they have a literal number like 47 and 3.14)?

>      except EOFError:
>          pass
>      exitCode = p4.wait()
> @@ -792,7 +813,7 @@ def gitConfig(key, typeSpecifier=None):
>          cmd += [ key ]
>          s = read_pipe(cmd, ignore_error=True)
>          _gitConfig[key] = s.strip()
> -    return _gitConfig[key]
> +    return ustring(_gitConfig[key])

Likewise.  I'd expect, if we were to declare that our internal
strings are all unicode in Python 3, we'd be using a thin wrapper of
read_pipe() that yields a unicode string, so 's' would not be a
bytestring, and s.strip() would not be either.

>  def gitConfigBool(key):
>      """Return a bool, using git config --bool.  It is True only if the
> @@ -860,6 +881,7 @@ def branch_exists(branch):
>      cmd = [ "git", "rev-parse", "--symbolic", "--verify", branch ]
>      p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>      out, _ = p.communicate()
> +    out = ustring(out)

But perhaps it is a bit too much and it is easier to call the
conversion helper on the results obtained by read_pipe() that gives
us bytestring, like this one (and one above that) does?  I dunno.

Thanks.

[Footnote]

*1* If it is not just string comparison but the general direction
    were to consistently use unicode strings under Python 3, which
    I think makes sense at the conceptual level, the sample log
    message I prepared in the earlier part of this response needs to
    be updated, so that it is not so message/comparison centric.

    I said "at the conceptual level", because there may be practical
    reasons not to use unicode everywhere (e.g. performance might
    degrade and some bytestrings used may not be text data in the
    first place).

Junio C Hamano Nov. 16, 2019, 3:52 a.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

>>          cmd += [ key ]
>>          s = read_pipe(cmd, ignore_error=True)
>>          _gitConfig[key] = s.strip()
>> -    return _gitConfig[key]
>> +    return ustring(_gitConfig[key])
>
> Likewise.  I'd expect, if we were to declare that our internal
> strings are all unicode in Python 3, we'd be using a thin wrapper of
> read_pipe() that yields a unicode string, so 's' would not be a
> bytestring, and s.strip() would not be either.

Oops, I did not even realize read_pipe() is already our own.  So
there is no need to any wrapper around it, if we were to declare
that all strings we use are unicode under Python 3---we just need
to have the ustring() call (after giviing the helper a better name)
directly inside read_pipe() function.

diff --git a/git-p4.py b/git-p4.py
index 60c73b6a37..6e8b3a26cd 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -36,12 +36,22 @@ 
     unicode = str
     bytes = bytes
     basestring = (str,bytes)
+    isunicode = True
+    def ustring(text):
+        """Returns the byte string as a unicode string"""
+        if text == '' or text == b'':
+            return ''
+        return unicode(text, "utf-8")
 else:
     # 'unicode' exists, must be Python 2
     str = str
     unicode = unicode
     bytes = str
     basestring = basestring
+    isunicode = False
+    def ustring(text):
+        """Returns the byte string unchanged"""
+        return text
 
 try:
     from subprocess import CalledProcessError
@@ -196,6 +206,8 @@  def read_pipe_full(c):
     expand = isinstance(c,basestring)
     p = subprocess.Popen(c, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=expand)
     (out, err) = p.communicate()
+    out = ustring(out)
+    err = ustring(err)
     return (p.returncode, out, err)
 
 def read_pipe(c, ignore_error=False):
@@ -263,6 +275,7 @@  def p4_has_move_command():
     cmd = p4_build_cmd(["move", "-k", "@from", "@to"])
     p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     (out, err) = p.communicate()
+    err = ustring(err)
     # return code will be 1 in either case
     if err.find("Invalid option") >= 0:
         return False
@@ -646,10 +659,18 @@  def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
             if skip_info:
                 if 'code' in entry and entry['code'] == 'info':
                     continue
+                if b'code' in entry and entry[b'code'] == b'info':
+                    continue
             if cb is not None:
                 cb(entry)
             else:
-                result.append(entry)
+                if isunicode:
+                    out = {}
+                    for key, value in entry.items():
+                        out[ustring(key)] = ustring(value)
+                    result.append(out)
+                else:
+                    result.append(entry)
     except EOFError:
         pass
     exitCode = p4.wait()
@@ -792,7 +813,7 @@  def gitConfig(key, typeSpecifier=None):
         cmd += [ key ]
         s = read_pipe(cmd, ignore_error=True)
         _gitConfig[key] = s.strip()
-    return _gitConfig[key]
+    return ustring(_gitConfig[key])
 
 def gitConfigBool(key):
     """Return a bool, using git config --bool.  It is True only if the
@@ -860,6 +881,7 @@  def branch_exists(branch):
     cmd = [ "git", "rev-parse", "--symbolic", "--verify", branch ]
     p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
     out, _ = p.communicate()
+    out = ustring(out)
     if p.returncode:
         return False
     # expect exactly one line of output: the branch name

[v2,1/3] Cast byte strings to unicode strings in python3

Commit Message

Comments

Patch