diff mbox series

[v4,03/11] git-p4: add new helper functions for python3 conversion

Message ID f0e658b984ca009c575368e661016f785922f970.1575498577.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series git-p4.py: Cast byte strings to unicode strings in python3 | expand

Commit Message

Linus Arver via GitGitGadget Dec. 4, 2019, 10:29 p.m. UTC
From: Ben Keene <seraphire@gmail.com>

Python 3+ handles strings differently than Python 2.7.  Since Python 2 is reaching it's end of life, a series of changes are being submitted to enable python 3.7+ support. The current code fails basic tests under python 3.7.

Change the existing unicode test add new support functions for python2-python3 support.

Define the following variables:
- isunicode - a boolean variable that states if the version of python natively supports unicode (true) or not (false). This is true for Python3 and false for Python2.
- unicode - a type alias for the datatype that holds a unicode string.  It is assigned to a str under python 3 and the unicode type for Python2.
- bytes - a type alias for an array of bytes.  It is assigned the native bytes type for Python3 and str for Python2.

Add the following new functions:

- as_string(text) - A new function that will convert a byte array to a unicode (UTF-8) string under python 3.  Under python 2, this returns the string unchanged.
- as_bytes(text) - A new function that will convert a unicode string to a byte array under python 3.  Under python 2, this returns the string unchanged.
- to_unicode(text) - Converts a text string as Unicode(UTF-8) on both Python2 and Python3.

Add a new function alias raw_input:
If raw_input does not exist (it was renamed to input in python 3) alias input as raw_input.

The AS_STRING and AS_BYTES functions allow for modifying the code with a minimal amount of impact on Python2 support.  When a string is expected, the as_string() will be used to convert "cast" the incoming "bytes" to a string type. Conversely as_bytes() will be used to convert a "string" to a "byte array" type. Since Python2 overloads the datatype 'str' to serve both purposes, the Python2 versions of these function do not change the data, since the str functions as both a byte array and a string.

basestring is removed since its only references are found in tests that were changed in the previous change list.

Signed-off-by: Ben Keene <seraphire@gmail.com>
(cherry picked from commit 7921aeb3136b07643c1a503c2d9d8b5ada620356)
---
 git-p4.py | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 4 deletions(-)

Comments

Denton Liu Dec. 5, 2019, 10:40 a.m. UTC | #1
On Wed, Dec 04, 2019 at 10:29:29PM +0000, Ben Keene via GitGitGadget wrote:
> From: Ben Keene <seraphire@gmail.com>
> 
> Python 3+ handles strings differently than Python 2.7.  Since Python 2 is reaching it's end of life, a series of changes are being submitted to enable python 3.7+ support. The current code fails basic tests under python 3.7.
> 
> Change the existing unicode test add new support functions for python2-python3 support.
> 
> Define the following variables:
> - isunicode - a boolean variable that states if the version of python natively supports unicode (true) or not (false). This is true for Python3 and false for Python2.
> - unicode - a type alias for the datatype that holds a unicode string.  It is assigned to a str under python 3 and the unicode type for Python2.
> - bytes - a type alias for an array of bytes.  It is assigned the native bytes type for Python3 and str for Python2.
> 
> Add the following new functions:
> 
> - as_string(text) - A new function that will convert a byte array to a unicode (UTF-8) string under python 3.  Under python 2, this returns the string unchanged.
> - as_bytes(text) - A new function that will convert a unicode string to a byte array under python 3.  Under python 2, this returns the string unchanged.
> - to_unicode(text) - Converts a text string as Unicode(UTF-8) on both Python2 and Python3.
> 
> Add a new function alias raw_input:
> If raw_input does not exist (it was renamed to input in python 3) alias input as raw_input.
> 
> The AS_STRING and AS_BYTES functions allow for modifying the code with a minimal amount of impact on Python2 support.  When a string is expected, the as_string() will be used to convert "cast" the incoming "bytes" to a string type. Conversely as_bytes() will be used to convert a "string" to a "byte array" type. Since Python2 overloads the datatype 'str' to serve both purposes, the Python2 versions of these function do not change the data, since the str functions as both a byte array and a string.

How come AS_STRING and AS_BYTES are all-caps here?

> 
> basestring is removed since its only references are found in tests that were changed in the previous change list.
> 
> Signed-off-by: Ben Keene <seraphire@gmail.com>
> (cherry picked from commit 7921aeb3136b07643c1a503c2d9d8b5ada620356)
> ---
>  git-p4.py | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 66 insertions(+), 4 deletions(-)
> 
> diff --git a/git-p4.py b/git-p4.py
> index 0f27996393..93dfd0920a 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -32,16 +32,78 @@
>      unicode = unicode
>  except NameError:
>      # 'unicode' is undefined, must be Python 3
> -    str = str
> +    #
> +    # For Python3 which is natively unicode, we will use 
> +    # unicode for internal information but all P4 Data
> +    # will remain in bytes
> +    isunicode = True
>      unicode = str
>      bytes = bytes
> -    basestring = (str,bytes)
> +
> +    def as_string(text):
> +        """Return a byte array as a unicode string"""
> +        if text == None:

Nit: use `text is None` instead. Actually, any time you're checking an
object to see if it's None, you should use `is` instead of `==` since
there's usually only one None reference.

> +            return None
> +        if isinstance(text, bytes):
> +            return unicode(text, "utf-8")
> +        else:
> +            return text
> +
> +    def as_bytes(text):
> +        """Return a Unicode string as a byte array"""
> +        if text == None:
> +            return None
> +        if isinstance(text, bytes):
> +            return text
> +        else:
> +            return bytes(text, "utf-8")
> +
> +    def to_unicode(text):
> +        """Return a byte array as a unicode string"""
> +        return as_string(text)    
> +
> +    def path_as_string(path):
> +        """ Converts a path to the UTF8 encoded string """
> +        if isinstance(path, unicode):
> +            return path
> +        return encodeWithUTF8(path).decode('utf-8')
> +    

Trailing whitespace.

>  else:
>      # 'unicode' exists, must be Python 2
> -    str = str
> +    #
> +    # We will treat the data as:
> +    #   str   -> str
> +    #   bytes -> str
> +    # So for Python2 these functions are no-ops
> +    # and will leave the data in the ambiguious
> +    # string/bytes state
> +    isunicode = False
>      unicode = unicode
>      bytes = str
> -    basestring = basestring
> +
> +    def as_string(text):
> +        """ Return text unaltered (for Python3 support) """

I didn't mention this in earlier emails but it's been bothering me a
lot: is there any reason why you write it as "Python3" vs. "Python 3"
sometimes (and Python2 as well)? If there's no difference, then we
should probably stick to one variant in both the commit messages and in
the code. (I prefer the spaced variant.)

> +        return text
> +
> +    def as_bytes(text):
> +        """ Return text unaltered (for Python3 support) """
> +        return text
> +
> +    def to_unicode(text):
> +        """Return a string as a unicode string"""
> +        return text.decode('utf-8')
> +    

Trailing whitespace.

> +    def path_as_string(path):
> +        """ Converts a path to the UTF8 encoded bytes """
> +        return encodeWithUTF8(path)
> +
> +
> + 

Trailing whitespace.

> +# Check for raw_input support
> +try:
> +    raw_input
> +except NameError:
> +    raw_input = input
>  
>  try:
>      from subprocess import CalledProcessError
> -- 
> gitgitgadget
>
Ben Keene Dec. 5, 2019, 6:42 p.m. UTC | #2
On 12/5/2019 5:40 AM, Denton Liu wrote:
> On Wed, Dec 04, 2019 at 10:29:29PM +0000, Ben Keene via GitGitGadget wrote:
>> From: Ben Keene <seraphire@gmail.com>
>>
>> Python 3+ handles strings differently than Python 2.7.  Since Python 2 is reaching it's end of life, a series of changes are being submitted to enable python 3.7+ support. The current code fails basic tests under python 3.7.
>>
>> Change the existing unicode test add new support functions for python2-python3 support.
>>
>> Define the following variables:
>> - isunicode - a boolean variable that states if the version of python natively supports unicode (true) or not (false). This is true for Python3 and false for Python2.
>> - unicode - a type alias for the datatype that holds a unicode string.  It is assigned to a str under python 3 and the unicode type for Python2.
>> - bytes - a type alias for an array of bytes.  It is assigned the native bytes type for Python3 and str for Python2.
>>
>> Add the following new functions:
>>
>> - as_string(text) - A new function that will convert a byte array to a unicode (UTF-8) string under python 3.  Under python 2, this returns the string unchanged.
>> - as_bytes(text) - A new function that will convert a unicode string to a byte array under python 3.  Under python 2, this returns the string unchanged.
>> - to_unicode(text) - Converts a text string as Unicode(UTF-8) on both Python2 and Python3.
>>
>> Add a new function alias raw_input:
>> If raw_input does not exist (it was renamed to input in python 3) alias input as raw_input.
>>
>> The AS_STRING and AS_BYTES functions allow for modifying the code with a minimal amount of impact on Python2 support.  When a string is expected, the as_string() will be used to convert "cast" the incoming "bytes" to a string type. Conversely as_bytes() will be used to convert a "string" to a "byte array" type. Since Python2 overloads the datatype 'str' to serve both purposes, the Python2 versions of these function do not change the data, since the str functions as both a byte array and a string.
> How come AS_STRING and AS_BYTES are all-caps here?


I changed them.  I used all caps to designate that they are code string. 
I changed them to as_string() and as_bytes()


>
>> basestring is removed since its only references are found in tests that were changed in the previous change list.
>>
>> Signed-off-by: Ben Keene <seraphire@gmail.com>
>> (cherry picked from commit 7921aeb3136b07643c1a503c2d9d8b5ada620356)
>> ---
>>   git-p4.py | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++----
>>   1 file changed, 66 insertions(+), 4 deletions(-)
>>
>> diff --git a/git-p4.py b/git-p4.py
>> index 0f27996393..93dfd0920a 100755
>> --- a/git-p4.py
>> +++ b/git-p4.py
>> @@ -32,16 +32,78 @@
>>       unicode = unicode
>>   except NameError:
>>       # 'unicode' is undefined, must be Python 3
>> -    str = str
>> +    #
>> +    # For Python3 which is natively unicode, we will use
>> +    # unicode for internal information but all P4 Data
>> +    # will remain in bytes
>> +    isunicode = True
>>       unicode = str
>>       bytes = bytes
>> -    basestring = (str,bytes)
>> +
>> +    def as_string(text):
>> +        """Return a byte array as a unicode string"""
>> +        if text == None:
> Nit: use `text is None` instead. Actually, any time you're checking an
> object to see if it's None, you should use `is` instead of `==` since
> there's usually only one None reference.

I changed this in this commit and will attempt to fix this in all the 
following commits as well.


>
>> +            return None
>> +        if isinstance(text, bytes):
>> +            return unicode(text, "utf-8")
>> +        else:
>> +            return text
>> +
>> +    def as_bytes(text):
>> +        """Return a Unicode string as a byte array"""
>> +        if text == None:
>> +            return None
>> +        if isinstance(text, bytes):
>> +            return text
>> +        else:
>> +            return bytes(text, "utf-8")
>> +
>> +    def to_unicode(text):
>> +        """Return a byte array as a unicode string"""
>> +        return as_string(text)
>> +
>> +    def path_as_string(path):
>> +        """ Converts a path to the UTF8 encoded string """
>> +        if isinstance(path, unicode):
>> +            return path
>> +        return encodeWithUTF8(path).decode('utf-8')
>> +
> Trailing whitespace.
>
>>   else:
>>       # 'unicode' exists, must be Python 2
>> -    str = str
>> +    #
>> +    # We will treat the data as:
>> +    #   str   -> str
>> +    #   bytes -> str
>> +    # So for Python2 these functions are no-ops
>> +    # and will leave the data in the ambiguious
>> +    # string/bytes state
>> +    isunicode = False
>>       unicode = unicode
>>       bytes = str
>> -    basestring = basestring
>> +
>> +    def as_string(text):
>> +        """ Return text unaltered (for Python3 support) """
> I didn't mention this in earlier emails but it's been bothering me a
> lot: is there any reason why you write it as "Python3" vs. "Python 3"
> sometimes (and Python2 as well)? If there's no difference, then we
> should probably stick to one variant in both the commit messages and in
> the code. (I prefer the spaced variant.)


The difference was sloppy typing.  Like the "is None" and trailing white 
spaces, I'll work on fixing these.


>> +        return text
>> +
>> +    def as_bytes(text):
>> +        """ Return text unaltered (for Python3 support) """
>> +        return text
>> +
>> +    def to_unicode(text):
>> +        """Return a string as a unicode string"""
>> +        return text.decode('utf-8')
>> +
> Trailing whitespace.
>
>> +    def path_as_string(path):
>> +        """ Converts a path to the UTF8 encoded bytes """
>> +        return encodeWithUTF8(path)
>> +
>> +
>> +
> Trailing whitespace.
>
>> +# Check for raw_input support
>> +try:
>> +    raw_input
>> +except NameError:
>> +    raw_input = input
>>   
>>   try:
>>       from subprocess import CalledProcessError
>> -- 
>> gitgitgadget
>>
diff mbox series

Patch

diff --git a/git-p4.py b/git-p4.py
index 0f27996393..93dfd0920a 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -32,16 +32,78 @@ 
     unicode = unicode
 except NameError:
     # 'unicode' is undefined, must be Python 3
-    str = str
+    #
+    # For Python3 which is natively unicode, we will use 
+    # unicode for internal information but all P4 Data
+    # will remain in bytes
+    isunicode = True
     unicode = str
     bytes = bytes
-    basestring = (str,bytes)
+
+    def as_string(text):
+        """Return a byte array as a unicode string"""
+        if text == None:
+            return None
+        if isinstance(text, bytes):
+            return unicode(text, "utf-8")
+        else:
+            return text
+
+    def as_bytes(text):
+        """Return a Unicode string as a byte array"""
+        if text == None:
+            return None
+        if isinstance(text, bytes):
+            return text
+        else:
+            return bytes(text, "utf-8")
+
+    def to_unicode(text):
+        """Return a byte array as a unicode string"""
+        return as_string(text)    
+
+    def path_as_string(path):
+        """ Converts a path to the UTF8 encoded string """
+        if isinstance(path, unicode):
+            return path
+        return encodeWithUTF8(path).decode('utf-8')
+    
 else:
     # 'unicode' exists, must be Python 2
-    str = str
+    #
+    # We will treat the data as:
+    #   str   -> str
+    #   bytes -> str
+    # So for Python2 these functions are no-ops
+    # and will leave the data in the ambiguious
+    # string/bytes state
+    isunicode = False
     unicode = unicode
     bytes = str
-    basestring = basestring
+
+    def as_string(text):
+        """ Return text unaltered (for Python3 support) """
+        return text
+
+    def as_bytes(text):
+        """ Return text unaltered (for Python3 support) """
+        return text
+
+    def to_unicode(text):
+        """Return a string as a unicode string"""
+        return text.decode('utf-8')
+    
+    def path_as_string(path):
+        """ Converts a path to the UTF8 encoded bytes """
+        return encodeWithUTF8(path)
+
+
+ 
+# Check for raw_input support
+try:
+    raw_input
+except NameError:
+    raw_input = input
 
 try:
     from subprocess import CalledProcessError