diff mbox series

git-p4: fix crlf handling for utf16 files on Windows

Message ID pull.1294.git.git.1658294873702.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 4d35f744219335d8b32df363891b6336dcf02a6e
Headers show
Series git-p4: fix crlf handling for utf16 files on Windows | expand

Commit Message

Baumann, Moritz July 20, 2022, 5:27 a.m. UTC
From: Moritz Baumann <moritz.baumann@sap.com>

Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
---
    git-p4: fix crlf handling for utf16 files on Windows
    
    Signed-off-by: Moritz Baumann moritz.baumann@sap.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-git-1294%2Fmbs-c%2Ffix-crlf-conversion-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-git-1294/mbs-c/fix-crlf-conversion-v1
Pull-Request: https://github.com/git/git/pull/1294

 git-p4.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


base-commit: bbea4dcf42b28eb7ce64a6306cdde875ae5d09ca

Comments

Junio C Hamano July 20, 2022, 4:08 p.m. UTC | #1
"Moritz Baumann via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Moritz Baumann <moritz.baumann@sap.com>

Can you describe briefly what problem is being solved and how the
change solves it in this place above your Sign-off?  The title says
"fix", without saying how the behaviour by the current code is
"broken", so that is one thing you can describe.  It talks about
"UTF-16 files on Windows", but does it mean git-p4 running on
Windows or git-p4 running anywhere that (over the wire) talks with
P4 running on Windows?  IOW, would the same problem trigger if you
are on macOS but the contents of the file you exchange with P4
happens to be in UTF-16?

These are the things you can describe to help those who are not you
(i.e. without access to an environment similar to what you saw the
problem on) understand the issue and help them convince themselves
that the patch they are seeing is a sensible solution.  Without any,
it is hard to evaluate.

> Signed-off-by: Moritz Baumann <moritz.baumann@sap.com>
> ---

> diff --git a/git-p4.py b/git-p4.py
> index 8fbf6eb1fe3..0a9d7e2ed7c 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -3148,7 +3148,7 @@ class P4Sync(Command, P4UserMap):
>                      raise e
>              else:
>                  if p4_version_string().find('/NT') >= 0:
> -                    text = text.replace(b'\r\n', b'\n')
> +                    text = text.replace(b'\x0d\x00\x0a\x00', b'\x0a\x00')
>                  contents = [text]
>  
>          if type_base == "apple":

OK, the part being touched is inside this context:

        if type_base == "utf16":
            # ...
            # But ascii text saved as -t utf16 is completely mangled.
            # Invoke print -o to get the real contents.
            #
            # On windows, the newlines will always be mangled by print, so put
            # them back too.  This is not needed to the cygwin windows version,
            # just the native "NT" type.
            #

            try:
                text = ...
            except Exception as e:
                ...
            else:
                if p4_version_string().find('/NT') >= 0:
                    text = text.replace(b'\r\n', b'\n')
                contents = [text]

So the intent of the existing code is "we know we are dealing with
UTF-16 text, and after successfully reading 'text' without
exception, we need to convert CRLF back to LF if we are on 'the
native NT type'".  Presumably 'text' that came from
p4_read_pipe(... raw=True) is not unicode string but just a bunch of
bytes, so each "char" is represented as two-byte sequence in UTF-16?

With that (speculative) understanding, I can guess that the patch
makes sense, but the patch should not make readers guess.

Thanks.
Baumann, Moritz July 20, 2022, 4:32 p.m. UTC | #2
Hi Junio,

Thank you for your notes. I assumed the intent of the original code would be clear, in which case the fix should also be clear, but I am happy to elaborate.

> Can you describe briefly what problem is being solved and how the change
> solves it in this place above your Sign-off?  […] It talks about
> "UTF-16 files on Windows", but does it mean git-p4 running on Windows or
> git-p4 running anywhere that (over the wire) talks with
> P4 running on Windows?  IOW, would the same problem trigger if you are on
> macOS but the contents of the file you exchange with P4 happens to be in
> UTF-16?

The potential problem that the original code was trying to solve is the following: If a file is marked as utf16 in Perforce, and if the Perforce client is on Windows, then Perforce will replace all LF line endings with CRLF when the file is synced. This is different from git's autocrlf behavior, which ignores UTF-16 encoded files and always treats them as binary files. Without special handling, this can lead to git-p4 creating files with different hashes when run on Windows. (Which is how I stumbled upon this issue.)

Therefore, git-p4 checks the Perforce "file type" and tries to undo the line endings changes.

> So the intent of the existing code is "we know we are dealing with
> UTF-16 text, and after successfully reading 'text' without exception, we need
> to convert CRLF back to LF if we are on 'the native NT type'".  Presumably
> 'text' that came from p4_read_pipe(... raw=True) is not unicode string but just
> a bunch of bytes, so each "char" is represented as two-byte sequence in UTF-
> 16?

Exactly. The original code tried to do the right thing to ensure stable hashes that are independent of the operating system git-p4 is run on, but failed to do so successfully. With my fix, I finally got deterministic hashes on my test repository.

> With that (speculative) understanding, I can guess that the patch makes sense,
> but the patch should not make readers guess.

Do you need me to resubmit the patch with an explanatory description? If so, I can try to summarize the above.

Best regards,
Moritz
Junio C Hamano July 20, 2022, 5:18 p.m. UTC | #3
"Baumann, Moritz" <moritz.baumann@sap.com> writes:

> Do you need me to resubmit the patch with an explanatory
> description? If so, I can try to summarize the above.

Yup.

Review comments are not a request to the authors to explain their
patches to reviewers.  Their primary purpose is to point out what
potential issues readers of the commit that would result by the
proposed patches may have.  So answering in your response to see if
your clarifications are understandable is very good, but please
consider it a preparation to write a better version (i.e. [PATCH
v2]).

Thanks for working on this fix.
diff mbox series

Patch

diff --git a/git-p4.py b/git-p4.py
index 8fbf6eb1fe3..0a9d7e2ed7c 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -3148,7 +3148,7 @@  class P4Sync(Command, P4UserMap):
                     raise e
             else:
                 if p4_version_string().find('/NT') >= 0:
-                    text = text.replace(b'\r\n', b'\n')
+                    text = text.replace(b'\x0d\x00\x0a\x00', b'\x0a\x00')
                 contents = [text]
 
         if type_base == "apple":