[2/2] git-p4: do not decode data from perforce by default

Message ID	20210412085251.51475-3-andrew@adoakley.name (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> From: Andrew Oakley <andrew@adoakley.name> To: git@vger.kernel.org Cc: Luke Diamand <luke@diamand.org>, Feiyang Xue <me@feiyangxue.com>, Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com>, Andrew Oakley <andrew@adoakley.name> Subject: [PATCH 2/2] git-p4: do not decode data from perforce by default Date: Mon, 12 Apr 2021 09:52:51 +0100 Message-Id: <20210412085251.51475-3-andrew@adoakley.name> In-Reply-To: <20210412085251.51475-1-andrew@adoakley.name> References: <20210412085251.51475-1-andrew@adoakley.name> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	git-p4: encoding of data from perforce \| expand [0/2] git-p4: encoding of data from perforce [1/2] git-p4: avoid decoding more data from perforce [2/2] git-p4: do not decode data from perforce by default

Andrew Oakley April 12, 2021, 8:52 a.m. UTC

This commit is not intended to change behaviour, any we still attempt to
decode values that might not be valid unicode.  It's not clear that all
of these values are safe to decode, but it's now more obvious which data
is decoded.
---
 git-p4.py | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Tzadik Vanderhoof April 29, 2021, 10 a.m. UTC | #1

I checked out "seen" and ran the test script from this patch
(t9835-git-p4-message-encoding.sh) on my Windows machine and it fails.

I don't think the solution in this patch will solve the issue of non
UTF-8 descriptions on Windows. The interaction between git-p4.py and
p4 around non-ASCII descriptions is different on Linux and Windows (at
least with the default code page settings).  Unfortunately the CI on
gitlab does not include any Windows test environments that have p4
installed.

As far as I can tell, non-ASCII strings passed to "p4 submit -d" pass
unchanged to the Perforce database on Linux.  As well, such data also
passes unchanged in the other direction, when "p4" output is consumed
by git-p4.py.  Since this patch avoids decoding descriptions, and the
test script uses binary data for descriptions, the tests pass on
Linux.

However, on Windows, UTF-8 strings passed to "p4 submit -d" are
somehow converted to the default Windows code page by the time they
are stored in the Perforce database, probably as part of the process
of passing the command line arguments to the Windows p4 executable.
However, the "code page" data is *not* converted to UTF-8 on the way
back from p4 to git-p4.py.  The only way to get it into UTF-8 is to
call string.decode().  As a result, this patch, which takes out the
call to string.decode() will not work on Windows.

Andrew Oakley April 30, 2021, 8:53 a.m. UTC | #2

On Thu, 29 Apr 2021 03:00:06 -0700
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote:
> However, on Windows, UTF-8 strings passed to "p4 submit -d" are
> somehow converted to the default Windows code page by the time they
> are stored in the Perforce database, probably as part of the process
> of passing the command line arguments to the Windows p4 executable.
> However, the "code page" data is *not* converted to UTF-8 on the way
> back from p4 to git-p4.py.  The only way to get it into UTF-8 is to
> call string.decode().  As a result, this patch, which takes out the
> call to string.decode() will not work on Windows.

Thanks for that explanation, the reencoding of the data on Windows is
not something I was expecting.  Given the behaviour you've described, I
suspect that there might be two different problems that we are trying
to solve.

The perforce depot I'm working with has a mixture of encodings, and
commits are created from a variety of different environments. The
majority of commits are ASCII or UTF-8, there are a small number that
are in some other encoding.  Any attempt to reencode the data is likely
to make the problem worse in at least some cases.

I suspect that other perforce depots are used primarily from Windows
machines, and have data that is encoded in a mostly consistent way but
the encoding is not UTF-8.  Re-encoding the data for git makes sense in
that case.  Is this the kind of repository you have?

If there are these two different cases then we probably need to come up
with a patch that solves both issues.

For my cases where we've got a repository containing all sorts of junk,
it sounds like it might be awkward to create a test case that works on
Windows.

Luke Diamand April 30, 2021, 3:33 p.m. UTC | #3

On 30/04/2021 08:53, Andrew Oakley wrote:
> On Thu, 29 Apr 2021 03:00:06 -0700
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote:
>> However, on Windows, UTF-8 strings passed to "p4 submit -d" are
>> somehow converted to the default Windows code page by the time they
>> are stored in the Perforce database, probably as part of the process
>> of passing the command line arguments to the Windows p4 executable.
>> However, the "code page" data is *not* converted to UTF-8 on the way
>> back from p4 to git-p4.py.  The only way to get it into UTF-8 is to
>> call string.decode().  As a result, this patch, which takes out the
>> call to string.decode() will not work on Windows.
> 
> Thanks for that explanation, the reencoding of the data on Windows is
> not something I was expecting.  Given the behaviour you've described, I
> suspect that there might be two different problems that we are trying
> to solve.
> 
> The perforce depot I'm working with has a mixture of encodings, and
> commits are created from a variety of different environments. The
> majority of commits are ASCII or UTF-8, there are a small number that
> are in some other encoding.  Any attempt to reencode the data is likely
> to make the problem worse in at least some cases.
> 
> I suspect that other perforce depots are used primarily from Windows
> machines, and have data that is encoded in a mostly consistent way but
> the encoding is not UTF-8.  Re-encoding the data for git makes sense in
> that case.  Is this the kind of repository you have?
> 
> If there are these two different cases then we probably need to come up
> with a patch that solves both issues.
> 
> For my cases where we've got a repository containing all sorts of junk,
> it sounds like it might be awkward to create a test case that works on
> Windows.
> 


https://www.perforce.com/perforce/doc.current/user/i18nnotes.txt

Tzadik - is your server unicode enabled or not? That would be 
interesting to know:

     p4 counters | grep -i unicode

I suspect it is not. It's only if unicode is enabled that the server 
will convert to/from utf8 (at least that's my understanding). Without 
this setting, p4d and p4 are (probably) not doing any conversions.

I think it might be useful to clarify exactly what conversions are 
actually happening.

I wonder what encoding Perforce thinks you've got in place.

Tzadik Vanderhoof April 30, 2021, 6:08 p.m. UTC | #4

On Fri, Apr 30, 2021 at 8:33 AM Luke Diamand <luke@diamand.org> wrote:
>
> Tzadik - is your server unicode enabled or not? That would be
> interesting to know:
>
>      p4 counters | grep -i unicode
>
> I suspect it is not. It's only if unicode is enabled that the server
> will convert to/from utf8 (at least that's my understanding). Without
> this setting, p4d and p4 are (probably) not doing any conversions.

My server is not unicode.

These conversions are happening even with a non-Unicode perforce db.
I don't think it's the p4d code per se that is doing the conversion, but
rather an interaction between the OS and the code, which is different
under Linux vs Windows.  If you create a trivial C program that dumps
the hex values of the bytes it receives in argv, you can see this
different behavior:

#include <stdio.h>

void main(int argc, char *argv[]) {
    int i, j;
    char *s;
    for (i = 1; i < argc; ++i) {
        s = argv[i];
        for (j = 0; s[j] != '\0'; ++j)
            printf(" %X", (unsigned char)s[j]);
        printf("\n");
        printf("[%s]\n\n", s);
    }
}

When built with Visual Studio and called from Cygwin, if you pass in
args with UTF-8 encoded characters, the program will spit them out in
cp1252. If you compile it on a Linux system using gcc, it will spit them out
in UTF-8 (unchanged).  I suspect that's what's happening with p4d on
Windows vs Linux.

In any event, if you look at my patch (v6 is the latest...
https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/ ),
you will see I have written tests that pass under both Linux and Windows.
(If you want to run them yourself, you need to base my patch off of "master",
not "seen").  The tests make clear what the different behavior is and
also show that p4d is not set to Unicode (since the tests do not change the
default setting).

Andrew Oakley May 4, 2021, 9:01 p.m. UTC | #5

On Fri, 30 Apr 2021 11:08:57 -0700
Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> wrote:
> My server is not unicode.
> 
> These conversions are happening even with a non-Unicode perforce db.
> I don't think it's the p4d code per se that is doing the conversion,
> but rather an interaction between the OS and the code, which is
> different under Linux vs Windows.

It's not particularly obvious exactly what is happening here.  The
perforce command line client is written in a rather odd way - it uses
the unicode (UTF-16) wWinMainCRTStartup entry point but then calls an
undocumented API to get the "narrow" version of the command line.  The
code is here:

https://swarm.workshop.perforce.com/projects/perforce_software-p4/files/2016-1/client/clientmain.cc

I think that perforce will end up with the data in a code page that
depends on the configuration of the machine.  I don't think the exact
details matter here - just that it's some semi-arbitrary encoding that
isn't recorded in the commit.

The key thing that I'm trying to point out here is that the encoding is
not necessarily consistent between different commits.  The changes that
you have proposed force you to pick one encoding that will be used for
every commit.  If it's wrong then data will be corrupted, and there is
no option provided to avoid that.  The only way I can see to avoid this
issue is to not attempt to re-encode the data - just pass it directly
to git.

I think another way to solve the issue you have is the encoding header
on git commits.  We can pass the bytes through git-p4 unmodified, but
mark the commit message as being encoded using something that isn't
UTF-8.  That avoids any potentially lossy conversions when cloning the
repository, but should allow the data to be displayed correctly in git.

> In any event, if you look at my patch (v6 is the latest...
> https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/
> ), you will see I have written tests that pass under both Linux and
> Windows. (If you want to run them yourself, you need to base my patch
> off of "master", not "seen").  The tests make clear what the
> different behavior is and also show that p4d is not set to Unicode
> (since the tests do not change the default setting).

I don't think the tests are doing anything interesting on Linux - you
stick valid UTF-8 in, and valid UTF-8 data comes out.   I suspect the
tests will fail on Windows if the relevant code page is set to a value
that you're not expecting.

For the purposes of writing tests that work the same everywhere we can
use `p4 submit -i`.  The data written on stdin isn't reencoded, even on
Windows.

I can rework my test to use `p4 submit -i` on windows.  It should be
fairly simple to write another change which allows the encoding to be
set on commits created by git-p4.

Does that seem like a reasonable way forward?  I think it gets us:
- sensible handling for repositories with mixed encodings
- sensible handling for repositories with known encodings
- tests that work the same on Linux and Windows

Tzadik Vanderhoof May 4, 2021, 9:46 p.m. UTC | #6

On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
> The key thing that I'm trying to point out here is that the encoding is
> not necessarily consistent between different commits.  The changes that
> you have proposed force you to pick one encoding that will be used for
> every commit.  If it's wrong then data will be corrupted, and there is
> no option provided to avoid that.  The only way I can see to avoid this
> issue is to not attempt to re-encode the data - just pass it directly
> to git.

No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
*always* tries to decode in UTF-8 first!  Only if that throws an exception
will my "fallbackEncoding" come into play, and it only comes into play
for the single changeset description that was invalid UTF-8.  After that,
subsequent descriptions will again be tried in UTF-8 first.

The design of the UTF-8 format makes it very unlikely that non UTF-8 text
will pass undetected through a UTF-8 decoder, so by attempting to decode
in UTF-8 first, there is very little risk of a lossy conversion.

As for passing data through "raw", that will *guarantee* bad encoding on
any descriptions that are not UTF-8, because git will interpret the data
as UTF-8 once it has been put into the commit (unless the encoding header
is used, as you mentioned) .  If that header is not used, and it was not in
UTF-8 in Perforce, it has zero chance of being correct in git unless
it is decoded.
At least "fallbackEncoding" gives it SOME chance of decoding it correctly.

> I think another way to solve the issue you have is the encoding header
> on git commits.  We can pass the bytes through git-p4 unmodified, but
> mark the commit message as being encoded using something that isn't
> UTF-8.  That avoids any potentially lossy conversions when cloning the
> repository, but should allow the data to be displayed correctly in git.

Yes, that could be a solution.  I will try that out.

> > In any event, if you look at my patch (v6 is the latest...
> > https://lore.kernel.org/git/20210429073905.837-1-tzadik.vanderhoof@gmail.com/
> > ), you will see I have written tests that pass under both Linux and
> > Windows. (If you want to run them yourself, you need to base my patch
> > off of "master", not "seen").  The tests make clear what the
> > different behavior is and also show that p4d is not set to Unicode
> > (since the tests do not change the default setting).
>
> I don't think the tests are doing anything interesting on Linux - you
> stick valid UTF-8 in, and valid UTF-8 data comes out.

Totally agree.... I only did that to get them to pass the Gitlab CI.
I submitted an earlier
patch that simply skipped the test file on Linux, but I got pushback
on that, so I
made them pass on Linux, even though they aren't useful.

> I suspect the tests will fail on Windows if the relevant code page is set to a value
> that you're not expecting.

It depends.  If the code page is set to UTF-8 (65001) I think the
tests would still work,
because as I said above, my code always tries to decode with UTF-8
first, no matter what
the "fallbackEncoding" setting is.

If the code page is set to something other than UTF-8 or the default,
then one of my tests would
fail, because it uses a hard-coded "fallbackEncoding" of "cp1252".

But the code would work for the user!  All they would need to do is
set "fallbackEncoding" to
the code page they're actually using, instead of "cp1252".

(More sophisticated tests could be developed that explicitly set the
code page and use the
corresponding "fallbackEncoding" setting.)

> For the purposes of writing tests that work the same everywhere we can
> use `p4 submit -i`.  The data written on stdin isn't reencoded, even on
> Windows.

I already have gone down the `p4 submit -i` road.  It behaves exactly
the same as
passing the description on the command line.

(One of several dead-ends I went down that I haven't mentioned in my emails)

Junio C Hamano May 5, 2021, 1:11 a.m. UTC | #7

Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

> On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
>> The key thing that I'm trying to point out here is that the encoding is
>> not necessarily consistent between different commits.  The changes that
>> you have proposed force you to pick one encoding that will be used for
>> every commit.  If it's wrong then data will be corrupted, and there is
>> no option provided to avoid that.  The only way I can see to avoid this
>> issue is to not attempt to re-encode the data - just pass it directly
>> to git.
>
> No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
> *always* tries to decode in UTF-8 first!  Only if that throws an exception
> will my "fallbackEncoding" come into play, and it only comes into play
> for the single changeset description that was invalid UTF-8.  After that,
> subsequent descriptions will again be tried in UTF-8 first.

Hmph, I do not quite see the need for "No" at the beginning of what
you said.  The fallbackEncoding can specify only one non UTF-8
encoding, so even if majority of commits were in UTF-8 but when you
need to import two commits with non UTF-8 encoding, there is no
suitable value to give to the fallbackEncoding setting.  One of
these two commits will fail to decode first in UTF-8 and then fail
to decode again with the fallback, and after that a corrupted
message remains.

>> I think another way to solve the issue you have is the encoding header
>> on git commits.  We can pass the bytes through git-p4 unmodified, but
>> mark the commit message as being encoded using something that isn't
>> UTF-8.  That avoids any potentially lossy conversions when cloning the
>> repository, but should allow the data to be displayed correctly in git.
>
> Yes, that could be a solution.  I will try that out.

If we can determine in what encoding the thing that came out of
Perforce is written in, we can put it on the encoding header of the
resulting commit.  But if that is possible to begin with, perhaps we
do not even need to do so---if you can determine what the original
encoding is, you can reencode with that encoding into UTF-8 inside
git-p4 while creating the commit, no?

And if the raw data that came from Perforce cannot be reencoded to
UTF-8 (i.e. iconv fails to process for some reason), then whether
the translation is done at the import time (i.e. where you would
have used the fallbackEncoding to reencode into UTF-8) or at the
display time (i.e. "git show" would notice the encoding header and
try to reencode the raw data from that encoding into UTF-8), it
would fail in the same way, so I do not see much advantage in
writing the encoding header into the resulting object (other than
shifting the blame to downstream and keeping the original data
intact, which is a good design principle).

Tzadik Vanderhoof May 5, 2021, 4:02 a.m. UTC | #8

On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>
> > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
> >> The key thing that I'm trying to point out here is that the encoding is
> >> not necessarily consistent between different commits.  The changes that
> >> you have proposed force you to pick one encoding that will be used for
> >> every commit.  If it's wrong then data will be corrupted, and there is
> >> no option provided to avoid that.  The only way I can see to avoid this
> >> issue is to not attempt to re-encode the data - just pass it directly
> >> to git.
> >
> > No, my "fallbackEndocing" setting is just that... a fallback.  My proposal
> > *always* tries to decode in UTF-8 first!  Only if that throws an exception
> > will my "fallbackEncoding" come into play, and it only comes into play
> > for the single changeset description that was invalid UTF-8.  After that,
> > subsequent descriptions will again be tried in UTF-8 first.
>
>  The fallbackEncoding can specify only one non UTF-8
> encoding, so even if majority of commits were in UTF-8 but when you
> need to import two commits with non UTF-8 encoding, there is no
> suitable value to give to the fallbackEncoding setting.  One of
> these two commits will fail to decode first in UTF-8 and then fail
> to decode again with the fallback, and after that a corrupted
> message remains.

I'm not sure I understand your scenario.  If the majority of commits
are in UTF-8, but there are 2 with the same UTF-8 encoding (say
"cp1252"), then just set "fallbackEndocing" to "cp1252" and all
the commits will display fine.

Are you talking about a scenario where most of the commits are UTF-8,
one is "cp1252" and another one is "cp1251", so a total of 3 encodings
are used in the Perforce depot?  I don't think that is a common scenario.

But you have a point that my patch does not address that scenario.

> If we can determine in what encoding the thing that came out of
> Perforce is written in, we can put it on the encoding header of the
> resulting commit.  But if that is possible to begin with, perhaps we
> do not even need to do so---if you can determine what the original
> encoding is, you can reencode with that encoding into UTF-8 inside
> git-p4 while creating the commit, no?
>
> And if the raw data that came from Perforce cannot be reencoded to
> UTF-8 (i.e. iconv fails to process for some reason), then whether
> the translation is done at the import time (i.e. where you would
> have used the fallbackEncoding to reencode into UTF-8) or at the
> display time (i.e. "git show" would notice the encoding header and
> try to reencode the raw data from that encoding into UTF-8), it
> would fail in the same way, so I do not see much advantage in
> writing the encoding header into the resulting object (other than
> shifting the blame to downstream and keeping the original data
> intact, which is a good design principle).

I agree with the idea that if you know what the encoding is, then
why not just use that knowledge to convert that to UTF-8, rather
than use the encoding header.

I'm not sure about how complete the support in "git" is for encoding
headers. Does *everything* that reads commit messages respect the
encoding header and handle it properly?  The documentation seems to
discourage their use altogether and in any event, the main use case
seems to be a project where everyone has settled on the same legacy
encoding to use everywhere, which is not really the case for our situation.

If we want to abandon the "fallbackEncoding" direction, then the only other
option I see is:

Step 1) try to decode in UTF-8.  If that succeeds then it is almost certain
that it really was in UTF-8 (due to the design of UTF-8).  Make the commit
in UTF-8.

Step 2) If it failed to decode in UTF-8, either use heuristics to detect the
encoding, or query the OS for the current code page and assume it's that.
Then use the selected encoding to convert to UTF-8.

Only the heuristics option will satisfy the case of more than 1 encoding
(in addition to UTF-8) being present in the Perforce depot, and in any event
the current code page may not help at all.

I'm not really familiar with what encoding-detection heuristics are available
for Python and how reliable, stable or performant they are.  It would also
involve taking on another library dependency.

I will take a look at the encoding-detection options out there.

Tzadik Vanderhoof May 5, 2021, 4:06 a.m. UTC | #9

Oops, noticed a typo.... the paragraph should read:

I'm not sure I understand your scenario.  If the majority of commits
are in UTF-8, but there are 2 with the same *non* UTF-8 encoding (say
"cp1252"), then just set "fallbackEndocing" to "cp1252" and all
the commits will display fine.

Junio C Hamano May 5, 2021, 4:34 a.m. UTC | #10

Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:

> On Tue, May 4, 2021 at 6:11 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Tzadik Vanderhoof <tzadik.vanderhoof@gmail.com> writes:
>>
>> > On Tue, May 4, 2021 at 2:01 PM Andrew Oakley <andrew@adoakley.name> wrote:
>> >> The key thing that I'm trying to point out here is that the encoding is
>> >> not necessarily consistent between different commits.  The changes that
>> >> you have proposed force you to pick one encoding that will be used for
>> >> every commit.  If it's wrong then data will be corrupted, and there is
>> >> no option provided to avoid that.  The only way I can see to avoid this
>> >> issue is to not attempt to re-encode the data - just pass it directly
>> >> to git.
>> > ...
> Are you talking about a scenario where most of the commits are UTF-8,
> one is "cp1252" and another one is "cp1251", so a total of 3 encodings
> are used in the Perforce depot?  I don't think that is a common scenario.

Yes.  I think that is where "not necessarily consistent between
different commits" leads us to---not limited only to two encodings.

> I agree with the idea that if you know what the encoding is, then
> why not just use that knowledge to convert that to UTF-8, rather
> than use the encoding header.

[2/2] git-p4: do not decode data from perforce by default

Commit Message

Comments

Patch