diff mbox series

[4/4] git-p4: resolve RCS keywords in binary

Message ID 20211213225441.1865782-5-jholdsworth@nvidia.com (mailing list archive)
State Superseded
Headers show
Series git-p4: fix RCS keyword processing encoding errors | expand

Commit Message

Joel Holdsworth Dec. 13, 2021, 10:54 p.m. UTC
RCS keywords are strings that will are replaced with information from
Perforce. Examples include $Date$, $Author$, $File$, $Change$ etc.

Perforce resolves these by expanding them with their expanded values
when files are synced, but Git's data model requires these expanded
values to be converted back into their unexpanded form.

Previously, git-p4.py would implement this behaviour through the use of
regular expressions. However, the regular expression substitution was
applied using decoded strings i.e. the content of incoming commit diffs
was first decoded from bytes into UTF-8, processed with regular
expressions, then converted back to bytes.

Not only is this behaviour inefficient, but it is also a cause of a
common issue caused by text files containing invalid UTF-8 data. For
files created in Windows, CP1252 Smart Quote Characters (0x93 and 0x94)
are seen fairly frequently. These codes are invalid in UTF-8, so if the
script encountered any file containing them, on Python 2 the symbols
will be corrupted, and on Python 3 the script will fail with an
exception.

This patch replaces this decoding/encoding with bytes object regular
expressions, so that the substitution is performed directly upon the
source data with not conversions.

A test for smart quote handling has been added to the
t9810-git-p4-rcs.sh test suite.

Signed-off-by: Joel Holdsworth <jholdsworth@nvidia.com>
---
 git-p4.py             | 15 ++++++++-------
 t/t9810-git-p4-rcs.sh | 15 +++++++++++++++
 2 files changed, 23 insertions(+), 7 deletions(-)

Comments

Junio C Hamano Dec. 13, 2021, 11:34 p.m. UTC | #1
Joel Holdsworth <jholdsworth@nvidia.com> writes:

> RCS keywords are strings that will are replaced with information from
> Perforce. Examples include $Date$, $Author$, $File$, $Change$ etc.
>
> Perforce resolves these by expanding them with their expanded values
> when files are synced, but Git's data model requires these expanded
> values to be converted back into their unexpanded form.
>
> Previously, git-p4.py would implement this behaviour through the use of
> regular expressions. However, the regular expression substitution was
> applied using decoded strings i.e. the content of incoming commit diffs
> was first decoded from bytes into UTF-8, processed with regular
> expressions, then converted back to bytes.
>
> Not only is this behaviour inefficient, but it is also a cause of a
> common issue caused by text files containing invalid UTF-8 data. For
> files created in Windows, CP1252 Smart Quote Characters (0x93 and 0x94)
> are seen fairly frequently. These codes are invalid in UTF-8, so if the
> script encountered any file containing them, on Python 2 the symbols
> will be corrupted, and on Python 3 the script will fail with an
> exception.

Makes sense, and I am with others who commented on the previous
discussion thread that the right approach to take is to take the
stuff coming from Perforce as byte strings, process them as such and
write them out as byte strings, UNLESS we positively know what the
source and destination encodings are.

And this change we see here, matching with patterns, is perfectly in
line with that direction.  Very nice.

>          try:
> -            with os.fdopen(handle, "w+") as outFile, open(file, "r") as inFile:
> +            with os.fdopen(handle, "wb") as outFile, open(file, "rb") as inFile:

We seem to have lost "w+" and now it is "wb".  I do not see a reason
to make outFile anything but write-only, so the end result looks
good to me, but is it an unrelated "bug"fix that should be explained
as such (e.g. "there is no reason to make outFile read-write, so
instead of using 'w+' just use 'wb' while we make it unencoded
output by adding 'b' to it")?
Joel Holdsworth Dec. 14, 2021, 1:12 p.m. UTC | #2
> Makes sense, and I am with others who commented on the previous
> discussion thread that the right approach to take is to take the stuff coming
> from Perforce as byte strings, process them as such and write them out as
> byte strings, UNLESS we positively know what the source and destination
> encodings are.
> 
> And this change we see here, matching with patterns, is perfectly in line with
> that direction.  Very nice.

Not bad. Fortunately, it's not possible for $ characters to appear as a component of a multi-byte UTF-8 character, so it's possible to do the matching byte-wise.

> 
> >          try:
> > -            with os.fdopen(handle, "w+") as outFile, open(file, "r") as inFile:
> > +            with os.fdopen(handle, "wb") as outFile, open(file, "rb") as inFile:
> 
> We seem to have lost "w+" and now it is "wb".  I do not see a reason to make
> outFile anything but write-only, so the end result looks good to me, but is it
> an unrelated "bug"fix that should be explained as such (e.g. "there is no
> reason to make outFile read-write, so instead of using 'w+' just use 'wb'
> while we make it unencoded output by adding 'b' to it")?

I am happy to split this change into a separate patch if this is preferred.

Joel
Junio C Hamano Dec. 15, 2021, 9:41 p.m. UTC | #3
Joel Holdsworth <jholdsworth@nvidia.com> writes:

>> Makes sense, and I am with others who commented on the previous
>> discussion thread that the right approach to take is to take the stuff coming
>> from Perforce as byte strings, process them as such and write them out as
>> byte strings, UNLESS we positively know what the source and destination
>> encodings are.
>> 
>> And this change we see here, matching with patterns, is perfectly in line with
>> that direction.  Very nice.
>
> Not bad. Fortunately, it's not possible for $ characters to appear as a component of a multi-byte UTF-8 character, so it's possible to do the matching byte-wise.
>
>> 
>> >          try:
>> > -            with os.fdopen(handle, "w+") as outFile, open(file, "r") as inFile:
>> > +            with os.fdopen(handle, "wb") as outFile, open(file, "rb") as inFile:
>> 
>> We seem to have lost "w+" and now it is "wb".  I do not see a reason to make
>> outFile anything but write-only, so the end result looks good to me, but is it
>> an unrelated "bug"fix that should be explained as such (e.g. "there is no
>> reason to make outFile read-write, so instead of using 'w+' just use 'wb'
>> while we make it unencoded output by adding 'b' to it")?
>
> I am happy to split this change into a separate patch if this is preferred.

I do not think this is big enough for a separate patch; just a
mention in the log message is sufficient.
diff mbox series

Patch

diff --git a/git-p4.py b/git-p4.py
index 509feac2d8..986595bef0 100755
--- a/git-p4.py
+++ b/git-p4.py
@@ -56,8 +56,8 @@ 
 
 p4_access_checked = False
 
-re_ko_keywords = re.compile(r'\$(Id|Header)(:[^$\n]+)?\$')
-re_k_keywords = re.compile(r'\$(Id|Header|Author|Date|DateTime|Change|File|Revision)(:[^$\n]+)?\$')
+re_ko_keywords = re.compile(br'\$(Id|Header)(:[^$\n]+)?\$')
+re_k_keywords = re.compile(br'\$(Id|Header|Author|Date|DateTime|Change|File|Revision)(:[^$\n]+)?\$')
 
 def p4_build_cmd(cmd):
     """Build a suitable p4 command line.
@@ -1754,9 +1754,9 @@  def patchRCSKeywords(self, file, regexp):
         # Attempt to zap the RCS keywords in a p4 controlled file matching the given regex
         (handle, outFileName) = tempfile.mkstemp(dir='.')
         try:
-            with os.fdopen(handle, "w+") as outFile, open(file, "r") as inFile:
+            with os.fdopen(handle, "wb") as outFile, open(file, "rb") as inFile:
                 for line in inFile.readlines():
-                    outFile.write(regexp.sub(r'$\1$', line))
+                    outFile.write(regexp.sub(br'$\1$', line))
             # Forcibly overwrite the original file
             os.unlink(file)
             shutil.move(outFileName, file)
@@ -2089,7 +2089,9 @@  def applyCommit(self, id):
                     regexp = p4_keywords_regexp_for_file(file)
                     if regexp:
                         # this file is a possibility...look for RCS keywords.
-                        for line in read_pipe_lines(["git", "diff", "%s^..%s" % (id, id), file]):
+                        for line in read_pipe_lines(
+                            ["git", "diff", "%s^..%s" % (id, id), file],
+                            raw=True):
                             if regexp.search(line):
                                 if verbose:
                                     print("got keyword match on %s in %s in %s" % (regex.pattern, line, file))
@@ -3020,8 +3022,7 @@  def streamOneP4File(self, file, contents):
         # even though in theory somebody may want that.
         regexp = p4_keywords_regexp_for_type(type_base, type_mods)
         if regexp:
-            contents = [encode_text_stream(regexp.sub(
-                r'$\1$', ''.join(decode_text_stream(c) for c in contents)))]
+            contents = [regexp.sub(br'$\1$', c) for c in contents]
 
         if self.largeFileSystem:
             (git_mode, contents) = self.largeFileSystem.processContent(git_mode, relPath, contents)
diff --git a/t/t9810-git-p4-rcs.sh b/t/t9810-git-p4-rcs.sh
index e3836888ec..5fe83315ec 100755
--- a/t/t9810-git-p4-rcs.sh
+++ b/t/t9810-git-p4-rcs.sh
@@ -4,6 +4,8 @@  test_description='git p4 rcs keywords'
 
 . ./lib-git-p4.sh
 
+CP1252="\223\224"
+
 test_expect_success 'start p4d' '
 	start_p4d
 '
@@ -32,6 +34,9 @@  test_expect_success 'init depot' '
 		p4 submit -d "filek" &&
 		p4 add -t text+ko fileko &&
 		p4 submit -d "fileko" &&
+		printf "$CP1252" >fileko_cp1252 &&
+		p4 add -t text+ko fileko_cp1252 &&
+		p4 submit -d "fileko_cp1252" &&
 		p4 add -t text file_text &&
 		p4 submit -d "file_text"
 	)
@@ -359,4 +364,14 @@  test_expect_failure 'Add keywords in git which do not match the default p4 value
 	)
 '
 
+test_expect_success 'check cp1252 smart quote are preserved through RCS keyword processing' '
+	test_when_finished cleanup_git &&
+	git p4 clone --dest="$git" //depot &&
+	(
+		cd "$git" &&
+		printf "$CP1252" >expect &&
+		test_cmp_bin expect fileko_cp1252
+	)
+'
+
 test_done