mbox series

[v5,0/5] Fix and extend encoding handling in fast export/import

Message ID 20190513231726.16218-1-newren@gmail.com (mailing list archive)
Headers show
Series Fix and extend encoding handling in fast export/import | expand

Message

Elijah Newren May 13, 2019, 11:17 p.m. UTC
While stress testing `git filter-repo`, I noticed an issue with
encoding; further digging led to the fixes and features in this series.
See the individual commit messages for details.

Changes since v4 (full range-diff below):
  * Used git_parse_maybe_bool()
  * Updated Documentation/git-fast-export.txt to document the new option

Elijah Newren (5):
  t9350: fix encoding test to actually test reencoding
  fast-import: support 'encoding' commit header
  fast-export: avoid stripping encoding header if we cannot reencode
  fast-export: differentiate between explicitly utf-8 and implicitly
    utf-8
  fast-export: do automatic reencoding of commit messages only if
    requested

 Documentation/git-fast-export.txt            |  7 ++
 Documentation/git-fast-import.txt            |  7 ++
 builtin/fast-export.c                        | 55 ++++++++++++--
 fast-import.c                                | 11 ++-
 t/t9300-fast-import.sh                       | 20 +++++
 t/t9350-fast-export.sh                       | 78 +++++++++++++++++---
 t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
 t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
 8 files changed, 163 insertions(+), 17 deletions(-)
 create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
 create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt

Range-diff:
1:  37a68a0ffd = 1:  37a68a0ffd t9350: fix encoding test to actually test reencoding
2:  3d84f4613d = 2:  3d84f4613d fast-import: support 'encoding' commit header
3:  baa8394a3a = 3:  baa8394a3a fast-export: avoid stripping encoding header if we cannot reencode
4:  49960164c6 = 4:  49960164c6 fast-export: differentiate between explicitly utf-8 and implicitly utf-8
5:  571613a09e ! 5:  d8be4ee826 fast-export: do automatic reencoding of commit messages only if requested
    @@ -13,6 +13,24 @@
     
         Signed-off-by: Elijah Newren <newren@gmail.com>
     
    + diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt
    + --- a/Documentation/git-fast-export.txt
    + +++ b/Documentation/git-fast-export.txt
    +@@
    + 	for intermediary filters (e.g. for rewriting commit messages
    + 	which refer to older commits, or for stripping blobs by id).
    + 
    ++--reencode=(yes|no|abort)::
    ++	Specify how to handle `encoding` header in commit objects.  When
    ++	asking to 'abort' (which is the default), this program will die
    ++	when encountering such a commit object.  With 'yes', the commit
    ++	message will be reencoded into UTF-8.  With 'no', the original
    ++	encoding will be preserved.
    ++
    + --refspec::
    + 	Apply the specified refspec to each ref exported. Multiple of them can
    + 	be specified.
    +
      diff --git a/builtin/fast-export.c b/builtin/fast-export.c
      --- a/builtin/fast-export.c
      +++ b/builtin/fast-export.c
    @@ -31,14 +49,25 @@
     +static int parse_opt_reencode_mode(const struct option *opt,
     +				   const char *arg, int unset)
     +{
    -+	if (unset || !strcmp(arg, "abort"))
    ++	if (unset) {
     +		reencode_mode = REENCODE_ABORT;
    -+	else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on"))
    -+		reencode_mode = REENCODE_YES;
    -+	else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off"))
    ++		return 0;
    ++	}
    ++
    ++	switch (git_parse_maybe_bool(arg)) {
    ++	case 0:
     +		reencode_mode = REENCODE_NO;
    -+	else
    -+		return error("Unknown reencoding mode: %s", arg);
    ++		break;
    ++	case 1:
    ++		reencode_mode = REENCODE_YES;
    ++		break;
    ++	default:
    ++		if (arg && !strcasecmp(arg, "abort"))
    ++			reencode_mode = REENCODE_ABORT;
    ++		else
    ++			return error("Unknown reencoding mode: %s", arg);
    ++	}
    ++
     +	return 0;
     +}
     +

Comments

Torsten Bögershausen May 14, 2019, 2:50 a.m. UTC | #1
On Mon, May 13, 2019 at 04:17:22PM -0700, Elijah Newren wrote:
> This test used an author with non-ascii characters in the name, but
> no special commit message.  It then grep'ed for those non-ascii
> characters, but those are guaranteed to exist regardless of the
> reencoding process since the reencoding only affects the commit message,
> not the author or committer names.  As such, the test would work even if
> the re-encoding process simply stripped the commit message entirely.
> Modify the test to actually check that the reencoding in utf-8 worked.

(Not a native english speaker here): Should that be
"...reencoding into utf-8 worked" ?

>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  t/t9350-fast-export.sh                       | 27 ++++++++++++--------
>  t/t9350/simple-iso-8859-7-commit-message.txt |  1 +
>  2 files changed, 18 insertions(+), 10 deletions(-)
>  create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt
>
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index 5690fe2810..c721026260 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -94,22 +94,30 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' '
>  	test $MUSS = $(git rev-parse --verify refs/tags/muss)
>  '
>
> -test_expect_success 'iso-8859-1' '
> +test_expect_success 'iso-8859-7' '
>
> -	git config i18n.commitencoding ISO8859-1 &&
> -	# use author and committer name in ISO-8859-1 to match it.
> -	. "$TEST_DIRECTORY"/t3901/8859-1.txt &&
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
>  	test_tick &&
>  	echo rosten >file &&
> -	git commit -s -m den file &&
> -	git fast-export wer^..wer >iso8859-1.fi &&
> -	sed "s/wer/i18n/" iso8859-1.fi |
> +	git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file &&
> +	git fast-export wer^..wer >iso-8859-7.fi &&
> +	sed "s/wer/i18n/" iso-8859-7.fi |
>  		(cd new &&
>  		 git fast-import &&
> +		 # The commit object, if not re-encoded, would be 240 bytes.
> +		 # Removing the "encoding iso-8859-7\n" header drops 20 bytes.
> +		 # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7
> +		 # to \xCF\x80 (\317\200) in utf-8 adds a byte.  Check for
> +		 # the expected size.
> +		 test 221 -eq "$(git cat-file -s i18n)" &&
> +		 # ...and for the expected translation of bytes.
>  		 git cat-file commit i18n >actual &&
> -		 grep "Áéí óú" actual)
> -
> +		 grep $(printf "\317\200") actual &&
> +		 # Also make sure the commit does not have the "encoding" header
> +		 ! grep ^encoding actual)
>  '
> +
>  test_expect_success 'import/export-marks' '
>
>  	git checkout -b marks master &&
> @@ -224,7 +232,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME
>
>  test_expect_success 'setup copies' '
>
> -	git config --unset i18n.commitencoding &&
>  	git checkout -b copy rein &&
>  	git mv file file3 &&
>  	git commit -m move1 &&
> diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt
> new file mode 100644
> index 0000000000..8b3f0c3dba
> --- /dev/null
> +++ b/t/t9350/simple-iso-8859-7-commit-message.txt
> @@ -0,0 +1 @@
> +Pi: ?
> \ No newline at end of file
> --
> 2.21.0.782.gd8be4ee826
>
Torsten Bögershausen May 14, 2019, 2:56 a.m. UTC | #2
On Mon, May 13, 2019 at 04:17:24PM -0700, Elijah Newren wrote:
> When fast-export encounters a commit with an 'encoding' header, it tries
> to reencode in utf-8 and then drops the encoding header.  However, if it
> fails to reencode in utf-8 because e.g. one of the characters in the
> commit message was invalid in the old encoding, then we need to retain
> the original encoding or otherwise we lose information needed to
> understand all the other (valid) characters in the original commit
> message.

Minor question: "utf-8" or "UTF-8" ?
Mostly we use UTF-8 in Git.

>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  builtin/fast-export.c                        |  7 +++++--
>  t/t9350-fast-export.sh                       | 21 ++++++++++++++++++++
>  t/t9350/broken-iso-8859-7-commit-message.txt |  1 +
>  3 files changed, 27 insertions(+), 2 deletions(-)
>  create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt
>
> diff --git a/builtin/fast-export.c b/builtin/fast-export.c
> index 9e283482ef..7734a9f5a5 100644
> --- a/builtin/fast-export.c
> +++ b/builtin/fast-export.c
> @@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev,
>  	printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum);
>  	if (show_original_ids)
>  		printf("original-oid %s\n", oid_to_hex(&commit->object.oid));
> -	printf("%.*s\n%.*s\ndata %u\n%s",
> +	printf("%.*s\n%.*s\n",
>  	       (int)(author_end - author), author,
> -	       (int)(committer_end - committer), committer,
> +	       (int)(committer_end - committer), committer);
> +	if (!reencoded && encoding)
> +		printf("encoding %s\n", encoding);
> +	printf("data %u\n%s",
>  	       (unsigned)(reencoded
>  			  ? strlen(reencoded) : message
>  			  ? strlen(message) : 0),
> diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh
> index c721026260..4fd637312a 100755
> --- a/t/t9350-fast-export.sh
> +++ b/t/t9350-fast-export.sh
> @@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' '
>  		 ! grep ^encoding actual)
>  '
>
> +test_expect_success 'encoding preserved if reencoding fails' '
> +
> +	test_when_finished "git reset --hard HEAD~1" &&
> +	test_config i18n.commitencoding iso-8859-7 &&
> +	echo rosten >file &&
> +	git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file &&
> +	git fast-export wer^..wer >iso-8859-7.fi &&
> +	sed "s/wer/i18n-invalid/" iso-8859-7.fi |
> +		(cd new &&
> +		 git fast-import &&
> +		 git cat-file commit i18n-invalid >actual &&
> +		 # Make sure the commit still has the encoding header
> +		 grep ^encoding actual &&
> +		 # Verify that the commit has the expected size; i.e.
> +		 # that no bytes were re-encoded to a different encoding.
> +		 test 252 -eq "$(git cat-file -s i18n-invalid)" &&
> +		 # ...and check for the original special bytes
> +		 grep $(printf "\360") actual &&
> +		 grep $(printf "\377") actual)
> +'
> +
>  test_expect_success 'import/export-marks' '
>
>  	git checkout -b marks master &&
> diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt
> new file mode 100644
> index 0000000000..d06ad75b44
> --- /dev/null
> +++ b/t/t9350/broken-iso-8859-7-commit-message.txt
> @@ -0,0 +1 @@
> +Pi: ?; Invalid: ?
> \ No newline at end of file
> --
> 2.21.0.782.gd8be4ee826
>