Message ID | 20190513231726.16218-1-newren@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Fix and extend encoding handling in fast export/import | expand |
On Mon, May 13, 2019 at 04:17:22PM -0700, Elijah Newren wrote: > This test used an author with non-ascii characters in the name, but > no special commit message. It then grep'ed for those non-ascii > characters, but those are guaranteed to exist regardless of the > reencoding process since the reencoding only affects the commit message, > not the author or committer names. As such, the test would work even if > the re-encoding process simply stripped the commit message entirely. > Modify the test to actually check that the reencoding in utf-8 worked. (Not a native english speaker here): Should that be "...reencoding into utf-8 worked" ? > > Signed-off-by: Elijah Newren <newren@gmail.com> > --- > t/t9350-fast-export.sh | 27 ++++++++++++-------- > t/t9350/simple-iso-8859-7-commit-message.txt | 1 + > 2 files changed, 18 insertions(+), 10 deletions(-) > create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt > > diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh > index 5690fe2810..c721026260 100755 > --- a/t/t9350-fast-export.sh > +++ b/t/t9350-fast-export.sh > @@ -94,22 +94,30 @@ test_expect_success 'fast-export --show-original-ids | git fast-import' ' > test $MUSS = $(git rev-parse --verify refs/tags/muss) > ' > > -test_expect_success 'iso-8859-1' ' > +test_expect_success 'iso-8859-7' ' > > - git config i18n.commitencoding ISO8859-1 && > - # use author and committer name in ISO-8859-1 to match it. > - . "$TEST_DIRECTORY"/t3901/8859-1.txt && > + test_when_finished "git reset --hard HEAD~1" && > + test_config i18n.commitencoding iso-8859-7 && > test_tick && > echo rosten >file && > - git commit -s -m den file && > - git fast-export wer^..wer >iso8859-1.fi && > - sed "s/wer/i18n/" iso8859-1.fi | > + git commit -s -F "$TEST_DIRECTORY/t9350/simple-iso-8859-7-commit-message.txt" file && > + git fast-export wer^..wer >iso-8859-7.fi && > + sed "s/wer/i18n/" iso-8859-7.fi | > (cd new && > git fast-import && > + # The commit object, if not re-encoded, would be 240 bytes. > + # Removing the "encoding iso-8859-7\n" header drops 20 bytes. > + # Re-encoding the Pi character from \xF0 (\360) in iso-8859-7 > + # to \xCF\x80 (\317\200) in utf-8 adds a byte. Check for > + # the expected size. > + test 221 -eq "$(git cat-file -s i18n)" && > + # ...and for the expected translation of bytes. > git cat-file commit i18n >actual && > - grep "Áéí óú" actual) > - > + grep $(printf "\317\200") actual && > + # Also make sure the commit does not have the "encoding" header > + ! grep ^encoding actual) > ' > + > test_expect_success 'import/export-marks' ' > > git checkout -b marks master && > @@ -224,7 +232,6 @@ GIT_COMMITTER_NAME='C O Mitter'; export GIT_COMMITTER_NAME > > test_expect_success 'setup copies' ' > > - git config --unset i18n.commitencoding && > git checkout -b copy rein && > git mv file file3 && > git commit -m move1 && > diff --git a/t/t9350/simple-iso-8859-7-commit-message.txt b/t/t9350/simple-iso-8859-7-commit-message.txt > new file mode 100644 > index 0000000000..8b3f0c3dba > --- /dev/null > +++ b/t/t9350/simple-iso-8859-7-commit-message.txt > @@ -0,0 +1 @@ > +Pi: ? > \ No newline at end of file > -- > 2.21.0.782.gd8be4ee826 >
On Mon, May 13, 2019 at 04:17:24PM -0700, Elijah Newren wrote: > When fast-export encounters a commit with an 'encoding' header, it tries > to reencode in utf-8 and then drops the encoding header. However, if it > fails to reencode in utf-8 because e.g. one of the characters in the > commit message was invalid in the old encoding, then we need to retain > the original encoding or otherwise we lose information needed to > understand all the other (valid) characters in the original commit > message. Minor question: "utf-8" or "UTF-8" ? Mostly we use UTF-8 in Git. > > Signed-off-by: Elijah Newren <newren@gmail.com> > --- > builtin/fast-export.c | 7 +++++-- > t/t9350-fast-export.sh | 21 ++++++++++++++++++++ > t/t9350/broken-iso-8859-7-commit-message.txt | 1 + > 3 files changed, 27 insertions(+), 2 deletions(-) > create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt > > diff --git a/builtin/fast-export.c b/builtin/fast-export.c > index 9e283482ef..7734a9f5a5 100644 > --- a/builtin/fast-export.c > +++ b/builtin/fast-export.c > @@ -642,9 +642,12 @@ static void handle_commit(struct commit *commit, struct rev_info *rev, > printf("commit %s\nmark :%"PRIu32"\n", refname, last_idnum); > if (show_original_ids) > printf("original-oid %s\n", oid_to_hex(&commit->object.oid)); > - printf("%.*s\n%.*s\ndata %u\n%s", > + printf("%.*s\n%.*s\n", > (int)(author_end - author), author, > - (int)(committer_end - committer), committer, > + (int)(committer_end - committer), committer); > + if (!reencoded && encoding) > + printf("encoding %s\n", encoding); > + printf("data %u\n%s", > (unsigned)(reencoded > ? strlen(reencoded) : message > ? strlen(message) : 0), > diff --git a/t/t9350-fast-export.sh b/t/t9350-fast-export.sh > index c721026260..4fd637312a 100755 > --- a/t/t9350-fast-export.sh > +++ b/t/t9350-fast-export.sh > @@ -118,6 +118,27 @@ test_expect_success 'iso-8859-7' ' > ! grep ^encoding actual) > ' > > +test_expect_success 'encoding preserved if reencoding fails' ' > + > + test_when_finished "git reset --hard HEAD~1" && > + test_config i18n.commitencoding iso-8859-7 && > + echo rosten >file && > + git commit -s -F "$TEST_DIRECTORY/t9350/broken-iso-8859-7-commit-message.txt" file && > + git fast-export wer^..wer >iso-8859-7.fi && > + sed "s/wer/i18n-invalid/" iso-8859-7.fi | > + (cd new && > + git fast-import && > + git cat-file commit i18n-invalid >actual && > + # Make sure the commit still has the encoding header > + grep ^encoding actual && > + # Verify that the commit has the expected size; i.e. > + # that no bytes were re-encoded to a different encoding. > + test 252 -eq "$(git cat-file -s i18n-invalid)" && > + # ...and check for the original special bytes > + grep $(printf "\360") actual && > + grep $(printf "\377") actual) > +' > + > test_expect_success 'import/export-marks' ' > > git checkout -b marks master && > diff --git a/t/t9350/broken-iso-8859-7-commit-message.txt b/t/t9350/broken-iso-8859-7-commit-message.txt > new file mode 100644 > index 0000000000..d06ad75b44 > --- /dev/null > +++ b/t/t9350/broken-iso-8859-7-commit-message.txt > @@ -0,0 +1 @@ > +Pi: ?; Invalid: ? > \ No newline at end of file > -- > 2.21.0.782.gd8be4ee826 >
While stress testing `git filter-repo`, I noticed an issue with encoding; further digging led to the fixes and features in this series. See the individual commit messages for details. Changes since v4 (full range-diff below): * Used git_parse_maybe_bool() * Updated Documentation/git-fast-export.txt to document the new option Elijah Newren (5): t9350: fix encoding test to actually test reencoding fast-import: support 'encoding' commit header fast-export: avoid stripping encoding header if we cannot reencode fast-export: differentiate between explicitly utf-8 and implicitly utf-8 fast-export: do automatic reencoding of commit messages only if requested Documentation/git-fast-export.txt | 7 ++ Documentation/git-fast-import.txt | 7 ++ builtin/fast-export.c | 55 ++++++++++++-- fast-import.c | 11 ++- t/t9300-fast-import.sh | 20 +++++ t/t9350-fast-export.sh | 78 +++++++++++++++++--- t/t9350/broken-iso-8859-7-commit-message.txt | 1 + t/t9350/simple-iso-8859-7-commit-message.txt | 1 + 8 files changed, 163 insertions(+), 17 deletions(-) create mode 100644 t/t9350/broken-iso-8859-7-commit-message.txt create mode 100644 t/t9350/simple-iso-8859-7-commit-message.txt Range-diff: 1: 37a68a0ffd = 1: 37a68a0ffd t9350: fix encoding test to actually test reencoding 2: 3d84f4613d = 2: 3d84f4613d fast-import: support 'encoding' commit header 3: baa8394a3a = 3: baa8394a3a fast-export: avoid stripping encoding header if we cannot reencode 4: 49960164c6 = 4: 49960164c6 fast-export: differentiate between explicitly utf-8 and implicitly utf-8 5: 571613a09e ! 5: d8be4ee826 fast-export: do automatic reencoding of commit messages only if requested @@ -13,6 +13,24 @@ Signed-off-by: Elijah Newren <newren@gmail.com> + diff --git a/Documentation/git-fast-export.txt b/Documentation/git-fast-export.txt + --- a/Documentation/git-fast-export.txt + +++ b/Documentation/git-fast-export.txt +@@ + for intermediary filters (e.g. for rewriting commit messages + which refer to older commits, or for stripping blobs by id). + ++--reencode=(yes|no|abort):: ++ Specify how to handle `encoding` header in commit objects. When ++ asking to 'abort' (which is the default), this program will die ++ when encountering such a commit object. With 'yes', the commit ++ message will be reencoded into UTF-8. With 'no', the original ++ encoding will be preserved. ++ + --refspec:: + Apply the specified refspec to each ref exported. Multiple of them can + be specified. + diff --git a/builtin/fast-export.c b/builtin/fast-export.c --- a/builtin/fast-export.c +++ b/builtin/fast-export.c @@ -31,14 +49,25 @@ +static int parse_opt_reencode_mode(const struct option *opt, + const char *arg, int unset) +{ -+ if (unset || !strcmp(arg, "abort")) ++ if (unset) { + reencode_mode = REENCODE_ABORT; -+ else if (!strcmp(arg, "yes") || !strcmp(arg, "true") || !strcmp(arg, "on")) -+ reencode_mode = REENCODE_YES; -+ else if (!strcmp(arg, "no") || !strcmp(arg, "false") || !strcmp(arg, "off")) ++ return 0; ++ } ++ ++ switch (git_parse_maybe_bool(arg)) { ++ case 0: + reencode_mode = REENCODE_NO; -+ else -+ return error("Unknown reencoding mode: %s", arg); ++ break; ++ case 1: ++ reencode_mode = REENCODE_YES; ++ break; ++ default: ++ if (arg && !strcasecmp(arg, "abort")) ++ reencode_mode = REENCODE_ABORT; ++ else ++ return error("Unknown reencoding mode: %s", arg); ++ } ++ + return 0; +} +