[3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
diff mbox series

Message ID 20191031092618.29073-4-congdanhqx@gmail.com
State New
Headers show
Series
  • Linux with musl libc improvement
Related show

Commit Message

Danh Doan Oct. 31, 2019, 9:26 a.m. UTC
On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900 is failing on Linux with musl libc.

Reencode to utf-8 before arranging rebase's todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---

Notes:
    The todo list shown to user has already been reencoded by sequencer_make_script,
    without this patch it looks like this:
    
    $ head -3 .git/rebase-merge/git-rebase-todo | xxd
    00000000: 7069 636b 2065 6633 3961 3033 201b 2442  pick ef39a03 .$B
    00000010: 244f 1b28 421b 2442 246c 1b28 421b 2442  $O.(B.$B$l.(B.$B
    00000020: 2452 1b28 421b 2442 245b 1b28 421b 2442  $R.(B.$B$[.(B.$B
    00000030: 2455 1b28 420a 7069 636b 2062 3832 3931  $U.(B.pick b8291
    00000040: 3336 2073 7175 6173 6821 201b 2442 244f  36 squash! .$B$O
    00000050: 1b28 421b 2442 246c 1b28 421b 2442 2452  .(B.$B$l.(B.$B$R
    00000060: 1b28 421b 2442 245b 1b28 421b 2442 2455  .(B.$B$[.(B.$B$U
    00000070: 1b28 420a 7069 636b 2062 3532 3132 6437  .(B.pick b5212d7
    00000080: 2069 6e74 6572 6d65 6469 6174 6520 636f   intermediate co
    00000090: 6d6d 6974 0a                             mmit.

 sequencer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Johannes Schindelin Oct. 31, 2019, 10:38 a.m. UTC | #1
Hi,


On Thu, 31 Oct 2019, Doan Tran Cong Danh wrote:

> On musl libc, ISO-2022-JP encoder is too eager to switch back to
> 1 byte encoding, musl's iconv always switch back after every combining
> character. Comparing glibc and musl's output for this command
> $ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
> 	iconv -f utf-8 -t ISO-2022-JP | xxd
>
> glibc:
> 00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
> 00000010: 0a                                       .
>
> musl:
> 00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
> 00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
> 00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.
>
> Although musl iconv's output isn't optimal, it's still correct.
>
> From commit 7d509878b8, ("pretty.c: format string with truncate respects
> logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
> first, then format it and convert the message to the actual output
> encoding on git commit --squash.
>
> Thus, t3900 is failing on Linux with musl libc.
>
> Reencode to utf-8 before arranging rebase's todo list.

Since the re-encoded commit messages are only used for figuring out the
relationships between the `fixup!`/`squash!` commits and their targets,
but are not used in any of the lines that are written out, this change
looks good to me.

In fact, all three patches look good to me.

Thanks for unbreaking Git with musl!
Johannes

>
> Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
> ---
>
> Notes:
>     The todo list shown to user has already been reencoded by sequencer_make_script,
>     without this patch it looks like this:
>
>     $ head -3 .git/rebase-merge/git-rebase-todo | xxd
>     00000000: 7069 636b 2065 6633 3961 3033 201b 2442  pick ef39a03 .$B
>     00000010: 244f 1b28 421b 2442 246c 1b28 421b 2442  $O.(B.$B$l.(B.$B
>     00000020: 2452 1b28 421b 2442 245b 1b28 421b 2442  $R.(B.$B$[.(B.$B
>     00000030: 2455 1b28 420a 7069 636b 2062 3832 3931  $U.(B.pick b8291
>     00000040: 3336 2073 7175 6173 6821 201b 2442 244f  36 squash! .$B$O
>     00000050: 1b28 421b 2442 246c 1b28 421b 2442 2452  .(B.$B$l.(B.$B$R
>     00000060: 1b28 421b 2442 245b 1b28 421b 2442 2455  .(B.$B$[.(B.$B$U
>     00000070: 1b28 420a 7069 636b 2062 3532 3132 6437  .(B.pick b5212d7
>     00000080: 2069 6e74 6572 6d65 6469 6174 6520 636f   intermediate co
>     00000090: 6d6d 6974 0a                             mmit.
>
>  sequencer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/sequencer.c b/sequencer.c
> index 9d5964fd81..69430fe23f 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
>  		*commit_todo_item_at(&commit_todo, item->commit) = item;
>
>  		parse_commit(item->commit);
> -		commit_buffer = get_commit_buffer(item->commit, NULL);
> +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
>  		find_commit_subject(commit_buffer, &subject);
>  		format_subject(&buf, subject, " ");
>  		subject = subjects[i] = strbuf_detach(&buf, &subject_len);
> --
> 2.24.0.rc1.3.g89530838a3.dirty
>
>
Jeff King Oct. 31, 2019, 7:26 p.m. UTC | #2
On Thu, Oct 31, 2019 at 11:38:14AM +0100, Johannes Schindelin wrote:

> On Thu, 31 Oct 2019, Doan Tran Cong Danh wrote:
> 
> > On musl libc, ISO-2022-JP encoder is too eager to switch back to
> > 1 byte encoding, musl's iconv always switch back after every combining
> > character. Comparing glibc and musl's output for this command
> > $ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
> > 	iconv -f utf-8 -t ISO-2022-JP | xxd
> >
> > glibc:
> > 00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
> > 00000010: 0a                                       .
> >
> > musl:
> > 00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
> > 00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
> > 00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.
> >
> > Although musl iconv's output isn't optimal, it's still correct.
> >
> > From commit 7d509878b8, ("pretty.c: format string with truncate respects
> > logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
> > first, then format it and convert the message to the actual output
> > encoding on git commit --squash.
> >
> > Thus, t3900 is failing on Linux with musl libc.
> >
> > Reencode to utf-8 before arranging rebase's todo list.
> 
> Since the re-encoded commit messages are only used for figuring out the
> relationships between the `fixup!`/`squash!` commits and their targets,
> but are not used in any of the lines that are written out, this change
> looks good to me.

I'm confused about a few things here, though. I agree with you that the
subjects here are only used for finding the fixup/squash relationships.
But I don't understand the musl connection.

Wouldn't failure to reencode here always be a problem? E.g., if I do:

  for encoding in utf-8 iso-8859-1; do
    # commit using the encoding
    echo $encoding >file && git add file
    echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
      git -c i18n.commitEncoding=$encoding commit -F -
    # and then fixup without it
    echo "$encoding fixed" >file && git add file
    git commit --fixup HEAD
  done
  
  GIT_EDITOR='echo; grep -v ^#' git rebase -i --root --autosquash

then the resulting todo-list output (on my glibc system) is:

  pick 3a5bace éñcödèd with utf-8
  fixup aa9f09c fixup! éñcödèd with utf-8
  pick 6e85d32 éñcödèd with iso-8859-1
  pick 3ceac05 fixup! éñcödèd with iso-8859-1

I.e., we don't actually match up the second pair, and I think we
probably ought to.

I guess the test in t3900 is less exotic; it uses the same encoding for
both commits. And it's just that "foo" and "!fixup foo" can (and do in
musl) end up with different encodings (because of the specific language,
and the vagaries of each iconv implementation).

Would we have similar problems in all of the other functions which use
get_commit_buffer() without reencoding? For instance if I do this:

  echo base >file && git add file && git commit -m base
  for encoding in utf-8 iso-8859-1; do
    echo $encoding >file && git add file
    echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
      git -c i18n.commitEncoding=$encoding commit -F -
  done
  git checkout -b side HEAD~2
  git cherry-pick master master^
  cat .git/sequencer/todo

then the resulting todo file has a mix of iso-8859-1 and utf-8.

It seems to me that we should always be working with the subjects in a
single encoding internally, and likewise outputting in that format
(which should probably be git_log_output_encoding(), for the instances
where we show it to the user).

I.e., we should always call logmsg_reencode() instead of
get_commit_buffer().

-Peff
Danh Doan Nov. 1, 2019, 4:49 a.m. UTC | #3
On 2019-10-31 15:26:50 -0400, Jeff King wrote:
> I'm confused about a few things here, though. I agree with you that the
> subjects here are only used for finding the fixup/squash relationships.
> But I don't understand the musl connection.

You're right.

Because of musl's iconv implementation, the problem is being shown up
earlier.

> Wouldn't failure to reencode here always be a problem? E.g., if I do:
> 
>   for encoding in utf-8 iso-8859-1; do
>     # commit using the encoding
>     echo $encoding >file && git add file
>     echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
>       git -c i18n.commitEncoding=$encoding commit -F -
>     # and then fixup without it
>     echo "$encoding fixed" >file && git add file
>     git commit --fixup HEAD
>   done
>   
>   GIT_EDITOR='echo; grep -v ^#' git rebase -i --root --autosquash
> 
> then the resulting todo-list output (on my glibc system) is:
> 
>   pick 3a5bace éñcödèd with utf-8
>   fixup aa9f09c fixup! éñcödèd with utf-8
>   pick 6e85d32 éñcödèd with iso-8859-1
>   pick 3ceac05 fixup! éñcödèd with iso-8859-1
> 
> I.e., we don't actually match up the second pair, and I think we
> probably ought to.

Yes, we ought to match up the second pair, and after changing
get_commit_buffer to logmsg_reencode, we do.

> 
> I guess the test in t3900 is less exotic; it uses the same encoding for
> both commits. And it's just that "foo" and "!fixup foo" can (and do in
> musl) end up with different encodings (because of the specific language,
> and the vagaries of each iconv implementation).
> 
> Would we have similar problems in all of the other functions which use
> get_commit_buffer() without reencoding? For instance if I do this:
> 
>   echo base >file && git add file && git commit -m base
>   for encoding in utf-8 iso-8859-1; do
>     echo $encoding >file && git add file
>     echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
>       git -c i18n.commitEncoding=$encoding commit -F -
>   done
>   git checkout -b side HEAD~2
>   git cherry-pick master master^
>   cat .git/sequencer/todo
> 
> then the resulting todo file has a mix of iso-8859-1 and utf-8.
> 
> It seems to me that we should always be working with the subjects in a
> single encoding internally,

I'm in favour of this idea.

> and likewise outputting in that format
> (which should probably be git_log_output_encoding(), for the instances
> where we show it to the user).

This is git's current behaviour but it's get_log_output_encoding()
instead of git_log_output_encoding().

> I.e., we should always call logmsg_reencode() instead of
> get_commit_buffer().

Patch
diff mbox series

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@  int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);