[v2,3/3] sequencer: reencode to utf-8 before arrange rebase's todo list
diff mbox series

Message ID b7927b27235422ac53595cfaa63b4f1cbe009013.1572596278.git.congdanhqx@gmail.com
State New
Headers show
Series
  • Linux with musl libc improvement
Related show

Commit Message

Danh Doan Nov. 1, 2019, 8:25 a.m. UTC
On musl libc, ISO-2022-JP encoder is too eager to switch back to
1 byte encoding, musl's iconv always switch back after every combining
character. Comparing glibc and musl's output for this command
$ sed q t/t3900/ISO-2022-JP.txt| iconv -f ISO-2022-JP -t utf-8 |
	iconv -f utf-8 -t ISO-2022-JP | xxd

glibc:
00000000: 1b24 4224 4f24 6c24 5224 5b24 551b 2842  .$B$O$l$R$[$U.(B
00000010: 0a                                       .

musl:
00000000: 1b24 4224 4f1b 2842 1b24 4224 6c1b 2842  .$B$O.(B.$B$l.(B
00000010: 1b24 4224 521b 2842 1b24 4224 5b1b 2842  .$B$R.(B.$B$[.(B
00000020: 1b24 4224 551b 2842 0a                   .$B$U.(B.

Although musl iconv's output isn't optimal, it's still correct.

From commit 7d509878b8, ("pretty.c: format string with truncate respects
logOutputEncoding", 2014-05-21), we're encoding the message to utf-8
first, then format it and convert the message to the actual output
encoding on git commit --squash.

Thus, t3900 is failing on Linux with musl libc.

This problem wasn't specific to musl libc. On Linux with glibc, this
problem can be observed by:

for encoding in utf-8 iso-8859-1; do
	# commit using the encoding
	echo $encoding >file && git add file
	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
	  git -c i18n.commitEncoding=$encoding commit -F -
	# and then fixup without it
	echo "$encoding fixed" >file && git add file
	git commit --fixup HEAD
done
git rebase -i --autosquash --root

Reencode to utf-8 before arranging rebase's todo list.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
The code that demonstrate the problem on Linux with glibc is written by Jeff.
But I don't know how to attribute him properly.

 sequencer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jeff King Nov. 1, 2019, 4:59 p.m. UTC | #1
On Fri, Nov 01, 2019 at 03:25:11PM +0700, Doan Tran Cong Danh wrote:

> This problem wasn't specific to musl libc. On Linux with glibc, this
> problem can be observed by:
> 
> for encoding in utf-8 iso-8859-1; do
> 	# commit using the encoding
> 	echo $encoding >file && git add file
> 	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
> 	  git -c i18n.commitEncoding=$encoding commit -F -
> 	# and then fixup without it
> 	echo "$encoding fixed" >file && git add file
> 	git commit --fixup HEAD
> done
> git rebase -i --autosquash --root

Is it worth adding this as a test in t3900?

> ---
> The code that demonstrate the problem on Linux with glibc is written by Jeff.
> But I don't know how to attribute him properly.

I'm OK with or without attribution, but people sometimes add a
"Helped-by" trailer if they want to acknowledge someone.

> diff --git a/sequencer.c b/sequencer.c
> index 9d5964fd81..69430fe23f 100644
> --- a/sequencer.c
> +++ b/sequencer.c
> @@ -5169,7 +5169,7 @@ int todo_list_rearrange_squash(struct todo_list *todo_list)
>  		*commit_todo_item_at(&commit_todo, item->commit) = item;
>  
>  		parse_commit(item->commit);
> -		commit_buffer = get_commit_buffer(item->commit, NULL);
> +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");

I think there are several other spots in this file that could use the
same treatment. But I can live with it if you want to just fix the one
that's bugging you and move on. It's still a strict improvement.

-Peff
Danh Doan Nov. 2, 2019, 1:02 a.m. UTC | #2
On 2019-11-01 12:59:21 -0400, Jeff King wrote:
> On Fri, Nov 01, 2019 at 03:25:11PM +0700, Doan Tran Cong Danh wrote:
> 
> > for encoding in utf-8 iso-8859-1; do
> > 	# commit using the encoding
> > 	echo $encoding >file && git add file
> > 	echo "éñcödèd with $encoding" | iconv -f utf-8 -t $encoding |
> > 	  git -c i18n.commitEncoding=$encoding commit -F -
> > 	# and then fixup without it
> > 	echo "$encoding fixed" >file && git add file
> > 	git commit --fixup HEAD
> > done
> > git rebase -i --autosquash --root
> 
> Is it worth adding this as a test in t3900?

I think yes, but with a little more work.
I'll make it as a separated patch in a re-roll.

> >  		parse_commit(item->commit);
> > -		commit_buffer = get_commit_buffer(item->commit, NULL);
> > +		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
> 
> I think there are several other spots in this file that could use the
> same treatment. But I can live with it if you want to just fix the one
> that's bugging you and move on. It's still a strict improvement.

There're 6 more occurence of get_commit_buffer in sequencer.c, and 13
occurences in other C source files. I'll try to figure out if it's
safe to change.

Anyway, if we're going to working with a single encoding internally,
can we take other extreme approach: reencode the commit message to
utf-8 before writing the commit object? (Is there any codepoint in
other encoding that can't be reencoded to utf-8?)
Since git-log and friends are doing 2 steps conversion for commit
message for now (reencode to utf-8 first, then reencode again to
get_log_output_encoding()). With this new approach, first step is
likely a noop (but must be kept for backward compatible).
Danh Doan Nov. 2, 2019, 12:20 p.m. UTC | #3
On 2019-11-02 08:02:15 +0700, Danh Doan wrote:
> Anyway, if we're going to working with a single encoding internally,
> can we take other extreme approach: reencode the commit message to
> utf-8 before writing the commit object? (Is there any codepoint in
> other encoding that can't be reencoded to utf-8?)

With this test (added into t3900):

    git checkout -b fixup-ISO-2022-JP-ISO-8859-1 C0 &&
    git config i18n.commitencoding ISO-2022-JP &&
    echo ISO-2022-JP >>F &&
    git commit -a -F "$TEST_DIRECTORY"/t3900/ISO-2022-JP.txt &&
    test_tick &&
    echo intermediate stuff >>G &&
    git add G &&
    git commit -a -m "intermediate commit" &&
    test_tick &&
    git config i18n.commitencoding ISO-8859-1 &&
    echo ISO-8859-1-fixup >>F &&
    git commit -a --fixup HEAD^ &&
    git config --unset-all i18n.commitencoding &&
    git rebase --autosquash -i HEAD^^^ &&
    git rev-list HEAD >actual &&
    test_line_count = 3 actual

reencode the commit message to utf-8 before writing the commit object
is (likely) the most simple option to fix it.

At the very least, the commit message for fixup/squash-ing commit must
be encoded in either utf-8 or the target-commit's encode.
Jeff King Nov. 5, 2019, 8 a.m. UTC | #4
On Sat, Nov 02, 2019 at 08:02:15AM +0700, Danh Doan wrote:

> Anyway, if we're going to working with a single encoding internally,
> can we take other extreme approach: reencode the commit message to
> utf-8 before writing the commit object? (Is there any codepoint in
> other encoding that can't be reencoded to utf-8?)

That's normally what we do. The only cases we're covering here are when
somebody has explicitly asked that the commit object be stored in
another encoding. Presumably they'd also be using a matching
i18n.logOutputEncoding in that case, in which case logmsg_reencode()
would be a noop. I think the only reasons to do that are:

  1. You're stuck on some legacy encoding for your terminal. But in that
     case, I think you'd still be better off storing utf-8 and
     translating on the fly, since whatever encoding you do store is
     baked into your objects for all time (so accept some slowness now,
     but eventually move to utf-8).

  2. Your preferred language is bigger in utf-8 than in some specific
     encoding, and you'd rather save some bytes. I'm not sure how big a
     deal this is, given that commit messages don't tend to be that big
     in the first place (compared to trees and blobs). And the zlib
     deflation on the result might help remove some of the redundancy,
     too.

So I'd actually expect very few people to be using this feature at all
these days (which is part of why I would not be all that broken up if we
just fix the test and move on, if nobody is reporting real-world
problems).

> Since git-log and friends are doing 2 steps conversion for commit
> message for now (reencode to utf-8 first, then reencode again to
> get_log_output_encoding()). With this new approach, first step is
> likely a noop (but must be kept for backward compatible).

Interesting. Traditionally we did a single step conversion to the output
format, and it looks like most output formats still do that (i.e.,
everything in pretty_print_commit() except FMT_USERFORMAT, which is what
powers "--pretty=format:%s", etc).

The two-part user-format thing goes back to 7e77df39bf (pretty: two
phase conversion for non utf-8 commits, 2013-04-19). It does seem like
it would be cheaper to convert the format string into the output
encoding (it would need to be an ascii superset, but that's already the
case, since we expect to parse "author", etc out of the re-encoded
commit object). But again, I have trouble caring too much about the
performance of this case, as I consider it to be mostly legacy at this
point. But I also don't write in (say) Japanese, so maybe I'm being too
narrow-minded about whether people really want to avoid utf-8.

-Peff
Junio C Hamano Nov. 6, 2019, 1:30 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

> That's normally what we do. The only cases we're covering here are when
> somebody has explicitly asked that the commit object be stored in
> another encoding. Presumably they'd also be using a matching
> i18n.logOutputEncoding in that case, in which case logmsg_reencode()
> would be a noop. I think the only reasons to do that are:
>
>   1. You're stuck on some legacy encoding for your terminal. But in that
>      case, I think you'd still be better off storing utf-8 and
>      translating on the fly, since whatever encoding you do store is
>      baked into your objects for all time (so accept some slowness now,
>      but eventually move to utf-8).
>
>   2. Your preferred language is bigger in utf-8 than in some specific
>      encoding, and you'd rather save some bytes. I'm not sure how big a
>      deal this is, given that commit messages don't tend to be that big
>      in the first place (compared to trees and blobs). And the zlib
>      deflation on the result might help remove some of the redundancy,
>      too.

Perhaps add

    3. You are dealing with a project originated on and migrated
       from a foreign SCM, and older parts of the history is stored
       in a non-utf-8, even though recent history is in utf-8

to the mix?

> The two-part user-format thing goes back to 7e77df39bf (pretty: two
> phase conversion for non utf-8 commits, 2013-04-19). It does seem like
> it would be cheaper to convert the format string into the output
> encoding (it would need to be an ascii superset, but that's already the
> case, since we expect to parse "author", etc out of the re-encoded
> commit object). But again, I have trouble caring too much about the
> performance of this case, as I consider it to be mostly legacy at this
> point. But I also don't write in (say) Japanese, so maybe I'm being too
> narrow-minded about whether people really want to avoid utf-8.

I suspect even the heavy Windows/Mac users in Japan have migrated
out of legacy (the suspicion comes from an anecdote that is offtopic
here).
Jeff King Nov. 6, 2019, 4:03 a.m. UTC | #6
On Wed, Nov 06, 2019 at 10:30:34AM +0900, Junio C Hamano wrote:

> > That's normally what we do. The only cases we're covering here are when
> > somebody has explicitly asked that the commit object be stored in
> > another encoding. Presumably they'd also be using a matching
> > i18n.logOutputEncoding in that case, in which case logmsg_reencode()
> > would be a noop. I think the only reasons to do that are:
> >
> >   1. You're stuck on some legacy encoding for your terminal. But in that
> >      case, I think you'd still be better off storing utf-8 and
> >      translating on the fly, since whatever encoding you do store is
> >      baked into your objects for all time (so accept some slowness now,
> >      but eventually move to utf-8).
> >
> >   2. Your preferred language is bigger in utf-8 than in some specific
> >      encoding, and you'd rather save some bytes. I'm not sure how big a
> >      deal this is, given that commit messages don't tend to be that big
> >      in the first place (compared to trees and blobs). And the zlib
> >      deflation on the result might help remove some of the redundancy,
> >      too.
> 
> Perhaps add
> 
>     3. You are dealing with a project originated on and migrated
>        from a foreign SCM, and older parts of the history is stored
>        in a non-utf-8, even though recent history is in utf-8
> 
> to the mix?

I would think you'd want to convert to utf-8 as you do the migration in
that case, since you're writing new hashes anyway. But I think a similar
case would just be an old Git repository, where for some reason you
thought i18n.commitEncoding was a good idea back then (perhaps because
you were in situation (1) then, but now you aren't).

In either case, though, I don't think it's a compelling motivation for
optimization, if only because those old commits will be shown less and
less (and even without modern optimizations like commit-graph, we'd
generally avoid reencoding those old commits unless we're actually going
to _show_ them).

> I suspect even the heavy Windows/Mac users in Japan have migrated
> out of legacy (the suspicion comes from an anecdote that is offtopic
> here).

Thanks for the data point. All of this is very far from my personal
experience, so I mostly go on scraps of hearsay I pick up reading this
or that. :)

-Peff
Danh Doan Nov. 6, 2019, 10:03 a.m. UTC | #7
On 2019-11-05 23:03:14 -0500, Jeff King wrote:
> >     3. You are dealing with a project originated on and migrated
> >        from a foreign SCM, and older parts of the history is stored
> >        in a non-utf-8, even though recent history is in utf-8
> > 
> > to the mix?
> 
> I would think you'd want to convert to utf-8 as you do the migration in
> that case, since you're writing new hashes anyway.

Sorry but I'm confused.
If we're migrating from foreign SCM, we could make our commit in
utf-8 (convert their commit message to utf8).
Even if we need to synchronise history between the foreign SCM in
question with git, we could use i18n.logoutputencoding for the output
comestic.

> But I think a similar
> case would just be an old Git repository, where for some reason you
> thought i18n.commitEncoding was a good idea back then (perhaps because
> you were in situation (1) then, but now you aren't).
> 
> In either case, though, I don't think it's a compelling motivation for
> optimization, if only because those old commits will be shown less and
> less (and even without modern optimizations like commit-graph, we'd
> generally avoid reencoding those old commits unless we're actually going
> to _show_ them).

I'm not sure if we're misunderstood each other.
I've only suggested to encode _new_ commit from now on in utf-8.
Reencoding old history in utf-8 is definitely not in that suggestion.
Jeff King Nov. 7, 2019, 5:56 a.m. UTC | #8
On Wed, Nov 06, 2019 at 05:03:22PM +0700, Danh Doan wrote:

> On 2019-11-05 23:03:14 -0500, Jeff King wrote:
> > >     3. You are dealing with a project originated on and migrated
> > >        from a foreign SCM, and older parts of the history is stored
> > >        in a non-utf-8, even though recent history is in utf-8
> > > 
> > > to the mix?
> > 
> > I would think you'd want to convert to utf-8 as you do the migration in
> > that case, since you're writing new hashes anyway.
> 
> Sorry but I'm confused.
> If we're migrating from foreign SCM, we could make our commit in
> utf-8 (convert their commit message to utf8).
> Even if we need to synchronise history between the foreign SCM in
> question with git, we could use i18n.logoutputencoding for the output
> comestic.

Right, that's the same thing I'm suggesting.

> > But I think a similar
> > case would just be an old Git repository, where for some reason you
> > thought i18n.commitEncoding was a good idea back then (perhaps because
> > you were in situation (1) then, but now you aren't).
> > 
> > In either case, though, I don't think it's a compelling motivation for
> > optimization, if only because those old commits will be shown less and
> > less (and even without modern optimizations like commit-graph, we'd
> > generally avoid reencoding those old commits unless we're actually going
> > to _show_ them).
> 
> I'm not sure if we're misunderstood each other.
> I've only suggested to encode _new_ commit from now on in utf-8.
> Reencoding old history in utf-8 is definitely not in that suggestion.

Yes. My point was that's _already_ the default behavior, unless you
explicitly set some config asking for non-utf8 commit objects. And I
don't think there's any good reason to set that these days.

-Peff

Patch
diff mbox series

diff --git a/sequencer.c b/sequencer.c
index 9d5964fd81..69430fe23f 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -5169,7 +5169,7 @@  int todo_list_rearrange_squash(struct todo_list *todo_list)
 		*commit_todo_item_at(&commit_todo, item->commit) = item;
 
 		parse_commit(item->commit);
-		commit_buffer = get_commit_buffer(item->commit, NULL);
+		commit_buffer = logmsg_reencode(item->commit, NULL, "UTF-8");
 		find_commit_subject(commit_buffer, &subject);
 		format_subject(&buf, subject, " ");
 		subject = subjects[i] = strbuf_detach(&buf, &subject_len);