diff mbox series

[v4,8/8] sequencer: reencode commit message for am/rebase --show-current-patch

Message ID 36796e2b679cd8b2d341058e775db401f9abcef7.1573094789.git.congdanhqx@gmail.com (mailing list archive)
State New, archived
Headers show
Series Correct internal working and output encoding | expand

Commit Message

Đoàn Trần Công Danh Nov. 7, 2019, 2:56 a.m. UTC
The message file will be used as commit message for the
git-{am,rebase} --continue.

Signed-off-by: Doan Tran Cong Danh <congdanhqx@gmail.com>
---
 sequencer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Jeff King Nov. 7, 2019, 6:32 a.m. UTC | #1
On Thu, Nov 07, 2019 at 09:56:19AM +0700, Doan Tran Cong Danh wrote:

> The message file will be used as commit message for the
> git-{am,rebase} --continue.
>
> [...]
>  	strbuf_addf(&buf, "%s/message", get_dir(opts));
>  	if (!file_exists(buf.buf)) {
> -		const char *commit_buffer = get_commit_buffer(commit, NULL);
> +		const char *encoding = get_commit_output_encoding();
> +		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);

That makes sense, though it's hard to understand the flow of this data
through multiple sequencer invocations. I _think_ this would be fixing a
case like this:

-- >8 --
git init repo
cd repo

# some commits to build off of
echo base >file
git add file
git commit -m base

echo side >file
git add file
git commit -m side

# now make a commit in iso8859-1
git checkout -b side HEAD^
echo iso8859-1 >file
git add file
iconv -f utf8 -t iso8859-1 <<-\EOF |
súbject

bödy
EOF
git -c i18n.commitEncoding=iso8859-1 commit -F -

# and rebase it with the merge strategy, which will fail;
# now .git/rebase-merge/message has iso8859-1 in it
git rebase -m master

# and if we resolve and commit, presumably we'd get a broken commit,
# with iso8859-1 and no encoding header
echo resolved >file
git add file
GIT_EDITOR=: git rebase --continue
-- 8< --

But somehow it all seems to work. The resulting commit has real utf8 in
it. I'm not sure if we pull it from the original commit via "commit -c",
or if it's in one of the other files. But it's not clear to me how
this "message" file is being used.

-Peff
Đoàn Trần Công Danh Nov. 7, 2019, 7:48 a.m. UTC | #2
On 2019-11-07 01:32:23 -0500, Jeff King wrote:
> On Thu, Nov 07, 2019 at 09:56:19AM +0700, Doan Tran Cong Danh wrote:
> 
> > The message file will be used as commit message for the
> > git-{am,rebase} --continue.
> >
> > [...]
> >  	strbuf_addf(&buf, "%s/message", get_dir(opts));
> >  	if (!file_exists(buf.buf)) {
> > -		const char *commit_buffer = get_commit_buffer(commit, NULL);
> > +		const char *encoding = get_commit_output_encoding();
> > +		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
> 
> That makes sense, though it's hard to understand the flow of this data
> through multiple sequencer invocations. I _think_ this would be fixing a
> case like this:
> 
> -- >8 --
> git init repo
> cd repo
> 
> # some commits to build off of
> echo base >file
> git add file
> git commit -m base
> 
> echo side >file
> git add file
> git commit -m side
> 
> # now make a commit in iso8859-1
> git checkout -b side HEAD^
> echo iso8859-1 >file
> git add file
> iconv -f utf8 -t iso8859-1 <<-\EOF |
> súbject
> 
> bödy
> EOF
> git -c i18n.commitEncoding=iso8859-1 commit -F -
> 
> # and rebase it with the merge strategy, which will fail;
> # now .git/rebase-merge/message has iso8859-1 in it
> git rebase -m master
> 
> # and if we resolve and commit, presumably we'd get a broken commit,
> # with iso8859-1 and no encoding header
> echo resolved >file
> git add file
> GIT_EDITOR=: git rebase --continue
> -- 8< --
> 
> But somehow it all seems to work. The resulting commit has real utf8 in
> it. I'm not sure if we pull it from the original commit via "commit -c",

Yes, somehow it worked. But, without this patch, git also warns:

    % GIT_EDITOR=: git rebase --continue
    Warning: commit message did not conform to UTF-8.
    You may want to amend it after fixing the message, or set the config
    variable i18n.commitencoding to the encoding your project uses.

Checking with strace (on glibc, musl strace can't trace execve):

> [pid 12848] execve("/home/danh/workspace/git/git", ["/home/danh/workspace/git/git", "commit", "-n", "-F", ".git/rebase-merge/message", "-e", "--allow-empty"], 0x558fb02e8240 /* 51 vars */) = 0

Turn out, it's because of: commit.c::verify_utf8

    /*
     * This verifies that the buffer is in proper utf8 format.
     *
     * If it isn't, it assumes any non-utf8 characters are Latin1,
     * and does the conversion.
     */
    static int verify_utf8(struct strbuf *buf)

Hence, your test is just pure luck (because it's in latin1).
Jeff King Nov. 7, 2019, 8:03 a.m. UTC | #3
On Thu, Nov 07, 2019 at 02:48:58PM +0700, Danh Doan wrote:

> > # and if we resolve and commit, presumably we'd get a broken commit,
> > # with iso8859-1 and no encoding header
> > echo resolved >file
> > git add file
> > GIT_EDITOR=: git rebase --continue
> > -- 8< --
> > 
> > But somehow it all seems to work. The resulting commit has real utf8 in
> > it. I'm not sure if we pull it from the original commit via "commit -c",
> 
> Yes, somehow it worked. But, without this patch, git also warns:
> 
>     % GIT_EDITOR=: git rebase --continue
>     Warning: commit message did not conform to UTF-8.
>     You may want to amend it after fixing the message, or set the config
>     variable i18n.commitencoding to the encoding your project uses.
> 
> Checking with strace (on glibc, musl strace can't trace execve):
> 
> > [pid 12848] execve("/home/danh/workspace/git/git", ["/home/danh/workspace/git/git", "commit", "-n", "-F", ".git/rebase-merge/message", "-e", "--allow-empty"], 0x558fb02e8240 /* 51 vars */) = 0
> 
> Turn out, it's because of: commit.c::verify_utf8
> 
>     /*
>      * This verifies that the buffer is in proper utf8 format.
>      *
>      * If it isn't, it assumes any non-utf8 characters are Latin1,
>      * and does the conversion.
>      */
>     static int verify_utf8(struct strbuf *buf)
> 
> Hence, your test is just pure luck (because it's in latin1).

Ah, thanks for resolving that mystery. Is it worth turning the scenario
above into a test?

-Peff
Đoàn Trần Công Danh Nov. 7, 2019, 4:32 p.m. UTC | #4
On 2019-11-07 03:03:07 -0500, Jeff King wrote:
> On Thu, Nov 07, 2019 at 02:48:58PM +0700, Danh Doan wrote:
> 
> > > # and if we resolve and commit, presumably we'd get a broken commit,
> > > # with iso8859-1 and no encoding header
> > > echo resolved >file
> > > git add file
> > > GIT_EDITOR=: git rebase --continue
> > > -- 8< --
> > > 
> > > But somehow it all seems to work. The resulting commit has real utf8 in
> > > it. I'm not sure if we pull it from the original commit via "commit -c",
> > 
> > Yes, somehow it worked. But, without this patch, git also warns:
> > 
> >     % GIT_EDITOR=: git rebase --continue
> >     Warning: commit message did not conform to UTF-8.
> >     You may want to amend it after fixing the message, or set the config
> >     variable i18n.commitencoding to the encoding your project uses.
> > 
> > Checking with strace (on glibc, musl strace can't trace execve):
> > 
> > > [pid 12848] execve("/home/danh/workspace/git/git", ["/home/danh/workspace/git/git", "commit", "-n", "-F", ".git/rebase-merge/message", "-e", "--allow-empty"], 0x558fb02e8240 /* 51 vars */) = 0
> > 
> > Turn out, it's because of: commit.c::verify_utf8
> > 
> >     /*
> >      * This verifies that the buffer is in proper utf8 format.
> >      *
> >      * If it isn't, it assumes any non-utf8 characters are Latin1,
> >      * and does the conversion.
> >      */
> >     static int verify_utf8(struct strbuf *buf)
> > 
> > Hence, your test is just pure luck (because it's in latin1).
> 
> Ah, thanks for resolving that mystery. Is it worth turning the scenario
> above into a test?

Yes, it's worth to have a test.
In fact, I found another breakage (rebase with encoding) while writing
this test. I'll delay the re-roll a bit to include that breakage.
diff mbox series

Patch

diff --git a/sequencer.c b/sequencer.c
index d735d09f98..4d12ad3cc6 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -2972,7 +2972,8 @@  static int make_patch(struct repository *r,
 
 	strbuf_addf(&buf, "%s/message", get_dir(opts));
 	if (!file_exists(buf.buf)) {
-		const char *commit_buffer = get_commit_buffer(commit, NULL);
+		const char *encoding = get_commit_output_encoding();
+		const char *commit_buffer = logmsg_reencode(commit, NULL, encoding);
 		find_commit_subject(commit_buffer, &subject);
 		res |= write_message(subject, strlen(subject), buf.buf, 1);
 		unuse_commit_buffer(commit, commit_buffer);