diff mbox series

[v4,1/2] ref-filter: handle CRLF at end-of-line more gracefully

Message ID 03b2d7d78a15d15130a68ed1e6092072aa0807cd.1603335680.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v4,1/2] ref-filter: handle CRLF at end-of-line more gracefully | expand

Commit Message

Philippe Blain Oct. 22, 2020, 3:01 a.m. UTC
From: Philippe Blain <levraiphilippeblain@gmail.com>

The ref-filter code does not correctly handle commit or tag messages
that use CRLF as the line terminator. Such messages can be created with
the `--cleanup=verbatim` option of `git commit` and `git tag`, or by
using `git commit-tree` directly.

The function `find_subpos` in ref-filter.c looks for two consecutive
LFs to find the end of the subject line, a sequence which is absent in
messages using CRLF. This results in the whole message being parsed as
the subject line (`%(contents:subject)`), and the body of the message
(`%(contents:body)`) being empty.

Moreover, in `copy_subject`, which wants to return the subject as a
single line, '\n' is replaced by space, but '\r' is
untouched.

This impacts the output of `git branch`, `git tag` and `git
for-each-ref`.

This bug is a regression for `git branch --verbose`, which
bisects down to 949af0684c (branch: use ref-filter printing APIs,
2017-01-10).

Fix this bug in ref-filter by hardening the logic in `copy_subject` and
`find_subpos` to correctly parse messages containing CRLF.

Add a new test script, 't3920-crlf-messages.sh', to test the behaviour
of commands using either the ref-filter or the pretty APIs with messages
using CRLF line endings. The function `test_crlf_subject_body_and_contents`
can be used to test that the `--format` option of `branch`, `tag`,
`for-each-ref`, `log` and `show` correctly displays the subject, body
and raw content of commit and tag messages using CRLF. Test the
output of `branch`, `tag` and `for-each-ref` with such commits.

Helped-by: Junio C Hamano <gitster@pobox.com>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com>
---
 ref-filter.c             |  36 ++++++++-----
 t/t3920-crlf-messages.sh | 108 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 130 insertions(+), 14 deletions(-)
 create mode 100755 t/t3920-crlf-messages.sh

Comments

Junio C Hamano Oct. 23, 2020, 12:52 a.m. UTC | #1
"Philippe Blain via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Philippe Blain <levraiphilippeblain@gmail.com>
>
> The ref-filter code does not correctly handle commit or tag messages
> that use CRLF as the line terminator. Such messages can be created with
> the `--cleanup=verbatim` option of `git commit` and `git tag`, or by
> using `git commit-tree` directly.
>
> The function `find_subpos` in ref-filter.c looks for two consecutive
> LFs to find the end of the subject line, a sequence which is absent in
> messages using CRLF. This results in the whole message being parsed as
> the subject line (`%(contents:subject)`), and the body of the message
> (`%(contents:body)`) being empty.
>
> Moreover, in `copy_subject`, which wants to return the subject as a
> single line, '\n' is replaced by space, but '\r' is
> untouched.

Honestly, all of the above signal, at least to me, that these
objects are designed to use LF terminated lines and nothing else,
whether Windows or DOS existed in the same world or not.  There is
no such thing as commit objects that use CRLF as the line
terminator.  They are commit objects whose payload has CR at the end
of each and every line.  Just like there can be commit objects whose
payload has trailing SP on each line, or even has binary guck, these
things can be created using the "commit --cleanup=verbatim" command,
or the "hash-objects" command.  It does not mean it is encouraged to
create such objects.  It does not mean it is sensible to expect them
to behave as if these trailing whitespaces (be it SP or CR) are not
there.

> This impacts the output of `git branch`, `git tag` and `git
> for-each-ref`.

The answer to that problem description is "then don't" ;-).  If you
do not want to have trailing whitespaces, you need to clean them up
somehow, and we give an easy way to do so with the default --cleanup
action.  Setting it to verbatim is to decline that easy way offered
to you, and it makes it your responsibility to do your own clean-up
if you still want to remove the CR at the end of your lines.

Having said all that.

Here is how I explained the topic in the "What's cooking" report.

     A commit and tag object may have CR at the end of each and
     every line (you can create such an object with hash-object or
     using --cleanup=verbatim to decline the default clean-up
     action), but it would make it impossible to have a blank line
     to separate the title from the body of the message.  Be lenient
     and accept a line with lone CR on it as a blank line, too.

Let's not call this change a "bug fix".  The phrase you used in your
title, "more gracefully", is a very good one.

In the meantime, I've squashed your "oops forgot ||return 1" change
into [PATCH 2/2].

Thanks.
Philippe Blain Oct. 23, 2020, 1:46 a.m. UTC | #2
Hi Junio,

> Le 22 oct. 2020 à 20:52, Junio C Hamano <gitster@pobox.com> a écrit :
> 
> "Philippe Blain via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> From: Philippe Blain <levraiphilippeblain@gmail.com>
>> 
>> The ref-filter code does not correctly handle commit or tag messages
>> that use CRLF as the line terminator. Such messages can be created with
>> the `--cleanup=verbatim` option of `git commit` and `git tag`, or by
>> using `git commit-tree` directly.
>> 
>> The function `find_subpos` in ref-filter.c looks for two consecutive
>> LFs to find the end of the subject line, a sequence which is absent in
>> messages using CRLF. This results in the whole message being parsed as
>> the subject line (`%(contents:subject)`), and the body of the message
>> (`%(contents:body)`) being empty.
>> 
>> Moreover, in `copy_subject`, which wants to return the subject as a
>> single line, '\n' is replaced by space, but '\r' is
>> untouched.
> 
> Honestly, all of the above signal, at least to me, that these
> objects are designed to use LF terminated lines and nothing else,
> whether Windows or DOS existed in the same world or not.  There is
> no such thing as commit objects that use CRLF as the line
> terminator.  They are commit objects whose payload has CR at the end
> of each and every line.  Just like there can be commit objects whose
> payload has trailing SP on each line, or even has binary guck, these
> things can be created using the "commit --cleanup=verbatim" command,
> or the "hash-objects" command.  It does not mean it is encouraged to
> create such objects.  It does not mean it is sensible to expect them
> to behave as if these trailing whitespaces (be it SP or CR) are not
> there.
> 
>> This impacts the output of `git branch`, `git tag` and `git
>> for-each-ref`.
> 
> The answer to that problem description is "then don't" ;-).  If you
> do not want to have trailing whitespaces, you need to clean them up
> somehow, and we give an easy way to do so with the default --cleanup
> action.  Setting it to verbatim is to decline that easy way offered
> to you, and it makes it your responsibility to do your own clean-up
> if you still want to remove the CR at the end of your lines.

I agree with you on that : if you are creating the object yourself,
you should let the default cleanup take place.

But as a lot of projects use GitHub, GitLab or similar services
to accept contributions, and let these web systems perform the "merge" 
(or rebase or whatever) operation to integrate these contributions;
maintainers sometime choose to not always have complete control
on all objects that become part of the canonical history of their repository. 
And as I wrote in [1], GitLab was creating commits using CRLF up until 9.2... [2].
So for these poor projects that are now stuck with these CRLFs in their
merge commit messages, I think it's good that Git handles these correctly.

> Having said all that.
> 
> Here is how I explained the topic in the "What's cooking" report.
> 
>     A commit and tag object may have CR at the end of each and
>     every line (you can create such an object with hash-object or
>     using --cleanup=verbatim to decline the default clean-up
>     action), but it would make it impossible to have a blank line
>     to separate the title from the body of the message.  Be lenient
>     and accept a line with lone CR on it as a blank line, too.

Just for the sake of searchability, I think it would be good to have 
CRLF spelled out in this topic description (since I gather this is what 
ends up in the release notes). But I don't feel that strongly
about that.

> Let's not call this change a "bug fix".  The phrase you used in your
> title, "more gracefully", is a very good one.

It was your suggestion ;) 

> In the meantime, I've squashed your "oops forgot ||return 1" change
> into [PATCH 2/2].

Thanks for squashing it in.

Cheers,
Philippe.


[1] https://lore.kernel.org/git/63755050-10A5-4A46-9BB3-8207E055692C@gmail.com/
[2] https://gitlab.com/gitlab-org/gitlab-foss/-/issues/31671
Junio C Hamano Oct. 28, 2020, 8:24 p.m. UTC | #3
Philippe Blain <levraiphilippeblain@gmail.com> writes:

>> Having said all that.
>> 
>> Here is how I explained the topic in the "What's cooking" report.
>> 
>>     A commit and tag object may have CR at the end of each and
>>     every line (you can create such an object with hash-object or
>>     using --cleanup=verbatim to decline the default clean-up
>>     action), but it would make it impossible to have a blank line
>>     to separate the title from the body of the message.  Be lenient
>>     and accept a line with lone CR on it as a blank line, too.
>
> Just for the sake of searchability, I think it would be good to have 
> CRLF spelled out in this topic description (since I gather this is what 
> ends up in the release notes). But I don't feel that strongly
> about that.
>
>> Let's not call this change a "bug fix".  The phrase you used in your
>> title, "more gracefully", is a very good one.
>
> It was your suggestion ;) 
>
>> In the meantime, I've squashed your "oops forgot ||return 1" change
>> into [PATCH 2/2].
>
> Thanks for squashing it in.

Squashing in the "oops forgot || return 1" was the only thing I did.
I did not rewrite (and will not do so myself) the proposed log
message 1/2, which needs to happen before the topic can hit 'next'.

Thanks.
Philippe Blain Oct. 29, 2020, 1:29 a.m. UTC | #4
Hi Junio,

> Le 28 oct. 2020 à 16:24, Junio C Hamano <gitster@pobox.com> a écrit :
> 
> Philippe Blain <levraiphilippeblain@gmail.com> writes:
> 
>>> Having said all that.
>>> 
>>> Here is how I explained the topic in the "What's cooking" report.
>>> 
>>>    A commit and tag object may have CR at the end of each and
>>>    every line (you can create such an object with hash-object or
>>>    using --cleanup=verbatim to decline the default clean-up
>>>    action), but it would make it impossible to have a blank line
>>>    to separate the title from the body of the message.  Be lenient
>>>    and accept a line with lone CR on it as a blank line, too.
>> 
>> Just for the sake of searchability, I think it would be good to have 
>> CRLF spelled out in this topic description (since I gather this is what 
>> ends up in the release notes). But I don't feel that strongly
>> about that.
>> 
>>> Let's not call this change a "bug fix".  The phrase you used in your
>>> title, "more gracefully", is a very good one.
>> 
>> It was your suggestion ;) 
>> 
>>> In the meantime, I've squashed your "oops forgot ||return 1" change
>>> into [PATCH 2/2].
>> 
>> Thanks for squashing it in.
> 
> Squashing in the "oops forgot || return 1" was the only thing I did.
> I did not rewrite (and will not do so myself) the proposed log
> message 1/2, which needs to happen before the topic can hit 'next'.

Ah! I thought you meant "Let's not call it a bug fix in the release notes".
I'll send a new version to not mention "bug" in the log messages.

Thanks,

Philippe.
diff mbox series

Patch

diff --git a/ref-filter.c b/ref-filter.c
index c62f6b4822..6476686fea 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1097,14 +1097,19 @@  static const char *copy_email(const char *buf, struct used_atom *atom)
 
 static char *copy_subject(const char *buf, unsigned long len)
 {
-	char *r = xmemdupz(buf, len);
+	struct strbuf sb = STRBUF_INIT;
 	int i;
 
-	for (i = 0; i < len; i++)
-		if (r[i] == '\n')
-			r[i] = ' ';
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\r' && i + 1 < len && buf[i + 1] == '\n')
+			continue; /* ignore CR in CRLF */
 
-	return r;
+		if (buf[i] == '\n')
+			strbuf_addch(&sb, ' ');
+		else
+			strbuf_addch(&sb, buf[i]);
+	}
+	return strbuf_detach(&sb, NULL);
 }
 
 static void grab_date(const char *buf, struct atom_value *v, const char *atomname)
@@ -1228,20 +1233,23 @@  static void find_subpos(const char *buf,
 
 	/* subject is first non-empty line */
 	*sub = buf;
-	/* subject goes to first empty line */
-	while (buf < *sig && *buf && *buf != '\n') {
-		eol = strchrnul(buf, '\n');
-		if (*eol)
-			eol++;
-		buf = eol;
-	}
+	/* subject goes to first empty line before signature begins */
+	if ((eol = strstr(*sub, "\n\n"))) {
+		eol = eol < *sig ? eol : *sig;
+	/* check if message uses CRLF */
+	} else if (! (eol = strstr(*sub, "\r\n\r\n"))) {
+		/* treat whole message as subject */
+		eol = strrchr(*sub, '\0');
+	}
+	buf = eol;
 	*sublen = buf - *sub;
 	/* drop trailing newline, if present */
-	if (*sublen && (*sub)[*sublen - 1] == '\n')
+	while (*sublen && ((*sub)[*sublen - 1] == '\n' ||
+			   (*sub)[*sublen - 1] == '\r'))
 		*sublen -= 1;
 
 	/* skip any empty lines */
-	while (*buf == '\n')
+	while (*buf == '\n' || *buf == '\r')
 		buf++;
 	*body = buf;
 	*bodylen = strlen(buf);
diff --git a/t/t3920-crlf-messages.sh b/t/t3920-crlf-messages.sh
new file mode 100755
index 0000000000..3f0ce02c3f
--- /dev/null
+++ b/t/t3920-crlf-messages.sh
@@ -0,0 +1,108 @@ 
+#!/bin/sh
+
+test_description='Test ref-filter and pretty APIs for commit and tag messages using CRLF'
+. ./test-lib.sh
+
+LIB_CRLF_BRANCHES=""
+
+create_crlf_ref () {
+	branch="$1" &&
+	cat >.crlf-orig-$branch.txt &&
+	cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
+	grep 'Subject' .crlf-orig-$branch.txt | tr '\n' ' ' | sed 's/[ ]*$//' | tr -d '\n' >.crlf-subject-$branch.txt &&
+	grep 'Body' .crlf-message-$branch.txt >.crlf-body-$branch.txt || true &&
+	LIB_CRLF_BRANCHES="${LIB_CRLF_BRANCHES} ${branch}" &&
+	test_tick &&
+	hash=$(git commit-tree HEAD^{tree} -p HEAD -F .crlf-message-${branch}.txt) &&
+	git branch ${branch} ${hash} &&
+	git tag tag-${branch} ${branch} -F .crlf-message-${branch}.txt --cleanup=verbatim
+}
+
+create_crlf_refs () {
+	create_crlf_ref crlf <<-\EOF &&
+	Subject first line
+
+	Body first line
+	Body second line
+	EOF
+	create_crlf_ref crlf-empty-lines-after-subject <<-\EOF &&
+	Subject first line
+
+
+	Body first line
+	Body second line
+	EOF
+	create_crlf_ref crlf-two-line-subject <<-\EOF &&
+	Subject first line
+	Subject second line
+
+	Body first line
+	Body second line
+	EOF
+	create_crlf_ref crlf-two-line-subject-no-body <<-\EOF &&
+	Subject first line
+	Subject second line
+	EOF
+	create_crlf_ref crlf-two-line-subject-no-body-trailing-newline <<-\EOF
+	Subject first line
+	Subject second line
+
+	EOF
+}
+
+test_crlf_subject_body_and_contents() {
+	command_and_args="$@" &&
+	command=$1 &&
+	if test ${command} = "branch" || test ${command} = "for-each-ref" || test ${command} = "tag"
+	then
+		atoms="(contents:subject) (contents:body) (contents)"
+	elif test ${command} = "log" || test ${command} = "show"
+	then
+		atoms="s b B"
+	fi &&
+	files="subject body message" &&
+	while test -n "${atoms}"
+	do
+		set ${atoms} && atom=$1 && shift && atoms="$*" &&
+		set ${files} &&	file=$1 && shift && files="$*" &&
+		test_expect_success "${command}: --format='%${atom}' works with messages using CRLF" "
+			rm -f expect &&
+			for ref in ${LIB_CRLF_BRANCHES}
+			do
+				cat .crlf-${file}-\"\${ref}\".txt >>expect &&
+				printf \"\n\" >>expect
+			done &&
+			git $command_and_args --format=\"%${atom}\" >actual &&
+			test_cmp expect actual
+		"
+	done
+}
+
+
+test_expect_success 'Setup refs with commit and tag messages using CRLF' '
+	test_commit inital &&
+	create_crlf_refs
+'
+
+test_expect_success 'branch: --verbose works with messages using CRLF' '
+	rm -f expect &&
+	for branch in $LIB_CRLF_BRANCHES
+	do
+		printf "  " >>expect &&
+		cat .crlf-subject-${branch}.txt >>expect &&
+		printf "\n" >>expect
+	done &&
+	git branch -v >tmp &&
+	# Remove first two columns, and the line for the currently checked out branch
+	current=$(git branch --show-current) &&
+	grep -v $current <tmp | awk "{\$1=\$2=\"\"}1"  >actual &&
+	test_cmp expect actual
+'
+
+test_crlf_subject_body_and_contents branch --list crlf*
+
+test_crlf_subject_body_and_contents tag --list tag-crlf*
+
+test_crlf_subject_body_and_contents for-each-ref refs/heads/crlf*
+
+test_done