diff mbox series

mailinfo: support Unicode scissors

Message ID 20190331220104.31628-1-rybak.a.v@gmail.com (mailing list archive)
State New, archived
Headers show
Series mailinfo: support Unicode scissors | expand

Commit Message

Andrei Rybak March 31, 2019, 10:01 p.m. UTC
'git am --scissors' allows cutting a patch from an email at a scissors
line.  Such a line should contain perforation, i.e. hyphens, and a
scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
recognized by 'git am --scissors' command at the moment.

Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
of character set Emoji 1.0, published in 2015 [2].  With its adoption as
an emoji, availability of this character on keyboards has increased.

Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
--scissors' to be able to cut at Unicode perforation lines in emails.
Note, that Unicode character '✂' is three bytes in UTF-8 encoding.

1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf
   https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt
2. https://unicode.org/Public/emoji/1.0/emoji-data.txt

Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com>
---

This applies on top of ar/t4150-remove-cruft merged into next in 
commit a0106a8d5c, which also edited the test setup in t4150.

 mailinfo.c    |  7 +++++++
 t/t4150-am.sh | 26 +++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

Comments

SZEDER Gábor March 31, 2019, 11:09 p.m. UTC | #1
On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> diff --git a/mailinfo.c b/mailinfo.c
> index b395adbdf2..4ef6cdee85 100644
> --- a/mailinfo.c
> +++ b/mailinfo.c
> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>  			c++;
>  			continue;
>  		}
> +		if (!memcmp(c, "✂", 3)) {

This character is tiny.  Please add a comment that it's supposed to be
a Unicode scissors character.

Should we worry about this memcmp() potentially reading past the end
of the string when 'c' points to the last character?

> +			in_perforation = 1;
> +			perforation += 3;
> +			scissors += 3;
> +			c++;

Here you should jump past the three byte long Unicode character, so
this should be c += 2.

> +			continue;
> +		}
>  		in_perforation = 0;
>  	}
>
Junio C Hamano April 1, 2019, 9:07 a.m. UTC | #2
SZEDER Gábor <szeder.dev@gmail.com> writes:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
>> diff --git a/mailinfo.c b/mailinfo.c
>> index b395adbdf2..4ef6cdee85 100644
>> --- a/mailinfo.c
>> +++ b/mailinfo.c
>> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
>>  			c++;
>>  			continue;
>>  		}
>> +		if (!memcmp(c, "✂", 3)) {
>
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.
>
> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

Quite honestly, I'd rather document what "scissors" line looks like
exactly and make sure no readers would mistake that we'd accept any
Unicode character whose name has substring "scissors" in it.

Ah, wait, we already do.  It is very clear that scissors are either
">8" (for right handers) or "8<" (for lefties) and nothing else.

Unless you are sure that you are (and more importantly, can stay to
be) exhaustive, adding allowed representations for a thing will
force users to learn more non-essential things ("we allow only 8<
and >8" vs "we allow only these 7, even though we are aware that
there are at least 14 more that we do not allow"---the end-user
needs to remember which 7 are allowed) and does not help users.

Taking only "black scissors" U+2702 but not all of U+2700 - U+2704
will be a cause for unnecessary end-user complaints "why do you take
this but not that one?"  Then the next noise would be "why is '-'
the only perforation and not U+2014 Em Dash or U+2013 En Dash?"

Let's try not to be cute in non-essential things like how a pair of
scissors ought to be spelled.  If "8<" had worked well for us for
the past 10 years, we should just stick to it.
Duy Nguyen April 1, 2019, 9:27 a.m. UTC | #3
On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote:
>
> 'git am --scissors' allows cutting a patch from an email at a scissors
> line.  Such a line should contain perforation, i.e. hyphens, and a
> scissors symbol.  Only ASCII graphics scissors '8<' '>8' '%<' '>%' are
> recognized by 'git am --scissors' command at the moment.
>
> Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode
> since version 1.0.0 [1].  Since then 'BLACK SCISSORS' also became part
> of character set Emoji 1.0, published in 2015 [2].  With its adoption as
> an emoji, availability of this character on keyboards has increased.
>
> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
> --scissors' to be able to cut at Unicode perforation lines in emails.
> Note, that Unicode character '✂' is three bytes in UTF-8 encoding.

On top of what was already said in this thread. For some reason (bad
font?) these scissors are drawn cutting _down_ for me instead of left
or right. It looks a bit strange.
Jeff King April 1, 2019, 10:11 a.m. UTC | #4
On Mon, Apr 01, 2019 at 01:09:47AM +0200, SZEDER Gábor wrote:

> On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote:
> > diff --git a/mailinfo.c b/mailinfo.c
> > index b395adbdf2..4ef6cdee85 100644
> > --- a/mailinfo.c
> > +++ b/mailinfo.c
> > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line)
> >  			c++;
> >  			continue;
> >  		}
> > +		if (!memcmp(c, "✂", 3)) {
> 
> This character is tiny.  Please add a comment that it's supposed to be
> a Unicode scissors character.

I think it might also be the first raw UTF-8 character in our source,
which is otherwise ASCII. Usually we'd spell out the binary (with a
comment).

I think I agree with Junio's response, tough, that this is probably not
a road we want to go down, unless this micro-format is being actively
used in the wild (I have no idea, but I have never seen it).

> Should we worry about this memcmp() potentially reading past the end
> of the string when 'c' points to the last character?

I also wondered if the existing memcmps for ">8", etc, would have this
problem. They don't, but it's somewhat subtle. They are only 2
characters long, and the outer loop guarantees we have at least 1
character. So at most we will look at the NUL. But obviously a 3-byte
sequence like this may invoke undefined behavior, and the existing
memcmps encourage anybody adding code to do it wrong.

I wonder if it's worth re-writing it like:

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..46b1b2a4a8 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -693,8 +693,8 @@ static int is_scissors_line(const char *line)
 			perforation++;
 			continue;
 		}
-		if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) ||
-		     !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) {
+		if ((starts_with(c, ">8") || starts_with(c, "8<") ||
+		     starts_with(c, ">%") || starts_with(c, "%<"))) {
 			in_perforation = 1;
 			perforation += 2;
 			scissors += 2;

-Peff
Andrei Rybak April 1, 2019, 9:54 p.m. UTC | #5
On Mon, 1 Apr 2019 16:27:02 +0700, Duy Nguyen wrote:
> On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote:
>> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am
>> --scissors' to be able to cut at Unicode perforation lines in emails.
>> Note, that Unicode character '✂' is three bytes in UTF-8 encoding.
> 
> On top of what was already said in this thread. For some reason (bad
> font?) these scissors are drawn cutting _down_ for me instead of left
> or right. It looks a bit strange.

This might be an indication that the font used is rendering the
symbol as an emoji. Most scissors in emoji fonts are vertical:
https://emojipedia.org/black-scissors/
diff mbox series

Patch

diff --git a/mailinfo.c b/mailinfo.c
index b395adbdf2..4ef6cdee85 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -701,6 +701,13 @@  static int is_scissors_line(const char *line)
 			c++;
 			continue;
 		}
+		if (!memcmp(c, "✂", 3)) {
+			in_perforation = 1;
+			perforation += 3;
+			scissors += 3;
+			c++;
+			continue;
+		}
 		in_perforation = 0;
 	}
 
diff --git a/t/t4150-am.sh b/t/t4150-am.sh
index 3f7f750cc8..3ea8e8a2cf 100755
--- a/t/t4150-am.sh
+++ b/t/t4150-am.sh
@@ -77,12 +77,20 @@  test_expect_success 'setup: messages' '
 
 	printf "Subject: " >subject-prefix &&
 
-	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF
+	cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF &&
 	This line should not be included in the commit message with --scissors enabled.
 
 	 - - >8 - - remove everything above this line - - >8 - -
 
 	EOF
+
+	cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF
+	Lines above unicode scissors line should not be included in the commit
+	message with --scissors enabled.
+
+	- - - ✂ - - - ✂ - - -
+
+	EOF
 '
 
 test_expect_success setup '
@@ -161,6 +169,12 @@  test_expect_success setup '
 	git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml &&
 	git reset --hard HEAD^ &&
 
+	echo file >file &&
+	git add file &&
+	git commit -F msg-with-unicode-scissors &&
+	git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml &&
+	git reset --hard HEAD^ &&
+
 	sed -n -e "3,\$p" msg >file &&
 	git add file &&
 	test_tick &&
@@ -421,6 +435,16 @@  test_expect_success 'am --scissors cuts the message at the scissors line' '
 	test_cmp_rev expected-for-scissors HEAD
 '
 
+test_expect_success 'am --scissors cuts the message at the unicode scissors line' '
+	rm -fr .git/rebase-apply &&
+	git reset --hard &&
+	git checkout second &&
+	git am --scissors patch-with-unicode-scissors.eml &&
+	test_path_is_missing .git/rebase-apply &&
+	git diff --exit-code expected-for-scissors &&
+	test_cmp_rev expected-for-scissors HEAD
+'
+
 test_expect_success 'am --no-scissors overrides mailinfo.scissors' '
 	rm -fr .git/rebase-apply &&
 	git reset --hard &&