Message ID | 20190331220104.31628-1-rybak.a.v@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mailinfo: support Unicode scissors | expand |
On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote: > diff --git a/mailinfo.c b/mailinfo.c > index b395adbdf2..4ef6cdee85 100644 > --- a/mailinfo.c > +++ b/mailinfo.c > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) > c++; > continue; > } > + if (!memcmp(c, "✂", 3)) { This character is tiny. Please add a comment that it's supposed to be a Unicode scissors character. Should we worry about this memcmp() potentially reading past the end of the string when 'c' points to the last character? > + in_perforation = 1; > + perforation += 3; > + scissors += 3; > + c++; Here you should jump past the three byte long Unicode character, so this should be c += 2. > + continue; > + } > in_perforation = 0; > } >
SZEDER Gábor <szeder.dev@gmail.com> writes: > On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote: >> diff --git a/mailinfo.c b/mailinfo.c >> index b395adbdf2..4ef6cdee85 100644 >> --- a/mailinfo.c >> +++ b/mailinfo.c >> @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) >> c++; >> continue; >> } >> + if (!memcmp(c, "✂", 3)) { > > This character is tiny. Please add a comment that it's supposed to be > a Unicode scissors character. > > Should we worry about this memcmp() potentially reading past the end > of the string when 'c' points to the last character? Quite honestly, I'd rather document what "scissors" line looks like exactly and make sure no readers would mistake that we'd accept any Unicode character whose name has substring "scissors" in it. Ah, wait, we already do. It is very clear that scissors are either ">8" (for right handers) or "8<" (for lefties) and nothing else. Unless you are sure that you are (and more importantly, can stay to be) exhaustive, adding allowed representations for a thing will force users to learn more non-essential things ("we allow only 8< and >8" vs "we allow only these 7, even though we are aware that there are at least 14 more that we do not allow"---the end-user needs to remember which 7 are allowed) and does not help users. Taking only "black scissors" U+2702 but not all of U+2700 - U+2704 will be a cause for unnecessary end-user complaints "why do you take this but not that one?" Then the next noise would be "why is '-' the only perforation and not U+2014 Em Dash or U+2013 En Dash?" Let's try not to be cute in non-essential things like how a pair of scissors ought to be spelled. If "8<" had worked well for us for the past 10 years, we should just stick to it.
On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote: > > 'git am --scissors' allows cutting a patch from an email at a scissors > line. Such a line should contain perforation, i.e. hyphens, and a > scissors symbol. Only ASCII graphics scissors '8<' '>8' '%<' '>%' are > recognized by 'git am --scissors' command at the moment. > > Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode > since version 1.0.0 [1]. Since then 'BLACK SCISSORS' also became part > of character set Emoji 1.0, published in 2015 [2]. With its adoption as > an emoji, availability of this character on keyboards has increased. > > Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am > --scissors' to be able to cut at Unicode perforation lines in emails. > Note, that Unicode character '✂' is three bytes in UTF-8 encoding. On top of what was already said in this thread. For some reason (bad font?) these scissors are drawn cutting _down_ for me instead of left or right. It looks a bit strange.
On Mon, Apr 01, 2019 at 01:09:47AM +0200, SZEDER Gábor wrote: > On Mon, Apr 01, 2019 at 12:01:04AM +0200, Andrei Rybak wrote: > > diff --git a/mailinfo.c b/mailinfo.c > > index b395adbdf2..4ef6cdee85 100644 > > --- a/mailinfo.c > > +++ b/mailinfo.c > > @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) > > c++; > > continue; > > } > > + if (!memcmp(c, "✂", 3)) { > > This character is tiny. Please add a comment that it's supposed to be > a Unicode scissors character. I think it might also be the first raw UTF-8 character in our source, which is otherwise ASCII. Usually we'd spell out the binary (with a comment). I think I agree with Junio's response, tough, that this is probably not a road we want to go down, unless this micro-format is being actively used in the wild (I have no idea, but I have never seen it). > Should we worry about this memcmp() potentially reading past the end > of the string when 'c' points to the last character? I also wondered if the existing memcmps for ">8", etc, would have this problem. They don't, but it's somewhat subtle. They are only 2 characters long, and the outer loop guarantees we have at least 1 character. So at most we will look at the NUL. But obviously a 3-byte sequence like this may invoke undefined behavior, and the existing memcmps encourage anybody adding code to do it wrong. I wonder if it's worth re-writing it like: diff --git a/mailinfo.c b/mailinfo.c index b395adbdf2..46b1b2a4a8 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -693,8 +693,8 @@ static int is_scissors_line(const char *line) perforation++; continue; } - if ((!memcmp(c, ">8", 2) || !memcmp(c, "8<", 2) || - !memcmp(c, ">%", 2) || !memcmp(c, "%<", 2))) { + if ((starts_with(c, ">8") || starts_with(c, "8<") || + starts_with(c, ">%") || starts_with(c, "%<"))) { in_perforation = 1; perforation += 2; scissors += 2; -Peff
On Mon, 1 Apr 2019 16:27:02 +0700, Duy Nguyen wrote: > On Mon, Apr 1, 2019 at 5:03 AM Andrei Rybak <rybak.a.v@gmail.com> wrote: >> Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am >> --scissors' to be able to cut at Unicode perforation lines in emails. >> Note, that Unicode character '✂' is three bytes in UTF-8 encoding. > > On top of what was already said in this thread. For some reason (bad > font?) these scissors are drawn cutting _down_ for me instead of left > or right. It looks a bit strange. This might be an indication that the font used is rendering the symbol as an emoji. Most scissors in emoji fonts are vertical: https://emojipedia.org/black-scissors/
diff --git a/mailinfo.c b/mailinfo.c index b395adbdf2..4ef6cdee85 100644 --- a/mailinfo.c +++ b/mailinfo.c @@ -701,6 +701,13 @@ static int is_scissors_line(const char *line) c++; continue; } + if (!memcmp(c, "✂", 3)) { + in_perforation = 1; + perforation += 3; + scissors += 3; + c++; + continue; + } in_perforation = 0; } diff --git a/t/t4150-am.sh b/t/t4150-am.sh index 3f7f750cc8..3ea8e8a2cf 100755 --- a/t/t4150-am.sh +++ b/t/t4150-am.sh @@ -77,12 +77,20 @@ test_expect_success 'setup: messages' ' printf "Subject: " >subject-prefix && - cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF + cat - subject-prefix msg-without-scissors-line >msg-with-scissors-line <<-\EOF && This line should not be included in the commit message with --scissors enabled. - - >8 - - remove everything above this line - - >8 - - EOF + + cat - subject-prefix msg-without-scissors-line >msg-with-unicode-scissors <<-\EOF + Lines above unicode scissors line should not be included in the commit + message with --scissors enabled. + + - - - ✂ - - - ✂ - - - + + EOF ' test_expect_success setup ' @@ -161,6 +169,12 @@ test_expect_success setup ' git format-patch --stdout expected-for-no-scissors^ >patch-with-scissors-line.eml && git reset --hard HEAD^ && + echo file >file && + git add file && + git commit -F msg-with-unicode-scissors && + git format-patch --stdout HEAD^ >patch-with-unicode-scissors.eml && + git reset --hard HEAD^ && + sed -n -e "3,\$p" msg >file && git add file && test_tick && @@ -421,6 +435,16 @@ test_expect_success 'am --scissors cuts the message at the scissors line' ' test_cmp_rev expected-for-scissors HEAD ' +test_expect_success 'am --scissors cuts the message at the unicode scissors line' ' + rm -fr .git/rebase-apply && + git reset --hard && + git checkout second && + git am --scissors patch-with-unicode-scissors.eml && + test_path_is_missing .git/rebase-apply && + git diff --exit-code expected-for-scissors && + test_cmp_rev expected-for-scissors HEAD +' + test_expect_success 'am --no-scissors overrides mailinfo.scissors' ' rm -fr .git/rebase-apply && git reset --hard &&
'git am --scissors' allows cutting a patch from an email at a scissors line. Such a line should contain perforation, i.e. hyphens, and a scissors symbol. Only ASCII graphics scissors '8<' '>8' '%<' '>%' are recognized by 'git am --scissors' command at the moment. Unicode character 'BLACK SCISSORS' (U+2702) has been a part of Unicode since version 1.0.0 [1]. Since then 'BLACK SCISSORS' also became part of character set Emoji 1.0, published in 2015 [2]. With its adoption as an emoji, availability of this character on keyboards has increased. Support UTF-8 encoding of '✂' in function is_scissors_line, for 'git am --scissors' to be able to cut at Unicode perforation lines in emails. Note, that Unicode character '✂' is three bytes in UTF-8 encoding. 1. https://www.unicode.org/versions/Unicode1.0.0/CodeCharts1.pdf https://www.unicode.org/Public/reconstructed/1.0.0/UnicodeData.txt 2. https://unicode.org/Public/emoji/1.0/emoji-data.txt Signed-off-by: Andrei Rybak <rybak.a.v@gmail.com> --- This applies on top of ar/t4150-remove-cruft merged into next in commit a0106a8d5c, which also edited the test setup in t4150. mailinfo.c | 7 +++++++ t/t4150-am.sh | 26 +++++++++++++++++++++++++- 2 files changed, 32 insertions(+), 1 deletion(-)