diff mbox series

mailinfo: strip CR from base64/quoted-printable email

Message ID 20210421013404.17383-1-congdanhqx@gmail.com (mailing list archive)
State New, archived
Headers show
Series mailinfo: strip CR from base64/quoted-printable email | expand

Commit Message

Đoàn Trần Công Danh April 21, 2021, 1:34 a.m. UTC
When an SMTP server receives an 8-bit email message, possibly with only
LF as line ending, some of those servers decide to change said LF to
CRLF.

Some other SMTP servers, when receives an 8-bit email message, decide to
encoding such message in base64 and/or quoted-printable instead.

If an email is transfered through those 2 email servers in order, the
final recipients will receive an email contains a patch mungled with
CRLF encoded inside another encoding. Thus, such CR couldn't be dropped
by mailsplit. Such accidents have been observed in the wild [1].

Let's guess if such CR was added automatically and strip them in
mailinfo.

[1]: https://nmbug.notmuchmail.org/nmweb/show/m2lf9ejegj.fsf%40guru.guru-group.fi

Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
---

 I'm not sure if guessing the heuristic to strip CR is a good approach.
 I think it's better to pass --keep-cr down from git-am.
 Let's say --keep-cr=<yes|no|auto>

 mailinfo.c             | 20 +++++++++++++++++---
 t/t5100-mailinfo.sh    |  5 +++++
 t/t5100/cr-base64.mbox | 22 ++++++++++++++++++++++
 t/t5100/info1000       |  5 +++++
 t/t5100/msg1000        |  2 ++
 t/t5100/patch1000      | 22 ++++++++++++++++++++++
 6 files changed, 73 insertions(+), 3 deletions(-)
 create mode 100644 t/t5100/cr-base64.mbox
 create mode 100644 t/t5100/info1000
 create mode 100644 t/t5100/msg1000
 create mode 100644 t/t5100/patch1000

Comments

Junio C Hamano April 21, 2021, 2:09 a.m. UTC | #1
Đoàn Trần Công Danh  <congdanhqx@gmail.com> writes:

> When an SMTP server receives an 8-bit email message, possibly with only
> LF as line ending, some of those servers decide to change said LF to
> CRLF.
>
> Some other SMTP servers, when receives an 8-bit email message, decide to
> encoding such message in base64 and/or quoted-printable instead.

encoding -> encode

>
> If an email is transfered through those 2 email servers in order, the
> final recipients will receive an email contains a patch mungled with
> CRLF encoded inside another encoding. Thus, such CR couldn't be dropped
> by mailsplit. Such accidents have been observed in the wild [1].
>
> Let's guess if such CR was added automatically and strip them in
> mailinfo.
>
> [1]: https://nmbug.notmuchmail.org/nmweb/show/m2lf9ejegj.fsf%40guru.guru-group.fi
>
> Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
> ---
>
>  I'm not sure if guessing the heuristic to strip CR is a good approach.
>  I think it's better to pass --keep-cr down from git-am.
>  Let's say --keep-cr=<yes|no|auto>

It matches my instinct to tie this with the existing --keep-cr
option, even though I admit that I haven't thought things through.

.
brian m. carlson April 21, 2021, 3:32 a.m. UTC | #2
On 2021-04-21 at 01:34:04, Đoàn Trần Công Danh wrote:
> When an SMTP server receives an 8-bit email message, possibly with only
> LF as line ending, some of those servers decide to change said LF to
> CRLF.
> 
> Some other SMTP servers, when receives an 8-bit email message, decide to
> encoding such message in base64 and/or quoted-printable instead.

This really isn't an SMTP server.  It's mailing list software, namely
mailman, and I would argue it's a bug, even though we may want to work
around it.  For example, re-encoding the message breaks DKIM signatures,
which means that mailman is likely to cause mail to be needlessly
rejected.

8BITMIME is now so common with SMTP that I'd argue that we should just
write off servers that don't support it (especially in the context of
SMTPUTF8 existing), but this isn't the case of an SMTP server being
stuck in the last century.  Can we say more accurately that this is
mailing list software (or just call it out by name)?

> If an email is transfered through those 2 email servers in order, the
> final recipients will receive an email contains a patch mungled with
> CRLF encoded inside another encoding. Thus, such CR couldn't be dropped
> by mailsplit. Such accidents have been observed in the wild [1].
> 
> Let's guess if such CR was added automatically and strip them in
> mailinfo.
>
> [1]: https://nmbug.notmuchmail.org/nmweb/show/m2lf9ejegj.fsf%40guru.guru-group.fi
> 
> Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
> ---
> 
>  I'm not sure if guessing the heuristic to strip CR is a good approach.
>  I think it's better to pass --keep-cr down from git-am.
>  Let's say --keep-cr=<yes|no|auto>

I think we may want a separate option here.  When I send a 7bit or 8bit
body, I expect text canonicalization on the line endings.  However, when
I send a base64 or quoted-printable body, I don't expect my data to be
modified at all, and absent a compelling reason, doing so is incorrect.
In most cases, using base64 or quoted-printable is going to mean that
the sender knew that the body shouldn't be modified, not that mailman
modified it, so we should make line munging in this case opt-in.
Đoàn Trần Công Danh April 21, 2021, 12:07 p.m. UTC | #3
On 2021-04-21 03:32:07+0000, "brian m. carlson" <sandals@crustytoothpaste.net> wrote:
> On 2021-04-21 at 01:34:04, Đoàn Trần Công Danh wrote:
> > When an SMTP server receives an 8-bit email message, possibly with only
> > LF as line ending, some of those servers decide to change said LF to
> > CRLF.
> > 
> > Some other SMTP servers, when receives an 8-bit email message, decide to
> > encoding such message in base64 and/or quoted-printable instead.
> 
> This really isn't an SMTP server.  It's mailing list software, namely
> mailman, and I would argue it's a bug, even though we may want to work
> around it.  For example, re-encoding the message breaks DKIM signatures,
> which means that mailman is likely to cause mail to be needlessly
> rejected.
> 
> 8BITMIME is now so common with SMTP that I'd argue that we should just
> write off servers that don't support it (especially in the context of
> SMTPUTF8 existing), but this isn't the case of an SMTP server being
> stuck in the last century.  Can we say more accurately that this is
> mailing list software (or just call it out by name)?

I think replace "SMTP servers" with "mailing list managers" is
correct. I don't feel comfortable to call it out, since I don't know
if other managers do it that way or not.

> 
> > If an email is transfered through those 2 email servers in order, the
> > final recipients will receive an email contains a patch mungled with
> > CRLF encoded inside another encoding. Thus, such CR couldn't be dropped
> > by mailsplit. Such accidents have been observed in the wild [1].
> > 
> > Let's guess if such CR was added automatically and strip them in
> > mailinfo.
> >
> > [1]: https://nmbug.notmuchmail.org/nmweb/show/m2lf9ejegj.fsf%40guru.guru-group.fi
> > 
> > Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
> > ---
> > 
> >  I'm not sure if guessing the heuristic to strip CR is a good approach.
> >  I think it's better to pass --keep-cr down from git-am.
> >  Let's say --keep-cr=<yes|no|auto>
> 
> I think we may want a separate option here.  When I send a 7bit or 8bit
> body, I expect text canonicalization on the line endings.  However, when
> I send a base64 or quoted-printable body, I don't expect my data to be
> modified at all, and absent a compelling reason, doing so is incorrect.
> In most cases, using base64 or quoted-printable is going to mean that
> the sender knew that the body shouldn't be modified, not that mailman
> modified it, so we should make line munging in this case opt-in.

Make sense, this patch was sent mostly for some discussion first.
Would you mind suggest something for the option.

I'm thinking about --quoted-cr=<nowarn|warn|fix>, mimicking the
--whitespace option.
brian m. carlson April 22, 2021, 1:10 a.m. UTC | #4
On 2021-04-21 at 12:07:42, Đoàn Trần Công Danh wrote:
> I think replace "SMTP servers" with "mailing list managers" is
> correct. I don't feel comfortable to call it out, since I don't know
> if other managers do it that way or not.

I think that's fair.  I would hope not for the reasons I mentioned, but
it also would not be surprising to me if others did nevertheless.

> Make sense, this patch was sent mostly for some discussion first.
> Would you mind suggest something for the option.
> 
> I'm thinking about --quoted-cr=<nowarn|warn|fix>, mimicking the
> --whitespace option.

I think that sounds like a great name.
diff mbox series

Patch

diff --git a/mailinfo.c b/mailinfo.c
index 5681d9130d..dbff867f42 100644
--- a/mailinfo.c
+++ b/mailinfo.c
@@ -988,16 +988,27 @@  static int handle_boundary(struct mailinfo *mi, struct strbuf *line)
 }
 
 static void handle_filter_flowed(struct mailinfo *mi, struct strbuf *line,
-				 struct strbuf *prev)
+				 struct strbuf *prev, int *keep_cr)
 {
 	size_t len = line->len;
 	const char *rest;
 
 	if (!mi->format_flowed) {
+		if (*keep_cr == -1 && len >= 2)
+			*keep_cr = !(line->buf[len - 2] == '\r' &&
+				     line->buf[len - 1] == '\n');
+		if (!*keep_cr && len >= 2 &&
+		    line->buf[len - 2] == '\r' &&
+		    line->buf[len - 1] == '\n') {
+			strbuf_setlen(line, len - 2);
+			strbuf_addch(line, '\n');
+			len--;
+		}
 		handle_filter(mi, line);
 		return;
 	}
 
+	*keep_cr = 1;
 	if (line->buf[len - 1] == '\n') {
 		len--;
 		if (len && line->buf[len - 1] == '\r')
@@ -1036,6 +1047,7 @@  static void handle_filter_flowed(struct mailinfo *mi, struct strbuf *line,
 static void handle_body(struct mailinfo *mi, struct strbuf *line)
 {
 	struct strbuf prev = STRBUF_INIT;
+	int keep_cr = -1;
 
 	/* Skip up to the first boundary */
 	if (*(mi->content_top)) {
@@ -1081,7 +1093,7 @@  static void handle_body(struct mailinfo *mi, struct strbuf *line)
 						strbuf_addbuf(&prev, sb);
 						break;
 					}
-				handle_filter_flowed(mi, sb, &prev);
+				handle_filter_flowed(mi, sb, &prev, &keep_cr);
 			}
 			/*
 			 * The partial chunk is saved in "prev" and will be
@@ -1091,7 +1103,9 @@  static void handle_body(struct mailinfo *mi, struct strbuf *line)
 			break;
 		}
 		default:
-			handle_filter_flowed(mi, line, &prev);
+			/* CR in plain message was processed in mailsplit */
+			keep_cr = 1;
+			handle_filter_flowed(mi, line, &prev, &keep_cr);
 		}
 
 		if (mi->input_error)
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index 147e616533..9ccc11d16a 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -228,4 +228,9 @@  test_expect_success 'mailinfo handles unusual header whitespace' '
 	test_cmp expect actual
 '
 
+test_expect_success 'mailinfo strip CR after decode base64' '
+	cp $DATA/cr-base64.mbox 1000 &&
+	check_mailinfo 1000 ""
+'
+
 test_done
diff --git a/t/t5100/cr-base64.mbox b/t/t5100/cr-base64.mbox
new file mode 100644
index 0000000000..6ea9806a6b
--- /dev/null
+++ b/t/t5100/cr-base64.mbox
@@ -0,0 +1,22 @@ 
+From: A U Thor <mail@example.com>
+To: list@example.org
+Subject: [PATCH v2] sample
+Date: Mon,  3 Aug 2020 22:40:55 +0700
+Message-Id: <msg-id@example.com>
+Content-Type: text/plain; charset="utf-8"
+Content-Transfer-Encoding: base64
+
+T24gZGlmZmVyZW50IGRpc3RybywgcHl0ZXN0IGlzIHN1ZmZpeGVkIHdpdGggZGlmZmVyZW50IHBh
+dHRlcm5zLg0KDQotLS0NCiBjb25maWd1cmUgfCAyICstDQogMSBmaWxlIGNoYW5nZWQsIDEgaW5z
+ZXJ0aW9uKCspLCAxIGRlbGV0aW9uKC0pDQoNCmRpZmYgLS1naXQgYS9jb25maWd1cmUgYi9jb25m
+aWd1cmUNCmluZGV4IGRiMzUzOGIzLi5mN2MxYzA5NSAxMDA3NTUNCi0tLSBhL2NvbmZpZ3VyZQ0K
+KysrIGIvY29uZmlndXJlDQpAQCAtODE0LDcgKzgxNCw3IEBAIGlmIFsgJGhhdmVfcHl0aG9uMyAt
+ZXEgMSBdOyB0aGVuDQogICAgIHByaW50ZiAiQ2hlY2tpbmcgZm9yIHB5dGhvbjMgcHl0ZXN0ICg+
+PSAzLjApLi4uICINCiAgICAgY29uZj0kKG1rdGVtcCkNCiAgICAgcHJpbnRmICJbcHl0ZXN0XVxu
+bWludmVyc2lvbj0zLjBcbiIgPiAkY29uZg0KLSAgICBpZiBweXRlc3QtMyAtYyAkY29uZiAtLXZl
+cnNpb24gPi9kZXYvbnVsbCAyPiYxOyB0aGVuDQorICAgIGlmICIkcHl0aG9uIiAtbSBweXRlc3Qg
+LWMgJGNvbmYgLS12ZXJzaW9uID4vZGV2L251bGwgMj4mMTsgdGhlbg0KICAgICAgICAgcHJpbnRm
+ICJZZXMuXG4iDQogICAgICAgICBoYXZlX3B5dGhvbjNfcHl0ZXN0PTENCiAgICAgZWxzZQ0KLS0g
+DQoyLjI4LjANCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
+CmV4YW1wbGUgbWFpbGluZyBsaXN0IC0tIGxpc3RAZXhhbXBsZS5vcmcKVG8gdW5zdWJzY3JpYmUg
+c2VuZCBhbiBlbWFpbCB0byBsaXN0LWxlYXZlQGV4YW1wbGUub3JnCg==
diff --git a/t/t5100/info1000 b/t/t5100/info1000
new file mode 100644
index 0000000000..dab2228b70
--- /dev/null
+++ b/t/t5100/info1000
@@ -0,0 +1,5 @@ 
+Author: A U Thor
+Email: mail@example.com
+Subject: sample
+Date: Mon, 3 Aug 2020 22:40:55 +0700
+
diff --git a/t/t5100/msg1000 b/t/t5100/msg1000
new file mode 100644
index 0000000000..5e8e860aae
--- /dev/null
+++ b/t/t5100/msg1000
@@ -0,0 +1,2 @@ 
+On different distro, pytest is suffixed with different patterns.
+
diff --git a/t/t5100/patch1000 b/t/t5100/patch1000
new file mode 100644
index 0000000000..51c4fb4cb5
--- /dev/null
+++ b/t/t5100/patch1000
@@ -0,0 +1,22 @@ 
+---
+ configure | 2 +-
+ 1 file changed, 1 insertion(+), 1 deletion(-)
+
+diff --git a/configure b/configure
+index db3538b3..f7c1c095 100755
+--- a/configure
++++ b/configure
+@@ -814,7 +814,7 @@ if [ $have_python3 -eq 1 ]; then
+     printf "Checking for python3 pytest (>= 3.0)... "
+     conf=$(mktemp)
+     printf "[pytest]\nminversion=3.0\n" > $conf
+-    if pytest-3 -c $conf --version >/dev/null 2>&1; then
++    if "$python" -m pytest -c $conf --version >/dev/null 2>&1; then
+         printf "Yes.\n"
+         have_python3_pytest=1
+     else
+-- 
+2.28.0
+_______________________________________________
+example mailing list -- list@example.org
+To unsubscribe send an email to list-leave@example.org