diff mbox series

[resend] git-send-email: Use sanitized address when reading mbox body

Message ID 20240621092721.2980939-2-csokas.bence@prolan.hu (mailing list archive)
State Superseded
Headers show
Series [resend] git-send-email: Use sanitized address when reading mbox body | expand

Commit Message

Csókás, Bence June 21, 2024, 9:27 a.m. UTC
Commas and other punctuation marks in 'Cc: ', 'Signed-off-by: '
etc. lines mess with git-send-email. In parsing the mbox headers,
this is handled by calling `sanitize_address()`. This function
is called when parsing the message body as well, but was only
used for comparing it to $author. Now we add it to @cc too.

Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu>
---
 git-send-email.perl | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Junio C Hamano June 21, 2024, 5:27 p.m. UTC | #1
"Csókás, Bence" <csokas.bence@prolan.hu> writes:

> Commas and other punctuation marks in 'Cc: ', 'Signed-off-by: '
> etc. lines mess with git-send-email. In parsing the mbox headers,
> this is handled by calling `sanitize_address()`. This function
> is called when parsing the message body as well, but was only
> used for comparing it to $author. Now we add it to @cc too.
>
> Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu>
> ---
>  git-send-email.perl | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/git-send-email.perl b/git-send-email.perl
> index f0be4b4560..72044e5ef3 100755
> --- a/git-send-email.perl
> +++ b/git-send-email.perl
> @@ -1847,9 +1847,9 @@ sub pre_process_file {
>  					$what, $_) unless $quiet;
>  				next;
>  			}
> -			push @cc, $c;
> +			push @cc, $sc;
>  			printf(__("(body) Adding cc: %s from line '%s'\n"),
> -				$c, $_) unless $quiet;
> +				$sc, $_) unless $quiet;
>  		}
>  	}

Hmph, there is this piece of code before the block:

			if ($c !~ /.+@.+|<.+>/) {
				printf("(body) Ignoring %s from line '%s'\n",
					$what, $_) unless $quiet;
				next;
			}

This is to reject strings that do not even look like an e-mail
address, but we called sanitize_address() call on $c way before this
check.  I wonder if we should move this block way up, even before
the call to santize_address()?

That was a relatively unrelated tangent.

In the same function, there is this snippet about Cc: (I didn't
check if the same issue is shared with other header fields):

			elsif (/^Cc:\s+(.*)$/i) {
				foreach my $addr (parse_address_line($1)) {
					my $qaddr = unquote_rfc2047($addr);
					my $saddr = sanitize_address($qaddr);
					if ($saddr eq $sender) {
						next if ($suppress_cc{'self'});
					} else {
						next if ($suppress_cc{'cc'});
					}
					printf(__("(mbox) Adding cc: %s from line '%s'\n"),
						$addr, $_) unless $quiet;
					push @cc, $addr;
				}
			}

We seem to use sanitized address only for the purpose of suppressing
Cc, and use the original address given in the input for e-mail
purposes (@cc is the same variable as the patch under discussion
sends the "fixed" address to, which holds the data to formulate the
Cc: header of the outgoing message, I presume?).  So in that way, we
seem to be very consistent.

Possibly we are being consistent in a broken way, but I am not yet
convinced that we are.

It looks to me that there are many other places that we try to be as
faithful to the original as possible.  In the same block as the one
that handles "Cc:" I quoted above, an address on "From:" is also
sent intact into @cc and addresses on "To:" are handled the same
way.

The patch under discussion singles out the addresses on the trailers
in the message body and treat them differently from others, which I
am not sure is what we want to do.
Csókás, Bence June 24, 2024, 7:37 a.m. UTC | #2
Hi!

On 6/21/24 19:27, Junio C Hamano wrote:
> Hmph, there is this piece of code before the block:
> 
> 			if ($c !~ /.+@.+|<.+>/) {
> 				printf("(body) Ignoring %s from line '%s'\n",
> 					$what, $_) unless $quiet;
> 				next;
> 			}
> 
> This is to reject strings that do not even look like an e-mail
> address, but we called sanitize_address() call on $c way before this
> check.  I wonder if we should move this block way up, even before
> the call to santize_address()?
> 
> That was a relatively unrelated tangent.

Hm, maybe. Should I add this to the patch as well?

> In the same function, there is this snippet about Cc: (I didn't
> check if the same issue is shared with other header fields):

Yes, I've seen that. However, the mbox headers are more likely to 
conform to RFC 2047, or at least `git format-patch` generates correct 
mbox headers. (Before sending the original patch, I was playing around 
with purposefully broken mbox files, and at least Thunderbird seems to 
correctly parse these non-conforming mbox headers in _most_ cases, but 
that's probably just extra caution on Mozilla's side).

> It looks to me that there are many other places that we try to be as
> faithful to the original as possible.  In the same block as the one
> that handles "Cc:" I quoted above, an address on "From:" is also
> sent intact into @cc and addresses on "To:" are handled the same
> way.
> 
> The patch under discussion singles out the addresses on the trailers
> in the message body and treat them differently from others, which I
> am not sure is what we want to do.

I think it is a reasonable assumption that the mbox headers will be 
conforming, whereas the message body is just freeform text and no such 
guarantees exist. But if we want to be paranoid about it, we could try 
and sanitize everything.

Bence
Junio C Hamano June 24, 2024, 3:50 p.m. UTC | #3
Csókás Bence <csokas.bence@prolan.hu> writes:

> On 6/21/24 19:27, Junio C Hamano wrote:
>> Hmph, there is this piece of code before the block:
>> 			if ($c !~ /.+@.+|<.+>/) {
>> 				printf("(body) Ignoring %s from line '%s'\n",
>> 					$what, $_) unless $quiet;
>> 				next;
>> 			}
>> This is to reject strings that do not even look like an e-mail
>> address, but we called sanitize_address() call on $c way before this
>> check.  I wonder if we should move this block way up, even before
>> the call to santize_address()?
>> That was a relatively unrelated tangent.
>
> Hm, maybe. Should I add this to the patch as well?

No, it is an unrelated tangent.  We try not to mix unrelated things
into a single patch.  Perhaps as a separate "clean-up" patch after
the dust from the main topic settles, or as a preliminary "clean-up"
before the main topic, would be good (but see below).

>> It looks to me that there are many other places that we try to be as
>> faithful to the original as possible.  In the same block as the one
>> that handles "Cc:" I quoted above, an address on "From:" is also
>> sent intact into @cc and addresses on "To:" are handled the same
>> way.
>> The patch under discussion singles out the addresses on the trailers
>> in the message body and treat them differently from others, which I
>> am not sure is what we want to do.
>
> I think it is a reasonable assumption that the mbox headers will be
> conforming, whereas the message body is just freeform text and no such
> guarantees exist. But if we want to be paranoid about it, we could try
> and sanitize everything.

I actually do not think we want to be more paranoid.  I'd rather
trust what the user gave us, which was where my response came from.

Having said that, given that the "Ignoring ..." check we saw earlier
in this message triggers only for trailers, it may be a sensible
position to take that mail headers are more trustworthy and a random
address-looking strings on the trailer lines should be checked more
carefully.  Because the "Ignoring ..." check is primarily to reject
strings that are not addresses (things like bug IDs, or just
people's names without e-mail addresses to credit for a new feature
request) that we do not ever intend to actually Cc: the message to,
another sensible choice coming from that "strings on trailers may
not even be addresses" position may be not to add the $c to the Cc:
list, if $sc (the sanitized address) and $c (the original address)
are different.  That is, "the simple regexp check currently used to
trigger 'Ignoring ...' message thought that the string looked like
an address, but when we ask sanitize_address, it turns out that it
was not, really."  And if we were to take that route, "Ignoring ..."
check, which I called an unrelated tangent, would become very much
related to the topic at hand, as it would mean the resulting logic
would look more like this:

	my $sc = sanitize_address($c);
	if ($c !~ /.+@.+|<.+>/ || $sc ne $c) {
        	printf("(body) Ignoring %s from line '%s'\n",
		       $what, $_) unless $quiet;
		next;
	}

In any case, if we were to move forward with this topic (whether
"send to corrected $sc instead" or "$c is suspicious, do not add it
to $cc" is picked as the direction), the motivation behind the
design decision to treat the address taken from a trailer line
differently needs to be explained better, I think.  I had a chance
to ask you directly on list, and you gave a good explanation to me
in the message I am responding to, but you will not be around when
future readers of "git log" want to ask the same question "why are
we singling this out while we use the original address in all the
other places?".

Your proposed commit log message is the place to help them.

Oh, before I forget, is this something we can test easily in t9001?
We would want to protect a new behaviour from accidental breakage,
so adding a test or two would be a good thing.

Thanks.
Csókás, Bence June 26, 2024, 8:58 a.m. UTC | #4
Hi!

On 6/24/24 17:50, Junio C Hamano wrote:
> another sensible choice coming from that "strings on trailers may
> not even be addresses" position may be not to add the $c to the Cc:
> list, if $sc (the sanitized address) and $c (the original address)
> are different.  That is, "the simple regexp check currently used to
> trigger 'Ignoring ...' message thought that the string looked like
> an address, but when we ask sanitize_address, it turns out that it
> was not, really."

Maybe. Though we would need to run through the sanitized address through 
`unquote_rfc2047()` first. But I don't think it's necessary; if someone 
feeds us an "unsanitary" address (for instance, there is whitespace 
between the angle brackets), we should try to make sense of it, and 
worst case, the SMTP server rejects it, as it does now.

> In any case, if we were to move forward with this topic (whether
> "send to corrected $sc instead" or "$c is suspicious, do not add it
> to $cc" is picked as the direction), the motivation behind the
> design decision to treat the address taken from a trailer line
> differently needs to be explained better, I think. [...]
> Your proposed commit log message is the place to help them.

Okay. So in short, I should add justification for trusting mbox headers 
and not the message body, correct?

> Oh, before I forget, is this something we can test easily in t9001?
> We would want to protect a new behaviour from accidental breakage,
> so adding a test or two would be a good thing.

Maybe, I'll look into it.

Bence
Junio C Hamano June 26, 2024, 1:33 p.m. UTC | #5
Csókás Bence <csokas.bence@prolan.hu> writes:

>> In any case, if we were to move forward with this topic (whether
>> "send to corrected $sc instead" or "$c is suspicious, do not add it
>> to $cc" is picked as the direction), the motivation behind the
>> design decision to treat the address taken from a trailer line
>> differently needs to be explained better, I think. [...]
>> Your proposed commit log message is the place to help them.
>
> Okay. So in short, I should add justification for trusting mbox
> headers and not the message body, correct?

We want these to be explained for future readers:

 (1) we stuff the sanitized address to @cc in this particular place,
     but we still let all other places copy the original taken from
     the message to @to or @cc (as in the original code).

 (2) the reason behind the above difference is because we trust less
     about the "address looking" strings that appear on the trailer
     lines.

So, not just (2), but in order for the readers to understand why
they should care about (2), they need to be told (1) as well.

Thanks.
diff mbox series

Patch

diff --git a/git-send-email.perl b/git-send-email.perl
index f0be4b4560..72044e5ef3 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -1847,9 +1847,9 @@  sub pre_process_file {
 					$what, $_) unless $quiet;
 				next;
 			}
-			push @cc, $c;
+			push @cc, $sc;
 			printf(__("(body) Adding cc: %s from line '%s'\n"),
-				$c, $_) unless $quiet;
+				$sc, $_) unless $quiet;
 		}
 	}
 	close $fh;