diff mbox series

[v5] gitweb: redacted e-mail addresses feature.

Message ID pull.910.v5.git.1616817387441.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v5] gitweb: redacted e-mail addresses feature. | expand

Commit Message

Georgios Kontaxis March 27, 2021, 3:56 a.m. UTC
From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers and they may not respect robots.txt.
This can result in unsolicited messages.

Introduce an 'email-privacy' feature which redacts e-mail addresses
from the generated HTML content. Specifically, obscure addresses
retrieved from the the author/committer and comment sections of the
Git log. The feature is off by default.

This feature does not prevent someone from downloading the
unredacted commit log, e.g., by cloning the repository, and
extracting information from it. It aims to hinder the low-
effort, bulk collection of e-mail addresses by web crawlers.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: redacted e-mail addresses feature.
    
    Changes since v1:
    
     * Turned off the feature by default.
     * Removed duplicate code.
     * Added note about Gitweb consumers receiving redacted logs.
    
    Changes since v2:
    
     * The feature can be set on a per-project basis. ('override' => 1)
    
    Changes since v3:
    
     * Renamed feature to "email-privacy" and improved documentation.
     * Removed UI elements for git-format-patch since it won't be redacted.
     * Simplified calls to the address redaction logic.
     * Mail::Address is now used to reduce false-positive redactions.
    
    Changes since v4:
    
     * Rephrased the commit comment.
     * hide_mailaddrs_if_private is slighly more compact.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/910

Range-diff vs v4:

 1:  03a3f41c37ef ! 1:  1427231f9db5 gitweb: redacted e-mail addresses feature.
     @@ Commit message
          Gitweb extracts content from the Git log and makes it accessible
          over HTTP. As a result, e-mail addresses found in commits are
          exposed to web crawlers and they may not respect robots.txt.
     -    This may result in unsolicited messages.
     -    This is a feature for redacting e-mail addresses
     -    from the generated HTML, etc. content.
     +    This can result in unsolicited messages.
     +
     +    Introduce an 'email-privacy' feature which redacts e-mail addresses
     +    from the generated HTML content. Specifically, obscure addresses
     +    retrieved from the the author/committer and comment sections of the
     +    Git log. The feature is off by default.
      
          This feature does not prevent someone from downloading the
          unredacted commit log, e.g., by cloning the repository, and
     -    extracting information from it.
     -    It aims to hinder the low-effort bulk collection of e-mail
     -    addresses by web crawlers.
     +    extracting information from it. It aims to hinder the low-
     +    effort, bulk collection of e-mail addresses by web crawlers.
      
          Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
      
     @@ Documentation/gitweb.conf.txt: default font sizes or lineheights are changed (e.
       
      +email-privacy::
      +	Redact e-mail addresses from the generated HTML, etc. content.
     -+	This hides e-mail addresses found in the commit log from HTTP clients.
     ++	This obscures e-mail addresses retrieved from the author/committer
     ++	and comment sections of the Git log.
      +	It is meant to hinder web crawlers that harvest and abuse addresses.
      +	Such crawlers may not respect robots.txt.
     -+	Note that users and user tools also see the addresses redacted.
     ++	Note that users and user tools also see the addresses as redacted.
      +	If Gitweb is not the final step in a workflow then subsequent steps
      +	may misbehave because of the redacted information they receive.
      +	Disabled by default.
     @@ gitweb/gitweb.perl: sub parse_date {
      +		if (!is_mailaddr($match)) {
      +			next;
      +		}
     -+		my $offset = pos $line;
     -+		my $head = substr $line, 0, $offset - length($match);
     ++		my $match_offset = pos($line) - length($match);
     ++		pos $line = $match_offset;
     ++
      +		my $redaction = "<redacted>";
     -+		my $tail = substr $line, $offset;
     -+		$line = $head . $redaction . $tail;
     -+		pos $line = length($head) + length($redaction);
     ++		$line =~ s/\G(<[^>]+>)/$redaction/;
     ++
     ++		pos $line = $match_offset + length($redaction);
      +	}
      +	return $line;
      +}


 Documentation/gitweb.conf.txt | 11 +++++++
 gitweb/gitweb.perl            | 55 ++++++++++++++++++++++++++++++-----
 t/lib-gitweb.sh               |  3 ++
 3 files changed, 62 insertions(+), 7 deletions(-)


base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f

Comments

Eric Wong March 29, 2021, 1:47 a.m. UTC | #1
Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers and they may not respect robots.txt.
> This can result in unsolicited messages.

> Introduce an 'email-privacy' feature which redacts e-mail addresses
> from the generated HTML content

A general reply to the topic: have you considered munging
addresses in a way that is still human readable, but obviously
obfuscated?

On some other project, I settled on HTML "&#8226;" as a replacement
for '.' for admins who enable that option.  The $USER@$NO_DOT
remains as-is for easy identification+recognition of hosts.

I also considered Unicode homographs which can look identical
to replacement characters, too; but rejected that idea since
it would cause grief for legitimate users who would not notice
the homograph when pasting into their mail client.

Anyways, here's the list of candidates I tried:

homograph∂80x24.org
homograph@80x24ͺorg
homograph@80x24·org
homograph@80x24•org
homograph@80x24.org
homograph﹫80x24.org

https://en.wikipedia.org/wiki/Ano_Teleia#Similar_symbols
https://en.wikipedia.org/wiki/Enclosed_A

homographⒶ80x24.org
homograph@80x24 org
homograph@80x24․org
homograph@80x24ꓸorg
Georgios Kontaxis March 29, 2021, 3:17 a.m. UTC | #2
> Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers and they may not respect robots.txt.
>> This can result in unsolicited messages.
>
>> Introduce an 'email-privacy' feature which redacts e-mail addresses
>> from the generated HTML content
>
> A general reply to the topic: have you considered munging
> addresses in a way that is still human readable, but obviously
> obfuscated?
>
> On some other project, I settled on HTML "&#8226;" as a replacement
> for '.' for admins who enable that option.  The $USER@$NO_DOT
> remains as-is for easy identification+recognition of hosts.
>
Thanks for the suggestion.

People have been trying to hinder address harvesting for a while now.
Replacing '@' with "at", the dot with "dot", adding spaces, etc.
was pretty common at some point. May still be.
I would expect crawlers to have caught up and this includes
all sorts of character encodings and unicode look-alike substitutions.

At the end of the day we are looking for something that's easy for humans
to read but hard for scripts to parse as an e-mail address.
(And that scripts cannot learn through an additional regex)
I'm not aware of anything like that. (I know CAPTCHAs, etc.)

> I also considered Unicode homographs which can look identical
> to replacement characters, too; but rejected that idea since
> it would cause grief for legitimate users who would not notice
> the homograph when pasting into their mail client.
>
> Anyways, here's the list of candidates I tried:
>
> homograph∂80x24.org
> homograph@80x24ͺorg
> homograph@80x24·org
> homograph@80x24•org
> homograph@80x24.org
> homograph﹫80x24.org
>
> https://en.wikipedia.org/wiki/Ano_Teleia#Similar_symbols
> https://en.wikipedia.org/wiki/Enclosed_A
>
> homographⒶ80x24.org
> homograph@80x24 org
> homograph@80x24․org
> homograph@80x24ꓸorg
>
Eric Wong April 8, 2021, 5:16 p.m. UTC | #3
Georgios Kontaxis <geko1702+commits@99rst.org> wrote:
> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
> >> Introduce an 'email-privacy' feature which redacts e-mail addresses
> >> from the generated HTML content
> >
> Eric Wong wrote:
> > A general reply to the topic: have you considered munging
> > addresses in a way that is still human readable, but obviously
> > obfuscated?
> >
> > On some other project, I settled on HTML "&#8226;" as a replacement
> > for '.' for admins who enable that option.  The $USER@$NO_DOT
> > remains as-is for easy identification+recognition of hosts.
> >
> Thanks for the suggestion.
> 
> People have been trying to hinder address harvesting for a while now.
> Replacing '@' with "at", the dot with "dot", adding spaces, etc.
> was pretty common at some point. May still be.
> I would expect crawlers to have caught up and this includes
> all sorts of character encodings and unicode look-alike substitutions.

I figure the crawlers hit a combinatorial explosion and
give up since they'd be wasting time with false-positives.

> > I also considered Unicode homographs which can look identical
> > to replacement characters, too; but rejected that idea since
> > it would cause grief for legitimate users who would not notice
> > the homograph when pasting into their mail client.

As a data point, none of the homograph@ candidates I posted here
on Mar 29 have attracted any attempts on my mail server.
Junio C Hamano April 8, 2021, 9:04 p.m. UTC | #4
Eric Wong <e@80x24.org> writes:

> Georgios Kontaxis <geko1702+commits@99rst.org> wrote:
>> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote:
>> >> Introduce an 'email-privacy' feature which redacts e-mail addresses
>> >> from the generated HTML content
>> >
>> Eric Wong wrote:
>> > A general reply to the topic: have you considered munging
>> > addresses in a way that is still human readable, but obviously
>> > obfuscated?
>> ...
>> > I also considered Unicode homographs which can look identical
>> > to replacement characters, too; but rejected that idea since
>> > it would cause grief for legitimate users who would not notice
>> > the homograph when pasting into their mail client.
>
> As a data point, none of the homograph@ candidates I posted here
> on Mar 29 have attracted any attempts on my mail server.

That is an interesting observation.  All homograph@ non-addresses,
if a human corrected the funnies in their spelling, would have hit
whoever handles @80x24.org mailboxes.

I take it to mean that as a future direction, replacing <redacted>
with the obfuscated-but-readable-by-humans homographs is a likely
improvement that would help human users while still inconveniencing
the crawlers.  It may however need some provision to prevent casual
end-users from cutting-and-pasting these homographs, as you said in
your original mention of the homograph approach.

But other than that, does the patch look reasonable?

Thanks.
Eric Wong April 8, 2021, 9:19 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> wrote:
> Eric Wong <e@80x24.org> writes:
> > As a data point, none of the homograph@ candidates I posted here
> > on Mar 29 have attracted any attempts on my mail server.
> 
> That is an interesting observation.  All homograph@ non-addresses,
> if a human corrected the funnies in their spelling, would have hit
> whoever handles @80x24.org mailboxes.
> 
> I take it to mean that as a future direction, replacing <redacted>
> with the obfuscated-but-readable-by-humans homographs is a likely
> improvement that would help human users while still inconveniencing
> the crawlers.  It may however need some provision to prevent casual
> end-users from cutting-and-pasting these homographs, as you said in
> your original mention of the homograph approach.

Yes, exactly.

> But other than that, does the patch look reasonable?

I only took a cursory glance at it, but v6 seemed fine.
Ævar Arnfjörð Bjarmason April 8, 2021, 10:45 p.m. UTC | #6
On Thu, Apr 08 2021, Eric Wong wrote:

> Junio C Hamano <gitster@pobox.com> wrote:
>> Eric Wong <e@80x24.org> writes:
>> > As a data point, none of the homograph@ candidates I posted here
>> > on Mar 29 have attracted any attempts on my mail server.
>> 
>> That is an interesting observation.  All homograph@ non-addresses,
>> if a human corrected the funnies in their spelling, would have hit
>> whoever handles @80x24.org mailboxes.
>> 
>> I take it to mean that as a future direction, replacing <redacted>
>> with the obfuscated-but-readable-by-humans homographs is a likely
>> improvement that would help human users while still inconveniencing
>> the crawlers.  It may however need some provision to prevent casual
>> end-users from cutting-and-pasting these homographs, as you said in
>> your original mention of the homograph approach.
>
> Yes, exactly.
>
>> But other than that, does the patch look reasonable?
>
> I only took a cursory glance at it, but v6 seemed fine.

Ditto, I left a small nit comment about a needless /i in a regex, but I
don't think that needs a re-roll.
Junio C Hamano April 8, 2021, 10:54 p.m. UTC | #7
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> On Thu, Apr 08 2021, Eric Wong wrote:
>
>> Junio C Hamano <gitster@pobox.com> wrote:
>>> Eric Wong <e@80x24.org> writes:
>>> > As a data point, none of the homograph@ candidates I posted here
>>> > on Mar 29 have attracted any attempts on my mail server.
>>> 
>>> That is an interesting observation.  All homograph@ non-addresses,
>>> if a human corrected the funnies in their spelling, would have hit
>>> whoever handles @80x24.org mailboxes.
>>> 
>>> I take it to mean that as a future direction, replacing <redacted>
>>> with the obfuscated-but-readable-by-humans homographs is a likely
>>> improvement that would help human users while still inconveniencing
>>> the crawlers.  It may however need some provision to prevent casual
>>> end-users from cutting-and-pasting these homographs, as you said in
>>> your original mention of the homograph approach.
>>
>> Yes, exactly.
>>
>>> But other than that, does the patch look reasonable?
>>
>> I only took a cursory glance at it, but v6 seemed fine.
>
> Ditto, I left a small nit comment about a needless /i in a regex, but I
> don't think that needs a re-roll.

Thanks, both.

Will tweak the /i out, and re-queue with acked-by from you two.
diff mbox series

Patch

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..34b1d6e22435 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -751,6 +751,17 @@  default font sizes or lineheights are changed (e.g. via adding extra
 CSS stylesheet in `@stylesheets`), it may be appropriate to change
 these values.
 
+email-privacy::
+	Redact e-mail addresses from the generated HTML, etc. content.
+	This obscures e-mail addresses retrieved from the author/committer
+	and comment sections of the Git log.
+	It is meant to hinder web crawlers that harvest and abuse addresses.
+	Such crawlers may not respect robots.txt.
+	Note that users and user tools also see the addresses as redacted.
+	If Gitweb is not the final step in a workflow then subsequent steps
+	may misbehave because of the redacted information they receive.
+	Disabled by default.
+
 highlight::
 	Server-side syntax highlight support in "blob" view.  It requires
 	`$highlight_bin` program to be available (see the description of
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..fe1dbc266ea7 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -21,6 +21,7 @@ 
 use File::Basename qw(basename);
 use Time::HiRes qw(gettimeofday tv_interval);
 use Digest::MD5 qw(md5_hex);
+use Git::LoadCPAN::Mail::Address;
 
 binmode STDOUT, ':utf8';
 
@@ -569,6 +570,15 @@  sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+	# Redact e-mail addresses.
+
+	# To enable system wide have in $GITWEB_CONFIG
+	# $feature{'email-privacy'}{'default'} = [1];
+	'email-privacy' => {
+		'sub' => sub { feature_bool('email-privacy', @_) },
+		'override' => 1,
+		'default' => [0]},
 );
 
 sub gitweb_get_feature {
@@ -3449,6 +3459,33 @@  sub parse_date {
 	return %date;
 }
 
+sub is_mailaddr {
+	my @addrs = Mail::Address->parse(shift);
+	if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) {
+		return 0;
+	}
+	return 1;
+}
+
+sub hide_mailaddrs_if_private {
+	my $line = shift;
+	return $line unless gitweb_check_feature('email-privacy');
+	while ($line =~ m/(<[^>]+>)/g) {
+		my $match = $1;
+		if (!is_mailaddr($match)) {
+			next;
+		}
+		my $match_offset = pos($line) - length($match);
+		pos $line = $match_offset;
+
+		my $redaction = "<redacted>";
+		$line =~ s/\G(<[^>]+>)/$redaction/;
+
+		pos $line = $match_offset + length($redaction);
+	}
+	return $line;
+}
+
 sub parse_tag {
 	my $tag_id = shift;
 	my %tag;
@@ -3465,7 +3502,7 @@  sub parse_tag {
 		} elsif ($line =~ m/^tag (.+)$/) {
 			$tag{'name'} = $1;
 		} elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) {
-			$tag{'author'} = $1;
+			$tag{'author'} = hide_mailaddrs_if_private($1);
 			$tag{'author_epoch'} = $2;
 			$tag{'author_tz'} = $3;
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3513,7 +3550,7 @@  sub parse_commit_text {
 		} elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) {
 			push @parents, $1;
 		} elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) {
-			$co{'author'} = to_utf8($1);
+			$co{'author'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'author_epoch'} = $2;
 			$co{'author_tz'} = $3;
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3523,7 +3560,7 @@  sub parse_commit_text {
 				$co{'author_name'} = $co{'author'};
 			}
 		} elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) {
-			$co{'committer'} = to_utf8($1);
+			$co{'committer'} = hide_mailaddrs_if_private(to_utf8($1));
 			$co{'committer_epoch'} = $2;
 			$co{'committer_tz'} = $3;
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
@@ -3568,9 +3605,10 @@  sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		$line = hide_mailaddrs_if_private($line);
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -7489,7 +7527,8 @@  sub git_log_generic {
 			         -accesskey => "n", -title => "Alt-n"}, "next");
 	}
 	my $patch_max = gitweb_get_feature('patches');
-	if ($patch_max && !defined $file_name) {
+	if ($patch_max && !defined $file_name &&
+		!gitweb_check_feature('email-privacy')) {
 		if ($patch_max < 0 || @commitlist <= $patch_max) {
 			$paging_nav .= " &sdot; " .
 				$cgi->a({-href => href(action=>"patches", -replay=>1)},
@@ -7550,7 +7589,8 @@  sub git_commit {
 			} @$parents ) .
 			')';
 	}
-	if (gitweb_check_feature('patches') && @$parents <= 1) {
+	if (gitweb_check_feature('patches') && @$parents <= 1 &&
+		!gitweb_check_feature('email-privacy')) {
 		$formats_nav .= " | " .
 			$cgi->a({-href => href(action=>"patch", -replay=>1)},
 				"patch");
@@ -7863,7 +7903,8 @@  sub git_commitdiff {
 		$formats_nav =
 			$cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)},
 			        "raw");
-		if ($patch_max && @{$co{'parents'}} <= 1) {
+		if ($patch_max && @{$co{'parents'}} <= 1 &&
+			!gitweb_check_feature('email-privacy')) {
 			$formats_nav .= " | " .
 				$cgi->a({-href => href(action=>"patch", -replay=>1)},
 					"patch");
diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh
index 1f32ca66ea51..77fc1298d4c6 100644
--- a/t/lib-gitweb.sh
+++ b/t/lib-gitweb.sh
@@ -67,6 +67,9 @@  gitweb_run () {
 	GITWEB_CONFIG=$(pwd)/gitweb_config.perl
 	export GITWEB_CONFIG
 
+	PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN"
+	export PERL5LIB
+
 	# some of git commands write to STDERR on error, but this is not
 	# written to web server logs, so we are not interested in that:
 	# we are interested only in properly formatted errors/warnings