Message ID | pull.910.v5.git.1616817387441.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v5] gitweb: redacted e-mail addresses feature. | expand |
Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote: > Gitweb extracts content from the Git log and makes it accessible > over HTTP. As a result, e-mail addresses found in commits are > exposed to web crawlers and they may not respect robots.txt. > This can result in unsolicited messages. > Introduce an 'email-privacy' feature which redacts e-mail addresses > from the generated HTML content A general reply to the topic: have you considered munging addresses in a way that is still human readable, but obviously obfuscated? On some other project, I settled on HTML "•" as a replacement for '.' for admins who enable that option. The $USER@$NO_DOT remains as-is for easy identification+recognition of hosts. I also considered Unicode homographs which can look identical to replacement characters, too; but rejected that idea since it would cause grief for legitimate users who would not notice the homograph when pasting into their mail client. Anyways, here's the list of candidates I tried: homograph∂80x24.org homograph@80x24ͺorg homograph@80x24·org homograph@80x24•org homograph@80x24.org homograph﹫80x24.org https://en.wikipedia.org/wiki/Ano_Teleia#Similar_symbols https://en.wikipedia.org/wiki/Enclosed_A homographⒶ80x24.org homograph@80x24 org homograph@80x24․org homograph@80x24ꓸorg
> Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote: >> Gitweb extracts content from the Git log and makes it accessible >> over HTTP. As a result, e-mail addresses found in commits are >> exposed to web crawlers and they may not respect robots.txt. >> This can result in unsolicited messages. > >> Introduce an 'email-privacy' feature which redacts e-mail addresses >> from the generated HTML content > > A general reply to the topic: have you considered munging > addresses in a way that is still human readable, but obviously > obfuscated? > > On some other project, I settled on HTML "•" as a replacement > for '.' for admins who enable that option. The $USER@$NO_DOT > remains as-is for easy identification+recognition of hosts. > Thanks for the suggestion. People have been trying to hinder address harvesting for a while now. Replacing '@' with "at", the dot with "dot", adding spaces, etc. was pretty common at some point. May still be. I would expect crawlers to have caught up and this includes all sorts of character encodings and unicode look-alike substitutions. At the end of the day we are looking for something that's easy for humans to read but hard for scripts to parse as an e-mail address. (And that scripts cannot learn through an additional regex) I'm not aware of anything like that. (I know CAPTCHAs, etc.) > I also considered Unicode homographs which can look identical > to replacement characters, too; but rejected that idea since > it would cause grief for legitimate users who would not notice > the homograph when pasting into their mail client. > > Anyways, here's the list of candidates I tried: > > homograph∂80x24.org > homograph@80x24ͺorg > homograph@80x24·org > homograph@80x24•org > homograph@80x24.org > homograph﹫80x24.org > > https://en.wikipedia.org/wiki/Ano_Teleia#Similar_symbols > https://en.wikipedia.org/wiki/Enclosed_A > > homographⒶ80x24.org > homograph@80x24 org > homograph@80x24․org > homograph@80x24ꓸorg >
Georgios Kontaxis <geko1702+commits@99rst.org> wrote: > > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote: > >> Introduce an 'email-privacy' feature which redacts e-mail addresses > >> from the generated HTML content > > > Eric Wong wrote: > > A general reply to the topic: have you considered munging > > addresses in a way that is still human readable, but obviously > > obfuscated? > > > > On some other project, I settled on HTML "•" as a replacement > > for '.' for admins who enable that option. The $USER@$NO_DOT > > remains as-is for easy identification+recognition of hosts. > > > Thanks for the suggestion. > > People have been trying to hinder address harvesting for a while now. > Replacing '@' with "at", the dot with "dot", adding spaces, etc. > was pretty common at some point. May still be. > I would expect crawlers to have caught up and this includes > all sorts of character encodings and unicode look-alike substitutions. I figure the crawlers hit a combinatorial explosion and give up since they'd be wasting time with false-positives. > > I also considered Unicode homographs which can look identical > > to replacement characters, too; but rejected that idea since > > it would cause grief for legitimate users who would not notice > > the homograph when pasting into their mail client. As a data point, none of the homograph@ candidates I posted here on Mar 29 have attracted any attempts on my mail server.
Eric Wong <e@80x24.org> writes: > Georgios Kontaxis <geko1702+commits@99rst.org> wrote: >> > Georgios Kontaxis via GitGitGadget <gitgitgadget@gmail.com> wrote: >> >> Introduce an 'email-privacy' feature which redacts e-mail addresses >> >> from the generated HTML content >> > >> Eric Wong wrote: >> > A general reply to the topic: have you considered munging >> > addresses in a way that is still human readable, but obviously >> > obfuscated? >> ... >> > I also considered Unicode homographs which can look identical >> > to replacement characters, too; but rejected that idea since >> > it would cause grief for legitimate users who would not notice >> > the homograph when pasting into their mail client. > > As a data point, none of the homograph@ candidates I posted here > on Mar 29 have attracted any attempts on my mail server. That is an interesting observation. All homograph@ non-addresses, if a human corrected the funnies in their spelling, would have hit whoever handles @80x24.org mailboxes. I take it to mean that as a future direction, replacing <redacted> with the obfuscated-but-readable-by-humans homographs is a likely improvement that would help human users while still inconveniencing the crawlers. It may however need some provision to prevent casual end-users from cutting-and-pasting these homographs, as you said in your original mention of the homograph approach. But other than that, does the patch look reasonable? Thanks.
Junio C Hamano <gitster@pobox.com> wrote: > Eric Wong <e@80x24.org> writes: > > As a data point, none of the homograph@ candidates I posted here > > on Mar 29 have attracted any attempts on my mail server. > > That is an interesting observation. All homograph@ non-addresses, > if a human corrected the funnies in their spelling, would have hit > whoever handles @80x24.org mailboxes. > > I take it to mean that as a future direction, replacing <redacted> > with the obfuscated-but-readable-by-humans homographs is a likely > improvement that would help human users while still inconveniencing > the crawlers. It may however need some provision to prevent casual > end-users from cutting-and-pasting these homographs, as you said in > your original mention of the homograph approach. Yes, exactly. > But other than that, does the patch look reasonable? I only took a cursory glance at it, but v6 seemed fine.
On Thu, Apr 08 2021, Eric Wong wrote: > Junio C Hamano <gitster@pobox.com> wrote: >> Eric Wong <e@80x24.org> writes: >> > As a data point, none of the homograph@ candidates I posted here >> > on Mar 29 have attracted any attempts on my mail server. >> >> That is an interesting observation. All homograph@ non-addresses, >> if a human corrected the funnies in their spelling, would have hit >> whoever handles @80x24.org mailboxes. >> >> I take it to mean that as a future direction, replacing <redacted> >> with the obfuscated-but-readable-by-humans homographs is a likely >> improvement that would help human users while still inconveniencing >> the crawlers. It may however need some provision to prevent casual >> end-users from cutting-and-pasting these homographs, as you said in >> your original mention of the homograph approach. > > Yes, exactly. > >> But other than that, does the patch look reasonable? > > I only took a cursory glance at it, but v6 seemed fine. Ditto, I left a small nit comment about a needless /i in a regex, but I don't think that needs a re-roll.
Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > On Thu, Apr 08 2021, Eric Wong wrote: > >> Junio C Hamano <gitster@pobox.com> wrote: >>> Eric Wong <e@80x24.org> writes: >>> > As a data point, none of the homograph@ candidates I posted here >>> > on Mar 29 have attracted any attempts on my mail server. >>> >>> That is an interesting observation. All homograph@ non-addresses, >>> if a human corrected the funnies in their spelling, would have hit >>> whoever handles @80x24.org mailboxes. >>> >>> I take it to mean that as a future direction, replacing <redacted> >>> with the obfuscated-but-readable-by-humans homographs is a likely >>> improvement that would help human users while still inconveniencing >>> the crawlers. It may however need some provision to prevent casual >>> end-users from cutting-and-pasting these homographs, as you said in >>> your original mention of the homograph approach. >> >> Yes, exactly. >> >>> But other than that, does the patch look reasonable? >> >> I only took a cursory glance at it, but v6 seemed fine. > > Ditto, I left a small nit comment about a needless /i in a regex, but I > don't think that needs a re-roll. Thanks, both. Will tweak the /i out, and re-queue with acked-by from you two.
diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt index 7963a79ba98b..34b1d6e22435 100644 --- a/Documentation/gitweb.conf.txt +++ b/Documentation/gitweb.conf.txt @@ -751,6 +751,17 @@ default font sizes or lineheights are changed (e.g. via adding extra CSS stylesheet in `@stylesheets`), it may be appropriate to change these values. +email-privacy:: + Redact e-mail addresses from the generated HTML, etc. content. + This obscures e-mail addresses retrieved from the author/committer + and comment sections of the Git log. + It is meant to hinder web crawlers that harvest and abuse addresses. + Such crawlers may not respect robots.txt. + Note that users and user tools also see the addresses as redacted. + If Gitweb is not the final step in a workflow then subsequent steps + may misbehave because of the redacted information they receive. + Disabled by default. + highlight:: Server-side syntax highlight support in "blob" view. It requires `$highlight_bin` program to be available (see the description of diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl index 0959a782eccb..fe1dbc266ea7 100755 --- a/gitweb/gitweb.perl +++ b/gitweb/gitweb.perl @@ -21,6 +21,7 @@ use File::Basename qw(basename); use Time::HiRes qw(gettimeofday tv_interval); use Digest::MD5 qw(md5_hex); +use Git::LoadCPAN::Mail::Address; binmode STDOUT, ':utf8'; @@ -569,6 +570,15 @@ sub evaluate_uri { 'sub' => \&feature_extra_branch_refs, 'override' => 0, 'default' => []}, + + # Redact e-mail addresses. + + # To enable system wide have in $GITWEB_CONFIG + # $feature{'email-privacy'}{'default'} = [1]; + 'email-privacy' => { + 'sub' => sub { feature_bool('email-privacy', @_) }, + 'override' => 1, + 'default' => [0]}, ); sub gitweb_get_feature { @@ -3449,6 +3459,33 @@ sub parse_date { return %date; } +sub is_mailaddr { + my @addrs = Mail::Address->parse(shift); + if (!@addrs || !$addrs[0]->host || !$addrs[0]->user) { + return 0; + } + return 1; +} + +sub hide_mailaddrs_if_private { + my $line = shift; + return $line unless gitweb_check_feature('email-privacy'); + while ($line =~ m/(<[^>]+>)/g) { + my $match = $1; + if (!is_mailaddr($match)) { + next; + } + my $match_offset = pos($line) - length($match); + pos $line = $match_offset; + + my $redaction = "<redacted>"; + $line =~ s/\G(<[^>]+>)/$redaction/; + + pos $line = $match_offset + length($redaction); + } + return $line; +} + sub parse_tag { my $tag_id = shift; my %tag; @@ -3465,7 +3502,7 @@ sub parse_tag { } elsif ($line =~ m/^tag (.+)$/) { $tag{'name'} = $1; } elsif ($line =~ m/^tagger (.*) ([0-9]+) (.*)$/) { - $tag{'author'} = $1; + $tag{'author'} = hide_mailaddrs_if_private($1); $tag{'author_epoch'} = $2; $tag{'author_tz'} = $3; if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) { @@ -3513,7 +3550,7 @@ sub parse_commit_text { } elsif ((!defined $withparents) && ($line =~ m/^parent ($oid_regex)$/)) { push @parents, $1; } elsif ($line =~ m/^author (.*) ([0-9]+) (.*)$/) { - $co{'author'} = to_utf8($1); + $co{'author'} = hide_mailaddrs_if_private(to_utf8($1)); $co{'author_epoch'} = $2; $co{'author_tz'} = $3; if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) { @@ -3523,7 +3560,7 @@ sub parse_commit_text { $co{'author_name'} = $co{'author'}; } } elsif ($line =~ m/^committer (.*) ([0-9]+) (.*)$/) { - $co{'committer'} = to_utf8($1); + $co{'committer'} = hide_mailaddrs_if_private(to_utf8($1)); $co{'committer_epoch'} = $2; $co{'committer_tz'} = $3; if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) { @@ -3568,9 +3605,10 @@ sub parse_commit_text { if (! defined $co{'title'} || $co{'title'} eq "") { $co{'title'} = $co{'title_short'} = '(no commit message)'; } - # remove added spaces + # remove added spaces, redact e-mail addresses if applicable. foreach my $line (@commit_lines) { $line =~ s/^ //; + $line = hide_mailaddrs_if_private($line); } $co{'comment'} = \@commit_lines; @@ -7489,7 +7527,8 @@ sub git_log_generic { -accesskey => "n", -title => "Alt-n"}, "next"); } my $patch_max = gitweb_get_feature('patches'); - if ($patch_max && !defined $file_name) { + if ($patch_max && !defined $file_name && + !gitweb_check_feature('email-privacy')) { if ($patch_max < 0 || @commitlist <= $patch_max) { $paging_nav .= " ⋅ " . $cgi->a({-href => href(action=>"patches", -replay=>1)}, @@ -7550,7 +7589,8 @@ sub git_commit { } @$parents ) . ')'; } - if (gitweb_check_feature('patches') && @$parents <= 1) { + if (gitweb_check_feature('patches') && @$parents <= 1 && + !gitweb_check_feature('email-privacy')) { $formats_nav .= " | " . $cgi->a({-href => href(action=>"patch", -replay=>1)}, "patch"); @@ -7863,7 +7903,8 @@ sub git_commitdiff { $formats_nav = $cgi->a({-href => href(action=>"commitdiff_plain", -replay=>1)}, "raw"); - if ($patch_max && @{$co{'parents'}} <= 1) { + if ($patch_max && @{$co{'parents'}} <= 1 && + !gitweb_check_feature('email-privacy')) { $formats_nav .= " | " . $cgi->a({-href => href(action=>"patch", -replay=>1)}, "patch"); diff --git a/t/lib-gitweb.sh b/t/lib-gitweb.sh index 1f32ca66ea51..77fc1298d4c6 100644 --- a/t/lib-gitweb.sh +++ b/t/lib-gitweb.sh @@ -67,6 +67,9 @@ gitweb_run () { GITWEB_CONFIG=$(pwd)/gitweb_config.perl export GITWEB_CONFIG + PERL5LIB="$GIT_BUILD_DIR/perl:$GIT_BUILD_DIR/perl/FromCPAN" + export PERL5LIB + # some of git commands write to STDERR on error, but this is not # written to web server logs, so we are not interested in that: # we are interested only in properly formatted errors/warnings