diff mbox series

gitweb: redacted e-mail addresses feature.

Message ID pull.910.git.1616283780358.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series gitweb: redacted e-mail addresses feature. | expand

Commit Message

Georgios Kontaxis March 20, 2021, 11:42 p.m. UTC
From: Georgios Kontaxis <geko1702+commits@99rst.org>

Gitweb extracts content from the Git log and makes it accessible
over HTTP. As a result, e-mail addresses found in commits are
exposed to web crawlers. This may result in unsolicited messages.
This is a feature for redacting e-mail addresses from the generated
HTML content.

This feature does not prevent someone from downloading the
unredacted commit log and extracting information from it.
It aims to hinder the low-effort bulk collection of e-mail
addresses by web crawlers.

Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
---
    gitweb: Redacted e-mail addresses feature.
    
    Gitweb extracts content from the Git log and makes it accessible over
    HTTP. As a result, e-mail addresses found in commits are exposed to web
    crawlers. This may result in unsolicited messages. This is a feature for
    redacting e-mail addresses from the generated HTML content.
    
    This feature does not prevent someone from downloading the unredacted
    commit log and extracting information from it. It aims to hinder the
    low-effort bulk collection of e-mail addresses by web crawlers.
    
    Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/910

 Documentation/gitweb.conf.txt | 12 ++++++++++++
 gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
 2 files changed, 45 insertions(+), 3 deletions(-)


base-commit: a5828ae6b52137b913b978e16cd2334482eb4c1f

Comments

Ævar Arnfjörð Bjarmason March 21, 2021, 12:42 a.m. UTC | #1
On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:

> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>
> Gitweb extracts content from the Git log and makes it accessible
> over HTTP. As a result, e-mail addresses found in commits are
> exposed to web crawlers. This may result in unsolicited messages.
> This is a feature for redacting e-mail addresses from the generated
> HTML content.
>
> This feature does not prevent someone from downloading the
> unredacted commit log and extracting information from it.
> It aims to hinder the low-effort bulk collection of e-mail
> addresses by web crawlers.

So web crawlers that aren't going to obey robots.txt?

I'm not opposed to this feature, but a glance at gitweb's documentation
seems to show that we don't discuss how to set robots.txt up for it at
all.

Perhaps having that in the docs or otherwise in the default setup would
get us most of the win of this feature?

> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
> ---

Odd:

>     gitweb: Redacted e-mail addresses feature.
>     
>     Gitweb extracts content from the Git log and makes it accessible over
>     HTTP. As a result, e-mail addresses found in commits are exposed to web
>     crawlers. This may result in unsolicited messages. This is a feature for
>     redacting e-mail addresses from the generated HTML content.
>     
>     This feature does not prevent someone from downloading the unredacted
>     commit log and extracting information from it. It aims to hinder the
>     low-effort bulk collection of e-mail addresses by web crawlers.
>     
>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org

To have this duplication of the patch here below "---", some GGG feature
gone awry?

> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-910/kontaxis/kontaxis/email_privacy-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>
>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>  2 files changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> index 7963a79ba98b..10653d8670a8 100644
> --- a/Documentation/gitweb.conf.txt
> +++ b/Documentation/gitweb.conf.txt
> @@ -896,6 +896,18 @@ same as of the snippet above:
>  It is an error to specify a ref that does not pass "git check-ref-format"
>  scrutiny. Duplicated values are filtered.
>  
> +email_privacy::
> +    Redact e-mail addresses from the generated HTML, etc. content.
> +    This hides e-mail addresses found in the commit log from web crawlers.
> +    Enabled by default.
> ++
> +It is highly recommended to keep this feature enabled unless web crawlers
> +are hindered in some other way. You can disable this feature as shown below:
> ++
> +---------------------------------------------------------------------------
> +$feature{'email_privacy'}{'default'} = [0];
> +---------------------------------------------------------------------------

I think there's plenty of gitweb users that are going to be relying on
the current behavior, so doesn't it make more sense for this to be
opt-in rather than opt-out?

>  
>  EXAMPLES
>  --------
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index 0959a782eccb..9d21c2583e18 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -569,6 +569,15 @@ sub evaluate_uri {
>  		'sub' => \&feature_extra_branch_refs,
>  		'override' => 0,
>  		'default' => []},
> +
> +    # Redact e-mail addresses.
> +
> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );
> [...]
>  sub gitweb_get_feature {
> @@ -3471,6 +3480,10 @@ sub parse_tag {
>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$tag{'author_name'}  = $1;
>  				$tag{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$tag{'author_email'} = "private";
> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$tag{'author_name'} = $tag{'author'};
>  			}
> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'author_name'}  = $1;
>  				$co{'author_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'author_email'} = "private";
> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'author_name'} = $co{'author'};
>  			}
> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>  				$co{'committer_name'}  = $1;
>  				$co{'committer_email'} = $2;
> +				if (gitweb_check_feature('email_privacy')) {
> +					$co{'committer_email'} = "private";
> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
> +				}
>  			} else {
>  				$co{'committer_name'} = $co{'committer'};
>  			}
> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>  	}
> -	# remove added spaces
> +	# remove added spaces, redact e-mail addresses if applicable.
>  	foreach my $line (@commit_lines) {
>  		$line =~ s/^    //;
> +		if (gitweb_check_feature('email_privacy') &&
> +			$line =~ m/^([^<]+) <([^>]*)>/) {
> +			$line =~ s/<([^>]+)>/<private>/;
> +		}
>  	}
>  	$co{'comment'} = \@commit_lines;

All of these hunks (and the below) should use some new function that
does this feature check + sanitizing instead of copy/pasting mostly the
same code N times. e.g.:
    
    sub maybe_hide_email {
        my $email = shift;
        return $email unless gitweb_check_feature('email_privacy');
        return hide_email($email);
    }

then:

    $tag{author_email} = maybe_hide_email($2);

Also it looks like this isn't a new issue, but does this need to
implement its own E-Mail parser? We ship with Mail::Address for
git-send-email, can gitweb (and the elided hide_email() function above)
use that too?


> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>  		close $fd
>  			or print "Reading git-diff-tree failed\n";
>  	} elsif ($format eq 'patch') {
> -		local $/ = undef;
> -		print <$fd>;
> +		while (my $line = <$fd>) {
> +			if (gitweb_check_feature('email_privacy') &&
> +				$line =~ m/^([^<]+) <([^>]*)>/) {
> +				$line =~ s/<([^>]+)>/<private>/;
> +			}
> +			print $line;
> +		}
>  		close $fd
>  			or print "Reading git-format-patch failed\n";

Is that "patch" output meant for "git am"? Won't this severely break
that use-case if so?
brian m. carlson March 21, 2021, 1:27 a.m. UTC | #2
On 2021-03-21 at 00:42:58, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
> 
> > From: Georgios Kontaxis <geko1702+commits@99rst.org>
> >
> > Gitweb extracts content from the Git log and makes it accessible
> > over HTTP. As a result, e-mail addresses found in commits are
> > exposed to web crawlers. This may result in unsolicited messages.
> > This is a feature for redacting e-mail addresses from the generated
> > HTML content.
> >
> > This feature does not prevent someone from downloading the
> > unredacted commit log and extracting information from it.
> > It aims to hinder the low-effort bulk collection of e-mail
> > addresses by web crawlers.
> 
> So web crawlers that aren't going to obey robots.txt?
> 
> I'm not opposed to this feature, but a glance at gitweb's documentation
> seems to show that we don't discuss how to set robots.txt up for it at
> all.
> 
> Perhaps having that in the docs or otherwise in the default setup would
> get us most of the win of this feature?

I'm going to guess that the two features are orthogonal.  robots.txt is
great for communicating to well-meaning actors what you do and don't
want crawled.  For example, one might ask a web crawler not to crawl
individual commits because that creates excessive load on the server.

This option is about preventing email harvesting, usually for the
purposes of sending spam.  Spam is email abuse and all reasonable people
know it's unacceptable, so by definition the people doing this are bad
actors and are not likely to honor the robots.txt.  As someone who runs
his own mail server, that is certainly my experience.

So I am in favor of this feature.  I think it mirrors what many other
tools do in this space and having it as an option is valuable.

> > diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
> > index 7963a79ba98b..10653d8670a8 100644
> > --- a/Documentation/gitweb.conf.txt
> > +++ b/Documentation/gitweb.conf.txt
> > @@ -896,6 +896,18 @@ same as of the snippet above:
> >  It is an error to specify a ref that does not pass "git check-ref-format"
> >  scrutiny. Duplicated values are filtered.
> >  
> > +email_privacy::
> > +    Redact e-mail addresses from the generated HTML, etc. content.
> > +    This hides e-mail addresses found in the commit log from web crawlers.
> > +    Enabled by default.
> > ++
> > +It is highly recommended to keep this feature enabled unless web crawlers
> > +are hindered in some other way. You can disable this feature as shown below:
> > ++
> > +---------------------------------------------------------------------------
> > +$feature{'email_privacy'}{'default'} = [0];
> > +---------------------------------------------------------------------------
> 
> I think there's plenty of gitweb users that are going to be relying on
> the current behavior, so doesn't it make more sense for this to be
> opt-in rather than opt-out?

I agree this make sense as an opt-in feature.  While many people will
want to enable it, users who are performing an upgrade won't necessarily
want the behavior to change right away.
Georgios Kontaxis March 21, 2021, 3:30 a.m. UTC | #3
>
> On Sun, Mar 21 2021, Georgios Kontaxis via GitGitGadget wrote:
>
>> From: Georgios Kontaxis <geko1702+commits@99rst.org>
>>
>> Gitweb extracts content from the Git log and makes it accessible
>> over HTTP. As a result, e-mail addresses found in commits are
>> exposed to web crawlers. This may result in unsolicited messages.
>> This is a feature for redacting e-mail addresses from the generated
>> HTML content.
>>
>> This feature does not prevent someone from downloading the
>> unredacted commit log and extracting information from it.
>> It aims to hinder the low-effort bulk collection of e-mail
>> addresses by web crawlers.
>
> So web crawlers that aren't going to obey robots.txt?
>
> I'm not opposed to this feature, but a glance at gitweb's documentation
> seems to show that we don't discuss how to set robots.txt up for it at
> all.
>
> Perhaps having that in the docs or otherwise in the default setup would
> get us most of the win of this feature?
>
File robots.txt is basically asking nicely and we should work on that.
At the same time crawlers that look for addresses to send SPAM to
will probably ignore it so this change is meant for them.

>> Signed-off-by: Georgios Kontaxis <geko1702+commits@99rst.org>
>> ---
>
> Odd:
>
>>     gitweb: Redacted e-mail addresses feature.
>>
>>     Gitweb extracts content from the Git log and makes it accessible
>> over
>>     HTTP. As a result, e-mail addresses found in commits are exposed to
>> web
>>     crawlers. This may result in unsolicited messages. This is a feature
>> for
>>     redacting e-mail addresses from the generated HTML content.
>>
>>     This feature does not prevent someone from downloading the
>> unredacted
>>     commit log and extracting information from it. It aims to hinder the
>>     low-effort bulk collection of e-mail addresses by web crawlers.
>>
>>     Signed-off-by: Georgios Kontaxis geko1702+commits@99rst.org
>
> To have this duplication of the patch here below "---", some GGG feature
> gone awry?
>
>> Published-As:
>> https://github.com/gitgitgadget/git/releases/tag/pr-910%2Fkontaxis%2Fkontaxis%2Femail_privacy-v1
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git
>> pr-910/kontaxis/kontaxis/email_privacy-v1
>> Pull-Request: https://github.com/gitgitgadget/git/pull/910
>>
>>  Documentation/gitweb.conf.txt | 12 ++++++++++++
>>  gitweb/gitweb.perl            | 36 ++++++++++++++++++++++++++++++++---
>>  2 files changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/gitweb.conf.txt
>> b/Documentation/gitweb.conf.txt
>> index 7963a79ba98b..10653d8670a8 100644
>> --- a/Documentation/gitweb.conf.txt
>> +++ b/Documentation/gitweb.conf.txt
>> @@ -896,6 +896,18 @@ same as of the snippet above:
>>  It is an error to specify a ref that does not pass "git
>> check-ref-format"
>>  scrutiny. Duplicated values are filtered.
>>
>> +email_privacy::
>> +    Redact e-mail addresses from the generated HTML, etc. content.
>> +    This hides e-mail addresses found in the commit log from web
>> crawlers.
>> +    Enabled by default.
>> ++
>> +It is highly recommended to keep this feature enabled unless web
>> crawlers
>> +are hindered in some other way. You can disable this feature as shown
>> below:
>> ++
>> +---------------------------------------------------------------------------
>> +$feature{'email_privacy'}{'default'} = [0];
>> +---------------------------------------------------------------------------
>
> I think there's plenty of gitweb users that are going to be relying on
> the current behavior, so doesn't it make more sense for this to be
> opt-in rather than opt-out?
>
My concern is that Gitweb operators may not understand the need
for this feature or maybe won't be aware the feature exists.

Nevertheless, I've changed the feature to be off by default.
Perhaps we can revisit this decision in the future? :)

>>
>>  EXAMPLES
>>  --------
>> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
>> index 0959a782eccb..9d21c2583e18 100755
>> --- a/gitweb/gitweb.perl
>> +++ b/gitweb/gitweb.perl
>> @@ -569,6 +569,15 @@ sub evaluate_uri {
>>  		'sub' => \&feature_extra_branch_refs,
>>  		'override' => 0,
>>  		'default' => []},
>> +
>> +    # Redact e-mail addresses.
>> +
>> +    # To disable system wide have in $GITWEB_CONFIG
>> +    # $feature{'email_privacy'}{'default'} = [0];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 0,
>> +		'default' => [1]},
>>  );
>> [...]
>>  sub gitweb_get_feature {
>> @@ -3471,6 +3480,10 @@ sub parse_tag {
>>  			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$tag{'author_name'}  = $1;
>>  				$tag{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$tag{'author_email'} = "private";
>> +					$tag{'author'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$tag{'author_name'} = $tag{'author'};
>>  			}
>> @@ -3519,6 +3532,10 @@ sub parse_commit_text {
>>  			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'author_name'}  = $1;
>>  				$co{'author_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'author_email'} = "private";
>> +					$co{'author'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$co{'author_name'} = $co{'author'};
>>  			}
>> @@ -3529,6 +3546,10 @@ sub parse_commit_text {
>>  			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
>>  				$co{'committer_name'}  = $1;
>>  				$co{'committer_email'} = $2;
>> +				if (gitweb_check_feature('email_privacy')) {
>> +					$co{'committer_email'} = "private";
>> +					$co{'committer'} =~ s/<([^>]+)>/<private>/;
>> +				}
>>  			} else {
>>  				$co{'committer_name'} = $co{'committer'};
>>  			}
>> @@ -3568,9 +3589,13 @@ sub parse_commit_text {
>>  	if (! defined $co{'title'} || $co{'title'} eq "") {
>>  		$co{'title'} = $co{'title_short'} = '(no commit message)';
>>  	}
>> -	# remove added spaces
>> +	# remove added spaces, redact e-mail addresses if applicable.
>>  	foreach my $line (@commit_lines) {
>>  		$line =~ s/^    //;
>> +		if (gitweb_check_feature('email_privacy') &&
>> +			$line =~ m/^([^<]+) <([^>]*)>/) {
>> +			$line =~ s/<([^>]+)>/<private>/;
>> +		}
>>  	}
>>  	$co{'comment'} = \@commit_lines;
>
> All of these hunks (and the below) should use some new function that
> does this feature check + sanitizing instead of copy/pasting mostly the
> same code N times. e.g.:
>
>     sub maybe_hide_email {
>         my $email = shift;
>         return $email unless gitweb_check_feature('email_privacy');
>         return hide_email($email);
>     }
>
> then:
>
>     $tag{author_email} = maybe_hide_email($2);
>
> Also it looks like this isn't a new issue, but does this need to
> implement its own E-Mail parser? We ship with Mail::Address for
> git-send-email, can gitweb (and the elided hide_email() function above)
> use that too?
>
Thanks for the suggestion.
I've removed the duplicate code.

The places where there are e-mail addresses today are pretty specific.
(Pretty much the author, committer fields and the Signed-off-by lines)

In theory someone could write a comment with a bunch of addresses in it
but that would be unstructured text and I think Mail::Address is not good
with that.

I can definitely keep working on this topic but maybe in subsequent PRs?
Assuming we're not exposing any addresses right now or redacting things
we shouldn't.

>
>> @@ -8060,8 +8085,13 @@ sub git_commitdiff {
>>  		close $fd
>>  			or print "Reading git-diff-tree failed\n";
>>  	} elsif ($format eq 'patch') {
>> -		local $/ = undef;
>> -		print <$fd>;
>> +		while (my $line = <$fd>) {
>> +			if (gitweb_check_feature('email_privacy') &&
>> +				$line =~ m/^([^<]+) <([^>]*)>/) {
>> +				$line =~ s/<([^>]+)>/<private>/;
>> +			}
>> +			print $line;
>> +		}
>>  		close $fd
>>  			or print "Reading git-format-patch failed\n";
>
> Is that "patch" output meant for "git am"? Won't this severely break
> that use-case if so?
>
Not sure who may be consuming that output.
I've added a note in the documentation for gitweb.conf.

If a web crawler can get to the information by following URLs
then I think we should redact it.

Hopefully by documenting this as a potential issue Gitweb operators
can create a workaround specific to their use case.
Possibly implementing access control and leaving this feature off.
Junio C Hamano March 21, 2021, 6 a.m. UTC | #4
"Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> +    # To disable system wide have in $GITWEB_CONFIG
> +    # $feature{'email_privacy'}{'default'} = [0];
> +	'email_privacy' => {
> +		'sub' => sub { feature_bool('email_privacy', @_) },
> +		'override' => 0,
> +		'default' => [1]},
>  );

I do not see why this should default to true.

It would break existing installations, who have been perfectly happy
with the convenience of supplying a ready access to potential new
contributors who/which addresses to contact plausible mentors in the
projects they are interested in.

And more importantly, I do not see why it should be made impossible
to override per repository/project in a multi-tenant installation.
Some projects may be more "privacy" sensitive than others.  Those
who want to use tighter setting should be able to enable it even
when the side-wide default is set to false, no?

Thanks.
Junio C Hamano March 21, 2021, 6:18 a.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

> And more importantly, I do not see why it should be made impossible
> to override per repository/project in a multi-tenant installation.
> Some projects may be more "privacy" sensitive than others.  Those
> who want to use tighter setting should be able to enable it even
> when the side-wide default is set to false, no?

To answer an inevitable and natural follow-up question preemptively,
the primary reason why we have the override => 0 mechanism is so
that site administrators can disable certain expensive features
(like blame, snapshot, etc.) no matter what each project hosted by
them wish.

And hiding contributor identity would not be a choice that is based
on how expensive the feature is to run.
Georgios Kontaxis March 21, 2021, 6:43 a.m. UTC | #6
> "Georgios Kontaxis via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
>> +    # To disable system wide have in $GITWEB_CONFIG
>> +    # $feature{'email_privacy'}{'default'} = [0];
>> +	'email_privacy' => {
>> +		'sub' => sub { feature_bool('email_privacy', @_) },
>> +		'override' => 0,
>> +		'default' => [1]},
>>  );
>
> I do not see why this should default to true.
>
I've changed the default to "false". V2 should reflect the change.

> It would break existing installations, who have been perfectly happy
> with the convenience of supplying a ready access to potential new
> contributors who/which addresses to contact plausible mentors in the
> projects they are interested in.
>
> And more importantly, I do not see why it should be made impossible
> to override per repository/project in a multi-tenant installation.
> Some projects may be more "privacy" sensitive than others.  Those
> who want to use tighter setting should be able to enable it even
> when the side-wide default is set to false, no?
>
> Thanks.
>
I was actually thinking about the other way around;
preventing projects from disabling this feature.

Sounds like the "override" flag is for other types
of use cases though. I'll change it to "true".

Thanks for the feedback.
Junio C Hamano March 21, 2021, 4:55 p.m. UTC | #7
"Georgios Kontaxis" <geko1702+commits@99rst.org> writes:

>> And more importantly, I do not see why it should be made impossible
>> to override per repository/project in a multi-tenant installation.
>> Some projects may be more "privacy" sensitive than others.  Those
>> who want to use tighter setting should be able to enable it even
>> when the side-wide default is set to false, no?
>>
>> Thanks.
>>
> I was actually thinking about the other way around;
> preventing projects from disabling this feature.

Yes, it cuts both ways, and override should be allowed for most
cases (unless absolutely necessary for healthy operation of the
system) for a simple reason that anybody who sets site-wide default
is not in a better position than those who set per-repository or
per-project setting to judge what is good for them.

Thanks.
diff mbox series

Patch

diff --git a/Documentation/gitweb.conf.txt b/Documentation/gitweb.conf.txt
index 7963a79ba98b..10653d8670a8 100644
--- a/Documentation/gitweb.conf.txt
+++ b/Documentation/gitweb.conf.txt
@@ -896,6 +896,18 @@  same as of the snippet above:
 It is an error to specify a ref that does not pass "git check-ref-format"
 scrutiny. Duplicated values are filtered.
 
+email_privacy::
+    Redact e-mail addresses from the generated HTML, etc. content.
+    This hides e-mail addresses found in the commit log from web crawlers.
+    Enabled by default.
++
+It is highly recommended to keep this feature enabled unless web crawlers
+are hindered in some other way. You can disable this feature as shown below:
++
+---------------------------------------------------------------------------
+$feature{'email_privacy'}{'default'} = [0];
+---------------------------------------------------------------------------
+
 
 EXAMPLES
 --------
diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index 0959a782eccb..9d21c2583e18 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -569,6 +569,15 @@  sub evaluate_uri {
 		'sub' => \&feature_extra_branch_refs,
 		'override' => 0,
 		'default' => []},
+
+    # Redact e-mail addresses.
+
+    # To disable system wide have in $GITWEB_CONFIG
+    # $feature{'email_privacy'}{'default'} = [0];
+	'email_privacy' => {
+		'sub' => sub { feature_bool('email_privacy', @_) },
+		'override' => 0,
+		'default' => [1]},
 );
 
 sub gitweb_get_feature {
@@ -3471,6 +3480,10 @@  sub parse_tag {
 			if ($tag{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$tag{'author_name'}  = $1;
 				$tag{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$tag{'author_email'} = "private";
+					$tag{'author'} =~ s/<([^>]+)>/<private>/;
+				}
 			} else {
 				$tag{'author_name'} = $tag{'author'};
 			}
@@ -3519,6 +3532,10 @@  sub parse_commit_text {
 			if ($co{'author'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'author_name'}  = $1;
 				$co{'author_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'author_email'} = "private";
+					$co{'author'} =~ s/<([^>]+)>/<private>/;
+				}
 			} else {
 				$co{'author_name'} = $co{'author'};
 			}
@@ -3529,6 +3546,10 @@  sub parse_commit_text {
 			if ($co{'committer'} =~ m/^([^<]+) <([^>]*)>/) {
 				$co{'committer_name'}  = $1;
 				$co{'committer_email'} = $2;
+				if (gitweb_check_feature('email_privacy')) {
+					$co{'committer_email'} = "private";
+					$co{'committer'} =~ s/<([^>]+)>/<private>/;
+				}
 			} else {
 				$co{'committer_name'} = $co{'committer'};
 			}
@@ -3568,9 +3589,13 @@  sub parse_commit_text {
 	if (! defined $co{'title'} || $co{'title'} eq "") {
 		$co{'title'} = $co{'title_short'} = '(no commit message)';
 	}
-	# remove added spaces
+	# remove added spaces, redact e-mail addresses if applicable.
 	foreach my $line (@commit_lines) {
 		$line =~ s/^    //;
+		if (gitweb_check_feature('email_privacy') &&
+			$line =~ m/^([^<]+) <([^>]*)>/) {
+			$line =~ s/<([^>]+)>/<private>/;
+		}
 	}
 	$co{'comment'} = \@commit_lines;
 
@@ -8060,8 +8085,13 @@  sub git_commitdiff {
 		close $fd
 			or print "Reading git-diff-tree failed\n";
 	} elsif ($format eq 'patch') {
-		local $/ = undef;
-		print <$fd>;
+		while (my $line = <$fd>) {
+			if (gitweb_check_feature('email_privacy') &&
+				$line =~ m/^([^<]+) <([^>]*)>/) {
+				$line =~ s/<([^>]+)>/<private>/;
+			}
+			print $line;
+		}
 		close $fd
 			or print "Reading git-format-patch failed\n";
 	}