diff mbox series

[2/2] gitweb: remove invalid http-equiv="content-type"

Message ID 20220307033723.175553-3-jason@jasonyundt.email (mailing list archive)
State New, archived
Headers show
Series gitweb: remove invalid http-equiv="content-type" | expand

Commit Message

Jason Yundt March 7, 2022, 3:37 a.m. UTC
Before this change, gitweb would generate pages which included:

	<meta http-equiv="content-type" content="application/xhtml+xml; charset=utf-8"/>

A meta element with http-equiv="content-type" is said to be in the
"Encoding declaration state". According to the HTML Standard,

	The Encoding declaration state may be used in HTML documents,
	but elements with an http-equiv attribute in that state must not
	be used in XML documents.

	Source: <https://html.spec.whatwg.org/multipage/semantics.html#attr-meta-http-equiv-content-type>

This change removes that meta element since gitweb always generates XML
documents.

Signed-off-by: Jason Yundt <jason@jasonyundt.email>
---
 gitweb/gitweb.perl                        |  4 +---
 t/t9502-gitweb-standalone-parse-output.sh | 13 +++++++++++++
 2 files changed, 14 insertions(+), 3 deletions(-)

Comments

Ævar Arnfjörð Bjarmason March 7, 2022, 12:23 p.m. UTC | #1
On Sun, Mar 06 2022, Jason Yundt wrote:

> Before this change, gitweb would generate pages which included:
>
> 	<meta http-equiv="content-type" content="application/xhtml+xml; charset=utf-8"/>
>
> A meta element with http-equiv="content-type" is said to be in the
> "Encoding declaration state". According to the HTML Standard,
>
> 	The Encoding declaration state may be used in HTML documents,
> 	but elements with an http-equiv attribute in that state must not
> 	be used in XML documents.
>
> 	Source: <https://html.spec.whatwg.org/multipage/semantics.html#attr-meta-http-equiv-content-type>
>
> This change removes that meta element since gitweb always generates XML
> documents.
>
> Signed-off-by: Jason Yundt <jason@jasonyundt.email>
> ---
>  gitweb/gitweb.perl                        |  4 +---
>  t/t9502-gitweb-standalone-parse-output.sh | 13 +++++++++++++
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
> index fbd1c20a23..606b50104c 100755
> --- a/gitweb/gitweb.perl
> +++ b/gitweb/gitweb.perl
> @@ -4213,8 +4213,7 @@ sub git_header_html {
>  	my %opts = @_;
>  
>  	my $title = get_page_title();
> -	my $content_type = get_content_type_html();
> -	print $cgi->header(-type=>$content_type, -charset => 'utf-8',
> +	print $cgi->header(-type=>get_content_type_html(), -charset => 'utf-8',

I think it would be better to just skip this hunk, no behavior will
change if it's left in.

>  	                   -status=> $status, -expires => $expires)
>  		unless ($opts{'-no_http_header'});
>  	my $mod_perl_version = $ENV{'MOD_PERL'} ? " $ENV{'MOD_PERL'}" : '';
> @@ -4225,7 +4224,6 @@ sub git_header_html {
>  <!-- git web interface version $version, (C) 2005-2006, Kay Sievers <kay.sievers\@vrfy.org>, Christian Gierke -->
>  <!-- git core binaries version $git_version -->
>  <head>
> -<meta http-equiv="content-type" content="$content_type; charset=utf-8"/>

..with this being the only behavior change (yeah the variable will now
be used only in one place, but that's fine)

I'm not sure I understand this change really. The result in always XML,
so application/xhtml+xml is redundant, text/html, or both?

But aside from that: I have seen browsers get the lack of encoding=""
"wrong" with data at rest, don't some still default to ISO-8859-1?

So won't this result in badly decoded data if you save the web page &
view it locally?

>  <meta name="generator" content="gitweb/$version git/$git_version$mod_perl_version"/>
>  <meta name="robots" content="index, nofollow"/>
>  <title>$title</title>
> diff --git a/t/t9502-gitweb-standalone-parse-output.sh b/t/t9502-gitweb-standalone-parse-output.sh
> index e7363511dd..25165edacc 100755
> --- a/t/t9502-gitweb-standalone-parse-output.sh
> +++ b/t/t9502-gitweb-standalone-parse-output.sh
> @@ -207,4 +207,17 @@ test_expect_success 'xss checks' '
>  	xss "" "$TAG+"
>  '
>  
> +no_http_equiv_content_type() {
> +	gitweb_run "$@" &&
> +	! grep -Ei "http-equiv=['\"]?content-type" gitweb.body

Nit: Should we skip the "-i" here since we're testing our own output,
and not http standards in general (i.e. we don't have to worry about the
case of http-equiv?)
Jason Yundt March 7, 2022, 10:49 p.m. UTC | #2
On Monday, March 7, 2022 7:23:49 AM EST Ævar Arnfjörð Bjarmason wrote:
> I'm not sure I understand this change really. The result in always XML,
> so application/xhtml+xml is redundant, text/html, or both?

To be honest, using an http-equiv="content-type" in XHTML is confusing. When 
you do use one, your goal shouldn’t really be to specify the document’s MIME 
type. After all, the first three lines of each page say

	<?xml version="1.0" encoding="utf-8"?>
	<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">

Those lines are more than enough to determine that something is using XHTML 
and UTF-8. Instead, the idea is to help out a parser that is incorrectly 
parsing the document as HTML (instead of as XHTML). Historical W3C documents  
(that were applicable when http-equiv="content-type" was allowed in XHTML) [1]
[2][3] indicate that http-equiv="content-type" should be used like this:

	<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

In other words, to use http-equiv="content-type" properly in XHTML, you had to 
lie about the document’s type. The fact that this is confusing is probably 
part of why WHATWG disallowed it in the HTML Standard.

> But aside from that: I have seen browsers get the lack of encoding=""
> "wrong" with data at rest, don't some still default to ISO-8859-1?
> 
> So won't this result in badly decoded data if you save the web page &
> view it locally?

I tested this idea in ungoogled-chromium, Firefox and Pale Moon. Other than 
Pale Moon in one specific circumstance, they all used UTF-8 as the encoding. 
Pale Moon used windows-1252, but only when the file ended with .html. When the 
file ended with .xhtml, Pale Moon used UTF-8. That being said, we don’t have to 
use an http-equiv="content-type" to fix the problem. Instead, we can use a 
<meta charset="utf-8"> which is allowed by the HTML Standard [4].

[1]: <https://www.w3.org/TR/xhtml1/#C_9>
[2]: <https://www.w3.org/TR/html-polyglot/#character-encoding>
[3]: <https://www.w3.org/Bugs/Public/show_bug.cgi?id=21818>

[4]: <https://html.spec.whatwg.org/multipage/semantics.html#attr-meta-charset>
brian m. carlson March 7, 2022, 11:24 p.m. UTC | #3
On 2022-03-07 at 03:37:23, Jason Yundt wrote:
> Before this change, gitweb would generate pages which included:
> 
> 	<meta http-equiv="content-type" content="application/xhtml+xml; charset=utf-8"/>
> 
> A meta element with http-equiv="content-type" is said to be in the
> "Encoding declaration state". According to the HTML Standard,
> 
> 	The Encoding declaration state may be used in HTML documents,
> 	but elements with an http-equiv attribute in that state must not
> 	be used in XML documents.
> 
> 	Source: <https://html.spec.whatwg.org/multipage/semantics.html#attr-meta-http-equiv-content-type>
> 
> This change removes that meta element since gitweb always generates XML
> documents.

This change seems fine.  We do specify this in the HTTP header,
including the character set, which is what matters, so this should work
in every browser, and the http-equiv is unneeded.

I also don't think we need a meta header here, since we have an XML
declaration, and that's controlling in this situation.  This isn't
regular HTML and we don't declare it as such, so using a meta header to
control this isn't correct: the XML declaration should be used instead
in the event a user downloads this to a local disk and processes it
outside the context of an HTTP request.

Since we control the HTTP headers, I'd actually argue that your test
might well reject all http-equiv headers since they could be done much
better with actual HTTP headers (and would therefore work with
non-browser clients), but I don't think that's worth a reroll, nor do I
think a test is even needed here (but bonus points for adding one).

So I think this looks good as is.  Thanks for the patch.
diff mbox series

Patch

diff --git a/gitweb/gitweb.perl b/gitweb/gitweb.perl
index fbd1c20a23..606b50104c 100755
--- a/gitweb/gitweb.perl
+++ b/gitweb/gitweb.perl
@@ -4213,8 +4213,7 @@  sub git_header_html {
 	my %opts = @_;
 
 	my $title = get_page_title();
-	my $content_type = get_content_type_html();
-	print $cgi->header(-type=>$content_type, -charset => 'utf-8',
+	print $cgi->header(-type=>get_content_type_html(), -charset => 'utf-8',
 	                   -status=> $status, -expires => $expires)
 		unless ($opts{'-no_http_header'});
 	my $mod_perl_version = $ENV{'MOD_PERL'} ? " $ENV{'MOD_PERL'}" : '';
@@ -4225,7 +4224,6 @@  sub git_header_html {
 <!-- git web interface version $version, (C) 2005-2006, Kay Sievers <kay.sievers\@vrfy.org>, Christian Gierke -->
 <!-- git core binaries version $git_version -->
 <head>
-<meta http-equiv="content-type" content="$content_type; charset=utf-8"/>
 <meta name="generator" content="gitweb/$version git/$git_version$mod_perl_version"/>
 <meta name="robots" content="index, nofollow"/>
 <title>$title</title>
diff --git a/t/t9502-gitweb-standalone-parse-output.sh b/t/t9502-gitweb-standalone-parse-output.sh
index e7363511dd..25165edacc 100755
--- a/t/t9502-gitweb-standalone-parse-output.sh
+++ b/t/t9502-gitweb-standalone-parse-output.sh
@@ -207,4 +207,17 @@  test_expect_success 'xss checks' '
 	xss "" "$TAG+"
 '
 
+no_http_equiv_content_type() {
+	gitweb_run "$@" &&
+	! grep -Ei "http-equiv=['\"]?content-type" gitweb.body
+}
+
+# See: <https://html.spec.whatwg.org/dev/semantics.html#attr-meta-http-equiv-content-type>
+test_expect_success 'no http-equiv="content-type" in XHTML' '
+	no_http_equiv_content_type &&
+	no_http_equiv_content_type "p=.git" &&
+	no_http_equiv_content_type "p=.git;a=log" &&
+	no_http_equiv_content_type "p=.git;a=tree"
+'
+
 test_done