diff mbox series

http-backend: treat empty CONTENT_LENGTH as zero

Message ID 20180911034227.GB20518@aiede.svl.corp.google.com (mailing list archive)
State New, archived
Headers show
Series http-backend: treat empty CONTENT_LENGTH as zero | expand

Commit Message

Jonathan Nieder Sept. 11, 2018, 3:42 a.m. UTC
As discussed in v2.19.0-rc0~45^2~2 (http-backend: respect
CONTENT_LENGTH as specified by rfc3875, 2018-06-10), HTTP servers such
as IIS do not close a CGI script's standard input at the end of a
request, instead expecting CGI scripts to stop reading after
CONTENT_LENGTH bytes.  That commit taught http-backend to respect this
convention except when CONTENT_LENGTH is unset, in which case it
preserved the previous behavior of reading until EOF.

RFC 3875 (the CGI specification) explains:

   The CONTENT_LENGTH variable contains the size of the message-body
   attached to the request, if any, in decimal number of octets.  If no
   data is attached, then NULL (or unset).

      CONTENT_LENGTH = "" | 1*digit

And:

   This specification does not distinguish between zero-length (NULL)
   values and missing values.

But that specification was written before HTTP/1.1 and chunked
encoding.  With chunked encoding, the length of a request is not known
early and it is useful to start a CGI script to process it anyway, so
Apache and many other servers violate the spec: they leave
CONTENT_LENGTH unset and rely on EOF to indicate the end of request.
This is reproducible using t5510-fetch.sh, which hangs if http-backend
is patched to treat a missing CONTENT_LENGTH as zero.

So we are in a bind: to support HTTP servers that don't produce EOF,
http-backend should respect an unset or empty CONTENT_LENGTH that
represents zero, and to support chunked encoding, http-backend should
respect an unset CONTENT_LENGTH that represents "read until EOF".

Fortunately, there's a way out.  Use the HTTP_TRANSFER_ENCODING
environment variable to distinguish the two cases.

Reported-by: Jeff King <peff@peff.net>
Helped-by: Max Kirillov <max@max630.net>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
How about this?

 http-backend.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Jonathan Nieder Sept. 11, 2018, 4:03 a.m. UTC | #1
Kicking off the reviews: ;-)

Jonathan Nieder wrote:

> --- a/http-backend.c
> +++ b/http-backend.c
> @@ -350,10 +350,25 @@ static ssize_t read_request_fixed_len(int fd, ssize_t req_len, unsigned char **o
>  
>  static ssize_t get_content_length(void)
[...]
> +		/*
> +		 * According to RFC 3875, an empty or missing
> +		 * CONTENT_LENGTH means "no body", but RFC 3875
> +		 * precedes HTTP/1.1 and chunked encoding. Apache and
> +		 * its imitators leave CONTENT_LENGTH unset for

Which imitators?  Maybe this should just say "Apache leaves [...]".

> +		 * chunked requests, for which we should use EOF to
> +		 * detect the end of the request.
> +		 */
> +		str = getenv("HTTP_TRANSFER_ENCODING");
> +		if (str && !strcmp(str, "chunked"))

RFC 2616 says Transfer-Encoding is a list of transfer-codings applied,
in the order that they were applied, and that "chunked" is always
applied last.  That means a transfer-encoding like

	Transfer-Encoding: identity chunked

would be permitted, or e.g.

	Transfer-Encoding: gzip chunked

Does that means we should be using a check like

	str && (!strcmp(str, "chunked") || ends_with(str, " chunked"))

?

That said, a quick search of codesearch.debian.net mostly finds
examples using straight comparison, so maybe the patch is fine as-is.

Thanks,
Jonathan
Junio C Hamano Sept. 11, 2018, 4:18 a.m. UTC | #2
Jonathan Nieder <jrnieder@gmail.com> writes:

> RFC 3875 (the CGI specification) explains:
>
>    The CONTENT_LENGTH variable contains the size of the message-body
>    attached to the request, if any, in decimal number of octets.  If no
>    data is attached, then NULL (or unset).
>
>       CONTENT_LENGTH = "" | 1*digit
>
> And:
>
>    This specification does not distinguish between zero-length (NULL)
>    values and missing values.
>
> But that specification was written before HTTP/1.1 and chunked
> encoding.

RFC 3875 is from October 2004, while RFC 2616 (HTTP/1.1) is from
June 1999; I presume that 3875 only writes down what has already
been an established practice and that is where the date discrepancy
comes from.

This is a bit old but some of those who participated in the
discussion archived at https://lists.gt.net/apache/users/373042
seemed to know what they were talking about.  In short, lack of
content-length for CGI scripts driven by Apache seems to mean that
the content length is unknown and we are expected to read through to
the eof, and we are expected to ignore the CGI spec, which says
missing CONTENT_LENGTH and CONTENT_LENGTH set to an empty string
both must mean there is no message body.  Instead, we need to take
the former as a sign to read through to the end.

> Fortunately, there's a way out.  Use the HTTP_TRANSFER_ENCODING
> environment variable to distinguish the two cases.

Cute.

I'm anxious to learn how well this works in practice. Or is this a
trick you know somebody else's system already uses (in which case,
that's wonderful)?

> +	if (!str || !*str) {
> +		/*
> +		 * According to RFC 3875, an empty or missing
> +		 * CONTENT_LENGTH means "no body", but RFC 3875
> +		 * precedes HTTP/1.1 and chunked encoding. Apache and
> +		 * its imitators leave CONTENT_LENGTH unset for
> +		 * chunked requests, for which we should use EOF to
> +		 * detect the end of the request.
> +		 */
> +		str = getenv("HTTP_TRANSFER_ENCODING");
> +		if (str && !strcmp(str, "chunked"))
> +			return -1;
> +
> +		return 0;
> +	}
> +	if (!git_parse_ssize_t(str, &val))
>  		die("failed to parse CONTENT_LENGTH: %s", str);
>  	return val;
>  }
Jonathan Nieder Sept. 11, 2018, 4:29 a.m. UTC | #3
Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

>> RFC 3875 (the CGI specification) explains:
>>
>>    The CONTENT_LENGTH variable contains the size of the message-body
>>    attached to the request, if any, in decimal number of octets.  If no
>>    data is attached, then NULL (or unset).
[...]
>> But that specification was written before HTTP/1.1 and chunked
>> encoding.
>
> RFC 3875 is from October 2004, while RFC 2616 (HTTP/1.1) is from
> June 1999; I presume that 3875 only writes down what has already
> been an established practice and that is where the date discrepancy
> comes from.

Yes, CGI 1.1 is from 1995.  More details are at
https://www.w3.org/CGI/.

[...]
>> Fortunately, there's a way out.  Use the HTTP_TRANSFER_ENCODING
>> environment variable to distinguish the two cases.
>
> Cute.
>
> I'm anxious to learn how well this works in practice. Or is this a
> trick you know somebody else's system already uses (in which case,
> that's wonderful)?

Alas, I came up with it today so I don't know yet how well it will
work in practice.

I can poke around a little tomorrow in Apache to sanity-check the
approach.  Results from anyone able to test using various HTTP servers
(lighttpd, etc) would also be very welcome.

Thanks,
Jonathan
Junio C Hamano Sept. 11, 2018, 6:15 p.m. UTC | #4
Jonathan Nieder <jrnieder@gmail.com> writes:

> Kicking off the reviews: ;-)
>
> Jonathan Nieder wrote:
>
>> --- a/http-backend.c
>> +++ b/http-backend.c
>> @@ -350,10 +350,25 @@ static ssize_t read_request_fixed_len(int fd, ssize_t req_len, unsigned char **o
>>  
>>  static ssize_t get_content_length(void)
> [...]
>> +		/*
>> +		 * According to RFC 3875, an empty or missing
>> +		 * CONTENT_LENGTH means "no body", but RFC 3875
>> +		 * precedes HTTP/1.1 and chunked encoding. Apache and
>> +		 * its imitators leave CONTENT_LENGTH unset for
>
> Which imitators?  Maybe this should just say "Apache leaves [...]".

I tend to agree; I do not mind amending the text while queuing.

>> +		 * chunked requests, for which we should use EOF to
>> +		 * detect the end of the request.
>> +		 */
>> +		str = getenv("HTTP_TRANSFER_ENCODING");
>> +		if (str && !strcmp(str, "chunked"))
>
> RFC 2616 says Transfer-Encoding is a list of transfer-codings applied,
> in the order that they were applied, and that "chunked" is always
> applied last.  That means a transfer-encoding like
>
> 	Transfer-Encoding: identity chunked
>
> would be permitted, or e.g.
>
> 	Transfer-Encoding: gzip chunked
>
> Does that means we should be using a check like
>
> 	str && (!strcmp(str, "chunked") || ends_with(str, " chunked"))
>
> ?

Hmph, that's 

	"Transfer-Encoding" ":" 1#transfer-coding

where #rule is

   #rule
      A construct "#" is defined, similar to "*", for defining lists of
      elements. The full form is "<n>#<m>element" indicating at least
      <n> and at most <m> elements, each separated by one or more commas
      (",") and OPTIONAL linear white space (LWS). This makes the usual
      form of lists very easy; a rule such as
         ( *LWS element *( *LWS "," *LWS element ))
      can be shown as
         1#element

So

 - you need to account for comma
 - your LWS may not be a SP

if you want to handle gzipped stream coming in a chunked form, I
think.

Unless I am missing the rule in CGI spec that is used to transform
the value on the Transfer-Encoding header to HTTP_TRANSFER_ENCODING
environment variable, that is.

> That said, a quick search of codesearch.debian.net mostly finds
> examples using straight comparison, so maybe the patch is fine as-is.

I do not think we would mind terribly if we do not support
combinations like gzipped-and-then-chunked from day one.  An in-code
NEEDSWORK comment that refers to the production in RFC 2616 Page 143
may not hurt, though.

Thanks.
Junio C Hamano Sept. 11, 2018, 6:27 p.m. UTC | #5
Junio C Hamano <gitster@pobox.com> writes:

>>> +		/*
>>> +		 * According to RFC 3875, an empty or missing
>>> +		 * CONTENT_LENGTH means "no body", but RFC 3875
>>> +		 * precedes HTTP/1.1 and chunked encoding. Apache and
>>> +		 * its imitators leave CONTENT_LENGTH unset for
>>
>> Which imitators?  Maybe this should just say "Apache leaves [...]".
>
> I tend to agree; I do not mind amending the text while queuing.
> ...
> I do not think we would mind terribly if we do not support
> combinations like gzipped-and-then-chunked from day one.  An in-code
> NEEDSWORK comment that refers to the production in RFC 2616 Page 143
> may not hurt, though.

I refrained from reflowing the first paragraph of the comment in
this message, but will probably reflow it before committing, if the
updated text is acceptable.


 http-backend.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/http-backend.c b/http-backend.c
index 8f515a6def..b997eafb00 100644
--- a/http-backend.c
+++ b/http-backend.c
@@ -357,10 +357,17 @@ static ssize_t get_content_length(void)
 		/*
 		 * According to RFC 3875, an empty or missing
 		 * CONTENT_LENGTH means "no body", but RFC 3875
-		 * precedes HTTP/1.1 and chunked encoding. Apache and
-		 * its imitators leave CONTENT_LENGTH unset for
+		 * precedes HTTP/1.1 and chunked encoding. Apache
+		 * leaves CONTENT_LENGTH unset for
 		 * chunked requests, for which we should use EOF to
 		 * detect the end of the request.
+		 *
+		 * NEEDSWORK: Transfer-Encoding header is defined to
+		 * be a list of elements where "chunked", if exists,
+		 * must be at the end.  The current code only deals
+		 * with the case where "chunked" is the only element.
+		 * See RFC 2616 (14.41 Transfer-Encoding) when
+		 * extending this code.
 		 */
 		str = getenv("HTTP_TRANSFER_ENCODING");
 		if (str && !strcmp(str, "chunked"))
Jeff King Sept. 12, 2018, 5:56 a.m. UTC | #6
On Tue, Sep 11, 2018 at 11:15:13AM -0700, Junio C Hamano wrote:

> > That said, a quick search of codesearch.debian.net mostly finds
> > examples using straight comparison, so maybe the patch is fine as-is.
> 
> I do not think we would mind terribly if we do not support
> combinations like gzipped-and-then-chunked from day one.  An in-code
> NEEDSWORK comment that refers to the production in RFC 2616 Page 143
> may not hurt, though.

It's pretty common for Git to send gzip'd contents, so this might
actually be necessary on day one. However, it looks like we do so by
setting the content-encoding header.

I really wonder if this topic is worth pursuing further without finding
a real-world case that actually fails with the v2.19 code. I.e., is
there actually a server that doesn't set CONTENT_LENGTH and really can't
handle read-to-eof? It's plausible to me, but it's also equally
plausible that we'd be breaking some other case.

-Peff
Jonathan Nieder Sept. 12, 2018, 6:26 a.m. UTC | #7
Jeff King wrote:
> On Tue, Sep 11, 2018 at 11:15:13AM -0700, Junio C Hamano wrote:

>> I do not think we would mind terribly if we do not support
>> combinations like gzipped-and-then-chunked from day one.  An in-code
>> NEEDSWORK comment that refers to the production in RFC 2616 Page 143
>> may not hurt, though.
>
> It's pretty common for Git to send gzip'd contents, so this might
> actually be necessary on day one. However, it looks like we do so by
> setting the content-encoding header.

Correct, we haven't been using Transfer-Encoding for that.

> I really wonder if this topic is worth pursuing further without finding
> a real-world case that actually fails with the v2.19 code. I.e., is
> there actually a server that doesn't set CONTENT_LENGTH and really can't
> handle read-to-eof? It's plausible to me, but it's also equally
> plausible that we'd be breaking some other case.

I wonder about the motivating IIS case.  The CGI spec says that
CONTENT_LENGTH is set if and only if the message has a message-body.
When discussing message-body, it says

      Request-Data   = [ request-body ] [ extension-data ]
[...]
   A request-body is supplied with the request if the CONTENT_LENGTH is
   not NULL.  The server MUST make at least that many bytes available
   for the script to read.  The server MAY signal an end-of-file
   condition after CONTENT_LENGTH bytes have been read or it MAY supply
   extension data.  Therefore, the script MUST NOT attempt to read more
   than CONTENT_LENGTH bytes, even if more data is available.

Does that mean that if CONTENT_LENGTH is not set, then we are
guaranteed to see EOF, because extension-data cannot be present?  If
so, then what we have in v2.19 (plus Max's test improvement that is in
"next") is already enough.

So I agree.

 1. Junio, please eject this patch from "pu", since we don't have any
    need for it.

 2. IIS users, please test v2.19 and let us know how it goes.

Do we have any scenarios that would use an empty POST (or other
non-GET) request?

Thanks,
Jonathan
Junio C Hamano Sept. 12, 2018, 4:10 p.m. UTC | #8
Jeff King <peff@peff.net> writes:

> I really wonder if this topic is worth pursuing further without finding
> a real-world case that actually fails with the v2.19 code. I.e., is
> there actually a server that doesn't set CONTENT_LENGTH and really can't
> handle read-to-eof? It's plausible to me, but it's also equally
> plausible that we'd be breaking some other case.

OK.
diff mbox series

Patch

diff --git a/http-backend.c b/http-backend.c
index 458642ef72..7902eeb0b3 100644
--- a/http-backend.c
+++ b/http-backend.c
@@ -350,10 +350,25 @@  static ssize_t read_request_fixed_len(int fd, ssize_t req_len, unsigned char **o
 
 static ssize_t get_content_length(void)
 {
-	ssize_t val = -1;
+	ssize_t val;
 	const char *str = getenv("CONTENT_LENGTH");
 
-	if (str && *str && !git_parse_ssize_t(str, &val))
+	if (!str || !*str) {
+		/*
+		 * According to RFC 3875, an empty or missing
+		 * CONTENT_LENGTH means "no body", but RFC 3875
+		 * precedes HTTP/1.1 and chunked encoding. Apache and
+		 * its imitators leave CONTENT_LENGTH unset for
+		 * chunked requests, for which we should use EOF to
+		 * detect the end of the request.
+		 */
+		str = getenv("HTTP_TRANSFER_ENCODING");
+		if (str && !strcmp(str, "chunked"))
+			return -1;
+
+		return 0;
+	}
+	if (!git_parse_ssize_t(str, &val))
 		die("failed to parse CONTENT_LENGTH: %s", str);
 	return val;
 }