Message ID | 20180911034227.GB20518@aiede.svl.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | http-backend: treat empty CONTENT_LENGTH as zero | expand |
Kicking off the reviews: ;-) Jonathan Nieder wrote: > --- a/http-backend.c > +++ b/http-backend.c > @@ -350,10 +350,25 @@ static ssize_t read_request_fixed_len(int fd, ssize_t req_len, unsigned char **o > > static ssize_t get_content_length(void) [...] > + /* > + * According to RFC 3875, an empty or missing > + * CONTENT_LENGTH means "no body", but RFC 3875 > + * precedes HTTP/1.1 and chunked encoding. Apache and > + * its imitators leave CONTENT_LENGTH unset for Which imitators? Maybe this should just say "Apache leaves [...]". > + * chunked requests, for which we should use EOF to > + * detect the end of the request. > + */ > + str = getenv("HTTP_TRANSFER_ENCODING"); > + if (str && !strcmp(str, "chunked")) RFC 2616 says Transfer-Encoding is a list of transfer-codings applied, in the order that they were applied, and that "chunked" is always applied last. That means a transfer-encoding like Transfer-Encoding: identity chunked would be permitted, or e.g. Transfer-Encoding: gzip chunked Does that means we should be using a check like str && (!strcmp(str, "chunked") || ends_with(str, " chunked")) ? That said, a quick search of codesearch.debian.net mostly finds examples using straight comparison, so maybe the patch is fine as-is. Thanks, Jonathan
Jonathan Nieder <jrnieder@gmail.com> writes: > RFC 3875 (the CGI specification) explains: > > The CONTENT_LENGTH variable contains the size of the message-body > attached to the request, if any, in decimal number of octets. If no > data is attached, then NULL (or unset). > > CONTENT_LENGTH = "" | 1*digit > > And: > > This specification does not distinguish between zero-length (NULL) > values and missing values. > > But that specification was written before HTTP/1.1 and chunked > encoding. RFC 3875 is from October 2004, while RFC 2616 (HTTP/1.1) is from June 1999; I presume that 3875 only writes down what has already been an established practice and that is where the date discrepancy comes from. This is a bit old but some of those who participated in the discussion archived at https://lists.gt.net/apache/users/373042 seemed to know what they were talking about. In short, lack of content-length for CGI scripts driven by Apache seems to mean that the content length is unknown and we are expected to read through to the eof, and we are expected to ignore the CGI spec, which says missing CONTENT_LENGTH and CONTENT_LENGTH set to an empty string both must mean there is no message body. Instead, we need to take the former as a sign to read through to the end. > Fortunately, there's a way out. Use the HTTP_TRANSFER_ENCODING > environment variable to distinguish the two cases. Cute. I'm anxious to learn how well this works in practice. Or is this a trick you know somebody else's system already uses (in which case, that's wonderful)? > + if (!str || !*str) { > + /* > + * According to RFC 3875, an empty or missing > + * CONTENT_LENGTH means "no body", but RFC 3875 > + * precedes HTTP/1.1 and chunked encoding. Apache and > + * its imitators leave CONTENT_LENGTH unset for > + * chunked requests, for which we should use EOF to > + * detect the end of the request. > + */ > + str = getenv("HTTP_TRANSFER_ENCODING"); > + if (str && !strcmp(str, "chunked")) > + return -1; > + > + return 0; > + } > + if (!git_parse_ssize_t(str, &val)) > die("failed to parse CONTENT_LENGTH: %s", str); > return val; > }
Junio C Hamano wrote: > Jonathan Nieder <jrnieder@gmail.com> writes: >> RFC 3875 (the CGI specification) explains: >> >> The CONTENT_LENGTH variable contains the size of the message-body >> attached to the request, if any, in decimal number of octets. If no >> data is attached, then NULL (or unset). [...] >> But that specification was written before HTTP/1.1 and chunked >> encoding. > > RFC 3875 is from October 2004, while RFC 2616 (HTTP/1.1) is from > June 1999; I presume that 3875 only writes down what has already > been an established practice and that is where the date discrepancy > comes from. Yes, CGI 1.1 is from 1995. More details are at https://www.w3.org/CGI/. [...] >> Fortunately, there's a way out. Use the HTTP_TRANSFER_ENCODING >> environment variable to distinguish the two cases. > > Cute. > > I'm anxious to learn how well this works in practice. Or is this a > trick you know somebody else's system already uses (in which case, > that's wonderful)? Alas, I came up with it today so I don't know yet how well it will work in practice. I can poke around a little tomorrow in Apache to sanity-check the approach. Results from anyone able to test using various HTTP servers (lighttpd, etc) would also be very welcome. Thanks, Jonathan
Jonathan Nieder <jrnieder@gmail.com> writes: > Kicking off the reviews: ;-) > > Jonathan Nieder wrote: > >> --- a/http-backend.c >> +++ b/http-backend.c >> @@ -350,10 +350,25 @@ static ssize_t read_request_fixed_len(int fd, ssize_t req_len, unsigned char **o >> >> static ssize_t get_content_length(void) > [...] >> + /* >> + * According to RFC 3875, an empty or missing >> + * CONTENT_LENGTH means "no body", but RFC 3875 >> + * precedes HTTP/1.1 and chunked encoding. Apache and >> + * its imitators leave CONTENT_LENGTH unset for > > Which imitators? Maybe this should just say "Apache leaves [...]". I tend to agree; I do not mind amending the text while queuing. >> + * chunked requests, for which we should use EOF to >> + * detect the end of the request. >> + */ >> + str = getenv("HTTP_TRANSFER_ENCODING"); >> + if (str && !strcmp(str, "chunked")) > > RFC 2616 says Transfer-Encoding is a list of transfer-codings applied, > in the order that they were applied, and that "chunked" is always > applied last. That means a transfer-encoding like > > Transfer-Encoding: identity chunked > > would be permitted, or e.g. > > Transfer-Encoding: gzip chunked > > Does that means we should be using a check like > > str && (!strcmp(str, "chunked") || ends_with(str, " chunked")) > > ? Hmph, that's "Transfer-Encoding" ":" 1#transfer-coding where #rule is #rule A construct "#" is defined, similar to "*", for defining lists of elements. The full form is "<n>#<m>element" indicating at least <n> and at most <m> elements, each separated by one or more commas (",") and OPTIONAL linear white space (LWS). This makes the usual form of lists very easy; a rule such as ( *LWS element *( *LWS "," *LWS element )) can be shown as 1#element So - you need to account for comma - your LWS may not be a SP if you want to handle gzipped stream coming in a chunked form, I think. Unless I am missing the rule in CGI spec that is used to transform the value on the Transfer-Encoding header to HTTP_TRANSFER_ENCODING environment variable, that is. > That said, a quick search of codesearch.debian.net mostly finds > examples using straight comparison, so maybe the patch is fine as-is. I do not think we would mind terribly if we do not support combinations like gzipped-and-then-chunked from day one. An in-code NEEDSWORK comment that refers to the production in RFC 2616 Page 143 may not hurt, though. Thanks.
Junio C Hamano <gitster@pobox.com> writes: >>> + /* >>> + * According to RFC 3875, an empty or missing >>> + * CONTENT_LENGTH means "no body", but RFC 3875 >>> + * precedes HTTP/1.1 and chunked encoding. Apache and >>> + * its imitators leave CONTENT_LENGTH unset for >> >> Which imitators? Maybe this should just say "Apache leaves [...]". > > I tend to agree; I do not mind amending the text while queuing. > ... > I do not think we would mind terribly if we do not support > combinations like gzipped-and-then-chunked from day one. An in-code > NEEDSWORK comment that refers to the production in RFC 2616 Page 143 > may not hurt, though. I refrained from reflowing the first paragraph of the comment in this message, but will probably reflow it before committing, if the updated text is acceptable. http-backend.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/http-backend.c b/http-backend.c index 8f515a6def..b997eafb00 100644 --- a/http-backend.c +++ b/http-backend.c @@ -357,10 +357,17 @@ static ssize_t get_content_length(void) /* * According to RFC 3875, an empty or missing * CONTENT_LENGTH means "no body", but RFC 3875 - * precedes HTTP/1.1 and chunked encoding. Apache and - * its imitators leave CONTENT_LENGTH unset for + * precedes HTTP/1.1 and chunked encoding. Apache + * leaves CONTENT_LENGTH unset for * chunked requests, for which we should use EOF to * detect the end of the request. + * + * NEEDSWORK: Transfer-Encoding header is defined to + * be a list of elements where "chunked", if exists, + * must be at the end. The current code only deals + * with the case where "chunked" is the only element. + * See RFC 2616 (14.41 Transfer-Encoding) when + * extending this code. */ str = getenv("HTTP_TRANSFER_ENCODING"); if (str && !strcmp(str, "chunked"))
On Tue, Sep 11, 2018 at 11:15:13AM -0700, Junio C Hamano wrote: > > That said, a quick search of codesearch.debian.net mostly finds > > examples using straight comparison, so maybe the patch is fine as-is. > > I do not think we would mind terribly if we do not support > combinations like gzipped-and-then-chunked from day one. An in-code > NEEDSWORK comment that refers to the production in RFC 2616 Page 143 > may not hurt, though. It's pretty common for Git to send gzip'd contents, so this might actually be necessary on day one. However, it looks like we do so by setting the content-encoding header. I really wonder if this topic is worth pursuing further without finding a real-world case that actually fails with the v2.19 code. I.e., is there actually a server that doesn't set CONTENT_LENGTH and really can't handle read-to-eof? It's plausible to me, but it's also equally plausible that we'd be breaking some other case. -Peff
Jeff King wrote: > On Tue, Sep 11, 2018 at 11:15:13AM -0700, Junio C Hamano wrote: >> I do not think we would mind terribly if we do not support >> combinations like gzipped-and-then-chunked from day one. An in-code >> NEEDSWORK comment that refers to the production in RFC 2616 Page 143 >> may not hurt, though. > > It's pretty common for Git to send gzip'd contents, so this might > actually be necessary on day one. However, it looks like we do so by > setting the content-encoding header. Correct, we haven't been using Transfer-Encoding for that. > I really wonder if this topic is worth pursuing further without finding > a real-world case that actually fails with the v2.19 code. I.e., is > there actually a server that doesn't set CONTENT_LENGTH and really can't > handle read-to-eof? It's plausible to me, but it's also equally > plausible that we'd be breaking some other case. I wonder about the motivating IIS case. The CGI spec says that CONTENT_LENGTH is set if and only if the message has a message-body. When discussing message-body, it says Request-Data = [ request-body ] [ extension-data ] [...] A request-body is supplied with the request if the CONTENT_LENGTH is not NULL. The server MUST make at least that many bytes available for the script to read. The server MAY signal an end-of-file condition after CONTENT_LENGTH bytes have been read or it MAY supply extension data. Therefore, the script MUST NOT attempt to read more than CONTENT_LENGTH bytes, even if more data is available. Does that mean that if CONTENT_LENGTH is not set, then we are guaranteed to see EOF, because extension-data cannot be present? If so, then what we have in v2.19 (plus Max's test improvement that is in "next") is already enough. So I agree. 1. Junio, please eject this patch from "pu", since we don't have any need for it. 2. IIS users, please test v2.19 and let us know how it goes. Do we have any scenarios that would use an empty POST (or other non-GET) request? Thanks, Jonathan
Jeff King <peff@peff.net> writes: > I really wonder if this topic is worth pursuing further without finding > a real-world case that actually fails with the v2.19 code. I.e., is > there actually a server that doesn't set CONTENT_LENGTH and really can't > handle read-to-eof? It's plausible to me, but it's also equally > plausible that we'd be breaking some other case. OK.
diff --git a/http-backend.c b/http-backend.c index 458642ef72..7902eeb0b3 100644 --- a/http-backend.c +++ b/http-backend.c @@ -350,10 +350,25 @@ static ssize_t read_request_fixed_len(int fd, ssize_t req_len, unsigned char **o static ssize_t get_content_length(void) { - ssize_t val = -1; + ssize_t val; const char *str = getenv("CONTENT_LENGTH"); - if (str && *str && !git_parse_ssize_t(str, &val)) + if (!str || !*str) { + /* + * According to RFC 3875, an empty or missing + * CONTENT_LENGTH means "no body", but RFC 3875 + * precedes HTTP/1.1 and chunked encoding. Apache and + * its imitators leave CONTENT_LENGTH unset for + * chunked requests, for which we should use EOF to + * detect the end of the request. + */ + str = getenv("HTTP_TRANSFER_ENCODING"); + if (str && !strcmp(str, "chunked")) + return -1; + + return 0; + } + if (!git_parse_ssize_t(str, &val)) die("failed to parse CONTENT_LENGTH: %s", str); return val; }
As discussed in v2.19.0-rc0~45^2~2 (http-backend: respect CONTENT_LENGTH as specified by rfc3875, 2018-06-10), HTTP servers such as IIS do not close a CGI script's standard input at the end of a request, instead expecting CGI scripts to stop reading after CONTENT_LENGTH bytes. That commit taught http-backend to respect this convention except when CONTENT_LENGTH is unset, in which case it preserved the previous behavior of reading until EOF. RFC 3875 (the CGI specification) explains: The CONTENT_LENGTH variable contains the size of the message-body attached to the request, if any, in decimal number of octets. If no data is attached, then NULL (or unset). CONTENT_LENGTH = "" | 1*digit And: This specification does not distinguish between zero-length (NULL) values and missing values. But that specification was written before HTTP/1.1 and chunked encoding. With chunked encoding, the length of a request is not known early and it is useful to start a CGI script to process it anyway, so Apache and many other servers violate the spec: they leave CONTENT_LENGTH unset and rely on EOF to indicate the end of request. This is reproducible using t5510-fetch.sh, which hangs if http-backend is patched to treat a missing CONTENT_LENGTH as zero. So we are in a bind: to support HTTP servers that don't produce EOF, http-backend should respect an unset or empty CONTENT_LENGTH that represents zero, and to support chunked encoding, http-backend should respect an unset CONTENT_LENGTH that represents "read until EOF". Fortunately, there's a way out. Use the HTTP_TRANSFER_ENCODING environment variable to distinguish the two cases. Reported-by: Jeff King <peff@peff.net> Helped-by: Max Kirillov <max@max630.net> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> --- How about this? http-backend.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-)