http: do not ignore proxy path

Message ID	pull.1767.git.1722009170590.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF624101F7 for <git@vger.kernel.org>; Fri, 26 Jul 2024 15:52:53 +0000 (UTC) Message-Id: <pull.1767.git.1722009170590.gitgitgadget@gmail.com> Date: Fri, 26 Jul 2024 15:52:50 +0000 Subject: [PATCH] http: do not ignore proxy path Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk MIME-Version: 1.0 To: git@vger.kernel.org Cc: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>, Ryan Hendrickson <ryan.hendrickson@alum.mit.edu> From: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Series	http: do not ignore proxy path \| expand http: do not ignore proxy path

Ryan Hendrickson July 26, 2024, 3:52 p.m. UTC

From: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>

The documentation for `http.proxy` describes that option, and the
environment variables it overrides, as supporting "the syntax understood
by curl". curl allows SOCKS proxies to use a path to a Unix domain
socket, like `socks5h://localhost/path/to/socket.sock`. Git should
therefore include, if present, the path part of the proxy URL in what it
passes to libcurl.

Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
---
    http: do not ignore proxy path
    
     * Documentation: do I need to add anything?
     * Tests: I could use a pointer on how best to add a test for this.
       Adding a case to t5564-http-proxy.sh seems straightforward but I
       don't think httpd can be configured to listen to domain sockets; can
       I use netcat?

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1767%2Frhendric%2Frhendric%2Fhttp-proxy-path-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1767/rhendric/rhendric/http-proxy-path-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1767

 http.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)


base-commit: ad57f148c6b5f8735b62238dda8f571c582e0e54

Junio C Hamano July 26, 2024, 4:29 p.m. UTC | #1

"Ryan Hendrickson via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
>
> The documentation for `http.proxy` describes that option, and the
> environment variables it overrides, as supporting "the syntax understood
> by curl". curl allows SOCKS proxies to use a path to a Unix domain
> socket, like `socks5h://localhost/path/to/socket.sock`. Git should
> therefore include, if present, the path part of the proxy URL in what it
> passes to libcurl.
>
> Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
> ---
>     http: do not ignore proxy path
>     
>      * Documentation: do I need to add anything?

http.proxy documentation says

  The syntax thus is [protocol://][user[:password]@]proxyhost[:port].

but the updated code pays attention to what can come after the
"host[:post]" part, does it not?

> diff --git a/http.c b/http.c
> index 623ed234891..0cd75986a6b 100644
> --- a/http.c
> +++ b/http.c

I am unfamiliar with this code path, so let me follow along aloud.

> @@ -1265,7 +1265,13 @@ static CURL *get_curl_handle(void)
>  		if (!proxy_auth.host)
>  			die("Invalid proxy URL '%s'", curl_http_proxy);

We grabbed the value from the configuration variable (or various
environment variables like $http_proxy) in the curl_http_proxy
variable, and then passed it to credential_from_url() to parse parts
out of [protocol://][user[:password]@]proxyhost[:port][/p/a/t/h].
The parsed result is in proxy_auth struct and there is no hope it
can be usable if the .host member is missing.

> -		curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);

We used to only use the .host member but the curl_http_proxy could
have had a path after it, held in the .path member.

> +		if (proxy_auth.path) {
> +			struct strbuf proxy = STRBUF_INIT;
> +			strbuf_addf(&proxy, "%s/%s", proxy_auth.host, proxy_auth.path);
> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
> +			strbuf_release(&proxy);
> +		} else
> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);

Style.  If "if" side needs {braces} because it consists of multiple
statements, the corresponding "else" side should also have {braces}
around its body, even if it only has a single statement.

If you have the proxy strbuf in a bit wider scope, then the above becomes

	if (proxy_auth.path)
		strbuf_addf(&proxy, "%s/%s", proxy_auth.host, proxy_auth.path);
	else
		strbuf_addstr(&proxy, proxy_auth.host);
	curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
	strbuf_release(&proxy);

which might (I am not decided) be easier to follow.

Could existing users have been taking advantage of the fact that the
extra /path at the end of http.proxy (and $http_proxy and friends)
are ignored?  For them, this change will appear as a regression.

Other than that (and the lack of any documentation and test
updates), looking good.

Thanks.

Ryan Hendrickson July 26, 2024, 5:12 p.m. UTC | #2

At 2024-07-26 09:29-0700, Junio C Hamano <gitster@pobox.com> sent:

> http.proxy documentation says
>
>  The syntax thus is [protocol://][user[:password]@]proxyhost[:port].
>
> but the updated code pays attention to what can come after the
> "host[:post]" part, does it not?

Correct; I'll add a "[/path]" to that construction.

>> +		if (proxy_auth.path) {
>> +			struct strbuf proxy = STRBUF_INIT;
>> +			strbuf_addf(&proxy, "%s/%s", proxy_auth.host, proxy_auth.path);
>> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
>> +			strbuf_release(&proxy);
>> +		} else
>> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
>
> Style.  If "if" side needs {braces} because it consists of multiple
> statements, the corresponding "else" side should also have {braces}
> around its body, even if it only has a single statement.
>
> If you have the proxy strbuf in a bit wider scope, then the above becomes
>
> 	if (proxy_auth.path)
> 		strbuf_addf(&proxy, "%s/%s", proxy_auth.host, proxy_auth.path);
> 	else
> 		strbuf_addstr(&proxy, proxy_auth.host);
> 	curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
> 	strbuf_release(&proxy);
>
> which might (I am not decided) be easier to follow.

For what it's worth, I was following the precedent set by the if-statement 
starting at line 1256 (a few lines above this patch):

> 		if (strstr(curl_http_proxy, "://"))
> 			credential_from_url(&proxy_auth, curl_http_proxy);
> 		else {
> 			struct strbuf url = STRBUF_INIT;
> 			strbuf_addf(&url, "http://%s", curl_http_proxy);
> 			credential_from_url(&proxy_auth, url.buf);
> 			strbuf_release(&url);
> 		}

<https://github.com/git/git/blob/ad57f148c6b5f8735b62238dda8f571c582e0e54/http.c#L1256>

I have no problem with being inconsistent with surrounding code in these 
style choices; just let me know what I should do in light of that.

> Could existing users have been taking advantage of the fact that the
> extra /path at the end of http.proxy (and $http_proxy and friends)
> are ignored?  For them, this change will appear as a regression.

That is possible, though I have difficulty imagining a scenario in which 
it would be intentional.

What do you recommend I do about that possibility?

R

Junio C Hamano July 26, 2024, 5:45 p.m. UTC | #3

Ryan Hendrickson <ryan.hendrickson@alum.mit.edu> writes:

> For what it's worth, I was following the precedent set by the
> if-statement starting at line 1256 (a few lines above this patch):

It is worth nothing.  Existing violation does not make it a good
idea to add more of them.  It makes it more work to clean them all
up later.

>> Could existing users have been taking advantage of the fact that the
>> extra /path at the end of http.proxy (and $http_proxy and friends)
>> are ignored?  For them, this change will appear as a regression.
>
> That is possible, though I have difficulty imagining a scenario in
> which it would be intentional.
>
> What do you recommend I do about that possibility?

I have no idea.  I already said that I am not familiar with this
code path, and it is likely I am a worse person than you are to find
a potential creative (ab)uses of the mechanism to take advantage of
the fact that the current code ignores the path part.

Having said that, given that http.proxy falls back to environment
variables that have been established a lot longer than Git itself,
like HTTPS_PROXY etc., if we _know_ that setting HTTPS_PROXY to such
a value with /path part causes curl (or other popular programs) to
try using such a value without stripping the path part (and even
better if that causes them to fail), then such an observation would
make a very good argument why it is extremely unlikely that existing
users used such a setting, hence it is unlikely this patch would
make a regression.

Thanks.

Jeff King July 26, 2024, 9:11 p.m. UTC | #4

On Fri, Jul 26, 2024 at 03:52:50PM +0000, Ryan Hendrickson via GitGitGadget wrote:

>      * Tests: I could use a pointer on how best to add a test for this.
>        Adding a case to t5564-http-proxy.sh seems straightforward but I
>        don't think httpd can be configured to listen to domain sockets; can
>        I use netcat?

I don't offhand know of a way to test this without a custom program like
netcat. If it's the only option, it's OK to use tools that might not be
available everywhere as long as the tests are marked as optional with
the appropriate prerequisite. You can find prior art by looking for
test_lazy_prereq calls (e.g., the ones for GZIP or ZIPINFO are pretty
straight-forward).

I would warn that there are several not-quite-compatible variants of
netcat floating around, which can create headaches. You might be better
off with a short perl script using IO::Socket::UNIX or similar.

> diff --git a/http.c b/http.c
> index 623ed234891..0cd75986a6b 100644
> --- a/http.c
> +++ b/http.c
> @@ -1265,7 +1265,13 @@ static CURL *get_curl_handle(void)
>  		if (!proxy_auth.host)
>  			die("Invalid proxy URL '%s'", curl_http_proxy);
>  
> -		curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
> +		if (proxy_auth.path) {
> +			struct strbuf proxy = STRBUF_INIT;
> +			strbuf_addf(&proxy, "%s/%s", proxy_auth.host, proxy_auth.path);
> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
> +			strbuf_release(&proxy);
> +		} else
> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);

The fields in the proxy_auth struct have been parsed from the url, with
any url encoding removed. But then we paste them back together into a
pseudo-url without doing any further encoding. Is that correct?

I doubt that the host contains a "/", but if you had a path that
contained a "%", then the URL form of that is going to be %25. Which is
curl expecting to get here?

I say "pseudo-url" because it is weird to slap a path on the end of the
hostname but leave off the scheme, etc. Which kind of makes me wonder
why we pass proxy_auth.host in the first place, and not simply the
original curl_http_proxy. It looks like a weird interaction between
372370f167 (http: use credential API to handle proxy authentication,
2016-01-26) and 57415089bd (http: honor empty http.proxy option to
bypass proxy, 2017-04-11). The former added a _second_ CURLOPT_PROXY
call, and the latter removed the first one.

I wonder if we could go back to passing the string straight to curl (as
we did prior to 2016), and keeping the proxy_auth struct purely as a
mechanism for gathering credentials. That would presumably fix your use
case. And this is also perhaps an interesting data point for Junio's
question about regressions (if people were passing paths, it used to
work and was broken in 2016).

-Peff

Ryan Hendrickson July 26, 2024, 10:43 p.m. UTC | #5

At 2024-07-26 17:11-0400, Jeff King <peff@peff.net> sent:

> I would warn that there are several not-quite-compatible variants of
> netcat floating around, which can create headaches. You might be better
> off with a short perl script using IO::Socket::UNIX or similar.

Ah, okay, thanks for the pointer!

>> diff --git a/http.c b/http.c
>> index 623ed234891..0cd75986a6b 100644
>> --- a/http.c
>> +++ b/http.c
>> @@ -1265,7 +1265,13 @@ static CURL *get_curl_handle(void)
>>  		if (!proxy_auth.host)
>>  			die("Invalid proxy URL '%s'", curl_http_proxy);
>>
>> -		curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
>> +		if (proxy_auth.path) {
>> +			struct strbuf proxy = STRBUF_INIT;
>> +			strbuf_addf(&proxy, "%s/%s", proxy_auth.host, proxy_auth.path);
>> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
>> +			strbuf_release(&proxy);
>> +		} else
>> +			curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
>
> The fields in the proxy_auth struct have been parsed from the url, with
> any url encoding removed. But then we paste them back together into a
> pseudo-url without doing any further encoding. Is that correct?
>
> I doubt that the host contains a "/", but if you had a path that
> contained a "%", then the URL form of that is going to be %25. Which is
> curl expecting to get here?

Oh, I see! Yes, this is an issue with my patch: if I create a socket file 
named "%30", command-line curl wants http_proxy to contain "%2530", and 
patched Git wants http_proxy to contain "%252530". Good edge case to put 
in a test.

> I wonder if we could go back to passing the string straight to curl (as
> we did prior to 2016), and keeping the proxy_auth struct purely as a
> mechanism for gathering credentials.

Hmm, that would be nice, but I think curl doesn't deal well with the extra 
case that Git supports of specifying a username but no password. It causes 
one of the existing tests to fail if I pass the string straight through.

On top of that, all of those starts_with tests for checking the protocol 
are written quite loosely, so in practice Git "supports" the protocols 
"socks://" and "socksonmyfeet://" by mapping them both to SOCKS4, and curl 
would not like it if it received those strings directly.

So given that Git wants to handle the protocol and the credentials, it 
makes sense that only the host and the path are passed to curl. I just 
have to make sure that they are correctly re-encoded.

R

Jeff King July 29, 2024, 7:31 p.m. UTC | #6

On Fri, Jul 26, 2024 at 06:43:29PM -0400, Ryan Hendrickson wrote:

> > I wonder if we could go back to passing the string straight to curl (as
> > we did prior to 2016), and keeping the proxy_auth struct purely as a
> > mechanism for gathering credentials.
> 
> Hmm, that would be nice, but I think curl doesn't deal well with the extra
> case that Git supports of specifying a username but no password. It causes
> one of the existing tests to fail if I pass the string straight through.

OK, thanks for trying. It would have been nice, but I'm not surprised
that there's an unusual interaction.

> On top of that, all of those starts_with tests for checking the protocol are
> written quite loosely, so in practice Git "supports" the protocols
> "socks://" and "socksonmyfeet://" by mapping them both to SOCKS4, and curl
> would not like it if it received those strings directly.
>
> So given that Git wants to handle the protocol and the credentials, it makes
> sense that only the host and the path are passed to curl. I just have to
> make sure that they are correctly re-encoded.

Makes sense.

-Peff

http: do not ignore proxy path

Commit Message

Comments

Patch