diff mbox series

[v2] http: do not ignore proxy path

Message ID pull.1767.v2.git.1722062682858.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v2] http: do not ignore proxy path | expand

Commit Message

Ryan Hendrickson July 27, 2024, 6:44 a.m. UTC
From: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>

The documentation for `http.proxy` describes that option, and the
environment variables it overrides, as supporting "the syntax understood
by curl". curl allows SOCKS proxies to use a path to a Unix domain
socket, like `socks5h://localhost/path/to/socket.sock`. Git should
therefore include, if present, the path part of the proxy URL in what it
passes to libcurl.

Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
---
    http: do not ignore proxy path
    
    Re earlier discussion about regressions, it turns out that curl has only
    supported this syntax since 2022 [https://curl.se/ch/7.84.0.html].
    Earlier versions of curl, if provided an http_proxy that contained a
    path, would also ignore it. curl didn't seem to consider this a breaking
    change when the feature was introduced
    [https://github.com/curl/curl/pull/8668], though; is that a valid
    argument for Git not to lose sleep over it either?
    
    A test has been added, but unfortunately it is not being run on any of
    GitHub's CI workflows, because the runner images in use all have
    versions of libcurl that are too old. It works on my machine, but
    unknown troubles may await someone else.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1767%2Frhendric%2Frhendric%2Fhttp-proxy-path-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1767/rhendric/rhendric/http-proxy-path-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1767

Range-diff vs v1:

 1:  6b6ef680dce < -:  ----------- http: do not ignore proxy path
 -:  ----------- > 1:  b4715eba5b1 http: do not ignore proxy path


 Documentation/config/http.txt       |   4 +-
 http.c                              |  20 ++-
 t/socks5-proxy-server/LICENSE       |  21 +++
 t/socks5-proxy-server/README.md     |  27 +++
 t/socks5-proxy-server/src/socks5.pl | 260 ++++++++++++++++++++++++++++
 t/t5564-http-proxy.sh               |  18 ++
 6 files changed, 347 insertions(+), 3 deletions(-)
 create mode 100644 t/socks5-proxy-server/LICENSE
 create mode 100644 t/socks5-proxy-server/README.md
 create mode 100755 t/socks5-proxy-server/src/socks5.pl


base-commit: ad57f148c6b5f8735b62238dda8f571c582e0e54

Comments

Jeff King July 29, 2024, 8:09 p.m. UTC | #1
On Sat, Jul 27, 2024 at 06:44:42AM +0000, Ryan Hendrickson via GitGitGadget wrote:

>     Re earlier discussion about regressions, it turns out that curl has only
>     supported this syntax since 2022 [https://curl.se/ch/7.84.0.html].
>     Earlier versions of curl, if provided an http_proxy that contained a
>     path, would also ignore it. curl didn't seem to consider this a breaking
>     change when the feature was introduced
>     [https://github.com/curl/curl/pull/8668], though; is that a valid
>     argument for Git not to lose sleep over it either?

IMHO it is probably OK. The path did not do anything before then, so why
would anybody pass it?

Looking into the curl support, I did notice two things:

  - unix sockets are only supported for socks proxies. Is it meaningful
    to have a path on an http proxy? I don't think so (there is no place
    to send it in the "GET" line). But perhaps we should limit the new
    code only to socks.

  - curl says you should put "localhost" in the host field for this,
    though obviously it is not otherwise used. Should we likewise
    require that to kick in the code?

> @@ -1265,7 +1266,24 @@ static CURL *get_curl_handle(void)
>  		if (!proxy_auth.host)
>  			die("Invalid proxy URL '%s'", curl_http_proxy);
>  
> -		curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
> +		strbuf_addstr(&proxy, proxy_auth.host);
> +		if (proxy_auth.path) {

There's one gotcha here I wondered about: we usually throw away the path
element, unless credential.useHTTPPath is set. That only happens after
we load the per-credential config, though, via credential_apply_config(),
which in turn is triggered by a call to credential_fill(), etc.

I think this is OK here, since we have just called credential_from_url()
ourselves, and the earliest credential_fill() we'd hit is from
init_curl_proxy_auth(), which is called right after the code you're
touching.

> +			int path_is_supported = 0;
> +			/* curl_version_info was added in curl 7.10 */
> +#if LIBCURL_VERSION_NUM >= 0x070a00
> +			curl_version_info_data *ver = curl_version_info(CURLVERSION_NOW);
> +			path_is_supported = ver->version_num >= 0x075400;
> +#endif

We've been trying to keep our curl version #ifdefs in git-curl-compat.h,
with human-readable names. But in this case, I think we can ditch it
entirely, as we require curl 7.21.3 or later already.

This will be the first time we check curl_version_info() instead of a
build-time option. I think that makes sense here. Most features require
extra information at build-time (e.g., CURLOPT_* constants), but in this
case it is purely a check on the runtime behavior.

We perhaps could get away with just letting an older curl quietly ignore
the path field, but what you have here is probably a bit friendlier for
users.

> diff --git a/t/socks5-proxy-server/src/socks5.pl b/t/socks5-proxy-server/src/socks5.pl
> new file mode 100755
> index 00000000000..3ef66a1a287
> --- /dev/null
> +++ b/t/socks5-proxy-server/src/socks5.pl

This is a lot more code than I was hoping for. There are definitely
parts of it we don't care about (e.g, threading, handling multiple
connections at once) that could be pared down a bit. I don't think we
particularly care about auth either (we just want to make sure we talk
to the unix-socket proxy at all).

What if we switched to socks4, which drops all of the auth bits and
supports only IPs, not hostnames (that came in socks4a). This is the
shortest I could come up with (only lightly tested):

-- >8 --
use strict;
use IO::Select;
use IO::Socket::UNIX;
use IO::Socket::INET;

my $path = shift;

unlink($path);
my $server = IO::Socket::UNIX->new(Listen => 1, Local => $path)
	or die "unable to listen on $path: $!";

$| = 1;
print "ready to rumble\n";

while (my $client = $server->accept()) {
	sysread $client, my $buf, 8;
	my ($version, $cmd, $port, $ip) = unpack 'CCnN', $buf;
	next unless $version == 4; # socks4
	next unless $cmd == 1; # TCP stream connection

	# skip NUL-terminated id
	while (sysread $client, my $char, 1) {
		last unless ord($char);
	}

	# version(0), reply(5a == granted), port (ignored), ip (ignored)
	syswrite $client, "\x00\x5a\x00\x00\x00\x00\x00\x00";

	my $remote = IO::Socket::INET->new(PeerHost => $ip, PeerPort => $port)
		or die "unable to connect to $ip/$port: $!";

	my $io = IO::Select->new($client, $remote);
	while ($io->count) {
		for my $fh ($io->can_read(0)) {
			for my $pair ([$client, $remote], [$remote, $client]) {
				my ($from, $to) = @$pair;
				next unless $fh == $from;

				my $r = sysread $from, my $buf, 1024;
				if (!defined $r || $r <= 0) {
					$io->remove($from);
					next;
				}
				syswrite $to, $buf;
			}
		}
	}
}
-- >8 --

That's still more than I'd like, but quite a bit smaller. I dunno.
Probably the one you found is more battle-tested, but I really think we
have pretty simple requirements.

> diff --git a/t/t5564-http-proxy.sh b/t/t5564-http-proxy.sh
> index bb35b87071d..bafaa31adf8 100755
> --- a/t/t5564-http-proxy.sh
> +++ b/t/t5564-http-proxy.sh
> @@ -39,4 +39,22 @@ test_expect_success 'clone can prompt for proxy password' '
>  	expect_askpass pass proxuser
>  '
>  
> +start_socks() {
> +	# The %30 tests that the correct amount of percent-encoding is applied
> +	# to the proxy string passed to curl.
> +	"$PERL_PATH" "$TEST_DIRECTORY/socks5-proxy-server/src/socks5.pl" \
> +		"$TRASH_DIRECTORY/%30.sock" proxuser proxpass &
> +	socks_pid=$!
> +}
> +
> +test_lazy_prereq LIBCURL_7_84 'expr x$(curl-config --vernum) \>= x075400'

This is the first time we'd be relying on curl-config in the test suite.
I suspect it would _usually_ work, but I'd worry about a few things:

  1. The Makefile allows for a different path for $(CURL_CONFIG), or
     even skipping it entirely by setting $(CURLDIR).

  2. We'd usually build and test in the same environment, but not
     necessarily. In particular, I know Windows uses separate CI jobs
     for this, and I'm not sure if curl-config will be around in the
     test jobs.

I can see two paths forward:

  a. There's been recent discussion about adding an option for Git to
     report the run-time curl version. That doesn't exist yet, though
     "git version --build-options" will report the build-time version.
     That might be good enough for the purposes of testing a build.

  b. Write the test such that it expects the request to work _or_ we see
     the "version too old" failure.

> +test_expect_success PERL,LIBCURL_7_84 'clone via Unix socket' '

I'm not sure if this PERL prereq is sufficient. We also need to know
that we can make unix sockets. Probably we need to actually run the
socks proxy and make sure it gets to the "starting..." line before
accepting that it works. Which also gets us to...

> +	start_socks &&
> +	test_when_finished "rm -rf clone && kill $socks_pid" &&
> +	test_config_global http.proxy "socks5://proxuser:proxpass@localhost$PWD/%2530.sock" &&
> +	GIT_TRACE_CURL=$PWD/trace git clone "$HTTPD_URL/smart/repo.git" clone &&

This is racy. We started the proxy in the background, but we don't know
that it's ready to accept connections by the time we run "git clone". My
first few runs all failed, but putting a "sleep 1" in between fixed it
(which obviously is not a real fix, but just a debugging aid).

If you usually see success, try running the test script with "--stress"
to create extra load, and you should see failures.

The usual trick here is to start the process in the background, and then
in the foreground wait for some "I'm ready" output, which involves ugly
tricks with fifos. There's prior art in lib-git-daemon.sh (though it's a
little more complicated there, because we want to relay its stderr, too,
but with a script under our control we can just put the ready message on
stdout).

So coupled with the prereq fix that I mentioned above, you might be able
to do something like (totally untested):

  start_socks_proxy () {
	mkfifo socks_output &&
	{
		perl proxy.pl ... >socks_output &
		socks_pid=$!
	} &&
	read line <socks_output &&
	test "$line" = "Starting..."
  }

  test_expect_success 'try to start socks proxy' '
	if start_socks_proxy
	then
		test_seq_prereq SOCKS_PROXY
	fi
  '

  test_expect_success SOCKS_PROXY 'clone via unix socket' '
	...
  '

-Peff
Ryan Hendrickson July 31, 2024, 3:33 p.m. UTC | #2
At 2024-07-29 16:09-0400, Jeff King <peff@peff.net> sent:

> Looking into the curl support, I did notice two things:
>
>  - unix sockets are only supported for socks proxies. Is it meaningful
>    to have a path on an http proxy? I don't think so (there is no place
>    to send it in the "GET" line). But perhaps we should limit the new
>    code only to socks.
>
>  - curl says you should put "localhost" in the host field for this,
>    though obviously it is not otherwise used. Should we likewise
>    require that to kick in the code?

Sounds good, I've added checks and tests for both of these cases.

>> @@ -1265,7 +1266,24 @@ static CURL *get_curl_handle(void)
>>  		if (!proxy_auth.host)
>>  			die("Invalid proxy URL '%s'", curl_http_proxy);
>>
>> -		curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
>> +		strbuf_addstr(&proxy, proxy_auth.host);
>> +		if (proxy_auth.path) {
>
> There's one gotcha here I wondered about: we usually throw away the path
> element, unless credential.useHTTPPath is set. That only happens after
> we load the per-credential config, though, via credential_apply_config(),
> which in turn is triggered by a call to credential_fill(), etc.
>
> I think this is OK here, since we have just called credential_from_url()
> ourselves, and the earliest credential_fill() we'd hit is from
> init_curl_proxy_auth(), which is called right after the code you're
> touching.

Yes, good lookout but I don't think this is a problem either.

>> +			int path_is_supported = 0;
>> +			/* curl_version_info was added in curl 7.10 */
>> +#if LIBCURL_VERSION_NUM >= 0x070a00
>> +			curl_version_info_data *ver = curl_version_info(CURLVERSION_NOW);
>> +			path_is_supported = ver->version_num >= 0x075400;
>> +#endif
>
> We've been trying to keep our curl version #ifdefs in git-curl-compat.h,
> with human-readable names. But in this case, I think we can ditch it
> entirely, as we require curl 7.21.3 or later already.

Ah, okay, that is good to know! I'll remove the #if.

> This will be the first time we check curl_version_info() instead of a
> build-time option. I think that makes sense here. Most features require
> extra information at build-time (e.g., CURLOPT_* constants), but in this
> case it is purely a check on the runtime behavior.
>
> We perhaps could get away with just letting an older curl quietly ignore
> the path field, but what you have here is probably a bit friendlier for
> users.

Yes, curl has a nasty tendency to quietly ignore a lot of things. I didn't 
want users to be uncertain about whether they were using the feature 
incorrectly if an older curl was the actual issue.

> What if we switched to socks4, which drops all of the auth bits and
> supports only IPs, not hostnames (that came in socks4a). This is the
> shortest I could come up with (only lightly tested):

Ah, many thanks! I like this much better, and I'm not proficient enough 
with Perl to have written it myself.

> I can see two paths forward:
>
>  a. There's been recent discussion about adding an option for Git to
>     report the run-time curl version. That doesn't exist yet, though
>     "git version --build-options" will report the build-time version.
>     That might be good enough for the purposes of testing a build.
>
>  b. Write the test such that it expects the request to work _or_ we see
>     the "version too old" failure.

Got it, I'll go with option b.

> If you usually see success, try running the test script with "--stress"
> to create extra load, and you should see failures.

Huh. I never saw any race issues on my machine even with --stress, but 
your approach is safer so I'm running with it.

Thank you!

R
diff mbox series

Patch

diff --git a/Documentation/config/http.txt b/Documentation/config/http.txt
index 162b33fc52f..a14371b5c96 100644
--- a/Documentation/config/http.txt
+++ b/Documentation/config/http.txt
@@ -5,8 +5,8 @@  http.proxy::
 	proxy string with a user name but no password, in which case git will
 	attempt to acquire one in the same way it does for other credentials. See
 	linkgit:gitcredentials[7] for more information. The syntax thus is
-	'[protocol://][user[:password]@]proxyhost[:port]'. This can be overridden
-	on a per-remote basis; see remote.<name>.proxy
+	'[protocol://][user[:password]@]proxyhost[:port][/path]'. This can be
+	overridden on a per-remote basis; see remote.<name>.proxy
 +
 Any proxy, however configured, must be completely transparent and must not
 modify, transform, or buffer the request or response in any way.  Proxies which
diff --git a/http.c b/http.c
index 623ed234891..a6001296dd0 100644
--- a/http.c
+++ b/http.c
@@ -1227,6 +1227,7 @@  static CURL *get_curl_handle(void)
 		 */
 		curl_easy_setopt(result, CURLOPT_PROXY, "");
 	} else if (curl_http_proxy) {
+		struct strbuf proxy = STRBUF_INIT;
 		if (starts_with(curl_http_proxy, "socks5h"))
 			curl_easy_setopt(result,
 				CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5_HOSTNAME);
@@ -1265,7 +1266,24 @@  static CURL *get_curl_handle(void)
 		if (!proxy_auth.host)
 			die("Invalid proxy URL '%s'", curl_http_proxy);
 
-		curl_easy_setopt(result, CURLOPT_PROXY, proxy_auth.host);
+		strbuf_addstr(&proxy, proxy_auth.host);
+		if (proxy_auth.path) {
+			int path_is_supported = 0;
+			/* curl_version_info was added in curl 7.10 */
+#if LIBCURL_VERSION_NUM >= 0x070a00
+			curl_version_info_data *ver = curl_version_info(CURLVERSION_NOW);
+			path_is_supported = ver->version_num >= 0x075400;
+#endif
+			if (path_is_supported) {
+				strbuf_addch(&proxy, '/');
+				strbuf_add_percentencode(&proxy, proxy_auth.path, 0);
+			} else {
+				die("libcurl 7.84 or later is required to support paths in proxy URLs");
+			}
+		}
+		curl_easy_setopt(result, CURLOPT_PROXY, proxy.buf);
+		strbuf_release(&proxy);
+
 		var_override(&curl_no_proxy, getenv("NO_PROXY"));
 		var_override(&curl_no_proxy, getenv("no_proxy"));
 		curl_easy_setopt(result, CURLOPT_NOPROXY, curl_no_proxy);
diff --git a/t/socks5-proxy-server/LICENSE b/t/socks5-proxy-server/LICENSE
new file mode 100644
index 00000000000..94981550c94
--- /dev/null
+++ b/t/socks5-proxy-server/LICENSE
@@ -0,0 +1,21 @@ 
+MIT License
+
+Copyright (c) 2014 kaimi.io
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/t/socks5-proxy-server/README.md b/t/socks5-proxy-server/README.md
new file mode 100644
index 00000000000..6eed0e5ce7a
--- /dev/null
+++ b/t/socks5-proxy-server/README.md
@@ -0,0 +1,27 @@ 
+socks5-proxy-server
+===================
+
+This Perl code implements a SOCKS5 proxy server that listens for incoming connections and processes them in separate threads. The server takes three input parameters: `path`, `login`, and `password`.
+
+When a client attempts to connect to the server, the server checks if the client supports any of the available authentication methods (no authentication or login/password authentication). If a suitable method is found, the server establishes a connection with the target server and begins forwarding data between the client and the target server.
+
+The code uses the `IO::Select` module for working with sockets and the `threads` module for creating and managing threads. It includes several functions, including:
+
+- `main`: the main function that creates threads for processing incoming connections.
+- `replenish`: a function that creates additional threads if the number of free threads is less than the specified value.
+- `new_client`: a function that handles incoming client connections and checks if the available authentication methods are supported by the client.
+- `socks_auth`: a function that performs login and password authentication.
+- `handle_client`: a function that establishes a connection with the target server and forwards data between the client and the target server.
+
+This code includes the use of the following Perl modules: `IO::Select`, `Socket`, `threads`, `threads::shared`.
+
+To run this code, enter the following command:
+`perl socks5.pl path login password`
+
+Note that this code is designed for educational purposes and should not be used in production environments without proper modifications and security measures.