diff mbox series

[v2,7/9] bundle-uri: allow relative URLs in bundle lists

Message ID 186e112d821d9a42ffbc3c8b46e24b2b4bb3dfe8.1668628303.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Bundle URIs IV: advertise over protocol v2 | expand

Commit Message

Derrick Stolee Nov. 16, 2022, 7:51 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

Bundle providers may want to distribute that data across multiple CDNs.
This might require a change in the base URI, all the way to the domain
name. If all bundles require an absolute URI in their 'uri' value, then
every push to a CDN would require altering the table of contents to
match the expected domain and exact location within it.

Allow a bundle list to specify a relative URI for the bundles.
This allows easier distribution of bundle data.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle-uri.c                | 16 ++++++++++-
 bundle-uri.h                |  9 +++++++
 t/helper/test-bundle-uri.c  |  2 ++
 t/t5750-bundle-uri-parse.sh | 54 +++++++++++++++++++++++++++++++++++++
 transport.c                 |  3 +++
 5 files changed, 83 insertions(+), 1 deletion(-)

Comments

Victoria Dye Nov. 29, 2022, 1:25 a.m. UTC | #1
Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Bundle providers may want to distribute that data across multiple CDNs.
> This might require a change in the base URI, all the way to the domain
> name. If all bundles require an absolute URI in their 'uri' value, then
> every push to a CDN would require altering the table of contents to
> match the expected domain and exact location within it.
> 
> Allow a bundle list to specify a relative URI for the bundles.
> This allows easier distribution of bundle data.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  bundle-uri.c                | 16 ++++++++++-
>  bundle-uri.h                |  9 +++++++
>  t/helper/test-bundle-uri.c  |  2 ++
>  t/t5750-bundle-uri-parse.sh | 54 +++++++++++++++++++++++++++++++++++++
>  transport.c                 |  3 +++
>  5 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/bundle-uri.c b/bundle-uri.c
> @@ -190,6 +192,18 @@ int bundle_uri_parse_config_format(const char *uri,
>  		.error_action = CONFIG_ERROR_ERROR,
>  	};
>  
> +	if (!list->baseURI) {
> +		struct strbuf baseURI = STRBUF_INIT;
> +		strbuf_addstr(&baseURI, uri);
> +
> +		/*
> +		 * If the URI does not end with a trailing slash, then
> +		 * remove the filename portion of the path. This is
> +		 * important for relative URIs.
> +		 */
> +		strbuf_strip_file_from_path(&baseURI);
> +		list->baseURI = strbuf_detach(&baseURI, NULL);

Is the 'baseURI' is set to the URI of the first bundle (ordered by hash)? If
data is distributed across multiple CDNs, couldn't this be a suboptimal
choice? For example, if the first bundle is on 'A.com', but every other
bundle is on 'B.org'?

> +	}
>  	result = git_config_from_file_with_options(config_to_bundle_list,
>  						   filename, list,
>  						   &opts);
> diff --git a/bundle-uri.h b/bundle-uri.h
> index 357111ecce8..7905e56732c 100644
> --- a/bundle-uri.h
> +++ b/bundle-uri.h
> @@ -61,6 +61,15 @@ struct bundle_list {
>  	int version;
>  	enum bundle_list_mode mode;
>  	struct hashmap bundles;
> +
> +	/**
> +	 * The baseURI of a bundle_list is used as the base for any
> +	 * relative URIs advertised by the bundle list at that location.
> +	 *
> +	 * When the list is generated from a Git server, then use that
> +	 * server's location.

Hmmm, I think I'm missing something with my earlier comment. I thought the
'uri' argument to 'bundle_uri_parse_config_format()' was an individual
bundle's URI? What's the "server's location" in this context?

> +	 */
> +	char *baseURI;
>  };
>  
>  void init_bundle_list(struct bundle_list *list);
> diff --git a/t/helper/test-bundle-uri.c b/t/helper/test-bundle-uri.c
> index ffb975b7b4f..5aa0b494ce3 100644
> --- a/t/helper/test-bundle-uri.c
> +++ b/t/helper/test-bundle-uri.c
> @@ -40,6 +40,8 @@ static int cmd__bundle_uri_parse(int argc, const char **argv, enum input_mode mo
>  
>  	init_bundle_list(&list);
>  
> +	list.baseURI = xstrdup("<uri>");

Using a hardcoded value here leads to pretty different behavior in
'test-bundle-uri.c' vs. starting with an unset 'list.baseURI' in something
like 'clone'. Why does this need to be set to '<uri>' for the tests?

> +
>  	switch (mode) {
>  	case KEY_VALUE_PAIRS:
>  		if (argc != 1)
> diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
> index c2fe3f9c5a5..ed5262a8d2b 100755
> --- a/t/t5750-bundle-uri-parse.sh
> +++ b/t/t5750-bundle-uri-parse.sh
> @@ -30,6 +30,30 @@ test_expect_success 'bundle_uri_parse_line() just URIs' '
>  	test_cmp_config_output expect actual
>  '
>  
> +test_expect_success 'bundle_uri_parse_line(): relative URIs' '
> +	cat >in <<-\EOF &&
> +	bundle.one.uri=bundle.bdl
> +	bundle.two.uri=../bundle.bdl
> +	bundle.three.uri=sub/dir/bundle.bdl
> +	EOF
> +
> +	cat >expect <<-\EOF &&
> +	[bundle]
> +		version = 1
> +		mode = all
> +	[bundle "one"]
> +		uri = <uri>/bundle.bdl
> +	[bundle "two"]
> +		uri = bundle.bdl

This seems a little strange, but it looks like '<uri>/../bundle.bdl'
normalizes to 'bundle.bdl' because '<uri>' is treated like a regular path
element (like a directory). 

Out of curiosity, what would happen if 'bundle.two.uri' was
'../../bundle.bdl'?

> +	[bundle "three"]
> +		uri = <uri>/sub/dir/bundle.bdl
> +	EOF
> +
> +	test-tool bundle-uri parse-key-values in >actual 2>err &&
> +	test_must_be_empty err &&
> +	test_cmp_config_output expect actual
> +'
> +
>  test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty key or value' '
>  	cat >in <<-\EOF &&
>  	=bogus-value
Derrick Stolee Dec. 2, 2022, 4:03 p.m. UTC | #2
On 11/28/2022 8:25 PM, Victoria Dye wrote:
> Derrick Stolee via GitGitGadget wrote:

>> +	if (!list->baseURI) {
>> +		struct strbuf baseURI = STRBUF_INIT;
>> +		strbuf_addstr(&baseURI, uri);
>> +
>> +		/*
>> +		 * If the URI does not end with a trailing slash, then
>> +		 * remove the filename portion of the path. This is
>> +		 * important for relative URIs.
>> +		 */
>> +		strbuf_strip_file_from_path(&baseURI);
>> +		list->baseURI = strbuf_detach(&baseURI, NULL);
>
> Is the 'baseURI' is set to the URI of the first bundle (ordered by hash)? If
> data is distributed across multiple CDNs, couldn't this be a suboptimal
> choice? For example, if the first bundle is on 'A.com', but every other
> bundle is on 'B.org'?

The baseURI is set to one of two things:

1. The URI used for the clone, specifying the way the client connected to
   the Git server, or

2. The URI used to download the bundle list itself.

This allows the same bundle list file to be distributed to multiple CDNs,
assuming that the bundles themselves will have the same relative position
to the list.

>> +	/**
>> +	 * The baseURI of a bundle_list is used as the base for any
>> +	 * relative URIs advertised by the bundle list at that location.
>> +	 *
>> +	 * When the list is generated from a Git server, then use that
>> +	 * server's location.
>
> Hmmm, I think I'm missing something with my earlier comment. I thought the
> 'uri' argument to 'bundle_uri_parse_config_format()' was an individual
> bundle's URI? What's the "server's location" in this context?

I can work to make this concept clearer by rewording this comment.

>> @@ -40,6 +40,8 @@ static int cmd__bundle_uri_parse(int argc, const char **argv, enum input_mode mo
>>
>>  	init_bundle_list(&list);
>>
>> +	list.baseURI = xstrdup("<uri>");
>
> Using a hardcoded value here leads to pretty different behavior in
> 'test-bundle-uri.c' vs. starting with an unset 'list.baseURI' in something
> like 'clone'. Why does this need to be set to '<uri>' for the tests?

In this part of the test helper, we are not making a connection to a server
and instead parsing a bundle list file directly. To demonstrate how the
relative paths work during this parsing, we add a bogus baseURI here so
we can clearly see where the relative paths were parsed versus using the
URI as an absolute URI.


>> +test_expect_success 'bundle_uri_parse_line(): relative URIs' '
>> +	cat >in <<-\EOF &&
>> +	bundle.one.uri=bundle.bdl
>> +	bundle.two.uri=../bundle.bdl
>> +	bundle.three.uri=sub/dir/bundle.bdl
>> +	EOF
>> +
>> +	cat >expect <<-\EOF &&
>> +	[bundle]
>> +		version = 1
>> +		mode = all
>> +	[bundle "one"]
>> +		uri = <uri>/bundle.bdl
>> +	[bundle "two"]
>> +		uri = bundle.bdl
>
> This seems a little strange, but it looks like '<uri>/../bundle.bdl'
> normalizes to 'bundle.bdl' because '<uri>' is treated like a regular path
> element (like a directory).
>
> Out of curiosity, what would happen if 'bundle.two.uri' was
> '../../bundle.bdl'?

It will fail! The error message is

	"fatal: cannot strip one component off url '.'"

This is disappointing that an erroneous bundle list could cause a 'git
clone' command to die(), when we want the bundle URI feature to allow the
clone to continue normally even if the bundle downloads fail. I will mark
this for #leftoverbits, since it would involve changing the interface for
chop_last_dir() and relative_url() in remote.c.

At minimum, I will document this with a test case.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/bundle-uri.c b/bundle-uri.c
index 3469f1aaa98..ab91bb32e9b 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -7,6 +7,7 @@ 
 #include "hashmap.h"
 #include "pkt-line.h"
 #include "config.h"
+#include "remote.h"
 
 static int compare_bundles(const void *hashmap_cmp_fn_data,
 			   const struct hashmap_entry *he1,
@@ -49,6 +50,7 @@  void clear_bundle_list(struct bundle_list *list)
 
 	for_all_bundles_in_list(list, clear_remote_bundle_info, NULL);
 	hashmap_clear_and_free(&list->bundles, struct remote_bundle_info, ent);
+	free(list->baseURI);
 }
 
 int for_all_bundles_in_list(struct bundle_list *list,
@@ -163,7 +165,7 @@  static int bundle_list_update(const char *key, const char *value,
 	if (!strcmp(subkey, "uri")) {
 		if (bundle->uri)
 			return -1;
-		bundle->uri = xstrdup(value);
+		bundle->uri = relative_url(list->baseURI, value, NULL);
 		return 0;
 	}
 
@@ -190,6 +192,18 @@  int bundle_uri_parse_config_format(const char *uri,
 		.error_action = CONFIG_ERROR_ERROR,
 	};
 
+	if (!list->baseURI) {
+		struct strbuf baseURI = STRBUF_INIT;
+		strbuf_addstr(&baseURI, uri);
+
+		/*
+		 * If the URI does not end with a trailing slash, then
+		 * remove the filename portion of the path. This is
+		 * important for relative URIs.
+		 */
+		strbuf_strip_file_from_path(&baseURI);
+		list->baseURI = strbuf_detach(&baseURI, NULL);
+	}
 	result = git_config_from_file_with_options(config_to_bundle_list,
 						   filename, list,
 						   &opts);
diff --git a/bundle-uri.h b/bundle-uri.h
index 357111ecce8..7905e56732c 100644
--- a/bundle-uri.h
+++ b/bundle-uri.h
@@ -61,6 +61,15 @@  struct bundle_list {
 	int version;
 	enum bundle_list_mode mode;
 	struct hashmap bundles;
+
+	/**
+	 * The baseURI of a bundle_list is used as the base for any
+	 * relative URIs advertised by the bundle list at that location.
+	 *
+	 * When the list is generated from a Git server, then use that
+	 * server's location.
+	 */
+	char *baseURI;
 };
 
 void init_bundle_list(struct bundle_list *list);
diff --git a/t/helper/test-bundle-uri.c b/t/helper/test-bundle-uri.c
index ffb975b7b4f..5aa0b494ce3 100644
--- a/t/helper/test-bundle-uri.c
+++ b/t/helper/test-bundle-uri.c
@@ -40,6 +40,8 @@  static int cmd__bundle_uri_parse(int argc, const char **argv, enum input_mode mo
 
 	init_bundle_list(&list);
 
+	list.baseURI = xstrdup("<uri>");
+
 	switch (mode) {
 	case KEY_VALUE_PAIRS:
 		if (argc != 1)
diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
index c2fe3f9c5a5..ed5262a8d2b 100755
--- a/t/t5750-bundle-uri-parse.sh
+++ b/t/t5750-bundle-uri-parse.sh
@@ -30,6 +30,30 @@  test_expect_success 'bundle_uri_parse_line() just URIs' '
 	test_cmp_config_output expect actual
 '
 
+test_expect_success 'bundle_uri_parse_line(): relative URIs' '
+	cat >in <<-\EOF &&
+	bundle.one.uri=bundle.bdl
+	bundle.two.uri=../bundle.bdl
+	bundle.three.uri=sub/dir/bundle.bdl
+	EOF
+
+	cat >expect <<-\EOF &&
+	[bundle]
+		version = 1
+		mode = all
+	[bundle "one"]
+		uri = <uri>/bundle.bdl
+	[bundle "two"]
+		uri = bundle.bdl
+	[bundle "three"]
+		uri = <uri>/sub/dir/bundle.bdl
+	EOF
+
+	test-tool bundle-uri parse-key-values in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp_config_output expect actual
+'
+
 test_expect_success 'bundle_uri_parse_line() parsing edge cases: empty key or value' '
 	cat >in <<-\EOF &&
 	=bogus-value
@@ -136,6 +160,36 @@  test_expect_success 'parse config format: just URIs' '
 	test_cmp_config_output expect actual
 '
 
+test_expect_success 'parse config format: relative URIs' '
+	cat >in <<-\EOF &&
+	[bundle]
+		version = 1
+		mode = all
+	[bundle "one"]
+		uri = bundle.bdl
+	[bundle "two"]
+		uri = ../bundle.bdl
+	[bundle "three"]
+		uri = sub/dir/bundle.bdl
+	EOF
+
+	cat >expect <<-\EOF &&
+	[bundle]
+		version = 1
+		mode = all
+	[bundle "one"]
+		uri = <uri>/bundle.bdl
+	[bundle "two"]
+		uri = bundle.bdl
+	[bundle "three"]
+		uri = <uri>/sub/dir/bundle.bdl
+	EOF
+
+	test-tool bundle-uri parse-config in >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp_config_output expect actual
+'
+
 test_expect_success 'parse config format edge cases: empty key or value' '
 	cat >in1 <<-\EOF &&
 	= bogus-value
diff --git a/transport.c b/transport.c
index b33180226ae..2c4ff0c2023 100644
--- a/transport.c
+++ b/transport.c
@@ -1553,6 +1553,9 @@  int transport_get_remote_bundle_uri(struct transport *transport, int quiet)
 	    (git_config_get_bool("transfer.bundleuri", &value) || !value))
 		return 0;
 
+	if (!transport->bundles->baseURI)
+		transport->bundles->baseURI = xstrdup(transport->url);
+
 	if (!vtable->get_bundle_uri) {
 		if (quiet)
 			return -1;