diff mbox series

[2/8] bundle-uri: parse bundle.heuristic=creationToken

Message ID 9007249b9488c23f00c2d498ffd520e4af8b37a4.1673037405.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Bundle URIs V: creationToken heuristic for incremental fetches | expand

Commit Message

Derrick Stolee Jan. 6, 2023, 8:36 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The bundle.heuristic value communicates that the bundle list is
organized to make use of the bundle.<id>.creationToken values that may
be provided in the bundle list. Those values will create a total order
on the bundles, allowing the Git client to download them in a specific
order and even remember previously-downloaded bundles by storing the
maximum creation token value.

Before implementing any logic that parses or uses the
bundle.<id>.creationToken values, teach Git to parse the
bundle.heuristic value from a bundle list. We can use 'test-tool
bundle-uri' to print the heuristic value and verify that the parsing
works correctly.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/config/bundle.txt |  7 +++++++
 bundle-uri.c                    | 21 +++++++++++++++++++++
 bundle-uri.h                    | 14 ++++++++++++++
 t/t5750-bundle-uri-parse.sh     | 19 +++++++++++++++++++
 4 files changed, 61 insertions(+)

Comments

Junio C Hamano Jan. 9, 2023, 2:38 a.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static const char *heuristics[] = {
> +	[BUNDLE_HEURISTIC_NONE] = "",
> +	[BUNDLE_HEURISTIC_CREATIONTOKEN] = "creationToken",
> +};

Ideally it would require the least amount of maintenance if we could
define BUNDLE_HEURISTIC__COUNT as ARRAY_SIZE() of this thing, but it
being a file scope static, it might not be easy to arrange that.  As
a lessor altenative, would it make it safer to size this array more
explicitly using BUNDLE_HEURISTIC__COUNT macro?

	static const char *heuristics[BUNDLE_HEURISTIC__COUNT] = {
		...
	};

or is it more-or-less moot point to aim for safety because nobody
enforces that these [indices] used to define the contents of this
array are dense?

That is ...

> @@ -142,6 +150,19 @@ static int bundle_list_update(const char *key, const char *value,
>  			return 0;
>  		}
>  
> +		if (!strcmp(subkey, "heuristic")) {
> +			int i;
> +			for (i = 0; i < BUNDLE_HEURISTIC__COUNT; i++) {
> +				if (!strcmp(value, heuristics[i])) {
> +					list->heuristic = i;
> +					return 0;
> +				}
> +			}

... this strcmp() will segfault if heuristics[] array is sparse, or
BUNDLE_HEURISTIC__COUNT is larger than the array (i.e. you add a new
heuristic in "enum bundle_heuristic" before the __COUNT sentinel,
but forget to add it to the heuristics[] array).

"You are worrying too much.  Our developers would notice a segfault
and the current code, which may look risky to you, is something they
can live with", is a perfectly acceptable response, but somehow I
have this nagging feeling that we should be able to make it easier
to maintain without incurring extra runtime cost.

> diff --git a/bundle-uri.h b/bundle-uri.h
> index d5e89f1671c..ad82174112d 100644
> --- a/bundle-uri.h
> +++ b/bundle-uri.h
> @@ -52,6 +52,14 @@ enum bundle_list_mode {
>  	BUNDLE_MODE_ANY
>  };
>  
> +enum bundle_list_heuristic {
> +	BUNDLE_HEURISTIC_NONE = 0,
> +	BUNDLE_HEURISTIC_CREATIONTOKEN,
> +
> +	/* Must be last. */
> +	BUNDLE_HEURISTIC__COUNT,
> +};

The only reason to leave a trailing comma is to make it easy to
append new values at the end.  By omitting the trailing comma, you
can doubly signal "Must be last" here (not suggesting to remove the
comment; suggesting to remove the trailing comma).

Thanks.
Derrick Stolee Jan. 9, 2023, 2:20 p.m. UTC | #2
On 1/8/2023 9:38 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +static const char *heuristics[] = {
>> +	[BUNDLE_HEURISTIC_NONE] = "",
>> +	[BUNDLE_HEURISTIC_CREATIONTOKEN] = "creationToken",
>> +};
> 
> Ideally it would require the least amount of maintenance if we could
> define BUNDLE_HEURISTIC__COUNT as ARRAY_SIZE() of this thing, but it
> being a file scope static, it might not be easy to arrange that.  As
> a lessor altenative, would it make it safer to size this array more
> explicitly using BUNDLE_HEURISTIC__COUNT macro?
> 
> 	static const char *heuristics[BUNDLE_HEURISTIC__COUNT] = {
> 		...
> 	};

Yes, I should have used this size indicator.
 
> or is it more-or-less moot point to aim for safety because nobody
> enforces that these [indices] used to define the contents of this
> array are dense?
> 
> That is ...
> 
>> @@ -142,6 +150,19 @@ static int bundle_list_update(const char *key, const char *value,
>>  			return 0;
>>  		}
>>  
>> +		if (!strcmp(subkey, "heuristic")) {
>> +			int i;
>> +			for (i = 0; i < BUNDLE_HEURISTIC__COUNT; i++) {
>> +				if (!strcmp(value, heuristics[i])) {
>> +					list->heuristic = i;
>> +					return 0;
>> +				}
>> +			}
> 
> ... this strcmp() will segfault if heuristics[] array is sparse, or
> BUNDLE_HEURISTIC__COUNT is larger than the array (i.e. you add a new
> heuristic in "enum bundle_heuristic" before the __COUNT sentinel,
> but forget to add it to the heuristics[] array).
> 
> "You are worrying too much.  Our developers would notice a segfault
> and the current code, which may look risky to you, is something they
> can live with", is a perfectly acceptable response, but somehow I
> have this nagging feeling that we should be able to make it easier
> to maintain without incurring extra runtime cost.

You're right. I was following an established pattern of linking
enums to values, but I'm not sure that those other examples will
loop over the array like this looking for a value.

A safer approach would be to have an array of (enum, string) pairs
that could either be iterated in a loop (fast enough for a small
number of enum values, such as this case) or used to populate a
hashmap at runtime if needed for a large number of queries.

>> diff --git a/bundle-uri.h b/bundle-uri.h
>> index d5e89f1671c..ad82174112d 100644
>> --- a/bundle-uri.h
>> +++ b/bundle-uri.h
>> @@ -52,6 +52,14 @@ enum bundle_list_mode {
>>  	BUNDLE_MODE_ANY
>>  };
>>  
>> +enum bundle_list_heuristic {
>> +	BUNDLE_HEURISTIC_NONE = 0,
>> +	BUNDLE_HEURISTIC_CREATIONTOKEN,
>> +
>> +	/* Must be last. */
>> +	BUNDLE_HEURISTIC__COUNT,
>> +};
> 
> The only reason to leave a trailing comma is to make it easy to
> append new values at the end.  By omitting the trailing comma, you
> can doubly signal "Must be last" here (not suggesting to remove the
> comment; suggesting to remove the trailing comma).

This is a great example of "doing the typically right thing" but
without thinking of _why_ we do that thing. Thanks for pointing this
out.

Thanks,
-Stolee
Victoria Dye Jan. 17, 2023, 7:13 p.m. UTC | #3
Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> The bundle.heuristic value communicates that the bundle list is
> organized to make use of the bundle.<id>.creationToken values that may
> be provided in the bundle list. Those values will create a total order
> on the bundles, allowing the Git client to download them in a specific
> order and even remember previously-downloaded bundles by storing the
> maximum creation token value.
> 
> Before implementing any logic that parses or uses the
> bundle.<id>.creationToken values, teach Git to parse the
> bundle.heuristic value from a bundle list. We can use 'test-tool
> bundle-uri' to print the heuristic value and verify that the parsing
> works correctly.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/config/bundle.txt |  7 +++++++
>  bundle-uri.c                    | 21 +++++++++++++++++++++
>  bundle-uri.h                    | 14 ++++++++++++++
>  t/t5750-bundle-uri-parse.sh     | 19 +++++++++++++++++++
>  4 files changed, 61 insertions(+)
> 
> diff --git a/Documentation/config/bundle.txt b/Documentation/config/bundle.txt
> index daa21eb674a..3faae386853 100644
> --- a/Documentation/config/bundle.txt
> +++ b/Documentation/config/bundle.txt
> @@ -15,6 +15,13 @@ bundle.mode::
>  	complete understanding of the bundled information (`all`) or if any one
>  	of the listed bundle URIs is sufficient (`any`).
>  
> +bundle.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +	work well with incremental `git fetch` commands. The heuristic signals
> +	that there are additional keys available for each bundle that help
> +	determine which subset of bundles the client should download. The
> +	only value currently understood is `creationToken`.

This description clearly describes the 'heuristic' key and what it does.

> +
>  bundle.<id>.*::
>  	The `bundle.<id>.*` keys are used to describe a single item in the
>  	bundle list, grouped under `<id>` for identification purposes.
> diff --git a/bundle-uri.c b/bundle-uri.c
> index 36268dda172..56c94595c2a 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -9,6 +9,11 @@
>  #include "config.h"
>  #include "remote.h"
>  
> +static const char *heuristics[] = {
> +	[BUNDLE_HEURISTIC_NONE] = "",
> +	[BUNDLE_HEURISTIC_CREATIONTOKEN] = "creationToken",
> +};
> +
>  static int compare_bundles(const void *hashmap_cmp_fn_data,
>  			   const struct hashmap_entry *he1,
>  			   const struct hashmap_entry *he2,
> @@ -100,6 +105,9 @@ void print_bundle_list(FILE *fp, struct bundle_list *list)
>  	fprintf(fp, "\tversion = %d\n", list->version);
>  	fprintf(fp, "\tmode = %s\n", mode);
>  
> +	if (list->heuristic)
> +		printf("\theuristic = %s\n", heuristics[list->heuristic]);

Given this condition, the 'heuristic' key should not be sent if it's
'BUNDLE_HEURISTIC_NONE'. But, as a fallback...

> +
>  	for_all_bundles_in_list(list, summarize_bundle, fp);
>  }
>  
> @@ -142,6 +150,19 @@ static int bundle_list_update(const char *key, const char *value,
>  			return 0;
>  		}
>  
> +		if (!strcmp(subkey, "heuristic")) {
> +			int i;
> +			for (i = 0; i < BUNDLE_HEURISTIC__COUNT; i++) {
> +				if (!strcmp(value, heuristics[i])) {
> +					list->heuristic = i;
> +					return 0;
> +				}
> +			}

...this condition seems to handle 'BUNDLE_HEURISTIC_NONE' anyway. There's no
harm in this, since 'BUNDLE_HEURISTIC_NONE' is the default value of
'list->heuristic' anyway.

>  void init_bundle_list(struct bundle_list *list);
> diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
> index 7b4f930e532..6fc92a9c0d4 100755
> --- a/t/t5750-bundle-uri-parse.sh
> +++ b/t/t5750-bundle-uri-parse.sh
> @@ -250,4 +250,23 @@ test_expect_success 'parse config format edge cases: empty key or value' '
>  	test_cmp_config_output expect actual
>  '
>  
> +test_expect_success 'parse config format: creationToken heuristic' '
> +	cat >expect <<-\EOF &&
> +	[bundle]
> +		version = 1
> +		mode = all
> +		heuristic = creationToken
> +	[bundle "one"]
> +		uri = http://example.com/bundle.bdl
> +	[bundle "two"]
> +		uri = https://example.com/bundle.bdl
> +	[bundle "three"]
> +		uri = file:///usr/share/git/bundle.bdl
> +	EOF
> +
> +	test-tool bundle-uri parse-config expect >actual 2>err &&
> +	test_must_be_empty err &&
> +	test_cmp_config_output expect actual
> +'

And this test verifies that 'heuristic' is no longer being ignored. Looks
good!

> +
>  test_done
diff mbox series

Patch

diff --git a/Documentation/config/bundle.txt b/Documentation/config/bundle.txt
index daa21eb674a..3faae386853 100644
--- a/Documentation/config/bundle.txt
+++ b/Documentation/config/bundle.txt
@@ -15,6 +15,13 @@  bundle.mode::
 	complete understanding of the bundled information (`all`) or if any one
 	of the listed bundle URIs is sufficient (`any`).
 
+bundle.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+	work well with incremental `git fetch` commands. The heuristic signals
+	that there are additional keys available for each bundle that help
+	determine which subset of bundles the client should download. The
+	only value currently understood is `creationToken`.
+
 bundle.<id>.*::
 	The `bundle.<id>.*` keys are used to describe a single item in the
 	bundle list, grouped under `<id>` for identification purposes.
diff --git a/bundle-uri.c b/bundle-uri.c
index 36268dda172..56c94595c2a 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -9,6 +9,11 @@ 
 #include "config.h"
 #include "remote.h"
 
+static const char *heuristics[] = {
+	[BUNDLE_HEURISTIC_NONE] = "",
+	[BUNDLE_HEURISTIC_CREATIONTOKEN] = "creationToken",
+};
+
 static int compare_bundles(const void *hashmap_cmp_fn_data,
 			   const struct hashmap_entry *he1,
 			   const struct hashmap_entry *he2,
@@ -100,6 +105,9 @@  void print_bundle_list(FILE *fp, struct bundle_list *list)
 	fprintf(fp, "\tversion = %d\n", list->version);
 	fprintf(fp, "\tmode = %s\n", mode);
 
+	if (list->heuristic)
+		printf("\theuristic = %s\n", heuristics[list->heuristic]);
+
 	for_all_bundles_in_list(list, summarize_bundle, fp);
 }
 
@@ -142,6 +150,19 @@  static int bundle_list_update(const char *key, const char *value,
 			return 0;
 		}
 
+		if (!strcmp(subkey, "heuristic")) {
+			int i;
+			for (i = 0; i < BUNDLE_HEURISTIC__COUNT; i++) {
+				if (!strcmp(value, heuristics[i])) {
+					list->heuristic = i;
+					return 0;
+				}
+			}
+
+			/* Ignore unknown heuristics. */
+			return 0;
+		}
+
 		/* Ignore other unknown global keys. */
 		return 0;
 	}
diff --git a/bundle-uri.h b/bundle-uri.h
index d5e89f1671c..ad82174112d 100644
--- a/bundle-uri.h
+++ b/bundle-uri.h
@@ -52,6 +52,14 @@  enum bundle_list_mode {
 	BUNDLE_MODE_ANY
 };
 
+enum bundle_list_heuristic {
+	BUNDLE_HEURISTIC_NONE = 0,
+	BUNDLE_HEURISTIC_CREATIONTOKEN,
+
+	/* Must be last. */
+	BUNDLE_HEURISTIC__COUNT,
+};
+
 /**
  * A bundle_list contains an unordered set of remote_bundle_info structs,
  * as well as information about the bundle listing, such as version and
@@ -75,6 +83,12 @@  struct bundle_list {
 	 * advertised by the bundle list at that location.
 	 */
 	char *baseURI;
+
+	/**
+	 * A list can have a heuristic, which helps reduce the number of
+	 * downloaded bundles.
+	 */
+	enum bundle_list_heuristic heuristic;
 };
 
 void init_bundle_list(struct bundle_list *list);
diff --git a/t/t5750-bundle-uri-parse.sh b/t/t5750-bundle-uri-parse.sh
index 7b4f930e532..6fc92a9c0d4 100755
--- a/t/t5750-bundle-uri-parse.sh
+++ b/t/t5750-bundle-uri-parse.sh
@@ -250,4 +250,23 @@  test_expect_success 'parse config format edge cases: empty key or value' '
 	test_cmp_config_output expect actual
 '
 
+test_expect_success 'parse config format: creationToken heuristic' '
+	cat >expect <<-\EOF &&
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+	[bundle "one"]
+		uri = http://example.com/bundle.bdl
+	[bundle "two"]
+		uri = https://example.com/bundle.bdl
+	[bundle "three"]
+		uri = file:///usr/share/git/bundle.bdl
+	EOF
+
+	test-tool bundle-uri parse-config expect >actual 2>err &&
+	test_must_be_empty err &&
+	test_cmp_config_output expect actual
+'
+
 test_done