diff mbox series

[RFC,01/13] serve: add command to advertise bundle URIs

Message ID RFC-patch-01.13-4e1a0dbef5-20210805T150534Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series Add bundle-uri: resumably clones, static "dumb" CDN etc. | expand

Commit Message

Ævar Arnfjörð Bjarmason Aug. 5, 2021, 3:07 p.m. UTC
When the uploadpack.bundleURI config is set to a URI (or URIs, if set
>1 times), advertise a "bundle-uri" command, then when the client
requests "bundle-uri" emit those URIs back at them.

The client CAN then request those URIs out of bounds, and after
they've done (after either disconnecting & coming back, or leaving us
hanging), proceed with the rest of request flow. I.e. issuing a
"ls-refs" followed by a "fetch". The client MAY then send us "have"
lines with the tips they've unpacked from their newly acquired
bundle(s).

This commit doesn't implement any of that required client behavior,
only the trivial server behavior of spewing a list of URLs at the
client on request.

There is already a uploadpack.blobPackfileUri setting for the server,
so why is this needed? The Documentation/technical/bundle-uri.txt
added in a preceding commit discusses that in more detail, but in
summary:

 1. There is no "real" support for in git.git. The
    uploadpack.blobPackfileUri setting allows carving out a list of
    blobs (actually any OIDs), but as alluded to in bfc2a36ff2a (Doc:
    clarify contents of packfile sent as URI, 2021-01-20) the only
    "real" implementation is JGit based.

 2. The uploadpack.blobPackfileUri is a MUST where this is a
    "CAN". I.e. once a client says they support packfile-uri of given
    list of protocols the server will send them a PACK response
    assuming they've downloaded the URI they client was sent, if the
    client doesn't do that they don't have a valid repository.

    Pointing at a bundle and having the client send us "have"
    lines (or not, maybe they couldn't fetch it, or decided they
    didn't want to) is more flexible, and can gracefully recover
    e.g. if the CDN isn't reachable (maybe you do support "https", but
    the CDN provider is down, or blocked your whole country).

 3. Because of the disconnect in #2 "dumb" servers can seed
    pre-clients, e.g. we might point to a repo.bundle whose exact
    state we aren't sure of (a cronjob updates it, sometimes). The
    client will discover its contents, and give us the "have" lines,
    the "packfile-uri" method effectively requires the server to have
    those exact "have" lines (or rather, it will produce a similar
    PACK using give-or-take the same exclusions).

 4. This provides an easy way to the long sought after "resumable
    clones". I.e. since we can assume that it's in the server's
    interest to keep their bundle(s) as up-to-date as possible, most
    or all of the history we need to fetch will be in the bundle. If
    we fail midway through the "clone" we can offload the problem of
    resuming to wget/curl/rsync/whatever, instead of (as has been
    suggested, but not implemented for the "normal" dialog)
    "repairing" a partial PACK response or something.

There was a suggestion of implementing a similar feature long ago[1]
by Jeff King. The main difference between it and this approach is that
we've since gained protocol v2, so we can add this as an optional path
in the dialog between client and server. The 2011 implementation
hooked into the transport mechanism to try to clone from a bundle
directly. See also [2] and [3] for some later mentions of that
approach.

See also [4] for the series that implemented
uploadpack.blobPackfileUri, and [5] for a series on top that did the
.gitmodules check in that context. See [6] for the "ls-refs unborn"
feature which modified code in similar areas of the request flow.

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/
2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/
3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/
4. https://lore.kernel.org/git/cover.1591821067.git.jonathantanmy@google.com/
   Merged as 34e849b05a4 (Merge branch 'jt/cdn-offload', 2020-06-25)
5. https://lore.kernel.org/git/cover.1614021092.git.jonathantanmy@google.com/
   Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
   2021-03-01)
6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/protocol-v2.txt | 140 ++++++++++++++++++++++++
 Makefile                                |   1 +
 bundle-uri.c                            |  65 +++++++++++
 bundle-uri.h                            |  14 +++
 serve.c                                 |   6 +
 t/t5701-git-serve.sh                    | 124 ++++++++++++++++++++-
 6 files changed, 349 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

Comments

Derrick Stolee Aug. 10, 2021, 1:58 p.m. UTC | #1
On 8/5/2021 11:07 AM, Ævar Arnfjörð Bjarmason wrote:
...
> +bundle-uri CLIENT AND SERVER EXPECTATIONS
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The advertised bundles MUST contain one or more reference tips for use
> +by the client. Bundles that are not self-contained MUST use the
> +standard "-" prefixes in the bundle format to indicate their
> +prerequisites. I.e. they must be in the standard format "git bundle
> +create" would create.
> +
> +If after an `ls-refs` the client finds that the ref tips it wants can
> +be retrieved entirety from advertised bundle(s), it MAY
> +disconnect. The results of such a "clone" or "fetch" should be
> +indistinguishable from the state attained without using bundle-uri.
> +
> +The client MAY also keep the connection open pending download of the
> +bundle-uris, e.g. should on or more downloads (or their validation)
> +fail.

The only technical thought I had (so far) about this proposal was that
leaving the connection open while downloading the bundle would leave
unnecessary load on the servers when no communication is happening.
There is a cost to keeping an open SSH connection, so here it would be
good to at least have the Git client close the connection after
getting a 200 response from the bundle (but not waiting for all of its
contents).

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Aug. 23, 2021, 1:25 p.m. UTC | #2
On Tue, Aug 10 2021, Derrick Stolee wrote:

> On 8/5/2021 11:07 AM, Ævar Arnfjörð Bjarmason wrote:
> ...
>> +bundle-uri CLIENT AND SERVER EXPECTATIONS
>> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +The advertised bundles MUST contain one or more reference tips for use
>> +by the client. Bundles that are not self-contained MUST use the
>> +standard "-" prefixes in the bundle format to indicate their
>> +prerequisites. I.e. they must be in the standard format "git bundle
>> +create" would create.
>> +
>> +If after an `ls-refs` the client finds that the ref tips it wants can
>> +be retrieved entirety from advertised bundle(s), it MAY
>> +disconnect. The results of such a "clone" or "fetch" should be
>> +indistinguishable from the state attained without using bundle-uri.
>> +
>> +The client MAY also keep the connection open pending download of the
>> +bundle-uris, e.g. should on or more downloads (or their validation)
>> +fail.
>
> The only technical thought I had (so far) about this proposal was that
> leaving the connection open while downloading the bundle would leave
> unnecessary load on the servers when no communication is happening.
> There is a cost to keeping an open SSH connection, so here it would be
> good to at least have the Git client close the connection after
> getting a 200 response from the bundle (but not waiting for all of its
> contents).

Thanks. Yes it's something I'll have to fix. I was hoping that I'd get
away with it for an initial implementation, but e.g. using
transfer.injectBundleURI to bootstrap chromium.git's repo from a bundle
will take so long that Google's server will give up and hang up on you.

I wonder if it's something the transport layer should be doing in
general to resume connections if they go stale if it's at a point of
clean separation in the dialog, but in any case I'll need it for
bundle-uri.

Closing the connection is also going to be more expensive in some cases,
e.g. if the bundle takes 1s we'll open/close/download
bundle-uri/open/close the connection, instead of of open/download
bundle-uri/close. I wonder if anyone cares though, we can always apply
some heuristic later I guess...
diff mbox series

Patch

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 213538f1d0..d10d5e9ef6 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -556,3 +556,143 @@  and associated requested information, each separated by a single space.
 	attr = "size"
 
 	obj-info = obj-id SP obj-size
+
+bundle-uri
+~~~~~~~~~~
+
+If the 'bundle-uri' capability is advertised, the server supports the
+`bundle-uri' command.
+
+The capability is currently advertised with no value (i.e. not
+"bundle-uri=somevalue"), a value may be added in the future for
+supporting command-wide extensions. Clients MUST ignore any unknown
+capability values and proceed with the 'bundle-uri` dialog they
+support.
+
+The 'bundle-uri' command is intended to be issued before `fetch` to
+get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and
+inform the subsequent `fetch` command.
+
+The client CAN issue `bundle-uri` before or after any other valid
+command. It's expected that it'll be issued after an `ls-refs` and
+before `fetch`.
+
+DISCUSSION of bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The intent of the feature is optimize for server resource consumption
+in the common case by changing the common case of fetching a very
+large PACK during linkgit:git-clone[1] into a smaller incremental
+fetch.
+
+It also allows servers to achieve better caching in combination with
+an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
+
+By having new clones or fetches be a more predictable and common
+negotiation against the tips of recently produces *.bundle file(s).
+Servers might even pre-generate the results of such negotiations for
+the `uploadpack.packObjectsHook` as new pushes come in.
+
+I.e. the server would anticipate that fresh clones will download a
+known bundle, followed by catching up to the current state of the
+repository using ref tips found in that bundle (or bundles).
+
+PROTOCOL for bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^
+
+A `bundle-uri` request takes no arguments, and as noted above does not
+currently advertise a capability value. Both may be added in the
+future.
+
+When the client issues a `command=bundle-uri` the response is a list
+of URIs the server would like the client to fetch out-of-bounds before
+proceeding with the `fetch` request in this format:
+
+	output = bundle-uri-line
+		 bundle-uri-line* flush-pkt
+
+	bundle-uri-line = PKT-LINE(bundle-uri)
+			  *(SP bundle-feature-key *(=bundle-feature-val))
+			  LF
+
+	bundle-uri = A URI such as a https://, ssh:// etc. URI
+
+	bundle-feature-key = Any printable ASCII characters except SP or "="
+	bundle-feature-val = Any printable ASCII characters except SP or "="
+
+No `bundle-feature-key`=`bundle-feature-value` fields are currently
+defined. See the discussion of features below.
+
+bundle-uri CLIENT AND SERVER EXPECTATIONS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The advertised bundles MUST contain one or more reference tips for use
+by the client. Bundles that are not self-contained MUST use the
+standard "-" prefixes in the bundle format to indicate their
+prerequisites. I.e. they must be in the standard format "git bundle
+create" would create.
+
+If after an `ls-refs` the client finds that the ref tips it wants can
+be retrieved entirety from advertised bundle(s), it MAY
+disconnect. The results of such a "clone" or "fetch" should be
+indistinguishable from the state attained without using bundle-uri.
+
+The client MAY also keep the connection open pending download of the
+bundle-uris, e.g. should on or more downloads (or their validation)
+fail.
+
+The client MAY provide reference tips found in the bundle(s) to be
+downloaded out-of-bounds as `have` lines in the `fetch` request. They
+MAY also ignore the bundle(s) entirely (e.g. if they can't be
+downloaded) or some combination of the two.
+
+For the convenience of clients bundles SHOULD be provided in the order
+that they must be unpacked in if processed one-at-a-time by a dumber client.
+
+That usually means a "big bundle" first with most of the history
+that's self-contained, optionally followed by incremental updates on
+that "big bundle".
+
+This ordering is a mere convention and not a MUST, e.g. a repository
+with N branches with disconnected histories might have N "big
+bundles", each with their own self-contained history. A server might
+also only provide "incremental updates".
+
+A client MUST consider the content of the bundles themselves and their
+header as the ultimate source of truth. Servers MAY be tolerant of
+simpler clients by using the convention outlined above.
+
+As noted before a client MUST gracefully degrade on errors, whether
+that error is because of bad missing/data in the bundle URI(s), or
+because that client is too dumb to e.g. understand and fully parse out
+bundle headers and their prerequisite relationships.
+
+bundle-uri PROTOCOL FEATURES
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above no `bundle-feature-key`=`bundle-feature-value` fields
+are currently defined.
+
+They are intended for future per-URI metadata which older clients MUST
+ignore and gracefully degrade on. Any fields they do recognize they
+CAN also ignore.
+
+Any backwards-incompatible addition of pre-URI key-value will be
+guarded by a new value or values in 'bundle-uri' capability
+advertisement itself, and/or by new future `bundle-uri` request
+arguments.
+
+While no per-URI key-value are currently supported currently they're
+intended to support future features such as:
+
+ * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
+   size of the bundle file.
+
+ * Advertise that one or more bundle files are the same (to e.g. have
+   clients round-robin or otherwise choose one of N possible files).
+
+ * A "tip=<OID>" shortcut. A client who'd have the advertised <OID>
+   would know there was no need to download the relevant bundle(s),
+   they've got that OID already (for multi-tips the client would need
+   to fetch the bundle, or do e.g. HTTP range requests to get its
+   header).
diff --git a/Makefile b/Makefile
index 9573190f1d..877c6c47b6 100644
--- a/Makefile
+++ b/Makefile
@@ -850,6 +850,7 @@  LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 0000000000..2d93e8b003
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,65 @@ 
+#include "cache.h"
+#include "bundle-uri.h"
+#include "pkt-line.h"
+#include "config.h"
+
+/**
+ * serve.[ch] API.
+ */
+
+/*
+ * "uploadpack.bundleURI" is advertised only if there's URIs to serve
+ * up per the config.
+ */
+static int advertise_bundle_uri = -1;
+
+static void send_bundle_uris(struct packet_writer *writer,
+			     struct string_list *uris)
+{
+	struct string_list_item *item;
+	for_each_string_list_item(item, uris) {
+		const char *uri = item->string;
+
+		packet_writer_write(writer, "%s", uri);
+	}
+}
+
+static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
+
+static int bundle_uri_startup_config(const char *var, const char *value,
+				     void *data)
+{
+	if (!strcmp(var, "uploadpack.bundleuri")) {
+		advertise_bundle_uri = 1;
+		string_list_append(&bundle_uris, value);
+	}
+	return 0;
+}
+
+int bundle_uri_advertise(struct repository *r, struct strbuf *value)
+{
+	if (advertise_bundle_uri == -1) {
+		git_config(bundle_uri_startup_config, NULL);
+		if (advertise_bundle_uri == -1)
+			advertise_bundle_uri = 0;
+	}
+	return advertise_bundle_uri;
+}
+
+int bundle_uri_command(struct repository *r,
+		       struct packet_reader *request)
+{
+	struct packet_writer writer;
+	packet_writer_init(&writer, 1);
+
+	while (packet_reader_read(request) == PACKET_READ_NORMAL)
+		die("bundle-uri: unexpected argument: '%s'", request->line);
+	if (request->status != PACKET_READ_FLUSH)
+		die("bundle-uri: expected flush after arguments");
+
+	send_bundle_uris(&writer, &bundle_uris);
+
+	packet_writer_flush(&writer);
+
+	return 0;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 0000000000..6a40efeb39
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,14 @@ 
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+
+struct repository;
+struct packet_reader;
+struct packet_writer;
+
+/**
+ * serve.[ch] API.
+ */
+int bundle_uri_advertise(struct repository *r, struct strbuf *value);
+int bundle_uri_command(struct repository *r, struct packet_reader *request);
+
+#endif /* BUNDLE_URI_H */
diff --git a/serve.c b/serve.c
index 1817edc7f5..789bf5fc38 100644
--- a/serve.c
+++ b/serve.c
@@ -8,6 +8,7 @@ 
 #include "protocol-caps.h"
 #include "serve.h"
 #include "upload-pack.h"
+#include "bundle-uri.h"
 
 static int advertise_sid = -1;
 
@@ -104,6 +105,11 @@  static struct protocol_capability capabilities[] = {
 		.advertise = always_advertise,
 		.command = cap_object_info,
 	},
+	{
+		.name = "bundle-uri",
+		.advertise = bundle_uri_advertise,
+		.command = bundle_uri_command,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 930721f053..21d5314d83 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -12,7 +12,7 @@  test_expect_success 'test capability advertisement' '
 	wrong_algo sha1:sha256
 	wrong_algo sha256:sha1
 	EOF
-	cat >expect <<-EOF &&
+	cat >expect.base <<-EOF &&
 	version 2
 	agent=git/$(git version | cut -d" " -f3)
 	ls-refs=unborn
@@ -20,8 +20,11 @@  test_expect_success 'test capability advertisement' '
 	server-option
 	object-format=$(test_oid algo)
 	object-info
+	EOF
+	cat >expect.trailer <<-EOF &&
 	0000
 	EOF
+	cat expect.base expect.trailer >expect &&
 
 	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
 		--advertise-capabilities >out &&
@@ -266,4 +269,123 @@  test_expect_success 'basics of object-info' '
 	test_cmp expect actual
 '
 
+# Test the basics of bundle-uri
+#
+test_expect_success 'test capability advertisement with uploadpack.bundleURI' '
+	test_config uploadpack.bundleURI FAKE &&
+
+	cat >expect.extra <<-EOF &&
+	bundle-uri
+	EOF
+	cat expect.base \
+	    expect.extra \
+	    expect.trailer >expect &&
+
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: dies if not enabled' '
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	cat >expect <<-\EOF &&
+	ERR serve: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with two URIs' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+	test_config uploadpack.bundleURI https://cdn.example.com/recent.bdl --add &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	https://cdn.example.com/recent.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: unknown future feature(s)' '
+	test_config uploadpack.bundleURI https://cdn.example.com/fake.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0001
+	some-feature
+	we-do-not
+	know=about
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: bundle-uri: unexpected argument: '"'"'some-feature'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
 test_done