diff mbox series

[2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"

Message ID patch-2.3-3ac0539c053-20211025T211159Z-avarab@gmail.com (mailing list archive)
State New, archived
Headers show
Series bundle-uri: "dumb" static CDN offloading, spec & server implementation | expand

Commit Message

Ævar Arnfjörð Bjarmason Oct. 25, 2021, 9:25 p.m. UTC
Add a server-side implementation of a new "bundle-uri" command to
protocol v2. As discussed in the updated "protocol-v2.txt" this will
allow conforming clients to optionally seed their initial clones or
incremental fetches from URLs containing "*.bundle" files created with
"git bundle create".

The use-cases are similar to those of the existing "Packfile URIs",
and the two feature can be combined within a single request, but
"bundle-uri" has a few advantages over packfile-uris in some some
common scenarios, discussed below.

This change does not give us a working "bundle-uri" client. I have
those patches as a follow-up, but let's first establish what the
protocol for this should be like first. The client implementation will
then implement this specification.

With this change when the uploadpack.bundleURI config is set to a
URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then
when the client requests "bundle-uri" emit those URIs back at them.

Differences between this and the existing packfile-uri facility:

 A. There is no "real" support for packfile-uri in git.git. The
    uploadpack.blobPackfileUri setting allows carving out a list of
    blobs (actually any OIDs), but as alluded to in bfc2a36ff2a (Doc:
    clarify contents of packfile sent as URI, 2021-01-20) the only
    "real" implementation is JGit based.

 B. The uploadpack.blobPackfileUri is a MUST where this is a
    "CAN". I.e. once a client says they support packfile-uri of given
    list of protocols the server will send them a PACK response
    assuming they've downloaded the URI they client was sent, if the
    client doesn't do that they don't have a valid repository.

    Pointing at a bundle and having the client send us "have"
    lines (or not, maybe they couldn't fetch it, or decided they
    didn't want to) is more flexible, and can gracefully recover
    e.g. if the CDN isn't reachable (maybe you do support "https", but
    the CDN provider is down, or blocked your whole country).

 C. The client, after executing "ls-refs" will disconnect if it has
    also grabbed the "bundle-uris" and knows the server won't send it
    anything it doesn't already have (or expect to have, if it's
    downloading the bundles concurrent to an early disconnect).

    This is in (small) contrast to packfile-uri where a client would
    enter a negotiation dialog, which may or may not result in a
    packfile-uri and/or an inline PACK.

 D. Because of "C" clients can, if the bundles are up-to-date, get an
    up-to-date repository with just "bundle-uri" and "ls-refs" commands,
    with no need to enter a dialog with "git upload-pack".

    That small dialog is unlikely to matter for performance purposes,
    this section is just noting differences between "bundle-uri" and
    "packfile-uri".

As noted above the features are compatible, a client that supports
"bundle-uri" and "packfile-uri" might download a bundle, and then
proceed with a "fetch" dialog, that dialog might then result in
"packfile-uri" response.

In practice server operators are unlikely to want to mix the two,
since the main benefit of either approach is the ability to offload
large "clone" responses to CDNs. A server operator would have little
reason not to go with one approach or the other.

There was a suggestion of implementing a similar feature long ago[1]
by Jeff King. The main difference between it and this approach is that
we've since gained protocol v2, so we can add this as an optional path
in the dialog between client and server. The 2011 implementation
hooked into the transport mechanism to try to clone from a bundle
directly. See also [2] and [3] for some later mentions of that
approach.

See also [4] for the series that implemented
uploadpack.blobPackfileUri, and [5] for a series on top that did the
.gitmodules check in that context. See [6] for the "ls-refs unborn"
feature which modified code in similar areas of the request flow.

1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/
2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/
3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/
4. https://lore.kernel.org/git/cover.1591821067.git.jonathantanmy@google.com/
   Merged as 34e849b05a4 (Merge branch 'jt/cdn-offload', 2020-06-25)
5. https://lore.kernel.org/git/cover.1614021092.git.jonathantanmy@google.com/
   Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs',
   2021-03-01)
6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17)

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++
 Makefile                                |   1 +
 bundle-uri.c                            |  55 +++++++
 bundle-uri.h                            |  14 ++
 serve.c                                 |   6 +
 t/t5701-git-serve.sh                    | 124 +++++++++++++-
 6 files changed, 408 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

Comments

Derrick Stolee Oct. 26, 2021, 2 p.m. UTC | #1
On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> +bundle-uri CLIENT ERROR RECOVERY
> +++++++++++++++++++++++++++++++++
> +
> +A client MUST above all gracefully degrade on errors, whether that
> +error is because of bad missing/data in the bundle URI(s), because
> +that client is too dumb to e.g. understand and fully parse out bundle
> +headers and their prerequisite relationships, or something else.

"too dumb" seems a bit informal to me, especially because you
immediately elaborate on its meaning. You could rewrite as follows:

  ...because
  that client can't understand or fully parse out bundle
  headers and their prerequisite relationships, or something else.

> +Server operators should feel confident in turning on "bundle-uri" and
> +not worry if e.g. their CDN goes down that clones or fetches will run
> +into hard failures. Even if the server bundle bundle(s) are
> +incomplete, or bad in some way the client should still end up with a
> +functioning repository, just as if it had chosen not to use this
> +protocol extension.

Also, insertions of "e.g." in the middle of a sentence don't flow well.

  Server operators should feel confident in turning on "bundle-uri" and
  not worry that failures such as the CDN being unavailable will cause
  clones or fetches to have hard failures. Even if the server bundle(s)
  are invalid, the client should still end up with a functioning
  repository, just as if it had chosen not to use this protocol extension.

(Note: I also removed a "bundle bundle(s)" that was split across a line
break.)

> +bundle-uri SERVER TO CLIENT
> ++++++++++++++++++++++++++++
> +
> +The ordering of the returned bundle uris is not significant. Clients

I'm late to noticing, but shouldn't "URI" be all-caps when not used in
the literal capability string "bundle-uri"?

> +bundle-uri CLIENT TO SERVER
> ++++++++++++++++++++++++++++
> +
> +The client SHOULD provide reference tips found in the bundle header(s)
> +as 'have' lines in any subsequent `fetch` request. A client MAY also
> +ignore the bundle(s) entirely if doing so is deemed worse for some
> +reason, e.g. if the bundles can't be downloaded, it doesn't like the
> +tips it finds etc.

I would just stop after "is deemed worse for some reason." because one
example is obvious and the other is unclear how the client would detect
that situation. (Maybe: tip commit timestamps are really old?)

> +
> +WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +
> +If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
> +of the bundle(s) the client finds that the ref tips it wants can be
> +retrieved entirety from advertised bundle(s), it MAY disconnect. The

s/entirety/entirely/

> +results of such a 'clone' or 'fetch' should be indistinguishable from
> +the state attained without using bundle-uri.
> +
> +EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
> +++++++++++++++++++++++++++++++++++++++++++++++
> +
> +A client MAY perform an early disconnect while still downloading the
> +bundle(s) (having streamed and parsed their headers). In such a case
> +the client MUST gracefully recover from any errors related to
> +finishing the download and validation of the bundle(s).
> +
> +I.e. a client might need to re-connect and issue a 'fetch' command,
> +and possibly fall back to not making use of 'bundle-uri' at all.

Use "For example," over starting a sentence with "i.e.". The examples
of "i.e." and "e.g." already in this document show proper use, which
involves parentheses.

> +This "MAY" behavior is specified as such (and not a "SHOULD") on the
> +assumption that a server advertising bundle uris is more likely than
> +not to be serving up a relatively large repository, and to be pointing
> +to URIs that have a good chance of being in working order. A client
> +MAY e.g. look at the payload size of the bundles as a heuristic to see

Again, here, the entire sentence is an example. This "e.g." can be
removed with no loss of meaning.

> +if an early disconnect is worth it, should falling back on a full
> +"fetch" dialog be necessary.


> +While no per-URI key-value are currently supported currently they're
> +intended to support future features such as:
> +
> + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
> +   size of the bundle file.

I suppose if one wanted to add this server-to-bundle coupling, then some
clients might appreciate it.

> + * Advertise that one or more bundle files are the same (to e.g. have
> +   clients round-robin or otherwise choose one of N possible files).

  * Advertise that one or more bundle files are the same, to allow for
    redundancy without causing duplicated effort.

> +static void send_bundle_uris(struct packet_writer *writer,
> +			     struct string_list *uris)
> +{
> +	struct string_list_item *item;
> +
> +	for_each_string_list_item(item, uris)
> +		packet_writer_write(writer, "%s", item->string);
> +}
> +
> +static int advertise_bundle_uri = -1;
> +static struct string_list bundle_uris = STRING_LIST_INIT_DUP;

I see you put send_bundle_uris() before the global bundle_uris so
it can be independent, but do you expect anyone to call send_bundle_uris()
via a different list?

Should we find a different place to store this data?

> +static int bundle_uri_config(const char *var, const char *value, void *data)
> +{
> +	if (!strcmp(var, "uploadpack.bundleuri")) {
> +		advertise_bundle_uri = 1;
> +		string_list_append(&bundle_uris, value);
> +	}
> +
> +	return 0;
> +}

Here, we are dictating that the URI list is available as a multi-valued
config "uploadpack.bundleuri".

1. Should this be updated in Documentation/config/uploadpack.txt?

2. This seems difficult to extend to your possible future features as
   listed in the protocol docs, mainly because this can only store the
   flat URI string. To add things like hash values, sizes, and prereqs,
   you would need more data included and grouped on a per-URI basis.
   What plans do you have to make extensions here while remaining
   somewhat compatible with downgrading Git versions?

> @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
>  		.advertise = always_advertise,
>  		.command = cap_object_info,
>  	},
> +	{
> +		.name = "bundle-uri",
> +		.advertise = bundle_uri_advertise,
> +		.command = bundle_uri_command,
> +	},
>  };

I really appreciate that it is this simple to extend protocol v2.

> +test_expect_success 'basics of bundle-uri: dies if not enabled' '
> +	test-tool pkt-line pack >in <<-EOF &&
> +	command=bundle-uri
> +	0000
> +	EOF
> +
> +	cat >err.expect <<-\EOF &&
> +	fatal: invalid command '"'"'bundle-uri'"'"'
> +	EOF
> +
> +	cat >expect <<-\EOF &&
> +	ERR serve: invalid command '"'"'bundle-uri'"'"'
> +	EOF
> +
> +	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
> +	test_cmp err.expect err.actual &&
> +	test_must_be_empty out
> +'
> +
> +

hyper-nit: double newline.

The implementation seems simple enough, which I like. I'm a bit
your current use of Git config as the back-end, only because it is
difficult to be future-proof. As the functionality stands today, the
current config design works just fine. Perhaps we don't need to
worry about the future, because we can design a new, complementary
storage for that extra data. It seems worth exploring for a little
while, though. Perhaps we should take a page out of 'git-remote'
and how it stores named remotes with sub-items for metadata. The
names probably don't need to ever be exposed to users, but it could
be beneficial to anyone implementing this scheme.

[bundle "main"]
	uri = https://example.com/my-bundle
	uri = https://redundant-cdn.com/my-bundle
	size = 120523
	sha256 = {64hexchars}

[bundle "fork"]
	uri = https://cdn.org/my-fork
	size = 334
	sha256 = {...}
	prereq = {oid}

This kind of layout has an immediate grouping of data that should
help any future plan. Notice how I included multiple "uri" lines
in the "main", which helps with your plan for duplicate URIs.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Oct. 26, 2021, 3 p.m. UTC | #2
On Tue, Oct 26 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
>> +bundle-uri CLIENT ERROR RECOVERY
>> +++++++++++++++++++++++++++++++++
>> +
>> +A client MUST above all gracefully degrade on errors, whether that
>> +error is because of bad missing/data in the bundle URI(s), because
>> +that client is too dumb to e.g. understand and fully parse out bundle
>> +headers and their prerequisite relationships, or something else.
>
> "too dumb" seems a bit informal to me, especially because you
> immediately elaborate on its meaning. You could rewrite as follows:
>
>   ...because
>   that client can't understand or fully parse out bundle
>   headers and their prerequisite relationships, or something else.

Thanks, I've snipped all your subsequent comments on
phrasing/clarifications etc, except insofar as they have questions I
need to address (as opposed to just my bad grammar/phrasing etc).

Thanks a lot for them, will go through them closely for any subsequent
re-roll & address them.

> [...]
>> +While no per-URI key-value are currently supported currently they're
>> +intended to support future features such as:
>> +
>> + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
>> +   size of the bundle file.
>
> I suppose if one wanted to add this server-to-bundle coupling, then some
> clients might appreciate it.

For packfile-uri there's a hard dependency on the server transferring
the hash of the PACK file.

I've intentionally omitted it, the reasons are covered in [1], which I
realize now should really be part of this early series.

Basically having it as a hard requirement isn't necessary for security
or payload validation. Any server who's worried about their transport
integrity would point to a https URI under their control, any
checksumming and validation we'll need we'll get from the transport
layer and the client's reachability check.

Having it would mean that you need closer cooperation by default between
server and CDN than I'm aiming for, i.e. a server should be able to
point to some URI somewhere updated by a dumb hourly cronjob, without
needing to pass information back & forth about what the "current" URL
is. The client will discover all that.

But I left that "hash=*" in because it could be optionally added, in
case someone really wants it for some reason...

1. https://lore.kernel.org/git/RFC-patch-13.13-1e657ed27a-20210805T150534Z-avarab@gmail.com/

>> + * Advertise that one or more bundle files are the same (to e.g. have
>> +   clients round-robin or otherwise choose one of N possible files).
>
>   * Advertise that one or more bundle files are the same, to allow for
>     redundancy without causing duplicated effort.

*nod*

>> +static void send_bundle_uris(struct packet_writer *writer,
>> +			     struct string_list *uris)
>> +{
>> +	struct string_list_item *item;
>> +
>> +	for_each_string_list_item(item, uris)
>> +		packet_writer_write(writer, "%s", item->string);
>> +}
>> +
>> +static int advertise_bundle_uri = -1;
>> +static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
>
> I see you put send_bundle_uris() before the global bundle_uris so
> it can be independent, but do you expect anyone to call send_bundle_uris()
> via a different list?

No, I'll move that around or rather fold it into bundle_uri_command()
directly.

I think I'd originally copied the structure of send_ref() and ls_refs()
from ls-refs.c, but it doesn't make much sense anymore here for this
2-line function. Thanks.

> Should we find a different place to store this data?
>
>> +static int bundle_uri_config(const char *var, const char *value, void *data)
>> +{
>> +	if (!strcmp(var, "uploadpack.bundleuri")) {
>> +		advertise_bundle_uri = 1;
>> +		string_list_append(&bundle_uris, value);
>> +	}
>> +
>> +	return 0;
>> +}
>
> Here, we are dictating that the URI list is available as a multi-valued
> config "uploadpack.bundleuri".
>
> 1. Should this be updated in Documentation/config/uploadpack.txt?

Definitely. I'll either incorporate that or re-structure this leading
series so that it's more design-doc/protocol focused, in any case all of
this ends up documented in the right places eventually...

> 2. This seems difficult to extend to your possible future features as
>    listed in the protocol docs, mainly because this can only store the
>    flat URI string. To add things like hash values, sizes, and prereqs,
>    you would need more data included and grouped on a per-URI basis.
>    What plans do you have to make extensions here while remaining
>    somewhat compatible with downgrading Git versions?

[...addressed below...]

>> @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = {
>>  		.advertise = always_advertise,
>>  		.command = cap_object_info,
>>  	},
>> +	{
>> +		.name = "bundle-uri",
>> +		.advertise = bundle_uri_advertise,
>> +		.command = bundle_uri_command,
>> +	},
>>  };
>
> I really appreciate that it is this simple to extend protocol v2.

Yeah! FWIW I've got some WIP patches to make it easier still, i.e. some
further simplification & validation in the serve.[ch] API.

>> +test_expect_success 'basics of bundle-uri: dies if not enabled' '
>> +	test-tool pkt-line pack >in <<-EOF &&
>> +	command=bundle-uri
>> +	0000
>> +	EOF
>> +
>> +	cat >err.expect <<-\EOF &&
>> +	fatal: invalid command '"'"'bundle-uri'"'"'
>> +	EOF
>> +
>> +	cat >expect <<-\EOF &&
>> +	ERR serve: invalid command '"'"'bundle-uri'"'"'
>> +	EOF
>> +
>> +	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
>> +	test_cmp err.expect err.actual &&
>> +	test_must_be_empty out
>> +'
>> +
>> +
>
> hyper-nit: double newline.
>
> The implementation seems simple enough, which I like. I'm a bit

I mentally inserted the missing "skeptical/uncertain" etc. here :)

> your current use of Git config as the back-end, only because it is
> difficult to be future-proof. As the functionality stands today, the
> current config design works just fine. Perhaps we don't need to
> worry about the future, because we can design a new, complementary
> storage for that extra data. It seems worth exploring for a little
> while, though. Perhaps we should take a page out of 'git-remote'
> and how it stores named remotes with sub-items for metadata. The
> names probably don't need to ever be exposed to users, but it could
> be beneficial to anyone implementing this scheme.
>
> [bundle "main"]
> 	uri = https://example.com/my-bundle
> 	uri = https://redundant-cdn.com/my-bundle
> 	size = 120523
> 	sha256 = {64hexchars}
>
> [bundle "fork"]
> 	uri = https://cdn.org/my-fork
> 	size = 334
> 	sha256 = {...}
> 	prereq = {oid}
>
> This kind of layout has an immediate grouping of data that should
> help any future plan. Notice how I included multiple "uri" lines
> in the "main", which helps with your plan for duplicate URIs.

At first sight I like that config schema much better than my current
one, in particular how it makes the future-proofed "these N urls are one
logical URL" case simpler.

But overall I'm a bit on the fence, and leaning towards keeping what I
have, not out of any lazynes or wanting to just keep what I have mind
you.

But rather that the main benefit of the current one is that it's a 1=1
mapping to the line-based protocol, and you can say update your URLs as:

    git config --replace-all uploadpack.bundleUri "$first_url" &&
    git config --add uploadpack.bundleUri "$second_url"

Having usually you'd know the URL you'd like to replace, so you can use
the [value-pattern] of --replace-all, if it's a named section or other
split-out structure that become a two-step lookup.

Also for testing I've got a (trivial) plumbing tool I'll submit called
"git ls-remote-bundle-uri" (could be folded into something else I guess)
to dump the server-side config, it's nice that you can pretty much
directly copy/paste it into config without needing to adjust it.

Having said all that I'm not sure I feel strongly about it either way,
what do you think given the above?

I think most "real" server operators will use this as
GIT_CONFIG_COUNT=<n> GIT_CONFIG_{KEY,VALUE}_<1..n>, which can of course
work with any config schema, but if you've got code generating it on the
other side naming the targets is probably a slight hassle / confusing.

There's also the small matter of it being consistent with the
packfile-uri config in its current form, but that shouldn't be a reason
not to come up with something better. If anything any better suggestion
(if we go for that) could be supported by it too...
Derrick Stolee Oct. 27, 2021, 1:55 a.m. UTC | #3
On 10/26/2021 11:00 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Oct 26 2021, Derrick Stolee wrote:
>> The implementation seems simple enough, which I like. I'm a bit
> 
> I mentally inserted the missing "skeptical/uncertain" etc. here :)

More "uncertain" than "skeptical". The current plan works perfectly
for the current implementation, so there is an element of YAGNI that
could easily lead us to avoid overthinking this.
 
>> your current use of Git config as the back-end, only because it is
>> difficult to be future-proof. As the functionality stands today, the
>> current config design works just fine. Perhaps we don't need to
>> worry about the future, because we can design a new, complementary
>> storage for that extra data. It seems worth exploring for a little
>> while, though. Perhaps we should take a page out of 'git-remote'
>> and how it stores named remotes with sub-items for metadata. The
>> names probably don't need to ever be exposed to users, but it could
>> be beneficial to anyone implementing this scheme.
>>
>> [bundle "main"]
>> 	uri = https://example.com/my-bundle
>> 	uri = https://redundant-cdn.com/my-bundle
>> 	size = 120523
>> 	sha256 = {64hexchars}
>>
>> [bundle "fork"]
>> 	uri = https://cdn.org/my-fork
>> 	size = 334
>> 	sha256 = {...}
>> 	prereq = {oid}
>>
>> This kind of layout has an immediate grouping of data that should
>> help any future plan. Notice how I included multiple "uri" lines
>> in the "main", which helps with your plan for duplicate URIs.
> 
> At first sight I like that config schema much better than my current
> one, in particular how it makes the future-proofed "these N urls are one
> logical URL" case simpler.
> 
> But overall I'm a bit on the fence, and leaning towards keeping what I
> have, not out of any lazynes or wanting to just keep what I have mind
> you.
> 
> But rather that the main benefit of the current one is that it's a 1=1
> mapping to the line-based protocol, and you can say update your URLs as:
> 
>     git config --replace-all uploadpack.bundleUri "$first_url" &&
>     git config --add uploadpack.bundleUri "$second_url"
> 
> Having usually you'd know the URL you'd like to replace, so you can use
> the [value-pattern] of --replace-all, if it's a named section or other
> split-out structure that become a two-step lookup.

Don't forget to use --fixed-value for exact string matching instead of
regex matching!

> Also for testing I've got a (trivial) plumbing tool I'll submit called
> "git ls-remote-bundle-uri" (could be folded into something else I guess)
> to dump the server-side config, it's nice that you can pretty much
> directly copy/paste it into config without needing to adjust it.

With the appropriate helper structs and methods in the product code,
such helper tools will still be simple without being a second place
that is directly aware of how the values are stored to disk. I don't
judge your prototype work that helps you build the feature, but it's
simultaneously not a reason to stick to a design.

> Having said all that I'm not sure I feel strongly about it either way,
> what do you think given the above?

I'm not feeling too strong about it right now. The current design
does not need anything extra, but it also purposefully leaves certain
things open for extension in the future.

The thing I worry about is that there will be two supported ways to
store a list of bundle URIs: a flat list of URIs in the multi-valued
uploadPack.bundleURI config value, but then also a second option that
allows the extensions that arise. It's a layer of complication that
would be nice to avoid if there was an easy way to do it, but the
schema I sketched earlier isn't simple enough to merit a switch right
now. Perhaps someone else will have an idea that accomplishes the
same goal, but also is less complicated?
 
> I think most "real" server operators will use this as
> GIT_CONFIG_COUNT=<n> GIT_CONFIG_{KEY,VALUE}_<1..n>, which can of course
> work with any config schema, but if you've got code generating it on the
> other side naming the targets is probably a slight hassle / confusing.

You'd really overload the environment this way? That's not how I would
approach it, but maybe there is a benefit over writing to the repository's
config file. I suppose that you could store the data in a database and
link it to the repository at runtime instead.

> There's also the small matter of it being consistent with the
> packfile-uri config in its current form, but that shouldn't be a reason
> not to come up with something better. If anything any better suggestion
> (if we go for that) could be supported by it too...

What do you mean about being consistent with packfile-uri? This layer
that we care about isn't even implemented in git.git.

Thanks,
-Stolee
Derrick Stolee Oct. 27, 2021, 2:01 a.m. UTC | #4
On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> Add a server-side implementation of a new "bundle-uri" command to
> protocol v2. As discussed in the updated "protocol-v2.txt" this will
> allow conforming clients to optionally seed their initial clones or
> incremental fetches from URLs containing "*.bundle" files created with
> "git bundle create".

...

> +DISCUSSION of bundle-uri
> +^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +The intent of the feature is optimize for server resource consumption
> +in the common case by changing the common case of fetching a very
> +large PACK during linkgit:git-clone[1] into a smaller incremental
> +fetch.
> +
> +It also allows servers to achieve better caching in combination with
> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
> +
> +By having new clones or fetches be a more predictable and common
> +negotiation against the tips of recently produces *.bundle file(s).
> +Servers might even pre-generate the results of such negotiations for
> +the `uploadpack.packObjectsHook` as new pushes come in.
> +
> +I.e. the server would anticipate that fresh clones will download a
> +known bundle, followed by catching up to the current state of the
> +repository using ref tips found in that bundle (or bundles).
> +
> +PROTOCOL for bundle-uri
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +A `bundle-uri` request takes no arguments, and as noted above does not
> +currently advertise a capability value. Both may be added in the
> +future.

One thing I realized was missing from this proposal is any interaction
with partial clone. It would be disappointing if we could not advertise
bundles of commit-and-tree packfiles for blobless partial clones.

There is currently no way for the client to signal the filter type
during this command. Not having any way to extend to include that
seems like an oversight we should remedy before committing to a
protocol that can't be extended.

(This also seems like a good enough reason to group the URIs into a
struct-like storage, because the filter type could be stored next to
the URI.)

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 8:29 a.m. UTC | #5
On Tue, Oct 26 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> Add a server-side implementation of a new "bundle-uri" command to
>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>> allow conforming clients to optionally seed their initial clones or
>> incremental fetches from URLs containing "*.bundle" files created with
>> "git bundle create".
>
> ...
>
>> +DISCUSSION of bundle-uri
>> +^^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +The intent of the feature is optimize for server resource consumption
>> +in the common case by changing the common case of fetching a very
>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>> +fetch.
>> +
>> +It also allows servers to achieve better caching in combination with
>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>> +
>> +By having new clones or fetches be a more predictable and common
>> +negotiation against the tips of recently produces *.bundle file(s).
>> +Servers might even pre-generate the results of such negotiations for
>> +the `uploadpack.packObjectsHook` as new pushes come in.
>> +
>> +I.e. the server would anticipate that fresh clones will download a
>> +known bundle, followed by catching up to the current state of the
>> +repository using ref tips found in that bundle (or bundles).
>> +
>> +PROTOCOL for bundle-uri
>> +^^^^^^^^^^^^^^^^^^^^^^^
>> +
>> +A `bundle-uri` request takes no arguments, and as noted above does not
>> +currently advertise a capability value. Both may be added in the
>> +future.
>
> One thing I realized was missing from this proposal is any interaction
> with partial clone. It would be disappointing if we could not advertise
> bundles of commit-and-tree packfiles for blobless partial clones.
>
> There is currently no way for the client to signal the filter type
> during this command. Not having any way to extend to include that
> seems like an oversight we should remedy before committing to a
> protocol that can't be extended.
>
> (This also seems like a good enough reason to group the URIs into a
> struct-like storage, because the filter type could be stored next to
> the URI.)

I'll update the docs to note that. I'd definitely like to leave out any
implementation of filter/shallow for an initial iteration of this for
simplicity, but the protocol keyword/behavior is open-ended enough to
permit any extension.

I.e. the server can start advertising "bundle-uri=shallow", and future
clients can request arbitrary key-value pairs in addition to just
"bundle-uri" now.

Having said that I think that *probably* this is something that'll never
be implemented, but maybe I'll eat my words there.

The reason is that once we're in the "fetch" dialog with the server, as
we are with "filter" and "shallow" I'd think that we'd be better of just
sending a packfile-uri, since that's tailor-made for that use-case.

But I suppose we could also advertise e.g.:

    <bundle-uri> tip=<oid> depth=1

Which a client that noticed that it noticed say the --single-branch at
<oid> but with depth=1 could use before it ever got to "fetch".

But (and I haven't looked into this really) I'd think that would quickly
get you into having a bundle with a PACK payload that wouldn't be
representable with the current bundle header format, which I think we'd
always want a 1=1 mapping of. I.e. you can specify a prereq, but not
leave out trees/blobs etc.

So thoughts on that most welcome, in particular how it could be made
future-proof.
Derrick Stolee Oct. 27, 2021, 4:31 p.m. UTC | #6
On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Oct 26 2021, Derrick Stolee wrote:
> 
>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>> Add a server-side implementation of a new "bundle-uri" command to
>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>> allow conforming clients to optionally seed their initial clones or
>>> incremental fetches from URLs containing "*.bundle" files created with
>>> "git bundle create".
>>
>> ...
>>
>>> +DISCUSSION of bundle-uri
>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>> +
>>> +The intent of the feature is optimize for server resource consumption
>>> +in the common case by changing the common case of fetching a very
>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>> +fetch.
>>> +
>>> +It also allows servers to achieve better caching in combination with
>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>> +
>>> +By having new clones or fetches be a more predictable and common
>>> +negotiation against the tips of recently produces *.bundle file(s).
>>> +Servers might even pre-generate the results of such negotiations for
>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>> +
>>> +I.e. the server would anticipate that fresh clones will download a
>>> +known bundle, followed by catching up to the current state of the
>>> +repository using ref tips found in that bundle (or bundles).
>>> +
>>> +PROTOCOL for bundle-uri
>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>> +
>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>> +currently advertise a capability value. Both may be added in the
>>> +future.
>>
>> One thing I realized was missing from this proposal is any interaction
>> with partial clone. It would be disappointing if we could not advertise
>> bundles of commit-and-tree packfiles for blobless partial clones.
>>
>> There is currently no way for the client to signal the filter type
>> during this command. Not having any way to extend to include that
>> seems like an oversight we should remedy before committing to a
>> protocol that can't be extended.
>>
>> (This also seems like a good enough reason to group the URIs into a
>> struct-like storage, because the filter type could be stored next to
>> the URI.)
> 
> I'll update the docs to note that. I'd definitely like to leave out any
> implementation of filter/shallow for an initial iteration of this for
> simplicity, but the protocol keyword/behavior is open-ended enough to
> permit any extension.

It would be good to be explicit about how this would work. Looking at
it fresh, it seems that the server could send multiple bundle URIs with
the extra metadata to say which ones have a filter (and what that filter
is). The client could then check if a bundle matches the given filter.

But this is a bit inverted: the filter mechanism currently has the client
request a given filter and the server responds with _at least_ that much
data. This allows the server to ignore things like pathspec-filters or
certain size-based filters. If the client just ignores a bundle URI
because it doesn't match the exact filter, this could lead the client to
ask for the data without a bundle, even if it would be faster to just
download the advertised bundle.

For this reason, I think it would be valuable for the client to tell
the server the intended filter, and the server responds with bundle
URIs that contain a subset of the information that would be provided
by a later fetch request with that filter.

> I.e. the server can start advertising "bundle-uri=shallow", and future
> clients can request arbitrary key-value pairs in addition to just
> "bundle-uri" now.
> 
> Having said that I think that *probably* this is something that'll never
> be implemented, but maybe I'll eat my words there.

You continue focusing on the shallow option, which I agree is not
important. The filter option, specifically --filter=blob:none, seems
to be critical to have a short-term plan for implementing with this
in mind.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 5:49 p.m. UTC | #7
On Tue, Oct 26 2021, Derrick Stolee wrote:

> On 10/26/2021 11:00 AM, Ævar Arnfjörð Bjarmason wrote:

[I'll reply to the rest later, either here or in related threads. I.e. I
might end up entirely revamping the config etc. format]

>> There's also the small matter of it being consistent with the
>> packfile-uri config in its current form, but that shouldn't be a reason
>> not to come up with something better. If anything any better suggestion
>> (if we go for that) could be supported by it too...
>
> What do you mean about being consistent with packfile-uri? This layer
> that we care about isn't even implemented in git.git.

It's rather limited, but we do support a uploadpack.BlobPackFileUri as a
server-side feature for upload-pack. I.e.:

    uploadpack.BlobPackFileUri=<OID> <pack-hash> <packfile-uri>

See Documentation/technical/packfile-uri.txt.

The <pack-hash> is part of the protocol, but the <OID> is just an aid to
upload-pack to peel out that OID when it serves up the PACK, the <OID>
being what you get from the URI.

In terms of server implementation it's rather proof-of-concept-ish,
i.e. it's not really all that useful unless your use case is carving out
a small number of really big blobs. JGit's is much more mature, and
there's some patches on-list recently to make the git.git one more
practically useful[1].

1. https://lore.kernel.org/git/cover.1634634814.git.tenglong@alibaba-inc.com/
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 6:01 p.m. UTC | #8
On Wed, Oct 27 2021, Derrick Stolee wrote:

> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>> 
>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>> allow conforming clients to optionally seed their initial clones or
>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>> "git bundle create".
>>>
>>> ...
>>>
>>>> +DISCUSSION of bundle-uri
>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> +The intent of the feature is optimize for server resource consumption
>>>> +in the common case by changing the common case of fetching a very
>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>> +fetch.
>>>> +
>>>> +It also allows servers to achieve better caching in combination with
>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>> +
>>>> +By having new clones or fetches be a more predictable and common
>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>> +Servers might even pre-generate the results of such negotiations for
>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>> +
>>>> +I.e. the server would anticipate that fresh clones will download a
>>>> +known bundle, followed by catching up to the current state of the
>>>> +repository using ref tips found in that bundle (or bundles).
>>>> +
>>>> +PROTOCOL for bundle-uri
>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>> +
>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>> +currently advertise a capability value. Both may be added in the
>>>> +future.
>>>
>>> One thing I realized was missing from this proposal is any interaction
>>> with partial clone. It would be disappointing if we could not advertise
>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>
>>> There is currently no way for the client to signal the filter type
>>> during this command. Not having any way to extend to include that
>>> seems like an oversight we should remedy before committing to a
>>> protocol that can't be extended.
>>>
>>> (This also seems like a good enough reason to group the URIs into a
>>> struct-like storage, because the filter type could be stored next to
>>> the URI.)
>> 
>> I'll update the docs to note that. I'd definitely like to leave out any
>> implementation of filter/shallow for an initial iteration of this for
>> simplicity, but the protocol keyword/behavior is open-ended enough to
>> permit any extension.
>
> It would be good to be explicit about how this would work. Looking at
> it fresh, it seems that the server could send multiple bundle URIs with
> the extra metadata to say which ones have a filter (and what that filter
> is). The client could then check if a bundle matches the given filter.
>
> But this is a bit inverted: the filter mechanism currently has the client
> request a given filter and the server responds with _at least_ that much
> data. This allows the server to ignore things like pathspec-filters or
> certain size-based filters. If the client just ignores a bundle URI
> because it doesn't match the exact filter, this could lead the client to
> ask for the data without a bundle, even if it would be faster to just
> download the advertised bundle.
>
> For this reason, I think it would be valuable for the client to tell
> the server the intended filter, and the server responds with bundle
> URIs that contain a subset of the information that would be provided
> by a later fetch request with that filter.
>
>> I.e. the server can start advertising "bundle-uri=shallow", and future
>> clients can request arbitrary key-value pairs in addition to just
>> "bundle-uri" now.
>> 
>> Having said that I think that *probably* this is something that'll never
>> be implemented, but maybe I'll eat my words there.

I didn't mean to elide past "filter", but was just using "shallow" as a
short-hand for one thing in the "fetch" dialog that a client can mention
that'll impact PACK generation, just like filter.

Having thought about this a bit more, I think it should be an invariant
in any bundle-uri design that the server shouldn't communicate any
side-channel information whatsoever about a bundle it advertises, if
that information can't be discovered in the header of that bundle file.

Mind you, this means throwing out large parts of my current proposed
over-the-wire design, but I think for the better. I.e. the whole
response part where we communicate:

    (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
    flush-pkt

Would just become something like:

    (bundle-uri delim-pkt bundle-header? delim-pkt)*
    flush-pkt

I.e. we'd optionally transfer the content of the bundle header (content
up to the first "\n\n") to the client, but *only* ever as a shorthand
for saving the client a roundtrip.

The pointed-to bundle is still 100% the source of truth, and when
retrieving the bundle-uri we'd ignore whatever "bundle-header" we got
earlier (except insofar as we'd like to say emit a warning() if the two
don't match).

(I'd not thought too carefully about these shallow/filter etc. edge
cases, my main intended use-case has been pre-seeding full clones, and
having this feedback to make me think about it is very valuable).

> You continue focusing on the shallow option, which I agree is not
> important. The filter option, specifically --filter=blob:none, seems
> to be critical to have a short-term plan for implementing with this
> in mind.

Per the above this then just becomes a question of "how do we produce a
bundle with those attributes?".

I *think* that currently there isn't a way to do that, i.e. the PACK
payload of a bundle is the output of "git pack-objects", but due to it
including refs, tips and prerequisites.

I don't think you can say "this bundle has no blobs". The
"prerequisites" hard map to the same thing you could put on a
"want/have" line during PACK negotiation.

I think we could/should fix that, i.e. we can bump the bundle format
version and have it encode some extended prerequisites/filter/shallow
etc information. You'd then have a 1=1 match between the features of
git-upload-pack and what you can transfer via the bundle side-channel.

But the more I think about it, the more strongly I feel that we should
always add that to the bundle *format*, and not as some side-channel
information in this "bundle-uri" protocol keyword.

To me *the* point of this feature is to have servers provide a shorthand
for something that's been a well-established trick you can do today, and
of which there are any number of pre-existing implementations.

I'm not trying to break any new ground here, just make "git
[fetch|clone]" support a well-known trick as a first-class feature via
protocol v2.

I'm not the first person to whip up some custom
"git-clone-via-bundle.sh" that takes bundle URI(s) and a repo URI,
wget's the bundle, calls "git bundle unbundle", updates ref tips, and
then does a "git fetch".

The benefit of making that a first-class protocol feature over a full
negotiation is essentially synonymous with how it's easier in practice
to widely deploy static assets on CDNs v.s. guaranteeing the same
network locality, caching etc. when serving up the same payload by
running a custom binary.

One reason not to add any side-channel information not found in the
bundle header(s) is that we can also guarantee that there won't be any
feature gap between the "transfer.injectBundleURI" config key I've
already got implemented (and is in the earlier RFC version of this
series). I.e. you can do:

    # You can specify this N number of times to inject N bundles
    git clone \
	-c transfer.injectBundleURI="https://something.local/some-repo.bundle" \
        https://example.com/some-repo.git

To inject CDN support to any remote server that doesn't know about
"bundle-uri", or add to the bundles of a server that does. That URI can
even be a file:// if you add "-c fetch.uriProtocols=file".

I realize that all of the above does *not* answer part of your question
about filters, which I think I can accurately rephrase as:

    Ok, so you can dump a static list of bundle URIs from config, but
    that's always going to be a small list, what about the combinatorial
    explosion arbitrary upload-pack options? Filters, shallow,
    include-tag etc.

My main answer to that is that YAGNI. If you need to spew out an URL for
a PACK after a client describes any of its arbitrary wants, needs,
filters etc. you've exactly re-invented what "packfile-uri" is today. I
think that feature is very useful, and I've got no intention of trying
to replace it.

I think the sweet spot for "bundle-uri" is to advertise a small number
of bundles that encompass common clone/fetch patterns. I.e. something
like a bundle for a full clone with the repo data up to today, and maybe
a couple of other bundles covering a time period that clients would be
likely to incrementally update within, e.g. 1 week ago..today &&
yesterday..now.

I agree that adding say "full clone, --depth=1" and "full clone, no
blobs" etc. to that might be *very* useful for some deployments, but per
the above I think we should really add that to the bundle format first,
not protocol v2.
Derrick Stolee Oct. 27, 2021, 7:23 p.m. UTC | #9
On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 27 2021, Derrick Stolee wrote:
> 
>> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>>>
>>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>>> allow conforming clients to optionally seed their initial clones or
>>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>>> "git bundle create".
>>>>
>>>> ...
>>>>
>>>>> +DISCUSSION of bundle-uri
>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> +
>>>>> +The intent of the feature is optimize for server resource consumption
>>>>> +in the common case by changing the common case of fetching a very
>>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>>> +fetch.
>>>>> +
>>>>> +It also allows servers to achieve better caching in combination with
>>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>>> +
>>>>> +By having new clones or fetches be a more predictable and common
>>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>>> +Servers might even pre-generate the results of such negotiations for
>>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>>> +
>>>>> +I.e. the server would anticipate that fresh clones will download a
>>>>> +known bundle, followed by catching up to the current state of the
>>>>> +repository using ref tips found in that bundle (or bundles).
>>>>> +
>>>>> +PROTOCOL for bundle-uri
>>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>>> +
>>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>>> +currently advertise a capability value. Both may be added in the
>>>>> +future.
>>>>
>>>> One thing I realized was missing from this proposal is any interaction
>>>> with partial clone. It would be disappointing if we could not advertise
>>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>>
>>>> There is currently no way for the client to signal the filter type
>>>> during this command. Not having any way to extend to include that
>>>> seems like an oversight we should remedy before committing to a
>>>> protocol that can't be extended.
>>>>
>>>> (This also seems like a good enough reason to group the URIs into a
>>>> struct-like storage, because the filter type could be stored next to
>>>> the URI.)
>>>
>>> I'll update the docs to note that. I'd definitely like to leave out any
>>> implementation of filter/shallow for an initial iteration of this for
>>> simplicity, but the protocol keyword/behavior is open-ended enough to
>>> permit any extension.
>>
>> It would be good to be explicit about how this would work. Looking at
>> it fresh, it seems that the server could send multiple bundle URIs with
>> the extra metadata to say which ones have a filter (and what that filter
>> is). The client could then check if a bundle matches the given filter.
>>
>> But this is a bit inverted: the filter mechanism currently has the client
>> request a given filter and the server responds with _at least_ that much
>> data. This allows the server to ignore things like pathspec-filters or
>> certain size-based filters. If the client just ignores a bundle URI
>> because it doesn't match the exact filter, this could lead the client to
>> ask for the data without a bundle, even if it would be faster to just
>> download the advertised bundle.
>>
>> For this reason, I think it would be valuable for the client to tell
>> the server the intended filter, and the server responds with bundle
>> URIs that contain a subset of the information that would be provided
>> by a later fetch request with that filter.
>>
>>> I.e. the server can start advertising "bundle-uri=shallow", and future
>>> clients can request arbitrary key-value pairs in addition to just
>>> "bundle-uri" now.
>>>
>>> Having said that I think that *probably* this is something that'll never
>>> be implemented, but maybe I'll eat my words there.
> 
> I didn't mean to elide past "filter", but was just using "shallow" as a
> short-hand for one thing in the "fetch" dialog that a client can mention
> that'll impact PACK generation, just like filter.
> 
> Having thought about this a bit more, I think it should be an invariant
> in any bundle-uri design that the server shouldn't communicate any
> side-channel information whatsoever about a bundle it advertises, if
> that information can't be discovered in the header of that bundle file.
> 
> Mind you, this means throwing out large parts of my current proposed
> over-the-wire design, but I think for the better. I.e. the whole
> response part where we communicate:
> 
>     (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
>     flush-pkt
> 
> Would just become something like:
> 
>     (bundle-uri delim-pkt bundle-header? delim-pkt)*
>     flush-pkt
> 
> I.e. we'd optionally transfer the content of the bundle header (content
> up to the first "\n\n") to the client, but *only* ever as a shorthand
> for saving the client a roundtrip.

It still seems like we're better off letting the client request a
filter and have the server present URIs that the client can use,
and the server can choose to ignore the filter or provide URIs that
are specific to that filter. Sending the full list and making the
client decide what it wants seems like it will be more complicated
than necessary.

However, I'll withhold complete judgement until I see a full proposal
in a v2.

> I think the sweet spot for "bundle-uri" is to advertise a small number
> of bundles that encompass common clone/fetch patterns. I.e. something
> like a bundle for a full clone with the repo data up to today, and maybe
> a couple of other bundles covering a time period that clients would be
> likely to incrementally update within, e.g. 1 week ago..today &&
> yesterday..now.
> 
> I agree that adding say "full clone, --depth=1" and "full clone, no
> blobs" etc. to that might be *very* useful for some deployments, but per
> the above I think we should really add that to the bundle format first,
> not protocol v2.
 
I'm focusing my interest in this topic not on "how can we make what we
already do faster?" but "how can we unlock scale not previously
possible?" Allowing blobless clones is an important part of this, in
my opinion, so it is my _default_ mode of operating.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Oct. 27, 2021, 8:22 p.m. UTC | #10
On Wed, Oct 27 2021, Derrick Stolee wrote:

> On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Oct 27 2021, Derrick Stolee wrote:
>> 
>>> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>>>>
>>>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>>>>
>>>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>>>> allow conforming clients to optionally seed their initial clones or
>>>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>>>> "git bundle create".
>>>>>
>>>>> ...
>>>>>
>>>>>> +DISCUSSION of bundle-uri
>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> +
>>>>>> +The intent of the feature is optimize for server resource consumption
>>>>>> +in the common case by changing the common case of fetching a very
>>>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>>>> +fetch.
>>>>>> +
>>>>>> +It also allows servers to achieve better caching in combination with
>>>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>>>> +
>>>>>> +By having new clones or fetches be a more predictable and common
>>>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>>>> +Servers might even pre-generate the results of such negotiations for
>>>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>>>> +
>>>>>> +I.e. the server would anticipate that fresh clones will download a
>>>>>> +known bundle, followed by catching up to the current state of the
>>>>>> +repository using ref tips found in that bundle (or bundles).
>>>>>> +
>>>>>> +PROTOCOL for bundle-uri
>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> +
>>>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>>>> +currently advertise a capability value. Both may be added in the
>>>>>> +future.
>>>>>
>>>>> One thing I realized was missing from this proposal is any interaction
>>>>> with partial clone. It would be disappointing if we could not advertise
>>>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>>>
>>>>> There is currently no way for the client to signal the filter type
>>>>> during this command. Not having any way to extend to include that
>>>>> seems like an oversight we should remedy before committing to a
>>>>> protocol that can't be extended.
>>>>>
>>>>> (This also seems like a good enough reason to group the URIs into a
>>>>> struct-like storage, because the filter type could be stored next to
>>>>> the URI.)
>>>>
>>>> I'll update the docs to note that. I'd definitely like to leave out any
>>>> implementation of filter/shallow for an initial iteration of this for
>>>> simplicity, but the protocol keyword/behavior is open-ended enough to
>>>> permit any extension.
>>>
>>> It would be good to be explicit about how this would work. Looking at
>>> it fresh, it seems that the server could send multiple bundle URIs with
>>> the extra metadata to say which ones have a filter (and what that filter
>>> is). The client could then check if a bundle matches the given filter.
>>>
>>> But this is a bit inverted: the filter mechanism currently has the client
>>> request a given filter and the server responds with _at least_ that much
>>> data. This allows the server to ignore things like pathspec-filters or
>>> certain size-based filters. If the client just ignores a bundle URI
>>> because it doesn't match the exact filter, this could lead the client to
>>> ask for the data without a bundle, even if it would be faster to just
>>> download the advertised bundle.
>>>
>>> For this reason, I think it would be valuable for the client to tell
>>> the server the intended filter, and the server responds with bundle
>>> URIs that contain a subset of the information that would be provided
>>> by a later fetch request with that filter.
>>>
>>>> I.e. the server can start advertising "bundle-uri=shallow", and future
>>>> clients can request arbitrary key-value pairs in addition to just
>>>> "bundle-uri" now.
>>>>
>>>> Having said that I think that *probably* this is something that'll never
>>>> be implemented, but maybe I'll eat my words there.
>> 
>> I didn't mean to elide past "filter", but was just using "shallow" as a
>> short-hand for one thing in the "fetch" dialog that a client can mention
>> that'll impact PACK generation, just like filter.
>> 
>> Having thought about this a bit more, I think it should be an invariant
>> in any bundle-uri design that the server shouldn't communicate any
>> side-channel information whatsoever about a bundle it advertises, if
>> that information can't be discovered in the header of that bundle file.
>> 
>> Mind you, this means throwing out large parts of my current proposed
>> over-the-wire design, but I think for the better. I.e. the whole
>> response part where we communicate:
>> 
>>     (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
>>     flush-pkt
>> 
>> Would just become something like:
>> 
>>     (bundle-uri delim-pkt bundle-header? delim-pkt)*
>>     flush-pkt
>> 
>> I.e. we'd optionally transfer the content of the bundle header (content
>> up to the first "\n\n") to the client, but *only* ever as a shorthand
>> for saving the client a roundtrip.
>
> It still seems like we're better off letting the client request a
> filter and have the server present URIs that the client can use,
> and the server can choose to ignore the filter or provide URIs that
> are specific to that filter.[...]

I've tested this a bit now and think there's no way to create such a
bundle currently. I.e. try:

    git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git
    cd git
    git config --unset remote.origin.partialclonefilter
    git config --unset remote.origin.promisor

You'll get:
    
    $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master
    Enumerating objects: 306784, done.
    Counting objects: 100% (306784/306784), done.
    Compressing objects: 100% (69265/69265), done.
    fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6
    error: pack-objects died

If you didn't do that config munging we'd create the pack, but it would
be inflated with the blobs (after going back and getting them from the
server).

So aside from any questions of how you'd hypothetically communicate your
desire to get such bundle from the server, I don't think it could serve
one up.

So I think this is moot until the bundle format itself could support
it. I'll need to "git bundle [verify|unbundle]" whatever I get on the
other end.

I really don't mean this in any way as dodging the desirability of this
feature. I'd really like to have it too. I think implementing it should
be relatively simple, and I've got an implementation in mind that makes
this future-proof for anything else we'd like to add.

I.e. if you look at that v3 format bundle you'll see:
    
    $ head -c 100 master.bdl
    # v3 git bundle
    @object-format=sha1
    e9e5ba39a78c8f5057262d49e261b42a8660d5b9 refs/heads/master
    
    PACK

Wouldn't this just be a matter of including extra lines with:

    # I'm assuming that the promisor url can be assumed to be "the url
    # we cloned this from", but maybe we need @remoteURL=https://....
    @promisor=true
    @filter = blob:none

I.e. exactly corresponding to the .git/config we'd end up with,
config. We'd then (I think) create .git/objects/pack/*.promisor with the
OIDs of each of the inflated tips (I'm not familiar with .promisor
files).

And a thing I need to include in the bundle-uri protocol is that the
client should not just include a "bundle-uri" attribute, but have a
"value" describing the bundle format it accepts. I.e. now:

    bundle-uri=v3,object-format

And for supporting the above:

    bundle-uri=v3,object-format,promisor,filter

I.e. currently we die on any bundle capability except "object-format",
if we're going to discover what to send we'd like a less crappy way than
parsing the version from the "agent" field.

> Sending the full list and making the client decide what it wants seems
> like it will be more complicated than necessary. However, I'll
> withhold complete judgement until I see a full proposal in a v2.

I'm very thankful for the thorough review, and it's exciting that you'd
like to use this feature in some form, and I'll definitely do my best to
support (and if not, future-proof) any use-cases you have in mind.

But I really don't get how this wouldn't effectively be functionally
indistinguishable from packfile-uri, sans a nit here and there.

I can see the convenience of having say 100 bundles, advertising 5, and
then after a full negotiation dialog pointing the equivalent of a
packfile-uri at a *.bundle file, just because that's what you happen to
have around already. If bundle-uri is your main static file distribution
you don't want a duplicate *.pack (without the bundle header) just for
that.

I think a logical extension of the packfile-uri feature for those that
need extended negotiation before deciding on the static URL would be to
teach the packfile-uri downloader to ignore an optional bundle header of
any PACK it finds at a URL (which would not be the same as this
proposal), just to support that use-case.

But, isn't that essentially what you'd want in those cases?

Spewing a "here's my bundles" at a client gets it started quickly, and
also has the side-benefit of making those assets more cachable, as well
as creating a known base for the caching of any subsequent "...and a
PACK negotiation to make it fully up-to-date" request.

The bundles are also our de-facto sneakernet and format, and can be used
for incremental replication. All of which is also a sweet spot for
bundle-uri, i.e. the combination of being able to re-use already
"replicated" files for CDN-ing, and providing wider access to CDN
features for "dumb" servers.

But once we're in dialog with a client to discuss its arbitrary filter
preferences before giving it a URL we're going to be most likely
implementing that as a mode of upload-pack.c anyway, and when it spews
out an optional URL at the end of that dialog....

>> I think the sweet spot for "bundle-uri" is to advertise a small number
>> of bundles that encompass common clone/fetch patterns. I.e. something
>> like a bundle for a full clone with the repo data up to today, and maybe
>> a couple of other bundles covering a time period that clients would be
>> likely to incrementally update within, e.g. 1 week ago..today &&
>> yesterday..now.
>> 
>> I agree that adding say "full clone, --depth=1" and "full clone, no
>> blobs" etc. to that might be *very* useful for some deployments, but per
>> the above I think we should really add that to the bundle format first,
>> not protocol v2.
>  
> I'm focusing my interest in this topic not on "how can we make what we
> already do faster?" but "how can we unlock scale not previously
> possible?" Allowing blobless clones is an important part of this, in
> my opinion, so it is my _default_ mode of operating.

*nod*, I'm keen to support it using something like what's described
above. I.e. stick it as new headers in the bundle format, then be able
to advertise those for common cases. I'd think most of these would be a
few combinations (usually something copy/pasted from a relevant dev
guide), e.g. "all history, no blobs", "main branch, no tags, no blobs"
etc.

Aside from this bundle-uri protocol proposal being able to sneakernet a
repo you cloned like that around as-is seems highly desirable, and just
a feature gap we added when those features were added to "git fetch" and
friends, but not "git bundle".
Derrick Stolee Oct. 29, 2021, 6:30 p.m. UTC | #11
On 10/27/2021 4:22 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 27 2021, Derrick Stolee wrote:
>> It still seems like we're better off letting the client request a
>> filter and have the server present URIs that the client can use,
>> and the server can choose to ignore the filter or provide URIs that
>> are specific to that filter.[...]
> 
> I've tested this a bit now and think there's no way to create such a
> bundle currently. I.e. try:
> 
>     git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git
>     cd git
>     git config --unset remote.origin.partialclonefilter
>     git config --unset remote.origin.promisor
> 
> You'll get:
>     
>     $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master
>     Enumerating objects: 306784, done.
>     Counting objects: 100% (306784/306784), done.
>     Compressing objects: 100% (69265/69265), done.
>     fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6
>     error: pack-objects died
> 
> If you didn't do that config munging we'd create the pack, but it would
> be inflated with the blobs (after going back and getting them from the
> server).
> 
> So aside from any questions of how you'd hypothetically communicate your
> desire to get such bundle from the server, I don't think it could serve
> one up.
> 
> So I think this is moot until the bundle format itself could support
> it. I'll need to "git bundle [verify|unbundle]" whatever I get on the
> other end.

Thank you for demonstrating that bundles don't currently work with
filters. This would need to be remedied, but can be done independently.

For now, let's just make sure that there is a paved path for extending
whatever is decided here to work with filters in the future.

As for the overall design, I have some thoughts that I'm going to
break out into a new message responding to your cover letter.

Thanks,
-Stolee
Philip Oakley Oct. 30, 2021, 2:51 p.m. UTC | #12
An aside:

On 27/10/2021 19:01, Ævar Arnfjörð Bjarmason wrote:
> I don't think you can say "this bundle has no blobs". The
> "prerequisites" hard map to the same thing you could put on a
> "want/have" line during PACK negotiation.
>
> I think we could/should fix that, i.e. we can 

> bump the bundle format
> version and have it encode some extended prerequisites/filter/shallow
> etc information.
If the format is bumped, could we also include the
HEAD=<particular-branch> info within that format.
The `guess the HEAD` algorithm isn't ideal and shows up in user
questions every now and again.

>  You'd then have a 1=1 match between the features of
> git-upload-pack and what you can transfer via the bundle side-channel.
--
Philip
diff mbox series

Patch

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 21e8258ccf3..4bc15a976cd 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -566,3 +566,212 @@  and associated requested information, each separated by a single space.
 	attr = "size"
 
 	obj-info = obj-id SP obj-size
+
+bundle-uri
+~~~~~~~~~~
+
+If the 'bundle-uri' capability is advertised, the server supports the
+`bundle-uri' command.
+
+The capability is currently advertised with no value (i.e. not
+"bundle-uri=somevalue"), a value may be added in the future for
+supporting command-wide extensions. Clients MUST ignore any unknown
+capability values and proceed with the 'bundle-uri` dialog they
+support.
+
+The 'bundle-uri' command is intended to be issued before `fetch` to
+get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and
+inform the subsequent `fetch` command.
+
+The client CAN issue `bundle-uri` before or after any other valid
+command. To be useful to clients it's expected that it'll be issued
+after an `ls-refs` and before `fetch`, but CAN be issued at any time
+in the dialog.
+
+DISCUSSION of bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+The intent of the feature is optimize for server resource consumption
+in the common case by changing the common case of fetching a very
+large PACK during linkgit:git-clone[1] into a smaller incremental
+fetch.
+
+It also allows servers to achieve better caching in combination with
+an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
+
+By having new clones or fetches be a more predictable and common
+negotiation against the tips of recently produces *.bundle file(s).
+Servers might even pre-generate the results of such negotiations for
+the `uploadpack.packObjectsHook` as new pushes come in.
+
+I.e. the server would anticipate that fresh clones will download a
+known bundle, followed by catching up to the current state of the
+repository using ref tips found in that bundle (or bundles).
+
+PROTOCOL for bundle-uri
+^^^^^^^^^^^^^^^^^^^^^^^
+
+A `bundle-uri` request takes no arguments, and as noted above does not
+currently advertise a capability value. Both may be added in the
+future.
+
+When the client issues a `command=bundle-uri` the response is a list
+of URIs the server would like the client to fetch out-of-bounds before
+proceeding with the `fetch` request in this format:
+
+	output = bundle-uri-line
+		 bundle-uri-line* flush-pkt
+
+	bundle-uri-line = PKT-LINE(bundle-uri)
+			  *(SP bundle-feature-key *(=bundle-feature-val))
+			  LF
+
+	bundle-uri = A URI such as a https://, ssh:// etc. URI
+
+	bundle-feature-key = Any printable ASCII characters except SP or "="
+	bundle-feature-val = Any printable ASCII characters except SP or "="
+
+No `bundle-feature-key`=`bundle-feature-value` fields are currently
+defined. See the discussion of features below.
+
+Clients are still expected to fully parse the line according to the
+above format, lines that do not conform to the format SHOULD be
+discarded. The user MAY be warned in such a case.
+
+bundle-uri CLIENT AND SERVER EXPECTATIONS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+".bundle" FORMAT
+++++++++++++++++
+
+The advertised bundle(s) MUST be in a format that "git bundle verify"
+would accept. I.e. they MUST contain one or more reference tips for
+use by the client, MUST indicate prerequisites (in any) with standard
+"-" prefixes, and MUST indicate their "object-format", if
+applicable. Create "*.bundle" files with "git bundle create".
+
+bundle-uri CLIENT ERROR RECOVERY
+++++++++++++++++++++++++++++++++
+
+A client MUST above all gracefully degrade on errors, whether that
+error is because of bad missing/data in the bundle URI(s), because
+that client is too dumb to e.g. understand and fully parse out bundle
+headers and their prerequisite relationships, or something else.
+
+Server operators should feel confident in turning on "bundle-uri" and
+not worry if e.g. their CDN goes down that clones or fetches will run
+into hard failures. Even if the server bundle bundle(s) are
+incomplete, or bad in some way the client should still end up with a
+functioning repository, just as if it had chosen not to use this
+protocol extension.
+
+All subsequent discussion on client and server interaction MUST keep
+this in mind.
+
+bundle-uri SERVER TO CLIENT
++++++++++++++++++++++++++++
+
+The ordering of the returned bundle uris is not significant. Clients
+MUST parse their headers to discover their contained OIDS and
+prerequisites. A client MUST consider the content of the bundle(s)
+themselves and their header as the ultimate source of truth.
+
+A server MAY even return bundle(s) that don't have any direct
+relationship to the repository being cloned (either through accident,
+or intentional "clever" configuration), and expect a client to sort
+out what data they'd like from the bundle(s), if any.
+
+bundle-uri CLIENT TO SERVER
++++++++++++++++++++++++++++
+
+The client SHOULD provide reference tips found in the bundle header(s)
+as 'have' lines in any subsequent `fetch` request. A client MAY also
+ignore the bundle(s) entirely if doing so is deemed worse for some
+reason, e.g. if the bundles can't be downloaded, it doesn't like the
+tips it finds etc.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+If after issuing `bundle-uri` and `ls-refs`, and getting the header(s)
+of the bundle(s) the client finds that the ref tips it wants can be
+retrieved entirety from advertised bundle(s), it MAY disconnect. The
+results of such a 'clone' or 'fetch' should be indistinguishable from
+the state attained without using bundle-uri.
+
+EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY
+++++++++++++++++++++++++++++++++++++++++++++++
+
+A client MAY perform an early disconnect while still downloading the
+bundle(s) (having streamed and parsed their headers). In such a case
+the client MUST gracefully recover from any errors related to
+finishing the download and validation of the bundle(s).
+
+I.e. a client might need to re-connect and issue a 'fetch' command,
+and possibly fall back to not making use of 'bundle-uri' at all.
+
+This "MAY" behavior is specified as such (and not a "SHOULD") on the
+assumption that a server advertising bundle uris is more likely than
+not to be serving up a relatively large repository, and to be pointing
+to URIs that have a good chance of being in working order. A client
+MAY e.g. look at the payload size of the bundles as a heuristic to see
+if an early disconnect is worth it, should falling back on a full
+"fetch" dialog be necessary.
+
+WHEN ADVERTISED BUNDLE(S) REQUIRE FURTHER NEGOTIATION
++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
+A client SHOULD commence a negotiation of a PACK from the server via
+the "fetch" command using the OID tips found in advertised bundles,
+even if's still in the process of downloading those bundle(s).
+
+This allows for aggressive early disconnects from any interactive
+server dialog. The client blindly trusts that the advertised OID tips
+are relevant, and issues them as 'have' lines, it then requests any
+tips it would like (usually from the "ls-refs" advertisement) via
+'want' lines. The server will then compute a (hopefully small) PACK
+with the expected difference between the tips from the bundle(s) and
+the data requested.
+
+The only connection the client then needs to keep active is to the
+concurrently downloading static bundle(s), when those and the
+incremental PACK are retrieved they should be inflated and
+validated. Any errors at this point should be gracefully recovered
+from, see above.
+
+bundle-uri PROTOCOL FEATURES
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+As noted above no `bundle-feature-key`=`bundle-feature-value` fields
+are currently defined.
+
+They are intended for future per-URI metadata which older clients MUST
+ignore and gracefully degrade on. Any fields they do recognize they
+CAN also ignore.
+
+Any backwards-incompatible addition of pre-URI key-value will be
+guarded by a new value or values in 'bundle-uri' capability
+advertisement itself, and/or by new future `bundle-uri` request
+arguments.
+
+While no per-URI key-value are currently supported currently they're
+intended to support future features such as:
+
+ * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or
+   size of the bundle file.
+
+ * Advertise that one or more bundle files are the same (to e.g. have
+   clients round-robin or otherwise choose one of N possible files).
+
+ * A "oid=<OID>" shortcut and "prerequisite=<OID>" shortcut. For
+   expressing the common case of a bundle with one tip and no
+   prerequisites, or one tip and one prerequisite.
++
+This would allow for optimizing the common case of servers who'd like
+to provide one "big bundle" containing only their "main" branch,
+and/or incremental updates thereof.
++
+A client receiving such a a response MAY assume that they can skip
+retrieving the header from a bundle at the indicated URI, and thus
+save themselves and the server(s) the request(s) needed to inspect the
+headers of that bundle or bundles.
diff --git a/Makefile b/Makefile
index 381bed2c1d2..e41ac60829d 100644
--- a/Makefile
+++ b/Makefile
@@ -846,6 +846,7 @@  LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 00000000000..ff054ddc690
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,55 @@ 
+#include "cache.h"
+#include "bundle-uri.h"
+#include "pkt-line.h"
+#include "config.h"
+
+static void send_bundle_uris(struct packet_writer *writer,
+			     struct string_list *uris)
+{
+	struct string_list_item *item;
+
+	for_each_string_list_item(item, uris)
+		packet_writer_write(writer, "%s", item->string);
+}
+
+static int advertise_bundle_uri = -1;
+static struct string_list bundle_uris = STRING_LIST_INIT_DUP;
+static int bundle_uri_config(const char *var, const char *value, void *data)
+{
+	if (!strcmp(var, "uploadpack.bundleuri")) {
+		advertise_bundle_uri = 1;
+		string_list_append(&bundle_uris, value);
+	}
+
+	return 0;
+}
+
+int bundle_uri_advertise(struct repository *r, struct strbuf *value)
+{
+	if (advertise_bundle_uri != -1)
+		goto cached;
+
+	git_config(bundle_uri_config, NULL);
+	advertise_bundle_uri = !!bundle_uris.nr;
+
+cached:
+	return advertise_bundle_uri;
+}
+
+int bundle_uri_command(struct repository *r,
+		       struct packet_reader *request)
+{
+	struct packet_writer writer;
+	packet_writer_init(&writer, 1);
+
+	while (packet_reader_read(request) == PACKET_READ_NORMAL)
+		die(_("bundle-uri: unexpected argument: '%s'"), request->line);
+	if (request->status != PACKET_READ_FLUSH)
+		die(_("bundle-uri: expected flush after arguments"));
+
+	send_bundle_uris(&writer, &bundle_uris);
+
+	packet_writer_flush(&writer);
+
+	return 0;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 00000000000..b8762e6a8e4
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,14 @@ 
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+
+struct repository;
+struct packet_reader;
+struct packet_writer;
+
+/**
+ * API used by serve.[ch].
+ */
+int bundle_uri_advertise(struct repository *r, struct strbuf *value);
+int bundle_uri_command(struct repository *r, struct packet_reader *request);
+
+#endif /* BUNDLE_URI_H */
diff --git a/serve.c b/serve.c
index b3fe9b5126a..f3e0203d2c6 100644
--- a/serve.c
+++ b/serve.c
@@ -8,6 +8,7 @@ 
 #include "protocol-caps.h"
 #include "serve.h"
 #include "upload-pack.h"
+#include "bundle-uri.h"
 
 static int advertise_sid = -1;
 static int client_hash_algo = GIT_HASH_SHA1;
@@ -136,6 +137,11 @@  static struct protocol_capability capabilities[] = {
 		.advertise = always_advertise,
 		.command = cap_object_info,
 	},
+	{
+		.name = "bundle-uri",
+		.advertise = bundle_uri_advertise,
+		.command = bundle_uri_command,
+	},
 };
 
 void protocol_v2_advertise_capabilities(void)
diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh
index 1896f671cb3..9d053f77a93 100755
--- a/t/t5701-git-serve.sh
+++ b/t/t5701-git-serve.sh
@@ -13,7 +13,7 @@  test_expect_success 'test capability advertisement' '
 	wrong_algo sha1:sha256
 	wrong_algo sha256:sha1
 	EOF
-	cat >expect <<-EOF &&
+	cat >expect.base <<-EOF &&
 	version 2
 	agent=git/$(git version | cut -d" " -f3)
 	ls-refs=unborn
@@ -21,8 +21,11 @@  test_expect_success 'test capability advertisement' '
 	server-option
 	object-format=$(test_oid algo)
 	object-info
+	EOF
+	cat >expect.trailer <<-EOF &&
 	0000
 	EOF
+	cat expect.base expect.trailer >expect &&
 
 	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
 		--advertise-capabilities >out &&
@@ -342,4 +345,123 @@  test_expect_success 'basics of object-info' '
 	test_cmp expect actual
 '
 
+# Test the basics of bundle-uri
+#
+test_expect_success 'test capability advertisement with uploadpack.bundleURI' '
+	test_config uploadpack.bundleURI FAKE &&
+
+	cat >expect.extra <<-EOF &&
+	bundle-uri
+	EOF
+	cat expect.base \
+	    expect.extra \
+	    expect.trailer >expect &&
+
+	GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \
+		--advertise-capabilities >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: dies if not enabled' '
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	cat >expect <<-\EOF &&
+	ERR serve: invalid command '"'"'bundle-uri'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with single URI' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: enabled with two URIs' '
+	test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl &&
+	test_config uploadpack.bundleURI https://cdn.example.com/recent.bdl --add &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0000
+	EOF
+
+	cat >expect <<-EOF &&
+	https://cdn.example.com/repo.bdl
+	https://cdn.example.com/recent.bdl
+	0000
+	EOF
+
+	test-tool serve-v2 --stateless-rpc <in >out &&
+	test-tool pkt-line unpack <out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'basics of bundle-uri: unknown future feature(s)' '
+	test_config uploadpack.bundleURI https://cdn.example.com/fake.bdl &&
+
+	test-tool pkt-line pack >in <<-EOF &&
+	command=bundle-uri
+	object-format=$(test_oid algo)
+	0001
+	some-feature
+	we-do-not
+	know=about
+	0000
+	EOF
+
+	cat >err.expect <<-\EOF &&
+	fatal: bundle-uri: unexpected argument: '"'"'some-feature'"'"'
+	EOF
+
+	test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual &&
+	test_cmp err.expect err.actual &&
+	test_must_be_empty out
+'
+
 test_done