mbox series

[0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation

Message ID cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com (mailing list archive)
Headers show
Series bundle-uri: "dumb" static CDN offloading, spec & server implementation | expand

Message

Ævar Arnfjörð Bjarmason Oct. 25, 2021, 9:25 p.m. UTC
This implements a new "bundle-uri" protocol v2 extension, which allows
servers to advertise *.bundle files which clients can pre-seed their
full "clone"'s or incremental "fetch"'s from.

This is both an alternative to, and complimentary to the existing
"packfile-uri" mechanism, i.e. servers and/or clients can pick one or
both, but would generally pick one over the other.

This "bundle-uri" mechanism has the advantage of being dumber, and
offloads more complexity from the server side to the client
side.

Unlike with packfile-uri a conforming server doesn't need produce a
PACK that (hopefully, otherwise there's not much point) excludes OIDs
that it knows it'll provide via a packfile-uri.

To the server a "bundle-uri" negotiation the same as a "normal" one,
the client just happens to provide OIDs it found in bundles as "have"
lines.

In my WIP client patches I even have a (trivial to implement) mode
where a client can choose to pretend that a server reported that a
given set of bundle URIs can be used to pre-seed its "clone" or
"fetch".

A client can thus use use a CDN it controls to optimistically pre-seed
a clone from a server that knows nothing about "bundle-uri", sort of
like a "git clone --reference <path> --dissociate", except with a
<uri> instead of a <path>.

Need re-clone a bunch of large repositories on CI boxes from
git.example.com, but git.example.com doesn't support "bundle-uri", and
you've got a slow outbound connection? Just point to a pre-seeding CDN
you control.

There are disadvantages to this over packfile-uri, JGit has a mature
implementation of it, and I doubt that e.g. Google will ever want to
use this, since that feature was tailor-made for their use case.

E.g. a repository that has a *.pack sitting on disk can't re-use and
stream it out with sendfile() as it could with a "packfile-uri",
instead it would need to point to some duplicate of that data in
*.bundle form (or on-the-fly generate a header for the *.pack).

The goal of this feature isn't to win over packfile-uri users, but to
give users who wouldn't consider it due to its tight coupling to have
access to CDN offloading.

The error optimistic recovery of "bundle-uri" and looseer coupling
between server and CDN means that it should be easy to use this for
use where the CDN is something like say Debian's mirror network.

We're coming up on 2.34.0-rc0, so this certainly won't be in 2.34.0,
but I'm submitting this now per discussion during '#git-devel' standup
today.

There was a discussion on the RFC version of the larger series of
patches to implement this "bundle-uri"[1].

I've updated the protocol-v2.txt changes in 2/3 a lot in response to
that, in particular I've specified and implemented early client
disconnection behavior, so bundle-uri SHOULD never cause
client<->server dialog to hang (at most we'll need to re-connect, if
we need to fall back from a failed bundle-uri).

1. https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

Ævar Arnfjörð Bjarmason (3):
  leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak
  protocol v2: specify static seeding of clone/fetch via "bundle-uri"
  bundle-uri client: add "bundle-uri" parsing + tests

 Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++
 Makefile                                |   2 +
 bundle-uri.c                            | 179 ++++++++++++++++++++
 bundle-uri.h                            |  30 ++++
 serve.c                                 |   6 +
 t/helper/test-bundle-uri.c              |  83 ++++++++++
 t/helper/test-tool.c                    |   1 +
 t/helper/test-tool.h                    |   1 +
 t/t5701-git-serve.sh                    | 125 +++++++++++++-
 t/t5750-bundle-uri-parse.sh             | 153 +++++++++++++++++
 10 files changed, 788 insertions(+), 1 deletion(-)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100755 t/t5750-bundle-uri-parse.sh

Comments

Derrick Stolee Oct. 29, 2021, 6:46 p.m. UTC | #1
On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
> This implements a new "bundle-uri" protocol v2 extension, which allows
> servers to advertise *.bundle files which clients can pre-seed their
> full "clone"'s or incremental "fetch"'s from.
> 
> This is both an alternative to, and complimentary to the existing
> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
> both, but would generally pick one over the other.
> 
> This "bundle-uri" mechanism has the advantage of being dumber, and
> offloads more complexity from the server side to the client
> side.

Generally, I like that using bundles presents an easier way to serve
static content from an alternative source and then let Git's fetch
negotiation catch up with the remainder.

However, after inspecting your design and talking to some GitHub
engineers who know more about CDNs and general internet things than I
do, I want to propose an alternative design. I think this new design
is simultaneously more flexible as well as promotes further decoupling
of the origin Git server and the bundle contents.

Your proposed design extends protocol v2 to let the client request a
list of bundle URIs from the origin server. However, this still requires
the origin server to know about this list. Further, your implementation
focuses on the server side without integrating with the client.

I propose that we flip this around. The "bundle server" should know
which bundles are available at which URIs, and the client should contact
the bundle server directly for a "table of contents" that lists these
URIs, along with metadata related to each URI. The origin Git server
then would only need to store the list of bundle servers and the URIs
to their table of contents. The client could then pick from among those
bundle servers (probably by ping time, or randomly) to start the bundle
downloads.

To summarize, there are two pieces here, that can be implemented at
different times:

1. Create a specification for a "bundle server" that doesn't need to
   speak the Git protocol at all. This could be a REST API specification
   using well-established standards such as JSON for the table of
   contents.

2. Create a way for the origin Git server to advertise known bundle
   servers to clients so they can automatically benefit from faster
   downloads without needing to know about bundle servers.

There are a few key benefits to this approach:

 * Further decoupling. The origin Git server doesn't need to know how
   the bundle server organizes its bundles. This allows maximum flexibility
   depending on whether the bundles are stored in something like a CDN
   (where bundles can't be too big) or some kind of blob storage (where
   they can have arbitrarily large size).

 * The bundle servers could be run completely independently from the
   origin Git server. Organizations could run their own bundle servers to
   host data in the same building as their build farms. As long as they
   can configure the bundle location at clone/fetch time, the origin Git
   server doesn't need to be involved.

While I didn't go so far as to create a clear standard or implement a
prototype in the Git codebase, I created a very simple prototype [1] using
a python script that parses a JSON table of contents and downloads
bundles into the Git repository. Then, I made a 'clone.sh' script that
initializes a repository using the bundle fetcher and fetching the
remainder from the origin Git server. I even computed static bundles for
the git.git repository based on where 'master' has been over several days
in the past month, to give an example of incremental bundles. You can
test the approach all the way to including the fetch to github.com (note
how the GitHub servers were not modified in any way for this).

[1] https://github.com/derrickstolee/bundles

There are a lot of limitations to the prototype, but it hopefully
demonstrates the possibility of using something other than the Git protocol
to solve these problems.

Let me know if you are interested in switching your approach to something
more like what I propose here. There are many more questions about what
information could/should be located in the table of contents and how it can
be extended in the future. I'm interested to explore that space with you.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Oct. 30, 2021, 7:21 a.m. UTC | #2
On Fri, Oct 29 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> This implements a new "bundle-uri" protocol v2 extension, which allows
>> servers to advertise *.bundle files which clients can pre-seed their
>> full "clone"'s or incremental "fetch"'s from.
>> 
>> This is both an alternative to, and complimentary to the existing
>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>> both, but would generally pick one over the other.
>> 
>> This "bundle-uri" mechanism has the advantage of being dumber, and
>> offloads more complexity from the server side to the client
>> side.
>
> Generally, I like that using bundles presents an easier way to serve
> static content from an alternative source and then let Git's fetch
> negotiation catch up with the remainder.
>
> However, after inspecting your design and talking to some GitHub
> engineers who know more about CDNs and general internet things than I
> do, I want to propose an alternative design. I think this new design
> is simultaneously more flexible as well as promotes further decoupling
> of the origin Git server and the bundle contents.
>
> Your proposed design extends protocol v2 to let the client request a
> list of bundle URIs from the origin server. However, this still requires
> the origin server to know about this list. [...]

Interesting, more below...

> Further, your implementation focuses on the server side without
> integrating with the client.

Do you mean these 3 patches we're discussing now? Yes, that's the
server-side and protocol specification only, because I figured talking
about just the spec might be helpful.

But as noted in the CL and previously on-list I have a larger set of
patches to implement the client behavior, an old RFC version of that
here (I've since changed some things...):
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

I mean, you commented on those too, so I'm not sure if that's what you
meant, but just for context...

> I propose that we flip this around. The "bundle server" should know
> which bundles are available at which URIs, and the client should contact
> the bundle server directly for a "table of contents" that lists these
> URIs, along with metadata related to each URI. The origin Git server
> then would only need to store the list of bundle servers and the URIs
> to their table of contents. The client could then pick from among those
> bundle servers (probably by ping time, or randomly) to start the bundle
> downloads.

I hadn't considered the server not advertising the list, but pointing to
another URI that has the list. I was thinking that the server would be
close enough to whatever's generating the list that updating the list
there wouldn't be a meaningful limitation for anyone.

But you seem to have a use-case for it, I'd be curious to hear why
specifically, but in any case that's easy to support in the client
patches I have.

There's a point at which we get the list of URIs from the server, to
support your case the client would just advertise the one TOC URI.

Then similarly to the "packfile-uri" special-case of handling a *.bundle
instead of a PACK that I noted in [1], the downloader would just spot
"oh this isn't a bundle, but list of URIs, and then fetch those (even
recursively), and eventually get to *.bundle files.

> To summarize, there are two pieces here, that can be implemented at
> different times:
>
> 1. Create a specification for a "bundle server" that doesn't need to
>    speak the Git protocol at all. This could be a REST API specification
>    using well-established standards such as JSON for the table of
>    contents.
>
> 2. Create a way for the origin Git server to advertise known bundle
>    servers to clients so they can automatically benefit from faster
>    downloads without needing to know about bundle servers.
>
> There are a few key benefits to this approach:
>
>  * Further decoupling. The origin Git server doesn't need to know how
>    the bundle server organizes its bundles. This allows maximum flexibility
>    depending on whether the bundles are stored in something like a CDN
>    (where bundles can't be too big) or some kind of blob storage (where
>    they can have arbitrarily large size).
>
>  * The bundle servers could be run completely independently from the
>    origin Git server. Organizations could run their own bundle servers to
>    host data in the same building as their build farms. As long as they
>    can configure the bundle location at clone/fetch time, the origin Git
>    server doesn't need to be involved.
>
> While I didn't go so far as to create a clear standard or implement a
> prototype in the Git codebase, I created a very simple prototype [1] using
> a python script that parses a JSON table of contents and downloads
> bundles into the Git repository. Then, I made a 'clone.sh' script that
> initializes a repository using the bundle fetcher and fetching the
> remainder from the origin Git server. I even computed static bundles for
> the git.git repository based on where 'master' has been over several days
> in the past month, to give an example of incremental bundles. You can
> test the approach all the way to including the fetch to github.com (note
> how the GitHub servers were not modified in any way for this).
>
> [1] https://github.com/derrickstolee/bundles
>
> There are a lot of limitations to the prototype, but it hopefully
> demonstrates the possibility of using something other than the Git protocol
> to solve these problems.

In your proposal the TOC bundle itself doesn't need to speak the git
protocol.

But as as soon as we specify such a thing all of that becomes a part of
the git protocol at large in any meaningful way, i.e. git.git's client
and any other client that wants to implement the full protocol at large
would now need to understand not only pkt-line but also ship a JSON
decoder etc.

I don't see an inherent problem with us wanting to support some nested
encoding format as part of the protocol, but JSON seems like a
particularly bad choice. It's specified as UTF-8 only (or rather, "a
Unicode enoding"), so you can't stick both valid UTF-8 and binary data
into it.

Our refs on the other hand don't conform to that, so having a JSON
format means you can never have something that refers to refnames, which
given that we're talking about bundles, whose own header already has
that information.

> Let me know if you are interested in switching your approach to something
> more like what I propose here. There are many more questions about what
> information could/should be located in the table of contents and how it can
> be extended in the future. I'm interested to explore that space with you.

As noted above, the TOC part of this seems interesting, and I don't see
a reason not to implement that.

But as noted in [1] I really don't see why it would be a good idea to
implement a side-format that's encoding a limited subset of what you'd
find in bundle headers.

Specifically on the meta-information you're proposing:

== requires

In your example you've added a monolithic "requires" relationship
between bundles, saying "This assumes that the bundles can be ordered".

But that's not something you can assume for actual bundle files,
i.e. the prerequisite relationship is per-reftip, it's not the case that
a given bundle requires another bundle, it's the case that tips found in
them may or may not depend on other prerequisites.

If you're creating bundles that contain only one tip there's a 1=1
mapping to what you're proposing with "requires", but otherwise there
isn't.

== timestamp

"This allows us to reexamine the table of contents and only download the
bundles that are newer than that timestamp."

We're usually going to be fetching these over http(s), why duplicate
what you can already get if the server just takes care to create unique
filenames (e.g. as a function of the SHA of their contents), and then
provides appropriate caching headers to a client so that they'll be
cached forever?

I think that gives you everything you'd like out of the "timestamp" and
more, the "more" being that since it's part of a protocol that's already
standard you'd have e.g. intermediate caching proxies understanding this
implicitly, in addition to the git client itself.

So on a network that's say locally unpacking https connections to a
public CDN you could have a local caching proxy for your N local
clients, as opposed to a custom "timestamp" value, which only each local
git client will understand.

== Generally

Sorry, I've got to run, so I haven't addressed all the things you
brought up, but generally while I think that the TOC idea is a good one.

I don't see a reason for why most/all of the other bits shouldn't be
leaning into either the bundle header (and for any TOC shortcut, dump it
as-is, as noted in [1]), or in the case of "timestamp" lean into the
properties of the transport protocol.

And just generally on overall protocol complexity, wouldn't it be OK if
any such TOC is just in pkt-line format?

We could just provide a git plumbing tool to spew that out, and having
some static server job call that once and ever more serve up a a
plain-file doesn't seem like a big restriction, and would mean that any
git client code wouldn't need to deal with another encoding format.

1. https://lore.kernel.org/git/211027.86a6iuxk3x.gmgdl@evledraar.gmail.com/
Derrick Stolee Nov. 1, 2021, 9 p.m. UTC | #3
On 10/30/2021 3:21 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Oct 29 2021, Derrick Stolee wrote:
> 
>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>> This implements a new "bundle-uri" protocol v2 extension, which allows
>>> servers to advertise *.bundle files which clients can pre-seed their
>>> full "clone"'s or incremental "fetch"'s from.
>>>
>>> This is both an alternative to, and complimentary to the existing
>>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>>> both, but would generally pick one over the other.
>>>
>>> This "bundle-uri" mechanism has the advantage of being dumber, and
>>> offloads more complexity from the server side to the client
>>> side.
>>
>> Generally, I like that using bundles presents an easier way to serve
>> static content from an alternative source and then let Git's fetch
>> negotiation catch up with the remainder.
>>
>> However, after inspecting your design and talking to some GitHub
>> engineers who know more about CDNs and general internet things than I
>> do, I want to propose an alternative design. I think this new design
>> is simultaneously more flexible as well as promotes further decoupling
>> of the origin Git server and the bundle contents.
>>
>> Your proposed design extends protocol v2 to let the client request a
>> list of bundle URIs from the origin server. However, this still requires
>> the origin server to know about this list. [...]
> 
> Interesting, more below...
> 
>> Further, your implementation focuses on the server side without
>> integrating with the client.
> 
> Do you mean these 3 patches we're discussing now? Yes, that's the
> server-side and protocol specification only, because I figured talking
> about just the spec might be helpful.
> 
> But as noted in the CL and previously on-list I have a larger set of
> patches to implement the client behavior, an old RFC version of that
> here (I've since changed some things...):
> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> 
> I mean, you commented on those too, so I'm not sure if that's what you
> meant, but just for context...

Yeah, I'm not able to keep all of that in my head, and I focused on
what you presented in this thread.

>> I propose that we flip this around. The "bundle server" should know
>> which bundles are available at which URIs, and the client should contact
>> the bundle server directly for a "table of contents" that lists these
>> URIs, along with metadata related to each URI. The origin Git server
>> then would only need to store the list of bundle servers and the URIs
>> to their table of contents. The client could then pick from among those
>> bundle servers (probably by ping time, or randomly) to start the bundle
>> downloads.
> 
> I hadn't considered the server not advertising the list, but pointing to
> another URI that has the list. I was thinking that the server would be
> close enough to whatever's generating the list that updating the list
> there wouldn't be a meaningful limitation for anyone.
> 
> But you seem to have a use-case for it, I'd be curious to hear why
> specifically, but in any case that's easy to support in the client
> patches I have.

Show me the client patches and then I can determine if I think that
is sufficiently flexible.

In general, I want to expand the scope of this feature beyond "bundles
on a CDN" and towards "alternative sources of Git object data" which
_could_ be a CDN, but could also be geodistributed HTTP servers that
manage their own copy of the Git data (periodically fetching from the
origin). These could be self-hosted by organizations with needs for
low-latency, high-throughput downloads of object data.

For a concrete example, a group with a build farm could create their
own bundle server on the same LAN as the build machines, and they could
mirror whatever Git service they want. This requires the users setting
up the server and telling the machines about the URL for the table of
contents at clone/fetch time.

By having the origin Git server advertise the table of contents, hosts
such as GitHub, GitLab, and others could have their own CDN solutions
that clients discover automatically. This is clearly the environment
you are targeting. Allowing a redirection to another table of contents
further decouples what is responsible for the bundle organization away
from the origin Git server.

One thing neither of us have touched is authentication, so we'll want
to find out how to access private information securely in this model.
Authentication doesn't matter for CDNs, but would matter for other
hosting models.

> There's a point at which we get the list of URIs from the server, to
> support your case the client would just advertise the one TOC URI.
> 
> Then similarly to the "packfile-uri" special-case of handling a *.bundle
> instead of a PACK that I noted in [1], the downloader would just spot
> "oh this isn't a bundle, but list of URIs, and then fetch those (even
> recursively), and eventually get to *.bundle files.

This recursive "follow the contents" approach seems to be a nice
general approach. I would still want to have some kind of specification
about what could be seen at these URIs before modifying the protocol v2
specification.

>> To summarize, there are two pieces here, that can be implemented at
>> different times:
>>
>> 1. Create a specification for a "bundle server" that doesn't need to
>>    speak the Git protocol at all. This could be a REST API specification
>>    using well-established standards such as JSON for the table of
>>    contents.
>>
>> 2. Create a way for the origin Git server to advertise known bundle
>>    servers to clients so they can automatically benefit from faster
>>    downloads without needing to know about bundle servers.
>>
>> There are a few key benefits to this approach:
>>
>>  * Further decoupling. The origin Git server doesn't need to know how
>>    the bundle server organizes its bundles. This allows maximum flexibility
>>    depending on whether the bundles are stored in something like a CDN
>>    (where bundles can't be too big) or some kind of blob storage (where
>>    they can have arbitrarily large size).
>>
>>  * The bundle servers could be run completely independently from the
>>    origin Git server. Organizations could run their own bundle servers to
>>    host data in the same building as their build farms. As long as they
>>    can configure the bundle location at clone/fetch time, the origin Git
>>    server doesn't need to be involved.
>>
>> While I didn't go so far as to create a clear standard or implement a
>> prototype in the Git codebase, I created a very simple prototype [1] using
>> a python script that parses a JSON table of contents and downloads
>> bundles into the Git repository. Then, I made a 'clone.sh' script that
>> initializes a repository using the bundle fetcher and fetching the
>> remainder from the origin Git server. I even computed static bundles for
>> the git.git repository based on where 'master' has been over several days
>> in the past month, to give an example of incremental bundles. You can
>> test the approach all the way to including the fetch to github.com (note
>> how the GitHub servers were not modified in any way for this).
>>
>> [1] https://github.com/derrickstolee/bundles
>>
>> There are a lot of limitations to the prototype, but it hopefully
>> demonstrates the possibility of using something other than the Git protocol
>> to solve these problems.
> 
> In your proposal the TOC bundle itself doesn't need to speak the git
> protocol.

Yes. I see this as a HUGE opportunity for flexibility.

> But as as soon as we specify such a thing all of that becomes a part of
> the git protocol at large in any meaningful way, i.e. git.git's client
> and any other client that wants to implement the full protocol at large
> would now need to understand not only pkt-line but also ship a JSON
> decoder etc.

I use JSON as my example because it is easy to implement in other
languages. It's easy to use in practically any other language than C.

The reason to use something like JSON is that it already encodes a way
to include optional, structured information in the list of results. I
focused on using that instead of specifics of a pkt-line protocol.

You have some flexibility in your protocol, but it's unclear how the
optional data would work without careful testing and actually building
a way to store and communicate the optional data.

> I don't see an inherent problem with us wanting to support some nested
> encoding format as part of the protocol, but JSON seems like a
> particularly bad choice. It's specified as UTF-8 only (or rather, "a
> Unicode enoding"), so you can't stick both valid UTF-8 and binary data
> into it.
> 
> Our refs on the other hand don't conform to that, so having a JSON
> format means you can never have something that refers to refnames, which
> given that we're talking about bundles, whose own header already has
> that information.

As I imagine it, we won't want the bundles to store real refnames,
anyway, since we just need pointers into the commit history to start
the incremental fetch of the real refs after the bundles are downloaded.
The prototype I made doesn't even store the bundled refs in "refs/heads"
or "refs/remotes".

>> Let me know if you are interested in switching your approach to something
>> more like what I propose here. There are many more questions about what
>> information could/should be located in the table of contents and how it can
>> be extended in the future. I'm interested to explore that space with you.
> 
> As noted above, the TOC part of this seems interesting, and I don't see
> a reason not to implement that.
> 
> But as noted in [1] I really don't see why it would be a good idea to
> implement a side-format that's encoding a limited subset of what you'd
> find in bundle headers.

You say "bundle headers" a lot as if we are supposed to download and
examine the start of the bundle file before completing the download.
Is that what you mean? Or, do you mean that somehow you are communicating
the bundle header in your list of URIs and I just missed it?

If you do mean that the bundle header _must_ be included in the table
of contents, then I think that is a blocker for scaling this approach,
since the bundle header includes the 'prerequisite' and 'reference'
records, which could be substantial. At scale, info/refs already takes
too much data to want to do often, let alone multiple times in a single
request.

I think the table of contents should provide enough information for the
client to decide if they should initiate the download at all, but be
flexible to multiple strategies for organizing the data.

> Specifically on the meta-information you're proposing:
> 
> == requires
> == timestamp

Part of the point of my proposal was to show how things can work in a
different way, especially with a flexible format like JSON. One possible
way to organize bundles is in a linear list of files that form snapshots
of the history at a given timestamp, storing the new objects since the
previous bundle. A client could store a local timestamp as a heuristic
for whether they need to download a bundle, but if they are missing
reachable objects, then the 'requires' tag gives them a bundle that
could fill in those gaps (they might need to follow a list until closing
all the gaps, depending on many factors).

So, that's the high-level "one could organize bundles like this" plan.
It is _not_ intended as the only way to do it, but I also believe it is
the most efficient. It's also why things like "requires" and "timestamp"
are intended to be optional metadata.

> == requires
> In your example you've added a monolithic "requires" relationship
> between bundles, saying "This assumes that the bundles can be ordered".
> 
> But that's not something you can assume for actual bundle files,
> i.e. the prerequisite relationship is per-reftip, it's not the case that
> a given bundle requires another bundle, it's the case that tips found in
> them may or may not depend on other prerequisites.

No, not all lists of bundles can be ordered. For instance, if you
use the "here is where the 'main' branch was for each month, minus
the previous month" as one list of bundles and then another "here is
where the 'dev' branch was for each month, minus the 'dev' for the
previous month and the 'main' branch for the current month" then you
get a partial order instead.

I agree that it is important to allow for full generality of cases like
this, but for large repositories, users will probably just want a
snapshot of all the ref tips.

> If you're creating bundles that contain only one tip there's a 1=1
> mapping to what you're proposing with "requires", but otherwise there
> isn't.
> 
> 
> "This allows us to reexamine the table of contents and only download the
> bundles that are newer than that timestamp."
> 
> We're usually going to be fetching these over http(s), why duplicate
> what you can already get if the server just takes care to create unique
> filenames (e.g. as a function of the SHA of their contents), and then
> provides appropriate caching headers to a client so that they'll be
> cached forever?

This assumes that a bundle URI will always be available forever, and that
the table of contents will never shift with any future reorganization.
If the snapshot layout that I specified was always additive, then the URI
would be sufficient (although we would need to keep a full list of every
URI ever downloaded) but also a single timestamp would be sufficient.

The issue arises if the bundle server reorganizes the data somehow, or
worse, the back-end of the bundle server is completely replaced with a
different server that had a different view of the refs at these timestamps.

Now, the 'requires' links provide a way to reconcile missing objects
after downloading only the new bundles, without downloading the full list.

> I think that gives you everything you'd like out of the "timestamp" and
> more, the "more" being that since it's part of a protocol that's already
> standard you'd have e.g. intermediate caching proxies understanding this
> implicitly, in addition to the git client itself.
> 
> So on a network that's say locally unpacking https connections to a
> public CDN you could have a local caching proxy for your N local
> clients, as opposed to a custom "timestamp" value, which only each local
> git client will understand.

I don't understand most of what you are saying here, and perhaps that
is my lack of understanding of possible network services that you are
referencing.

What I'm trying to communicate is that a URI is not enough information
to make a decision about whether or not the Git client should start
downloading that data. Providing clues about the bundle content can
be helpful.

In talking with some others about this, they were thinking about
advertising the ref tips at the bundle boundaries. This would be a
list of "have"s and "have not"s as OIDs that were used to generate
the bundle. However, in my opinion, this only works if you are focused
on snapshots of a single ref instead of a large set of moving refs
(think thousands of refs). In that environment, timestamps are rather
effective so it's nice to have the option.

I'm also not saying that you need to implement an organization such
as the one I'm proposing. I am only strongly recommending that you
build it with enough generality that it is possible.

> == Generally
> 
> Sorry, I've got to run, so I haven't addressed all the things you
> brought up, but generally while I think that the TOC idea is a good one.
> 
> I don't see a reason for why most/all of the other bits shouldn't be
> leaning into either the bundle header (and for any TOC shortcut, dump it
> as-is, as noted in [1]), or in the case of "timestamp" lean into the
> properties of the transport protocol.
> 
> And just generally on overall protocol complexity, wouldn't it be OK if
> any such TOC is just in pkt-line format?

The complexity either lives in the code that parses a well-known format
or the design of a format that is sufficiently general to handle extra
metadata that we could use for the future. Pick your poison.

> We could just provide a git plumbing tool to spew that out, and having
> some static server job call that once and ever more serve up a a
> plain-file doesn't seem like a big restriction, and would mean that any
> git client code wouldn't need to deal with another encoding format.

I agree that it would be good to have a Git command that prepares bundles
for publishing on a static webserver, complete with table of contents.
It would be especially good if this incrementally updated the table with
new bundles as time goes on.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Nov. 1, 2021, 11:18 p.m. UTC | #4
On Mon, Nov 01 2021, Derrick Stolee wrote:

> On 10/30/2021 3:21 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Fri, Oct 29 2021, Derrick Stolee wrote:
>> 
>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>> This implements a new "bundle-uri" protocol v2 extension, which allows
>>>> servers to advertise *.bundle files which clients can pre-seed their
>>>> full "clone"'s or incremental "fetch"'s from.
>>>>
>>>> This is both an alternative to, and complimentary to the existing
>>>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>>>> both, but would generally pick one over the other.
>>>>
>>>> This "bundle-uri" mechanism has the advantage of being dumber, and
>>>> offloads more complexity from the server side to the client
>>>> side.
>>>
>>> Generally, I like that using bundles presents an easier way to serve
>>> static content from an alternative source and then let Git's fetch
>>> negotiation catch up with the remainder.
>>>
>>> However, after inspecting your design and talking to some GitHub
>>> engineers who know more about CDNs and general internet things than I
>>> do, I want to propose an alternative design. I think this new design
>>> is simultaneously more flexible as well as promotes further decoupling
>>> of the origin Git server and the bundle contents.
>>>
>>> Your proposed design extends protocol v2 to let the client request a
>>> list of bundle URIs from the origin server. However, this still requires
>>> the origin server to know about this list. [...]
>> 
>> Interesting, more below...
>> 
>>> Further, your implementation focuses on the server side without
>>> integrating with the client.
>> 
>> Do you mean these 3 patches we're discussing now? Yes, that's the
>> server-side and protocol specification only, because I figured talking
>> about just the spec might be helpful.
>> 
>> But as noted in the CL and previously on-list I have a larger set of
>> patches to implement the client behavior, an old RFC version of that
>> here (I've since changed some things...):
>> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
>> 
>> I mean, you commented on those too, so I'm not sure if that's what you
>> meant, but just for context...
>
> Yeah, I'm not able to keep all of that in my head, and I focused on
> what you presented in this thread.

[...]

>>> I propose that we flip this around. The "bundle server" should know
>>> which bundles are available at which URIs, and the client should contact
>>> the bundle server directly for a "table of contents" that lists these
>>> URIs, along with metadata related to each URI. The origin Git server
>>> then would only need to store the list of bundle servers and the URIs
>>> to their table of contents. The client could then pick from among those
>>> bundle servers (probably by ping time, or randomly) to start the bundle
>>> downloads.
>> 
>> I hadn't considered the server not advertising the list, but pointing to
>> another URI that has the list. I was thinking that the server would be
>> close enough to whatever's generating the list that updating the list
>> there wouldn't be a meaningful limitation for anyone.
>> 
>> But you seem to have a use-case for it, I'd be curious to hear why
>> specifically, but in any case that's easy to support in the client
>> patches I have.
>
> Show me the client patches and then I can determine if I think that
> is sufficiently flexible.

I'm working on that larger re-roll, hoping to have something to submit
by the end of the week. I've snipped out much of the below, to the
extent that it's probably better on my part to reply to it with working
code than just continued discussion..

> In general, I want to expand the scope of this feature beyond "bundles
> on a CDN" and towards "alternative sources of Git object data" which
> _could_ be a CDN, but could also be geodistributed HTTP servers that
> manage their own copy of the Git data (periodically fetching from the
> origin). These could be self-hosted by organizations with needs for
> low-latency, high-throughput downloads of object data.

*Nod*. FWIW I share those goal[...]

> One thing neither of us have touched is authentication, so we'll want
> to find out how to access private information securely in this model.
> Authentication doesn't matter for CDNs, but would matter for other
> hosting models.

We've got authentication already for packfile-uri in the form of bearer
tokens, and I was planning to do the same (there's recent related
patches on-list to strip out those URLs during logging for that reason).

I was planning to handle it the same way, i.e. a server implementation
would be responsible for spewing out a bundle-uri that's
authenticated/non-public.

In practice I think that probably something that just allows generating
the bundle-uri list via a hook would be flexible enough for both this &
some of the other use-cases you've had in mind, as long as we pass the
protocol-specific auth info down to it in some way.

>> There's a point at which we get the list of URIs from the server, to
>> support your case the client would just advertise the one TOC URI.
>> 
>> Then similarly to the "packfile-uri" special-case of handling a *.bundle
>> instead of a PACK that I noted in [1], the downloader would just spot
>> "oh this isn't a bundle, but list of URIs, and then fetch those (even
>> recursively), and eventually get to *.bundle files.
>
> This recursive "follow the contents" approach seems to be a nice
> general approach. I would still want to have some kind of specification
> about what could be seen at these URIs before modifying the protocol v2
> specification.

*nod*, of course.

> [...]
> Yes. I see this as a HUGE opportunity for flexibility.
>
>> But as as soon as we specify such a thing all of that becomes a part of
>> the git protocol at large in any meaningful way, i.e. git.git's client
>> and any other client that wants to implement the full protocol at large
>> would now need to understand not only pkt-line but also ship a JSON
>> decoder etc.
>
> I use JSON as my example because it is easy to implement in other
> languages. It's easy to use in practically any other language than C.
>
> The reason to use something like JSON is that it already encodes a way
> to include optional, structured information in the list of results. I
> focused on using that instead of specifics of a pkt-line protocol.
>
> You have some flexibility in your protocol, but it's unclear how the
> optional data would work without careful testing and actually building
> a way to store and communicate the optional data.

Yes, it's definitely not as out-of-the-box extensible as a nested
structure like JSON, but if we leave in space for arbitrary key-values
(which I'm planning) then we should be able to extend it in the future,
a key/value could even be a serialized data of some sort...

>> I don't see an inherent problem with us wanting to support some nested
>> encoding format as part of the protocol, but JSON seems like a
>> particularly bad choice. It's specified as UTF-8 only (or rather, "a
>> Unicode enoding"), so you can't stick both valid UTF-8 and binary data
>> into it.
>> 
>> Our refs on the other hand don't conform to that, so having a JSON
>> format means you can never have something that refers to refnames, which
>> given that we're talking about bundles, whose own header already has
>> that information.
>
> As I imagine it, we won't want the bundles to store real refnames,
> anyway, since we just need pointers into the commit history to start
> the incremental fetch of the real refs after the bundles are downloaded.
> The prototype I made doesn't even store the bundled refs in "refs/heads"
> or "refs/remotes".

*nod*, FWIW I've got off-list patches to make that use-case much easier
with "git bundle", i.e. teaching it to take <oid>\t<refname>\n... on
--stdin, not just <oid>\n..., so you can pick arbitrary names.

>>> Let me know if you are interested in switching your approach to something
>>> more like what I propose here. There are many more questions about what
>>> information could/should be located in the table of contents and how it can
>>> be extended in the future. I'm interested to explore that space with you.
>> 
>> As noted above, the TOC part of this seems interesting, and I don't see
>> a reason not to implement that.
>> 
>> But as noted in [1] I really don't see why it would be a good idea to
>> implement a side-format that's encoding a limited subset of what you'd
>> find in bundle headers.
>
> You say "bundle headers" a lot as if we are supposed to download and
> examine the start of the bundle file before completing the download.
> Is that what you mean? Or, do you mean that somehow you are communicating
> the bundle header in your list of URIs and I just missed it?

It's something I've changed my mind on mid-way through this RFC, but
yes, as described in
https://lore.kernel.org/git/211027.86a6iuxk3x.gmgdl@evledraar.gmail.com/
I was originally thinking of a design like

    client: bundle-uri
    server: https://example.com/bundle2.bdl
    server: https://example.com/bundle1.bdl optional key=values

But am now thinking of/working on something like:

    client: command=bundle-uri
    client: v3
    client: object-format
    client: want=headers <pkt-line-flush>
    server: https://example.com/bundle2.bdl <pkt-line-delim>
    server: # v3 git bundle
    server: @object-format=sha1
    server: e9e5ba39a78c8f5057262d49e261b42a8660d5b9 refs/heads/master <pkt-line-delim>
    server: https://example.com/bundle1.bdl <pkt-line-delim> [...]

I.e. a client requests bundle-uris with N arguments, those communicate
what sort of bundles it's able to understand, and that it would like the
server to give it headers as a shortcut, if it can.

The server replies with a list of URIs to bundles (or TOC's etc.), and
optionally as a shortcut to the client includes the headers of those
bundles.

It could also simply skip those, but then the client will need to go and
fetch the headers from the pointed-to resources, or decide it doesn't
need to.

> If you do mean that the bundle header _must_ be included in the table
> of contents, then I think that is a blocker for scaling this approach,
> since the bundle header includes the 'prerequisite' and 'reference'
> records, which could be substantial. At scale, info/refs already takes
> too much data to want to do often, let alone multiple times in a single
> request.

We'll see when I submit the updated working patches for this, and in
coming up with testcases, but I think that you can attain a sweet spot
in most repositories by advertising some of your common/recent tips, as
opposed to a big dump of all the refs.

> [...]
> Part of the point of my proposal was to show how things can work in a
> different way, especially with a flexible format like JSON. One possible
> way to organize bundles is in a linear list of files that form snapshots
> of the history at a given timestamp, storing the new objects since the
> previous bundle. A client could store a local timestamp as a heuristic
> for whether they need to download a bundle, but if they are missing
> reachable objects, then the 'requires' tag gives them a bundle that
> could fill in those gaps (they might need to follow a list until closing
> all the gaps, depending on many factors).
>
> So, that's the high-level "one could organize bundles like this" plan.
> It is _not_ intended as the only way to do it, but I also believe it is
> the most efficient. It's also why things like "requires" and "timestamp"
> are intended to be optional metadata.

[more below]

> [...]
>> If you're creating bundles that contain only one tip there's a 1=1
>> mapping to what you're proposing with "requires", but otherwise there
>> isn't.
>> 
>> 
>> "This allows us to reexamine the table of contents and only download the
>> bundles that are newer than that timestamp."
>> 
>> We're usually going to be fetching these over http(s), why duplicate
>> what you can already get if the server just takes care to create unique
>> filenames (e.g. as a function of the SHA of their contents), and then
>> provides appropriate caching headers to a client so that they'll be
>> cached forever?
>
> This assumes that a bundle URI will always be available forever, and that
> the table of contents will never shift with any future reorganization.
> If the snapshot layout that I specified was always additive, then the URI
> would be sufficient (although we would need to keep a full list of every
> URI ever downloaded) but also a single timestamp would be sufficient.
>
> The issue arises if the bundle server reorganizes the data somehow, or
> worse, the back-end of the bundle server is completely replaced with a
> different server that had a different view of the refs at these timestamps.
>
> Now, the 'requires' links provide a way to reconcile missing objects
> after downloading only the new bundles, without downloading the full list.

[also on this]

>> I think that gives you everything you'd like out of the "timestamp" and
>> more, the "more" being that since it's part of a protocol that's already
>> standard you'd have e.g. intermediate caching proxies understanding this
>> implicitly, in addition to the git client itself.
>> 
>> So on a network that's say locally unpacking https connections to a
>> public CDN you could have a local caching proxy for your N local
>> clients, as opposed to a custom "timestamp" value, which only each local
>> git client will understand.
>
> I don't understand most of what you are saying here, and perhaps that
> is my lack of understanding of possible network services that you are
> referencing.
>
> What I'm trying to communicate is that a URI is not enough information
> to make a decision about whether or not the Git client should start
> downloading that data. Providing clues about the bundle content can
> be helpful.
>
> In talking with some others about this, they were thinking about
> advertising the ref tips at the bundle boundaries. This would be a
> list of "have"s and "have not"s as OIDs that were used to generate
> the bundle. However, in my opinion, this only works if you are focused
> on snapshots of a single ref instead of a large set of moving refs
> (think thousands of refs). In that environment, timestamps are rather
> effective so it's nice to have the option.
>
> I'm also not saying that you need to implement an organization such
> as the one I'm proposing. I am only strongly recommending that you
> build it with enough generality that it is possible.

On the "This assumes that a bundle URI will always be available forever"
& generally on piggy-backing on the HTTP protocol. I mean that you can
just advertise a:

    https://example.com/big-bundle.bdl

I.e. a URL that never changes, and if you serve it up with appropriate
caching headers a client may or may not request it, e.g. a "clone"
probably will every time, but we MAY also just ignore it based on some
other client heuristic.

But then let's say you serve up (opaque SHA1s of the content) with
appropriate Cache-Control headers[1][2]:

    https://example.com/18719eddecbdf01d6c4166402d62e178482d83d4.bdl
    https://example.com/9cfaf0ef69c3c3024ff5fe92ba84bf7f6caefa2a.bdl

Now a client only needs to grab those once, and if the server operator
has set Cache-Control appropriately we'll only need to request the
header for each resource once, e.g. for a use-case of having a "big
bundle" we update monthly, and some weekly updates etc.

One reason I highly prefer this sort of approach is that it works well
out of the box with other software.

Want to make your local git clients faster? Just pipe those URLs into
wget, and if you've got a squid/varnish or other http cache sitting in
front of your clients doing that will pre-seed your local cache.

I may also turn out to be wrong, but early tests with
pipelining/streaming the headers are very promising (see e.g. [3] for
the API). I.e. for the common case of N bundles on the same CDN you can
pipeline them all & stream them in parallel, and if you don't like what
you see in some of them you don't need to continue to download the PACK
payload itself.

1. https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control
2. https://bitsup.blogspot.com/2016/05/cache-control-immutable.html
3. https://curl.se/libcurl/c/http2-download.html