[v2,1/6] docs: document bundle URI standard

Message ID	d444042dc4dcc1f9b218ca851fcf603a3afce92f.1656535245.git.gitgitgadget@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <d444042dc4dcc1f9b218ca851fcf603a3afce92f.1656535245.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1248.v2.git.1656535245.gitgitgadget@gmail.com> References: <pull.1248.git.1654545325.gitgitgadget@gmail.com> <pull.1248.v2.git.1656535245.gitgitgadget@gmail.com> Date: Wed, 29 Jun 2022 20:40:40 +0000 Subject: [PATCH v2 1/6] docs: document bundle URI standard Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, me@ttaylorr.com, newren@gmail.com, avarab@gmail.com, dyroneteng@gmail.com, Johannes.Schindelin@gmx.de, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <derrickstolee@github.com> Precedence: bulk From: Derrick Stolee <derrickstolee@github.com>
Series	bundle URIs: design doc and initial git fetch --bundle-uri implementation \| expand [v2,0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation [v2,1/6] docs: document bundle URI standard [v2,2/6] remote-curl: add 'get' capability [v2,3/6] bundle-uri: create basic file-copy logic [v2,4/6] fetch: add --bundle-uri option [v2,5/6] bundle-uri: add support for http(s):// and file:// [v2,6/6] fetch: add 'refs/bundle/' to log.excludeDecoration

Derrick Stolee June 29, 2022, 8:40 p.m. UTC

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/bundle-uri.txt | 464 +++++++++++++++++++++++++
 1 file changed, 464 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

SZEDER Gábor July 18, 2022, 9:20 a.m. UTC | #1

On Wed, Jun 29, 2022 at 08:40:40PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..a0230a902a4
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt

> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features:
> +
> +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include a new `git fetch --bundle-uri` mode for use as the
> +   implementation underneath `git clone`. The initial version here will
> +   expect a single bundle at the given URI.
> +
> +2. Implement the ability to parse a bundle list from a bundle URI and
> +   update the `git fetch --bundle-uri` logic to properly distinguish
> +   between `bundle.mode` options. Specifically design the feature so
> +   that the config format parsing feeds a list of key-value pairs into the
> +   bundle list logic.
> +
> +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise

s/verb/command/ or s/verb/request/

'bundle-uri' is not a verb, but even if it were, the protocol v2
documentation talks only about commands and requests, but never
mentions a single "verb".

> +   bundle URIs using the key-value pairs. Plug into the existing key-value
> +   input to the bundle list logic. Allow `git clone` to discover these
> +   bundle URIs and bootstrap the client repository from the bundle data.
> +   (This choice is an opt-in via a config option and a command-line
> +   option.)

Matthew John Cheetham July 21, 2022, 12:09 p.m. UTC | #2

I had a few questions and suggestions; below.

On 2022-06-29 21:40, Derrick Stolee via GitGitGadget wrote:> +Assuming a 
`200 OK` response from the server, the content at the URL is
> +inspected. First, Git attempts to parse the file as a bundle file of
> +version 2 or higher. If the file is not a bundle, then the file is parsed
> +as a plain-text file using Git's config parser. The key-value pairs in
> +that config file are expected to describe a list of bundle URIs. If
> +neither of these parse attempts succeed, then Git will report an error to
> +the user that the bundle URI provided erroneous data.
> +
> +Any other data provided by the server is considered erroneous.

I wonder if it may be worth considering adding an optional server 
requirement ("MAY" not "MUST") to provide a `Content-Type` header 
indicating if the response is a bundle list (or bundle) to skip straight 
to parsing the correct type of file? Eg "application/x-git-bundle-list"?

Even the simplest of content servers should be able to set Content-Type. 
If not falling back to 'try parse bundle else try parse bundle list' is 
still OK.

> +bundle.mode::
> +	(Required) This value has one of two values: `all` and `any`. When `all`
> +	is specified, then the client should expect to need all of the listed
> +	bundle URIs that match their repository's requirements. When `any` is
> +	specified, then the client should expect that any one of the bundle URIs
> +	that match their repository's requirements will suffice. Typically, the
> +	`any` option is used to list a number of different bundle servers
> +	located in different geographies.

Do you forsee any future where we'd want or need to specify 'sets' of 
bundles where "all" _of_ "any" particular set would be required?
Eg. there are 3 sets of bundles (A, B, C), and the client would need to 
download all bundles belonging to any of A, B, or C? Where ABC would be 
different geo-distributed sets?

I guess what I'm getting at here is with this design (which I appreciate 
is intentionally flexible), there are several different ways a server 
could direct a client to bundles stored in a nearby geography:

1. Serve an "all" bundle list with geo-located bundle URIs?
2. Serve a global "any" bundle list with each single bundle in a 
different geography (specified by `location`)
3. Serve a single bundle (not a list) with a different

Are any of these going to be preferred over another for potential client 
optimisations?

> +
> +bundle.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +	work well with incremental `git fetch` commands. The heuristic signals
> +	that there are additional keys available for each bundle that help
> +	determine which subset of bundles the client should download.
> +
> +The remaining keys include an `<id>` segment which is a server-designated
> +name for each available bundle.

Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

> +bundle.<id>.location::
> +	This string value advertises a real-world location from where the bundle
> +	URI is served. This can be used to present the user with an option for
> +	which bundle URI to use or simply as an informative indicator of which
> +	bundle URI was selected by Git. This is only valuable when
> +	`bundle.mode` is `any`.

I assume `location` is just an opaque string that is just used for info 
or display purposes? Does it make sense for other 'display' type strings 
like 'name' or 'message'?

> +Here is an example bundle list using the Git config format:
> +
> +```
> +[bundle]
> +	version = 1
> +	mode = all
> +	heuristic = creationToken
> +
> +[bundle "2022-02-09-1644442601-daily"]
> +	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
> +	timestamp = 1644442601
> +
> +[bundle "2022-02-02-1643842562"]
> +	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
> +	timestamp = 1643842562
> +
> +[bundle "2022-02-09-1644442631-daily-blobless"]
> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
> +	timestamp = 1644442631
> +	filter = blob:none
> +
> +[bundle "2022-02-02-1643842568-blobless"]
> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
> +	timestamp = 1643842568
> +	filter = blob:none
> +```

Do you mean to use `creationToken` in these examples rather than 
`timestamp`?

> +Advertising Bundle URIs
> +-----------------------
> +
...
> +The client could choose an arbitrary bundle URI as an option _or_ select
> +the URI with best performance by some exploratory checks. It is up to the
> +bundle provider to decide if having multiple URIs is preferable to a
> +single URI that is geodistributed through server-side infrastructure.

Would it make sense for the client to pick the first bundle URI rather 
than an arbitrary one? The server could use information about the 
request (origin IP/geography) to provide a sorted list of URIs by 
physical distance to the client.

I guess if arbitrary is 'random' then this provides some client-side 
load balancing over multiple potential servers too. Interested in your 
thoughts behind what would be 'best practice' for a bundle server here.

> +If the bundle provider does not provide a heuristic, then the client
> +should attempt to inspect the bundle headers before downloading the full
> +bundle data in case the bundle tips already exist in the client
> +repository.

Would this default behaviour also be considered another explicit 
heurisitic option? For example: `bundle.heuristic=default` or `inspect`.
Is it ever likely that the default behaviour would change?

> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
> +  use the `credential.helper` to attempt authentication after the first
> +  `401 Not Authorized` response, but a second such response is a failure.

I'd probably say a 500 is not solvable with a different set of 
credentials, but potentially a retry or just `die`. Do we attempt to 
`credential_fill()` with anything other than a 401 (or maybe 403 or 404) 
elsewhere in Git?

> +* A bundle download during a `git fetch` contains objects already in the
> +  object database. This is probably unavoidable if we are using bundles
> +  for fetches, since the client will almost always be slightly ahead of
> +  the bundle servers after performing its "catch-up" fetch to the remote
> +  server. This extra work is most wasteful when the client is fetching
> +  much more frequently than the server is computing bundles, such as if
> +  the client is using hourly prefetches with background maintenance, but
> +  the server is computing bundles weekly. For this reason, the client
> +  should not use bundle URIs for fetch unless the server has explicitly
> +  recommended it through the `bundle.flags = forFetch` value.

`bundle.flags` is not mentioned elsewhere in this document. Might be 
worth including this, and possible values, with the other key 
definitions above.


> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features:
> +
> +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include a new `git fetch --bundle-uri` mode for use as the
> +   implementation underneath `git clone`. The initial version here will
> +   expect a single bundle at the given URI.
> +
> +2. Implement the ability to parse a bundle list from a bundle URI and
> +   update the `git fetch --bundle-uri` logic to properly distinguish
> +   between `bundle.mode` options. Specifically design the feature so
> +   that the config format parsing feeds a list of key-value pairs into the
> +   bundle list logic.
> +
> +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
> +   bundle URIs using the key-value pairs. Plug into the existing key-value
> +   input to the bundle list logic. Allow `git clone` to discover these
> +   bundle URIs and bootstrap the client repository from the bundle data.
> +   (This choice is an opt-in via a config option and a command-line
> +   option.)
> +
> +4. Allow the client to understand the `bundle.flag=forFetch` configuration
> +   and the `bundle.<id>.creationToken` heuristic. When `git clone`
> +   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
> +   client repository to check that bundle URI during later `git fetch <remote>`
> +   commands.
> +
> +5. Allow clients to discover bundle URIs during `git fetch` and configure
> +   a bundle URI for later fetches if `bundle.flag=forFetch`.
> +
> +6. Implement the "inspect headers" heuristic to reduce data downloads when
> +   the `bundle.<id>.creationToken` heuristic is not available.
> +
> +As these features are reviewed, this plan might be updated. We also expect
> +that new designs will be discovered and implemented as this feature
> +matures and becomes used in real-world scenarios.

This plan seems logical to me at least! :-)

--
Thanks,
Matthew

Josh Steadmon July 21, 2022, 9:39 p.m. UTC | #3

On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Introduce the idea of bundle URIs to the Git codebase through an
> aspirational design document. This document includes the full design
> intended to include the feature in its fully-implemented form. This will
> take several steps as detailed in the Implementation Plan section.
> 
> By committing this document now, it can be used to motivate changes
> necessary to reach these final goals. The design can still be altered as
> new information is discovered.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/technical/bundle-uri.txt | 464 +++++++++++++++++++++++++
>  1 file changed, 464 insertions(+)
>  create mode 100644 Documentation/technical/bundle-uri.txt
> 
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..a0230a902a4
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt
> @@ -0,0 +1,464 @@
> +Bundle URIs
> +===========
> +
> +Bundle URIs are locations where Git can download one or more bundles in
> +order to bootstrap the object database in advance of fetching the remaining
> +objects from a remote.
> +
> +One goal is to speed up clones and fetches for users with poor network
> +connectivity to the origin server. Another benefit is to allow heavy users,
> +such as CI build farms, to use local resources for the majority of Git data
> +and thereby reducing the load on the origin server.
> +
> +To enable the bundle URI feature, users can specify a bundle URI using
> +command-line options or the origin server can advertise one or more URIs
> +via a protocol v2 capability.
> +
> +Design Goals
> +------------
> +
> +The bundle URI standard aims to be flexible enough to satisfy multiple
> +workloads. The bundle provider and the Git client have several choices in
> +how they create and consume bundle URIs.
> +
> +* Bundles can have whatever name the server desires. This name could refer
> +  to immutable data by using a hash of the bundle contents. However, this
> +  means that a new URI will be needed after every update of the content.
> +  This might be acceptable if the server is advertising the URI (and the
> +  server is aware of new bundles being generated) but would not be
> +  ergonomic for users using the command line option.
> +
> +* The bundles could be organized specifically for bootstrapping full
> +  clones, but could also be organized with the intention of bootstrapping
> +  incremental fetches. The bundle provider must decide on one of several
> +  organization schemes to minimize client downloads during incremental
> +  fetches, but the Git client can also choose whether to use bundles for
> +  either of these operations.
> +
> +* The bundle provider can choose to support full clones, partial clones,
> +  or both. The client can detect which bundles are appropriate for the
> +  repository's partial clone filter, if any.
> +
> +* The bundle provider can use a single bundle (for clones only), or a
> +  list of bundles. When using a list of bundles, the provider can specify
> +  whether or not the client needs _all_ of the bundle URIs for a full
> +  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
> +  bundle provider to use different URIs for different geographies.
> +
> +* The bundle provider can organize the bundles using heuristics, such as
> +  creation tokens, to help the client prevent downloading bundles it does
> +  not need. When the bundle provider does not provide these heuristics,
> +  the client can use optimizations to minimize how much of the data is
> +  downloaded.
> +
> +* The bundle provider does not need to be associated with the Git server.
> +  The client can choose to use the bundle provider without it being
> +  advertised by the Git server.
> +
> +* The client can choose to discover bundle providers that are advertised
> +  by the Git server. This could happen during `git clone`, during
> +  `git fetch`, both, or neither. The user can choose which combination
> +  works best for them.
> +
> +* The client can choose to configure a bundle provider manually at any
> +  time. The client can also choose to specify a bundle provider manually
> +  as a command-line option to `git clone`.
> +
> +Each repository is different and every Git server has different needs.
> +Hopefully the bundle URI feature is flexible enough to satisfy all needs.
> +If not, then the feature can be extended through its versioning mechanism.
> +
> +Server requirements
> +-------------------
> +
> +To provide a server-side implementation of bundle servers, no other parts
> +of the Git protocol are required. This allows server maintainers to use
> +static content solutions such as CDNs in order to serve the bundle files.
> +
> +At the current scope of the bundle URI feature, all URIs are expected to
> +be HTTP(S) URLs where content is downloaded to a local file using a `GET`
> +request to that URL. The server could include authentication requirements
> +to those requests with the aim of triggering the configured credential
> +helper for secure access. (Future extensions could use "file://" URIs or
> +SSH URIs.)
> +
> +Assuming a `200 OK` response from the server, the content at the URL is
> +inspected. First, Git attempts to parse the file as a bundle file of
> +version 2 or higher. If the file is not a bundle, then the file is parsed
> +as a plain-text file using Git's config parser. The key-value pairs in
> +that config file are expected to describe a list of bundle URIs. If
> +neither of these parse attempts succeed, then Git will report an error to
> +the user that the bundle URI provided erroneous data.
> +
> +Any other data provided by the server is considered erroneous.
> +
> +Bundle Lists
> +------------
> +
> +The Git server can advertise bundle URIs using a set of `key=value` pairs.
> +A bundle URI can also serve a plain-text file in the Git config format
> +containing these same `key=value` pairs. In both cases, we consider this
> +to be a _bundle list_. The pairs specify information about the bundles
> +that the client can use to make decisions for which bundles to download
> +and which to ignore.
> +
> +A few keys focus on properties of the list itself.
> +
> +bundle.version::
> +	(Required) This value provides a version number for the bundle
> +	list. If a future Git change enables a feature that needs the Git
> +	client to react to a new key in the bundle list file, then this version
> +	will increment. The only current version number is 1, and if any other
> +	value is specified then Git will fail to use this file.
> +
> +bundle.mode::
> +	(Required) This value has one of two values: `all` and `any`. When `all`
> +	is specified, then the client should expect to need all of the listed
> +	bundle URIs that match their repository's requirements. When `any` is
> +	specified, then the client should expect that any one of the bundle URIs
> +	that match their repository's requirements will suffice. Typically, the
> +	`any` option is used to list a number of different bundle servers
> +	located in different geographies.
> +
> +bundle.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +	work well with incremental `git fetch` commands. The heuristic signals
> +	that there are additional keys available for each bundle that help
> +	determine which subset of bundles the client should download.

To be clear, the values of `bundle.heuristic` are keys underneath the
following `bundle.<id>` segments?


> +
> +The remaining keys include an `<id>` segment which is a server-designated
> +name for each available bundle.
> +
> +bundle.<id>.uri::
> +	(Required) This string value is the URI for downloading bundle `<id>`.
> +	If the URI begins with a protocol (`http://` or `https://`) then the URI
> +	is absolute. Otherwise, the URI is interpreted as relative to the URI
> +	used for the bundle list. If the URI begins with `/`, then that relative
> +	path is relative to the domain name used for the bundle list. (This use
> +	of relative paths is intended to make it easier to distribute a set of
> +	bundles across a large number of servers or CDNs with different domain
> +	names.)
> +
> +bundle.<id>.filter::
> +	This string value represents an object filter that should also appear in
> +	the header of this bundle. The server uses this value to differentiate
> +	different kinds of bundles from which the client can choose those that
> +	match their object filters.
> +
> +bundle.<id>.creationToken::
> +	This value is a nonnegative 64-bit integer used for sorting the bundles
> +	the list. This is used to download a subset of bundles during a fetch
> +	when `bundle.heuristic=creationToken`.
> +
> +bundle.<id>.location::
> +	This string value advertises a real-world location from where the bundle
> +	URI is served. This can be used to present the user with an option for
> +	which bundle URI to use or simply as an informative indicator of which
> +	bundle URI was selected by Git. This is only valuable when
> +	`bundle.mode` is `any`.
> +
> +Here is an example bundle list using the Git config format:
> +
> +```
> +[bundle]
> +	version = 1
> +	mode = all
> +	heuristic = creationToken
> +
> +[bundle "2022-02-09-1644442601-daily"]
> +	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
> +	timestamp = 1644442601
> +
> +[bundle "2022-02-02-1643842562"]
> +	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
> +	timestamp = 1643842562
> +
> +[bundle "2022-02-09-1644442631-daily-blobless"]
> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
> +	timestamp = 1644442631
> +	filter = blob:none
> +
> +[bundle "2022-02-02-1643842568-blobless"]
> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
> +	timestamp = 1643842568
> +	filter = blob:none
> +```
> +
> +This example uses `bundle.mode=all` as well as the
> +`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
> +options to present two parallel sets of bundles: one for full clones and
> +another for blobless partial clones.
> +
> +Suppose that this bundle list was found at the URI
> +`https://bundles.example.com/git/git/` and so the two blobless bundles have
> +the following fully-expanded URIs:
> +
> +* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
> +* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
> +
> +Advertising Bundle URIs
> +-----------------------
> +
> +If a user knows a bundle URI for the repository they are cloning, then
> +they can specify that URI manually through a command-line option. However,
> +a Git host may want to advertise bundle URIs during the clone operation,
> +helping users unaware of the feature.
> +
> +The only thing required for this feature is that the server can advertise
> +one or more bundle URIs. This advertisement takes the form of a new
> +protocol v2 capability specifically for discovering bundle URIs.
> +
> +The client could choose an arbitrary bundle URI as an option _or_ select
> +the URI with best performance by some exploratory checks. It is up to the
> +bundle provider to decide if having multiple URIs is preferable to a
> +single URI that is geodistributed through server-side infrastructure.
> +
> +Cloning with Bundle URIs
> +------------------------
> +
> +The primary need for bundle URIs is to speed up clones. The Git client
> +will interact with bundle URIs according to the following flow:
> +
> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> +   option _or_ the client discovers a bundle list advertised by the
> +   Git server.
> +
> +2. If the downloaded data from a bundle URI is a bundle, then the client
> +   inspects the bundle headers to check that the prerequisite commit OIDs
> +   are present in the client repository. If some are missing, then the
> +   client delays unbundling until other bundles have been unbundled,
> +   making those OIDs present. When all required OIDs are present, the
> +   client unbundles that data using a refspec. The default refspec is
> +   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
> +   are stored so that later `git fetch` negotiations can communicate the
> +   bundled refs as `have`s, reducing the size of the fetch over the Git
> +   protocol. To allow pruning refs from this ref namespace, Git may
> +   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
> +   stale bundle refs can be deleted.
> +
> +3. If the file is instead a bundle list, then the client inspects the
> +   `bundle.mode` to see if the list is of the `all` or `any` form.
> +
> +   a. If `bundle.mode=all`, then the client considers all bundle
> +      URIs. The list is reduced based on the `bundle.<id>.filter` options
> +      matching the client repository's partial clone filter. Then, all
> +      bundle URIs are requested. If the `bundle.<id>.creationToken`
> +      heuristic is provided, then the bundles are downloaded in decreasing
> +      order by the creation token, stopping when a bundle has all required
> +      OIDs. The bundles can then be unbundled in increasing creation token
> +      order. The client stores the latest creation token as a heuristic
> +      for avoiding future downloads if the bundle list does not advertise
> +      bundles with larger creation tokens.
> +
> +   b. If `bundle.mode=any`, then the client can choose any one of the
> +      bundle URIs to inspect. The client can use a variety of ways to
> +      choose among these URIs. The client can also fallback to another URI
> +      if the initial choice fails to return a result.
> +
> +Note that during a clone we expect that all bundles will be required, and
> +heuristics such as `bundle.<uri>.creationToken` can be used to download
> +bundles in chronological order or in parallel.
> +
> +If a given bundle URI is a bundle list with a `bundle.heuristic`
> +value, then the client can choose to store that URI as its chosen bundle
> +URI. The client can then navigate directly to that URI during later `git
> +fetch` calls.
> +
> +When downloading bundle URIs, the client can choose to inspect the initial
> +content before committing to downloading the entire content. This may
> +provide enough information to determine if the URI is a bundle list or
> +a bundle. In the case of a bundle, the client may inspect the bundle
> +header to determine that all advertised tips are already in the client
> +repository and cancel the remaining download.
> +
> +Fetching with Bundle URIs
> +-------------------------
> +
> +When the client fetches new data, it can decide to fetch from bundle
> +servers before fetching from the origin remote. This could be done via a
> +command-line option, but it is more likely useful to use a config value
> +such as the one specified during the clone.
> +
> +The fetch operation follows the same procedure to download bundles from a
> +bundle list (although we do _not_ want to use parallel downloads here). We
> +expect that the process will end when all prerequisite commit OIDs in a
> +thin bundle are already in the object database.
> +
> +When using the `creationToken` heuristic, the client can avoid downloading
> +any bundles if their creation tokenss are not larger than the stored

Typo: tokenss


> +creation token. After fetching new bundles, Git updates this local
> +creation token.
> +
> +If the bundle provider does not provide a heuristic, then the client
> +should attempt to inspect the bundle headers before downloading the full
> +bundle data in case the bundle tips already exist in the client
> +repository.
> +
> +Error Conditions
> +----------------
> +
> +If the Git client discovers something unexpected while downloading
> +information according to a bundle URI or the bundle list found at that
> +location, then Git can ignore that data and continue as if it was not
> +given a bundle URI. The remote Git server is the ultimate source of truth,
> +not the bundle URI.
> +
> +Here are a few example error conditions:
> +
> +* The client fails to connect with a server at the given URI or a connection
> +  is lost without any chance to recover.
> +
> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
> +  use the `credential.helper` to attempt authentication after the first
> +  `401 Not Authorized` response, but a second such response is a failure.
> +
> +* The client receives data that is not parsable as a bundle or bundle list.
> +
> +* The bundle list describes a directed cycle in the
> +  `bundle.<id>.requires` links.
> +
> +* A bundle includes a filter that does not match expectations.
> +
> +* The client cannot unbundle the bundles because the prerequisite commit OIDs
> +  are not in the object database and there are no more
> +  `bundle.<id>.requires` links to follow.

The list above still mentions `.requires` fields, which IIUC have been
removed from the V2 design.


> +There are also situations that could be seen as wasteful, but are not
> +error conditions:
> +
> +* The downloaded bundles contain more information than is requested by
> +  the clone or fetch request. A primary example is if the user requests
> +  a clone with `--single-branch` but downloads bundles that store every
> +  reachable commit from all `refs/heads/*` references. This might be
> +  initially wasteful, but perhaps these objects will become reachable by
> +  a later ref update that the client cares about.
> +
> +* A bundle download during a `git fetch` contains objects already in the
> +  object database. This is probably unavoidable if we are using bundles
> +  for fetches, since the client will almost always be slightly ahead of
> +  the bundle servers after performing its "catch-up" fetch to the remote
> +  server. This extra work is most wasteful when the client is fetching
> +  much more frequently than the server is computing bundles, such as if
> +  the client is using hourly prefetches with background maintenance, but
> +  the server is computing bundles weekly. For this reason, the client
> +  should not use bundle URIs for fetch unless the server has explicitly
> +  recommended it through the `bundle.flags = forFetch` value.

The "Bundle Lists" section above doesn't mention `bundle.flags` at all;
has this been replaced by `bundle.heuristic`?


> +
> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features:
> +
> +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include a new `git fetch --bundle-uri` mode for use as the
> +   implementation underneath `git clone`. The initial version here will
> +   expect a single bundle at the given URI.
> +
> +2. Implement the ability to parse a bundle list from a bundle URI and
> +   update the `git fetch --bundle-uri` logic to properly distinguish
> +   between `bundle.mode` options. Specifically design the feature so
> +   that the config format parsing feeds a list of key-value pairs into the
> +   bundle list logic.
> +
> +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
> +   bundle URIs using the key-value pairs. Plug into the existing key-value
> +   input to the bundle list logic. Allow `git clone` to discover these
> +   bundle URIs and bootstrap the client repository from the bundle data.
> +   (This choice is an opt-in via a config option and a command-line
> +   option.)
> +
> +4. Allow the client to understand the `bundle.flag=forFetch` configuration
> +   and the `bundle.<id>.creationToken` heuristic. When `git clone`
> +   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
> +   client repository to check that bundle URI during later `git fetch <remote>`
> +   commands.
> +
> +5. Allow clients to discover bundle URIs during `git fetch` and configure
> +   a bundle URI for later fetches if `bundle.flag=forFetch`.
> +
> +6. Implement the "inspect headers" heuristic to reduce data downloads when
> +   the `bundle.<id>.creationToken` heuristic is not available.
> +
> +As these features are reviewed, this plan might be updated. We also expect
> +that new designs will be discovered and implemented as this feature
> +matures and becomes used in real-world scenarios.
> +
> +Related Work: Packfile URIs
> +---------------------------

Thanks for including this section; my very first question before I even
started reading the doc was "How does this compare to Packfile URIs?".


> +
> +The Git protocol already has a capability where the Git server can list
> +a set of URLs along with the packfile response when serving a client
> +request. The client is then expected to download the packfiles at those
> +locations in order to have a complete understanding of the response.
> +
> +This mechanism is used by the Gerrit server (implemented with JGit) and
> +has been effective at reducing CPU load and improving user performance for
> +clones.
> +
> +A major downside to this mechanism is that the origin server needs to know
> +_exactly_ what is in those packfiles, and the packfiles need to be available
> +to the user for some time after the server has responded. This coupling
> +between the origin and the packfile data is difficult to manage.

It was not immediately clear to me why bundle URIs would avoid these two
downsides, but after thinking about it and discussing it, I believe this
is the reasoning (please correct me if I'm wrong):

Bundle URIs are intended to be fully processed before negotiation
happens, so the server can rely solely on the client's reported Haves /
Wants to determine its response, as usual, and therefore doesn't need to
know the bundle contents. Packfile URIs are not processed by the client
before negotiation, so the server needs to be aware of the packfile
contents in order to determine which additional objects to send, and
can't rely solely on the Haves/Wants from the client.

If that's accurate, then it seems fine that when the bundle URI is
provided by the user on the command-line, Git can download and process
the bundle before even attempting to contact the server. But in future
series, when we add the ability for the server to provide a URI via a
capability, the client will have to pause after seeing the server's
advertised URI, fetch and process the bundle, and then proceed with a
fetch command. This seems fine assuming the bundles can be handled in a
reasonable amount of time. And even if the connection breaks before the
fetch command can be issued, presumably the client would not need to
attempt to re-download the bundle a second time when the client makes a
second connection to re-attempt the fetch.

For the point about the packfiles needing to be available for some time,
this makes sense because the server's response is processed by the
client before the referenced packfile is downloaded, but the response is
incomplete without the packfile. But with bundle URIs, the server's
response doesn't depend on whether or not the client actually processed
the referenced bundle, only on what was negotiated. So the server's
response is still valid and useful even if the client is unable to get
the bundle.


> +
> +Further, this implementation is extremely hard to make work with fetches.
> +
> +Related Work: GVFS Cache Servers
> +--------------------------------
> +
> +The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
> +the Git project before Git's partial clone was created. One feature of this
> +protocol is the idea of a "cache server" which can be colocated with build
> +machines or developer offices to transfer Git data without overloading the
> +central server.
> +
> +The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
> +endpoint, which allows downloading an object on-demand. This is a critical
> +piece of the filesystem virtualization of that product.
> +
> +However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
> +endpoint. Given an optional timestamp, the cache server responds with a list
> +of precomputed packfiles containing the commits and trees that were introduced
> +in those time intervals.
> +
> +The cache server computes these "prefetch" packfiles using the following
> +strategy:
> +
> +1. Every hour, an "hourly" pack is generated with a given timestamp.
> +2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
> +3. Nightly, all prefetch packs more than 30 days old are rolled up into
> +   one pack.
> +
> +When a user runs `gvfs clone` or `scalar clone` against a repo with cache
> +servers, the client requests all prefetch packfiles, which is at most
> +`24 + 30 + 1` packfiles downloading only commits and trees. The client
> +then follows with a request to the origin server for the references, and
> +attempts to checkout that tip reference. (There is an extra endpoint that
> +helps get all reachable trees from a given commit, in case that commit
> +was not already in a prefetch packfile.)
> +
> +During a `git fetch`, a hook requests the prefetch endpoint using the
> +most-recent timestamp from a previously-downloaded prefetch packfile.
> +Only the list of packfiles with later timestamps are downloaded. Most
> +users fetch hourly, so they get at most one hourly prefetch pack. Users
> +whose machines have been off or otherwise have not fetched in over 30 days
> +might redownload all prefetch packfiles. This is rare.
> +
> +It is important to note that the clients always contact the origin server
> +for the refs advertisement, so the refs are frequently "ahead" of the
> +prefetched pack data. The missing objects are downloaded on-demand using
> +the `GET gvfs/objects/{oid}` requests, when needed by a command such as
> +`git checkout` or `git log`. Some Git optimizations disable checks that
> +would cause these on-demand downloads to be too aggressive.
> +
> +See Also
> +--------
> +
> +[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> +    An earlier RFC for a bundle URI feature.
> +
> +[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
> +    The GVFS Protocol
> -- 
> gitgitgadget
>

Derrick Stolee July 22, 2022, 1:15 p.m. UTC | #4

On 7/21/2022 5:39 PM, Josh Steadmon wrote:
> On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>

>> +bundle.heuristic::
>> +	If this string-valued key exists, then the bundle list is designed to
>> +	work well with incremental `git fetch` commands. The heuristic signals
>> +	that there are additional keys available for each bundle that help
>> +	determine which subset of bundles the client should download.
> 
> To be clear, the values of `bundle.heuristic` are keys underneath the
> following `bundle.<id>` segments?

At the moment, the only planned heuristic is associated with a single
key (bundle.<id>.creationToken) but future heuristics could be more
complicated. The general idea is to say "these bundles were organized
with this heuristic in mind, so either take all of them (ignoring the
heuristic) or apply a specific algorithm to download a subset at a
time." The creationToken heuristic uses the total order on the bundles
to do a greedy algorithm for downloading the most-recent bundles until
the required commits are all satisfied.

>> +Related Work: Packfile URIs
>> +---------------------------
> 
> Thanks for including this section; my very first question before I even
> started reading the doc was "How does this compare to Packfile URIs?".
> 
> 
>> +
>> +The Git protocol already has a capability where the Git server can list
>> +a set of URLs along with the packfile response when serving a client
>> +request. The client is then expected to download the packfiles at those
>> +locations in order to have a complete understanding of the response.
>> +
>> +This mechanism is used by the Gerrit server (implemented with JGit) and
>> +has been effective at reducing CPU load and improving user performance for
>> +clones.
>> +
>> +A major downside to this mechanism is that the origin server needs to know
>> +_exactly_ what is in those packfiles, and the packfiles need to be available
>> +to the user for some time after the server has responded. This coupling
>> +between the origin and the packfile data is difficult to manage.
> 
> It was not immediately clear to me why bundle URIs would avoid these two
> downsides, but after thinking about it and discussing it, I believe this
> is the reasoning (please correct me if I'm wrong):
> 
> Bundle URIs are intended to be fully processed before negotiation
> happens, so the server can rely solely on the client's reported Haves /
> Wants to determine its response, as usual, and therefore doesn't need to
> know the bundle contents. Packfile URIs are not processed by the client
> before negotiation, so the server needs to be aware of the packfile
> contents in order to determine which additional objects to send, and
> can't rely solely on the Haves/Wants from the client.
> 
> If that's accurate, then it seems fine that when the bundle URI is
> provided by the user on the command-line, Git can download and process
> the bundle before even attempting to contact the server. But in future
> series, when we add the ability for the server to provide a URI via a
> capability, the client will have to pause after seeing the server's
> advertised URI, fetch and process the bundle, and then proceed with a
> fetch command. This seems fine assuming the bundles can be handled in a
> reasonable amount of time. And even if the connection breaks before the
> fetch command can be issued, presumably the client would not need to
> attempt to re-download the bundle a second time when the client makes a
> second connection to re-attempt the fetch.

After the server has advertised the bundle URIs, the client should
consider disconnecting while downloading bundles, then reconnect to
the server after that is complete.

For stateless connections (https) we need to reconnect every time, so
this isn't a problem.

I anticipate that in the long-term view of this feature, the server
advertised bundle URIs will usually point to an independent bundle
list at a static URI. In that situation, we can store the chosen
bundle list URI in Git config and check that bundle list for new data
before even connecting to the origin Git server. We would only
"rediscover" the URIs when either the bundle URI returns 404 and/or
the origin server lists new URIs.

Of course, there is the option that the origin Git server wants to
track and advertise the individual bundles directly, in which case the
client can't do this opportunistic bundle pre-check.

> For the point about the packfiles needing to be available for some time,
> this makes sense because the server's response is processed by the
> client before the referenced packfile is downloaded, but the response is
> incomplete without the packfile. But with bundle URIs, the server's
> response doesn't depend on whether or not the client actually processed
> the referenced bundle, only on what was negotiated. So the server's
> response is still valid and useful even if the client is unable to get
> the bundle.

You're pointing out the critical feature: the bundles are completely
optional. If the client fails to download the bundles due to something
like this race, then it will just negotiate for the origin Git server
to provide the data.

What's perhaps not entirely fair about this point is that the bundle
provider should keep this race in mind between serving a bundle list
and having the client request the associated bundles. Something like
a 24-hour window between updating the bundle list and deleting an
unreferenced bundle file should satisfy the vast majority of clients.

Thanks for the close look. While I removed them from the context, I've
made note of your smaller comments on things like typos and will fix
them in v3.

Thanks,
-Stolee

Derrick Stolee July 22, 2022, 1:52 p.m. UTC | #5

On 7/21/2022 8:09 AM, Matthew John Cheetham wrote:> I had a few questions and suggestions; below.
> 
> On 2022-06-29 21:40, Derrick Stolee via GitGitGadget wrote:> +Assuming a `200 OK` response from the server, the content at the URL is
>> +inspected. First, Git attempts to parse the file as a bundle file of
>> +version 2 or higher. If the file is not a bundle, then the file is parsed
>> +as a plain-text file using Git's config parser. The key-value pairs in
>> +that config file are expected to describe a list of bundle URIs. If
>> +neither of these parse attempts succeed, then Git will report an error to
>> +the user that the bundle URI provided erroneous data.
>> +
>> +Any other data provided by the server is considered erroneous.
> 
> I wonder if it may be worth considering adding an optional server requirement ("MAY" not "MUST") to provide a `Content-Type` header indicating if the response is a bundle list (or bundle) to skip straight to parsing the correct type of file? Eg "application/x-git-bundle-list"?
> 
> Even the simplest of content servers should be able to set Content-Type. If not falling back to 'try parse bundle else try parse bundle list' is still OK.
This is an interesting idea. We should keep it in mind for a future
extension, since the Git client needs to do some work to parse the
header and communicate that upwards to the logic that parses the file. 
>> +bundle.mode::
>> +    (Required) This value has one of two values: `all` and `any`. When `all`
>> +    is specified, then the client should expect to need all of the listed
>> +    bundle URIs that match their repository's requirements. When `any` is
>> +    specified, then the client should expect that any one of the bundle URIs
>> +    that match their repository's requirements will suffice. Typically, the
>> +    `any` option is used to list a number of different bundle servers
>> +    located in different geographies.
> 
> Do you forsee any future where we'd want or need to specify 'sets' of bundles where "all" _of_ "any" particular set would be required?
> Eg. there are 3 sets of bundles (A, B, C), and the client would need to download all bundles belonging to any of A, B, or C? Where ABC would be different geo-distributed sets?

The bundle.heuristic space is open-ended to allow for different ways to
scan the bundle list and download a subset, so that might be a way to
extend this in the future.

I think what you're hinting at is a single "global" bundle list that knows
about all of the bundles scattered across the world and it groups bundles
by geography, even though the list knows about multiple geos. Is that what
you mean?

> I guess what I'm getting at here is with this design (which I appreciate is intentionally flexible), there are several different ways a server could direct a client to bundles stored in a nearby geography:
> 
> 1. Serve an "all" bundle list with geo-located bundle URIs?
> 2. Serve a global "any" bundle list with each single bundle in a different geography (specified by `location`)
> 3. Serve a single bundle (not a list) with a different

I think this item 3 is incomplete. What were you going to say?

I'll assume for now that you just mean "3. Serve a single bundle (not a
list)."

> Are any of these going to be preferred over another for potential client optimisations?

I could do better in this document of giving clear examples of how a
bundle provider could organize bundles. As the client becomes more
sophisticated, then different organizations become "unlocked" as something
the client can understand and use efficiently. That also sets the stage
for how to add extensions: the client change can be paired with an example
bundle provider organization.

The bundle provider setup that I personally think will work best is:

 1. The origin Git server advertises a bundle list in "any" mode. Each URI
    is a static URI that corresponds to a different geography. The client
    picks the closest one and stores that URI in local config.

 2. Each static URI provides a bundle list in "all" mode with the
    creationToken heuristic. The bundles are sorted by creation time, and
    new bundles are created on roughly a daily basis, but it could be
    a wider time frame if not enough Git data is added every day. The
    creationTokens are added as appending to this order, but after the
    list has some fixed length, the oldest bundles are merged into a single
    bundle. That merged bundle is then assigned the maximum creationToken
    of the bundles used to create it.

This comes from experience in how the GVFS Cache Server prefetch packfiles
are created. To support those very-active repos, there's another layer of
"hourly" packs that are merged into "daily" packs; those daily packs are
merged into a giant "everything old" pack after 30 days. This means that
there is a maximum of 1 + 30 + 23 packs at any given time.

>> +
>> +bundle.heuristic::
>> +    If this string-valued key exists, then the bundle list is designed to
>> +    work well with incremental `git fetch` commands. The heuristic signals
>> +    that there are additional keys available for each bundle that help
>> +    determine which subset of bundles the client should download.
>> +
>> +The remaining keys include an `<id>` segment which is a server-designated
>> +name for each available bundle.
> 
> Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

I would say "Same as Git config rules" in general, but we could try to be
more strict here, if we want.

>> +bundle.<id>.location::
>> +    This string value advertises a real-world location from where the bundle
>> +    URI is served. This can be used to present the user with an option for
>> +    which bundle URI to use or simply as an informative indicator of which
>> +    bundle URI was selected by Git. This is only valuable when
>> +    `bundle.mode` is `any`.
> 
> I assume `location` is just an opaque string that is just used for info or display purposes? Does it make sense for other 'display' type strings like 'name' or 'message'?

We should definitely give this key another look when we get around to
building the UX around choosing from a list in "any" mode with human-
readable information like this.

>> +Advertising Bundle URIs
>> +-----------------------
>> +
> ...
>> +The client could choose an arbitrary bundle URI as an option _or_ select
>> +the URI with best performance by some exploratory checks. It is up to the
>> +bundle provider to decide if having multiple URIs is preferable to a
>> +single URI that is geodistributed through server-side infrastructure.
> 
> Would it make sense for the client to pick the first bundle URI rather than an arbitrary one? The server could use information about the request (origin IP/geography) to provide a sorted list of URIs by physical distance to the client.
> 
> I guess if arbitrary is 'random' then this provides some client-side load balancing over multiple potential servers too. Interested in your thoughts behind what would be 'best practice' for a bundle server here.

I tend to think of the client as being the "smartest" participant in this
exchange. The origin Git server is just serving lines from its local config
and isn't thinking at all about the client's network topology. If instead
the bundle provider (not the Git server) organizes to have a single URI
listing all of the geographic options, then that server could do smarter
things, for sure. It might be able to do that reordering to take advantage
of the Git client picking the first option (before the client has the
capability to test the connections to all of them for the fastest link).

>> +If the bundle provider does not provide a heuristic, then the client
>> +should attempt to inspect the bundle headers before downloading the full
>> +bundle data in case the bundle tips already exist in the client
>> +repository.
> 
> Would this default behaviour also be considered another explicit heurisitic option? For example: `bundle.heuristic=default` or `inspect`.
> Is it ever likely that the default behaviour would change?

I think that if there is no explicit heuristic, then this is the only way
the client can avoid downloading all of the bundle content.

>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
>> +  use the `credential.helper` to attempt authentication after the first
>> +  `401 Not Authorized` response, but a second such response is a failure.
> 
> I'd probably say a 500 is not solvable with a different set of credentials, but potentially a retry or just `die`. Do we attempt to `credential_fill()` with anything other than a 401 (or maybe 403 or 404) elsewhere in Git?

This is poorly worded, and I should group 400-level errors in one bullet
and 500 errors as its own category. The implementation uses the same
"retry with credentials" logic that other HTTPS connections make, so the
logic should be shared there. This document should point to another place
that documents that contract, especially in case it changes in the future.

Thanks!
-Stolee

Derrick Stolee July 22, 2022, 3:01 p.m. UTC | #6

On 7/21/2022 5:39 PM, Josh Steadmon wrote:
> On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>

>> +* A bundle download during a `git fetch` contains objects already in the
>> +  object database. This is probably unavoidable if we are using bundles
>> +  for fetches, since the client will almost always be slightly ahead of
>> +  the bundle servers after performing its "catch-up" fetch to the remote
>> +  server. This extra work is most wasteful when the client is fetching
>> +  much more frequently than the server is computing bundles, such as if
>> +  the client is using hourly prefetches with background maintenance, but
>> +  the server is computing bundles weekly. For this reason, the client
>> +  should not use bundle URIs for fetch unless the server has explicitly
>> +  recommended it through the `bundle.flags = forFetch` value.
> 
> The "Bundle Lists" section above doesn't mention `bundle.flags` at all;
> has this been replaced by `bundle.heuristic`?

You and Matthew both caught this. I first thought that absolutely this
was a replacement I missed. Then I re-checked my future steps and the
very last patch implements `bundle.flags = forFetch`. However, the
reason it exists is for the ability to incrementally fetch even if the
`bundle.heuristic` value is not set. That requires the "download only
the headers" feature, so should be delayed to that implementation.

I'll remove it from the design doc and instead implement the logic to
store the bundle URI for fetching if `bundle.heuristic=creationToken`,
which can be extended later.

Thanks,
-Stolee

Derrick Stolee July 22, 2022, 4:03 p.m. UTC | #7

On 7/21/2022 8:09 AM, Matthew John Cheetham wrote:

>> +The remaining keys include an `<id>` segment which is a server-designated
>> +name for each available bundle.
> 
> Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

I was thinking "same as Git config rules", but those rules are extremely
flexible, perhaps more than we want them to be. (For instance, the rules
allow specifying "remote.<id>.url" where <id> could be anything except
having a newline or null byte.

Perhaps we should start with "alphanumeric and `-`" like the section
name, just to be extra careful here. We can probably be fine parsing
more flexibly on the client, but better to be restrictive now and
relax later.

Thanks,
-Stolee

[v2,1/6] docs: document bundle URI standard

Commit Message

Comments

Patch