Message ID | e771b2971d092af5ea8a47eb708d03e34b284a0f.1654545325.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bundle URIs: design doc and initial git fetch --bundle-uri implementation | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <derrickstolee@github.com> > > Introduce the idea of bundle URIs to the Git codebase through an > aspirational design document. This document includes the full design > intended to include the feature in its fully-implemented form. This will > take several steps as detailed in the Implementation Plan section. > > By committing this document now, it can be used to motivate changes > necessary to reach these final goals. The design can still be altered as > new information is discovered. > > Signed-off-by: Derrick Stolee <derrickstolee@github.com> > --- > Documentation/technical/bundle-uri.txt | 475 +++++++++++++++++++++++++ > 1 file changed, 475 insertions(+) > create mode 100644 Documentation/technical/bundle-uri.txt > > diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt > new file mode 100644 > index 00000000000..6657ba079ab > --- /dev/null > +++ b/Documentation/technical/bundle-uri.txt > @@ -0,0 +1,475 @@ > +Bundle URIs > +=========== > + > +Bundle URIs are locations where Git can download one or more bundles in > +order to bootstrap the object database in advance of fetching the remaining > +objects from a remote. > + > +One goal is to speed up clones and fetches for users with poor network > +connectivity to the origin server. Another benefit is to allow heavy users, > +such as CI build farms, to use local resources for the majority of Git data > +and thereby reducing the load on the origin server. > + > +To enable the bundle URI feature, users can specify a bundle URI using > +command-line options or the origin server can advertise one or more URIs > +via a protocol v2 capability. > + > +Design Goals > +------------ > + > +The bundle URI standard aims to be flexible enough to satisfy multiple > +workloads. The bundle provider and the Git client have several choices in > +how they create and consume bundle URIs. > + > +* Bundles can have whatever name the server desires. This name could refer > + to immutable data by using a hash of the bundle contents. However, this > + means that a new URI will be needed after every update of the content. > + This might be acceptable if the server is advertising the URI (and the > + server is aware of new bundles being generated) but would not be > + ergonomic for users using the command line option. > + > +* The bundles could be organized specifically for bootstrapping full > + clones, but could also be organized with the intention of bootstrapping > + incremental fetches. The bundle provider must decide on one of several > + organization schemes to minimize client downloads during incremental > + fetches, but the Git client can also choose whether to use bundles for > + either of these operations. > + > +* The bundle provider can choose to support full clones, partial clones, > + or both. The client can detect which bundles are appropriate for the > + repository's partial clone filter, if any. > + > +* The bundle provider can use a single bundle (for clones only), or a > + list of bundles. When using a list of bundles, the provider can specify > + whether or not the client needs _all_ of the bundle URIs for a full > + clone, or if _any_ one of the bundle URIs is sufficient. This allows the > + bundle provider to use different URIs for different geographies. > + > +* The bundle provider can organize the bundles using heuristics, such as > + timestamps or tokens, to help the client prevent downloading bundles it > + does not need. When the bundle provider does not provide these > + heuristics, the client can use optimizations to minimize how much of the > + data is downloaded. > + > +* The bundle provider does not need to be associated with the Git server. > + The client can choose to use the bundle provider without it being > + advertised by the Git server. > + > +* The client can choose to discover bundle providers that are advertised > + by the Git server. This could happen during `git clone`, during > + `git fetch`, both, or neither. The user can choose which combination > + works best for them. > + > +* The client can choose to configure a bundle provider manually at any > + time. The client can also choose to specify a bundle provider manually > + as a command-line option to `git clone`. > + > +Each repository is different and every Git server has different needs. > +Hopefully the bundle URI feature is flexible enough to satisfy all needs. > +If not, then the feature can be extended through its versioning mechanism. > + > +Server requirements > +------------------- > + > +To provide a server-side implementation of bundle servers, no other parts > +of the Git protocol are required. This allows server maintainers to use > +static content solutions such as CDNs in order to serve the bundle files. > + > +At the current scope of the bundle URI feature, all URIs are expected to > +be HTTP(S) URLs where content is downloaded to a local file using a `GET` > +request to that URL. The server could include authentication requirements > +to those requests with the aim of triggering the configured credential > +helper for secure access. (Future extensions could use "file://" URIs or > +SSH URIs.) > + > +Assuming a `200 OK` response from the server, the content at the URL is > +expected to be of one of two forms: > + > +1. Bundle: A Git bundle file of version 2 or higher. > + > +2. Bundle List: A plain-text file that is parsable using Git's > + config file parser. This file describes one or more bundles that are > + accessible from other URIs. > + > +Any other data provided by the server is considered erroneous. How does a client tell which one it got? Do we register mimetype with iana to use for these two types of files, and have the HTTP downloader to use the information? > +Bundle Lists > +------------ > + > +The Git server can advertise bundle URIs using a set of `key=value` pairs. > +A bundle URI can also serve a plain-text file in the Git config format > +containing these same `key=value` pairs. In both cases, we consider this > +to be a _bundle list_. The pairs specify information about the bundles > +that the client can use to make decisions for which bundles to download > +and which to ignore. > + > +A few keys focus on properties of the list itself. > + > +bundle.list.version:: > + (Required) This value provides a version number for the table of > + contents. If a future Git change enables a feature that needs the Git > + client to react to a new key in the bundle list file, then this version > + will increment. The only current version number is 1, and if any other > + value is specified then Git will fail to use this file. > + > +bundle.list.mode:: > + (Required) This value has one of two values: `all` and `any`. When `all` > + is specified, then the client should expect to need all of the listed > + bundle URIs that match their repository's requirements. When `any` is > + specified, then the client should expect that any one of the bundle URIs > + that match their repository's requirements will suffice. Typically, the > + `any` option is used to list a number of different bundle servers > + located in different geographies. OK. Presumably, if there are two sets of bundles, A and B, that consist of 3 and 2 bundle files respectively, and either one of these two sets is sufficient to help the client, then we'd have a bundle.list of type 'any', with two bundle.<id>.uri, that point at (sub) bundle.list of type 'all' in which these 3 or 2 bundle files are contained? I am just wondering why we need 'all' and 'any', and at the same time why these two are sufficient for our needs. > +bundle.list.heuristic:: > + If this string-valued key exists, then the bundle list is designed to > + work well with incremental `git fetch` commands. The heuristic signals > + that there are additional keys available for each bundle that help > + determine which subset of bundles the client should download. Funny indentation? > +The remaining keys include an `<id>` segment which is a server-designated > +name for each available bundle. > + > +bundle.<id>.uri:: > + (Required) This string value is the URI for downloading bundle `<id>`. > + If the URI begins with a protocol (`http://` or `https://`) then the URI > + is absolute. Otherwise, the URI is interpreted as relative to the URI > + used for the bundle list. If the URI begins with `/`, then that relative > + path is relative to the domain name used for the bundle list. (This use > + of relative paths is intended to make it easier to distribute a set of > + bundles across a large number of servers or CDNs with different domain > + names.) I have no objection to a host-relative URI notation, but is it something we need to spell out here? I am mostly interested in making sure that we do not deviate a practice that is already used to point at different resource at the same server. If the way we specify host-relative is unnecessarily different from the way existing "internet" programs (say, a web browser) work, that would be embarrasing, unless there is a very good reason for us to be different. > +bundle.<id>.list:: > + This boolean value indicates whether the client should expect the > + content from this URI to be a list (if `true`) or a bundle (if `false`). > + This is typically used when `bundle.list.mode` is `any`. OK, so the type of a (sub) bundle.list can be specified using this without having the HTTP(s) server annotate the resource with mimetype when the thing gets actually downloaded. It still leaves the issue of bootstrapping the system. If the server advises bundle URI when the client contacts, presumably that first-contact bundle.*.uri can be annotated with the bundle.*.list at the same time, but the model allows the client to learn bundles independently from the server, and it still is a bit unclear how we tell. Of course, we can examine the contents of a file that was obtained from a bundle URI, a file that parses correctly as a config-like file is very unlikely to be a valid bundle file, and we need to be prepared to deal with a corrupt resource downloaded from a bundle URI anyway, so this may not be a huge deal. > +bundle.<id>.filter:: > + This string value represents an object filter that should also appear in > + the header of this bundle. The server uses this value to differentiate > + different kinds of bundles from which the client can choose those that > + match their object filters. Is it an error to have .filter defined for a bundle URI whose .list says "true"? Or does it mean all bundle files that make up the list share the same object filter? > +bundle.<id>.timestamp:: > + This value is the number of seconds since Unix epoch (UTC) that this > + bundle was created. This is used as an approximation of a point in time > + that the bundle matches the data available at the origin server. This is > + used when `bundle.list.heuristic=timestamp`. Name of this field should be better than 'timestamp'; we should say timestamp of creation (or last modification if the same name can be reused). > +bundle.<id>.requires:: > + This string value represents the ID of another bundle. When present, the > + server is indicating that this bundle contains a thin packfile. If the > + client does not have all necessary objects to unbundle this packfile, > + then the client can download the bundle with the `requires` ID and try > + again. (Note: it may be beneficial to allow the server to specify > + multiple `requires` bundles.) This is used when > + `bundle.list.heuristic=timestamp`. So, bundle.list.mode can say 'any', with three <id>s in it, but bundle.1.requires can point at '2', while bundle.2.requires can point at '1', and bundle.3.requires can be emtpy, in which case you can either fetch 1&2 or 3 alone. Is that the idea? > +bundle.<id>.location:: > + This string value advertises a real-world location from where the bundle > + URI is served. This can be used to present the user with an option for > + which bundle URI to use. This is only valuable when `bundle.list.mode` > + is `any`. I am afraid I do not follow. Do you mean, by "a real-world location", we write things like "America/Los_Angeles" and "Asia/Tokyo" in this field, so people can tell which one is cheaper to get to? Do we make any further specification to make it machine usable in any way (I suspect machines would rather measure the latency and throughput against bundle.<id>.uri and .location is meant purely for human consumption)? > +Here is an example bundle list using the Git config format: > + > +``` > +[bundle "list"] > + version = 1 > + mode = all > + heuristic = timestamp In all mode, how does heuristic help? Doesn't mode=all by definition require you to grab everything anyway? > +[bundle "2022-02-09-1644442601-daily"] > + uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle example.com (cf. RFC6761)? > + timestamp = 1644442601 > + requires = 2022-02-02-1643842562 > + > +[bundle "2022-02-02-1643842562"] > + uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle > + timestamp = 1643842562 > + > +[bundle "2022-02-09-1644442631-daily-blobless"] > + uri = 2022-02-09-1644442631-daily-blobless.bundle > + timestamp = 1644442631 > + requires = 2022-02-02-1643842568-blobless > + filter = blob:none > + > +[bundle "2022-02-02-1643842568-blobless"] > + uri = /git/git/2022-02-02-1643842568-blobless.bundle > + timestamp = 1643842568 > + filter = blob:none > +``` > + > +This example uses `bundle.list.mode=all` as well as the > +`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter` > +options to present two parallel sets of bundles: one for full clones and > +another for blobless partial clones. > + > +Suppose that this bundle list was found at the URI > +`https://bundles.fake.com/git/git/` and so the two blobless bundles have > +the following fully-expanded URIs: > + > +* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle` > +* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle` So,... is the idea that bundle.list.mode=all does *not* mean "you need to get all of them"? Rather, you first group bundles with the same filter, attribute, and then for each group with the same filter, you'd need to grab all of them? IOW, if you are interested in a full clone, you can ignore <id>'s with non-empty bundle.<id>.filter and grab all the rest? If so, then I can see how the design makes sense. I still do not know how heuristic kicks in, though. ANother thing I noticed. The above scheme makes it impossible to have <id> that happens to be "list". I think the variables that apply to the entire list should be given two-level names, i.e. [bundle] version = 1 mode = all heuristic = timestamp [bundle "2022-02-09-1644442631-daily"] uri = ... > +The client could choose an arbitrary bundle URI as an option _or_ select > +the URI with lowest latency by some exploratory checks. Some places may have higher latency but great throughput. The review for the rest of the document will be left for another sitting. Thanks.
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > +Cloning with Bundle URIs > +------------------------ > + > +The primary need for bundle URIs is to speed up clones. The Git client > +will interact with bundle URIs according to the following flow: > + > +1. The user specifies a bundle URI with the `--bundle-uri` command-line > + option _or_ the client discovers a bundle list advertised by the > + Git server. > + > +2. If the downloaded data from a bundle URI is a bundle, then the client > + inspects the bundle headers to check that the negative commit OIDs are Although "negative" would be understandable to pros, the commits required to unbundle a bundle file are officially called "prerequisite commits" (cf. "git bundle --help"), so that may be easier to understand by ordinary readers. > + present in the client repository. If some are missing, then the client > + delays unbundling until other bundles have been unbundled, making those > + OIDs present. When all required OIDs are present, the client unbundles > + that data using a refspec. The default refspec is > + `+refs/heads/*:refs/bundles/*`, but this can be configured. The refs/bundles/ appear in the document only here, and it is unclear why we even want it (I am assuming this is against gc while "git clone" is still running) or how we are going to retire it, if ever. If there are multiple bundle files involved in this "git clone", to anchor objects that are necessary against "gc", don't we need to use refs/bundles/<i>/* or something like that, where <i> is uniquely assigned number locally? > +3. If the file is instead a bundle list, then the client inspects the > + `bundle.list.mode` to see if the list is of the `all` or `any` form. If the downloaded file is not a bundle (e.g. "git bundle list-heads" barfs on it) and it is not parseable with our configuration parser, do we error out, or do we pretend as if that bundle file or the TOC did not exist (if the bundle list with mode=any at the higher level has appropriate alternatives)? > + a. If `bundle.list.mode=all`, then the client considers all bundle > + URIs. The list is reduced based on the `bundle.<id>.filter` options > + matching the client repository's partial clone filter. OK, this answers my earlier question nicely. It probably means that either the presentation order needs a bit of rethinking, or "we group by .filter" needs to be mentioned a lot earlier. > Then, all > + bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic > + is provided, then the bundles are downloaded in reverse- > + chronological order, stopping when a bundle has all required OIDs. Stop as soon as just one bundle has all the prerequisite objects, or should we keep going until all bundles have their prerequisites satisfied? I presume it is the latter. > + The bundles can then be unbundled in chronological order. The client > + stores the latest timestamp as a heuristic for avoiding future > + downloads if the bundle list does not advertise newer bundles. So we see a list, we start grabbing from new to old. Newer ones that are based on older ones may have dependencies, so we do not unbndle until we have all the prerequisites for them. The bundles that satisfy their prerequisites are unbundled---that would give us enough objects to play with. What happens to the refs recorded in them, though? Is the timestamp per the serving host, or per the CDN host that serve us bundle files, or...? I guess it is premature to discuss it here. "git clone" bootstraps from the advertisement made only by a single serving host, so the single newest timestamp among the bundles used from the bundle list is what we store here. How that timestamp is used is primarily of interest in future fetching, which would be discussed later. > +Fetching with Bundle URIs > +------------------------- > + > +When the client fetches new data, it can decide to fetch from bundle > +servers before fetching from the origin remote. This could be done via a > +command-line option, but it is more likely useful to use a config value > +such as the one specified during the clone. > + > +The fetch operation follows the same procedure to download bundles from a > +bundle list (although we do _not_ want to use parallel downloads here). We > +expect that the process will end when all negative commit OIDs in a thin > +bundle are already in the object database. I do not see why we do not want to use parallel download, though. If our last bundle download was last month, and they have two newer bundles since then, don't we want to grab both at the same time? Wasn't that the point of recording the newest timestamp when "git clone" grabbed bundles? > +Error Conditions > +---------------- > + > +If the Git client discovers something unexpected while downloading > +information according to a bundle URI or the bundle list found at that > +location, then Git can ignore that data and continue as if it was not > +given a bundle URI. The remote Git server is the ultimate source of truth, > +not the bundle URI. > + > +Here are a few example error conditions: > + > +* The client fails to connect with a server at the given URI or a connection > + is lost without any chance to recover. > + > +* The client receives a response other than `200 OK` (such as `404 Not Found`, > + `401 Not Authorized`, or `500 Internal Server Error`). The client should > + use the `credential.helper` to attempt authentication after the first > + `401 Not Authorized` response, but a second such response is a failure. > + > +* The client receives data that is not parsable as a bundle or table of > + contents. Is it an error if bundle.<id>.list and the contents disagree? It is fine to call the possibility other than "a bundle file" "table of contents", but then let's do so consistently throughout the document. When we explain bundle.<id>.list, we should not call the other possibility "list" but "table of contents", for example. > +* The bundle list describes a directed cycle in the > + `bundle.<id>.requires` links. > + > +* A bundle includes a filter that does not match expectations. Does this refer to a mismatch between the filter recorded in a bundle and bundle.<id>.filter entry that described the bundle? > +* The client cannot unbundle the bundles because the negative commit OIDs > + are not in the object database and there are no more > + `bundle.<id>.requires` links to follow. Is a .requires link mandatory? In a mode=all table of contents, we should not have to have .requires at all. In the above description on how bundle files are downloaded and in what order in Clone and Fetch operations, I didn't see any mention of .requires at all, but I think there should be. For example, the timestamp heuristics may say the bundle A is the latest. In a mode=any table of contents, shouldn't bundles that contain prerequisite commits of the bundle A be pointed by A's .requires fields? > +4. Allow the client to understand the `bundle.list.forFetch` configuration > + and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a > + bundle URI with `bundle.list.forFetch=true`, it configures the client > + repository to check that bundle URI during later `git fetch <remote>` > + commands. So bundle.list.forFetch is, unlike everything else we saw that looked like a configuration variable in this document, a configuration variable whose value is boolean? Ah, no. You mean the "git clone" sees a bundle URI, grabs it and sees a table of contents, and in it, finds "bundle.forFetch" is set to true? Then "git fetch <remote>" is configured to also use bundle URI? It is unclear to me (with the information given here so far), why we want this. Isn't this something the responder to "git fetch" can advertise over the wire? If we leave a permanent record in the resulting repository to do the bundle URI during 'fetch', wouldn't it become more cumbersome to cancel (iow "no, you no longer want to talk to me with bundle URI feature") from the server side? > +5. Allow clients to discover bundle URIs during `git fetch` and configure > + a bundle URI for later fetches if `bundle.list.forFetch=true`. > + > +6. Implement the "inspect headers" heuristic to reduce data downloads when > + the `bundle.<id>.timestamp` heuristic is not available. Sounds sensible, even though I do not offhand see why the "peek header and stop" is any less useful when the timestamp heurisitc is available. > +A major downside to this mechanism is that the origin server needs to know > +_exactly_ what is in those packfiles, and the packfiles need to be available > +to the user for some time after the server has responded. This coupling > +between the origin and the packfile data is difficult to manage. Hmph. I strongly suspect that there are Googlers on the list who have been managing such JGit server installations. Has this "coupling" been difficult to manage for you guys in the real world? > +Further, this implementation is extremely hard to make work with fetches. IOW, they do this only for clones and not fetches? > +Related Work: GVFS Cache Servers > +-------------------------------- > ... > +During a `git fetch`, a hook requests the prefetch endpoint using the > +most-recent timestamp from a previously-downloaded prefetch packfile. > +Only the list of packfiles with later timestamps are downloaded. That sounds quite straight-forward. Do you envision that their incremental snapshot packfile chains can somehow be shared with the bundle URI implementations? Doesn't it make it more cumbersome that this proposal uses the bundles as the encapsulation format, rather than packfiles? As you are sending extra pieces of information on top of the payload in the form of table-of-contents already, I wonder if bundle.<id>.uri should point at a bare packfile (instead of a bundle), while multi-valued bundle.<id>.prerequisite give the prerequisite objects? The machinery that is already generating the prefetch packfiles already know which packfile has what prerequisites in it, so it rather looks simpler if the solution did not involve bundles. Thanks. Here ends my two-part review on the doc.
On 6/6/2022 6:18 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> +Assuming a `200 OK` response from the server, the content at the URL is >> +expected to be of one of two forms: >> + >> +1. Bundle: A Git bundle file of version 2 or higher. >> + >> +2. Bundle List: A plain-text file that is parsable using Git's >> + config file parser. This file describes one or more bundles that are >> + accessible from other URIs. >> + >> +Any other data provided by the server is considered erroneous. > > How does a client tell which one it got? Do we register mimetype > with iana to use for these two types of files, and have the HTTP > downloader to use the information? My implementation is much dumber than that: it first attempts to parse the file as a bundle (looking for a bundle header) and then attempts to parse it as a config file after that. If neither succeed, then an error is thrown. >> +bundle.list.mode:: >> + (Required) This value has one of two values: `all` and `any`. When `all` >> + is specified, then the client should expect to need all of the listed >> + bundle URIs that match their repository's requirements. When `any` is >> + specified, then the client should expect that any one of the bundle URIs >> + that match their repository's requirements will suffice. Typically, the >> + `any` option is used to list a number of different bundle servers >> + located in different geographies. > > OK. Presumably, if there are two sets of bundles, A and B, that > consist of 3 and 2 bundle files respectively, and either one of > these two sets is sufficient to help the client, then we'd have a > bundle.list of type 'any', with two bundle.<id>.uri, that point at > (sub) bundle.list of type 'all' in which these 3 or 2 bundle files > are contained? I am just wondering why we need 'all' and 'any', and > at the same time why these two are sufficient for our needs. Necessary: The origin Git server may want to advertise a list of geo-distributed bundle servers, but not need to know the exact list of bundles at each of those locations. The client can choose from "any" advertised bundle URI, then download a bundle list from that URI and download "all" bundles it advertised. Sufficient: I can see a few different ways that we could want to have something in-between "any" and "all" and that is: "The bundles break into 'buckets', so pick any bucket and get all within that bucket." This is already planned as part of the bundle.<id>.filter, which creates different 'buckets'. So, if we need a partitioning like this, then we can rely on functionality-based partitions. Finally, we can always extend this in the future. If we want to add a new mode "use-key-X" or something, then bundle providers could start using it, but knowing that older clients would not understand it and would lose the ability to use their bundles. This is only a performance issue, not a correctness issue. I'm definitely trying to minimize the need for these kinds of extensions while keeping the spec small enough to implement in a reasonable amount of time. >> +bundle.list.heuristic:: >> + If this string-valued key exists, then the bundle list is designed to >> + work well with incremental `git fetch` commands. The heuristic signals >> + that there are additional keys available for each bundle that help >> + determine which subset of bundles the client should download. > > Funny indentation? Thanks. Editor confusion with .txt files, apparently. Fixed now. >> +The remaining keys include an `<id>` segment which is a server-designated >> +name for each available bundle. >> + >> +bundle.<id>.uri:: >> + (Required) This string value is the URI for downloading bundle `<id>`. >> + If the URI begins with a protocol (`http://` or `https://`) then the URI >> + is absolute. Otherwise, the URI is interpreted as relative to the URI >> + used for the bundle list. If the URI begins with `/`, then that relative >> + path is relative to the domain name used for the bundle list. (This use >> + of relative paths is intended to make it easier to distribute a set of >> + bundles across a large number of servers or CDNs with different domain >> + names.) > > I have no objection to a host-relative URI notation, but is it > something we need to spell out here? I am mostly interested in > making sure that we do not deviate a practice that is already used > to point at different resource at the same server. If the way we > specify host-relative is unnecessarily different from the way > existing "internet" programs (say, a web browser) work, that would > be embarrasing, unless there is a very good reason for us to be > different. Since it requires careful implementation, I thought the detail along with the justification would fit in this technical document. I can avoid including that when describing the config options inside the user-facing docs. >> +bundle.<id>.list:: >> + This boolean value indicates whether the client should expect the >> + content from this URI to be a list (if `true`) or a bundle (if `false`). >> + This is typically used when `bundle.list.mode` is `any`. > > OK, so the type of a (sub) bundle.list can be specified using this > without having the HTTP(s) server annotate the resource with > mimetype when the thing gets actually downloaded. It still leaves > the issue of bootstrapping the system. If the server advises bundle > URI when the client contacts, presumably that first-contact > bundle.*.uri can be annotated with the bundle.*.list at the same > time, but the model allows the client to learn bundles independently > from the server, and it still is a bit unclear how we tell. Of > course, we can examine the contents of a file that was obtained from > a bundle URI, a file that parses correctly as a config-like file is > very unlikely to be a valid bundle file, and we need to be prepared > to deal with a corrupt resource downloaded from a bundle URI anyway, > so this may not be a huge deal. Right. We can inspect the file with our existing tools to see if they fit the format. It might be worth doing some fuzz testing on these parsers to be sure there isn't a surprising way to trick them into doing strange things. >> +bundle.<id>.filter:: >> + This string value represents an object filter that should also appear in >> + the header of this bundle. The server uses this value to differentiate >> + different kinds of bundles from which the client can choose those that >> + match their object filters. > > Is it an error to have .filter defined for a bundle URI whose .list > says "true"? Or does it mean all bundle files that make up the list > share the same object filter? While this would not be the typical situation, a bundle provider could choose to combine these and the client would expect a list where all of the .filter values match the one here. Of course, it would not be a _failure_ if that wasn't true, but the client would ignore any bundles it finds where .filter doesn't match. >> +bundle.<id>.timestamp:: >> + This value is the number of seconds since Unix epoch (UTC) that this >> + bundle was created. This is used as an approximation of a point in time >> + that the bundle matches the data available at the origin server. This is >> + used when `bundle.list.heuristic=timestamp`. > > Name of this field should be better than 'timestamp'; we should say > timestamp of creation (or last modification if the same name can be > reused). How about creationToken? That communicates that we don't really care what the number is as long as it comes from an increasing sequence controlled by the bundle provider. >> +bundle.<id>.requires:: >> + This string value represents the ID of another bundle. When present, the >> + server is indicating that this bundle contains a thin packfile. If the >> + client does not have all necessary objects to unbundle this packfile, >> + then the client can download the bundle with the `requires` ID and try >> + again. (Note: it may be beneficial to allow the server to specify >> + multiple `requires` bundles.) This is used when >> + `bundle.list.heuristic=timestamp`. > > So, bundle.list.mode can say 'any', with three <id>s in it, but > bundle.1.requires can point at '2', while bundle.2.requires can > point at '1', and bundle.3.requires can be emtpy, in which case you > can either fetch 1&2 or 3 alone. Is that the idea? The idea is that if I download bundle '1' and I can't unbundle it (because I'm missing some required refs), then I can look at bundle.1.requires to get any missing refs. If that is '2', then I download that. It then continues in a chain. The 'any' means "start anywhere", but I also don't expect a provider to use .requires without the (maybe-to-be-renamed) timestamp heuristic. We could also make that be a hard-coded statement: ".requires will be ignored unless mode=all and heuristic=X" >> +bundle.<id>.location:: >> + This string value advertises a real-world location from where the bundle >> + URI is served. This can be used to present the user with an option for >> + which bundle URI to use. This is only valuable when `bundle.list.mode` >> + is `any`. > > I am afraid I do not follow. Do you mean, by "a real-world > location", we write things like "America/Los_Angeles" and > "Asia/Tokyo" in this field, so people can tell which one is cheaper > to get to? Do we make any further specification to make it machine > usable in any way (I suspect machines would rather measure the > latency and throughput against bundle.<id>.uri and .location is > meant purely for human consumption)? The intention is to be human-readable, for a user-facing prompt. This could be for an interactive "chooser" or just letting the user know "this is the location of the bundle URI I picked". If we want the computer to automatically select, then using ping latency would be a better way forward. Even in that case, it would be helpful to tell the user "I discovered the closest bundle URI is <location>". >> +Here is an example bundle list using the Git config format: >> + >> +``` >> +[bundle "list"] >> + version = 1 >> + mode = all >> + heuristic = timestamp > > In all mode, how does heuristic help? Doesn't mode=all by > definition require you to grab everything anyway? The heuristic is for incremental fetches, when you already have some Git object data locally and don't want to download every single bundle if you don't need to. (I think I have a step-by-step flow of this lower in the doc.) >> +[bundle "2022-02-09-1644442601-daily"] >> + uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle > > example.com (cf. RFC6761)? Sure. Thanks. >> + timestamp = 1644442601 >> + requires = 2022-02-02-1643842562 >> + >> +[bundle "2022-02-02-1643842562"] >> + uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle >> + timestamp = 1643842562 >> + >> +[bundle "2022-02-09-1644442631-daily-blobless"] >> + uri = 2022-02-09-1644442631-daily-blobless.bundle >> + timestamp = 1644442631 >> + requires = 2022-02-02-1643842568-blobless >> + filter = blob:none >> + >> +[bundle "2022-02-02-1643842568-blobless"] >> + uri = /git/git/2022-02-02-1643842568-blobless.bundle >> + timestamp = 1643842568 >> + filter = blob:none >> +``` >> + >> +This example uses `bundle.list.mode=all` as well as the >> +`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter` >> +options to present two parallel sets of bundles: one for full clones and >> +another for blobless partial clones. >> + >> +Suppose that this bundle list was found at the URI >> +`https://bundles.fake.com/git/git/` and so the two blobless bundles have >> +the following fully-expanded URIs: >> + >> +* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle` >> +* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle` > > So,... is the idea that bundle.list.mode=all does *not* mean "you > need to get all of them"? Rather, you first group bundles with the > same filter, attribute, and then for each group with the same filter, > you'd need to grab all of them? IOW, if you are interested in a > full clone, you can ignore <id>'s with non-empty bundle.<id>.filter > and grab all the rest? > > If so, then I can see how the design makes sense. I still do not > know how heuristic kicks in, though. > > ANother thing I noticed. The above scheme makes it impossible to > have <id> that happens to be "list". I think the variables that > apply to the entire list should be given two-level names, i.e. > > [bundle] > version = 1 > mode = all > heuristic = timestamp > [bundle "2022-02-09-1644442631-daily"] > uri = ... This then means that <id> can't be "version", "mode", or "heuristic", or any other possible key that we use in the future, right? Using "list" helped with this. Perhaps we could do this: [bundles] version = 1 mode = all heuristic = timestamp [bundle "id1"] uri = ... Notably: "bundles" refers to the full list, while "bundle" refers to a single bundle at a time. It makes the situation slightly more complicated from the server side (we reserve bundles.* and bundle.* for this advertisement). >> +The client could choose an arbitrary bundle URI as an option _or_ select >> +the URI with lowest latency by some exploratory checks. > > Some places may have higher latency but great throughput. > > The review for the rest of the document will be left for another > sitting. Thanks for your careful reading! -Stolee
Derrick Stolee <derrickstolee@github.com> writes: >> How does a client tell which one it got? Do we register mimetype >> with iana to use for these two types of files, and have the HTTP >> downloader to use the information? > > My implementation is much dumber than that: it first attempts to > parse the file as a bundle (looking for a bundle header) and then > attempts to parse it as a config file after that. If neither > succeed, then an error is thrown. I think that is probably the best implementation after all. We cannot trust what the other side tells us. "They claimed that this is a bundle file and not a table-of-contents, and it does look like one, but it may be corrupt or even malicious file that may look like a bundle file on surface, so let's carefully examine it" ought to be the attitude the receiving side has towards the incoming data. Thanks.
On 6/6/2022 8:33 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> +Cloning with Bundle URIs >> +------------------------ >> + >> +The primary need for bundle URIs is to speed up clones. The Git client >> +will interact with bundle URIs according to the following flow: >> + >> +1. The user specifies a bundle URI with the `--bundle-uri` command-line >> + option _or_ the client discovers a bundle list advertised by the >> + Git server. >> + >> +2. If the downloaded data from a bundle URI is a bundle, then the client >> + inspects the bundle headers to check that the negative commit OIDs are > > Although "negative" would be understandable to pros, the commits > required to unbundle a bundle file are officially called > "prerequisite commits" (cf. "git bundle --help"), so that may be > easier to understand by ordinary readers. Ok. I can work to replace this language throughout. >> + present in the client repository. If some are missing, then the client >> + delays unbundling until other bundles have been unbundled, making those >> + OIDs present. When all required OIDs are present, the client unbundles >> + that data using a refspec. The default refspec is >> + `+refs/heads/*:refs/bundles/*`, but this can be configured. > > The refs/bundles/ appear in the document only here, and it is > unclear why we even want it (I am assuming this is against gc while > "git clone" is still running) or how we are going to retire it, if > ever. If there are multiple bundle files involved in this "git clone", > to anchor objects that are necessary against "gc", don't we need to use > refs/bundles/<i>/* or something like that, where <i> is uniquely assigned > number locally? The real reason to keep them in refs/bundles/ is because then those refs can be used in the incremental 'git fetch' after downloading the bundles (in perpetuity) while not stomping refs/heads or refs/remotes/ >> +3. If the file is instead a bundle list, then the client inspects the >> + `bundle.list.mode` to see if the list is of the `all` or `any` form. > > If the downloaded file is not a bundle (e.g. "git bundle list-heads" > barfs on it) and it is not parseable with our configuration parser, > do we error out, or do we pretend as if that bundle file or the TOC > did not exist (if the bundle list with mode=any at the higher level > has appropriate alternatives)? I think the best thing to do would be to fail as gracefully as possible here. In particular: give a warning that the remote did not give anything helpful, but continue with the normal fetching from the remote without help from bundles. >> + a. If `bundle.list.mode=all`, then the client considers all bundle >> + URIs. The list is reduced based on the `bundle.<id>.filter` options >> + matching the client repository's partial clone filter. > > OK, this answers my earlier question nicely. It probably means that > either the presentation order needs a bit of rethinking, or "we > group by .filter" needs to be mentioned a lot earlier. Ok. It may even need to be _implemented_ a lot earlier, looking at my current outline of future changes. >> Then, all >> + bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic >> + is provided, then the bundles are downloaded in reverse- >> + chronological order, stopping when a bundle has all required OIDs. > > Stop as soon as just one bundle has all the prerequisite objects, or > should we keep going until all bundles have their prerequisites > satisfied? I presume it is the latter. True. We keep going until all have their required OIDs. With a .requires chain of "1 requires 2" and "2 requires 3", it is possible that "1" has a prerequisite commit that actually exists in "3", but somehow the bundle "2" can unbundle in the client's repository. One could argue that the "1" bundle should have a direct .requires relationship on "3" (in addition to "2"), but for simplicity I think it is OK to imply "transitive requires". Using "all" here should make it clearer. >> + The bundles can then be unbundled in chronological order. The client >> + stores the latest timestamp as a heuristic for avoiding future >> + downloads if the bundle list does not advertise newer bundles. > > So we see a list, we start grabbing from new to old. Newer ones > that are based on older ones may have dependencies, so we do not > unbndle until we have all the prerequisites for them. The bundles > that satisfy their prerequisites are unbundled---that would give us > enough objects to play with. What happens to the refs recorded in > them, though? Those refs are translated into refs/bundles/. If multiple bundles have the same refs/heads/ included, then the newest ones would overwite those refs in refs/bundles/. > Is the timestamp per the serving host, or per the CDN host that > serve us bundle files, or...? I guess it is premature to discuss it > here. "git clone" bootstraps from the advertisement made only by a > single serving host, so the single newest timestamp among the > bundles used from the bundle list is what we store here. How that > timestamp is used is primarily of interest in future fetching, which > would be discussed later. The bundle provider is responsible for making the timestamps make sense as an opaque increasing token. >> +Fetching with Bundle URIs >> +------------------------- >> + >> +When the client fetches new data, it can decide to fetch from bundle >> +servers before fetching from the origin remote. This could be done via a >> +command-line option, but it is more likely useful to use a config value >> +such as the one specified during the clone. >> + >> +The fetch operation follows the same procedure to download bundles from a >> +bundle list (although we do _not_ want to use parallel downloads here). We >> +expect that the process will end when all negative commit OIDs in a thin >> +bundle are already in the object database. > > I do not see why we do not want to use parallel download, though. > If our last bundle download was last month, and they have two newer > bundles since then, don't we want to grab both at the same time? > Wasn't that the point of recording the newest timestamp when "git > clone" grabbed bundles? In theory, we have also been fetching from the origin, so we might have already received all of the objects in that second (or first!) bundle. By going in reverse-chronological order, we minimize the amount of data downloaded (assuming that is the most expensive part of the operation). This is something where we have room to experiment. With Ævar's idea of downloading only the headers, we could download all headers in parallel and halt the download for any bundles where we have all of the ref tips. >> +Error Conditions >> +---------------- >> + >> +If the Git client discovers something unexpected while downloading >> +information according to a bundle URI or the bundle list found at that >> +location, then Git can ignore that data and continue as if it was not >> +given a bundle URI. The remote Git server is the ultimate source of truth, >> +not the bundle URI. >> + >> +Here are a few example error conditions: >> + >> +* The client fails to connect with a server at the given URI or a connection >> + is lost without any chance to recover. >> + >> +* The client receives a response other than `200 OK` (such as `404 Not Found`, >> + `401 Not Authorized`, or `500 Internal Server Error`). The client should >> + use the `credential.helper` to attempt authentication after the first >> + `401 Not Authorized` response, but a second such response is a failure. >> + >> +* The client receives data that is not parsable as a bundle or table of >> + contents. > > Is it an error if bundle.<id>.list and the contents disagree? I think this can be flexible. It is not difficult to treat the .list value as advisory, so we can ignore disagreements. If we start making decisions that hinge on the value of .list, then we can start treating it as an error. Or: maybe the .list value is so "advisory" that it is useless, and I should just drop it from the schema. > It is fine to call the possibility other than "a bundle file" "table > of contents", but then let's do so consistently throughout the document. > When we explain bundle.<id>.list, we should not call the other > possibility "list" but "table of contents", for example. Sorry, I had intended to find-and-replace all of these instances, but missed some that cross line breaks. >> +* The bundle list describes a directed cycle in the >> + `bundle.<id>.requires` links. >> + >> +* A bundle includes a filter that does not match expectations. > > Does this refer to a mismatch between the filter recorded in a > bundle and bundle.<id>.filter entry that described the bundle? Yes. I can make that more explicit. >> +* The client cannot unbundle the bundles because the negative commit OIDs >> + are not in the object database and there are no more >> + `bundle.<id>.requires` links to follow. > > Is a .requires link mandatory? In a mode=all table of contents, we > should not have to have .requires at all. In the above description > on how bundle files are downloaded and in what order in Clone and > Fetch operations, I didn't see any mention of .requires at all, but > I think there should be. For example, the timestamp heuristics may > say the bundle A is the latest. In a mode=any table of contents, > shouldn't bundles that contain prerequisite commits of the bundle A > be pointed by A's .requires fields? It could be considered useful for the .timestamp heuristic, or we could infer the .requires value as "the next most-recent bundle" in that heuristic. (So maybe we remove it from the schema.) With "all" but no heuristic, then we need everything without knowing where to "start" and trying to minimize downloads. >> +4. Allow the client to understand the `bundle.list.forFetch` configuration >> + and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a >> + bundle URI with `bundle.list.forFetch=true`, it configures the client >> + repository to check that bundle URI during later `git fetch <remote>` >> + commands. > > So bundle.list.forFetch is, unlike everything else we saw that > looked like a configuration variable in this document, a > configuration variable whose value is boolean? > > Ah, no. You mean the "git clone" sees a bundle URI, grabs it and > sees a table of contents, and in it, finds "bundle.forFetch" is set > to true? Then "git fetch <remote>" is configured to also use bundle > URI? Yes. The bundle provider is advertising "I'm working hard to help you with your 'git fetches' after your initial clone." > It is unclear to me (with the information given here so far), why we > want this. Isn't this something the responder to "git fetch" can > advertise over the wire? If we leave a permanent record in the > resulting repository to do the bundle URI during 'fetch', wouldn't > it become more cumbersome to cancel (iow "no, you no longer want to > talk to me with bundle URI feature") from the server side? I've tried very hard to make the bundle provider as independent from the origin Git server as possible (including no relationship at all). Even if the origin Git server knows about a bundle provider at a given URI, it does not necessarily know if _at this moment_ the provider is bundling for fetches. >> +5. Allow clients to discover bundle URIs during `git fetch` and configure >> + a bundle URI for later fetches if `bundle.list.forFetch=true`. >> + >> +6. Implement the "inspect headers" heuristic to reduce data downloads when >> + the `bundle.<id>.timestamp` heuristic is not available. > > Sounds sensible, even though I do not offhand see why the "peek > header and stop" is any less useful when the timestamp heurisitc is > available. Yes, it is still helpful with the heuristic. I can remove that phrasing. (This functionality is _required_ to help incremental fetches when a heuristic is not provided.) >> +A major downside to this mechanism is that the origin server needs to know >> +_exactly_ what is in those packfiles, and the packfiles need to be available >> +to the user for some time after the server has responded. This coupling >> +between the origin and the packfile data is difficult to manage. > > Hmph. I strongly suspect that there are Googlers on the list who > have been managing such JGit server installations. Has this > "coupling" been difficult to manage for you guys in the real world? > >> +Further, this implementation is extremely hard to make work with fetches. > > IOW, they do this only for clones and not fetches? The last time I spoke with Googlers about this subject, what I heard from them was "Yes, this is hard to get right for incremental fetches, but the benefit for clones is big enough that we are happy with just that." If advancements have come about since then to work with incremental fetches, then I'd love to hear about it. >> +Related Work: GVFS Cache Servers >> +-------------------------------- >> ... >> +During a `git fetch`, a hook requests the prefetch endpoint using the >> +most-recent timestamp from a previously-downloaded prefetch packfile. >> +Only the list of packfiles with later timestamps are downloaded. > > That sounds quite straight-forward. Do you envision that their > incremental snapshot packfile chains can somehow be shared with the > bundle URI implementations? Doesn't it make it more cumbersome that > this proposal uses the bundles as the encapsulation format, rather > than packfiles? As you are sending extra pieces of information on > top of the payload in the form of table-of-contents already, I > wonder if bundle.<id>.uri should point at a bare packfile (instead > of a bundle), while multi-valued bundle.<id>.prerequisite give the > prerequisite objects? The machinery that is already generating the > prefetch packfiles already know which packfile has what > prerequisites in it, so it rather looks simpler if the solution did > not involve bundles. The prefetch packfiles could be replaced with bundle URIs, if desired. The reason bundles were not needed for the GVFS protocol was that all other object data not in those prefetch packfiles is downloaded via direct requests (one object per request or a batch of objects requested from a list of OIDs) and not from an incremental fetch. The VFS for Git clients only talk to the origin server for the ref advertisement and talk to the cache servers for the objects necessary to satisfy the client's Git command. (Also: the client pushes directly to the origin server.) So in this world, the bundle URIs could be used as a replacement for downloading these prefetch packfiles (bundles with filter=blob:none) but the bundled refs become useless to the client. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > How about creationToken? That communicates that we don't really care > what the number is as long as it comes from an increasing sequence > controlled by the bundle provider. Is it important for the clients that this value is tied to creation time? I am trying to come up with a word like "serial" that makes it clear that the values have no meaning other than that they are always growing in one direction to suggest as a replacement, but failing to find a good one. The word should probably not have any connotation that they must be consecutive, but "serial" numbers are usually dense. >> Another thing I noticed. The above scheme makes it impossible to >> have <id> that happens to be "list". I think the variables that >> apply to the entire list should be given two-level names, i.e. >> >> [bundle] >> version = 1 >> mode = all >> heuristic = timestamp >> [bundle "2022-02-09-1644442631-daily"] >> uri = ... > > This then means that <id> can't be "version", "mode", or "heuristic", > or any other possible key that we use in the future, right? Left ;-). Two-level variable names and three-level variable names live completely in a separate namespace (there is no D/F conflict). [bundle] version = 1 [bundle "version"] url = ... mode = ... is perfectly legit.
Junio C Hamano <gitster@pobox.com> writes: > Derrick Stolee <derrickstolee@github.com> writes: > >>> How does a client tell which one it got? Do we register mimetype >>> with iana to use for these two types of files, and have the HTTP >>> downloader to use the information? >> >> My implementation is much dumber than that: it first attempts to >> parse the file as a bundle (looking for a bundle header) and then >> attempts to parse it as a config file after that. If neither >> succeed, then an error is thrown. > > I think that is probably the best implementation after all. > > We cannot trust what the other side tells us. "They claimed that > this is a bundle file and not a table-of-contents, and it does look > like one, but it may be corrupt or even malicious file that may look > like a bundle file on surface, so let's carefully examine it" ought > to be the attitude the receiving side has towards the incoming data. With the above, I do not mean that this new mechanism must be more paranoia than we already are. $ git fetch bootstrap.bndl refs/*:refs/bundle/bootstrap/* should already have sensible error checking, and we should use the available mechanism. But there of course are places the new feature should be careful in its new code, for example, we may want to unbundle all these bundles in quarantined area until we resolve all the prerequisite objects and then move them out of the quarantine, for example, if the new feature rolls its own code to unbundle instead of invoking "git fetch" on it. Even if it spawns "git fetch" on it, it may have to choose the parameters carefully (e.g. the refmap would want to avoid clobbering our own ref namespace, which you plan to do).
On 6/8/2022 4:39 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > >> How about creationToken? That communicates that we don't really care >> what the number is as long as it comes from an increasing sequence >> controlled by the bundle provider. > > Is it important for the clients that this value is tied to creation > time? I am trying to come up with a word like "serial" that makes > it clear that the values have no meaning other than that they are > always growing in one direction to suggest as a replacement, but > failing to find a good one. The word should probably not have > any connotation that they must be consecutive, but "serial" numbers > are usually dense. > >>> Another thing I noticed. The above scheme makes it impossible to >>> have <id> that happens to be "list". I think the variables that >>> apply to the entire list should be given two-level names, i.e. >>> >>> [bundle] >>> version = 1 >>> mode = all >>> heuristic = timestamp >>> [bundle "2022-02-09-1644442631-daily"] >>> uri = ... >> >> This then means that <id> can't be "version", "mode", or "heuristic", >> or any other possible key that we use in the future, right? > > Left ;-). > > Two-level variable names and three-level variable names live > completely in a separate namespace (there is no D/F conflict). > > [bundle] > version = 1 > [bundle "version"] > url = ... > mode = ... > > is perfectly legit. Then I stand corrected. For some reason I thought this would cause a problem, but I must have messed something _else_ up in the process of testing it. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: >> Is it an error if bundle.<id>.list and the contents disagree? > > I think this can be flexible. It is not difficult to treat the .list > value as advisory, so we can ignore disagreements. If we start making > decisions that hinge on the value of .list, then we can start treating > it as an error. > > Or: maybe the .list value is so "advisory" that it is useless, and I > should just drop it from the schema. I was wondering about the same thing ;-) >>> +Related Work: GVFS Cache Servers >>> +-------------------------------- >>> ... >>> +During a `git fetch`, a hook requests the prefetch endpoint using the >>> +most-recent timestamp from a previously-downloaded prefetch packfile. >>> +Only the list of packfiles with later timestamps are downloaded. >> >> That sounds quite straight-forward. Do you envision that their >> incremental snapshot packfile chains can somehow be shared with the >> bundle URI implementations? Doesn't it make it more cumbersome that >> this proposal uses the bundles as the encapsulation format, rather >> than packfiles? As you are sending extra pieces of information on >> top of the payload in the form of table-of-contents already, I >> wonder if bundle.<id>.uri should point at a bare packfile (instead >> of a bundle), while multi-valued bundle.<id>.prerequisite give the >> prerequisite objects? The machinery that is already generating the >> prefetch packfiles already know which packfile has what >> prerequisites in it, so it rather looks simpler if the solution did >> not involve bundles. > > The prefetch packfiles could be replaced with bundle URIs, if desired. > ... > So in this world, the bundle URIs could be used as a replacement for > downloading these prefetch packfiles (bundles with filter=blob:none) > but the bundled refs become useless to the client. That's all understandable, but what I was alluding to was to go in the other direction. Since "bundle URI" thing is new, while the GVFS Cache Servers already use these prefetch packfiles, it could be beneficial if the new thing can be done without bundle files and instead with packfiles. You are already generating these snapshot packfiles for GVFS Cache Servers. So if we can reuse them to also serve "git clone" and "git fetch" clients, we can do so without doubling the disk footprint. Even if you scrapped the "bundle URI" and rebuilt it as the "packfile URI" mechanism, the only change you need is to make positive and negative refs, which were available in bundle files but not stored in packfiles, available as a part of the metadata for each packfile, no? You'd be keeping track of associated metadata (like the .timestamp and .requires fields) in addition to what is in the bundle anyway, so...
On 6/8/2022 5:01 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: >>> That sounds quite straight-forward. Do you envision that their >>> incremental snapshot packfile chains can somehow be shared with the >>> bundle URI implementations? Doesn't it make it more cumbersome that >>> this proposal uses the bundles as the encapsulation format, rather >>> than packfiles? As you are sending extra pieces of information on >>> top of the payload in the form of table-of-contents already, I >>> wonder if bundle.<id>.uri should point at a bare packfile (instead >>> of a bundle), while multi-valued bundle.<id>.prerequisite give the >>> prerequisite objects? The machinery that is already generating the >>> prefetch packfiles already know which packfile has what >>> prerequisites in it, so it rather looks simpler if the solution did >>> not involve bundles. >> >> The prefetch packfiles could be replaced with bundle URIs, if desired. >> ... >> So in this world, the bundle URIs could be used as a replacement for >> downloading these prefetch packfiles (bundles with filter=blob:none) >> but the bundled refs become useless to the client. > > That's all understandable, but what I was alluding to was to go in > the other direction. Since "bundle URI" thing is new, while the > GVFS Cache Servers already use these prefetch packfiles, it could be > beneficial if the new thing can be done without bundle files and > instead with packfiles. You are already generating these snapshot > packfiles for GVFS Cache Servers. So if we can reuse them to also > serve "git clone" and "git fetch" clients, we can do so without > doubling the disk footprint. Now I'm confused as to what you are trying to say, so let me back up and start from the beginning. Hopefully, that brings clarity so we can get to the root of my confusion. The GVFS Cache Servers started as a way to have low-latency per-object downloads to satisfy the filesystem virtualization feature of the clients. This initially was going to be the _only_ way clients got objects until we realized that commit and tree "misses" are very expensive. So, the "prefetch packfile" system was developed to use timestamp- based packs that contain commits and trees. Clients would provide their latest timestamp and the servers would provide the list of packfiles to download. Because the GVFS Protocol still has the "download objects on-demand" feature, any objects that were needed that were not already in those prefetch packfiles (including recently-pushed commits and trees) could be downloaded by the clients on-demand. This has been successful in production, and in particular is helpful that cache servers can be maintained completely independently of the origin Git server. There is some configuration to allow the origin server to advertise the list of cache servers via the <url>/gvfs/config REST API, but otherwise they are completely independent. For years, I've been interested in bringing this kind of functionality to Git proper, but struggled on multiple fronts: 1. The independence of the cache servers could not use packfile-URIs. 2. The way packfile-URIs happens _within_ a fetch negotiation makes it hard to integrate even if we didn't have this independence. 3. If the Git client directly downloaded these packfiles from the cache server, then how does it get the remaining objects from the origin server? Ævar's observation that bundles also add ref tips to the packfile is the key to breaking down this concern: these ref tips give us a way to negotiate the difference between what the client already has (including the bundles downloaded from a bundle provider) and what it wants from the origin Git server. This all happens without any change necessary to the origin Git server. And thus, this bundle URI design came about. It takes all of the best things about the GVFS Cache Server but then layers refs on top of the time-based prefetch packfiles so a normal Git client can do that "catch-up fetch" afterwards. This motivated my "could we use the new bundle URI feature in the old GVFS Cache Server environment?" comment: I could imagine updating GVFS Cache Servers to generate bundles instead (or also) and updating the VFS for Git clients to use the bundle URI feature to download the data. However, for the sake of not overloading the origin server with those incremental fetches, we would probably keep the "only download missing objects on-demand" feature in that environment. (Hence, the refs are useless to those clients.) However, you seem to be hinting at "the GVFS Cache Servers seem to work just fine, so why do we need bundles?" but I think that the constraints of what is expected at the end of "git clone" or "git fetch" require us to not "catch up later" and instead complete the full download during the process. The refs in the bundles are critical to making that work. > Even if you scrapped the "bundle URI" and rebuilt it as the > "packfile URI" mechanism, the only change you need is to make > positive and negative refs, which were available in bundle files but > not stored in packfiles, available as a part of the metadata for > each packfile, no? You'd be keeping track of associated metadata > (like the .timestamp and .requires fields) in addition to what is in > the bundle anyway, so... From this comment, it seems you are suggesting that we augment the packfile data being served by the packfile-URI feature in order to include those positive/negative refs (as well as other metadata that is included in the bundle URI design). I see two major issues with that: 1. We don't have a way to add that metadata directly into packfiles, so we would need to update the file standard or update the packfile-URI protocol to include that metadata. 2. The only source of packfile-URI listings come as a response to the "git fetch" request to the origin Git server, so there is no way to allow an independent server to provide that data. I hope I am going in the right direction here, but I likely misunderstood some of your proposed alternatives. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > Ævar's observation that bundles also add ref tips to the packfile is > the key to breaking down this concern: these ref tips give us a way > to negotiate the difference between what the client already has > (including the bundles downloaded from a bundle provider) and what it > wants from the origin Git server. This all happens without any change > necessary to the origin Git server. > > And thus, this bundle URI design came about. It takes all of the best > things about the GVFS Cache Server but then layers refs on top of the > time-based prefetch packfiles so a normal Git client can do that > "catch-up fetch" afterwards. Yup. My observation was that (1) you would need ref tips in some way, (2) you are conveying not just "here are the set of bundle files", but "this bundle file has these associated attributes" (like .timestamp, and .uri to fetch it from), in the table-of-contents the clients are expected to obtain anyway, hence (3) you could, but you do not need to, use bundle as a way to convey "packfile contents plus refs" to the clients (iow, instead you can use packfile and then report these refs information in the table-of-contents as more "associated attributes" to the items listed in the table-of-contents). > I could imagine updating GVFS Cache Servers to generate bundles > instead (or also) and updating the VFS for Git clients to use the > bundle URI feature to download the data. I could, too, but I do not think that would buy us anything. If an existing system is happily working, I do not see a point in switching it to use bundle. What I was imagining is going the other way. A new thing being written, instead of basing it on bundles, can be based on packfiles, and that would allow you to share the on-disk packfiles between the two systems. > However, you seem to be hinting at "the GVFS Cache Servers seem to > work just fine, so why do we need bundles?" but I think that the > constraints of what is expected at the end of "git clone" or "git > fetch" require us to not "catch up later" and instead complete the > full download during the process. The refs in the bundles are critical > to making that work. The refs are critical. They do not have to be recorded in a bundle. > I see two major issues with that: > > 1. We don't have a way to add that metadata directly into packfiles, > so we would need to update the file standard or update the > packfile-URI protocol to include that metadata. > > 2. The only source of packfile-URI listings come as a response to the > "git fetch" request to the origin Git server, so there is no way > to allow an independent server to provide that data. It might be end up being the same thing at the end, but I was thinking about starting from the "bundle URI standard" document. I am not yet interested in discussing "packfile URI" feature that independently exists already in this discussion (but going this route we _might_ be able to share data and some code with it---but that was not where I am coming from). Starting from "bundle URI standard" document at the beginning of the thread, if we replace all the mentions of "bundle file" with "packfile" in it, and then add .positiveRefs and .negativeRefs to each "packfile" (renamed from "bundle file") as additional "packfile.<id>.*" (renamed from "bundle.<id>.*") attributes, without changing anything else, the result would be feature equivalent to the original "bundle URI standard", I would think, but without having to wrap a packfile in a bundle file? > I hope I am going in the right direction here, but I likely > misunderstood some of your proposed alternatives. I wasn't seriously "proposing" an alternative. It was just that it looked wasteful to go to a separate format (i.e. bundle) when packfiles should suffice, as you would be adding extra information that is not in bundles via the table-of-contents anyway, and what is given by a bundle that is missing in a packfile is only the refs information, which should be trivial to add to the table-of-contents.
On Thu, Jun 09 2022, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > [...] >> I hope I am going in the right direction here, but I likely >> misunderstood some of your proposed alternatives. > > I wasn't seriously "proposing" an alternative. It was just that it > looked wasteful to go to a separate format (i.e. bundle) when packfiles > should suffice, as you would be adding extra information that is not > in bundles via the table-of-contents anyway, and what is given by a > bundle that is missing in a packfile is only the refs information, > which should be trivial to add to the table-of-contents. One thing that got pointed out to me by someone interested in this feature is that they'd be interested in serving up historical git repositories with a repo.bundle file. We're much better off with a format that includes refs for those use cases, even though no version of the patches that have been kicked around support this particular use-case yet (but it would be relatively easy as a follow-up). I also daresay that bundles will tend to integrate better into existing infrastructure, since if you're doing incremental backups with them it's a natural extension to throw those same incremental updates at a CDN and serve them up with bundle-uri. I also think we can come up with much better tooling for collections of bundles than random packfiles, since they declare what "tip" they want, and we're able to add arbitrary key-values to the header format in the future (and Stolee already added one such key-value).
On 6/9/2022 1:56 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > >> Ævar's observation that bundles also add ref tips to the packfile is >> the key to breaking down this concern: these ref tips give us a way >> to negotiate the difference between what the client already has >> (including the bundles downloaded from a bundle provider) and what it >> wants from the origin Git server. This all happens without any change >> necessary to the origin Git server. >> >> And thus, this bundle URI design came about. It takes all of the best >> things about the GVFS Cache Server but then layers refs on top of the >> time-based prefetch packfiles so a normal Git client can do that >> "catch-up fetch" afterwards. > > Yup. My observation was that (1) you would need ref tips in some > way, (2) you are conveying not just "here are the set of bundle > files", but "this bundle file has these associated attributes" (like > .timestamp, and .uri to fetch it from), in the table-of-contents the > clients are expected to obtain anyway, hence (3) you could, but you > do not need to, use bundle as a way to convey "packfile contents > plus refs" to the clients (iow, instead you can use packfile and > then report these refs information in the table-of-contents as more > "associated attributes" to the items listed in the table-of-contents). ... > Starting from "bundle URI standard" document at the beginning of the > thread, if we replace all the mentions of "bundle file" with > "packfile" in it, and then add .positiveRefs and .negativeRefs to > each "packfile" (renamed from "bundle file") as additional > "packfile.<id>.*" (renamed from "bundle.<id>.*") attributes, without > changing anything else, the result would be feature equivalent to > the original "bundle URI standard", I would think, but without > having to wrap a packfile in a bundle file? > >> I hope I am going in the right direction here, but I likely >> misunderstood some of your proposed alternatives. > > I wasn't seriously "proposing" an alternative. It was just that it > looked wasteful to go to a separate format (i.e. bundle) when packfiles > should suffice, as you would be adding extra information that is not > in bundles via the table-of-contents anyway, and what is given by a > bundle that is missing in a packfile is only the refs information, > which should be trivial to add to the table-of-contents. Ok, I've trimmed your latest reply to focus on the main point: "Why bundles?" You are right that we could use a table of contents to list the metadata that we need (that is currently stored in the bundle header) except for one case: the single bundle. If the provider wants to skip the table of contents/bundle list and only provide one file that bootstraps clones, then we need something more than just a packfile. This could be remedied by _requiring_ the table of contents with the ref list, but it does lead to separation of the packfile from the important ref information. Further, the provider might want to cover a large number of refs, not just the default ref. That would increase the size of the table of contents more than necessary. With these things in mind, I do still think bundles are a good way to store and share this data. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > Further, the provider might want to cover a large number of refs, > not just the default ref. That would increase the size of the > table of contents more than necessary. > > With these things in mind, I do still think bundles are a good > way to store and share this data. If you keep the refs and filter information separately from the packdata (i.e. in the table-of-contents like I outlined in the message you are responding to), one downside is that you lose these pieces of information but still have packfiles, such an accident would make the set of packfiles pretty much useless. But if you have bundles, the filter information to be placed in the table-of-contents can be recovered from them. Which is much better. I wonder if we should add more to the bundle, like what we would write to the .timestamp field of the table-of-contents, though, if we are to go in that direction. Thanks.
On 6/8/2022 3:46 PM, Derrick Stolee wrote: > On 6/6/2022 8:33 PM, Junio C Hamano wrote: >> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: I'm finally getting around to applying the recommended changes locally, when I noticed... >>> + present in the client repository. If some are missing, then the client >>> + delays unbundling until other bundles have been unbundled, making those >>> + OIDs present. When all required OIDs are present, the client unbundles >>> + that data using a refspec. The default refspec is >>> + `+refs/heads/*:refs/bundles/*`, but this can be configured. >> >> The refs/bundles/ appear in the document only here, and it is >> unclear why we even want it (I am assuming this is against gc while >> "git clone" is still running) or how we are going to retire it, if >> ever. If there are multiple bundle files involved in this "git clone", >> to anchor objects that are necessary against "gc", don't we need to use >> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned >> number locally? > > The real reason to keep them in refs/bundles/ is because then those > refs can be used in the incremental 'git fetch' after downloading the > bundles (in perpetuity) while not stomping refs/heads or refs/remotes/ ...I completely ignored your "refs/bundles/<i>/*" suggestion, which is an interesting way to allow dropping refs from this space, allowing GC to clear up space over time. I'm making note of this and will include it as a potential way forward (while I also think about what the implementation would look like). Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: >>> The refs/bundles/ appear in the document only here, and it is >>> unclear why we even want it (I am assuming this is against gc while >>> "git clone" is still running) or how we are going to retire it, if >>> ever. If there are multiple bundle files involved in this "git clone", >>> to anchor objects that are necessary against "gc", don't we need to use >>> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned >>> number locally? >> >> The real reason to keep them in refs/bundles/ is because then those >> refs can be used in the incremental 'git fetch' after downloading the >> bundles (in perpetuity) while not stomping refs/heads or refs/remotes/ > > ...I completely ignored your "refs/bundles/<i>/*" suggestion, which is > an interesting way to allow dropping refs from this space, allowing GC > to clear up space over time. FWIW, I wasn't thinking about GC and expiration. If bundle URI thing can say "you need this, that and that other bundle" and cause you to fetch three bundles, I thought that there needs a way for you to record the tips of these three bundles---these three bundles should not have to compete for refs/bundles/master, for example.
On 6/21/2022 4:16 PM, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > >>>> The refs/bundles/ appear in the document only here, and it is >>>> unclear why we even want it (I am assuming this is against gc while >>>> "git clone" is still running) or how we are going to retire it, if >>>> ever. If there are multiple bundle files involved in this "git clone", >>>> to anchor objects that are necessary against "gc", don't we need to use >>>> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned >>>> number locally? >>> >>> The real reason to keep them in refs/bundles/ is because then those >>> refs can be used in the incremental 'git fetch' after downloading the >>> bundles (in perpetuity) while not stomping refs/heads or refs/remotes/ >> >> ...I completely ignored your "refs/bundles/<i>/*" suggestion, which is >> an interesting way to allow dropping refs from this space, allowing GC >> to clear up space over time. > > FWIW, I wasn't thinking about GC and expiration. If bundle URI > thing can say "you need this, that and that other bundle" and cause > you to fetch three bundles, I thought that there needs a way for you > to record the tips of these three bundles---these three bundles > should not have to compete for refs/bundles/master, for example. Not wanting to compete makes sense, but also we should usually expect the "most recent" bundle to be the most recent version of the branch. However, that ordering only makes sense when we have the creationToken (nee timestamp) heuristic, so having distinct ref spaces makes sense to avoid collisions. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: >> FWIW, I wasn't thinking about GC and expiration. If bundle URI >> thing can say "you need this, that and that other bundle" and cause >> you to fetch three bundles, I thought that there needs a way for you >> to record the tips of these three bundles---these three bundles >> should not have to compete for refs/bundles/master, for example. > > Not wanting to compete makes sense, but also we should usually > expect the "most recent" bundle to be the most recent version of the > branch. However, that ordering only makes sense when we have the > creationToken (nee timestamp) heuristic, so having distinct ref > spaces makes sense to avoid collisions. I still do not see how keeping track of bundle tips in the longer term with refs fits in the larger picture. It's not like we are keeping the bundle files (we are instead exploding them into the object store, at which point the contents are mixed with all other objects), and if we do so as part of a boot-strapping, we'd fetch more history on top of what came from bundles, at which point we no longer need these refs for protecting objects from collection. And if the project rewinds and force pushes history, some objects that originally came from these bundles can and should go stale and be collected. Also, if I am not mistaken, the table of contents does not record the tip commits of each bundle, so keeping track of the bundle tips we have seen does not help us optimizing download upon seeing a toc---we'd need to look at the bundle data (at least the header part that lists the tips) anyway. Not complaining against the existence of refs/bundles/ hierarchy. Just stating that I do not think I got the use of it documented in the proposed design (perhaps it was described but I didn't understand it, or perhaps it was under-described and needs clarification). Thanks.
diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt new file mode 100644 index 00000000000..6657ba079ab --- /dev/null +++ b/Documentation/technical/bundle-uri.txt @@ -0,0 +1,475 @@ +Bundle URIs +=========== + +Bundle URIs are locations where Git can download one or more bundles in +order to bootstrap the object database in advance of fetching the remaining +objects from a remote. + +One goal is to speed up clones and fetches for users with poor network +connectivity to the origin server. Another benefit is to allow heavy users, +such as CI build farms, to use local resources for the majority of Git data +and thereby reducing the load on the origin server. + +To enable the bundle URI feature, users can specify a bundle URI using +command-line options or the origin server can advertise one or more URIs +via a protocol v2 capability. + +Design Goals +------------ + +The bundle URI standard aims to be flexible enough to satisfy multiple +workloads. The bundle provider and the Git client have several choices in +how they create and consume bundle URIs. + +* Bundles can have whatever name the server desires. This name could refer + to immutable data by using a hash of the bundle contents. However, this + means that a new URI will be needed after every update of the content. + This might be acceptable if the server is advertising the URI (and the + server is aware of new bundles being generated) but would not be + ergonomic for users using the command line option. + +* The bundles could be organized specifically for bootstrapping full + clones, but could also be organized with the intention of bootstrapping + incremental fetches. The bundle provider must decide on one of several + organization schemes to minimize client downloads during incremental + fetches, but the Git client can also choose whether to use bundles for + either of these operations. + +* The bundle provider can choose to support full clones, partial clones, + or both. The client can detect which bundles are appropriate for the + repository's partial clone filter, if any. + +* The bundle provider can use a single bundle (for clones only), or a + list of bundles. When using a list of bundles, the provider can specify + whether or not the client needs _all_ of the bundle URIs for a full + clone, or if _any_ one of the bundle URIs is sufficient. This allows the + bundle provider to use different URIs for different geographies. + +* The bundle provider can organize the bundles using heuristics, such as + timestamps or tokens, to help the client prevent downloading bundles it + does not need. When the bundle provider does not provide these + heuristics, the client can use optimizations to minimize how much of the + data is downloaded. + +* The bundle provider does not need to be associated with the Git server. + The client can choose to use the bundle provider without it being + advertised by the Git server. + +* The client can choose to discover bundle providers that are advertised + by the Git server. This could happen during `git clone`, during + `git fetch`, both, or neither. The user can choose which combination + works best for them. + +* The client can choose to configure a bundle provider manually at any + time. The client can also choose to specify a bundle provider manually + as a command-line option to `git clone`. + +Each repository is different and every Git server has different needs. +Hopefully the bundle URI feature is flexible enough to satisfy all needs. +If not, then the feature can be extended through its versioning mechanism. + +Server requirements +------------------- + +To provide a server-side implementation of bundle servers, no other parts +of the Git protocol are required. This allows server maintainers to use +static content solutions such as CDNs in order to serve the bundle files. + +At the current scope of the bundle URI feature, all URIs are expected to +be HTTP(S) URLs where content is downloaded to a local file using a `GET` +request to that URL. The server could include authentication requirements +to those requests with the aim of triggering the configured credential +helper for secure access. (Future extensions could use "file://" URIs or +SSH URIs.) + +Assuming a `200 OK` response from the server, the content at the URL is +expected to be of one of two forms: + +1. Bundle: A Git bundle file of version 2 or higher. + +2. Bundle List: A plain-text file that is parsable using Git's + config file parser. This file describes one or more bundles that are + accessible from other URIs. + +Any other data provided by the server is considered erroneous. + +Bundle Lists +------------ + +The Git server can advertise bundle URIs using a set of `key=value` pairs. +A bundle URI can also serve a plain-text file in the Git config format +containing these same `key=value` pairs. In both cases, we consider this +to be a _bundle list_. The pairs specify information about the bundles +that the client can use to make decisions for which bundles to download +and which to ignore. + +A few keys focus on properties of the list itself. + +bundle.list.version:: + (Required) This value provides a version number for the table of + contents. If a future Git change enables a feature that needs the Git + client to react to a new key in the bundle list file, then this version + will increment. The only current version number is 1, and if any other + value is specified then Git will fail to use this file. + +bundle.list.mode:: + (Required) This value has one of two values: `all` and `any`. When `all` + is specified, then the client should expect to need all of the listed + bundle URIs that match their repository's requirements. When `any` is + specified, then the client should expect that any one of the bundle URIs + that match their repository's requirements will suffice. Typically, the + `any` option is used to list a number of different bundle servers + located in different geographies. + +bundle.list.heuristic:: + If this string-valued key exists, then the bundle list is designed to + work well with incremental `git fetch` commands. The heuristic signals + that there are additional keys available for each bundle that help + determine which subset of bundles the client should download. + +The remaining keys include an `<id>` segment which is a server-designated +name for each available bundle. + +bundle.<id>.uri:: + (Required) This string value is the URI for downloading bundle `<id>`. + If the URI begins with a protocol (`http://` or `https://`) then the URI + is absolute. Otherwise, the URI is interpreted as relative to the URI + used for the bundle list. If the URI begins with `/`, then that relative + path is relative to the domain name used for the bundle list. (This use + of relative paths is intended to make it easier to distribute a set of + bundles across a large number of servers or CDNs with different domain + names.) + +bundle.<id>.list:: + This boolean value indicates whether the client should expect the + content from this URI to be a list (if `true`) or a bundle (if `false`). + This is typically used when `bundle.list.mode` is `any`. + +bundle.<id>.filter:: + This string value represents an object filter that should also appear in + the header of this bundle. The server uses this value to differentiate + different kinds of bundles from which the client can choose those that + match their object filters. + +bundle.<id>.timestamp:: + This value is the number of seconds since Unix epoch (UTC) that this + bundle was created. This is used as an approximation of a point in time + that the bundle matches the data available at the origin server. This is + used when `bundle.list.heuristic=timestamp`. + +bundle.<id>.requires:: + This string value represents the ID of another bundle. When present, the + server is indicating that this bundle contains a thin packfile. If the + client does not have all necessary objects to unbundle this packfile, + then the client can download the bundle with the `requires` ID and try + again. (Note: it may be beneficial to allow the server to specify + multiple `requires` bundles.) This is used when + `bundle.list.heuristic=timestamp`. + +bundle.<id>.location:: + This string value advertises a real-world location from where the bundle + URI is served. This can be used to present the user with an option for + which bundle URI to use. This is only valuable when `bundle.list.mode` + is `any`. + +Here is an example bundle list using the Git config format: + +``` +[bundle "list"] + version = 1 + mode = all + heuristic = timestamp + +[bundle "2022-02-09-1644442601-daily"] + uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle + timestamp = 1644442601 + requires = 2022-02-02-1643842562 + +[bundle "2022-02-02-1643842562"] + uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle + timestamp = 1643842562 + +[bundle "2022-02-09-1644442631-daily-blobless"] + uri = 2022-02-09-1644442631-daily-blobless.bundle + timestamp = 1644442631 + requires = 2022-02-02-1643842568-blobless + filter = blob:none + +[bundle "2022-02-02-1643842568-blobless"] + uri = /git/git/2022-02-02-1643842568-blobless.bundle + timestamp = 1643842568 + filter = blob:none +``` + +This example uses `bundle.list.mode=all` as well as the +`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter` +options to present two parallel sets of bundles: one for full clones and +another for blobless partial clones. + +Suppose that this bundle list was found at the URI +`https://bundles.fake.com/git/git/` and so the two blobless bundles have +the following fully-expanded URIs: + +* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle` +* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle` + +Advertising Bundle URIs +----------------------- + +If a user knows a bundle URI for the repository they are cloning, then +they can specify that URI manually through a command-line option. However, +a Git host may want to advertise bundle URIs during the clone operation, +helping users unaware of the feature. + +The only thing required for this feature is that the server can advertise +one or more bundle URIs. This advertisement takes the form of a new +protocol v2 capability specifically for discovering bundle URIs. + +The client could choose an arbitrary bundle URI as an option _or_ select +the URI with lowest latency by some exploratory checks. It is up to the +bundle provider to decide if having multiple URIs is preferable to a +single URI that is geodistributed through server-side infrastructure. + +Cloning with Bundle URIs +------------------------ + +The primary need for bundle URIs is to speed up clones. The Git client +will interact with bundle URIs according to the following flow: + +1. The user specifies a bundle URI with the `--bundle-uri` command-line + option _or_ the client discovers a bundle list advertised by the + Git server. + +2. If the downloaded data from a bundle URI is a bundle, then the client + inspects the bundle headers to check that the negative commit OIDs are + present in the client repository. If some are missing, then the client + delays unbundling until other bundles have been unbundled, making those + OIDs present. When all required OIDs are present, the client unbundles + that data using a refspec. The default refspec is + `+refs/heads/*:refs/bundles/*`, but this can be configured. + +3. If the file is instead a bundle list, then the client inspects the + `bundle.list.mode` to see if the list is of the `all` or `any` form. + + a. If `bundle.list.mode=all`, then the client considers all bundle + URIs. The list is reduced based on the `bundle.<id>.filter` options + matching the client repository's partial clone filter. Then, all + bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic + is provided, then the bundles are downloaded in reverse- + chronological order, stopping when a bundle has all required OIDs. + The bundles can then be unbundled in chronological order. The client + stores the latest timestamp as a heuristic for avoiding future + downloads if the bundle list does not advertise newer bundles. + + b. If `bundle.list.mode=any`, then the client can choose any one of the + bundle URIs to inspect. The client can use a variety of ways to + choose among these URIs. The client can also fallback to another URI + if the initial choice fails to return a result. + +Note that during a clone we expect that all bundles will be required, and +heuristics such as `bundle.<uri>.timestamp` can be used to download bundles +in chronological order or in parallel. + +If a given bundle URI is a bundle list with a `bundle.list.heuristic` +value, then the client can choose to store that URI as its chosen bundle +URI. The client can then navigate directly to that URI during later `git +fetch` calls. + +When downloading bundle URIs, the client can choose to inspect the initial +content before committing to downloading the entire content. This may +provide enough information to determine if the URI is a bundle list or +a bundle. In the case of a bundle, the client may inspect the bundle +header to determine that all advertised tips are already in the client +repository and cancel the remaining download. + +Fetching with Bundle URIs +------------------------- + +When the client fetches new data, it can decide to fetch from bundle +servers before fetching from the origin remote. This could be done via a +command-line option, but it is more likely useful to use a config value +such as the one specified during the clone. + +The fetch operation follows the same procedure to download bundles from a +bundle list (although we do _not_ want to use parallel downloads here). We +expect that the process will end when all negative commit OIDs in a thin +bundle are already in the object database. + +When using the `timestamp` heuristic, the client can avoid downloading any +bundles if their timestamps are not larger than the stored timestamp. +After fetching new bundles, this local timestamp value is updated. + +If the bundle provider does not provide a heuristic, then the client +should attempt to inspect the bundle headers before downloading the full +bundle data in case the bundle tips already exist in the client +repository. + +Error Conditions +---------------- + +If the Git client discovers something unexpected while downloading +information according to a bundle URI or the bundle list found at that +location, then Git can ignore that data and continue as if it was not +given a bundle URI. The remote Git server is the ultimate source of truth, +not the bundle URI. + +Here are a few example error conditions: + +* The client fails to connect with a server at the given URI or a connection + is lost without any chance to recover. + +* The client receives a response other than `200 OK` (such as `404 Not Found`, + `401 Not Authorized`, or `500 Internal Server Error`). The client should + use the `credential.helper` to attempt authentication after the first + `401 Not Authorized` response, but a second such response is a failure. + +* The client receives data that is not parsable as a bundle or table of + contents. + +* The bundle list describes a directed cycle in the + `bundle.<id>.requires` links. + +* A bundle includes a filter that does not match expectations. + +* The client cannot unbundle the bundles because the negative commit OIDs + are not in the object database and there are no more + `bundle.<id>.requires` links to follow. + +There are also situations that could be seen as wasteful, but are not +error conditions: + +* The downloaded bundles contain more information than is requested by + the clone or fetch request. A primary example is if the user requests + a clone with `--single-branch` but downloads bundles that store every + reachable commit from all `refs/heads/*` references. This might be + initially wasteful, but perhaps these objects will become reachable by + a later ref update that the client cares about. + +* A bundle download during a `git fetch` contains objects already in the + object database. This is probably unavoidable if we are using bundles + for fetches, since the client will almost always be slightly ahead of + the bundle servers after performing its "catch-up" fetch to the remote + server. This extra work is most wasteful when the client is fetching + much more frequently than the server is computing bundles, such as if + the client is using hourly prefetches with background maintenance, but + the server is computing bundles weekly. For this reason, the client + should not use bundle URIs for fetch unless the server has explicitly + recommended it through the `bundle.list.forFetch = true` value. + +Implementation Plan +------------------- + +This design document is being submitted on its own as an aspirational +document, with the goal of implementing all of the mentioned client +features over the course of several patch series. Here is a potential +outline for submitting these features: + +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option. + This will include a new `git fetch --bundle-uri` mode for use as the + implementation underneath `git clone`. The initial version here will + expect a single bundle at the given URI. + +2. Implement the ability to parse a bundle list from a bundle URI and + update the `git fetch --bundle-uri` logic to properly distinguish + between `bundle.list.mode` options. Specifically design the feature so + that the config format parsing feeds a list of key-value pairs into the + bundle list logic. + +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise + bundle URIs using the key-value pairs. Plug into the existing key-value + input to the bundle list logic. Allow `git clone` to discover these + bundle URIs and bootstrap the client repository from the bundle data. + (This choice is an opt-in via a config option and a command-line + option.) + +4. Allow the client to understand the `bundle.list.forFetch` configuration + and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a + bundle URI with `bundle.list.forFetch=true`, it configures the client + repository to check that bundle URI during later `git fetch <remote>` + commands. + +5. Allow clients to discover bundle URIs during `git fetch` and configure + a bundle URI for later fetches if `bundle.list.forFetch=true`. + +6. Implement the "inspect headers" heuristic to reduce data downloads when + the `bundle.<id>.timestamp` heuristic is not available. + +As these features are reviewed, this plan might be updated. We also expect +that new designs will be discovered and implemented as this feature +matures and becomes used in real-world scenarios. + +Related Work: Packfile URIs +--------------------------- + +The Git protocol already has a capability where the Git server can list +a set of URLs along with the packfile response when serving a client +request. The client is then expected to download the packfiles at those +locations in order to have a complete understanding of the response. + +This mechanism is used by the Gerrit server (implemented with JGit) and +has been effective at reducing CPU load and improving user performance for +clones. + +A major downside to this mechanism is that the origin server needs to know +_exactly_ what is in those packfiles, and the packfiles need to be available +to the user for some time after the server has responded. This coupling +between the origin and the packfile data is difficult to manage. + +Further, this implementation is extremely hard to make work with fetches. + +Related Work: GVFS Cache Servers +-------------------------------- + +The GVFS Protocol [2] is a set of HTTP endpoints designed independently of +the Git project before Git's partial clone was created. One feature of this +protocol is the idea of a "cache server" which can be colocated with build +machines or developer offices to transfer Git data without overloading the +central server. + +The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}` +endpoint, which allows downloading an object on-demand. This is a critical +piece of the filesystem virtualization of that product. + +However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>` +endpoint. Given an optional timestamp, the cache server responds with a list +of precomputed packfiles containing the commits and trees that were introduced +in those time intervals. + +The cache server computes these "prefetch" packfiles using the following +strategy: + +1. Every hour, an "hourly" pack is generated with a given timestamp. +2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack. +3. Nightly, all prefetch packs more than 30 days old are rolled up into + one pack. + +When a user runs `gvfs clone` or `scalar clone` against a repo with cache +servers, the client requests all prefetch packfiles, which is at most +`24 + 30 + 1` packfiles downloading only commits and trees. The client +then follows with a request to the origin server for the references, and +attempts to checkout that tip reference. (There is an extra endpoint that +helps get all reachable trees from a given commit, in case that commit +was not already in a prefetch packfile.) + +During a `git fetch`, a hook requests the prefetch endpoint using the +most-recent timestamp from a previously-downloaded prefetch packfile. +Only the list of packfiles with later timestamps are downloaded. Most +users fetch hourly, so they get at most one hourly prefetch pack. Users +whose machines have been off or otherwise have not fetched in over 30 days +might redownload all prefetch packfiles. This is rare. + +It is important to note that the clients always contact the origin server +for the refs advertisement, so the refs are frequently "ahead" of the +prefetched pack data. The missing objects are downloaded on-demand using +the `GET gvfs/objects/{oid}` requests, when needed by a command such as +`git checkout` or `git log`. Some Git optimizations disable checks that +would cause these on-demand downloads to be too aggressive. + +See Also +-------- + +[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/ + An earlier RFC for a bundle URI feature. + +[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md + The GVFS Protocol