Message ID | patch-2.3-3ac0539c053-20211025T211159Z-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | bundle-uri: "dumb" static CDN offloading, spec & server implementation | expand |
On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: > diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt > +bundle-uri CLIENT ERROR RECOVERY > +++++++++++++++++++++++++++++++++ > + > +A client MUST above all gracefully degrade on errors, whether that > +error is because of bad missing/data in the bundle URI(s), because > +that client is too dumb to e.g. understand and fully parse out bundle > +headers and their prerequisite relationships, or something else. "too dumb" seems a bit informal to me, especially because you immediately elaborate on its meaning. You could rewrite as follows: ...because that client can't understand or fully parse out bundle headers and their prerequisite relationships, or something else. > +Server operators should feel confident in turning on "bundle-uri" and > +not worry if e.g. their CDN goes down that clones or fetches will run > +into hard failures. Even if the server bundle bundle(s) are > +incomplete, or bad in some way the client should still end up with a > +functioning repository, just as if it had chosen not to use this > +protocol extension. Also, insertions of "e.g." in the middle of a sentence don't flow well. Server operators should feel confident in turning on "bundle-uri" and not worry that failures such as the CDN being unavailable will cause clones or fetches to have hard failures. Even if the server bundle(s) are invalid, the client should still end up with a functioning repository, just as if it had chosen not to use this protocol extension. (Note: I also removed a "bundle bundle(s)" that was split across a line break.) > +bundle-uri SERVER TO CLIENT > ++++++++++++++++++++++++++++ > + > +The ordering of the returned bundle uris is not significant. Clients I'm late to noticing, but shouldn't "URI" be all-caps when not used in the literal capability string "bundle-uri"? > +bundle-uri CLIENT TO SERVER > ++++++++++++++++++++++++++++ > + > +The client SHOULD provide reference tips found in the bundle header(s) > +as 'have' lines in any subsequent `fetch` request. A client MAY also > +ignore the bundle(s) entirely if doing so is deemed worse for some > +reason, e.g. if the bundles can't be downloaded, it doesn't like the > +tips it finds etc. I would just stop after "is deemed worse for some reason." because one example is obvious and the other is unclear how the client would detect that situation. (Maybe: tip commit timestamps are really old?) > + > +WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > + > +If after issuing `bundle-uri` and `ls-refs`, and getting the header(s) > +of the bundle(s) the client finds that the ref tips it wants can be > +retrieved entirety from advertised bundle(s), it MAY disconnect. The s/entirety/entirely/ > +results of such a 'clone' or 'fetch' should be indistinguishable from > +the state attained without using bundle-uri. > + > +EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY > +++++++++++++++++++++++++++++++++++++++++++++++ > + > +A client MAY perform an early disconnect while still downloading the > +bundle(s) (having streamed and parsed their headers). In such a case > +the client MUST gracefully recover from any errors related to > +finishing the download and validation of the bundle(s). > + > +I.e. a client might need to re-connect and issue a 'fetch' command, > +and possibly fall back to not making use of 'bundle-uri' at all. Use "For example," over starting a sentence with "i.e.". The examples of "i.e." and "e.g." already in this document show proper use, which involves parentheses. > +This "MAY" behavior is specified as such (and not a "SHOULD") on the > +assumption that a server advertising bundle uris is more likely than > +not to be serving up a relatively large repository, and to be pointing > +to URIs that have a good chance of being in working order. A client > +MAY e.g. look at the payload size of the bundles as a heuristic to see Again, here, the entire sentence is an example. This "e.g." can be removed with no loss of meaning. > +if an early disconnect is worth it, should falling back on a full > +"fetch" dialog be necessary. > +While no per-URI key-value are currently supported currently they're > +intended to support future features such as: > + > + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or > + size of the bundle file. I suppose if one wanted to add this server-to-bundle coupling, then some clients might appreciate it. > + * Advertise that one or more bundle files are the same (to e.g. have > + clients round-robin or otherwise choose one of N possible files). * Advertise that one or more bundle files are the same, to allow for redundancy without causing duplicated effort. > +static void send_bundle_uris(struct packet_writer *writer, > + struct string_list *uris) > +{ > + struct string_list_item *item; > + > + for_each_string_list_item(item, uris) > + packet_writer_write(writer, "%s", item->string); > +} > + > +static int advertise_bundle_uri = -1; > +static struct string_list bundle_uris = STRING_LIST_INIT_DUP; I see you put send_bundle_uris() before the global bundle_uris so it can be independent, but do you expect anyone to call send_bundle_uris() via a different list? Should we find a different place to store this data? > +static int bundle_uri_config(const char *var, const char *value, void *data) > +{ > + if (!strcmp(var, "uploadpack.bundleuri")) { > + advertise_bundle_uri = 1; > + string_list_append(&bundle_uris, value); > + } > + > + return 0; > +} Here, we are dictating that the URI list is available as a multi-valued config "uploadpack.bundleuri". 1. Should this be updated in Documentation/config/uploadpack.txt? 2. This seems difficult to extend to your possible future features as listed in the protocol docs, mainly because this can only store the flat URI string. To add things like hash values, sizes, and prereqs, you would need more data included and grouped on a per-URI basis. What plans do you have to make extensions here while remaining somewhat compatible with downgrading Git versions? > @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = { > .advertise = always_advertise, > .command = cap_object_info, > }, > + { > + .name = "bundle-uri", > + .advertise = bundle_uri_advertise, > + .command = bundle_uri_command, > + }, > }; I really appreciate that it is this simple to extend protocol v2. > +test_expect_success 'basics of bundle-uri: dies if not enabled' ' > + test-tool pkt-line pack >in <<-EOF && > + command=bundle-uri > + 0000 > + EOF > + > + cat >err.expect <<-\EOF && > + fatal: invalid command '"'"'bundle-uri'"'"' > + EOF > + > + cat >expect <<-\EOF && > + ERR serve: invalid command '"'"'bundle-uri'"'"' > + EOF > + > + test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual && > + test_cmp err.expect err.actual && > + test_must_be_empty out > +' > + > + hyper-nit: double newline. The implementation seems simple enough, which I like. I'm a bit your current use of Git config as the back-end, only because it is difficult to be future-proof. As the functionality stands today, the current config design works just fine. Perhaps we don't need to worry about the future, because we can design a new, complementary storage for that extra data. It seems worth exploring for a little while, though. Perhaps we should take a page out of 'git-remote' and how it stores named remotes with sub-items for metadata. The names probably don't need to ever be exposed to users, but it could be beneficial to anyone implementing this scheme. [bundle "main"] uri = https://example.com/my-bundle uri = https://redundant-cdn.com/my-bundle size = 120523 sha256 = {64hexchars} [bundle "fork"] uri = https://cdn.org/my-fork size = 334 sha256 = {...} prereq = {oid} This kind of layout has an immediate grouping of data that should help any future plan. Notice how I included multiple "uri" lines in the "main", which helps with your plan for duplicate URIs. Thanks, -Stolee
On Tue, Oct 26 2021, Derrick Stolee wrote: > On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: >> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt >> +bundle-uri CLIENT ERROR RECOVERY >> +++++++++++++++++++++++++++++++++ >> + >> +A client MUST above all gracefully degrade on errors, whether that >> +error is because of bad missing/data in the bundle URI(s), because >> +that client is too dumb to e.g. understand and fully parse out bundle >> +headers and their prerequisite relationships, or something else. > > "too dumb" seems a bit informal to me, especially because you > immediately elaborate on its meaning. You could rewrite as follows: > > ...because > that client can't understand or fully parse out bundle > headers and their prerequisite relationships, or something else. Thanks, I've snipped all your subsequent comments on phrasing/clarifications etc, except insofar as they have questions I need to address (as opposed to just my bad grammar/phrasing etc). Thanks a lot for them, will go through them closely for any subsequent re-roll & address them. > [...] >> +While no per-URI key-value are currently supported currently they're >> +intended to support future features such as: >> + >> + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or >> + size of the bundle file. > > I suppose if one wanted to add this server-to-bundle coupling, then some > clients might appreciate it. For packfile-uri there's a hard dependency on the server transferring the hash of the PACK file. I've intentionally omitted it, the reasons are covered in [1], which I realize now should really be part of this early series. Basically having it as a hard requirement isn't necessary for security or payload validation. Any server who's worried about their transport integrity would point to a https URI under their control, any checksumming and validation we'll need we'll get from the transport layer and the client's reachability check. Having it would mean that you need closer cooperation by default between server and CDN than I'm aiming for, i.e. a server should be able to point to some URI somewhere updated by a dumb hourly cronjob, without needing to pass information back & forth about what the "current" URL is. The client will discover all that. But I left that "hash=*" in because it could be optionally added, in case someone really wants it for some reason... 1. https://lore.kernel.org/git/RFC-patch-13.13-1e657ed27a-20210805T150534Z-avarab@gmail.com/ >> + * Advertise that one or more bundle files are the same (to e.g. have >> + clients round-robin or otherwise choose one of N possible files). > > * Advertise that one or more bundle files are the same, to allow for > redundancy without causing duplicated effort. *nod* >> +static void send_bundle_uris(struct packet_writer *writer, >> + struct string_list *uris) >> +{ >> + struct string_list_item *item; >> + >> + for_each_string_list_item(item, uris) >> + packet_writer_write(writer, "%s", item->string); >> +} >> + >> +static int advertise_bundle_uri = -1; >> +static struct string_list bundle_uris = STRING_LIST_INIT_DUP; > > I see you put send_bundle_uris() before the global bundle_uris so > it can be independent, but do you expect anyone to call send_bundle_uris() > via a different list? No, I'll move that around or rather fold it into bundle_uri_command() directly. I think I'd originally copied the structure of send_ref() and ls_refs() from ls-refs.c, but it doesn't make much sense anymore here for this 2-line function. Thanks. > Should we find a different place to store this data? > >> +static int bundle_uri_config(const char *var, const char *value, void *data) >> +{ >> + if (!strcmp(var, "uploadpack.bundleuri")) { >> + advertise_bundle_uri = 1; >> + string_list_append(&bundle_uris, value); >> + } >> + >> + return 0; >> +} > > Here, we are dictating that the URI list is available as a multi-valued > config "uploadpack.bundleuri". > > 1. Should this be updated in Documentation/config/uploadpack.txt? Definitely. I'll either incorporate that or re-structure this leading series so that it's more design-doc/protocol focused, in any case all of this ends up documented in the right places eventually... > 2. This seems difficult to extend to your possible future features as > listed in the protocol docs, mainly because this can only store the > flat URI string. To add things like hash values, sizes, and prereqs, > you would need more data included and grouped on a per-URI basis. > What plans do you have to make extensions here while remaining > somewhat compatible with downgrading Git versions? [...addressed below...] >> @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = { >> .advertise = always_advertise, >> .command = cap_object_info, >> }, >> + { >> + .name = "bundle-uri", >> + .advertise = bundle_uri_advertise, >> + .command = bundle_uri_command, >> + }, >> }; > > I really appreciate that it is this simple to extend protocol v2. Yeah! FWIW I've got some WIP patches to make it easier still, i.e. some further simplification & validation in the serve.[ch] API. >> +test_expect_success 'basics of bundle-uri: dies if not enabled' ' >> + test-tool pkt-line pack >in <<-EOF && >> + command=bundle-uri >> + 0000 >> + EOF >> + >> + cat >err.expect <<-\EOF && >> + fatal: invalid command '"'"'bundle-uri'"'"' >> + EOF >> + >> + cat >expect <<-\EOF && >> + ERR serve: invalid command '"'"'bundle-uri'"'"' >> + EOF >> + >> + test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual && >> + test_cmp err.expect err.actual && >> + test_must_be_empty out >> +' >> + >> + > > hyper-nit: double newline. > > The implementation seems simple enough, which I like. I'm a bit I mentally inserted the missing "skeptical/uncertain" etc. here :) > your current use of Git config as the back-end, only because it is > difficult to be future-proof. As the functionality stands today, the > current config design works just fine. Perhaps we don't need to > worry about the future, because we can design a new, complementary > storage for that extra data. It seems worth exploring for a little > while, though. Perhaps we should take a page out of 'git-remote' > and how it stores named remotes with sub-items for metadata. The > names probably don't need to ever be exposed to users, but it could > be beneficial to anyone implementing this scheme. > > [bundle "main"] > uri = https://example.com/my-bundle > uri = https://redundant-cdn.com/my-bundle > size = 120523 > sha256 = {64hexchars} > > [bundle "fork"] > uri = https://cdn.org/my-fork > size = 334 > sha256 = {...} > prereq = {oid} > > This kind of layout has an immediate grouping of data that should > help any future plan. Notice how I included multiple "uri" lines > in the "main", which helps with your plan for duplicate URIs. At first sight I like that config schema much better than my current one, in particular how it makes the future-proofed "these N urls are one logical URL" case simpler. But overall I'm a bit on the fence, and leaning towards keeping what I have, not out of any lazynes or wanting to just keep what I have mind you. But rather that the main benefit of the current one is that it's a 1=1 mapping to the line-based protocol, and you can say update your URLs as: git config --replace-all uploadpack.bundleUri "$first_url" && git config --add uploadpack.bundleUri "$second_url" Having usually you'd know the URL you'd like to replace, so you can use the [value-pattern] of --replace-all, if it's a named section or other split-out structure that become a two-step lookup. Also for testing I've got a (trivial) plumbing tool I'll submit called "git ls-remote-bundle-uri" (could be folded into something else I guess) to dump the server-side config, it's nice that you can pretty much directly copy/paste it into config without needing to adjust it. Having said all that I'm not sure I feel strongly about it either way, what do you think given the above? I think most "real" server operators will use this as GIT_CONFIG_COUNT=<n> GIT_CONFIG_{KEY,VALUE}_<1..n>, which can of course work with any config schema, but if you've got code generating it on the other side naming the targets is probably a slight hassle / confusing. There's also the small matter of it being consistent with the packfile-uri config in its current form, but that shouldn't be a reason not to come up with something better. If anything any better suggestion (if we go for that) could be supported by it too...
On 10/26/2021 11:00 AM, Ævar Arnfjörð Bjarmason wrote: > > On Tue, Oct 26 2021, Derrick Stolee wrote: >> The implementation seems simple enough, which I like. I'm a bit > > I mentally inserted the missing "skeptical/uncertain" etc. here :) More "uncertain" than "skeptical". The current plan works perfectly for the current implementation, so there is an element of YAGNI that could easily lead us to avoid overthinking this. >> your current use of Git config as the back-end, only because it is >> difficult to be future-proof. As the functionality stands today, the >> current config design works just fine. Perhaps we don't need to >> worry about the future, because we can design a new, complementary >> storage for that extra data. It seems worth exploring for a little >> while, though. Perhaps we should take a page out of 'git-remote' >> and how it stores named remotes with sub-items for metadata. The >> names probably don't need to ever be exposed to users, but it could >> be beneficial to anyone implementing this scheme. >> >> [bundle "main"] >> uri = https://example.com/my-bundle >> uri = https://redundant-cdn.com/my-bundle >> size = 120523 >> sha256 = {64hexchars} >> >> [bundle "fork"] >> uri = https://cdn.org/my-fork >> size = 334 >> sha256 = {...} >> prereq = {oid} >> >> This kind of layout has an immediate grouping of data that should >> help any future plan. Notice how I included multiple "uri" lines >> in the "main", which helps with your plan for duplicate URIs. > > At first sight I like that config schema much better than my current > one, in particular how it makes the future-proofed "these N urls are one > logical URL" case simpler. > > But overall I'm a bit on the fence, and leaning towards keeping what I > have, not out of any lazynes or wanting to just keep what I have mind > you. > > But rather that the main benefit of the current one is that it's a 1=1 > mapping to the line-based protocol, and you can say update your URLs as: > > git config --replace-all uploadpack.bundleUri "$first_url" && > git config --add uploadpack.bundleUri "$second_url" > > Having usually you'd know the URL you'd like to replace, so you can use > the [value-pattern] of --replace-all, if it's a named section or other > split-out structure that become a two-step lookup. Don't forget to use --fixed-value for exact string matching instead of regex matching! > Also for testing I've got a (trivial) plumbing tool I'll submit called > "git ls-remote-bundle-uri" (could be folded into something else I guess) > to dump the server-side config, it's nice that you can pretty much > directly copy/paste it into config without needing to adjust it. With the appropriate helper structs and methods in the product code, such helper tools will still be simple without being a second place that is directly aware of how the values are stored to disk. I don't judge your prototype work that helps you build the feature, but it's simultaneously not a reason to stick to a design. > Having said all that I'm not sure I feel strongly about it either way, > what do you think given the above? I'm not feeling too strong about it right now. The current design does not need anything extra, but it also purposefully leaves certain things open for extension in the future. The thing I worry about is that there will be two supported ways to store a list of bundle URIs: a flat list of URIs in the multi-valued uploadPack.bundleURI config value, but then also a second option that allows the extensions that arise. It's a layer of complication that would be nice to avoid if there was an easy way to do it, but the schema I sketched earlier isn't simple enough to merit a switch right now. Perhaps someone else will have an idea that accomplishes the same goal, but also is less complicated? > I think most "real" server operators will use this as > GIT_CONFIG_COUNT=<n> GIT_CONFIG_{KEY,VALUE}_<1..n>, which can of course > work with any config schema, but if you've got code generating it on the > other side naming the targets is probably a slight hassle / confusing. You'd really overload the environment this way? That's not how I would approach it, but maybe there is a benefit over writing to the repository's config file. I suppose that you could store the data in a database and link it to the repository at runtime instead. > There's also the small matter of it being consistent with the > packfile-uri config in its current form, but that shouldn't be a reason > not to come up with something better. If anything any better suggestion > (if we go for that) could be supported by it too... What do you mean about being consistent with packfile-uri? This layer that we care about isn't even implemented in git.git. Thanks, -Stolee
On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: > Add a server-side implementation of a new "bundle-uri" command to > protocol v2. As discussed in the updated "protocol-v2.txt" this will > allow conforming clients to optionally seed their initial clones or > incremental fetches from URLs containing "*.bundle" files created with > "git bundle create". ... > +DISCUSSION of bundle-uri > +^^^^^^^^^^^^^^^^^^^^^^^^ > + > +The intent of the feature is optimize for server resource consumption > +in the common case by changing the common case of fetching a very > +large PACK during linkgit:git-clone[1] into a smaller incremental > +fetch. > + > +It also allows servers to achieve better caching in combination with > +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). > + > +By having new clones or fetches be a more predictable and common > +negotiation against the tips of recently produces *.bundle file(s). > +Servers might even pre-generate the results of such negotiations for > +the `uploadpack.packObjectsHook` as new pushes come in. > + > +I.e. the server would anticipate that fresh clones will download a > +known bundle, followed by catching up to the current state of the > +repository using ref tips found in that bundle (or bundles). > + > +PROTOCOL for bundle-uri > +^^^^^^^^^^^^^^^^^^^^^^^ > + > +A `bundle-uri` request takes no arguments, and as noted above does not > +currently advertise a capability value. Both may be added in the > +future. One thing I realized was missing from this proposal is any interaction with partial clone. It would be disappointing if we could not advertise bundles of commit-and-tree packfiles for blobless partial clones. There is currently no way for the client to signal the filter type during this command. Not having any way to extend to include that seems like an oversight we should remedy before committing to a protocol that can't be extended. (This also seems like a good enough reason to group the URIs into a struct-like storage, because the filter type could be stored next to the URI.) Thanks, -Stolee
On Tue, Oct 26 2021, Derrick Stolee wrote: > On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: >> Add a server-side implementation of a new "bundle-uri" command to >> protocol v2. As discussed in the updated "protocol-v2.txt" this will >> allow conforming clients to optionally seed their initial clones or >> incremental fetches from URLs containing "*.bundle" files created with >> "git bundle create". > > ... > >> +DISCUSSION of bundle-uri >> +^^^^^^^^^^^^^^^^^^^^^^^^ >> + >> +The intent of the feature is optimize for server resource consumption >> +in the common case by changing the common case of fetching a very >> +large PACK during linkgit:git-clone[1] into a smaller incremental >> +fetch. >> + >> +It also allows servers to achieve better caching in combination with >> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). >> + >> +By having new clones or fetches be a more predictable and common >> +negotiation against the tips of recently produces *.bundle file(s). >> +Servers might even pre-generate the results of such negotiations for >> +the `uploadpack.packObjectsHook` as new pushes come in. >> + >> +I.e. the server would anticipate that fresh clones will download a >> +known bundle, followed by catching up to the current state of the >> +repository using ref tips found in that bundle (or bundles). >> + >> +PROTOCOL for bundle-uri >> +^^^^^^^^^^^^^^^^^^^^^^^ >> + >> +A `bundle-uri` request takes no arguments, and as noted above does not >> +currently advertise a capability value. Both may be added in the >> +future. > > One thing I realized was missing from this proposal is any interaction > with partial clone. It would be disappointing if we could not advertise > bundles of commit-and-tree packfiles for blobless partial clones. > > There is currently no way for the client to signal the filter type > during this command. Not having any way to extend to include that > seems like an oversight we should remedy before committing to a > protocol that can't be extended. > > (This also seems like a good enough reason to group the URIs into a > struct-like storage, because the filter type could be stored next to > the URI.) I'll update the docs to note that. I'd definitely like to leave out any implementation of filter/shallow for an initial iteration of this for simplicity, but the protocol keyword/behavior is open-ended enough to permit any extension. I.e. the server can start advertising "bundle-uri=shallow", and future clients can request arbitrary key-value pairs in addition to just "bundle-uri" now. Having said that I think that *probably* this is something that'll never be implemented, but maybe I'll eat my words there. The reason is that once we're in the "fetch" dialog with the server, as we are with "filter" and "shallow" I'd think that we'd be better of just sending a packfile-uri, since that's tailor-made for that use-case. But I suppose we could also advertise e.g.: <bundle-uri> tip=<oid> depth=1 Which a client that noticed that it noticed say the --single-branch at <oid> but with depth=1 could use before it ever got to "fetch". But (and I haven't looked into this really) I'd think that would quickly get you into having a bundle with a PACK payload that wouldn't be representable with the current bundle header format, which I think we'd always want a 1=1 mapping of. I.e. you can specify a prereq, but not leave out trees/blobs etc. So thoughts on that most welcome, in particular how it could be made future-proof.
On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote: > > On Tue, Oct 26 2021, Derrick Stolee wrote: > >> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: >>> Add a server-side implementation of a new "bundle-uri" command to >>> protocol v2. As discussed in the updated "protocol-v2.txt" this will >>> allow conforming clients to optionally seed their initial clones or >>> incremental fetches from URLs containing "*.bundle" files created with >>> "git bundle create". >> >> ... >> >>> +DISCUSSION of bundle-uri >>> +^^^^^^^^^^^^^^^^^^^^^^^^ >>> + >>> +The intent of the feature is optimize for server resource consumption >>> +in the common case by changing the common case of fetching a very >>> +large PACK during linkgit:git-clone[1] into a smaller incremental >>> +fetch. >>> + >>> +It also allows servers to achieve better caching in combination with >>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). >>> + >>> +By having new clones or fetches be a more predictable and common >>> +negotiation against the tips of recently produces *.bundle file(s). >>> +Servers might even pre-generate the results of such negotiations for >>> +the `uploadpack.packObjectsHook` as new pushes come in. >>> + >>> +I.e. the server would anticipate that fresh clones will download a >>> +known bundle, followed by catching up to the current state of the >>> +repository using ref tips found in that bundle (or bundles). >>> + >>> +PROTOCOL for bundle-uri >>> +^^^^^^^^^^^^^^^^^^^^^^^ >>> + >>> +A `bundle-uri` request takes no arguments, and as noted above does not >>> +currently advertise a capability value. Both may be added in the >>> +future. >> >> One thing I realized was missing from this proposal is any interaction >> with partial clone. It would be disappointing if we could not advertise >> bundles of commit-and-tree packfiles for blobless partial clones. >> >> There is currently no way for the client to signal the filter type >> during this command. Not having any way to extend to include that >> seems like an oversight we should remedy before committing to a >> protocol that can't be extended. >> >> (This also seems like a good enough reason to group the URIs into a >> struct-like storage, because the filter type could be stored next to >> the URI.) > > I'll update the docs to note that. I'd definitely like to leave out any > implementation of filter/shallow for an initial iteration of this for > simplicity, but the protocol keyword/behavior is open-ended enough to > permit any extension. It would be good to be explicit about how this would work. Looking at it fresh, it seems that the server could send multiple bundle URIs with the extra metadata to say which ones have a filter (and what that filter is). The client could then check if a bundle matches the given filter. But this is a bit inverted: the filter mechanism currently has the client request a given filter and the server responds with _at least_ that much data. This allows the server to ignore things like pathspec-filters or certain size-based filters. If the client just ignores a bundle URI because it doesn't match the exact filter, this could lead the client to ask for the data without a bundle, even if it would be faster to just download the advertised bundle. For this reason, I think it would be valuable for the client to tell the server the intended filter, and the server responds with bundle URIs that contain a subset of the information that would be provided by a later fetch request with that filter. > I.e. the server can start advertising "bundle-uri=shallow", and future > clients can request arbitrary key-value pairs in addition to just > "bundle-uri" now. > > Having said that I think that *probably* this is something that'll never > be implemented, but maybe I'll eat my words there. You continue focusing on the shallow option, which I agree is not important. The filter option, specifically --filter=blob:none, seems to be critical to have a short-term plan for implementing with this in mind. Thanks, -Stolee
On Tue, Oct 26 2021, Derrick Stolee wrote: > On 10/26/2021 11:00 AM, Ævar Arnfjörð Bjarmason wrote: [I'll reply to the rest later, either here or in related threads. I.e. I might end up entirely revamping the config etc. format] >> There's also the small matter of it being consistent with the >> packfile-uri config in its current form, but that shouldn't be a reason >> not to come up with something better. If anything any better suggestion >> (if we go for that) could be supported by it too... > > What do you mean about being consistent with packfile-uri? This layer > that we care about isn't even implemented in git.git. It's rather limited, but we do support a uploadpack.BlobPackFileUri as a server-side feature for upload-pack. I.e.: uploadpack.BlobPackFileUri=<OID> <pack-hash> <packfile-uri> See Documentation/technical/packfile-uri.txt. The <pack-hash> is part of the protocol, but the <OID> is just an aid to upload-pack to peel out that OID when it serves up the PACK, the <OID> being what you get from the URI. In terms of server implementation it's rather proof-of-concept-ish, i.e. it's not really all that useful unless your use case is carving out a small number of really big blobs. JGit's is much more mature, and there's some patches on-list recently to make the git.git one more practically useful[1]. 1. https://lore.kernel.org/git/cover.1634634814.git.tenglong@alibaba-inc.com/
On Wed, Oct 27 2021, Derrick Stolee wrote: > On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote: >> >> On Tue, Oct 26 2021, Derrick Stolee wrote: >> >>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: >>>> Add a server-side implementation of a new "bundle-uri" command to >>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will >>>> allow conforming clients to optionally seed their initial clones or >>>> incremental fetches from URLs containing "*.bundle" files created with >>>> "git bundle create". >>> >>> ... >>> >>>> +DISCUSSION of bundle-uri >>>> +^^^^^^^^^^^^^^^^^^^^^^^^ >>>> + >>>> +The intent of the feature is optimize for server resource consumption >>>> +in the common case by changing the common case of fetching a very >>>> +large PACK during linkgit:git-clone[1] into a smaller incremental >>>> +fetch. >>>> + >>>> +It also allows servers to achieve better caching in combination with >>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). >>>> + >>>> +By having new clones or fetches be a more predictable and common >>>> +negotiation against the tips of recently produces *.bundle file(s). >>>> +Servers might even pre-generate the results of such negotiations for >>>> +the `uploadpack.packObjectsHook` as new pushes come in. >>>> + >>>> +I.e. the server would anticipate that fresh clones will download a >>>> +known bundle, followed by catching up to the current state of the >>>> +repository using ref tips found in that bundle (or bundles). >>>> + >>>> +PROTOCOL for bundle-uri >>>> +^^^^^^^^^^^^^^^^^^^^^^^ >>>> + >>>> +A `bundle-uri` request takes no arguments, and as noted above does not >>>> +currently advertise a capability value. Both may be added in the >>>> +future. >>> >>> One thing I realized was missing from this proposal is any interaction >>> with partial clone. It would be disappointing if we could not advertise >>> bundles of commit-and-tree packfiles for blobless partial clones. >>> >>> There is currently no way for the client to signal the filter type >>> during this command. Not having any way to extend to include that >>> seems like an oversight we should remedy before committing to a >>> protocol that can't be extended. >>> >>> (This also seems like a good enough reason to group the URIs into a >>> struct-like storage, because the filter type could be stored next to >>> the URI.) >> >> I'll update the docs to note that. I'd definitely like to leave out any >> implementation of filter/shallow for an initial iteration of this for >> simplicity, but the protocol keyword/behavior is open-ended enough to >> permit any extension. > > It would be good to be explicit about how this would work. Looking at > it fresh, it seems that the server could send multiple bundle URIs with > the extra metadata to say which ones have a filter (and what that filter > is). The client could then check if a bundle matches the given filter. > > But this is a bit inverted: the filter mechanism currently has the client > request a given filter and the server responds with _at least_ that much > data. This allows the server to ignore things like pathspec-filters or > certain size-based filters. If the client just ignores a bundle URI > because it doesn't match the exact filter, this could lead the client to > ask for the data without a bundle, even if it would be faster to just > download the advertised bundle. > > For this reason, I think it would be valuable for the client to tell > the server the intended filter, and the server responds with bundle > URIs that contain a subset of the information that would be provided > by a later fetch request with that filter. > >> I.e. the server can start advertising "bundle-uri=shallow", and future >> clients can request arbitrary key-value pairs in addition to just >> "bundle-uri" now. >> >> Having said that I think that *probably* this is something that'll never >> be implemented, but maybe I'll eat my words there. I didn't mean to elide past "filter", but was just using "shallow" as a short-hand for one thing in the "fetch" dialog that a client can mention that'll impact PACK generation, just like filter. Having thought about this a bit more, I think it should be an invariant in any bundle-uri design that the server shouldn't communicate any side-channel information whatsoever about a bundle it advertises, if that information can't be discovered in the header of that bundle file. Mind you, this means throwing out large parts of my current proposed over-the-wire design, but I think for the better. I.e. the whole response part where we communicate: (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)* flush-pkt Would just become something like: (bundle-uri delim-pkt bundle-header? delim-pkt)* flush-pkt I.e. we'd optionally transfer the content of the bundle header (content up to the first "\n\n") to the client, but *only* ever as a shorthand for saving the client a roundtrip. The pointed-to bundle is still 100% the source of truth, and when retrieving the bundle-uri we'd ignore whatever "bundle-header" we got earlier (except insofar as we'd like to say emit a warning() if the two don't match). (I'd not thought too carefully about these shallow/filter etc. edge cases, my main intended use-case has been pre-seeding full clones, and having this feedback to make me think about it is very valuable). > You continue focusing on the shallow option, which I agree is not > important. The filter option, specifically --filter=blob:none, seems > to be critical to have a short-term plan for implementing with this > in mind. Per the above this then just becomes a question of "how do we produce a bundle with those attributes?". I *think* that currently there isn't a way to do that, i.e. the PACK payload of a bundle is the output of "git pack-objects", but due to it including refs, tips and prerequisites. I don't think you can say "this bundle has no blobs". The "prerequisites" hard map to the same thing you could put on a "want/have" line during PACK negotiation. I think we could/should fix that, i.e. we can bump the bundle format version and have it encode some extended prerequisites/filter/shallow etc information. You'd then have a 1=1 match between the features of git-upload-pack and what you can transfer via the bundle side-channel. But the more I think about it, the more strongly I feel that we should always add that to the bundle *format*, and not as some side-channel information in this "bundle-uri" protocol keyword. To me *the* point of this feature is to have servers provide a shorthand for something that's been a well-established trick you can do today, and of which there are any number of pre-existing implementations. I'm not trying to break any new ground here, just make "git [fetch|clone]" support a well-known trick as a first-class feature via protocol v2. I'm not the first person to whip up some custom "git-clone-via-bundle.sh" that takes bundle URI(s) and a repo URI, wget's the bundle, calls "git bundle unbundle", updates ref tips, and then does a "git fetch". The benefit of making that a first-class protocol feature over a full negotiation is essentially synonymous with how it's easier in practice to widely deploy static assets on CDNs v.s. guaranteeing the same network locality, caching etc. when serving up the same payload by running a custom binary. One reason not to add any side-channel information not found in the bundle header(s) is that we can also guarantee that there won't be any feature gap between the "transfer.injectBundleURI" config key I've already got implemented (and is in the earlier RFC version of this series). I.e. you can do: # You can specify this N number of times to inject N bundles git clone \ -c transfer.injectBundleURI="https://something.local/some-repo.bundle" \ https://example.com/some-repo.git To inject CDN support to any remote server that doesn't know about "bundle-uri", or add to the bundles of a server that does. That URI can even be a file:// if you add "-c fetch.uriProtocols=file". I realize that all of the above does *not* answer part of your question about filters, which I think I can accurately rephrase as: Ok, so you can dump a static list of bundle URIs from config, but that's always going to be a small list, what about the combinatorial explosion arbitrary upload-pack options? Filters, shallow, include-tag etc. My main answer to that is that YAGNI. If you need to spew out an URL for a PACK after a client describes any of its arbitrary wants, needs, filters etc. you've exactly re-invented what "packfile-uri" is today. I think that feature is very useful, and I've got no intention of trying to replace it. I think the sweet spot for "bundle-uri" is to advertise a small number of bundles that encompass common clone/fetch patterns. I.e. something like a bundle for a full clone with the repo data up to today, and maybe a couple of other bundles covering a time period that clients would be likely to incrementally update within, e.g. 1 week ago..today && yesterday..now. I agree that adding say "full clone, --depth=1" and "full clone, no blobs" etc. to that might be *very* useful for some deployments, but per the above I think we should really add that to the bundle format first, not protocol v2.
On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Oct 27 2021, Derrick Stolee wrote: > >> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote: >>> >>> On Tue, Oct 26 2021, Derrick Stolee wrote: >>> >>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: >>>>> Add a server-side implementation of a new "bundle-uri" command to >>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will >>>>> allow conforming clients to optionally seed their initial clones or >>>>> incremental fetches from URLs containing "*.bundle" files created with >>>>> "git bundle create". >>>> >>>> ... >>>> >>>>> +DISCUSSION of bundle-uri >>>>> +^^^^^^^^^^^^^^^^^^^^^^^^ >>>>> + >>>>> +The intent of the feature is optimize for server resource consumption >>>>> +in the common case by changing the common case of fetching a very >>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental >>>>> +fetch. >>>>> + >>>>> +It also allows servers to achieve better caching in combination with >>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). >>>>> + >>>>> +By having new clones or fetches be a more predictable and common >>>>> +negotiation against the tips of recently produces *.bundle file(s). >>>>> +Servers might even pre-generate the results of such negotiations for >>>>> +the `uploadpack.packObjectsHook` as new pushes come in. >>>>> + >>>>> +I.e. the server would anticipate that fresh clones will download a >>>>> +known bundle, followed by catching up to the current state of the >>>>> +repository using ref tips found in that bundle (or bundles). >>>>> + >>>>> +PROTOCOL for bundle-uri >>>>> +^^^^^^^^^^^^^^^^^^^^^^^ >>>>> + >>>>> +A `bundle-uri` request takes no arguments, and as noted above does not >>>>> +currently advertise a capability value. Both may be added in the >>>>> +future. >>>> >>>> One thing I realized was missing from this proposal is any interaction >>>> with partial clone. It would be disappointing if we could not advertise >>>> bundles of commit-and-tree packfiles for blobless partial clones. >>>> >>>> There is currently no way for the client to signal the filter type >>>> during this command. Not having any way to extend to include that >>>> seems like an oversight we should remedy before committing to a >>>> protocol that can't be extended. >>>> >>>> (This also seems like a good enough reason to group the URIs into a >>>> struct-like storage, because the filter type could be stored next to >>>> the URI.) >>> >>> I'll update the docs to note that. I'd definitely like to leave out any >>> implementation of filter/shallow for an initial iteration of this for >>> simplicity, but the protocol keyword/behavior is open-ended enough to >>> permit any extension. >> >> It would be good to be explicit about how this would work. Looking at >> it fresh, it seems that the server could send multiple bundle URIs with >> the extra metadata to say which ones have a filter (and what that filter >> is). The client could then check if a bundle matches the given filter. >> >> But this is a bit inverted: the filter mechanism currently has the client >> request a given filter and the server responds with _at least_ that much >> data. This allows the server to ignore things like pathspec-filters or >> certain size-based filters. If the client just ignores a bundle URI >> because it doesn't match the exact filter, this could lead the client to >> ask for the data without a bundle, even if it would be faster to just >> download the advertised bundle. >> >> For this reason, I think it would be valuable for the client to tell >> the server the intended filter, and the server responds with bundle >> URIs that contain a subset of the information that would be provided >> by a later fetch request with that filter. >> >>> I.e. the server can start advertising "bundle-uri=shallow", and future >>> clients can request arbitrary key-value pairs in addition to just >>> "bundle-uri" now. >>> >>> Having said that I think that *probably* this is something that'll never >>> be implemented, but maybe I'll eat my words there. > > I didn't mean to elide past "filter", but was just using "shallow" as a > short-hand for one thing in the "fetch" dialog that a client can mention > that'll impact PACK generation, just like filter. > > Having thought about this a bit more, I think it should be an invariant > in any bundle-uri design that the server shouldn't communicate any > side-channel information whatsoever about a bundle it advertises, if > that information can't be discovered in the header of that bundle file. > > Mind you, this means throwing out large parts of my current proposed > over-the-wire design, but I think for the better. I.e. the whole > response part where we communicate: > > (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)* > flush-pkt > > Would just become something like: > > (bundle-uri delim-pkt bundle-header? delim-pkt)* > flush-pkt > > I.e. we'd optionally transfer the content of the bundle header (content > up to the first "\n\n") to the client, but *only* ever as a shorthand > for saving the client a roundtrip. It still seems like we're better off letting the client request a filter and have the server present URIs that the client can use, and the server can choose to ignore the filter or provide URIs that are specific to that filter. Sending the full list and making the client decide what it wants seems like it will be more complicated than necessary. However, I'll withhold complete judgement until I see a full proposal in a v2. > I think the sweet spot for "bundle-uri" is to advertise a small number > of bundles that encompass common clone/fetch patterns. I.e. something > like a bundle for a full clone with the repo data up to today, and maybe > a couple of other bundles covering a time period that clients would be > likely to incrementally update within, e.g. 1 week ago..today && > yesterday..now. > > I agree that adding say "full clone, --depth=1" and "full clone, no > blobs" etc. to that might be *very* useful for some deployments, but per > the above I think we should really add that to the bundle format first, > not protocol v2. I'm focusing my interest in this topic not on "how can we make what we already do faster?" but "how can we unlock scale not previously possible?" Allowing blobless clones is an important part of this, in my opinion, so it is my _default_ mode of operating. Thanks, -Stolee
On Wed, Oct 27 2021, Derrick Stolee wrote: > On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote: >> >> On Wed, Oct 27 2021, Derrick Stolee wrote: >> >>> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote: >>>> >>>> On Tue, Oct 26 2021, Derrick Stolee wrote: >>>> >>>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote: >>>>>> Add a server-side implementation of a new "bundle-uri" command to >>>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will >>>>>> allow conforming clients to optionally seed their initial clones or >>>>>> incremental fetches from URLs containing "*.bundle" files created with >>>>>> "git bundle create". >>>>> >>>>> ... >>>>> >>>>>> +DISCUSSION of bundle-uri >>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^ >>>>>> + >>>>>> +The intent of the feature is optimize for server resource consumption >>>>>> +in the common case by changing the common case of fetching a very >>>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental >>>>>> +fetch. >>>>>> + >>>>>> +It also allows servers to achieve better caching in combination with >>>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). >>>>>> + >>>>>> +By having new clones or fetches be a more predictable and common >>>>>> +negotiation against the tips of recently produces *.bundle file(s). >>>>>> +Servers might even pre-generate the results of such negotiations for >>>>>> +the `uploadpack.packObjectsHook` as new pushes come in. >>>>>> + >>>>>> +I.e. the server would anticipate that fresh clones will download a >>>>>> +known bundle, followed by catching up to the current state of the >>>>>> +repository using ref tips found in that bundle (or bundles). >>>>>> + >>>>>> +PROTOCOL for bundle-uri >>>>>> +^^^^^^^^^^^^^^^^^^^^^^^ >>>>>> + >>>>>> +A `bundle-uri` request takes no arguments, and as noted above does not >>>>>> +currently advertise a capability value. Both may be added in the >>>>>> +future. >>>>> >>>>> One thing I realized was missing from this proposal is any interaction >>>>> with partial clone. It would be disappointing if we could not advertise >>>>> bundles of commit-and-tree packfiles for blobless partial clones. >>>>> >>>>> There is currently no way for the client to signal the filter type >>>>> during this command. Not having any way to extend to include that >>>>> seems like an oversight we should remedy before committing to a >>>>> protocol that can't be extended. >>>>> >>>>> (This also seems like a good enough reason to group the URIs into a >>>>> struct-like storage, because the filter type could be stored next to >>>>> the URI.) >>>> >>>> I'll update the docs to note that. I'd definitely like to leave out any >>>> implementation of filter/shallow for an initial iteration of this for >>>> simplicity, but the protocol keyword/behavior is open-ended enough to >>>> permit any extension. >>> >>> It would be good to be explicit about how this would work. Looking at >>> it fresh, it seems that the server could send multiple bundle URIs with >>> the extra metadata to say which ones have a filter (and what that filter >>> is). The client could then check if a bundle matches the given filter. >>> >>> But this is a bit inverted: the filter mechanism currently has the client >>> request a given filter and the server responds with _at least_ that much >>> data. This allows the server to ignore things like pathspec-filters or >>> certain size-based filters. If the client just ignores a bundle URI >>> because it doesn't match the exact filter, this could lead the client to >>> ask for the data without a bundle, even if it would be faster to just >>> download the advertised bundle. >>> >>> For this reason, I think it would be valuable for the client to tell >>> the server the intended filter, and the server responds with bundle >>> URIs that contain a subset of the information that would be provided >>> by a later fetch request with that filter. >>> >>>> I.e. the server can start advertising "bundle-uri=shallow", and future >>>> clients can request arbitrary key-value pairs in addition to just >>>> "bundle-uri" now. >>>> >>>> Having said that I think that *probably* this is something that'll never >>>> be implemented, but maybe I'll eat my words there. >> >> I didn't mean to elide past "filter", but was just using "shallow" as a >> short-hand for one thing in the "fetch" dialog that a client can mention >> that'll impact PACK generation, just like filter. >> >> Having thought about this a bit more, I think it should be an invariant >> in any bundle-uri design that the server shouldn't communicate any >> side-channel information whatsoever about a bundle it advertises, if >> that information can't be discovered in the header of that bundle file. >> >> Mind you, this means throwing out large parts of my current proposed >> over-the-wire design, but I think for the better. I.e. the whole >> response part where we communicate: >> >> (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)* >> flush-pkt >> >> Would just become something like: >> >> (bundle-uri delim-pkt bundle-header? delim-pkt)* >> flush-pkt >> >> I.e. we'd optionally transfer the content of the bundle header (content >> up to the first "\n\n") to the client, but *only* ever as a shorthand >> for saving the client a roundtrip. > > It still seems like we're better off letting the client request a > filter and have the server present URIs that the client can use, > and the server can choose to ignore the filter or provide URIs that > are specific to that filter.[...] I've tested this a bit now and think there's no way to create such a bundle currently. I.e. try: git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git cd git git config --unset remote.origin.partialclonefilter git config --unset remote.origin.promisor You'll get: $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master Enumerating objects: 306784, done. Counting objects: 100% (306784/306784), done. Compressing objects: 100% (69265/69265), done. fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6 error: pack-objects died If you didn't do that config munging we'd create the pack, but it would be inflated with the blobs (after going back and getting them from the server). So aside from any questions of how you'd hypothetically communicate your desire to get such bundle from the server, I don't think it could serve one up. So I think this is moot until the bundle format itself could support it. I'll need to "git bundle [verify|unbundle]" whatever I get on the other end. I really don't mean this in any way as dodging the desirability of this feature. I'd really like to have it too. I think implementing it should be relatively simple, and I've got an implementation in mind that makes this future-proof for anything else we'd like to add. I.e. if you look at that v3 format bundle you'll see: $ head -c 100 master.bdl # v3 git bundle @object-format=sha1 e9e5ba39a78c8f5057262d49e261b42a8660d5b9 refs/heads/master PACK Wouldn't this just be a matter of including extra lines with: # I'm assuming that the promisor url can be assumed to be "the url # we cloned this from", but maybe we need @remoteURL=https://.... @promisor=true @filter = blob:none I.e. exactly corresponding to the .git/config we'd end up with, config. We'd then (I think) create .git/objects/pack/*.promisor with the OIDs of each of the inflated tips (I'm not familiar with .promisor files). And a thing I need to include in the bundle-uri protocol is that the client should not just include a "bundle-uri" attribute, but have a "value" describing the bundle format it accepts. I.e. now: bundle-uri=v3,object-format And for supporting the above: bundle-uri=v3,object-format,promisor,filter I.e. currently we die on any bundle capability except "object-format", if we're going to discover what to send we'd like a less crappy way than parsing the version from the "agent" field. > Sending the full list and making the client decide what it wants seems > like it will be more complicated than necessary. However, I'll > withhold complete judgement until I see a full proposal in a v2. I'm very thankful for the thorough review, and it's exciting that you'd like to use this feature in some form, and I'll definitely do my best to support (and if not, future-proof) any use-cases you have in mind. But I really don't get how this wouldn't effectively be functionally indistinguishable from packfile-uri, sans a nit here and there. I can see the convenience of having say 100 bundles, advertising 5, and then after a full negotiation dialog pointing the equivalent of a packfile-uri at a *.bundle file, just because that's what you happen to have around already. If bundle-uri is your main static file distribution you don't want a duplicate *.pack (without the bundle header) just for that. I think a logical extension of the packfile-uri feature for those that need extended negotiation before deciding on the static URL would be to teach the packfile-uri downloader to ignore an optional bundle header of any PACK it finds at a URL (which would not be the same as this proposal), just to support that use-case. But, isn't that essentially what you'd want in those cases? Spewing a "here's my bundles" at a client gets it started quickly, and also has the side-benefit of making those assets more cachable, as well as creating a known base for the caching of any subsequent "...and a PACK negotiation to make it fully up-to-date" request. The bundles are also our de-facto sneakernet and format, and can be used for incremental replication. All of which is also a sweet spot for bundle-uri, i.e. the combination of being able to re-use already "replicated" files for CDN-ing, and providing wider access to CDN features for "dumb" servers. But once we're in dialog with a client to discuss its arbitrary filter preferences before giving it a URL we're going to be most likely implementing that as a mode of upload-pack.c anyway, and when it spews out an optional URL at the end of that dialog.... >> I think the sweet spot for "bundle-uri" is to advertise a small number >> of bundles that encompass common clone/fetch patterns. I.e. something >> like a bundle for a full clone with the repo data up to today, and maybe >> a couple of other bundles covering a time period that clients would be >> likely to incrementally update within, e.g. 1 week ago..today && >> yesterday..now. >> >> I agree that adding say "full clone, --depth=1" and "full clone, no >> blobs" etc. to that might be *very* useful for some deployments, but per >> the above I think we should really add that to the bundle format first, >> not protocol v2. > > I'm focusing my interest in this topic not on "how can we make what we > already do faster?" but "how can we unlock scale not previously > possible?" Allowing blobless clones is an important part of this, in > my opinion, so it is my _default_ mode of operating. *nod*, I'm keen to support it using something like what's described above. I.e. stick it as new headers in the bundle format, then be able to advertise those for common cases. I'd think most of these would be a few combinations (usually something copy/pasted from a relevant dev guide), e.g. "all history, no blobs", "main branch, no tags, no blobs" etc. Aside from this bundle-uri protocol proposal being able to sneakernet a repo you cloned like that around as-is seems highly desirable, and just a feature gap we added when those features were added to "git fetch" and friends, but not "git bundle".
On 10/27/2021 4:22 PM, Ævar Arnfjörð Bjarmason wrote: > > On Wed, Oct 27 2021, Derrick Stolee wrote: >> It still seems like we're better off letting the client request a >> filter and have the server present URIs that the client can use, >> and the server can choose to ignore the filter or provide URIs that >> are specific to that filter.[...] > > I've tested this a bit now and think there's no way to create such a > bundle currently. I.e. try: > > git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git > cd git > git config --unset remote.origin.partialclonefilter > git config --unset remote.origin.promisor > > You'll get: > > $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master > Enumerating objects: 306784, done. > Counting objects: 100% (306784/306784), done. > Compressing objects: 100% (69265/69265), done. > fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6 > error: pack-objects died > > If you didn't do that config munging we'd create the pack, but it would > be inflated with the blobs (after going back and getting them from the > server). > > So aside from any questions of how you'd hypothetically communicate your > desire to get such bundle from the server, I don't think it could serve > one up. > > So I think this is moot until the bundle format itself could support > it. I'll need to "git bundle [verify|unbundle]" whatever I get on the > other end. Thank you for demonstrating that bundles don't currently work with filters. This would need to be remedied, but can be done independently. For now, let's just make sure that there is a paved path for extending whatever is decided here to work with filters in the future. As for the overall design, I have some thoughts that I'm going to break out into a new message responding to your cover letter. Thanks, -Stolee
An aside: On 27/10/2021 19:01, Ævar Arnfjörð Bjarmason wrote: > I don't think you can say "this bundle has no blobs". The > "prerequisites" hard map to the same thing you could put on a > "want/have" line during PACK negotiation. > > I think we could/should fix that, i.e. we can > bump the bundle format > version and have it encode some extended prerequisites/filter/shallow > etc information. If the format is bumped, could we also include the HEAD=<particular-branch> info within that format. The `guess the HEAD` algorithm isn't ideal and shows up in user questions every now and again. > You'd then have a 1=1 match between the features of > git-upload-pack and what you can transfer via the bundle side-channel. -- Philip
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt index 21e8258ccf3..4bc15a976cd 100644 --- a/Documentation/technical/protocol-v2.txt +++ b/Documentation/technical/protocol-v2.txt @@ -566,3 +566,212 @@ and associated requested information, each separated by a single space. attr = "size" obj-info = obj-id SP obj-size + +bundle-uri +~~~~~~~~~~ + +If the 'bundle-uri' capability is advertised, the server supports the +`bundle-uri' command. + +The capability is currently advertised with no value (i.e. not +"bundle-uri=somevalue"), a value may be added in the future for +supporting command-wide extensions. Clients MUST ignore any unknown +capability values and proceed with the 'bundle-uri` dialog they +support. + +The 'bundle-uri' command is intended to be issued before `fetch` to +get URIs to bundle files (see linkgit:git-bundle[1]) to "seed" and +inform the subsequent `fetch` command. + +The client CAN issue `bundle-uri` before or after any other valid +command. To be useful to clients it's expected that it'll be issued +after an `ls-refs` and before `fetch`, but CAN be issued at any time +in the dialog. + +DISCUSSION of bundle-uri +^^^^^^^^^^^^^^^^^^^^^^^^ + +The intent of the feature is optimize for server resource consumption +in the common case by changing the common case of fetching a very +large PACK during linkgit:git-clone[1] into a smaller incremental +fetch. + +It also allows servers to achieve better caching in combination with +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]). + +By having new clones or fetches be a more predictable and common +negotiation against the tips of recently produces *.bundle file(s). +Servers might even pre-generate the results of such negotiations for +the `uploadpack.packObjectsHook` as new pushes come in. + +I.e. the server would anticipate that fresh clones will download a +known bundle, followed by catching up to the current state of the +repository using ref tips found in that bundle (or bundles). + +PROTOCOL for bundle-uri +^^^^^^^^^^^^^^^^^^^^^^^ + +A `bundle-uri` request takes no arguments, and as noted above does not +currently advertise a capability value. Both may be added in the +future. + +When the client issues a `command=bundle-uri` the response is a list +of URIs the server would like the client to fetch out-of-bounds before +proceeding with the `fetch` request in this format: + + output = bundle-uri-line + bundle-uri-line* flush-pkt + + bundle-uri-line = PKT-LINE(bundle-uri) + *(SP bundle-feature-key *(=bundle-feature-val)) + LF + + bundle-uri = A URI such as a https://, ssh:// etc. URI + + bundle-feature-key = Any printable ASCII characters except SP or "=" + bundle-feature-val = Any printable ASCII characters except SP or "=" + +No `bundle-feature-key`=`bundle-feature-value` fields are currently +defined. See the discussion of features below. + +Clients are still expected to fully parse the line according to the +above format, lines that do not conform to the format SHOULD be +discarded. The user MAY be warned in such a case. + +bundle-uri CLIENT AND SERVER EXPECTATIONS +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +".bundle" FORMAT +++++++++++++++++ + +The advertised bundle(s) MUST be in a format that "git bundle verify" +would accept. I.e. they MUST contain one or more reference tips for +use by the client, MUST indicate prerequisites (in any) with standard +"-" prefixes, and MUST indicate their "object-format", if +applicable. Create "*.bundle" files with "git bundle create". + +bundle-uri CLIENT ERROR RECOVERY +++++++++++++++++++++++++++++++++ + +A client MUST above all gracefully degrade on errors, whether that +error is because of bad missing/data in the bundle URI(s), because +that client is too dumb to e.g. understand and fully parse out bundle +headers and their prerequisite relationships, or something else. + +Server operators should feel confident in turning on "bundle-uri" and +not worry if e.g. their CDN goes down that clones or fetches will run +into hard failures. Even if the server bundle bundle(s) are +incomplete, or bad in some way the client should still end up with a +functioning repository, just as if it had chosen not to use this +protocol extension. + +All subsequent discussion on client and server interaction MUST keep +this in mind. + +bundle-uri SERVER TO CLIENT ++++++++++++++++++++++++++++ + +The ordering of the returned bundle uris is not significant. Clients +MUST parse their headers to discover their contained OIDS and +prerequisites. A client MUST consider the content of the bundle(s) +themselves and their header as the ultimate source of truth. + +A server MAY even return bundle(s) that don't have any direct +relationship to the repository being cloned (either through accident, +or intentional "clever" configuration), and expect a client to sort +out what data they'd like from the bundle(s), if any. + +bundle-uri CLIENT TO SERVER ++++++++++++++++++++++++++++ + +The client SHOULD provide reference tips found in the bundle header(s) +as 'have' lines in any subsequent `fetch` request. A client MAY also +ignore the bundle(s) entirely if doing so is deemed worse for some +reason, e.g. if the bundles can't be downloaded, it doesn't like the +tips it finds etc. + +WHEN ADVERTISED BUNDLE(S) REQUIRE NO FURTHER NEGOTIATION +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +If after issuing `bundle-uri` and `ls-refs`, and getting the header(s) +of the bundle(s) the client finds that the ref tips it wants can be +retrieved entirety from advertised bundle(s), it MAY disconnect. The +results of such a 'clone' or 'fetch' should be indistinguishable from +the state attained without using bundle-uri. + +EARLY CLIENT DISCONNECTIONS AND ERROR RECOVERY +++++++++++++++++++++++++++++++++++++++++++++++ + +A client MAY perform an early disconnect while still downloading the +bundle(s) (having streamed and parsed their headers). In such a case +the client MUST gracefully recover from any errors related to +finishing the download and validation of the bundle(s). + +I.e. a client might need to re-connect and issue a 'fetch' command, +and possibly fall back to not making use of 'bundle-uri' at all. + +This "MAY" behavior is specified as such (and not a "SHOULD") on the +assumption that a server advertising bundle uris is more likely than +not to be serving up a relatively large repository, and to be pointing +to URIs that have a good chance of being in working order. A client +MAY e.g. look at the payload size of the bundles as a heuristic to see +if an early disconnect is worth it, should falling back on a full +"fetch" dialog be necessary. + +WHEN ADVERTISED BUNDLE(S) REQUIRE FURTHER NEGOTIATION ++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +A client SHOULD commence a negotiation of a PACK from the server via +the "fetch" command using the OID tips found in advertised bundles, +even if's still in the process of downloading those bundle(s). + +This allows for aggressive early disconnects from any interactive +server dialog. The client blindly trusts that the advertised OID tips +are relevant, and issues them as 'have' lines, it then requests any +tips it would like (usually from the "ls-refs" advertisement) via +'want' lines. The server will then compute a (hopefully small) PACK +with the expected difference between the tips from the bundle(s) and +the data requested. + +The only connection the client then needs to keep active is to the +concurrently downloading static bundle(s), when those and the +incremental PACK are retrieved they should be inflated and +validated. Any errors at this point should be gracefully recovered +from, see above. + +bundle-uri PROTOCOL FEATURES +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As noted above no `bundle-feature-key`=`bundle-feature-value` fields +are currently defined. + +They are intended for future per-URI metadata which older clients MUST +ignore and gracefully degrade on. Any fields they do recognize they +CAN also ignore. + +Any backwards-incompatible addition of pre-URI key-value will be +guarded by a new value or values in 'bundle-uri' capability +advertisement itself, and/or by new future `bundle-uri` request +arguments. + +While no per-URI key-value are currently supported currently they're +intended to support future features such as: + + * Add a "hash=<val>" or "size=<bytes>" advertise the expected hash or + size of the bundle file. + + * Advertise that one or more bundle files are the same (to e.g. have + clients round-robin or otherwise choose one of N possible files). + + * A "oid=<OID>" shortcut and "prerequisite=<OID>" shortcut. For + expressing the common case of a bundle with one tip and no + prerequisites, or one tip and one prerequisite. ++ +This would allow for optimizing the common case of servers who'd like +to provide one "big bundle" containing only their "main" branch, +and/or incremental updates thereof. ++ +A client receiving such a a response MAY assume that they can skip +retrieving the header from a bundle at the indicated URI, and thus +save themselves and the server(s) the request(s) needed to inspect the +headers of that bundle or bundles. diff --git a/Makefile b/Makefile index 381bed2c1d2..e41ac60829d 100644 --- a/Makefile +++ b/Makefile @@ -846,6 +846,7 @@ LIB_OBJS += blob.o LIB_OBJS += bloom.o LIB_OBJS += branch.o LIB_OBJS += bulk-checkin.o +LIB_OBJS += bundle-uri.o LIB_OBJS += bundle.o LIB_OBJS += cache-tree.o LIB_OBJS += cbtree.o diff --git a/bundle-uri.c b/bundle-uri.c new file mode 100644 index 00000000000..ff054ddc690 --- /dev/null +++ b/bundle-uri.c @@ -0,0 +1,55 @@ +#include "cache.h" +#include "bundle-uri.h" +#include "pkt-line.h" +#include "config.h" + +static void send_bundle_uris(struct packet_writer *writer, + struct string_list *uris) +{ + struct string_list_item *item; + + for_each_string_list_item(item, uris) + packet_writer_write(writer, "%s", item->string); +} + +static int advertise_bundle_uri = -1; +static struct string_list bundle_uris = STRING_LIST_INIT_DUP; +static int bundle_uri_config(const char *var, const char *value, void *data) +{ + if (!strcmp(var, "uploadpack.bundleuri")) { + advertise_bundle_uri = 1; + string_list_append(&bundle_uris, value); + } + + return 0; +} + +int bundle_uri_advertise(struct repository *r, struct strbuf *value) +{ + if (advertise_bundle_uri != -1) + goto cached; + + git_config(bundle_uri_config, NULL); + advertise_bundle_uri = !!bundle_uris.nr; + +cached: + return advertise_bundle_uri; +} + +int bundle_uri_command(struct repository *r, + struct packet_reader *request) +{ + struct packet_writer writer; + packet_writer_init(&writer, 1); + + while (packet_reader_read(request) == PACKET_READ_NORMAL) + die(_("bundle-uri: unexpected argument: '%s'"), request->line); + if (request->status != PACKET_READ_FLUSH) + die(_("bundle-uri: expected flush after arguments")); + + send_bundle_uris(&writer, &bundle_uris); + + packet_writer_flush(&writer); + + return 0; +} diff --git a/bundle-uri.h b/bundle-uri.h new file mode 100644 index 00000000000..b8762e6a8e4 --- /dev/null +++ b/bundle-uri.h @@ -0,0 +1,14 @@ +#ifndef BUNDLE_URI_H +#define BUNDLE_URI_H + +struct repository; +struct packet_reader; +struct packet_writer; + +/** + * API used by serve.[ch]. + */ +int bundle_uri_advertise(struct repository *r, struct strbuf *value); +int bundle_uri_command(struct repository *r, struct packet_reader *request); + +#endif /* BUNDLE_URI_H */ diff --git a/serve.c b/serve.c index b3fe9b5126a..f3e0203d2c6 100644 --- a/serve.c +++ b/serve.c @@ -8,6 +8,7 @@ #include "protocol-caps.h" #include "serve.h" #include "upload-pack.h" +#include "bundle-uri.h" static int advertise_sid = -1; static int client_hash_algo = GIT_HASH_SHA1; @@ -136,6 +137,11 @@ static struct protocol_capability capabilities[] = { .advertise = always_advertise, .command = cap_object_info, }, + { + .name = "bundle-uri", + .advertise = bundle_uri_advertise, + .command = bundle_uri_command, + }, }; void protocol_v2_advertise_capabilities(void) diff --git a/t/t5701-git-serve.sh b/t/t5701-git-serve.sh index 1896f671cb3..9d053f77a93 100755 --- a/t/t5701-git-serve.sh +++ b/t/t5701-git-serve.sh @@ -13,7 +13,7 @@ test_expect_success 'test capability advertisement' ' wrong_algo sha1:sha256 wrong_algo sha256:sha1 EOF - cat >expect <<-EOF && + cat >expect.base <<-EOF && version 2 agent=git/$(git version | cut -d" " -f3) ls-refs=unborn @@ -21,8 +21,11 @@ test_expect_success 'test capability advertisement' ' server-option object-format=$(test_oid algo) object-info + EOF + cat >expect.trailer <<-EOF && 0000 EOF + cat expect.base expect.trailer >expect && GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \ --advertise-capabilities >out && @@ -342,4 +345,123 @@ test_expect_success 'basics of object-info' ' test_cmp expect actual ' +# Test the basics of bundle-uri +# +test_expect_success 'test capability advertisement with uploadpack.bundleURI' ' + test_config uploadpack.bundleURI FAKE && + + cat >expect.extra <<-EOF && + bundle-uri + EOF + cat expect.base \ + expect.extra \ + expect.trailer >expect && + + GIT_TEST_SIDEBAND_ALL=0 test-tool serve-v2 \ + --advertise-capabilities >out && + test-tool pkt-line unpack <out >actual && + test_cmp expect actual +' + +test_expect_success 'basics of bundle-uri: dies if not enabled' ' + test-tool pkt-line pack >in <<-EOF && + command=bundle-uri + 0000 + EOF + + cat >err.expect <<-\EOF && + fatal: invalid command '"'"'bundle-uri'"'"' + EOF + + cat >expect <<-\EOF && + ERR serve: invalid command '"'"'bundle-uri'"'"' + EOF + + test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual && + test_cmp err.expect err.actual && + test_must_be_empty out +' + + +test_expect_success 'basics of bundle-uri: enabled with single URI' ' + test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl && + + test-tool pkt-line pack >in <<-EOF && + command=bundle-uri + object-format=$(test_oid algo) + 0000 + EOF + + cat >expect <<-EOF && + https://cdn.example.com/repo.bdl + 0000 + EOF + + test-tool serve-v2 --stateless-rpc <in >out && + test-tool pkt-line unpack <out >actual && + test_cmp expect actual +' + +test_expect_success 'basics of bundle-uri: enabled with single URI' ' + test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl && + + test-tool pkt-line pack >in <<-EOF && + command=bundle-uri + object-format=$(test_oid algo) + 0000 + EOF + + cat >expect <<-EOF && + https://cdn.example.com/repo.bdl + 0000 + EOF + + test-tool serve-v2 --stateless-rpc <in >out && + test-tool pkt-line unpack <out >actual && + test_cmp expect actual +' + +test_expect_success 'basics of bundle-uri: enabled with two URIs' ' + test_config uploadpack.bundleURI https://cdn.example.com/repo.bdl && + test_config uploadpack.bundleURI https://cdn.example.com/recent.bdl --add && + + test-tool pkt-line pack >in <<-EOF && + command=bundle-uri + object-format=$(test_oid algo) + 0000 + EOF + + cat >expect <<-EOF && + https://cdn.example.com/repo.bdl + https://cdn.example.com/recent.bdl + 0000 + EOF + + test-tool serve-v2 --stateless-rpc <in >out && + test-tool pkt-line unpack <out >actual && + test_cmp expect actual +' + +test_expect_success 'basics of bundle-uri: unknown future feature(s)' ' + test_config uploadpack.bundleURI https://cdn.example.com/fake.bdl && + + test-tool pkt-line pack >in <<-EOF && + command=bundle-uri + object-format=$(test_oid algo) + 0001 + some-feature + we-do-not + know=about + 0000 + EOF + + cat >err.expect <<-\EOF && + fatal: bundle-uri: unexpected argument: '"'"'some-feature'"'"' + EOF + + test_must_fail test-tool serve-v2 --stateless-rpc <in >out 2>err.actual && + test_cmp err.expect err.actual && + test_must_be_empty out +' + test_done
Add a server-side implementation of a new "bundle-uri" command to protocol v2. As discussed in the updated "protocol-v2.txt" this will allow conforming clients to optionally seed their initial clones or incremental fetches from URLs containing "*.bundle" files created with "git bundle create". The use-cases are similar to those of the existing "Packfile URIs", and the two feature can be combined within a single request, but "bundle-uri" has a few advantages over packfile-uris in some some common scenarios, discussed below. This change does not give us a working "bundle-uri" client. I have those patches as a follow-up, but let's first establish what the protocol for this should be like first. The client implementation will then implement this specification. With this change when the uploadpack.bundleURI config is set to a URI (or URIs, if set >1 times), advertise a "bundle-uri" command. Then when the client requests "bundle-uri" emit those URIs back at them. Differences between this and the existing packfile-uri facility: A. There is no "real" support for packfile-uri in git.git. The uploadpack.blobPackfileUri setting allows carving out a list of blobs (actually any OIDs), but as alluded to in bfc2a36ff2a (Doc: clarify contents of packfile sent as URI, 2021-01-20) the only "real" implementation is JGit based. B. The uploadpack.blobPackfileUri is a MUST where this is a "CAN". I.e. once a client says they support packfile-uri of given list of protocols the server will send them a PACK response assuming they've downloaded the URI they client was sent, if the client doesn't do that they don't have a valid repository. Pointing at a bundle and having the client send us "have" lines (or not, maybe they couldn't fetch it, or decided they didn't want to) is more flexible, and can gracefully recover e.g. if the CDN isn't reachable (maybe you do support "https", but the CDN provider is down, or blocked your whole country). C. The client, after executing "ls-refs" will disconnect if it has also grabbed the "bundle-uris" and knows the server won't send it anything it doesn't already have (or expect to have, if it's downloading the bundles concurrent to an early disconnect). This is in (small) contrast to packfile-uri where a client would enter a negotiation dialog, which may or may not result in a packfile-uri and/or an inline PACK. D. Because of "C" clients can, if the bundles are up-to-date, get an up-to-date repository with just "bundle-uri" and "ls-refs" commands, with no need to enter a dialog with "git upload-pack". That small dialog is unlikely to matter for performance purposes, this section is just noting differences between "bundle-uri" and "packfile-uri". As noted above the features are compatible, a client that supports "bundle-uri" and "packfile-uri" might download a bundle, and then proceed with a "fetch" dialog, that dialog might then result in "packfile-uri" response. In practice server operators are unlikely to want to mix the two, since the main benefit of either approach is the ability to offload large "clone" responses to CDNs. A server operator would have little reason not to go with one approach or the other. There was a suggestion of implementing a similar feature long ago[1] by Jeff King. The main difference between it and this approach is that we've since gained protocol v2, so we can add this as an optional path in the dialog between client and server. The 2011 implementation hooked into the transport mechanism to try to clone from a bundle directly. See also [2] and [3] for some later mentions of that approach. See also [4] for the series that implemented uploadpack.blobPackfileUri, and [5] for a series on top that did the .gitmodules check in that context. See [6] for the "ls-refs unborn" feature which modified code in similar areas of the request flow. 1. https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/ 2. https://lore.kernel.org/git/20190514092900.GA11679@sigill.intra.peff.net/ 3. https://lore.kernel.org/git/YFJWz5yIGng+a16k@coredump.intra.peff.net/ 4. https://lore.kernel.org/git/cover.1591821067.git.jonathantanmy@google.com/ Merged as 34e849b05a4 (Merge branch 'jt/cdn-offload', 2020-06-25) 5. https://lore.kernel.org/git/cover.1614021092.git.jonathantanmy@google.com/ Merged as 6ee353d42f3 (Merge branch 'jt/transfer-fsck-across-packs', 2021-03-01) 6. 69571dfe219 (Merge branch 'jt/clone-unborn-head', 2021-02-17) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/technical/protocol-v2.txt | 209 ++++++++++++++++++++++++ Makefile | 1 + bundle-uri.c | 55 +++++++ bundle-uri.h | 14 ++ serve.c | 6 + t/t5701-git-serve.sh | 124 +++++++++++++- 6 files changed, 408 insertions(+), 1 deletion(-) create mode 100644 bundle-uri.c create mode 100644 bundle-uri.h