mbox series

[RFC,00/13] Add bundle-uri: resumably clones, static "dumb" CDN etc.

Message ID RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com (mailing list archive)
Headers show
Series Add bundle-uri: resumably clones, static "dumb" CDN etc. | expand

Message

Ævar Arnfjörð Bjarmason Aug. 5, 2021, 3:07 p.m. UTC
We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep
work for this integrated already, but for now here's something
interesting I've been working on for early commentary/feedback.

This adds the the ability to protocol v2 for servers to optimistically
pre-seed supporting clients with one or more bundles via a new
"bundle-uri" protocol extension.

Right now only "clone" supports this, but it's a rather easy change on
top to add incremental "fetch" support as well.

The elevator pitch for this feature is (I guess it's a long elevator
ride..);

 * Allows for offloading most/all of a PACK "fetch" to "dumb" CDN's
   that *don't* have very close coordination with the server running
   "git-upload-pack" (unlike packfile-uri, more on that below).

   I.e. distributing an up-to-date-enough bundle via something like
   Debian's FTP mirror system, or a best-effort occasionally updated
   CDN.

   Should the bundle(s) be outdated, corrupt or whatever the client
   gracefully recovers by either ignoring the bad data, or catching up
   via negotiation with whatever data it did get.

   Server operators should be confident in using bundle URIs, even if
   the CDN they're pointing to is flaky, out of date, or even
   sometimes outright broken or unreachable. The client will recover
   in all those cases.

 * Makes performant git infrastructure more accessible, i.e. this
   feature helps the last with an up-to-date repack with up-to-date
   bitmaps when talking to a network-local git server, but a lot of
   users have more option for scaling or distributing things via dumb
   CDNs than a server that can run "git-upload-pack".

 * You can even bootstrap a clone of a remote server that doesn't
   support bundle-uri with a local or remote bundle with the
   "transfer.injectBundleURI" config.

 * Trivial path to resumable clones. Note that that's "resumable" in
   the sense that curl(1) will resume a partially downloaded bundle,
   we don't resume whatever state index-pack was in when the
   connection was broken.

   I have a POC of this working locally, it's just a matter of
   invoking curl(1) with "--continue-at -".

   The hindrance for resumably clones is now just the UI for git-clone
   (i.e. stashing the partial data somewhere), not protocol
   limitations.

This goes on top of the outstanding series I have for serve.c API
cleanup & fixes at
https://lore.kernel.org/git/cover-v4-00.10-00000000000-20210805T011823Z-avarab@gmail.com

I also needed to grab one patch from my "bundle unbundle progress"
series:
https://lore.kernel.org/git/cover-0.4-0000000000-20210727T004015Z-avarab@gmail.com/

Something like this approach had been suggested before in late 2011 by
Jeff King, see:
https://lore.kernel.org/git/20111110074330.GA27925@sigill.intra.peff.net/;
There's significant differences in the approach, mainly due to
protocol v2 not existing at the time. I wrote most of this before
finding/seeing Jeff's earlier patches.

For a demo of how this works head over to 12/13:
https://lore.kernel.org/git/RFC-patch-12.13-8dc5613e87-20210805T150534Z-avarab@gmail.com

In 13/13 there's a design doc discussing the approach, and major
differences with the existing packfile-uri mechanism:
https://lore.kernel.org/git/RFC-patch-13.13-1e657ed27a-20210805T150534Z-avarab@gmail.com

This can also be grabbed from the "avar/bundle-uri-client-clone"
branch of https://github.com/avar/git/

Ævar Arnfjörð Bjarmason (13):
  serve: add command to advertise bundle URIs
  bundle-uri client: add "bundle-uri" parsing + tests
  connect.c: refactor sending of agent & object-format
  bundle-uri client: add minimal NOOP client
  bundle-uri client: add "git ls-remote-bundle-uri"
  bundle-uri client: add transfer.injectBundleURI support
  bundle-uri client: add boolean transfer.bundleURI setting
  bundle.h: make "fd" version of read_bundle_header() public
  fetch-pack: add a deref_without_lazy_fetch_extended()
  fetch-pack: move --keep=* option filling to a function
  index-pack: add --progress-title option
  bundle-uri client: support for bundle-uri with "clone"
  bundle-uri docs: add design notes

 Documentation/config/transfer.txt          |  26 ++
 Documentation/git-index-pack.txt           |   6 +
 Documentation/git-ls-remote-bundle-uri.txt |  63 +++
 Documentation/git-ls-remote.txt            |   1 +
 Documentation/technical/bundle-uri.txt     | 119 ++++++
 Documentation/technical/protocol-v2.txt    | 145 +++++++
 Makefile                                   |   3 +
 builtin.h                                  |   1 +
 builtin/clone.c                            |   7 +
 builtin/index-pack.c                       |   6 +
 builtin/ls-remote-bundle-uri.c             |  90 +++++
 bundle-uri.c                               | 151 ++++++++
 bundle-uri.h                               |  30 ++
 bundle.c                                   |   8 +-
 bundle.h                                   |   2 +
 command-list.txt                           |   1 +
 connect.c                                  |  80 +++-
 fetch-pack.c                               | 304 ++++++++++++++-
 fetch-pack.h                               |   6 +
 git.c                                      |   1 +
 remote.h                                   |   4 +
 serve.c                                    |   6 +
 t/helper/test-bundle-uri.c                 |  80 ++++
 t/helper/test-tool.c                       |   1 +
 t/helper/test-tool.h                       |   1 +
 t/lib-t5730-protocol-v2-bundle-uri.sh      | 425 +++++++++++++++++++++
 t/t5701-git-serve.sh                       | 124 +++++-
 t/t5730-protocol-v2-bundle-uri-file.sh     |  36 ++
 t/t5731-protocol-v2-bundle-uri-git.sh      |  17 +
 t/t5732-protocol-v2-bundle-uri-http.sh     |  17 +
 t/t5750-bundle-uri-parse.sh                |  98 +++++
 transport-helper.c                         |  13 +
 transport-internal.h                       |   7 +
 transport.c                                | 120 ++++++
 transport.h                                |  22 ++
 35 files changed, 1988 insertions(+), 33 deletions(-)
 create mode 100644 Documentation/git-ls-remote-bundle-uri.txt
 create mode 100644 Documentation/technical/bundle-uri.txt
 create mode 100644 builtin/ls-remote-bundle-uri.c
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100644 t/helper/test-bundle-uri.c
 create mode 100644 t/lib-t5730-protocol-v2-bundle-uri.sh
 create mode 100755 t/t5730-protocol-v2-bundle-uri-file.sh
 create mode 100755 t/t5731-protocol-v2-bundle-uri-git.sh
 create mode 100755 t/t5732-protocol-v2-bundle-uri-http.sh
 create mode 100755 t/t5750-bundle-uri-parse.sh

Comments

Jonathan Nieder Aug. 6, 2021, 2:38 p.m. UTC | #1
Hi,

Ævar Arnfjörð Bjarmason wrote:

> We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep
> work for this integrated already, but for now here's something
> interesting I've been working on for early commentary/feedback.
>
> This adds the the ability to protocol v2 for servers to optimistically
> pre-seed supporting clients with one or more bundles via a new
> "bundle-uri" protocol extension.

My initial thought here is that even though this includes a comparison
to packfile URIs, I suspect you're underestimating them. :)

Would it be possible to do the same pre-seeding using the packfile
URIs protocol?  Nothing stops a server from sending more objects than
the client asked for.  Is the issue that you want the client to be
able to list "have"s based on that pack?  Can't the server obtain that
same information at the same time as it obtains the bundle URL?

The reason I ask is that this contains a number of differences
relative to packfile URIs, most noticeably the use of bundles instead
of packfiles as the format for the static content.  If we were
starting from scratch and chose this design _instead_ of packfile URIs
then that could make sense (though there are issues with the bundle
format that we can also go into), but in a world where people are also
using packfile URIs it makes for a kind of confusing UX.  Is a server
operator expected to put both kinds of files on CDN and double their
storage bill?  Is this meant as an alternative, a replacement, or
something that combines well together with the packfile URIs feature?
What does the intended end state look like?

Projects like chromium have been using packfile URIs in production for
about 11 months now and it's been working well.  Because of that, I'd
be interested in understanding its shortcomings and improving it in
place --- or in other words, I want _you_ to benefit from them instead
of having to create an alternative to them.  Alternatively, if the
packfile URIs protocol is fundamentally flawed, then I'd like us to
understand that early and act on it instead of creating a parallel
alternative and waiting for it to bitrot.

I'll try to find time to look more closely at the patches to
understand the use case in more detail, but it will take some time
since I'm currently focused on the -rc.

Thanks,
Jonathan
Ævar Arnfjörð Bjarmason Aug. 6, 2021, 4:26 p.m. UTC | #2
On Fri, Aug 06 2021, Jonathan Nieder wrote:

> Hi,
>
> Ævar Arnfjörð Bjarmason wrote:
>
>> We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep
>> work for this integrated already, but for now here's something
>> interesting I've been working on for early commentary/feedback.
>>
>> This adds the the ability to protocol v2 for servers to optimistically
>> pre-seed supporting clients with one or more bundles via a new
>> "bundle-uri" protocol extension.
>
> My initial thought here is that even though this includes a comparison
> to packfile URIs, I suspect you're underestimating them. :)
>
> Would it be possible to do the same pre-seeding using the packfile
> URIs protocol?  Nothing stops a server from sending more objects than
> the client asked for.  Is the issue that you want the client to be
> able to list "have"s based on that pack?  Can't the server obtain that
> same information at the same time as it obtains the bundle URL?

Hi. Thanks for taking a quick look.

I think that the changes to Documentation/technical/protocol-v2.txt in
01/13[1] and the Documentation/technical/bundle-uri.txt document added
in 13/13[2] should address these questions.

Or perhaps not, but they're my currently my best effort to explain the
differences between the two and how they interact. So I think it's best
to point to those instead of coming up with something in this reply,
which'll inevitably be an incomplete rewrite of much of that.

In short, there are use-cases that packfile-uri is inherently unsuitable
for, or rather changing the packfile-uri feature to support them would
pretty much make it indistinguishable from this bundle-uri mechanism,
which I think would just add more confusion to the protocol.

> The reason I ask is that this contains a number of differences
> relative to packfile URIs, most noticeably the use of bundles instead
> of packfiles as the format for the static content.  If we were
> starting from scratch and chose this design _instead_ of packfile URIs
> then that could make sense (though there are issues with the bundle
> format that we can also go into), but in a world where people are also
> using packfile URIs it makes for a kind of confusing UX.  Is a server
> operator expected to put both kinds of files on CDN and double their
> storage bill?  Is this meant as an alternative, a replacement, or
> something that combines well together with the packfile URIs feature?
> What does the intended end state look like?

The two are complimentary, i.e. it's meant as something that combines
well with packfile-uri. One (packfile-uri) is meant as a dumb
pre-seeding of objects from static URLs that happens pre-negotiation.

The other (packfile-uri) is a post-negotiation mechanism for giving a
client not only a PACK, but complimenting it with a URI, together they
two form one logical complete PACK. (I know that you know this, but for
anyone reading along...).

> Projects like chromium have been using packfile URIs in production for
> about 11 months now and it's been working well.  Because of that, I'd
> be interested in understanding its shortcomings and improving it in
> place --- or in other words, I want _you_ to benefit from them instead
> of having to create an alternative to them.  Alternatively, if the
> packfile URIs protocol is fundamentally flawed, then I'd like us to
> understand that early and act on it instead of creating a parallel
> alternative and waiting for it to bitrot.

I don't think there's any real shortcomings of packfile-uri in the senes
that it makes sense to improve it in the direction of this bundle-uri
mechanism, they're simply doing different things at different phases in
the dialog.

When packfile-uri was initially being discussed I was interested in
getting something like what this bundle-uri mechanism to be a part it,
but it was clear from discussions with Jonathan Tan that he was taking
packfile-uri in a very different direction. Links to those discussions:

    https://lore.kernel.org/git/87k1hv6eel.fsf@evledraar.gmail.com/
    https://lore.kernel.org/git/87pnpbsra8.fsf@evledraar.gmail.com/
    https://lore.kernel.org/git/87zh35okzy.fsf@evledraar.gmail.com/

A demo of this feature playing nice with a server that supports
packfile-uri, although I haven't actually gotten it to also send me a
packfile-uri if I pre-seed with bundles (perhaps it's only doing it on
full clones?):

    git -c fetch.uriprotocols=https \
        -c transfer.injectBundleURI="https://vm.nix.is/~avar/noindex/codesearch-git-master.bdl" \
        -c transfer.injectBundleURI="https://vm.nix.is/~avar/noindex/codesearch-git-recent.bdl" \
        clone --bare https://chromium.googlesource.com/chromiumos/codesearch.git /tmp/codesearch.git

Will emit:
    
    Cloning into bare repository '/tmp/codesearch.git'...
    Receiving bundle (1/2): 100% (322661/322661), 89.66 MiB | 0 bytes/s, done.
    Resolving deltas: 100% (142478/142478), done.
    Receiving bundle (2/2): 100% (69378/69378), 5.51 MiB | 0 bytes/s, done.
    Resolving deltas: 100% (52558/52558)completed with 4 local objects
    remote: Total 1958 (delta 4), reused 1958 (delta 4)
    Receiving objects: 100% (1958/1958), 1.60 MiB | 0 bytes/s, done.
    Resolving deltas: 100% (4/4)completed with 3 local objects
    Checking connectivity: 393997, done.

I.e. I produced a combination of bundles going up to a commit at the
start of this month, so that last "Receiving objects" is the only
dynamic fetching of objects via the negotiation.

1. https://lore.kernel.org/git/RFC-patch-01.13-4e1a0dbef5-20210805T150534Z-avarab@gmail.com/
2. https://lore.kernel.org/git/RFC-patch-13.13-1e657ed27a-20210805T150534Z-avarab@gmail.com/
Jonathan Nieder Aug. 6, 2021, 8:40 p.m. UTC | #3
Hi,

Ævar Arnfjörð Bjarmason wrote:

> Or perhaps not, but they're my currently my best effort to explain the
> differences between the two and how they interact. So I think it's best
> to point to those instead of coming up with something in this reply,
> which'll inevitably be an incomplete rewrite of much of that.
>
> In short, there are use-cases that packfile-uri is inherently unsuitable
> for, or rather changing the packfile-uri feature to support them would
> pretty much make it indistinguishable from this bundle-uri mechanism,
> which I think would just add more confusion to the protocol.

Hm.  I was hoping you might say more about those use cases --- e.g. is
there a concrete installation that wants to take advantage of this?
By focusing on the real-world example, we'd get a better shared
understanding of the underlying constraints.

After all, both are ways to reduce the bandwidth of a clone or other
large fetch operation by offloading the bulk of content to static
serving.

Thanks,
Jonathan
Ævar Arnfjörð Bjarmason Aug. 7, 2021, 2:19 a.m. UTC | #4
On Fri, Aug 06 2021, Jonathan Nieder wrote:

> Ævar Arnfjörð Bjarmason wrote:
>
>> Or perhaps not, but they're my currently my best effort to explain the
>> differences between the two and how they interact. So I think it's best
>> to point to those instead of coming up with something in this reply,
>> which'll inevitably be an incomplete rewrite of much of that.
>>
>> In short, there are use-cases that packfile-uri is inherently unsuitable
>> for, or rather changing the packfile-uri feature to support them would
>> pretty much make it indistinguishable from this bundle-uri mechanism,
>> which I think would just add more confusion to the protocol.
>
> Hm.  I was hoping you might say more about those use cases --- e.g. is
> there a concrete installation that wants to take advantage of this?
> By focusing on the real-world example, we'd get a better shared
> understanding of the underlying constraints.

I hacked this up for potential use on GitLab's infrastructure, mainly as
a mechanism to relieve CPU pressure on CI thundering herds.

Often you need full clones, and you sometimes need to do those from
scratch. When you've just had a push come in it's handy to convert those
to incremental fetches on top of a bundle you made recently.

It's not deployed on anything currently, it's just something I've been
hacking up. I'll be on vacation much of the rest of this month, the plan
is to start stressing it on real-world use-cases after that. I thought
I'd send this RFC first.

> After all, both are ways to reduce the bandwidth of a clone or other
> large fetch operation by offloading the bulk of content to static
> serving.

The support we have for packfile-uri in git.git now as far as the server
side goes, I think it's fair to say, fairly immature, I gather that
JGit's version is more advanced, and that's what's serving up things at
Google at e.g. https://chromium.googlesource.com.

I.e. right now for git-upload-pack you need to exhaustively enumerate
all the objects to exclude, although there's some on-list patches
recently for being able to supply tips.

More importantly your CDN reliability MUST match that of your git
server, otherwise your clone fails (as the server has already sent the
"other side" of the expected CDN pack).

Furthermore you MUST as the server be able to tell the client what pack
checksum on the CDN they should expect, which requires a very tight
coupling between git server and CDN.

You can't e.g. say "bootstrap with this bundle/pack" and point to
something like Debian's async-updated FTP network as a source. The
bootstrap data may or may not be there, and it may or may not be as
up-to-date as you'd like.

I think any proposed integration into git.git should mainly consider the
bulk of users, the big hosting providers can always run with their own
patches.

I think this approach opens up the potential for easier and therefore
wider CDN integration for git servers for providers that aren't one of
the big fish. E.g. your CDN generation can be daily cronjob, and the
server can point to it blindly and hope for the best. The client will
optimistically use the CDN data, and recover if not.

I think one thing I oversold is the "path to resumable clones",
i.e. that's all true, but I don't think that's really any harder go do
with packfile-uri in theory (in practice just serving up a sensible pack
with it is pretty tedious with git-upload-pack as it stands).

The negotiation aspect of it is also interesting and something I've been
experimenting with. I.e. the bundles are what the client sends as its
HAVE tips. This allows a server to anticipate what dialog newly cloning
clients are likely to have, and even pre-populate a cache of that
locally (it could even serve that diff up as a packfile-uri :).

Right now it retrieves each bundle in full before adding the tips to
negotiate to a subsequent dialog, but I've successfully experimental
locally with negotiating on the basis of objects we don't even have
yet. I.e. download the bundle(s), and as soon as we have the header fire
off the dialog with the server to get its PACK on the basis of those
promised tips.

It makes the recovery a bit more involved in case the bundles don't have
what we're after, but this allows us to disconnect really fast from the
server and twiddle our own thumbs while we finish downloading the
bundles to get the full clone. We already disconnect early in cases
where the bundle(s) already have what we're after.

This E-Mail got rather long, but hey, I did point you parts of this
series that cover some/most of this, and since it wasn't clear what if
anything of that you'd read ... :) Hopefully this summary is useful.
Derrick Stolee Aug. 10, 2021, 1:55 p.m. UTC | #5
On 8/5/2021 11:07 AM, Ævar Arnfjörð Bjarmason wrote:> We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep
> work for this integrated already, but for now here's something
> interesting I've been working on for early commentary/feedback.

I've taking a brief look at the code, but mostly have thoughts about the
core idea based on reading the documentation. I'll keep my feedback here
restricted to that, especially because I saw some thoughts from Jonathan
that question the idea.

> This adds the the ability to protocol v2 for servers to optimistically
> pre-seed supporting clients with one or more bundles via a new
> "bundle-uri" protocol extension.

To make sure I understand things correctly, I'm going to rewrite the
description of the feature in my own words. Please correct me if I have
misread something.

This presents a new capability for protocol v2 that allows a server to
notify a client that they can download a bundle to bootstrap their object
database, and then come back to the origin server to "catch up" the
remaining new objects since that bundle via a fetch negotiation.

This idea is similar to the packfile-uri feature in that it offloads some
server load to a dumb data store (like a CDN) but differs in a few key
ways:

1. The bundle download does not need to be controlled by the server, and
   the server doesn't even need to know its contents. The client will
   discover the content and share its view in a later negotiation.

2. The packfile-uri could advertise a pack that contains only the large
   reachable blobs, and hence could use the "happy path" for bitmaps to
   compute a packfile containing all reachable objects except these large
   blobs. For the bundle-uri feature to mimic this, the blobs would need
   to be reachable from refs (perhaps shallow tags?) but it would not be
   clear that the negotiation would prevent redownloading those files.
   This is an area to explore and expand upon.

3. The bundle-uri feature is focused on "clone" scenarios, and I don't see
   how it could be used to help "fetch" scenarios. To be fair, I also have
   been vocal about how the packfile-uri feature is very hard to make
   helpful for the "fetch" scenario. The benefits to "clone" seem to be
   worth the effort alone. I think the bundle-api doubles-down on that
   being the core functionality that matters, and that is fine by me. It
   sacrifices the flexibility of the packfile-uri with a lower maintenance
   burden for servers that want to implement it.

The biggest benefit that I see is that the Git server does not need to
worry about whether the bundle has an exact set of data that it expects.
There is no timing issue about whether or not the exact packfile is seeded.
Instead, the server could have a fixed URL for its bundle and update its
contents (atomically) at any schedule without worrying about concurrent
clones being interrupted. From an operational perspective, I find the
bundle-uri a better way to offload "clone" load.

This also depends on that following "git fetch" being easy to serve. In
that sense, it can be beneficial to be somewhat aware of the bundles that
are being served: can we store the bundled refs as reachability bitmaps so
we have those available for the negotiation in the following "git fetch"
operations? This choice seems specific to how the server is deciding to
create these bundles.

It also presents interesting ideas for how to create the bundle to focus
on some core portion of the repo. The "thundering herd" of CI machines
that re-clone repos at a high rate will also be likely to use the
"--single-branch" option to reduce the refs that they will ask for in the
negotiation. In that sense, we won't want a snapshot of all refs at a
given time and instead might prefer a snapshot of the default branch or
some set of highly-active branches.

One question I saw Jonathan ask was "can we modify the packfile-uri
capability to be better?" I think this feature has enough different goals
that they could co-exist. That's the point of protocol v2, right? Servers
can implement and advertise the subset of functionality that they think is
best for their needs.

I hope my ramblings provide some amount of clarity to the discussion, but
also I intend to show support of the idea. If I was given the choice of
which feature to support (I mostly work on the client experience, so take
my choice with a grain of salt), then I would focus on implementing the
bundle-uri capability _before_ the packfile-uri capability. And that's the
best part: more options present more flexibility for different hosts to
make different decisions.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Aug. 23, 2021, 1:28 p.m. UTC | #6
On Tue, Aug 10 2021, Derrick Stolee wrote:

> On 8/5/2021 11:07 AM, Ævar Arnfjörð Bjarmason wrote:> We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep
>> work for this integrated already, but for now here's something
>> interesting I've been working on for early commentary/feedback.
>
> I've taking a brief look at the code, but mostly have thoughts about the
> core idea based on reading the documentation. I'll keep my feedback here
> restricted to that, especially because I saw some thoughts from Jonathan
> that question the idea.

Thanks a lot for taking a look, and sorry about the very late reply. I
was away from the computer for a while (on vacation).

>> This adds the the ability to protocol v2 for servers to optimistically
>> pre-seed supporting clients with one or more bundles via a new
>> "bundle-uri" protocol extension.
>
> To make sure I understand things correctly, I'm going to rewrite the
> description of the feature in my own words. Please correct me if I have
> misread something.

Thanks. I'll probably try to steal some of this summary...

> This presents a new capability for protocol v2 that allows a server to
> notify a client that they can download a bundle to bootstrap their object
> database, and then come back to the origin server to "catch up" the
> remaining new objects since that bundle via a fetch negotiation.
>
> This idea is similar to the packfile-uri feature in that it offloads some
> server load to a dumb data store (like a CDN) but differs in a few key
> ways:
>
> 1. The bundle download does not need to be controlled by the server, and
>    the server doesn't even need to know its contents. The client will
>    discover the content and share its view in a later negotiation.

Correct.

> 2. The packfile-uri could advertise a pack that contains only the large
>    reachable blobs, and hence could use the "happy path" for bitmaps to
>    compute a packfile containing all reachable objects except these large
>    blobs. For the bundle-uri feature to mimic this, the blobs would need
>    to be reachable from refs (perhaps shallow tags?) but it would not be
>    clear that the negotiation would prevent redownloading those files.
>    This is an area to explore and expand upon.

Correct that this isn't handled currently.

The few paragraphs below are mostly a (hopefully helpful) digression.

Despite what I said in
https://lore.kernel.org/git/87h7g2zd8f.fsf@evledraar.gmail.com/ about
packfile-uri and bundle-uri playing nicely together there are cases
where they won't. They won't error out or anything, just that combining
them in those cases won't make much sense.

In particular if you now use packfile-uri to exclude one giant blob from
your early history, you can't use bundle-uri to serve up that early
history without also including that blob (as we'll expect the bundles to
be connected).

That property isn't inherent though, in the same way that we're clever
about .gitmodules checking we could make note of any missing content and
do a full connectivity check at the end.

But if you're using packfile-uri for that "one big blob on a CDN" I
don't see why you wouldn't be happy to serve it up in your bundle, so I
don't see much of a practical reason for further work in that area.

And yes, you do run into limitations in the general negotiation
mechanism eventually (which to be fair, we probably wouldn't care about
before bundle-uri). E.g. we can't say we HAVE a blob OID currently
(server expects commit OID).

I don't see why it *couldn't* be supported though, and would be handled
under the hood with the same happy path for bitmaps that we use for
packfile-uri exclusions now...

> 3. The bundle-uri feature is focused on "clone" scenarios, and I don't see
>    how it could be used to help "fetch" scenarios. To be fair, I also have
>    been vocal about how the packfile-uri feature is very hard to make
>    helpful for the "fetch" scenario. The benefits to "clone" seem to be
>    worth the effort alone. I think the bundle-api doubles-down on that
>    being the core functionality that matters, and that is fine by me. It
>    sacrifices the flexibility of the packfile-uri with a lower maintenance
>    burden for servers that want to implement it.

Not correct, I'm going to have it handle incremental fetches.

What *is* correct is that the initial RFC patches here entirely punt on
it (I only hook them into clone.c, no fetch.c).

I focused on the "clone" case to get to a MVP earlier, and as you note I
think in practice most users of this will want to only use it for clone
bootstrapping, and won't care so much about the incremental case.

But for the "fetch" case we'd simply start fetching the N bundles in
parallel, and read their headers, then abort fetching any whose header
declares tips that we already have.

So e.g. for a repo with a "big" bundle that's 1 week old, and 7
incremental bundles for the last 7 days of updates we'd quickly find
that we only need the last 3 if we updated 4 days ago, based on only
fetching their headers.

The client spec is very liberal on "MAY" instead of "MUST" for
supporting clients. E.g. a client might sensibly conclude that their
most recent commits are recent enough, and that it would probably be a
waste of time to fetch the headers of the pointed-to bundles, and just
do a normal fetch instead.

The response format is also currently:

    <URI>[ <optional key-value>*]

And nothing uses the "optional key-value". One of the main things I had
in mind was to support something like (none of these key-values are
specified currently):

    https://example.com/big.bundle tip=deadbeef1
    https://example.com/incremental-1.bundle prereq=deadbeef1 tip=deadbeef2
    https://example.com/incremental-2.bundle prereq=deadbeef2 tip=deadbeef3

Or:

    https://example.com/big.bundle type=complete
    https://example.com/incremental-1.bundle type=incremental
    https://example.com/incremental-2.bundle type=incremental

I.e. allow the server to up-front to optionally specify some details
about the pointed-to-bundles, purely so the client can avoid the work
and roundtrip of trying to fetch the headers of N pointed-to bundles
just to see which if any it needs for a clone/fetch.

This peeking behind the curtain would always yield to the source of
truth of inpecting the bundle header/content itself, so it would be
purely an optional client hint. It's also not as flexible as a full
bundle header (e.g. can't supply N tips or N prereqs, but for the common
case of "one big history" it should work nicely.

But maybe this use for key-value headers would be a premature
optimization, I haven't actually benchmarked it...

> The biggest benefit that I see is that the Git server does not need to
> worry about whether the bundle has an exact set of data that it expects.
> There is no timing issue about whether or not the exact packfile is seeded.
> Instead, the server could have a fixed URL for its bundle and update its
> contents (atomically) at any schedule without worrying about concurrent
> clones being interrupted. From an operational perspective, I find the
> bundle-uri a better way to offload "clone" load.

Thanks, yeah. That's pretty the main reason I hacked this up when we
already have packefile-uri, being able to e.g. have a dumb cronjob
(which may not even run) optimistically handle your CDN offloading.

> This also depends on that following "git fetch" being easy to serve. In
> that sense, it can be beneficial to be somewhat aware of the bundles that
> are being served: can we store the bundled refs as reachability bitmaps so
> we have those available for the negotiation in the following "git fetch"
> operations? This choice seems specific to how the server is deciding to
> create these bundles.

Yeah, one of the first things I'm experimenting with on the server side
(but haven't yet) is to generate bundles on the one hand, and being
aware of the last generated bundle do the equivalent of pre-populating
the pack-objects hook cache in anticipation of pre-seeded clients coming
in.

> It also presents interesting ideas for how to create the bundle to focus
> on some core portion of the repo. The "thundering herd" of CI machines
> that re-clone repos at a high rate will also be likely to use the
> "--single-branch" option to reduce the refs that they will ask for in the
> negotiation. In that sense, we won't want a snapshot of all refs at a
> given time and instead might prefer a snapshot of the default branch or
> some set of highly-active branches.

Yes, the current RFC MVP I have is rather dumb about this, but it's an
easy follow-up change along with the "fetch" support to e.g. have a
server have one "big" bundle of just its main branch in anticipation of
--single-branch --no-tags fetches, and then the next bundle is the delta
between that and the other branches.

So the first is master, the second is master..next, master..seen + tags
etc. The client from ls-refs what tips it wants, so it can then ignore
or retrieve the bundles as appropriate.

A server operator can then play around with what they think might be the
optimal arrangement and split-up of bundles.

They obviously can't handle all cases, e.g. if you have a bundle with
"master" and clone just git.git's "todo" branch with these current
patches you'll get redundant objects you don't need currently, but a
slightly smarter client will be ignoring that bundle-uri.

> One question I saw Jonathan ask was "can we modify the packfile-uri
> capability to be better?" I think this feature has enough different goals
> that they could co-exist. That's the point of protocol v2, right? Servers
> can implement and advertise the subset of functionality that they think is
> best for their needs.

*Nod*, I also think as shown with these patches the overall complexity
after setting up the scaffolding for a new feature isn't very high, and
much of it is going away in some generally useful prep patches I'm
trying to get in...

> I hope my ramblings provide some amount of clarity to the discussion, but
> also I intend to show support of the idea. If I was given the choice of
> which feature to support (I mostly work on the client experience, so take
> my choice with a grain of salt), then I would focus on implementing the
> bundle-uri capability _before_ the packfile-uri capability. And that's the
> best part: more options present more flexibility for different hosts to
> make different decisions.

Thanks again. I'll keep you CC'd on future submissions if that's OK.

One thing I could really use to move this forward is code review of some
of the outstanding serieses I've got, in particular these two are
immediately applicable to this one. It's based on the former, and I
needed to steal one patch from the latter:

https://lore.kernel.org/git/cover-v4-00.10-00000000000-20210805T011823Z-avarab@gmail.com/
https://lore.kernel.org/git/cover-v2-0.4-00000000000-20210823T110136Z-avarab@gmail.com/
Derrick Stolee Aug. 24, 2021, 2:03 a.m. UTC | #7
On 8/23/2021 9:28 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Aug 10 2021, Derrick Stolee wrote:
> 
>> On 8/5/2021 11:07 AM, Ævar Arnfjörð Bjarmason wrote:> We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep

(I'll skip the parts where you agree with me.)

>> 3. The bundle-uri feature is focused on "clone" scenarios, and I don't see
>>    how it could be used to help "fetch" scenarios. To be fair, I also have
>>    been vocal about how the packfile-uri feature is very hard to make
>>    helpful for the "fetch" scenario. The benefits to "clone" seem to be
>>    worth the effort alone. I think the bundle-api doubles-down on that
>>    being the core functionality that matters, and that is fine by me. It
>>    sacrifices the flexibility of the packfile-uri with a lower maintenance
>>    burden for servers that want to implement it.
> 
> Not correct, I'm going to have it handle incremental fetches.
> 
> What *is* correct is that the initial RFC patches here entirely punt on
> it (I only hook them into clone.c, no fetch.c).
> 
> I focused on the "clone" case to get to a MVP earlier, and as you note I
> think in practice most users of this will want to only use it for clone
> bootstrapping, and won't care so much about the incremental case.
> 
> But for the "fetch" case we'd simply start fetching the N bundles in
> parallel, and read their headers, then abort fetching any whose header
> declares tips that we already have.

To save network time, we would need to read data as it is streaming
from the network and stop the download when we realize we don't need
the bundle? This seems wasteful, but I'm willing to see it work in
reality before judging it too harshly.

It would be better if the origin Git server could see the tips that
the user has and then only recommend bundles that serve new data.
That requires a closer awareness of which bundles contain which
content.

> So e.g. for a repo with a "big" bundle that's 1 week old, and 7
> incremental bundles for the last 7 days of updates we'd quickly find
> that we only need the last 3 if we updated 4 days ago, based on only
> fetching their headers.

I can attest that using time-based packs can be a good way to catch
up based on pre-computed pack-files instead of generating one on
demand. The GVFS protocol has a "prefetch" endpoint that gives all
pre-computed pack-files since a given timestamp. (One big difference
is that those pack-files contain only commits and trees, not blobs.
We should ensure that your proposal can extend to bundles that serve
filtered repos such as --filter=blob:none.)

The thing that works especially well with the GVFS protocol is that
if we are missing a reachable object from one of those pack-files,
then it is downloaded as a one-off request _when needed_. This allows
the pack-files to be computed by independent cache servers that have
their own clocks and compute their own timestamps and generate these
pack-files as "new objects I fetched since the last timestamp".

This is all to say that your timestamp-based approach will work as
long as you are very careful in how the incremental bundles are
constructed. There can really only be one list of bundles at a time,
and somehow we need those servers to atomically swap their list of
bundles when they are reorganized.

> The client spec is very liberal on "MAY" instead of "MUST" for
> supporting clients. E.g. a client might sensibly conclude that their
> most recent commits are recent enough, and that it would probably be a
> waste of time to fetch the headers of the pointed-to bundles, and just
> do a normal fetch instead.

Sensible. I wouldn't expect anything less.

> The response format is also currently:
> 
>     <URI>[ <optional key-value>*]
> 
> And nothing uses the "optional key-value". One of the main things I had
> in mind was to support something like (none of these key-values are
> specified currently):
> 
>     https://example.com/big.bundle tip=deadbeef1
>     https://example.com/incremental-1.bundle prereq=deadbeef1 tip=deadbeef2
>     https://example.com/incremental-2.bundle prereq=deadbeef2 tip=deadbeef3
> 
> Or:
> 
>     https://example.com/big.bundle type=complete
>     https://example.com/incremental-1.bundle type=incremental
>     https://example.com/incremental-2.bundle type=incremental
> 
> I.e. allow the server to up-front to optionally specify some details
> about the pointed-to-bundles, purely so the client can avoid the work
> and roundtrip of trying to fetch the headers of N pointed-to bundles
> just to see which if any it needs for a clone/fetch.

Ok, this seems like a reasonable way forward. If the bundles are
truly focused on data within a core "trunk" (say, the reflog of
the default branch) then this works. It gets more complicated for
repos with many long-lived branches that merge together over the
span of weeks. I'm talking about extreme outliers like the Windows
OS repository, so take this concern with a grain of salt. The
point is that in that world, a single tip isn't enough sometimes.

...
>> One question I saw Jonathan ask was "can we modify the packfile-uri
>> capability to be better?" I think this feature has enough different goals
>> that they could co-exist. That's the point of protocol v2, right? Servers
>> can implement and advertise the subset of functionality that they think is
>> best for their needs.
> 
> *Nod*, I also think as shown with these patches the overall complexity
> after setting up the scaffolding for a new feature isn't very high, and
> much of it is going away in some generally useful prep patches I'm
> trying to get in...
> 
>> I hope my ramblings provide some amount of clarity to the discussion, but
>> also I intend to show support of the idea. If I was given the choice of
>> which feature to support (I mostly work on the client experience, so take
>> my choice with a grain of salt), then I would focus on implementing the
>> bundle-uri capability _before_ the packfile-uri capability. And that's the
>> best part: more options present more flexibility for different hosts to
>> make different decisions.
> 
> Thanks again. I'll keep you CC'd on future submissions if that's OK.

Sounds good. I'll try to prioritize review.

> One thing I could really use to move this forward is code review of some
> of the outstanding serieses I've got, in particular these two are
> immediately applicable to this one. It's based on the former, and I
> needed to steal one patch from the latter:
> 
> https://lore.kernel.org/git/cover-v4-00.10-00000000000-20210805T011823Z-avarab@gmail.com/
> https://lore.kernel.org/git/cover-v2-0.4-00000000000-20210823T110136Z-avarab@gmail.com/

I'll see what I can contribute here.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason Aug. 24, 2021, 10 p.m. UTC | #8
On Mon, Aug 23 2021, Derrick Stolee wrote:

> On 8/23/2021 9:28 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Aug 10 2021, Derrick Stolee wrote:
>> 
>>> On 8/5/2021 11:07 AM, Ævar Arnfjörð Bjarmason wrote:> We're in the 2.33.0 rc cycle, and I'd hoped to have some more prep
>
> (I'll skip the parts where you agree with me.)
>
>>> 3. The bundle-uri feature is focused on "clone" scenarios, and I don't see
>>>    how it could be used to help "fetch" scenarios. To be fair, I also have
>>>    been vocal about how the packfile-uri feature is very hard to make
>>>    helpful for the "fetch" scenario. The benefits to "clone" seem to be
>>>    worth the effort alone. I think the bundle-api doubles-down on that
>>>    being the core functionality that matters, and that is fine by me. It
>>>    sacrifices the flexibility of the packfile-uri with a lower maintenance
>>>    burden for servers that want to implement it.
>> 
>> Not correct, I'm going to have it handle incremental fetches.
>> 
>> What *is* correct is that the initial RFC patches here entirely punt on
>> it (I only hook them into clone.c, no fetch.c).
>> 
>> I focused on the "clone" case to get to a MVP earlier, and as you note I
>> think in practice most users of this will want to only use it for clone
>> bootstrapping, and won't care so much about the incremental case.
>> 
>> But for the "fetch" case we'd simply start fetching the N bundles in
>> parallel, and read their headers, then abort fetching any whose header
>> declares tips that we already have.
>
> To save network time, we would need to read data as it is streaming
> from the network and stop the download when we realize we don't need
> the bundle? This seems wasteful, but I'm willing to see it work in
> reality before judging it too harshly.

The common case would be making a Range request to get (what's most
likely to contain) the header, so not deciding on the fly while we're
redundantly getting data we may not use from the server.

But yes, I think for that incremental case it's likely to be better to
skip it entirely for small-ish fetches, both for client & server.

An out by default here is that a server operator would be unlikely to
configure this if they don't suspect that it's helping their use-case.

I also wonder if I shouldn't make it explicit from day 1 in the spec
that either the first listed bundle is always one without prereqs, so we
could short-circuit this on "fetch" for what's likely to be the common
case of a server operator who's only interested in this for "clone",
i.e. without any incremental bundles.

It would mostly close the door (or rather, result in redundant requests
for) someone who wants incremental-only bundles, but I can't imagine
anyone using this at all would want that, but maybe...

> It would be better if the origin Git server could see the tips that
> the user has and then only recommend bundles that serve new data.
> That requires a closer awareness of which bundles contain which
> content.

Once we're imagining this happening after the user sends over the tips
they have we're really starting to talk about something that's pretty
much indistinguishable from packfile-uri.

I.e. that happen after ls-refs & fetch negotiation, whereas bundle-uri
happens after ls-refs, but pre-negotiation.

"Pretty much" because we can imagine that instead of the server sending
the packfile-uri they'd send some hybrid of bundle-uri and packfile-uri
where instead of sending a PACK they say "oh if you want these tips
here's a bundle that has most/all of it, fetch that and get back to me".

Except the "get back to me" probably can't happen, the client would
either need to keep the server waiting as they fetch that
hybrid-bundle-uri, or disconnect, fetch it, and then re-do the
negotiation.

I can see how it might be useful if you're generating incremental
bundles anyway, but I think mostly I'd just say "well, use packfile-uri
then", but maybe I'm not being imaginative enough ... :)

>> So e.g. for a repo with a "big" bundle that's 1 week old, and 7
>> incremental bundles for the last 7 days of updates we'd quickly find
>> that we only need the last 3 if we updated 4 days ago, based on only
>> fetching their headers.
>
> I can attest that using time-based packs can be a good way to catch
> up based on pre-computed pack-files instead of generating one on
> demand. The GVFS protocol has a "prefetch" endpoint that gives all
> pre-computed pack-files since a given timestamp. (One big difference
> is that those pack-files contain only commits and trees, not blobs.
> We should ensure that your proposal can extend to bundles that serve
> filtered repos such as --filter=blob:none.)

I may be wrong, and I'm at all familiar with the guts of these filtering
mechanisms, but would that even be possible? Doesn't that now involve
the server sending over a filtered PACK payload without the blobs per
client request?

Whereas a bundle per my understanding must always be complete as far as
a commit range goes, i.e. it can have prereqs, but those prereqs are
commits, not arbitrary OIDs. But maybe arbitrary OIDs (i.e. the OIDs of
the missing blob/trees etc.) would work, I haven't tried to manually
craft such a thing...

> The thing that works especially well with the GVFS protocol is that
> if we are missing a reachable object from one of those pack-files,
> then it is downloaded as a one-off request _when needed_. This allows
> the pack-files to be computed by independent cache servers that have
> their own clocks and compute their own timestamps and generate these
> pack-files as "new objects I fetched since the last timestamp".

I *think* that's covered by the above, but not sure, i.e. we're still
discussing these filtered PACKs I think...

> This is all to say that your timestamp-based approach will work as
> long as you are very careful in how the incremental bundles are
> constructed. There can really only be one list of bundles at a time,
> and somehow we need those servers to atomically swap their list of
> bundles when they are reorganized.

The current implementation theoretically racy in that we could have
bundle uris on the server defined in different parts of config (e.g. via
includes), and have a A..F "big", bundle, then F..G, G..H etc, but in
the middle of a swap-out a client might see A..E, F..G.

I think in practice the general way we read/write config in one location
should alleviate that, and for any remaining races the client will
handle it gracefully, i.e. we'll fetch that A..E try to apply (or
rather, read fail on reading the prereqs in the header) F..G on top of
it, fail and just fall back on a negotiation from E, rather than the
expected G.

>> The client spec is very liberal on "MAY" instead of "MUST" for
>> supporting clients. E.g. a client might sensibly conclude that their
>> most recent commits are recent enough, and that it would probably be a
>> waste of time to fetch the headers of the pointed-to bundles, and just
>> do a normal fetch instead.
>
> Sensible. I wouldn't expect anything less.

*Nod*, but it's useful to note that by comparison with packfile-uri, the
plan here is to have the server and its CDN not have a hard dependency
on one another. I.e. with packfile-uri if your server and CDN have a 99%
availability, your total availability is .99^2 =~ 98.01%.

So it's inherently a softer and more optimistic way to handle failures,
and allows server operators who aren't 100% sure of their CDN never
failing them (least clones start failing) to experiment with using one.

>> The response format is also currently:
>> 
>>     <URI>[ <optional key-value>*]
>> 
>> And nothing uses the "optional key-value". One of the main things I had
>> in mind was to support something like (none of these key-values are
>> specified currently):
>> 
>>     https://example.com/big.bundle tip=deadbeef1
>>     https://example.com/incremental-1.bundle prereq=deadbeef1 tip=deadbeef2
>>     https://example.com/incremental-2.bundle prereq=deadbeef2 tip=deadbeef3
>> 
>> Or:
>> 
>>     https://example.com/big.bundle type=complete
>>     https://example.com/incremental-1.bundle type=incremental
>>     https://example.com/incremental-2.bundle type=incremental
>> 
>> I.e. allow the server to up-front to optionally specify some details
>> about the pointed-to-bundles, purely so the client can avoid the work
>> and roundtrip of trying to fetch the headers of N pointed-to bundles
>> just to see which if any it needs for a clone/fetch.
>
> Ok, this seems like a reasonable way forward. If the bundles are
> truly focused on data within a core "trunk" (say, the reflog of
> the default branch) then this works. It gets more complicated for
> repos with many long-lived branches that merge together over the
> span of weeks. I'm talking about extreme outliers like the Windows
> OS repository, so take this concern with a grain of salt. The
> point is that in that world, a single tip isn't enough sometimes.

*Nod*, but note that this would just be a shortcut over using the Range
request to get the header discussed above.

For the Windows repository you could presumably (using git.git as an
example) have bundles for:

    ..master (the big bundle, no prereqs)
    master..next
    next..ds/add-with-sparse-index
    next..ab/serve-cleanup

I.e. ds/add-with-sparse-index are two divergent branches from "next"
that we'd like to get.

A bundle header could of course have N branches and N prereqs, but for
the prereq/tip suggested above we'd handle the common case of a small
number of "big topics" that are likely to diverge quite a bit, which I
imagine the use-case you're noting is like.

>>> One question I saw Jonathan ask was "can we modify the packfile-uri
>>> capability to be better?" I think this feature has enough different goals
>>> that they could co-exist. That's the point of protocol v2, right? Servers
>>> can implement and advertise the subset of functionality that they think is
>>> best for their needs.
>> 
>> *Nod*, I also think as shown with these patches the overall complexity
>> after setting up the scaffolding for a new feature isn't very high, and
>> much of it is going away in some generally useful prep patches I'm
>> trying to get in...
>> 
>>> I hope my ramblings provide some amount of clarity to the discussion, but
>>> also I intend to show support of the idea. If I was given the choice of
>>> which feature to support (I mostly work on the client experience, so take
>>> my choice with a grain of salt), then I would focus on implementing the
>>> bundle-uri capability _before_ the packfile-uri capability. And that's the
>>> best part: more options present more flexibility for different hosts to
>>> make different decisions.
>> 
>> Thanks again. I'll keep you CC'd on future submissions if that's OK.
>
> Sounds good. I'll try to prioritize review.
>
>> One thing I could really use to move this forward is code review of some
>> of the outstanding serieses I've got, in particular these two are
>> immediately applicable to this one. It's based on the former, and I
>> needed to steal one patch from the latter:
>> 
>> https://lore.kernel.org/git/cover-v4-00.10-00000000000-20210805T011823Z-avarab@gmail.com/
>> https://lore.kernel.org/git/cover-v2-0.4-00000000000-20210823T110136Z-avarab@gmail.com/
>
> I'll see what I can contribute here.

Thanks, the reviews on the "unbundle progress" are already very
useful...