[WIP,0/7] CDN offloading of fetch response

Message ID	cover.1550963965.git.jonathantanmy@google.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Sat, 23 Feb 2019 15:38:54 -0800 Message-Id: <cover.1550963965.git.jonathantanmy@google.com> Mime-Version: 1.0 Subject: [WIP 0/7] CDN offloading of fetch response From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, gitster@pobox.com, peff@peff.net, christian.couder@gmail.com, avarab@gmail.com Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	CDN offloading of fetch response \| expand [WIP,0/7] CDN offloading of fetch response [WIP,1/7] http: use --stdin and --keep when downloading pack [WIP,2/7] http: improve documentation of http_pack_request [WIP,3/7] http-fetch: support fetching packfiles by URL [WIP,4/7] Documentation: order protocol v2 sections [WIP,5/7] Documentation: add Packfile URIs design doc [WIP,6/7] upload-pack: refactor reading of pack-objects out [WIP,7/7] upload-pack: send part of packfile response as uri

Jonathan Tan Feb. 23, 2019, 11:38 p.m. UTC

It's been a while, so here is an updated version of what I previously
sent [1]. The main difference is that fetch-pack now actually downloads
whatever the server tells it to. The second main difference is that
we no longer buffer progress messages and suspend keepalives - we
no longer need to, because the sideband-all patches have been merged.

I think I've address all the technical comments in [1], except one
comment of Junio's [2] that I still don't understand:

> And the "fix" I was alluding to was to update that "else" clause to
> make it a no-op that keeps os->used non-zero, which would not call
> send-client-data.
>
> When that fix happens, the part that early in the function this
> patch added "now we know we will call send-client-data, so let's say
> 'here comes packdata' unless we have already said that" is making
> the decision too early.  Depending on the value of os->used when we
> enter the code and the number of bytes xread() reads from the
> upstream, we might not call send-client-data yet (namely, when we
> have no buffered data and we happened to get only one byte).

With or without this fix, I don't think there is ever a time when we
say "here comes packdata" without calling send-client-data - we say
"here comes packdata" only when we see the string "PACK", which forms
part of the packfile, and thus we should have no problem sending any
client data. Having said that, maybe this is a moot point - Junio says
that this only happens when the fix is implemented, and the fix is not
implemented.

There are probably some more design discussions to be had:

 - Requirement that all pointed-to CDNs support byte ranges for
   resumability, and to guarantee that the packfiles will be there
   permanently. After some thought, it is a good idea for CDNs to
   support that, but I think that we should support CDNs that can only
   give temporal guarantees (e.g. if/when we start implementing
   resumption, we could read the Cache headers). I didn't add any
   mention of this issue in the documentation.

 - Client-side whitelist of protocol and hostnames. I've implemented
   whitelist of protocol, but not hostnames.

 - Any sort of follow-up fetch - for example, if the download from the
   CDN fails or if we allow the server to tell us of best-effort
   packfiles (but the client still must check and formulate the correct
   request to the server to fetch the rest). This protocol seems like a
   prerequisite to all those, and is independently useful, so maybe all
   of those can be future work.

Please take a look. Feel free to comment on anything, but I prefer
comments on the major things first (e.g. my usage of a separate process
(http-fetch) to fetch packfiles, since as far as I know, Git doesn't
link to libcurl; any of the design decisions I described above). I know
that there are some implementation details that could be improved (e.g.
parallelization of the CDN downloads, starting CDN downloads *after*
closing the first HTTP request, holding on to the .keep locks until
after the refs are set), but will work on those once the overall design
is more or less finalized.

Note that the first patch is exactly the same as one I've previously
sent [3].

[1] https://public-inbox.org/git/cover.1543879256.git.jonathantanmy@google.com/
[2] https://public-inbox.org/git/xmqqmupi89ub.fsf@gitster-ct.c.googlers.com/
[3] https://public-inbox.org/git/20190221001447.124088-1-jonathantanmy@google.com/

Jonathan Tan (7):
  http: use --stdin and --keep when downloading pack
  http: improve documentation of http_pack_request
  http-fetch: support fetching packfiles by URL
  Documentation: order protocol v2 sections
  Documentation: add Packfile URIs design doc
  upload-pack: refactor reading of pack-objects out
  upload-pack: send part of packfile response as uri

 Documentation/git-http-fetch.txt         |   7 +-
 Documentation/technical/packfile-uri.txt |  79 ++++++++++++
 Documentation/technical/protocol-v2.txt  |  22 ++--
 builtin/pack-objects.c                   |  63 +++++++++
 fetch-pack.c                             |  58 +++++++++
 http-fetch.c                             |  65 ++++++++--
 http-push.c                              |   7 +-
 http-walker.c                            |   5 +-
 http.c                                   |  83 +++++++-----
 http.h                                   |  26 +++-
 t/t5550-http-fetch-dumb.sh               |  18 +++
 t/t5702-protocol-v2.sh                   |  54 ++++++++
 upload-pack.c                            | 155 +++++++++++++++++------
 13 files changed, 542 insertions(+), 100 deletions(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

Christian Couder Feb. 25, 2019, 9:30 p.m. UTC | #1

On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:

> There are probably some more design discussions to be had:

[...]

>  - Client-side whitelist of protocol and hostnames. I've implemented
>    whitelist of protocol, but not hostnames.

I would appreciate a more complete answer to my comments in:

https://public-inbox.org/git/CAP8UFD16fvtu_dg3S_J9BjGpxAYvgp8SXscdh=TJB5jvAbzi4A@mail.gmail.com/

Especially I'd like to know what should the client do if they find out
that for example a repo that contains a lot of large files is
configured so that the large files should be fetched from a CDN that
the client cannot use? Is the client forced to find or setup another
repo configured differently if the client still wants to use CDN
offloading?

Wouldn't it be better if the client could use the same repo and just
select a CDN configuration among many?

Jonathan Nieder Feb. 25, 2019, 11:45 p.m. UTC | #2

Hi,

Christian Couder wrote:
> On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:

>> There are probably some more design discussions to be had:
>
> [...]
>
>>  - Client-side whitelist of protocol and hostnames. I've implemented
>>    whitelist of protocol, but not hostnames.
>
> I would appreciate a more complete answer to my comments in:
>
> https://public-inbox.org/git/CAP8UFD16fvtu_dg3S_J9BjGpxAYvgp8SXscdh=TJB5jvAbzi4A@mail.gmail.com/
>
> Especially I'd like to know what should the client do if they find out
> that for example a repo that contains a lot of large files is
> configured so that the large files should be fetched from a CDN that
> the client cannot use? Is the client forced to find or setup another
> repo configured differently if the client still wants to use CDN
> offloading?

The example from that message:

  For example I think the Great Firewall of China lets people in China
  use GitHub.com but not Google.com. So if people start configuring
  their repos on GitHub so that they send packs that contain Google.com
  CDN URLs (or actually anything that the Firewall blocks), it might
  create many problems for users in China if they don't have a way to
  opt out of receiving packs with those kind of URLs.

But the same thing can happen with redirects, with embedded assets in
web pages, and so on.  I think in this situation the user would likely
(and rightly) blame the host (github.com) for requiring access to a
separate inaccessible site, and the problem could be resolved with
them.

The beauty of this is that it's transparent to the client: the fact
that packfile transfer was offloaded to a CDN is an implementation
detail, and the server takes full responsibility for it.

This doesn't stop a hosting provider from using e.g. server options to
allow the client more control over how their response is served, just
like can be done for other features of how the transfer works (how
often to send progress updates, whether to prioritize latency or
throughput, etc).

What the client *can* do is turn off support for packfile URLs in a
request completely.  This is required for backward compatibility and
allows working around a host that has configured the feature
incorrectly.

Thanks for an interesting example,
Jonathan

Christian Couder Feb. 26, 2019, 8:30 a.m. UTC | #3

Hi,

On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Christian Couder wrote:
> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > Especially I'd like to know what should the client do if they find out
> > that for example a repo that contains a lot of large files is
> > configured so that the large files should be fetched from a CDN that
> > the client cannot use? Is the client forced to find or setup another
> > repo configured differently if the client still wants to use CDN
> > offloading?
>
> The example from that message:
>
>   For example I think the Great Firewall of China lets people in China
>   use GitHub.com but not Google.com. So if people start configuring
>   their repos on GitHub so that they send packs that contain Google.com
>   CDN URLs (or actually anything that the Firewall blocks), it might
>   create many problems for users in China if they don't have a way to
>   opt out of receiving packs with those kind of URLs.
>
> But the same thing can happen with redirects, with embedded assets in
> web pages, and so on.

I don't think it's the same situation, because the CDN offloading is
likely to be used for large objects that some hosting sites like
GitHub, GitLab and BitBucket might not be ok to have them store for
free on their machines. (I think the current limitations are around
10GB or 20GB, everything included, for each user.)

So it's likely that users will want a way to host on such sites
incomplete repos using CDN offloading to a CDN on another site. And
then if the CDN is not accessible for some reason, things will
completely break when users will clone.

You could say that it's the same issue as when a video is not
available on a web page, but the web browser can still render the page
when a video is not available. So I don't think it's the same kind of
issue.

And by the way that's a reason why I think it's important to think
about this in relation to promisor/partial clone remotes. Because with
them it's less of a big deal if the CDN is unavailable, temporarily or
not, for some reason.

> I think in this situation the user would likely
> (and rightly) blame the host (github.com) for requiring access to a
> separate inaccessible site, and the problem could be resolved with
> them.

The host will say that it's repo admins' responsibility to use a CDN
that works for the repo users (or to pay for more space on the host).
Then repo admins will say that they use this CDN because it's simpler
for them or the only thing they can afford or deal with. (For example
I don't think it would be easy for westerners to use a Chinese CDN.)
Then users will likely blame Git for not supporting a way to use a
different CDN than the one configured in each repo.

> The beauty of this is that it's transparent to the client: the fact
> that packfile transfer was offloaded to a CDN is an implementation
> detail, and the server takes full responsibility for it.

Who is "the server" in real life? Are you sure they would be ok to
take full responsibility?

And yes, I agree that transparency for the client is nice. And if it's
really nice, then why not have it for promisor/partial clone remotes
too? But then do we really need duplicated functionality between
promisor remotes and CDN offloading?

And also I just think that in real life there needs to be an easy way
to override this transparency, and we already have that with promisor
remotes.

> This doesn't stop a hosting provider from using e.g. server options to
> allow the client more control over how their response is served, just
> like can be done for other features of how the transfer works (how
> often to send progress updates, whether to prioritize latency or
> throughput, etc).

Could you give a more concrete example of what could be done?

> What the client *can* do is turn off support for packfile URLs in a
> request completely.  This is required for backward compatibility and
> allows working around a host that has configured the feature
> incorrectly.

If the full content of a repo is really large, the size of a full pack
file sent by an initial clone could be really big and many client
machines could not have enough memory to deal with that. And this
suppose that repo hosting providers would be ok to host very large
repos in the first place.

With promisor remotes, it's less of a problem if for example:

- a repo hosting provider is not ok with very large repos,
- a CDN is unavailable,
- a repo admin has not configured some repos very well.

Thanks for your answer,
Christian.

Ævar Arnfjörð Bjarmason Feb. 26, 2019, 9:12 a.m. UTC | #4

On Tue, Feb 26 2019, Christian Couder wrote:

> Hi,
>
> On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>>
>> Christian Couder wrote:
>> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>>
>> > Especially I'd like to know what should the client do if they find out
>> > that for example a repo that contains a lot of large files is
>> > configured so that the large files should be fetched from a CDN that
>> > the client cannot use? Is the client forced to find or setup another
>> > repo configured differently if the client still wants to use CDN
>> > offloading?
>>
>> The example from that message:
>>
>>   For example I think the Great Firewall of China lets people in China
>>   use GitHub.com but not Google.com. So if people start configuring
>>   their repos on GitHub so that they send packs that contain Google.com
>>   CDN URLs (or actually anything that the Firewall blocks), it might
>>   create many problems for users in China if they don't have a way to
>>   opt out of receiving packs with those kind of URLs.
>>
>> But the same thing can happen with redirects, with embedded assets in
>> web pages, and so on.
>
> I don't think it's the same situation, because the CDN offloading is
> likely to be used for large objects that some hosting sites like
> GitHub, GitLab and BitBucket might not be ok to have them store for
> free on their machines. (I think the current limitations are around
> 10GB or 20GB, everything included, for each user.)
>
> So it's likely that users will want a way to host on such sites
> incomplete repos using CDN offloading to a CDN on another site. And
> then if the CDN is not accessible for some reason, things will
> completely break when users will clone.
>
> You could say that it's the same issue as when a video is not
> available on a web page, but the web browser can still render the page
> when a video is not available. So I don't think it's the same kind of
> issue.
>
> And by the way that's a reason why I think it's important to think
> about this in relation to promisor/partial clone remotes. Because with
> them it's less of a big deal if the CDN is unavailable, temporarily or
> not, for some reason.

I think all of that's correct. E.g. you can imagine a CDN where the CDN
serves literally one blob (not a pack), and the server the rest of the
trees/commits/blobs.

But for the purposes of reviewing this I think it's better to say that
we're going to have a limited initial introduction of CDN where those
more complex cases don't need to be handled.

That can always be added later, as far as I can tell from the protocol
alteration in the RFC nothing's closing the door on that, we could
always add another capability / protocol extension.

Jonathan Nieder Feb. 28, 2019, 11:21 p.m. UTC | #5

Hi,

Sorry for the slow followup.  Thanks for probing into the design ---
this should be useful for getting the docs to be clear.

Christian Couder wrote:

> So it's likely that users will want a way to host on such sites
> incomplete repos using CDN offloading to a CDN on another site. And
> then if the CDN is not accessible for some reason, things will
> completely break when users will clone.

I think this would be a broken setup --- we can make it clear in the
protocol and server docs that you should only point to a CDN for which
you control the contents, to avoid breaking clients.

That doesn't prevent adding additional features in the future e.g. for
"server suggested alternates" --- it's just that I consider that a
separate feature.

Using CDN offloading requires cooperation of the hosting provider.
It's a way to optimize how fetches work, not a way to have a partial
repository on the server side.

[...]
> On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:

>> This doesn't stop a hosting provider from using e.g. server options to
>> allow the client more control over how their response is served, just
>> like can be done for other features of how the transfer works (how
>> often to send progress updates, whether to prioritize latency or
>> throughput, etc).
>
> Could you give a more concrete example of what could be done?

What I mean is passing server options using "git fetch --server-option".
For example:

	git fetch -o priority=BATCH origin master

or

	git fetch -o avoid-cdn=badcdn.example.com origin master

The interpretation of server options is up to the server.

>> What the client *can* do is turn off support for packfile URLs in a
>> request completely.  This is required for backward compatibility and
>> allows working around a host that has configured the feature
>> incorrectly.
>
> If the full content of a repo is really large, the size of a full pack
> file sent by an initial clone could be really big and many client
> machines could not have enough memory to deal with that. And this
> suppose that repo hosting providers would be ok to host very large
> repos in the first place.

Do we require the packfile to fit in memory?  If so, we should fix
that (to use streaming instead).

Thanks,
Jonathan

Christian Couder March 4, 2019, 8:24 a.m. UTC | #6

On Tue, Feb 26, 2019 at 10:12 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Feb 26 2019, Christian Couder wrote:
>
> > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:

> >> But the same thing can happen with redirects, with embedded assets in
> >> web pages, and so on.
> >
> > I don't think it's the same situation, because the CDN offloading is
> > likely to be used for large objects that some hosting sites like
> > GitHub, GitLab and BitBucket might not be ok to have them store for
> > free on their machines. (I think the current limitations are around
> > 10GB or 20GB, everything included, for each user.)
> >
> > So it's likely that users will want a way to host on such sites
> > incomplete repos using CDN offloading to a CDN on another site. And
> > then if the CDN is not accessible for some reason, things will
> > completely break when users will clone.
> >
> > You could say that it's the same issue as when a video is not
> > available on a web page, but the web browser can still render the page
> > when a video is not available. So I don't think it's the same kind of
> > issue.
> >
> > And by the way that's a reason why I think it's important to think
> > about this in relation to promisor/partial clone remotes. Because with
> > them it's less of a big deal if the CDN is unavailable, temporarily or
> > not, for some reason.
>
> I think all of that's correct. E.g. you can imagine a CDN where the CDN
> serves literally one blob (not a pack), and the server the rest of the
> trees/commits/blobs.
>
> But for the purposes of reviewing this I think it's better to say that
> we're going to have a limited initial introduction of CDN where those
> more complex cases don't need to be handled.
>
> That can always be added later, as far as I can tell from the protocol
> alteration in the RFC nothing's closing the door on that, we could
> always add another capability / protocol extension.

Yeah, it doesn't close the door on further improvements. The issue
though is that it doesn't seem to have many benefits over implementing
things in many promisor remotes. The main benefit seems to be that the
secondary server locations are automatically configured. But when
looking at what can happen in the real world, this benefit seems more
like a drawback to me as it potentially creates a lot of problems.

A solution, many promisor remotes, where:

- first secondary server URLs are manually specified on the client
side, and then
- some kind of negotiation, so that they can be automatically
selected, is implemented

seems better to me than a solution, CDN offloading, where:

- first the main server decides the secondary server URLs, and then
- we work around the cases where this creates problems

In the case of CDN offloading it is likely that early client and
server implementations will create problems for many people as long as
most of the workarounds aren't implemented. While in the case of many
promisor remotes there is always the manual solution as long as the
automated selection doesn't work well enough.

Christian Couder March 4, 2019, 8:54 a.m. UTC | #7

Hi,

On Fri, Mar 1, 2019 at 12:21 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Sorry for the slow followup.  Thanks for probing into the design ---
> this should be useful for getting the docs to be clear.
>
> Christian Couder wrote:
>
> > So it's likely that users will want a way to host on such sites
> > incomplete repos using CDN offloading to a CDN on another site. And
> > then if the CDN is not accessible for some reason, things will
> > completely break when users will clone.
>
> I think this would be a broken setup --- we can make it clear in the
> protocol and server docs that you should only point to a CDN for which
> you control the contents, to avoid breaking clients.

We can say whatever in the docs, but in real life if it's
simpler/cheaper for repo admins to use a CDN for example on Google and
a repo on GitHub, they are likely to do it anyway.

> That doesn't prevent adding additional features in the future e.g. for
> "server suggested alternates" --- it's just that I consider that a
> separate feature.
>
> Using CDN offloading requires cooperation of the hosting provider.
> It's a way to optimize how fetches work, not a way to have a partial
> repository on the server side.

We can say whatever we want about what it is for. Users are likely to
use it anyway in the way they think it will benefit them the most.

> > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> >> This doesn't stop a hosting provider from using e.g. server options to
> >> allow the client more control over how their response is served, just
> >> like can be done for other features of how the transfer works (how
> >> often to send progress updates, whether to prioritize latency or
> >> throughput, etc).
> >
> > Could you give a more concrete example of what could be done?
>
> What I mean is passing server options using "git fetch --server-option".
> For example:
>
>         git fetch -o priority=BATCH origin master
>
> or
>
>         git fetch -o avoid-cdn=badcdn.example.com origin master
>
> The interpretation of server options is up to the server.

If you often have to tell things like "-o
avoid-cdn=badcdn.example.com", then how is it better than just
specifying "-o usecdn=goodcdn.example.com" or even better using the
remote mechanism to configure a remote for goodcdn.example.com and
then configuring this remote to be used along the origin remote (which
is what many promisor remotes is about)?

> >> What the client *can* do is turn off support for packfile URLs in a
> >> request completely.  This is required for backward compatibility and
> >> allows working around a host that has configured the feature
> >> incorrectly.
> >
> > If the full content of a repo is really large, the size of a full pack
> > file sent by an initial clone could be really big and many client
> > machines could not have enough memory to deal with that. And this
> > suppose that repo hosting providers would be ok to host very large
> > repos in the first place.
>
> Do we require the packfile to fit in memory?  If so, we should fix
> that (to use streaming instead).

Even if we stream the packfile to write it, at one point we have to use it.

And I could be wrong but I think that mmap doesn't work on Windows, so
I think we will just try to read the whole thing into memory. Even on
Linux I don't think it's a good idea to mmap a very large file and
then use some big parts of it which I think we will have to do when
checking out the large files from inside the packfile.

Yeah, we can improve that part of Git too. I think though that it
means yet another thing (and not an easy one) that needs to be
improved before CDN offloading can work well in the real world.

I think that the Git "development philosophy" since the beginning has
been more about adding things that work well in the real world first
even if they are small and a bit manual, and then improving on top of
those early things, rather than adding a big thing that doesn't quite
work well in the real world but is automated and then improving on
that.

[WIP,0/7] CDN offloading of fetch response

Message

Comments