Message ID | cover.1550963965.git.jonathantanmy@google.com (mailing list archive) |
---|---|
Headers | show |
Series | CDN offloading of fetch response | expand |
On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote: > There are probably some more design discussions to be had: [...] > - Client-side whitelist of protocol and hostnames. I've implemented > whitelist of protocol, but not hostnames. I would appreciate a more complete answer to my comments in: https://public-inbox.org/git/CAP8UFD16fvtu_dg3S_J9BjGpxAYvgp8SXscdh=TJB5jvAbzi4A@mail.gmail.com/ Especially I'd like to know what should the client do if they find out that for example a repo that contains a lot of large files is configured so that the large files should be fetched from a CDN that the client cannot use? Is the client forced to find or setup another repo configured differently if the client still wants to use CDN offloading? Wouldn't it be better if the client could use the same repo and just select a CDN configuration among many?
Hi, Christian Couder wrote: > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote: >> There are probably some more design discussions to be had: > > [...] > >> - Client-side whitelist of protocol and hostnames. I've implemented >> whitelist of protocol, but not hostnames. > > I would appreciate a more complete answer to my comments in: > > https://public-inbox.org/git/CAP8UFD16fvtu_dg3S_J9BjGpxAYvgp8SXscdh=TJB5jvAbzi4A@mail.gmail.com/ > > Especially I'd like to know what should the client do if they find out > that for example a repo that contains a lot of large files is > configured so that the large files should be fetched from a CDN that > the client cannot use? Is the client forced to find or setup another > repo configured differently if the client still wants to use CDN > offloading? The example from that message: For example I think the Great Firewall of China lets people in China use GitHub.com but not Google.com. So if people start configuring their repos on GitHub so that they send packs that contain Google.com CDN URLs (or actually anything that the Firewall blocks), it might create many problems for users in China if they don't have a way to opt out of receiving packs with those kind of URLs. But the same thing can happen with redirects, with embedded assets in web pages, and so on. I think in this situation the user would likely (and rightly) blame the host (github.com) for requiring access to a separate inaccessible site, and the problem could be resolved with them. The beauty of this is that it's transparent to the client: the fact that packfile transfer was offloaded to a CDN is an implementation detail, and the server takes full responsibility for it. This doesn't stop a hosting provider from using e.g. server options to allow the client more control over how their response is served, just like can be done for other features of how the transfer works (how often to send progress updates, whether to prioritize latency or throughput, etc). What the client *can* do is turn off support for packfile URLs in a request completely. This is required for backward compatibility and allows working around a host that has configured the feature incorrectly. Thanks for an interesting example, Jonathan
Hi, On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote: > > Christian Couder wrote: > > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote: > > > Especially I'd like to know what should the client do if they find out > > that for example a repo that contains a lot of large files is > > configured so that the large files should be fetched from a CDN that > > the client cannot use? Is the client forced to find or setup another > > repo configured differently if the client still wants to use CDN > > offloading? > > The example from that message: > > For example I think the Great Firewall of China lets people in China > use GitHub.com but not Google.com. So if people start configuring > their repos on GitHub so that they send packs that contain Google.com > CDN URLs (or actually anything that the Firewall blocks), it might > create many problems for users in China if they don't have a way to > opt out of receiving packs with those kind of URLs. > > But the same thing can happen with redirects, with embedded assets in > web pages, and so on. I don't think it's the same situation, because the CDN offloading is likely to be used for large objects that some hosting sites like GitHub, GitLab and BitBucket might not be ok to have them store for free on their machines. (I think the current limitations are around 10GB or 20GB, everything included, for each user.) So it's likely that users will want a way to host on such sites incomplete repos using CDN offloading to a CDN on another site. And then if the CDN is not accessible for some reason, things will completely break when users will clone. You could say that it's the same issue as when a video is not available on a web page, but the web browser can still render the page when a video is not available. So I don't think it's the same kind of issue. And by the way that's a reason why I think it's important to think about this in relation to promisor/partial clone remotes. Because with them it's less of a big deal if the CDN is unavailable, temporarily or not, for some reason. > I think in this situation the user would likely > (and rightly) blame the host (github.com) for requiring access to a > separate inaccessible site, and the problem could be resolved with > them. The host will say that it's repo admins' responsibility to use a CDN that works for the repo users (or to pay for more space on the host). Then repo admins will say that they use this CDN because it's simpler for them or the only thing they can afford or deal with. (For example I don't think it would be easy for westerners to use a Chinese CDN.) Then users will likely blame Git for not supporting a way to use a different CDN than the one configured in each repo. > The beauty of this is that it's transparent to the client: the fact > that packfile transfer was offloaded to a CDN is an implementation > detail, and the server takes full responsibility for it. Who is "the server" in real life? Are you sure they would be ok to take full responsibility? And yes, I agree that transparency for the client is nice. And if it's really nice, then why not have it for promisor/partial clone remotes too? But then do we really need duplicated functionality between promisor remotes and CDN offloading? And also I just think that in real life there needs to be an easy way to override this transparency, and we already have that with promisor remotes. > This doesn't stop a hosting provider from using e.g. server options to > allow the client more control over how their response is served, just > like can be done for other features of how the transfer works (how > often to send progress updates, whether to prioritize latency or > throughput, etc). Could you give a more concrete example of what could be done? > What the client *can* do is turn off support for packfile URLs in a > request completely. This is required for backward compatibility and > allows working around a host that has configured the feature > incorrectly. If the full content of a repo is really large, the size of a full pack file sent by an initial clone could be really big and many client machines could not have enough memory to deal with that. And this suppose that repo hosting providers would be ok to host very large repos in the first place. With promisor remotes, it's less of a problem if for example: - a repo hosting provider is not ok with very large repos, - a CDN is unavailable, - a repo admin has not configured some repos very well. Thanks for your answer, Christian.
On Tue, Feb 26 2019, Christian Couder wrote: > Hi, > > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote: >> >> Christian Couder wrote: >> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote: >> >> > Especially I'd like to know what should the client do if they find out >> > that for example a repo that contains a lot of large files is >> > configured so that the large files should be fetched from a CDN that >> > the client cannot use? Is the client forced to find or setup another >> > repo configured differently if the client still wants to use CDN >> > offloading? >> >> The example from that message: >> >> For example I think the Great Firewall of China lets people in China >> use GitHub.com but not Google.com. So if people start configuring >> their repos on GitHub so that they send packs that contain Google.com >> CDN URLs (or actually anything that the Firewall blocks), it might >> create many problems for users in China if they don't have a way to >> opt out of receiving packs with those kind of URLs. >> >> But the same thing can happen with redirects, with embedded assets in >> web pages, and so on. > > I don't think it's the same situation, because the CDN offloading is > likely to be used for large objects that some hosting sites like > GitHub, GitLab and BitBucket might not be ok to have them store for > free on their machines. (I think the current limitations are around > 10GB or 20GB, everything included, for each user.) > > So it's likely that users will want a way to host on such sites > incomplete repos using CDN offloading to a CDN on another site. And > then if the CDN is not accessible for some reason, things will > completely break when users will clone. > > You could say that it's the same issue as when a video is not > available on a web page, but the web browser can still render the page > when a video is not available. So I don't think it's the same kind of > issue. > > And by the way that's a reason why I think it's important to think > about this in relation to promisor/partial clone remotes. Because with > them it's less of a big deal if the CDN is unavailable, temporarily or > not, for some reason. I think all of that's correct. E.g. you can imagine a CDN where the CDN serves literally one blob (not a pack), and the server the rest of the trees/commits/blobs. But for the purposes of reviewing this I think it's better to say that we're going to have a limited initial introduction of CDN where those more complex cases don't need to be handled. That can always be added later, as far as I can tell from the protocol alteration in the RFC nothing's closing the door on that, we could always add another capability / protocol extension.
Hi, Sorry for the slow followup. Thanks for probing into the design --- this should be useful for getting the docs to be clear. Christian Couder wrote: > So it's likely that users will want a way to host on such sites > incomplete repos using CDN offloading to a CDN on another site. And > then if the CDN is not accessible for some reason, things will > completely break when users will clone. I think this would be a broken setup --- we can make it clear in the protocol and server docs that you should only point to a CDN for which you control the contents, to avoid breaking clients. That doesn't prevent adding additional features in the future e.g. for "server suggested alternates" --- it's just that I consider that a separate feature. Using CDN offloading requires cooperation of the hosting provider. It's a way to optimize how fetches work, not a way to have a partial repository on the server side. [...] > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote: >> This doesn't stop a hosting provider from using e.g. server options to >> allow the client more control over how their response is served, just >> like can be done for other features of how the transfer works (how >> often to send progress updates, whether to prioritize latency or >> throughput, etc). > > Could you give a more concrete example of what could be done? What I mean is passing server options using "git fetch --server-option". For example: git fetch -o priority=BATCH origin master or git fetch -o avoid-cdn=badcdn.example.com origin master The interpretation of server options is up to the server. >> What the client *can* do is turn off support for packfile URLs in a >> request completely. This is required for backward compatibility and >> allows working around a host that has configured the feature >> incorrectly. > > If the full content of a repo is really large, the size of a full pack > file sent by an initial clone could be really big and many client > machines could not have enough memory to deal with that. And this > suppose that repo hosting providers would be ok to host very large > repos in the first place. Do we require the packfile to fit in memory? If so, we should fix that (to use streaming instead). Thanks, Jonathan
On Tue, Feb 26, 2019 at 10:12 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > On Tue, Feb 26 2019, Christian Couder wrote: > > > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote: > >> But the same thing can happen with redirects, with embedded assets in > >> web pages, and so on. > > > > I don't think it's the same situation, because the CDN offloading is > > likely to be used for large objects that some hosting sites like > > GitHub, GitLab and BitBucket might not be ok to have them store for > > free on their machines. (I think the current limitations are around > > 10GB or 20GB, everything included, for each user.) > > > > So it's likely that users will want a way to host on such sites > > incomplete repos using CDN offloading to a CDN on another site. And > > then if the CDN is not accessible for some reason, things will > > completely break when users will clone. > > > > You could say that it's the same issue as when a video is not > > available on a web page, but the web browser can still render the page > > when a video is not available. So I don't think it's the same kind of > > issue. > > > > And by the way that's a reason why I think it's important to think > > about this in relation to promisor/partial clone remotes. Because with > > them it's less of a big deal if the CDN is unavailable, temporarily or > > not, for some reason. > > I think all of that's correct. E.g. you can imagine a CDN where the CDN > serves literally one blob (not a pack), and the server the rest of the > trees/commits/blobs. > > But for the purposes of reviewing this I think it's better to say that > we're going to have a limited initial introduction of CDN where those > more complex cases don't need to be handled. > > That can always be added later, as far as I can tell from the protocol > alteration in the RFC nothing's closing the door on that, we could > always add another capability / protocol extension. Yeah, it doesn't close the door on further improvements. The issue though is that it doesn't seem to have many benefits over implementing things in many promisor remotes. The main benefit seems to be that the secondary server locations are automatically configured. But when looking at what can happen in the real world, this benefit seems more like a drawback to me as it potentially creates a lot of problems. A solution, many promisor remotes, where: - first secondary server URLs are manually specified on the client side, and then - some kind of negotiation, so that they can be automatically selected, is implemented seems better to me than a solution, CDN offloading, where: - first the main server decides the secondary server URLs, and then - we work around the cases where this creates problems In the case of CDN offloading it is likely that early client and server implementations will create problems for many people as long as most of the workarounds aren't implemented. While in the case of many promisor remotes there is always the manual solution as long as the automated selection doesn't work well enough.
Hi, On Fri, Mar 1, 2019 at 12:21 AM Jonathan Nieder <jrnieder@gmail.com> wrote: > > Sorry for the slow followup. Thanks for probing into the design --- > this should be useful for getting the docs to be clear. > > Christian Couder wrote: > > > So it's likely that users will want a way to host on such sites > > incomplete repos using CDN offloading to a CDN on another site. And > > then if the CDN is not accessible for some reason, things will > > completely break when users will clone. > > I think this would be a broken setup --- we can make it clear in the > protocol and server docs that you should only point to a CDN for which > you control the contents, to avoid breaking clients. We can say whatever in the docs, but in real life if it's simpler/cheaper for repo admins to use a CDN for example on Google and a repo on GitHub, they are likely to do it anyway. > That doesn't prevent adding additional features in the future e.g. for > "server suggested alternates" --- it's just that I consider that a > separate feature. > > Using CDN offloading requires cooperation of the hosting provider. > It's a way to optimize how fetches work, not a way to have a partial > repository on the server side. We can say whatever we want about what it is for. Users are likely to use it anyway in the way they think it will benefit them the most. > > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote: > > >> This doesn't stop a hosting provider from using e.g. server options to > >> allow the client more control over how their response is served, just > >> like can be done for other features of how the transfer works (how > >> often to send progress updates, whether to prioritize latency or > >> throughput, etc). > > > > Could you give a more concrete example of what could be done? > > What I mean is passing server options using "git fetch --server-option". > For example: > > git fetch -o priority=BATCH origin master > > or > > git fetch -o avoid-cdn=badcdn.example.com origin master > > The interpretation of server options is up to the server. If you often have to tell things like "-o avoid-cdn=badcdn.example.com", then how is it better than just specifying "-o usecdn=goodcdn.example.com" or even better using the remote mechanism to configure a remote for goodcdn.example.com and then configuring this remote to be used along the origin remote (which is what many promisor remotes is about)? > >> What the client *can* do is turn off support for packfile URLs in a > >> request completely. This is required for backward compatibility and > >> allows working around a host that has configured the feature > >> incorrectly. > > > > If the full content of a repo is really large, the size of a full pack > > file sent by an initial clone could be really big and many client > > machines could not have enough memory to deal with that. And this > > suppose that repo hosting providers would be ok to host very large > > repos in the first place. > > Do we require the packfile to fit in memory? If so, we should fix > that (to use streaming instead). Even if we stream the packfile to write it, at one point we have to use it. And I could be wrong but I think that mmap doesn't work on Windows, so I think we will just try to read the whole thing into memory. Even on Linux I don't think it's a good idea to mmap a very large file and then use some big parts of it which I think we will have to do when checking out the large files from inside the packfile. Yeah, we can improve that part of Git too. I think though that it means yet another thing (and not an easy one) that needs to be improved before CDN offloading can work well in the real world. I think that the Git "development philosophy" since the beginning has been more about adding things that work well in the real world first even if they are small and a bit manual, and then improving on top of those early things, rather than adding a big thing that doesn't quite work well in the real world but is automated and then improving on that.