Message ID | 20240910163000.1985723-1-christian.couder@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Introduce a "promisor-remote" capability | expand |
Christian Couder <christian.couder@gmail.com> writes: > Changes compared to version 1 > ... > Thanks to Junio, Patrick, Eric and Taylor for their suggestions. We haven't heard from anybody in support of (or against, for that matter) this series even after a few weeks, which is not a good sign, even with everybody away for GitMerge for a few days. IIRC, the comments that the initial iteration have received were mostly about clarifying the intent of this new capability (and some typofixes). What are opinions on this round from folks (especially those who did not read the initial round)? Does this round clearly explain what the capability means and why projects want to use it under what condition? Personally, I still find that knownName is increasing potential attack surface without much benefit, but in a tightly controled intranet environment, it might have convenience value. I dunno.
On Thu, Sep 26, 2024 at 8:09 PM Junio C Hamano <gitster@pobox.com> wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > Changes compared to version 1 > > ... > > Thanks to Junio, Patrick, Eric and Taylor for their suggestions. > > We haven't heard from anybody in support of (or against, for that > matter) this series even after a few weeks, which is not a good > sign, even with everybody away for GitMerge for a few days. By the way there was an unconference breakout session on day 2 of the Git Merge called "Git LFS Can we do better?" where this was discussed with a number of people. Scott Chacon took some notes: https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md It was in parallel with the Contributor Summit, so few contributors participated in this session (maybe only Michael Haggerty, John Cai and me). But the impression of GitLab people there, including me, was that folks in general would be happy to have an alternative to Git LFS based on this. > IIRC, the comments that the initial iteration have received were > mostly about clarifying the intent of this new capability (and some > typofixes). What are opinions on this round from folks (especially > those who did not read the initial round)? Does this round clearly > explain what the capability means and why projects want to use it > under what condition? > > Personally, I still find that knownName is increasing potential > attack surface without much benefit, but in a tightly controled > intranet environment, it might have convenience value. I dunno.
Christian Couder <christian.couder@gmail.com> writes: > By the way there was an unconference breakout session on day 2 of the > Git Merge called "Git LFS Can we do better?" where this was discussed > with a number of people. Scott Chacon took some notes: > > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md Thanks for a link. > It was in parallel with the Contributor Summit, so few contributors > participated in this session (maybe only Michael Haggerty, John Cai > and me). But the impression of GitLab people there, including me, was > that folks in general would be happy to have an alternative to Git LFS > based on this. I am not sure what "based on this" is really about, though. This series adds a feature to redirect requests to one server to another, but does it really have much to solve the problem LFS wants to solve? I would imagine that you would want to be able to manage larger objects separately to avoid affecting the performance and convenience when handling smaller objects, and to serve these larger objects from a dedicated server. You certainly can filter the larger blobs away with blob size filter, but when you really need these larger blobs, it is unclear how the new capability helps, as you cannot really tell what the criteria the serving side that gave you the "promisor-remote" capability wants you to use to sift your requests between the original server and the new promisor. Wouldn't your requests _all_ be redirected to a single place, the promisor remote you learned via the capability? Coming up with a better alternative to LFS is certainly good, and it is worthwhile addtion to the system. I just do not see how the topic of this series helps further that goal. Thanks.
On September 27, 2024 6:48 PM, Junio C Hamano wrote: >Christian Couder <christian.couder@gmail.com> writes: > >> By the way there was an unconference breakout session on day 2 of the >> Git Merge called "Git LFS Can we do better?" where this was discussed >> with a number of people. Scott Chacon took some notes: >> >> https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md > >Thanks for a link. > >> It was in parallel with the Contributor Summit, so few contributors >> participated in this session (maybe only Michael Haggerty, John Cai >> and me). But the impression of GitLab people there, including me, was >> that folks in general would be happy to have an alternative to Git LFS >> based on this. > >I am not sure what "based on this" is really about, though. > >This series adds a feature to redirect requests to one server to another, but does it >really have much to solve the problem LFS wants to solve? I would imagine that >you would want to be able to manage larger objects separately to avoid affecting >the performance and convenience when handling smaller objects, and to serve >these larger objects from a dedicated server. You certainly can filter the larger blobs >away with blob size filter, but when you really need these larger blobs, it is unclear >how the new capability helps, as you cannot really tell what the criteria the serving >side that gave you the "promisor-remote" capability wants you to use to sift your >requests between the original server and the new promisor. Wouldn't your >requests _all_ be redirected to a single place, the promisor remote you learned via >the capability? > >Coming up with a better alternative to LFS is certainly good, and it is worthwhile >addtion to the system. I just do not see how the topic of this series helps further >that goal. I am one of those who really would like to see an improvement in this area. My community needs large binaries, and the GitHub LFS support limits sizes to the point of being pretty much not enough. I would be happy to participate in requirements gathering for this effort (even if it goes to Rust
On Sat, Sep 28, 2024, at 01:31, rsbecker@nexbridge.com wrote: > I am one of those who really would like to see an improvement in this area. My > community needs large binaries, and the GitHub LFS support limits sizes to the > point of being pretty much not enough. I would be happy to participate in > requirements gathering for this effort (even if it goes to Rust
On Fri, Sep 27, 2024 at 03:48:11PM -0700, Junio C Hamano wrote: > Christian Couder <christian.couder@gmail.com> writes: > > > By the way there was an unconference breakout session on day 2 of the > > Git Merge called "Git LFS Can we do better?" where this was discussed > > with a number of people. Scott Chacon took some notes: > > > > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md > > Thanks for a link. > > > It was in parallel with the Contributor Summit, so few contributors > > participated in this session (maybe only Michael Haggerty, John Cai > > and me). But the impression of GitLab people there, including me, was > > that folks in general would be happy to have an alternative to Git LFS > > based on this. > > I am not sure what "based on this" is really about, though. > > This series adds a feature to redirect requests to one server to > another, but does it really have much to solve the problem LFS wants > to solve? I would imagine that you would want to be able to manage > larger objects separately to avoid affecting the performance and > convenience when handling smaller objects, and to serve these larger > objects from a dedicated server. You certainly can filter the > larger blobs away with blob size filter, but when you really need > these larger blobs, it is unclear how the new capability helps, as > you cannot really tell what the criteria the serving side that gave > you the "promisor-remote" capability wants you to use to sift your > requests between the original server and the new promisor. Wouldn't > your requests _all_ be redirected to a single place, the promisor > remote you learned via the capability? > > Coming up with a better alternative to LFS is certainly good, and it > is worthwhile addtion to the system. I just do not see how the > topic of this series helps further that goal. I guess it helps to address part of the problem. I'm not sure whether my understanding is aligned with Chris' intention, but I could certainly see that at some point in time we start to advertise promisor remote URLs that use different transport helpers to fetch objects. This would allow hosting providers to offload objects to e.g. blob storage or somesuch thing and the client would know how to fetch them. But there are still a couple of pieces missing in the bigger puzzle: - How would a client know to omit certain objects? Right now it only knows that there are promisor remotes, but it doesn't know that it e.g. should omit every blob larger than X megabytes. The answer could of course be that the client should just know to do a partial clone by themselves. - Storing those large objects locally is still expensive. We had discussions in the past where such objects could be stored uncompressed to stop wasting compute here. At GitLab, we're thinking about the ability to use rolling hash functions to chunk such big objects into smaller parts to also allow for somewhat efficient deduplication. We're also thinking about how to make the overall ODB pluggable such that we can eventually make it more scalable in this context. But that's of course thinking into the future quite a bit. - Local repositories would likely want to prune large objects that have not been accessed for a while to eventually regain some storage space. I think chipping away the problems one by one is fine. But it would be nice to draw something like a "big picture" of where we eventually want to end up at and how all the parts connect with each other to form a viable native replacement for Git LFS. Also Cc'ing brian, who likely has a thing or two to say about this :) Patrick
On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@pks.im> wrote: > > On Fri, Sep 27, 2024 at 03:48:11PM -0700, Junio C Hamano wrote: > > Christian Couder <christian.couder@gmail.com> writes: > > > > > By the way there was an unconference breakout session on day 2 of the > > > Git Merge called "Git LFS Can we do better?" where this was discussed > > > with a number of people. Scott Chacon took some notes: > > > > > > https://github.com/git/git-merge/blob/main/breakouts/git-lfs.md > > > > Thanks for a link. > > > > > It was in parallel with the Contributor Summit, so few contributors > > > participated in this session (maybe only Michael Haggerty, John Cai > > > and me). But the impression of GitLab people there, including me, was > > > that folks in general would be happy to have an alternative to Git LFS > > > based on this. > > > > I am not sure what "based on this" is really about, though. > > > > This series adds a feature to redirect requests to one server to > > another, but does it really have much to solve the problem LFS wants > > to solve? I would imagine that you would want to be able to manage > > larger objects separately to avoid affecting the performance and > > convenience when handling smaller objects, and to serve these larger > > objects from a dedicated server. You certainly can filter the > > larger blobs away with blob size filter, but when you really need > > these larger blobs, it is unclear how the new capability helps, as > > you cannot really tell what the criteria the serving side that gave > > you the "promisor-remote" capability wants you to use to sift your > > requests between the original server and the new promisor. Wouldn't > > your requests _all_ be redirected to a single place, the promisor > > remote you learned via the capability? > > > > Coming up with a better alternative to LFS is certainly good, and it > > is worthwhile addtion to the system. I just do not see how the > > topic of this series helps further that goal. > > I guess it helps to address part of the problem. I'm not sure whether my > understanding is aligned with Chris' intention, but I could certainly > see that at some point in time we start to advertise promisor remote > URLs that use different transport helpers to fetch objects. This would > allow hosting providers to offload objects to e.g. blob storage or > somesuch thing and the client would know how to fetch them. > > But there are still a couple of pieces missing in the bigger puzzle: > > - How would a client know to omit certain objects? Right now it only > knows that there are promisor remotes, but it doesn't know that it > e.g. should omit every blob larger than X megabytes. The answer > could of course be that the client should just know to do a partial > clone by themselves. If we add a "filter" field to the "promisor-remote" capability in a future patch series, then the server could pass information like a filter-spec that the client could use to omit some large blobs. Patch 3/4 has the following in its commit message about it: "In the future, it might be possible to pass other information like a filter-spec that the client should use when cloning from S". > - Storing those large objects locally is still expensive. We had > discussions in the past where such objects could be stored > uncompressed to stop wasting compute here. Yeah, I think a new "verbatim" object representation in the object database as discussed in https://lore.kernel.org/git/xmqqbkdometi.fsf@gitster.g/ is the most likely and easiest in the short term. > At GitLab, we're thinking > about the ability to use rolling hash functions to chunk such big > objects into smaller parts to also allow for somewhat efficient > deduplication. We're also thinking about how to make the overall ODB > pluggable such that we can eventually make it more scalable in this > context. But that's of course thinking into the future quite a bit. Yeah, there are different options for this. For example HuggingFace (https://huggingface.co/) recently acquired the XetHub company (see https://huggingface.co/blog/xethub-joins-hf), and said they might open source XetHub software that does chunking and deduplicates chunks, so that could be an option too. > - Local repositories would likely want to prune large objects that > have not been accessed for a while to eventually regain some storage > space. `git repack --filter` and such might already help a bit in this area. I agree that more work is needed though. > I think chipping away the problems one by one is fine. But it would be > nice to draw something like a "big picture" of where we eventually want > to end up at and how all the parts connect with each other to form a > viable native replacement for Git LFS. I have tried to discuss this at the Git Merge 2022 and 2024 and perhaps even before that. But as you know it's difficult to make people agree on big projects that are not backed by patches and that might span over several years (especially when very few people actually work on them and when they might have other things to work on too). Thanks, Christian.
Patrick Steinhardt <ps@pks.im> writes: > I guess it helps to address part of the problem. I'm not sure whether my > understanding is aligned with Chris' intention, but I could certainly > see that at some point in time we start to advertise promisor remote > URLs that use different transport helpers to fetch objects. This would > allow hosting providers to offload objects to e.g. blob storage or > somesuch thing and the client would know how to fetch them. > > But there are still a couple of pieces missing in the bigger puzzle: > ... > I think chipping away the problems one by one is fine. But it would be > nice to draw something like a "big picture" of where we eventually want > to end up at and how all the parts connect with each other to form a > viable native replacement for Git LFS. Yes, thanks for stating this a lot more clearly than I said in the reviews so far. > Also Cc'ing brian, who likely has a thing or two to say about this :)
Christian Couder <christian.couder@gmail.com> writes: >> But there are still a couple of pieces missing in the bigger puzzle: >> >> - How would a client know to omit certain objects? Right now it only >> knows that there are promisor remotes, but it doesn't know that it >> e.g. should omit every blob larger than X megabytes. The answer >> could of course be that the client should just know to do a partial >> clone by themselves. > > If we add a "filter" field to the "promisor-remote" capability in a > future patch series, then the server could pass information like a > filter-spec that the client could use to omit some large blobs. Yes, but at that point, is the current scheme to mark a promisor pack with a single bit, the fact that the pack came from a promisor remote (which one?, and for what filter settings does the remote used?) becomes insufficient, isn't it? Chipping away one by one is fine, but we'd at least need to be aware that it is one of the things we need to upgrade in the scope of the bigger picture. It may even be OK to upgrade the on-the-wire protocol side before the code on the both ends learn to take advantage of the feature (e.g., to add "promisor-remote" capability itself, or to add the capability that can also convey the associated filter specification to that remote), but without even the design (let alone the implementation) of what runs on both ends of the connection to to make use of what is communicated via the capability, it is rather hard to get the details of the protocol design right. As on-the-wire protocol is harder to upgrade due to compatibility constraints, it smells like it is a better order to do things if it is left as the _last_ piece to be designed and implemented, if we were to chip away one-by-one. That may, for example, go like this: (0) We want to ensure that the projects can specify what kind of objects are to be offloaded to other transports. (1) We design the client end first. We may want to be able to choose what remote to run a lazy fetch against, based on a filter spec, for example. We realize and make a mental note that our new "capability" needs to tell the client enough information to make such a decision. (2) We design the server end to supply the above pieces of information to the client end. During this process, we may realize that some pieces of information cannot be prepared on the server end and (1) may need to get adjusted. (3) There may be tons of other things that need to be designed and implemented before we know what pieces of information our new "capability" needs to convey, and what these pieces of information mean by iterating (1) and (2). (4) Once we nail (3) down, we can add a new protocol capability, knowing how it should work, and knowing that the client and the server ends would work well once it is implemented. >> At GitLab, we're thinking >> about the ability to use rolling hash functions to chunk such big >> objects into smaller parts to also allow for somewhat efficient >> deduplication. We're also thinking about how to make the overall ODB >> pluggable such that we can eventually make it more scalable in this >> context. But that's of course thinking into the future quite a bit. Reminds me of rsync and bup ;-). Thanks.
On 2024-09-30 at 07:57:17, Patrick Steinhardt wrote: > But there are still a couple of pieces missing in the bigger puzzle: > > - How would a client know to omit certain objects? Right now it only > knows that there are promisor remotes, but it doesn't know that it > e.g. should omit every blob larger than X megabytes. The answer > could of course be that the client should just know to do a partial > clone by themselves. It would be helpful to have some sort of protocol v2 feature that says that a partial clone (of whatever sort) is recommended and let honouring that be a config flag. Otherwise, you're going to have a bunch of users who try to download every giant object in the repository when they don't need to. Git LFS has the advantage that this is the default behaviour, which is really valuable. > - Storing those large objects locally is still expensive. We had > discussions in the past where such objects could be stored > uncompressed to stop wasting compute here. At GitLab, we're thinking > about the ability to use rolling hash functions to chunk such big > objects into smaller parts to also allow for somewhat efficient > deduplication. We're also thinking about how to make the overall ODB > pluggable such that we can eventually make it more scalable in this > context. But that's of course thinking into the future quite a bit. Git LFS has a `git lfs dedup` command, which takes the files in the working tree and creates a copy using the copy-on-write functionality in the operating system and file system to avoid duplicating them. There are certainly some users who simply cannot afford to store multiple copies of the file system (say, because their repository is 500 GB), and this is important functionality for them. Note that this doesn't work for all file systems. It does for APFS on macOS, XFS and Btrfs on Linux, and ReFS on Windows, but not HFS+, ext4, or NTFS, which lack copy-on-write functionality. We'd probably need to add an extension for uncompressed objects for this, since it's a repository format change, but it shouldn't be hard to do. In Git LFS, it's also possible to share a set of objects across repositories although one must be careful not to prune them. We already have that through alternates, so I don't think we're lacking anything there. > - Local repositories would likely want to prune large objects that > have not been accessed for a while to eventually regain some storage > space. Git LFS has a `git lfs prune` command for this as well. It does have to be run manually, though. > I think chipping away the problems one by one is fine. But it would be > nice to draw something like a "big picture" of where we eventually want > to end up at and how all the parts connect with each other to form a > viable native replacement for Git LFS. I think a native replacement would be a valuable feature. Part of the essential component is going to be a way to handle this gracefully during pushes, since part of the goal of Git LFS is to get large blobs off the main server storage where they tend to make repacks extremely expensive and into an external store. Without that, it's unlikely that this feature is going to be viable on the server side. GitHub doesn't allow large blobs for exactly that reason, so we'd want some way to store them outside the main repository but still have the repo think they were present. One idea I had about this was pluggable storage backends, which might be a nice feature to add via a dynamically loaded shared library. In addition, this seems like the kind of feature that one might like to use Rust for, since it probably will involve HTTP code, and generally people like doing that less in C (I do, at least). > Also Cc'ing brian, who likely has a thing or two to say about this :) I certainly have thought about this a lot. I will say that I've stepped down from being one of the Git LFS maintainers (endless supply of work, not nearly enough time), but I am still familiar with the architecture of the project.
"brian m. carlson" <sandals@crustytoothpaste.net> writes: > One idea I had about this was pluggable storage backends, which might be > a nice feature to add via a dynamically loaded shared library. In > addition, this seems like the kind of feature that one might like to use > Rust for, since it probably will involve HTTP code, and generally people > like doing that less in C (I do, at least). Yes, yes, and yes. >> Also Cc'ing brian, who likely has a thing or two to say about this :) > > I certainly have thought about this a lot. I will say that I've stepped > down from being one of the Git LFS maintainers (endless supply of work, > not nearly enough time), but I am still familiar with the architecture > of the project. Thanks.
On Mon, Sep 30, 2024 at 03:27:14PM -0700, Junio C Hamano wrote: > "brian m. carlson" <sandals@crustytoothpaste.net> writes: > > > One idea I had about this was pluggable storage backends, which might be > > a nice feature to add via a dynamically loaded shared library. In > > addition, this seems like the kind of feature that one might like to use > > Rust for, since it probably will involve HTTP code, and generally people > > like doing that less in C (I do, at least). > > Yes, yes, and yes. Indeed, I strongly agree with this. In fact, pluggable ODBs are the next big topic I'll be working on now that the refdb is pluggable. Naturally this is a huge undertaking that will likely take more on the order of years to realize, but one has to start somewhen, I guess. I'm also aligned with the idea of having something like dlopen-style implementations of the backends. While the reftable-library is nice and fixes some of the issues that we have at GitLab, the more important win is that this demonstrates that the abstractions that we have hold. Which also means that adding a new backend has gotten a ton easier now. And yes, being able to implement self-contained features like a refdb implementation or an ODB implementation in Rust would be a sensible first step for adopting it. It doesn't interact with anything else and initially we could continue to support platforms that do not have Rust by simply not compiling such a backend. Patrick
On Mon, Sep 30, 2024 at 11:17:48AM +0200, Christian Couder wrote: > On Mon, Sep 30, 2024 at 9:57 AM Patrick Steinhardt <ps@pks.im> wrote: > > I think chipping away the problems one by one is fine. But it would be > > nice to draw something like a "big picture" of where we eventually want > > to end up at and how all the parts connect with each other to form a > > viable native replacement for Git LFS. > > I have tried to discuss this at the Git Merge 2022 and 2024 and > perhaps even before that. But as you know it's difficult to make > people agree on big projects that are not backed by patches and that > might span over several years (especially when very few people > actually work on them and when they might have other things to work on > too). Certainly true, yeah. But we did have documents in the past that outlined long-term visions in our tree, and it may help the project as a whole to better understand the long-term vision we're headed into. And by encouraging discussion up front we may be able to spot any weaknesses and address them before it is too late. Patrick