Message ID | 20200319174439.230969-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] fetch-pack: lazy fetch using tree:0 | expand |
On 3/19/2020 1:44 PM, Jonathan Tan wrote: > Support for partial clones with filtered trees was added in bc5975d24f > ("list-objects-filter: implement filter tree:0", 2018-10-07), but > whenever a lazy fetch of a tree is done, besides the tree itself, some > other objects that it references are also fetched. > > The "blob:none" filter was added to lazy fetches in 4c7f9567ea > ("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to > restrict blobs from being fetched, but it didn't restrict trees. > ("tree:0", which would restrict all trees as well, wasn't added then > because "tree:0" was itself new and may not have been supported by Git > servers, as you can see from the dates of the commits.) > > Now that "tree:0" has been supported in Git for a while, teach lazy > fetches to use "tree:0" instead of "blob:none". > > (An alternative to doing this is to teach Git a new filter that only > returns exactly the objects requested, no more - but "tree:0" already > does that for us for now, hence this patch. If we were to support > filtering of commits in partial clones later, I think that specifying a > depth will work to restrict the commits returned, so we won't need an > additional filter anyway.) > --- > This looks like a good change to me - in particular, it makes Git align > with the (in my opinion, reasonable) mental model that when we lazily > fetch something, we only fetch that thing. Some issues that I can think > about: > > - Some hosts like GitHub support some partial clone filters, but not > "tree:0". > - I haven't figured out the performance implications yet. If we want a > tree, I think that we typically will want some of its subtrees, but > not all. > > Any thoughts? The end result of fetching missing objects one-by-one matches how the GVFS protocol has handled these tree misses in the past. While there may be a lot more round trips, it saves on excess data since a missing tree likely can reach several known trees and blobs. The real unknown here is how the "boundary" of missing trees is created. In the GVFS protocol, missing trees happen mostly when our pre-computed "prefetch pack-files" of commits and trees are behind the ref tips. The usage pattern for depth-limited or path-scoped filters is not quite as established as the blob-limited patterns (because they are similar to the behavior in VFS for Git and Scalar). The code seems to be doing what you say, but I highly recommend taking this for a spin on a real repository with a real remote, if possible. The more that we could get some numbers for which situations do better in one case or the other, the more this change can be adopted with confidence. Thanks, -Stolee
On Thu, Mar 19, 2020 at 10:44:39AM -0700, Jonathan Tan wrote: > Support for partial clones with filtered trees was added in bc5975d24f > ("list-objects-filter: implement filter tree:0", 2018-10-07), but > whenever a lazy fetch of a tree is done, besides the tree itself, some > other objects that it references are also fetched. > > The "blob:none" filter was added to lazy fetches in 4c7f9567ea > ("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to > restrict blobs from being fetched, but it didn't restrict trees. > ("tree:0", which would restrict all trees as well, wasn't added then > because "tree:0" was itself new and may not have been supported by Git > servers, as you can see from the dates of the commits.) > > Now that "tree:0" has been supported in Git for a while, teach lazy > fetches to use "tree:0" instead of "blob:none". This does mean a new client fetching objects for a partial clone from an older server (pre-bc5975d24f) used to work, but now won't (we couldn't have fetched from it with a tree filter, but this patch makes the use of tree:0 unconditional; so even a blob:none followup fetch would use it). I'm not _too_ broken up about that, given that partial clone support at that era was pretty clearly labeled as experimental. But it would be a nice bonus to make it work everywhere. > (An alternative to doing this is to teach Git a new filter that only > returns exactly the objects requested, no more - but "tree:0" already > does that for us for now, hence this patch. If we were to support > filtering of commits in partial clones later, I think that specifying a > depth will work to restrict the commits returned, so we won't need an > additional filter anyway.) The depth thing might work for commits, though there are a lot of special code paths taken when the client is asking for shallow commits that might be better avoided. Being able to say "only send me the objects I'm asking for" seems like a much more direct way. It doesn't even need to be a filter, really; it could be a protocol capability. And in fact I think we'd want a capability either way, because clients would ideally be able to send the old "blob:none" for older servers, or the new "only what I'm asking for" with new servers. > --- > This looks like a good change to me - in particular, it makes Git align > with the (in my opinion, reasonable) mental model that when we lazily > fetch something, we only fetch that thing. Some issues that I can think > about: Yeah, I like the mental model. I just think it should be expressed even more explicitly. :) > - Some hosts like GitHub support some partial clone filters, but not > "tree:0". Yes, this is going to fail against GitHub servers, just like it would for older servers. One way to prevent that would be to use a "blob" filter if that's what we originally partial-cloned with. I don't know if that information always reliably makes it into this code path, though. I think I'd prefer a capability-based fix in the long run. We may support "tree:0" eventually at GitHub. It's quick to compute with bitmaps, just like "blob:none" is. But "tree:1" isn't. One side note (for Taylor, cc'd): our patches elsewhere to limit the allowed filters don't make it possible to express the difference between "tree:0" and "tree:1". It may be worth thinking about that, especially if it influences the config schema (since we'll have to support it forever once it makes it into a release). > - I haven't figured out the performance implications yet. If we want a > tree, I think that we typically will want some of its subtrees, but > not all. I could imagine a scenario where you want to get trees one level at a time in order to only grab the sub-trees you want based on pathnames (sort of like sparse-checkout's cone mode). Though you do get into "n+1" fetches based on tree depth there. If the latency for a fetch is high, it will be pretty painful. I can equally imagine there are cases where you want to grab the whole subtree in one go, but I think that raises another performance issue: you might already have most of it. E.g., consider a root tree with one toplevel subtree that contains a million files. You already have the root tree at some commit A. Now you want to diff against its parent, B. You ask the server for B^{tree}, and it sends you the million-entry tree, too (and maybe some blobs?). You could tell it you already have them, but you don't actually know what's in B^{tree} until you get it. And advertising all of your trees and blobs is prohibitively expensive. So I think that pushes us back towards wanting an "n+1" scheme, even if the latency is bad. And is really why partial clone is _so_ much easier if you just resign yourself to giving the client all the commits and trees. :) -Peff
Hi Peff, On Fri, Mar 20, 2020 at 02:12:14AM -0400, Jeff King wrote: > On Thu, Mar 19, 2020 at 10:44:39AM -0700, Jonathan Tan wrote: > > - Some hosts like GitHub support some partial clone filters, but not > > "tree:0". > > Yes, this is going to fail against GitHub servers, just like it would > for older servers. One way to prevent that would be to use a "blob" > filter if that's what we originally partial-cloned with. I don't know if > that information always reliably makes it into this code path, though. > I think I'd prefer a capability-based fix in the long run. > > We may support "tree:0" eventually at GitHub. It's quick to compute with > bitmaps, just like "blob:none" is. But "tree:1" isn't. I'm rolling this out shortly :). > One side note (for Taylor, cc'd): our patches elsewhere to limit the > allowed filters don't make it possible to express the difference between > "tree:0" and "tree:1". It may be worth thinking about that, especially > if it influences the config schema (since we'll have to support it > forever once it makes it into a release). They do now, mostly thanks to Peff's original thinking about separating out each filter choice into its own subsection, so that we can write not just: [uploadpack "filter.tree:depth"] allow = true but also: [uploadpack "filter.tree:depth"] allow = true maxDepth = <n> I have some patches that I'm rolling out to GitHub in the next day or so that do just that. I think that we will likely just support the no-trees version of this filter (i.e., '--filter=tree:0'), since that is the only case that is helped by having up-to-date bitmaps enabled. I assume that other administrators will do the same. I'm going to send these patches upstream in the not-too-distant future, because I want to address the original issues that I talked about in my RFC about adding these filter subsections to 'upload-pack's configuration in the first place in [1]. My hope is that I can do that while these patches are being vetted in the wild before I send them to the list. Thanks, Taylor [1]: https://lore.kernel.org/git/cover.1584477196.git.me@ttaylorr.com/
On Thu, Mar 26, 2020 at 01:50:58PM -0600, Taylor Blau wrote: > > One side note (for Taylor, cc'd): our patches elsewhere to limit the > > allowed filters don't make it possible to express the difference between > > "tree:0" and "tree:1". It may be worth thinking about that, especially > > if it influences the config schema (since we'll have to support it > > forever once it makes it into a release). > [...] > [uploadpack "filter.tree:depth"] > allow = true > maxDepth = <n> Yeah, that scheme solves the problem very neatly, and removes any worry I had about painting ourselves into a corner. Thanks. -Peff
diff --git a/fetch-pack.c b/fetch-pack.c index 1734a573b0..655ec5d4a3 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -1747,14 +1747,14 @@ struct ref *fetch_pack(struct fetch_pack_args *args, /* * The protocol does not support requesting that only the * wanted objects be sent, so approximate this by setting a - * "blob:none" filter if no filter is already set. This works - * for all object types: note that wanted blobs will still be + * "tree:0" filter if no filter is already set. This works + * for all object types: note that wanted blobs and trees will still be * sent because they are directly specified as a "want". * * NEEDSWORK: Add an option in the protocol to request that * only the wanted objects be sent, and implement it. */ - parse_list_objects_filter(&args->filter_options, "blob:none"); + parse_list_objects_filter(&args->filter_options, "tree:0"); } if (version != protocol_v2 && !ref) {