diff mbox series

[RFC] fetch-pack: lazy fetch using tree:0

Message ID 20200319174439.230969-1-jonathantanmy@google.com (mailing list archive)
State New, archived
Headers show
Series [RFC] fetch-pack: lazy fetch using tree:0 | expand

Commit Message

Jonathan Tan March 19, 2020, 5:44 p.m. UTC
Support for partial clones with filtered trees was added in bc5975d24f
("list-objects-filter: implement filter tree:0", 2018-10-07), but
whenever a lazy fetch of a tree is done, besides the tree itself, some
other objects that it references are also fetched.

The "blob:none" filter was added to lazy fetches in 4c7f9567ea
("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to
restrict blobs from being fetched, but it didn't restrict trees.
("tree:0", which would restrict all trees as well, wasn't added then
because "tree:0" was itself new and may not have been supported by Git
servers, as you can see from the dates of the commits.)

Now that "tree:0" has been supported in Git for a while, teach lazy
fetches to use "tree:0" instead of "blob:none".

(An alternative to doing this is to teach Git a new filter that only
returns exactly the objects requested, no more - but "tree:0" already
does that for us for now, hence this patch. If we were to support
filtering of commits in partial clones later, I think that specifying a
depth will work to restrict the commits returned, so we won't need an
additional filter anyway.)
---
This looks like a good change to me - in particular, it makes Git align
with the (in my opinion, reasonable) mental model that when we lazily
fetch something, we only fetch that thing. Some issues that I can think
about:

 - Some hosts like GitHub support some partial clone filters, but not
   "tree:0".
 - I haven't figured out the performance implications yet. If we want a
   tree, I think that we typically will want some of its subtrees, but
   not all.

Any thoughts?
---
 fetch-pack.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Derrick Stolee March 19, 2020, 7:58 p.m. UTC | #1
On 3/19/2020 1:44 PM, Jonathan Tan wrote:
> Support for partial clones with filtered trees was added in bc5975d24f
> ("list-objects-filter: implement filter tree:0", 2018-10-07), but
> whenever a lazy fetch of a tree is done, besides the tree itself, some
> other objects that it references are also fetched.
> 
> The "blob:none" filter was added to lazy fetches in 4c7f9567ea
> ("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to
> restrict blobs from being fetched, but it didn't restrict trees.
> ("tree:0", which would restrict all trees as well, wasn't added then
> because "tree:0" was itself new and may not have been supported by Git
> servers, as you can see from the dates of the commits.)
> 
> Now that "tree:0" has been supported in Git for a while, teach lazy
> fetches to use "tree:0" instead of "blob:none".
> 
> (An alternative to doing this is to teach Git a new filter that only
> returns exactly the objects requested, no more - but "tree:0" already
> does that for us for now, hence this patch. If we were to support
> filtering of commits in partial clones later, I think that specifying a
> depth will work to restrict the commits returned, so we won't need an
> additional filter anyway.)
> ---
> This looks like a good change to me - in particular, it makes Git align
> with the (in my opinion, reasonable) mental model that when we lazily
> fetch something, we only fetch that thing. Some issues that I can think
> about:
> 
>  - Some hosts like GitHub support some partial clone filters, but not
>    "tree:0".
>  - I haven't figured out the performance implications yet. If we want a
>    tree, I think that we typically will want some of its subtrees, but
>    not all.
> 
> Any thoughts?

The end result of fetching missing objects one-by-one matches how the
GVFS protocol has handled these tree misses in the past. While there
may be a lot more round trips, it saves on excess data since a
missing tree likely can reach several known trees and blobs.

The real unknown here is how the "boundary" of missing trees is
created. In the GVFS protocol, missing trees happen mostly when our
pre-computed "prefetch pack-files" of commits and trees are behind the
ref tips.

The usage pattern for depth-limited or path-scoped filters is not
quite as established as the blob-limited patterns (because they are
similar to the behavior in VFS for Git and Scalar).

The code seems to be doing what you say, but I highly recommend taking
this for a spin on a real repository with a real remote, if possible.
The more that we could get some numbers for which situations do better
in one case or the other, the more this change can be adopted with
confidence.

Thanks,
-Stolee
Jeff King March 20, 2020, 6:12 a.m. UTC | #2
On Thu, Mar 19, 2020 at 10:44:39AM -0700, Jonathan Tan wrote:

> Support for partial clones with filtered trees was added in bc5975d24f
> ("list-objects-filter: implement filter tree:0", 2018-10-07), but
> whenever a lazy fetch of a tree is done, besides the tree itself, some
> other objects that it references are also fetched.
> 
> The "blob:none" filter was added to lazy fetches in 4c7f9567ea
> ("fetch-pack: exclude blobs when lazy-fetching trees", 2018-10-04) to
> restrict blobs from being fetched, but it didn't restrict trees.
> ("tree:0", which would restrict all trees as well, wasn't added then
> because "tree:0" was itself new and may not have been supported by Git
> servers, as you can see from the dates of the commits.)
> 
> Now that "tree:0" has been supported in Git for a while, teach lazy
> fetches to use "tree:0" instead of "blob:none".

This does mean a new client fetching objects for a partial clone from an
older server (pre-bc5975d24f) used to work, but now won't (we couldn't
have fetched from it with a tree filter, but this patch makes the use of
tree:0 unconditional; so even a blob:none followup fetch would use it).
I'm not _too_ broken up about that, given that partial clone support at
that era was pretty clearly labeled as experimental. But it would be a
nice bonus to make it work everywhere.

> (An alternative to doing this is to teach Git a new filter that only
> returns exactly the objects requested, no more - but "tree:0" already
> does that for us for now, hence this patch. If we were to support
> filtering of commits in partial clones later, I think that specifying a
> depth will work to restrict the commits returned, so we won't need an
> additional filter anyway.)

The depth thing might work for commits, though there are a lot of
special code paths taken when the client is asking for shallow commits
that might be better avoided.

Being able to say "only send me the objects I'm asking for" seems like a
much more direct way. It doesn't even need to be a filter, really; it
could be a protocol capability. And in fact I think we'd want a
capability either way, because clients would ideally be able to send the
old "blob:none" for older servers, or the new "only what I'm asking for"
with new servers.

> ---
> This looks like a good change to me - in particular, it makes Git align
> with the (in my opinion, reasonable) mental model that when we lazily
> fetch something, we only fetch that thing. Some issues that I can think
> about:

Yeah, I like the mental model. I just think it should be expressed even
more explicitly. :)

>  - Some hosts like GitHub support some partial clone filters, but not
>    "tree:0".

Yes, this is going to fail against GitHub servers, just like it would
for older servers. One way to prevent that would be to use a "blob"
filter if that's what we originally partial-cloned with. I don't know if
that information always reliably makes it into this code path, though.
I think I'd prefer a capability-based fix in the long run.

We may support "tree:0" eventually at GitHub. It's quick to compute with
bitmaps, just like "blob:none" is. But "tree:1" isn't.

One side note (for Taylor, cc'd): our patches elsewhere to limit the
allowed filters don't make it possible to express the difference between
"tree:0" and "tree:1". It may be worth thinking about that, especially
if it influences the config schema (since we'll have to support it
forever once it makes it into a release).

>  - I haven't figured out the performance implications yet. If we want a
>    tree, I think that we typically will want some of its subtrees, but
>    not all.

I could imagine a scenario where you want to get trees one level at a
time in order to only grab the sub-trees you want based on pathnames
(sort of like sparse-checkout's cone mode). Though you do get into "n+1"
fetches based on tree depth there. If the latency for a fetch is high,
it will be pretty painful.

I can equally imagine there are cases where you want to grab the whole
subtree in one go, but I think that raises another performance issue:
you might already have most of it. E.g., consider a root tree with one
toplevel subtree that contains a million files. You already have the
root tree at some commit A. Now you want to diff against its parent, B.
You ask the server for B^{tree}, and it sends you the million-entry
tree, too (and maybe some blobs?). You could tell it you already have
them, but you don't actually know what's in B^{tree} until you get it.
And advertising all of your trees and blobs is prohibitively expensive.

So I think that pushes us back towards wanting an "n+1" scheme, even if
the latency is bad. And is really why partial clone is _so_ much easier
if you just resign yourself to giving the client all the commits and
trees. :)

-Peff
Taylor Blau March 26, 2020, 7:50 p.m. UTC | #3
Hi Peff,

On Fri, Mar 20, 2020 at 02:12:14AM -0400, Jeff King wrote:
> On Thu, Mar 19, 2020 at 10:44:39AM -0700, Jonathan Tan wrote:
> >  - Some hosts like GitHub support some partial clone filters, but not
> >    "tree:0".
>
> Yes, this is going to fail against GitHub servers, just like it would
> for older servers. One way to prevent that would be to use a "blob"
> filter if that's what we originally partial-cloned with. I don't know if
> that information always reliably makes it into this code path, though.
> I think I'd prefer a capability-based fix in the long run.
>
> We may support "tree:0" eventually at GitHub. It's quick to compute with
> bitmaps, just like "blob:none" is. But "tree:1" isn't.

I'm rolling this out shortly :).

> One side note (for Taylor, cc'd): our patches elsewhere to limit the
> allowed filters don't make it possible to express the difference between
> "tree:0" and "tree:1". It may be worth thinking about that, especially
> if it influences the config schema (since we'll have to support it
> forever once it makes it into a release).

They do now, mostly thanks to Peff's original thinking about separating
out each filter choice into its own subsection, so that we can write not
just:

  [uploadpack "filter.tree:depth"]
    allow = true

but also:

  [uploadpack "filter.tree:depth"]
    allow = true
    maxDepth = <n>

I have some patches that I'm rolling out to GitHub in the next day or so
that do just that. I think that we will likely just support the no-trees
version of this filter (i.e., '--filter=tree:0'), since that is the only
case that is helped by having up-to-date bitmaps enabled. I assume that
other administrators will do the same.

I'm going to send these patches upstream in the not-too-distant future,
because I want to address the original issues that I talked about in my
RFC about adding these filter subsections to 'upload-pack's
configuration in the first place in [1].

My hope is that I can do that while these patches are being vetted in
the wild before I send them to the list.

Thanks,
Taylor

[1]: https://lore.kernel.org/git/cover.1584477196.git.me@ttaylorr.com/
Jeff King March 27, 2020, 9:37 a.m. UTC | #4
On Thu, Mar 26, 2020 at 01:50:58PM -0600, Taylor Blau wrote:

> > One side note (for Taylor, cc'd): our patches elsewhere to limit the
> > allowed filters don't make it possible to express the difference between
> > "tree:0" and "tree:1". It may be worth thinking about that, especially
> > if it influences the config schema (since we'll have to support it
> > forever once it makes it into a release).
> [...]
>   [uploadpack "filter.tree:depth"]
>     allow = true
>     maxDepth = <n>

Yeah, that scheme solves the problem very neatly, and removes any worry
I had about painting ourselves into a corner. Thanks.

-Peff
diff mbox series

Patch

diff --git a/fetch-pack.c b/fetch-pack.c
index 1734a573b0..655ec5d4a3 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1747,14 +1747,14 @@  struct ref *fetch_pack(struct fetch_pack_args *args,
 		/*
 		 * The protocol does not support requesting that only the
 		 * wanted objects be sent, so approximate this by setting a
-		 * "blob:none" filter if no filter is already set. This works
-		 * for all object types: note that wanted blobs will still be
+		 * "tree:0" filter if no filter is already set. This works
+		 * for all object types: note that wanted blobs and trees will still be
 		 * sent because they are directly specified as a "want".
 		 *
 		 * NEEDSWORK: Add an option in the protocol to request that
 		 * only the wanted objects be sent, and implement it.
 		 */
-		parse_list_objects_filter(&args->filter_options, "blob:none");
+		parse_list_objects_filter(&args->filter_options, "tree:0");
 	}
 
 	if (version != protocol_v2 && !ref) {