Message ID | 20191023233403.145457-1-jonathantanmy@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] fetch: delay fetch_if_missing=0 until after config | expand |
Jonathan Tan <jonathantanmy@google.com> writes: > Suppose, from a repository that has ".gitmodules", we clone with > --filter=blob:none: > > git clone --filter=blob:none --no-checkout \ > https://kernel.googlesource.com/pub/scm/git/git > > Then we fetch: > > git -C git fetch > > This will cause a "unable to load config blob object", because the > fetch_config_from_gitmodules() invocation in cmd_fetch() will attempt to > load ".gitmodules" (which Git knows to exist because the client has the > tree of HEAD) while fetch_if_missing is set to 0. > > fetch_if_missing is set to 0 too early - ".gitmodules" here should be > lazily fetched. Git must set fetch_if_missing to 0 before the fetch > because as part of the fetch, packfile negotiation happens (and we do > not want to fetch any missing objects when checking existence of > objects)... Is it only me to feel that this is piling band-aids on top of band-aids? Perhaps the addition (and enabling) of lazy-fetch should have been done after "checking existence" calls are vetted and sifted into two categories? Some accesses to objects are "because we need it now---so let's lazily fetch if that is available as a fallback option to us", as opposed to "if we do not have it locally right now, we want to know the fact". And each callsite should be able to declare for what reason between the two it is making the access. The single fetch-if-missing boolean may have been a quick-and-dirty way to get the ball rolling, but perhaps the codebase grew up enough so that it is time to wean off of it?
> > fetch_if_missing is set to 0 too early - ".gitmodules" here should be > > lazily fetched. Git must set fetch_if_missing to 0 before the fetch > > because as part of the fetch, packfile negotiation happens (and we do > > not want to fetch any missing objects when checking existence of > > objects)... > > Is it only me to feel that this is piling band-aids on top of > band-aids? To me, this is moving a band-aid, not adding another one. But to the bigger point... > Perhaps the addition (and enabling) of lazy-fetch should have been > done after "checking existence" calls are vetted and sifted into two > categories? Some accesses to objects are "because we need it > now---so let's lazily fetch if that is available as a fallback > option to us", as opposed to "if we do not have it locally right > now, we want to know the fact". And each callsite should be able to > declare for what reason between the two it is making the access. > > The single fetch-if-missing boolean may have been a quick-and-dirty > way to get the ball rolling, but perhaps the codebase grew up enough > so that it is time to wean off of it? This is one of the alternatives I mentioned in the original email [1] after "---". But to elaborate on that, I prefer sifting regions over sifting calls. Sifting calls into two categories might work, but it's error-prone in that we would have to do the same line-by-line analysis we did when we added the repository argument to many functions, and we would have to modify functions like parse_commit() to take a flag similar to OBJECT_INFO_SKIP_FETCH_OBJECT. Also, we would have to do the same careful inspection for future patches. Instead, we can control whether a region of code lazy-fetches by moving fetch_if_missing to the struct repository object and using a struct repository that has fetch_if_missing==0 when we don't want lazy fetching. Besides being less error-prone, the infrastructure for this has already been built (if I remember correctly, for submodules, at first). The microproject to put fetch_if_missing into struct repository and my Outreachy project 'Refactor "git index-pack" logic into library code' [2] are steps towards this goal. (I don't think that the latter is strictly necessary, but it will make things simpler. In particular, passing "-C" then the_repository->gitdir to index-pack makes a few tests fail - not many, but not zero; and even after we resolve that, we would need CLI parameters ensuring that we marshal everything we need from the_repository, including fetch_if_missing. It seems better to libify index-pack first.) [1] https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/ [2] https://www.outreachy.org/apply/project-selection/#git
Jonathan Tan <jonathantanmy@google.com> writes: > To me, this is moving a band-aid, not adding another one. But to the Fair enough. > Sifting calls into two categories might work, but it's error-prone in > that we would have to do the same line-by-line analysis we did when we > added the repository argument to many functions, and we would have to > modify functions like parse_commit() to take a flag similar to > OBJECT_INFO_SKIP_FETCH_OBJECT. Also, we would have to do the same > careful inspection for future patches. Absolutely. That is the price to pay for the lazy-fetch feature. > Instead, we can control whether a region of code lazy-fetches... The approach "from here to there, we can set global to forbid lazy-fetch" may prolong the life support of the quick-and-dirty mechanism, but it has to assume you can say "from here to there"; it would mean that we cannot go multi-threaded until we get off of it. That is why I said it may be time for us to wean us off of the quick-and-dirty hack that helped us bootstrapping the initial implementation of lazy clone.
> > Instead, we can control whether a region of code lazy-fetches... > > The approach "from here to there, we can set global to forbid > lazy-fetch" may prolong the life support of the quick-and-dirty > mechanism, but it has to assume you can say "from here to there"; it > would mean that we cannot go multi-threaded until we get off of it. By "from here to there", I meant, for example, creating a struct repository in cmd_fetch() that has fetch_if_missing=0, then passing that repository to fetch_pack() (once fetch_pack() and all functions it calls support a repository object). In that way, from here (start of fetch_pack()) to there (end of fetch_pack()) there will be no lazy fetching. As a bonus, if we ever want to support fetching in repositories other than the_repository (e.g. submodules), this change will allow us to support that too. I don't think this is quick-and-dirty, and I don't see how this prevents multithreading.
Jonathan Tan <jonathantanmy@google.com> writes: >> > Instead, we can control whether a region of code lazy-fetches... >> >> The approach "from here to there, we can set global to forbid >> lazy-fetch" may prolong the life support of the quick-and-dirty >> mechanism, but it has to assume you can say "from here to there"; it >> would mean that we cannot go multi-threaded until we get off of it. > > By "from here to there", I meant, for example, creating a struct > repository in cmd_fetch() that has fetch_if_missing=0, then passing that > repository to fetch_pack() (once fetch_pack() and all functions it calls > support a repository object). I know---but that means the struct cannot be shared among threads that are calling object layer, some of which want to lazily fetch missing objects while others only want to check the existence, at the same time. Compared to that, judicious use of OBJECT_INFO_SKIP_FETCH_OBJECT and other flags by callers can tell the underlying machinery why we are interested in the object, which I think is the right direction to go in the longer term. What I am not certain about is if we are ready to move to the right direction for the longer term, or we should still be relying on the big-hammer global bit for expediency a bit longer.
> Jonathan Tan <jonathantanmy@google.com> writes: > > >> > Instead, we can control whether a region of code lazy-fetches... > >> > >> The approach "from here to there, we can set global to forbid > >> lazy-fetch" may prolong the life support of the quick-and-dirty > >> mechanism, but it has to assume you can say "from here to there"; it > >> would mean that we cannot go multi-threaded until we get off of it. > > > > By "from here to there", I meant, for example, creating a struct > > repository in cmd_fetch() that has fetch_if_missing=0, then passing that > > repository to fetch_pack() (once fetch_pack() and all functions it calls > > support a repository object). > > I know---but that means the struct cannot be shared among threads > that are calling object layer, some of which want to lazily fetch > missing objects while others only want to check the existence, at > the same time. That's true. Mixing lazy-fetch and non-lazy-fetch in a function is rare, but admittedly it does happen - in particular, when fetching, the server may send us objects delta-ed against a missing object, and we need to lazy-fetch those missing objects. > Compared to that, judicious use of OBJECT_INFO_SKIP_FETCH_OBJECT and > other flags by callers can tell the underlying machinery why we are > interested in the object, which I think is the right direction to go > in the longer term. What I am not certain about is if we are ready > to move to the right direction for the longer term, or we should still > be relying on the big-hammer global bit for expediency a bit longer. I've taken a look at this and figured out tests that can demonstrate what exactly is being lazy-fetched (and hopefully will be good enough to prevent future code changes from adding extra lazy-fetches). In this way, maybe we can at least reduce the scope of the big-hammer global. I just sent out a patch [1]. [1] https://public-inbox.org/git/20191101203830.231676-1-jonathantanmy@google.com/
diff --git a/builtin/fetch.c b/builtin/fetch.c index 0c345b5dfe..863c858fde 100644 --- a/builtin/fetch.c +++ b/builtin/fetch.c @@ -1755,8 +1755,6 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) packet_trace_identity("fetch"); - fetch_if_missing = 0; - /* Record the command line for the reflog */ strbuf_addstr(&default_rla, "fetch"); for (i = 1; i < argc; i++) @@ -1824,6 +1822,8 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) } } + fetch_if_missing = 0; + if (remote) { if (filter_options.choice || has_promisor_remote()) fetch_one_setup_partial(remote);
Suppose, from a repository that has ".gitmodules", we clone with --filter=blob:none: git clone --filter=blob:none --no-checkout \ https://kernel.googlesource.com/pub/scm/git/git Then we fetch: git -C git fetch This will cause a "unable to load config blob object", because the fetch_config_from_gitmodules() invocation in cmd_fetch() will attempt to load ".gitmodules" (which Git knows to exist because the client has the tree of HEAD) while fetch_if_missing is set to 0. fetch_if_missing is set to 0 too early - ".gitmodules" here should be lazily fetched. Git must set fetch_if_missing to 0 before the fetch because as part of the fetch, packfile negotiation happens (and we do not want to fetch any missing objects when checking existence of objects), but we do not need to set it so early. Move the setting of fetch_if_missing to the earliest possible point in cmd_fetch(), right before any fetching happens. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- No change from v2 except that I've added Signed-off-by. --- builtin/fetch.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)