[v3] fetch: delay fetch_if_missing=0 until after config

Message ID	20191023233403.145457-1-jonathantanmy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=JW2X=YQ=vger.kernel.org=git-owner@kernel.org> Date: Wed, 23 Oct 2019 16:34:03 -0700 In-Reply-To: <20191007181825.13463-1-jonathantanmy@google.com> Message-Id: <20191023233403.145457-1-jonathantanmy@google.com> Mime-Version: 1.0 References: <20191007181825.13463-1-jonathantanmy@google.com> Subject: [PATCH v3] fetch: delay fetch_if_missing=0 until after config From: Jonathan Tan <jonathantanmy@google.com> To: git@vger.kernel.org Cc: Jonathan Tan <jonathantanmy@google.com>, emilyshaffer@google.com Content-Type: text/plain; charset="UTF-8" Sender: git-owner@vger.kernel.org Precedence: bulk
Series	[v3] fetch: delay fetch_if_missing=0 until after config \| expand [v3] fetch: delay fetch_if_missing=0 until after config

Jonathan Tan Oct. 23, 2019, 11:34 p.m. UTC

Suppose, from a repository that has ".gitmodules", we clone with
--filter=blob:none:

  git clone --filter=blob:none --no-checkout \
    https://kernel.googlesource.com/pub/scm/git/git

Then we fetch:

  git -C git fetch

This will cause a "unable to load config blob object", because the
fetch_config_from_gitmodules() invocation in cmd_fetch() will attempt to
load ".gitmodules" (which Git knows to exist because the client has the
tree of HEAD) while fetch_if_missing is set to 0.

fetch_if_missing is set to 0 too early - ".gitmodules" here should be
lazily fetched.  Git must set fetch_if_missing to 0 before the fetch
because as part of the fetch, packfile negotiation happens (and we do
not want to fetch any missing objects when checking existence of
objects), but we do not need to set it so early. Move the setting of
fetch_if_missing to the earliest possible point in cmd_fetch(), right
before any fetching happens.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
No change from v2 except that I've added Signed-off-by.
---
 builtin/fetch.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Junio C Hamano Oct. 24, 2019, 4:30 a.m. UTC | #1

Jonathan Tan <jonathantanmy@google.com> writes:

> Suppose, from a repository that has ".gitmodules", we clone with
> --filter=blob:none:
>
>   git clone --filter=blob:none --no-checkout \
>     https://kernel.googlesource.com/pub/scm/git/git
>
> Then we fetch:
>
>   git -C git fetch
>
> This will cause a "unable to load config blob object", because the
> fetch_config_from_gitmodules() invocation in cmd_fetch() will attempt to
> load ".gitmodules" (which Git knows to exist because the client has the
> tree of HEAD) while fetch_if_missing is set to 0.
>
> fetch_if_missing is set to 0 too early - ".gitmodules" here should be
> lazily fetched.  Git must set fetch_if_missing to 0 before the fetch
> because as part of the fetch, packfile negotiation happens (and we do
> not want to fetch any missing objects when checking existence of
> objects)...

Is it only me to feel that this is piling band-aids on top of
band-aids?

Perhaps the addition (and enabling) of lazy-fetch should have been
done after "checking existence" calls are vetted and sifted into two
categories?  Some accesses to objects are "because we need it
now---so let's lazily fetch if that is available as a fallback
option to us", as opposed to "if we do not have it locally right
now, we want to know the fact".  And each callsite should be able to
declare for what reason between the two it is making the access.

The single fetch-if-missing boolean may have been a quick-and-dirty
way to get the ball rolling, but perhaps the codebase grew up enough
so that it is time to wean off of it?

Jonathan Tan Oct. 24, 2019, 7:18 p.m. UTC | #2

> > fetch_if_missing is set to 0 too early - ".gitmodules" here should be
> > lazily fetched.  Git must set fetch_if_missing to 0 before the fetch
> > because as part of the fetch, packfile negotiation happens (and we do
> > not want to fetch any missing objects when checking existence of
> > objects)...
> 
> Is it only me to feel that this is piling band-aids on top of
> band-aids?

To me, this is moving a band-aid, not adding another one. But to the
bigger point...

> Perhaps the addition (and enabling) of lazy-fetch should have been
> done after "checking existence" calls are vetted and sifted into two
> categories?  Some accesses to objects are "because we need it
> now---so let's lazily fetch if that is available as a fallback
> option to us", as opposed to "if we do not have it locally right
> now, we want to know the fact".  And each callsite should be able to
> declare for what reason between the two it is making the access.
> 
> The single fetch-if-missing boolean may have been a quick-and-dirty
> way to get the ball rolling, but perhaps the codebase grew up enough
> so that it is time to wean off of it?

This is one of the alternatives I mentioned in the original email [1]
after "---". But to elaborate on that, I prefer sifting regions over
sifting calls.

Sifting calls into two categories might work, but it's error-prone in
that we would have to do the same line-by-line analysis we did when we
added the repository argument to many functions, and we would have to
modify functions like parse_commit() to take a flag similar to
OBJECT_INFO_SKIP_FETCH_OBJECT. Also, we would have to do the same
careful inspection for future patches.

Instead, we can control whether a region of code lazy-fetches by moving
fetch_if_missing to the struct repository object and using a struct
repository that has fetch_if_missing==0 when we don't want lazy
fetching. Besides being less error-prone, the infrastructure for this
has already been built (if I remember correctly, for submodules, at
first).

The microproject to put fetch_if_missing into struct repository and my
Outreachy project 'Refactor "git index-pack" logic into library code'
[2] are steps towards this goal. (I don't think that the latter is
strictly necessary, but it will make things simpler. In particular,
passing "-C" then the_repository->gitdir to index-pack makes a few tests
fail - not many, but not zero; and even after we resolve that, we would
need CLI parameters ensuring that we marshal everything we need from
the_repository, including fetch_if_missing. It seems better to libify
index-pack first.)

[1] https://public-inbox.org/git/20191007181825.13463-1-jonathantanmy@google.com/
[2] https://www.outreachy.org/apply/project-selection/#git

Junio C Hamano Oct. 25, 2019, 2:38 a.m. UTC | #3

Jonathan Tan <jonathantanmy@google.com> writes:

> To me, this is moving a band-aid, not adding another one. But to the

Fair enough.

> Sifting calls into two categories might work, but it's error-prone in
> that we would have to do the same line-by-line analysis we did when we
> added the repository argument to many functions, and we would have to
> modify functions like parse_commit() to take a flag similar to
> OBJECT_INFO_SKIP_FETCH_OBJECT. Also, we would have to do the same
> careful inspection for future patches.

Absolutely.  That is the price to pay for the lazy-fetch feature.

> Instead, we can control whether a region of code lazy-fetches...

The approach "from here to there, we can set global to forbid
lazy-fetch" may prolong the life support of the quick-and-dirty
mechanism, but it has to assume you can say "from here to there"; it
would mean that we cannot go multi-threaded until we get off of it.

That is why I said it may be time for us to wean us off of the
quick-and-dirty hack that helped us bootstrapping the initial
implementation of lazy clone.

Jonathan Tan Oct. 25, 2019, 5:41 p.m. UTC | #4

> > Instead, we can control whether a region of code lazy-fetches...
> 
> The approach "from here to there, we can set global to forbid
> lazy-fetch" may prolong the life support of the quick-and-dirty
> mechanism, but it has to assume you can say "from here to there"; it
> would mean that we cannot go multi-threaded until we get off of it.

By "from here to there", I meant, for example, creating a struct
repository in cmd_fetch() that has fetch_if_missing=0, then passing that
repository to fetch_pack() (once fetch_pack() and all functions it calls
support a repository object). In that way, from here (start of
fetch_pack()) to there (end of fetch_pack()) there will be no lazy
fetching. As a bonus, if we ever want to support fetching in
repositories other than the_repository (e.g. submodules), this change
will allow us to support that too. I don't think this is
quick-and-dirty, and I don't see how this prevents multithreading.

Junio C Hamano Oct. 29, 2019, 1:39 a.m. UTC | #5

Jonathan Tan <jonathantanmy@google.com> writes:

>> > Instead, we can control whether a region of code lazy-fetches...
>> 
>> The approach "from here to there, we can set global to forbid
>> lazy-fetch" may prolong the life support of the quick-and-dirty
>> mechanism, but it has to assume you can say "from here to there"; it
>> would mean that we cannot go multi-threaded until we get off of it.
>
> By "from here to there", I meant, for example, creating a struct
> repository in cmd_fetch() that has fetch_if_missing=0, then passing that
> repository to fetch_pack() (once fetch_pack() and all functions it calls
> support a repository object).

I know---but that means the struct cannot be shared among threads
that are calling object layer, some of which want to lazily fetch
missing objects while others only want to check the existence, at
the same time.

Compared to that, judicious use of OBJECT_INFO_SKIP_FETCH_OBJECT and
other flags by callers can tell the underlying machinery why we are
interested in the object, which I think is the right direction to go
in the longer term.  What I am not certain about is if we are ready
to move to the right direction for the longer term, or we should still
be relying on the big-hammer global bit for expediency a bit longer.

Jonathan Tan Nov. 1, 2019, 8:43 p.m. UTC | #6

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> >> > Instead, we can control whether a region of code lazy-fetches...
> >> 
> >> The approach "from here to there, we can set global to forbid
> >> lazy-fetch" may prolong the life support of the quick-and-dirty
> >> mechanism, but it has to assume you can say "from here to there"; it
> >> would mean that we cannot go multi-threaded until we get off of it.
> >
> > By "from here to there", I meant, for example, creating a struct
> > repository in cmd_fetch() that has fetch_if_missing=0, then passing that
> > repository to fetch_pack() (once fetch_pack() and all functions it calls
> > support a repository object).
> 
> I know---but that means the struct cannot be shared among threads
> that are calling object layer, some of which want to lazily fetch
> missing objects while others only want to check the existence, at
> the same time.

That's true. Mixing lazy-fetch and non-lazy-fetch in a function is rare,
but admittedly it does happen - in particular, when fetching, the server
may send us objects delta-ed against a missing object, and we need to
lazy-fetch those missing objects.

> Compared to that, judicious use of OBJECT_INFO_SKIP_FETCH_OBJECT and
> other flags by callers can tell the underlying machinery why we are
> interested in the object, which I think is the right direction to go
> in the longer term.  What I am not certain about is if we are ready
> to move to the right direction for the longer term, or we should still
> be relying on the big-hammer global bit for expediency a bit longer.

I've taken a look at this and figured out tests that can demonstrate
what exactly is being lazy-fetched (and hopefully will be good enough to
prevent future code changes from adding extra lazy-fetches). In this
way, maybe we can at least reduce the scope of the big-hammer global.

I just sent out a patch [1].

[1] https://public-inbox.org/git/20191101203830.231676-1-jonathantanmy@google.com/

[v3] fetch: delay fetch_if_missing=0 until after config

Commit Message

Comments

Patch