[0/8] fetch --recurse-submodules: fetch unpopulated submodules

Message ID	20220210044152.78352-1-chooglen@google.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> Date: Thu, 10 Feb 2022 12:41:44 +0800 Message-Id: <20220210044152.78352-1-chooglen@google.com> Mime-Version: 1.0 Subject: [PATCH 0/8] fetch --recurse-submodules: fetch unpopulated submodules From: Glen Choo <chooglen@google.com> To: git@vger.kernel.org Cc: Glen Choo <chooglen@google.com>, Jonathan Tan <jonathantanmy@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	fetch --recurse-submodules: fetch unpopulated submodules \| expand [0/8] fetch --recurse-submodules: fetch unpopulated submodules [1/8] submodule: inline submodule_commits() into caller [2/8] submodule: store new submodule commits oid_array in a struct [3/8] submodule: make static functions read submodules from commits [4/8] t5526: introduce test helper to assert on fetches [5/8] t5526: use grep to assert on fetches [6/8] submodule: extract get_fetch_task() [7/8] fetch: fetch unpopulated, changed submodules [8/8] submodule: fix bug and remove add_submodule_odb()

Message ID

20220210044152.78352-1-chooglen@google.com (mailing list archive)

Headers

Date: Thu, 10 Feb 2022 12:41:44 +0800
Message-Id: <20220210044152.78352-1-chooglen@google.com>
Mime-Version: 1.0
Subject: [PATCH 0/8] fetch --recurse-submodules: fetch unpopulated submodules
From: Glen Choo <chooglen@google.com>
To: git@vger.kernel.org
Cc: Glen Choo <chooglen@google.com>,
        Jonathan Tan <jonathantanmy@google.com>
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

Series

fetch --recurse-submodules: fetch unpopulated submodules | expand

Message

Glen Choo Feb. 10, 2022, 4:41 a.m. UTC

= Background

When fetching submodule commits, "git fetch --recurse-submodules" only
considers populated submodules, and not all of the submodules in
$GIT_DIR/modules as one might expect. As a result, "git fetch
--recurse-submodules" behaves differently based on which commit is
checked out.

This can be a problem, for instance, if the user has a branch with
submodules and a branch without:

  # the submodules were initialized at some point in history..
  git checkout -b branch-with-submodules origin/branch-with-submodules
  git submodule update --init

  # later down the road..
  git checkout --recurse-submodules branch-without-submodules
  # no submodules are fetched!
  git fetch --recurse-submodules
  # if origin/branch-with-submodules has new submodule commits, this
  # checkout will fail because we never fetched the submodule
  git checkout --recurse-submodules branch-with-submodules

This series makes "git fetch" fetch the right submodules regardless of
which commit is checked out, as long as the submodule has already been
cloned. In particular, "git fetch" learns to:

1. read submodules from the relevant superproject commit instead of
   the file system
2. fetch all changed submodules, even if they are not populated

= Patch organization

- Patches 1-3 teach "git fetch" to read .gitmodules from the
  superproject commit.
- Patches 4-5 are quality-of-life improvements to the test suite that
  make it easier to write the tests in patch 7.
- Patch 6 separates the steps of "finding which submodules to fetch" and
  "fetching the submodules", making it easier to tell "git fetch" to
  fetch unpopulated submodules.
- Patch 7 teaches "git fetch" to fetch changed, unpopulated submodules
  in addition to populated submodules.
- Patch 8 is an optional bugfix + cleanup of the "git fetch" code that
  removes the last caller of the deprecated "add_submodule_odb()".

= Future work

Even with this series, there is no guarantee that "git fetch" will fetch
every necessary submodule commit because a superproject commit can
introduce new submodules, and since those submodules are not cloned, "git
fetch" cannot fetch those commits yet. This series should get us closer
to that goal because "git fetch" can read submodules from the
superproject commit, which is a necessary precursor to figuring out
whether to clone submodules from superproject commits.

Glen Choo (8):
  submodule: inline submodule_commits() into caller
  submodule: store new submodule commits oid_array in a struct
  submodule: make static functions read submodules from commits
  t5526: introduce test helper to assert on fetches
  t5526: use grep to assert on fetches
  submodule: extract get_fetch_task()
  fetch: fetch unpopulated, changed submodules
  submodule: fix bug and remove add_submodule_odb()

 Documentation/fetch-options.txt |  26 ++-
 Documentation/git-fetch.txt     |  10 +-
 submodule.c                     | 316 ++++++++++++++++----------
 submodule.h                     |   9 +-
 t/t5526-fetch-submodules.sh     | 386 ++++++++++++++++++++++++--------
 5 files changed, 524 insertions(+), 223 deletions(-)


base-commit: 679e3693aba0c17af60c031f7eef68f2296b8dad

Comments

Junio C Hamano Feb. 10, 2022, 7:07 a.m. UTC | #1

Glen Choo <chooglen@google.com> writes:

> = Background
>
> When fetching submodule commits, "git fetch --recurse-submodules" only
> considers populated submodules, and not all of the submodules in
> $GIT_DIR/modules as one might expect. As a result, "git fetch
> --recurse-submodules" behaves differently based on which commit is
> checked out.

After getting 'init'ed, which is a sign that the user is interested
in that submodule, when we happen to check out a superproject commit
that lack the submodule in question, do we _lose_ the record that it
was once of interest?  I do not think so.  The cloned copy in
$GIT_DIR/modules/ would not go away, so we _could_ update it, even
there is no checkout, when the superproject we currently have may
not have the submodule.

But I am not sure if that is a problem.  After all, the recursive
fetch tries to make sure that the superproject commit that is
checked out is reproduced as recorded by fetching the submodule
commit recorded in the superproject commit, not a commit that
happens to be at the tip of random branch in the submodule.

It is OK to allow fetching into submodule that is not currently have
a checkout, but I think we should view it purely as prefetching.  We
do not even know, after doing such a fetch in the submodule, we have
the commit necessary for the _next_ commit in superproject we will
check out.

> This can be a problem, for instance, if the user has a branch with
> submodules and a branch without:
>
>   # the submodules were initialized at some point in history..
>   git checkout -b branch-with-submodules origin/branch-with-submodules
>   git submodule update --init
>
>   # later down the road..
>   git checkout --recurse-submodules branch-without-submodules
>   # no submodules are fetched!
>   git fetch --recurse-submodules
>   # if origin/branch-with-submodules has new submodule commits, this
>   # checkout will fail because we never fetched the submodule
>   git checkout --recurse-submodules branch-with-submodules

That is expected, and UNLESS we fetched _everything_ offered by the
remote repository to the submodule in the previous step, there is no
guarantee that this "recurse-submodules" checkout would succeed.

> This series makes "git fetch" fetch the right submodules regardless of
> which commit is checked out, as long as the submodule has already been
> cloned. In particular, "git fetch" learns to:
>
> 1. read submodules from the relevant superproject commit instead of
>    the file system
> 2. fetch all changed submodules, even if they are not populated

The real question is not "in which submodules we fetch", but "what
commits we fetch in these submodules".  I do not think there is a
good answer to the latter.

Of course, we we take this sequence instead:

	git checkout branch-with-submodules
	git fetch --recurse-submodules
	git checkout --recurse-submodules branch-with-submodules

things should work correctly (I think we both are assuming that the
other side allows to fetch _any_ object, not just ref), as "fetch"
knows what superproject commit it is asked to complete, unlike the
previous example you gave, where it does not have a clue on what
superproject commit it is preparing submodules for, right?

So, I am not quite sure if we are solving the right problem, let
alone with the right approach.

Also, if the strategy is to prefetch in all submodules that were
'init'ed, we do not have to read .gitmodules from the superproject
commit at all, right?  We can just go check .git/modules/ which is
available locally.  We need to see which submodules are of local
interest by consulting .git/config and/or .git/modules/ anyway even
if we read .gitmodules from the superproject commit to learn what
modules are there.

Thanks.

Glen Choo Feb. 10, 2022, 8:51 a.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> Glen Choo <chooglen@google.com> writes:
>
>> = Background
>>
>> When fetching submodule commits, "git fetch --recurse-submodules" only
>> considers populated submodules, and not all of the submodules in
>> $GIT_DIR/modules as one might expect. As a result, "git fetch
>> --recurse-submodules" behaves differently based on which commit is
>> checked out.
>
> After getting 'init'ed, which is a sign that the user is interested
> in that submodule, when we happen to check out a superproject commit
> that lack the submodule in question, do we _lose_ the record that it
> was once of interest?  I do not think so.  The cloned copy in
> $GIT_DIR/modules/ would not go away, so we _could_ update it, even
> there is no checkout, when the superproject we currently have may
> not have the submodule.
>
> But I am not sure if that is a problem.  After all, the recursive
> fetch tries to make sure that the superproject commit that is
> checked out is reproduced as recorded by fetching the submodule
> commit recorded in the superproject commit, not a commit that
> happens to be at the tip of random branch in the submodule.
>
> It is OK to allow fetching into submodule that is not currently have
> a checkout, but I think we should view it purely as prefetching.  We
> do not even know, after doing such a fetch in the submodule, we have
> the commit necessary for the _next_ commit in superproject we will
> check out.

Hm, I may be misreading your message, but by "tip of random branch in
the submodule", did you mean "tip of random branch in the
_superproject_"?

If so, prior to this series, recursive fetch already fetches submodule
commits that are recorded by superproject commits other than the one
checked out. submodule.c:calculate_changed_submodule_paths() performs a
rev walk starting from the newly fetched superproject branch tips to
find missing submodule commits that are referenced by superproject
commits. These missing submodule commits are explicitly fetched by the
recursive fetch.

So we already do prefetching, but this series makes the prefetching
smarter by also prefetching in submodules that aren't checked out.

(I think my cover letter could have been clearer; I should have
explicitly called out the fact that we already do prefetching.)

>> This can be a problem, for instance, if the user has a branch with
>> submodules and a branch without:
>>
>>   # the submodules were initialized at some point in history..
>>   git checkout -b branch-with-submodules origin/branch-with-submodules
>>   git submodule update --init
>>
>>   # later down the road..
>>   git checkout --recurse-submodules branch-without-submodules
>>   # no submodules are fetched!
>>   git fetch --recurse-submodules
>>   # if origin/branch-with-submodules has new submodule commits, this
>>   # checkout will fail because we never fetched the submodule
>>   git checkout --recurse-submodules branch-with-submodules
>
> That is expected, and UNLESS we fetched _everything_ offered by the
> remote repository to the submodule in the previous step, there is no
> guarantee that this "recurse-submodules" checkout would succeed.

Yes. With the current prefetching, I don't think we make any guarantee
to the user that all submodule commits will be fetched (even if all of
the subomdules are checked out).

But if I understand the "find changed submodules" rev walk correctly, we
look for changed submodules in the ancestry chains of the branch tips
(but I'm not sure how the rev walk decides to stop). So we might be
_very close_ to fetching all the commits that we think users care about
even though we don't guarantee that all commits will be fetched.

>> This series makes "git fetch" fetch the right submodules regardless of
>> which commit is checked out, as long as the submodule has already been
>> cloned. In particular, "git fetch" learns to:
>>
>> 1. read submodules from the relevant superproject commit instead of
>>    the file system
>> 2. fetch all changed submodules, even if they are not populated
>
> The real question is not "in which submodules we fetch", but "what
> commits we fetch in these submodules".  I do not think there is a
> good answer to the latter.
>
> Of course, we we take this sequence instead:
>
> 	git checkout branch-with-submodules
> 	git fetch --recurse-submodules
> 	git checkout --recurse-submodules branch-with-submodules
> 	
> things should work correctly (I think we both are assuming that the
> other side allows to fetch _any_ object, not just ref), as "fetch"
> knows what superproject commit it is asked to complete, unlike the
> previous example you gave, where it does not have a clue on what
> superproject commit it is preparing submodules for, right?

So, given my prior description of recursive fetch, we actually _do_ know
which superproject commits to prepare for and which submodule commits to
fetch.

> Also, if the strategy is to prefetch in all submodules that were
> 'init'ed, we do not have to read .gitmodules from the superproject
> commit at all, right?  We can just go check .git/modules/ which is
> available locally.  We need to see which submodules are of local
> interest by consulting .git/config and/or .git/modules/ anyway even
> if we read .gitmodules from the superproject commit to learn what
> modules are there.

Hm, good point. Finding submodules of interest in .git/modules or
.git/config sounds like common sense (it's more obvious than trying to
identify all submodules by doing a rev walk at least). 

That said, just looking at what submodules we have doesn't tell us which
submodule commits we need, which is why we have the "find changed
submodules" rev walk. And since we already have the rev walk (which
tells us which superproject commits we care about), it's not that much
effort to fetch non-checked-out submodules.

So I think we'd eventually want to consult .git/modules and .git/config
(we'll have to do this when we start teaching "git fetch" to clone new
submodules, for example) but it's unnecessary for this series.

Junio C Hamano Feb. 10, 2022, 5:40 p.m. UTC | #3

Glen Choo <chooglen@google.com> writes:

>> It is OK to allow fetching into submodule that is not currently have
>> a checkout, but I think we should view it purely as prefetching.  We
>> do not even know, after doing such a fetch in the submodule, we have
>> the commit necessary for the _next_ commit in superproject we will
>> check out.
>
> Hm, I may be misreading your message, but by "tip of random branch in
> the submodule", did you mean "tip of random branch in the
> _superproject_"?

No, I meant something like "git submodule foreach 'git fetch --all'"
(or without '--all' to fetch whatever the refspec there tells us),
i.e. tips of branches in the submodule.

>> The real question is not "in which submodules we fetch", but "what
>> commits we fetch in these submodules".  I do not think there is a
>> good answer to the latter.
>>
>> Of course, we we take this sequence instead:
>>
>> 	git checkout branch-with-submodules
>> 	git fetch --recurse-submodules
>> 	git checkout --recurse-submodules branch-with-submodules
>> 	
>> things should work correctly (I think we both are assuming that the
>> other side allows to fetch _any_ object, not just ref), as "fetch"
>> knows what superproject commit it is asked to complete, unlike the
>> previous example you gave, where it does not have a clue on what
>> superproject commit it is preparing submodules for, right?
>
> So, given my prior description of recursive fetch, we actually _do_ know
> which superproject commits to prepare for and which submodule commits to
> fetch.

Just to make sure I understand what is going on, let me rephrase.

 * To find out which submodule commits we need to fetch, we find new
   commits in the superproject we just fetched, inspect the trees of
   these commits to see gitlinks that name commits we need to fetch
   into the submodule repositories.

 * For that to work well, we need to know, from the path these
   commits appear in the trees of the superproject, to find out from
   which submodule to fetch these commits from.  And to make the
   mapping from paths to submodule names, we need to read
   .gitmodules from the same superproject commit we found the
   submodule commit in (as during the history of the superproject,
   the submodule may have moved around).

If so, I understand why being able to read .gitmodules from
superproject commits is essential.  The flow would become like

 (1) fetch in the superproject

 (2) iterate over each new superproject commit:
     - read its .gitmodules
     - iterate over each gitlink found in the superproject commit:
       - map the path we found gitlink at into module name
       - find the submodule repository initialized for the module
         - if the submodule is not of local interest, skip
         - add the submodule commit pointed by gitlink to the
           set of commits that need to be fetched for the submodule [*]

 (3) iterate over each submodule we found more than one commits that
     need to be fetched in, and fetch these commits (we do not have
     to go over the network to re-fetch commits that exist in the
     object store and are reachable from the refs, but "fetch"
     already knows how to optimize that).

Am I on the right track?

Thanks.

Glen Choo Feb. 11, 2022, 2:39 a.m. UTC | #4

Junio C Hamano <gitster@pobox.com> writes:

>>> The real question is not "in which submodules we fetch", but "what
>>> commits we fetch in these submodules".  I do not think there is a
>>> good answer to the latter.
>>>
>>> Of course, we we take this sequence instead:
>>>
>>> 	git checkout branch-with-submodules
>>> 	git fetch --recurse-submodules
>>> 	git checkout --recurse-submodules branch-with-submodules
>>> 	
>>> things should work correctly (I think we both are assuming that the
>>> other side allows to fetch _any_ object, not just ref), as "fetch"
>>> knows what superproject commit it is asked to complete, unlike the
>>> previous example you gave, where it does not have a clue on what
>>> superproject commit it is preparing submodules for, right?
>>
>> So, given my prior description of recursive fetch, we actually _do_ know
>> which superproject commits to prepare for and which submodule commits to
>> fetch.
>
> Just to make sure I understand what is going on, let me rephrase.
>
>  * To find out which submodule commits we need to fetch, we find new
>    commits in the superproject we just fetched, inspect the trees of
>    these commits to see gitlinks that name commits we need to fetch
>    into the submodule repositories.
>
>  * For that to work well, we need to know, from the path these
>    commits appear in the trees of the superproject, to find out from
>    which submodule to fetch these commits from.  And to make the
>    mapping from paths to submodule names, we need to read
>    .gitmodules from the same superproject commit we found the
>    submodule commit in (as during the history of the superproject,
>    the submodule may have moved around).
>
> If so, I understand why being able to read .gitmodules from
> superproject commits is essential.  The flow would become like
>
>  (1) fetch in the superproject
>
>  (2) iterate over each new superproject commit:
>      - read its .gitmodules
>      - iterate over each gitlink found in the superproject commit:
>        - map the path we found gitlink at into module name
>        - find the submodule repository initialized for the module
>          - if the submodule is not of local interest, skip
>          - add the submodule commit pointed by gitlink to the
>            set of commits that need to be fetched for the submodule [*]
>
>  (3) iterate over each submodule we found more than one commits that
>      need to be fetched in, and fetch these commits (we do not have
>      to go over the network to re-fetch commits that exist in the
>      object store and are reachable from the refs, but "fetch"
>      already knows how to optimize that).
>
> Am I on the right track?

Yup, I think that's quite an accurate description. In particular..

>  (2) iterate over each new superproject commit:
>      - read its .gitmodules

Prior to this series, .gitmodules is read from the filesystem even
though we may notice the missing commit in a non-checked-out
superproject commit.

>  (3) iterate over each submodule we found more than one commits that
>      need to be fetched in, and fetch these commits

Yes, this describes the new "fetch changed submodules behavior"
accurately. However, we also attempt to fetch checked out submodules,
and this is where the two fetching strategies, "yes" and "on-demand" [1]
matter:

"yes", aka "--recurse-submodules" tells "git fetch" to attempt to fetch
_every_ checked out submodule regardless of whether Git thinks it has
missing commits (if we do not find any missing commits, I believe it
defaults to the "git fetch <remote-name>" form). [2]

"on-demand", aka "--recurse-submodules=on-demand", is the 'default'
option. (It is 'default' as in the sense of not passing a
"--recurse-submodules" arg, not default as in passing
"--recurse-submodules" without an "="). With "on-demand", we _only_
attempt to fetch changed submodules. Vis-a-vis "yes", this strategy has
no effect on fetching non-checked-out submodules because we only
fetch changed, non-checked-out submodules anyway, but it lets us ignore
unchanged, checked out submodules.

[1] From Documentation/fetch-options.txt:

--recurse-submodules[=yes|on-demand|no]::
  This option controls if and under what conditions new commits of
  populated submodules should be fetched too. It can be used as a
  boolean option to completely disable recursion when set to 'no' or to
  unconditionally recurse into all populated submodules when set to
  'yes', which is the default when this option is used without any
  value. Use 'on-demand' to only recurse into a populated submodule
  when the superproject retrieves a commit that updates the submodule's
  reference to a commit that isn't already in the local submodule
  clone. By default, 'on-demand' is used, unless
`fetch.recurseSubmodules` is set (see linkgit:git-config[1]).

[2] Sidenote on the "yes" option: I think the rationale for doing
unconditional fetches is not clear to reviewers. Admittedly, beyond 'the
test suite fails', I don't really remember why either. I'll dig into
this as I respond to the reviewer feedback.