Message ID | 20220210044152.78352-1-chooglen@google.com (mailing list archive) |
---|---|
Headers | show |
Series | fetch --recurse-submodules: fetch unpopulated submodules | expand |
Glen Choo <chooglen@google.com> writes: > = Background > > When fetching submodule commits, "git fetch --recurse-submodules" only > considers populated submodules, and not all of the submodules in > $GIT_DIR/modules as one might expect. As a result, "git fetch > --recurse-submodules" behaves differently based on which commit is > checked out. After getting 'init'ed, which is a sign that the user is interested in that submodule, when we happen to check out a superproject commit that lack the submodule in question, do we _lose_ the record that it was once of interest? I do not think so. The cloned copy in $GIT_DIR/modules/ would not go away, so we _could_ update it, even there is no checkout, when the superproject we currently have may not have the submodule. But I am not sure if that is a problem. After all, the recursive fetch tries to make sure that the superproject commit that is checked out is reproduced as recorded by fetching the submodule commit recorded in the superproject commit, not a commit that happens to be at the tip of random branch in the submodule. It is OK to allow fetching into submodule that is not currently have a checkout, but I think we should view it purely as prefetching. We do not even know, after doing such a fetch in the submodule, we have the commit necessary for the _next_ commit in superproject we will check out. > This can be a problem, for instance, if the user has a branch with > submodules and a branch without: > > # the submodules were initialized at some point in history.. > git checkout -b branch-with-submodules origin/branch-with-submodules > git submodule update --init > > # later down the road.. > git checkout --recurse-submodules branch-without-submodules > # no submodules are fetched! > git fetch --recurse-submodules > # if origin/branch-with-submodules has new submodule commits, this > # checkout will fail because we never fetched the submodule > git checkout --recurse-submodules branch-with-submodules That is expected, and UNLESS we fetched _everything_ offered by the remote repository to the submodule in the previous step, there is no guarantee that this "recurse-submodules" checkout would succeed. > This series makes "git fetch" fetch the right submodules regardless of > which commit is checked out, as long as the submodule has already been > cloned. In particular, "git fetch" learns to: > > 1. read submodules from the relevant superproject commit instead of > the file system > 2. fetch all changed submodules, even if they are not populated The real question is not "in which submodules we fetch", but "what commits we fetch in these submodules". I do not think there is a good answer to the latter. Of course, we we take this sequence instead: git checkout branch-with-submodules git fetch --recurse-submodules git checkout --recurse-submodules branch-with-submodules things should work correctly (I think we both are assuming that the other side allows to fetch _any_ object, not just ref), as "fetch" knows what superproject commit it is asked to complete, unlike the previous example you gave, where it does not have a clue on what superproject commit it is preparing submodules for, right? So, I am not quite sure if we are solving the right problem, let alone with the right approach. Also, if the strategy is to prefetch in all submodules that were 'init'ed, we do not have to read .gitmodules from the superproject commit at all, right? We can just go check .git/modules/ which is available locally. We need to see which submodules are of local interest by consulting .git/config and/or .git/modules/ anyway even if we read .gitmodules from the superproject commit to learn what modules are there. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Glen Choo <chooglen@google.com> writes: > >> = Background >> >> When fetching submodule commits, "git fetch --recurse-submodules" only >> considers populated submodules, and not all of the submodules in >> $GIT_DIR/modules as one might expect. As a result, "git fetch >> --recurse-submodules" behaves differently based on which commit is >> checked out. > > After getting 'init'ed, which is a sign that the user is interested > in that submodule, when we happen to check out a superproject commit > that lack the submodule in question, do we _lose_ the record that it > was once of interest? I do not think so. The cloned copy in > $GIT_DIR/modules/ would not go away, so we _could_ update it, even > there is no checkout, when the superproject we currently have may > not have the submodule. > > But I am not sure if that is a problem. After all, the recursive > fetch tries to make sure that the superproject commit that is > checked out is reproduced as recorded by fetching the submodule > commit recorded in the superproject commit, not a commit that > happens to be at the tip of random branch in the submodule. > > It is OK to allow fetching into submodule that is not currently have > a checkout, but I think we should view it purely as prefetching. We > do not even know, after doing such a fetch in the submodule, we have > the commit necessary for the _next_ commit in superproject we will > check out. Hm, I may be misreading your message, but by "tip of random branch in the submodule", did you mean "tip of random branch in the _superproject_"? If so, prior to this series, recursive fetch already fetches submodule commits that are recorded by superproject commits other than the one checked out. submodule.c:calculate_changed_submodule_paths() performs a rev walk starting from the newly fetched superproject branch tips to find missing submodule commits that are referenced by superproject commits. These missing submodule commits are explicitly fetched by the recursive fetch. So we already do prefetching, but this series makes the prefetching smarter by also prefetching in submodules that aren't checked out. (I think my cover letter could have been clearer; I should have explicitly called out the fact that we already do prefetching.) >> This can be a problem, for instance, if the user has a branch with >> submodules and a branch without: >> >> # the submodules were initialized at some point in history.. >> git checkout -b branch-with-submodules origin/branch-with-submodules >> git submodule update --init >> >> # later down the road.. >> git checkout --recurse-submodules branch-without-submodules >> # no submodules are fetched! >> git fetch --recurse-submodules >> # if origin/branch-with-submodules has new submodule commits, this >> # checkout will fail because we never fetched the submodule >> git checkout --recurse-submodules branch-with-submodules > > That is expected, and UNLESS we fetched _everything_ offered by the > remote repository to the submodule in the previous step, there is no > guarantee that this "recurse-submodules" checkout would succeed. Yes. With the current prefetching, I don't think we make any guarantee to the user that all submodule commits will be fetched (even if all of the subomdules are checked out). But if I understand the "find changed submodules" rev walk correctly, we look for changed submodules in the ancestry chains of the branch tips (but I'm not sure how the rev walk decides to stop). So we might be _very close_ to fetching all the commits that we think users care about even though we don't guarantee that all commits will be fetched. >> This series makes "git fetch" fetch the right submodules regardless of >> which commit is checked out, as long as the submodule has already been >> cloned. In particular, "git fetch" learns to: >> >> 1. read submodules from the relevant superproject commit instead of >> the file system >> 2. fetch all changed submodules, even if they are not populated > > The real question is not "in which submodules we fetch", but "what > commits we fetch in these submodules". I do not think there is a > good answer to the latter. > > Of course, we we take this sequence instead: > > git checkout branch-with-submodules > git fetch --recurse-submodules > git checkout --recurse-submodules branch-with-submodules > > things should work correctly (I think we both are assuming that the > other side allows to fetch _any_ object, not just ref), as "fetch" > knows what superproject commit it is asked to complete, unlike the > previous example you gave, where it does not have a clue on what > superproject commit it is preparing submodules for, right? So, given my prior description of recursive fetch, we actually _do_ know which superproject commits to prepare for and which submodule commits to fetch. > Also, if the strategy is to prefetch in all submodules that were > 'init'ed, we do not have to read .gitmodules from the superproject > commit at all, right? We can just go check .git/modules/ which is > available locally. We need to see which submodules are of local > interest by consulting .git/config and/or .git/modules/ anyway even > if we read .gitmodules from the superproject commit to learn what > modules are there. Hm, good point. Finding submodules of interest in .git/modules or .git/config sounds like common sense (it's more obvious than trying to identify all submodules by doing a rev walk at least). That said, just looking at what submodules we have doesn't tell us which submodule commits we need, which is why we have the "find changed submodules" rev walk. And since we already have the rev walk (which tells us which superproject commits we care about), it's not that much effort to fetch non-checked-out submodules. So I think we'd eventually want to consult .git/modules and .git/config (we'll have to do this when we start teaching "git fetch" to clone new submodules, for example) but it's unnecessary for this series.
Glen Choo <chooglen@google.com> writes: >> It is OK to allow fetching into submodule that is not currently have >> a checkout, but I think we should view it purely as prefetching. We >> do not even know, after doing such a fetch in the submodule, we have >> the commit necessary for the _next_ commit in superproject we will >> check out. > > Hm, I may be misreading your message, but by "tip of random branch in > the submodule", did you mean "tip of random branch in the > _superproject_"? No, I meant something like "git submodule foreach 'git fetch --all'" (or without '--all' to fetch whatever the refspec there tells us), i.e. tips of branches in the submodule. >> The real question is not "in which submodules we fetch", but "what >> commits we fetch in these submodules". I do not think there is a >> good answer to the latter. >> >> Of course, we we take this sequence instead: >> >> git checkout branch-with-submodules >> git fetch --recurse-submodules >> git checkout --recurse-submodules branch-with-submodules >> >> things should work correctly (I think we both are assuming that the >> other side allows to fetch _any_ object, not just ref), as "fetch" >> knows what superproject commit it is asked to complete, unlike the >> previous example you gave, where it does not have a clue on what >> superproject commit it is preparing submodules for, right? > > So, given my prior description of recursive fetch, we actually _do_ know > which superproject commits to prepare for and which submodule commits to > fetch. Just to make sure I understand what is going on, let me rephrase. * To find out which submodule commits we need to fetch, we find new commits in the superproject we just fetched, inspect the trees of these commits to see gitlinks that name commits we need to fetch into the submodule repositories. * For that to work well, we need to know, from the path these commits appear in the trees of the superproject, to find out from which submodule to fetch these commits from. And to make the mapping from paths to submodule names, we need to read .gitmodules from the same superproject commit we found the submodule commit in (as during the history of the superproject, the submodule may have moved around). If so, I understand why being able to read .gitmodules from superproject commits is essential. The flow would become like (1) fetch in the superproject (2) iterate over each new superproject commit: - read its .gitmodules - iterate over each gitlink found in the superproject commit: - map the path we found gitlink at into module name - find the submodule repository initialized for the module - if the submodule is not of local interest, skip - add the submodule commit pointed by gitlink to the set of commits that need to be fetched for the submodule [*] (3) iterate over each submodule we found more than one commits that need to be fetched in, and fetch these commits (we do not have to go over the network to re-fetch commits that exist in the object store and are reachable from the refs, but "fetch" already knows how to optimize that). Am I on the right track? Thanks.
Junio C Hamano <gitster@pobox.com> writes: >>> The real question is not "in which submodules we fetch", but "what >>> commits we fetch in these submodules". I do not think there is a >>> good answer to the latter. >>> >>> Of course, we we take this sequence instead: >>> >>> git checkout branch-with-submodules >>> git fetch --recurse-submodules >>> git checkout --recurse-submodules branch-with-submodules >>> >>> things should work correctly (I think we both are assuming that the >>> other side allows to fetch _any_ object, not just ref), as "fetch" >>> knows what superproject commit it is asked to complete, unlike the >>> previous example you gave, where it does not have a clue on what >>> superproject commit it is preparing submodules for, right? >> >> So, given my prior description of recursive fetch, we actually _do_ know >> which superproject commits to prepare for and which submodule commits to >> fetch. > > Just to make sure I understand what is going on, let me rephrase. > > * To find out which submodule commits we need to fetch, we find new > commits in the superproject we just fetched, inspect the trees of > these commits to see gitlinks that name commits we need to fetch > into the submodule repositories. > > * For that to work well, we need to know, from the path these > commits appear in the trees of the superproject, to find out from > which submodule to fetch these commits from. And to make the > mapping from paths to submodule names, we need to read > .gitmodules from the same superproject commit we found the > submodule commit in (as during the history of the superproject, > the submodule may have moved around). > > If so, I understand why being able to read .gitmodules from > superproject commits is essential. The flow would become like > > (1) fetch in the superproject > > (2) iterate over each new superproject commit: > - read its .gitmodules > - iterate over each gitlink found in the superproject commit: > - map the path we found gitlink at into module name > - find the submodule repository initialized for the module > - if the submodule is not of local interest, skip > - add the submodule commit pointed by gitlink to the > set of commits that need to be fetched for the submodule [*] > > (3) iterate over each submodule we found more than one commits that > need to be fetched in, and fetch these commits (we do not have > to go over the network to re-fetch commits that exist in the > object store and are reachable from the refs, but "fetch" > already knows how to optimize that). > > Am I on the right track? Yup, I think that's quite an accurate description. In particular.. > (2) iterate over each new superproject commit: > - read its .gitmodules Prior to this series, .gitmodules is read from the filesystem even though we may notice the missing commit in a non-checked-out superproject commit. > (3) iterate over each submodule we found more than one commits that > need to be fetched in, and fetch these commits Yes, this describes the new "fetch changed submodules behavior" accurately. However, we also attempt to fetch checked out submodules, and this is where the two fetching strategies, "yes" and "on-demand" [1] matter: "yes", aka "--recurse-submodules" tells "git fetch" to attempt to fetch _every_ checked out submodule regardless of whether Git thinks it has missing commits (if we do not find any missing commits, I believe it defaults to the "git fetch <remote-name>" form). [2] "on-demand", aka "--recurse-submodules=on-demand", is the 'default' option. (It is 'default' as in the sense of not passing a "--recurse-submodules" arg, not default as in passing "--recurse-submodules" without an "="). With "on-demand", we _only_ attempt to fetch changed submodules. Vis-a-vis "yes", this strategy has no effect on fetching non-checked-out submodules because we only fetch changed, non-checked-out submodules anyway, but it lets us ignore unchanged, checked out submodules. [1] From Documentation/fetch-options.txt: --recurse-submodules[=yes|on-demand|no]:: This option controls if and under what conditions new commits of populated submodules should be fetched too. It can be used as a boolean option to completely disable recursion when set to 'no' or to unconditionally recurse into all populated submodules when set to 'yes', which is the default when this option is used without any value. Use 'on-demand' to only recurse into a populated submodule when the superproject retrieves a commit that updates the submodule's reference to a commit that isn't already in the local submodule clone. By default, 'on-demand' is used, unless `fetch.recurseSubmodules` is set (see linkgit:git-config[1]). [2] Sidenote on the "yes" option: I think the rationale for doing unconditional fetches is not clear to reviewers. Admittedly, beyond 'the test suite fails', I don't really remember why either. I'll dig into this as I respond to the reviewer feedback.