mbox series

[RFC,0/4] cache parent project's gitdir in submodules

Message ID 20210611225428.1208973-1-emilyshaffer@google.com (mailing list archive)
Headers show
Series cache parent project's gitdir in submodules | expand

Message

Emily Shaffer June 11, 2021, 10:54 p.m. UTC
It's necessary for a superproject to know which submodules it contains.
However, historically submodules do not know they are anything but a
normal single-repo Git project (or a superproject, in nested-submodules
cases). This decision does help prevent us from having to support
divergent behaviors in submodule projects vs. superprojects, which makes
sure Git is (somewhat) less confusing for the reader, and helps simplify
our code.

One could imagine, though, some conveniences we could gain from
submodules learning added behavior (though not necessarily *different*
behavior) to provide more context about the state of the project as a
whole, and to make large submodule-based projects easier to work with.
One example is a series[1] I sent some time ago, adding a config to be
shared between the superproject and all its submodules. The RFC[2] I
sent around the same time mentions a few other examples, such as "git
status" run in the submodule noticing whether the superproject has a
commit referencing the submodule's current HEAD.

It's expensive and non-definitive to try and guess whether or not the
current repo is a submodule. submodule.c:get_superproject_working_tree()
does so by essentially running 'git -C .. ls-files -- <own-path>',
invoking an additional process. get_superproject_working_tree() is not
called often, so that's mostly fine. However, [1] attempted to include
an additional config located in the superproject's gitdir by running
'git -C .. rev-parse --git-dir' during startup - a little expensive in
the best case, because it's an extra process, but extremely expensive in
the case when the current repo is *not* a submodule, because we hunt all
the way up the filesystem looking for a '.git'. Adding that cost to
every startup is infeasible.

To that end, in this series I propose caching a path to the
superproject's gitdir - by having the superproject write that relative
path to the submodule's config on creation or update. The goal here is
*not* to say "If I am a submodule, I must have
submodule.superprojectGitDir set" - but instead to say "If I have
submodule.superprojectGitDir set, then I must be a submodule." That is,
I expect we will find edge cases where a submodule was introduced in
some interesting way that bypassed any of the patches below, and
therefore doesn't have the superproject's gitdir cached.

The combination of these two rules:
 - Anything relying on submodule.superprojectGitDir must be nice to
   have, but not essential, because
 - It's possible for a submodule to be valid without having
   submodule.superprojectGitDir set
makes me feel more comfortable with the idea of submodules learning
additional behavior based on this config. I feel pretty unconfident in
our ability to ensure that *every* submodule has this config set.

The series covers a few paths for introducing that config, which I'm
hoping covers most cases.
 - "git submodule update" (which seems to be part of the "git submodule
   init" flow)
 - "git submodule absorbgitdir" to convert a "git init"'d repo into a
   submodule

Notably, we can only really set this config when 'the_repository' is the
superproject - that appears to be the only time when we know the gitdirs
of both the superproject and the submodule.

I'm expecting folks may have a lot to say about this, so I look forward
to discussion :)

 - Emily

1: https://lore.kernel.org/git/20210423001539.4059524-1-emilyshaffer@google.com
2: https://lore.kernel.org/git/YHofmWcIAidkvJiD@google.com

Emily Shaffer (4):
  t7400-submodule-basic: modernize inspect() helper
  introduce submodule.superprojectGitDir cache
  submodule: cache superproject gitdir during absorbgitdirs
  submodule: cache superproject gitdir during 'update'

 builtin/submodule--helper.c        |  4 +++
 git-submodule.sh                   |  9 ++++++
 submodule.c                        | 10 ++++++
 t/t7400-submodule-basic.sh         | 49 ++++++++++++++----------------
 t/t7406-submodule-update.sh        | 10 ++++++
 t/t7412-submodule-absorbgitdirs.sh |  1 +
 6 files changed, 57 insertions(+), 26 deletions(-)

Comments

Jacob Keller June 12, 2021, 8:12 p.m. UTC | #1
On Fri, Jun 11, 2021 at 3:54 PM Emily Shaffer <emilyshaffer@google.com> wrote:
>
> It's necessary for a superproject to know which submodules it contains.
> However, historically submodules do not know they are anything but a
> normal single-repo Git project (or a superproject, in nested-submodules
> cases). This decision does help prevent us from having to support
> divergent behaviors in submodule projects vs. superprojects, which makes
> sure Git is (somewhat) less confusing for the reader, and helps simplify
> our code.
>
> One could imagine, though, some conveniences we could gain from
> submodules learning added behavior (though not necessarily *different*
> behavior) to provide more context about the state of the project as a
> whole, and to make large submodule-based projects easier to work with.
> One example is a series[1] I sent some time ago, adding a config to be
> shared between the superproject and all its submodules. The RFC[2] I
> sent around the same time mentions a few other examples, such as "git
> status" run in the submodule noticing whether the superproject has a
> commit referencing the submodule's current HEAD.
>
> It's expensive and non-definitive to try and guess whether or not the
> current repo is a submodule. submodule.c:get_superproject_working_tree()
> does so by essentially running 'git -C .. ls-files -- <own-path>',
> invoking an additional process. get_superproject_working_tree() is not
> called often, so that's mostly fine. However, [1] attempted to include
> an additional config located in the superproject's gitdir by running
> 'git -C .. rev-parse --git-dir' during startup - a little expensive in
> the best case, because it's an extra process, but extremely expensive in
> the case when the current repo is *not* a submodule, because we hunt all
> the way up the filesystem looking for a '.git'. Adding that cost to
> every startup is infeasible.
>

It also adds cost for no benefit to the normal case where it's not a
submodule. The cost added for non-submodules ought to be as near-zero
as possible.

> To that end, in this series I propose caching a path to the
> superproject's gitdir - by having the superproject write that relative
> path to the submodule's config on creation or update. The goal here is
> *not* to say "If I am a submodule, I must have
> submodule.superprojectGitDir set" - but instead to say "If I have
> submodule.superprojectGitDir set, then I must be a submodule." That is,
> I expect we will find edge cases where a submodule was introduced in
> some interesting way that bypassed any of the patches below, and
> therefore doesn't have the superproject's gitdir cached.
>
> The combination of these two rules:
>  - Anything relying on submodule.superprojectGitDir must be nice to
>    have, but not essential, because
>  - It's possible for a submodule to be valid without having
>    submodule.superprojectGitDir set
> makes me feel more comfortable with the idea of submodules learning
> additional behavior based on this config. I feel pretty unconfident in
> our ability to ensure that *every* submodule has this config set.
>

I think this is a good direction. I do think having some information
about being a submodule could be very useful for tools such as git
status, and making it more of a "if we know for sure, we get some
small benefit, but if we don't know then it's no harm" is a good
direction.

> The series covers a few paths for introducing that config, which I'm
> hoping covers most cases.
>  - "git submodule update" (which seems to be part of the "git submodule
>    init" flow)
>  - "git submodule absorbgitdir" to convert a "git init"'d repo into a
>    submodule
>
> Notably, we can only really set this config when 'the_repository' is the
> superproject - that appears to be the only time when we know the gitdirs
> of both the superproject and the submodule.

We could also presumably add a new command for setting this?

>
> I'm expecting folks may have a lot to say about this, so I look forward
> to discussion :)
>
>  - Emily
>
> 1: https://lore.kernel.org/git/20210423001539.4059524-1-emilyshaffer@google.com
> 2: https://lore.kernel.org/git/YHofmWcIAidkvJiD@google.com
>
> Emily Shaffer (4):
>   t7400-submodule-basic: modernize inspect() helper
>   introduce submodule.superprojectGitDir cache
>   submodule: cache superproject gitdir during absorbgitdirs
>   submodule: cache superproject gitdir during 'update'
>
>  builtin/submodule--helper.c        |  4 +++
>  git-submodule.sh                   |  9 ++++++
>  submodule.c                        | 10 ++++++
>  t/t7400-submodule-basic.sh         | 49 ++++++++++++++----------------
>  t/t7406-submodule-update.sh        | 10 ++++++
>  t/t7412-submodule-absorbgitdirs.sh |  1 +
>  6 files changed, 57 insertions(+), 26 deletions(-)
>
> --
> 2.32.0.272.g935e593368-goog
>
Junio C Hamano June 14, 2021, 7:26 a.m. UTC | #2
Emily Shaffer <emilyshaffer@google.com> writes:

> It's necessary for a superproject to know which submodules it contains.
> However, historically submodules do not know they are anything but a
> normal single-repo Git project (or a superproject, in nested-submodules
> cases). This decision does help prevent us from having to support
> divergent behaviors in submodule projects vs. superprojects, which makes
> sure Git is (somewhat) less confusing for the reader, and helps simplify
> our code.

https://github.com/git/git/actions/runs/934803365 is the CI run of
'seen' that passes.  With this and another topic merged to 'seen',
as can be seen in https://github.com/git/git/actions/runs/934680687,
a handful of tests fail.

Thanks.
Emily Shaffer June 15, 2021, 9:18 p.m. UTC | #3
On Mon, Jun 14, 2021 at 04:26:17PM +0900, Junio C Hamano wrote:
> 
> Emily Shaffer <emilyshaffer@google.com> writes:
> 
> > It's necessary for a superproject to know which submodules it contains.
> > However, historically submodules do not know they are anything but a
> > normal single-repo Git project (or a superproject, in nested-submodules
> > cases). This decision does help prevent us from having to support
> > divergent behaviors in submodule projects vs. superprojects, which makes
> > sure Git is (somewhat) less confusing for the reader, and helps simplify
> > our code.
> 
> https://github.com/git/git/actions/runs/934803365 is the CI run of
> 'seen' that passes.  With this and another topic merged to 'seen',
> as can be seen in https://github.com/git/git/actions/runs/934680687,
> a handful of tests fail.

Thanks, I will investigate.