mbox series

[0/7] Submodules and partial clones

Message ID 20200929155350.49066-1-andrew@adoakley.name (mailing list archive)
Headers show
Series Submodules and partial clones | expand

Message

Andrew Oakley Sept. 29, 2020, 3:53 p.m. UTC
I've been investigating what is required to get submodules and partial
clones to work well together.  The issue seems to be that the correct
repository is not passed around, so we sometimes end up trying to fetch
objects from the wrong place.

These patches don't make promisor_remote_get_direct handle different
repositories because I've not found a case where that is necessary.

The patches rework various cases where objects from a submodule are
added to the object store of the main repository.  There are some
remaining cases where add_to_alternates_memory is used to do this, but
add_submodule_odb has been removed.

I expect there will be some remaining issues, but these changes seem to
be enough to get the basics working.

Andrew Oakley (6):
  refs: store owning repository for object lookup
  submodule: use separate submodule repositories
  refs: use correct repo in refs_peel_ref
  merge-recursive: use separate submodule repository
  submodule: remove add_submodule_odb
  submodule: use partial clone filter

Luke Diamand (1):
  Add failing test for partial clones with submodules

 builtin/clone.c                     |   5 ++
 builtin/fsck.c                      |   2 +-
 builtin/pack-objects.c              |   2 +-
 builtin/submodule--helper.c         |  21 +++--
 git-submodule.sh                    |  20 ++++-
 http-push.c                         |   2 +-
 merge-recursive.c                   |  73 ++++++++-------
 object.c                            |   7 +-
 object.h                            |   2 +-
 refs.c                              |  37 ++++----
 refs.h                              |  37 +-------
 refs/debug.c                        |   3 +-
 refs/files-backend.c                |  22 +++--
 refs/iterator.c                     |  11 ++-
 refs/packed-backend.c               |  15 ++--
 refs/packed-backend.h               |   3 +-
 refs/ref-cache.c                    |   5 +-
 refs/refs-internal.h                |  18 +++-
 revision.c                          |  21 ++---
 revision.h                          |   1 -
 submodule.c                         | 135 +++++++++++++---------------
 submodule.h                         |  11 ++-
 t/helper/test-example-decorate.c    |   6 +-
 t/helper/test-reach.c               |   2 +-
 t/helper/test-ref-store.c           |   1 -
 t/t0411-partial-clone-submodules.sh |  44 +++++++++
 t/t1406-submodule-ref-store.sh      |  17 +---
 tag.c                               |   4 +-
 tag.h                               |   2 +-
 upload-pack.c                       |   2 +-
 walker.c                            |   3 +-
 31 files changed, 301 insertions(+), 233 deletions(-)
 create mode 100755 t/t0411-partial-clone-submodules.sh

Comments

Jonathan Tan Sept. 29, 2020, 6:05 p.m. UTC | #1
> I've been investigating what is required to get submodules and partial
> clones to work well together.  The issue seems to be that the correct
> repository is not passed around, so we sometimes end up trying to fetch
> objects from the wrong place.
> 
> These patches don't make promisor_remote_get_direct handle different
> repositories because I've not found a case where that is necessary.

Anything that reads a submodule object without spawning another process
to do so (e.g. grep, which adds submodule object stores as alternates in
order to read from them) will need to be prepared to lazy-fetch objects
into those stores.

> The patches rework various cases where objects from a submodule are
> added to the object store of the main repository.  There are some
> remaining cases where add_to_alternates_memory is used to do this, but
> add_submodule_odb has been removed.
> 
> I expect there will be some remaining issues, but these changes seem to
> be enough to get the basics working.

What are the basics that work?

When I looked into this, my main difficulty lay in getting the
lazy fetch to work in another repository. Now that lazy fetches are done
using a separate process, the problem has shifted to being able to
invoke run_command() in a separate Git repository. I haven't figured out
the best way to ensure that run_command() is run with a clean set of
environment variables (so no inheriting of GIT_DIR etc.), but that
doesn't seem insurmountable.
Andrew Oakley Sept. 30, 2020, 1:28 p.m. UTC | #2
On Tue, 29 Sep 2020 11:05:08 -0700
Jonathan Tan <jonathantanmy@google.com> wrote:

> > I've been investigating what is required to get submodules and
> > partial clones to work well together.  The issue seems to be that
> > the correct repository is not passed around, so we sometimes end up
> > trying to fetch objects from the wrong place.
> > 
> > These patches don't make promisor_remote_get_direct handle different
> > repositories because I've not found a case where that is necessary.
> 
> Anything that reads a submodule object without spawning another
> process to do so (e.g. grep, which adds submodule object stores as
> alternates in order to read from them) will need to be prepared to
> lazy-fetch objects into those stores.

Yes, grep just calls `add_to_alternates_memory` and will be broken.

When handling nested submodules `config_from_gitmodules` does the same
thing, so that will also be broken if some of the .gitmodules files
need fetching.

Fixing these probably does require supporting fetching of objects from
submodules.

> > The patches rework various cases where objects from a submodule are
> > added to the object store of the main repository.  There are some
> > remaining cases where add_to_alternates_memory is used to do this,
> > but add_submodule_odb has been removed.
> > 
> > I expect there will be some remaining issues, but these changes
> > seem to be enough to get the basics working.  
> 
> What are the basics that work?

I've tried at least the following, in a repo with several submodules and
large objects (but no nested submodules):
- git clone --recursive --filter=blob:limit=1M ...
- git pull --rebase --recurse-submodules=on-demand
- git show --submodue=diff <commit-with-big-submodule-object>
- git push --recurse-submodules=check
- git push --recurse-submodules=on-demand

I used the partial clone for a while and didn't hit any problems, but I
can't say what (relevant) commands I might have used.

An important thing that I've not tried is a merge that needs to fetch
objects.  I should probably write a testcase for that.

> When I looked into this, my main difficulty lay in getting the
> lazy fetch to work in another repository. Now that lazy fetches are
> done using a separate process, the problem has shifted to being able
> to invoke run_command() in a separate Git repository. I haven't
> figured out the best way to ensure that run_command() is run with a
> clean set of environment variables (so no inheriting of GIT_DIR
> etc.), but that doesn't seem insurmountable.

Yes, I think that to fix promisor_remote_get_direct we need to:
- store the promisor configuration per-repository
- run the fetch process in the correct repository

AFAICT we just need to set cp.dir and call prepare_submodule_repo_env
to get the right environment for the fetch process. The per-repository
configuration looks more fiddly to do.  I'm happy to try and make these
additional changes (but it won't be quick as I'm busy with the day job).

In any case we need to pass the right repository around.
Jonathan Tan Sept. 30, 2020, 8:41 p.m. UTC | #3
> Yes, grep just calls `add_to_alternates_memory` and will be broken.
> 
> When handling nested submodules `config_from_gitmodules` does the same
> thing, so that will also be broken if some of the .gitmodules files
> need fetching.
> 
> Fixing these probably does require supporting fetching of objects from
> submodules.

[snip]

> > > The patches rework various cases where objects from a submodule are
> > > added to the object store of the main repository.  There are some
> > > remaining cases where add_to_alternates_memory is used to do this,
> > > but add_submodule_odb has been removed.
> > > 
> > > I expect there will be some remaining issues, but these changes
> > > seem to be enough to get the basics working.  
> > 
> > What are the basics that work?
> 
> I've tried at least the following, in a repo with several submodules and
> large objects (but no nested submodules):
> - git clone --recursive --filter=blob:limit=1M ...
> - git pull --rebase --recurse-submodules=on-demand
> - git show --submodue=diff <commit-with-big-submodule-object>
> - git push --recurse-submodules=check
> - git push --recurse-submodules=on-demand
> 
> I used the partial clone for a while and didn't hit any problems, but I
> can't say what (relevant) commands I might have used.
> 
> An important thing that I've not tried is a merge that needs to fetch
> objects.  I should probably write a testcase for that.

Thanks - so it looks like what we need to have are:

 (1) propagate --filter when cloning (done here)
 (2) handle how Git lazy-fetches when accessing submodule objects
  (2a) access through add_submodule_odb (seems to have been done here -
       the patches here convert these accesses into (2c))
  (2b) access through add_to_alternates_memory (reading missing objects will
       trigger a likely-to-fail fetch)
  (2c) access through repo_.* and similar functions (reading missing objects
       will fail outright)
  (2d) access through invoking a subprocess (will work)

Having (1) and (2a) means that (as you described above) we can have
generally working partial-clone submodules (in that commands like
"clone", "pull", and "push" work), except that there may be some
commands like "grep" that will have strange behavior (lazy-fetching from
the wrong repo and failing). Currently, partial-clone submodules do not
work at all. My initial inclination was to say that we should resolve
(2c) first, but it is true that this change (even without (2b) and (2c))
would bring user-facing benefit, but at the cost of possible negative
surprises (even if we warn in the documentation that this feature is
experimental and might fail). I'm not sure what the rest of the Git
developer community thinks about this.

> > When I looked into this, my main difficulty lay in getting the
> > lazy fetch to work in another repository. Now that lazy fetches are
> > done using a separate process, the problem has shifted to being able
> > to invoke run_command() in a separate Git repository. I haven't
> > figured out the best way to ensure that run_command() is run with a
> > clean set of environment variables (so no inheriting of GIT_DIR
> > etc.), but that doesn't seem insurmountable.
> 
> Yes, I think that to fix promisor_remote_get_direct we need to:
> - store the promisor configuration per-repository
> - run the fetch process in the correct repository
> 
> AFAICT we just need to set cp.dir and call prepare_submodule_repo_env
> to get the right environment for the fetch process. The per-repository
> configuration looks more fiddly to do.  I'm happy to try and make these
> additional changes (but it won't be quick as I'm busy with the day job).
> 
> In any case we need to pass the right repository around.

Ah, good point about prepare_submodule_repo_env() - that does take care
of the environment variables.

I'll look at per-repository configuration and see what I can do too.