diff mbox series

[1/4] gitfaq: add advice on monorepos

Message ID 20211020010624.675562-2-sandals@crustytoothpaste.net (mailing list archive)
State Superseded
Headers show
Series Additional FAQ entries | expand

Commit Message

brian m. carlson Oct. 20, 2021, 1:06 a.m. UTC
Many projects around the world have chosen monorepos, and active
development on Git is ongoing to support them better.  However, as
projects using monorepos grow, they often find various performance
and scalability problems that are unpleasant to deal with.

Add a FAQ entry to note that while Git is attempting improvements in
this area, it is not uncommon to see performance problems that
necessitate the use of partial or shallow clone, sparse checkout, or the
like, and that if users wish to avoid these problems, avoiding a
monorepo may be best.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
---
 Documentation/gitfaq.txt | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

Comments

Bagas Sanjaya Oct. 20, 2021, 4:45 a.m. UTC | #1
On 20/10/21 08.06, brian m. carlson wrote:
> +[[monorepos]]
> +Should we use a monorepo or many individual repos?::
> +	This is a decision that is typically made based on an organization's needs and
> +	desires for their projects.  Git has several features, such as shallow clone,
> +	partial clone, and sparse checkout to make working with large repositories
> +	easier, and there is active development on making the monorepo experience
> +	better.
> ++
> +However, at a certain size, the performance of a monorepo will likely become
> +unacceptable _unless_ you use these features.  If you choose to start with a
> +monorepo and continue to grow, you may end up unhappy with the performance
> +characteristics at a point where making a change is difficult.  The performance
> +of using many smaller repositories will almost always be much better and will
> +generally not necessitate the use of these more advanced features.  If you are
> +concerned about future performance of your repository and related tools, you may
> +wish to avoid a monorepo.
> ++
> +Ultimately, you should make a decision fully informed about the potential
> +benefits and downsides, including the capabilities, performance, and future
> +requirements for your repository and related tools, including your hosting
> +platform, build tools, and other programs you typically use as part of your
> +workflow.
> +
>   Merging and Rebasing
>   --------------------
>   
> 

It seems like recommending to split repo, right?

Ultimately, if people choose split repo instead of monorepo, it will 
only delay the necessity to use advanced features (partial/shallow 
clones, sparse checkout, etc.) when the repos become large.

For balanced view, we should describe benefits and drawbacks of both 
monorepo and split repos.
Ævar Arnfjörð Bjarmason Oct. 20, 2021, 10:54 a.m. UTC | #2
On Wed, Oct 20 2021, brian m. carlson wrote:

> +[[monorepos]]
> +Should we use a monorepo or many individual repos?::
> +	This is a decision that is typically made based on an organization's needs and
> +	desires for their projects.  Git has several features, such as shallow clone,
> +	partial clone, and sparse checkout to make working with large repositories
> +	easier, and there is active development on making the monorepo experience
> +	better.
> ++
> +However, at a certain size, the performance of a monorepo will likely become
> +unacceptable _unless_ you use these features.  If you choose to start with a
> +monorepo and continue to grow, you may end up unhappy with the performance
> +characteristics at a point where making a change is difficult.  The performance
> +of using many smaller repositories will almost always be much better and will
> +generally not necessitate the use of these more advanced features.  If you are
> +concerned about future performance of your repository and related tools, you may
> +wish to avoid a monorepo.
> ++
> +Ultimately, you should make a decision fully informed about the potential
> +benefits and downsides, including the capabilities, performance, and future
> +requirements for your repository and related tools, including your hosting
> +platform, build tools, and other programs you typically use as part of your
> +workflow.

In the context of git development we're typically talking about really
big repos when we're talking about monorepos, saying "monorepo"
communicates among other things that the user of that pattern is
unwilling to use splitting up as a way to address any scalability issues
they may have.

But a monorepo doesn't really say anything about size per-se, and it
would be confusing to conflate the two in a FAQ. I may be wrong, perhaps
the term has really come to exclusively refer to colossal size, but I
haven't seen or heard it exclusively (or even mainly) used like that

My understanding of what a monorepo in the context of software
development is a collection of all your "main" code and all its
dependencies in one repository, such that a person working on it rarely
or never has to worry about N=N dependencies between different pieces of
that collection, they'll all move in unison. You are able to atomically
change a function and all its users.

I think people have a different understanding of "all its
dependencies". Some monorepo users really mean it and try to e.g. fold
their system configuration system that might manage files in /etc in
with their monorepo, others might have "another monorepo" for that
software, etc.

I bet that the vast majority of monorepo users are never going to
experience scaling problems, e.g. having your laptop dotfiles and
automation of /etc in one repo is a "monorepo", and most companies/teams
that use monorepos I'd bet are in the long tail of size
distribution. They're not going to grow to the size of a MS's, FB's
etc. monorepo, but they might benefit (or not) from the monorepo
/workflow/.

Anyway, all of the above can be read as a suggestion that we should
split any discussion of "large repo [that runs into scaling issues]"
from "monorepo", the latter should of course make a passing reference to
scaling (as the pattern will lead to that sooner than not), but IMO not
conflate the two.
Johannes Schindelin Oct. 20, 2021, 11:55 a.m. UTC | #3
Hi brian,

thank you for this excellent idea to talk about monorepos.

On Wed, 20 Oct 2021, brian m. carlson wrote:

> diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
> index 8c1f2d5675..946691c153 100644
> --- a/Documentation/gitfaq.txt
> +++ b/Documentation/gitfaq.txt
> @@ -241,6 +241,32 @@ How do I know if I want to do a fetch or a pull?::
>  	ignore the upstream changes.  A pull consists of a fetch followed
>  	immediately by either a merge or rebase.  See linkgit:git-pull[1].
>
> +Design
> +------
> +
> +[[monorepos]]
> +Should we use a monorepo or many individual repos?::
> +	This is a decision that is typically made based on an organization's needs and
> +	desires for their projects.  Git has several features, such as shallow clone,
> +	partial clone, and sparse checkout to make working with large repositories

May I request taking out shallow clones? A user new to Git might think
that shallow clones are a sane way to clone a large repository. In
practice, this only makes sense for "throw-away" clones, though. As soon
as you fetch in such clones, performance will be so horrible that it is
frequently a better idea to start with a partial clone instead.

At the same time, I would like to swap in "sparse index" for "shallow
clone" because it _does_ have the best potential of all currently
discussed new features to improve performance with monorepos.

> +	easier, and there is active development on making the monorepo experience
> +	better.
> ++
> +However, at a certain size, the performance of a monorepo will likely become
> +unacceptable _unless_ you use these features.  If you choose to start with a

I would like to add a plug for Scalar here. Maybe we can link to this
"opinionated tool based on Git" here? I wouldn't ask if I didn't _know_
that it helps monorepo users out there.

> +monorepo and continue to grow, you may end up unhappy with the performance
> +characteristics at a point where making a change is difficult.  The performance
> +of using many smaller repositories will almost always be much better and will
> +generally not necessitate the use of these more advanced features.  If you are
> +concerned about future performance of your repository and related tools, you may
> +wish to avoid a monorepo.
> ++
> +Ultimately, you should make a decision fully informed about the potential
> +benefits and downsides, including the capabilities, performance, and future
> +requirements for your repository and related tools, including your hosting
> +platform, build tools, and other programs you typically use as part of your
> +workflow.

I wish we had a good article to link to, here. Yes, it is a decision that
should be fully informed, and yes, this FAQ entry is not the place for a
thorough discussion of monorepos and how Git can be asked to handle them
more efficiently.

Do you know of any good resource that we could use here?

Thanks,
Dscho

> +
>  Merging and Rebasing
>  --------------------
>
>
Philip Oakley Oct. 20, 2021, 2:11 p.m. UTC | #4
On 20/10/2021 02:06, brian m. carlson wrote:
> Many projects around the world have chosen monorepos, and active
> development on Git is ongoing to support them better.  However, as
> projects using monorepos grow, they often find various performance
> and scalability problems that are unpleasant to deal with.
>
> Add a FAQ entry to note that while Git is attempting improvements in
> this area, it is not uncommon to see performance problems that
> necessitate the use of partial or shallow clone, sparse checkout, or the
> like, and that if users wish to avoid these problems, avoiding a
> monorepo may be best.
>
> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
> ---
>  Documentation/gitfaq.txt | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
>
> diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
> index 8c1f2d5675..946691c153 100644
> --- a/Documentation/gitfaq.txt
> +++ b/Documentation/gitfaq.txt
> @@ -241,6 +241,32 @@ How do I know if I want to do a fetch or a pull?::
>  	ignore the upstream changes.  A pull consists of a fetch followed
>  	immediately by either a merge or rebase.  See linkgit:git-pull[1].
>  
> +Design
> +------
> +
> +[[monorepos]]
> +Should we use a monorepo or many individual repos?::
> +	This is a decision that is typically made based on an organization's needs and
> +	desires for their projects.  Git has several features, such as shallow clone,
> +	partial clone, and sparse checkout to make working with large repositories
> +	easier, and there is active development on making the monorepo experience
> +	better.
> ++
> +However, at a certain size, the performance of a monorepo will likely become
> +unacceptable _unless_ you use these features.  If you choose to start with a
> +monorepo and continue to grow, you may end up unhappy with the performance
> +characteristics at a point where making a change is difficult.  The performance
> +of using many smaller repositories will almost always be much better and will
> +generally not necessitate the use of these more advanced features.  If you are
> +concerned about future performance of your repository and related tools, you may
> +wish to avoid a monorepo.
> ++
> +Ultimately, you should make a decision fully informed about the potential
> +benefits and downsides, including the capabilities, performance, and future
> +requirements for your repository and related tools, including your hosting
> +platform, build tools, and other programs you typically use as part of your
> +workflow.
> +

Does this need some comparison, or link, with sub-module methods and
issues? Such as the nested sub-module problem, the distinction between
active sub-modules and quiescent sub-modules (e.g. libraries Vx.y.z)?

As an aside, I don't think we provide any background to the Git
philosophy that frames some of these issues.
brian m. carlson Oct. 20, 2021, 9:19 p.m. UTC | #5
On 2021-10-20 at 10:54:47, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Oct 20 2021, brian m. carlson wrote:
> 
> > +[[monorepos]]
> > +Should we use a monorepo or many individual repos?::
> > +	This is a decision that is typically made based on an organization's needs and
> > +	desires for their projects.  Git has several features, such as shallow clone,
> > +	partial clone, and sparse checkout to make working with large repositories
> > +	easier, and there is active development on making the monorepo experience
> > +	better.
> > ++
> > +However, at a certain size, the performance of a monorepo will likely become
> > +unacceptable _unless_ you use these features.  If you choose to start with a
> > +monorepo and continue to grow, you may end up unhappy with the performance
> > +characteristics at a point where making a change is difficult.  The performance
> > +of using many smaller repositories will almost always be much better and will
> > +generally not necessitate the use of these more advanced features.  If you are
> > +concerned about future performance of your repository and related tools, you may
> > +wish to avoid a monorepo.
> > ++
> > +Ultimately, you should make a decision fully informed about the potential
> > +benefits and downsides, including the capabilities, performance, and future
> > +requirements for your repository and related tools, including your hosting
> > +platform, build tools, and other programs you typically use as part of your
> > +workflow.
> 
> In the context of git development we're typically talking about really
> big repos when we're talking about monorepos, saying "monorepo"
> communicates among other things that the user of that pattern is
> unwilling to use splitting up as a way to address any scalability issues
> they may have.
> 
> But a monorepo doesn't really say anything about size per-se, and it
> would be confusing to conflate the two in a FAQ. I may be wrong, perhaps
> the term has really come to exclusively refer to colossal size, but I
> haven't seen or heard it exclusively (or even mainly) used like that

I routinely hear "monorepo" used to imply repositories of specifically
large size.  However, I'm happy to rephrase to make it clearer.

> I bet that the vast majority of monorepo users are never going to
> experience scaling problems, e.g. having your laptop dotfiles and
> automation of /etc in one repo is a "monorepo", and most companies/teams
> that use monorepos I'd bet are in the long tail of size
> distribution. They're not going to grow to the size of a MS's, FB's
> etc. monorepo, but they might benefit (or not) from the monorepo
> /workflow/.

I almost never hear individuals refer to such a configuration as a
monorepo.  Technically, it is one, yes, but I almost always hear it in
the context of an organization's repository covering all of their
services or the entirety of one major project.

I will point out that I personally would run into scaling issues if I
put all of my projects in the same repository.  I have many projects,
and that would quickly become unsustainable, since the resources I have
at my disposal are more limited than most organizations.

> Anyway, all of the above can be read as a suggestion that we should
> split any discussion of "large repo [that runs into scaling issues]"
> from "monorepo", the latter should of course make a passing reference to
> scaling (as the pattern will lead to that sooner than not), but IMO not
> conflate the two.

I'm happy to clarify, but I think we need to mention the word "monorepo"
specifically because (a) that's the term that's commonly used for this
approach and (b) that approach is one that tends to lead to
significantly greater growth in a single repository leading to scale
problems.
brian m. carlson Oct. 20, 2021, 10:22 p.m. UTC | #6
On 2021-10-20 at 14:11:09, Philip Oakley wrote:
> Does this need some comparison, or link, with sub-module methods and
> issues? Such as the nested sub-module problem, the distinction between
> active sub-modules and quiescent sub-modules (e.g. libraries Vx.y.z)?

I don't think it does.  Some projects choose to use many repositories
with submodules, and some use many repositories without submodules.  At
work, we do the latter, and it tends to work just fine.
Philip Oakley Oct. 25, 2021, 10:44 a.m. UTC | #7
Hi Brian,
On 20/10/2021 23:22, brian m. carlson wrote:
> On 2021-10-20 at 14:11:09, Philip Oakley wrote:
>> Does this need some comparison, or link, with sub-module methods and
>> issues? Such as the nested sub-module problem, the distinction between
>> active sub-modules and quiescent sub-modules (e.g. libraries Vx.y.z)?
> I don't think it does.  Some projects choose to use many repositories
> with submodules, and some use many repositories without submodules.  At
> work, we do the latter, and it tends to work just fine.

To clarify, my comment was with regard to the complementary discussions
about _choice_ of repo types, rather than mono-repo and other
post-choice issues. Possibly, part of such a discussion on choice of
repo-type, could include the potential slippery slope between the
different uses of sub-modules.

I feel a lot of the difficult sub-module discussion are because we don't
have a common terminology for the different (mental) models of
sub-module use, their benefits and problems.
--
Philip
diff mbox series

Patch

diff --git a/Documentation/gitfaq.txt b/Documentation/gitfaq.txt
index 8c1f2d5675..946691c153 100644
--- a/Documentation/gitfaq.txt
+++ b/Documentation/gitfaq.txt
@@ -241,6 +241,32 @@  How do I know if I want to do a fetch or a pull?::
 	ignore the upstream changes.  A pull consists of a fetch followed
 	immediately by either a merge or rebase.  See linkgit:git-pull[1].
 
+Design
+------
+
+[[monorepos]]
+Should we use a monorepo or many individual repos?::
+	This is a decision that is typically made based on an organization's needs and
+	desires for their projects.  Git has several features, such as shallow clone,
+	partial clone, and sparse checkout to make working with large repositories
+	easier, and there is active development on making the monorepo experience
+	better.
++
+However, at a certain size, the performance of a monorepo will likely become
+unacceptable _unless_ you use these features.  If you choose to start with a
+monorepo and continue to grow, you may end up unhappy with the performance
+characteristics at a point where making a change is difficult.  The performance
+of using many smaller repositories will almost always be much better and will
+generally not necessitate the use of these more advanced features.  If you are
+concerned about future performance of your repository and related tools, you may
+wish to avoid a monorepo.
++
+Ultimately, you should make a decision fully informed about the potential
+benefits and downsides, including the capabilities, performance, and future
+requirements for your repository and related tools, including your hosting
+platform, build tools, and other programs you typically use as part of your
+workflow.
+
 Merging and Rebasing
 --------------------