[RFC,0/4] name-rev: improve memory usage

Message ID	20190301175024.17337-1-alban.gruin@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@kernel.org> From: Alban Gruin <alban.gruin@gmail.com> To: Git Mailing List <git@vger.kernel.org> Cc: Alban Gruin <alban.gruin@gmail.com> Subject: [RFC PATCH 0/4] name-rev: improve memory usage Date: Fri, 1 Mar 2019 18:50:20 +0100 Message-Id: <20190301175024.17337-1-alban.gruin@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	name-rev: improve memory usage \| expand [RFC,0/4] name-rev: improve memory usage [RFC,1/4] name-rev: improve name_rev() memory usage [RFC,2/4] commit-list: add a function to check if a commit is in a list [RFC,3/4] name-rev: check if a commit should be named before naming it [RFC,4/4] name-rev: avoid naming from a ref if it’s not a descendant of any commit

Message ID

20190301175024.17337-1-alban.gruin@gmail.com (mailing list archive)

Headers

From: Alban Gruin <alban.gruin@gmail.com>
To: Git Mailing List <git@vger.kernel.org>
Cc: Alban Gruin <alban.gruin@gmail.com>
Subject: [RFC PATCH 0/4] name-rev: improve memory usage
Date: Fri,  1 Mar 2019 18:50:20 +0100
Message-Id: <20190301175024.17337-1-alban.gruin@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: git-owner@vger.kernel.org
Precedence: bulk

Series

name-rev: improve memory usage | expand

Message

Alban Gruin March 1, 2019, 5:50 p.m. UTC

rafasc reported on IRC that on a repository with a lot of branches,
tags, remotes, and commits, name-rev --stdin could use a massive amount
of memory (more than 2GB of RAM) to complete.

This patch series tries to improve name-rev’s memory usage.

There is some improvement that could be done, such as reference counting
the names attributed to commits.  Tell me if it could be worth to pursue
this way, or if name-rev’s internals would need a more thorough rewrite.

This is based on master (8104ec994e, "Git 2.21").

The tip of this series is tagged as "fix-name-rev-leak-rfc-v1" in
https://github.com/agrn/git.

Alban Gruin (4):
  name-rev: improve name_rev() memory usage
  commit-list: add a function to check if a commit is in a list
  name-rev: check if a commit should be named before naming it
  name-rev: avoid naming from a ref if it’s not a descendant of any
    commit

 builtin/name-rev.c | 124 +++++++++++++++++++++++++++++++++++----------
 commit.c           |  12 +++++
 commit.h           |   1 +
 3 files changed, 111 insertions(+), 26 deletions(-)

Comments

Jeff King March 1, 2019, 6:42 p.m. UTC | #1

On Fri, Mar 01, 2019 at 06:50:20PM +0100, Alban Gruin wrote:

> rafasc reported on IRC that on a repository with a lot of branches,
> tags, remotes, and commits, name-rev --stdin could use a massive amount
> of memory (more than 2GB of RAM) to complete.
> 
> This patch series tries to improve name-rev’s memory usage.

Have you tried this?

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index f1cb45c227..7aaa86f1c0 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 		OPT_END(),
 	};
 
+	save_commit_buffer = 0;
+
 	init_commit_rev_name(&rev_names);
 	git_config(git_default_config, NULL);
 	argc = parse_options(argc, argv, prefix, opts, name_rev_usage, 0);

It seems to lower heap usage of:

  git name-rev 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2

in linux.git (that commit is the final one reported by "git log", so
it's traversing all of history) from ~1GB to ~300MB.

-Peff

Alban Gruin March 1, 2019, 7:14 p.m. UTC | #2

Hi Jeff,

Le 01/03/2019 à 19:42, Jeff King a écrit :
> On Fri, Mar 01, 2019 at 06:50:20PM +0100, Alban Gruin wrote:
> 
>> rafasc reported on IRC that on a repository with a lot of branches,
>> tags, remotes, and commits, name-rev --stdin could use a massive amount
>> of memory (more than 2GB of RAM) to complete.
>>
>> This patch series tries to improve name-rev’s memory usage.
> 
> Have you tried this?
> 
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> index f1cb45c227..7aaa86f1c0 100644
> --- a/builtin/name-rev.c
> +++ b/builtin/name-rev.c
> @@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
>  		OPT_END(),
>  	};
>  
> +	save_commit_buffer = 0;
> +
>  	init_commit_rev_name(&rev_names);
>  	git_config(git_default_config, NULL);
>  	argc = parse_options(argc, argv, prefix, opts, name_rev_usage, 0);
> 
> It seems to lower heap usage of:
> 
>   git name-rev 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
> 
> in linux.git (that commit is the final one reported by "git log", so
> it's traversing all of history) from ~1GB to ~300MB.
> 
> -Peff
> 

Unfortunately this does not work in all cases, apparently.  On my git
copy, I have 3 origins.  If I run this:

	git log --graph --oneline --abbrev=-1 -5 | git name-rev --stdin

With or without your change, it uses 3GB of RAM.  With this series, it
uses 25MB of RAM.

With the first git commit:

	git name-rev e83c5163316f89bfbde7d9ab23ca2e25604af290

It also uses 3GB of RAM.  With this series, it uses around 350MB of RAM.
 Which makes me think that I should adapt 4/4 to arguments.

Sorry, I should have specified this in the cover letter.

Cheers,
Alban

Jeff King March 1, 2019, 7:39 p.m. UTC | #3

On Fri, Mar 01, 2019 at 08:14:26PM +0100, Alban Gruin wrote:

> > diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> > index f1cb45c227..7aaa86f1c0 100644
> > --- a/builtin/name-rev.c
> > +++ b/builtin/name-rev.c
> > @@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
> >  		OPT_END(),
> >  	};
> >  
> > +	save_commit_buffer = 0;
> > +
> [...]
> 
> Unfortunately this does not work in all cases, apparently.  On my git
> copy, I have 3 origins.  If I run this:
> 
> 	git log --graph --oneline --abbrev=-1 -5 | git name-rev --stdin
> 
> With or without your change, it uses 3GB of RAM.  With this series, it
> uses 25MB of RAM.

Sorry if I was unclear. I meant to try that _in addition_ to your
changes. It helps by avoiding keeping the useless commit-object buffers
in RAM as we traverse. But the most it can save is the total
uncompressed bytes of all commit objects. I.e., in git.git:

  $ git cat-file --batch-check='%(objectsize) %(objecttype)' --batch-all-objects |
    grep commit |
    perl -alne '$total += $F[0]; END { print $total }'
  74678114

or around 70MB. In linux.git, it's more like 700MB.

But in your examples, the problem is the inefficiencies in name-rev's
algorithm, and you're not actually traversing that many commits. So I
think you'd want to turn off save_commit_buffer as an extra patch in
your series. It may or not be a big win for any given case, but it's
quite easy to do.

-Peff