mbox series

[RFC,0/4] name-rev: improve memory usage

Message ID 20190301175024.17337-1-alban.gruin@gmail.com (mailing list archive)
Headers show
Series name-rev: improve memory usage | expand

Message

Alban Gruin March 1, 2019, 5:50 p.m. UTC
rafasc reported on IRC that on a repository with a lot of branches,
tags, remotes, and commits, name-rev --stdin could use a massive amount
of memory (more than 2GB of RAM) to complete.

This patch series tries to improve name-rev’s memory usage.

There is some improvement that could be done, such as reference counting
the names attributed to commits.  Tell me if it could be worth to pursue
this way, or if name-rev’s internals would need a more thorough rewrite.

This is based on master (8104ec994e, "Git 2.21").

The tip of this series is tagged as "fix-name-rev-leak-rfc-v1" in
https://github.com/agrn/git.

Alban Gruin (4):
  name-rev: improve name_rev() memory usage
  commit-list: add a function to check if a commit is in a list
  name-rev: check if a commit should be named before naming it
  name-rev: avoid naming from a ref if it’s not a descendant of any
    commit

 builtin/name-rev.c | 124 +++++++++++++++++++++++++++++++++++----------
 commit.c           |  12 +++++
 commit.h           |   1 +
 3 files changed, 111 insertions(+), 26 deletions(-)

Comments

Jeff King March 1, 2019, 6:42 p.m. UTC | #1
On Fri, Mar 01, 2019 at 06:50:20PM +0100, Alban Gruin wrote:

> rafasc reported on IRC that on a repository with a lot of branches,
> tags, remotes, and commits, name-rev --stdin could use a massive amount
> of memory (more than 2GB of RAM) to complete.
> 
> This patch series tries to improve name-rev’s memory usage.

Have you tried this?

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index f1cb45c227..7aaa86f1c0 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 		OPT_END(),
 	};
 
+	save_commit_buffer = 0;
+
 	init_commit_rev_name(&rev_names);
 	git_config(git_default_config, NULL);
 	argc = parse_options(argc, argv, prefix, opts, name_rev_usage, 0);

It seems to lower heap usage of:

  git name-rev 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2

in linux.git (that commit is the final one reported by "git log", so
it's traversing all of history) from ~1GB to ~300MB.

-Peff
Alban Gruin March 1, 2019, 7:14 p.m. UTC | #2
Hi Jeff,

Le 01/03/2019 à 19:42, Jeff King a écrit :
> On Fri, Mar 01, 2019 at 06:50:20PM +0100, Alban Gruin wrote:
> 
>> rafasc reported on IRC that on a repository with a lot of branches,
>> tags, remotes, and commits, name-rev --stdin could use a massive amount
>> of memory (more than 2GB of RAM) to complete.
>>
>> This patch series tries to improve name-rev’s memory usage.
> 
> Have you tried this?
> 
> diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> index f1cb45c227..7aaa86f1c0 100644
> --- a/builtin/name-rev.c
> +++ b/builtin/name-rev.c
> @@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
>  		OPT_END(),
>  	};
>  
> +	save_commit_buffer = 0;
> +
>  	init_commit_rev_name(&rev_names);
>  	git_config(git_default_config, NULL);
>  	argc = parse_options(argc, argv, prefix, opts, name_rev_usage, 0);
> 
> It seems to lower heap usage of:
> 
>   git name-rev 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2
> 
> in linux.git (that commit is the final one reported by "git log", so
> it's traversing all of history) from ~1GB to ~300MB.
> 
> -Peff
> 

Unfortunately this does not work in all cases, apparently.  On my git
copy, I have 3 origins.  If I run this:

	git log --graph --oneline --abbrev=-1 -5 | git name-rev --stdin

With or without your change, it uses 3GB of RAM.  With this series, it
uses 25MB of RAM.

With the first git commit:

	git name-rev e83c5163316f89bfbde7d9ab23ca2e25604af290

It also uses 3GB of RAM.  With this series, it uses around 350MB of RAM.
 Which makes me think that I should adapt 4/4 to arguments.

Sorry, I should have specified this in the cover letter.

Cheers,
Alban
Jeff King March 1, 2019, 7:39 p.m. UTC | #3
On Fri, Mar 01, 2019 at 08:14:26PM +0100, Alban Gruin wrote:

> > diff --git a/builtin/name-rev.c b/builtin/name-rev.c
> > index f1cb45c227..7aaa86f1c0 100644
> > --- a/builtin/name-rev.c
> > +++ b/builtin/name-rev.c
> > @@ -431,6 +431,8 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
> >  		OPT_END(),
> >  	};
> >  
> > +	save_commit_buffer = 0;
> > +
> [...]
> 
> Unfortunately this does not work in all cases, apparently.  On my git
> copy, I have 3 origins.  If I run this:
> 
> 	git log --graph --oneline --abbrev=-1 -5 | git name-rev --stdin
> 
> With or without your change, it uses 3GB of RAM.  With this series, it
> uses 25MB of RAM.

Sorry if I was unclear. I meant to try that _in addition_ to your
changes. It helps by avoiding keeping the useless commit-object buffers
in RAM as we traverse. But the most it can save is the total
uncompressed bytes of all commit objects. I.e., in git.git:

  $ git cat-file --batch-check='%(objectsize) %(objecttype)' --batch-all-objects |
    grep commit |
    perl -alne '$total += $F[0]; END { print $total }'
  74678114

or around 70MB. In linux.git, it's more like 700MB.

But in your examples, the problem is the inefficiencies in name-rev's
algorithm, and you're not actually traversing that many commits. So I
think you'd want to turn off save_commit_buffer as an extra patch in
your series. It may or not be a big win for any given case, but it's
quite easy to do.

-Peff