[v3,1/4] alloc: introduce parsed_commits_count

Message ID	20200612184014.1226972-2-abhishekkumar8222@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=soZh=7Z=vger.kernel.org=git-owner@kernel.org> From: Abhishek Kumar <abhishekkumar8222@gmail.com> To: git@vger.kernel.org Cc: stolee@gmail.com, jnareb@gmail.com Subject: [PATCH v3 1/4] alloc: introduce parsed_commits_count Date: Sat, 13 Jun 2020 00:10:11 +0530 Message-Id: <20200612184014.1226972-2-abhishekkumar8222@gmail.com> In-Reply-To: <20200612184014.1226972-1-abhishekkumar8222@gmail.com> References: <20200612184014.1226972-1-abhishekkumar8222@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Move generation, graph_pos to a slab \| expand [GSoC,v3,0/4] Move generation, graph_pos to a slab [v3,1/4] alloc: introduce parsed_commits_count [v3,2/4] commit-graph: introduce commit_graph_data_slab [v3,3/4] commit: move members graph_pos, generation to a slab [v3,4/4] commit-graph: minimize commit_graph_data_slab access

Abhishek Kumar June 12, 2020, 6:40 p.m. UTC

Commit slab relies on uniqueness of commit->index to access data. As
submodules are repositories on their own, alloc_commit_index() (which
depends on repository->parsed_objects->commit_count) no longer
returns unique values.

This would break tests once we move `generation` and `graph_pos` into a
commit slab, as commits of supermodule and submodule can have the same
index but must have different graph positions.

Let's introduce a counter variable, `parsed_commits_count` to keep track
of parsed commits so far.


Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---

CI Build for the failing tests:
https://travis-ci.com/github/abhishekkumar2718/git/jobs/345413840

 alloc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Junio C Hamano June 12, 2020, 8:08 p.m. UTC | #1

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

>  static unsigned int alloc_commit_index(struct repository *r)
>  {
> -	return r->parsed_objects->commit_count++;
> +	static unsigned int parsed_commits_count = 0;
> +	r->parsed_objects->commit_count++;
> +	return parsed_commits_count++;
>  }

Hmph, ugly.  

The need for this hack, which butchers the function that explicitly
takes a repository as its parameter so that parsed commits can be
counted per repository to not count commits per repository, makes
the whole thing smell fishy.

There shouldn't be a reason why a commit in the superproject and
another commit in the submodule need to be in the same commit graph
in the first place.

Instead of breaking the function like this, the right "fix" may be
to make the commit slab per repository, because the commit index are
taken from the separate namespace per repository.

Junio C Hamano June 12, 2020, 10 p.m. UTC | #2

Junio C Hamano <gitster@pobox.com> writes:

> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>
>>  static unsigned int alloc_commit_index(struct repository *r)
>>  {
>> -	return r->parsed_objects->commit_count++;
>> +	static unsigned int parsed_commits_count = 0;
>> +	r->parsed_objects->commit_count++;
>> +	return parsed_commits_count++;
>>  }
>
> Hmph, ugly.  
>
> The need for this hack, which butchers the function that explicitly
> takes a repository as its parameter so that parsed commits can be
> counted per repository to not count commits per repository, makes
> the whole thing smell fishy.
>
> There shouldn't be a reason why a commit in the superproject and
> another commit in the submodule need to be in the same commit graph
> in the first place.
>
> Instead of breaking the function like this, the right "fix" may be
> to make the commit slab per repository, because the commit index are
> taken from the separate namespace per repository.

Given a commit object (i.e. "struct commit *"), currently there is
no way to tell from which repository it came from.  So from that
point of view, we cannot identify which commit slab to use, unless
we add a back-pointer that says from which repository each commit
object came from, but that makes the commit object even heavier.

Back when 14ba97f8 (alloc: allow arbitrary repositories for alloc
functions, 2018-05-15) made the count per repository (which you are
reverting with this patch), there must have been a reason why it did
so.  We know that commit slab code wants you to count globally and
that is why you wrote this patch, but don't other parts of the code
expect and rely on the commits being counted per repository?  In
other words, with this change, who are you breaking to help the
commit slab code?

Cc'ing SZEDER who also have touched this and made the function
static early last year.

Thanks.

Junio C Hamano June 12, 2020, 10:20 p.m. UTC | #3

Junio C Hamano <gitster@pobox.com> writes:

> Back when 14ba97f8 (alloc: allow arbitrary repositories for alloc
> functions, 2018-05-15) made the count per repository (which you are
> reverting with this patch), there must have been a reason why it did
> so.  We know that commit slab code wants you to count globally and
> that is why you wrote this patch, but don't other parts of the code
> expect and rely on the commits being counted per repository?  In
> other words, with this change, who are you breaking to help the
> commit slab code?

I did a bit more digging, as I had a bit of time after pushing
today's integration result out, and it turns out that in the today's
code, c->index is the only thing that uses this counter.  It is set
in init_commit_node() function, which is called from object.c when
an in-core commit object is created.  The callchain looks like this:

object_as_type(struct repository *, struct object *, enum object_type, int quiet)
  -> init_commit_node(struct repository *, struct commit *)
       -> alloc_commit_index(struct repository *)

What is interesting is that object_as_type() takes the parameter
"struct repository *" ONLY BECAUSE it needs to pass something to
init_commit_node(), which in turn takes it ONLY BECAUSE the
alloc_commit_index() wants one to be passed.

Since the ONLY reason why there needs a monotorically increasing
counter of in-core commit objects is because we need to be able to
index into the commit slab, and because we cannot make commit slabs
per-repository due to the lack of backpointer from an in-core commit
object to the repository it belongs to, reverting 14ba97f8 (alloc:
allow arbitrary repositories for alloc functions, 2018-05-15) is a
reasonable thing to do, but then since the only reason the above
three functions in the callchain take "struct repository *" is
because the bottom-most alloc_commit_index() wants it WHEN IT DOES
NOT USE IT, it would be good to get rid of the "struct repository"
parameter from the functions involved in the callchain altogether.

Then we won't have to ask the "who are we breaking with this change"
question.  At that point it would be clear that everybody would be
OK with a single counter to ensure uniqueness of in-core commit
across all the in-core repository instances.

Thanks.

Junio C Hamano June 12, 2020, 10:52 p.m. UTC | #4

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:

>  static unsigned int alloc_commit_index(struct repository *r)
>  {
> -	return r->parsed_objects->commit_count++;
> +	static unsigned int parsed_commits_count = 0;
> +	r->parsed_objects->commit_count++;
> +	return parsed_commits_count++;
>  }

I'll queue this as-is, together with the rest of the series, but
with the following SQUASH??? to document why we are counting
globally.

I do not think r->parsed_objects->*_count is used by anybody, and we
probably can eventually get rid of these stats fields, and when that
happens, we may want to lose the "struct repository *" pointer from
the functions in the callchain, but not right now.

Thanks.

 alloc.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/alloc.c b/alloc.c
index 29f0e3aa80..ee92661b71 100644
--- a/alloc.c
+++ b/alloc.c
@@ -99,6 +99,11 @@ void *alloc_object_node(struct repository *r)
 	return obj;
 }

+/*
+ * The returned count is to be used as an index into commit slabs,
+ * that are *NOT* maintained per repository, and that is why a single
+ * global counter is used.
+ */
 static unsigned int alloc_commit_index(struct repository *r)
 {
 	static unsigned int parsed_commits_count = 0;

Jakub Narębski June 12, 2020, 11:16 p.m. UTC | #5

On 12.06.2020, Abhishek Kumar wrote:

> Commit slab relies on uniqueness of commit->index to access data. As
> submodules are repositories on their own, alloc_commit_index() (which
> depends on repository->parsed_objects->commit_count) no longer
> returns unique values.
> 
> This would break tests once we move `generation` and `graph_pos` into a
> commit slab, as commits of supermodule and submodule can have the same
> index but must have different graph positions.

First, commits of supermodule and of submodule are in different graphs,
so I don't see why they have to be in the same serialized commit-graph
file.

Second, Git stores many different types of information on slab already.
How comes that we have not had any problems till now?  

There is contains_cache, commit_seen, indegree_slab, author_date_slab,
commit_base, commit_pos, bloom_filter_slab, buffer_slab, commit_rev_name,
commit_names, commit_name_slab, saved_parents, blame_suspects,
commit_todo_item.

> 
> Let's introduce a counter variable, `parsed_commits_count` to keep track
> of parsed commits so far.

All right, thought it might be worth mentioning that it is a global
variable, or rather a static variable in a function.

> 
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
> 
> CI Build for the failing tests:
> https://travis-ci.com/github/abhishekkumar2718/git/jobs/345413840

The failed tests are, from what I see:
- t4060-diff-submodule-option-diff-format.sh
- t4041-diff-submodule-option.sh
- t4059-diff-submodule-not-initialized.sh

> 
>   alloc.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/alloc.c b/alloc.c
> index 1c64c4dd16..29f0e3aa80 100644
> --- a/alloc.c
> +++ b/alloc.c
> @@ -101,7 +101,9 @@ void *alloc_object_node(struct repository *r)
>   
>   static unsigned int alloc_commit_index(struct repository *r)
>   {
> -	return r->parsed_objects->commit_count++;
> +	static unsigned int parsed_commits_count = 0;
> +	r->parsed_objects->commit_count++;

Do we use r->parsed_objects->commit_count anywhere?

> +	return parsed_commits_count++;

Does it matter that it is not thread safe, because it is not atomic?
Shouldn't it be

  +	static _Atomic unsigned int parsed_commits_count = 0;

or

  +	static _Atomic unsigned int parsed_commits_count = ATOMIC_VAR_INIT(0);

(If it is allowed).

>   }
>   
>   void init_commit_node(struct repository *r, struct commit *c)
>

Abhishek Kumar June 13, 2020, 6:57 p.m. UTC | #6

On Fri, Jun 12, 2020 at 03:52:26PM -0700, Junio C Hamano wrote:
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> 
> >  static unsigned int alloc_commit_index(struct repository *r)
> >  {
> > -	return r->parsed_objects->commit_count++;
> > +	static unsigned int parsed_commits_count = 0;
> > +	r->parsed_objects->commit_count++;
> > +	return parsed_commits_count++;
> >  }
> 
> I'll queue this as-is, together with the rest of the series, but
> with the following SQUASH??? to document why we are counting
> globally.
> 
> I do not think r->parsed_objects->*_count is used by anybody, and we
> probably can eventually get rid of these stats fields, and when that
> happens, we may want to lose the "struct repository *" pointer from
> the functions in the callchain, but not right now.
> 
> Thanks.
> 
>  alloc.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/alloc.c b/alloc.c
> index 29f0e3aa80..ee92661b71 100644
> --- a/alloc.c
> +++ b/alloc.c
> @@ -99,6 +99,11 @@ void *alloc_object_node(struct repository *r)
>  	return obj;
>  }
>  
> +/*
> + * The returned count is to be used as an index into commit slabs,
> + * that are *NOT* maintained per repository, and that is why a single
> + * global counter is used.
> + */
>  static unsigned int alloc_commit_index(struct repository *r)
>  {
>  	static unsigned int parsed_commits_count = 0;
> -- 
> 2.27.0-90-geebb51ba8c
> 

Thanks for taking a closer look. I have created an issue on
gitgitgadget [1] to get rid of commit count and avoid passing repository
to the functions anymore.

[1]: https://github.com/gitgitgadget/git/issues/657

Abhishek

[v3,1/4] alloc: introduce parsed_commits_count

Commit Message

Comments

Patch