diff mbox series

[v3,07/11] commit-graph: implement corrected commit date

Message ID 4074ace65be3094d35dd0aaedb89eb5a0ec98cee.1597509583.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Philippe Blain via GitGitGadget Aug. 15, 2020, 4:39 p.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 27 deletions(-)

Comments

Jakub Narębski Aug. 22, 2020, 12:05 a.m. UTC | #1
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.

Good.

>
> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.

Perhaps we should add more details about data type sizes in question.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t).  However corrected commit date
offsets can be safely stored^* using only 32 bits.  This halves the size
of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
2*H + 16 + 4 bytes, which is reduction of around 6%, not including
header, fanout table (OIDF) and extra edges list (EDGE).

Which might mean that the extra complication is not worth it, and we
should store corrected commit date directly instead.

*) unless for example one of commits is malformed but valid,
   and has committerdate of 0 Unix time, 1 January 1970.

>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.

OK, so the agenda for introducing geeration number v2 is as follows:
- compute generation numbers v2, i.e. corrected commit date
- store corrected commit date [offsets] in new GDAT chunk,
  unless backward-compatibility concerns require us to not to
- load [and compute] corrected commit date from commit-graph
  storing it as 'generation' field of `struct commit_graph_data`,
  unless backward-compatibility concerns require us to store
  topological levels (generation number v1) in there instead

Because the reachability condition for corrected commit date and for
topological level is exactly the same, we don't need to do anything to
take advantage of generation number v2.

Though we can use generation number v2 in more cases, where we turned
off use of generation numbers because v1 gave worse performance than
date heuristics.

Did I got this right?

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 31 insertions(+), 27 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index a2f15b2825..fd69534dd5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	else if (generation_a > generation_b)
>  		return 1;
>
> -	/* use date as a heuristic when generations are equal */
> -	if (a->date < b->date)
> -		return -1;
> -	else if (a->date > b->date)
> -		return 1;

At first I was wondering why this tie-breaking is beig removed; wouldn't
be needed for backward-compatibility?  But then I remembered that this
comparison function is used _only_ for sorting commits when writing
Bloom filters, for `git commit-graph write --reachable --changed-paths ...`

Assuming that when writing the commit graph we always compute geeration
number v2 and 'generation' field stores corrected commit date, we don't
need to use date as a heuristic when generations are equal, and it would
not help in tie-breaking anyway.

All right.

>  	return 0;
>  }
>
> @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;

All right, so the pattern is to add 'corrected_commit_date' stuff after
'topological_level' stuff.

>
>  		display_progress(ctx->progress, i + 1);
>  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> -		    level != GENERATION_NUMBER_ZERO)
> +		    level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> +		    )
>  			continue;
>
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;
>
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>
>  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> -				    level == GENERATION_NUMBER_ZERO) {
> +				    level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> +				    ) {
>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;
>  				}
>  			}
>
> @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (max_level > GENERATION_NUMBER_MAX - 1)
>  					max_level = GENERATION_NUMBER_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> +				if (current->date > max_corrected_commit_date)
> +					max_corrected_commit_date = current->date - 1;
> +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>  			}
>  		}
>  	}

All right.  Looks good to me.

> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  	for (i = 0; i < g->num_commits; i++) {
>  		struct commit *graph_commit, *odb_commit;
>  		struct commit_list *graph_parents, *odb_parents;
> -		timestamp_t max_generation = 0;
> -		timestamp_t generation;
> +		timestamp_t max_corrected_commit_date = 0;
> +		timestamp_t corrected_commit_date;

This is simple, and perhaps unnecessary, rename of variables.
Shouldn't we however verify *both* topological level, and
(if exists) corrected commit date?

>
>  		display_progress(progress, i + 1);
>  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  					     oid_to_hex(&graph_parents->item->object.oid),
>  					     oid_to_hex(&odb_parents->item->object.oid));
>
> -			generation = commit_graph_generation(graph_parents->item);
> -			if (generation > max_generation)
> -				max_generation = generation;
> +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> +			if (corrected_commit_date > max_corrected_commit_date)
> +				max_corrected_commit_date = corrected_commit_date;

Actually, commit_graph_generation(<commit>) can return either corrected
commit date, or topological level, the latter in backward-compatibility
case (if at least one commit-graph file is lacking GDAT chunk, because
[some of] it was created by the "Old" Git).

>
>  			graph_parents = graph_parents->next;
>  			odb_parents = odb_parents->next;
> @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>  			continue;
>
> -		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> -		 * extra logic in the following condition.
> -		 */
> -		if (max_generation == GENERATION_NUMBER_MAX)
> -			max_generation--;

All right, this was needed for checking the correctness of topological
levels (generation number v1) because we were checking not that it
fullfills the reachability condition, but more strict one: namely that
topological level of commit is equal to maximum of topological levels of
its parents plus one.

The comment about checking both generation number v1 and v2 still
applies.

> -
> -		generation = commit_graph_generation(graph_commit);
> -		if (generation != max_generation + 1)
> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> +		corrected_commit_date = commit_graph_generation(graph_commit);
> +		if (corrected_commit_date < max_corrected_commit_date + 1)
> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>  				     oid_to_hex(&cur_oid),
> -				     generation,
> -				     max_generation + 1);
> +				     corrected_commit_date,
> +				     max_corrected_commit_date + 1);

All right, we check less strict condition for corrected commit date.

>
>  		if (graph_commit->date != odb_commit->date)
>  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),

Best,
Abhishek Kumar Aug. 25, 2020, 6:49 a.m. UTC | #2
On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With most of preparations done, let's implement corrected commit date.
> >
> > The corrected commit date for a commit is defined as:
> >
> > * A commit with no parents (a root commit) has corrected commit date
> >   equal to its committer date.
> > * A commit with at least one parent has corrected commit date equal to
> >   the maximum of its commit date and one more than the largest corrected
> >   commit date among its parents.
> 
> Good.
> 
> >
> > To minimize the space required to store corrected commit date, Git
> > stores corrected commit date offsets into the commit-graph file. The
> > corrected commit date offset for a commit is defined as the difference
> > between its corrected commit date and actual commit date.
> 
> Perhaps we should add more details about data type sizes in question.

Will add.

> 
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t).  However corrected commit date
> offsets can be safely stored^* using only 32 bits.  This halves the size
> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
> header, fanout table (OIDF) and extra edges list (EDGE).
> 
> Which might mean that the extra complication is not worth it, and we
> should store corrected commit date directly instead.
> 
> *) unless for example one of commits is malformed but valid,
>    and has committerdate of 0 Unix time, 1 January 1970.
> 
> >
> > While Git does not write out offsets at this stage, Git stores the
> > corrected commit dates in member generation of struct commit_graph_data.
> > It will begin writing commit date offsets with the introduction of
> > generation data chunk.
> 
> OK, so the agenda for introducing geeration number v2 is as follows:
> - compute generation numbers v2, i.e. corrected commit date
> - store corrected commit date [offsets] in new GDAT chunk,
>   unless backward-compatibility concerns require us to not to
> - load [and compute] corrected commit date from commit-graph
>   storing it as 'generation' field of `struct commit_graph_data`,
>   unless backward-compatibility concerns require us to store
>   topological levels (generation number v1) in there instead
> 

The last point is not correct. We always store topological levels into
the topo_levels slab introduced and always store corrected commit date
into data->generation, regardless of backward compatibility concerns.

We could avoid initializing topo_slab if we are not writing generation
data chunk (and thus don't need corrected commit dates) but that
wouldn't have an impact on run time while writing commit-graph because
computing corrected commit dates is cheap as the main cost is in walking
the graph and writing the file.

> Because the reachability condition for corrected commit date and for
> topological level is exactly the same, we don't need to do anything to
> take advantage of generation number v2.
> 
> Though we can use generation number v2 in more cases, where we turned
> off use of generation numbers because v1 gave worse performance than
> date heuristics.
> 
> Did I got this right?
> 
> >
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> >  commit-graph.c | 58 +++++++++++++++++++++++++++-----------------------
> >  1 file changed, 31 insertions(+), 27 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index a2f15b2825..fd69534dd5 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -169,11 +169,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	else if (generation_a > generation_b)
> >  		return 1;
> >
> > -	/* use date as a heuristic when generations are equal */
> > -	if (a->date < b->date)
> > -		return -1;
> > -	else if (a->date > b->date)
> > -		return 1;
> 
> At first I was wondering why this tie-breaking is beig removed; wouldn't
> be needed for backward-compatibility?  But then I remembered that this
> comparison function is used _only_ for sorting commits when writing
> Bloom filters, for `git commit-graph write --reachable --changed-paths ...`
> 
> Assuming that when writing the commit graph we always compute geeration
> number v2 and 'generation' field stores corrected commit date, we don't
> need to use date as a heuristic when generations are equal, and it would
> not help in tie-breaking anyway.
> 
> All right.
> 
> >  	return 0;
> >  }
> >
> > @@ -1342,10 +1337,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> 
> All right, so the pattern is to add 'corrected_commit_date' stuff after
> 'topological_level' stuff.
> 
> >
> >  		display_progress(ctx->progress, i + 1);
> >  		if (level != GENERATION_NUMBER_V1_INFINITY &&
> > -		    level != GENERATION_NUMBER_ZERO)
> > +		    level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> > +		    )
> >  			continue;
> >
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1354,17 +1353,26 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> >
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> >
> >  				if (level == GENERATION_NUMBER_V1_INFINITY ||
> > -				    level == GENERATION_NUMBER_ZERO) {
> > +				    level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> > +				    ) {
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> >  				}
> >  			}
> >
> > @@ -1374,6 +1382,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  				if (max_level > GENERATION_NUMBER_MAX - 1)
> >  					max_level = GENERATION_NUMBER_MAX - 1;
> >  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > +				if (current->date > max_corrected_commit_date)
> > +					max_corrected_commit_date = current->date - 1;
> > +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> >  			}
> >  		}
> >  	}
> 
> All right.  Looks good to me.
> 
> > @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  	for (i = 0; i < g->num_commits; i++) {
> >  		struct commit *graph_commit, *odb_commit;
> >  		struct commit_list *graph_parents, *odb_parents;
> > -		timestamp_t max_generation = 0;
> > -		timestamp_t generation;
> > +		timestamp_t max_corrected_commit_date = 0;
> > +		timestamp_t corrected_commit_date;
> 
> This is simple, and perhaps unnecessary, rename of variables.
> Shouldn't we however verify *both* topological level, and
> (if exists) corrected commit date?

The problem with verifying both topological level and corrected commit
dates is that we would have to re-fill commit_graph_data slab with commit
data chunk as we cannot modify data->generation otherwise, essentially
repeating the whole verification process.

While it's okay for now, I might take this up in a future series [1].

[1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

> 
> >
> >  		display_progress(progress, i + 1);
> >  		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
> > @@ -2412,9 +2424,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  					     oid_to_hex(&graph_parents->item->object.oid),
> >  					     oid_to_hex(&odb_parents->item->object.oid));
> >
> > -			generation = commit_graph_generation(graph_parents->item);
> > -			if (generation > max_generation)
> > -				max_generation = generation;
> > +			corrected_commit_date = commit_graph_generation(graph_parents->item);
> > +			if (corrected_commit_date > max_corrected_commit_date)
> > +				max_corrected_commit_date = corrected_commit_date;
> 
> Actually, commit_graph_generation(<commit>) can return either corrected
> commit date, or topological level, the latter in backward-compatibility
> case (if at least one commit-graph file is lacking GDAT chunk, because
> [some of] it was created by the "Old" Git).
> 
> >
> >  			graph_parents = graph_parents->next;
> >  			odb_parents = odb_parents->next;
> > @@ -2436,20 +2448,12 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  		if (generation_zero == GENERATION_ZERO_EXISTS)
> >  			continue;
> >
> > -		/*
> > -		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
> > -		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
> > -		 * extra logic in the following condition.
> > -		 */
> > -		if (max_generation == GENERATION_NUMBER_MAX)
> > -			max_generation--;
> 
> All right, this was needed for checking the correctness of topological
> levels (generation number v1) because we were checking not that it
> fullfills the reachability condition, but more strict one: namely that
> topological level of commit is equal to maximum of topological levels of
> its parents plus one.
> 
> The comment about checking both generation number v1 and v2 still
> applies.
> 
> > -
> > -		generation = commit_graph_generation(graph_commit);
> > -		if (generation != max_generation + 1)
> > -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > +		corrected_commit_date = commit_graph_generation(graph_commit);
> > +		if (corrected_commit_date < max_corrected_commit_date + 1)
> > +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
> >  				     oid_to_hex(&cur_oid),
> > -				     generation,
> > -				     max_generation + 1);
> > +				     corrected_commit_date,
> > +				     max_corrected_commit_date + 1);
> 
> All right, we check less strict condition for corrected commit date.
> 
> >
> >  		if (graph_commit->date != odb_commit->date)
> >  			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
> 
> Best,
> -- 
> Jakub Narębski
Jakub Narębski Aug. 25, 2020, 10:07 a.m. UTC | #3
Hello,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 02:05:41AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> To minimize the space required to store corrected commit date, Git
>>> stores corrected commit date offsets into the commit-graph file. The
>>> corrected commit date offset for a commit is defined as the difference
>>> between its corrected commit date and actual commit date.
>>
>> Perhaps we should add more details about data type sizes in question.
>
> Will add.

Note however that we need to solve the problem of storing values which
are not monotonic wrt. parent relation (partial order) in limited disk
space, that is GENERATION_NUMBER_V2_OFFSET_MAX vs GENERATION_NUMBER_MAX;
see comments in 11/11 and 00/11.

>>
>> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
>> in most cases is 64 bits (uintmax_t).  However corrected commit date
>> offsets can be safely stored^* using only 32 bits.  This halves the size
>> of GDAT chunk, reducing per-commit storage from 2*H + 16 + 8 bytes to
>> 2*H + 16 + 4 bytes, which is reduction of around 6%, not including
>> header, fanout table (OIDF) and extra edges list (EDGE).
>>
>> Which might mean that the extra complication is not worth it, and we
>> should store corrected commit date directly instead.
>>
>> *) unless for example one of commits is malformed but valid,
>>    and has committerdate of 0 Unix time, 1 January 1970.

See above.

>>> While Git does not write out offsets at this stage, Git stores the
>>> corrected commit dates in member generation of struct commit_graph_data.
>>> It will begin writing commit date offsets with the introduction of
>>> generation data chunk.
>>
>> OK, so the agenda for introducing geeration number v2 is as follows:
>> - compute generation numbers v2, i.e. corrected commit date
>> - store corrected commit date [offsets] in new GDAT chunk,
>>   unless backward-compatibility concerns require us to not to
>> - load [and compute] corrected commit date from commit-graph
>>   storing it as 'generation' field of `struct commit_graph_data`,
>>   unless backward-compatibility concerns require us to store
>>   topological levels (generation number v1) in there instead
>>
>
> The last point is not correct. We always store topological levels into
> the topo_levels slab introduced and always store corrected commit date
> into data->generation, regardless of backward compatibility concerns.

I think I was not clear enough (in trying to be brief).  I meant here
loading available generation numbers for use in graph traversal,
done in later patches in this series.

In _next_ commit we store topological levels in `generation` field:

  @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
   	date_low = get_be32(commit_data + g->hash_len + 12);
   	item->date = (timestamp_t)((date_high << 32) | date_low);

  -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
  +	if (g->chunk_generation_data)
  +		graph_data->generation = item->date +
  +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
  +	else
  +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;


We use topo_level slab only when writing the commit-graph file.

> We could avoid initializing topo_slab if we are not writing generation
> data chunk (and thus don't need corrected commit dates) but that
> wouldn't have an impact on run time while writing commit-graph because
> computing corrected commit dates is cheap as the main cost is in walking
> the graph and writing the file.

Right.

Though you need to add the cost of allocation and managing extra
commit slab, I think that amortized cost is negligible.

But what would be better is showing benchmark data: does writing the
commit graph without GDAT take not insigificant more time than without
this patch?

[...]
>>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>  	for (i = 0; i < g->num_commits; i++) {
>>>  		struct commit *graph_commit, *odb_commit;
>>>  		struct commit_list *graph_parents, *odb_parents;
>>> -		timestamp_t max_generation = 0;
>>> -		timestamp_t generation;
>>> +		timestamp_t max_corrected_commit_date = 0;
>>> +		timestamp_t corrected_commit_date;
>>
>> This is simple, and perhaps unnecessary, rename of variables.
>> Shouldn't we however verify *both* topological level, and
>> (if exists) corrected commit date?
>
> The problem with verifying both topological level and corrected commit
> dates is that we would have to re-fill commit_graph_data slab with commit
> data chunk as we cannot modify data->generation otherwise, essentially
> repeating the whole verification process.
>
> While it's okay for now, I might take this up in a future series [1].
>
> [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/

All right, I believe you that verifying both topological level and
corrected commit date would be more difficult.

That doesn't change the conclusion that this variable should remain to
be named `generation`, as when verifying GDAT-less commit-graph files it
would check topological levels (it uses commit_graph_generation(), which
in turn uses `generation` field in commit graph info, which as I have
show above in later patch could be v1 or v2 generation number).

Best,
Abhishek Kumar Sept. 1, 2020, 11:01 a.m. UTC | #4
On Tue, Aug 25, 2020 at 12:07:17PM +0200, Jakub Narębski wrote:
> Hello,
> 
> ...
> 
> I think I was not clear enough (in trying to be brief).  I meant here
> loading available generation numbers for use in graph traversal,
> done in later patches in this series.
> 
> In _next_ commit we store topological levels in `generation` field:
> 
>   @@ -755,7 +763,11 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>    	date_low = get_be32(commit_data + g->hash_len + 12);
>    	item->date = (timestamp_t)((date_high << 32) | date_low);
> 
>   -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
>   +	if (g->chunk_generation_data)
>   +		graph_data->generation = item->date +
>   +			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>   +	else
>   +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
> 
> 
> We use topo_level slab only when writing the commit-graph file.
> 

Right, I thought the agenda outlined points in the process of writing
commit-graph file.

>
> > We could avoid initializing topo_slab if we are not writing generation
> > data chunk (and thus don't need corrected commit dates) but that
> > wouldn't have an impact on run time while writing commit-graph because
> > computing corrected commit dates is cheap as the main cost is in walking
> > the graph and writing the file.
> 
> Right.
> 
> Though you need to add the cost of allocation and managing extra
> commit slab, I think that amortized cost is negligible.
> 
> But what would be better is showing benchmark data: does writing the
> commit graph without GDAT take not insigificant more time than without
> this patch?

Right, we could compare time taken by master and series until (but not
including this patcth) to write a commit-graph file. Will add.

> 
> [...]
> >>> @@ -2372,8 +2384,8 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >>>  	for (i = 0; i < g->num_commits; i++) {
> >>>  		struct commit *graph_commit, *odb_commit;
> >>>  		struct commit_list *graph_parents, *odb_parents;
> >>> -		timestamp_t max_generation = 0;
> >>> -		timestamp_t generation;
> >>> +		timestamp_t max_corrected_commit_date = 0;
> >>> +		timestamp_t corrected_commit_date;
> >>
> >> This is simple, and perhaps unnecessary, rename of variables.
> >> Shouldn't we however verify *both* topological level, and
> >> (if exists) corrected commit date?
> >
> > The problem with verifying both topological level and corrected commit
> > dates is that we would have to re-fill commit_graph_data slab with commit
> > data chunk as we cannot modify data->generation otherwise, essentially
> > repeating the whole verification process.
> >
> > While it's okay for now, I might take this up in a future series [1].
> >
> > [1]: https://lore.kernel.org/git/4043ffbc-84df-0cd6-5c75-af80383a56cf@gmail.com/
> 
> All right, I believe you that verifying both topological level and
> corrected commit date would be more difficult.
> 
> That doesn't change the conclusion that this variable should remain to
> be named `generation`, as when verifying GDAT-less commit-graph files it
> would check topological levels (it uses commit_graph_generation(), which
> in turn uses `generation` field in commit graph info, which as I have
> show above in later patch could be v1 or v2 generation number).
> 

Right, I completely misunderstood you initially. Reverted the variable
name changes.

> Best,
> -- 
> Jakub Narębski
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index a2f15b2825..fd69534dd5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -169,11 +169,6 @@  static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -1342,10 +1337,14 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
 		if (level != GENERATION_NUMBER_V1_INFINITY &&
-		    level != GENERATION_NUMBER_ZERO)
+		    level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO
+		    )
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1354,17 +1353,26 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 
 				if (level == GENERATION_NUMBER_V1_INFINITY ||
-				    level == GENERATION_NUMBER_ZERO) {
+				    level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO
+				    ) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				} else {
+					if (level > max_level)
+						max_level = level;
+
+					if (corrected_commit_date > max_corrected_commit_date)
+						max_corrected_commit_date = corrected_commit_date;
 				}
 			}
 
@@ -1374,6 +1382,10 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_MAX - 1)
 					max_level = GENERATION_NUMBER_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
@@ -2372,8 +2384,8 @@  int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 	for (i = 0; i < g->num_commits; i++) {
 		struct commit *graph_commit, *odb_commit;
 		struct commit_list *graph_parents, *odb_parents;
-		timestamp_t max_generation = 0;
-		timestamp_t generation;
+		timestamp_t max_corrected_commit_date = 0;
+		timestamp_t corrected_commit_date;
 
 		display_progress(progress, i + 1);
 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
@@ -2412,9 +2424,9 @@  int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 					     oid_to_hex(&graph_parents->item->object.oid),
 					     oid_to_hex(&odb_parents->item->object.oid));
 
-			generation = commit_graph_generation(graph_parents->item);
-			if (generation > max_generation)
-				max_generation = generation;
+			corrected_commit_date = commit_graph_generation(graph_parents->item);
+			if (corrected_commit_date > max_corrected_commit_date)
+				max_corrected_commit_date = corrected_commit_date;
 
 			graph_parents = graph_parents->next;
 			odb_parents = odb_parents->next;
@@ -2436,20 +2448,12 @@  int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
-		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_MAX)
-			max_generation--;
-
-		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		corrected_commit_date = commit_graph_generation(graph_commit);
+		if (corrected_commit_date < max_corrected_commit_date + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
-				     generation,
-				     max_generation + 1);
+				     corrected_commit_date,
+				     max_corrected_commit_date + 1);
 
 		if (graph_commit->date != odb_commit->date)
 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),