diff mbox series

[4/5] commit-graph: be extra careful about mixed generations

Message ID b267a9653a7560d1e59708f20106ef054d140a9f.1612199707.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Generation Number v2: Fix a tricky split graph bug | expand

Commit Message

Derrick Stolee Feb. 1, 2021, 5:15 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

When upgrading to a commit-graph with corrected commit dates from
one without, there are a few things that need to be considered.

When computing generation numbers for the new commit-graph file that
expects to add the generation_data chunk with corrected commit
dates, we need to ensure that the 'generation' member of the
commit_graph_data struct is set to zero for these commits.

Unfortunately, the fallback to use topological level for generation
number when corrected commit dates are not available are causing us
harm here: parsing commits notices that read_generation_data is
false and populates 'generation' with the topological level.

The solution is to iterate through the commits, parse the commits
to populate initial values, then reset the generation values to
zero to trigger recalculation. This loop only occurs when the
existing commit-graph data has no corrected commit dates.

While this improves our situation somewhat, we have not completely
solved the issue for correctly computing generation numbers for mixes
layers. That follows in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-graph.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

Comments

Taylor Blau Feb. 1, 2021, 6:04 p.m. UTC | #1
On Mon, Feb 01, 2021 at 05:15:06PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When upgrading to a commit-graph with corrected commit dates from
> one without, there are a few things that need to be considered.
>
> When computing generation numbers for the new commit-graph file that
> expects to add the generation_data chunk with corrected commit
> dates, we need to ensure that the 'generation' member of the
> commit_graph_data struct is set to zero for these commits.
>
> Unfortunately, the fallback to use topological level for generation
> number when corrected commit dates are not available are causing us
> harm here: parsing commits notices that read_generation_data is
> false and populates 'generation' with the topological level.
>
> The solution is to iterate through the commits, parse the commits
> to populate initial values, then reset the generation values to
> zero to trigger recalculation. This loop only occurs when the
> existing commit-graph data has no corrected commit dates.
>
> While this improves our situation somewhat, we have not completely
> solved the issue for correctly computing generation numbers for mixes
> layers. That follows in the next change.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 32 +++++++++++++++++++++++---------
>  1 file changed, 23 insertions(+), 9 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 13992137dd0..08148dd17f1 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1033,7 +1033,8 @@ struct write_commit_graph_context {
>  		 split:1,
>  		 changed_paths:1,
>  		 order_by_pack:1,
> -		 write_generation_data:1;
> +		 write_generation_data:1,
> +		 trust_generation_numbers:1;
>
>  	struct topo_level_slab *topo_levels;
>  	const struct commit_graph_opts *opts;
> @@ -1452,6 +1453,15 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  		ctx->progress = start_delayed_progress(
>  					_("Computing commit graph generation numbers"),
>  					ctx->commits.nr);
> +
> +	if (ctx->write_generation_data && !ctx->trust_generation_numbers) {
> +		for (i = 0; i < ctx->commits.nr; i++) {
> +			struct commit *c = ctx->commits.list[i];
> +			repo_parse_commit(ctx->r, c);
> +			commit_graph_data_at(c)->generation = GENERATION_NUMBER_ZERO;
> +		}
> +	}
> +

This took me a while to figure out since I spent quite a lot of time
thinking that you were setting the topological level to zero, _not_ the
corrected committer date.

Now that I understand which is which, I agree that this is the right way
to go forward.

That said, I do find it unnecessarily complex that we compute both the
generation number and the topological level in the same loops in
compute_generation_numbers()...

>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = ctx->commits.list[i];
>  		uint32_t level;
> @@ -1480,7 +1490,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>
>  				if (level == GENERATION_NUMBER_ZERO ||
> -				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
> +				    (ctx->write_generation_data &&
> +				     corrected_commit_date == GENERATION_NUMBER_ZERO)) {

...for exactly reasons like this. It does make sense that they could be
computed together since their computation is indeed quite similar. But
in practice I think you end up spending a lot of time reasoning around
complex conditionals like these.

So, I feel a little bit like we should spend some effort to split these
up. I'm OK with a little bit of code duplication (though if we can
factor out some common routine, that would also be nice). But I think
there's a tradeoff between DRY-ness and understandability, and that we
might be on the wrong side of it here.

>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> @@ -1500,12 +1511,15 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
>
> -				if (current->date && current->date > max_corrected_commit_date)
> -					max_corrected_commit_date = current->date - 1;
> -				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> -
> -				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
> -					ctx->num_generation_data_overflows++;
> +				if (ctx->write_generation_data) {
> +					timestamp_t cur_g;
> +					if (current->date && current->date > max_corrected_commit_date)
> +						max_corrected_commit_date = current->date - 1;
> +					cur_g = commit_graph_data_at(current)->generation
> +					      = max_corrected_commit_date + 1;
> +					if (cur_g - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
> +						ctx->num_generation_data_overflows++;
> +				}

Looks like two things happened here:

  - A new local variable was introduced to store the value of
    'commit_graph_data_at(current)->generation' (now called 'cur_g'),
    and

  - All of this was guarded by a conditional on
    'ctx->write_generation_data'.

The first one is a readability improvement, and the second is the
substantive one, no?

>  			}
>  		}
>  	}
> @@ -2396,7 +2410,7 @@ int write_commit_graph(struct object_directory *odb,
>  	} else
>  		ctx->num_commit_graphs_after = 1;
>
> -	validate_mixed_generation_chain(ctx->r->objects->commit_graph);
> +	ctx->trust_generation_numbers = validate_mixed_generation_chain(ctx->r->objects->commit_graph);
>
>  	compute_generation_numbers(ctx);

Makes sense.

Thanks,
Taylor
Derrick Stolee Feb. 1, 2021, 6:13 p.m. UTC | #2
On 2/1/2021 1:04 PM, Taylor Blau wrote:
> On Mon, Feb 01, 2021 at 05:15:06PM +0000, Derrick Stolee via GitGitGadget wrote:
...
>>  	struct topo_level_slab *topo_levels;
>>  	const struct commit_graph_opts *opts;
>> @@ -1452,6 +1453,15 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>  		ctx->progress = start_delayed_progress(
>>  					_("Computing commit graph generation numbers"),
>>  					ctx->commits.nr);
>> +
>> +	if (ctx->write_generation_data && !ctx->trust_generation_numbers) {
>> +		for (i = 0; i < ctx->commits.nr; i++) {
>> +			struct commit *c = ctx->commits.list[i];
>> +			repo_parse_commit(ctx->r, c);
>> +			commit_graph_data_at(c)->generation = GENERATION_NUMBER_ZERO;
>> +		}
>> +	}
>> +
> 
> This took me a while to figure out since I spent quite a lot of time
> thinking that you were setting the topological level to zero, _not_ the
> corrected committer date.
> 
> Now that I understand which is which, I agree that this is the right way
> to go forward.
> 
> That said, I do find it unnecessarily complex that we compute both the
> generation number and the topological level in the same loops in
> compute_generation_numbers()...
> 
>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>  		struct commit *c = ctx->commits.list[i];
>>  		uint32_t level;
>> @@ -1480,7 +1490,8 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>  				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
>>
>>  				if (level == GENERATION_NUMBER_ZERO ||
>> -				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
>> +				    (ctx->write_generation_data &&
>> +				     corrected_commit_date == GENERATION_NUMBER_ZERO)) {
> 
> ...for exactly reasons like this. It does make sense that they could be
> computed together since their computation is indeed quite similar. But
> in practice I think you end up spending a lot of time reasoning around
> complex conditionals like these.
> 
> So, I feel a little bit like we should spend some effort to split these
> up. I'm OK with a little bit of code duplication (though if we can
> factor out some common routine, that would also be nice). But I think
> there's a tradeoff between DRY-ness and understandability, and that we
> might be on the wrong side of it here.

You're probably right that it is valuable to split the computations.
It would allow us to skip all of the "if (ctx->write_generation_data)"
checks in this implementation and rely on the callers to make that
choice.

>>  					all_parents_computed = 0;
>>  					commit_list_insert(parent->item, &list);
>>  					break;
>> @@ -1500,12 +1511,15 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
>>
>> -				if (current->date && current->date > max_corrected_commit_date)
>> -					max_corrected_commit_date = current->date - 1;
>> -				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
>> -
>> -				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
>> -					ctx->num_generation_data_overflows++;
>> +				if (ctx->write_generation_data) {
>> +					timestamp_t cur_g;
>> +					if (current->date && current->date > max_corrected_commit_date)
>> +						max_corrected_commit_date = current->date - 1;
>> +					cur_g = commit_graph_data_at(current)->generation
>> +					      = max_corrected_commit_date + 1;
>> +					if (cur_g - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
>> +						ctx->num_generation_data_overflows++;
>> +				}
> 
> Looks like two things happened here:
> 
>   - A new local variable was introduced to store the value of
>     'commit_graph_data_at(current)->generation' (now called 'cur_g'),
>     and
> 
>   - All of this was guarded by a conditional on
>     'ctx->write_generation_data'.
> 
> The first one is a readability improvement, and the second is the
> substantive one, no?

Yes. Adding these checks and tabs made things super-wide, so cur_g
exists only for readability. If we split the computation, then this
is no longer required.

Thanks,
-Stolee
Junio C Hamano Feb. 1, 2021, 6:55 p.m. UTC | #3
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When upgrading to a commit-graph with corrected commit dates from
> one without, there are a few things that need to be considered.
>
> When computing generation numbers for the new commit-graph file that
> expects to add the generation_data chunk with corrected commit
> dates, we need to ensure that the 'generation' member of the
> commit_graph_data struct is set to zero for these commits.
>
> Unfortunately, the fallback to use topological level for generation
> number when corrected commit dates are not available are causing us
> harm here: parsing commits notices that read_generation_data is
> false and populates 'generation' with the topological level.
>
> The solution is to iterate through the commits, parse the commits
> to populate initial values, then reset the generation values to
> zero to trigger recalculation. This loop only occurs when the
> existing commit-graph data has no corrected commit dates.
>
> While this improves our situation somewhat, we have not completely
> solved the issue for correctly computing generation numbers for mixes

"mixed"? (will not touch locally).

> layers. That follows in the next change.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-graph.c | 32 +++++++++++++++++++++++---------
>  1 file changed, 23 insertions(+), 9 deletions(-)

The hunks that mention trust_generation_numbers match what the
commit log message says what the patch does, but other two hunks
that make sure that some code that used to run regardless of the
value of "ctx->write_generation_data" are now run only when the
member is nonzero smells like an unrelated change.

Thanks.
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index 13992137dd0..08148dd17f1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1033,7 +1033,8 @@  struct write_commit_graph_context {
 		 split:1,
 		 changed_paths:1,
 		 order_by_pack:1,
-		 write_generation_data:1;
+		 write_generation_data:1,
+		 trust_generation_numbers:1;
 
 	struct topo_level_slab *topo_levels;
 	const struct commit_graph_opts *opts;
@@ -1452,6 +1453,15 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 		ctx->progress = start_delayed_progress(
 					_("Computing commit graph generation numbers"),
 					ctx->commits.nr);
+
+	if (ctx->write_generation_data && !ctx->trust_generation_numbers) {
+		for (i = 0; i < ctx->commits.nr; i++) {
+			struct commit *c = ctx->commits.list[i];
+			repo_parse_commit(ctx->r, c);
+			commit_graph_data_at(c)->generation = GENERATION_NUMBER_ZERO;
+		}
+	}
+
 	for (i = 0; i < ctx->commits.nr; i++) {
 		struct commit *c = ctx->commits.list[i];
 		uint32_t level;
@@ -1480,7 +1490,8 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 
 				if (level == GENERATION_NUMBER_ZERO ||
-				    corrected_commit_date == GENERATION_NUMBER_ZERO) {
+				    (ctx->write_generation_data &&
+				     corrected_commit_date == GENERATION_NUMBER_ZERO)) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
@@ -1500,12 +1511,15 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
 
-				if (current->date && current->date > max_corrected_commit_date)
-					max_corrected_commit_date = current->date - 1;
-				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
-
-				if (commit_graph_data_at(current)->generation - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
-					ctx->num_generation_data_overflows++;
+				if (ctx->write_generation_data) {
+					timestamp_t cur_g;
+					if (current->date && current->date > max_corrected_commit_date)
+						max_corrected_commit_date = current->date - 1;
+					cur_g = commit_graph_data_at(current)->generation
+					      = max_corrected_commit_date + 1;
+					if (cur_g - current->date > GENERATION_NUMBER_V2_OFFSET_MAX)
+						ctx->num_generation_data_overflows++;
+				}
 			}
 		}
 	}
@@ -2396,7 +2410,7 @@  int write_commit_graph(struct object_directory *odb,
 	} else
 		ctx->num_commit_graphs_after = 1;
 
-	validate_mixed_generation_chain(ctx->r->objects->commit_graph);
+	ctx->trust_generation_numbers = validate_mixed_generation_chain(ctx->r->objects->commit_graph);
 
 	compute_generation_numbers(ctx);