diff mbox series

[v2,08/10] commit-graph: handle mixed generation commit chains

Message ID 833779ad53eb4f57ae514f4e8964e397845f1ddd.1596941625.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Jean-Noël Avila via GitGitGadget Aug. 9, 2020, 2:53 a.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

As corrected commit dates and topological levels cannot be compared
directly, we must handle commit graph chains with mixed generation
number definitions.

While reading a commit graph file, we disable generation numbers if the
chain contains mixed generation numbers.

While writing to commit graph chain, we write generation data chunk only
if the previous tip of chain had a generation data chunk. Using
`--split=replace` overwrites the existing chain and writes generation
data chunk regardless of previous tip.

In t5324-split-commit-graph, we set up a repo with twelve commits and
write a base commit graph file with no generation data chunk. When add
three commits and write to chain again, Git does not write generation
data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
as we replace the existing chain, Git writes a commit graph file with
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 14 ++++++++
 t/t5324-split-commit-graph.sh | 66 +++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)

Comments

Derrick Stolee Aug. 10, 2020, 4:42 p.m. UTC | #1
On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> As corrected commit dates and topological levels cannot be compared
> directly, we must handle commit graph chains with mixed generation
> number definitions.
> 
> While reading a commit graph file, we disable generation numbers if the
> chain contains mixed generation numbers.
> 
> While writing to commit graph chain, we write generation data chunk only
> if the previous tip of chain had a generation data chunk. Using
> `--split=replace` overwrites the existing chain and writes generation
> data chunk regardless of previous tip.
> 
> In t5324-split-commit-graph, we set up a repo with twelve commits and
> write a base commit graph file with no generation data chunk. When add
> three commits and write to chain again, Git does not write generation
> data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
> as we replace the existing chain, Git writes a commit graph file with
> generation data chunk.
> 
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 14 ++++++++
>  t/t5324-split-commit-graph.sh | 66 +++++++++++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+)
> 
> diff --git a/commit-graph.c b/commit-graph.c
> index d0f977852b..c6b6111adf 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -674,6 +674,14 @@ int generation_numbers_enabled(struct repository *r)
>  	if (!g->num_commits)
>  		return 0;
>  
> +	/* We cannot compare topological levels and corrected commit dates */
> +	while (g->base_graph) {
> +		warning(_("commit-graph-chain contains mixed generation versions"));

This warning is premature. It will add a warning whenever we have
a split commit-graph, regardless of an incorrect chain.

> +		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))

Hm. A bit-wise XOR here? That seems unfortunate. I think that it
is easier to focus on the 



> +			return 0;
> +		g = g->base_graph;
> +	}
> +

Hm. So this scenario actually disables generation numbers completely
in the event that anything in the chain disagrees. I think this is
not the right way to approach the situation, as it will significantly
punish users in this state with slow performance.

The patch I sent [1] is probably better: it uses generation number
v1 if the tip of the chain does not have a GDAT chunk.

[1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/

>  	first_generation = get_be32(g->chunk_commit_data +
>  				    g->hash_len + 8) >> 2;
>  
> @@ -2186,6 +2194,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		g = ctx->r->objects->commit_graph;
>  
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +
>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2204,6 +2215,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		if (ctx->split_opts)
>  			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;
>  	}

Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
from its current location to after this line. We want to make sure that the
environment variable is checked _last_. The best location is likely the start
of the implementation of compute_generation_numbers(), or immediately before
the call to the method.

> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&
> +	graphdir=".git/objects/info/commit-graphs" &&
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 3 0
> +	num_commits: 12
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output
> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 1
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split='replace' &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&
> +	graph_read_expect 15
> +'

It would be valuable to double-check here that the values in the GDAT chunk
are correct. I'm concerned about the possibility that the 'generation'
member of struct commit_graph_data gets filled with topological level during
parsing and then that is written as an offset into the CDAT chunk.

Perhaps this is best left for a follow-up series that updates the 'verify'
subcommand to check the GDAT chunk.

Thanks,
-Stolee
Abhishek Kumar Aug. 11, 2020, 11:36 a.m. UTC | #2
On Mon, Aug 10, 2020 at 12:42:29PM -0400, Derrick Stolee wrote:
> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
> 
> ...
>
> Hm. So this scenario actually disables generation numbers completely
> in the event that anything in the chain disagrees. I think this is
> not the right way to approach the situation, as it will significantly
> punish users in this state with slow performance.
> 
> The patch I sent [1] is probably better: it uses generation number
> v1 if the tip of the chain does not have a GDAT chunk.
> 
> [1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/
> 

Yes, the patch is an clear improvement over my (convoluted and incorrect)
logic. Will add.

>
> ...
> 
> Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
> from its current location to after this line. We want to make sure that the
> environment variable is checked _last_. The best location is likely the start
> of the implementation of compute_generation_numbers(), or immediately before
> the call to the method.
> 

Sure, will do.

>
> ...
> 
> It would be valuable to double-check here that the values in the GDAT chunk
> are correct. I'm concerned about the possibility that the 'generation'
> member of struct commit_graph_data gets filled with topological level during
> parsing and then that is written as an offset into the CDAT chunk.
> 
> Perhaps this is best left for a follow-up series that updates the 'verify'
> subcommand to check the GDAT chunk.

If I can understand it correctly, one of ways to update 'verify'
subcommand to check the GDAT chunk as well would to be make use of the
flag variable introduced in your patch. We can isolate generation number
related checks and run checks once with flag = 1 (checking corrected
commit dates) and once with flag = 0 (checking topological levels).

This has the unfortunate effect of filling all commits twice, but as we
cannot change the commit_graph_data->generation any other way, I see no
alternatives without changing how commit_graph_generation() works.

Would it make more sense if we add the flag to struct commit_graph
instead of making it depend solely on g->chunk_generation_data and set
it within parse_commit_graph()?

We would be able to control the behavior of fill_commit_graph_info() and
we will not need to check g->chunk_generation_data before filling every
commit.

> 
> Thanks,
> -Stolee
> 

Thanks
- Abhishek
Derrick Stolee Aug. 11, 2020, 12:43 p.m. UTC | #3
On 8/11/2020 7:36 AM, Abhishek Kumar wrote:
> On Mon, Aug 10, 2020 at 12:42:29PM -0400, Derrick Stolee wrote:
>> On 8/8/2020 10:53 PM, Abhishek Kumar via GitGitGadget wrote:
>>
>> ...
>>
>> Hm. So this scenario actually disables generation numbers completely
>> in the event that anything in the chain disagrees. I think this is
>> not the right way to approach the situation, as it will significantly
>> punish users in this state with slow performance.
>>
>> The patch I sent [1] is probably better: it uses generation number
>> v1 if the tip of the chain does not have a GDAT chunk.
>>
>> [1] https://lore.kernel.org/git/a3910f82-ab2e-bf35-ac43-c30d77f3c96b@gmail.com/
>>
> 
> Yes, the patch is an clear improvement over my (convoluted and incorrect)
> logic. Will add.
> 
>>
>> ...
>>
>> Please make a point to move the line that checks GIT_TEST_COMMIT_GRAPH_NO_GDAT
>> from its current location to after this line. We want to make sure that the
>> environment variable is checked _last_. The best location is likely the start
>> of the implementation of compute_generation_numbers(), or immediately before
>> the call to the method.
>>
> 
> Sure, will do.
> 
>>
>> ...
>>
>> It would be valuable to double-check here that the values in the GDAT chunk
>> are correct. I'm concerned about the possibility that the 'generation'
>> member of struct commit_graph_data gets filled with topological level during
>> parsing and then that is written as an offset into the CDAT chunk.
>>
>> Perhaps this is best left for a follow-up series that updates the 'verify'
>> subcommand to check the GDAT chunk.
> 
> If I can understand it correctly, one of ways to update 'verify'
> subcommand to check the GDAT chunk as well would to be make use of the
> flag variable introduced in your patch. We can isolate generation number
> related checks and run checks once with flag = 1 (checking corrected
> commit dates) and once with flag = 0 (checking topological levels).
> 
> This has the unfortunate effect of filling all commits twice, but as we
> cannot change the commit_graph_data->generation any other way, I see no
> alternatives without changing how commit_graph_generation() works.
> 
> Would it make more sense if we add the flag to struct commit_graph
> instead of making it depend solely on g->chunk_generation_data and set
> it within parse_commit_graph()?
> 
> We would be able to control the behavior of fill_commit_graph_info() and
> we will not need to check g->chunk_generation_data before filling every
> commit.

I missed that you _already_ updated the logic in verify_commit_graph()
based on the generation. That logic should catch the problem, so it
might be enough to just add some "git commit-graph verify" commands into
your multi-level tests.

Specifically, the end result is this check:

	corrected_commit_date = commit_graph_generation(graph_commit);
	if (corrected_commit_date < max_parent_corrected_commit_date + 1)
		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
			     oid_to_hex(&cur_oid),
			     corrected_commit_date,
			     max_parent_corrected_commit_date + 1);

This will catch the order violations I was proposing could happen. It
doesn't go the extra mile to ensure that the commit-graph stores the
exact correct value or that the two bits of data are correct (both
topo-level and corrected commit date). That is fine for now, and we
can revisit if necessary.

The diff below makes some tweaks to your split-level test to show the
logic _was_ incorrect without my patch. Please incorporate the test
changes into your series. Note in particular that I added a base
layer that includes the GDAT chunk and _then_ adds a layer without
the GDAT chunk. That is an important case!

Thanks,
-Stolee

--- >8 ---

diff --git a/commit-graph.c b/commit-graph.c
index 17623274d9..d891a8ba3a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -674,14 +674,6 @@ int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
-	/* We cannot compare topological levels and corrected commit dates */
-	while (g->base_graph) {
-		warning(_("commit-graph-chain contains mixed generation versions"));
-		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
-			return 0;
-		g = g->base_graph;
-	}
-
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -787,7 +779,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data && (flags & COMMIT_GRAPH_GENERATION_V2))
+	if (g->chunk_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 	else
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 1a9be5e656..721515cc23 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -443,6 +443,7 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 		test_commit $i &&
 		git branch commits/$i || return 1
 	done &&
+	git commit-graph write --reachable --split &&
 	git reset --hard commits/2 &&
 	for i in $(test_seq 6 10)
 	do
@@ -455,14 +456,15 @@ test_expect_success 'setup repo for mixed generation commit-graph-chain' '
 	git reset --hard commits/4 &&
 	git merge commits/6 &&
 	git branch merge/2 &&
-	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
 	test-tool read-graph >output &&
 	cat >expect <<-EOF &&
-	header: 43475048 1 1 3 0
-	num_commits: 12
+	header: 43475048 1 1 4 1
+	num_commits: 7
 	chunks: oid_fanout oid_lookup commit_metadata
 	EOF
-	test_cmp expect output
+	test_cmp expect output &&
+	git commit-graph verify
 '
 
 test_expect_success 'does not write generation data chunk if not present on existing tip' '
@@ -472,23 +474,25 @@ test_expect_success 'does not write generation data chunk if not present on exis
 	git merge commits/5 &&
 	git merge merge/2 &&
 	git branch merge/3 &&
-	git commit-graph write --reachable --split &&
+	git commit-graph write --reachable --split=no-merge &&
 	test-tool read-graph >output &&
 	cat >expect <<-EOF &&
 	header: 43475048 1 1 4 1
 	num_commits: 3
 	chunks: oid_fanout oid_lookup commit_metadata
 	EOF
-	test_cmp expect output
+	test_cmp expect output &&
+	git commit-graph verify
 '
 
 test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
 	cd "$TRASH_DIRECTORY/mixed" &&
-	git commit-graph write --reachable --split='replace' &&
+	git commit-graph write --reachable --split=replace &&
 	test_path_is_file $graphdir/commit-graph-chain &&
 	test_line_count = 1 $graphdir/commit-graph-chain &&
 	verify_chain_files_exist $graphdir &&
-	graph_read_expect 15
+	graph_read_expect 15 &&
+	git commit-graph verify
 '
 
 test_done
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index d0f977852b..c6b6111adf 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -674,6 +674,14 @@  int generation_numbers_enabled(struct repository *r)
 	if (!g->num_commits)
 		return 0;
 
+	/* We cannot compare topological levels and corrected commit dates */
+	while (g->base_graph) {
+		warning(_("commit-graph-chain contains mixed generation versions"));
+		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
+			return 0;
+		g = g->base_graph;
+	}
+
 	first_generation = get_be32(g->chunk_commit_data +
 				    g->hash_len + 8) >> 2;
 
@@ -2186,6 +2194,9 @@  int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2204,6 +2215,9 @@  int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 6b25c3d9ce..1a9be5e656 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -425,4 +425,70 @@  done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 3 0
+	num_commits: 12
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split='replace' &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15
+'
+
 test_done