diff mbox series

[v3,09/11] commit-graph: use generation v2 only if entire chain does

Message ID 5a147a9704f0f8d8644c92ea38583e966378b931.1597509583.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Johannes Schindelin via GitGitGadget Aug. 15, 2020, 4:39 p.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 32 +++++++++++++++-
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 1 deletion(-)

Comments

Jakub Narębski Aug. 22, 2020, 5:14 p.m. UTC | #1
Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.

All right, this explains changes to the *reading* side.  If there is
split commit-graph layer without GDAT chunk (or to be more exact,
current code checks if there is any layer without GDAT chunk), then we
fall back to using topological levels, as if all layers / all graphs
were without GDAT.  This allows to avoid the issue explained above,
where for some commits 'generation' holds corrected commit date, and for
some it holds topological levels, breaking the reachability condition
guarantee.


However the commit message do not say anything about the *writing* side.

We have decided to not write the GDAT chunk when writing the new layer
in split commit-graph, and top layer doesn't itself have GDAT chunks.
That makes for easier reasoning and safer handling: in mixed-version
environment the only possible arrangement is for the lower layers
(possibly zero) have GDAT chunk, and higher layers (possibly zero) not
have GDAT chunks.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite should
not have GDAT chunks either.


To be more detailed, without '--split=replace' we would want the following
layer merging behavior:

   [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
           1              2           3             4            5

In the split commit-graph chain above, merging two topmost layers
(layers 4 and 5) should create a layer without GDAT; merging three
topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
4) should create a new layer with GDAT.

   [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
           1              2           3               merged

   [layer with GDAT][with GDAT][-------------with GDAT------------------]
           1              2                    merged

I hope those ASCII-art pictures help understanding it

>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.
>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).

Wouldn't breaking the reachability condition promise make some Git
commands to return *incorrect* results if they short-circuit, stop
walking if generation number shows that A cannot reach B?

I am talking here about commands that return boolean, or select subset
from given set of revisions:
- git merge-base --is-ancestor <B> <A>
- git branch branch-A <A> && git branch --contains <B>
- git branch branch-B <B> && git branch --merged <A>

Git assumes that generation numbers fulfill the following condition:

  if A can reach B, then gen(A) > gen(B)

Notably this includes commits not in commit-graph, and clamped values.

However, in the following case

* if commit A is from higher layer without GDAT
  and uses topological levels for 'generation', e.g. 115 (in a small repo)
* and commit B is from lower layer with GDAT
  and uses corrected commit date as 'generation', for example 1598112896,

it may happen that A (later commit) can reach B (earlier commit), but
gen(B) > gen(A).  The reachability condition promise for generation
numbers is broken.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---

I have reordered files in the patch itself to make it easier to review
the proposed changes.

>  commit-graph.h                |  1 +
>  commit-graph.c                | 32 +++++++++++++++-
>  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
>  3 files changed, 102 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.h b/commit-graph.h
> index f78c892fc0..3cf89d895d 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -63,6 +63,7 @@ struct commit_graph {
>  	struct object_directory *odb;
>
>  	uint32_t num_commits_in_base;
> +	uint32_t read_generation_data;
>  	struct commit_graph *base_graph;
>

First, why `read_generation_data` is of uint32_t type, when it stores
(as far as I understand it), a "boolean" value of either 0 or 1?

Second, couldn't we simply set chunk_generation_data to NULL?  Or would
that interfere with the case of rewriting, where we want to use existing
GDAT data when writing new commit-graph with GDAT chunk?

>  	const uint32_t *chunk_oid_fanout;
> diff --git a/commit-graph.c b/commit-graph.c
> index b7a72b40db..c1292f8e08 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -597,6 +597,27 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
>  	return graph_chain;
>  }
>
> +static void validate_mixed_generation_chain(struct repository *r)
> +{
> +	struct commit_graph *g = r->objects->commit_graph;
> +	int read_generation_data = 1;
> +
> +	while (g) {
> +		if (!g->chunk_generation_data) {
> +			read_generation_data = 0;
> +			break;
> +		}
> +		g = g->base_graph;
> +	}

This loop checks whole split commit-graph chain for existence of layers
without GDAT chunk.

On one hand it is more than needed _if_ we assume that the fact that
only topmost layers can be without GDAT holds true. On the other hand it
is safer (an example of defensive coding), and as the length of chain is
limited it should be not much of a performance penalty.

> +
> +	g = r->objects->commit_graph;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}

All right... though one of earlier commits introduced similar loop, but
it set chunk_generation_data to NULL instead.  Or did I remember it wrong?

> +}
> +
>  struct commit_graph *read_commit_graph_one(struct repository *r,
>  					   struct object_directory *odb)
>  {
> @@ -605,6 +626,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>
> +	validate_mixed_generation_chain(r);
> +

All right, when reading the commit-graph, check if we are in forced
backward-compatibile mode, and we need to use topological levels for
generation numbers.

>  	return g;
>  }
>
> @@ -763,7 +786,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>
> -	if (g->chunk_generation_data)
> +	if (g->chunk_generation_data && g->read_generation_data)

All right, when deciding whether to use corrected commit date
(generation number v2 from GDAT chunk), or fall back to using
topological levels (generation number v1 from CDAT chunk), we need to
take into accout other layers, to not mix v1 and v2.

We have earlier checked whether we can use generation number v2, now we
use the result of this check, propagated down the commit-graph chain.

>  		graph_data->generation = item->date +
>  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
>  	else
> @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
>  	uint32_t pos;
>  	if (!prepare_commit_graph(r))
>  		return;
> +
>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
>  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
>  }

This is unrelated whitespace fix, a "while at it" in neighbourhood of
changes.  All right then.

> @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
>  		g = ctx->r->objects->commit_graph;
>
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +
>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2210,6 +2237,9 @@ int write_commit_graph(struct object_directory *odb,
>
>  		if (ctx->split_opts)
>  			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;
>  	}


The previous commit introduced `write_generation_data` member in
`struct write_commit_graph_context`, then used to handle support for
GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable.

Those two hunks of changes above are both inside

   if (ctx->split) {
      ...
   }

Here we examine the topmost layer of split commit-graph chain, and if it
does not contain GDAT chunk, then we do not store the GDAT chunk, unless
`git commit-graph write` is ru with `--split=replace` option.

However this is overly strict condition.  If we merge layer without GDAT
with layer with GDAT below, then we surely can write GDAT; the condition
for GDAT-less layers would be still fulfilled (met).  However we can
consider it 'good enough' for now, and relax this condition in later
commits.


Note that it is the first time in this patch were we make use of
assumption that if there are layers without GDAT then topmost layer is
without GDAT.

>
>  	ctx->approx_nr_objects = approximate_object_count();
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 531016f405..ac5e7783fb 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -424,4 +424,74 @@ done <<\EOF
>  0600 -r--------
>  EOF
>

It would be nice to have an ASCII-art graph of commits, but earlier
tests do not have it either...

> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&
> +	graphdir=".git/objects/info/commit-graphs" &&
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&
> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --split &&
> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&
> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&
> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 1
> +	num_commits: 2
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify
> +'

Looks all right to me.

> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&
> +	git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 1 4 2
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify
> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split=replace &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&
> +	graph_read_expect 15 &&
> +	git commit-graph verify
> +'

It would be nice to have an example with merging layers (whether we
would handle it in strict or relaxed way).

> +
>  test_done

Best,
Abhishek Kumar Aug. 26, 2020, 7:15 a.m. UTC | #2
On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
> Hi Abhishek,
> 
> ... 
> 
> However the commit message do not say anything about the *writing* side.
> 

Revised the commit message to include the following at the end:

When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.

Rewriting layers follows similar approach: if the topmost layer below
set of layers being rewritten (in the split commit-graph chain) exists,
and it does not contain GDAT chunk, then the result of rewrite does not
have GDAT chunks either.

> 
> ...
> 
> To be more detailed, without '--split=replace' we would want the following
> layer merging behavior:
> 
>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>            1              2           3             4            5
> 
> In the split commit-graph chain above, merging two topmost layers
> (layers 4 and 5) should create a layer without GDAT; merging three
> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
> 4) should create a new layer with GDAT.
> 
>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>            1              2           3               merged
> 
>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>            1              2                    merged
> 
> I hope those ASCII-art pictures help understanding it
> 

Thanks! There were helpful.

While we work as expected in the first scenario i.e merging 4 and 5, we
would *still* write a layer without GDAT in the second scenario.

I have tweaked split_graph_merge_strategy() to fix this:

----------------------------------------------

diff --git a/commit-graph.c b/commit-graph.c
index 6d54d9a286..246fad030d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		}
 	}
 
+	if (!ctx->write_generation_data && g->chunk_generation_data)
+		ctx->write_generation_data = 1;
+
 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->new_base_graph = g;
 	else if (ctx->num_commit_graphs_after != 1)

----------------------------------------------------

That is, if we were not writing generation data (because of mixed
generation concerns) but the new topmost layer has a generation data
chunk, we have merged all layers without GDAT chunk and can now write a
GDAT chunk safely.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> Wouldn't breaking the reachability condition promise make some Git
> commands to return *incorrect* results if they short-circuit, stop
> walking if generation number shows that A cannot reach B?
> 
> I am talking here about commands that return boolean, or select subset
> from given set of revisions:
> - git merge-base --is-ancestor <B> <A>
> - git branch branch-A <A> && git branch --contains <B>
> - git branch branch-B <B> && git branch --merged <A>
> 
> Git assumes that generation numbers fulfill the following condition:
> 
>   if A can reach B, then gen(A) > gen(B)
> 
> Notably this includes commits not in commit-graph, and clamped values.
> 
> However, in the following case
> 
> * if commit A is from higher layer without GDAT
>   and uses topological levels for 'generation', e.g. 115 (in a small repo)
> * and commit B is from lower layer with GDAT
>   and uses corrected commit date as 'generation', for example 1598112896,
> 
> it may happen that A (later commit) can reach B (earlier commit), but
> gen(B) > gen(A).  The reachability condition promise for generation
> numbers is broken.
> 
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ---
> 
> I have reordered files in the patch itself to make it easier to review
> the proposed changes.
> 
> >  commit-graph.h                |  1 +
> >  commit-graph.c                | 32 +++++++++++++++-
> >  t/t5324-split-commit-graph.sh | 70 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 102 insertions(+), 1 deletion(-)
> >
> > diff --git a/commit-graph.h b/commit-graph.h
> > index f78c892fc0..3cf89d895d 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -63,6 +63,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >
> >  	uint32_t num_commits_in_base;
> > +	uint32_t read_generation_data;
> >  	struct commit_graph *base_graph;
> >
> 
> First, why `read_generation_data` is of uint32_t type, when it stores
> (as far as I understand it), a "boolean" value of either 0 or 1?
> 

Yes, using unsigned int instead of uint32_t (although in most of cases
it would be same).  If commit_graph had other flags as well, we could
have used a bit field.

> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
> that interfere with the case of rewriting, where we want to use existing
> GDAT data when writing new commit-graph with GDAT chunk?

It interferes with rewriting the split commit-graph, as you might have
guessed from the above code snippet.

> 
> ...
>
> > diff --git a/commit-graph.c b/commit-graph.c
> 
> >  		graph_data->generation = item->date +
> >  			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
> >  	else
> > @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
> >  	uint32_t pos;
> >  	if (!prepare_commit_graph(r))
> >  		return;
> > +
> >  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
> >  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
> >  }
> 
> This is unrelated whitespace fix, a "while at it" in neighbourhood of
> changes.  All right then.
> 

Reverted this change, as it's unimportant.

> > @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>
> ...
> 
> It would be nice to have an example with merging layers (whether we
> would handle it in strict or relaxed way).
> 

Sure, will add.

> > +
> >  test_done
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek
Jakub Narębski Aug. 26, 2020, 10:38 a.m. UTC | #3
Hi Abhishek,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sat, Aug 22, 2020 at 07:14:38PM +0200, Jakub Narębski wrote:
>> Hi Abhishek,
>>
>> ...
>>
>> However the commit message do not say anything about the *writing* side.
>>
>
> Revised the commit message to include the following at the end:
>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>

All right.

> Rewriting layers follows similar approach: if the topmost layer below
> set of layers being rewritten (in the split commit-graph chain) exists,
> and it does not contain GDAT chunk, then the result of rewrite does not
> have GDAT chunks either.

All right.

I see that you went with proposed more complex (but better) solution...

>>
>> ...
>>
>> To be more detailed, without '--split=replace' we would want the following
>> layer merging behavior:
>>
>>    [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]
>>            1              2           3             4            5
>>
>> In the split commit-graph chain above, merging two topmost layers
>> (layers 4 and 5) should create a layer without GDAT; merging three
>> topmost layers (and any other layers, e.g. two middle ones, i.e. 3 and
>> 4) should create a new layer with GDAT.

A simpler solution would be to create a new merged layer without GDAT if
any of the layers being merged do not have GDAT.

In this solution merging 3+4+5, 3+4, and even 2+3 would result with
layer without GDAT, and only merging 1+2 would result in layer with GDAT.

>>
>>    [layer with GDAT][with GDAT][without GDAT][-------without GDAT-------]
>>            1              2           3               merged
>>
>>    [layer with GDAT][with GDAT][-------------with GDAT------------------]
>>            1              2                    merged
>>
>> I hope those ASCII-art pictures help understanding it
>>
>
> Thanks! There were helpful.
>
> While we work as expected in the first scenario i.e merging 4 and 5, we
> would *still* write a layer without GDAT in the second scenario.
>
> I have tweaked split_graph_merge_strategy() to fix this:
>
> ----------------------------------------------
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 6d54d9a286..246fad030d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1973,6 +1973,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  		}
>  	}
>
> +	if (!ctx->write_generation_data && g->chunk_generation_data)
> +		ctx->write_generation_data = 1;
> +
>  	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
>  		ctx->new_base_graph = g;
>  	else if (ctx->num_commit_graphs_after != 1)

...which turned out to be not that complicated.  Nice work!

Though this needs tests that if fulfills the stated condition (because I
am not sure if it is entirely correct: we are not checking the layer
below current one, isn't it?... ah, you explain it below).

One possible solution would be to grep `test-tool read-graph` output for
"^chunks: ", then pass it through `uniq` (without `sort`!), check that
the number of lines is less or equal 2, and if there are two lines then
check that we get the following contents:

  chunks: oid_fanout oid_lookup commit_metadata generation_data
  chunks: oid_fanout oid_lookup commit_metadata

(assuming that information about layers is added in top-down order).

This test must be run with GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0, which
I think is the default.

> ----------------------------------------------------
>
> That is, if we were not writing generation data (because of mixed
> generation concerns) but the new topmost layer has a generation data
> chunk, we have merged all layers without GDAT chunk and can now write a
> GDAT chunk safely.

All right.

[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index f78c892fc0..3cf89d895d 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -63,6 +63,7 @@ struct commit_graph {
>>>  	struct object_directory *odb;
>>>
>>>  	uint32_t num_commits_in_base;
>>> +	uint32_t read_generation_data;
>>>  	struct commit_graph *base_graph;
>>>
>>
>> First, why `read_generation_data` is of uint32_t type, when it stores
>> (as far as I understand it), a "boolean" value of either 0 or 1?
>
> Yes, using unsigned int instead of uint32_t (although in most of cases
> it would be same).  If commit_graph had other flags as well, we could
> have used a bit field.

OK.

>> Second, couldn't we simply set chunk_generation_data to NULL?  Or would
>> that interfere with the case of rewriting, where we want to use existing
>> GDAT data when writing new commit-graph with GDAT chunk?
>
> It interferes with rewriting the split commit-graph, as you might have
> guessed from the above code snippet.

All right.

[...]
>>> @@ -885,6 +908,7 @@ void load_commit_graph_info(struct repository *r, struct commit *item)
>>>  	uint32_t pos;
>>>  	if (!prepare_commit_graph(r))
>>>  		return;
>>> +
>>>  	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
>>>  		fill_commit_graph_info(item, r->objects->commit_graph, pos);
>>>  }
>>
>> This is unrelated whitespace fix, a "while at it" in neighbourhood of
>> changes.  All right then.
>>
>
> Reverted this change, as it's unimportant.

Actually I am not against fixing the whitespace in the neighbourhood of
changes, so you can keep it or revert it (discard).

>>> @@ -2192,6 +2216,9 @@ int write_commit_graph(struct object_directory *odb,
>>
>> ...
>>
>> It would be nice to have an example with merging layers (whether we
>> would handle it in strict or relaxed way).
>>
>
> Sure, will add.

Thanks.


Best,
--
Jakub Narębski
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index b7a72b40db..c1292f8e08 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -597,6 +597,27 @@  static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct repository *r)
+{
+	struct commit_graph *g = r->objects->commit_graph;
+	int read_generation_data = 1;
+
+	while (g) {
+		if (!g->chunk_generation_data) {
+			read_generation_data = 0;
+			break;
+		}
+		g = g->base_graph;
+	}
+
+	g = r->objects->commit_graph;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -605,6 +626,8 @@  struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(r);
+
 	return g;
 }
 
@@ -763,7 +786,7 @@  static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data)
+	if (g->chunk_generation_data && g->read_generation_data)
 		graph_data->generation = item->date +
 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 	else
@@ -885,6 +908,7 @@  void load_commit_graph_info(struct repository *r, struct commit *item)
 	uint32_t pos;
 	if (!prepare_commit_graph(r))
 		return;
+
 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
 }
@@ -2192,6 +2216,9 @@  int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2210,6 +2237,9 @@  int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->split_opts)
 			replace = ctx->split_opts->flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index f78c892fc0..3cf89d895d 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,7 @@  struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	uint32_t read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 531016f405..ac5e7783fb 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -424,4 +424,74 @@  done <<\EOF
 0600 -r--------
 EOF
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 1 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
 test_done