diff mbox series

[v4,08/10] commit-graph: use generation v2 only if entire chain does

Message ID 8ec119edc66814ad4d63908c79437a7f9dd3c08c.1602079786.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Elijah Newren via GitGitGadget Oct. 7, 2020, 2:09 p.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

Because of the current use of inspecting the current layer for a
chunk_generation_data pointer, the commits in the lower layer will be
interpreted as having very large generation values (commit date plus
offset) compared to the generation numbers in the top layer (topological
level). This violates the expectation that the generation of a parent is
strictly smaller than the generation of a child.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

When writing the new layer in split commit-graph, we write a GDAT chunk
only if the topmost layer has a GDAT chunk. This guarantees that if a
layer has GDAT chunk, all lower layers must have a GDAT chunk as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c                | 29 +++++++++++-
 commit-graph.h                |  1 +
 t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
 3 files changed, 115 insertions(+), 1 deletion(-)

Comments

Jakub Narębski Nov. 1, 2020, 12:55 a.m. UTC | #1
Hi Abhishek,

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> Since there are released versions of Git that understand generation
> numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> chunk, the following scenario is possible:
>
> 1. "New" Git writes a commit-graph with the GDAT chunk.
> 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

All right.

>
> Because of the current use of inspecting the current layer for a
> chunk_generation_data pointer, the commits in the lower layer will be
> interpreted as having very large generation values (commit date plus
> offset) compared to the generation numbers in the top layer (topological
> level). This violates the expectation that the generation of a parent is
> strictly smaller than the generation of a child.

I think this paragraphs tries too much to be concise, with the result it
is less clear than it could be.  Perhaps it would be better to separate
"what-if" from the current behavior.

  If each layer of split commit-graph is treated independently, as it
  were the case before this commit, with Git inspecting only the current
  layer for chunk_generation_data pointer, commits in the lower layer
  (one with GDAT) would have corrected commit date as their generation
  number, while commits in the upper layer would have topological levels
  as their generation.  Corrected commit dates have usually much larger
  values than topological levels.  This means that if we take two
  commits, one from the upper layer, and one reachable from it in the
  lower layer, then the expectation that the generation of a parent is
  smaller than the generation of a child would be violated.

>
> It is difficult to expose this issue in a test. Since we _start_ with
> artificially low generation numbers, any commit walk that prioritizes
> generation numbers will walk all of the commits with high generation
> number before walking the commits with low generation number. In all the
> cases I tried, the commit-graph layers themselves "protect" any
> incorrect behavior since none of the commits in the lower layer can
> reach the commits in the upper layer.

I don't quite understand the issue here. Unless none of the following
query commands short-circuit and all walk the commit graph regardless of
what generation numbers tell them, they should give different results
with and without the commit graph, if we take two commits one from lower
layer of split commit graph with GDAT, and one commit from the higher
layer without GDAT, one lower reachable from the other higher.

We have the following query commands that we can check:
  $ git merge-base --is-ancestor <lower> <higher>
  $ git merge-base --independent <lower> <higher>
  
  $ git tag --contains <tag-to-lower>
  $ git tag --merged <tag-to-higher>
  $ git branch --contains <branch-to-lower>
  $ git branch --merged <branch-to-higher>

The second set of queries require for those commits to be tagged, or
have branch pointing at them, respectively.

Also, shouldn't `git commit-graph verify` fail with split commit graph
where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?


Let's assume that we have the following history, with newer commits
shown on top like in `git log --graph --oneline --all`:

          topological     corrected         generation
          level           commit date       number^*

      d    3                                3
      |
   c  |    3                                3
   |  |                                                 without GDAT
 ..|..|.....[layer.boundary]........................................
   |  |                                                    with GDAT
   |  b    2              1112912113        1112912113
   |  |
   a  |    2              1112912053        1112912053
   | /
   |/
   r       1              1112911993        1112911993

*) each layer inspected individually.

With such history, we can for example reach 'a' from 'c', thus
`git merge-base --is-ancestor a b` should return true value, but
without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
I use here weaker reachability condition, but the one that works
also for commits outside the commit-graph (and those for which
generation numbers overflows).

>
> This issue would manifest itself as a performance problem in this case,
> especially with something like "git log --graph" since the low
> generation numbers would cause the in-degree queue to walk all of the
> commits in the lower layer before allowing the topo-order queue to write
> anything to output (depending on the size of the upper layer).

All right, that's good explanation.

>
> When writing the new layer in split commit-graph, we write a GDAT chunk
> only if the topmost layer has a GDAT chunk. This guarantees that if a
> layer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
> Rewriting layers follows similar approach: if the topmost layer below
> the set of layers being rewritten (in the split commit-graph chain)
> exists, and it does not contain GDAT chunk, then the result of rewrite
> does not have GDAT chunks either.

All right, very good explanation; the only minor suggestion would be to
add some 'intro' to the first of those two paragraphs, for example:

  Therefore, when writing the new layer in split commit-graph...

Though I am not sure if it is necessary.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c                | 29 +++++++++++-
>  commit-graph.h                |  1 +
>  t/t5324-split-commit-graph.sh | 86 +++++++++++++++++++++++++++++++++++
>  3 files changed, 115 insertions(+), 1 deletion(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 71d0b243db..5d15a1399b 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -605,6 +605,21 @@ static struct commit_graph *load_commit_graph_chain(struct repository *r,
>  	return graph_chain;
>  }
>  
> +static void validate_mixed_generation_chain(struct commit_graph *g)
> +{
> +	int read_generation_data;
> +
> +	if (!g)
> +		return;
> +
> +	read_generation_data = !!g->chunk_generation_data;
> +
> +	while (g) {
> +		g->read_generation_data = read_generation_data;
> +		g = g->base_graph;
> +	}
> +}

All right, this function checks assumedly topmost layer if it is
GDAT-less, and if it is propagates this status down the layers of split
commit graph.  This is needed because if we have mixed-generation commit
graph, then for each and every layer we need to use topological levels
as generation number.

The only minor issue is the name of this function (it does not hint that
it propagates the GDAT status downwards), but I don't have a better
idea, unfortunately.  And it does reflect what this function is used for.

> +
>  struct commit_graph *read_commit_graph_one(struct repository *r,
>  					   struct object_directory *odb)
>  {
> @@ -613,6 +628,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  	if (!g)
>  		g = load_commit_graph_chain(r, odb);
>  
> +	validate_mixed_generation_chain(g);
> +

All right, this looks like a good place to put this new check: just
after reading commit-graph chain.

>  	return g;
>  }
>  
> @@ -782,7 +799,7 @@ static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
>  	date_low = get_be32(commit_data + g->hash_len + 12);
>  	item->date = (timestamp_t)((date_high << 32) | date_low);
>  
> -	if (g->chunk_generation_data) {
> +	if (g->chunk_generation_data && g->read_generation_data) {
>  		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);

All right, instead of simply checking if the current layer has
generation data chunk, we need to also check if the whole graph allows
for it (if there are no mixed-generation layers).

The g->read_generation_data should be filled correctly, because
fill_commit_graph_info() is always preceded by read_commit_graph(), if I
understand it correctly.

>  
>  		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
> @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
>  		}
>  	}
>  
> +	if (!ctx->write_generation_data && g->chunk_generation_data)
> +		ctx->write_generation_data = 1;
> +

This needs more careful examination, and looking at larger context of
those lines.

At this point, unless `--split=replace` option is used, 'g' points to
the bottom layer out of all topmost layers being merged. We know that if
there are GDAT-less layers then these must be top layers, so this means
that we can write GDAT chunk in the result of the merge -- because we
would be replacing all possible GDAT-less layers (and maybe some with
GDAT) with a single layer with the GDAT chunk.

The ctx->write_generation_data is set to true unless environment
variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
write_commit_graph() it would be set to false if topmost layer doesn't
have GDAT chunk, and to true if `--split=replace` option is used; see
below.

Looks good to me.


NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
writing GDAT chunk with generation data v2 unless we are merging layers,
or replacing all of them with a single layer: then it is _ignored_.

Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
in t/README?  Currently it reads:

  GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
  commit-graph to be written without generation data chunk.

>  	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
>  		ctx->new_base_graph = g;
>  	else if (ctx->num_commit_graphs_after != 1)
> @@ -2274,6 +2294,7 @@ int write_commit_graph(struct object_directory *odb,
>  		struct commit_graph *g = ctx->r->objects->commit_graph;
>  
>  		while (g) {
> +			g->read_generation_data = 1;
>  			g->topo_levels = &topo_levels;
>  			g = g->base_graph;
>  		}

All right, when writing the commit graph we want to make use of existing
generation data chunks.  This is safe, because when computing generation
numbers for writing we have separate place to store topoogical levels
(`topo_levels`) so they would not be mixed with corrected commit dates:
generation number v1 and v2 are kept separate.

> @@ -2300,6 +2321,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		g = ctx->r->objects->commit_graph;
>  
> +		if (g && !g->chunk_generation_data)
> +			ctx->write_generation_data = 0;
> +

All right, if current (topmost) layed does not have GDAT, then when
creating a new layer do not create GDAT layer either (merging layers and
rewriting the commit-graph is handled separately).

>  		while (g) {
>  			ctx->num_commit_graphs_before++;
>  			g = g->base_graph;
> @@ -2318,6 +2342,9 @@ int write_commit_graph(struct object_directory *odb,
>  
>  		if (ctx->opts)
>  			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
> +
> +		if (replace)
> +			ctx->write_generation_data = 1;

All right, when replacing all layers (`git commit-graph write --split=replace`),
then we can safely write the GDAT chunk.

Note however that here we don't take into account the value of the
environment variable GIT_TEST_COMMIT_GRAPH_NO_GDAT.  Which maybe is what
we want...

>  	}
>  
>  	ctx->approx_nr_objects = approximate_object_count();
> diff --git a/commit-graph.h b/commit-graph.h
> index 19a02001fd..ad52130883 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -64,6 +64,7 @@ struct commit_graph {
>  	struct object_directory *odb;
>  
>  	uint32_t num_commits_in_base;
> +	unsigned int read_generation_data;
>  	struct commit_graph *base_graph;

All right, this new field is here to propagate to each layer the
information whether we can read from the generation number v2 data
chunk.

Though I am not sure whether this field should be added here, and
whether it should be `unsigned int` (we don't have to be that careful
about saving space for this type).

>  
>  	const uint32_t *chunk_oid_fanout;
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 651df89ab2..d0949a9eb8 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
>  	verify_chain_files_exist $graphdir
>  '
>  
> +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> +	mkdir mixed &&

This should probably go just before cd-ing into just created
subdirectory.

> +	graphdir=".git/objects/info/commit-graphs" &&
> +	test_oid_cache <<-EOM &&
> +	oid_version sha1:1
> +	oid_version sha256:2
> +	EOM

Minor nitpick: Why use "EOM", which is used only twice in Git the test
suite, and not the conventional "EOF" (used at least 4000 times)?

> +	cd "$TRASH_DIRECTORY/mixed" &&

The t/README says:

   - Don't chdir around in tests.  It is not sufficient to chdir to
     somewhere and then chdir back to the original location later in
     the test, as any intermediate step can fail and abort the test,
     causing the next test to start in an unexpected directory.  Do so
     inside a subshell if necessary.

Though I am not sure if it should apply also to this situation.

> +	git init &&
> +	git config core.commitGraph true &&
> +	git config gc.writeCommitGraph false &&

All right.

> +	for i in $(test_seq 3)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/1 &&
> +	for i in $(test_seq 4 5)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git reset --hard commits/2 &&
> +	for i in $(test_seq 6 10)
> +	do
> +		test_commit $i &&
> +		git branch commits/$i || return 1
> +	done &&
> +	git commit-graph write --reachable --split &&

Is there a reason why we do not check just written commit-graph file
with `test-tool read-graph >output-layer-1`?

> +	git reset --hard commits/2 &&
> +	git merge commits/4 &&

Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
to use one or the other?

> +	git branch merge/1 &&
> +	git reset --hard commits/4 &&
> +	git merge commits/6 &&
> +	git branch merge/2 &&

It would be nice to have ASCII-art of the history (of the graph of
revisions) created here for subsequent tests:

                                        
           /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
          /    \-\
         /        \
  1 <-- 2 <-- 3*   \--\
  |      \             \ 
  |       \-----\       \
   \             \       \
    \-- 4*<------ M/1     M/2
        |\               /  
        | \-- 5*        /
        \              /
         \------------/

  * - 1st layer  

Though I am not sure if what I have created is readable; I think a
better way to draw this graph is possible, for example:

               /- 3*
              /
             /
  1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
   \         \       \
    \         \       \
     \         \       \
      \- 4* <-- M/1     \    
         |\              \
         | \------------- M/2
         \
          \---- 5*

Edit: as I see the history gets even more complicated, so perhaps
ASCII-art diagram of the history with layers marked would be too
complicated, and wouldn't bring much.

Why do we need such shape of the history in the repository?

> +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 $(test_oid oid_version) 4 1
> +	num_commits: 2
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&

All right, we check that we have 2 commits, and that there is no GDAT
chunk.

> +	git commit-graph verify

All right, we verify commit-graph as a whole (both layers).

> +'
> +
> +test_expect_success 'does not write generation data chunk if not present on existing tip' '

Hmmm... I wonder if we can come up with a better name for this test;
for example should it be "does not write" or "do not write"?

> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git reset --hard commits/3 &&
> +	git merge merge/1 &&
> +	git merge commits/5 &&
> +	git merge merge/2 &&
> +	git branch merge/3 &&

The commit graph gets complicated, so it would not be easy to visualize
it with ASCII-art diagram without any crossed lines.  Maybe `git log
--graph --oneline --all` would help:

*   (merge/3) Merge branch 'merge/2'
|\
| *   (merge/2) Merge branch 'commits/6'
| |\
* | \   Merge branch 'commits/5'
|\ \ \
| * | | (commits/5) 5
| |/ /
* | |   Merge branch 'merge/1'
|\ \ \
| * | | (merge/1) Merge branch 'commits/4'
| |\| |
| | * | (commits/4) 4
* | | | (commits/3) 3
|/ / /
| | | * (commits/10) 10
| | | * (commits/9) 9
| | | * (commits/8) 8
| | | * (commits/7) 7
| | |/
| | * (commits/6) 6
| |/
|/|
* | (commits/2) 2
|/
* (commits/1) 1


> +	git commit-graph write --reachable --split=no-merge &&
> +	test-tool read-graph >output &&
> +	cat >expect <<-EOF &&
> +	header: 43475048 1 $(test_oid oid_version) 4 2
> +	num_commits: 3
> +	chunks: oid_fanout oid_lookup commit_metadata
> +	EOF
> +	test_cmp expect output &&
> +	git commit-graph verify

All right, so here we check that we have layer without GDAT at the top,
and we request not to merge layers thus new layer will be created, then
the new layer also does not have GDAT chunk (and has 3 commits).

Minor nitpick: shouldn't those test be indented?

> +'
> +
> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	git commit-graph write --reachable --split=replace &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	test_line_count = 1 $graphdir/commit-graph-chain &&
> +	verify_chain_files_exist $graphdir &&

All right, this checks that we have split commit-graph chain that
consist of a single layer, and that the commit-graph file for this
single layer exists.

> +	graph_read_expect 15 &&

Shouldn't we use `test-tool read-graph` to check whether generation_data
chunk is present... ah, sorry, I have realized that after previous
patches `graph_read_expect 15` implicitly checks the latter, because in
its' use of `test-tool read-graph` it does expect generation_data chunk.

So we use `test-tool read-graph` manually to check that generation_data
chunk is absent, and we use graph_read_expect to check that it is
present (and in both cases that the number of commits matches).  I
wonder if it would be possible to simplify that...

> +	git commit-graph verify

All right.

> +'
> +
> +test_expect_success 'add one commit, write a tip graph' '
> +	cd "$TRASH_DIRECTORY/mixed" &&
> +	test_commit 11 &&
> +	git branch commits/11 &&
> +	git commit-graph write --reachable --split &&
> +	test_path_is_missing $infodir/commit-graph &&
> +	test_path_is_file $graphdir/commit-graph-chain &&
> +	ls $graphdir/graph-*.graph >graph-files &&
> +	test_line_count = 2 graph-files &&
> +	verify_chain_files_exist $graphdir
> +'

What it is meant to test?  That adding single-commit to a 15 commit
commit-graph file in split mode does not result in layers merging, and
actually adds a new layer: we check that we have exactly two layers and
that they are all OK.

We don't check here that the newly created top layer commit-graph does
have GDAT chunk, as it should be if the top layer (in this case the only
layer) has GDAT chunk.

> +
>  test_done

One test we are missing is testing that merging layers is done
correctly, namely that if we are merging layers in split commit-graph
file, and the layer below the ones we are merging lacks GDAT chunk, then
the result of the merge should also be without GDAT chunk.  This would
require at least two GDAT-less layers in a setup.

I'm not sure how difficult writing such test should be.

Best,
Abhishek Kumar Nov. 12, 2020, 10:01 a.m. UTC | #2
On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > Since there are released versions of Git that understand generation
> > numbers in the commit-graph's CDAT chunk but do not understand the GDAT
> > chunk, the following scenario is possible:
> >
> > 1. "New" Git writes a commit-graph with the GDAT chunk.
> > 2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
> 
> All right.
> 
> >
> > Because of the current use of inspecting the current layer for a
> > chunk_generation_data pointer, the commits in the lower layer will be
> > interpreted as having very large generation values (commit date plus
> > offset) compared to the generation numbers in the top layer (topological
> > level). This violates the expectation that the generation of a parent is
> > strictly smaller than the generation of a child.
> 
> I think this paragraphs tries too much to be concise, with the result it
> is less clear than it could be.  Perhaps it would be better to separate
> "what-if" from the current behavior.
> 
>   If each layer of split commit-graph is treated independently, as it
>   were the case before this commit, with Git inspecting only the current
>   layer for chunk_generation_data pointer, commits in the lower layer
>   (one with GDAT) would have corrected commit date as their generation
>   number, while commits in the upper layer would have topological levels
>   as their generation.  Corrected commit dates have usually much larger
>   values than topological levels.  This means that if we take two
>   commits, one from the upper layer, and one reachable from it in the
>   lower layer, then the expectation that the generation of a parent is
>   smaller than the generation of a child would be violated.
> 

Thanks, that's better.

> >
> > It is difficult to expose this issue in a test. Since we _start_ with
> > artificially low generation numbers, any commit walk that prioritizes
> > generation numbers will walk all of the commits with high generation
> > number before walking the commits with low generation number. In all the
> > cases I tried, the commit-graph layers themselves "protect" any
> > incorrect behavior since none of the commits in the lower layer can
> > reach the commits in the upper layer.
> 
> I don't quite understand the issue here. Unless none of the following
> query commands short-circuit and all walk the commit graph regardless of
> what generation numbers tell them, they should give different results
> with and without the commit graph, if we take two commits one from lower
> layer of split commit graph with GDAT, and one commit from the higher
> layer without GDAT, one lower reachable from the other higher.
> 
> We have the following query commands that we can check:
>   $ git merge-base --is-ancestor <lower> <higher>
>   $ git merge-base --independent <lower> <higher>
>   
>   $ git tag --contains <tag-to-lower>
>   $ git tag --merged <tag-to-higher>
>   $ git branch --contains <branch-to-lower>
>   $ git branch --merged <branch-to-higher>
> 
> The second set of queries require for those commits to be tagged, or
> have branch pointing at them, respectively.
> 
> Also, shouldn't `git commit-graph verify` fail with split commit graph
> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
> 
> Let's assume that we have the following history, with newer commits
> shown on top like in `git log --graph --oneline --all`:
> 
>           topological     corrected         generation
>           level           commit date       number^*
> 
>       d    3                                3
>       |
>    c  |    3                                3
>    |  |                                                 without GDAT
>  ..|..|.....[layer.boundary]........................................
>    |  |                                                    with GDAT
>    |  b    2              1112912113        1112912113
>    |  |
>    a  |    2              1112912053        1112912053
>    | /
>    |/
>    r       1              1112911993        1112911993
> 
> *) each layer inspected individually.
> 
> With such history, we can for example reach 'a' from 'c', thus
> `git merge-base --is-ancestor a b` should return true value, but
> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
> I use here weaker reachability condition, but the one that works
> also for commits outside the commit-graph (and those for which
> generation numbers overflows).
> 

The original explanation was given by Dr. Stolee and he might not have
thought exhaustively about the issue.

In any case, your explanation and the history make sense to me. I will
try to add test and report back to the mailing list if something goes
wrong.

Thank you for clarifying in such detail.

> >
> > This issue would manifest itself as a performance problem in this case,
> > especially with something like "git log --graph" since the low
> > generation numbers would cause the in-degree queue to walk all of the
> > commits in the lower layer before allowing the topo-order queue to write
> > anything to output (depending on the size of the upper layer).
> 
> All right, that's good explanation.
> 
> ...
>
> > @@ -2030,6 +2047,9 @@ static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
> >  		}
> >  	}
> >  
> > +	if (!ctx->write_generation_data && g->chunk_generation_data)
> > +		ctx->write_generation_data = 1;
> > +
> 
> This needs more careful examination, and looking at larger context of
> those lines.
> 
> At this point, unless `--split=replace` option is used, 'g' points to
> the bottom layer out of all topmost layers being merged. We know that if
> there are GDAT-less layers then these must be top layers, so this means
> that we can write GDAT chunk in the result of the merge -- because we
> would be replacing all possible GDAT-less layers (and maybe some with
> GDAT) with a single layer with the GDAT chunk.
> 
> The ctx->write_generation_data is set to true unless environment
> variable GIT_TEST_COMMIT_GRAPH_NO_GDAT is true, and that in
> write_commit_graph() it would be set to false if topmost layer doesn't
> have GDAT chunk, and to true if `--split=replace` option is used; see
> below.
> 
> Looks good to me.
> 
> 
> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
> writing GDAT chunk with generation data v2 unless we are merging layers,
> or replacing all of them with a single layer: then it is _ignored_.
> 
> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
> in t/README?  Currently it reads:
> 
>   GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
>   commit-graph to be written without generation data chunk.

I think it's better to *not* write generation data chunk if
GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
are merged, that is:

  if (!ctx->write_generation_data &&
      g->chunk_generation_data &&
     !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
    ctx->write_generation_data = 1;

With this change, we would have a method to force-write commit-graph
without generation data chunk regardless of the shape of split
commit-graph files.

> 
> ...
>
> > diff --git a/commit-graph.h b/commit-graph.h
> > index 19a02001fd..ad52130883 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -64,6 +64,7 @@ struct commit_graph {
> >  	struct object_directory *odb;
> >  
> >  	uint32_t num_commits_in_base;
> > +	unsigned int read_generation_data;
> >  	struct commit_graph *base_graph;
> 
> All right, this new field is here to propagate to each layer the
> information whether we can read from the generation number v2 data
> chunk.
> 
> Though I am not sure whether this field should be added here, and
> whether it should be `unsigned int` (we don't have to be that careful
> about saving space for this type).
> 

I cannot think of a more appropriate struct than `struct commit_graph`. 
Any particular suggestions?

> > 
> >  	const uint32_t *chunk_oid_fanout;
> > diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> > index 651df89ab2..d0949a9eb8 100755
> > --- a/t/t5324-split-commit-graph.sh
> > +++ b/t/t5324-split-commit-graph.sh
> > @@ -440,4 +440,90 @@ test_expect_success '--split=replace with partial Bloom data' '
> >  	verify_chain_files_exist $graphdir
> >  '
> >  
> > +test_expect_success 'setup repo for mixed generation commit-graph-chain' '
> > +	mkdir mixed &&
> 
> This should probably go just before cd-ing into just created
> subdirectory.
> 
> > +	graphdir=".git/objects/info/commit-graphs" &&
> > +	test_oid_cache <<-EOM &&
> > +	oid_version sha1:1
> > +	oid_version sha256:2
> > +	EOM
> 
> Minor nitpick: Why use "EOM", which is used only twice in Git the test
> suite, and not the conventional "EOF" (used at least 4000 times)?

Right, both instances of "EOM" are actually my own. I looked up some
test script for oid cache that did use EOM when I first wrote the tests
but it's changed now. Will replace.
> 
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> 
> The t/README says:
> 
>    - Don't chdir around in tests.  It is not sufficient to chdir to
>      somewhere and then chdir back to the original location later in
>      the test, as any intermediate step can fail and abort the test,
>      causing the next test to start in an unexpected directory.  Do so
>      inside a subshell if necessary.
> 
> Though I am not sure if it should apply also to this situation.

While I cannot avoid changing directory, using a subshell would be best
to avoid causing the later tests to start in unexpected directories.

> 
> > +	git init &&
> > +	git config core.commitGraph true &&
> > +	git config gc.writeCommitGraph false &&
> 
> All right.
> 
> > +	for i in $(test_seq 3)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git reset --hard commits/1 &&
> > +	for i in $(test_seq 4 5)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git reset --hard commits/2 &&
> > +	for i in $(test_seq 6 10)
> > +	do
> > +		test_commit $i &&
> > +		git branch commits/$i || return 1
> > +	done &&
> > +	git commit-graph write --reachable --split &&
> 
> Is there a reason why we do not check just written commit-graph file
> with `test-tool read-graph >output-layer-1`?

We could check the written commit-graph file at this point but it's same
as existing tests as above.

> 
> > +	git reset --hard commits/2 &&
> > +	git merge commits/4 &&
> 
> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
> to use one or the other?

`test_merge` is used in 26 places whereas `git merge` is used in over a
thousand places. `test_merge` is just not widely adopted and this lack
of adoption prevents further use.

> 
> > +	git branch merge/1 &&
> > +	git reset --hard commits/4 &&
> > +	git merge commits/6 &&
> > +	git branch merge/2 &&
> 
> It would be nice to have ASCII-art of the history (of the graph of
> revisions) created here for subsequent tests:
> 
>                                         
>            /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>           /    \-\
>          /        \
>   1 <-- 2 <-- 3*   \--\
>   |      \             \ 
>   |       \-----\       \
>    \             \       \
>     \-- 4*<------ M/1     M/2
>         |\               /  
>         | \-- 5*        /
>         \              /
>          \------------/
> 
>   * - 1st layer  
> 
> Though I am not sure if what I have created is readable; I think a
> better way to draw this graph is possible, for example:
> 
>                /- 3*
>               /
>              /
>   1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>    \         \       \
>     \         \       \
>      \         \       \
>       \- 4* <-- M/1     \    
>          |\              \
>          | \------------- M/2
>          \
>           \---- 5*
> 
> Edit: as I see the history gets even more complicated, so perhaps
> ASCII-art diagram of the history with layers marked would be too
> complicated, and wouldn't bring much.
> 
> Why do we need such shape of the history in the repository?

We don't need such a complicated shape. Any commit-graph file with 2-3
layers regardless of how commits are related should suffice. Will
simplify.

> 
> > +	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
> > +	test-tool read-graph >output &&
> > +	cat >expect <<-EOF &&
> > +	header: 43475048 1 $(test_oid oid_version) 4 1
> > +	num_commits: 2
> > +	chunks: oid_fanout oid_lookup commit_metadata
> > +	EOF
> > +	test_cmp expect output &&
> 
> All right, we check that we have 2 commits, and that there is no GDAT
> chunk.
> 
> > +	git commit-graph verify
> 
> All right, we verify commit-graph as a whole (both layers).
> 
> > +'
> > +
> > +test_expect_success 'does not write generation data chunk if not present on existing tip' '
> 
> Hmmm... I wonder if we can come up with a better name for this test;
> for example should it be "does not write" or "do not write"?

That's better.

> 
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	git reset --hard commits/3 &&
> > +	git merge merge/1 &&
> > +	git merge commits/5 &&
> > +	git merge merge/2 &&
> > +	git branch merge/3 &&
> 
> The commit graph gets complicated, so it would not be easy to visualize
> it with ASCII-art diagram without any crossed lines.  Maybe `git log
> --graph --oneline --all` would help:
> 
> *   (merge/3) Merge branch 'merge/2'
> |\
> | *   (merge/2) Merge branch 'commits/6'
> | |\
> * | \   Merge branch 'commits/5'
> |\ \ \
> | * | | (commits/5) 5
> | |/ /
> * | |   Merge branch 'merge/1'
> |\ \ \
> | * | | (merge/1) Merge branch 'commits/4'
> | |\| |
> | | * | (commits/4) 4
> * | | | (commits/3) 3
> |/ / /
> | | | * (commits/10) 10
> | | | * (commits/9) 9
> | | | * (commits/8) 8
> | | | * (commits/7) 7
> | | |/
> | | * (commits/6) 6
> | |/
> |/|
> * | (commits/2) 2
> |/
> * (commits/1) 1
> 
> 
> > +	git commit-graph write --reachable --split=no-merge &&
> > +	test-tool read-graph >output &&
> > +	cat >expect <<-EOF &&
> > +	header: 43475048 1 $(test_oid oid_version) 4 2
> > +	num_commits: 3
> > +	chunks: oid_fanout oid_lookup commit_metadata
> > +	EOF
> > +	test_cmp expect output &&
> > +	git commit-graph verify
> 
> All right, so here we check that we have layer without GDAT at the top,
> and we request not to merge layers thus new layer will be created, then
> the new layer also does not have GDAT chunk (and has 3 commits).
> 
> Minor nitpick: shouldn't those test be indented?
> 

The tests look indented to me and `git diff HEAD^ --check` gives nothing.

Did you mean the lines enclosed by EOF delimiter?

> > +'
> > +
> > +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	git commit-graph write --reachable --split=replace &&
> > +	test_path_is_file $graphdir/commit-graph-chain &&
> > +	test_line_count = 1 $graphdir/commit-graph-chain &&
> > +	verify_chain_files_exist $graphdir &&
> 
> All right, this checks that we have split commit-graph chain that
> consist of a single layer, and that the commit-graph file for this
> single layer exists.
> 
> > +	graph_read_expect 15 &&
> 
> Shouldn't we use `test-tool read-graph` to check whether generation_data
> chunk is present... ah, sorry, I have realized that after previous
> patches `graph_read_expect 15` implicitly checks the latter, because in
> its' use of `test-tool read-graph` it does expect generation_data chunk.
> 
> So we use `test-tool read-graph` manually to check that generation_data
> chunk is absent, and we use graph_read_expect to check that it is
> present (and in both cases that the number of commits matches).  I
> wonder if it would be possible to simplify that...
>

The problem here is graph_read_expect() as defined in
t5324-split-commit-graph takes two parameters - number of commits and
number of base graphs. If the number of base graphs is not passed to
the function call, it's assumed to be zero. Using a default parameter
is tricky - I can fix it by manually adding a zero to each of 
graph_read_expect() in an additional preparatory patch.

Any other suggestions are welcome too.

> 
> > +	git commit-graph verify
> 
> All right.
> 
> > +'
> > +
> > +test_expect_success 'add one commit, write a tip graph' '
> > +	cd "$TRASH_DIRECTORY/mixed" &&
> > +	test_commit 11 &&
> > +	git branch commits/11 &&
> > +	git commit-graph write --reachable --split &&
> > +	test_path_is_missing $infodir/commit-graph &&
> > +	test_path_is_file $graphdir/commit-graph-chain &&
> > +	ls $graphdir/graph-*.graph >graph-files &&
> > +	test_line_count = 2 graph-files &&
> > +	verify_chain_files_exist $graphdir
> > +'
> 
> What it is meant to test?  That adding single-commit to a 15 commit
> commit-graph file in split mode does not result in layers merging, and
> actually adds a new layer: we check that we have exactly two layers and
> that they are all OK.

This test is meant to check writing to a split graph in "normal"
conditions (i.e. all existing layers have generation data chunk). The
above tests are special cases as they involve merging layers with mixed 
generation number versions.

> 
> We don't check here that the newly created top layer commit-graph does
> have GDAT chunk, as it should be if the top layer (in this case the only
> layer) has GDAT chunk.
> > +
> >  test_done
> 
> One test we are missing is testing that merging layers is done
> correctly, namely that if we are merging layers in split commit-graph
> file, and the layer below the ones we are merging lacks GDAT chunk, then
> the result of the merge should also be without GDAT chunk.  This would
> require at least two GDAT-less layers in a setup.
> 
> I'm not sure how difficult writing such test should be.

It wouldn't be too hard. 

After the last test, I can write some more commits and write split 
commit-graph file without GDAT chunk. Then write some more commits 
and merge layers using `git commit-graph write --max-commits=<nr>`.

Thanks for pointing this out!

> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek
Jakub Narębski Nov. 13, 2020, 9:59 a.m. UTC | #3
Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Sun, Nov 01, 2020 at 01:55:11AM +0100, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
[...]
>>> It is difficult to expose this issue in a test. Since we _start_ with
>>> artificially low generation numbers, any commit walk that prioritizes
>>> generation numbers will walk all of the commits with high generation
>>> number before walking the commits with low generation number. In all the
>>> cases I tried, the commit-graph layers themselves "protect" any
>>> incorrect behavior since none of the commits in the lower layer can
>>> reach the commits in the upper layer.
>> 
>> I don't quite understand the issue here. Unless none of the following
>> query commands short-circuit and all walk the commit graph regardless of
>> what generation numbers tell them, they should give different results
>> with and without the commit graph, if we take two commits one from lower
>> layer of split commit graph with GDAT, and one commit from the higher
>> layer without GDAT, one lower reachable from the other higher.
>> 
>> We have the following query commands that we can check:
>>   $ git merge-base --is-ancestor <lower> <higher>
>>   $ git merge-base --independent <lower> <higher>
>>   
>>   $ git tag --contains <tag-to-lower>
>>   $ git tag --merged <tag-to-higher>
>>   $ git branch --contains <branch-to-lower>
>>   $ git branch --merged <branch-to-higher>
>> 
>> The second set of queries require for those commits to be tagged, or
>> have branch pointing at them, respectively.
>> 
>> Also, shouldn't `git commit-graph verify` fail with split commit graph
>> where the top layer is created with GIT_TEST_COMMIT_GRAPH_NO_GDAT=1?
>> 
>> Let's assume that we have the following history, with newer commits
>> shown on top like in `git log --graph --oneline --all`:
>> 
>>           topological     corrected         generation
>>           level           commit date       number^*
>> 
>>       d    3                                3
>>       |
>>    c  |    3                                3
>>    |  |                                                 without GDAT
>>  ..|..|.....[layer.boundary]........................................
>>    |  |                                                    with GDAT
>>    |  b    2              1112912113        1112912113
>>    |  |
>>    a  |    2              1112912053        1112912053
>>    | /
>>    |/
>>    r       1              1112911993        1112911993
>> 
>> *) each layer inspected individually.
>> 
>> With such history, we can for example reach 'a' from 'c', thus
>> `git merge-base --is-ancestor a b` should return true value, but
>> without this commit gen(a) > gen(c), instead of gen(a) <= gen(c);
>> I use here weaker reachability condition, but the one that works
>> also for commits outside the commit-graph (and those for which
>> generation numbers overflows).
>> 
>
> The original explanation was given by Dr. Stolee and he might not have
> thought exhaustively about the issue.
>
> In any case, your explanation and the history make sense to me. I will
> try to add test and report back to the mailing list if something goes
> wrong.
>
> Thank you for clarifying in such detail.

I don't think you need to add any new test.  It should be enough to check
that the first test introduced in this patch, namely 'setup repo for
mixed generation commit-graph-chain', fails without the change in this
patch -- as I think it does.  This is because `git commit-graph verify`
should fail with mixed-version split commit-graph with GDAT-less layer
on top without this change.

Reporting this (possibly as from one sentence to one paragraph in the
commit message) would be enough, in my opinion.

[...]
>> NOTE that this means that GIT_TEST_COMMIT_GRAPH_NO_GDAT prevents from
>> writing GDAT chunk with generation data v2 unless we are merging layers,
>> or replacing all of them with a single layer: then it is _ignored_.
>> 
>> Should we clarify this fact in the description of GIT_TEST_COMMIT_GRAPH_NO_GDAT
>> in t/README?  Currently it reads:
>> 
>>   GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
>>   commit-graph to be written without generation data chunk.
>
> I think it's better to *not* write generation data chunk if
> GIT_TEST_COMMIT_GRAPH_NO_GDAT is set even though all GDAT-less layers
> are merged, that is:
>
>   if (!ctx->write_generation_data &&
>       g->chunk_generation_data &&
>      !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
>     ctx->write_generation_data = 1;
>
> With this change, we would have a method to force-write commit-graph
> without generation data chunk regardless of the shape of split
> commit-graph files.

While it would be more consistent to always behave like the old Git with
GIT_TEST_COMMIT_GRAPH_NO_GDAT=1, it is in my opinion not necessary.

The only thing we need to test the mixed-version commit-graph chain is
the ability to add new layer on top without GDAT.  It does not matter if
this layer is created from new commits or a result of partial or full
merge of layers.

So the alternative to extending what GIT_TEST_COMMIT_GRAPH_NO_GDAT does
that you propose here would be simply improving the description of t in
t/README, e.g.

     GIT_TEST_COMMIT_GRAPH_NO_GDAT=<boolean>, when true, forces the
     commit-graph, or new layer in split commit-graph chain, to be written
     without generation data chunk.  It does not affect merging of layers.

For me either solution is fine.

[...]
>>> diff --git a/commit-graph.h b/commit-graph.h
>>> index 19a02001fd..ad52130883 100644
>>> --- a/commit-graph.h
>>> +++ b/commit-graph.h
>>> @@ -64,6 +64,7 @@ struct commit_graph {
>>>  	struct object_directory *odb;
>>>  
>>>  	uint32_t num_commits_in_base;
>>> +	unsigned int read_generation_data;
>>>  	struct commit_graph *base_graph;
>> 
>> All right, this new field is here to propagate to each layer the
>> information whether we can read from the generation number v2 data
>> chunk.
>> 
>> Though I am not sure whether this field should be added here, and
>> whether it should be `unsigned int` (we don't have to be that careful
>> about saving space for this type).
>
> I cannot think of a more appropriate struct than `struct commit_graph`. 
> Any particular suggestions?

After thinking about it a bit more, I think it is fine to have it here
in `struct commit_graph`, it is better than using a global variable
(which would make code non-reentrant; not that we use multiple threads
for reading multiple layers of the commit graph, but we might want to in
the future).

[...]
>>> +	cd "$TRASH_DIRECTORY/mixed" &&
>> 
>> The t/README says:
>> 
>>    - Don't chdir around in tests.  It is not sufficient to chdir to
>>      somewhere and then chdir back to the original location later in
>>      the test, as any intermediate step can fail and abort the test,
>>      causing the next test to start in an unexpected directory.  Do so
>>      inside a subshell if necessary.
>> 
>> Though I am not sure if it should apply also to this situation.
>
> While I cannot avoid changing directory, using a subshell would be best
> to avoid causing the later tests to start in unexpected directories.

This would allow for easier skipping of tests, and failed tests would
not propagate the error (because of subsequent tests after a failed one
starting in unexpected directory).

[...]
>>> +	for i in $(test_seq 3)
>>> +	do
>>> +		test_commit $i &&
>>> +		git branch commits/$i || return 1
>>> +	done &&
>>> +	git reset --hard commits/1 &&
>>> +	for i in $(test_seq 4 5)
>>> +	do
>>> +		test_commit $i &&
>>> +		git branch commits/$i || return 1
>>> +	done &&
>>> +	git reset --hard commits/2 &&
>>> +	for i in $(test_seq 6 10)
>>> +	do
>>> +		test_commit $i &&
>>> +		git branch commits/$i || return 1
>>> +	done &&
>>> +	git commit-graph write --reachable --split &&
>> 
>> Is there a reason why we do not check just written commit-graph file
>> with `test-tool read-graph >output-layer-1`?
>
> We could check the written commit-graph file at this point but it's same
> as existing tests as above.

All right, thanks for an explanation.

>> 
>>> +	git reset --hard commits/2 &&
>>> +	git merge commits/4 &&
>> 
>> Shouldn't we use `test_merge` instead of `git merge`; I am not sure when
>> to use one or the other?
>
> `test_merge` is used in 26 places whereas `git merge` is used in over a
> thousand places. `test_merge` is just not widely adopted and this lack
> of adoption prevents further use.

All right then.

>>> +	git branch merge/1 &&
>>> +	git reset --hard commits/4 &&
>>> +	git merge commits/6 &&
>>> +	git branch merge/2 &&
>> 
>> It would be nice to have ASCII-art of the history (of the graph of
>> revisions) created here for subsequent tests:
>> 
>>                                         
>>            /- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>>           /    \-\
>>          /        \
>>   1 <-- 2 <-- 3*   \--\
>>   |      \             \ 
>>   |       \-----\       \
>>    \             \       \
>>     \-- 4*<------ M/1     M/2
>>         |\               /  
>>         | \-- 5*        /
>>         \              /
>>          \------------/
>> 
>>   * - 1st layer  
>> 
>> Though I am not sure if what I have created is readable; I think a
>> better way to draw this graph is possible, for example:
>> 
>>                /- 3*
>>               /
>>              /
>>   1 <------ 2 <---- 6 <-- 7 <-- 8 <-- 9 <-- 10*
>>    \         \       \
>>     \         \       \
>>      \         \       \
>>       \- 4* <-- M/1     \    
>>          |\              \
>>          | \------------- M/2
>>          \
>>           \---- 5*
>> 
>> Edit: as I see the history gets even more complicated, so perhaps
>> ASCII-art diagram of the history with layers marked would be too
>> complicated, and wouldn't bring much.
>> 
>> Why do we need such shape of the history in the repository?
>
> We don't need such a complicated shape. Any commit-graph file with 2-3
> layers regardless of how commits are related should suffice. Will
> simplify.

If you are unsire if we need this shape of history to properly test all
corner cases of the algorithm, or whether simple history would be
enough, you can simply compare code coverage.  Git Makefile ha the
'coverage' target (which requires 'gcov' tool).

NOTE: if it is possible to run 'make coverage' for you, it can be used
to check if there are any parts of the new code that are not tested.

[...]
>>> +	git commit-graph write --reachable --split=no-merge &&
>>> +	test-tool read-graph >output &&
>>> +	cat >expect <<-EOF &&
>>> +	header: 43475048 1 $(test_oid oid_version) 4 2
>>> +	num_commits: 3
>>> +	chunks: oid_fanout oid_lookup commit_metadata
>>> +	EOF
>>> +	test_cmp expect output &&
>>> +	git commit-graph verify
>> 
>> All right, so here we check that we have layer without GDAT at the top,
>> and we request not to merge layers thus new layer will be created, then
>> the new layer also does not have GDAT chunk (and has 3 commits).
>> 
>> Minor nitpick: shouldn't those test be indented?
>> 
>
> The tests look indented to me and `git diff HEAD^ --check` gives nothing.
>
> Did you mean the lines enclosed by EOF delimiter?

I'm sorry, that was my mistake -- tabs are used for indent, and the
tabstop (in my newsreader) when being quoted made it look like it was
not indented.

[...]
>>> +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
>>> +	cd "$TRASH_DIRECTORY/mixed" &&
>>> +	git commit-graph write --reachable --split=replace &&
>>> +	test_path_is_file $graphdir/commit-graph-chain &&
>>> +	test_line_count = 1 $graphdir/commit-graph-chain &&
>>> +	verify_chain_files_exist $graphdir &&
>> 
>> All right, this checks that we have split commit-graph chain that
>> consist of a single layer, and that the commit-graph file for this
>> single layer exists.
>> 
>>> +	graph_read_expect 15 &&
>> 
>> Shouldn't we use `test-tool read-graph` to check whether generation_data
>> chunk is present... ah, sorry, I have realized that after previous
>> patches `graph_read_expect 15` implicitly checks the latter, because in
>> its' use of `test-tool read-graph` it does expect generation_data chunk.
>> 
>> So we use `test-tool read-graph` manually to check that generation_data
>> chunk is absent, and we use graph_read_expect to check that it is
>> present (and in both cases that the number of commits matches).  I
>> wonder if it would be possible to simplify that...

What I wanted to say that it might be better to have a second variant of
graph_read_expect() for GDAT-less layers -- but this might be
unnecessary complication.

> The problem here is graph_read_expect() as defined in
> t5324-split-commit-graph takes two parameters - number of commits and
> number of base graphs. If the number of base graphs is not passed to
> the function call, it's assumed to be zero. Using a default parameter
> is tricky - I can fix it by manually adding a zero to each of 
> graph_read_expect() in an additional preparatory patch.

All right, thanks for an explanation.  I should have examined
graph_read_expect() in more detail.

> Any other suggestions are welcome too.

[...]
>>> +test_expect_success 'add one commit, write a tip graph' '
>>> +	cd "$TRASH_DIRECTORY/mixed" &&
>>> +	test_commit 11 &&
>>> +	git branch commits/11 &&
>>> +	git commit-graph write --reachable --split &&
>>> +	test_path_is_missing $infodir/commit-graph &&
>>> +	test_path_is_file $graphdir/commit-graph-chain &&
>>> +	ls $graphdir/graph-*.graph >graph-files &&
>>> +	test_line_count = 2 graph-files &&
>>> +	verify_chain_files_exist $graphdir
>>> +'
>> 
>> What it is meant to test?  That adding single-commit to a 15 commit
>> commit-graph file in split mode does not result in layers merging, and
>> actually adds a new layer: we check that we have exactly two layers and
>> that they are all OK.
>
> This test is meant to check writing to a split graph in "normal"
> conditions (i.e. all existing layers have generation data chunk). The
> above tests are special cases as they involve merging layers with mixed 
> generation number versions.

All right.

>> 
>> We don't check here that the newly created top layer commit-graph does
>> have GDAT chunk, as it should be if the top layer (in this case the only
>> layer) has GDAT chunk.
>>> +
>>>  test_done
>> 
>> One test we are missing is testing that merging layers is done
>> correctly, namely that if we are merging layers in split commit-graph
>> file, and the layer below the ones we are merging lacks GDAT chunk, then
>> the result of the merge should also be without GDAT chunk.  This would
>> require at least two GDAT-less layers in a setup.
>> 
>> I'm not sure how difficult writing such test should be.
>
> It wouldn't be too hard. 
>
> After the last test, I can write some more commits and write split 
> commit-graph file without GDAT chunk. Then write some more commits 
> and merge layers using `git commit-graph write --max-commits=<nr>`.
>
> Thanks for pointing this out!

Good.

Best,
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index 71d0b243db..5d15a1399b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -605,6 +605,21 @@  static struct commit_graph *load_commit_graph_chain(struct repository *r,
 	return graph_chain;
 }
 
+static void validate_mixed_generation_chain(struct commit_graph *g)
+{
+	int read_generation_data;
+
+	if (!g)
+		return;
+
+	read_generation_data = !!g->chunk_generation_data;
+
+	while (g) {
+		g->read_generation_data = read_generation_data;
+		g = g->base_graph;
+	}
+}
+
 struct commit_graph *read_commit_graph_one(struct repository *r,
 					   struct object_directory *odb)
 {
@@ -613,6 +628,8 @@  struct commit_graph *read_commit_graph_one(struct repository *r,
 	if (!g)
 		g = load_commit_graph_chain(r, odb);
 
+	validate_mixed_generation_chain(g);
+
 	return g;
 }
 
@@ -782,7 +799,7 @@  static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
 	date_low = get_be32(commit_data + g->hash_len + 12);
 	item->date = (timestamp_t)((date_high << 32) | date_low);
 
-	if (g->chunk_generation_data) {
+	if (g->chunk_generation_data && g->read_generation_data) {
 		offset = (timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
 
 		if (offset & CORRECTED_COMMIT_DATE_OFFSET_OVERFLOW) {
@@ -2030,6 +2047,9 @@  static void split_graph_merge_strategy(struct write_commit_graph_context *ctx)
 		}
 	}
 
+	if (!ctx->write_generation_data && g->chunk_generation_data)
+		ctx->write_generation_data = 1;
+
 	if (flags != COMMIT_GRAPH_SPLIT_REPLACE)
 		ctx->new_base_graph = g;
 	else if (ctx->num_commit_graphs_after != 1)
@@ -2274,6 +2294,7 @@  int write_commit_graph(struct object_directory *odb,
 		struct commit_graph *g = ctx->r->objects->commit_graph;
 
 		while (g) {
+			g->read_generation_data = 1;
 			g->topo_levels = &topo_levels;
 			g = g->base_graph;
 		}
@@ -2300,6 +2321,9 @@  int write_commit_graph(struct object_directory *odb,
 
 		g = ctx->r->objects->commit_graph;
 
+		if (g && !g->chunk_generation_data)
+			ctx->write_generation_data = 0;
+
 		while (g) {
 			ctx->num_commit_graphs_before++;
 			g = g->base_graph;
@@ -2318,6 +2342,9 @@  int write_commit_graph(struct object_directory *odb,
 
 		if (ctx->opts)
 			replace = ctx->opts->split_flags & COMMIT_GRAPH_SPLIT_REPLACE;
+
+		if (replace)
+			ctx->write_generation_data = 1;
 	}
 
 	ctx->approx_nr_objects = approximate_object_count();
diff --git a/commit-graph.h b/commit-graph.h
index 19a02001fd..ad52130883 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -64,6 +64,7 @@  struct commit_graph {
 	struct object_directory *odb;
 
 	uint32_t num_commits_in_base;
+	unsigned int read_generation_data;
 	struct commit_graph *base_graph;
 
 	const uint32_t *chunk_oid_fanout;
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 651df89ab2..d0949a9eb8 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -440,4 +440,90 @@  test_expect_success '--split=replace with partial Bloom data' '
 	verify_chain_files_exist $graphdir
 '
 
+test_expect_success 'setup repo for mixed generation commit-graph-chain' '
+	mkdir mixed &&
+	graphdir=".git/objects/info/commit-graphs" &&
+	test_oid_cache <<-EOM &&
+	oid_version sha1:1
+	oid_version sha256:2
+	EOM
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git init &&
+	git config core.commitGraph true &&
+	git config gc.writeCommitGraph false &&
+	for i in $(test_seq 3)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/1 &&
+	for i in $(test_seq 4 5)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git reset --hard commits/2 &&
+	for i in $(test_seq 6 10)
+	do
+		test_commit $i &&
+		git branch commits/$i || return 1
+	done &&
+	git commit-graph write --reachable --split &&
+	git reset --hard commits/2 &&
+	git merge commits/4 &&
+	git branch merge/1 &&
+	git reset --hard commits/4 &&
+	git merge commits/6 &&
+	git branch merge/2 &&
+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 $(test_oid oid_version) 4 1
+	num_commits: 2
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'does not write generation data chunk if not present on existing tip' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git reset --hard commits/3 &&
+	git merge merge/1 &&
+	git merge commits/5 &&
+	git merge merge/2 &&
+	git branch merge/3 &&
+	git commit-graph write --reachable --split=no-merge &&
+	test-tool read-graph >output &&
+	cat >expect <<-EOF &&
+	header: 43475048 1 $(test_oid oid_version) 4 2
+	num_commits: 3
+	chunks: oid_fanout oid_lookup commit_metadata
+	EOF
+	test_cmp expect output &&
+	git commit-graph verify
+'
+
+test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	git commit-graph write --reachable --split=replace &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	test_line_count = 1 $graphdir/commit-graph-chain &&
+	verify_chain_files_exist $graphdir &&
+	graph_read_expect 15 &&
+	git commit-graph verify
+'
+
+test_expect_success 'add one commit, write a tip graph' '
+	cd "$TRASH_DIRECTORY/mixed" &&
+	test_commit 11 &&
+	git branch commits/11 &&
+	git commit-graph write --reachable --split &&
+	test_path_is_missing $infodir/commit-graph &&
+	test_path_is_file $graphdir/commit-graph-chain &&
+	ls $graphdir/graph-*.graph >graph-files &&
+	test_line_count = 2 graph-files &&
+	verify_chain_files_exist $graphdir
+'
+
 test_done