diff mbox series

[v2,3/4] commit-graph: start parsing generation v2 (again)

Message ID 5bc6a7660d897ca6c52eabba8fb9ecfb6304dabb.1646056423.git.gitgitgadget@gmail.com (mailing list archive)
State Superseded
Headers show
Series Commit-graph: Generation Number v2 Fixes | expand

Commit Message

Derrick Stolee Feb. 28, 2022, 1:53 p.m. UTC
From: Derrick Stolee <derrickstolee@github.com>

The 'read_generation_data' member of 'struct commit_graph' was
introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire
chain does, 2021-01-16). The intention was to avoid using corrected
commit dates if not all layers of a commit-graph had that data stored.
The logic in validate_mixed_generation_chain() at that point incorrectly
initialized read_generation_data to 1 if and only if the tip
commit-graph contained the Corrected Commit Date chunk.

This was "fixed" in 448a39e65 (commit-graph: validate layers for
generation data, 2021-02-02) to validate that read_generation_data was
either non-zero for all layers, or it would set read_generation_data to
zero for all layers.

The problem here is that read_generation_data is not initialized to be
non-zero anywhere!

This change initializes read_generation_data immediately after the chunk
is parsed, so each layer will have its value present as soon as
possible.

The read_generation_data member is used in fill_commit_graph_info() to
determine if we should use the corrected commit date or the topological
levels stored in the Commit Data chunk. Due to this bug, all previous
versions of Git were defaulting to topological levels in all cases!

This can be measured with some performance tests. Using the Linux kernel
as a testbed, I generated a complete commit-graph containing corrected
commit dates and tested the 'new' version against the previous, 'old'
version.

First, rev-list with --topo-order demonstrates a 26% improvement using
corrected commit dates:

hyperfine \
	-n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \
	-n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \
	--warmup=10

Benchmark 1: old
  Time (mean ± σ):      57.1 ms ±   3.1 ms
  Range (min … max):    52.9 ms …  62.0 ms    55 runs

Benchmark 2: new
  Time (mean ± σ):      45.5 ms ±   3.3 ms
  Range (min … max):    39.9 ms …  51.7 ms    59 runs

Summary
  'new' ran
    1.26 ± 0.11 times faster than 'old'

These performance improvements are due to the algorithmic improvements
given by walking fewer commits due to the higher cutoffs from corrected
commit dates.

However, this comes at a cost. The additional I/O cost of parsing the
corrected commit dates is visible in case of merge-base commands that do
not reduce the overall number of walked commits.

hyperfine \
        -n "old" "$OLD_GIT merge-base v4.8 v4.9" \
        -n "new" "$NEW_GIT merge-base v4.8 v4.9" \
        --warmup=10

Benchmark 1: old
  Time (mean ± σ):     110.4 ms ±   6.4 ms
  Range (min … max):    96.0 ms … 118.3 ms    25 runs

Benchmark 2: new
  Time (mean ± σ):     150.7 ms ±   1.1 ms
  Range (min … max):   149.3 ms … 153.4 ms    19 runs

Summary
  'old' ran
    1.36 ± 0.08 times faster than 'new'

Performance issues like this are what motivated 702110aac (commit-graph:
use config to specify generation type, 2021-02-25).

In the future, we could fix this performance problem by inserting the
corrected commit date offsets into the Commit Date chunk instead of
having that data in an extra chunk.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 commit-graph.c                |  3 +++
 t/t4216-log-bloom.sh          |  2 +-
 t/t5318-commit-graph.sh       | 14 ++++++++++++--
 t/t5324-split-commit-graph.sh |  9 +++++++--
 4 files changed, 23 insertions(+), 5 deletions(-)

Comments

Ævar Arnfjörð Bjarmason Feb. 28, 2022, 3:30 p.m. UTC | #1
On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <derrickstolee@github.com>
> [...]
> +	GENERATION_VERSION=2
> +	if test ! -z "$3"

TIL this works somewhere :)

I thought it *might* be unportable behavior (but didn't check at
first), but it's not! We have a few such cases already.

But IMO much less puzzling would be at least:

    if ! test -z "$3"

Or in this case, more plainly:

    if test -n "$3"

> +	then
> +		GENERATION_VERSION=$3
> +	fi
> +	OPTIONS=
> +	if test $GENERATION_VERSION -gt 1
> +	then
> +		OPTIONS=" read_generation_data"
> +	fi

Or actually, since we don't use $GENERATION_VERSION further down setting
it to a default etc. here seems a bit odd. Perhaps something closer to:

    if test $# -eq 3 && test $3 -gt 1

It's also possible to be more clever as e.g.:

    test "${3:-2}" -gt 1

But that hardly seems worth it...

>  NUM_COMMITS=9
> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
> index 778fa418de2..669ddc645fa 100755
> --- a/t/t5324-split-commit-graph.sh
> +++ b/t/t5324-split-commit-graph.sh
> @@ -30,11 +30,16 @@ graph_read_expect() {
>  	then
>  		NUM_BASE=$2
>  	fi
> +	OPTIONS=
> +	if test -z "$3"
> +	then
> +		OPTIONS=" read_generation_data"
> +	fi
>  	cat >expect <<- EOF
>  	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
>  	num_commits: $1
>  	chunks: oid_fanout oid_lookup commit_metadata generation_data
> -	options:
> +	options:$OPTIONS
>  	EOF
>  	test-tool read-graph >output &&
>  	test_cmp expect output

Not a new issue, but it would be nice to have the mostly copy/pasted
code in a lib-commit-graph.sh or something, but probably too distracting
for this small series...
Derrick Stolee Feb. 28, 2022, 4:43 p.m. UTC | #2
On 2/28/2022 10:30 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Feb 28 2022, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <derrickstolee@github.com>
>> [...]
>> +	GENERATION_VERSION=2
>> +	if test ! -z "$3"
> 
> TIL this works somewhere :)
> 
> I thought it *might* be unportable behavior (but didn't check at
> first), but it's not! We have a few such cases already.
> 
> But IMO much less puzzling would be at least:
> 
>     if ! test -z "$3"
> 
> Or in this case, more plainly:
> 
>     if test -n "$3"

Sure, that makes sense.

>> +	then
>> +		GENERATION_VERSION=$3
>> +	fi
>> +	OPTIONS=
>> +	if test $GENERATION_VERSION -gt 1
>> +	then
>> +		OPTIONS=" read_generation_data"
>> +	fi
> 
> Or actually, since we don't use $GENERATION_VERSION further down setting
> it to a default etc. here seems a bit odd. Perhaps something closer to:
> 
>     if test $# -eq 3 && test $3 -gt 1
> 
> It's also possible to be more clever as e.g.:
> 
>     test "${3:-2}" -gt 1
> 
> But that hardly seems worth it...

I prefer to use a variable so the code is self-documenting.

>>  NUM_COMMITS=9
>> diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
>> index 778fa418de2..669ddc645fa 100755
>> --- a/t/t5324-split-commit-graph.sh
>> +++ b/t/t5324-split-commit-graph.sh
>> @@ -30,11 +30,16 @@ graph_read_expect() {
>>  	then
>>  		NUM_BASE=$2
>>  	fi
>> +	OPTIONS=
>> +	if test -z "$3"
>> +	then
>> +		OPTIONS=" read_generation_data"
>> +	fi
>>  	cat >expect <<- EOF
>>  	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
>>  	num_commits: $1
>>  	chunks: oid_fanout oid_lookup commit_metadata generation_data
>> -	options:
>> +	options:$OPTIONS
>>  	EOF
>>  	test-tool read-graph >output &&
>>  	test_cmp expect output
> 
> Not a new issue, but it would be nice to have the mostly copy/pasted
> code in a lib-commit-graph.sh or something, but probably too distracting
> for this small series...

These cases are different enough in the needs of the test files
that they cannot be shared without significant complication.

Thanks,
-Stolee
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index a19bd96c2ee..8e52bb09552 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -407,6 +407,9 @@  struct commit_graph *parse_commit_graph(struct repository *r,
 			&graph->chunk_generation_data);
 		pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW,
 			&graph->chunk_generation_data_overflow);
+
+		if (graph->chunk_generation_data)
+			graph->read_generation_data = 1;
 	}
 
 	if (r->settings.commit_graph_read_changed_paths) {
diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh
index 5ed6d2a21c1..fa9d32facfb 100755
--- a/t/t4216-log-bloom.sh
+++ b/t/t4216-log-bloom.sh
@@ -48,7 +48,7 @@  graph_read_expect () {
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data
-	options: bloom(1,10,7)
+	options: bloom(1,10,7) read_generation_data
 	EOF
 	test-tool read-graph >actual &&
 	test_cmp expect actual
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index f9bffe38013..1afee1c2705 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -100,11 +100,21 @@  graph_read_expect() {
 		OPTIONAL=" $2"
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
 	fi
+	GENERATION_VERSION=2
+	if test ! -z "$3"
+	then
+		GENERATION_VERSION=$3
+	fi
+	OPTIONS=
+	if test $GENERATION_VERSION -gt 1
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -498,7 +508,7 @@  test_expect_success 'git commit-graph verify' '
 	cd "$TRASH_DIRECTORY/full" &&
 	git rev-parse commits/8 | git -c commitGraph.generationVersion=1 commit-graph write --stdin-commits &&
 	git commit-graph verify >output &&
-	graph_read_expect 9 extra_edges
+	graph_read_expect 9 extra_edges 1
 '
 
 NUM_COMMITS=9
diff --git a/t/t5324-split-commit-graph.sh b/t/t5324-split-commit-graph.sh
index 778fa418de2..669ddc645fa 100755
--- a/t/t5324-split-commit-graph.sh
+++ b/t/t5324-split-commit-graph.sh
@@ -30,11 +30,16 @@  graph_read_expect() {
 	then
 		NUM_BASE=$2
 	fi
+	OPTIONS=
+	if test -z "$3"
+	then
+		OPTIONS=" read_generation_data"
+	fi
 	cat >expect <<- EOF
 	header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata generation_data
-	options:
+	options:$OPTIONS
 	EOF
 	test-tool read-graph >output &&
 	test_cmp expect output
@@ -624,7 +629,7 @@  test_expect_success 'write generation data chunk if topmost remaining layer has
 		header: 43475048 1 $(test_oid oid_version) 5 1
 		num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
 		chunks: oid_fanout oid_lookup commit_metadata generation_data
-		options:
+		options: read_generation_data
 		EOF
 		test_cmp expect output
 	)