diff mbox series

[5/6] commit-graph: implement file format version 2

Message ID c55e0a738c79c30affcf2cb68564c01c36c66dff.1548280753.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Create commit-graph file format v2 | expand

Commit Message

Derrick Stolee via GitGitGadget Jan. 23, 2019, 9:59 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

The commit-graph file format had some shortcomings which we now
correct:

  1. The hash algorithm was determined by a single byte, instead
     of the 4-byte format identifier.

  2. There was no way to update the reachability index we used.
     We currently only support generation numbers, but that will
     change in the future.

  3. Git did not fail with error if the unused eighth byte was
     non-zero, so we could not use that to indicate an incremental
     file format without breaking compatibility across versions.

The new format modifies the header of the commit-graph to solve
these problems. We use the 4-byte hash format id, freeing up a byte
in our 32-bit alignment to introduce a reachability index version.
We can also fail to read the commit-graph if the eighth byte is
non-zero.

The 'git commit-graph read' subcommand needs updating to show the
new data.

Set the default file format version to 2, and adjust the tests to
expect the new 'git commit-graph read' output.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 .../technical/commit-graph-format.txt         | 26 +++++++++-
 builtin/commit-graph.c                        |  9 ++++
 commit-graph.c                                | 47 ++++++++++++++++---
 commit-graph.h                                |  1 +
 t/t5318-commit-graph.sh                       |  9 +++-
 5 files changed, 83 insertions(+), 9 deletions(-)

Comments

Jonathan Tan Jan. 23, 2019, 11:56 p.m. UTC | #1
> +Version 2:
> +
> +  1-byte number (C) of "chunks"
> +
> +  1-byte reachability index version number:
> +      Currently, the only valid number is 1.
> +
> +  1-byte (reserved for later use)
> +      Current clients expect this value to be zero, and will not
> +      try to read the commit-graph file if it is non-zero.
> +
> +  4-byte format identifier for the hash algorithm:
> +      If this identifier does not agree with the repository's current
> +      hash algorithm, then the client will not read the commit graph.

[snip]

> diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
> index b79d6263e9..3ff5e3b48d 100755
> --- a/t/t5318-commit-graph.sh
> +++ b/t/t5318-commit-graph.sh
> @@ -65,7 +65,8 @@ graph_read_expect() {
>  		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
>  	fi
>  	cat >expect <<- EOF
> -	header: 43475048 1 1 $NUM_CHUNKS 0
> +	header: 43475048 2 $NUM_CHUNKS 1 0
> +	hash algorithm: 73686131
>  	num_commits: $1
>  	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
>  	EOF
> @@ -390,10 +391,14 @@ test_expect_success 'detect bad signature' '
>  '
>  
>  test_expect_success 'detect bad version' '
> -	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
> +	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\03" \
>  		"graph version"
>  '
>  
> +test_expect_success 'detect version 2 with version 1 data' '
> +	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
> +		"reachability index version"
> +'
>  test_expect_success 'detect bad hash version' '
>  	corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \
>  		"hash version"

Should there also be a test that the "reserved" section be 0 and the
4-byte identifier agrees with the repo's hash algorithm? I assume that
this can be done by "corrupting" the version to 2 and then truly
corrupting the subsequent bytes.

Other than that, this and the previous patches look good.
Ævar Arnfjörð Bjarmason Jan. 24, 2019, 9:40 a.m. UTC | #2
On Wed, Jan 23 2019, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The commit-graph file format had some shortcomings which we now
> correct:
>
>   1. The hash algorithm was determined by a single byte, instead
>      of the 4-byte format identifier.
>
>   2. There was no way to update the reachability index we used.
>      We currently only support generation numbers, but that will
>      change in the future.
>
>   3. Git did not fail with error if the unused eighth byte was
>      non-zero, so we could not use that to indicate an incremental
>      file format without breaking compatibility across versions.
>
> The new format modifies the header of the commit-graph to solve
> these problems. We use the 4-byte hash format id, freeing up a byte
> in our 32-bit alignment to introduce a reachability index version.
> We can also fail to read the commit-graph if the eighth byte is
> non-zero.

I haven't tested, but it looks from the patch like we can transparently
read existing v1 data and then will write v2 the next time. Would be
helpful for reviewers if this was noted explicitly in the commit
message.

Should there be a GIT_TEST_COMMIT_GRAPH_VERSION=[12] going forward to
test the non-default version, or do you feel confident the tests added
here test the upgrade path & old code well enough?

> The 'git commit-graph read' subcommand needs updating to show the
> new data.

Let's say "The ... subcommand has been updated to show the new
data". This sounds like a later patch is going to do that, but in fact
it's done here.

> Set the default file format version to 2, and adjust the tests to
> expect the new 'git commit-graph read' output.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  .../technical/commit-graph-format.txt         | 26 +++++++++-
>  builtin/commit-graph.c                        |  9 ++++
>  commit-graph.c                                | 47 ++++++++++++++++---
>  commit-graph.h                                |  1 +
>  t/t5318-commit-graph.sh                       |  9 +++-
>  5 files changed, 83 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 16452a0504..e367aa94b1 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -31,13 +31,22 @@ and hash type.
>
>  All 4-byte numbers are in network order.
>
> +There are two versions available, 1 and 2. These currently differ only in
> +the header.

Shouldn't this be s/currently/ / ? Won't we add a version 3 if we make
new changes?
Derrick Stolee Jan. 24, 2019, 2:34 p.m. UTC | #3
On 1/24/2019 4:40 AM, Ævar Arnfjörð Bjarmason wrote:
> On Wed, Jan 23 2019, Derrick Stolee via GitGitGadget wrote:
>
>> The new format modifies the header of the commit-graph to solve
>> these problems. We use the 4-byte hash format id, freeing up a byte
>> in our 32-bit alignment to introduce a reachability index version.
>> We can also fail to read the commit-graph if the eighth byte is
>> non-zero.
> I haven't tested, but it looks from the patch like we can transparently
> read existing v1 data and then will write v2 the next time. Would be
> helpful for reviewers if this was noted explicitly in the commit
> message.

Can do.

>
> Should there be a GIT_TEST_COMMIT_GRAPH_VERSION=[12] going forward to
> test the non-default version, or do you feel confident the tests added
> here test the upgrade path & old code well enough?

You're right that we should have an explicit "upgrade" test:

  1. Write a v1 commit-graph
  2. Add a commit
  3. Write a v2 commit-graph

As for a new GIT_TEST_ variable, we should only need to continue relying 
on GIT_TEST_COMMIT_GRAPH to test v2. I can add a 'graph_git_behavior' 
call on an explicitly v1 commit-graph file to get most of the coverage 
we need.

>> The 'git commit-graph read' subcommand needs updating to show the
>> new data.
> Let's say "The ... subcommand has been updated to show the new
> data". This sounds like a later patch is going to do that, but in fact
> it's done here.

Will clean up.

>> Set the default file format version to 2, and adjust the tests to
>> expect the new 'git commit-graph read' output.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>   .../technical/commit-graph-format.txt         | 26 +++++++++-
>>   builtin/commit-graph.c                        |  9 ++++
>>   commit-graph.c                                | 47 ++++++++++++++++---
>>   commit-graph.h                                |  1 +
>>   t/t5318-commit-graph.sh                       |  9 +++-
>>   5 files changed, 83 insertions(+), 9 deletions(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index 16452a0504..e367aa94b1 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -31,13 +31,22 @@ and hash type.
>>
>>   All 4-byte numbers are in network order.
>>
>> +There are two versions available, 1 and 2. These currently differ only in
>> +the header.
> Shouldn't this be s/currently/ / ? Won't we add a version 3 if we make
> new changes?

When we add a new reachability index version, then the content of the 
data chunk will change. Since we have a separate byte for versioning 
that data, we do not need a v3 for the file format as a whole. A similar 
statement applies to the unused byte reserved for the incremental file 
format: we won't need to increase the file format version as we will 
make that number non-zero and add a chunk with extra data.

Thanks,
-Stolee
Ævar Arnfjörð Bjarmason March 21, 2019, 9:21 a.m. UTC | #4
On Wed, Jan 23 2019, Derrick Stolee via GitGitGadget wrote:

>  	graph_version = *(unsigned char*)(data + 4);
> -	if (graph_version != 1) {
> -		error(_("graph version %X does not match version %X"),
> -		      graph_version, 1);
> +	if (!graph_version || graph_version > 2) {
> +		error(_("unsupported graph version %X"),
> +		      graph_version);
>  		goto cleanup_fail;
>  	}

Just noticed this while writing
https://public-inbox.org/git/87va0cd1zp.fsf@evledraar.gmail.com/ i.e. to
resolve the conflict with my commit-graph segfault fixing series.

This really should be something like:

	/* earlier */
	#define GRAPH_MAX_VERSION 2

	if (!graph_version || graph_version > GRAPH_MAX_VERSION) {
		error(_("commit-graph unsupported graph version %X, we support up to %X"),
			graph_version, GRAPH_MAX_VERSION);

Also, I'm confused as to what these patches are based on. Your is doing
"!= 1", but on "master" and ever since 2a2e32bdc5 ("commit-graph:
implement git commit-graph read", 2018-04-10) this has been the macro
"GRAPH_VERSION" instead of "1".
diff mbox series

Patch

diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
index 16452a0504..e367aa94b1 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -31,13 +31,22 @@  and hash type.
 
 All 4-byte numbers are in network order.
 
+There are two versions available, 1 and 2. These currently differ only in
+the header.
+
 HEADER:
 
+All commit-graph files use the first five bytes for the same purpose.
+
   4-byte signature:
       The signature is: {'C', 'G', 'P', 'H'}
 
   1-byte version number:
-      Currently, the only valid version is 1.
+      Currently, the valid version numbers are 1 and 2.
+
+The remainder of the header changes depending on the version.
+
+Version 1:
 
   1-byte Hash Version (1 = SHA-1)
       We infer the hash length (H) from this value.
@@ -47,6 +56,21 @@  HEADER:
   1-byte (reserved for later use)
      Current clients should ignore this value.
 
+Version 2:
+
+  1-byte number (C) of "chunks"
+
+  1-byte reachability index version number:
+      Currently, the only valid number is 1.
+
+  1-byte (reserved for later use)
+      Current clients expect this value to be zero, and will not
+      try to read the commit-graph file if it is non-zero.
+
+  4-byte format identifier for the hash algorithm:
+      If this identifier does not agree with the repository's current
+      hash algorithm, then the client will not read the commit graph.
+
 CHUNK LOOKUP:
 
   (C + 1) * 12 bytes listing the table of contents for the chunks:
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index b1bed84260..28787d0c9c 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -102,6 +102,11 @@  static int graph_read(int argc, const char **argv)
 		*(unsigned char*)(graph->data + 5),
 		*(unsigned char*)(graph->data + 6),
 		*(unsigned char*)(graph->data + 7));
+
+	if (*(unsigned char *)(graph->data + 4) == 2)
+		printf("hash algorithm: %X\n",
+		       get_be32(graph->data + 8));
+
 	printf("num_commits: %u\n", graph->num_commits);
 	printf("chunks:");
 
@@ -162,6 +167,10 @@  static int graph_write(int argc, const char **argv)
 	case 1:
 		flags |= COMMIT_GRAPH_VERSION_1;
 		break;
+
+	case 2:
+		flags |= COMMIT_GRAPH_VERSION_2;
+		break;
 	}
 
 	read_replace_refs = 0;
diff --git a/commit-graph.c b/commit-graph.c
index f7f45893fd..aeb6cae656 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -90,7 +90,8 @@  struct commit_graph *load_commit_graph_one(const char *graph_file)
 	uint64_t last_chunk_offset;
 	uint32_t last_chunk_id;
 	uint32_t graph_signature;
-	unsigned char graph_version, hash_version;
+	unsigned char graph_version, hash_version, reach_index_version;
+	uint32_t hash_id;
 
 	if (fd < 0)
 		return NULL;
@@ -115,9 +116,9 @@  struct commit_graph *load_commit_graph_one(const char *graph_file)
 	}
 
 	graph_version = *(unsigned char*)(data + 4);
-	if (graph_version != 1) {
-		error(_("graph version %X does not match version %X"),
-		      graph_version, 1);
+	if (!graph_version || graph_version > 2) {
+		error(_("unsupported graph version %X"),
+		      graph_version);
 		goto cleanup_fail;
 	}
 
@@ -135,6 +136,30 @@  struct commit_graph *load_commit_graph_one(const char *graph_file)
 		graph->num_chunks = *(unsigned char*)(data + 6);
 		chunk_lookup = data + 8;
 		break;
+
+	case 2:
+		graph->num_chunks = *(unsigned char *)(data + 5);
+
+		reach_index_version = *(unsigned char *)(data + 6);
+		if (reach_index_version != 1) {
+			error(_("unsupported reachability index version %d"),
+			      reach_index_version);
+			goto cleanup_fail;
+		}
+
+		if (*(unsigned char*)(data + 7)) {
+			error(_("unsupported value in commit-graph header"));
+			goto cleanup_fail;
+		}
+
+		hash_id = get_be32(data + 8);
+		if (hash_id != the_hash_algo->format_id) {
+			error(_("commit-graph hash algorithm does not match current algorithm"));
+			goto cleanup_fail;
+		}
+
+		chunk_lookup = data + 12;
+		break;
 	}
 
 	graph->hash_len = the_hash_algo->rawsz;
@@ -802,9 +827,11 @@  int write_commit_graph(const char *obj_dir,
 
 	if (flags & COMMIT_GRAPH_VERSION_1)
 		version = 1;
+	if (flags & COMMIT_GRAPH_VERSION_2)
+		version = 2;
 	if (!version)
-		version = 1;
-	if (version != 1) {
+		version = 2;
+	if (version <= 0 || version > 2) {
 		error(_("unsupported commit-graph version %d"),
 		      version);
 		return 1;
@@ -1003,6 +1030,14 @@  int write_commit_graph(const char *obj_dir,
 		hashwrite_u8(f, 0); /* unused padding byte */
 		header_size = 8;
 		break;
+
+	case 2:
+		hashwrite_u8(f, num_chunks);
+		hashwrite_u8(f, 1); /* reachability index version */
+		hashwrite_u8(f, 0); /* unused padding byte */
+		hashwrite_be32(f, the_hash_algo->format_id);
+		header_size = 12;
+		break;
 	}
 
 	chunk_ids[0] = GRAPH_CHUNKID_OIDFANOUT;
diff --git a/commit-graph.h b/commit-graph.h
index e03df54e33..050137063b 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -63,6 +63,7 @@  int generation_numbers_enabled(struct repository *r);
 #define COMMIT_GRAPH_APPEND     (1 << 0)
 #define COMMIT_GRAPH_PROGRESS   (1 << 1)
 #define COMMIT_GRAPH_VERSION_1  (1 << 2)
+#define COMMIT_GRAPH_VERSION_2  (1 << 3)
 
 int write_commit_graph_reachable(const char *obj_dir, int flags);
 int write_commit_graph(const char *obj_dir,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b79d6263e9..3ff5e3b48d 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -65,7 +65,8 @@  graph_read_expect() {
 		NUM_CHUNKS=$((3 + $(echo "$2" | wc -w)))
 	fi
 	cat >expect <<- EOF
-	header: 43475048 1 1 $NUM_CHUNKS 0
+	header: 43475048 2 $NUM_CHUNKS 1 0
+	hash algorithm: 73686131
 	num_commits: $1
 	chunks: oid_fanout oid_lookup commit_metadata$OPTIONAL
 	EOF
@@ -390,10 +391,14 @@  test_expect_success 'detect bad signature' '
 '
 
 test_expect_success 'detect bad version' '
-	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
+	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\03" \
 		"graph version"
 '
 
+test_expect_success 'detect version 2 with version 1 data' '
+	corrupt_graph_and_verify $GRAPH_BYTE_VERSION "\02" \
+		"reachability index version"
+'
 test_expect_success 'detect bad hash version' '
 	corrupt_graph_and_verify $GRAPH_BYTE_HASH "\02" \
 		"hash version"