[10/30] chunk-format: allow trailing table of contents

Message ID	78e585cf4df2bb82a2569cee226a6b97d0ea7629.1667846164.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-Id: <78e585cf4df2bb82a2569cee226a6b97d0ea7629.1667846164.git.gitgitgadget@gmail.com> In-Reply-To: <pull.1408.git.1667846164.gitgitgadget@gmail.com> References: <pull.1408.git.1667846164.gitgitgadget@gmail.com> Date: Mon, 07 Nov 2022 18:35:44 +0000 Subject: [PATCH 10/30] chunk-format: allow trailing table of contents Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: jrnieder@gmail.com, Derrick Stolee <derrickstolee@github.com>, Derrick Stolee <derrickstolee@github.com> Precedence: bulk From: Derrick Stolee <derrickstolee@github.com>
Series	extensions.refFormat and packed-refs v2 file format \| expand [00/30,RFC] extensions.refFormat and packed-refs v2 file format [01/30] hashfile: allow skipping the hash function [02/30] read-cache: add index.computeHash config option [03/30] extensions: add refFormat extension [04/30] config: fix multi-level bulleted list [05/30] repository: wire ref extensions to ref backends [06/30] refs: allow loose files without packed-refs [07/30] chunk-format: number of chunks is optional [08/30] chunk-format: document trailing table of contents [09/30] chunk-format: store chunk offset during write [10/30] chunk-format: allow trailing table of contents [11/30] chunk-format: parse trailing table of contents [12/30] refs: extract packfile format to new file [13/30] packed-backend: extract add_write_error() [14/30] packed-backend: extract iterator/updates merge [15/30] packed-backend: create abstraction for writing refs [16/30] config: add config values for packed-refs v2 [17/30] packed-backend: create shell of v2 writes [18/30] packed-refs: write file format version 2 [19/30] packed-refs: read file format v2 [20/30] packed-refs: read optional prefix chunks [21/30] packed-refs: write prefix chunks [22/30] packed-backend: create GIT_TEST_PACKED_REFS_VERSION [23/30] t1409: test with packed-refs v2 [24/30] t5312: allow packed-refs v2 format [25/30] t5502: add PACKED_REFS_V1 prerequisite [26/30] t3210: require packed-refs v1 for some tests [27/30] t*: skip packed-refs v2 over http tests [28/30] ci: run GIT_TEST_PACKED_REFS_VERSION=2 in some builds [29/30] p1401: create performance test for ref operations [30/30] refs: skip hashing when writing packed-refs v2

Message ID

78e585cf4df2bb82a2569cee226a6b97d0ea7629.1667846164.git.gitgitgadget@gmail.com (mailing list archive)

State

New, archived

Headers

Message-Id: 
 <78e585cf4df2bb82a2569cee226a6b97d0ea7629.1667846164.git.gitgitgadget@gmail.com>
In-Reply-To: <pull.1408.git.1667846164.gitgitgadget@gmail.com>
References: <pull.1408.git.1667846164.gitgitgadget@gmail.com>
Date: Mon, 07 Nov 2022 18:35:44 +0000
Subject: [PATCH 10/30] chunk-format: allow trailing table of contents
Fcc: Sent
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
To: git@vger.kernel.org
Cc: jrnieder@gmail.com, Derrick Stolee <derrickstolee@github.com>,
        Derrick Stolee <derrickstolee@github.com>
Precedence: bulk
From: Derrick Stolee <derrickstolee@github.com>

Series

extensions.refFormat and packed-refs v2 file format | expand

Commit Message

Derrick Stolee Nov. 7, 2022, 6:35 p.m. UTC

From: Derrick Stolee <derrickstolee@github.com>

The existing chunk formats use the table of contents at the beginning of
the file. This is intended as a way to speed up the initial loading of
the file, but comes at a cost during writes. Each example needs to fully
compute how big each chunk will be in advance, which usually requires
storing the full file contents in memory.

Future file formats may want to use the chunk format API in cases where
the writing stage is critical to performance, so we may want to stream
updates from an existing file and then only write the table of contents
at the end.

Add a new 'flags' parameter to write_chunkfile() that allows this
behavior. When this is specified, the defensive programming that checks
that the chunks are written with the precomputed sizes is disabled.
Then, the table of contents is written in reverse order at the end of
the hashfile, so a parser can read the chunk list starting from the end
of the file (minus the hash).

The parsing of these table of contents will come in a later change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 chunk-format.c | 53 +++++++++++++++++++++++++++++++++++---------------
 chunk-format.h |  9 ++++++++-
 commit-graph.c |  2 +-
 midx.c         |  2 +-
 4 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/chunk-format.c b/chunk-format.c
index f1b2c8a8b36..3f5cc9b5ddf 100644
--- a/chunk-format.c
+++ b/chunk-format.c
@@ -57,26 +57,31 @@  void add_chunk(struct chunkfile *cf,
 	cf->chunks_nr++;
 }
 
-int write_chunkfile(struct chunkfile *cf, void *data)
+int write_chunkfile(struct chunkfile *cf,
+		    enum chunkfile_flags flags,
+		    void *data)
 {
 	int i, result = 0;
-	uint64_t cur_offset = hashfile_total(cf->f);
 
 	trace2_region_enter("chunkfile", "write", the_repository);
 
-	/* Add the table of contents to the current offset */
-	cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
+	if (!(flags & CHUNKFILE_TRAILING_TOC)) {
+		uint64_t cur_offset = hashfile_total(cf->f);
 
-	for (i = 0; i < cf->chunks_nr; i++) {
-		hashwrite_be32(cf->f, cf->chunks[i].id);
-		hashwrite_be64(cf->f, cur_offset);
+		/* Add the table of contents to the current offset */
+		cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE;
 
-		cur_offset += cf->chunks[i].size;
-	}
+		for (i = 0; i < cf->chunks_nr; i++) {
+			hashwrite_be32(cf->f, cf->chunks[i].id);
+			hashwrite_be64(cf->f, cur_offset);
 
-	/* Trailing entry marks the end of the chunks */
-	hashwrite_be32(cf->f, 0);
-	hashwrite_be64(cf->f, cur_offset);
+			cur_offset += cf->chunks[i].size;
+		}
+
+		/* Trailing entry marks the end of the chunks */
+		hashwrite_be32(cf->f, 0);
+		hashwrite_be64(cf->f, cur_offset);
+	}
 
 	for (i = 0; i < cf->chunks_nr; i++) {
 		cf->chunks[i].offset = hashfile_total(cf->f);
@@ -85,10 +90,26 @@  int write_chunkfile(struct chunkfile *cf, void *data)
 		if (result)
 			goto cleanup;
 
-		if (hashfile_total(cf->f) - cf->chunks[i].offset != cf->chunks[i].size)
-			BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
-			    cf->chunks[i].size, cf->chunks[i].id,
-			    hashfile_total(cf->f) - cf->chunks[i].offset);
+		if (!(flags & CHUNKFILE_TRAILING_TOC)) {
+			if (hashfile_total(cf->f) - cf->chunks[i].offset != cf->chunks[i].size)
+				BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead",
+				    cf->chunks[i].size, cf->chunks[i].id,
+				    hashfile_total(cf->f) - cf->chunks[i].offset);
+		}
+
+		cf->chunks[i].size = hashfile_total(cf->f) - cf->chunks[i].offset;
+	}
+
+	if (flags & CHUNKFILE_TRAILING_TOC) {
+		size_t last_chunk_tail = hashfile_total(cf->f);
+		/* First entry marks the end of the chunks */
+		hashwrite_be32(cf->f, 0);
+		hashwrite_be64(cf->f, last_chunk_tail);
+
+		for (i = cf->chunks_nr - 1; i >= 0; i--) {
+			hashwrite_be32(cf->f, cf->chunks[i].id);
+			hashwrite_be64(cf->f, cf->chunks[i].offset);
+		}
 	}
 
 cleanup:
diff --git a/chunk-format.h b/chunk-format.h
index 7885aa08487..39e8967e950 100644
--- a/chunk-format.h
+++ b/chunk-format.h
@@ -31,7 +31,14 @@  void add_chunk(struct chunkfile *cf,
 	       uint32_t id,
 	       size_t size,
 	       chunk_write_fn fn);
-int write_chunkfile(struct chunkfile *cf, void *data);
+
+enum chunkfile_flags {
+	CHUNKFILE_TRAILING_TOC = (1 << 0),
+};
+
+int write_chunkfile(struct chunkfile *cf,
+		    enum chunkfile_flags flags,
+		    void *data);
 
 int read_table_of_contents(struct chunkfile *cf,
 			   const unsigned char *mfile,
diff --git a/commit-graph.c b/commit-graph.c
index a7d87559328..c927b81250d 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1932,7 +1932,7 @@  static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 			get_num_chunks(cf) * ctx->commits.nr);
 	}
 
-	write_chunkfile(cf, ctx);
+	write_chunkfile(cf, 0, ctx);
 
 	stop_progress(&ctx->progress);
 	strbuf_release(&progress_title);
diff --git a/midx.c b/midx.c
index 7cfad04a240..03d947a5d33 100644
--- a/midx.c
+++ b/midx.c
@@ -1510,7 +1510,7 @@  static int write_midx_internal(const char *object_dir,
 	}
 
 	write_midx_header(f, get_num_chunks(cf), ctx.nr - dropped_packs);
-	write_chunkfile(cf, &ctx);
+	write_chunkfile(cf, 0, &ctx);
 
 	finalize_hashfile(f, midx_hash, FSYNC_COMPONENT_PACK_METADATA,
 			  CSUM_FSYNC | CSUM_HASH_IN_STREAM);

[10/30] chunk-format: allow trailing table of contents

Commit Message

Patch