[v3,00/11,GSoC] Implement Corrected Commit Date

Message ID	pull.676.v3.git.1597509583.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <SRS0=NddV=BZ=vger.kernel.org=git-owner@kernel.org> Message-Id: <pull.676.v3.git.1597509583.gitgitgadget@gmail.com> In-Reply-To: <pull.676.v2.git.1596941624.gitgitgadget@gmail.com> References: <pull.676.v2.git.1596941624.gitgitgadget@gmail.com> From: "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> Date: Sat, 15 Aug 2020 16:39:32 +0000 Subject: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fcc: Sent To: git@vger.kernel.org Cc: Derrick Stolee <stolee@gmail.com>, Jakub =?utf-8?b?TmFyxJlic2tp?= <jnareb@gmail.com>, Taylor Blau <me@ttaylor.com>, Abhishek Kumar <abhishekkumar8222@gmail.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Implement Corrected Commit Date \| expand [v3,00/11,GSoC] Implement Corrected Commit Date [v3,01/11] commit-graph: fix regression when computing bloom filter [v3,02/11] revision: parse parent in indegree_walk_step() [v3,03/11] commit-graph: consolidate fill_commit_graph_info [v3,04/11] commit-graph: consolidate compare_commits_by_gen [v3,05/11] commit-graph: return 64-bit generation number [v3,06/11] commit-graph: add a slab to store topological levels [v3,07/11] commit-graph: implement corrected commit date [v3,08/11] commit-graph: implement generation data chunk [v3,09/11] commit-graph: use generation v2 only if entire chain does [v3,10/11] commit-reach: use corrected commit dates in paint_down_to_common() [v3,11/11] doc: add corrected commit date info

Philippe Blain via GitGitGadget Aug. 15, 2020, 4:39 p.m. UTC

This patch series implements the corrected commit date offsets as generation
number v2, along with other pre-requisites.

Git uses topological levels in the commit-graph file for commit-graph
traversal operations like git log --graph. Unfortunately, using topological
levels can result in a worse performance than without them when compared
with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
on the Linux repository walks 635,579 commits using topological levels and
walks 167,468 using committer date.

Thus, the need for generation number v2 was born. New generation number
needed to provide good performance, increment updates, and backward
compatibility. Due to an unfortunate problem, we also needed a way to
distinguish between the old and new generation number without incrementing
graph version.

Various candidates were examined (https://github.com/derrickstolee/gen-test, 
https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
number v2, Corrected Commit Date with Mononotically Increasing Offsets 
performed much worse than committer date (506,577 vs. 167,468 commits walked
for git merge-base v4.8 v4.9) and was dropped.

Using Generation Data chunk (GDAT) relieves the requirement of backward
compatibility as we would continue to store topological levels in Commit
Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
number v2. The Corrected Commit Date is defined as:

For a commit C, let its corrected commit date be the maximum of the commit
date of C and the corrected commit dates of its parents. Then corrected
commit date offset is the difference between corrected commit date of C and
commit date of C.

We will introduce an additional commit-graph chunk, Generation Data chunk,
and store corrected commit date offsets in GDAT chunk while storing
topological levels in CDAT chunk. The old versions of Git would ignore GDAT
chunk, using topological levels from CDAT chunk. In contrast, new versions
of Git would use corrected commit dates, falling back to topological level
if the generation data chunk is absent in the commit-graph file.

Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
first version.

I look forward to everyone's reviews!

Thanks

 * Abhishek


----------------------------------------------------------------------------

Changes in version 3:

 * Reordered patches as discussed in 1
   [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
 * Split "implement corrected commit date" into two patches - one
   introducing the topo level slab and other implementing corrected commit
   dates.
 * Extended split-commit-graph tests to verify at the end of test.
 * Use topological levels as generation number if any of split commit-graph
   files do not have generation data chunk.

Changes in version 2:

 * Add tests for generation data chunk.
 * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
   generation data chunk.
 * Compare commits with corrected commit dates if present in
   paint_down_to_common().
 * Update technical documentation.
 * Handle mixed graph version commit chains.
 * Improve commit messages for
 * Revert unnecessary whitespace changes.
 * Split uint_32 -> timestamp_t change into a new commit.

Abhishek Kumar (11):
  commit-graph: fix regression when computing bloom filter
  revision: parse parent in indegree_walk_step()
  commit-graph: consolidate fill_commit_graph_info
  commit-graph: consolidate compare_commits_by_gen
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  commit-graph: implement corrected commit date
  commit-graph: implement generation data chunk
  commit-graph: use generation v2 only if entire chain does
  commit-reach: use corrected commit dates in paint_down_to_common()
  doc: add corrected commit date info

 .../technical/commit-graph-format.txt         |  12 +-
 Documentation/technical/commit-graph.txt      |  45 ++--
 commit-graph.c                                | 241 +++++++++++++-----
 commit-graph.h                                |  16 +-
 commit-reach.c                                |  49 ++--
 commit-reach.h                                |   2 +-
 commit.c                                      |   9 +-
 commit.h                                      |   4 +-
 revision.c                                    |  13 +-
 t/README                                      |   3 +
 t/helper/test-read-graph.c                    |   2 +
 t/t4216-log-bloom.sh                          |   4 +-
 t/t5000-tar-tree.sh                           |   4 +-
 t/t5318-commit-graph.sh                       |  27 +-
 t/t5324-split-commit-graph.sh                 |  82 +++++-
 t/t6024-recursive-merge.sh                    |   4 +-
 t/t6600-test-reach.sh                         |  62 +++--
 upload-pack.c                                 |   2 +-
 18 files changed, 396 insertions(+), 185 deletions(-)


base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/676

Range-diff vs v2:

  1:  a962b9ae4b =  1:  c6b7ade7af commit-graph: fix regression when computing bloom filter
  2:  cf61239f93 =  2:  e673867234 revision: parse parent in indegree_walk_step()
  3:  32da955e31 =  3:  18d5864f81 commit-graph: consolidate fill_commit_graph_info
  4:  b254782858 !  4:  6a0cde983d commit-graph: consolidate compare_commits_by_gen
     @@ Commit message
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
          Reviewed-by: Taylor Blau <me@ttaylorr.com>
     +    Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
      @@ commit-graph.c: uint32_t commit_graph_generation(const struct commit *c)
  6:  1aa2a00a7a =  5:  6be759a954 commit-graph: return 64-bit generation number
  7:  bfe1473201 !  6:  b347dbb01b commit-graph: implement corrected commit date
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: implement corrected commit date
     +    commit-graph: add a slab to store topological levels
      
     -    With most of preparations done, let's implement corrected commit date
     -    offset. We add a new commit-slab to store topogical levels while
     -    writing commit graph and upgrade the generation member in struct
     -    commit_graph_data to a 64-bit timestamp. We store topological levels to
     -    ensure that older versions of Git will still have the performance
     -    benefits from generation number v2.
     +    As we are writing topological levels to commit data chunk to ensure
     +    backwards compatibility with "Old" Git and the member `generation` of
     +    struct commit_graph_data will store corrected commit date in a later
     +    commit, let's introduce a commit-slab to store topological levels while
     +    writing commit-graph.
     +
     +    When Git creates a split commit-graph, it takes advantage of the
     +    generation values that have been computed already and present in
     +    existing commit-graph files.
     +
     +    So, let's add a pointer to struct commit_graph to the topological level
     +    commit-slab and populate it with topological levels while writing a
     +    split commit-graph.
      
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
     @@ commit-graph.c: void git_test_write_commit_graph_or_die(void)
       /* Keep track of the order in which commits are added to our list. */
       define_commit_slab(commit_pos, int);
       static struct commit_pos commit_pos = COMMIT_SLAB_INIT(1, commit_pos);
     -@@ commit-graph.c: static int commit_gen_cmp(const void *va, const void *vb)
     - 	else if (generation_a > generation_b)
     - 		return 1;
     - 
     --	/* use date as a heuristic when generations are equal */
     --	if (a->date < b->date)
     --		return -1;
     --	else if (a->date > b->date)
     --		return 1;
     - 	return 0;
     - }
     - 
      @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
       	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     - 	if (g->chunk_generation_data)
     --		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	{
     -+		graph_data->generation = item->date +
     -+			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     -+	}
     - 	else
     - 		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     + 	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     ++
     ++	if (g->topo_levels)
     ++		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
       }
     + 
     + static inline void set_commit_tree(struct commit *c, struct tree *t)
      @@ commit-graph.c: struct write_commit_graph_context {
     - 	struct progress *progress;
     - 	int progress_done;
     - 	uint64_t progress_cnt;
     -+	struct topo_level_slab *topo_levels;
     + 		 changed_paths:1,
     + 		 order_by_pack:1;
       
     - 	char *base_graph_name;
     - 	int num_commit_graphs_before;
     ++	struct topo_level_slab *topo_levels;
     + 	const struct split_commit_graph_opts *split_opts;
     + 	size_t total_bloom_filter_data_size;
     + 	const struct bloom_filter_settings *bloom_settings;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       		else
       			packedDate[0] = 0;
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       
       		packedDate[1] = htonl((*list)->date);
       		hashwrite(f, packedDate, 8);
     -@@ commit-graph.c: static int write_graph_chunk_generation_data(struct hashfile *f,
     - 	int i;
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     - 		struct commit *c = ctx->commits.list[i];
     -+		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
     - 		display_progress(ctx->progress, ++ctx->progress_cnt);
     --		hashwrite_be32(f, commit_graph_data_at(c)->generation);
     -+
     -+		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     -+			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     -+
     -+		hashwrite_be32(f, offset);
     - 	}
     - 
     - 	return 0;
      @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph_context *ctx)
       					_("Computing commit graph generation numbers"),
       					ctx->commits.nr);
       	for (i = 0; i < ctx->commits.nr; i++) {
      -		uint32_t generation = commit_graph_data_at(ctx->commits.list[i])->generation;
     -+		uint32_t topo_level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
     ++		uint32_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
       
       		display_progress(ctx->progress, i + 1);
      -		if (generation != GENERATION_NUMBER_V1_INFINITY &&
      -		    generation != GENERATION_NUMBER_ZERO)
     -+		if (topo_level != GENERATION_NUMBER_V1_INFINITY &&
     -+		    topo_level != GENERATION_NUMBER_ZERO)
     ++		if (level != GENERATION_NUMBER_V1_INFINITY &&
     ++		    level != GENERATION_NUMBER_ZERO)
       			continue;
       
       		commit_list_insert(ctx->commits.list[i], &list);
     @@ commit-graph.c: static void compute_generation_numbers(struct write_commit_graph
       			int all_parents_computed = 1;
      -			uint32_t max_generation = 0;
      +			uint32_t max_level = 0;
     -+			timestamp_t max_corrected_commit_date = current->date - 1;
       
       			for (parent = current->parents; parent; parent = parent->next) {
      -				generation = commit_graph_data_at(parent->item)->generation;
     -+				topo_level = *topo_level_slab_at(ctx->topo_levels, parent->item);
     ++				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
       
      -				if (generation == GENERATION_NUMBER_V1_INFINITY ||
      -				    generation == GENERATION_NUMBER_ZERO) {
     -+				if (topo_level == GENERATION_NUMBER_V1_INFINITY ||
     -+				    topo_level == GENERATION_NUMBER_ZERO) {
     ++				if (level == GENERATION_NUMBER_V1_INFINITY ||
     ++				    level == GENERATION_NUMBER_ZERO) {
       					all_parents_computed = 0;
       					commit_list_insert(parent->item, &list);
       					break;
      -				} else if (generation > max_generation) {
      -					max_generation = generation;
     -+				} else {
     -+					struct commit_graph_data *data = commit_graph_data_at(parent->item);
     -+
     -+					if (topo_level > max_level)
     -+						max_level = topo_level;
     -+
     -+					if (data->generation > max_corrected_commit_date)
     -+						max_corrected_commit_date = data->generation;
     ++				} else if (level > max_level) {
     ++					max_level = level;
       				}
       			}
       
       			if (all_parents_computed) {
     - 				struct commit_graph_data *data = commit_graph_data_at(current);
     - 
     +-				struct commit_graph_data *data = commit_graph_data_at(current);
     +-
      -				data->generation = max_generation + 1;
     --				pop_commit(&list);
     -+				if (max_level > GENERATION_NUMBER_MAX - 1)
     -+					max_level = GENERATION_NUMBER_MAX - 1;
     -+
     -+				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
     -+				data->generation = max_corrected_commit_date + 1;
     + 				pop_commit(&list);
       
      -				if (data->generation > GENERATION_NUMBER_MAX)
      -					data->generation = GENERATION_NUMBER_MAX;
     -+				pop_commit(&list);
     ++				if (max_level > GENERATION_NUMBER_MAX - 1)
     ++					max_level = GENERATION_NUMBER_MAX - 1;
     ++				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
       			}
       		}
       	}
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	if (!commit_graph_compatible(the_repository))
       		return 0;
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
     - 	ctx->total_bloom_filter_data_size = 0;
     - 	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
     + 		}
     + 	}
       
      +	init_topo_level_slab(&topo_levels);
      +	ctx->topo_levels = &topo_levels;
      +
     - 	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
     - 		ctx->changed_paths = 1;
     - 	if (!(flags & COMMIT_GRAPH_NO_WRITE_BLOOM_FILTERS)) {
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 	for (i = 0; i < g->num_commits; i++) {
     - 		struct commit *graph_commit, *odb_commit;
     - 		struct commit_list *graph_parents, *odb_parents;
     --		timestamp_t max_generation = 0;
     --		timestamp_t generation;
     -+		timestamp_t max_parent_corrected_commit_date = 0;
     -+		timestamp_t corrected_commit_date;
     - 
     - 		display_progress(progress, i + 1);
     - 		hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 					     oid_to_hex(&graph_parents->item->object.oid),
     - 					     oid_to_hex(&odb_parents->item->object.oid));
     - 
     --			generation = commit_graph_generation(graph_parents->item);
     --			if (generation > max_generation)
     --				max_generation = generation;
     -+			corrected_commit_date = commit_graph_generation(graph_parents->item);
     -+			if (corrected_commit_date > max_parent_corrected_commit_date)
     -+				max_parent_corrected_commit_date = corrected_commit_date;
     - 
     - 			graph_parents = graph_parents->next;
     - 			odb_parents = odb_parents->next;
     -@@ commit-graph.c: int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
     - 		if (generation_zero == GENERATION_ZERO_EXISTS)
     - 			continue;
     ++	if (ctx->r->objects->commit_graph) {
     ++		struct commit_graph *g = ctx->r->objects->commit_graph;
     ++
     ++		while (g) {
     ++			g->topo_levels = &topo_levels;
     ++			g = g->base_graph;
     ++		}
     ++	}
     ++
     + 	if (pack_indexes) {
     + 		ctx->order_by_pack = 1;
     + 		if ((res = fill_oids_from_packs(ctx, pack_indexes)))
     +
     + ## commit-graph.h ##
     +@@ commit-graph.h: struct commit_graph {
     + 	const unsigned char *chunk_bloom_indexes;
     + 	const unsigned char *chunk_bloom_data;
     + 
     ++	struct topo_level_slab *topo_levels;
     + 	struct bloom_filter_settings *bloom_filter_settings;
     + };
       
     --		/*
     --		 * If one of our parents has generation GENERATION_NUMBER_MAX, then
     --		 * our generation is also GENERATION_NUMBER_MAX. Decrement to avoid
     --		 * extra logic in the following condition.
     --		 */
     --		if (max_generation == GENERATION_NUMBER_MAX)
     --			max_generation--;
     --
     --		generation = commit_graph_generation(graph_commit);
     --		if (generation != max_generation + 1)
     --			graph_report(_("commit-graph generation for commit %s is %u != %u"),
     -+		corrected_commit_date = commit_graph_generation(graph_commit);
     -+		if (corrected_commit_date < max_parent_corrected_commit_date + 1)
     -+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
     - 				     oid_to_hex(&cur_oid),
     --				     generation,
     --				     max_generation + 1);
     -+				     corrected_commit_date,
     -+				     max_parent_corrected_commit_date + 1);
     - 
     - 		if (graph_commit->date != odb_commit->date)
     - 			graph_report(_("commit date for commit %s in commit-graph is %"PRItime" != %"PRItime),
      
       ## commit.h ##
      @@
  -:  ---------- >  7:  4074ace65b commit-graph: implement corrected commit date
  5:  cb797e20d7 !  8:  4e746628ac commit-graph: implement generation data chunk
     @@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct c
       
      -	graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
      +	if (g->chunk_generation_data)
     -+		graph_data->generation = get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     ++		graph_data->generation = item->date +
     ++			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
      +	else
      +		graph_data->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
     - }
       
     - static inline void set_commit_tree(struct commit *c, struct tree *t)
     + 	if (g->topo_levels)
     + 		*topo_level_slab_at(g->topo_levels, item) = get_be32(commit_data + g->hash_len + 8) >> 2;
      @@ commit-graph.c: struct write_commit_graph_context {
       		 report_progress:1,
       		 split:1,
     @@ commit-graph.c: struct write_commit_graph_context {
      +		 order_by_pack:1,
      +		 write_generation_data:1;
       
     + 	struct topo_level_slab *topo_levels;
       	const struct split_commit_graph_opts *split_opts;
     - 	size_t total_bloom_filter_data_size;
      @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
       	return 0;
       }
     @@ commit-graph.c: static int write_graph_chunk_data(struct hashfile *f,
      +	int i;
      +	for (i = 0; i < ctx->commits.nr; i++) {
      +		struct commit *c = ctx->commits.list[i];
     ++		timestamp_t offset = commit_graph_data_at(c)->generation - c->date;
      +		display_progress(ctx->progress, ++ctx->progress_cnt);
     -+		hashwrite_be32(f, commit_graph_data_at(c)->generation);
     ++
     ++		if (offset > GENERATION_NUMBER_V2_OFFSET_MAX)
     ++			offset = GENERATION_NUMBER_V2_OFFSET_MAX;
     ++		hashwrite_be32(f, offset);
      +	}
      +
      +	return 0;
     @@ commit-graph.c: static int write_commit_graph_file(struct write_commit_graph_con
       	chunks[2].id = GRAPH_CHUNKID_DATA;
       	chunks[2].size = (hashsz + 16) * ctx->commits.nr;
       	chunks[2].write_fn = write_graph_chunk_data;
     ++
     ++	if (git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0))
     ++		ctx->write_generation_data = 0;
      +	if (ctx->write_generation_data) {
      +		chunks[num_chunks].id = GRAPH_CHUNKID_GENERATION_DATA;
      +		chunks[num_chunks].size = sizeof(uint32_t) * ctx->commits.nr;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       	ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0;
       	ctx->split_opts = split_opts;
       	ctx->total_bloom_filter_data_size = 0;
     -+	ctx->write_generation_data = !git_env_bool(GIT_TEST_COMMIT_GRAPH_NO_GDAT, 0);
     ++	ctx->write_generation_data = 1;
       
       	if (flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS)
       		ctx->changed_paths = 1;
     @@ t/t5318-commit-graph.sh: test_expect_success 'replace-objects invalidates commit
      
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: test_expect_success 'setup repo' '
     + 	infodir=".git/objects/info" &&
       	graphdir="$infodir/commit-graphs" &&
     - 	test_oid_init &&
       	test_oid_cache <<-EOM
      -	shallow sha1:1760
      -	shallow sha256:2064
  8:  833779ad53 !  9:  5a147a9704 commit-graph: handle mixed generation commit chains
     @@ Metadata
      Author: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## Commit message ##
     -    commit-graph: handle mixed generation commit chains
     +    commit-graph: use generation v2 only if entire chain does
      
     -    As corrected commit dates and topological levels cannot be compared
     -    directly, we must handle commit graph chains with mixed generation
     -    number definitions.
     +    Since there are released versions of Git that understand generation
     +    numbers in the commit-graph's CDAT chunk but do not understand the GDAT
     +    chunk, the following scenario is possible:
      
     -    While reading a commit graph file, we disable generation numbers if the
     -    chain contains mixed generation numbers.
     +    1. "New" Git writes a commit-graph with the GDAT chunk.
     +    2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
      
     -    While writing to commit graph chain, we write generation data chunk only
     -    if the previous tip of chain had a generation data chunk. Using
     -    `--split=replace` overwrites the existing chain and writes generation
     -    data chunk regardless of previous tip.
     +    Because of the current use of inspecting the current layer for a
     +    chunk_generation_data pointer, the commits in the lower layer will be
     +    interpreted as having very large generation values (commit date plus
     +    offset) compared to the generation numbers in the top layer (topological
     +    level). This violates the expectation that the generation of a parent is
     +    strictly smaller than the generation of a child.
      
     -    In t5324-split-commit-graph, we set up a repo with twelve commits and
     -    write a base commit graph file with no generation data chunk. When add
     -    three commits and write to chain again, Git does not write generation
     -    data chunk even without setting GIT_TEST_COMMIT_GRAPH_NO_GDAT=1. Then,
     -    as we replace the existing chain, Git writes a commit graph file with
     -    generation data chunk.
     +    It is difficult to expose this issue in a test. Since we _start_ with
     +    artificially low generation numbers, any commit walk that prioritizes
     +    generation numbers will walk all of the commits with high generation
     +    number before walking the commits with low generation number. In all the
     +    cases I tried, the commit-graph layers themselves "protect" any
     +    incorrect behavior since none of the commits in the lower layer can
     +    reach the commits in the upper layer.
      
     +    This issue would manifest itself as a performance problem in this case,
     +    especially with something like "git log --graph" since the low
     +    generation numbers would cause the in-degree queue to walk all of the
     +    commits in the lower layer before allowing the topo-order queue to write
     +    anything to output (depending on the size of the upper layer).
     +
     +    Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
      
       ## commit-graph.c ##
     -@@ commit-graph.c: int generation_numbers_enabled(struct repository *r)
     - 	if (!g->num_commits)
     - 		return 0;
     +@@ commit-graph.c: static struct commit_graph *load_commit_graph_chain(struct repository *r,
     + 	return graph_chain;
     + }
       
     -+	/* We cannot compare topological levels and corrected commit dates */
     -+	while (g->base_graph) {
     -+		warning(_("commit-graph-chain contains mixed generation versions"));
     -+		if ((g->chunk_generation_data == NULL) ^ (g->base_graph->chunk_generation_data == NULL))
     -+			return 0;
     ++static void validate_mixed_generation_chain(struct repository *r)
     ++{
     ++	struct commit_graph *g = r->objects->commit_graph;
     ++	int read_generation_data = 1;
     ++
     ++	while (g) {
     ++		if (!g->chunk_generation_data) {
     ++			read_generation_data = 0;
     ++			break;
     ++		}
      +		g = g->base_graph;
      +	}
      +
     - 	first_generation = get_be32(g->chunk_commit_data +
     - 				    g->hash_len + 8) >> 2;
     ++	g = r->objects->commit_graph;
     ++
     ++	while (g) {
     ++		g->read_generation_data = read_generation_data;
     ++		g = g->base_graph;
     ++	}
     ++}
     ++
     + struct commit_graph *read_commit_graph_one(struct repository *r,
     + 					   struct object_directory *odb)
     + {
     +@@ commit-graph.c: struct commit_graph *read_commit_graph_one(struct repository *r,
     + 	if (!g)
     + 		g = load_commit_graph_chain(r, odb);
     + 
     ++	validate_mixed_generation_chain(r);
     ++
     + 	return g;
     + }
     + 
     +@@ commit-graph.c: static void fill_commit_graph_info(struct commit *item, struct commit_graph *g,
     + 	date_low = get_be32(commit_data + g->hash_len + 12);
     + 	item->date = (timestamp_t)((date_high << 32) | date_low);
       
     +-	if (g->chunk_generation_data)
     ++	if (g->chunk_generation_data && g->read_generation_data)
     + 		graph_data->generation = item->date +
     + 			(timestamp_t) get_be32(g->chunk_generation_data + sizeof(uint32_t) * lex_index);
     + 	else
     +@@ commit-graph.c: void load_commit_graph_info(struct repository *r, struct commit *item)
     + 	uint32_t pos;
     + 	if (!prepare_commit_graph(r))
     + 		return;
     ++
     + 	if (find_commit_in_graph(item, r->objects->commit_graph, &pos))
     + 		fill_commit_graph_info(item, r->objects->commit_graph, pos);
     + }
      @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       		g = ctx->r->objects->commit_graph;
     @@ commit-graph.c: int write_commit_graph(struct object_directory *odb,
       
       	ctx->approx_nr_objects = approximate_object_count();
      
     + ## commit-graph.h ##
     +@@ commit-graph.h: struct commit_graph {
     + 	struct object_directory *odb;
     + 
     + 	uint32_t num_commits_in_base;
     ++	uint32_t read_generation_data;
     + 	struct commit_graph *base_graph;
     + 
     + 	const uint32_t *chunk_oid_fanout;
     +
       ## t/t5324-split-commit-graph.sh ##
      @@ t/t5324-split-commit-graph.sh: done <<\EOF
       0600 -r--------
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +		test_commit $i &&
      +		git branch commits/$i || return 1
      +	done &&
     ++	git commit-graph write --reachable --split &&
      +	git reset --hard commits/2 &&
      +	git merge commits/4 &&
      +	git branch merge/1 &&
      +	git reset --hard commits/4 &&
      +	git merge commits/6 &&
      +	git branch merge/2 &&
     -+	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split &&
     ++	GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 3 0
     -+	num_commits: 12
     ++	header: 43475048 1 1 4 1
     ++	num_commits: 2
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output
     ++	test_cmp expect output &&
     ++	git commit-graph verify
      +'
      +
      +test_expect_success 'does not write generation data chunk if not present on existing tip' '
     @@ t/t5324-split-commit-graph.sh: done <<\EOF
      +	git merge commits/5 &&
      +	git merge merge/2 &&
      +	git branch merge/3 &&
     -+	git commit-graph write --reachable --split &&
     ++	git commit-graph write --reachable --split=no-merge &&
      +	test-tool read-graph >output &&
      +	cat >expect <<-EOF &&
     -+	header: 43475048 1 1 4 1
     ++	header: 43475048 1 1 4 2
      +	num_commits: 3
      +	chunks: oid_fanout oid_lookup commit_metadata
      +	EOF
     -+	test_cmp expect output
     ++	test_cmp expect output &&
     ++	git commit-graph verify
      +'
      +
      +test_expect_success 'writes generation data chunk when commit-graph chain is replaced' '
      +	cd "$TRASH_DIRECTORY/mixed" &&
     -+	git commit-graph write --reachable --split='replace' &&
     ++	git commit-graph write --reachable --split=replace &&
      +	test_path_is_file $graphdir/commit-graph-chain &&
      +	test_line_count = 1 $graphdir/commit-graph-chain &&
      +	verify_chain_files_exist $graphdir &&
     -+	graph_read_expect 15
     ++	graph_read_expect 15 &&
     ++	git commit-graph verify
      +'
      +
       test_done
  9:  58a2d5da01 = 10:  439adc1718 commit-reach: use corrected commit dates in paint_down_to_common()
 10:  4c34294602 = 11:  f6f91af305 doc: add corrected commit date info

Jakub Narębski Aug. 17, 2020, 12:13 a.m. UTC | #1

"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

I'm not sure if this level of detail is required in the cover letter for
the series, but generation number v2 is corrected commit date; corrected
commit date offsets is how we store this value in the commit-graph file.

>
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.

I would say "committer date heuristics" instead of just "committer
date", to be more exact.

Is this data generated using https://github.com/derrickstolee/gen-test
scripts?

>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.

It would be nice to have reference email (or other place with details)
for "unfortunate problem".

>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2.

This is a bit of simplification, but good enough for a cover letter.

To be more exact, from various candidates the Corrected Commit Date was
chosen.  Then it turned out that old Git crashes on changed commit-graph
format version value, so if the generation number v2 was to replace v1
it needed to be backward-compatibile: hence the idea of Corrected Commit
Date with Monotonically Increasing Offsets.  But with GDAT chunk to
store generation number v2 (and for the time being leaving generation
number v1, i.e. Topological Levels, in CDAT), we are no longer
constrained by the requirement of backward-compatibility to make old Git
work with commit-graph file created by new Git.  So we could go back to
Corrected Commit Date, and as you wrote above the backward-compatibile
variant performs worse.

> The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents.

Actually it needs to be "corrected commit dates of its parents plus 1"
to fulfill the reachability condition for a generation number for a
commit:

      A can reach B   =>  gen(A) < gen(B)

Of course it can be computed in simpler way, because

  max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1


>                                                           Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

All right.

>
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

All right.

However I think the cover letter should also describe what should happen
in a mixed version environment (for example new Git on command line,
copy of old Git used by GUI client), and in particular what should
happen in a mixed-chain case - both for reading and for writing the
commit-graph file.

For *writing*: because old Git would create commit-graph layers without
the GDAT chunk, to simplify the behavior and make easy to reason about
commit-graph data (the situation should be not that common, and
transient -- it should get more rare as the time goes), we want the
following behavior from new Git:

- If top layer contains the GDAT chunk, or we are rewriting commit-graph
  file (--split=replace), or we are merging layers and there are no
  layers without GDAT chunk below set of layers that are merged, then

     write commit-graph file or commit-graph layer with GDAT chunk,

  otherwise

     write commit-graph layer without GDAT chunk.

  This means that there are commit-graph layers without GDAT chunk if
  and only if the top layer is also without GDAT chunk.


For *reading* we want to use generation number v2 (corrected commit
date) if possible, and fall back to generation number v1 (topological
levels).

- If the top layer contains the GDAT chunk (or maybe even if the topmost
  layer that involves all commits in question, not necessarily the top
  layer in the full commit-graph chain), then use generation number v2

  - commit_graph_data->generation stores corrected commit date,
    computed as sum of committer date (from CDAT) and offset (from GDAT)

  - A can reach B   =>  gen(A) < gen(B)

  - there is no need for committer date heuristics, and no need for
    limiting use of generation number to where there is a cutoff (to not
    hamper performance).

- If there are layers without GDAT chunks, which thanks to the write
  behavior means simply top layer without GDAT chunk, we need to turn
  off use of generation numbers or fall back to using topological levels

  - commit_graph_data->generation stores topological levels,
    taken from CDAT chunk (30-bits)

  - A can reach B   =>  gen(A) < gen(B)

  - we probably want to keep tie-breaking of sorting by generation
    number via committer date, and limit use of generation number as
    opposed to using committer date heuristics (with slop) to not make
    performance worse.

>
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> first version.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> Changes in version 3:
>
>  * Reordered patches as discussed in 1
>    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]

If I remember it correctly this was done to always store in GDAT chunk
corrected commit date offsets, isn't it?

>  * Split "implement corrected commit date" into two patches - one
>    introducing the topo level slab and other implementing corrected commit
>    dates.

All right.

I think it might be good idea to split off the change to tar file tests
(as a preparatory patch), to make reviews and bisecting easier.

>  * Extended split-commit-graph tests to verify at the end of test.

Do we also test for proper merging of split commit-graph layers, not
only adding a new layer and a full rewrite (--split=replace)?

>  * Use topological levels as generation number if any of split commit-graph
>    files do not have generation data chunk.

That is good for performance.

>
> Changes in version 2:
>
>  * Add tests for generation data chunk.

Good.

>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.

Good, that is needed for testing mixed-version behavior.

>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().

All right, but see the caveat.

>  * Update technical documentation.

Always a good thing.

>  * Handle mixed graph version commit chains.

Where by "version" you mean generation number version - the commit-graph
version number unfortunately needs to stay the same...

>  * Improve commit messages for
                                ^^^^^^
Something missing in this point, the sentence ends abruptly.

>  * Revert unnecessary whitespace changes.

Thanks.

>  * Split uint_32 -> timestamp_t change into a new commit.

It is usually better to keep the commits small.  Good.


Good work!

>
> Abhishek Kumar (11):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: return 64-bit generation number
>   commit-graph: add a slab to store topological levels
>   commit-graph: implement corrected commit date
>   commit-graph: implement generation data chunk
>   commit-graph: use generation v2 only if entire chain does
>   commit-reach: use corrected commit dates in paint_down_to_common()
>   doc: add corrected commit date info
>
>  .../technical/commit-graph-format.txt         |  12 +-
>  Documentation/technical/commit-graph.txt      |  45 ++--
>  commit-graph.c                                | 241 +++++++++++++-----
>  commit-graph.h                                |  16 +-
>  commit-reach.c                                |  49 ++--
>  commit-reach.h                                |   2 +-
>  commit.c                                      |   9 +-
>  commit.h                                      |   4 +-
>  revision.c                                    |  13 +-
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   2 +
>  t/t4216-log-bloom.sh                          |   4 +-
>  t/t5000-tar-tree.sh                           |   4 +-
>  t/t5318-commit-graph.sh                       |  27 +-
>  t/t5324-split-commit-graph.sh                 |  82 +++++-
>  t/t6024-recursive-merge.sh                    |   4 +-
>  t/t6600-test-reach.sh                         |  62 +++--
>  upload-pack.c                                 |   2 +-
>  18 files changed, 396 insertions(+), 185 deletions(-)
>
>
> base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
[...]

Best,
--
Jakub Narębski

Taylor Blau Aug. 17, 2020, 1:32 a.m. UTC | #2

On Mon, Aug 17, 2020 at 02:16:08AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
>
> All right.
>
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,
>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This seems very sane to me, and I'd be glad to see it spelled out in
more specific detail. I was wondering this myself, and had to double
check with Stolee off-list that my interpretation of Abhishek's code was
correct.

But yes, only writing GDAT chunks when all layers in the chain have GDAT
chunks makes sense, since we can't interoperate between corrected dates
and topological levels. Since we can't fill in the GDAT data of layers
generated in pre-GDAT versions of Git without invalidating the GDAT
layers on-disk, there's no point to speculatively computing both chunks.

Merging rules are obviously correct, which is good. For what it's worth,
the '--split=replace' case is what we'll really care about at GitHub,
since it's unlikely we'd drop all existing commit-graph chains and
rebuild them from scratch. More likely is that we'll let the new GDAT
chunks trickle in over time when we run 'git commit-graph write' with
'--split=replace', which happens "every so often".

> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2

I don't follow this. If we have a multi-layer chain, either all or none
of the layers have a GDAT chunk. So, "if the top layer contains the GDAT
chunk" makes sense, since it implies that all layers have the GDAT
chunk. I don't see how "even if the topmost layer that involves all
commits in question" would be possible, since (if I'm understanding your
description correctly), we can't have *some* of the layers having a GDAT
chunk with others only having a CDAT chunk.

I'm a little confused here.

>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels

Good, I'm glad that this can be a quick check (that we can cache for
future reads, but I'm not even sure the caching would be necessary
without measuring).
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

All makes very good sense, except for the one point I raised above.

> >
> > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > first version.

Thanks, Abhishek for your great work on this. I was feeling bad that I
wasn't more involved in the early discussions about the transition plan,
but what you, Stolee, and Jakub came up with all seems like what I would
have suggested, anyway ;-).

> Jakub Narebski

Thanks,
Taylor

Jakub Narębski Aug. 17, 2020, 7:56 a.m. UTC | #3

On Mon, 17 Aug 2020 at 03:32, Taylor Blau <me@ttaylorr.com> wrote:
> On Mon, Aug 17, 2020 at 02:16:08AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >
> > > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > > and store corrected commit date offsets in GDAT chunk while storing
> > > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > > of Git would use corrected commit dates, falling back to topological level
> > > if the generation data chunk is absent in the commit-graph file.
> >
> > All right.
> >
> > However I think the cover letter should also describe what should happen
> > in a mixed version environment (for example new Git on command line,
> > copy of old Git used by GUI client), and in particular what should
> > happen in a mixed-chain case - both for reading and for writing the
> > commit-graph file.
> >
> > For *writing*: because old Git would create commit-graph layers without
> > the GDAT chunk, to simplify the behavior and make easy to reason about
> > commit-graph data (the situation should be not that common, and
> > transient -- it should get more rare as the time goes), we want the
> > following behavior from new Git:
> >
> > - If top layer contains the GDAT chunk, or we are rewriting commit-graph
> >   file (--split=replace), or we are merging layers and there are no
> >   layers without GDAT chunk below set of layers that are merged, then
> >
> >      write commit-graph file or commit-graph layer with GDAT chunk,
> >
> >   otherwise
> >
> >      write commit-graph layer without GDAT chunk.
> >
> >   This means that there are commit-graph layers without GDAT chunk if
> >   and only if the top layer is also without GDAT chunk.
>
> This seems very sane to me, and I'd be glad to see it spelled out in
> more specific detail. I was wondering this myself, and had to double
> check with Stolee off-list that my interpretation of Abhishek's code was
> correct.
>
> But yes, only writing GDAT chunks when all layers in the chain have GDAT
> chunks makes sense, since we can't interoperate between corrected dates
> and topological levels. Since we can't fill in the GDAT data of layers
> generated in pre-GDAT versions of Git without invalidating the GDAT
> layers on-disk, there's no point to speculatively computing both chunks.
>
> Merging rules are obviously correct, which is good. For what it's worth,
> the '--split=replace' case is what we'll really care about at GitHub,
> since it's unlikely we'd drop all existing commit-graph chains and
> rebuild them from scratch. More likely is that we'll let the new GDAT
> chunks trickle in over time when we run 'git commit-graph write' with
> '--split=replace', which happens "every so often".

To be more detailed, without '--split=replace' we would want the following
layer merging behavior:

   [layer with GDAT][with GDAT][without GDAT][without GDAT][without GDAT]

In the split commit-graph chain above, merging two topmost layers
should create a layer without GDAT; merging three topmost layers
(and any other layers, e.g. two middle ones) should create a layer
 with GDAT.

> > For *reading* we want to use generation number v2 (corrected commit
> > date) if possible, and fall back to generation number v1 (topological
> > levels).
> >
> > - If the top layer contains the GDAT chunk (or maybe even if the topmost
> >   layer that involves all commits in question, not necessarily the top
> >   layer in the full commit-graph chain), then use generation number v2
>
> I don't follow this. If we have a multi-layer chain, either all or none
> of the layers have a GDAT chunk. So, "if the top layer contains the GDAT
> chunk" makes sense, since it implies that all layers have the GDAT
> chunk. I don't see how "even if the topmost layer that involves all
> commits in question" would be possible, since (if I'm understanding your
> description correctly), we can't have *some* of the layers having a GDAT
> chunk with others only having a CDAT chunk.
>
> I'm a little confused here.

This is only speculative, and most probably totally unnecessary
complication (either that, or something that we would get for free).
Assume that the command in question operates only on
historical data; for example `git log --topo-order HEAD~1000`.
If all commits (or, what's equivalent, most recent commits
i.e. HEAD~1000) have their data in split commit-graph layers
with GDAT, we can theoretically use generation number v2,
even if there are some newer commits that have their data
in layers without GDAT (and some even newer ones outside
commit-graph files).

I hope that this explains my (possibly harebrained) idea.

> >   - commit_graph_data->generation stores corrected commit date,
> >     computed as sum of committer date (from CDAT) and offset (from GDAT)
> >
> >   - A can reach B   =>  gen(A) < gen(B)
> >
> >   - there is no need for committer date heuristics, and no need for
> >     limiting use of generation number to where there is a cutoff (to not
> >     hamper performance).
> >
> > - If there are layers without GDAT chunks, which thanks to the write
> >   behavior means simply top layer without GDAT chunk, we need to turn
> >   off use of generation numbers or fall back to using topological levels
>
> Good, I'm glad that this can be a quick check (that we can cache for
> future reads, but I'm not even sure the caching would be necessary
> without measuring).

There is a question where to store the information that we cannot
use generation number v2 (that 'generation' contains topological
levels and not corrected commit date):
- create new global variable
- store it in `struct split_commit_graph_opts`
- set `chunk_generation_data` to NULL for all graphs
  in chain (it is in `struct commit_graph`)?

> >
> >   - commit_graph_data->generation stores topological levels,
> >     taken from CDAT chunk (30-bits)
> >
> >   - A can reach B   =>  gen(A) < gen(B)
> >
> >   - we probably want to keep tie-breaking of sorting by generation
> >     number via committer date, and limit use of generation number as
> >     opposed to using committer date heuristics (with slop) to not make
> >     performance worse.
>
> All makes very good sense, except for the one point I raised above.
>
> > >
> > > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > > first version.
>
> Thanks, Abhishek for your great work on this. I was feeling bad that I
> wasn't more involved in the early discussions about the transition plan,
> but what you, Stolee, and Jakub came up with all seems like what I would
> have suggested, anyway ;-).

Thank you for your work on improving this feature.

Best,

Abhishek Kumar Aug. 18, 2020, 6:12 a.m. UTC | #4

On Mon, Aug 17, 2020 at 02:13:18AM +0200, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> I'm not sure if this level of detail is required in the cover letter for
> the series, but generation number v2 is corrected commit date; corrected
> commit date offsets is how we store this value in the commit-graph file.
> 
> >
> > Git uses topological levels in the commit-graph file for commit-graph
> > traversal operations like git log --graph. Unfortunately, using topological
> > levels can result in a worse performance than without them when compared
> > with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> > on the Linux repository walks 635,579 commits using topological levels and
> > walks 167,468 using committer date.
> 
> I would say "committer date heuristics" instead of just "committer
> date", to be more exact.
> 
> Is this data generated using https://github.com/derrickstolee/gen-test
> scripts?
> 

Yes, it is.

> >
> > Thus, the need for generation number v2 was born. New generation number
> > needed to provide good performance, increment updates, and backward
> > compatibility. Due to an unfortunate problem, we also needed a way to
> > distinguish between the old and new generation number without incrementing
> > graph version.
> 
> It would be nice to have reference email (or other place with details)
> for "unfortunate problem".
> 

Will add.

> >
> > Various candidates were examined (https://github.com/derrickstolee/gen-test,
> > https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> > number v2, Corrected Commit Date with Mononotically Increasing Offsets
> > performed much worse than committer date (506,577 vs. 167,468 commits walked
> > for git merge-base v4.8 v4.9) and was dropped.
> >
> > Using Generation Data chunk (GDAT) relieves the requirement of backward
> > compatibility as we would continue to store topological levels in Commit
> > Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> > number v2.
> 
> This is a bit of simplification, but good enough for a cover letter.
> 
> To be more exact, from various candidates the Corrected Commit Date was
> chosen.  Then it turned out that old Git crashes on changed commit-graph
> format version value, so if the generation number v2 was to replace v1
> it needed to be backward-compatibile: hence the idea of Corrected Commit
> Date with Monotonically Increasing Offsets.  But with GDAT chunk to
> store generation number v2 (and for the time being leaving generation
> number v1, i.e. Topological Levels, in CDAT), we are no longer
> constrained by the requirement of backward-compatibility to make old Git
> work with commit-graph file created by new Git.  So we could go back to
> Corrected Commit Date, and as you wrote above the backward-compatibile
> variant performs worse.
> 
> > The Corrected Commit Date is defined as:
> >
> > For a commit C, let its corrected commit date be the maximum of the commit
> > date of C and the corrected commit dates of its parents.
> 
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
> 
>       A can reach B   =>  gen(A) < gen(B)
> 
> Of course it can be computed in simpler way, because
> 
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1
> 
> 
> >                                                           Then corrected
> > commit date offset is the difference between corrected commit date of C and
> > commit date of C.
> 
> All right.
> 
> >
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
> 
> All right.
> 
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
> 

Yes, definitely. Will add

> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
> 
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
> 
>      write commit-graph file or commit-graph layer with GDAT chunk,
> 
>   otherwise
> 
>      write commit-graph layer without GDAT chunk.
> 
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.
> 
> 
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
> 
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
> 

The current implementation checks the entire chain for GDAT, rather than
just the topmost layer as we cannot assert that `g` would be the topmost
layer of the chain.

See the discussion here: https://lore.kernel.org/git/20200814045957.GA1380@Abhishek-Arch/

It's one of drawbacks of having a single member 64-bit `generation`
instead of two 32-bit members `level` and `odate`.

>
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)
> 
>   - A can reach B   =>  gen(A) < gen(B)
> 
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
> 
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
> 
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
> 
>   - A can reach B   =>  gen(A) < gen(B)
> 
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.
> 
> >
> > Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> > first version.
> >
> > I look forward to everyone's reviews!
> >
> > Thanks
> >
> >  * Abhishek
> >
> >
> > ----------------------------------------------------------------------------
> >
> > Changes in version 3:
> >
> >  * Reordered patches as discussed in 1
> >    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@gmail.com/]
> 
> If I remember it correctly this was done to always store in GDAT chunk
> corrected commit date offsets, isn't it?
> 

Yes.

> >  * Split "implement corrected commit date" into two patches - one
> >    introducing the topo level slab and other implementing corrected commit
> >    dates.
> 
> All right.
> 
> I think it might be good idea to split off the change to tar file tests
> (as a preparatory patch), to make reviews and bisecting easier.
> 
> >  * Extended split-commit-graph tests to verify at the end of test.
> 
> Do we also test for proper merging of split commit-graph layers, not
> only adding a new layer and a full rewrite (--split=replace)?
> 

We do not, will add a test at end. Thanks for pointing this out.

> >  * Use topological levels as generation number if any of split commit-graph
> >    files do not have generation data chunk.
> 
> That is good for performance.
> 
> >
> > Changes in version 2:
> >
> >  * Add tests for generation data chunk.
> 
> Good.
> 
> >  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
> >    generation data chunk.
> 
> Good, that is needed for testing mixed-version behavior.
> 
> >  * Compare commits with corrected commit dates if present in
> >    paint_down_to_common().
> 
> All right, but see the caveat.
> 
> >  * Update technical documentation.
> 
> Always a good thing.
> 
> >  * Handle mixed graph version commit chains.
> 
> Where by "version" you mean generation number version - the commit-graph
> version number unfortunately needs to stay the same...
> 

Yes, clarified.

> >  * Improve commit messages for
>                                 ^^^^^^
> Something missing in this point, the sentence ends abruptly.

I didn't finish the sentence. Meant to say:

- Improve commit messages for "commit-graph: fix regression when computing bloom filter", "commit-graph: consolidate fill_commit_graph_info",
> 
> >  * Revert unnecessary whitespace changes.
> 
> Thanks.
> 
> >  * Split uint_32 -> timestamp_t change into a new commit.
> 
> It is usually better to keep the commits small.  Good.
> 
> 
> Good work!
> 
> ...
> --
> Jakub Narębski

Jakub Narębski Aug. 23, 2020, 3:27 p.m. UTC | #5

Hello,

Here is a summary of my comments and thoughts after carefully reviewing
all patches in the series.

Jakub Narębski <jnareb@gmail.com> writes:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
[...]
>> The Corrected Commit Date is defined as:
>>
>> For a commit C, let its corrected commit date be the maximum of the commit
>> date of C and the corrected commit dates of its parents.
>
> Actually it needs to be "corrected commit dates of its parents plus 1"
> to fulfill the reachability condition for a generation number for a
> commit:
>
>       A can reach B   =>  gen(A) < gen(B)
>
> Of course it can be computed in simpler way, because
>
>   max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1

I see that it is defined correctly in the documentation, which is more
important than cover letter (which would not be stored in the repository
for memory, even in the commit message for the merge).

[...]
> However I think the cover letter should also describe what should happen
> in a mixed version environment (for example new Git on command line,
> copy of old Git used by GUI client), and in particular what should
> happen in a mixed-chain case - both for reading and for writing the
> commit-graph file.
>
> For *writing*: because old Git would create commit-graph layers without
> the GDAT chunk, to simplify the behavior and make easy to reason about
> commit-graph data (the situation should be not that common, and
> transient -- it should get more rare as the time goes), we want the
> following behavior from new Git:
>
> - If top layer contains the GDAT chunk, or we are rewriting commit-graph
>   file (--split=replace), or we are merging layers and there are no
>   layers without GDAT chunk below set of layers that are merged, then
>
>      write commit-graph file or commit-graph layer with GDAT chunk,

Actually we can simplify handling of merging layers, while still
retaining the property that in mixed-version chain all GDAT-full layers
are before / below GDAT-less layers.

Namely, if merging layers, and at least one layer being merged doesn't
have GDAT chunk, then the result of merge wouldn't have either.

We can always switch to slightly more complicated behavior proposed
above in quoted part later, perhaps as a followup commit.

>
>   otherwise
>
>      write commit-graph layer without GDAT chunk.
>
>   This means that there are commit-graph layers without GDAT chunk if
>   and only if the top layer is also without GDAT chunk.

This might be not necessary if Git would always check the whole chain of
split commit-graph layers for presence of GDAT-less layers.

But I still think it is a good idea to avoid having GDAT-less "holes".

>
> For *reading* we want to use generation number v2 (corrected commit
> date) if possible, and fall back to generation number v1 (topological
> levels).
>
> - If the top layer contains the GDAT chunk (or maybe even if the topmost
>   layer that involves all commits in question, not necessarily the top
>   layer in the full commit-graph chain), then use generation number v2
>
>   - commit_graph_data->generation stores corrected commit date,
>     computed as sum of committer date (from CDAT) and offset (from GDAT)

Or stored directly in GDAT, at the cost of increasing the file size by
at most 7% (if I have done my math correctly).

See also the issue with clamping offsets (GENERATION_NUMBER_V2_OFFSET_MAX).

>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - there is no need for committer date heuristics, and no need for
>     limiting use of generation number to where there is a cutoff (to not
>     hamper performance).
>
> - If there are layers without GDAT chunks, which thanks to the write
>   behavior means simply top layer without GDAT chunk, we need to turn
>   off use of generation numbers or fall back to using topological levels
>
>   - commit_graph_data->generation stores topological levels,
>     taken from CDAT chunk (30-bits)
>
>   - A can reach B   =>  gen(A) < gen(B)
>
>   - we probably want to keep tie-breaking of sorting by generation
>     number via committer date, and limit use of generation number as
>     opposed to using committer date heuristics (with slop) to not make
>     performance worse.

And this is being done in this patch series.  Good!

The thing I was worrying about turned out to be non-issue, as the
comparison function in question is used only when writing, and in this
case we have corrected commit date computer - though perhaps not being
written (as far as I understand it, but I might be mistaken).

[...]
>>
>> Abhishek Kumar (11):
>>   commit-graph: fix regression when computing bloom filter

No problems, maybe just expanding a commit message and/or adding a comment.

>>   revision: parse parent in indegree_walk_step()

Looks good to me.

>>   commit-graph: consolidate fill_commit_graph_info

I think this commit could be split into three:
- fix to the 'generate tar with future mtime' test
  as it is a hidden bug (when using commit-graph)
- simplify implementation of fill_commit_in_graph()
  by using fill_commit_graph_info()
- move loading date into fill_commit_graph_info()
  that uncovers the issue with 'generate tar with future mtime'

On the other hand because they are inter-related, those changes might be
kept in a single commit.

In commit message greater care needs to be taken with
fill_commit_graph_info() and fill_commit_in_graph(), when to use one and
when to use the other. For example it is fill_commit_graph_info() that
changes its behavior, and it is fill_commit_in_graph() that is getting
simplified.

>>   commit-graph: consolidate compare_commits_by_gen

Looks good to me, though it might be good idea to add comments about the
sorting order (inside comment) to appropriate header files.

>>   commit-graph: return 64-bit generation number

This conversion misses one place, though it would be changed to use
topological levels slab in next patch.

This patch also unnecessary introduces GENERATION_NUMBER_V1_INFINITY.
There is no need for it: `generation` field can always simply use
GENERATION_NUMBER_INFINITY for commits not in commit-graph.

>>   commit-graph: add a slab to store topological levels

This is the patch that needs GENERATION_NUMBER_V1_INFINITY (or
TOPOLOGICAL_LEVEL_INFINITY), not the previous patch.

Detailed analysis uncovered hidden bug in the code of
compute_generation_numbers() that works only because of historical
reasons (that topological levels in Git start from 1, not from 0). The
problem is that the 'level' / 'generation' variable for commits not in
graph, and therefore ones that needs to have its generation number
computed, is 0 (default value on commit slab) and not
GENERATION_NUMBER*_INFINITY as it should.

This issue is present since moving commit graph info data to
commit-slab.  We can simply document it and ignore (it works, if by
accident), or try to fix it.

We need to handle GENERATION_NUMBER*_MAX clamping carefully
in the future patches.

I think also that this patch needs a bit more detailed commit message.

>>   commit-graph: implement corrected commit date

Looks good, though verify_commit_graph() now verifies *a* generation
number used, be it v1 (topological level) or v2 (corrected commit date),
so the variable rename is unnecessary.  We verify that they fulfill the
reachability condition promise, that is gen(parent) <= gen(commit),
(the '=' is to include GENERATION_NUMBER*_MAX case), as it is what is
used to speed up graph traversal.

We probably want to verify both topological level values in CDAT, and if
they exist also corrected commit date values in GDAT.  But that might be
left for the future commit.

>>   commit-graph: implement generation data chunk

To save up to 6-7% on commit-graph file size we store 32-bits corrected
commit date offsets, instead of storing 64-bits corrected commit date.

However, as far as I understand it, using non-monotonic values for
on-disk storage with limited field size, like 32-bits corrected
commit date offset, leads to problems with GENERATION_NUMBER*_MAX.
Namely, as I have written in detail in my reply to patch 11/11 in this
series, there is no way to fulfill the reachability condition promise if
we have to store offset which true value do not fit in 32-bits reserved
for it.

This is extremly unlikely to happen in practice, but we need to be able
to handle it somehow.  We can store 64-bit corrected commit date, which
has graph-monotonic values, and the problem goes away in theory and in
practice (we would never have values that do not fit).  We can keep
storing 32-bit offsets, and simply do not use GDAT chunk if there is
offset value that do not fit.

All this of course, provided that I am not wrong about this issue...

>>   commit-graph: use generation v2 only if entire chain does

The commit message do not say anything about the *writing* side.

However, if we want to keep the requirement that GDAT-less layers in the
split commit-graph chain must all be above any GDAT-containing layers,
we need to consider how we want layer merging to behave.  We could
either opt for using GDAT whenever possible, or similify code and skip
using GDAT if we are unsure.

The first approach would mean that if the topmost layer below set of
layers being rewritten (in the split commit-graph chain) exists, and it
does not contain GDAT chunk, then and only then the result of rewrite
should not have GDAT chunk either.

The second approach is, I think, simpler: if any of layers that is being
rewritten is GDAT-less (we need to check only the top layer, though),
and we are not running full rewrite ('--split=replace'), then the result
of rewrite should not have GDAT chunk either.  We can switch to the
first algorithm in later commit.

Whether we choose one or the other, we need test that doing layer
merging do not break GDAT-inness requirement stated above.

Also, we can probably test that we are not using v1 and v2 at the same
time with tests involving --is-ancestor, or --contains / --merged.

>>   commit-reach: use corrected commit dates in paint_down_to_common()

I think this commit could be split into two:
- disable commit graph for entire t6024-recursive-merge test
- use corrected commit dates in paint_down_to_common()

On the other hand because they are inter-related, those changes might be
better kept in a single commit.

It would be nice to have some benchmark data, or at least stating that
performance does not change (within the error bounds), using for example
'git merge-base v4.8 v4.9' in Linux kernel repository.

>>   doc: add corrected commit date info

Looks good, there are a few places where 'generation number' (referring
to the v1 version) should have been replaced with 'topological level'.

I am also unsure how the GDAT chunk can be "later modified to store
future generation number related data.".  I'd like to have an example,
or for this statement to be removed (if it turns out to not be true, not
without introducing yet another chunk type).

>>  18 files changed, 396 insertions(+), 185 deletions(-)

Thanks for all your work.

Best regards,
--
Jakub Narębski

Abhishek Kumar Aug. 24, 2020, 2:49 a.m. UTC | #6

Hello,

On Sun, Aug 23, 2020 at 05:27:46PM +0200, Jakub Narębski wrote:
> Hello,
> 
> Here is a summary of my comments and thoughts after carefully reviewing
> all patches in the series.
> 

Thanks for the detailed review. It must have taken you a lot of time and
focus to go through the patches, so I have really appreciate the effort.

I have been going over the comments and they are reasonable and very
much needed. I will respond to the mails in the specific threads (to
keep the discussion locally scoped and thus manageable), with doubts
and comments that I have as I try to follow through the suggestions.

> Best regards,
> --
> Jakub Narębski

Thanks
- Abhishek

[v3,00/11,GSoC] Implement Corrected Commit Date

Message

Comments