diff mbox series

[v4,09/10] commit-reach: use corrected commit dates in paint_down_to_common()

Message ID bb9b02af32d028fc0c26d372aa490e260c74e74d.1602079786.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Elijah Newren via GitGitGadget Oct. 7, 2020, 2:09 p.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With corrected commit dates implemented, we no longer have to rely on
commit date as a heuristic in paint_down_to_common().

While using corrected commit dates Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

For example, the command `git merge-base v4.8 v4.9` on the linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

t6404-recursive-merge setups a unique repository where all commits have
the same committer date without well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form a intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form a intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c             | 14 ++++++++++++++
 commit-graph.h             |  8 +++++++-
 commit-reach.c             |  2 +-
 t/t6404-recursive-merge.sh |  5 ++++-
 4 files changed, 26 insertions(+), 3 deletions(-)

Comments

Jakub Narębski Nov. 3, 2020, 5:59 p.m. UTC | #1
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With corrected commit dates implemented, we no longer have to rely on
> commit date as a heuristic in paint_down_to_common().
>
> While using corrected commit dates Git walks nearly the same number of
> commits as commit date, the process is slower as for each comparision we
> have to access a commit-slab (for corrected committer date) instead of
> accessing struct member (for committer date).

Something for the future: I wonder if it would be worth it to bring back
generation number from the commit-slab into `struct commit`.

>
> For example, the command `git merge-base v4.8 v4.9` on the linux
> repository walks 167468 commits, taking 0.135s for committer date and
> 167496 commits, taking 0.157s for corrected committer date respectively.

I think it would be good idea to explicitly refer to the commit that
changed paint_down_to_common() to *not* use generation numbers v1
(topological levels) in the cases such as this, namely 091f4cf3 (commit:
don't use generation numbers if not needed).  In this commit we have the
following:

  This change makes a concrete difference depending on the topology
  of the commit graph. For instance, computing the merge-base between
  consecutive versions of the Linux kernel has no effect for versions
  after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
  regression:

      v2.18.0: 0.122s
  v2.19.0-rc1: 0.547s
         HEAD: 0.127s

  To determine that this was simply an ordering issue, I inserted
  a counter within the while loop of paint_down_to_common() and
  found that the loop runs 167,468 times in v2.18.0 and 635,579
  times in v2.19.0-rc1.

The times you report (0.135s and 0.157s) are close to 0.122s / 0.127s
reported in 091f4cf3 - that is most probably because of the differences
in the system performance (hardware, operating system, load, etc.).
Numbers of commits walked for the committed date heuristics, that is
167,468 agrees with your results; 167,496 (+28) for corrected commit
date (generation number v2) is significantly smaller (-468,083) than
635,579 reported for topological levels (generation number v1).

I suspect that there are cases (with date skew) where corrected commit
date gives better performance than committer date heuristics, and I am
quite sure that generation number v2 can give better performance in case
where paint_down_to_common() uses generation numbers.

.................................................................

Here begins separate second change, which is not put into separate
commit because it is fairly tightly connected to the change described
above.  It would be good idea, in my opinion, to add a sentence that
explicitely marks this switch, for example:

  This change accidentally broke fragile t6404-recursive-merge test.
  t6404-recursive-merge setups a unique repository...

Maybe with s/accidentaly/incidentally/.

Or add some other way of connection those two parts of the commit
messages.

> t6404-recursive-merge setups a unique repository where all commits have
> the same committer date without well-defined merge-base.
>
> While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
> date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
> conflicts' merges commits in the order:
> - Merge C with B to form a intermediate commit.
> - Merge the intermediate commit with A.
>
> With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
> use the corrected committer date, which changes the order in which
> commits are merged:
> - Merge A with B to form a intermediate commit.
> - Merge the intermediate commit with C.
>
> While resulting repositories are equivalent, 6404.4 'virtual trees were
> processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
> different merge-bases and thus have different object ids for the
> intermediate commits.
>
> As this has already causes problems (as noted in 859fdc0 (commit-graph:
> define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
> within t6404-recursive-merge.

Very nice explanation.

Perhaps in the future we could make this test less fragile.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> ---
>  commit-graph.c             | 14 ++++++++++++++
>  commit-graph.h             |  8 +++++++-
>  commit-reach.c             |  2 +-
>  t/t6404-recursive-merge.sh |  5 ++++-
>  4 files changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index 5d15a1399b..3de1933ede 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -705,6 +705,20 @@ int generation_numbers_enabled(struct repository *r)
>  	return !!first_generation;
>  }
>  
> +int corrected_commit_dates_enabled(struct repository *r)
> +{
> +	struct commit_graph *g;
> +	if (!prepare_commit_graph(r))
> +		return 0;
> +
> +	g = r->objects->commit_graph;
> +
> +	if (!g->num_commits)
> +		return 0;
> +
> +	return g->read_generation_data;
> +}

Very nice abstraction.

Minor issue: I wonder if it would be better to use _available() or
"_present()" rather than _enabled() suffix.

> +
>  struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
>  {
>  	struct commit_graph *g = r->objects->commit_graph;
> diff --git a/commit-graph.h b/commit-graph.h
> index ad52130883..d2c048dc64 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
>  struct commit_graph *parse_commit_graph(struct repository *r,
>  					void *graph_map, size_t graph_size);
>  
> +struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> +
>  /*
>   * Return 1 if and only if the repository has a commit-graph
>   * file and generation numbers are computed in that file.
>   */
>  int generation_numbers_enabled(struct repository *r);
>  
> -struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);

This moving get_bloom_filter_settings() before generation_numbers_enabled() 
looks like accidental change.  If not, why it is here?

> +/*
> + * Return 1 if and only if the repository has a commit-graph
> + * file and generation data chunk has been written for the file.
> + */
> +int corrected_commit_dates_enabled(struct repository *r);
>

All right, nice to have documentation for the public function.

>  enum commit_graph_write_flags {
>  	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
> diff --git a/commit-reach.c b/commit-reach.c
> index 20b48b872b..46f5a9e638 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -39,7 +39,7 @@ static struct commit_list *paint_down_to_common(struct repository *r,
>  	int i;
>  	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
>  
> -	if (!min_generation)
> +	if (!min_generation && !corrected_commit_dates_enabled(r))
>  		queue.compare = compare_commits_by_commit_date;
>  
>  	one->object.flags |= PARENT1;

All right, this is the meat of the first change.

> diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
> index 332cfc53fd..7055771b62 100755
> --- a/t/t6404-recursive-merge.sh
> +++ b/t/t6404-recursive-merge.sh
> @@ -15,6 +15,8 @@ GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
>  export GIT_COMMITTER_DATE
>  
>  test_expect_success 'setup tests' '
> +	GIT_TEST_COMMIT_GRAPH=0 &&
> +	export GIT_TEST_COMMIT_GRAPH &&
>  	echo 1 >a1 &&
>  	git add a1 &&
>  	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&

All right, we turn off running this test with commit-graph for the whole
script, not only for a single test.  As this is a setup, it would be run
even if we are skipping some tests.

> @@ -66,7 +68,7 @@ test_expect_success 'setup tests' '
>  '
>  
>  test_expect_success 'combined merge conflicts' '
> -	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
> +	test_must_fail git merge -m final G
>  '

All right, it is no longer necessary to run this specific test with
GIT_TEST_COMMIT_GRAPH=0 as now the whole script is run with this
setting.

>  
>  test_expect_success 'result contains a conflict' '
> @@ -82,6 +84,7 @@ test_expect_success 'result contains a conflict' '
>  '
>  
>  test_expect_success 'virtual trees were processed' '
> +	# TODO: fragile test, relies on ambigious merge-base resolution
>  	git ls-files --stage >out &&
>  
>  	cat >expect <<-EOF &&

Good call!  Nice adding TODO comment for the future.

Best,
Junio C Hamano Nov. 3, 2020, 6:19 p.m. UTC | #2
jnareb@gmail.com (Jakub Narębski) writes:

> I suspect that there are cases (with date skew) where corrected commit
> date gives better performance than committer date heuristics, and I am
> quite sure that generation number v2 can give better performance in case
> where paint_down_to_common() uses generation numbers.

Thanks for a well reasoned review.

>
> .................................................................
>
> Here begins separate second change, which is not put into separate
> commit because it is fairly tightly connected to the change described
> above.  It would be good idea, in my opinion, to add a sentence that
> explicitely marks this switch, for example:
>
>   This change accidentally broke fragile t6404-recursive-merge test.
>   t6404-recursive-merge setups a unique repository...
>
> Maybe with s/accidentaly/incidentally/.

Also "setup" is not a verb.  "... sets up a unique repository".

> Or add some other way of connection those two parts of the commit
> messages.
> ...
>> As this has already causes problems (as noted in 859fdc0 (commit-graph:
>> define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
>> within t6404-recursive-merge.
>
> Very nice explanation.
>
> Perhaps in the future we could make this test less fragile.

If "separate second change" is distracting, would it be an option to
fix the test before this step, perhaps?

Thanks.
Abhishek Kumar Nov. 20, 2020, 10:33 a.m. UTC | #3
On Tue, Nov 03, 2020 at 06:59:03PM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> >
> > With corrected commit dates implemented, we no longer have to rely on
> > commit date as a heuristic in paint_down_to_common().
> >
> > While using corrected commit dates Git walks nearly the same number of
> > commits as commit date, the process is slower as for each comparision we
> > have to access a commit-slab (for corrected committer date) instead of
> > accessing struct member (for committer date).
> 
> Something for the future: I wonder if it would be worth it to bring back
> generation number from the commit-slab into `struct commit`.
> 
> >
> > For example, the command `git merge-base v4.8 v4.9` on the linux
> > repository walks 167468 commits, taking 0.135s for committer date and
> > 167496 commits, taking 0.157s for corrected committer date respectively.
> 
> I think it would be good idea to explicitly refer to the commit that
> changed paint_down_to_common() to *not* use generation numbers v1
> (topological levels) in the cases such as this, namely 091f4cf3 (commit:
> don't use generation numbers if not needed).  In this commit we have the
> following:
> ...
>

I have re-arranged the first half of commit message: 

  091f4cf3 (commit: don't use generation numbers if not needed,
  2018-08-30) changed paint_down_to_common() to use commit dates instead
  of generation numbers v1 (topological levels) as the performance
  regressed on certain topologies. With generation number v2 (corrected
  commit dates) implemented, we no longer have to rely on commit dates and
  can use generation numbers.
  
  For example, the command `git merge-base v4.8 v4.9` on the Linux
  repository walks 167468 commits, taking 0.135s for committer date and
  167496 commits, taking 0.157s for corrected committer date respectively.
  
  While using corrected commit dates Git walks nearly the same number of
  commits as commit date, the process is slower as for each comparision we
  have to access a commit-slab (for corrected committer date) instead of
  accessing struct member (for committer date).

> 
> The times you report (0.135s and 0.157s) are close to 0.122s / 0.127s
> reported in 091f4cf3 - that is most probably because of the differences
> in the system performance (hardware, operating system, load, etc.).
> Numbers of commits walked for the committed date heuristics, that is
> 167,468 agrees with your results; 167,496 (+28) for corrected commit
> date (generation number v2) is significantly smaller (-468,083) than
> 635,579 reported for topological levels (generation number v1).
> 
> I suspect that there are cases (with date skew) where corrected commit
> date gives better performance than committer date heuristics, and I am
> quite sure that generation number v2 can give better performance in case
> where paint_down_to_common() uses generation numbers.
> 
> .................................................................
> 
> Here begins separate second change, which is not put into separate
> commit because it is fairly tightly connected to the change described
> above.  It would be good idea, in my opinion, to add a sentence that
> explicitely marks this switch, for example:
> 
>   This change accidentally broke fragile t6404-recursive-merge test.
>   t6404-recursive-merge setups a unique repository...
> 
> Maybe with s/accidentaly/incidentally/.
> 

Thanks, will add.

> Or add some other way of connection those two parts of the commit
> messages.
> ...
> >  
> > +int corrected_commit_dates_enabled(struct repository *r)
> > +{
> > +	struct commit_graph *g;
> > +	if (!prepare_commit_graph(r))
> > +		return 0;
> > +
> > +	g = r->objects->commit_graph;
> > +
> > +	if (!g->num_commits)
> > +		return 0;
> > +
> > +	return g->read_generation_data;
> > +}
> 
> Very nice abstraction.
> 
> Minor issue: I wonder if it would be better to use _available() or
> "_present()" rather than _enabled() suffix.
> 

We could, but that breaks conformity with `generation_numbers_enabled()`.

I see both functions to be similar in nature, to answer whether the
commit-graph has X? X could be topological levels or corrected commit
dates.

> > +
> >  struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
> >  {
> >  	struct commit_graph *g = r->objects->commit_graph;
> > diff --git a/commit-graph.h b/commit-graph.h
> > index ad52130883..d2c048dc64 100644
> > --- a/commit-graph.h
> > +++ b/commit-graph.h
> > @@ -89,13 +89,19 @@ struct commit_graph *read_commit_graph_one(struct repository *r,
> >  struct commit_graph *parse_commit_graph(struct repository *r,
> >  					void *graph_map, size_t graph_size);
> >  
> > +struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> > +
> >  /*
> >   * Return 1 if and only if the repository has a commit-graph
> >   * file and generation numbers are computed in that file.
> >   */
> >  int generation_numbers_enabled(struct repository *r);
> >  
> > -struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
> 
> This moving get_bloom_filter_settings() before generation_numbers_enabled() 
> looks like accidental change.  If not, why it is here?

Right, that's an accidental change. I wanted to group
generation_numbers_enabled() and corrected_commit_dates_enabled()
together.

> 
> ...
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index 5d15a1399b..3de1933ede 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -705,6 +705,20 @@  int generation_numbers_enabled(struct repository *r)
 	return !!first_generation;
 }
 
+int corrected_commit_dates_enabled(struct repository *r)
+{
+	struct commit_graph *g;
+	if (!prepare_commit_graph(r))
+		return 0;
+
+	g = r->objects->commit_graph;
+
+	if (!g->num_commits)
+		return 0;
+
+	return g->read_generation_data;
+}
+
 struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r)
 {
 	struct commit_graph *g = r->objects->commit_graph;
diff --git a/commit-graph.h b/commit-graph.h
index ad52130883..d2c048dc64 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -89,13 +89,19 @@  struct commit_graph *read_commit_graph_one(struct repository *r,
 struct commit_graph *parse_commit_graph(struct repository *r,
 					void *graph_map, size_t graph_size);
 
+struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+
 /*
  * Return 1 if and only if the repository has a commit-graph
  * file and generation numbers are computed in that file.
  */
 int generation_numbers_enabled(struct repository *r);
 
-struct bloom_filter_settings *get_bloom_filter_settings(struct repository *r);
+/*
+ * Return 1 if and only if the repository has a commit-graph
+ * file and generation data chunk has been written for the file.
+ */
+int corrected_commit_dates_enabled(struct repository *r);
 
 enum commit_graph_write_flags {
 	COMMIT_GRAPH_WRITE_APPEND     = (1 << 0),
diff --git a/commit-reach.c b/commit-reach.c
index 20b48b872b..46f5a9e638 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -39,7 +39,7 @@  static struct commit_list *paint_down_to_common(struct repository *r,
 	int i;
 	timestamp_t last_gen = GENERATION_NUMBER_INFINITY;
 
-	if (!min_generation)
+	if (!min_generation && !corrected_commit_dates_enabled(r))
 		queue.compare = compare_commits_by_commit_date;
 
 	one->object.flags |= PARENT1;
diff --git a/t/t6404-recursive-merge.sh b/t/t6404-recursive-merge.sh
index 332cfc53fd..7055771b62 100755
--- a/t/t6404-recursive-merge.sh
+++ b/t/t6404-recursive-merge.sh
@@ -15,6 +15,8 @@  GIT_COMMITTER_DATE="2006-12-12 23:28:00 +0100"
 export GIT_COMMITTER_DATE
 
 test_expect_success 'setup tests' '
+	GIT_TEST_COMMIT_GRAPH=0 &&
+	export GIT_TEST_COMMIT_GRAPH &&
 	echo 1 >a1 &&
 	git add a1 &&
 	GIT_AUTHOR_DATE="2006-12-12 23:00:00" git commit -m 1 a1 &&
@@ -66,7 +68,7 @@  test_expect_success 'setup tests' '
 '
 
 test_expect_success 'combined merge conflicts' '
-	test_must_fail env GIT_TEST_COMMIT_GRAPH=0 git merge -m final G
+	test_must_fail git merge -m final G
 '
 
 test_expect_success 'result contains a conflict' '
@@ -82,6 +84,7 @@  test_expect_success 'result contains a conflict' '
 '
 
 test_expect_success 'virtual trees were processed' '
+	# TODO: fragile test, relies on ambigious merge-base resolution
 	git ls-files --stage >out &&
 
 	cat >expect <<-EOF &&