diff mbox series

[v3,4/7] revision.c: begin refactoring --topo-order logic

Message ID fd1a0ab7cdaa06fd99f86fec51b483238f588296.1537551564.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Use generation numbers for --topo-order | expand

Commit Message

Johannes Schindelin via GitGitGadget Sept. 21, 2018, 5:39 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

Comments

Jeff King Oct. 11, 2018, 2:06 p.m. UTC | #1
On Fri, Sep 21, 2018 at 10:39:32AM -0700, Derrick Stolee via GitGitGadget wrote:

> [..]
> When setting revs->limited only because revs->topo_order is true,
> only do so if generation numbers are not available. There is no
> reason to use the new logic as it will behave similarly when all
> generation numbers are INFINITY or ZERO.
> 
> In prepare_revision_walk(), if we have revs->topo_order but not
> revs->limited, then we trigger the new logic. It breaks the logic
> into three pieces, to fit with the existing framework:

Nicely explained. Your abstracted init/next/expand API seems sane, but
of course the real test will be reading the later patches that make use
of it. :)

The patch matches my understanding of your explanation.

-Peff
Junio C Hamano Oct. 12, 2018, 6:33 a.m. UTC | #2
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> * revs->limited implies we run limit_list() to walk the entire
>   reachable set. There are some short-cuts here, such as if we
>   perform a range query like 'git rev-list COMPARE..HEAD' and we
>   can stop limit_list() when all queued commits are uninteresting.
>
> * revs->topo_order implies we run sort_in_topological_order(). See
>   the implementation of that method in commit.c. It implies that
>   the full set of commits to order is in the given commit_list.
>
> These two methods imply that a 'git rev-list --topo-order HEAD'
> command must walk the entire reachable set of commits _twice_ before
> returning a single result.

With or without "--topo-order", running rev-list without any
negative commit means we must dig down to the roots that can be
reached from the positive commits we have.

I am to sure if having to run the "sort" of order N counts as "walk
the entire reachable set once" (in addition to the enumeration that
must be done to prepare that N commits, performed in limit_list()).

> In v2.18.0, the commit-graph file contains zero-valued bytes in the
> positions where the generation number is stored in v2.19.0 and later.
> Thus, we use generation_numbers_enabled() to check if the commit-graph
> is available and has non-zero generation numbers.
>
> When setting revs->limited only because revs->topo_order is true,
> only do so if generation numbers are not available. There is no
> reason to use the new logic as it will behave similarly when all
> generation numbers are INFINITY or ZERO.

> In prepare_revision_walk(), if we have revs->topo_order but not
> revs->limited, then we trigger the new logic. It breaks the logic
> into three pieces, to fit with the existing framework:
>
> 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
>    struct. We use the presence of this struct as a signal to use the
>    new methods during our walk. In this patch, this method simply
>    calls limit_list() and sort_in_topological_order(). In the future,
>    this method will set up a new data structure to perform that logic
>    in-line.
>
> 2. next_topo_commit() provides get_revision_1() with the next topo-
>    ordered commit in the list. Currently, this simply pops the commit
>    from revs->commits.

... because everything is already done in #1 above.  Which makes sense.

> 3. expand_topo_walk() provides get_revision_1() with a way to signal
>    walking beyond the latest commit. Currently, this calls
>    add_parents_to_list() exactly like the old logic.

"latest"?  We dig down the history from newer to older, so at some
point we hit an old commit and need to find the parents to keep
walking towards even older parts of the history.  Did you mean
"earliest" instead?

> While this commit presents method redirection for performing the
> exact same logic as before, it allows the next commit to focus only
> on the new logic.

OK.

> diff --git a/revision.c b/revision.c
> index e18bd530e4..2dcde8a8ac 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -25,6 +25,7 @@
>  #include "worktree.h"
>  #include "argv-array.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
>  	if (revs->diffopt.objfind)
>  		revs->simplify_history = 0;
>  
> -	if (revs->topo_order)
> +	if (revs->topo_order && !generation_numbers_enabled(the_repository))
>  		revs->limited = 1;

Are we expecting that this is always a bool?  Can there be new
commits for which generation numbers are not computed and stored
while all the old, stable and packed commits have generation
numbers?

> @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
>  	return 0;
>  }
>  
> +struct topo_walk_info {};
> +
> +static void init_topo_walk(struct rev_info *revs)
> +{
> +	struct topo_walk_info *info;
> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
> +	info = revs->topo_walk_info;
> +	memset(info, 0, sizeof(struct topo_walk_info));

There is no member in the struct at this point.  Are we sure this is
safe?  Just being curious.  I know xmalloc() gives us at least one
byte and info won't be NULL.  I just do not know offhand if we have
a guarantee that memset() acts sensibly to fill the first 0 bytes.

> +	limit_list(revs);
> +	sort_in_topological_order(&revs->commits, revs->sort_order);
> +}
> +
> +static struct commit *next_topo_commit(struct rev_info *revs)
> +{
> +	return pop_commit(&revs->commits);
> +}
> +
> +static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
> +{
> +	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
> +		if (!revs->ignore_missing_links)
> +			die("Failed to traverse parents of commit %s",
> +			    oid_to_hex(&commit->object.oid));
> +	}
> +}
> +
>  int prepare_revision_walk(struct rev_info *revs)
>  {
>  	int i;
> @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
>  		return 0;
> -	if (revs->limited)
> +	if (revs->limited) {
>  		if (limit_list(revs) < 0)
>  			return -1;
> -	if (revs->topo_order)
> -		sort_in_topological_order(&revs->commits, revs->sort_order);
> +		if (revs->topo_order)
> +			sort_in_topological_order(&revs->commits, revs->sort_order);
> +	} else if (revs->topo_order)
> +		init_topo_walk(revs);
>  	if (revs->line_level_traverse)
>  		line_log_filter(revs);
>  	if (revs->simplify_merges)

The diff is a bit hard to grok around here, but 

 - when limited *and* topo_order, we do the sort here, as we know we
   already have called limit_list(), i.e. we behave identically as
   the code before this patch in that case.

 - when not limited but topo_order, then we do init_topo_walk();
   currently we do limit_list() and sort_in_topological_order(),
   which means we do the same as above.

As long as limit_list() and sort_in_topological_order() does not
look at revs->limited bit, this patch cannot cause any regression.

> @@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  		if (revs->reflog_info)
>  			commit = next_reflog_entry(revs->reflog_info);
> +		else if (revs->topo_walk_info)
> +			commit = next_topo_commit(revs);
>  		else
>  			commit = pop_commit(&revs->commits);

So this get_revision_1() always grabs the commit from next_topo_commit()
when topo-order is in effect.

> @@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  			if (revs->reflog_info)
>  				try_to_simplify_commit(revs, commit);
> +			else if (revs->topo_walk_info)
> +				expand_topo_walk(revs, commit);
>  			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
>  				if (!revs->ignore_missing_links)
>  					die("Failed to traverse parents of commit %s",

And this add-parents-or-barf is replicated in expand_topo_walk() at
this step, so there is no change in behaviour.

Looks like a cleanly done preparation that is a no-op.

> diff --git a/revision.h b/revision.h
> index 2b30ac270d..fd4154ff75 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct rev_cmdline_info {
>  #define REVISION_WALK_NO_WALK_SORTED 1
>  #define REVISION_WALK_NO_WALK_UNSORTED 2
>  
> +struct topo_walk_info;
> +
>  struct rev_info {
>  	/* Starting list */
>  	struct commit_list *commits;
> @@ -245,6 +247,8 @@ struct rev_info {
>  	const char *break_bar;
>  
>  	struct revision_sources *sources;
> +
> +	struct topo_walk_info *topo_walk_info;
>  };
>  
>  int ref_excluded(struct string_list *, const char *path);
Derrick Stolee Oct. 12, 2018, 12:32 p.m. UTC | #3
On 10/12/2018 2:33 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> * revs->limited implies we run limit_list() to walk the entire
>>    reachable set. There are some short-cuts here, such as if we
>>    perform a range query like 'git rev-list COMPARE..HEAD' and we
>>    can stop limit_list() when all queued commits are uninteresting.
>>
>> * revs->topo_order implies we run sort_in_topological_order(). See
>>    the implementation of that method in commit.c. It implies that
>>    the full set of commits to order is in the given commit_list.
>>
>> These two methods imply that a 'git rev-list --topo-order HEAD'
>> command must walk the entire reachable set of commits _twice_ before
>> returning a single result.
> With or without "--topo-order", running rev-list without any
> negative commit means we must dig down to the roots that can be
> reached from the positive commits we have.
If we use default order in 'git log', we don't walk all the way to the 
root commits, and instead trust the commit-date. (This is different than 
--date-order, which does guarantee parents after children.) In this 
case, revs->limited is false.
> I am to sure if having to run the "sort" of order N counts as "walk
> the entire reachable set once" (in addition to the enumeration that
> must be done to prepare that N commits, performed in limit_list()).

sort_in_topological_order() does actually _two_ walks (the in-degree 
computation plus the walk that peels commits of in-degree zero), but 
those walks are cheaper because we've already parsed the commits in 
limit_list().
>> 3. expand_topo_walk() provides get_revision_1() with a way to signal
>>     walking beyond the latest commit. Currently, this calls
>>     add_parents_to_list() exactly like the old logic.
> "latest"?  We dig down the history from newer to older, so at some
> point we hit an old commit and need to find the parents to keep
> walking towards even older parts of the history.  Did you mean
> "earliest" instead?
I mean "latest" in terms of the algorithm, so "the commit that was 
returned by get_revision_1() most recently". This could use some 
rewriting for clarity.
>>   
>> @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
>>   	if (revs->diffopt.objfind)
>>   		revs->simplify_history = 0;
>>   
>> -	if (revs->topo_order)
>> +	if (revs->topo_order && !generation_numbers_enabled(the_repository))
>>   		revs->limited = 1;
> Are we expecting that this is always a bool?  Can there be new
> commits for which generation numbers are not computed and stored
> while all the old, stable and packed commits have generation
> numbers?

For this algorithm to work, we only care that _some_ commits have 
generation numbers. We expect that if a commit-graph file exists with 
generation numbers, then the majority of commits have generation 
numbers. The commits that were added or fetched since the commit-graph 
was written will have generation number INFINITY, but the topo-order 
algorithm will still work and be efficient in those cases. (This is also 
why we have the "half graph" case in test_three_modes.)

>> @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
>>   	return 0;
>>   }
>>   
>> +struct topo_walk_info {};
>> +
>> +static void init_topo_walk(struct rev_info *revs)
>> +{
>> +	struct topo_walk_info *info;
>> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>> +	info = revs->topo_walk_info;
>> +	memset(info, 0, sizeof(struct topo_walk_info));
> There is no member in the struct at this point.  Are we sure this is
> safe?  Just being curious.  I know xmalloc() gives us at least one
> byte and info won't be NULL.  I just do not know offhand if we have
> a guarantee that memset() acts sensibly to fill the first 0 bytes.
This is a good question. It seems to work for me when I check out your 
version of this commit (6c04ff30 "revision.c: begin refactoring 
--topo-order logic") and run all tests.
>> +	limit_list(revs);
>> +	sort_in_topological_order(&revs->commits, revs->sort_order);
>> +}
>> +
>> +static struct commit *next_topo_commit(struct rev_info *revs)
>> +{
>> +	return pop_commit(&revs->commits);
>> +}
>> +
>> +static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
>> +{
>> +	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
>> +		if (!revs->ignore_missing_links)
>> +			die("Failed to traverse parents of commit %s",
>> +			    oid_to_hex(&commit->object.oid));
>> +	}
>> +}
>> +
>>   int prepare_revision_walk(struct rev_info *revs)
>>   {
>>   	int i;
>> @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
>>   		commit_list_sort_by_date(&revs->commits);
>>   	if (revs->no_walk)
>>   		return 0;
>> -	if (revs->limited)
>> +	if (revs->limited) {
>>   		if (limit_list(revs) < 0)
>>   			return -1;
>> -	if (revs->topo_order)
>> -		sort_in_topological_order(&revs->commits, revs->sort_order);
>> +		if (revs->topo_order)
>> +			sort_in_topological_order(&revs->commits, revs->sort_order);
>> +	} else if (revs->topo_order)
>> +		init_topo_walk(revs);
>>   	if (revs->line_level_traverse)
>>   		line_log_filter(revs);
>>   	if (revs->simplify_merges)
> The diff is a bit hard to grok around here, but
>
>   - when limited *and* topo_order, we do the sort here, as we know we
>     already have called limit_list(), i.e. we behave identically as
>     the code before this patch in that case.
>
>   - when not limited but topo_order, then we do init_topo_walk();
>     currently we do limit_list() and sort_in_topological_order(),
>     which means we do the same as above.
>
> As long as limit_list() and sort_in_topological_order() does not
> look at revs->limited bit, this patch cannot cause any regression.
>
>> @@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>>   
>>   		if (revs->reflog_info)
>>   			commit = next_reflog_entry(revs->reflog_info);
>> +		else if (revs->topo_walk_info)
>> +			commit = next_topo_commit(revs);
>>   		else
>>   			commit = pop_commit(&revs->commits);
> So this get_revision_1() always grabs the commit from next_topo_commit()
> when topo-order is in effect.
And specifically, when the conditions for our new topo-walk algorithm 
are in effect. If the commit-graph doesn't exist, the old logic will 
still go through for "git log --topo-order".

Thanks for the careful look!
-Stolee
Johannes Sixt Oct. 12, 2018, 4:15 p.m. UTC | #4
Am 12.10.18 um 08:33 schrieb Junio C Hamano:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +struct topo_walk_info {};
>> +
>> +static void init_topo_walk(struct rev_info *revs)
>> +{
>> +	struct topo_walk_info *info;
>> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>> +	info = revs->topo_walk_info;
>> +	memset(info, 0, sizeof(struct topo_walk_info));
> 
> There is no member in the struct at this point.  Are we sure this is
> safe?  Just being curious.

sizeof cannot return 0. sizeof(struct topo_walk_info) will be 1 here.

-- Hannes
Junio C Hamano Oct. 13, 2018, 8:05 a.m. UTC | #5
Johannes Sixt <j6t@kdbg.org> writes:

> Am 12.10.18 um 08:33 schrieb Junio C Hamano:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> +struct topo_walk_info {};
>>> +
>>> +static void init_topo_walk(struct rev_info *revs)
>>> +{
>>> +	struct topo_walk_info *info;
>>> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
>>> +	info = revs->topo_walk_info;
>>> +	memset(info, 0, sizeof(struct topo_walk_info));
>>
>> There is no member in the struct at this point.  Are we sure this is
>> safe?  Just being curious.
>
> sizeof cannot return 0. sizeof(struct topo_walk_info) will be 1 here.

Thanks.
diff mbox series

Patch

diff --git a/revision.c b/revision.c
index e18bd530e4..2dcde8a8ac 100644
--- a/revision.c
+++ b/revision.c
@@ -25,6 +25,7 @@ 
 #include "worktree.h"
 #include "argv-array.h"
 #include "commit-reach.h"
+#include "commit-graph.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -2454,7 +2455,7 @@  int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
 	if (revs->diffopt.objfind)
 		revs->simplify_history = 0;
 
-	if (revs->topo_order)
+	if (revs->topo_order && !generation_numbers_enabled(the_repository))
 		revs->limited = 1;
 
 	if (revs->prune_data.nr) {
@@ -2892,6 +2893,33 @@  static int mark_uninteresting(const struct object_id *oid,
 	return 0;
 }
 
+struct topo_walk_info {};
+
+static void init_topo_walk(struct rev_info *revs)
+{
+	struct topo_walk_info *info;
+	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
+	info = revs->topo_walk_info;
+	memset(info, 0, sizeof(struct topo_walk_info));
+
+	limit_list(revs);
+	sort_in_topological_order(&revs->commits, revs->sort_order);
+}
+
+static struct commit *next_topo_commit(struct rev_info *revs)
+{
+	return pop_commit(&revs->commits);
+}
+
+static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
+{
+	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
+		if (!revs->ignore_missing_links)
+			die("Failed to traverse parents of commit %s",
+			    oid_to_hex(&commit->object.oid));
+	}
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2928,11 +2956,13 @@  int prepare_revision_walk(struct rev_info *revs)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
 		return 0;
-	if (revs->limited)
+	if (revs->limited) {
 		if (limit_list(revs) < 0)
 			return -1;
-	if (revs->topo_order)
-		sort_in_topological_order(&revs->commits, revs->sort_order);
+		if (revs->topo_order)
+			sort_in_topological_order(&revs->commits, revs->sort_order);
+	} else if (revs->topo_order)
+		init_topo_walk(revs);
 	if (revs->line_level_traverse)
 		line_log_filter(revs);
 	if (revs->simplify_merges)
@@ -3257,6 +3287,8 @@  static struct commit *get_revision_1(struct rev_info *revs)
 
 		if (revs->reflog_info)
 			commit = next_reflog_entry(revs->reflog_info);
+		else if (revs->topo_walk_info)
+			commit = next_topo_commit(revs);
 		else
 			commit = pop_commit(&revs->commits);
 
@@ -3278,6 +3310,8 @@  static struct commit *get_revision_1(struct rev_info *revs)
 
 			if (revs->reflog_info)
 				try_to_simplify_commit(revs, commit);
+			else if (revs->topo_walk_info)
+				expand_topo_walk(revs, commit);
 			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
 				if (!revs->ignore_missing_links)
 					die("Failed to traverse parents of commit %s",
diff --git a/revision.h b/revision.h
index 2b30ac270d..fd4154ff75 100644
--- a/revision.h
+++ b/revision.h
@@ -56,6 +56,8 @@  struct rev_cmdline_info {
 #define REVISION_WALK_NO_WALK_SORTED 1
 #define REVISION_WALK_NO_WALK_UNSORTED 2
 
+struct topo_walk_info;
+
 struct rev_info {
 	/* Starting list */
 	struct commit_list *commits;
@@ -245,6 +247,8 @@  struct rev_info {
 	const char *break_bar;
 
 	struct revision_sources *sources;
+
+	struct topo_walk_info *topo_walk_info;
 };
 
 int ref_excluded(struct string_list *, const char *path);