[v4,4/7] revision.c: begin refactoring --topo-order logic

Message ID	cd9eef96885a811196ab0b851a98de4455b950ab.1539729393.git.gitgitgadget@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Date: Tue, 16 Oct 2018 15:36:42 -0700 (PDT) Message-Id: <cd9eef96885a811196ab0b851a98de4455b950ab.1539729393.git.gitgitgadget@gmail.com> In-Reply-To: <pull.25.v4.git.gitgitgadget@gmail.com> References: <pull.25.v3.git.gitgitgadget@gmail.com> <pull.25.v4.git.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Subject: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: peff@peff.net, Junio C Hamano <gitster@pobox.com>, Derrick Stolee <dstolee@microsoft.com> Sender: git-owner@vger.kernel.org Precedence: bulk
Series	Use generation numbers for --topo-order \| expand [v4,0/7] Use generation numbers for --topo-order [v4,1/7] prio-queue: add 'peek' operation [v4,2/7] test-reach: add run_three_modes method [v4,3/7] test-reach: add rev-list tests [v4,4/7] revision.c: begin refactoring --topo-order logic [v4,5/7] commit/revisions: bookkeeping before refactoring [v4,6/7] revision.c: generation-based topo-order algorithm [v4,7/7] t6012: make rev-list tests more interesting

Linus Arver via GitGitGadget Oct. 16, 2018, 10:36 p.m. UTC

From: Derrick Stolee <dstolee@microsoft.com>

When running 'git rev-list --topo-order' and its kin, the topo_order
setting in struct rev_info implies the limited setting. This means
that the following things happen during prepare_revision_walk():

* revs->limited implies we run limit_list() to walk the entire
  reachable set. There are some short-cuts here, such as if we
  perform a range query like 'git rev-list COMPARE..HEAD' and we
  can stop limit_list() when all queued commits are uninteresting.

* revs->topo_order implies we run sort_in_topological_order(). See
  the implementation of that method in commit.c. It implies that
  the full set of commits to order is in the given commit_list.

These two methods imply that a 'git rev-list --topo-order HEAD'
command must walk the entire reachable set of commits _twice_ before
returning a single result.

If we have a commit-graph file with generation numbers computed, then
there is a better way. This patch introduces some necessary logic
redirection when we are in this situation.

In v2.18.0, the commit-graph file contains zero-valued bytes in the
positions where the generation number is stored in v2.19.0 and later.
Thus, we use generation_numbers_enabled() to check if the commit-graph
is available and has non-zero generation numbers.

When setting revs->limited only because revs->topo_order is true,
only do so if generation numbers are not available. There is no
reason to use the new logic as it will behave similarly when all
generation numbers are INFINITY or ZERO.

In prepare_revision_walk(), if we have revs->topo_order but not
revs->limited, then we trigger the new logic. It breaks the logic
into three pieces, to fit with the existing framework:

1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
   struct. We use the presence of this struct as a signal to use the
   new methods during our walk. In this patch, this method simply
   calls limit_list() and sort_in_topological_order(). In the future,
   this method will set up a new data structure to perform that logic
   in-line.

2. next_topo_commit() provides get_revision_1() with the next topo-
   ordered commit in the list. Currently, this simply pops the commit
   from revs->commits.

3. expand_topo_walk() provides get_revision_1() with a way to signal
   walking beyond the latest commit. Currently, this calls
   add_parents_to_list() exactly like the old logic.

While this commit presents method redirection for performing the
exact same logic as before, it allows the next commit to focus only
on the new logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
 revision.h |  4 ++++
 2 files changed, 42 insertions(+), 4 deletions(-)

Jakub Narębski Oct. 21, 2018, 3:55 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> When running 'git rev-list --topo-order' and its kin, the topo_order
> setting in struct rev_info implies the limited setting. This means
> that the following things happen during prepare_revision_walk():
>
> * revs->limited implies we run limit_list() to walk the entire
>   reachable set. There are some short-cuts here, such as if we
>   perform a range query like 'git rev-list COMPARE..HEAD' and we
>   can stop limit_list() when all queued commits are uninteresting.

And if revs->topo_order is set, then (with current implementation) we
need limit_list() to run to generate commit_list with commits to be
topologically sorted, which is done by setting revs->limited.

In short, with current code revs->topo_order implies revs->limited.

>
> * revs->topo_order implies we run sort_in_topological_order(). See
>   the implementation of that method in commit.c. It implies that
>   the full set of commits to order is in the given commit_list.

So the current code uses "generate list of commits, then sort it"
approach...

>
> These two methods imply that a 'git rev-list --topo-order HEAD'
> command must walk the entire reachable set of commits _twice_ before
> returning a single result.
>
> If we have a commit-graph file with generation numbers computed, then
> there is a better way.

...instead of generating commits in topological order as you go.

>                        This patch introduces some necessary logic
> redirection when we are in this situation.

O.K., this should make main commit smaller.  All right.

> In v2.18.0, the commit-graph file contains zero-valued bytes in the
> positions where the generation number is stored in v2.19.0 and later.
> Thus, we use generation_numbers_enabled() to check if the commit-graph
> is available and has non-zero generation numbers.
>
> When setting revs->limited only because revs->topo_order is true,
> only do so if generation numbers are not available. There is no
> reason to use the new logic as it will behave similarly when all
> generation numbers are INFINITY or ZERO.

O.K. we will be using new algorithm only when there actually are some
generation numbers.


> In prepare_revision_walk(), if we have revs->topo_order but not
> revs->limited, then we trigger the new logic. It breaks the logic
> into three pieces, to fit with the existing framework:

So if revs->limited is set (but not because revs->topo_order is set),
which means A..B queries, we will be still using the old algorithm.
All right, though I wonder if it could be improved in the future
(perhaps with the help of other graph labelling / indices than
generation numbers, maybe a positive-cut index).

Do you have an idea why there is no improvement with the new code in
this case?

> 1. init_topo_walk() fills a new struct topo_walk_info in the rev_info
>    struct. We use the presence of this struct as a signal to use the
>    new methods during our walk. In this patch, this method simply
>    calls limit_list() and sort_in_topological_order(). In the future,
>    this method will set up a new data structure to perform that logic
>    in-line.
>
> 2. next_topo_commit() provides get_revision_1() with the next topo-
>    ordered commit in the list. Currently, this simply pops the commit
>    from revs->commits.
>
> 3. expand_topo_walk() provides get_revision_1() with a way to signal
>    walking beyond the latest commit. Currently, this calls
>    add_parents_to_list() exactly like the old logic.

So all three new functions should perform exactly like the old logic,
isn't it?

> While this commit presents method redirection for performing the
> exact same logic as before, it allows the next commit to focus only
> on the new logic.

All right, it's logical.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  revision.c | 42 ++++++++++++++++++++++++++++++++++++++----
>  revision.h |  4 ++++
>  2 files changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/revision.c b/revision.c
> index e18bd530e4..2dcde8a8ac 100644
> --- a/revision.c
> +++ b/revision.c
> @@ -25,6 +25,7 @@
>  #include "worktree.h"
>  #include "argv-array.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>  
>  volatile show_early_output_fn_t show_early_output;
>  
> @@ -2454,7 +2455,7 @@ int setup_revisions(int argc, const char **argv, struct rev_info *revs, struct s
>  	if (revs->diffopt.objfind)
>  		revs->simplify_history = 0;
>  
> -	if (revs->topo_order)
> +	if (revs->topo_order && !generation_numbers_enabled(the_repository))
>  		revs->limited = 1;

All right, with --topo-order and existing generation numbers don't force
the revs->limited code (i.e. explicit not wrapped use of limit_list()).

So with --topo-order and A..B, we have revs->limited set, with
--topo-order and no generation numbers we have revs->limited set.

>  
>  	if (revs->prune_data.nr) {
> @@ -2892,6 +2893,33 @@ static int mark_uninteresting(const struct object_id *oid,
>  	return 0;
>  }
>  
> +struct topo_walk_info {};

Nice trick with using NULL-ness of the pointer to the currently empty
struct as a boolean flag denoting whether to use new generation number
using algorithm for topological sorting.

> +
> +static void init_topo_walk(struct rev_info *revs)
> +{
> +	struct topo_walk_info *info;

I guess this helper variables is here for next revisions, as we could
have made without it...

> +	revs->topo_walk_info = xmalloc(sizeof(struct topo_walk_info));
> +	info = revs->topo_walk_info;
> +	memset(info, 0, sizeof(struct topo_walk_info));

...by using

  +	memset(revs->topo_walk_info, 0, sizeof(struct topo_walk_info));

> +
> +	limit_list(revs);
> +	sort_in_topological_order(&revs->commits, revs->sort_order);

This is not exactly identical to the old code, which has

	if (limit_list(revs) < 0)
		return -1;
	if (revs->topo_order)
		sort_in_topological_order(&revs->commits, revs->sort_order);

We know that init_topo_walk() would be invoked, as the name implies,
only when revs->topo_order is set, but do we know that limit_list()
would not return an error?

> +}
> +
> +static struct commit *next_topo_commit(struct rev_info *revs)
> +{
> +	return pop_commit(&revs->commits);
> +}

All right, identical to the old code.

> +
> +static void expand_topo_walk(struct rev_info *revs, struct commit *commit)
> +{
> +	if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
> +		if (!revs->ignore_missing_links)
> +			die("Failed to traverse parents of commit %s",
> +			    oid_to_hex(&commit->object.oid));
> +	}
> +}

All right, identical to the old code.

While at it, should this message be marked up for translation, or is it
something so low-level (and rare) that should be kept untranslated?  But
this would be better left for separate commit series, to not entangle
this one with spurious changes.

> +
>  int prepare_revision_walk(struct rev_info *revs)
>  {
>  	int i;
> @@ -2928,11 +2956,13 @@ int prepare_revision_walk(struct rev_info *revs)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
>  		return 0;
> -	if (revs->limited)
> +	if (revs->limited) {
>  		if (limit_list(revs) < 0)
>  			return -1;
> -	if (revs->topo_order)
> -		sort_in_topological_order(&revs->commits, revs->sort_order);
> +		if (revs->topo_order)
> +			sort_in_topological_order(&revs->commits, revs->sort_order);
> +	} else if (revs->topo_order)
> +		init_topo_walk(revs);

Previously when revs->topo_order was set, Git called
sort_in_topological_order(), because revs->limited got always set to
truthy value if revs->topo_order was true.

Now running sort_in_topological_order() is done only if revs->limited is
set (because of A..B); if it is not, init_topo_walk() is called.

All right, identical to the old code, up to checking the return value of
limit_list(), see previous comments.

>  	if (revs->line_level_traverse)
>  		line_log_filter(revs);
>  	if (revs->simplify_merges)
> @@ -3257,6 +3287,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  		if (revs->reflog_info)
>  			commit = next_reflog_entry(revs->reflog_info);
> +		else if (revs->topo_walk_info)
> +			commit = next_topo_commit(revs);
>  		else
>  			commit = pop_commit(&revs->commits);

All right, identical to the old code.

> @@ -3278,6 +3310,8 @@ static struct commit *get_revision_1(struct rev_info *revs)
>  
>  			if (revs->reflog_info)
>  				try_to_simplify_commit(revs, commit);
> +			else if (revs->topo_walk_info)
> +				expand_topo_walk(revs, commit);
>  			else if (add_parents_to_list(revs, commit, &revs->commits, NULL) < 0) {
>  				if (!revs->ignore_missing_links)
>  					die("Failed to traverse parents of commit %s",

All right, identical to the old code.

> diff --git a/revision.h b/revision.h
> index 2b30ac270d..fd4154ff75 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct rev_cmdline_info {
>  #define REVISION_WALK_NO_WALK_SORTED 1
>  #define REVISION_WALK_NO_WALK_UNSORTED 2
>  
> +struct topo_walk_info;
> +
>  struct rev_info {
>  	/* Starting list */
>  	struct commit_list *commits;
> @@ -245,6 +247,8 @@ struct rev_info {
>  	const char *break_bar;
>  
>  	struct revision_sources *sources;
> +
> +	struct topo_walk_info *topo_walk_info;
>  };
>  
>  int ref_excluded(struct string_list *, const char *path);

Junio C Hamano Oct. 22, 2018, 1:12 a.m. UTC | #2

Jakub Narebski <jnareb@gmail.com> writes:

> So if revs->limited is set (but not because revs->topo_order is set),
> which means A..B queries, we will be still using the old algorithm.
> All right, though I wonder if it could be improved in the future
> (perhaps with the help of other graph labelling / indices than
> generation numbers, maybe a positive-cut index).
>
> Do you have an idea why there is no improvement with the new code in
> this case?

I didn't get the impression that it would not be possible to improve
the "--topo A..B" case by using generation numbers from this series.
Isn't it just because the necessary code has not been written yet?
In addition to what is needed for "--topo P1 P2 P3..." (all
positive), limited walk needs to notice the bottom boundary and stop
traversal.  Having generation numbers would make it slightly easier
than without, as you know that a positive commit you have will not
be marked UNINTERESTING due to a negative commit whose ancestors
have not been explored, as long as that negative commit has a higher
generation number.  But you still need to adjust the traversal logic
to properly terminate upon hitting UNINTERESTING one, and also
propagate the bit down the history, which is not needed at all if
you only want to support the "positive only" case.

Derrick Stolee Oct. 22, 2018, 1:51 a.m. UTC | #3

On 10/21/2018 9:12 PM, Junio C Hamano wrote:
> Jakub Narebski <jnareb@gmail.com> writes:
>
>> So if revs->limited is set (but not because revs->topo_order is set),
>> which means A..B queries, we will be still using the old algorithm.
>> All right, though I wonder if it could be improved in the future
>> (perhaps with the help of other graph labelling / indices than
>> generation numbers, maybe a positive-cut index).
>>
>> Do you have an idea why there is no improvement with the new code in
>> this case?
> I didn't get the impression that it would not be possible to improve
> the "--topo A..B" case by using generation numbers from this series.
> Isn't it just because the necessary code has not been written yet?
> In addition to what is needed for "--topo P1 P2 P3..." (all
> positive), limited walk needs to notice the bottom boundary and stop
> traversal.  Having generation numbers would make it slightly easier
> than without, as you know that a positive commit you have will not
> be marked UNINTERESTING due to a negative commit whose ancestors
> have not been explored, as long as that negative commit has a higher
> generation number.  But you still need to adjust the traversal logic
> to properly terminate upon hitting UNINTERESTING one, and also
> propagate the bit down the history, which is not needed at all if
> you only want to support the "positive only" case.

Actually, the code has been written, but the problem is the same as the 
performance issue when I made paint_down_to_common() use generation 
numbers: the algorithm for deciding what is in the set "reachable from A 
but not reachable from B" uses commit-date order as a heuristic to avoid 
walking the entire graph. Yes, the revision parameters specify "limited" 
in order to call "limit_list()", but it uses the same algorithm to 
determine the reachable set difference.

You can test this yourself! Run the following two commands in the Git 
repository using v2.19.1:

     time git log --topo-order -10 master >/dev/null

     time git log --topo-order -10 maint..master >/dev/null

I get 0.39s for the first call and 0.01s for the second. (Note: I 
specified "-10" to ensure we are only writing 10 commits and the output 
size does not factor into the time.) This is because the first walks the 
entire history, while the second uses the heuristic walk to identify a 
much smaller subgraph that the topo-order algorithm uses.

Just as before, by using this algorithm for the B..A case, we are adding 
an extra restriction on the algorithm: always be correct. This results 
in us walking a larger set (everything reachable from B or A with 
generation number at least the smallest generation of a commit reachable 
from only one).

I believe this can be handled by using a smarter generation number (one 
that relies on commit date as a heuristic, but still have enough 
information to guarantee topological relationships), and I've already 
started testing a few of these directions. It is possible now that we 
have concrete graph algorithms to use on real repositories. I hope to 
share a report on my findings in a couple weeks. I'll include how using 
this algorithm compares to the old algorithm in the B..A case.

Thanks,

-Stolee

Junio C Hamano Oct. 25, 2018, 8:28 a.m. UTC | #4

Derrick Stolee <stolee@gmail.com> writes:

>     time git log --topo-order -10 master >/dev/null
>
>     time git log --topo-order -10 maint..master >/dev/null
>
> I get 0.39s for the first call and 0.01s for the second. (Note: I
> specified "-10" to ensure we are only writing 10 commits and the
> output size does not factor into the time.) This is because the first
> walks the entire history, while the second uses the heuristic walk to
> identify a much smaller subgraph that the topo-order algorithm uses.

The algorithm can be fooled by skewed timestamps (i.e. that SLOP
thing tries to work around), but is helped by being able to leave
early, and it will give us the correct answer as long as there is no
timestamp inversion.

But monotonically increasing "timestamp" without inversion is what
we invented "generation numbers" for, no?  When there is no
timestamp inversion, would a walk based on commit timestamps walk
smaller set than a walk based on commit generation numbers?

> Just as before, by using this algorithm for the B..A case, we are
> adding an extra restriction on the algorithm: always be correct. This
> results in us walking a larger set (everything reachable from B or A
> with generation number at least the smallest generation of a commit
> reachable from only one).
>
> I believe this can be handled by using a smarter generation number
> (one that relies on commit date as a heuristic, but still have enough
> information to guarantee topological relationships), and I've already
> started testing a few of these directions.

Good ot hear.

Jakub Narębski Oct. 26, 2018, 8:56 p.m. UTC | #5

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <stolee@gmail.com> writes:
>
>>     time git log --topo-order -10 master >/dev/null
>>
>>     time git log --topo-order -10 maint..master >/dev/null
>>
>> I get 0.39s for the first call and 0.01s for the second. (Note: I
>> specified "-10" to ensure we are only writing 10 commits and the
>> output size does not factor into the time.) This is because the first
>> walks the entire history, while the second uses the heuristic walk to
>> identify a much smaller subgraph that the topo-order algorithm uses.
>
> The algorithm can be fooled by skewed timestamps (i.e. that SLOP
> thing tries to work around), but is helped by being able to leave
> early, and it will give us the correct answer as long as there is no
> timestamp inversion.
>
> But monotonically increasing "timestamp" without inversion is what
> we invented "generation numbers" for, no?  When there is no
> timestamp inversion, would a walk based on commit timestamps walk
> smaller set than a walk based on commit generation numbers?

The problem, as far as I understand it, is in B..A case, to be more
exact with the walk from the exclusion set.  There can be cases when
there are two paths from the commit in the exclusion set, and sorting by
generation number will always walk the longer one, while date-order
based heuristic walk traverses smaller subgraph.

This problem was described in "[PATCH 1/1] commit: don't use generation
numbers if not needed" [1]; relevant fragment cited below:

DS> For instance, computing the merge-base between consecutive versions of
DS> the Linux kernel has no effect for versions after v4.9, but 'git
DS> merge-base v4.8 v4.9' presents a performance regression [...]
DS> 
DS> The topology of this case can be described in a simplified way
DS> here:
DS> 
DS>   v4.9
DS>    |  \
DS>    |   \
DS>   v4.8  \
DS>    | \   \
DS>    |  \   |
DS>   ...  A  B
DS>    |  /  /
DS>    | /  /
DS>    |/__/
DS>    C
DS> 
DS> Here, the "..." means "a very long line of commits". By generation
DS> number, A and B have generation one more than C. However, A and B
DS> have commit date higher than most of the commits reachable from
DS> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
DS> and PARENT2 flags, so everything it can reach is marked as STALE,
DS> including A. B has only the PARENT1 flag, so is not STALE.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_commit_date, A and B are removed from the queue
DS> early and C is inserted into the queue. At this point, C and the
DS> rest of the queue entries are marked as STALE. The loop then
DS> terminates.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_gen_then_commit_date, B is removed from the
DS> queue only after the many commits reachable from v4.8 are explored.
DS> This causes the loop to run longer. The reason for this regression
DS> is simple: the queue order is intended to not explore a commit
DS> until everything that _could_ reach that commit is explored. From
DS> the information gathered by the original ordering, we have no
DS> guarantee that there is not a commit D reachable from v4.8 that
DS> can also reach B. We gained absolute correctness in exchange for
DS> a performance regression.

[1]: https://public-inbox.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com/T/#u

>> Just as before, by using this algorithm for the B..A case, we are
>> adding an extra restriction on the algorithm: always be correct. This
>> results in us walking a larger set (everything reachable from B or A
>> with generation number at least the smallest generation of a commit
>> reachable from only one).
>>
>> I believe this can be handled by using a smarter generation number
>> (one that relies on commit date as a heuristic, but still have enough
>> information to guarantee topological relationships), and I've already
>> started testing a few of these directions.
>
> Good to hear.

I'm not sure if you can enhance generation numbers, or rather using /
sorting of generation numbers in such way.

With generation numbers, if you have two possible paths to the merge
commit, and one is much longer than the other, then the commits on this
longer path will have larger generation numbers.

        0     1     2     3     4     5     6     7       = gen(c) - gen(B)
    --- B<----*<----*<----*<----*<----*<----*<----M ---
        ^                                        /
         \--------------------------x<--------x</         = gen(c) - gen(B)
                                    1         2

    ===================== commit date ======================>

Using priority-queue which selects max generation number would then
always walk the longer path (or walk longer path first).

This does not matter for the walk from the inclusive set (A in B..A), as
we want to walk all commits anyway; we walk both paths, it does not
matter which we walk first (it may change the unimportant parts of
topological sort, though).

For the walk from the exclusive set (B in B..A), any path is good; we
don't need and do not want to walk all paths.  Sorting by generation
numbers will always choose the longer path, while sorting by commit date
may walk the "shortcut" first, thus marking commits reachable from the
inclusive subset (from A) STALE earlier, thus earlier notice of
all-stale, and end of walk.

I think that in the case of walking from the exclusion subset, using
positive-cut reachbility index would be more useful than trying to come
up with "smarter generation number" (though I am not saying that it
would be impossible).  For example using reachability bitmap indices, or
post-order in spanning tree (the latter descibed in [2]) to mark more
commits stale than just parents, may help more.

Actually, this is something independent from "smarter generation
numbers", and can only help...

[2]: "[RFC] Other chunks for commit-graph, part 2 - reachability indexes"
     https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/

[v4,4/7] revision.c: begin refactoring --topo-order logic

Commit Message

Comments

Patch