diff mbox series

[1/3] commit-reach: implement get_reachable_subset

Message ID 4c0c5c9143a757ee5187ae6fca9a76f29b6c6ae2.1540908961.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Make add_missing_tags() linear | expand

Commit Message

Philippe Blain via GitGitGadget Oct. 30, 2018, 2:16 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

The existing reachability algorithms in commit-reach.c focus on
finding merge-bases or determining if all commits in a set X can
reach at least one commit in a set Y. However, for two commits sets
X and Y, we may also care about which commits in Y are reachable
from at least one commit in X.

Implement get_reachable_subset() which answers this question. Given
two arrays of commits, 'from' and 'to', return a commit_list with
every commit from the 'to' array that is reachable from at least
one commit in the 'from' array.

The algorithm is a simple walk starting at the 'from' commits, using
the PARENT2 flag to indicate "this commit has already been added to
the walk queue". By marking the 'to' commits with the PARENT1 flag,
we can determine when we see a commit from the 'to' array. We remove
the PARENT1 flag as we add that commit to the result list to avoid
duplicates.

The order of the resulting list is a reverse of the order that the
commits are discovered in the walk.

There are a couple shortcuts to avoid walking more than we need:

1. We determine the minimum generation number of commits in the
   'to' array. We do not walk commits with generation number
   below this minimum.

2. We count how many distinct commits are in the 'to' array, and
   decrement this count when we discover a 'to' commit during the
   walk. If this number reaches zero, then we can terminate the
   walk.

Tests will be added using the 'test-tool reach' helper in a
subsequent commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-reach.h | 10 ++++++++
 2 files changed, 80 insertions(+)

Comments

Junio C Hamano Oct. 31, 2018, 3:35 a.m. UTC | #1
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +					 struct commit **to, int nr_to,
> +					 int reachable_flag)

This is OR'ed into object.flags, and I somoehow expected to see it
as 'unsigned int', not a signed one.

> +{
> +	struct commit **item;
> +	struct commit *current;
> +	struct commit_list *found_commits = NULL;
> +	struct commit **to_last = to + nr_to;
> +	struct commit **from_last = from + nr_from;
> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	int num_to_find = 0;
> +
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> +
> +	for (item = to; item < to_last; item++) {
> +		struct commit *c = *item;
> +		
> +		parse_commit(c);
> +		if (c->generation < min_generation)
> +			min_generation = c->generation;
> +
> +		if (!(c->object.flags & PARENT1)) {
> +			c->object.flags |= PARENT1;
> +			num_to_find++;
> +		}
> +	}
> +
> +	for (item = from; item < from_last; item++) {
> +		struct commit *c = *item;
> +		if (!(c->object.flags & PARENT2)) {
> +			c->object.flags |= PARENT2;
> +			parse_commit(c);
> +
> +			prio_queue_put(&queue, *item);
> +		}
> +	}

OK, we marked "to" with PARENT1 and counted them in num_to_find
without dups.  We also marked "from" with PARENT2 and threw them in
the "queue" without dups.

Mental note: the caller must guarantee that everybody reachable from
"to" and "from" have PARENT1 and PARENT2 clear.  This might deserve
to be in the comment before the function.

> +	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
> +		struct commit_list *parents;
> +
> +		if (current->object.flags & PARENT1) {
> +			current->object.flags &= ~PARENT1;
> +			current->object.flags |= reachable_flag;
> +			commit_list_insert(current, &found_commits);
> +			num_to_find--;

What is in "queue" are all reachable from "from" so we found this
object in the "to" set is reachable.  One down.

> +		}
> +
> +		for (parents = current->parents; parents; parents = parents->next) {
> +			struct commit *p = parents->item;
> +
> +			parse_commit(p);
> +
> +			if (p->generation < min_generation)
> +				continue;

Any commit reachable from "from" whose generation is too small
(i.e. old) can never reach any of "to", because all "to" are at
generation "min_generation" or greater.  Makes sense.

> +			if (p->object.flags & PARENT2)
> +				continue;

If the object is painted with PARENT2, we know we have it in queue
and traversing to find more of its ancestors.  Avoid recursing into
it twice.  Makes sense.

> +			p->object.flags |= PARENT2;
> +			prio_queue_put(&queue, p);

Otherwise, explore this parent.

> +		}
> +	}
> +
> +	clear_commit_marks_many(nr_to, to, PARENT1);

Among "to", the ones that were not found still have PARENT1.
Because we do not propagate this flag down the ancestry chain like
merge-base discovery, this only has to clear just one level.

> +	clear_commit_marks_many(nr_from, from, PARENT2);

And then we smudged everything that are reachable from "from"
with this flag.  In the worst case, i.e. when num_to_find is still
positive here, we would have painted all commits down to the root,
and we need to clear all of them.

> +	return found_commits;
> +}

OK, this all makes sense.  Unlike merge-base traversals, this does
not have to traverse from the "to" side at all, which makes it quite
simpler and straight-forward.

I do wonder if we can now reimplement in_merge_bases_many() in terms
of this helper, and if that gives us a better performance.  It asks
"is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable
from, one of the 'references', i.e.  'from'"?

> diff --git a/commit-reach.h b/commit-reach.h
> index 7d313e297..43bd50a70 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  		       int commit_date_cutoff);
>  
> +
> +/*
> + * Return a list of commits containing the commits in the 'to' array
> + * that are reachable from at least one commit in the 'from' array.
> + * Also add the given 'flag' to each of the commits in the returned list.
> + */
> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +					 struct commit **to, int nr_to,
> +					 int reachable_flag);
> +
>  #endif
Elijah Newren Oct. 31, 2018, 6:07 a.m. UTC | #2
On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>         object_array_clear(&from_objs);
>         return result;
>  }
> +
> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +                                        struct commit **to, int nr_to,
> +                                        int reachable_flag)
> +{
> +       struct commit **item;
> +       struct commit *current;
> +       struct commit_list *found_commits = NULL;
> +       struct commit **to_last = to + nr_to;
> +       struct commit **from_last = from + nr_from;
> +       uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +       int num_to_find = 0;
> +
> +       struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> +
> +       for (item = to; item < to_last; item++) {
> +               struct commit *c = *item;
> +
> +               parse_commit(c);
> +               if (c->generation < min_generation)
> +                       min_generation = c->generation;

So, when we don't have a commit-graph, is c->generation just going to
be 0, making min_generation also be 0? (meaning we get no possible
speed benefit from the commit-graph, since we just don't have that
information available)?

> +
> +               if (!(c->object.flags & PARENT1)) {
> +                       c->object.flags |= PARENT1;
> +                       num_to_find++;
> +               }
> +       }
> +
> +       for (item = from; item < from_last; item++) {
> +               struct commit *c = *item;
> +               if (!(c->object.flags & PARENT2)) {
> +                       c->object.flags |= PARENT2;
> +                       parse_commit(c);
> +
> +                       prio_queue_put(&queue, *item);
> +               }
> +       }
> +
> +       while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
> +               struct commit_list *parents;
> +
> +               if (current->object.flags & PARENT1) {
> +                       current->object.flags &= ~PARENT1;
> +                       current->object.flags |= reachable_flag;
> +                       commit_list_insert(current, &found_commits);
> +                       num_to_find--;
> +               }
> +
> +               for (parents = current->parents; parents; parents = parents->next) {
> +                       struct commit *p = parents->item;
> +
> +                       parse_commit(p);
> +
> +                       if (p->generation < min_generation)
> +                               continue;
> +
> +                       if (p->object.flags & PARENT2)
> +                               continue;
> +
> +                       p->object.flags |= PARENT2;
> +                       prio_queue_put(&queue, p);
> +               }
> +       }
> +
> +       clear_commit_marks_many(nr_to, to, PARENT1);
> +       clear_commit_marks_many(nr_from, from, PARENT2);
> +
> +       return found_commits;
> +}
> +
> diff --git a/commit-reach.h b/commit-reach.h
> index 7d313e297..43bd50a70 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>                        int commit_date_cutoff);
>
> +
> +/*
> + * Return a list of commits containing the commits in the 'to' array
> + * that are reachable from at least one commit in the 'from' array.
> + * Also add the given 'flag' to each of the commits in the returned list.
> + */
> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +                                        struct commit **to, int nr_to,
> +                                        int reachable_flag);
> +
>  #endif
> --
> gitgitgadget
>
Derrick Stolee Oct. 31, 2018, 11:54 a.m. UTC | #3
On 10/31/2018 2:07 AM, Elijah Newren wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> --- a/commit-reach.c
>> +++ b/commit-reach.c
>> @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>>          object_array_clear(&from_objs);
>>          return result;
>>   }
>> +
>> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>> +                                        struct commit **to, int nr_to,
>> +                                        int reachable_flag)
>> +{
>> +       struct commit **item;
>> +       struct commit *current;
>> +       struct commit_list *found_commits = NULL;
>> +       struct commit **to_last = to + nr_to;
>> +       struct commit **from_last = from + nr_from;
>> +       uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>> +       int num_to_find = 0;
>> +
>> +       struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>> +
>> +       for (item = to; item < to_last; item++) {
>> +               struct commit *c = *item;
>> +
>> +               parse_commit(c);
>> +               if (c->generation < min_generation)
>> +                       min_generation = c->generation;
> So, when we don't have a commit-graph, is c->generation just going to
> be 0, making min_generation also be 0? (meaning we get no possible
> speed benefit from the commit-graph, since we just don't have that
> information available)?

If we don't have a commit-graph, then we can only terminate the loop early
if we discover all of the commits (num_to_find == 0).Otherwise, we need to
walk the entire graph in order to determine non-reachability. This relates
to Junio's point about in_merge_bases_many(), which I'll respond to his
message in more detail about that.

Thanks,
-Stolee
Derrick Stolee Oct. 31, 2018, 12:01 p.m. UTC | #4
On 10/30/2018 11:35 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>> +					 struct commit **to, int nr_to,
>> +					 int reachable_flag)
> This is OR'ed into object.flags, and I somoehow expected to see it
> as 'unsigned int', not a signed one.

Will do. Thanks.

>
>> +{
>> +	struct commit **item;
>> +	struct commit *current;
>> +	struct commit_list *found_commits = NULL;
>> +	struct commit **to_last = to + nr_to;
>> +	struct commit **from_last = from + nr_from;
>> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>> +	int num_to_find = 0;
>> +
>> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>> +
>> +	for (item = to; item < to_last; item++) {
>> +		struct commit *c = *item;
>> +		
>> +		parse_commit(c);
>> +		if (c->generation < min_generation)
>> +			min_generation = c->generation;
>> +
>> +		if (!(c->object.flags & PARENT1)) {
>> +			c->object.flags |= PARENT1;
>> +			num_to_find++;
>> +		}
>> +	}
>> +
>> +	for (item = from; item < from_last; item++) {
>> +		struct commit *c = *item;
>> +		if (!(c->object.flags & PARENT2)) {
>> +			c->object.flags |= PARENT2;
>> +			parse_commit(c);
>> +
>> +			prio_queue_put(&queue, *item);
>> +		}
>> +	}
> OK, we marked "to" with PARENT1 and counted them in num_to_find
> without dups.  We also marked "from" with PARENT2 and threw them in
> the "queue" without dups.
>
> Mental note: the caller must guarantee that everybody reachable from
> "to" and "from" have PARENT1 and PARENT2 clear.  This might deserve
> to be in the comment before the function.

I'll put that in the header file.

[snip]
> OK, this all makes sense.  Unlike merge-base traversals, this does
> not have to traverse from the "to" side at all, which makes it quite
> simpler and straight-forward.
>
> I do wonder if we can now reimplement in_merge_bases_many() in terms
> of this helper, and if that gives us a better performance.  It asks
> "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable
> from, one of the 'references', i.e.  'from'"?

We could do this, but it does come with a performance hit when the following
are all true:

1. 'to' is not reachable from any 'from' commits.

2. The 'to' and 'from' commits are close in commit-date.

3. Generation numbers are not available, or the topology is skewed to have
    commits with high commit date and low generation number.

Since in_merge_bases_many() calls paint_down_to_common(), it has the same
issues with the current generation numbers. This can be fixed when we have
the next version of generation numbers available.

I'll make a note to have in_merge_bases_many() call get_reachable_subset()
conditionally (like the generation_numbers_available() trick in the 
--topo-order
series) after the generation numbers are settled and implemented.

Thanks,
-Stolee
Junio C Hamano Nov. 2, 2018, 1:51 a.m. UTC | #5
Derrick Stolee <stolee@gmail.com> writes:

> We could do this, but it does come with a performance hit when the following
> are all true:
>
> 1. 'to' is not reachable from any 'from' commits.
>
> 2. The 'to' and 'from' commits are close in commit-date.
>
> 3. Generation numbers are not available, or the topology is skewed to have
>    commits with high commit date and low generation number.
>
> Since in_merge_bases_many() calls paint_down_to_common(), it has the same
> issues with the current generation numbers. This can be fixed when we have
> the next version of generation numbers available.
>
> I'll make a note to have in_merge_bases_many() call get_reachable_subset()
> conditionally (like the generation_numbers_available() trick in the
> --topo-order
> series) after the generation numbers are settled and implemented.

Sounds good.

I wondered how this compares with in_merge_bases_many() primarily
because how performance characteristics between two approaches trade
off.  After all, what it needs to compute is a specialized case of
get_reachable_subset() where "to" happens to be a set with only
single element.  If the latter with a single element "to" has the
above performance "issues" compared to paint-down-to-common based
approach, it could be possible that, for small enough N>1, running N
in_merge_bases_many() traversals is more performant than a single
get_reachable_subset() call that works on N-element "to".

I am hoping that an update to better generation numbers to help
get_reachable_subset() would help paint-down-to-common the same way,
and would eventually allow us to use a single traversal that is best
for N==1 and N>1.

Thanks.
diff mbox series

Patch

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a2..a98532ecc 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	object_array_clear(&from_objs);
 	return result;
 }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+					 struct commit **to, int nr_to,
+					 int reachable_flag)
+{
+	struct commit **item;
+	struct commit *current;
+	struct commit_list *found_commits = NULL;
+	struct commit **to_last = to + nr_to;
+	struct commit **from_last = from + nr_from;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	int num_to_find = 0;
+
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+	for (item = to; item < to_last; item++) {
+		struct commit *c = *item;
+		
+		parse_commit(c);
+		if (c->generation < min_generation)
+			min_generation = c->generation;
+
+		if (!(c->object.flags & PARENT1)) {
+			c->object.flags |= PARENT1;
+			num_to_find++;
+		}
+	}
+
+	for (item = from; item < from_last; item++) {
+		struct commit *c = *item;
+		if (!(c->object.flags & PARENT2)) {
+			c->object.flags |= PARENT2;
+			parse_commit(c);
+
+			prio_queue_put(&queue, *item);
+		}
+	}
+
+	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
+		struct commit_list *parents;
+
+		if (current->object.flags & PARENT1) {
+			current->object.flags &= ~PARENT1;
+			current->object.flags |= reachable_flag;
+			commit_list_insert(current, &found_commits);
+			num_to_find--;
+		}
+
+		for (parents = current->parents; parents; parents = parents->next) {
+			struct commit *p = parents->item;
+
+			parse_commit(p);
+
+			if (p->generation < min_generation)
+				continue;
+
+			if (p->object.flags & PARENT2)
+				continue;
+
+			p->object.flags |= PARENT2;
+			prio_queue_put(&queue, p);
+		}
+	}
+
+	clear_commit_marks_many(nr_to, to, PARENT1);
+	clear_commit_marks_many(nr_from, from, PARENT2);
+
+	return found_commits;
+}
+
diff --git a/commit-reach.h b/commit-reach.h
index 7d313e297..43bd50a70 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -74,4 +74,14 @@  int can_all_from_reach_with_flag(struct object_array *from,
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
+
+/*
+ * Return a list of commits containing the commits in the 'to' array
+ * that are reachable from at least one commit in the 'from' array.
+ * Also add the given 'flag' to each of the commits in the returned list.
+ */
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+					 struct commit **to, int nr_to,
+					 int reachable_flag);
+
 #endif