Message ID | 4c0c5c9143a757ee5187ae6fca9a76f29b6c6ae2.1540908961.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Make add_missing_tags() linear | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > + struct commit **to, int nr_to, > + int reachable_flag) This is OR'ed into object.flags, and I somoehow expected to see it as 'unsigned int', not a signed one. > +{ > + struct commit **item; > + struct commit *current; > + struct commit_list *found_commits = NULL; > + struct commit **to_last = to + nr_to; > + struct commit **from_last = from + nr_from; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > + int num_to_find = 0; > + > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > + > + for (item = to; item < to_last; item++) { > + struct commit *c = *item; > + > + parse_commit(c); > + if (c->generation < min_generation) > + min_generation = c->generation; > + > + if (!(c->object.flags & PARENT1)) { > + c->object.flags |= PARENT1; > + num_to_find++; > + } > + } > + > + for (item = from; item < from_last; item++) { > + struct commit *c = *item; > + if (!(c->object.flags & PARENT2)) { > + c->object.flags |= PARENT2; > + parse_commit(c); > + > + prio_queue_put(&queue, *item); > + } > + } OK, we marked "to" with PARENT1 and counted them in num_to_find without dups. We also marked "from" with PARENT2 and threw them in the "queue" without dups. Mental note: the caller must guarantee that everybody reachable from "to" and "from" have PARENT1 and PARENT2 clear. This might deserve to be in the comment before the function. > + while (num_to_find && (current = prio_queue_get(&queue)) != NULL) { > + struct commit_list *parents; > + > + if (current->object.flags & PARENT1) { > + current->object.flags &= ~PARENT1; > + current->object.flags |= reachable_flag; > + commit_list_insert(current, &found_commits); > + num_to_find--; What is in "queue" are all reachable from "from" so we found this object in the "to" set is reachable. One down. > + } > + > + for (parents = current->parents; parents; parents = parents->next) { > + struct commit *p = parents->item; > + > + parse_commit(p); > + > + if (p->generation < min_generation) > + continue; Any commit reachable from "from" whose generation is too small (i.e. old) can never reach any of "to", because all "to" are at generation "min_generation" or greater. Makes sense. > + if (p->object.flags & PARENT2) > + continue; If the object is painted with PARENT2, we know we have it in queue and traversing to find more of its ancestors. Avoid recursing into it twice. Makes sense. > + p->object.flags |= PARENT2; > + prio_queue_put(&queue, p); Otherwise, explore this parent. > + } > + } > + > + clear_commit_marks_many(nr_to, to, PARENT1); Among "to", the ones that were not found still have PARENT1. Because we do not propagate this flag down the ancestry chain like merge-base discovery, this only has to clear just one level. > + clear_commit_marks_many(nr_from, from, PARENT2); And then we smudged everything that are reachable from "from" with this flag. In the worst case, i.e. when num_to_find is still positive here, we would have painted all commits down to the root, and we need to clear all of them. > + return found_commits; > +} OK, this all makes sense. Unlike merge-base traversals, this does not have to traverse from the "to" side at all, which makes it quite simpler and straight-forward. I do wonder if we can now reimplement in_merge_bases_many() in terms of this helper, and if that gives us a better performance. It asks "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable from, one of the 'references', i.e. 'from'"? > diff --git a/commit-reach.h b/commit-reach.h > index 7d313e297..43bd50a70 100644 > --- a/commit-reach.h > +++ b/commit-reach.h > @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, > int can_all_from_reach(struct commit_list *from, struct commit_list *to, > int commit_date_cutoff); > > + > +/* > + * Return a list of commits containing the commits in the 'to' array > + * that are reachable from at least one commit in the 'from' array. > + * Also add the given 'flag' to each of the commits in the returned list. > + */ > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > + struct commit **to, int nr_to, > + int reachable_flag); > + > #endif
On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > --- a/commit-reach.c > +++ b/commit-reach.c > @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, > object_array_clear(&from_objs); > return result; > } > + > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > + struct commit **to, int nr_to, > + int reachable_flag) > +{ > + struct commit **item; > + struct commit *current; > + struct commit_list *found_commits = NULL; > + struct commit **to_last = to + nr_to; > + struct commit **from_last = from + nr_from; > + uint32_t min_generation = GENERATION_NUMBER_INFINITY; > + int num_to_find = 0; > + > + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; > + > + for (item = to; item < to_last; item++) { > + struct commit *c = *item; > + > + parse_commit(c); > + if (c->generation < min_generation) > + min_generation = c->generation; So, when we don't have a commit-graph, is c->generation just going to be 0, making min_generation also be 0? (meaning we get no possible speed benefit from the commit-graph, since we just don't have that information available)? > + > + if (!(c->object.flags & PARENT1)) { > + c->object.flags |= PARENT1; > + num_to_find++; > + } > + } > + > + for (item = from; item < from_last; item++) { > + struct commit *c = *item; > + if (!(c->object.flags & PARENT2)) { > + c->object.flags |= PARENT2; > + parse_commit(c); > + > + prio_queue_put(&queue, *item); > + } > + } > + > + while (num_to_find && (current = prio_queue_get(&queue)) != NULL) { > + struct commit_list *parents; > + > + if (current->object.flags & PARENT1) { > + current->object.flags &= ~PARENT1; > + current->object.flags |= reachable_flag; > + commit_list_insert(current, &found_commits); > + num_to_find--; > + } > + > + for (parents = current->parents; parents; parents = parents->next) { > + struct commit *p = parents->item; > + > + parse_commit(p); > + > + if (p->generation < min_generation) > + continue; > + > + if (p->object.flags & PARENT2) > + continue; > + > + p->object.flags |= PARENT2; > + prio_queue_put(&queue, p); > + } > + } > + > + clear_commit_marks_many(nr_to, to, PARENT1); > + clear_commit_marks_many(nr_from, from, PARENT2); > + > + return found_commits; > +} > + > diff --git a/commit-reach.h b/commit-reach.h > index 7d313e297..43bd50a70 100644 > --- a/commit-reach.h > +++ b/commit-reach.h > @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, > int can_all_from_reach(struct commit_list *from, struct commit_list *to, > int commit_date_cutoff); > > + > +/* > + * Return a list of commits containing the commits in the 'to' array > + * that are reachable from at least one commit in the 'from' array. > + * Also add the given 'flag' to each of the commits in the returned list. > + */ > +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, > + struct commit **to, int nr_to, > + int reachable_flag); > + > #endif > -- > gitgitgadget >
On 10/31/2018 2:07 AM, Elijah Newren wrote: > On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> --- a/commit-reach.c >> +++ b/commit-reach.c >> @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, >> object_array_clear(&from_objs); >> return result; >> } >> + >> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, >> + struct commit **to, int nr_to, >> + int reachable_flag) >> +{ >> + struct commit **item; >> + struct commit *current; >> + struct commit_list *found_commits = NULL; >> + struct commit **to_last = to + nr_to; >> + struct commit **from_last = from + nr_from; >> + uint32_t min_generation = GENERATION_NUMBER_INFINITY; >> + int num_to_find = 0; >> + >> + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; >> + >> + for (item = to; item < to_last; item++) { >> + struct commit *c = *item; >> + >> + parse_commit(c); >> + if (c->generation < min_generation) >> + min_generation = c->generation; > So, when we don't have a commit-graph, is c->generation just going to > be 0, making min_generation also be 0? (meaning we get no possible > speed benefit from the commit-graph, since we just don't have that > information available)? If we don't have a commit-graph, then we can only terminate the loop early if we discover all of the commits (num_to_find == 0).Otherwise, we need to walk the entire graph in order to determine non-reachability. This relates to Junio's point about in_merge_bases_many(), which I'll respond to his message in more detail about that. Thanks, -Stolee
On 10/30/2018 11:35 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, >> + struct commit **to, int nr_to, >> + int reachable_flag) > This is OR'ed into object.flags, and I somoehow expected to see it > as 'unsigned int', not a signed one. Will do. Thanks. > >> +{ >> + struct commit **item; >> + struct commit *current; >> + struct commit_list *found_commits = NULL; >> + struct commit **to_last = to + nr_to; >> + struct commit **from_last = from + nr_from; >> + uint32_t min_generation = GENERATION_NUMBER_INFINITY; >> + int num_to_find = 0; >> + >> + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; >> + >> + for (item = to; item < to_last; item++) { >> + struct commit *c = *item; >> + >> + parse_commit(c); >> + if (c->generation < min_generation) >> + min_generation = c->generation; >> + >> + if (!(c->object.flags & PARENT1)) { >> + c->object.flags |= PARENT1; >> + num_to_find++; >> + } >> + } >> + >> + for (item = from; item < from_last; item++) { >> + struct commit *c = *item; >> + if (!(c->object.flags & PARENT2)) { >> + c->object.flags |= PARENT2; >> + parse_commit(c); >> + >> + prio_queue_put(&queue, *item); >> + } >> + } > OK, we marked "to" with PARENT1 and counted them in num_to_find > without dups. We also marked "from" with PARENT2 and threw them in > the "queue" without dups. > > Mental note: the caller must guarantee that everybody reachable from > "to" and "from" have PARENT1 and PARENT2 clear. This might deserve > to be in the comment before the function. I'll put that in the header file. [snip] > OK, this all makes sense. Unlike merge-base traversals, this does > not have to traverse from the "to" side at all, which makes it quite > simpler and straight-forward. > > I do wonder if we can now reimplement in_merge_bases_many() in terms > of this helper, and if that gives us a better performance. It asks > "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable > from, one of the 'references', i.e. 'from'"? We could do this, but it does come with a performance hit when the following are all true: 1. 'to' is not reachable from any 'from' commits. 2. The 'to' and 'from' commits are close in commit-date. 3. Generation numbers are not available, or the topology is skewed to have commits with high commit date and low generation number. Since in_merge_bases_many() calls paint_down_to_common(), it has the same issues with the current generation numbers. This can be fixed when we have the next version of generation numbers available. I'll make a note to have in_merge_bases_many() call get_reachable_subset() conditionally (like the generation_numbers_available() trick in the --topo-order series) after the generation numbers are settled and implemented. Thanks, -Stolee
Derrick Stolee <stolee@gmail.com> writes: > We could do this, but it does come with a performance hit when the following > are all true: > > 1. 'to' is not reachable from any 'from' commits. > > 2. The 'to' and 'from' commits are close in commit-date. > > 3. Generation numbers are not available, or the topology is skewed to have > commits with high commit date and low generation number. > > Since in_merge_bases_many() calls paint_down_to_common(), it has the same > issues with the current generation numbers. This can be fixed when we have > the next version of generation numbers available. > > I'll make a note to have in_merge_bases_many() call get_reachable_subset() > conditionally (like the generation_numbers_available() trick in the > --topo-order > series) after the generation numbers are settled and implemented. Sounds good. I wondered how this compares with in_merge_bases_many() primarily because how performance characteristics between two approaches trade off. After all, what it needs to compute is a specialized case of get_reachable_subset() where "to" happens to be a set with only single element. If the latter with a single element "to" has the above performance "issues" compared to paint-down-to-common based approach, it could be possible that, for small enough N>1, running N in_merge_bases_many() traversals is more performant than a single get_reachable_subset() call that works on N-element "to". I am hoping that an update to better generation numbers to help get_reachable_subset() would help paint-down-to-common the same way, and would eventually allow us to use a single traversal that is best for N==1 and N>1. Thanks.
diff --git a/commit-reach.c b/commit-reach.c index 9f79ce0a2..a98532ecc 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to, object_array_clear(&from_objs); return result; } + +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, + struct commit **to, int nr_to, + int reachable_flag) +{ + struct commit **item; + struct commit *current; + struct commit_list *found_commits = NULL; + struct commit **to_last = to + nr_to; + struct commit **from_last = from + nr_from; + uint32_t min_generation = GENERATION_NUMBER_INFINITY; + int num_to_find = 0; + + struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; + + for (item = to; item < to_last; item++) { + struct commit *c = *item; + + parse_commit(c); + if (c->generation < min_generation) + min_generation = c->generation; + + if (!(c->object.flags & PARENT1)) { + c->object.flags |= PARENT1; + num_to_find++; + } + } + + for (item = from; item < from_last; item++) { + struct commit *c = *item; + if (!(c->object.flags & PARENT2)) { + c->object.flags |= PARENT2; + parse_commit(c); + + prio_queue_put(&queue, *item); + } + } + + while (num_to_find && (current = prio_queue_get(&queue)) != NULL) { + struct commit_list *parents; + + if (current->object.flags & PARENT1) { + current->object.flags &= ~PARENT1; + current->object.flags |= reachable_flag; + commit_list_insert(current, &found_commits); + num_to_find--; + } + + for (parents = current->parents; parents; parents = parents->next) { + struct commit *p = parents->item; + + parse_commit(p); + + if (p->generation < min_generation) + continue; + + if (p->object.flags & PARENT2) + continue; + + p->object.flags |= PARENT2; + prio_queue_put(&queue, p); + } + } + + clear_commit_marks_many(nr_to, to, PARENT1); + clear_commit_marks_many(nr_from, from, PARENT2); + + return found_commits; +} + diff --git a/commit-reach.h b/commit-reach.h index 7d313e297..43bd50a70 100644 --- a/commit-reach.h +++ b/commit-reach.h @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from, int can_all_from_reach(struct commit_list *from, struct commit_list *to, int commit_date_cutoff); + +/* + * Return a list of commits containing the commits in the 'to' array + * that are reachable from at least one commit in the 'from' array. + * Also add the given 'flag' to each of the commits in the returned list. + */ +struct commit_list *get_reachable_subset(struct commit **from, int nr_from, + struct commit **to, int nr_to, + int reachable_flag); + #endif