Message ID | pull.1489.v2.git.1678468863.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | ref-filter: ahead/behind counting, faster --merged option | expand |
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > I was > initially concerned about the overhead of 'git for-each-ref' and its > generality and sorting, but I was not able to measure any important > difference between this implementation and our internal 'git ahead-behind' > implementation. That certainly is nice to know. > However, for our specific uses, we like to batch a list of exact references > that could be very long. We introduce a new --stdin option here. > > To keep things close to the v1 outline, I replaced the existing patches with > closely-related ones, when possible. > > Patch 1 adds the --stdin option to 'git for-each-ref'. (This is similar to > the boilerplate patch from v1.) > > Patch 2 adds a test to explicitly check that 'git for-each-ref' will still > succeed when all input refs are missing. (This is similar to the > --ignore-missing patch from v1.) Sensible. > Patches 3-5 introduce a new method: ensure_generations_valid(). Patch 3 does > some refactoring of the existing generation number computations to make it > more generic, and patch 4 updates the definition of > commit_graph_generation() slightly, making way for patch 5 to implement the > method. With an existing commit-graph file, the commits that are not present > in the file are considered as having generation number "infinity". This is > useful for most of our reachability queries to this point, since those > commits are "above" the ones tracked by the commit-graph. When these commits > are low in number, then there is very little performance cost and zero > correctness cost. (These patches match v1 exactly.) > > However, we will see that the ahead/behind computation requires accurate > generation numbers to avoid overcounting. Thus, ensure_generations_valid() > is a way to specify a list of commits that need generation numbers computed > before continuing. It's a no-op if all of those commits are in the > commit-graph file. It's expensive if the commit-graph doesn't exist. Reasonable. > However, '%(ahead-behind:)' computations are likely to be slow no matter > what without a commit-graph, so assuming an existing commit-graph file is > reasonable. If we find sufficient desire to have an implementation that does > not have this requirement, we could create a second implementation and > toggle to it when generation_numbers_enabled() returns false. At that point, it might make sense to find a way to make the work ensure_generations_valid() had to spend cycles on not to go to waste. Something like "ah, you do not have commit-graph at all, so let's try to create one if you can write into the repository" at the beginning of the function, or something? Just thinking aloud. > Patch 6 implements the ahead-behind algorithm, but it is not connected to a > builtin. It's a long commit message, so hopefully it explains the algorithm > sufficiently. (The difference from v1 is that it no longer integrates with a > builtin and there are no new tests. It also uses 'unsigned int' and is > correctly co-authored by Taylor.) Nice. > Patch 7 integrates the ahead-behind algorithm with the ref-filter code, > including parsing the "ahead-behind" token. This finally adds tests that > check both ahead_behind() and ensure_generations_valid() via > t6600-test-reach.sh. (This patch is essentially completely new in v2.) > > Patch 8 implements the tips_reachable_from_base() method, and uses it within > the ref-filter code to speed up 'git for-each-ref --merged' and 'git branch > --merged'. (The interface is slightly different than v1, due to the needs of > the new caller.) Very nice. Having read all the patches, I am very impressed and pleased, but are we losing anything by having the feature inside for-each-ref compared to a new command ahead-behind? As far as I can tell, the new "for-each-ref --stdin" would still want to match refs and work only on refs, but there shouldn't be any reason for ahead-behind computation to limit to tips that are at the tip of a ref, so that may be one downside in this updated design. For the intended use case of "let's find which branches are stale", that downside does not matter in practice, but for other use cases people will think of in the future, the limitation might matter (at which time we can easily resurrect the other subcommand, using the internal machinery we have here, so it is not a huge deal, I presume). Thanks.
On 3/10/2023 2:16 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: >> However, '%(ahead-behind:)' computations are likely to be slow no matter >> what without a commit-graph, so assuming an existing commit-graph file is >> reasonable. If we find sufficient desire to have an implementation that does >> not have this requirement, we could create a second implementation and >> toggle to it when generation_numbers_enabled() returns false. > > At that point, it might make sense to find a way to make the work > ensure_generations_valid() had to spend cycles on not to go to > waste. Something like "ah, you do not have commit-graph at all, so > let's try to create one if you can write into the repository" at the > beginning of the function, or something? Just thinking aloud. This is a reasonable idea for helping users get out of a slow situation. Let's see how often this is a problem to see if it is worth doing that extra work in another series. > Having read all the patches, I am very impressed and pleased, but > are we losing anything by having the feature inside for-each-ref > compared to a new command ahead-behind? As far as I can tell, the > new "for-each-ref --stdin" would still want to match refs and work > only on refs, but there shouldn't be any reason for ahead-behind > computation to limit to tips that are at the tip of a ref, so that > may be one downside in this updated design. For the intended use > case of "let's find which branches are stale", that downside does > not matter in practice, but for other use cases people will think > of in the future, the limitation might matter (at which time we can > easily resurrect the other subcommand, using the internal machinery > we have here, so it is not a huge deal, I presume). I think the for-each-ref implementation solves the use case we had in mind, I think. I'll double-check to see if we ever use exact commit IDs instead of reference names, but I think these callers are rarely interested in an exact commit ID but instead want the latest version of refs. The idea of using committish tips definitely changes the functionality boundary. You are right that we can introduce a new builtin easily if that is necessary. Even without the ahead-behind builtin, we are succeeding in reducing the diff between our fork and the core project. Thanks, -Stolee
On Fri, Mar 10 2023, Derrick Stolee via GitGitGadget wrote: > At $DAYJOB, we have used a custom 'ahead-behind' builtin in our fork of Git > for lots of reasons. The main goal of the builtin is to compare multiple > references against a common base reference. The comparison is number of > commits that are in each side of the symmtric difference of their reachable > sets. A commit C is "ahead" of a commit B by the number of commits in B..C > (reachable from C but not reachable from B). Similarly, the commit C is > "behind" the commit B by the number of commits in C..B (reachable from B but > not reachable from C). I have a local change to get rid of the various "the_repository" macros, which a merge of this in "seen" conflicted with (semantically). The below patch on top of "seen" will fix it, could you please squash it in in the appropriate places? Aside from a desire to get rid of "the_repository" macros this also makes your own code consistent, i.e. you use repo_clear_commit_marks(), but then use parse_commit() instead of the repo_parse_commit() in the same function. It also makes the new API more future-proof, I don't think we should be adding new code that implicitly uses "the_repository" to our libraries if we can help it, much better to pass it down, even if all the current users are built-in that end up using "the_repository". diff --git a/builtin/branch.c b/builtin/branch.c index 21526d9883a..7c7dba839cf 100644 --- a/builtin/branch.c +++ b/builtin/branch.c @@ -448,7 +448,7 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin if (verify_ref_format(format)) die(_("unable to parse format string")); - filter_ahead_behind(format, &array); + filter_ahead_behind(the_repository, format, &array); ref_array_sort(sorting, &array); for (i = 0; i < array.nr; i++) { diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c index 1cdf8eb5a6b..e097f44e226 100644 --- a/builtin/for-each-ref.c +++ b/builtin/for-each-ref.c @@ -102,7 +102,7 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) filter.match_as_path = 1; filter_refs(&array, &filter, FILTER_REFS_ALL); - filter_ahead_behind(&format, &array); + filter_ahead_behind(the_repository, &format, &array); ref_array_sort(sorting, &array); diff --git a/builtin/tag.c b/builtin/tag.c index 8652d5edd47..7e2f686600a 100644 --- a/builtin/tag.c +++ b/builtin/tag.c @@ -67,7 +67,7 @@ static int list_tags(struct ref_filter *filter, struct ref_sorting *sorting, die(_("unable to parse format string")); filter->with_commit_tag_algo = 1; filter_refs(&array, filter, FILTER_REFS_TAGS); - filter_ahead_behind(format, &array); + filter_ahead_behind(the_repository, format, &array); ref_array_sort(sorting, &array); for (i = 0; i < array.nr; i++) { diff --git a/commit-reach.c b/commit-reach.c index 0abd43801fc..4a2216b8ae0 100644 --- a/commit-reach.c +++ b/commit-reach.c @@ -977,8 +977,9 @@ static void free_bit_array(struct commit *c) *bitmap = NULL; } -void ahead_behind(struct commit **commits, size_t commits_nr, - struct ahead_behind_count *counts, size_t counts_nr) +void ahead_behind(struct repository *r, struct commit **commits, + size_t commits_nr, struct ahead_behind_count *counts, + size_t counts_nr) { struct prio_queue queue = { compare_commits_by_gen_then_commit_date }; size_t width = (commits_nr + BITS_IN_EWORD - 1) / BITS_IN_EWORD; @@ -1024,7 +1025,7 @@ void ahead_behind(struct commit **commits, size_t commits_nr, for (p = c->parents; p; p = p->next) { struct bitmap *bitmap_p; - parse_commit(p->item); + repo_parse_commit(r, p->item); bitmap_p = init_bit_array(p->item, width); bitmap_or(bitmap_p, bitmap_c); @@ -1039,7 +1040,7 @@ void ahead_behind(struct commit **commits, size_t commits_nr, } /* STALE is used here, PARENT2 is used by insert_no_dup(). */ - repo_clear_commit_marks(the_repository, PARENT2 | STALE); + repo_clear_commit_marks(r, PARENT2 | STALE); clear_bit_arrays(&bit_arrays); clear_prio_queue(&queue); } diff --git a/commit-reach.h b/commit-reach.h index f871b5dcce9..2269fab8261 100644 --- a/commit-reach.h +++ b/commit-reach.h @@ -131,7 +131,8 @@ struct ahead_behind_count { * Given an array of commits and an array of ahead_behind_count pairs, * compute the ahead/behind counts for each pair. */ -void ahead_behind(struct commit **commits, size_t commits_nr, - struct ahead_behind_count *counts, size_t counts_nr); +void ahead_behind(struct repository *r, struct commit **commits, + size_t commits_nr, struct ahead_behind_count *counts, + size_t counts_nr); #endif diff --git a/ref-filter.c b/ref-filter.c index 4125ec3fd3a..cdc054beede 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -2462,7 +2462,7 @@ static void reach_filter(struct ref_array *array, free(to_clear); } -void filter_ahead_behind(struct ref_format *format, +void filter_ahead_behind(struct repository *r, struct ref_format *format, struct ref_array *array) { struct commit **commits; @@ -2502,7 +2502,7 @@ void filter_ahead_behind(struct ref_format *format, commits_nr++; } - ahead_behind(commits, commits_nr, array->counts, array->counts_nr); + ahead_behind(r, commits, commits_nr, array->counts, array->counts_nr); free(commits); } diff --git a/ref-filter.h b/ref-filter.h index 7e8bff3864e..1a757b49233 100644 --- a/ref-filter.h +++ b/ref-filter.h @@ -163,7 +163,7 @@ struct ref_array_item *ref_array_push(struct ref_array *array, * * If this is not called, then any ahead-behind atoms will be blank. */ -void filter_ahead_behind(struct ref_format *format, +void filter_ahead_behind(struct repository *r, struct ref_format *format, struct ref_array *array); #endif /* REF_FILTER_H */
On 3/15/2023 9:22 AM, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Mar 10 2023, Derrick Stolee via GitGitGadget wrote: > >> At $DAYJOB, we have used a custom 'ahead-behind' builtin in our fork of Git >> for lots of reasons. The main goal of the builtin is to compare multiple >> references against a common base reference. The comparison is number of >> commits that are in each side of the symmtric difference of their reachable >> sets. A commit C is "ahead" of a commit B by the number of commits in B..C >> (reachable from C but not reachable from B). Similarly, the commit C is >> "behind" the commit B by the number of commits in C..B (reachable from B but >> not reachable from C). > > I have a local change to get rid of the various "the_repository" macros, > which a merge of this in "seen" conflicted with (semantically). Thanks for doing that important refactoring. > The below patch on top of "seen" will fix it, could you please squash it > in in the appropriate places? Got it. Thanks. v3 will arrive later today with those changes and the recommended strvec changes. Thanks, -Stolee
On Fri, Mar 10, 2023 at 02:25:52PM -0500, Derrick Stolee wrote: > > Having read all the patches, I am very impressed and pleased, but > > are we losing anything by having the feature inside for-each-ref > > compared to a new command ahead-behind? As far as I can tell, the > > new "for-each-ref --stdin" would still want to match refs and work > > only on refs, but there shouldn't be any reason for ahead-behind > > computation to limit to tips that are at the tip of a ref, so that > > may be one downside in this updated design. For the intended use > > case of "let's find which branches are stale", that downside does > > not matter in practice, but for other use cases people will think > > of in the future, the limitation might matter (at which time we can > > easily resurrect the other subcommand, using the internal machinery > > we have here, so it is not a huge deal, I presume). > > I think the for-each-ref implementation solves the use case we > had in mind, I think. I'll double-check to see if we ever use > exact commit IDs instead of reference names, but I think these > callers are rarely interested in an exact commit ID but instead > want the latest version of refs. One thing I'd worry about here are race conditions. If you have a porcelain-ish view (and I'd count "showing a web page" as a porcelain view) that requires several commands to compute, it's possible for there to be simultaneous ref updates between your commands. If each command is given a refname, then the results may not be consistent. E.g., imagine resolving "main" to 1234abcd in step one, then somebody updates it to 5678cdef, then you run "for-each-ref" to compute ahead/behind, and now you show an inconsistent result: you say that "main" points to 1234abcd, but show the wrong ahead/behind information. Showing 1234abcd at all is out-of-date, of course, but the real problem is the lack of atomicity. Most porcelain scripts deal with this by resolving the refs immediately, assuming object ids are immutable (which they are modulo games like refs/replace), and then working with them. I don't know if this is how your current application-level code calling ahead-behind works, or if it just accepts the possibility of a race (or maybe the call is not presented along with other information so it's sort-of atomic on its own). Presumably your double-checking will find out. :) I do otherwise like exposing this as an option of for-each-ref, as that is the way I'd expect most normal client users to want to get at the information. And if this is step 1 and that's good enough for now, and we have a path forward to later expose it for general commits, that's OK with me. -Peff
On 3/15/2023 1:31 PM, Jeff King wrote: > On Fri, Mar 10, 2023 at 02:25:52PM -0500, Derrick Stolee wrote: > >>> Having read all the patches, I am very impressed and pleased, but >>> are we losing anything by having the feature inside for-each-ref >>> compared to a new command ahead-behind? As far as I can tell, the >>> new "for-each-ref --stdin" would still want to match refs and work >>> only on refs, but there shouldn't be any reason for ahead-behind >>> computation to limit to tips that are at the tip of a ref, so that >>> may be one downside in this updated design. For the intended use >>> case of "let's find which branches are stale", that downside does >>> not matter in practice, but for other use cases people will think >>> of in the future, the limitation might matter (at which time we can >>> easily resurrect the other subcommand, using the internal machinery >>> we have here, so it is not a huge deal, I presume). >> >> I think the for-each-ref implementation solves the use case we >> had in mind, I think. I'll double-check to see if we ever use >> exact commit IDs instead of reference names, but I think these >> callers are rarely interested in an exact commit ID but instead >> want the latest version of refs. > > One thing I'd worry about here are race conditions. > > If you have a porcelain-ish view (and I'd count "showing a web page" as > a porcelain view) that requires several commands to compute, it's > possible for there to be simultaneous ref updates between your commands. > If each command is given a refname, then the results may not be > consistent. > I don't know if this is how your current application-level code calling > ahead-behind works, or if it just accepts the possibility of a race (or > maybe the call is not presented along with other information so it's > sort-of atomic on its own). Presumably your double-checking will find > out. :) I completely agree on both of these points. The major lift in this series is that the two commit walk algorithms are being contributed to the core project in a way that are easy to modify our 'git ahead-behind' builtin to use the "new" internals without any UX change. Actually porting the application layer to use 'git for-each-ref' instead would be a second step, where I'd plan to do this deep dive. From what I understand, though, these race conditions do exist already, but they are minor relative to the cost of doing a lookup of all the ref values and then calling this backend. > I do otherwise like exposing this as an option of for-each-ref, as that > is the way I'd expect most normal client users to want to get at the > information. And if this is step 1 and that's good enough for now, and > we have a path forward to later expose it for general commits, that's OK > with me. And if we truly need more general committish inputs for tips (HEAD~10, too), then a new builtin could be built on top. Modeling it after for-each-ref (for-each-commit?) would be a good start to make the behavior as similar as possible. Doing that in full generality might require strange updates to ref-filter.[c|h], but we can cross that bridge when we come to it. Thanks, -Stolee
Jeff King <peff@peff.net> writes: > E.g., imagine resolving "main" to 1234abcd in step one, then somebody > updates it to 5678cdef, then you run "for-each-ref" to compute > ahead/behind, and now you show an inconsistent result: you say that > "main" points to 1234abcd, but show the wrong ahead/behind information. > > Showing 1234abcd at all is out-of-date, of course, but the real problem > is the lack of atomicity. Most porcelain scripts deal with this by > resolving the refs immediately, assuming object ids are immutable (which > they are modulo games like refs/replace), and then working with them. A really paranoid caller can use %(ahead-behind-detail:refs/heads/main) and get a report on refs/heads/topic, something that conveys refs/heads/topic (at 67f9f40d) is ahead by 2 commits and behind by 4 commits relative to refs/heads/main (at d7c3a768). in a machine readable form. And when the "ahead by 2 commits" disappears, we know 67f9f40d is merged to main sometime before d7c3a768. Then it can say "update-ref -d refs/heads/topic 67f9f40d" to avoid racing with simultanous updaters.