Message ID | patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blame-tree: add library and tests via "test-tool blame-tree" | expand |
On 2/5/2023 3:47 PM, Ævar Arnfjörð Bjarmason wrote: I've been meaning to get to this all week, but today was my first chance. > The "blame-tree" library allows for finding the most recent > modification to paths in a tree. It does so by expanding the tree at a > given commit, taking note of the current state of each path, and then > walking backwards through history looking for commits where each path > changed into its final sha1. > > The "git blame-tree" command was first noted on the ML 2011[1], and > over the years there have been mentions of it[2][3], and whether it > could be upstreamed. The sources have been available at [4]'s > "jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches. > > This change attempts to start upstreaming this command, but rather > than adding a command whose interface & semantics may be controversial > starts by exposing & testing the core of its library through a new > test helper. One thing that I think is important for this work and the future we build on top of it is that we need to reconsider every fact about the existing contribution and make sure the new thing is built for the actual needs. In that vein, I think your use of a test-tool is a really good one. It can help us define an API boundary that could then be reflected in a CLI in independent review from improved algorithms underneath the API. > +struct blame_tree_options { > + struct object_id oid; > + const char *prefix; > + unsigned int recursive; > + struct strvec args; > +}; > +#define BLAME_TREE_OPTIONS_INIT(...) { \ > + .args = STRVEC_INIT, \ > + __VA_ARGS__ \ > +} > +void blame_tree_opts_release(struct blame_tree_options *bto); > + > +struct blame_tree { > + struct string_list paths; > + struct rev_info rev; > +}; > +#define BLAME_TREE_INIT { \ > + .paths = STRING_LIST_INIT_DUP, \ > + .rev = REV_INFO_INIT, \ > +} > + > +void blame_tree_init(struct blame_tree *bt, > + const struct blame_tree_options *opts); > +void blame_tree_release(struct blame_tree *); > + > +typedef void (*blame_tree_fn)(const char *path, const struct commit *commit, > + void *data); > +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data); However, this API is too leaky. Specifically, it allows passing arbitrary 'args' instead of structured options. Before I get into what I think needs to change here, let's look at the initial implementation: > +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata) > +{ > + struct blame_tree_callback_data data; > + > + data.paths = &bt->paths; > + data.num_interesting = bt->paths.nr; > + data.callback = cb; > + data.callback_data = cbdata; > + > + bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK; > + bt->rev.diffopt.format_callback = blame_diff; > + bt->rev.diffopt.format_callback_data = &data; > + > + prepare_revision_walk(&bt->rev); > + > + while (data.num_interesting) { > + data.commit = get_revision(&bt->rev); > + if (!data.commit) > + break; > + > + if (data.commit->object.flags & BOUNDARY) { > + diff_tree_oid(the_hash_algo->empty_tree, > + &data.commit->object.oid, > + "", &bt->rev.diffopt); > + diff_flush(&bt->rev.diffopt); > + } else { > + log_tree_commit(&bt->rev, data.commit); > + } > + } > + > + return 0; > +} When I think of 'blame-tree', I think of the following output: For each path (within a pathspec, recursive or not), output the first commit that changed that path according to simplified file history. This history walk begins at a single commit, but may be terminated by a negative commit range, so the output could indicate that we reached the boundary. With that definition, the most-obvious first implementation would be to first generate the path list, then run `git rev-list -n 1 <revisions> -- <path>` for each path in the pathspec. This could be done by N prepare_revision_walk()/get_revision() executions within a loop instead of actually executing subcommands. The implementation in this patch is simultaneously smarter than that basic approach, but also not smart enough for the best performance. In order to get the optimal performance, the following parts of my output definition are critical: 1. We use simplified file history. 2. The history walk begins at a single commit. If either of these conditions are broken, then the current-best algorithm cannot be used (and even more: our proactive caching mechanism cannot be used). The API as currently defined does not allow us to enforce that restriction because the arbitrary arguments could specify `--full-history`, or worse `--simplify-merge` (and also `--first-parent`) to modify the history mode or even `--all` to modify the number of starting commits. The version we use in our custom fork has the historical baggage of this open-ended argument-parsing, but because we have full control of the callers to the CLI, we have enforced that those arguments are not being used. While we are not currently reviewing the CLI of a builtin, the API layer is essentially providing a pass-through of arbitrary revision options. I am happy to see that the 'struct blame_tree_options' includes a callback for the output, as that is valuable for both reporting results to stdout or to collect results into a list for caching purposes, in the future where we have that ability. --- Recommended API --- Using 'struct blame_tree_options' instead of many parameters creates a good future-proof method prototype, so we can always _add_ options when explicitly understood as important to callers of one kind or another. However, to drop the arbitrary 'args' we need to instead make it very explicit which commits are being used in the history walk. struct blame_tree_options { struct object_id oid; const char *prefix; unsigned int recursive; struct commit *start; struct commit_list *uninteresting; }; This definition of the struct should be enough to demonstrate the behavior we are describing. However, for the v1 of the API, we may even want to completely remove the 'uninteresting' choice, and instead focus only on a single starting position ('start'). If we decide that 'uninteresting' is valuable, then it can be added again later, hopefully after the primary use of this command is established. Again, thinking about the lifetime of this command in our infrastructure, the commit range behavior was used at one point as a performance improvement where a cache was given for a specific starting commit B, then we dynamically computed the blame-tree for the range B..A when given a new commit A, and merged the two results together. However, this was not always correct due to complexities around parallel changes. A different caching mechanism was built into the low-level algorithm itself which resolves these complications, but also _that cache cannot be used when given range queries_. Further, I recommend building the simplest implementation first, since it is actually closer to the very fast implementation in that it has two parts: 1. Compute the list of paths at the tip. 2. Perform history walks for those paths. The fast implementation does a single history walk that essentially walks the simplified history for all of the paths simultaneously, but it is critical to have that list of paths at the start of that walk. In that way it's actually easier to adapt the simpler algorithm to the current-best algorithm than it is to adapt the smart algorithm from this patch to the current-best algorithm. All this is to say, that I'd like to see this API start with the smallest possible surface area and with the simplest implementation, and then I'd be happy to contribute those algorithms within the API boundary while the CLI is handled independently. Thanks, -Stolee
On Sun, Feb 05, 2023 at 09:47:03PM +0100, Ævar Arnfjörð Bjarmason wrote: > From: Jeff King <peff@peff.net> I appreciate being credited, of course, but at some point I think this becomes "Based-on-a-patch-by". And we may have crossed the line here. > The "blame-tree" library allows for finding the most recent > modification to paths in a tree. It does so by expanding the tree at a > given commit, taking note of the current state of each path, and then > walking backwards through history looking for commits where each path > changed into its final sha1. > > The "git blame-tree" command was first noted on the ML 2011[1], and > over the years there have been mentions of it[2][3], and whether it > could be upstreamed. The sources have been available at [4]'s > "jk/blame-tree-wip" and "jk/faster-blame-tree-wip" branches. Sort of. jk/blame-tree-wip probably matches what I sent to the list in 2011 (and that probably matches what was deployed at GitHub at the time). And like all of my branches, I continually rebase it on git.git's master branch. But like all branches in my repo with "-wip" in them, it is not part of my daily driver, and I generally do not even build it (and I would not be surprised if it does not build at all, or one of those rebases introduced a horrible bug). The jk/faster-blame-tree-wip was an experiment from 2014 to narrow the pathspec as we found answers. But it was never deployed anywhere, and likewise may or may not even build now. I think the general idea there is sound (make pathspec lookups faster by using a trie, and then narrow it), but I suspect it's not a full solution. In particular, I don't think it does anything clever with merges. Since we are narrowing the pathspec as we traverse, we can't rely on the usual pruning of side branches that happens via limit_list(), as that is all up-front. So it probably needs to look at each merge as we traverse and cull parents based on TREESAME. I believe GitHub later had patches to do that, but they didn't use the pathspec machinery (because without the tries, it becomes accidentally quadratic in the number of paths). I didn't work on those patches, though (Stolee and Taylor did), and they're not anywhere in my repository (and I no longer have access to the private github repo). > This change attempts to start upstreaming this command, but rather > than adding a command whose interface & semantics may be controversial > starts by exposing & testing the core of its library through a new > test helper. > > An eventual "git blame-tree" command, or e.g. a new format for "git > ls-tree" to show what a path "blames" to can then be implemented with > this library. OK. It's a little weird, as I do think the interface and semantics are the more interesting part, but this certainly isn't hurting anybody to go this route. > * Removing the "--max-depth" changes to the diff code. We'll need > those eventually, but it's not required for a blame of a given list > of paths. > > As has been noted in previous on-list discussions the semantics of > the "max-depth" changes might be controversial, so it's worthwhile > to split those out so that they can be reviewed separately. That's probably reasonable. The only two interesting "depths" are really "recurse" and "don't recurse" (where "recurse" is probably what you'd put in a user-facing tool, and "don't recurse" is what a site like GitHub uses to do the blame for a single level of tree it's showing). And that narrows the problem space quite a bit. > * Made the "blame-tree" helper take "--" before any revision options, > for clarity. An eventual built-in command (if any) probably doesn't > want to enforce this, but it makes it clearer in the test helper > what's an argument for "blame-tree" itself, and what's an argument for > the revision machinery. OK. Since this is just a test-helper, we don't care too much either way. > * Minor updates for using C99 syntax, and "size_t" instead of "int" > when we're iterating over types whose "nr" is that size. Reasonable. > * Avoid sub-shelling in the tests, use "test-tool -C .." instead. Yeah, the original code probably predates "-C". ;) > The range-diff here is to peff's jk/blame-tree-wip. As noted above > this is far from the full thing, but hopefully getting the basic bits > of the library (sans the max-depth question) will make the review of > subsequent bits easier. I doubt this range-diff is useful to anybody. I'm probably the person most likely to make sense of it, and it means nothing to me. It probably makes sense conceptually to just treat this as a new topic that happens to be based on older work. But if you did want to base it on something, you probably ought to do so on what GitHub is currently running in production (and again, talk to Taylor or Stolee for that). Unless the intent is to use this as a base for showing their changes. Though even then, I'm not sure the intermediate state is all that interesting. > [oodles of patch] TBH, I didn't really look at this closely. It's been a decade, so even for me reviewing this would basically be looking at it from scratch. And as my Git time is a bit limited these days, I can't really promise timely review of a big topic. -Peff
On Fri, Feb 10 2023, Derrick Stolee wrote: > On 2/5/2023 3:47 PM, Ævar Arnfjörð Bjarmason wrote: > [...] >> +struct blame_tree_options { >> + struct object_id oid; >> + const char *prefix; >> + unsigned int recursive; >> + struct strvec args; >> +}; >> +#define BLAME_TREE_OPTIONS_INIT(...) { \ >> + .args = STRVEC_INIT, \ >> + __VA_ARGS__ \ >> +} >> +void blame_tree_opts_release(struct blame_tree_options *bto); >> + >> +struct blame_tree { >> + struct string_list paths; >> + struct rev_info rev; >> +}; >> +#define BLAME_TREE_INIT { \ >> + .paths = STRING_LIST_INIT_DUP, \ >> + .rev = REV_INFO_INIT, \ >> +} >> + >> +void blame_tree_init(struct blame_tree *bt, >> + const struct blame_tree_options *opts); >> +void blame_tree_release(struct blame_tree *); >> + >> +typedef void (*blame_tree_fn)(const char *path, const struct commit *commit, >> + void *data); >> +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data); > > However, this API is too leaky. Specifically, it allows passing arbitrary > 'args' instead of structured options. > > Before I get into what I think needs to change here, let's look at the > initial implementation: > >> +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata) >> +{ >> + struct blame_tree_callback_data data; >> + >> + data.paths = &bt->paths; >> + data.num_interesting = bt->paths.nr; >> + data.callback = cb; >> + data.callback_data = cbdata; >> + >> + bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK; >> + bt->rev.diffopt.format_callback = blame_diff; >> + bt->rev.diffopt.format_callback_data = &data; >> + >> + prepare_revision_walk(&bt->rev); >> + >> + while (data.num_interesting) { >> + data.commit = get_revision(&bt->rev); >> + if (!data.commit) >> + break; >> + >> + if (data.commit->object.flags & BOUNDARY) { >> + diff_tree_oid(the_hash_algo->empty_tree, >> + &data.commit->object.oid, >> + "", &bt->rev.diffopt); >> + diff_flush(&bt->rev.diffopt); >> + } else { >> + log_tree_commit(&bt->rev, data.commit); >> + } >> + } >> + >> + return 0; >> +} > > When I think of 'blame-tree', I think of the following output: > > For each path (within a pathspec, recursive or not), output the first > commit that changed that path according to simplified file history. This > history walk begins at a single commit, but may be terminated by a > negative commit range, so the output could indicate that we reached the > boundary. > > With that definition, the most-obvious first implementation would be to > first generate the path list, then run > `git rev-list -n 1 <revisions> -- <path>` for each path in the pathspec. > This could be done by N prepare_revision_walk()/get_revision() executions > within a loop instead of actually executing subcommands. > > The implementation in this patch is simultaneously smarter than that basic > approach, but also not smart enough for the best performance. In order to > get the optimal performance, the following parts of my output definition > are critical: > > 1. We use simplified file history. > 2. The history walk begins at a single commit. > > If either of these conditions are broken, then the current-best algorithm > cannot be used (and even more: our proactive caching mechanism cannot be > used). > > The API as currently defined does not allow us to enforce that restriction > because the arbitrary arguments could specify `--full-history`, or worse > `--simplify-merge` (and also `--first-parent`) to modify the history mode > or even `--all` to modify the number of starting commits. There's already a limitation on --all in this patch, or anything else that would yield a many<->many rev<->path relationship, e.g. "--all" or "HEAD HEAD~", that's covered by the "can only blame one tree at a time" error condition. For --full-history v.s. --simplify-merge the output on git.git at least (maybe not on others?) would be the same: diff -u <(t/helper/test-tool -C "$PWD/t" blame-tree -- --full-history) <(t/helper/test-tool -C "$PWD/t" blame-tree -- --simplify-merge) But of course for --first-parent it's not, but more on that below. > The version we use in our custom fork has the historical baggage of this > open-ended argument-parsing, but because we have full control of the > callers to the CLI, we have enforced that those arguments are not being > used. While we are not currently reviewing the CLI of a builtin, the API > layer is essentially providing a pass-through of arbitrary revision options. > > I am happy to see that the 'struct blame_tree_options' includes a callback > for the output, as that is valuable for both reporting results to stdout > or to collect results into a list for caching purposes, in the future > where we have that ability. > > --- Recommended API --- > > Using 'struct blame_tree_options' instead of many parameters creates a > good future-proof method prototype, so we can always _add_ options when > explicitly understood as important to callers of one kind or another. > > However, to drop the arbitrary 'args' we need to instead make it very > explicit which commits are being used in the history walk. > > struct blame_tree_options { > struct object_id oid; > const char *prefix; > unsigned int recursive; > struct commit *start; > struct commit_list *uninteresting; > }; > > This definition of the struct should be enough to demonstrate the behavior > we are describing. However, for the v1 of the API, we may even want to > completely remove the 'uninteresting' choice, and instead focus only on a > single starting position ('start'). If we decide that 'uninteresting' is > valuable, then it can be added again later, hopefully after the primary > use of this command is established. > > Again, thinking about the lifetime of this command in our infrastructure, > the commit range behavior was used at one point as a performance improvement > where a cache was given for a specific starting commit B, then we > dynamically computed the blame-tree for the range B..A when given a new > commit A, and merged the two results together. However, this was not always > correct due to complexities around parallel changes. A different caching > mechanism was built into the low-level algorithm itself which resolves > these complications, but also _that cache cannot be used when given range > queries_. > > Further, I recommend building the simplest implementation first, since it > is actually closer to the very fast implementation in that it has two > parts: > > 1. Compute the list of paths at the tip. > 2. Perform history walks for those paths. > > The fast implementation does a single history walk that essentially walks > the simplified history for all of the paths simultaneously, but it is > critical to have that list of paths at the start of that walk. In that way > it's actually easier to adapt the simpler algorithm to the current-best > algorithm than it is to adapt the smart algorithm from this patch to the > current-best algorithm. > > All this is to say, that I'd like to see this API start with the smallest > possible surface area and with the simplest implementation, and then I'd > be happy to contribute those algorithms within the API boundary while the > CLI is handled independently. I hear your concern about leaving this open for optimization, and in general I'd vehemently agree with it, except for needing to eventually feed a command-line to setup_revisions(). Ideally the revision API would make what you're describing easy, but the way it's currently implemented (and changing it would be a much larger project) someone who'd like to pass structured options in the way you'd describe will end up having to re-implement bug-for-bug compatible versions of some subset of the option parsing in revision.c. Isn't a way to get the best of both worlds to have a small snippet of code that inspects the "struct rev_info" before & after setup_revisions(), and which would only implement certain optimizations if certain known options are provided, but not if any unknown ones are? That way those who'd like the faster happy path could use that subset of options, while the general API would allow any revision options. We'd then error() or BUG() out only if we fail to map our expected paths to OIDs. Specifically, I'm thinking of something similar to what match_stat_with_submodule() in diff-lib.c now does with "orig_flags". I also think as a practical matter it's acceptable for us to leave the underlying low-level API open-ended for potential (ab)use, while any porcelain or plumbing tool for this would document that a given narrow subset of revision options is what we expect. That's what we do for format-patch, and e.g. as of my df569c3f31f (range-diff doc: add a section about output stability, 2018-11-09) range-diff explicitly documents that feeding it options it won't expect might result in nonsense output. I think those are all good ways forward here, and I'd much prefer those to having to re-implement or pull out subsets of the current option parsing logic in revision.c. What do you think?
On 3/7/2023 8:56 AM, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Feb 10 2023, Derrick Stolee wrote: >> All this is to say, that I'd like to see this API start with the smallest >> possible surface area and with the simplest implementation, and then I'd >> be happy to contribute those algorithms within the API boundary while the >> CLI is handled independently. > > I hear your concern about leaving this open for optimization, and in > general I'd vehemently agree with it, except for needing to eventually > feed a command-line to setup_revisions(). The most-correct way to build this, with full optimizations, does not involve revisions.c at all, so this "eventually" is incorrect. It's only something to do for the "first" implementation, as a reference. In order to do the single-walk approach for every path simultaneously, we _must_ have full control of the commit walk. There was a time where we had done a single-walk approach by letting the revision machinery walk all commits that changed the base tree, then looked for changes to the contained paths. However, this results in _incorrect_ results because commits that would normally be ignored by the simplified history walk for "<dir>/<entry>" were not ignored by the simplified history walk for "<dir>/" and thus that algorithm presented _incorrect results_. For that reason, doing a single walk that outputs the blame-tree results for each path must have full control over which commits are walked and which paths could emit a change for those commits. This means we must not use revision.c as a base for full control. > Ideally the revision API would make what you're describing easy, but the > way it's currently implemented (and changing it would be a much larger > project) someone who'd like to pass structured options in the way you'd > describe will end up having to re-implement bug-for-bug compatible > versions of some subset of the option parsing in revision.c. The subset of option parsing is "a starting revision" and "a base tree" and _perhaps_ "is the diff recursive or not?" (and this last one isn't even in revision.c yet). That does not seem like using revision.c's parsing is actually helpful at all. > Isn't a way to get the best of both worlds to have a small snippet of > code that inspects the "struct rev_info" before & after > setup_revisions(), and which would only implement certain optimizations > if certain known options are provided, but not if any unknown ones are? > > That way those who'd like the faster happy path could use that subset of > options, while the general API would allow any revision options. We'd > then error() or BUG() out only if we fail to map our expected paths to > OIDs. This option requires examining the long and ever-growing list of options to struct rev_info which will take much more work than parsing a starting ref and a path from the command-line. > I think those are all good ways forward here, and I'd much prefer those > to having to re-implement or pull out subsets of the current option > parsing logic in revision.c. What do you think? I think you are skirting over the difficult part about upstreaming the blame-tree command, which is the biggest reason we have not done it in the past. The way it is implemented in our fork started with this "just parse args using revision.c" because that's the easiest way to implement the naive implementation, but we were able to make optimizations on top only because we had full control over the callers not using any other options. We would not have been able to make the assumptions that allowed those performance enhancements without that control. Actually building the interface in a way that guarantees the behavior will be stable and understood is not easy, but is worth doing well. Thanks, -Stolee
On Tue, Mar 07, 2023 at 02:56:29PM +0100, Ævar Arnfjörð Bjarmason wrote: > I hear your concern about leaving this open for optimization, and in > general I'd vehemently agree with it, except for needing to eventually > feed a command-line to setup_revisions(). > > Ideally the revision API would make what you're describing easy, but the > way it's currently implemented (and changing it would be a much larger > project) someone who'd like to pass structured options in the way you'd > describe will end up having to re-implement bug-for-bug compatible > versions of some subset of the option parsing in revision.c. I get what you are both saying here, but I think I find myself tending to agree with Ævar a little bit more here. In an ideal world, sure, having the blame-tree API take a single struct called 'blame_tree_options' would be very clean. But the crux is that we have to pass some arguments to setup_revisions(), and that our problems here stem from the leakiness of *that* API, not this one. I ran into a similar problem when looking at rewriting the bitmap traversal code a year or so ago (which is sadly still on my to-do list). Without getting into the details, part of that work involved calling limit_list() as a function of setup_revisions() to discover the traversal boundary. And if the caller happened to put --topo-order in their command-line arguments, we wouldn't end up calling limit_list() at all, since (as Stolee well knows ;-)) those two code paths are quite different. I can't recall if I either detected if '--topo-order' was passed (by looking to see if `revs.topo_order` was set), or grafted an extra `--no-topo-order` argument onto the end of the list. Either way, I think those two problems are more or less equivalent in this context, and that it seemed like a much more expedient solution to work around the fundamental leakiness of the setup_revision() API rather than refactor it. > Isn't a way to get the best of both worlds to have a small snippet of > code that inspects the "struct rev_info" before & after > setup_revisions(), and which would only implement certain optimizations > if certain known options are provided, but not if any unknown ones are? Yeah, I think this is basically the same as my "let's check if the caller passed `--topo-order` by checking the `revs.topo_order` bit" above. > I think those are all good ways forward here, and I'd much prefer those > to having to re-implement or pull out subsets of the current option > parsing logic in revision.c. What do you think? Same :-). Thanks, Taylor
diff --git a/Makefile b/Makefile index 45bd6ac9c3e..ed82dfee8e7 100644 --- a/Makefile +++ b/Makefile @@ -786,6 +786,7 @@ PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS)) TEST_BUILTINS_OBJS += test-advise.o TEST_BUILTINS_OBJS += test-bitmap.o +TEST_BUILTINS_OBJS += test-blame-tree.o TEST_BUILTINS_OBJS += test-bloom.o TEST_BUILTINS_OBJS += test-bundle-uri.o TEST_BUILTINS_OBJS += test-cache-tree.o @@ -974,6 +975,7 @@ LIB_OBJS += archive.o LIB_OBJS += attr.o LIB_OBJS += base85.o LIB_OBJS += bisect.o +LIB_OBJS += blame-tree.o LIB_OBJS += blame.o LIB_OBJS += blob.o LIB_OBJS += bloom.o diff --git a/blame-tree.c b/blame-tree.c new file mode 100644 index 00000000000..dc739b23bff --- /dev/null +++ b/blame-tree.c @@ -0,0 +1,195 @@ +#include "cache.h" +#include "blame-tree.h" +#include "strvec.h" +#include "hash.h" +#include "commit.h" +#include "diffcore.h" +#include "diff.h" +#include "object.h" +#include "revision.h" +#include "log-tree.h" + +void blame_tree_opts_release(struct blame_tree_options *bto) +{ + strvec_clear(&bto->args); +} + +struct blame_tree_entry { + struct object_id oid; + struct commit *commit; +}; + +static void add_from_diff(struct diff_queue_struct *q, + struct diff_options *opt, void *data) +{ + struct blame_tree *bt = data; + + for (size_t i = 0; i < q->nr; i++) { + struct diff_filepair *p = q->queue[i]; + struct blame_tree_entry *ent = xcalloc(1, sizeof(*ent)); + + oidcpy(&ent->oid, &p->two->oid); + string_list_append(&bt->paths, p->two->path)->util = ent; + } +} + +static int add_from_revs(struct blame_tree *bt) +{ + size_t count = 0; + struct diff_options diffopt; + + memcpy(&diffopt, &bt->rev.diffopt, sizeof(diffopt)); + diffopt.output_format = DIFF_FORMAT_CALLBACK; + diffopt.format_callback = add_from_diff; + diffopt.format_callback_data = bt; + diffopt.no_free = 1; + + for (size_t i = 0; i < bt->rev.pending.nr; i++) { + struct object_array_entry *obj = bt->rev.pending.objects + i; + + if (obj->item->flags & UNINTERESTING) + continue; + + if (count++) + return error(_("can only blame one tree at a time")); + diff_tree_oid(the_hash_algo->empty_tree, &obj->item->oid, "", + &diffopt); + diff_flush(&diffopt); + } + + string_list_sort(&bt->paths); + return 0; +} + +void blame_tree_init(struct blame_tree *bt, + const struct blame_tree_options *opts) +{ + init_revisions(&bt->rev, opts->prefix); + bt->rev.def = oid_to_hex(&opts->oid); + bt->rev.combine_merges = 1; + bt->rev.show_root_diff = 1; + bt->rev.boundary = 1; + bt->rev.no_commit_id = 1; + bt->rev.diff = 1; + bt->rev.diffopt.flags.recursive = opts->recursive; + setup_revisions(opts->args.nr, opts->args.v, &bt->rev, NULL); + + if (add_from_revs(bt) < 0) + die(_("unable to setup blame-tree")); +} + +void blame_tree_release(struct blame_tree *bt) +{ + string_list_clear(&bt->paths, 1); + release_revisions(&bt->rev); +} + +struct blame_tree_callback_data { + struct commit *commit; + struct string_list *paths; + size_t num_interesting; + + blame_tree_fn callback; + void *callback_data; +}; + +static void mark_path(const char *path, const struct object_id *oid, + struct blame_tree_callback_data *data) +{ + struct string_list_item *item = string_list_lookup(data->paths, path); + struct blame_tree_entry *ent; + + /* Is it even a path that exists in our tree? */ + if (!item) + return; + + /* Have we already blamed a commit? */ + ent = item->util; + if (ent->commit) + return; + + /* + * Is it arriving at a version of interest, or is it from a side branch + * which did not contribute to the final state? + */ + if (!oideq(oid, &ent->oid)) + return; + + ent->commit = data->commit; + data->num_interesting--; + if (data->callback) + data->callback(path, data->commit, data->callback_data); +} + +static void blame_diff(struct diff_queue_struct *q, + struct diff_options *opt, void *cbdata) +{ + struct blame_tree_callback_data *data = cbdata; + + for (size_t i = 0; i < q->nr; i++) { + struct diff_filepair *p = q->queue[i]; + switch (p->status) { + case DIFF_STATUS_DELETED: + /* + * There's no point in feeding a deletion, as it could + * not have resulted in our current state, which + * actually has the file. + */ + break; + + default: + /* + * Otherwise, we care only that we somehow arrived at + * a final path/sha1 state. Note that this covers some + * potentially controversial areas, including: + * + * 1. A rename or copy will be blamed, as it is the + * first time the content has arrived at the given + * path. + * + * 2. Even a non-content modification like a mode or + * type change will trigger it. + * + * We take the inclusive approach for now, and blame + * anything which impacts the path. Options to tweak + * the behavior (e.g., to "--follow" the content across + * renames) can come later. + */ + mark_path(p->two->path, &p->two->oid, data); + break; + } + } +} + +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *cbdata) +{ + struct blame_tree_callback_data data; + + data.paths = &bt->paths; + data.num_interesting = bt->paths.nr; + data.callback = cb; + data.callback_data = cbdata; + + bt->rev.diffopt.output_format = DIFF_FORMAT_CALLBACK; + bt->rev.diffopt.format_callback = blame_diff; + bt->rev.diffopt.format_callback_data = &data; + + prepare_revision_walk(&bt->rev); + + while (data.num_interesting) { + data.commit = get_revision(&bt->rev); + if (!data.commit) + break; + + if (data.commit->object.flags & BOUNDARY) { + diff_tree_oid(the_hash_algo->empty_tree, + &data.commit->object.oid, + "", &bt->rev.diffopt); + diff_flush(&bt->rev.diffopt); + } else { + log_tree_commit(&bt->rev, data.commit); + } + } + + return 0; +} diff --git a/blame-tree.h b/blame-tree.h new file mode 100644 index 00000000000..fb4ba1004e7 --- /dev/null +++ b/blame-tree.h @@ -0,0 +1,39 @@ +#ifndef BLAME_TREE_H +#define BLAME_TREE_H + +#include "hash.h" +#include "strvec.h" +#include "string-list.h" +#include "revision.h" +#include "commit.h" + +struct blame_tree_options { + struct object_id oid; + const char *prefix; + unsigned int recursive; + struct strvec args; +}; +#define BLAME_TREE_OPTIONS_INIT(...) { \ + .args = STRVEC_INIT, \ + __VA_ARGS__ \ +} +void blame_tree_opts_release(struct blame_tree_options *bto); + +struct blame_tree { + struct string_list paths; + struct rev_info rev; +}; +#define BLAME_TREE_INIT { \ + .paths = STRING_LIST_INIT_DUP, \ + .rev = REV_INFO_INIT, \ +} + +void blame_tree_init(struct blame_tree *bt, + const struct blame_tree_options *opts); +void blame_tree_release(struct blame_tree *); + +typedef void (*blame_tree_fn)(const char *path, const struct commit *commit, + void *data); +int blame_tree_run(struct blame_tree *bt, blame_tree_fn cb, void *data); + +#endif /* BLAME_TREE_H */ diff --git a/t/helper/test-blame-tree.c b/t/helper/test-blame-tree.c new file mode 100644 index 00000000000..1ea5ff783af --- /dev/null +++ b/t/helper/test-blame-tree.c @@ -0,0 +1,62 @@ +#include "test-tool.h" +#include "blame-tree.h" +#include "quote.h" +#include "config.h" +#include "parse-options.h" + +static void show_entry(const char *path, const struct commit *commit, void *d) +{ + struct blame_tree *bt = d; + + if (commit->object.flags & BOUNDARY) + putchar('^'); + printf("%s\t", oid_to_hex(&commit->object.oid)); + + write_name_quoted(path, stdout, bt->rev.diffopt.line_termination); + + fflush(stdout); +} + +int cmd__blame_tree(int argc, const char **argv) +{ + const char *prefix = setup_git_directory(); + struct blame_tree_options opts = BLAME_TREE_OPTIONS_INIT( + .prefix = prefix, + .recursive = 1, + ); + struct blame_tree bt = BLAME_TREE_INIT; + struct option options[] = { + OPT_BOOL(0, "recursive", &opts.recursive, + "recurse into to subtrees"), + OPT_END() + }; + static char const* const usage[] = { + "test-tool blame-tree [--no-recursive] [-- <rev-opts>]", + NULL + }; + + argc = parse_options(argc, argv, prefix, options, usage, + PARSE_OPT_KEEP_DASHDASH); + + if (argc) { + if (strcmp(argv[0], "--") && + strcmp(argv[0], "--end-of-options")) + usage_msg_opt("need -- before <rev-opts>", usage, options); + argc--; + argv++; + } + + if (repo_get_oid(the_repository, "HEAD", &opts.oid)) + die("unable to get HEAD"); + + strvec_push(&opts.args, "blame-tree"); + if (argc) + strvec_pushv(&opts.args, argv); + blame_tree_init(&bt, &opts); + if (blame_tree_run(&bt, show_entry, &bt) < 0) + die("error running blame-tree traversal"); + blame_tree_release(&bt); + blame_tree_opts_release(&opts); + + return 0; +} diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c index abe8a785eb6..8f34ff18d6f 100644 --- a/t/helper/test-tool.c +++ b/t/helper/test-tool.c @@ -12,6 +12,7 @@ static const char * const test_tool_usage[] = { static struct test_cmd cmds[] = { { "advise", cmd__advise_if_enabled }, { "bitmap", cmd__bitmap }, + { "blame-tree", cmd__blame_tree }, { "bloom", cmd__bloom }, { "bundle-uri", cmd__bundle_uri }, { "cache-tree", cmd__cache_tree }, diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h index ea2672436c9..745a2097000 100644 --- a/t/helper/test-tool.h +++ b/t/helper/test-tool.h @@ -5,6 +5,7 @@ int cmd__advise_if_enabled(int argc, const char **argv); int cmd__bitmap(int argc, const char **argv); +int cmd__blame_tree(int argc, const char **argv); int cmd__bloom(int argc, const char **argv); int cmd__bundle_uri(int argc, const char **argv); int cmd__cache_tree(int argc, const char **argv); diff --git a/t/t8020-blame-tree-lib.sh b/t/t8020-blame-tree-lib.sh new file mode 100755 index 00000000000..4213c25de8b --- /dev/null +++ b/t/t8020-blame-tree-lib.sh @@ -0,0 +1,138 @@ +#!/bin/sh + +test_description='basic blame-tree library tests' + +. ./test-lib.sh + +test_expect_success 'setup' ' + test_commit 1 file && + mkdir a && + test_commit 2 a/file && + mkdir a/b && + test_commit 3 a/b/file +' + +test_expect_success 'cannot blame two trees' ' + test_must_fail test-tool blame-tree HEAD HEAD~1 +' + +check_blame() { + local indir= && + while test $# != 0 + do + case "$1" in + -C) + indir="$2" + shift + ;; + *) + break + ;; + esac && + shift + done && + + cat >expect && + test_when_finished "rm -f tmp.*" && + test-tool ${indir:+-C "$indir"} \ + blame-tree "$@" >tmp.1 && + git name-rev --annotate-stdin --name-only --tags \ + <tmp.1 >tmp.2 && + tr '\t' ' ' <tmp.2 >tmp.3 && + sort tmp.3 >actual && + test_cmp expect actual +} + +test_expect_success 'blame recursive' ' + check_blame <<-\EOF + 1 file + 2 a/file + 3 a/b/file + EOF +' + +test_expect_success 'blame root' ' + check_blame --no-recursive <<-\EOF + 1 file + 3 a + EOF +' + +test_expect_success 'blame subdir' ' + check_blame -- a <<-\EOF + 2 a/file + 3 a/b/file + EOF +' + +test_expect_success 'blame from non-HEAD commit' ' + check_blame --no-recursive -- HEAD^ <<-\EOF + 1 file + 2 a + EOF +' + +test_expect_success 'blame from subdir defaults to root' ' + check_blame -C a --no-recursive <<-\EOF + 1 file + 3 a + EOF +' + +test_expect_success 'blame from subdir uses relative pathspecs' ' + check_blame -C a -- b <<-\EOF + 3 a/b/file + EOF +' + +test_expect_success 'limit blame traversal by count' ' + check_blame --no-recursive -- -1 <<-\EOF + 3 a + ^2 file + EOF +' + +test_expect_success 'limit blame traversal by commit' ' + check_blame --no-recursive -- HEAD~2..HEAD <<-\EOF + 3 a + ^1 file + EOF +' + +test_expect_success 'only blame files in the current tree' ' + git rm -rf a && + git commit -m "remove a" && + check_blame <<-\EOF + 1 file + EOF +' + +test_expect_success 'cross merge boundaries in blaming' ' + git checkout HEAD^0 && + git rm -rf . && + test_commit m1 && + git checkout HEAD^ && + git rm -rf . && + test_commit m2 && + git merge m1 && + check_blame <<-\EOF + m1 m1.t + m2 m2.t + EOF +' + +test_expect_success 'blame merge for resolved conflicts' ' + git checkout HEAD^0 && + git rm -rf . && + test_commit c1 conflict && + git checkout HEAD^ && + git rm -rf . && + test_commit c2 conflict && + test_must_fail git merge c1 && + test_commit resolved conflict && + check_blame -- conflict <<-\EOF + resolved conflict + EOF +' + +test_done