mbox series

[0/8] Introduce git-blame-tree(1) command

Message ID 20250326-toon-blame-tree-v1-0-4173133f3786@iotcl.com (mailing list archive)
Headers show
Series Introduce git-blame-tree(1) command | expand

Message

Toon Claes March 26, 2025, 8:18 p.m. UTC
This is yet another attempt to upstream the builtin command
`git-blame-tree(1)`. This command is similar to git-blame(1) and shows
the most recent modification to paths in a tree..

The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
series was based of patches by Peff written in 2011[2].

In contrary to Ævar's attempt, my series implements the
`git-blame-tree(1)` standalone command. Ævar suggested to only implement
the blame-tree subsystem, and test it through test-tool. But since the
command line options of that test-tool subcommand were very similar to
the standalone command we want to land, I decided transfer the test-tool
subcommand into a real Git builtin. Also, nowadays if we want to test
the blame-tree subsystem independently, we should use clar unit tests
instead of test-tool.

This series brings back the --max-depth changes. I know Ævar removed
them because there was controversy, but I think they are relevant to
include. And I want to open up the conversation again.

Also extra patches from Peff are included to implement pathspec tries.
I've taken them from Peff's fork on GitHub[3], but they also have been
posted to this mailing list[4].

Other improvements I've made:

* Rebase onto current master.

* Remove the use of USE_THE_REPOSITORY_VARIABLE.

* Fix various memory leaks.

* Moved code from blame-tree.c to pathspec.c, as suggested in Peff's
  last patch on GitHub.

* Updated the benchmark results in the last commit message, although the
  numbers were roughly the same.

I don't expect this code to be deemed ready to merge, but I mainly want
to gather as much feedback as possible and use it to iterate toward
actually getting it merged.

There are some things I know that are missing, for example
`--porcelain`, but those can be added in a follow-up. At the moment
there's one test marked `test_expect_failure`, but I was not able yet
identify that bug.

So, let me know what you think?

--
Toon

[1]: https://lore.kernel.org/git/patch-1.1-0ea849d900b-20230205T204104Z-avarab@gmail.com/
[2]: https://lore.kernel.org/git/20110302164031.GA18233@sigill.intra.peff.net/
[3]: https://github.com/peff/git/tree/jk/faster-blame-tree-wip
[4]: https://lore.kernel.org/git/YN4zKVK7gvuIZ0vK@coredump.intra.peff.net/

---
Jeff King (7):
      combine-diff: zero memory used for callback filepairs
      within_depth: fix return for empty path
      diff: teach tree-diff a max-depth parameter
      pathspec: add optional trie index
      pathspec: turn on tries when appropriate
      tree-diff: use pathspec tries
      blame-tree: narrow pathspec as we traverse

Toon Claes (1):
      blame-tree: introduce new subcommand to blame files

 .gitignore                      |   1 +
 Documentation/diff-options.adoc |  26 +++++
 Makefile                        |   3 +
 blame-tree.c                    | 216 ++++++++++++++++++++++++++++++++++++++++
 blame-tree.h                    |  43 ++++++++
 builtin.h                       |   1 +
 builtin/blame-tree.c            |  67 +++++++++++++
 combine-diff.c                  |   2 +-
 diff-lib.c                      |   5 +
 diff.c                          |  18 ++++
 diff.h                          |   9 ++
 dir.c                           |   2 +-
 git.c                           |   1 +
 meson.build                     |   2 +
 pathspec.c                      | 216 ++++++++++++++++++++++++++++++++++++++++
 pathspec.h                      |  27 +++++
 t/helper/meson.build            |   1 +
 t/helper/test-pathspec.c        |  96 ++++++++++++++++++
 t/helper/test-tool.c            |   1 +
 t/helper/test-tool.h            |   2 +
 t/meson.build                   |   3 +
 t/perf/p4003-diff-pathspec.sh   |  26 +++++
 t/t4071-diff-max-depth.sh       | 109 ++++++++++++++++++++
 t/t6137-pathspec-trie.sh        |  57 +++++++++++
 t/t8020-blame-tree.sh           | 142 ++++++++++++++++++++++++++
 tree-diff.c                     | 176 +++++++++++++++++++++++++++++---
 26 files changed, 1235 insertions(+), 17 deletions(-)
---



---

base-commit: 683c54c999c301c2cd6f715c411407c413b1d84e
change-id: 20250326-toon-blame-tree-2ca74d7712ac

Thanks
--
Toon

Comments

Taylor Blau March 26, 2025, 8:38 p.m. UTC | #1
On Wed, Mar 26, 2025 at 09:18:24PM +0100, Toon Claes wrote:
> This is yet another attempt to upstream the builtin command
> `git-blame-tree(1)`. This command is similar to git-blame(1) and shows
> the most recent modification to paths in a tree..
>
> The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
> series was based of patches by Peff written in 2011[2].

For what it's worth, the blame-tree implementation that this came from
has evolved significantly since it was originally written in 2011. Most
recently Stolee and I worked on a version that uses changed-path Bloom
filters to narrow the search, passing un-blamed paths to their parents
at each level of the traversal.

I wonder if it would be easier to start from scratch with the modern
implementation rather than land this one and try to build on top of it.

Thanks,
Taylor
Jeff King March 27, 2025, 6:32 a.m. UTC | #2
On Wed, Mar 26, 2025 at 04:38:59PM -0400, Taylor Blau wrote:

> On Wed, Mar 26, 2025 at 09:18:24PM +0100, Toon Claes wrote:
> > This is yet another attempt to upstream the builtin command
> > `git-blame-tree(1)`. This command is similar to git-blame(1) and shows
> > the most recent modification to paths in a tree..
> >
> > The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
> > series was based of patches by Peff written in 2011[2].
> 
> For what it's worth, the blame-tree implementation that this came from
> has evolved significantly since it was originally written in 2011. Most
> recently Stolee and I worked on a version that uses changed-path Bloom
> filters to narrow the search, passing un-blamed paths to their parents
> at each level of the traversal.
> 
> I wonder if it would be easier to start from scratch with the modern
> implementation rather than land this one and try to build on top of it.

Yeah, I'd suggest starting with that work from you and Stolee (though I
do not know if it was ever made public?). It should be much faster and
will have been battle-tested in production.

The pathspec-trie stuff is, I think, still a reasonable idea for general
use. But IIRC, the rewritten blame-tree you guys worked on does not
benefit from it, because it ditches pathspecs entirely (both because
they're too slow without the tries, but also because it's important to
continually narrow the pathspec while traversing). That trie code was
never run in production, I think (and I see there is a patch to narrow
the pathspec while traversing; I suspect that likewise was never used).

The max-depth diff code is also in theory a reasonable thing to have in
general. But it is awkward to use, and not really necessary for
blame-tree. There we really only care about recursing vs not recursing,
but the usual "recursive" flag for diffing isn't enough (we have to
recurse down to the tree of interest, but may not want to go further). I
don't remember how that is handled in your blame-tree rewrites.

So that really mostly leaves the blame-tree scaffolding itself. I
remember Junio left a lot of good comments on the original thread on how
merges should be handled, but I don't think I ever fixed those bits. I
don't recall what your rewritten code does there, but I think it may
have improved things.

So yeah. I don't know if all of this is really a very good starting
point. Taylor, if you can share the current code that GitHub is running,
I think that would be beneficial for the community.

-Peff
Derrick Stolee March 27, 2025, 12:04 p.m. UTC | #3
On 3/26/2025 4:38 PM, Taylor Blau wrote:
> On Wed, Mar 26, 2025 at 09:18:24PM +0100, Toon Claes wrote:
>> This is yet another attempt to upstream the builtin command
>> `git-blame-tree(1)`. This command is similar to git-blame(1) and shows
>> the most recent modification to paths in a tree..
>>
>> The last attempt (I'm aware of) was made by Ævar in 2023[1]. That
>> series was based of patches by Peff written in 2011[2].
> 
> For what it's worth, the blame-tree implementation that this came from
> has evolved significantly since it was originally written in 2011. Most
> recently Stolee and I worked on a version that uses changed-path Bloom
> filters to narrow the search, passing un-blamed paths to their parents
> at each level of the traversal.

It's worth mentioning that the underlying algorithm was nearly rebuilt
from scratch with this "passing un-blamed paths to their parents" aspect,
which unlocked other features such as caching results to be used by
future queries, even when the tip branch advances.

With that in mind, using the 2011 version is unlikely to be valuable.

Taylor: do you have a drop of the latest blame-tree implementation that
could be shared with Toon?

Thanks,
-Stolee
Taylor Blau March 27, 2025, 9:58 p.m. UTC | #4
On Thu, Mar 27, 2025 at 02:32:43AM -0400, Jeff King wrote:
> The pathspec-trie stuff is, I think, still a reasonable idea for general
> use. But IIRC, the rewritten blame-tree you guys worked on does not
> benefit from it, because it ditches pathspecs entirely (both because
> they're too slow without the tries, but also because it's important to
> continually narrow the pathspec while traversing). That trie code was
> never run in production, I think (and I see there is a patch to narrow
> the pathspec while traversing; I suspect that likewise was never used).

Yeah, the rewritten blame-tree code uses changed-path Bloom filters to
narrow the set of revisions that we need to actually compute tree-diffs
for.

The general idea is that we have a set of paths that we have yet to
blame, and those are the "interesting" ones. IOW, if a changed-path
Bloom filter tells us that we are at some revision where there is maybe
a change to one or more unblamed paths, we need to compute a tree-diff.
But if the Bloom filter says "no", then we can skip the tree-diff at
that layer entirely.

> The max-depth diff code is also in theory a reasonable thing to have in
> general. But it is awkward to use, and not really necessary for
> blame-tree. There we really only care about recursing vs not recursing,
> but the usual "recursive" flag for diffing isn't enough (we have to
> recurse down to the tree of interest, but may not want to go further). I
> don't remember how that is handled in your blame-tree rewrites.

I think that's mostly true, but the blame-tree caching stuff that Stolee
worked on many years ago and mentioned below does require it IIRC.

> So yeah. I don't know if all of this is really a very good starting
> point. Taylor, if you can share the current code that GitHub is running,
> I think that would be beneficial for the community.

Sure. You can fetch from the 'tb/blame-tree' branch from my tree (which
is located at 'git@github.com:ttaylorr/git.git'). I owe a huge "thank
you" to Victoria Dye, who split out the various topics from GitHub's
fork into individual rebased branches.

There were many more patches on top that came after Victoria's split
above, and I applied those manually. The commit structure probably needs
significant clean-up and polishing before it's ready for serious review,
since this is more-or-less a raw dump of the work on GitHub's side over
more than a decade.

It also doesn't pass the tests in t9932 (and the test number should
probably also be reworked, it's in the t99xx range so that inclusion in
GitHub's fork doesn't cause collisions with new tests when we merge
upstream). To that end, I removed everybody's Signed-off-by in case I
have mangled their work in some way unintentionally.

Thanks,
Taylor
Toon Claes March 28, 2025, 1:27 p.m. UTC | #5
Taylor Blau <me@ttaylorr.com> writes:

> Sure. You can fetch from the 'tb/blame-tree' branch from my tree (which
> is located at 'git@github.com:ttaylorr/git.git'). I owe a huge "thank
> you" to Victoria Dye, who split out the various topics from GitHub's
> fork into individual rebased branches.
>
> There were many more patches on top that came after Victoria's split
> above, and I applied those manually. The commit structure probably needs
> significant clean-up and polishing before it's ready for serious review,
> since this is more-or-less a raw dump of the work on GitHub's side over
> more than a decade.
>
> It also doesn't pass the tests in t9932 (and the test number should
> probably also be reworked, it's in the t99xx range so that inclusion in
> GitHub's fork doesn't cause collisions with new tests when we merge
> upstream). To that end, I removed everybody's Signed-off-by in case I
> have mangled their work in some way unintentionally.

Thanks for sharing. I will check it out.
Jeff King April 1, 2025, 9:29 a.m. UTC | #6
On Thu, Mar 27, 2025 at 05:58:19PM -0400, Taylor Blau wrote:

> On Thu, Mar 27, 2025 at 02:32:43AM -0400, Jeff King wrote:
> > The pathspec-trie stuff is, I think, still a reasonable idea for general
> > use. But IIRC, the rewritten blame-tree you guys worked on does not
> > benefit from it, because it ditches pathspecs entirely (both because
> > they're too slow without the tries, but also because it's important to
> > continually narrow the pathspec while traversing). That trie code was
> > never run in production, I think (and I see there is a patch to narrow
> > the pathspec while traversing; I suspect that likewise was never used).
> 
> Yeah, the rewritten blame-tree code uses changed-path Bloom filters to
> narrow the set of revisions that we need to actually compute tree-diffs
> for.
> 
> The general idea is that we have a set of paths that we have yet to
> blame, and those are the "interesting" ones. IOW, if a changed-path
> Bloom filter tells us that we are at some revision where there is maybe
> a change to one or more unblamed paths, we need to compute a tree-diff.
> But if the Bloom filter says "no", then we can skip the tree-diff at
> that layer entirely.

You'd still in theory benefit from the tree-diffs you _do_ run using a
continually narrowing pathspec. Skimming over the code from your
tb/blame-tree branch, it looks like it's just fed the original pathspec.
But that's probably good enough in practice. Especially for
non-recursive blame-trees, where pruning already-matched entries will
never save you from opening another tree anyway.

> > So yeah. I don't know if all of this is really a very good starting
> > point. Taylor, if you can share the current code that GitHub is running,
> > I think that would be beneficial for the community.
> 
> Sure. You can fetch from the 'tb/blame-tree' branch from my tree (which
> is located at 'git@github.com:ttaylorr/git.git'). I owe a huge "thank
> you" to Victoria Dye, who split out the various topics from GitHub's
> fork into individual rebased branches.

Thanks. I don't have time to pick it up as a topic myself, but hopefully
it can be useful to Toon (or any others interested in the topic).

-Peff