[0/8] ahead-behind: new builtin for counting multiple commit ranges

Message ID	pull.1489.git.1678111598.gitgitgadget@gmail.com (mailing list archive)
Headers	show Return-Path: <git-owner@vger.kernel.org> Message-Id: <pull.1489.git.1678111598.gitgitgadget@gmail.com> From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> Date: Mon, 06 Mar 2023 14:06:30 +0000 Subject: [PATCH 0/8] ahead-behind: new builtin for counting multiple commit ranges Fcc: Sent Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit MIME-Version: 1.0 To: git@vger.kernel.org Cc: gitster@pobox.com, me@ttaylorr.com, vdye@github.com, Derrick Stolee <derrickstolee@github.com> Precedence: bulk
Series	ahead-behind: new builtin for counting multiple commit ranges \| expand [0/8] ahead-behind: new builtin for counting multiple commit ranges [1/8] ahead-behind: create empty builtin [2/8] ahead-behind: parse tip references [3/8] ahead-behind: implement --ignore-missing option [4/8] commit-graph: combine generation computations [5/8] commit-graph: return generation from memory [6/8] commit-graph: introduce `ensure_generations_valid()` [7/8] ahead-behind: implement ahead_behind() logic [8/8] ahead-behind: add --contains mode

Philippe Blain via GitGitGadget March 6, 2023, 2:06 p.m. UTC

This series introduces the 'git ahead-behind' builtin, which has been used
at $DAYJOB for many years, but took many forms before landing on the current
version.

The main goal of the builtin is to compare multiple references against a
common base reference. The comparison is number of commits that are in each
side of the symmtric difference of their reachable sets. A commit C is
"ahead" of a commit B by the number of commits in B..C (reachable from C but
not reachable from B). Similarly, the commit C is "behind" the commit B by
the number of commits in C..B (reachable from B but not reachable from C).

These numbers can be computed by 'git rev-list --count B..C' and 'git
rev-list --count C..B', but there are common needs that benefit from having
the checks being done in the same process:

1. Our "branches" page lists ahead/behind counts for each listed branch as
compared to the repo's default branch. This can be done with a single
'git ahead-behind' process.
2. When a branch is updated, a background job checks if any pull requests
that target that branch should be closed because their branches were
merged implicitly by that update. These queries can e batched into 'git
ahead-behind' calls.

In that second example, we don't need the full ahead/behind counts (although
it is sufficient to look for branches that are "zero commits ahead", meaning
they are reachable from the base), so this builtin has an extra '--contains'
mode that only checks reachability from the base to each of the tips. 'git
ahead-behind --contains' is sort of the reverse of 'git branch --contains'.

The series starts with some basic boilerplate and argument parsing, along
with error conditions for missing objects. To avoid TOCTOU races, an
'--ignore-missing' option allows being flexible when a tip reference does
not exist. This is all covered in patches 1-3.

Patches 4-6 introduce a new method: ensure_generations_valid(). Patch 4 does
some refactoring of the existing generation number computations to make it
more generic, and patch 5 updates the definition of
commit_graph_generation() slightly, making way for patch 6 to implement the
method. With an existing commit-graph file, the commits that are not present
in the file are considered as having generation number "infinity". This is
useful for most of our reachability queries to this point, since those
commits are "above" the ones tracked by the commit-graph. When these commits
are low in number, then there is very little performance cost and zero
correctness cost.

However, we will see that the ahead/behind computation requires accurate
generation numbers to avoid overcounting. Thus, ensure_generations_valid()
is a way to specify a list of commits that need generation numbers computed
before continuing. It's a no-op if all of those commits are in the
commit-graph file. It's expensive if the commit-graph doesn't exist.
However, 'git ahead-behind' computations are likely to be slow no matter
what without a commit-graph, so assuming an existing commit-graph file is
reasonable. If we find sufficient desire to have an implementation that does
not have this requirement, we could create a second implementation and
toggle to it when generation_numbers_enabled() returns false.

Patch 7 implements the ahead-behind algorithm, as well as integrating the
builtin with that implementation. It's a long commit message, so hopefully
it explains the algorithm sufficiently.

Patch 8 implements the --contains option, which is another algorithm, but
more similar to other depth-first searches that already exist in
commit-reach.c.

Thanks, -Stolee

Derrick Stolee (7):
ahead-behind: create empty builtin
ahead-behind: parse tip references
ahead-behind: implement --ignore-missing option
commit-graph: combine generation computations
commit-graph: return generation from memory
ahead-behind: implement ahead_behind() logic
ahead-behind: add --contains mode

Taylor Blau (1):
commit-graph: introduce `ensure_generations_valid()`

.gitignore | 1 +
Documentation/git-ahead-behind.txt | 78 +++++++++++
Makefile | 1 +
builtin.h | 1 +
builtin/ahead-behind.c | 121 +++++++++++++++++
commit-graph.c | 209 +++++++++++++++++++----------
commit-graph.h | 7 +
commit-reach.c | 205 ++++++++++++++++++++++++++++
commit-reach.h | 37 +++++
git.c | 1 +
t/perf/p1500-graph-walks.sh | 29 ++++
t/t4218-ahead-behind.sh | 162 ++++++++++++++++++++++
t/t5318-commit-graph.sh | 2 +-
t/t6600-test-reach.sh | 120 +++++++++++++++++
14 files changed, 904 insertions(+), 70 deletions(-)
create mode 100644 Documentation/git-ahead-behind.txt
create mode 100644 builtin/ahead-behind.c
create mode 100755 t/perf/p1500-graph-walks.sh
create mode 100755 t/t4218-ahead-behind.sh

base-commit: d15644fe0226af7ffc874572d968598564a230dd
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1489%2Fderrickstolee%2Fstolee%2Fupstream-ahead-behind-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1489/derrickstolee/stolee/upstream-ahead-behind-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1489

Junio C Hamano March 6, 2023, 6:26 p.m. UTC | #1

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> These numbers can be computed by 'git rev-list --count B..C' and 'git
> rev-list --count C..B', but there are common needs that benefit from having
> the checks being done in the same process:

This makes readers wonder if "git rev-list --count B...C" should be
the end-user facing UI for this new feature, perhaps?

Of course if you are checking how C0, C1, C2,... relate to a single
B, the existing rev-list syntax would not work, and makes a totally
new subcommand a possibilty.

>  2. When a branch is updated, a background job checks if any pull requests
>     that target that branch should be closed because their branches were
>     merged implicitly by that update. These queries can e batched into 'git
>     ahead-behind' calls.
>
> In that second example, we don't need the full ahead/behind counts (although
> it is sufficient to look for branches that are "zero commits ahead", meaning
> they are reachable from the base), so this builtin has an extra '--contains'
> mode that only checks reachability from the base to each of the tips. 'git
> ahead-behind --contains' is sort of the reverse of 'git branch --contains'.

I thought that the reverse of "git branch --contains" was "git
branch --merged".  "git branch --merged maint ??/\*" is how I cull
topic branches that have already served their purpose.  

Isn't closing pull requests because they have been already merged
the same idea?  "git for-each-ref --merged main refs/pull/\*" or
something, perhaps?

All of the above are from only reading the cover letter.  I'm sure
I'll have more thoughts or even change my mind after reading the
patches.

Thanks.

Derrick Stolee March 6, 2023, 8:18 p.m. UTC | #2

On 3/6/2023 1:26 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> These numbers can be computed by 'git rev-list --count B..C' and 'git
>> rev-list --count C..B', but there are common needs that benefit from having
>> the checks being done in the same process:
> 
> This makes readers wonder if "git rev-list --count B...C" should be
> the end-user facing UI for this new feature, perhaps?
> 
> Of course if you are checking how C0, C1, C2,... relate to a single
> B, the existing rev-list syntax would not work, and makes a totally
> new subcommand a possibilty.
> 
>>  2. When a branch is updated, a background job checks if any pull requests
>>     that target that branch should be closed because their branches were
>>     merged implicitly by that update. These queries can e batched into 'git
>>     ahead-behind' calls.
>>
>> In that second example, we don't need the full ahead/behind counts (although
>> it is sufficient to look for branches that are "zero commits ahead", meaning
>> they are reachable from the base), so this builtin has an extra '--contains'
>> mode that only checks reachability from the base to each of the tips. 'git
>> ahead-behind --contains' is sort of the reverse of 'git branch --contains'.
> 
> I thought that the reverse of "git branch --contains" was "git
> branch --merged".  "git branch --merged maint ??/\*" is how I cull
> topic branches that have already served their purpose.  
> 
> Isn't closing pull requests because they have been already merged
> the same idea?  "git for-each-ref --merged main refs/pull/\*" or
> something, perhaps?

You are definitely on to something, and I was not aware of --merged as
an option to either of these.

'git branch --merged' has some limitations that tags cannot be used.

'git for-each-ref --merged' is probably sufficient. The only difference
being that it would be nice to specify the matching refs over stdin
with --stdin to avoid long argument lists.

With this in mind, I can update the performance test to look like this
(after updating the setup step to add branches for each line in 'refs')


test_perf 'batch reachability: git ahead-behind --contains' '
	git ahead-behind --contains --base=HEAD --stdin <refs
'

test_perf 'batch reachability: git branch --merged' '
	xargs git branch --merged=HEAD <branches
'

test_perf 'batch reachability: git for-each-ref --merged' '
	xargs git for-each-ref --merged=HEAD <refs
'

And get decent results on all cases with the Linux kernel repository:

Test                                                      this tree      
-------------------------------------------------------------------------
1500.2: ahead-behind counts: git ahead-behind             0.26(0.24+0.01)
1500.3: ahead-behind counts: git rev-list                 4.46(3.91+0.54)
1500.4: batch reachability: git ahead-behind --contains   0.02(0.01+0.01)
1500.5: batch reachability: git branch --merged           0.14(0.13+0.00)
1500.6: batch reachability: git for-each-ref --merged     0.14(0.13+0.00)

So, there is benefit in using this tips_reachable_from_base() method in
the two existing 'git (branch|for-each-ref) --merged' computations. The
API boundary selected in this series might not be the most appropriate
for those builtins, so let's kick out patch 8 from this series for now
and I'll revisit it separately.

Thanks,
-Stolee

Junio C Hamano March 6, 2023, 10:24 p.m. UTC | #3

Derrick Stolee <derrickstolee@github.com> writes:

> 'git for-each-ref --merged' is probably sufficient. The only difference
> being that it would be nice to specify the matching refs over stdin
> with --stdin to avoid long argument lists.

Yeah, if you have a list of concrete refs maintained externally
(as opposed to the example I responded to, where you generate the
refs by telling for-each-ref what pattern they should match), having
--stdin would be a good thing.

> So, there is benefit in using this tips_reachable_from_base() method in
> the two existing 'git (branch|for-each-ref) --merged' computations. The
> API boundary selected in this series might not be the most appropriate
> for those builtins, so let's kick out patch 8 from this series for now
> and I'll revisit it separately.

Yup, if the reachability API refactoring in these patches can also
help the "for-each-ref" listing (which "git branch" and "git tag"
bases their listing behaviour), it would be very good.

Thanks.

Taylor Blau March 7, 2023, 12:33 a.m. UTC | #4

On Mon, Mar 06, 2023 at 02:06:30PM +0000, Derrick Stolee via GitGitGadget wrote:
> This series introduces the 'git ahead-behind' builtin, which has been used
> at $DAYJOB for many years, but took many forms before landing on the current
> version.

Thanks for a helpful summary of all of the details here. I am of course
familiar with your use of this builtin at our common $DAYJOB, but it was
nice to see a from-scratch explanation of what you're trying to do here.

Now let's take a look at how the patches came together... ;-).

Thanks,
Taylor

Taylor Blau March 7, 2023, 12:36 a.m. UTC | #5

On Mon, Mar 06, 2023 at 10:26:26AM -0800, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > These numbers can be computed by 'git rev-list --count B..C' and 'git
> > rev-list --count C..B', but there are common needs that benefit from having
> > the checks being done in the same process:
>
> This makes readers wonder if "git rev-list --count B...C" should be
> the end-user facing UI for this new feature, perhaps?
>
> Of course if you are checking how C0, C1, C2,... relate to a single
> B, the existing rev-list syntax would not work, and makes a totally
> new subcommand a possibilty.

Yeah. You could imagine that `rev-list --count` might do something
fancy like coalescing

    git rev-list --count B...C1 B...C2 B...C3

into a single walk. But I am not sure that just because `rev-list
--count` provides similar functionality that we should fold in the
proposed `ahead-behind` interface into that flag.

On the other hand, I could see a compelling argument for a slightly
different syntax (maybe `--count-ahead-behind` or
`--count=ahead-behind`) that would fold this functionality into
`rev-list`.

And that is the sort of thing that we would want to settle on sooner
rather than later, since it's fairly baked in once we decide one way or
another and then merge this up.

My personal feeling is that we ought to avoid (further) overloading
`rev-list` absent of a compelling reason to do so. But I am definitely
open to other thoughts here.

Thanks,
Taylor

Jeff King March 9, 2023, 9:20 a.m. UTC | #6

On Mon, Mar 06, 2023 at 07:36:23PM -0500, Taylor Blau wrote:

> > This makes readers wonder if "git rev-list --count B...C" should be
> > the end-user facing UI for this new feature, perhaps?
> >
> > Of course if you are checking how C0, C1, C2,... relate to a single
> > B, the existing rev-list syntax would not work, and makes a totally
> > new subcommand a possibilty.
> 
> Yeah. You could imagine that `rev-list --count` might do something
> fancy like coalescing
> 
>     git rev-list --count B...C1 B...C2 B...C3
> 
> into a single walk. But I am not sure that just because `rev-list
> --count` provides similar functionality that we should fold in the
> proposed `ahead-behind` interface into that flag.

It does coalesce all of that into a single walk. The problem is somewhat
the opposite: it only has a notion of two "sides" for a symmetric
traversal: left and right. But in your example there are many sides, and
we have to remember which is which.

I think getting the answer from one walk would require an arbitrary
number of bits to paint down each path. Certainly the ahead-behind that
Vicent and I wrote long ago didn't do that (IIRC it mostly relied on
doing multiple traversals in the same process, which amortized the cost
of commit parsing; that's not really an issue these days with commit
graphs).

Peeking at patch 7 of Stolee's series...yep. That's exactly what it
does. :)

I wondered how much it would matter on top of a naive loop of
single-traversals, now that we have commit graphs. It looks like there's
still quite a nice speedup from the numbers in patch 7 (though the
totally naive "loop of rev-list" is incurring extra startup overhead,
too).

> My personal feeling is that we ought to avoid (further) overloading
> `rev-list` absent of a compelling reason to do so. But I am definitely
> open to other thoughts here.

So I think this actually is what "git rev-list --left-right --count
old...new" does now. But extending it to multiple sets in one traversal
means you need:

  - being able to ask for individual left-right markers for each pair,
    not treating all lefts and all rights together

  - don't stop traversing when you hit an UNINTERESTING commit if there
    are still bits to paint. In a single-pair traversal, those two are
    the same thing (we stop at the merge base), but with multiple pairs
    you may have to keep walking past a commit that is excluded from one
    pair, but not another. This _might_ be doable if you assume all of
    the left-hand bases are the same, but I didn't think hard enough to
    feel confident in that. But even so, that only solves cases like
    "how do these branches compare to HEAD" (which I think is what
    GitHub does). But it doesn't allow "how do these branches compare to
    to their respective @{upstream} refs".

So I don't think it would be impossible to make this a mode of rev-list.
And that mode might even provide flexibility for other similar
operations, like a mass "git rev-list --cherry-mark"[1]. But it is a
pretty big departure from the current rev-list traversal (to my mind,
especially the "keep walking past UNINTERESTING part). I don't mind it
as its own command.

-Peff

[1] The reason you might want a mass cherry-mark is basically doing
    something like the "branches" page, but in a workflow where upstream
    applies patches, like git.git. There you may want to ask about
    "origin/next...$branch" for all of your branches to see which ones
    have been merged where.

Junio C Hamano March 9, 2023, 9:51 p.m. UTC | #7

Jeff King <peff@peff.net> writes:

>> Yeah. You could imagine that `rev-list --count` might do something
>> fancy like coalescing
>> 
>>     git rev-list --count B...C1 B...C2 B...C3
>> 
>> into a single walk. But I am not sure that just because `rev-list
>> --count` provides similar functionality that we should fold in the
>> proposed `ahead-behind` interface into that flag.
>
> It does coalesce all of that into a single walk. The problem is somewhat
> the opposite: it only has a notion of two "sides" for a symmetric
> traversal: left and right. But in your example there are many sides, and
> we have to remember which is which.

Yeah, this reminds me of what I had to do in "show-branch", where
each tip gets assigned a bit in the object->flags (which means it
can only traverse from a very small limited number of tips, like 30
or so), which I once planned to extend to arbitrary number of tips
by storing these bits in commit slab, but it never materialized.

> So I don't think it would be impossible to make this a mode of rev-list.
> And that mode might even provide flexibility for other similar
> operations, like a mass "git rev-list --cherry-mark"[1]. But it is a
> pretty big departure from the current rev-list traversal (to my mind,
> especially the "keep walking past UNINTERESTING part). I don't mind it
> as its own command.

I agree this is not a good fit for the mental model of rev-list or
the revision.c::get_revision() traversal.

Thanks.

[0/8] ahead-behind: new builtin for counting multiple commit ranges

Message

Comments