mbox series

[RFC,0/5] line-log: towards a more responsive, incremental 'git log -L'

Message ID 20190818182801.7580-1-szeder.dev@gmail.com
Headers show
Series line-log: towards a more responsive, incremental 'git log -L' | expand

Message

SZEDER Gábor Aug. 18, 2019, 6:27 p.m. UTC
Line-level log performs a preprocessing step in
prepare_revision_walk(), during which it filters and rewrites history
to keep only commits modifying the given line range.  This
preprocessing causes significant delay before the first commit is
shown, wastes CPU time when the user asks only for a few commits, and
does parent rewriting with no way to turn it off.

This patch series addresses these issues by integrating line-level log
filtering into the revision walking machinery and making it work
together with generation number-based topo-ordering (though for now
only in the case when the user doesn't explicitly asks for parent
rewriting, which is probably the common case).

The first two patches are quite straightforward (and arguably somewhat
unrelated), but the rest deals with history traversal and parent
rewriting, which I don't usually do, hence the RFC.


SZEDER Gábor (5):
  completion: offer '--(no-)patch' among 'git log' options
  line-log: remove unused fields from 'struct line_log_data'
  t4211-line-log: add tests for parent oids
  line-log: more responsive, incremental 'git log -L'
  line-log: try to use generation number-based topo-ordering

 contrib/completion/git-completion.bash |  1 +
 line-log.c                             |  4 +-
 line-log.h                             |  5 +-
 revision.c                             | 38 +++++++++++---
 t/t4211-line-log.sh                    | 68 ++++++++++++++++++++++++++
 5 files changed, 105 insertions(+), 11 deletions(-)

Comments

Derrick Stolee Aug. 19, 2019, noon UTC | #1
On 8/18/2019 2:27 PM, SZEDER Gábor wrote:
> Line-level log performs a preprocessing step in
> prepare_revision_walk(), during which it filters and rewrites history
> to keep only commits modifying the given line range.  This
> preprocessing causes significant delay before the first commit is
> shown, wastes CPU time when the user asks only for a few commits, and
> does parent rewriting with no way to turn it off.
> 
> This patch series addresses these issues by integrating line-level log
> filtering into the revision walking machinery and making it work
> together with generation number-based topo-ordering (though for now
> only in the case when the user doesn't explicitly asks for parent
> rewriting, which is probably the common case).
> 
> The first two patches are quite straightforward (and arguably somewhat
> unrelated), but the rest deals with history traversal and parent
> rewriting, which I don't usually do, hence the RFC.

Hi Szeder,

Thanks for sending this series! I'm particularly excited about it
because we recently got a bug report from a user in the Windows OS
repo about "git log -L" being very slow. I put it on the backlog [1]
but haven't had time to investigate it myself. After we are done
updating to v2.23.0 [2], I'll have time to test your changes on
the Windows repo. I anticipate your change to provide a massive
boost.

In the meantime, your first three patches look good to me. I'll
dig more into the last two at the same time as I run performance
tests.

Thanks,
-Stolee

[1] https://github.com/microsoft/git/issues/166
[2] https://github.com/microsoft/git/pull/165
SZEDER Gábor Aug. 19, 2019, 1:03 p.m. UTC | #2
On Mon, Aug 19, 2019 at 08:00:11AM -0400, Derrick Stolee wrote:
> On 8/18/2019 2:27 PM, SZEDER Gábor wrote:
> > Line-level log performs a preprocessing step in
> > prepare_revision_walk(), during which it filters and rewrites history
> > to keep only commits modifying the given line range.  This
> > preprocessing causes significant delay before the first commit is
> > shown, wastes CPU time when the user asks only for a few commits, and
> > does parent rewriting with no way to turn it off.
> > 
> > This patch series addresses these issues by integrating line-level log
> > filtering into the revision walking machinery and making it work
> > together with generation number-based topo-ordering (though for now
> > only in the case when the user doesn't explicitly asks for parent
> > rewriting, which is probably the common case).
> > 
> > The first two patches are quite straightforward (and arguably somewhat
> > unrelated), but the rest deals with history traversal and parent
> > rewriting, which I don't usually do, hence the RFC.
> 
> Hi Szeder,
> 
> Thanks for sending this series! I'm particularly excited about it
> because we recently got a bug report from a user in the Windows OS
> repo about "git log -L" being very slow. I put it on the backlog [1]
> but haven't had time to investigate it myself. After we are done
> updating to v2.23.0 [2], I'll have time to test your changes on
> the Windows repo. I anticipate your change to provide a massive
> boost.

Well, it depends on what you mean by "boost"...  As discussed in patch
4's commit message, this series doesn't really optimizes much, and the
total amount of work to be done remains the same, except that 'git log
-L... -<N>' will be able to return early.  So when you benchmark it
with e.g. 'time git log -L... >/dev/null', then you won't see much
difference, as it will still take just about as looooong as before,
apart from the faster generation numbers-based topo-ordering.  (But I
have a few WIP patches waiting to be cleaned up that might bring about
3-4x speedup in general.)

In the meantime, until these changes trickle into a Git release, for a
faster line-level log I would recommend to:

  - Limit the revision range up front, i.e.:

      git log -L... ^a-not-too-old-version

    because this can greatly reduce the amount of commits that that
    expensive preprocessing step has to churn through.

  - Use '--no-renames'.  Although it won't follow the evolution of the
    line range over file renames, this might be an acceptable
    compromise.  (this is what those WIP patches are focusing on)

  - Or both.
Derrick Stolee Aug. 19, 2019, 2:50 p.m. UTC | #3
On 8/19/2019 9:03 AM, SZEDER Gábor wrote:
> On Mon, Aug 19, 2019 at 08:00:11AM -0400, Derrick Stolee wrote:
>> On 8/18/2019 2:27 PM, SZEDER Gábor wrote:
>>> Line-level log performs a preprocessing step in
>>> prepare_revision_walk(), during which it filters and rewrites history
>>> to keep only commits modifying the given line range.  This
>>> preprocessing causes significant delay before the first commit is
>>> shown, wastes CPU time when the user asks only for a few commits, and
>>> does parent rewriting with no way to turn it off.
>>>
>>> This patch series addresses these issues by integrating line-level log
>>> filtering into the revision walking machinery and making it work
>>> together with generation number-based topo-ordering (though for now
>>> only in the case when the user doesn't explicitly asks for parent
>>> rewriting, which is probably the common case).
>>>
>>> The first two patches are quite straightforward (and arguably somewhat
>>> unrelated), but the rest deals with history traversal and parent
>>> rewriting, which I don't usually do, hence the RFC.
>>
>> Hi Szeder,
>>
>> Thanks for sending this series! I'm particularly excited about it
>> because we recently got a bug report from a user in the Windows OS
>> repo about "git log -L" being very slow. I put it on the backlog [1]
>> but haven't had time to investigate it myself. After we are done
>> updating to v2.23.0 [2], I'll have time to test your changes on
>> the Windows repo. I anticipate your change to provide a massive
>> boost.
> 
> Well, it depends on what you mean by "boost"...  As discussed in patch
> 4's commit message, this series doesn't really optimizes much, and the
> total amount of work to be done remains the same, except that 'git log
> -L... -<N>' will be able to return early.  So when you benchmark it
> with e.g. 'time git log -L... >/dev/null', then you won't see much
> difference, as it will still take just about as looooong as before,
> apart from the faster generation numbers-based topo-ordering.  (But I
> have a few WIP patches waiting to be cleaned up that might bring about
> 3-4x speedup in general.)
> 
> In the meantime, until these changes trickle into a Git release, for a
> faster line-level log I would recommend to:
> 
>   - Limit the revision range up front, i.e.:
> 
>       git log -L... ^a-not-too-old-version
> 
>     because this can greatly reduce the amount of commits that that
>     expensive preprocessing step has to churn through.
> 
>   - Use '--no-renames'.  Although it won't follow the evolution of the
>     line range over file renames, this might be an acceptable
>     compromise.  (this is what those WIP patches are focusing on)
> 
>   - Or both.

Usually when I'm testing something involving the --topo-order logic,
I'll use a simple "-10" to limit to something similar to a "first page".

In this line-log case, I'll use "-1" to just get the top result, as that
is essentially how long it takes before the user gets some feedback.

Here are the results using a random path I picked out from the Windows
repo (it was only changed ~10 times in the 4.5 million commits):

Before:

real    2m7.308s
real    2m8.572s

With Patch 4:

real    0m38.628s
real    0m38.477s

With Patch 5:

real    0m24.685s
real    0m24.310s

For the specific file in the bug report from a real user, I got
these numbers:

real    0m32.293s (patch 4)
real    0m19.362s (patch 5)

Note that I don't include the "without patch" numbers. For some
reason the path provided is particularly nasty and caused 20,000+
missing blobs to be downloaded one-by-one (remember: VFS for Git
has many partial-clone-like behaviors). I canceled my test after 55
minutes. I'll dig in more to see what is going on, since only 37
commits actually change that path.

Thanks so much for finding these "easy" wins. Your changes are
nicely isolated, but give us a target to improve even further!

I expect that we will incorporate your commits into microsoft/git
very early.

Thanks,
-Stolee
SZEDER Gábor Aug. 19, 2019, 3:02 p.m. UTC | #4
On Mon, Aug 19, 2019 at 10:50:48AM -0400, Derrick Stolee wrote:
> Note that I don't include the "without patch" numbers. For some
> reason the path provided is particularly nasty and caused 20,000+
> missing blobs to be downloaded one-by-one (remember: VFS for Git
> has many partial-clone-like behaviors). I canceled my test after 55
> minutes. I'll dig in more to see what is going on, since only 37
> commits actually change that path.

Don't bother digging into it, I know why it happens (and how to avoid
it! :), but don't have the time right now to explain.
Derrick Stolee Aug. 19, 2019, 4:12 p.m. UTC | #5
On 8/19/2019 11:02 AM, SZEDER Gábor wrote:
> On Mon, Aug 19, 2019 at 10:50:48AM -0400, Derrick Stolee wrote:
>> Note that I don't include the "without patch" numbers. For some
>> reason the path provided is particularly nasty and caused 20,000+
>> missing blobs to be downloaded one-by-one (remember: VFS for Git
>> has many partial-clone-like behaviors). I canceled my test after 55
>> minutes. I'll dig in more to see what is going on, since only 37
>> commits actually change that path.
> 
> Don't bother digging into it, I know why it happens (and how to avoid
> it! :), but don't have the time right now to explain.

Great!  I look forward to your explanation and fix later.

Just to be sure we've got the same issue, here is a section of the
call stack in the hot portion:

line_log_filter
+ queue_diffs
  + diffcore_std
    + diffcore_rename
      + diff_populate_filespec

Thanks,
-Stolee
SZEDER Gábor Aug. 21, 2019, 11:07 a.m. UTC | #6
On Mon, Aug 19, 2019 at 12:12:32PM -0400, Derrick Stolee wrote:
> On 8/19/2019 11:02 AM, SZEDER Gábor wrote:
> > On Mon, Aug 19, 2019 at 10:50:48AM -0400, Derrick Stolee wrote:
> >> Note that I don't include the "without patch" numbers. For some
> >> reason the path provided is particularly nasty and caused 20,000+
> >> missing blobs to be downloaded one-by-one (remember: VFS for Git
> >> has many partial-clone-like behaviors). I canceled my test after 55
> >> minutes. I'll dig in more to see what is going on, since only 37
> >> commits actually change that path.
> > 
> > Don't bother digging into it, I know why it happens (and how to avoid
> > it! :), but don't have the time right now to explain.
> 
> Great!  I look forward to your explanation and fix later.
> 
> Just to be sure we've got the same issue, here is a section of the
> call stack in the hot portion:
> 
> line_log_filter
> + queue_diffs
>   + diffcore_std
>     + diffcore_rename
>       + diff_populate_filespec

Hmm, interesting.  Certainly most of the time is wasted in
queue_diffs(), but in my perf measurements diff_tree_oid() is
responsible for most of that, not diffcore_std().  This might still be
the same issue, though, but perhaps VFS for Git shifts the balance.

We'll see, here are my patches to address that diff_tree_oid()
slowness:

https://public-inbox.org/git/20190821110424.18184-1-szeder.dev@gmail.com/T/
Derrick Stolee April 9, 2020, 12:17 p.m. UTC | #7
On 8/18/2019 2:27 PM, SZEDER Gábor wrote:
> Line-level log performs a preprocessing step in
> prepare_revision_walk(), during which it filters and rewrites history
> to keep only commits modifying the given line range.  This
> preprocessing causes significant delay before the first commit is
> shown, wastes CPU time when the user asks only for a few commits, and
> does parent rewriting with no way to turn it off.
> 
> This patch series addresses these issues by integrating line-level log
> filtering into the revision walking machinery and making it work
> together with generation number-based topo-ordering (though for now
> only in the case when the user doesn't explicitly asks for parent
> rewriting, which is probably the common case).
> 
> The first two patches are quite straightforward (and arguably somewhat
> unrelated), but the rest deals with history traversal and parent
> rewriting, which I don't usually do, hence the RFC.
> 
> 
> SZEDER Gábor (5):
>   completion: offer '--(no-)patch' among 'git log' options
>   line-log: remove unused fields from 'struct line_log_data'
>   t4211-line-log: add tests for parent oids
>   line-log: more responsive, incremental 'git log -L'
>   line-log: try to use generation number-based topo-ordering

Hi Szeder,

I was taking inventory of our issues especially around history now
that the changed-path Bloom filters are close to wrapping up. What's
the status on this RFC? Looking at it now, I understand the situation
better and could help review a bit more than before. Do you have more
context as to the situation on this series?

Thanks,
-Stolee
SZEDER Gábor April 24, 2020, 9:02 a.m. UTC | #8
On Thu, Apr 09, 2020 at 08:17:41AM -0400, Derrick Stolee wrote:
> On 8/18/2019 2:27 PM, SZEDER Gábor wrote:
> > Line-level log performs a preprocessing step in
> > prepare_revision_walk(), during which it filters and rewrites history
> > to keep only commits modifying the given line range.  This
> > preprocessing causes significant delay before the first commit is
> > shown, wastes CPU time when the user asks only for a few commits, and
> > does parent rewriting with no way to turn it off.
> > 
> > This patch series addresses these issues by integrating line-level log
> > filtering into the revision walking machinery and making it work
> > together with generation number-based topo-ordering (though for now
> > only in the case when the user doesn't explicitly asks for parent
> > rewriting, which is probably the common case).
> > 
> > The first two patches are quite straightforward (and arguably somewhat
> > unrelated), but the rest deals with history traversal and parent
> > rewriting, which I don't usually do, hence the RFC.
> > 
> > 
> > SZEDER Gábor (5):
> >   completion: offer '--(no-)patch' among 'git log' options
> >   line-log: remove unused fields from 'struct line_log_data'
> >   t4211-line-log: add tests for parent oids
> >   line-log: more responsive, incremental 'git log -L'
> >   line-log: try to use generation number-based topo-ordering
> 
> Hi Szeder,
> 
> I was taking inventory of our issues especially around history now
> that the changed-path Bloom filters are close to wrapping up.

Well, I'm about to stir it up over the weekend...

> What's
> the status on this RFC? Looking at it now, I understand the situation
> better and could help review a bit more than before. Do you have more
> context as to the situation on this series?

Sadly, I haven't touched this patch series since then, other than
rebasing it on top of new releases once or twice, but since v2.23 not
even that.  I think I run into some conflicts and was not in the mood
to resolve them, because with a2bb801f6a (line-log: avoid unnecessary
full tree diffs, 2019-08-21) the performance benefits are much lower,
so it was not that pressing...

I think patch 4 in itself is not really the right way to integrate
line-log into the revision walking machinery:

  - Line-log follows full-file renames, but it doesn't actually use
    '--follow', but rather implements its own logic to detect them.
    This logic is in some ways better, than '--follow', notably it can
    follow multiple paths at once, while '--follow' only allows a
    single path.
    I think the rename following logic should be extracted from
    line-log, and it should be used to implement '--follow', removing
    some of its restrictions.

  - Line-log should then be ported to use the revamped '--follow'.

  - And then it's finally time for something like that patch 4, and to
    have some "fun" with making explicitly requested parent rewriting
    work (I can only remember that whenever I tried to make that work
    my brain started to hurt :)
 
Anyway, I think the first three patches are worth having.