mbox series

[0/8] uprobes: RCU-protected hot path optimizations

Message ID 20240731214256.3588718-1-andrii@kernel.org (mailing list archive)
Headers show
Series uprobes: RCU-protected hot path optimizations | expand

Message

Andrii Nakryiko July 31, 2024, 9:42 p.m. UTC
This patch set is heavily inspired by Peter Zijlstra's uprobe optimization
patches ([0]) and continue that work, albeit trying to keep complexity to the
minimum, and attepting to reuse existing primitives as much as possible. The
goal here is to optimize obvious uprobe triggering hot path, while keeping the
rest of locking mostly intact.

I've reused rb_find_rcu()/rb_find_add_rcu() patches as is, and the "split
uprobe_unregister()" is mostly intact, but I've added uprobe_unregister_sync()
into the error handling code path inside uprobe_unregister(). This is due to
recent refactorings from Oleg Nesterov ([1]), which necessitates this
addition. I'm not sure I got Co-Developed-by/SOB pieces right, for which
I apoligize in advance.

Except for refcounting change patch (which I stongly believe is a good
improvement we should do and forget about quasi-refcounting schema of
uprobe->consumers list), the rest of the changes are similar to Peter's
initial changes in [0].

Main differences would be:
  - no special RCU protection for mmap and fork handling, we just stick to
    refcounts there, as those are infrequent and not performance-sensitive
    code, while being complex and thus benefiting from proper locking;
  - the above means we don't need to do any custom SRCU additions to handle
    forking code path;
  - I handled UPROBE_HANDLER_REMOVE problem in handler_chain() differently,
    again, leveraging existing locking scheam;
  - I kept refcount usage for uretprobe and single-stepping uprobes, I plan to
    address that in a separate follow up patches. The plan is to avoid
    task_work, but I need to sit down and write and test the code.
  - finally, I dutifully was using SRCU throughout all the changes, and only
    last patch switches SRCU to RCU Tasks Trace and demonstrates significant
    performance and scalability gains from this.

The changes in this patch set were tested using BPF selftests and using
uprobe-stress ([2]) tool. One recent BPF selftest (uprobe_multi/consumers),
only recently added by Jiri Olsa will need a single-line adjustment to the
counting logic, but the patch itself is in bpf-next/master, so we'll have to
address that once linux-trace or tip and bpf-next trees merge. I'll take care
of that when this happens.

Now, for the benchmarking results. I've used the following script (which
utilizes BPF selftests-based bench tool). The CPU used was 80-core Intel Xeon
Gold 6138 CPU @ 2.00GHz running kernel with production-like config. I minimized
background noise by stopping any service I could identify and stop, so results
are pretty stable and variability is pretty small, overall.

Benchmark script:

#!/bin/bash

set -eufo pipefail

for i in uprobe-nop uretprobe-nop; do
    for p in 1 2 4 8 16 32 64; do
        summary=$(sudo ./bench -w3 -d5 -p$p -a trig-$i | tail -n1)
        total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-)
        percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1)
        printf "%-15s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu"
    done
    echo
done

With all the lock-avoiding changes done in this patch set, we get a pretty
decent improvement in performance and scalability of uprobes with number of
CPUs, even though we are still nowhere near linear scalability. This is due to
the remaning mmap_lock, which is currently taken to resolve interrupt address
to inode+offset and then uprobe instance. And, of course, uretprobes still need
similar RCU to avoid refcount in the hot path, which will be addressed in the
follow up patches.

BASELINE (on top of Oleg's clean up patches)
============================================
uprobe-nop      ( 1 cpus):    3.032 ± 0.023M/s  (  3.032M/s/cpu)
uprobe-nop      ( 2 cpus):    3.452 ± 0.005M/s  (  1.726M/s/cpu)
uprobe-nop      ( 4 cpus):    3.663 ± 0.005M/s  (  0.916M/s/cpu)
uprobe-nop      ( 8 cpus):    3.718 ± 0.038M/s  (  0.465M/s/cpu)
uprobe-nop      (16 cpus):    3.344 ± 0.008M/s  (  0.209M/s/cpu)
uprobe-nop      (32 cpus):    2.288 ± 0.021M/s  (  0.071M/s/cpu)
uprobe-nop      (64 cpus):    3.205 ± 0.004M/s  (  0.050M/s/cpu)

uretprobe-nop   ( 1 cpus):    1.979 ± 0.005M/s  (  1.979M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.361 ± 0.005M/s  (  1.180M/s/cpu)
uretprobe-nop   ( 4 cpus):    2.309 ± 0.002M/s  (  0.577M/s/cpu)
uretprobe-nop   ( 8 cpus):    2.253 ± 0.001M/s  (  0.282M/s/cpu)
uretprobe-nop   (16 cpus):    2.007 ± 0.000M/s  (  0.125M/s/cpu)
uretprobe-nop   (32 cpus):    1.624 ± 0.003M/s  (  0.051M/s/cpu)
uretprobe-nop   (64 cpus):    2.149 ± 0.001M/s  (  0.034M/s/cpu)

Up to second-to-last patch (i.e., SRCU-based optimizations)
===========================================================
uprobe-nop      ( 1 cpus):    3.276 ± 0.005M/s  (  3.276M/s/cpu)
uprobe-nop      ( 2 cpus):    4.125 ± 0.002M/s  (  2.063M/s/cpu)
uprobe-nop      ( 4 cpus):    7.713 ± 0.002M/s  (  1.928M/s/cpu)
uprobe-nop      ( 8 cpus):    8.097 ± 0.006M/s  (  1.012M/s/cpu)
uprobe-nop      (16 cpus):    6.501 ± 0.056M/s  (  0.406M/s/cpu)
uprobe-nop      (32 cpus):    4.398 ± 0.084M/s  (  0.137M/s/cpu)
uprobe-nop      (64 cpus):    6.452 ± 0.000M/s  (  0.101M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.055 ± 0.001M/s  (  2.055M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.677 ± 0.000M/s  (  1.339M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.561 ± 0.003M/s  (  1.140M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.291 ± 0.002M/s  (  0.661M/s/cpu)
uretprobe-nop   (16 cpus):    5.065 ± 0.019M/s  (  0.317M/s/cpu)
uretprobe-nop   (32 cpus):    3.622 ± 0.003M/s  (  0.113M/s/cpu)
uretprobe-nop   (64 cpus):    3.723 ± 0.002M/s  (  0.058M/s/cpu)

RCU Tasks Trace
===============
uprobe-nop      ( 1 cpus):    3.396 ± 0.002M/s  (  3.396M/s/cpu)
uprobe-nop      ( 2 cpus):    4.271 ± 0.006M/s  (  2.135M/s/cpu)
uprobe-nop      ( 4 cpus):    8.499 ± 0.015M/s  (  2.125M/s/cpu)
uprobe-nop      ( 8 cpus):   10.355 ± 0.028M/s  (  1.294M/s/cpu)
uprobe-nop      (16 cpus):    7.615 ± 0.099M/s  (  0.476M/s/cpu)
uprobe-nop      (32 cpus):    4.430 ± 0.007M/s  (  0.138M/s/cpu)
uprobe-nop      (64 cpus):    6.887 ± 0.020M/s  (  0.108M/s/cpu)

uretprobe-nop   ( 1 cpus):    2.174 ± 0.001M/s  (  2.174M/s/cpu)
uretprobe-nop   ( 2 cpus):    2.853 ± 0.001M/s  (  1.426M/s/cpu)
uretprobe-nop   ( 4 cpus):    4.913 ± 0.002M/s  (  1.228M/s/cpu)
uretprobe-nop   ( 8 cpus):    5.883 ± 0.002M/s  (  0.735M/s/cpu)
uretprobe-nop   (16 cpus):    5.147 ± 0.001M/s  (  0.322M/s/cpu)
uretprobe-nop   (32 cpus):    3.738 ± 0.008M/s  (  0.117M/s/cpu)
uretprobe-nop   (64 cpus):    4.397 ± 0.002M/s  (  0.069M/s/cpu)

For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe
triggerings per second) up to about 8 M/s. For uretprobes it's a bit more
modest with bump from 2.4 M/s to 5 M/s.

For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from
8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%),
as we have more work to do on uretprobes side.

Even single-thread (no contention) performance is slightly better: 3.276 M/s to
3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%)
for uretprobes.

  [0] https://lore.kernel.org/linux-trace-kernel/20240711110235.098009979@infradead.org/
  [1] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/
  [2] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress

Andrii Nakryiko (6):
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: protected uprobe lifetime with SRCU
  uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
  uprobes: travers uprobe's consumer list locklessly under SRCU
    protection
  uprobes: perform lockless SRCU-protected uprobes_tree lookup
  uprobes: switch to RCU Tasks Trace flavor for better performance

Peter Zijlstra (2):
  rbtree: provide rb_find_rcu() / rb_find_add_rcu()
  perf/uprobe: split uprobe_unregister()

 include/linux/rbtree.h                        |  67 ++++
 include/linux/uprobes.h                       |  20 +-
 kernel/events/uprobes.c                       | 375 ++++++++++--------
 kernel/trace/bpf_trace.c                      |   8 +-
 kernel/trace/trace_uprobe.c                   |  15 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |   3 +-
 6 files changed, 305 insertions(+), 183 deletions(-)

Comments

Oleg Nesterov Aug. 7, 2024, 1:29 p.m. UTC | #1
On 07/31, Andrii Nakryiko wrote:
>
> Andrii Nakryiko (6):
>   uprobes: revamp uprobe refcounting and lifetime management
>   uprobes: protected uprobe lifetime with SRCU
>   uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
>   uprobes: travers uprobe's consumer list locklessly under SRCU
>     protection
>   uprobes: perform lockless SRCU-protected uprobes_tree lookup
>   uprobes: switch to RCU Tasks Trace flavor for better performance
>
> Peter Zijlstra (2):
>   rbtree: provide rb_find_rcu() / rb_find_add_rcu()
>   perf/uprobe: split uprobe_unregister()

I see nothing wrong in 1-7. LGTM.

But since you are going to send the new version, I'd like to apply V2
and then try to re-check the resulting code.

As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)

Oleg.
Andrii Nakryiko Aug. 7, 2024, 3:24 p.m. UTC | #2
On Wed, Aug 7, 2024 at 6:30 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > Andrii Nakryiko (6):
> >   uprobes: revamp uprobe refcounting and lifetime management
> >   uprobes: protected uprobe lifetime with SRCU
> >   uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks
> >   uprobes: travers uprobe's consumer list locklessly under SRCU
> >     protection
> >   uprobes: perform lockless SRCU-protected uprobes_tree lookup
> >   uprobes: switch to RCU Tasks Trace flavor for better performance
> >
> > Peter Zijlstra (2):
> >   rbtree: provide rb_find_rcu() / rb_find_add_rcu()
> >   perf/uprobe: split uprobe_unregister()
>
> I see nothing wrong in 1-7. LGTM.
>

So, I don't know how it slipped the first time, because I tested
overnight with uprobe-stress, perhaps I adjusted some parameters (more
threads or different ratio of threads, not sure by now), but it turns
out that lockless RB-tree traversal actually crashes after a few
minutes of running uprobe-stress. I'll post details separately later
today, but I suspect that rb_find_rcu() and rb_find_add_rcu() is not
enough to make it safe.

Hopefully someone can help me figure this out.

> But since you are going to send the new version, I'd like to apply V2
> and then try to re-check the resulting code.

Yes, I was waiting for more of Peter's comments, but I guess I'll just
send a v2 today. I'll probably include the SRCU+timeout logic for
return_instances, and maybe lockless VMA parts as well. Those won't be
yet for landing, but it's probably useful to see everything
end-to-end.

Given the hiccup with lockless uprobes_tree lookup, we should land
that change, but the first 4 patches hopefully are in good enough
shape to apply and reduce the amount of patches that need to be
juggled around.

>
> As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)

Honestly curious, why the preference?

>
> Oleg.
>

BTW, while you are here :) What can you say about
current->sighand->siglock use in handle_singlestep()? Is there any way
to avoid that (or at least not have to take it every single time a
single-stepped uprobe is triggered?). This turned out to be a huge
issue for single-stepped uprobe when scaling to multiple CPUs even
with all the other things (all the SRCU and lockless VMA lookup) taken
care of.
Oleg Nesterov Aug. 7, 2024, 5:11 p.m. UTC | #3
On 08/07, Andrii Nakryiko wrote:
>
> Yes, I was waiting for more of Peter's comments, but I guess I'll just
> send a v2 today.

OK,

> I'll probably include the SRCU+timeout logic for
> return_instances, and maybe lockless VMA parts as well.

Well, feel free to do what you think right, but perhaps it would be
better to push this series first? at least 1-4.

As for lockless VMA. To me this needs more discussions. I didn't read
your conversation with Peter and Suren carefully, but I too have some
concerns. Most probably I am wrong, and until I saw this thread I didn't
even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
but still.

> > As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
>
> Honestly curious, why the preference?

Well, you can safely ignore me, but since you have asked ;)

I understand what SRCU does, and years ago I even understood (I hope)
the implementation. More or less the same for rcu_tasks. But as for
the _trace flavour, I simply fail to understand its semantics.

> BTW, while you are here :) What can you say about
> current->sighand->siglock use in handle_singlestep()?

It should die, and this looks simple. I disagree with the patches
from Liao, see the
https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com/
thread, but I agree with the intent.

IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make the
necessary changes really simple.

(To clarify. In fact I think that a new TIF_ or even PF_ flag makes more sense,
 afaics it can have more users. But I don't think that uprobes can provide enough
 justification for that right now)

Oleg.
Andrii Nakryiko Aug. 7, 2024, 5:31 p.m. UTC | #4
On Wed, Aug 7, 2024 at 10:11 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/07, Andrii Nakryiko wrote:
> >
> > Yes, I was waiting for more of Peter's comments, but I guess I'll just
> > send a v2 today.
>
> OK,
>
> > I'll probably include the SRCU+timeout logic for
> > return_instances, and maybe lockless VMA parts as well.
>
> Well, feel free to do what you think right, but perhaps it would be
> better to push this series first? at least 1-4.

Ok, I can send those first 4 patches first and hopefully we can land
them soon and move to the next part. I just also wrote up details
about that crash in rb_find_rcu().

>
> As for lockless VMA. To me this needs more discussions. I didn't read

We are still discussing, feel free to join the conversation.

> your conversation with Peter and Suren carefully, but I too have some
> concerns. Most probably I am wrong, and until I saw this thread I didn't
> even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
> but still.
>
> > > As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
> >
> > Honestly curious, why the preference?
>
> Well, you can safely ignore me, but since you have asked ;)
>
> I understand what SRCU does, and years ago I even understood (I hope)
> the implementation. More or less the same for rcu_tasks. But as for
> the _trace flavour, I simply fail to understand its semantics.

Ok, I won't try to repeat Paul's explanations. If you are curious you
can find them in comments to my previous batch register/unregister API
patches.

>
> > BTW, while you are here :) What can you say about
> > current->sighand->siglock use in handle_singlestep()?
>
> It should die, and this looks simple. I disagree with the patches
> from Liao, see the
> https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com/
> thread, but I agree with the intent.

I wasn't aware of this patch, thanks for mentioning it. Strange that
me or at least bpf@vger.kernel.org wasn't CC'ed.

Liao, please cc bpf@ mailing list for future patches like that.

>
> IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make the
> necessary changes really simple.

The simpler the better, I can't comment on correctness as I don't
understand the logic well enough. Are you going to send a patch with
your bool flag proposal?

>
> (To clarify. In fact I think that a new TIF_ or even PF_ flag makes more sense,
>  afaics it can have more users. But I don't think that uprobes can provide enough
>  justification for that right now)
>
> Oleg.
>
Oleg Nesterov Aug. 7, 2024, 6:24 p.m. UTC | #5
On 08/07, Andrii Nakryiko wrote:
>
> Are you going to send a patch with
> your bool flag proposal?

No, I will wait for the next version from Liao.

If he agrees with my comments. If not, we will continue the discussion in that thread.

Note also another patch from Liao which removes ->siglock from uprobe_deny_signal(),
https://lore.kernel.org/all/20240731095017.1560516-1-liaochang1@huawei.com/

Oleg.
Liao, Chang Aug. 8, 2024, 7:51 a.m. UTC | #6
在 2024/8/8 1:31, Andrii Nakryiko 写道:
> On Wed, Aug 7, 2024 at 10:11 AM Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> On 08/07, Andrii Nakryiko wrote:
>>>
>>> Yes, I was waiting for more of Peter's comments, but I guess I'll just
>>> send a v2 today.
>>
>> OK,
>>
>>> I'll probably include the SRCU+timeout logic for
>>> return_instances, and maybe lockless VMA parts as well.
>>
>> Well, feel free to do what you think right, but perhaps it would be
>> better to push this series first? at least 1-4.
> 
> Ok, I can send those first 4 patches first and hopefully we can land
> them soon and move to the next part. I just also wrote up details
> about that crash in rb_find_rcu().
> 
>>
>> As for lockless VMA. To me this needs more discussions. I didn't read
> 
> We are still discussing, feel free to join the conversation.
> 
>> your conversation with Peter and Suren carefully, but I too have some
>> concerns. Most probably I am wrong, and until I saw this thread I didn't
>> even know that vm_area_free() uses call_rcu() if CONFIG_PER_VMA_LOCK,
>> but still.
>>
>>>> As for 8/8 - I leave it to you and Peter. I'd prefer SRCU though ;)
>>>
>>> Honestly curious, why the preference?
>>
>> Well, you can safely ignore me, but since you have asked ;)
>>
>> I understand what SRCU does, and years ago I even understood (I hope)
>> the implementation. More or less the same for rcu_tasks. But as for
>> the _trace flavour, I simply fail to understand its semantics.
> 
> Ok, I won't try to repeat Paul's explanations. If you are curious you
> can find them in comments to my previous batch register/unregister API
> patches.
> 
>>
>>> BTW, while you are here :) What can you say about
>>> current->sighand->siglock use in handle_singlestep()?
>>
>> It should die, and this looks simple. I disagree with the patches
>> from Liao, see the
>> https://lore.kernel.org/all/20240801082407.1618451-1-liaochang1@huawei.com/
>> thread, but I agree with the intent.
> 
> I wasn't aware of this patch, thanks for mentioning it. Strange that
> me or at least bpf@vger.kernel.org wasn't CC'ed.
> 
> Liao, please cc bpf@ mailing list for future patches like that.

OK, sorry about that.

> 
>>
>> IMO, we need a simple "bool restore_sigpending" in uprobe_task, it will make the
>> necessary changes really simple.
> 

[...]

>>
>> (To clarify. In fact I think that a new TIF_ or even PF_ flag makes more sense,
>>  afaics it can have more users. But I don't think that uprobes can provide enough
>>  justification for that right now)

I also face the same choice when Oleg suggested me to add new flag to track the denied
flag, due to I haven't encountered scenarios outside of uprobe that would deny signal,
so I'm not confident of introduce new TIF_ flag without a fully understanding of potential
potential impacts.

>>
>> Oleg.
>>