mbox series

[bpf-next,0/3] Inline two LBR-related helpers

Message ID 20240321180501.734779-1-andrii@kernel.org (mailing list archive)
Headers show
Series Inline two LBR-related helpers | expand

Message

Andrii Nakryiko March 21, 2024, 6:04 p.m. UTC
Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach.

Also inline bpf_get_smp_processor_id() BPF helper but using
architecture-specific assembly code in x86-64 JIT compiler, given getting CPU
ID is highly architecture-specific.

These two helpers are on a criticl direct path to grabbing LBR records from
BPF program and inlining them help save 3 LBR records in PERF_SAMPLE_BRANCH_ANY
mode.

Just to give some visual idea of the effect of these changes (and inlining of
kprobe_multi_link_prog_run() posted as a separte patch) based on retsnoop's
LBR output (with --lbr=any flag). I only show "wasted" records that are needed
to go from when some event happened (kernel function return in this case), to
triggering BPF program that captures LBR *the very first thing* (after getting
CPU ID to get a temporary buffer).

There are still ways to reduce number of "wasted" records further, this is
a problem that requires many small and rather independent steps.

fentry mode
===========

BEFORE
------
  [#10] __sys_bpf+0x270                          ->  __x64_sys_bpf+0x18
  [#09] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508684+0x7f
  [#08] bpf_trampoline_6442508684+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#07] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#06] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#05] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508684+0xa1
  [#04] bpf_trampoline_6442508684+0xad           ->  bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#03] bpf_prog_dc54a596b39d4177_fexit1+0x32    ->  bpf_get_smp_processor_id+0x0
  [#02] bpf_get_smp_processor_id+0xe             ->  bpf_prog_dc54a596b39d4177_fexit1+0x37
  [#01] bpf_prog_dc54a596b39d4177_fexit1+0xe0    ->  bpf_get_branch_snapshot+0x0
  [#00] bpf_get_branch_snapshot+0x13             ->  intel_pmu_snapshot_branch_stack+0x0

AFTER
-----
  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508829+0xa1
  [#01] bpf_trampoline_6442508829+0xad           ->  bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->  intel_pmu_snapshot_branch_stack+0x0

multi-kprobe mode
=================

BEFORE
------
  [#14] __sys_bpf+0x270                          ->  arch_rethook_trampoline+0x0
  [#13] arch_rethook_trampoline+0x27             ->  arch_rethook_trampoline_callback+0x0
  [#12] arch_rethook_trampoline_callback+0x31    ->  rethook_trampoline_handler+0x0
  [#11] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#10] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#09] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#08] fprobe_exit_handler+0xb4                 ->  kprobe_multi_link_exit_handler+0x0
  [#07] kprobe_multi_link_exit_handler+0x4       ->  kprobe_multi_link_prog_run+0x0
  [#06] kprobe_multi_link_prog_run+0x2d          ->  migrate_disable+0x0
  [#05] migrate_disable+0x37                     ->  kprobe_multi_link_prog_run+0x32
  [#04] kprobe_multi_link_prog_run+0x58          ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#03] bpf_prog_2b455b4f8a8d48c5_kexit+0x32     ->  bpf_get_smp_processor_id+0x0
  [#02] bpf_get_smp_processor_id+0xe             ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x37
  [#01] bpf_prog_2b455b4f8a8d48c5_kexit+0x82     ->  bpf_get_branch_snapshot+0x0
  [#00] bpf_get_branch_snapshot+0x13             ->  intel_pmu_snapshot_branch_stack+0x0

AFTER
-----
  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->  arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->  rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->  kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->  kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3     ->  intel_pmu_snapshot_branch_stack+0x0


For default --lbr mode (PERF_SAMPLE_BRANCH_ANY_RETURN), interestingly enough,
multi-kprobe is *less* wasteful (by one function call):

fentry mode
===========

BEFORE
------
  [#04] __sys_bpf+0x270                          ->  __x64_sys_bpf+0x18
  [#03] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508684+0x7f
  [#02] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#01] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508684+0xa1
  [#00] bpf_get_smp_processor_id+0xe             ->  bpf_prog_dc54a596b39d4177_fexit1+0x37

AFTER
-----
  [#03] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#02] __x64_sys_bpf+0x1a                       ->  bpf_trampoline_6442508829+0x7f
  [#01] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#00] __bpf_prog_enter_recur+0x43              ->  bpf_trampoline_6442508829+0xa1

multi-kprobe mode
=================

BEFORE
------
  [#03] __sys_bpf+0x270                          ->  arch_rethook_trampoline+0x0
  [#02] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#01] migrate_disable+0x37                     ->  kprobe_multi_link_prog_run+0x32
  [#00] bpf_get_smp_processor_id+0xe             ->  bpf_prog_2b455b4f8a8d48c5_kexit+0x37

AFTER
-----
  [#02] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#01] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#00] migrate_disable+0x37                     ->  kprobe_multi_link_exit_handler+0x36

Andrii Nakryiko (3):
  bpf: make bpf_get_branch_snapshot() architecture-agnostic
  bpf: inline bpf_get_branch_snapshot() helper
  bpf,x86: inline bpf_get_smp_processor_id() on x86-64

 arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
 kernel/bpf/verifier.c       | 37 +++++++++++++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c    |  4 ----
 3 files changed, 62 insertions(+), 5 deletions(-)

Comments

Alexei Starovoitov March 21, 2024, 11:46 p.m. UTC | #1
On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
>
>
> There are still ways to reduce number of "wasted" records further, this is
> a problem that requires many small and rather independent steps.

I feel this is a wrong path to follow.
I think it would be better to introduce a flag for kprobe/fentry
to do perf_snapshot_branch_stack() as early as possible
and then bpf prog can copy these 16 or 32 8-byte entries at
its leasure.
Hacking all over the kernel and requiring bpf prog to call
bpf_get_branch_snapshot() in the first few instructions
looks like self inflicted pain.
Andrii Nakryiko March 22, 2024, 4:45 p.m. UTC | #2
On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> >
> > There are still ways to reduce number of "wasted" records further, this is
> > a problem that requires many small and rather independent steps.
>
> I feel this is a wrong path to follow.
> I think it would be better to introduce a flag for kprobe/fentry
> to do perf_snapshot_branch_stack() as early as possible
> and then bpf prog can copy these 16 or 32 8-byte entries at
> its leasure.

This is basically how Song started when he was adding this feature a
few years ago. And I feel like we discussed this and decided that it
would be cleaner to let the BPF program decide when (and whether) to
get LBR, based on conditions. It still feels like a right tradeoff.

Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
doesn't have to happen so fast. BPF program can evaluate some
conditions and grab LBR optionally, saving the overhead.

With prog flag saying "kernel should capture LBR ASAP", we:
  a) lose this flexibility to decide whether and when to grab LBR;
  b) pay overhead regardless if LBR is ever actually used for any
given prog invocation;
  c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
buffers for something that is pretty niche (though hugely valuable
when needed, of course);
  d) each program type that supports bpf_get_branch_snapshot() helper
needs to implement this logic in their corresponding
`bpf_prog_run_xxx()` running helpers, which is more than a few places.

Now, let's see how much we can also realistically save with this approach.

For fentry, we do save a few (2) entries, indeed. With changes in this
patch we are at:

  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->
bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->
bpf_trampoline_6442508829+0xa1
  [#01] bpf_trampoline_6442508829+0xad           ->
bpf_prog_dc54a596b39d4177_fexit1+0x0
  [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
intel_pmu_snapshot_branch_stack+0x0

With flag and kernel support, we'll be at something like

  [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
  [#06] __x64_sys_bpf+0x1a                       ->
bpf_trampoline_6442508829+0x7f
  [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
  [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
  [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
  [#02] __bpf_prog_enter_recur+0x43              ->
intel_pmu_snapshot_branch_stack+0x0

So we get 2 extra LBRs at the expense of all those downsides I mentioned above.

But for kretprobe-multi it's even worse (just 1). With changes in this
patch set, we are at:

  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->
arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->
rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->
kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->
kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->
bpf_prog_2b455b4f8a8d48c5_kexit+0x0
  [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3     ->
intel_pmu_snapshot_branch_stack+0x0

With custom flag support:

  [#10] __sys_bpf+0xdfc                          ->  arch_rethook_trampoline+0x0
  [#09] arch_rethook_trampoline+0x27             ->
arch_rethook_trampoline_callback+0x0
  [#08] arch_rethook_trampoline_callback+0x31    ->
rethook_trampoline_handler+0x0
  [#07] rethook_trampoline_handler+0x6f          ->  fprobe_exit_handler+0x0
  [#06] fprobe_exit_handler+0x3d                 ->  rcu_is_watching+0x0
  [#05] rcu_is_watching+0x17                     ->  fprobe_exit_handler+0x42
  [#04] fprobe_exit_handler+0xb4                 ->
kprobe_multi_link_exit_handler+0x0
  [#03] kprobe_multi_link_exit_handler+0x31      ->  migrate_disable+0x0
  [#02] migrate_disable+0x37                     ->
kprobe_multi_link_exit_handler+0x36
  [#01] kprobe_multi_link_exit_handler+0x5c      ->
intel_pmu_snapshot_branch_stack+0x0

We save just 1 extra LBR record.

For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
at all. Is it really worth it?

Any other improvements (like flattening of rethook call pass somehow)
will benefit both approaches equally.

> Hacking all over the kernel and requiring bpf prog to call
> bpf_get_branch_snapshot() in the first few instructions
> looks like self inflicted pain.

While inlining bpf_get_branch_snapshot() does benefit only LBR use
case, it's a rather typical BPF helper inlining procedure we do for a
lot of helpers, so it's not exactly a hack or anything, just an
optimization.

But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
pretty frequently used helper used to implement various
BPF-program-specific per-CPU usages (recursion protection, temporary
storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
variables, which is already a bit faster approach, and now will be
even faster). And the implementation is well-contained.
Alexei Starovoitov March 25, 2024, 2:05 a.m. UTC | #3
On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > >
> > > There are still ways to reduce number of "wasted" records further, this is
> > > a problem that requires many small and rather independent steps.
> >
> > I feel this is a wrong path to follow.
> > I think it would be better to introduce a flag for kprobe/fentry
> > to do perf_snapshot_branch_stack() as early as possible
> > and then bpf prog can copy these 16 or 32 8-byte entries at
> > its leasure.
>
> This is basically how Song started when he was adding this feature a
> few years ago. And I feel like we discussed this and decided that it
> would be cleaner to let the BPF program decide when (and whether) to
> get LBR, based on conditions. It still feels like a right tradeoff.

Right we discussed it back then and at that time it was about
collecting stacks.
What's different now is you want to collect all types of branches
in retrnoop including plain 'jmp pc+10' and conditional jmps.
This is not something that C code can control.
always_inline in C and inline by the verifier reduce call frames,
but they may have both positive and negative effect when
all branches are collected.
Hence __always_inline in kprobe_multi_link_prog_run()
is a leap of faith with assumptions that compiler won't
add jmps before calling into prog,
but lots of different compiler flags add instrumentation:
kasan, stack protector, security mitigation that count call depth, etc.


> Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
> that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
> doesn't have to happen so fast. BPF program can evaluate some
> conditions and grab LBR optionally, saving the overhead.
>
> With prog flag saying "kernel should capture LBR ASAP", we:

I was suggesting to use per attachment flag.
And kprobe is a lost cause.
I would do it for fentry only where
we can generate 'save LBR' call first thing in the bpf trampoline.

>   a) lose this flexibility to decide whether and when to grab LBR;
>   b) pay overhead regardless if LBR is ever actually used for any
> given prog invocation;

when retsnoop attaches a prog the prog gotta call that 'save LBR'
as soon as possible without any branches.
So per-attach flag is not really a downside.

>   c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
> buffers for something that is pretty niche (though hugely valuable
> when needed, of course);

I wouldn't worry about such a tiny buffer.

>   d) each program type that supports bpf_get_branch_snapshot() helper
> needs to implement this logic in their corresponding
> `bpf_prog_run_xxx()` running helpers, which is more than a few places.

I think new kfunc that copies from the buffer will do.
Nothing needs to change.
Maybe bpf_get_branch_snapshot() can be made smart too,
but that is optional.

> Now, let's see how much we can also realistically save with this approach.
>
> For fentry, we do save a few (2) entries, indeed. With changes in this
> patch we are at:
>
>   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
>   [#06] __x64_sys_bpf+0x1a                       ->
> bpf_trampoline_6442508829+0x7f
>   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
>   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
>   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
>   [#02] __bpf_prog_enter_recur+0x43              ->
> bpf_trampoline_6442508829+0xa1
>   [#01] bpf_trampoline_6442508829+0xad           ->
> bpf_prog_dc54a596b39d4177_fexit1+0x0
>   [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
> intel_pmu_snapshot_branch_stack+0x0
>
> With flag and kernel support, we'll be at something like
>
>   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
>   [#06] __x64_sys_bpf+0x1a                       ->
> bpf_trampoline_6442508829+0x7f
>   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
>   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
>   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
>   [#02] __bpf_prog_enter_recur+0x43              ->
> intel_pmu_snapshot_branch_stack+0x0

with flag migrate_disable and prog_enter* will be gone.
It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack.
If we try hard we can inline
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
Then it will be bpf_trampoline_ only and
that's as minimal as it can get. One entry.
Hacking all over the kernel with inline won't get anywhere close.

> For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
> at all. Is it really worth it?

any_return is ok today.
Especially with fentry.

> While inlining bpf_get_branch_snapshot() does benefit only LBR use
> case, it's a rather typical BPF helper inlining procedure we do for a
> lot of helpers, so it's not exactly a hack or anything, just an
> optimization.

Inlining bpf_get_branch_snapshot() may be ok,
but the way you're doing is not a clear win.
Replacing size / 24 (that compiler optimize into mul)
with actual divide and adding extra mul will be slower.
div+mul vs mul is quite a difference.
How noticable is that is questionable,
but from inlining perspective it doesn't feel right to do
"inline to avoid extra frame" instead of
"inline to improve performance".

> But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
> pretty frequently used helper used to implement various
> BPF-program-specific per-CPU usages (recursion protection, temporary
> storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
> variables, which is already a bit faster approach, and now will be
> even faster). And the implementation is well-contained.

Agree that inlining bpf_get_smp_processor_id() is a good thing,
but please do it cleanly so that per-cpu accessors can be
reused in other places.
I'll reply with details in the other thread.
Andrii Nakryiko March 25, 2024, 5:20 p.m. UTC | #4
On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > >
> > > >
> > > > There are still ways to reduce number of "wasted" records further, this is
> > > > a problem that requires many small and rather independent steps.
> > >
> > > I feel this is a wrong path to follow.
> > > I think it would be better to introduce a flag for kprobe/fentry
> > > to do perf_snapshot_branch_stack() as early as possible
> > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > its leasure.
> >
> > This is basically how Song started when he was adding this feature a
> > few years ago. And I feel like we discussed this and decided that it
> > would be cleaner to let the BPF program decide when (and whether) to
> > get LBR, based on conditions. It still feels like a right tradeoff.
>
> Right we discussed it back then and at that time it was about
> collecting stacks.
> What's different now is you want to collect all types of branches

I was using --lbr=any from the get go (and it actually was a
motivation for the entire feature, because we were lacking visibility
inside some large function with lots of conditions), but yes, we had
to leave with a bunch of entries wasted, which on Intel CPUs with 32
entries was tolerable, but on AMD now is useless (we get only 1-2
useful entries right now).

> in retrnoop including plain 'jmp pc+10' and conditional jmps.
> This is not something that C code can control.
> always_inline in C and inline by the verifier reduce call frames,
> but they may have both positive and negative effect when
> all branches are collected.
> Hence __always_inline in kprobe_multi_link_prog_run()
> is a leap of faith with assumptions that compiler won't
> add jmps before calling into prog,
> but lots of different compiler flags add instrumentation:
> kasan, stack protector, security mitigation that count call depth, etc.
>

I understand that, but at least for now in practice it does help. I
have some more changes in fprobe/ftrace space which reduces waste of
LBR entries some more (and would be beneficial regardless of this
custom flag support we are discussing), and there is some really good
news with aggressive inlining. a) I get only 4 entries wasted for
multi-kprobe (7 for fentry, still not bad, but this one is harder to
optimize) and b) I get +25% speed up for multi-kprobes, which seems
like a nice side benefit.

So I agree that none of this is any guarantee, but it also is not some
binding UAPI, so seems worth doing. And as I pointed above, I don't
think I see any regression in performance, rather the opposite.

>
> > Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and
> > that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this
> > doesn't have to happen so fast. BPF program can evaluate some
> > conditions and grab LBR optionally, saving the overhead.
> >
> > With prog flag saying "kernel should capture LBR ASAP", we:
>
> I was suggesting to use per attachment flag.
> And kprobe is a lost cause.
> I would do it for fentry only where
> we can generate 'save LBR' call first thing in the bpf trampoline.
>

See above, I get down to just 4 unavoidable LBR entries wasted with
multi-kprobe, all without a custom flag anywhere.

I just really not think it's worth it to complicate trampoline just
for this, we'll save at most 1-2 LBR entries, inlining
bpf_get_branch_snapshot() gets all basically the same benefit, but
across all supported program types.


> >   a) lose this flexibility to decide whether and when to grab LBR;
> >   b) pay overhead regardless if LBR is ever actually used for any
> > given prog invocation;
>
> when retsnoop attaches a prog the prog gotta call that 'save LBR'
> as soon as possible without any branches.
> So per-attach flag is not really a downside.

for retsnoop, yes, but only if this is supported in multi-kprobe,
which is the main mode. But see above, I just don't think we have to
do this to get almost all the benefit. I just need to inline
bpf_get_branch_snapshot().

>
> >   c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU
> > buffers for something that is pretty niche (though hugely valuable
> > when needed, of course);
>
> I wouldn't worry about such a tiny buffer.
>
> >   d) each program type that supports bpf_get_branch_snapshot() helper
> > needs to implement this logic in their corresponding
> > `bpf_prog_run_xxx()` running helpers, which is more than a few places.
>
> I think new kfunc that copies from the buffer will do.
> Nothing needs to change.
> Maybe bpf_get_branch_snapshot() can be made smart too,
> but that is optional.
>

I meant that fentry would need to implement this LBR capture in BPF
trampoline, multi-kprobe in its kprobe_multi_link_prog_run, kprobe in
still another helper. And so on, we have many targeted "runner"
helpers for specific program types.

And just implementing this for fentry/fexit is not very useful.

> > Now, let's see how much we can also realistically save with this approach.
> >
> > For fentry, we do save a few (2) entries, indeed. With changes in this
> > patch we are at:
> >
> >   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
> >   [#06] __x64_sys_bpf+0x1a                       ->
> > bpf_trampoline_6442508829+0x7f
> >   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
> >   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
> >   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
> >   [#02] __bpf_prog_enter_recur+0x43              ->
> > bpf_trampoline_6442508829+0xa1
> >   [#01] bpf_trampoline_6442508829+0xad           ->
> > bpf_prog_dc54a596b39d4177_fexit1+0x0
> >   [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101   ->
> > intel_pmu_snapshot_branch_stack+0x0
> >
> > With flag and kernel support, we'll be at something like
> >
> >   [#07] __sys_bpf+0xdfc                          ->  __x64_sys_bpf+0x18
> >   [#06] __x64_sys_bpf+0x1a                       ->
> > bpf_trampoline_6442508829+0x7f
> >   [#05] bpf_trampoline_6442508829+0x9c           ->  __bpf_prog_enter_recur+0x0
> >   [#04] __bpf_prog_enter_recur+0x9               ->  migrate_disable+0x0
> >   [#03] migrate_disable+0x37                     ->  __bpf_prog_enter_recur+0xe
> >   [#02] __bpf_prog_enter_recur+0x43              ->
> > intel_pmu_snapshot_branch_stack+0x0
>
> with flag migrate_disable and prog_enter* will be gone.

I don't think we can get rid of migrate_disable, we need to make sure
we are freezing LBR on the CPU on which BPF program will run. So it's
either preempt_disable or migrate_disable.

Yes, __bpf_prog_enter_recur() won't be there if we code-generate code
for BPF trampoline (though, ugh, who wants more code generation than
necessary, but that's an aside). But then see above, migrate_disable
will have to be called before __bpf_prog_enter_recur(), which is just
more opaque code generation than necessary.

> It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack.

Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
implemented, we call into original function and it returns into
trampoline. So this seems unavoidable as well without completely
changing how trampoline works for fexit. Multi-kprobe actually,
conveniently, avoids this problem.

> If we try hard we can inline
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> Then it will be bpf_trampoline_ only and
> that's as minimal as it can get. One entry.
> Hacking all over the kernel with inline won't get anywhere close.

It's not like I'm doing something technically wrong, just enabling
more inlining. And it's not really all over the kernel, few targeted
places that deal with LBR and BPF programs running.

>
> > For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings
> > at all. Is it really worth it?
>
> any_return is ok today.
> Especially with fentry.

Yes, even today we are at 2-3 entries, I'm not too worried about this
in general.

>
> > While inlining bpf_get_branch_snapshot() does benefit only LBR use
> > case, it's a rather typical BPF helper inlining procedure we do for a
> > lot of helpers, so it's not exactly a hack or anything, just an
> > optimization.
>
> Inlining bpf_get_branch_snapshot() may be ok,
> but the way you're doing is not a clear win.
> Replacing size / 24 (that compiler optimize into mul)
> with actual divide and adding extra mul will be slower.
> div+mul vs mul is quite a difference.
> How noticable is that is questionable,
> but from inlining perspective it doesn't feel right to do
> "inline to avoid extra frame" instead of
> "inline to improve performance".

Yes, I saw this division-through-multiplication division. I can
replicate that as well, but it will be pretty hard to understand, so I
thought it is probably not worth it. Note that
bpf_get_branch_snapshot() is not some sort of performance-critical
helper, if you are calling it on some super frequent kprobe/fentry,
you are paying a lot of price just for copying 700+ bytes (and
probably a bunch of other stuff).

So I think it's a wrong tradeoff to optimize for performance.
bpf_get_branch_snapshot() is about information (most complete LBR),
and that's why the inlining. I know it's a bit unconventional compared
to other inlining cases, but it's still valid objection, no?

>
> > But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a
> > pretty frequently used helper used to implement various
> > BPF-program-specific per-CPU usages (recursion protection, temporary
> > storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global
> > variables, which is already a bit faster approach, and now will be
> > even faster). And the implementation is well-contained.
>
> Agree that inlining bpf_get_smp_processor_id() is a good thing,
> but please do it cleanly so that per-cpu accessors can be
> reused in other places.
> I'll reply with details in the other thread.

Agreed, internal special instruction makes sense, replied on that patch as well.
Alexei Starovoitov March 26, 2024, 3:13 a.m. UTC | #5
On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > >
> > > > >
> > > > > There are still ways to reduce number of "wasted" records further, this is
> > > > > a problem that requires many small and rather independent steps.
> > > >
> > > > I feel this is a wrong path to follow.
> > > > I think it would be better to introduce a flag for kprobe/fentry
> > > > to do perf_snapshot_branch_stack() as early as possible
> > > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > > its leasure.
> > >
> > > This is basically how Song started when he was adding this feature a
> > > few years ago. And I feel like we discussed this and decided that it
> > > would be cleaner to let the BPF program decide when (and whether) to
> > > get LBR, based on conditions. It still feels like a right tradeoff.
> >
> > Right we discussed it back then and at that time it was about
> > collecting stacks.
> > What's different now is you want to collect all types of branches
>
> I was using --lbr=any

and that's probably something to fix.
I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
To figure out the path to return in the code
ARCH_LBR_JCC      |
ARCH_LBR_REL_CALL |
ARCH_LBR_IND_CALL |
ARCH_LBR_RETURN

might be good enough and there won't be a need to do
inling in odd places just to avoid tail jmp.

> do this to get almost all the benefit. I just need to inline
> bpf_get_branch_snapshot().

If that is the only one that needs inling then fine,
but I really don't like to always_inline
kprobe_multi_link_prog_run().
A day goes by and somebody will send a patch
to save 500 bytes of kernel .text by removing always_inline.
The argument that it's there to help a user space tool that
wants to do lbr=all instead of excluding rel_jmp
won't look good.

>
> I don't think we can get rid of migrate_disable, we need to make sure
> we are freezing LBR on the CPU on which BPF program will run. So it's
> either preempt_disable or migrate_disable.

we cannot extend preempt disable across the prog
and migrate_disable won't really help, since
there could be another prog on the same cpu
doing the same "save lbr" action in a different hook
that will trash per-cpu scratch space.

But we don't need to either migrate_disable or preempt_disable.
We can have a 32*24 byte buffer per attach point.
In case of fentry it can be in bpf_trampoline or in bpf_link
(I didn't analyze pros/cons too far) and
fentry will only do the single "call intel_pmu_snapshot_branch_stack"
with that address.
That's a trivial addition to arch_prepare_bpf_trampoline.

Then bpf prog will take entries from link, since it has access to it.

Same thing for kprobes. As soon as it's triggered it will
call intel_pmu_snapshot_branch_stack.
Should be simple to add.

Recursion can overwrite that per-attach buffer, but
lbr is screwed anyway if we recursed. So not a concern.

> Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
> implemented, we call into original function and it returns into
> trampoline. So this seems unavoidable as well without completely
> changing how trampoline works for fexit. Multi-kprobe actually,
> conveniently, avoids this problem.

Definitely do not want to redesign that to help retsnoop save an lbr entry.

> > Inlining bpf_get_branch_snapshot() may be ok,
> > but the way you're doing is not a clear win.
> > Replacing size / 24 (that compiler optimize into mul)
> > with actual divide and adding extra mul will be slower.
> > div+mul vs mul is quite a difference.
> > How noticable is that is questionable,
> > but from inlining perspective it doesn't feel right to do
> > "inline to avoid extra frame" instead of
> > "inline to improve performance".
>
> Yes, I saw this division-through-multiplication division. I can
> replicate that as well, but it will be pretty hard to understand, so I
> thought it is probably not worth it. Note that
> bpf_get_branch_snapshot() is not some sort of performance-critical
> helper, if you are calling it on some super frequent kprobe/fentry,
> you are paying a lot of price just for copying 700+ bytes (and
> probably a bunch of other stuff).

div is the slowest instruction.
On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles.
L1 cache is 1-2.
So it might be faster to copy 768 bytes than do a single div.

I still think that adding call intel_pmu_snapshot_branch_stack
to fenty and kprobes is a simpler and cleaner solution that eliminates
all guess work of compiler inlining and optimizations.

We can potentially optimize it further.
Since arch_prepare_bpf_trampoline() is arch specific,
for x86 we can inline:
        local_irq_save(flags);
        __intel_pmu_disable_all(false);
        __intel_pmu_lbr_disable();
into generated trampoline (since above is just 5-6 instructions)
and call into __intel_pmu_snapshot_branch_stack.
Andrii Nakryiko March 26, 2024, 4:50 p.m. UTC | #6
On Mon, Mar 25, 2024 at 8:13 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote:
> > > > > >
> > > > > >
> > > > > > There are still ways to reduce number of "wasted" records further, this is
> > > > > > a problem that requires many small and rather independent steps.
> > > > >
> > > > > I feel this is a wrong path to follow.
> > > > > I think it would be better to introduce a flag for kprobe/fentry
> > > > > to do perf_snapshot_branch_stack() as early as possible
> > > > > and then bpf prog can copy these 16 or 32 8-byte entries at
> > > > > its leasure.
> > > >
> > > > This is basically how Song started when he was adding this feature a
> > > > few years ago. And I feel like we discussed this and decided that it
> > > > would be cleaner to let the BPF program decide when (and whether) to
> > > > get LBR, based on conditions. It still feels like a right tradeoff.
> > >
> > > Right we discussed it back then and at that time it was about
> > > collecting stacks.
> > > What's different now is you want to collect all types of branches
> >
> > I was using --lbr=any
>
> and that's probably something to fix.

Fix in the sense to adjust or add another generic
PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?

> I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> To figure out the path to return in the code
> ARCH_LBR_JCC      |
> ARCH_LBR_REL_CALL |
> ARCH_LBR_IND_CALL |
> ARCH_LBR_RETURN
>
> might be good enough and there won't be a need to do
> inling in odd places just to avoid tail jmp.

retsnoop supports all modes perf exposes generically (see [0]), I
believe I tried all of them and keep gravitating back to --lbr=any as
most useful, unfortunately.

But it's ok, let's put this particular __always_inline on pause for
now, it's one LBR record more, not the end of the world.

  [0] https://github.com/anakryiko/retsnoop/blob/master/src/retsnoop.c#L269-L280

>
> > do this to get almost all the benefit. I just need to inline
> > bpf_get_branch_snapshot().
>
> If that is the only one that needs inling then fine,

yes, it's one of the most important ones, I'll take it :)

> but I really don't like to always_inline
> kprobe_multi_link_prog_run().
> A day goes by and somebody will send a patch
> to save 500 bytes of kernel .text by removing always_inline.
> The argument that it's there to help a user space tool that
> wants to do lbr=all instead of excluding rel_jmp
> won't look good.
>

There is a lot of __always_inline in rethook/ftrace code, as well as
some in BPF code as well. I don't remember people trying to roll this
back, so this seems a bit overdramatic. But ok, if you think it will
be problematic to reject such hypothetical patches, let's put
kprobe_multi_link_prog_run inlining aside for now.

> >
> > I don't think we can get rid of migrate_disable, we need to make sure
> > we are freezing LBR on the CPU on which BPF program will run. So it's
> > either preempt_disable or migrate_disable.
>
> we cannot extend preempt disable across the prog
> and migrate_disable won't really help, since
> there could be another prog on the same cpu
> doing the same "save lbr" action in a different hook
> that will trash per-cpu scratch space.
>
> But we don't need to either migrate_disable or preempt_disable.
> We can have a 32*24 byte buffer per attach point.

I'm missing how we can get away from having a per-CPU buffer. LBRs on
different CPU cores are completely independent and one BPF prog
attachment can be simultaneously running on many CPUs.

Or do you mean per-CPU allocation for each attach point?

We can do LBR capture before migrate_disable calls and still have
correct data most of the time (hopefully), though, so yeah, it can be
another improvement (but with inlining of those two BPF helpers I'm
not sure we have to do this just yet).

> In case of fentry it can be in bpf_trampoline or in bpf_link
> (I didn't analyze pros/cons too far) and
> fentry will only do the single "call intel_pmu_snapshot_branch_stack"
> with that address.
> That's a trivial addition to arch_prepare_bpf_trampoline.
>
> Then bpf prog will take entries from link, since it has access to it.
>
> Same thing for kprobes. As soon as it's triggered it will
> call intel_pmu_snapshot_branch_stack.
> Should be simple to add.
>
> Recursion can overwrite that per-attach buffer, but
> lbr is screwed anyway if we recursed. So not a concern.
>
> > Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is
> > implemented, we call into original function and it returns into
> > trampoline. So this seems unavoidable as well without completely
> > changing how trampoline works for fexit. Multi-kprobe actually,
> > conveniently, avoids this problem.
>
> Definitely do not want to redesign that to help retsnoop save an lbr entry.

Yep.

>
> > > Inlining bpf_get_branch_snapshot() may be ok,
> > > but the way you're doing is not a clear win.
> > > Replacing size / 24 (that compiler optimize into mul)
> > > with actual divide and adding extra mul will be slower.
> > > div+mul vs mul is quite a difference.
> > > How noticable is that is questionable,
> > > but from inlining perspective it doesn't feel right to do
> > > "inline to avoid extra frame" instead of
> > > "inline to improve performance".
> >
> > Yes, I saw this division-through-multiplication division. I can
> > replicate that as well, but it will be pretty hard to understand, so I
> > thought it is probably not worth it. Note that
> > bpf_get_branch_snapshot() is not some sort of performance-critical
> > helper, if you are calling it on some super frequent kprobe/fentry,
> > you are paying a lot of price just for copying 700+ bytes (and
> > probably a bunch of other stuff).
>
> div is the slowest instruction.
> On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles.
> L1 cache is 1-2.
> So it might be faster to copy 768 bytes than do a single div.

Ok, I can add the div-through-multiplication code to keep it 1:1 w/
compiled helper code, no problem.

>
> I still think that adding call intel_pmu_snapshot_branch_stack
> to fenty and kprobes is a simpler and cleaner solution that eliminates
> all guess work of compiler inlining and optimizations.
>
> We can potentially optimize it further.
> Since arch_prepare_bpf_trampoline() is arch specific,
> for x86 we can inline:
>         local_irq_save(flags);
>         __intel_pmu_disable_all(false);
>         __intel_pmu_lbr_disable();
> into generated trampoline (since above is just 5-6 instructions)
> and call into __intel_pmu_snapshot_branch_stack.

Let's keep it as plan B, given this gets into gnarly internals of
Intel-specific x86-64 code.
Alexei Starovoitov March 27, 2024, 9:59 p.m. UTC | #7
On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Fix in the sense to adjust or add another generic
> PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?
>
> > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> > To figure out the path to return in the code
> > ARCH_LBR_JCC      |
> > ARCH_LBR_REL_CALL |
> > ARCH_LBR_IND_CALL |
> > ARCH_LBR_RETURN
> >
> > might be good enough and there won't be a need to do
> > inling in odd places just to avoid tail jmp.
>
> retsnoop supports all modes perf exposes generically (see [0]), I
> believe I tried all of them and keep gravitating back to --lbr=any as
> most useful, unfortunately.

I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND
which I suspect will exclude ARCH_LBR_REL_JMP
and will avoid counting normal goto-s.

> I'm missing how we can get away from having a per-CPU buffer. LBRs on
> different CPU cores are completely independent and one BPF prog
> attachment can be simultaneously running on many CPUs.
>
> Or do you mean per-CPU allocation for each attach point?
>
> We can do LBR capture before migrate_disable calls and still have
> correct data most of the time (hopefully), though, so yeah, it can be
> another improvement (but with inlining of those two BPF helpers I'm
> not sure we have to do this just yet).

I meant 32*24 buffer per attachment.
Doing per-attachment per-cpu might not scale?
It's certainly cleaner with per-cpu though.
With a single per-attach buffer the assumption was that the different
cpus will likely take the same path towards that kprobe.
retsnoop doesn't care which cpu it collected that stack trace from.
It cares about the code path and it will be there.
We can make the whole thing configurable.
bpf_link_attach would specify a buffer or per-cpu or
a buffer right in the bpf map array that should be used
to store the lbr trace.
Andrii Nakryiko March 28, 2024, 10:53 p.m. UTC | #8
On Wed, Mar 27, 2024 at 2:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > Fix in the sense to adjust or add another generic
> > PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode?
> >
> > > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled.
> > > To figure out the path to return in the code
> > > ARCH_LBR_JCC      |
> > > ARCH_LBR_REL_CALL |
> > > ARCH_LBR_IND_CALL |
> > > ARCH_LBR_RETURN
> > >
> > > might be good enough and there won't be a need to do
> > > inling in odd places just to avoid tail jmp.
> >
> > retsnoop supports all modes perf exposes generically (see [0]), I
> > believe I tried all of them and keep gravitating back to --lbr=any as
> > most useful, unfortunately.
>
> I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND
> which I suspect will exclude ARCH_LBR_REL_JMP
> and will avoid counting normal goto-s.

This would be equivalent to passing `--lbr=any_call --lbr=cond` to
retsnoop. And yes, you are right that it doesn't record unconditional
jumps, saving a few more LBR frames. The problem is that it makes it
super hard to follow what's going on without disassembling all
relevant functions down to assembly and following *very carefully* to
understand the flow of logic. This is normally absolutely unnecessary
with --lbr=any (unless DWARF line info is screwed up, of course).
--lbr=any allows to understand C statement-level code flow based on
file:line information alone.

So, in summary, yes `--lbr=any_call --lbr=cond` is a good last resort
option if a few more LBR records are necessary. But it doesn't feel
like something I'd recommend typical users to start with.

>
> > I'm missing how we can get away from having a per-CPU buffer. LBRs on
> > different CPU cores are completely independent and one BPF prog
> > attachment can be simultaneously running on many CPUs.
> >
> > Or do you mean per-CPU allocation for each attach point?
> >
> > We can do LBR capture before migrate_disable calls and still have
> > correct data most of the time (hopefully), though, so yeah, it can be
> > another improvement (but with inlining of those two BPF helpers I'm
> > not sure we have to do this just yet).
>
> I meant 32*24 buffer per attachment.
> Doing per-attachment per-cpu might not scale?
> It's certainly cleaner with per-cpu though.
> With a single per-attach buffer the assumption was that the different
> cpus will likely take the same path towards that kprobe.

This is a rather bold assumption. It might be true in a lot of cases,
but certainly not always. Think about tracing __sys_bpf for errors.
There could be many simultaneous bpf() syscalls in the system, some
returning expected -ENOENT for expected map element iteration (so not
really interesting), but some resulting in confusing failures (and
thus "interesting"). It's even worse with some file system related
functions.

My point is that in retsnoop I don't want to rely on these
assumptions. I won't stop anyone trying to add this functionality in
the kernel, but I don't see retsnoop relying on something like that.

> retsnoop doesn't care which cpu it collected that stack trace from.
> It cares about the code path and it will be there.

It does, retsnoop has a bunch of filters to narrow down specific
conditions, where process information is taken into account, among
other things. And this filtering capabilities will just grow over
time.

> We can make the whole thing configurable.
> bpf_link_attach would specify a buffer or per-cpu or
> a buffer right in the bpf map array that should be used
> to store the lbr trace.

I'm pretty happy with existing functionality and don't really need new
APIs. I'll be posting a few more patches improving performance (not
just LBR usage) over the next few days. With those I get close enough:
4 LBR records waste with --lbr=any.

Thanks for the brainstorming, internal per-CPU instruction
implementation turned out pretty well, I'll be sending a patch set
soon.