diff mbox series

[bpf-next] bpf: mark kprobe_multi_link_prog_run as always inlined function

Message ID 20240320200610.2556049-1-andrii@kernel.org (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next] bpf: mark kprobe_multi_link_prog_run as always inlined function | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 956 this patch: 956
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 13 maintainers not CCed: haoluo@google.com linux-trace-kernel@vger.kernel.org john.fastabend@gmail.com rostedt@goodmis.org mhiramat@kernel.org eddyz87@gmail.com sdf@google.com song@kernel.org kpsingh@kernel.org yonghong.song@linux.dev martin.lau@linux.dev jolsa@kernel.org mathieu.desnoyers@efficios.com
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 973 this patch: 973
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-39 fail Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-16 fail Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc

Commit Message

Andrii Nakryiko March 20, 2024, 8:06 p.m. UTC
kprobe_multi_link_prog_run() is called both for multi-kprobe and
multi-kretprobe BPF programs from kprobe_multi_link_handler() and
kprobe_multi_link_exit_handler(), respectively.
kprobe_multi_link_prog_run() is doing all the relevant work, with those
wrappers just satisfying ftrace's interfaces (kprobe callback is
supposed to return int, while kretprobe returns void).

With this structure compile performs tail-call optimization:

Dump of assembler code for function kprobe_multi_link_exit_handler:
   0xffffffff8122f1e0 <+0>:     add    $0xffffffffffffffc0,%rdi
   0xffffffff8122f1e4 <+4>:     mov    %rcx,%rdx
   0xffffffff8122f1e7 <+7>:     jmp    0xffffffff81230080 <kprobe_multi_link_prog_run>

This means that when trying to capture LBR that traces all indirect branches
we are wasting an entry just to record that kprobe_multi_link_exit_handler
called/jumped into kprobe_multi_link_prog_run.

LBR entries are especially sparse on AMD CPUs (just 16 entries on latest CPUs
vs typically 32 on latest Intel CPUs), and every entry counts (and we already
have a bunch of other LBR entries spent getting to a BPF program), so it would
be great to not waste any more than necessary.

Marking it as just `static inline` doesn't change anything, compiler
still performs tail call optimization only. But by marking
kprobe_multi_link_prog_run() as __always_inline we ensure that compiler
fully inlines it, avoiding jumps:

Dump of assembler code for function kprobe_multi_link_exit_handler:
   0xffffffff8122f4e0 <+0>:     push   %r15
   0xffffffff8122f4e2 <+2>:     push   %r14
   0xffffffff8122f4e4 <+4>:     push   %r13
   0xffffffff8122f4e6 <+6>:     push   %r12
   0xffffffff8122f4e8 <+8>:     push   %rbx
   0xffffffff8122f4e9 <+9>:     sub    $0x10,%rsp
   0xffffffff8122f4ed <+13>:    mov    %rdi,%r14
   0xffffffff8122f4f0 <+16>:    lea    -0x40(%rdi),%rax

   ...

   0xffffffff8122f590 <+176>:   call   0xffffffff8108e420 <sched_clock>
   0xffffffff8122f595 <+181>:   sub    %r14,%rax
   0xffffffff8122f598 <+184>:   add    %rax,0x8(%rbx,%r13,1)
   0xffffffff8122f59d <+189>:   jmp    0xffffffff8122f541 <kprobe_multi_link_exit_handler+97>

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/bpf_trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Alexei Starovoitov March 21, 2024, 6:55 a.m. UTC | #1
On Wed, Mar 20, 2024 at 1:06 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Dump of assembler code for function kprobe_multi_link_exit_handler:
>    0xffffffff8122f1e0 <+0>:     add    $0xffffffffffffffc0,%rdi
>    0xffffffff8122f1e4 <+4>:     mov    %rcx,%rdx
>    0xffffffff8122f1e7 <+7>:     jmp    0xffffffff81230080 <kprobe_multi_link_prog_run>
>
> This means that when trying to capture LBR that traces all indirect branches
> we are wasting an entry just to record that kprobe_multi_link_exit_handler
> called/jumped into kprobe_multi_link_prog_run.

I don't follow.
If LBR was configured to trace indirect branches then this direct jmp
shouldn't be recorded.
If LBR is capturing stack frames (those should be call/ret instructions)
then this jmp also won't be recorded.
If LBR is actually capturing all indirect, direct and conditional
jmps, and calls then saving an entry won't make a difference.

Maybe it's an LBR configuration issue on the retsnoop side?
Alexei Starovoitov March 21, 2024, 7:02 a.m. UTC | #2
On Wed, Mar 20, 2024 at 11:55 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 20, 2024 at 1:06 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Dump of assembler code for function kprobe_multi_link_exit_handler:
> >    0xffffffff8122f1e0 <+0>:     add    $0xffffffffffffffc0,%rdi
> >    0xffffffff8122f1e4 <+4>:     mov    %rcx,%rdx
> >    0xffffffff8122f1e7 <+7>:     jmp    0xffffffff81230080 <kprobe_multi_link_prog_run>
> >
> > This means that when trying to capture LBR that traces all indirect branches
> > we are wasting an entry just to record that kprobe_multi_link_exit_handler
> > called/jumped into kprobe_multi_link_prog_run.
>
> I don't follow.
> If LBR was configured to trace indirect branches then this direct jmp
> shouldn't be recorded.
> If LBR is capturing stack frames (those should be call/ret instructions)
> then this jmp also won't be recorded.
> If LBR is actually capturing all indirect, direct and conditional
> jmps, and calls then saving an entry won't make a difference.
>
> Maybe it's an LBR configuration issue on the retsnoop side?

furthermore.
This 'jmp kprobe_multi_link_prog_run' is not any different than
"goto" in C code and other unconditional jmp-s within functions.
Something doesn't add up that this particular jmp stands out
from other similar jmps before and after. From cpu pov
the tail call jmp is the same as jmp within a function.
Andrii Nakryiko March 21, 2024, 4:04 p.m. UTC | #3
On Thu, Mar 21, 2024 at 12:02 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Mar 20, 2024 at 11:55 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 1:06 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >
> > > Dump of assembler code for function kprobe_multi_link_exit_handler:
> > >    0xffffffff8122f1e0 <+0>:     add    $0xffffffffffffffc0,%rdi
> > >    0xffffffff8122f1e4 <+4>:     mov    %rcx,%rdx
> > >    0xffffffff8122f1e7 <+7>:     jmp    0xffffffff81230080 <kprobe_multi_link_prog_run>
> > >
> > > This means that when trying to capture LBR that traces all indirect branches
> > > we are wasting an entry just to record that kprobe_multi_link_exit_handler
> > > called/jumped into kprobe_multi_link_prog_run.
> >
> > I don't follow.
> > If LBR was configured to trace indirect branches then this direct jmp
> > shouldn't be recorded.
> > If LBR is capturing stack frames (those should be call/ret instructions)
> > then this jmp also won't be recorded.
> > If LBR is actually capturing all indirect, direct and conditional
> > jmps, and calls then saving an entry won't make a difference.

I was imprecise with my wording, sorry. It's not "indirect branches",
it's any change of linear execution of code. So any jump (conditional
or not) and any call add LBR record, as you mention in this case.

This is the --lbr=any option in retsnoop, which is used to "look" into
the last function that fails to see which condition (out of many)
caused an error. Very sensitive but extremely useful mode, so
minimizing the amount of wasted LBR records is very important.

Not sure why you are saying it won't make any difference? I'm working
on some other changes to eliminate a few more calls/jmps/branches. It
all adds up.


> >
> > Maybe it's an LBR configuration issue on the retsnoop side?
>
> furthermore.
> This 'jmp kprobe_multi_link_prog_run' is not any different than
> "goto" in C code and other unconditional jmp-s within functions.
> Something doesn't add up that this particular jmp stands out
> from other similar jmps before and after. From cpu pov
> the tail call jmp is the same as jmp within a function.
diff mbox series

Patch

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 434e3ece6688..0bebd6f02e17 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2796,7 +2796,7 @@  static u64 bpf_kprobe_multi_entry_ip(struct bpf_run_ctx *ctx)
 	return run_ctx->entry_ip;
 }
 
-static int
+static __always_inline int
 kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,
 			   unsigned long entry_ip, struct pt_regs *regs)
 {