Message ID | 20240404002640.1774210-3-andrii@kernel.org (mailing list archive) |
---|---|
State | Accepted |
Commit | 314a53623cd4e62e1b88126e5ed2bc87073d90ee |
Delegated to: | BPF |
Headers | show |
Series | Inline bpf_get_branch_snapshot() BPF helper | expand |
On 4/3/24 5:26 PM, Andrii Nakryiko wrote: > Inline bpf_get_branch_snapshot() helper using architecture-agnostic > inline BPF code which calls directly into underlying callback of > perf_snapshot_branch_stack static call. This callback is set early > during kernel initialization and is never updated or reset, so it's ok > to fetch actual implementation using static_call_query() and call > directly into it. > > This change eliminates a full function call and saves one LBR entry > in PERF_SAMPLE_BRANCH_ANY LBR mode. > > Acked-by: John Fastabend <john.fastabend@gmail.com> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev>
On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > Inline bpf_get_branch_snapshot() helper using architecture-agnostic > inline BPF code which calls directly into underlying callback of > perf_snapshot_branch_stack static call. This callback is set early > during kernel initialization and is never updated or reset, so it's ok > to fetch actual implementation using static_call_query() and call > directly into it. > > This change eliminates a full function call and saves one LBR entry > in PERF_SAMPLE_BRANCH_ANY LBR mode. > > Acked-by: John Fastabend <john.fastabend@gmail.com> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 55 insertions(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 17c06f1505e4..2cb5db317a5e 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env) > goto next_insn; > } > > + /* Implement bpf_get_branch_snapshot inline. */ > + if (prog->jit_requested && BITS_PER_LONG == 64 && > + insn->imm == BPF_FUNC_get_branch_snapshot) { > + /* We are dealing with the following func protos: > + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); > + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); > + */ > + const u32 br_entry_size = sizeof(struct perf_branch_entry); > + > + /* struct perf_branch_entry is part of UAPI and is > + * used as an array element, so extremely unlikely to > + * ever grow or shrink > + */ > + BUILD_BUG_ON(br_entry_size != 24); > + > + /* if (unlikely(flags)) return -EINVAL */ > + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7); > + > + /* Transform size (bytes) into number of entries (cnt = size / 24). > + * But to avoid expensive division instruction, we implement > + * divide-by-3 through multiplication, followed by further > + * division by 8 through 3-bit right shift. > + * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr., > + * p. 227, chapter "Unsigned Divison by 3" for details and proofs. > + * > + * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab. > + */ > + insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab); > + insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0); > + insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36); > + > + /* call perf_snapshot_branch_stack implementation */ > + insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); How will this work on non-x86 ? I tried to grep the code and looks like only x86 does: static_call_update(perf_snapshot_branch_stack,...) so on other arch-s static_call_query() will return zero/einval? And above will crash?
On 4/4/24 11:14 AM, Alexei Starovoitov wrote: > On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote: >> Inline bpf_get_branch_snapshot() helper using architecture-agnostic >> inline BPF code which calls directly into underlying callback of >> perf_snapshot_branch_stack static call. This callback is set early >> during kernel initialization and is never updated or reset, so it's ok >> to fetch actual implementation using static_call_query() and call >> directly into it. >> >> This change eliminates a full function call and saves one LBR entry >> in PERF_SAMPLE_BRANCH_ANY LBR mode. >> >> Acked-by: John Fastabend <john.fastabend@gmail.com> >> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> >> --- >> kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 55 insertions(+) >> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 17c06f1505e4..2cb5db317a5e 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env) >> goto next_insn; >> } >> >> + /* Implement bpf_get_branch_snapshot inline. */ >> + if (prog->jit_requested && BITS_PER_LONG == 64 && >> + insn->imm == BPF_FUNC_get_branch_snapshot) { >> + /* We are dealing with the following func protos: >> + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); >> + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); >> + */ >> + const u32 br_entry_size = sizeof(struct perf_branch_entry); >> + >> + /* struct perf_branch_entry is part of UAPI and is >> + * used as an array element, so extremely unlikely to >> + * ever grow or shrink >> + */ >> + BUILD_BUG_ON(br_entry_size != 24); >> + >> + /* if (unlikely(flags)) return -EINVAL */ >> + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7); >> + >> + /* Transform size (bytes) into number of entries (cnt = size / 24). >> + * But to avoid expensive division instruction, we implement >> + * divide-by-3 through multiplication, followed by further >> + * division by 8 through 3-bit right shift. >> + * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr., >> + * p. 227, chapter "Unsigned Divison by 3" for details and proofs. >> + * >> + * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab. >> + */ >> + insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab); >> + insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0); >> + insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36); >> + >> + /* call perf_snapshot_branch_stack implementation */ >> + insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); > How will this work on non-x86 ? > I tried to grep the code and looks like only x86 does: > static_call_update(perf_snapshot_branch_stack,...) > > so on other arch-s static_call_query() will return zero/einval? > And above will crash? Patch 1 will give the answer.In events/core.c, we have the following: DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack, perf_snapshot_branch_stack_t); #define DEFINE_STATIC_CALL_RET0(name, _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key STATIC_CALL_KEY(name) = { \ .func = __static_call_return0, \ .type = 1, \ }; \ ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) So the default value for perf_snapshot_branch_stack is __static_call_return0. In static_call.c, long __static_call_return0(void) { return 0; } EXPORT_SYMBOL_GPL(__static_call_return0); So we should be fine.
On Thu, Apr 4, 2024 at 11:28 AM Yonghong Song <yonghong.song@linux.dev> wrote: > > > On 4/4/24 11:14 AM, Alexei Starovoitov wrote: > > On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote: > >> Inline bpf_get_branch_snapshot() helper using architecture-agnostic > >> inline BPF code which calls directly into underlying callback of > >> perf_snapshot_branch_stack static call. This callback is set early > >> during kernel initialization and is never updated or reset, so it's ok > >> to fetch actual implementation using static_call_query() and call > >> directly into it. > >> > >> This change eliminates a full function call and saves one LBR entry > >> in PERF_SAMPLE_BRANCH_ANY LBR mode. > >> > >> Acked-by: John Fastabend <john.fastabend@gmail.com> > >> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > >> --- > >> kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 55 insertions(+) > >> > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > >> index 17c06f1505e4..2cb5db317a5e 100644 > >> --- a/kernel/bpf/verifier.c > >> +++ b/kernel/bpf/verifier.c > >> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env) > >> goto next_insn; > >> } > >> > >> + /* Implement bpf_get_branch_snapshot inline. */ > >> + if (prog->jit_requested && BITS_PER_LONG == 64 && > >> + insn->imm == BPF_FUNC_get_branch_snapshot) { > >> + /* We are dealing with the following func protos: > >> + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); > >> + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); > >> + */ > >> + const u32 br_entry_size = sizeof(struct perf_branch_entry); > >> + > >> + /* struct perf_branch_entry is part of UAPI and is > >> + * used as an array element, so extremely unlikely to > >> + * ever grow or shrink > >> + */ > >> + BUILD_BUG_ON(br_entry_size != 24); > >> + > >> + /* if (unlikely(flags)) return -EINVAL */ > >> + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7); > >> + > >> + /* Transform size (bytes) into number of entries (cnt = size / 24). > >> + * But to avoid expensive division instruction, we implement > >> + * divide-by-3 through multiplication, followed by further > >> + * division by 8 through 3-bit right shift. > >> + * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr., > >> + * p. 227, chapter "Unsigned Divison by 3" for details and proofs. > >> + * > >> + * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab. > >> + */ > >> + insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab); > >> + insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0); > >> + insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36); > >> + > >> + /* call perf_snapshot_branch_stack implementation */ > >> + insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); > > How will this work on non-x86 ? > > I tried to grep the code and looks like only x86 does: > > static_call_update(perf_snapshot_branch_stack,...) > > > > so on other arch-s static_call_query() will return zero/einval? > > And above will crash? > > Patch 1 will give the answer.In events/core.c, we have the following: > DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack, > perf_snapshot_branch_stack_t); #define DEFINE_STATIC_CALL_RET0(name, > _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key > STATIC_CALL_KEY(name) = { \ .func = __static_call_return0, \ .type = 1, > \ }; \ ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) So the default value for > perf_snapshot_branch_stack is > __static_call_return0. > > In static_call.c, long __static_call_return0(void) { return 0; } > EXPORT_SYMBOL_GPL(__static_call_return0); So we should be fine. I see. Thanks for explaining.
On Thu, Apr 4, 2024 at 11:31 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Apr 4, 2024 at 11:28 AM Yonghong Song <yonghong.song@linux.dev> wrote: > > > > > > On 4/4/24 11:14 AM, Alexei Starovoitov wrote: > > > On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > >> Inline bpf_get_branch_snapshot() helper using architecture-agnostic > > >> inline BPF code which calls directly into underlying callback of > > >> perf_snapshot_branch_stack static call. This callback is set early > > >> during kernel initialization and is never updated or reset, so it's ok > > >> to fetch actual implementation using static_call_query() and call > > >> directly into it. > > >> > > >> This change eliminates a full function call and saves one LBR entry > > >> in PERF_SAMPLE_BRANCH_ANY LBR mode. > > >> > > >> Acked-by: John Fastabend <john.fastabend@gmail.com> > > >> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > >> --- > > >> kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++ > > >> 1 file changed, 55 insertions(+) > > >> > > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > >> index 17c06f1505e4..2cb5db317a5e 100644 > > >> --- a/kernel/bpf/verifier.c > > >> +++ b/kernel/bpf/verifier.c > > >> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env) > > >> goto next_insn; > > >> } > > >> > > >> + /* Implement bpf_get_branch_snapshot inline. */ > > >> + if (prog->jit_requested && BITS_PER_LONG == 64 && > > >> + insn->imm == BPF_FUNC_get_branch_snapshot) { > > >> + /* We are dealing with the following func protos: > > >> + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); > > >> + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); > > >> + */ > > >> + const u32 br_entry_size = sizeof(struct perf_branch_entry); > > >> + > > >> + /* struct perf_branch_entry is part of UAPI and is > > >> + * used as an array element, so extremely unlikely to > > >> + * ever grow or shrink > > >> + */ > > >> + BUILD_BUG_ON(br_entry_size != 24); > > >> + > > >> + /* if (unlikely(flags)) return -EINVAL */ > > >> + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7); > > >> + > > >> + /* Transform size (bytes) into number of entries (cnt = size / 24). > > >> + * But to avoid expensive division instruction, we implement > > >> + * divide-by-3 through multiplication, followed by further > > >> + * division by 8 through 3-bit right shift. > > >> + * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr., > > >> + * p. 227, chapter "Unsigned Divison by 3" for details and proofs. > > >> + * > > >> + * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab. > > >> + */ > > >> + insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab); > > >> + insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0); > > >> + insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36); > > >> + > > >> + /* call perf_snapshot_branch_stack implementation */ > > >> + insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); > > > How will this work on non-x86 ? > > > I tried to grep the code and looks like only x86 does: > > > static_call_update(perf_snapshot_branch_stack,...) > > > > > > so on other arch-s static_call_query() will return zero/einval? > > > And above will crash? > > > > Patch 1 will give the answer.In events/core.c, we have the following: > > DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack, > > perf_snapshot_branch_stack_t); #define DEFINE_STATIC_CALL_RET0(name, > > _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key > > STATIC_CALL_KEY(name) = { \ .func = __static_call_return0, \ .type = 1, > > \ }; \ ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) So the default value for > > perf_snapshot_branch_stack is > > __static_call_return0. > > > > In static_call.c, long __static_call_return0(void) { return 0; } > > EXPORT_SYMBOL_GPL(__static_call_return0); So we should be fine. > > I see. Thanks for explaining. > Yep, just like Yonghong explained. And I also dropped that CONFIG_X86 check in anticipation of [0], which hopefully lands some time in the not too distant future. [0] https://lore.kernel.org/all/20240125094119.2542332-1-anshuman.khandual@arm.com/
Hi Andrii, I was looking around in do_misc_fixups() and came across this ... > + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); > + if (!new_prog) > + return -ENOMEM; > + > + delta += cnt - 1; > + env->prog = prog = new_prog; > + insn = new_prog->insnsi + i + delta; > + continue; Should the above be turned into "goto next_insn" like the others that were touched by commit 011832b97b31 "bpf: Introduce may_goto instruction"? > + } > + > /* Implement bpf_kptr_xchg inline */ > if (prog->jit_requested && BITS_PER_LONG == 64 && > insn->imm == BPF_FUNC_kptr_xchg &&
On Wed, Oct 23, 2024 at 1:29 AM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote: > > Hi Andrii, > > I was looking around in do_misc_fixups() and came across this > > ... > > + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); > > + if (!new_prog) > > + return -ENOMEM; > > + > > + delta += cnt - 1; > > + env->prog = prog = new_prog; > > + insn = new_prog->insnsi + i + delta; > > + continue; > > Should the above be turned into "goto next_insn" like the others that > were touched by commit 011832b97b31 "bpf: Introduce may_goto > instruction"? > Yep, seems so, thanks for catching. I'll send a fix. > > + } > > + > > /* Implement bpf_kptr_xchg inline */ > > if (prog->jit_requested && BITS_PER_LONG == 64 && > > insn->imm == BPF_FUNC_kptr_xchg &&
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 17c06f1505e4..2cb5db317a5e 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env) goto next_insn; } + /* Implement bpf_get_branch_snapshot inline. */ + if (prog->jit_requested && BITS_PER_LONG == 64 && + insn->imm == BPF_FUNC_get_branch_snapshot) { + /* We are dealing with the following func protos: + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); + */ + const u32 br_entry_size = sizeof(struct perf_branch_entry); + + /* struct perf_branch_entry is part of UAPI and is + * used as an array element, so extremely unlikely to + * ever grow or shrink + */ + BUILD_BUG_ON(br_entry_size != 24); + + /* if (unlikely(flags)) return -EINVAL */ + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7); + + /* Transform size (bytes) into number of entries (cnt = size / 24). + * But to avoid expensive division instruction, we implement + * divide-by-3 through multiplication, followed by further + * division by 8 through 3-bit right shift. + * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr., + * p. 227, chapter "Unsigned Divison by 3" for details and proofs. + * + * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab. + */ + insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab); + insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0); + insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36); + + /* call perf_snapshot_branch_stack implementation */ + insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); + /* if (entry_cnt == 0) return -ENOENT */ + insn_buf[5] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4); + /* return entry_cnt * sizeof(struct perf_branch_entry) */ + insn_buf[6] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size); + insn_buf[7] = BPF_JMP_A(3); + /* return -EINVAL; */ + insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL); + insn_buf[9] = BPF_JMP_A(1); + /* return -ENOENT; */ + insn_buf[10] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT); + cnt = 11; + + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); + if (!new_prog) + return -ENOMEM; + + delta += cnt - 1; + env->prog = prog = new_prog; + insn = new_prog->insnsi + i + delta; + continue; + } + /* Implement bpf_kptr_xchg inline */ if (prog->jit_requested && BITS_PER_LONG == 64 && insn->imm == BPF_FUNC_kptr_xchg &&