Message ID | 20241019092709.128359-1-xukuohai@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [bpf] bpf, arm64: Fix stack frame construction for struct_ops trampoline | expand |
Xu Kuohai <xukuohai@huaweicloud.com> writes: > From: Xu Kuohai <xukuohai@huawei.com> > > The callsite layout for arm64 fentry is: > > mov x9, lr > nop > > When a bpf prog is attached, the nop instruction is patched to a call > to bpf trampoline: > > mov x9, lr > bl <bpf trampoline> > > This passes two return addresses to bpf trampoline: the return address > for the traced function/prog, stored in x9, and the return address for > the bpf trampoline, stored in lr. To ensure stacktrace works properly, > the bpf trampoline constructs two fake function stack frames using x9 > and lr. > > However, struct_ops progs are used as function callbacks and are invoked > directly, without x9 being set as the fentry callsite does. Therefore, > only one stack frame should be constructed using lr for struct_ops. > > Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64") > Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Acked-by: Puranjay Mohan <puranjay@kernel.org> > --- > arch/arm64/net/bpf_jit_comp.c | 47 +++++++++++++++++++++++------------ > 1 file changed, 31 insertions(+), 16 deletions(-) > [...]
Xu Kuohai <xukuohai@huaweicloud.com> writes: > From: Xu Kuohai <xukuohai@huawei.com> > > The callsite layout for arm64 fentry is: > > mov x9, lr > nop > > When a bpf prog is attached, the nop instruction is patched to a call > to bpf trampoline: > > mov x9, lr > bl <bpf trampoline> > > This passes two return addresses to bpf trampoline: the return address > for the traced function/prog, stored in x9, and the return address for > the bpf trampoline, stored in lr. To ensure stacktrace works properly, > the bpf trampoline constructs two fake function stack frames using x9 > and lr. > > However, struct_ops progs are used as function callbacks and are invoked > directly, without x9 being set as the fentry callsite does. Therefore, > only one stack frame should be constructed using lr for struct_ops. > > Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64") > Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Tested-by: Puranjay Mohan <puranjay@kernel.org> Thanks, Puranjay
On Sat, Oct 19, 2024 at 2:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: > > From: Xu Kuohai <xukuohai@huawei.com> > > The callsite layout for arm64 fentry is: > > mov x9, lr > nop > > When a bpf prog is attached, the nop instruction is patched to a call > to bpf trampoline: > > mov x9, lr > bl <bpf trampoline> > > This passes two return addresses to bpf trampoline: the return address > for the traced function/prog, stored in x9, and the return address for > the bpf trampoline, stored in lr. To ensure stacktrace works properly, > the bpf trampoline constructs two fake function stack frames using x9 > and lr. > > However, struct_ops progs are used as function callbacks and are invoked > directly, without x9 being set as the fentry callsite does. Therefore, > only one stack frame should be constructed using lr for struct_ops. Are you saying that currently stack unwinder on arm64 is completely broken for struct_ops progs ? or it shows an extra frame that doesn't have to be shown ? If former then it's certainly a bpf tree material. If latter then bpf-next will do. Pls clarify.
On Wed, Oct 23, 2024 at 12:50 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Sat, Oct 19, 2024 at 2:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: > > > > From: Xu Kuohai <xukuohai@huawei.com> > > > > The callsite layout for arm64 fentry is: > > > > mov x9, lr > > nop > > > > When a bpf prog is attached, the nop instruction is patched to a call > > to bpf trampoline: > > > > mov x9, lr > > bl <bpf trampoline> > > > > This passes two return addresses to bpf trampoline: the return address > > for the traced function/prog, stored in x9, and the return address for > > the bpf trampoline, stored in lr. To ensure stacktrace works properly, > > the bpf trampoline constructs two fake function stack frames using x9 > > and lr. > > > > However, struct_ops progs are used as function callbacks and are invoked > > directly, without x9 being set as the fentry callsite does. Therefore, > > only one stack frame should be constructed using lr for struct_ops. > > Are you saying that currently stack unwinder on arm64 is > completely broken for struct_ops progs ? > or it shows an extra frame that doesn't have to be shown ? > > If former then it's certainly a bpf tree material. > If latter then bpf-next will do. > Pls clarify. It is not completely broken, only an extra garbage frame is shown between the caller of the trampoline and its caller. So, this can go from the bpf-next tree. But let's wait for Xu to provide more information. Thanks, Puranjay
On 10/23/2024 7:37 AM, Puranjay Mohan wrote: > On Wed, Oct 23, 2024 at 12:50 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Sat, Oct 19, 2024 at 2:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: >>> >>> From: Xu Kuohai <xukuohai@huawei.com> >>> >>> The callsite layout for arm64 fentry is: >>> >>> mov x9, lr >>> nop >>> >>> When a bpf prog is attached, the nop instruction is patched to a call >>> to bpf trampoline: >>> >>> mov x9, lr >>> bl <bpf trampoline> >>> >>> This passes two return addresses to bpf trampoline: the return address >>> for the traced function/prog, stored in x9, and the return address for >>> the bpf trampoline, stored in lr. To ensure stacktrace works properly, >>> the bpf trampoline constructs two fake function stack frames using x9 >>> and lr. >>> >>> However, struct_ops progs are used as function callbacks and are invoked >>> directly, without x9 being set as the fentry callsite does. Therefore, >>> only one stack frame should be constructed using lr for struct_ops. >> >> Are you saying that currently stack unwinder on arm64 is >> completely broken for struct_ops progs ? >> or it shows an extra frame that doesn't have to be shown ? >> >> If former then it's certainly a bpf tree material. >> If latter then bpf-next will do. >> Pls clarify. > > It is not completely broken, only an extra garbage frame is shown > between the caller of the trampoline and its caller. > Yep, exactly. Here is a perf script sample, where tcp_ack+0x404 is the garbage frame. ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid) ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms]) ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms]) ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms]) ... And this sample also shows that there is no symbol for the struct_ops bpf trampoline. Maybe we should add symbol for it? > So, this can go from the bpf-next tree. But let's wait for Xu to > provide more information. > > Thanks, > Puranjay >
On 10/23/2024 11:16 AM, Xu Kuohai wrote: > On 10/23/2024 7:37 AM, Puranjay Mohan wrote: >> On Wed, Oct 23, 2024 at 12:50 AM Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >>> >>> On Sat, Oct 19, 2024 at 2:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: >>>> >>>> From: Xu Kuohai <xukuohai@huawei.com> >>>> >>>> The callsite layout for arm64 fentry is: >>>> >>>> mov x9, lr >>>> nop >>>> >>>> When a bpf prog is attached, the nop instruction is patched to a call >>>> to bpf trampoline: >>>> >>>> mov x9, lr >>>> bl <bpf trampoline> >>>> >>>> This passes two return addresses to bpf trampoline: the return address >>>> for the traced function/prog, stored in x9, and the return address for >>>> the bpf trampoline, stored in lr. To ensure stacktrace works properly, >>>> the bpf trampoline constructs two fake function stack frames using x9 >>>> and lr. >>>> >>>> However, struct_ops progs are used as function callbacks and are invoked >>>> directly, without x9 being set as the fentry callsite does. Therefore, >>>> only one stack frame should be constructed using lr for struct_ops. >>> >>> Are you saying that currently stack unwinder on arm64 is >>> completely broken for struct_ops progs ? >>> or it shows an extra frame that doesn't have to be shown ? >>> >>> If former then it's certainly a bpf tree material. >>> If latter then bpf-next will do. >>> Pls clarify. >> >> It is not completely broken, only an extra garbage frame is shown >> between the caller of the trampoline and its caller. >> > > Yep, exactly. Here is a perf script sample, where tcp_ack+0x404 > is the garbage frame. > > ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid) > ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline > ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline > ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame > ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms]) > ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms]) > ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms]) > ... > > And this sample also shows that there is no symbol for the > struct_ops bpf trampoline. Maybe we should add symbol for it? > Emm, stack unwinder on x86 is completely broken for struct_ops progs. It's because the following function returns 0 for a struct_ops bpf trampoline address as there is no corresponding kernel symbol, which causes the address not to be recognized as kerneltext. As a result, the winder stops on ip == 0. unsigned long unwind_get_return_address(struct unwind_state *state) { if (unwind_done(state)) return 0; return __kernel_text_address(state->ip) ? state->ip : 0; } Here is an example of broken stack trace from perf sampling, where only one stack frame is captured: ffffffffc000cfb4 bpf_prog_e60d93d3ec88d5ef_bpf_cubic_cong_avoid+0x78 (bpf_prog_e60d93d3ec88d5ef_bpf_cubic_cong_avoid) (no more frames) To fix it, I think kernel symbol should be added for struct_ops trampoline. >> So, this can go from the bpf-next tree. But let's wait for Xu to >> provide more information. >> >> Thanks, >> Puranjay >> >
On Thu, Oct 24, 2024 at 6:48 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: > > On 10/23/2024 11:16 AM, Xu Kuohai wrote: > > On 10/23/2024 7:37 AM, Puranjay Mohan wrote: > >> On Wed, Oct 23, 2024 at 12:50 AM Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >>> > >>> On Sat, Oct 19, 2024 at 2:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: > >>>> > >>>> From: Xu Kuohai <xukuohai@huawei.com> > >>>> > >>>> The callsite layout for arm64 fentry is: > >>>> > >>>> mov x9, lr > >>>> nop > >>>> > >>>> When a bpf prog is attached, the nop instruction is patched to a call > >>>> to bpf trampoline: > >>>> > >>>> mov x9, lr > >>>> bl <bpf trampoline> > >>>> > >>>> This passes two return addresses to bpf trampoline: the return address > >>>> for the traced function/prog, stored in x9, and the return address for > >>>> the bpf trampoline, stored in lr. To ensure stacktrace works properly, > >>>> the bpf trampoline constructs two fake function stack frames using x9 > >>>> and lr. > >>>> > >>>> However, struct_ops progs are used as function callbacks and are invoked > >>>> directly, without x9 being set as the fentry callsite does. Therefore, > >>>> only one stack frame should be constructed using lr for struct_ops. > >>> > >>> Are you saying that currently stack unwinder on arm64 is > >>> completely broken for struct_ops progs ? > >>> or it shows an extra frame that doesn't have to be shown ? > >>> > >>> If former then it's certainly a bpf tree material. > >>> If latter then bpf-next will do. > >>> Pls clarify. > >> > >> It is not completely broken, only an extra garbage frame is shown > >> between the caller of the trampoline and its caller. > >> > > > > Yep, exactly. Here is a perf script sample, where tcp_ack+0x404 > > is the garbage frame. > > > > ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid) > > ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline > > ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline > > ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame > > ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms]) > > ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms]) > > ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms]) > > ... > > > > And this sample also shows that there is no symbol for the > > struct_ops bpf trampoline. Maybe we should add symbol for it? > > > > Emm, stack unwinder on x86 is completely broken for struct_ops > progs. > > It's because the following function returns 0 for a struct_ops > bpf trampoline address as there is no corresponding kernel symbol, > which causes the address not to be recognized as kerneltext. As > a result, the winder stops on ip == 0. > > unsigned long unwind_get_return_address(struct unwind_state *state) > { > if (unwind_done(state)) > return 0; > > return __kernel_text_address(state->ip) ? state->ip : 0; > } > > Here is an example of broken stack trace from perf sampling, where > only one stack frame is captured: > > ffffffffc000cfb4 bpf_prog_e60d93d3ec88d5ef_bpf_cubic_cong_avoid+0x78 (bpf_prog_e60d93d3ec88d5ef_bpf_cubic_cong_avoid) > (no more frames) you mean arch_stack_walk() won't see anything after ip=0, but dump_stack() will still print the rest with "?" (as unreliable). That's bad indeed. > To fix it, I think kernel symbol should be added for struct_ops > trampoline. Makes sense. Pls send a patch. As far as this patch please add an earlier example of double tcp_ack trace to commit log and resubmit targeting bpf-next.
On 10/25/2024 12:24 AM, Alexei Starovoitov wrote: > On Thu, Oct 24, 2024 at 6:48 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: >> >> On 10/23/2024 11:16 AM, Xu Kuohai wrote: >>> On 10/23/2024 7:37 AM, Puranjay Mohan wrote: >>>> On Wed, Oct 23, 2024 at 12:50 AM Alexei Starovoitov >>>> <alexei.starovoitov@gmail.com> wrote: >>>>> >>>>> On Sat, Oct 19, 2024 at 2:15 AM Xu Kuohai <xukuohai@huaweicloud.com> wrote: >>>>>> >>>>>> From: Xu Kuohai <xukuohai@huawei.com> >>>>>> >>>>>> The callsite layout for arm64 fentry is: >>>>>> >>>>>> mov x9, lr >>>>>> nop >>>>>> >>>>>> When a bpf prog is attached, the nop instruction is patched to a call >>>>>> to bpf trampoline: >>>>>> >>>>>> mov x9, lr >>>>>> bl <bpf trampoline> >>>>>> >>>>>> This passes two return addresses to bpf trampoline: the return address >>>>>> for the traced function/prog, stored in x9, and the return address for >>>>>> the bpf trampoline, stored in lr. To ensure stacktrace works properly, >>>>>> the bpf trampoline constructs two fake function stack frames using x9 >>>>>> and lr. >>>>>> >>>>>> However, struct_ops progs are used as function callbacks and are invoked >>>>>> directly, without x9 being set as the fentry callsite does. Therefore, >>>>>> only one stack frame should be constructed using lr for struct_ops. >>>>> >>>>> Are you saying that currently stack unwinder on arm64 is >>>>> completely broken for struct_ops progs ? >>>>> or it shows an extra frame that doesn't have to be shown ? >>>>> >>>>> If former then it's certainly a bpf tree material. >>>>> If latter then bpf-next will do. >>>>> Pls clarify. >>>> >>>> It is not completely broken, only an extra garbage frame is shown >>>> between the caller of the trampoline and its caller. >>>> >>> >>> Yep, exactly. Here is a perf script sample, where tcp_ack+0x404 >>> is the garbage frame. >>> >>> ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid) >>> ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline >>> ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline >>> ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame >>> ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms]) >>> ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms]) >>> ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms]) >>> ... >>> >>> And this sample also shows that there is no symbol for the >>> struct_ops bpf trampoline. Maybe we should add symbol for it? >>> >> >> Emm, stack unwinder on x86 is completely broken for struct_ops >> progs. >> >> It's because the following function returns 0 for a struct_ops >> bpf trampoline address as there is no corresponding kernel symbol, >> which causes the address not to be recognized as kerneltext. As >> a result, the winder stops on ip == 0. >> >> unsigned long unwind_get_return_address(struct unwind_state *state) >> { >> if (unwind_done(state)) >> return 0; >> >> return __kernel_text_address(state->ip) ? state->ip : 0; >> } >> >> Here is an example of broken stack trace from perf sampling, where >> only one stack frame is captured: >> >> ffffffffc000cfb4 bpf_prog_e60d93d3ec88d5ef_bpf_cubic_cong_avoid+0x78 (bpf_prog_e60d93d3ec88d5ef_bpf_cubic_cong_avoid) >> (no more frames) > > you mean arch_stack_walk() won't see anything after ip=0, > but dump_stack() will still print the rest with "?" (as unreliable). > Yes, dump_stack() does not stop on !__kernel_text_address > That's bad indeed. > >> To fix it, I think kernel symbol should be added for struct_ops >> trampoline. > > Makes sense. Pls send a patch. > > > As far as this patch please add an earlier example of double tcp_ack trace > to commit log and resubmit targeting bpf-next. > Sure
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 8bbd0b20136a..f8fec244c0c2 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -2094,6 +2094,12 @@ static void restore_args(struct jit_ctx *ctx, int args_off, int nregs) } } +static bool is_struct_ops_tramp(const struct bpf_tramp_links *fentry_links) +{ + return fentry_links->nr_links == 1 && + fentry_links->links[0]->link.type == BPF_LINK_TYPE_STRUCT_OPS; +} + /* Based on the x86's implementation of arch_prepare_bpf_trampoline(). * * bpf prog and function entry before bpf trampoline hooked: @@ -2123,6 +2129,7 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN]; bool save_ret; __le32 **branches = NULL; + bool is_struct_ops = is_struct_ops_tramp(fentry); /* trampoline stack layout: * [ parent ip ] @@ -2191,11 +2198,14 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, */ emit_bti(A64_BTI_JC, ctx); - /* frame for parent function */ - emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx); - emit(A64_MOV(1, A64_FP, A64_SP), ctx); + /* x9 is not set for struct_ops */ + if (!is_struct_ops) { + /* frame for parent function */ + emit(A64_PUSH(A64_FP, A64_R(9), A64_SP), ctx); + emit(A64_MOV(1, A64_FP, A64_SP), ctx); + } - /* frame for patched function */ + /* frame for patched function for tracing, or caller for struct_ops */ emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx); emit(A64_MOV(1, A64_FP, A64_SP), ctx); @@ -2281,19 +2291,24 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im, /* reset SP */ emit(A64_MOV(1, A64_SP, A64_FP), ctx); - /* pop frames */ - emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx); - emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx); - - if (flags & BPF_TRAMP_F_SKIP_FRAME) { - /* skip patched function, return to parent */ - emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); - emit(A64_RET(A64_R(9)), ctx); + if (is_struct_ops) { + emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx); + emit(A64_RET(A64_LR), ctx); } else { - /* return to patched function */ - emit(A64_MOV(1, A64_R(10), A64_LR), ctx); - emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); - emit(A64_RET(A64_R(10)), ctx); + /* pop frames */ + emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx); + emit(A64_POP(A64_FP, A64_R(9), A64_SP), ctx); + + if (flags & BPF_TRAMP_F_SKIP_FRAME) { + /* skip patched function, return to parent */ + emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); + emit(A64_RET(A64_R(9)), ctx); + } else { + /* return to patched function */ + emit(A64_MOV(1, A64_R(10), A64_LR), ctx); + emit(A64_MOV(1, A64_LR, A64_R(9)), ctx); + emit(A64_RET(A64_R(10)), ctx); + } } kfree(branches);