Message ID | 161495873696.346821.10161501768906432924.stgit@devnote2 (mailing list archive) |
---|---|
Headers | show |
Series | kprobes: Fix stacktrace in kretprobes | expand |
Hi Masami, On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote: > Hello, > > Here is a series of patches for kprobes and stacktracer to fix the kretprobe > entries in the kernel stack. This was reported by Daniel Xu. I thought that > was in the bpftrace, but it is actually more generic issue. > So I decided to fix the issue in arch independent part. > > While fixing the issue, I found a bug in ia64 related to kretprobe, which is > fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main > issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe > internal change. And [5/5] removing the stacktrace kretprobe fixup code in > ftrace. > > Daniel, can you also check that this fixes your issue too? I hope it is. Unfortunately, this patch series does not fix the issue I reported. I think the reason your tests work is because you're using ftrace and the ORC unwinder is aware of ftrace trampolines (see arch/x86/kernel/unwind_orc.c:orc_ftrace_find). bpftrace kprobes go through perf event subsystem (ie not ftrace) so naturally orc_ftrace_find() does not find an associated trampoline. ORC unwinding fails in this case because arch/x86/kernel/kprobes/core.c:trampoline_handler sets regs->ip = (unsigned long)&kretprobe_trampoline; and `kretprobe_trampoline` is marked STACK_FRAME_NON_STANDARD(kretprobe_trampoline); so it doesn't have a valid ORC entry. Thus, ORC immediately bails when trying to unwind past the first frame. The only way I can think of to fix this issue is to make the ORC unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping you have another idea if my patch isn't acceptable. Thanks, Daniel
On Fri, 5 Mar 2021 11:16:45 -0800 Daniel Xu <dxu@dxuuu.xyz> wrote: > Hi Masami, > > On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote: > > Hello, > > > > Here is a series of patches for kprobes and stacktracer to fix the kretprobe > > entries in the kernel stack. This was reported by Daniel Xu. I thought that > > was in the bpftrace, but it is actually more generic issue. > > So I decided to fix the issue in arch independent part. > > > > While fixing the issue, I found a bug in ia64 related to kretprobe, which is > > fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main > > issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe > > internal change. And [5/5] removing the stacktrace kretprobe fixup code in > > ftrace. > > > > Daniel, can you also check that this fixes your issue too? I hope it is. > > Unfortunately, this patch series does not fix the issue I reported. Ah, OK. This was my mistake I didn't choose ORC unwinder for test kernel. > > I think the reason your tests work is because you're using ftrace and > the ORC unwinder is aware of ftrace trampolines (see > arch/x86/kernel/unwind_orc.c:orc_ftrace_find). OK, so it has to be fixed in ORC unwinder. > > bpftrace kprobes go through perf event subsystem (ie not ftrace) so > naturally orc_ftrace_find() does not find an associated trampoline. ORC > unwinding fails in this case because > arch/x86/kernel/kprobes/core.c:trampoline_handler sets > > regs->ip = (unsigned long)&kretprobe_trampoline; > > and `kretprobe_trampoline` is marked > > STACK_FRAME_NON_STANDARD(kretprobe_trampoline); > > so it doesn't have a valid ORC entry. Thus, ORC immediately bails when > trying to unwind past the first frame. Hm, in ftrace case, it decode kretprobe's stackframe and stoped right after kretprobe_trampoline callsite. => kretprobe_trace_func+0x21f/0x340 => kretprobe_dispatcher+0x73/0xb0 => __kretprobe_trampoline_handler+0xcd/0x1a0 => trampoline_handler+0x3d/0x50 => kretprobe_trampoline+0x25/0x50 => 0 kretprobe replaces the real return address with kretprobe_trampoline and kretprobe_trampoline *calls* trampoline_handler (this part depends on architecture implementation). Thus, if kretprobe_trampoline has no stack frame information, ORC may fails at the first kretprobe_trampoline+0x25, that is different from the kretprobe_trampoline+0, so the hack doesn't work. Hmm, how the other inline-asm function makes its stackframe info? If I get the kretprobe_trampoline+0, I can fix it up. > The only way I can think of to fix this issue is to make the ORC > unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping > you have another idea if my patch isn't acceptable. OK, anyway, your patch doesn't care the case that the multiple functions are probed by kretprobes. In that case, you'll see several entries are replaced by the kretprobe_trampoline. To find it correctly, you have to pass a state-holder (@cur of the kretprobe_find_ret_addr()) to the fixup entries. Thank you,
On Sat, Mar 06, 2021 at 10:13:57AM +0900, Masami Hiramatsu wrote: > On Fri, 5 Mar 2021 11:16:45 -0800 > Daniel Xu <dxu@dxuuu.xyz> wrote: > > > Hi Masami, > > > > On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote: > > > Hello, > > > > > > Here is a series of patches for kprobes and stacktracer to fix the kretprobe > > > entries in the kernel stack. This was reported by Daniel Xu. I thought that > > > was in the bpftrace, but it is actually more generic issue. > > > So I decided to fix the issue in arch independent part. > > > > > > While fixing the issue, I found a bug in ia64 related to kretprobe, which is > > > fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main > > > issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe > > > internal change. And [5/5] removing the stacktrace kretprobe fixup code in > > > ftrace. > > > > > > Daniel, can you also check that this fixes your issue too? I hope it is. > > > > Unfortunately, this patch series does not fix the issue I reported. > > Ah, OK. This was my mistake I didn't choose ORC unwinder for test kernel. > > > > > I think the reason your tests work is because you're using ftrace and > > the ORC unwinder is aware of ftrace trampolines (see > > arch/x86/kernel/unwind_orc.c:orc_ftrace_find). > > OK, so it has to be fixed in ORC unwinder. > > > > > bpftrace kprobes go through perf event subsystem (ie not ftrace) so > > naturally orc_ftrace_find() does not find an associated trampoline. ORC > > unwinding fails in this case because > > arch/x86/kernel/kprobes/core.c:trampoline_handler sets > > > > regs->ip = (unsigned long)&kretprobe_trampoline; > > > > and `kretprobe_trampoline` is marked > > > > STACK_FRAME_NON_STANDARD(kretprobe_trampoline); > > > > so it doesn't have a valid ORC entry. Thus, ORC immediately bails when > > trying to unwind past the first frame. > > Hm, in ftrace case, it decode kretprobe's stackframe and stoped right > after kretprobe_trampoline callsite. > > => kretprobe_trace_func+0x21f/0x340 > => kretprobe_dispatcher+0x73/0xb0 > => __kretprobe_trampoline_handler+0xcd/0x1a0 > => trampoline_handler+0x3d/0x50 > => kretprobe_trampoline+0x25/0x50 > => 0 > > kretprobe replaces the real return address with kretprobe_trampoline > and kretprobe_trampoline *calls* trampoline_handler (this part depends > on architecture implementation). > Thus, if kretprobe_trampoline has no stack frame information, ORC may > fails at the first kretprobe_trampoline+0x25, that is different from > the kretprobe_trampoline+0, so the hack doesn't work. I'm not sure I follow 100% what you're saying, but assuming you're asking why bpftrace fails at `kretprobe_trampoline+0` instead of `kretprobe_trampoline+0x25`: `regs` is set to &kretprobe_trampoline: regs->ip = (unsigned long)&kretprobe_trampoline; Then the kretprobe event is dispatched like this: kretprobe_trampoline_handler __kretprobe_trampoline_handler rp->handler // ie kernel/trace/trace_kprobe.c:kretprobe_dispatcher kretprobe_perf_func trace_bpf_call(.., regs) BPF_PROG_RUN_ARRAY_CHECK bpf_get_stackid(regs, .., ..) // in bpftrace prog And then `bpf_get_stackid` unwinds the stack via: bpf_get_stackid get_perf_callchain(regs, ...) perf_callchain_kernel(.., regs) perf_callchain_store(.., regs->ip) // !!! first unwound entry unwind_start unwind_next_frame In summary: unwinding via BPF begins at regs->ip, which `trampoline_handler` sets to `&kretprobe_trampoline+0x0`. > Hmm, how the other inline-asm function makes its stackframe info? > If I get the kretprobe_trampoline+0, I can fix it up. So I think I misunderstood the mechanics before. I think `call trampoline_handler` does set up a new frame. My current guess is that ftrace doesn't thread through `regs` like the bpf code path does. I'm not very familiar with ftrace internals so I haven't looked. > > > The only way I can think of to fix this issue is to make the ORC > > unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping > > you have another idea if my patch isn't acceptable. > > OK, anyway, your patch doesn't care the case that the multiple functions > are probed by kretprobes. In that case, you'll see several entries are > replaced by the kretprobe_trampoline. To find it correctly, you have > to pass a state-holder (@cur of the kretprobe_find_ret_addr()) > to the fixup entries. I'll see if I can figure something out tomorrow. Thanks, Daniel
On Sun, 7 Mar 2021 13:23:33 -0800 Daniel Xu <dxu@dxuuu.xyz> wrote: > > kretprobe replaces the real return address with kretprobe_trampoline > > and kretprobe_trampoline *calls* trampoline_handler (this part depends > > on architecture implementation). > > Thus, if kretprobe_trampoline has no stack frame information, ORC may > > fails at the first kretprobe_trampoline+0x25, that is different from > > the kretprobe_trampoline+0, so the hack doesn't work. > > I'm not sure I follow 100% what you're saying, but assuming you're > asking why bpftrace fails at `kretprobe_trampoline+0` instead of > `kretprobe_trampoline+0x25`: > > `regs` is set to &kretprobe_trampoline: > > regs->ip = (unsigned long)&kretprobe_trampoline; Ah, OK. bpftrace does the adjustment. > Then the kretprobe event is dispatched like this: > > kretprobe_trampoline_handler > __kretprobe_trampoline_handler > rp->handler // ie kernel/trace/trace_kprobe.c:kretprobe_dispatcher > kretprobe_perf_func > trace_bpf_call(.., regs) > BPF_PROG_RUN_ARRAY_CHECK > bpf_get_stackid(regs, .., ..) // in bpftrace prog > > And then `bpf_get_stackid` unwinds the stack via: > > bpf_get_stackid > get_perf_callchain(regs, ...) > perf_callchain_kernel(.., regs) > perf_callchain_store(.., regs->ip) // !!! first unwound entry > unwind_start > unwind_next_frame > > In summary: unwinding via BPF begins at regs->ip, which > `trampoline_handler` sets to `&kretprobe_trampoline+0x0`. OK, maybe you are using stack_trace_save_regs() with pt_regs instead of stack_trace_save(). this means it started from regs at saved by kretprobe always. In the ftrace, we are using stack_trace_save() which is based on the current stack, this means stack unwinder tracks back the stack of kretprobe itself at first. So it saw the kretprobe_trampoline+0x25 (return address of the trampoline_handler) and stop unwinding because the call frame information (ORC information) and the return address are not there. This issue is not only the ftrace, but also backtrace in interrupt handler and kretprobe handler. > > Hmm, how the other inline-asm function makes its stackframe info? > > If I get the kretprobe_trampoline+0, I can fix it up. > > So I think I misunderstood the mechanics before. I think `call > trampoline_handler` does set up a new frame. My current guess is that > ftrace doesn't thread through `regs` like the bpf code path does. I'm > not very familiar with ftrace internals so I haven't looked. Yes, that's right. Since I made a kretprobe event on top of the ftrace event framework, it doesn't pass the regs to the event trigger. > > > The only way I can think of to fix this issue is to make the ORC > > > unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping > > > you have another idea if my patch isn't acceptable. > > > > OK, anyway, your patch doesn't care the case that the multiple functions > > are probed by kretprobes. In that case, you'll see several entries are > > replaced by the kretprobe_trampoline. To find it correctly, you have > > to pass a state-holder (@cur of the kretprobe_find_ret_addr()) > > to the fixup entries. > > I'll see if I can figure something out tomorrow. To help your understanding, let me explain. If we have a code here caller_func: 0x00 add sp, 0x20 /* 0x20 bytes stack frame allocated */ ... 0x10 call target_func 0x15 ... /* return address */ On the stack in the entry of target_func, we have [stack] 0x0e0 caller_func+0x15 ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ 0x100 /* caller_func return address */ And when we put a kretprobe on the target_func, the stack will be [stack] 0x0e0 kretprobe_trampoline ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ 0x100 /* caller_func return address */ * "caller_func+0x15" is saved in current->kretprobe_instances.first. When returning from the target_func, call consumed the 0x0e0 and jump to kretprobe_trampoline. Let's see the assembler code. ".text\n" ".global kretprobe_trampoline\n" ".type kretprobe_trampoline, @function\n" "kretprobe_trampoline:\n" /* We don't bother saving the ss register */ " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" " call trampoline_handler\n" /* Replace saved sp with true return address. */ " movq %rax, 19*8(%rsp)\n" RESTORE_REGS_STRING " popfq\n" " ret\n" When the entry of trampoline_handler, stack is like this; [stack] 0x040 kretprobe_trampoline+0x25 0x048 r15 ... /* pt_regs */ 0x0d8 flags 0x0e0 rsp (=0x0e0) ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ 0x100 /* caller_func return address */ And after returned from trampoline_handler, "movq" changes the stack like this. [stack] 0x040 kretprobe_trampoline+0x25 0x048 r15 ... /* pt_regs */ 0x0d8 flags 0x0e0 caller_func+0x15 ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ 0x100 /* caller_func return address */ So at the kretprobe handler, we have 2 issues. 1) the return address (caller_func+0x15) is not on the stack. this can be solved by searching from current->kretprobe_instances. 2) the stack frame size of kretprobe_trampoline is unknown Since the stackframe is fixed, the fixed number (0x98) can be used. However, those solutions are only for the kretprobe handler. The stacktrace from interrupt handler hit in the kretprobe_trampoline still doesn't work. So, here is my idea; 1) Change the trampline code to prepare stack frame at first and save registers on it, instead of "push". This will makes ORC easy to setup stackframe information for this code. 2) change the return addres fixup timing. Instead of using return value of trampoline handler, before removing the real return address from current->kretprobe_instances. 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it checks the contents of the end of stackframe (at the place of regs->sp) is same as the address of it. If it is, it can find the correct address from current->kretprobe_instances. If not, there is a correct address. I need to find how the ORC info is prepared for 1), maybe UNWIND_HINT macro may help? Thank you,
On Mon, 8 Mar 2021 11:52:10 +0900 Masami Hiramatsu <mhiramat@kernel.org> wrote: > So, here is my idea; > > 1) Change the trampline code to prepare stack frame at first and save > registers on it, instead of "push". This will makes ORC easy to setup > stackframe information for this code. > 2) change the return addres fixup timing. Instead of using return value > of trampoline handler, before removing the real return address from > current->kretprobe_instances. > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it > checks the contents of the end of stackframe (at the place of regs->sp) > is same as the address of it. If it is, it can find the correct address > from current->kretprobe_instances. If not, there is a correct address. Another trickly idea is put a call on top of kretprobe_trampoline like this. "__kretprobe_trampoline:\n" " call kretprobe_trampoline\n" "kretprobe_trampoline:\n" " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" " call trampoline_handler\n" /* Replace __kretprobe_trampoline with true return address. */ " movq %rax, 20*8(%rsp)\n" RESTORE_REGS_STRING " popfq\n" " popq %rsp\n" " ret\n" This will leave a marker (kretprobe_trampoline or __kretprobe_trampoline+5) on the top of stack, and the stack frame seems like a normal function. If objtool can make an ORC info by disassembling kretprobe_trampoline, I guess it is easy to make a stack frame information. But anyway, from the inside of target function, it still see "__kretprobe_trampoline" on the stack instead of caller_func, so orc_kretprobe_find() is still needed. I'm not familier with the UNWIND_HINT macro, so if it is easy to handle the original case, I think my first idea will be better. Thank you,
On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote: > So at the kretprobe handler, we have 2 issues. > 1) the return address (caller_func+0x15) is not on the stack. > this can be solved by searching from current->kretprobe_instances. > 2) the stack frame size of kretprobe_trampoline is unknown > Since the stackframe is fixed, the fixed number (0x98) can be used. > > However, those solutions are only for the kretprobe handler. The stacktrace > from interrupt handler hit in the kretprobe_trampoline still doesn't work. > > So, here is my idea; > > 1) Change the trampline code to prepare stack frame at first and save > registers on it, instead of "push". This will makes ORC easy to setup > stackframe information for this code. > 2) change the return addres fixup timing. Instead of using return value > of trampoline handler, before removing the real return address from > current->kretprobe_instances. > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it > checks the contents of the end of stackframe (at the place of regs->sp) > is same as the address of it. If it is, it can find the correct address > from current->kretprobe_instances. If not, there is a correct address. > > I need to find how the ORC info is prepared for 1), maybe UNWIND_HINT macro > may help? Hi Masami, If I understand correctly, for #1 you need an unwind hint which treats the instruction *after* the "pushq %rsp" as the beginning of the function. I'm thinking this should work: diff --git a/arch/x86/include/asm/unwind_hints.h b/arch/x86/include/asm/unwind_hints.h index 8e574c0afef8..8b33674288ea 100644 --- a/arch/x86/include/asm/unwind_hints.h +++ b/arch/x86/include/asm/unwind_hints.h @@ -52,6 +52,11 @@ UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=8 type=UNWIND_HINT_TYPE_FUNC .endm +#else + +#define UNWIND_HINT_FUNC \ + UNWIND_HINT(ORC_REG_SP, 8, UNWIND_HINT_TYPE_FUNC, 0) + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_UNWIND_HINTS_H */ diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index df776cdca327..38ff1387f95d 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -767,6 +767,7 @@ asm( /* We don't bother saving the ss register */ #ifdef CONFIG_X86_64 " pushq %rsp\n" + UNWIND_HINT_FUNC " pushfq\n" SAVE_REGS_STRING " movq %rsp, %rdi\n" @@ -790,7 +791,6 @@ asm( ".size kretprobe_trampoline, .-kretprobe_trampoline\n" ); NOKPROBE_SYMBOL(kretprobe_trampoline); -STACK_FRAME_NON_STANDARD(kretprobe_trampoline); /*
Hi Masami, Just want to clarify a few points: On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote: > On Sun, 7 Mar 2021 13:23:33 -0800 > Daniel Xu <dxu@dxuuu.xyz> wrote: > To help your understanding, let me explain. > > If we have a code here > > caller_func: > 0x00 add sp, 0x20 /* 0x20 bytes stack frame allocated */ > ... > 0x10 call target_func > 0x15 ... /* return address */ > > On the stack in the entry of target_func, we have > > [stack] > 0x0e0 caller_func+0x15 > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > 0x100 /* caller_func return address */ > > And when we put a kretprobe on the target_func, the stack will be > > [stack] > 0x0e0 kretprobe_trampoline > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > 0x100 /* caller_func return address */ > > * "caller_func+0x15" is saved in current->kretprobe_instances.first. > > When returning from the target_func, call consumed the 0x0e0 and > jump to kretprobe_trampoline. Let's see the assembler code. > > ".text\n" > ".global kretprobe_trampoline\n" > ".type kretprobe_trampoline, @function\n" > "kretprobe_trampoline:\n" > /* We don't bother saving the ss register */ > " pushq %rsp\n" > " pushfq\n" > SAVE_REGS_STRING > " movq %rsp, %rdi\n" > " call trampoline_handler\n" > /* Replace saved sp with true return address. */ > " movq %rax, 19*8(%rsp)\n" > RESTORE_REGS_STRING > " popfq\n" > " ret\n" > > When the entry of trampoline_handler, stack is like this; > > [stack] > 0x040 kretprobe_trampoline+0x25 > 0x048 r15 > ... /* pt_regs */ > 0x0d8 flags > 0x0e0 rsp (=0x0e0) > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > 0x100 /* caller_func return address */ > > And after returned from trampoline_handler, "movq" changes the > stack like this. > > [stack] > 0x040 kretprobe_trampoline+0x25 > 0x048 r15 > ... /* pt_regs */ > 0x0d8 flags > 0x0e0 caller_func+0x15 > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > 0x100 /* caller_func return address */ Thanks for the detailed explanation. I think I understand kretprobe mechanics from a somewhat high level (kprobe saves real return address on entry, overwrites return address to trampoline, then trampoline runs handler and finally resets return address to real return address). I don't usually write much assembly so the details escape me somewhat. > So at the kretprobe handler, we have 2 issues. > 1) the return address (caller_func+0x15) is not on the stack. > this can be solved by searching from current->kretprobe_instances. Yes, agreed. > 2) the stack frame size of kretprobe_trampoline is unknown > Since the stackframe is fixed, the fixed number (0x98) can be used. I'm confused why this is relevant. Is it so ORC knows where to find saved return address in the frame? > However, those solutions are only for the kretprobe handler. The stacktrace > from interrupt handler hit in the kretprobe_trampoline still doesn't work. > > So, here is my idea; > > 1) Change the trampline code to prepare stack frame at first and save > registers on it, instead of "push". This will makes ORC easy to setup > stackframe information for this code. I'm confused on the details here. But this is what Josh solves in his patch, right? > 2) change the return addres fixup timing. Instead of using return value > of trampoline handler, before removing the real return address from > current->kretprobe_instances. Is the idea to have `kretprobe_trampoline` place the real return address on the stack (while telling ORC where to find it) _before_ running `call trampoline_handler` ? So that an unwind from inside the user defined kretprobe handler simply unwinds correctly? And to be extra clear, this would only work for stack_trace_save() and not stack_trace_save_regs()? > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it > checks the contents of the end of stackframe (at the place of regs->sp) > is same as the address of it. If it is, it can find the correct address > from current->kretprobe_instances. If not, there is a correct address. What do you mean by "it" w.r.t. "is the same address of it"? I'm confused on this point. Thanks, Daniel
Hi Josh, On Mon, 8 Mar 2021 19:19:45 -0600 Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote: > > So at the kretprobe handler, we have 2 issues. > > 1) the return address (caller_func+0x15) is not on the stack. > > this can be solved by searching from current->kretprobe_instances. > > 2) the stack frame size of kretprobe_trampoline is unknown > > Since the stackframe is fixed, the fixed number (0x98) can be used. > > > > However, those solutions are only for the kretprobe handler. The stacktrace > > from interrupt handler hit in the kretprobe_trampoline still doesn't work. > > > > So, here is my idea; > > > > 1) Change the trampline code to prepare stack frame at first and save > > registers on it, instead of "push". This will makes ORC easy to setup > > stackframe information for this code. > > 2) change the return addres fixup timing. Instead of using return value > > of trampoline handler, before removing the real return address from > > current->kretprobe_instances. > > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it > > checks the contents of the end of stackframe (at the place of regs->sp) > > is same as the address of it. If it is, it can find the correct address > > from current->kretprobe_instances. If not, there is a correct address. > > > > I need to find how the ORC info is prepared for 1), maybe UNWIND_HINT macro > > may help? > > Hi Masami, > > If I understand correctly, for #1 you need an unwind hint which treats > the instruction *after* the "pushq %rsp" as the beginning of the > function. Thanks for the patch. In that case, should I still change the stack allocation? Or can I continue to use a series of "push/pop" ? > > I'm thinking this should work: OK, Let me test it. Thanks! > > > diff --git a/arch/x86/include/asm/unwind_hints.h b/arch/x86/include/asm/unwind_hints.h > index 8e574c0afef8..8b33674288ea 100644 > --- a/arch/x86/include/asm/unwind_hints.h > +++ b/arch/x86/include/asm/unwind_hints.h > @@ -52,6 +52,11 @@ > UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=8 type=UNWIND_HINT_TYPE_FUNC > .endm > > +#else > + > +#define UNWIND_HINT_FUNC \ > + UNWIND_HINT(ORC_REG_SP, 8, UNWIND_HINT_TYPE_FUNC, 0) > + > #endif /* __ASSEMBLY__ */ > > #endif /* _ASM_X86_UNWIND_HINTS_H */ > diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c > index df776cdca327..38ff1387f95d 100644 > --- a/arch/x86/kernel/kprobes/core.c > +++ b/arch/x86/kernel/kprobes/core.c > @@ -767,6 +767,7 @@ asm( > /* We don't bother saving the ss register */ > #ifdef CONFIG_X86_64 > " pushq %rsp\n" > + UNWIND_HINT_FUNC > " pushfq\n" > SAVE_REGS_STRING > " movq %rsp, %rdi\n" > @@ -790,7 +791,6 @@ asm( > ".size kretprobe_trampoline, .-kretprobe_trampoline\n" > ); > NOKPROBE_SYMBOL(kretprobe_trampoline); > -STACK_FRAME_NON_STANDARD(kretprobe_trampoline); > > > /* >
On Tue, 9 Mar 2021 13:34:42 -0800 Daniel Xu <dxu@dxuuu.xyz> wrote: > Hi Masami, > > Just want to clarify a few points: > > On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote: > > On Sun, 7 Mar 2021 13:23:33 -0800 > > Daniel Xu <dxu@dxuuu.xyz> wrote: > > To help your understanding, let me explain. > > > > If we have a code here > > > > caller_func: > > 0x00 add sp, 0x20 /* 0x20 bytes stack frame allocated */ > > ... > > 0x10 call target_func > > 0x15 ... /* return address */ > > > > On the stack in the entry of target_func, we have > > > > [stack] > > 0x0e0 caller_func+0x15 > > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > > 0x100 /* caller_func return address */ > > > > And when we put a kretprobe on the target_func, the stack will be > > > > [stack] > > 0x0e0 kretprobe_trampoline > > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > > 0x100 /* caller_func return address */ > > > > * "caller_func+0x15" is saved in current->kretprobe_instances.first. > > > > When returning from the target_func, call consumed the 0x0e0 and > > jump to kretprobe_trampoline. Let's see the assembler code. > > > > ".text\n" > > ".global kretprobe_trampoline\n" > > ".type kretprobe_trampoline, @function\n" > > "kretprobe_trampoline:\n" > > /* We don't bother saving the ss register */ > > " pushq %rsp\n" > > " pushfq\n" > > SAVE_REGS_STRING > > " movq %rsp, %rdi\n" > > " call trampoline_handler\n" > > /* Replace saved sp with true return address. */ > > " movq %rax, 19*8(%rsp)\n" > > RESTORE_REGS_STRING > > " popfq\n" > > " ret\n" > > > > When the entry of trampoline_handler, stack is like this; > > > > [stack] > > 0x040 kretprobe_trampoline+0x25 > > 0x048 r15 > > ... /* pt_regs */ > > 0x0d8 flags > > 0x0e0 rsp (=0x0e0) > > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > > 0x100 /* caller_func return address */ > > > > And after returned from trampoline_handler, "movq" changes the > > stack like this. > > > > [stack] > > 0x040 kretprobe_trampoline+0x25 > > 0x048 r15 > > ... /* pt_regs */ > > 0x0d8 flags > > 0x0e0 caller_func+0x15 > > ... /* 0x20 bytes = 4 entries are stack frame of caller_func */ > > 0x100 /* caller_func return address */ > > Thanks for the detailed explanation. I think I understand kretprobe > mechanics from a somewhat high level (kprobe saves real return address > on entry, overwrites return address to trampoline, then trampoline > runs handler and finally resets return address to real return address). > > I don't usually write much assembly so the details escape me somewhat. > > > So at the kretprobe handler, we have 2 issues. > > 1) the return address (caller_func+0x15) is not on the stack. > > this can be solved by searching from current->kretprobe_instances. > > Yes, agreed. > > > 2) the stack frame size of kretprobe_trampoline is unknown > > Since the stackframe is fixed, the fixed number (0x98) can be used. > > I'm confused why this is relevant. Is it so ORC knows where to find > saved return address in the frame? No, because the kretprobe_trampoline is somewhat special. Usually, at the function entry, there is a return address on the top of stack, but kretprobe_trampoline doesn't have it. So we have to put a hint at the function entry to mark there should be a next return address. (and ORC unwinder must find it) > > However, those solutions are only for the kretprobe handler. The stacktrace > > from interrupt handler hit in the kretprobe_trampoline still doesn't work. > > > > So, here is my idea; > > > > 1) Change the trampline code to prepare stack frame at first and save > > registers on it, instead of "push". This will makes ORC easy to setup > > stackframe information for this code. > > I'm confused on the details here. But this is what Josh solves in his > patch, right? I'm not so sure how objtool makes the ORC information. If it can trace the push/pop correctly, yes, it is solved. > > 2) change the return addres fixup timing. Instead of using return value > > of trampoline handler, before removing the real return address from > > current->kretprobe_instances. > > Is the idea to have `kretprobe_trampoline` place the real return address > on the stack (while telling ORC where to find it) _before_ running `call > trampoline_handler` ? So that an unwind from inside the user defined > kretprobe handler simply unwinds correctly? No, unless calling the trampoline_handler, we can not get the real return address. Thus, the __kretprobe_trampoline_handler() will call return address fixup function right before unlink the current->kretprobe_instances. > And to be extra clear, this would only work for stack_trace_save() and > not stack_trace_save_regs()? Yes, for the stack_trace_save_regs() and the stack-tracing inside the kretprobe'd target function, we still need a hack as same as orc_ftrace_find(). > > > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it > > checks the contents of the end of stackframe (at the place of regs->sp) > > is same as the address of it. If it is, it can find the correct address > > from current->kretprobe_instances. If not, there is a correct address. > > What do you mean by "it" w.r.t. "is the same address of it"? I'm > confused on this point. Oh I meant, 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, orc_find() checks the contents of the end of stackframe (at the place of regs->sp) is same as the address of the stackframe (Note that kretprobe_trampoline does "push %sp" at first). If so, orc_find() can find the correct address from current->kretprobe_instances. If not, there is a correct address. I need to see the orc unwinder carefully, orc_find() only gets the ip but to find stackframe, I think this should be fixed in the caller of orc_find(). Thank you,
On Wed, Mar 10, 2021 at 06:57:34PM +0900, Masami Hiramatsu wrote: > > If I understand correctly, for #1 you need an unwind hint which treats > > the instruction *after* the "pushq %rsp" as the beginning of the > > function. > > Thanks for the patch. In that case, should I still change the stack allocation? > Or can I continue to use a series of "push/pop" ? You can continue to use push/pop. Objtool is only getting confused by the unbalanced stack of the function (more pushes than pops). The unwind hint should fix that.
Hi Josh and Daniel, On Wed, 10 Mar 2021 09:08:45 -0600 Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Wed, Mar 10, 2021 at 06:57:34PM +0900, Masami Hiramatsu wrote: > > > If I understand correctly, for #1 you need an unwind hint which treats > > > the instruction *after* the "pushq %rsp" as the beginning of the > > > function. > > > > Thanks for the patch. In that case, should I still change the stack allocation? > > Or can I continue to use a series of "push/pop" ? > > You can continue to use push/pop. Objtool is only getting confused by > the unbalanced stack of the function (more pushes than pops). The > unwind hint should fix that. With you patch, I made a fix for ORC unwinder. I confirmed it works with 2 multiple kretprobes on the call path like this ; cd /sys/kernel/debug/tracing/ echo r vfs_read >> kprobe_events echo r full_proxy_read >> kprobe_events echo traceoff:1 > events/kprobes/r_vfs_read_0/trigger echo stacktrace:1 > events/kprobes/r_full_proxy_read_0/trigger echo 1 > events/kprobes/enable echo 1 > options/sym-offset cat /sys/kernel/debug/kprobes/list echo 0 > events/kprobes/enable cat trace # tracer: nop # # entries-in-buffer/entries-written: 3/3 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | <...>-136 [004] ...1 648.481281: r_full_proxy_read_0: (vfs_read+0xab/0x1a0 <- full_proxy_read) <...>-136 [004] ...1 648.481310: <stack trace> => kretprobe_trace_func+0x209/0x2f0 => kretprobe_dispatcher+0x4a/0x70 => __kretprobe_trampoline_handler+0xcd/0x170 => trampoline_handler+0x3d/0x50 => kretprobe_trampoline+0x25/0x50 => vfs_read+0xab/0x1a0 => ksys_read+0x5f/0xe0 => do_syscall_64+0x33/0x40 => entry_SYSCALL_64_after_hwframe+0x44/0xae => 0 => 0 => 0 => 0 => 0 => 0 => 0 I didn't tested it with bpftrace, but this also handles regs->ip == kretprobe_trampoline case. So it should work. commit aa452d999b524b1851f69cc947be3e1a2f3ca1ec Author: Masami Hiramatsu <mhiramat@kernel.org> Date: Sat Mar 6 08:34:51 2021 +0900 x86/unwind/orc: Fixup kretprobe trampoline entry Since the kretprobe replaces the function return address with the kretprobe_trampoline on the stack, the ORC unwinder can not continue the stack unwinding at that point. To fix this issue, correct state->ip as like as function-graph tracer in the unwind_next_frame(). Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h index 70fc159ebe69..ab5e45b848d5 100644 --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -4,6 +4,7 @@ #include <linux/sched.h> #include <linux/ftrace.h> +#include <linux/llist.h> #include <asm/ptrace.h> #include <asm/stacktrace.h> @@ -20,6 +21,9 @@ struct unwind_state { bool signal, full_regs; unsigned long sp, bp, ip; struct pt_regs *regs, *prev_regs; +#if defined(CONFIG_KRETPROBES) + struct llist_node *kr_iter; +#endif #elif defined(CONFIG_UNWINDER_FRAME_POINTER) bool got_irq; unsigned long *bp, *orig_sp, ip; diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index 2a1d47f47eee..94869516cfc0 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -2,6 +2,7 @@ #include <linux/objtool.h> #include <linux/module.h> #include <linux/sort.h> +#include <linux/kprobes.h> #include <asm/ptrace.h> #include <asm/stacktrace.h> #include <asm/unwind.h> @@ -414,6 +415,30 @@ static bool get_reg(struct unwind_state *state, unsigned int reg_off, return false; } +#ifdef CONFIG_KRETPROBES +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state) +{ + return kretprobe_find_ret_addr( + (unsigned long)kretprobe_trampoline_addr(), + state->task, &state->kr_iter); +} + +static bool is_kretprobe_trampoline_address(unsigned long ip) +{ + return ip == (unsigned long)kretprobe_trampoline_addr(); +} +#else +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state) +{ + return state->ip; +} + +static bool is_kretprobe_trampoline_address(unsigned long ip) +{ + return false; +} +#endif + bool unwind_next_frame(struct unwind_state *state) { unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, state->ip, (void *)ip_p); + /* + * There are special cases when the stack unwinder is called + * from the kretprobe handler or the interrupt handler which + * occurs in the kretprobe trampoline code. In those cases, + * %sp is shown on the stack instead of the return address. + * Or, when the unwinder find the return address is replaced + * by kretprobe_trampoline. + * In those cases, correct address can be found in kretprobe. + */ + if (state->ip == sp || + is_kretprobe_trampoline_address(state->ip)) + state->ip = orc_kretprobe_correct_ip(state); state->sp = sp; state->regs = NULL; @@ -649,6 +686,12 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task, state->full_regs = true; state->signal = true; + /* + * When the unwinder called with regs from kretprobe handler, + * the regs->ip starts from kretprobe_trampoline address. + */ + if (is_kretprobe_trampoline_address(state->ip)) + state->ip = orc_kretprobe_correct_ip(state); } else if (task == current) { asm volatile("lea (%%rip), %0\n\t" "mov %%rsp, %1\n\t"
On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote: > +#ifdef CONFIG_KRETPROBES > +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state) > +{ > + return kretprobe_find_ret_addr( > + (unsigned long)kretprobe_trampoline_addr(), > + state->task, &state->kr_iter); > +} > + > +static bool is_kretprobe_trampoline_address(unsigned long ip) > +{ > + return ip == (unsigned long)kretprobe_trampoline_addr(); > +} > +#else > +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state) > +{ > + return state->ip; > +} > + > +static bool is_kretprobe_trampoline_address(unsigned long ip) > +{ > + return false; > +} > +#endif > + Can this code go in a kprobes file? I'd rather not clutter ORC with it, and maybe it would be useful for other arches or unwinders. > bool unwind_next_frame(struct unwind_state *state) > { > unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) > > state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, > state->ip, (void *)ip_p); > + /* > + * There are special cases when the stack unwinder is called > + * from the kretprobe handler or the interrupt handler which > + * occurs in the kretprobe trampoline code. In those cases, > + * %sp is shown on the stack instead of the return address. > + * Or, when the unwinder find the return address is replaced > + * by kretprobe_trampoline. > + * In those cases, correct address can be found in kretprobe. > + */ > + if (state->ip == sp || Why is the 'state->ip == sp' needed? > + is_kretprobe_trampoline_address(state->ip)) > + state->ip = orc_kretprobe_correct_ip(state); This is similar in concept to ftrace_graph_ret_addr(), right? Would it be possible to have a similar API? Like state->ip = kretprobe_ret_addr(state->task, &state->kr_iter, state->ip); and without the conditional. > > state->sp = sp; > state->regs = NULL; > @@ -649,6 +686,12 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task, > state->full_regs = true; > state->signal = true; > > + /* > + * When the unwinder called with regs from kretprobe handler, > + * the regs->ip starts from kretprobe_trampoline address. > + */ > + if (is_kretprobe_trampoline_address(state->ip)) > + state->ip = orc_kretprobe_correct_ip(state); Shouldn't __kretprobe_trampoline_handler() just set regs->ip to 'correct_ret_addr' before passing the regs to the handler? I'd think that would be a less surprising value for regs->ip than '&kretprobe_trampoline'. And it would make the unwinder just work automatically when unwinding from the handler using the regs. It would also work when unwinding from the handler's stack, if we put an UNWIND_HINT_REGS after saving the regs. The only (rare) case it wouldn't work would be unwinding from an interrupt before regs->ip gets set properly. In which case we'd still need the above call to orc_kretprobe_correct_ip() or so.
On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote: > Hi Josh and Daniel, <...> > commit aa452d999b524b1851f69cc947be3e1a2f3ca1ec > Author: Masami Hiramatsu <mhiramat@kernel.org> > Date: Sat Mar 6 08:34:51 2021 +0900 > > x86/unwind/orc: Fixup kretprobe trampoline entry > > Since the kretprobe replaces the function return address with > the kretprobe_trampoline on the stack, the ORC unwinder can not > continue the stack unwinding at that point. > > To fix this issue, correct state->ip as like as function-graph > tracer in the unwind_next_frame(). > > Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> > I applied your original patchset + Josh's patch + this patch and can confirm it works for bpftrace + kretprobes. Tested-by: Daniel Xu <dxu@dxuuu.xyz>
On Wed, 10 Mar 2021 12:31:13 -0600 Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote: > > +#ifdef CONFIG_KRETPROBES > > +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state) > > +{ > > + return kretprobe_find_ret_addr( > > + (unsigned long)kretprobe_trampoline_addr(), > > + state->task, &state->kr_iter); > > +} > > + > > +static bool is_kretprobe_trampoline_address(unsigned long ip) > > +{ > > + return ip == (unsigned long)kretprobe_trampoline_addr(); > > +} > > +#else > > +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state) > > +{ > > + return state->ip; > > +} > > + > > +static bool is_kretprobe_trampoline_address(unsigned long ip) > > +{ > > + return false; > > +} > > +#endif > > + > > Can this code go in a kprobes file? I'd rather not clutter ORC with it, > and maybe it would be useful for other arches or unwinders. Yes, anyway dummy kretprobe_find_ret_addr() and kretprobe_trampoline_addr() should be defined !CONFIG_KRETPROBES case. > > > bool unwind_next_frame(struct unwind_state *state) > > { > > unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) > > > > state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, > > state->ip, (void *)ip_p); > > + /* > > + * There are special cases when the stack unwinder is called > > + * from the kretprobe handler or the interrupt handler which > > + * occurs in the kretprobe trampoline code. In those cases, > > + * %sp is shown on the stack instead of the return address. > > + * Or, when the unwinder find the return address is replaced > > + * by kretprobe_trampoline. > > + * In those cases, correct address can be found in kretprobe. > > + */ > > + if (state->ip == sp || > > Why is the 'state->ip == sp' needed? As I commented above, until kretprobe_trampoline writes back the real address to the stack, sp value is there (which has been pushed by the 'pushq %rsp' at the entry of kretprobe_trampoline.) ".type kretprobe_trampoline, @function\n" "kretprobe_trampoline:\n" /* We don't bother saving the ss register */ " pushq %rsp\n" // THIS " pushfq\n" Thus, from inside the kretprobe handler, like ftrace, you'll see the sp value instead of the real return address. > > + is_kretprobe_trampoline_address(state->ip)) > > + state->ip = orc_kretprobe_correct_ip(state); > > This is similar in concept to ftrace_graph_ret_addr(), right? Would it > be possible to have a similar API? Like > > state->ip = kretprobe_ret_addr(state->task, &state->kr_iter, state->ip); OK, but, > and without the conditional. As I said, it is not possible because "state->ip == sp" check depends on ORC unwinder. > > state->sp = sp; > > state->regs = NULL; > > @@ -649,6 +686,12 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task, > > state->full_regs = true; > > state->signal = true; > > > > + /* > > + * When the unwinder called with regs from kretprobe handler, > > + * the regs->ip starts from kretprobe_trampoline address. > > + */ > > + if (is_kretprobe_trampoline_address(state->ip)) > > + state->ip = orc_kretprobe_correct_ip(state); > > Shouldn't __kretprobe_trampoline_handler() just set regs->ip to > 'correct_ret_addr' before passing the regs to the handler? I'd think > that would be a less surprising value for regs->ip than > '&kretprobe_trampoline'. Hmm, actually current implementation on x86 mimics the behevior of the int3 exception (which many architectures still do). Previously the kretprobe_trampoline is a place holder like this. "kretprobe_trampoline:\n" " nop\n" And arch_init_kprobes() puts a kprobe (int3) there. So in that case regs->ip should be kretprobe_trampoline. User handler (usually architecutre independent) finds the correct_ret_addr in kretprobe_instance.ret_addr field. > And it would make the unwinder just work automatically when unwinding > from the handler using the regs. > > It would also work when unwinding from the handler's stack, if we put an > UNWIND_HINT_REGS after saving the regs. At that moment, the real return address is not identified. So we can not put it. > > The only (rare) case it wouldn't work would be unwinding from an > interrupt before regs->ip gets set properly. In which case we'd still > need the above call to orc_kretprobe_correct_ip() or so. Thank you,
On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote: > > > bool unwind_next_frame(struct unwind_state *state) > > > { > > > unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) > > > > > > state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, > > > state->ip, (void *)ip_p); > > > + /* > > > + * There are special cases when the stack unwinder is called > > > + * from the kretprobe handler or the interrupt handler which > > > + * occurs in the kretprobe trampoline code. In those cases, > > > + * %sp is shown on the stack instead of the return address. > > > + * Or, when the unwinder find the return address is replaced > > > + * by kretprobe_trampoline. > > > + * In those cases, correct address can be found in kretprobe. > > > + */ > > > + if (state->ip == sp || > > > > Why is the 'state->ip == sp' needed? > > As I commented above, until kretprobe_trampoline writes back the real > address to the stack, sp value is there (which has been pushed by the > 'pushq %rsp' at the entry of kretprobe_trampoline.) > > ".type kretprobe_trampoline, @function\n" > "kretprobe_trampoline:\n" > /* We don't bother saving the ss register */ > " pushq %rsp\n" // THIS > " pushfq\n" > > Thus, from inside the kretprobe handler, like ftrace, you'll see > the sp value instead of the real return address. I see. If you change is_kretprobe_trampoline_address() to include the entire function, like: static bool is_kretprobe_trampoline_address(unsigned long ip) { return (void *)ip >= kretprobe_trampoline && (void *)ip < kretprobe_trampoline_end; } then the unwinder won't ever read the bogus %rsp value into state->ip, and the 'state->ip == sp' check can be removed. > > And it would make the unwinder just work automatically when unwinding > > from the handler using the regs. > > > > It would also work when unwinding from the handler's stack, if we put an > > UNWIND_HINT_REGS after saving the regs. > > At that moment, the real return address is not identified. So we can not > put it. True, at the time the regs are originally saved, the real return address isn't available. But by the time the user handler is called, the return address *is* available. So if the real return address were placed in regs->ip before calling the handler, the unwinder could find it there, when called from the handler. Then we wouldn't need the call to orc_kretprobe_correct_ip() in __unwind_start(). But maybe it's not possible due to the regs->ip expectations of legacy handlers?
On Wed, 10 Mar 2021 19:06:15 -0600 Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote: > > > > bool unwind_next_frame(struct unwind_state *state) > > > > { > > > > unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; > > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) > > > > > > > > state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, > > > > state->ip, (void *)ip_p); > > > > + /* > > > > + * There are special cases when the stack unwinder is called > > > > + * from the kretprobe handler or the interrupt handler which > > > > + * occurs in the kretprobe trampoline code. In those cases, > > > > + * %sp is shown on the stack instead of the return address. > > > > + * Or, when the unwinder find the return address is replaced > > > > + * by kretprobe_trampoline. > > > > + * In those cases, correct address can be found in kretprobe. > > > > + */ > > > > + if (state->ip == sp || > > > > > > Why is the 'state->ip == sp' needed? > > > > As I commented above, until kretprobe_trampoline writes back the real > > address to the stack, sp value is there (which has been pushed by the > > 'pushq %rsp' at the entry of kretprobe_trampoline.) > > > > ".type kretprobe_trampoline, @function\n" > > "kretprobe_trampoline:\n" > > /* We don't bother saving the ss register */ > > " pushq %rsp\n" // THIS > > " pushfq\n" > > > > Thus, from inside the kretprobe handler, like ftrace, you'll see > > the sp value instead of the real return address. > > I see. If you change is_kretprobe_trampoline_address() to include the > entire function, like: > > static bool is_kretprobe_trampoline_address(unsigned long ip) > { > return (void *)ip >= kretprobe_trampoline && > (void *)ip < kretprobe_trampoline_end; > } > > then the unwinder won't ever read the bogus %rsp value into state->ip, > and the 'state->ip == sp' check can be removed. Hmm, I couldn't get your point. Since sp is the address of stack, it always out of text address. > > > > And it would make the unwinder just work automatically when unwinding > > > from the handler using the regs. > > > > > > It would also work when unwinding from the handler's stack, if we put an > > > UNWIND_HINT_REGS after saving the regs. > > > > At that moment, the real return address is not identified. So we can not > > put it. > > True, at the time the regs are originally saved, the real return address > isn't available. But by the time the user handler is called, the return > address *is* available. So if the real return address were placed in > regs->ip before calling the handler, the unwinder could find it there, > when called from the handler. OK, but this is not arch independent specification. I can make a hack only for x86, but that is not clean implementation, hmm. > > Then we wouldn't need the call to orc_kretprobe_correct_ip() in > __unwind_start(). What about the ORC implementation in other architecture? Is that for x86 only? > > But maybe it's not possible due to the regs->ip expectations of legacy > handlers? Usually, the legacy handlers will ignore it, the official way to access the correct return address is kretprobe_instance.ret_addr. Because it is arch independent. Nowadays there are instruction_pointer() and instruction_pointer_set() APIs in many (not all) architecutre, so I can try to replace to use it instead of the kretprobe_instance.ret_addr. (and it will break the out-of-tree codes) Thank you,
On Thu, Mar 11, 2021 at 10:54:38AM +0900, Masami Hiramatsu wrote: > On Wed, 10 Mar 2021 19:06:15 -0600 > Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > > On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote: > > > > > bool unwind_next_frame(struct unwind_state *state) > > > > > { > > > > > unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; > > > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) > > > > > > > > > > state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, > > > > > state->ip, (void *)ip_p); > > > > > + /* > > > > > + * There are special cases when the stack unwinder is called > > > > > + * from the kretprobe handler or the interrupt handler which > > > > > + * occurs in the kretprobe trampoline code. In those cases, > > > > > + * %sp is shown on the stack instead of the return address. > > > > > + * Or, when the unwinder find the return address is replaced > > > > > + * by kretprobe_trampoline. > > > > > + * In those cases, correct address can be found in kretprobe. > > > > > + */ > > > > > + if (state->ip == sp || > > > > > > > > Why is the 'state->ip == sp' needed? > > > > > > As I commented above, until kretprobe_trampoline writes back the real > > > address to the stack, sp value is there (which has been pushed by the > > > 'pushq %rsp' at the entry of kretprobe_trampoline.) > > > > > > ".type kretprobe_trampoline, @function\n" > > > "kretprobe_trampoline:\n" > > > /* We don't bother saving the ss register */ > > > " pushq %rsp\n" // THIS > > > " pushfq\n" > > > > > > Thus, from inside the kretprobe handler, like ftrace, you'll see > > > the sp value instead of the real return address. > > > > I see. If you change is_kretprobe_trampoline_address() to include the > > entire function, like: > > > > static bool is_kretprobe_trampoline_address(unsigned long ip) > > { > > return (void *)ip >= kretprobe_trampoline && > > (void *)ip < kretprobe_trampoline_end; > > } > > > > then the unwinder won't ever read the bogus %rsp value into state->ip, > > and the 'state->ip == sp' check can be removed. > > Hmm, I couldn't get your point. Since sp is the address of stack, > it always out of text address. When unwinding from trampoline_handler(), state->ip will point to the instruction after the call: call trampoline_handler movq %rax, 19*8(%rsp) <-- state->ip points to this insn But then, the above version of is_kretprobe_trampoline_address() is true, so state->ip gets immediately replaced with the real return address: if (is_kretprobe_trampoline_address(state->ip)) state->ip = orc_kretprobe_correct_ip(state); so the unwinder skips over the kretprobe_trampoline() frame and goes straight to the frame of the real return address. Thus it never reads this bogus return value into state->ip: pushq %rsp which is why the weird 'state->ip == sp' check is no longer needed. The only "downside" is that the unwinder skips the kretprobe_trampoline() frame. (note that downside wouldn't exist in the case of UNWIND_HINT_REGS + valid regs->ip). > > > > And it would make the unwinder just work automatically when unwinding > > > > from the handler using the regs. > > > > > > > > It would also work when unwinding from the handler's stack, if we put an > > > > UNWIND_HINT_REGS after saving the regs. > > > > > > At that moment, the real return address is not identified. So we can not > > > put it. > > > > True, at the time the regs are originally saved, the real return address > > isn't available. But by the time the user handler is called, the return > > address *is* available. So if the real return address were placed in > > regs->ip before calling the handler, the unwinder could find it there, > > when called from the handler. > > OK, but this is not arch independent specification. I can make a hack > only for x86, but that is not clean implementation, hmm. > > > > > Then we wouldn't need the call to orc_kretprobe_correct_ip() in > > __unwind_start(). > > What about the ORC implementation in other architecture? Is that for > x86 only? ORC is x86 only. > > But maybe it's not possible due to the regs->ip expectations of legacy > > handlers? > > Usually, the legacy handlers will ignore it, the official way to access > the correct return address is kretprobe_instance.ret_addr. Because it is > arch independent. > > Nowadays there are instruction_pointer() and instruction_pointer_set() APIs > in many (not all) architecutre, so I can try to replace to use it instead > of the kretprobe_instance.ret_addr. > (and it will break the out-of-tree codes) That sounds better to me, though I don't have an understanding of what it would break.
On Thu, 11 Mar 2021 10:51:10 -0600 Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Thu, Mar 11, 2021 at 10:54:38AM +0900, Masami Hiramatsu wrote: > > On Wed, 10 Mar 2021 19:06:15 -0600 > > Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > > > > On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote: > > > > > > bool unwind_next_frame(struct unwind_state *state) > > > > > > { > > > > > > unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp; > > > > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state) > > > > > > > > > > > > state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx, > > > > > > state->ip, (void *)ip_p); > > > > > > + /* > > > > > > + * There are special cases when the stack unwinder is called > > > > > > + * from the kretprobe handler or the interrupt handler which > > > > > > + * occurs in the kretprobe trampoline code. In those cases, > > > > > > + * %sp is shown on the stack instead of the return address. > > > > > > + * Or, when the unwinder find the return address is replaced > > > > > > + * by kretprobe_trampoline. > > > > > > + * In those cases, correct address can be found in kretprobe. > > > > > > + */ > > > > > > + if (state->ip == sp || > > > > > > > > > > Why is the 'state->ip == sp' needed? > > > > > > > > As I commented above, until kretprobe_trampoline writes back the real > > > > address to the stack, sp value is there (which has been pushed by the > > > > 'pushq %rsp' at the entry of kretprobe_trampoline.) > > > > > > > > ".type kretprobe_trampoline, @function\n" > > > > "kretprobe_trampoline:\n" > > > > /* We don't bother saving the ss register */ > > > > " pushq %rsp\n" // THIS > > > > " pushfq\n" > > > > > > > > Thus, from inside the kretprobe handler, like ftrace, you'll see > > > > the sp value instead of the real return address. > > > > > > I see. If you change is_kretprobe_trampoline_address() to include the > > > entire function, like: > > > > > > static bool is_kretprobe_trampoline_address(unsigned long ip) > > > { > > > return (void *)ip >= kretprobe_trampoline && > > > (void *)ip < kretprobe_trampoline_end; > > > } > > > > > > then the unwinder won't ever read the bogus %rsp value into state->ip, > > > and the 'state->ip == sp' check can be removed. > > > > Hmm, I couldn't get your point. Since sp is the address of stack, > > it always out of text address. > > When unwinding from trampoline_handler(), state->ip will point to the > instruction after the call: > > call trampoline_handler > movq %rax, 19*8(%rsp) <-- state->ip points to this insn > > But then, the above version of is_kretprobe_trampoline_address() is > true, so state->ip gets immediately replaced with the real return > address: > > if (is_kretprobe_trampoline_address(state->ip)) > state->ip = orc_kretprobe_correct_ip(state); Hmm, but that ip is for the "next" frame, isn't it? [stack] 0x040 kretprobe_trampoline+0x25 -+ 0x048 r15 | ... /* pt_regs */ +- ORC sp_offset at the kretprobe_trampoline+0x24 0x0d8 flags -+ 0x0e0 rsp (=0x0e0) /* will be replaced by the real return address */ Your idea seems replacing stack@0x040 with the real return address. In that case, may orc_find() returns the wrong ORC info? > so the unwinder skips over the kretprobe_trampoline() frame and goes > straight to the frame of the real return address. Thus it never reads > this bogus return value into state->ip: > > pushq %rsp > > which is why the weird 'state->ip == sp' check is no longer needed. But that is what the kretprobe_trampoline actually does. I would rather like to see the real information from stacktrace. > The only "downside" is that the unwinder skips the > kretprobe_trampoline() frame. (note that downside wouldn't exist in > the case of UNWIND_HINT_REGS + valid regs->ip). As I said, I would like to see the kretprobe_trampoline+0x25 on my stacktrace, because kretprobe doesn't replace the kretprobe_trampoline but the target_func on the stack. > > > > > And it would make the unwinder just work automatically when unwinding > > > > > from the handler using the regs. > > > > > > > > > > It would also work when unwinding from the handler's stack, if we put an > > > > > UNWIND_HINT_REGS after saving the regs. > > > > > > > > At that moment, the real return address is not identified. So we can not > > > > put it. > > > > > > True, at the time the regs are originally saved, the real return address > > > isn't available. But by the time the user handler is called, the return > > > address *is* available. So if the real return address were placed in > > > regs->ip before calling the handler, the unwinder could find it there, > > > when called from the handler. > > > > OK, but this is not arch independent specification. I can make a hack > > only for x86, but that is not clean implementation, hmm. > > > > > > > > Then we wouldn't need the call to orc_kretprobe_correct_ip() in > > > __unwind_start(). > > > > What about the ORC implementation in other architecture? Is that for > > x86 only? > > ORC is x86 only. > > > > But maybe it's not possible due to the regs->ip expectations of legacy > > > handlers? > > > > Usually, the legacy handlers will ignore it, the official way to access > > the correct return address is kretprobe_instance.ret_addr. Because it is > > arch independent. > > > > Nowadays there are instruction_pointer() and instruction_pointer_set() APIs > > in many (not all) architecutre, so I can try to replace to use it instead > > of the kretprobe_instance.ret_addr. > > (and it will break the out-of-tree codes) > > That sounds better to me, though I don't have an understanding of what > it would break. OK, anyway try to do that and see what happen. Thank you,