mbox series

[-tip,0/5] kprobes: Fix stacktrace in kretprobes

Message ID 161495873696.346821.10161501768906432924.stgit@devnote2 (mailing list archive)
Headers show
Series kprobes: Fix stacktrace in kretprobes | expand

Message

Masami Hiramatsu (Google) March 5, 2021, 3:38 p.m. UTC
Hello,

Here is a series of patches for kprobes and stacktracer to fix the kretprobe
entries in the kernel stack. This was reported by Daniel Xu. I thought that
was in the bpftrace, but it is actually more generic issue.
So I decided to fix the issue in arch independent part.

While fixing the issue, I found a bug in ia64 related to kretprobe, which is
fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main
issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe
internal change. And [5/5] removing the stacktrace kretprobe fixup code in
ftrace. 

Daniel, can you also check that this fixes your issue too? I hope it is.

Note that this doesn't fixup all cases. Unfortunately, stacktracing the
other tasks (non current task) on the arch which doesn't support ARCH_STACKWALK,
I can not fix it in the arch independent code. Maybe each arch dependent
stacktrace implementation must fixup by themselves.

Thank you,

---

Masami Hiramatsu (5):
      ia64: kprobes: Fix to pass correct trampoline address to the handler
      kprobes: treewide: Replace arch_deref_entry_point() with dereference_function_descriptor()
      kprobes: treewide: Remove trampoline_address from kretprobe_trampoline_handler()
      kprobes: stacktrace: Recover the address changed by kretprobe
      tracing: Remove kretprobe unknown indicator from stacktrace


 arch/arc/kernel/kprobes.c          |    2 -
 arch/arm/probes/kprobes/core.c     |    3 -
 arch/arm64/kernel/probes/kprobes.c |    3 -
 arch/csky/kernel/probes/kprobes.c  |    2 -
 arch/ia64/kernel/kprobes.c         |   15 ++----
 arch/mips/kernel/kprobes.c         |    3 -
 arch/parisc/kernel/kprobes.c       |    4 +-
 arch/powerpc/kernel/kprobes.c      |   13 -----
 arch/riscv/kernel/probes/kprobes.c |    2 -
 arch/s390/kernel/kprobes.c         |    2 -
 arch/sh/kernel/kprobes.c           |    2 -
 arch/sparc/kernel/kprobes.c        |    2 -
 arch/x86/kernel/kprobes/core.c     |    2 -
 include/linux/kprobes.h            |   32 +++++++++++--
 kernel/kprobes.c                   |   89 ++++++++++++++++++++++--------------
 kernel/stacktrace.c                |   21 ++++++++
 kernel/trace/trace_output.c        |   27 ++---------
 lib/error-inject.c                 |    3 +
 18 files changed, 126 insertions(+), 101 deletions(-)

--
Masami Hiramatsu (Linaro) <mhiramat@kernel.org>

Comments

Daniel Xu March 5, 2021, 7:16 p.m. UTC | #1
Hi Masami,

On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote:
> Hello,
> 
> Here is a series of patches for kprobes and stacktracer to fix the kretprobe
> entries in the kernel stack. This was reported by Daniel Xu. I thought that
> was in the bpftrace, but it is actually more generic issue.
> So I decided to fix the issue in arch independent part.
> 
> While fixing the issue, I found a bug in ia64 related to kretprobe, which is
> fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main
> issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe
> internal change. And [5/5] removing the stacktrace kretprobe fixup code in
> ftrace. 
> 
> Daniel, can you also check that this fixes your issue too? I hope it is.

Unfortunately, this patch series does not fix the issue I reported.

I think the reason your tests work is because you're using ftrace and
the ORC unwinder is aware of ftrace trampolines (see
arch/x86/kernel/unwind_orc.c:orc_ftrace_find).

bpftrace kprobes go through perf event subsystem (ie not ftrace) so
naturally orc_ftrace_find() does not find an associated trampoline. ORC
unwinding fails in this case because
arch/x86/kernel/kprobes/core.c:trampoline_handler sets

    regs->ip = (unsigned long)&kretprobe_trampoline;

and `kretprobe_trampoline` is marked

    STACK_FRAME_NON_STANDARD(kretprobe_trampoline);

so it doesn't have a valid ORC entry. Thus, ORC immediately bails when
trying to unwind past the first frame.

The only way I can think of to fix this issue is to make the ORC
unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping
you have another idea if my patch isn't acceptable.

Thanks,
Daniel
Masami Hiramatsu (Google) March 6, 2021, 1:13 a.m. UTC | #2
On Fri, 5 Mar 2021 11:16:45 -0800
Daniel Xu <dxu@dxuuu.xyz> wrote:

> Hi Masami,
> 
> On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote:
> > Hello,
> > 
> > Here is a series of patches for kprobes and stacktracer to fix the kretprobe
> > entries in the kernel stack. This was reported by Daniel Xu. I thought that
> > was in the bpftrace, but it is actually more generic issue.
> > So I decided to fix the issue in arch independent part.
> > 
> > While fixing the issue, I found a bug in ia64 related to kretprobe, which is
> > fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main
> > issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe
> > internal change. And [5/5] removing the stacktrace kretprobe fixup code in
> > ftrace. 
> > 
> > Daniel, can you also check that this fixes your issue too? I hope it is.
> 
> Unfortunately, this patch series does not fix the issue I reported.

Ah, OK. This was my mistake I didn't choose ORC unwinder for test kernel.

> 
> I think the reason your tests work is because you're using ftrace and
> the ORC unwinder is aware of ftrace trampolines (see
> arch/x86/kernel/unwind_orc.c:orc_ftrace_find).

OK, so it has to be fixed in ORC unwinder.

> 
> bpftrace kprobes go through perf event subsystem (ie not ftrace) so
> naturally orc_ftrace_find() does not find an associated trampoline. ORC
> unwinding fails in this case because
> arch/x86/kernel/kprobes/core.c:trampoline_handler sets
> 
>     regs->ip = (unsigned long)&kretprobe_trampoline;
> 
> and `kretprobe_trampoline` is marked
> 
>     STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
> 
> so it doesn't have a valid ORC entry. Thus, ORC immediately bails when
> trying to unwind past the first frame.

Hm, in ftrace case, it decode kretprobe's stackframe and stoped right
after kretprobe_trampoline callsite.

 => kretprobe_trace_func+0x21f/0x340
 => kretprobe_dispatcher+0x73/0xb0
 => __kretprobe_trampoline_handler+0xcd/0x1a0
 => trampoline_handler+0x3d/0x50
 => kretprobe_trampoline+0x25/0x50
 => 0

kretprobe replaces the real return address with kretprobe_trampoline
and kretprobe_trampoline *calls* trampoline_handler (this part depends
on architecture implementation).
Thus, if kretprobe_trampoline has no stack frame information, ORC may
fails at the first kretprobe_trampoline+0x25, that is different from
the kretprobe_trampoline+0, so the hack doesn't work.

Hmm, how the other inline-asm function makes its stackframe info?
If I get the kretprobe_trampoline+0, I can fix it up.

> The only way I can think of to fix this issue is to make the ORC
> unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping
> you have another idea if my patch isn't acceptable.

OK, anyway, your patch doesn't care the case that the multiple functions
are probed by kretprobes. In that case, you'll see several entries are
replaced by the kretprobe_trampoline. To find it correctly, you have
to pass a state-holder (@cur of the kretprobe_find_ret_addr())
to the fixup entries.

Thank you,
Daniel Xu March 7, 2021, 9:23 p.m. UTC | #3
On Sat, Mar 06, 2021 at 10:13:57AM +0900, Masami Hiramatsu wrote:
> On Fri, 5 Mar 2021 11:16:45 -0800
> Daniel Xu <dxu@dxuuu.xyz> wrote:
> 
> > Hi Masami,
> > 
> > On Sat, Mar 06, 2021 at 12:38:57AM +0900, Masami Hiramatsu wrote:
> > > Hello,
> > > 
> > > Here is a series of patches for kprobes and stacktracer to fix the kretprobe
> > > entries in the kernel stack. This was reported by Daniel Xu. I thought that
> > > was in the bpftrace, but it is actually more generic issue.
> > > So I decided to fix the issue in arch independent part.
> > > 
> > > While fixing the issue, I found a bug in ia64 related to kretprobe, which is
> > > fixed by [1/5]. [2/5] and [3/5] is a kind of cleanup before fixing the main
> > > issue. [4/5] is the patch to fix the stacktrace, which involves kretprobe
> > > internal change. And [5/5] removing the stacktrace kretprobe fixup code in
> > > ftrace. 
> > > 
> > > Daniel, can you also check that this fixes your issue too? I hope it is.
> > 
> > Unfortunately, this patch series does not fix the issue I reported.
> 
> Ah, OK. This was my mistake I didn't choose ORC unwinder for test kernel.
> 
> > 
> > I think the reason your tests work is because you're using ftrace and
> > the ORC unwinder is aware of ftrace trampolines (see
> > arch/x86/kernel/unwind_orc.c:orc_ftrace_find).
> 
> OK, so it has to be fixed in ORC unwinder.
> 
> > 
> > bpftrace kprobes go through perf event subsystem (ie not ftrace) so
> > naturally orc_ftrace_find() does not find an associated trampoline. ORC
> > unwinding fails in this case because
> > arch/x86/kernel/kprobes/core.c:trampoline_handler sets
> > 
> >     regs->ip = (unsigned long)&kretprobe_trampoline;
> > 
> > and `kretprobe_trampoline` is marked
> > 
> >     STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
> > 
> > so it doesn't have a valid ORC entry. Thus, ORC immediately bails when
> > trying to unwind past the first frame.
> 
> Hm, in ftrace case, it decode kretprobe's stackframe and stoped right
> after kretprobe_trampoline callsite.
> 
>  => kretprobe_trace_func+0x21f/0x340
>  => kretprobe_dispatcher+0x73/0xb0
>  => __kretprobe_trampoline_handler+0xcd/0x1a0
>  => trampoline_handler+0x3d/0x50
>  => kretprobe_trampoline+0x25/0x50
>  => 0
> 
> kretprobe replaces the real return address with kretprobe_trampoline
> and kretprobe_trampoline *calls* trampoline_handler (this part depends
> on architecture implementation).
> Thus, if kretprobe_trampoline has no stack frame information, ORC may
> fails at the first kretprobe_trampoline+0x25, that is different from
> the kretprobe_trampoline+0, so the hack doesn't work.

I'm not sure I follow 100% what you're saying, but assuming you're
asking why bpftrace fails at `kretprobe_trampoline+0` instead of
`kretprobe_trampoline+0x25`:

`regs` is set to &kretprobe_trampoline:

    regs->ip = (unsigned long)&kretprobe_trampoline;

Then the kretprobe event is dispatched like this:

    kretprobe_trampoline_handler
      __kretprobe_trampoline_handler
        rp->handler // ie kernel/trace/trace_kprobe.c:kretprobe_dispatcher
          kretprobe_perf_func
            trace_bpf_call(.., regs)
              BPF_PROG_RUN_ARRAY_CHECK
                bpf_get_stackid(regs, .., ..) // in bpftrace prog 

And then `bpf_get_stackid` unwinds the stack via:

    bpf_get_stackid
      get_perf_callchain(regs, ...)
        perf_callchain_kernel(.., regs)
          perf_callchain_store(.., regs->ip) // !!! first unwound entry
          unwind_start
          unwind_next_frame

In summary: unwinding via BPF begins at regs->ip, which
`trampoline_handler` sets to `&kretprobe_trampoline+0x0`.

> Hmm, how the other inline-asm function makes its stackframe info?
> If I get the kretprobe_trampoline+0, I can fix it up.

So I think I misunderstood the mechanics before. I think `call
trampoline_handler` does set up a new frame. My current guess is that
ftrace doesn't thread through `regs` like the bpf code path does. I'm
not very familiar with ftrace internals so I haven't looked.

> 
> > The only way I can think of to fix this issue is to make the ORC
> > unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping
> > you have another idea if my patch isn't acceptable.
> 
> OK, anyway, your patch doesn't care the case that the multiple functions
> are probed by kretprobes. In that case, you'll see several entries are
> replaced by the kretprobe_trampoline. To find it correctly, you have
> to pass a state-holder (@cur of the kretprobe_find_ret_addr())
> to the fixup entries.

I'll see if I can figure something out tomorrow.

Thanks,
Daniel
Masami Hiramatsu (Google) March 8, 2021, 2:52 a.m. UTC | #4
On Sun, 7 Mar 2021 13:23:33 -0800
Daniel Xu <dxu@dxuuu.xyz> wrote:

> > kretprobe replaces the real return address with kretprobe_trampoline
> > and kretprobe_trampoline *calls* trampoline_handler (this part depends
> > on architecture implementation).
> > Thus, if kretprobe_trampoline has no stack frame information, ORC may
> > fails at the first kretprobe_trampoline+0x25, that is different from
> > the kretprobe_trampoline+0, so the hack doesn't work.
> 
> I'm not sure I follow 100% what you're saying, but assuming you're
> asking why bpftrace fails at `kretprobe_trampoline+0` instead of
> `kretprobe_trampoline+0x25`:
> 
> `regs` is set to &kretprobe_trampoline:
> 
>     regs->ip = (unsigned long)&kretprobe_trampoline;

Ah, OK. bpftrace does the adjustment.

> Then the kretprobe event is dispatched like this:
> 
>     kretprobe_trampoline_handler
>       __kretprobe_trampoline_handler
>         rp->handler // ie kernel/trace/trace_kprobe.c:kretprobe_dispatcher
>           kretprobe_perf_func
>             trace_bpf_call(.., regs)
>               BPF_PROG_RUN_ARRAY_CHECK
>                 bpf_get_stackid(regs, .., ..) // in bpftrace prog 
> 
> And then `bpf_get_stackid` unwinds the stack via:
> 
>     bpf_get_stackid
>       get_perf_callchain(regs, ...)
>         perf_callchain_kernel(.., regs)
>           perf_callchain_store(.., regs->ip) // !!! first unwound entry
>           unwind_start
>           unwind_next_frame
> 
> In summary: unwinding via BPF begins at regs->ip, which
> `trampoline_handler` sets to `&kretprobe_trampoline+0x0`.

OK, maybe you are using stack_trace_save_regs() with pt_regs instead of
stack_trace_save(). this means it started from regs at saved by
kretprobe always.

In the ftrace, we are using stack_trace_save() which is based on
the current stack, this means stack unwinder tracks back the stack of
kretprobe itself at first. So it saw the kretprobe_trampoline+0x25 
(return address of the trampoline_handler) and stop unwinding because
the call frame information (ORC information) and the return address
are not there.

This issue is not only the ftrace, but also backtrace in interrupt
handler and kretprobe handler.

> > Hmm, how the other inline-asm function makes its stackframe info?
> > If I get the kretprobe_trampoline+0, I can fix it up.
> 
> So I think I misunderstood the mechanics before. I think `call
> trampoline_handler` does set up a new frame. My current guess is that
> ftrace doesn't thread through `regs` like the bpf code path does. I'm
> not very familiar with ftrace internals so I haven't looked.

Yes, that's right. Since I made a kretprobe event on top of the ftrace
event framework, it doesn't pass the regs to the event trigger.


> > > The only way I can think of to fix this issue is to make the ORC
> > > unwinder aware of kretprobe (ie the patch I sent earlier). I'm hoping
> > > you have another idea if my patch isn't acceptable.
> > 
> > OK, anyway, your patch doesn't care the case that the multiple functions
> > are probed by kretprobes. In that case, you'll see several entries are
> > replaced by the kretprobe_trampoline. To find it correctly, you have
> > to pass a state-holder (@cur of the kretprobe_find_ret_addr())
> > to the fixup entries.
> 
> I'll see if I can figure something out tomorrow.

To help your understanding, let me explain.

If we have a code here

caller_func:
0x00 add sp, 0x20	/* 0x20 bytes stack frame allocated */
...
0x10 call target_func
0x15 ... /* return address */

On the stack in the entry of target_func, we have

[stack]
0x0e0 caller_func+0x15
... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
0x100 /* caller_func return address */

And when we put a kretprobe on the target_func, the stack will be

[stack]
0x0e0 kretprobe_trampoline
... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
0x100 /* caller_func return address */

* "caller_func+0x15" is saved in current->kretprobe_instances.first.

When returning from the target_func, call consumed the 0x0e0 and
jump to kretprobe_trampoline. Let's see the assembler code.

        ".text\n"
        ".global kretprobe_trampoline\n"
        ".type kretprobe_trampoline, @function\n"
        "kretprobe_trampoline:\n"
        /* We don't bother saving the ss register */
        "       pushq %rsp\n"
        "       pushfq\n"
        SAVE_REGS_STRING
        "       movq %rsp, %rdi\n"
        "       call trampoline_handler\n"
        /* Replace saved sp with true return address. */
        "       movq %rax, 19*8(%rsp)\n"
        RESTORE_REGS_STRING
        "       popfq\n"
        "       ret\n"

When the entry of trampoline_handler, stack is like this;

[stack]
0x040 kretprobe_trampoline+0x25
0x048 r15
...     /* pt_regs */
0x0d8 flags
0x0e0 rsp (=0x0e0)
... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
0x100 /* caller_func return address */

And after returned from trampoline_handler, "movq" changes the
stack like this.

[stack]
0x040 kretprobe_trampoline+0x25
0x048 r15
...     /* pt_regs */
0x0d8 flags
0x0e0 caller_func+0x15
... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
0x100 /* caller_func return address */


So at the kretprobe handler, we have 2 issues.
1) the return address (caller_func+0x15) is not on the stack.
   this can be solved by searching from current->kretprobe_instances.
2) the stack frame size of kretprobe_trampoline is unknown
   Since the stackframe is fixed, the fixed number (0x98) can be used.

However, those solutions are only for the kretprobe handler. The stacktrace
from interrupt handler hit in the kretprobe_trampoline still doesn't work.

So, here is my idea;

1) Change the trampline code to prepare stack frame at first and save
   registers on it, instead of "push". This will makes ORC easy to setup
   stackframe information for this code.
2) change the return addres fixup timing. Instead of using return value
   of trampoline handler, before removing the real return address from
   current->kretprobe_instances.
3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
   checks the contents of the end of stackframe (at the place of regs->sp)
   is same as the address of it. If it is, it can find the correct address
   from current->kretprobe_instances. If not, there is a correct address.

I need to find how the ORC info is prepared for 1), maybe UNWIND_HINT macro
may help?

Thank you,
Masami Hiramatsu (Google) March 8, 2021, 1:05 p.m. UTC | #5
On Mon, 8 Mar 2021 11:52:10 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> So, here is my idea;
> 
> 1) Change the trampline code to prepare stack frame at first and save
>    registers on it, instead of "push". This will makes ORC easy to setup
>    stackframe information for this code.
> 2) change the return addres fixup timing. Instead of using return value
>    of trampoline handler, before removing the real return address from
>    current->kretprobe_instances.
> 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
>    checks the contents of the end of stackframe (at the place of regs->sp)
>    is same as the address of it. If it is, it can find the correct address
>    from current->kretprobe_instances. If not, there is a correct address.

Another trickly idea is put a call on top of kretprobe_trampoline like this.

        "__kretprobe_trampoline:\n"
        "       call kretprobe_trampoline\n"
        "kretprobe_trampoline:\n"
        "       pushq %rsp\n"
        "       pushfq\n"
        SAVE_REGS_STRING
        "       movq %rsp, %rdi\n"
        "       call trampoline_handler\n"
        /* Replace __kretprobe_trampoline with true return address. */
        "       movq %rax, 20*8(%rsp)\n"
        RESTORE_REGS_STRING
        "       popfq\n"
        "       popq %rsp\n"
        "       ret\n"

This will leave a marker (kretprobe_trampoline or __kretprobe_trampoline+5) on
the top of stack, and the stack frame seems like a normal function. If objtool
can make an ORC info by disassembling kretprobe_trampoline, I guess it is
easy to make a stack frame information.

But anyway, from the inside of target function, it still see "__kretprobe_trampoline"
on the stack instead of caller_func, so orc_kretprobe_find() is still needed.

I'm not familier with the UNWIND_HINT macro, so if it is easy to handle the original
case, I think my first idea will be better.

Thank you,
Josh Poimboeuf March 9, 2021, 1:19 a.m. UTC | #6
On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote:
> So at the kretprobe handler, we have 2 issues.
> 1) the return address (caller_func+0x15) is not on the stack.
>    this can be solved by searching from current->kretprobe_instances.
> 2) the stack frame size of kretprobe_trampoline is unknown
>    Since the stackframe is fixed, the fixed number (0x98) can be used.
> 
> However, those solutions are only for the kretprobe handler. The stacktrace
> from interrupt handler hit in the kretprobe_trampoline still doesn't work.
> 
> So, here is my idea;
> 
> 1) Change the trampline code to prepare stack frame at first and save
>    registers on it, instead of "push". This will makes ORC easy to setup
>    stackframe information for this code.
> 2) change the return addres fixup timing. Instead of using return value
>    of trampoline handler, before removing the real return address from
>    current->kretprobe_instances.
> 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
>    checks the contents of the end of stackframe (at the place of regs->sp)
>    is same as the address of it. If it is, it can find the correct address
>    from current->kretprobe_instances. If not, there is a correct address.
> 
> I need to find how the ORC info is prepared for 1), maybe UNWIND_HINT macro
> may help?

Hi Masami,

If I understand correctly, for #1 you need an unwind hint which treats
the instruction *after* the "pushq %rsp" as the beginning of the
function.

I'm thinking this should work:


diff --git a/arch/x86/include/asm/unwind_hints.h b/arch/x86/include/asm/unwind_hints.h
index 8e574c0afef8..8b33674288ea 100644
--- a/arch/x86/include/asm/unwind_hints.h
+++ b/arch/x86/include/asm/unwind_hints.h
@@ -52,6 +52,11 @@
 	UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=8 type=UNWIND_HINT_TYPE_FUNC
 .endm
 
+#else
+
+#define UNWIND_HINT_FUNC \
+	UNWIND_HINT(ORC_REG_SP, 8, UNWIND_HINT_TYPE_FUNC, 0)
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_X86_UNWIND_HINTS_H */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index df776cdca327..38ff1387f95d 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -767,6 +767,7 @@ asm(
 	/* We don't bother saving the ss register */
 #ifdef CONFIG_X86_64
 	"	pushq %rsp\n"
+	UNWIND_HINT_FUNC
 	"	pushfq\n"
 	SAVE_REGS_STRING
 	"	movq %rsp, %rdi\n"
@@ -790,7 +791,6 @@ asm(
 	".size kretprobe_trampoline, .-kretprobe_trampoline\n"
 );
 NOKPROBE_SYMBOL(kretprobe_trampoline);
-STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
 
 
 /*
Daniel Xu March 9, 2021, 9:34 p.m. UTC | #7
Hi Masami,

Just want to clarify a few points:

On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote:
> On Sun, 7 Mar 2021 13:23:33 -0800
> Daniel Xu <dxu@dxuuu.xyz> wrote:
> To help your understanding, let me explain.
> 
> If we have a code here
> 
> caller_func:
> 0x00 add sp, 0x20	/* 0x20 bytes stack frame allocated */
> ...
> 0x10 call target_func
> 0x15 ... /* return address */
> 
> On the stack in the entry of target_func, we have
> 
> [stack]
> 0x0e0 caller_func+0x15
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */
> 
> And when we put a kretprobe on the target_func, the stack will be
> 
> [stack]
> 0x0e0 kretprobe_trampoline
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */
> 
> * "caller_func+0x15" is saved in current->kretprobe_instances.first.
> 
> When returning from the target_func, call consumed the 0x0e0 and
> jump to kretprobe_trampoline. Let's see the assembler code.
> 
>         ".text\n"
>         ".global kretprobe_trampoline\n"
>         ".type kretprobe_trampoline, @function\n"
>         "kretprobe_trampoline:\n"
>         /* We don't bother saving the ss register */
>         "       pushq %rsp\n"
>         "       pushfq\n"
>         SAVE_REGS_STRING
>         "       movq %rsp, %rdi\n"
>         "       call trampoline_handler\n"
>         /* Replace saved sp with true return address. */
>         "       movq %rax, 19*8(%rsp)\n"
>         RESTORE_REGS_STRING
>         "       popfq\n"
>         "       ret\n"
> 
> When the entry of trampoline_handler, stack is like this;
> 
> [stack]
> 0x040 kretprobe_trampoline+0x25
> 0x048 r15
> ...     /* pt_regs */
> 0x0d8 flags
> 0x0e0 rsp (=0x0e0)
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */
> 
> And after returned from trampoline_handler, "movq" changes the
> stack like this.
> 
> [stack]
> 0x040 kretprobe_trampoline+0x25
> 0x048 r15
> ...     /* pt_regs */
> 0x0d8 flags
> 0x0e0 caller_func+0x15
> ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> 0x100 /* caller_func return address */

Thanks for the detailed explanation. I think I understand kretprobe
mechanics from a somewhat high level (kprobe saves real return address
on entry, overwrites return address to trampoline, then trampoline
runs handler and finally resets return address to real return address).

I don't usually write much assembly so the details escape me somewhat.

> So at the kretprobe handler, we have 2 issues.
> 1) the return address (caller_func+0x15) is not on the stack.
>    this can be solved by searching from current->kretprobe_instances.

Yes, agreed.

> 2) the stack frame size of kretprobe_trampoline is unknown
>    Since the stackframe is fixed, the fixed number (0x98) can be used.

I'm confused why this is relevant. Is it so ORC knows where to find
saved return address in the frame?

> However, those solutions are only for the kretprobe handler. The stacktrace
> from interrupt handler hit in the kretprobe_trampoline still doesn't work.
> 
> So, here is my idea;
> 
> 1) Change the trampline code to prepare stack frame at first and save
>    registers on it, instead of "push". This will makes ORC easy to setup
>    stackframe information for this code.

I'm confused on the details here. But this is what Josh solves in his
patch, right?

> 2) change the return addres fixup timing. Instead of using return value
>    of trampoline handler, before removing the real return address from
>    current->kretprobe_instances.

Is the idea to have `kretprobe_trampoline` place the real return address
on the stack (while telling ORC where to find it) _before_ running `call
trampoline_handler` ? So that an unwind from inside the user defined
kretprobe handler simply unwinds correctly?

And to be extra clear, this would only work for stack_trace_save() and
not stack_trace_save_regs()?

> 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
>    checks the contents of the end of stackframe (at the place of regs->sp)
>    is same as the address of it. If it is, it can find the correct address
>    from current->kretprobe_instances. If not, there is a correct address.

What do you mean by "it" w.r.t. "is the same address of it"? I'm
confused on this point.

Thanks,
Daniel
Masami Hiramatsu (Google) March 10, 2021, 9:57 a.m. UTC | #8
Hi Josh,

On Mon, 8 Mar 2021 19:19:45 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote:
> > So at the kretprobe handler, we have 2 issues.
> > 1) the return address (caller_func+0x15) is not on the stack.
> >    this can be solved by searching from current->kretprobe_instances.
> > 2) the stack frame size of kretprobe_trampoline is unknown
> >    Since the stackframe is fixed, the fixed number (0x98) can be used.
> > 
> > However, those solutions are only for the kretprobe handler. The stacktrace
> > from interrupt handler hit in the kretprobe_trampoline still doesn't work.
> > 
> > So, here is my idea;
> > 
> > 1) Change the trampline code to prepare stack frame at first and save
> >    registers on it, instead of "push". This will makes ORC easy to setup
> >    stackframe information for this code.
> > 2) change the return addres fixup timing. Instead of using return value
> >    of trampoline handler, before removing the real return address from
> >    current->kretprobe_instances.
> > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
> >    checks the contents of the end of stackframe (at the place of regs->sp)
> >    is same as the address of it. If it is, it can find the correct address
> >    from current->kretprobe_instances. If not, there is a correct address.
> > 
> > I need to find how the ORC info is prepared for 1), maybe UNWIND_HINT macro
> > may help?
> 
> Hi Masami,
> 
> If I understand correctly, for #1 you need an unwind hint which treats
> the instruction *after* the "pushq %rsp" as the beginning of the
> function.

Thanks for the patch. In that case, should I still change the stack allocation?
Or can I continue to use a series of "push/pop" ?

> 
> I'm thinking this should work:

OK, Let me test it.

Thanks!

> 
> 
> diff --git a/arch/x86/include/asm/unwind_hints.h b/arch/x86/include/asm/unwind_hints.h
> index 8e574c0afef8..8b33674288ea 100644
> --- a/arch/x86/include/asm/unwind_hints.h
> +++ b/arch/x86/include/asm/unwind_hints.h
> @@ -52,6 +52,11 @@
>  	UNWIND_HINT sp_reg=ORC_REG_SP sp_offset=8 type=UNWIND_HINT_TYPE_FUNC
>  .endm
>  
> +#else
> +
> +#define UNWIND_HINT_FUNC \
> +	UNWIND_HINT(ORC_REG_SP, 8, UNWIND_HINT_TYPE_FUNC, 0)
> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_UNWIND_HINTS_H */
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index df776cdca327..38ff1387f95d 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -767,6 +767,7 @@ asm(
>  	/* We don't bother saving the ss register */
>  #ifdef CONFIG_X86_64
>  	"	pushq %rsp\n"
> +	UNWIND_HINT_FUNC
>  	"	pushfq\n"
>  	SAVE_REGS_STRING
>  	"	movq %rsp, %rdi\n"
> @@ -790,7 +791,6 @@ asm(
>  	".size kretprobe_trampoline, .-kretprobe_trampoline\n"
>  );
>  NOKPROBE_SYMBOL(kretprobe_trampoline);
> -STACK_FRAME_NON_STANDARD(kretprobe_trampoline);
>  
>  
>  /*
>
Masami Hiramatsu (Google) March 10, 2021, 10:50 a.m. UTC | #9
On Tue, 9 Mar 2021 13:34:42 -0800
Daniel Xu <dxu@dxuuu.xyz> wrote:

> Hi Masami,
> 
> Just want to clarify a few points:
> 
> On Mon, Mar 08, 2021 at 11:52:10AM +0900, Masami Hiramatsu wrote:
> > On Sun, 7 Mar 2021 13:23:33 -0800
> > Daniel Xu <dxu@dxuuu.xyz> wrote:
> > To help your understanding, let me explain.
> > 
> > If we have a code here
> > 
> > caller_func:
> > 0x00 add sp, 0x20	/* 0x20 bytes stack frame allocated */
> > ...
> > 0x10 call target_func
> > 0x15 ... /* return address */
> > 
> > On the stack in the entry of target_func, we have
> > 
> > [stack]
> > 0x0e0 caller_func+0x15
> > ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> > 0x100 /* caller_func return address */
> > 
> > And when we put a kretprobe on the target_func, the stack will be
> > 
> > [stack]
> > 0x0e0 kretprobe_trampoline
> > ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> > 0x100 /* caller_func return address */
> > 
> > * "caller_func+0x15" is saved in current->kretprobe_instances.first.
> > 
> > When returning from the target_func, call consumed the 0x0e0 and
> > jump to kretprobe_trampoline. Let's see the assembler code.
> > 
> >         ".text\n"
> >         ".global kretprobe_trampoline\n"
> >         ".type kretprobe_trampoline, @function\n"
> >         "kretprobe_trampoline:\n"
> >         /* We don't bother saving the ss register */
> >         "       pushq %rsp\n"
> >         "       pushfq\n"
> >         SAVE_REGS_STRING
> >         "       movq %rsp, %rdi\n"
> >         "       call trampoline_handler\n"
> >         /* Replace saved sp with true return address. */
> >         "       movq %rax, 19*8(%rsp)\n"
> >         RESTORE_REGS_STRING
> >         "       popfq\n"
> >         "       ret\n"
> > 
> > When the entry of trampoline_handler, stack is like this;
> > 
> > [stack]
> > 0x040 kretprobe_trampoline+0x25
> > 0x048 r15
> > ...     /* pt_regs */
> > 0x0d8 flags
> > 0x0e0 rsp (=0x0e0)
> > ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> > 0x100 /* caller_func return address */
> > 
> > And after returned from trampoline_handler, "movq" changes the
> > stack like this.
> > 
> > [stack]
> > 0x040 kretprobe_trampoline+0x25
> > 0x048 r15
> > ...     /* pt_regs */
> > 0x0d8 flags
> > 0x0e0 caller_func+0x15
> > ... /* 0x20 bytes = 4 entries  are stack frame of caller_func */
> > 0x100 /* caller_func return address */
> 
> Thanks for the detailed explanation. I think I understand kretprobe
> mechanics from a somewhat high level (kprobe saves real return address
> on entry, overwrites return address to trampoline, then trampoline
> runs handler and finally resets return address to real return address).
> 
> I don't usually write much assembly so the details escape me somewhat.
> 
> > So at the kretprobe handler, we have 2 issues.
> > 1) the return address (caller_func+0x15) is not on the stack.
> >    this can be solved by searching from current->kretprobe_instances.
> 
> Yes, agreed.
> 
> > 2) the stack frame size of kretprobe_trampoline is unknown
> >    Since the stackframe is fixed, the fixed number (0x98) can be used.
> 
> I'm confused why this is relevant. Is it so ORC knows where to find
> saved return address in the frame?

No, because the kretprobe_trampoline is somewhat special. Usually, at the
function entry, there is a return address on the top of stack, but
kretprobe_trampoline doesn't have it.
So we have to put a hint at the function entry to mark there should be
a next return address. (and ORC unwinder must find it)

> > However, those solutions are only for the kretprobe handler. The stacktrace
> > from interrupt handler hit in the kretprobe_trampoline still doesn't work.
> > 
> > So, here is my idea;
> > 
> > 1) Change the trampline code to prepare stack frame at first and save
> >    registers on it, instead of "push". This will makes ORC easy to setup
> >    stackframe information for this code.
> 
> I'm confused on the details here. But this is what Josh solves in his
> patch, right?

I'm not so sure how objtool makes the ORC information. If it can trace the
push/pop correctly, yes, it is solved.

> > 2) change the return addres fixup timing. Instead of using return value
> >    of trampoline handler, before removing the real return address from
> >    current->kretprobe_instances.
> 
> Is the idea to have `kretprobe_trampoline` place the real return address
> on the stack (while telling ORC where to find it) _before_ running `call
> trampoline_handler` ? So that an unwind from inside the user defined
> kretprobe handler simply unwinds correctly?

No, unless calling the trampoline_handler, we can not get the real return
address. Thus, the __kretprobe_trampoline_handler() will call return address
fixup function right before unlink the current->kretprobe_instances.

> And to be extra clear, this would only work for stack_trace_save() and
> not stack_trace_save_regs()?

Yes, for the stack_trace_save_regs() and the stack-tracing inside the
kretprobe'd target function, we still need a hack as same as orc_ftrace_find().

> 
> > 3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, it
> >    checks the contents of the end of stackframe (at the place of regs->sp)
> >    is same as the address of it. If it is, it can find the correct address
> >    from current->kretprobe_instances. If not, there is a correct address.
> 
> What do you mean by "it" w.r.t. "is the same address of it"? I'm
> confused on this point.

Oh I meant,

3) Then, if orc_find() finds the ip is in the kretprobe_trampoline, orc_find()
    checks the contents of the end of stackframe (at the place of regs->sp)
   is same as the address of the stackframe (Note that kretprobe_trampoline
   does "push %sp" at first). If so, orc_find() can find the correct address
   from current->kretprobe_instances. If not, there is a correct address.

I need to see the orc unwinder carefully, orc_find() only gets the ip but
to find stackframe, I think this should be fixed in the caller of orc_find().

Thank you,
Josh Poimboeuf March 10, 2021, 3:08 p.m. UTC | #10
On Wed, Mar 10, 2021 at 06:57:34PM +0900, Masami Hiramatsu wrote:
> > If I understand correctly, for #1 you need an unwind hint which treats
> > the instruction *after* the "pushq %rsp" as the beginning of the
> > function.
> 
> Thanks for the patch. In that case, should I still change the stack allocation?
> Or can I continue to use a series of "push/pop" ?

You can continue to use push/pop.  Objtool is only getting confused by
the unbalanced stack of the function (more pushes than pops).  The
unwind hint should fix that.
Masami Hiramatsu (Google) March 10, 2021, 3:55 p.m. UTC | #11
Hi Josh and Daniel,

On Wed, 10 Mar 2021 09:08:45 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Wed, Mar 10, 2021 at 06:57:34PM +0900, Masami Hiramatsu wrote:
> > > If I understand correctly, for #1 you need an unwind hint which treats
> > > the instruction *after* the "pushq %rsp" as the beginning of the
> > > function.
> > 
> > Thanks for the patch. In that case, should I still change the stack allocation?
> > Or can I continue to use a series of "push/pop" ?
> 
> You can continue to use push/pop.  Objtool is only getting confused by
> the unbalanced stack of the function (more pushes than pops).  The
> unwind hint should fix that.

With you patch, I made a fix for ORC unwinder. I confirmed it works with
2 multiple kretprobes on the call path like this ;

cd /sys/kernel/debug/tracing/
echo r vfs_read >> kprobe_events
echo r full_proxy_read >> kprobe_events
echo traceoff:1 > events/kprobes/r_vfs_read_0/trigger
echo stacktrace:1 > events/kprobes/r_full_proxy_read_0/trigger
echo 1 > events/kprobes/enable
echo 1 > options/sym-offset
cat /sys/kernel/debug/kprobes/list
echo 0 > events/kprobes/enable
cat trace

# tracer: nop
#
# entries-in-buffer/entries-written: 3/3   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
           <...>-136     [004] ...1   648.481281: r_full_proxy_read_0: (vfs_read+0xab/0x1a0 <- full_proxy_read)
           <...>-136     [004] ...1   648.481310: <stack trace>
 => kretprobe_trace_func+0x209/0x2f0
 => kretprobe_dispatcher+0x4a/0x70
 => __kretprobe_trampoline_handler+0xcd/0x170
 => trampoline_handler+0x3d/0x50
 => kretprobe_trampoline+0x25/0x50
 => vfs_read+0xab/0x1a0
 => ksys_read+0x5f/0xe0
 => do_syscall_64+0x33/0x40
 => entry_SYSCALL_64_after_hwframe+0x44/0xae
 => 0
 => 0
 => 0
 => 0
 => 0
 => 0
 => 0

I didn't tested it with bpftrace, but this also handles 
regs->ip == kretprobe_trampoline case. So it should work.

commit aa452d999b524b1851f69cc947be3e1a2f3ca1ec
Author: Masami Hiramatsu <mhiramat@kernel.org>
Date:   Sat Mar 6 08:34:51 2021 +0900

    x86/unwind/orc: Fixup kretprobe trampoline entry
    
    Since the kretprobe replaces the function return address with
    the kretprobe_trampoline on the stack, the ORC unwinder can not
    continue the stack unwinding at that point.
    
    To fix this issue, correct state->ip as like as function-graph
    tracer in the unwind_next_frame().
    
    Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>

diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 70fc159ebe69..ab5e45b848d5 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -4,6 +4,7 @@
 
 #include <linux/sched.h>
 #include <linux/ftrace.h>
+#include <linux/llist.h>
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
 
@@ -20,6 +21,9 @@ struct unwind_state {
 	bool signal, full_regs;
 	unsigned long sp, bp, ip;
 	struct pt_regs *regs, *prev_regs;
+#if defined(CONFIG_KRETPROBES)
+	struct llist_node *kr_iter;
+#endif
 #elif defined(CONFIG_UNWINDER_FRAME_POINTER)
 	bool got_irq;
 	unsigned long *bp, *orig_sp, ip;
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 2a1d47f47eee..94869516cfc0 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -2,6 +2,7 @@
 #include <linux/objtool.h>
 #include <linux/module.h>
 #include <linux/sort.h>
+#include <linux/kprobes.h>
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
 #include <asm/unwind.h>
@@ -414,6 +415,30 @@ static bool get_reg(struct unwind_state *state, unsigned int reg_off,
 	return false;
 }
 
+#ifdef CONFIG_KRETPROBES
+static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state)
+{
+	return kretprobe_find_ret_addr(
+			(unsigned long)kretprobe_trampoline_addr(),
+			state->task, &state->kr_iter);
+}
+
+static bool is_kretprobe_trampoline_address(unsigned long ip)
+{
+	return ip == (unsigned long)kretprobe_trampoline_addr();
+}
+#else
+static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state)
+{
+	return state->ip;
+}
+
+static bool is_kretprobe_trampoline_address(unsigned long ip)
+{
+	return false;
+}
+#endif
+
 bool unwind_next_frame(struct unwind_state *state)
 {
 	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
@@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
 
 		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
 						  state->ip, (void *)ip_p);
+		/*
+		 * There are special cases when the stack unwinder is called
+		 * from the kretprobe handler or the interrupt handler which
+		 * occurs in the kretprobe trampoline code. In those cases,
+		 * %sp is shown on the stack instead of the return address.
+		 * Or, when the unwinder find the return address is replaced
+		 * by kretprobe_trampoline.
+		 * In those cases, correct address can be found in kretprobe.
+		 */
+		if (state->ip == sp ||
+		    is_kretprobe_trampoline_address(state->ip))
+			state->ip = orc_kretprobe_correct_ip(state);
 
 		state->sp = sp;
 		state->regs = NULL;
@@ -649,6 +686,12 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
 		state->full_regs = true;
 		state->signal = true;
 
+		/*
+		 * When the unwinder called with regs from kretprobe handler,
+		 * the regs->ip starts from kretprobe_trampoline address.
+		 */
+		if (is_kretprobe_trampoline_address(state->ip))
+			state->ip = orc_kretprobe_correct_ip(state);
 	} else if (task == current) {
 		asm volatile("lea (%%rip), %0\n\t"
 			     "mov %%rsp, %1\n\t"
Josh Poimboeuf March 10, 2021, 6:31 p.m. UTC | #12
On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote:
> +#ifdef CONFIG_KRETPROBES
> +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state)
> +{
> +	return kretprobe_find_ret_addr(
> +			(unsigned long)kretprobe_trampoline_addr(),
> +			state->task, &state->kr_iter);
> +}
> +
> +static bool is_kretprobe_trampoline_address(unsigned long ip)
> +{
> +	return ip == (unsigned long)kretprobe_trampoline_addr();
> +}
> +#else
> +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state)
> +{
> +	return state->ip;
> +}
> +
> +static bool is_kretprobe_trampoline_address(unsigned long ip)
> +{
> +	return false;
> +}
> +#endif
> +

Can this code go in a kprobes file?  I'd rather not clutter ORC with it,
and maybe it would be useful for other arches or unwinders.

>  bool unwind_next_frame(struct unwind_state *state)
>  {
>  	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
> @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
>  
>  		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
>  						  state->ip, (void *)ip_p);
> +		/*
> +		 * There are special cases when the stack unwinder is called
> +		 * from the kretprobe handler or the interrupt handler which
> +		 * occurs in the kretprobe trampoline code. In those cases,
> +		 * %sp is shown on the stack instead of the return address.
> +		 * Or, when the unwinder find the return address is replaced
> +		 * by kretprobe_trampoline.
> +		 * In those cases, correct address can be found in kretprobe.
> +		 */
> +		if (state->ip == sp ||

Why is the 'state->ip == sp' needed?

> +		    is_kretprobe_trampoline_address(state->ip))
> +			state->ip = orc_kretprobe_correct_ip(state);

This is similar in concept to ftrace_graph_ret_addr(), right?  Would it
be possible to have a similar API?  Like

		state->ip = kretprobe_ret_addr(state->task, &state->kr_iter, state->ip);

and without the conditional.

>  
>  		state->sp = sp;
>  		state->regs = NULL;
> @@ -649,6 +686,12 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
>  		state->full_regs = true;
>  		state->signal = true;
>  
> +		/*
> +		 * When the unwinder called with regs from kretprobe handler,
> +		 * the regs->ip starts from kretprobe_trampoline address.
> +		 */
> +		if (is_kretprobe_trampoline_address(state->ip))
> +			state->ip = orc_kretprobe_correct_ip(state);

Shouldn't __kretprobe_trampoline_handler() just set regs->ip to
'correct_ret_addr' before passing the regs to the handler?  I'd think
that would be a less surprising value for regs->ip than
'&kretprobe_trampoline'.

And it would make the unwinder just work automatically when unwinding
from the handler using the regs.

It would also work when unwinding from the handler's stack, if we put an
UNWIND_HINT_REGS after saving the regs.

The only (rare) case it wouldn't work would be unwinding from an
interrupt before regs->ip gets set properly.  In which case we'd still
need the above call to orc_kretprobe_correct_ip() or so.
Daniel Xu March 10, 2021, 10:46 p.m. UTC | #13
On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote:
> Hi Josh and Daniel,
<...>

> commit aa452d999b524b1851f69cc947be3e1a2f3ca1ec
> Author: Masami Hiramatsu <mhiramat@kernel.org>
> Date:   Sat Mar 6 08:34:51 2021 +0900
> 
>     x86/unwind/orc: Fixup kretprobe trampoline entry
>     
>     Since the kretprobe replaces the function return address with
>     the kretprobe_trampoline on the stack, the ORC unwinder can not
>     continue the stack unwinding at that point.
>     
>     To fix this issue, correct state->ip as like as function-graph
>     tracer in the unwind_next_frame().
>     
>     Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> 

I applied your original patchset + Josh's patch + this patch and can
confirm it works for bpftrace + kretprobes.

Tested-by: Daniel Xu <dxu@dxuuu.xyz>
Masami Hiramatsu (Google) March 11, 2021, 12:20 a.m. UTC | #14
On Wed, 10 Mar 2021 12:31:13 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Thu, Mar 11, 2021 at 12:55:09AM +0900, Masami Hiramatsu wrote:
> > +#ifdef CONFIG_KRETPROBES
> > +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state)
> > +{
> > +	return kretprobe_find_ret_addr(
> > +			(unsigned long)kretprobe_trampoline_addr(),
> > +			state->task, &state->kr_iter);
> > +}
> > +
> > +static bool is_kretprobe_trampoline_address(unsigned long ip)
> > +{
> > +	return ip == (unsigned long)kretprobe_trampoline_addr();
> > +}
> > +#else
> > +static unsigned long orc_kretprobe_correct_ip(struct unwind_state *state)
> > +{
> > +	return state->ip;
> > +}
> > +
> > +static bool is_kretprobe_trampoline_address(unsigned long ip)
> > +{
> > +	return false;
> > +}
> > +#endif
> > +
> 
> Can this code go in a kprobes file?  I'd rather not clutter ORC with it,
> and maybe it would be useful for other arches or unwinders.

Yes, anyway dummy kretprobe_find_ret_addr() and kretprobe_trampoline_addr()
should be defined !CONFIG_KRETPROBES case.

> 
> >  bool unwind_next_frame(struct unwind_state *state)
> >  {
> >  	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
> > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
> >  
> >  		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> >  						  state->ip, (void *)ip_p);
> > +		/*
> > +		 * There are special cases when the stack unwinder is called
> > +		 * from the kretprobe handler or the interrupt handler which
> > +		 * occurs in the kretprobe trampoline code. In those cases,
> > +		 * %sp is shown on the stack instead of the return address.
> > +		 * Or, when the unwinder find the return address is replaced
> > +		 * by kretprobe_trampoline.
> > +		 * In those cases, correct address can be found in kretprobe.
> > +		 */
> > +		if (state->ip == sp ||
> 
> Why is the 'state->ip == sp' needed?

As I commented above, until kretprobe_trampoline writes back the real
address to the stack, sp value is there (which has been pushed by the
'pushq %rsp' at the entry of kretprobe_trampoline.)

        ".type kretprobe_trampoline, @function\n"
        "kretprobe_trampoline:\n"
        /* We don't bother saving the ss register */
        "       pushq %rsp\n"				// THIS
        "       pushfq\n"

Thus, from inside the kretprobe handler, like ftrace, you'll see
the sp value instead of the real return address.

> > +		    is_kretprobe_trampoline_address(state->ip))
> > +			state->ip = orc_kretprobe_correct_ip(state);
> 
> This is similar in concept to ftrace_graph_ret_addr(), right?  Would it
> be possible to have a similar API?  Like
> 
> 		state->ip = kretprobe_ret_addr(state->task, &state->kr_iter, state->ip);

OK, but,

> and without the conditional.

As I said, it is not possible because "state->ip == sp" check depends on
ORC unwinder.

> >  		state->sp = sp;
> >  		state->regs = NULL;
> > @@ -649,6 +686,12 @@ void __unwind_start(struct unwind_state *state, struct task_struct *task,
> >  		state->full_regs = true;
> >  		state->signal = true;
> >  
> > +		/*
> > +		 * When the unwinder called with regs from kretprobe handler,
> > +		 * the regs->ip starts from kretprobe_trampoline address.
> > +		 */
> > +		if (is_kretprobe_trampoline_address(state->ip))
> > +			state->ip = orc_kretprobe_correct_ip(state);
> 
> Shouldn't __kretprobe_trampoline_handler() just set regs->ip to
> 'correct_ret_addr' before passing the regs to the handler?  I'd think
> that would be a less surprising value for regs->ip than
> '&kretprobe_trampoline'.

Hmm, actually current implementation on x86 mimics the behevior of
the int3 exception (which many architectures still do).

Previously the kretprobe_trampoline is a place holder like this.

        "kretprobe_trampoline:\n"
        "       nop\n"

And arch_init_kprobes() puts a kprobe (int3) there.
So in that case regs->ip should be kretprobe_trampoline.
User handler (usually architecutre independent) finds the
correct_ret_addr in kretprobe_instance.ret_addr field.

> And it would make the unwinder just work automatically when unwinding
> from the handler using the regs.
> 
> It would also work when unwinding from the handler's stack, if we put an
> UNWIND_HINT_REGS after saving the regs.

At that moment, the real return address is not identified. So we can not
put it.

> 
> The only (rare) case it wouldn't work would be unwinding from an
> interrupt before regs->ip gets set properly.  In which case we'd still
> need the above call to orc_kretprobe_correct_ip() or so.


Thank you,
Josh Poimboeuf March 11, 2021, 1:06 a.m. UTC | #15
On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote:
> > >  bool unwind_next_frame(struct unwind_state *state)
> > >  {
> > >  	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
> > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
> > >  
> > >  		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> > >  						  state->ip, (void *)ip_p);
> > > +		/*
> > > +		 * There are special cases when the stack unwinder is called
> > > +		 * from the kretprobe handler or the interrupt handler which
> > > +		 * occurs in the kretprobe trampoline code. In those cases,
> > > +		 * %sp is shown on the stack instead of the return address.
> > > +		 * Or, when the unwinder find the return address is replaced
> > > +		 * by kretprobe_trampoline.
> > > +		 * In those cases, correct address can be found in kretprobe.
> > > +		 */
> > > +		if (state->ip == sp ||
> > 
> > Why is the 'state->ip == sp' needed?
> 
> As I commented above, until kretprobe_trampoline writes back the real
> address to the stack, sp value is there (which has been pushed by the
> 'pushq %rsp' at the entry of kretprobe_trampoline.)
> 
>         ".type kretprobe_trampoline, @function\n"
>         "kretprobe_trampoline:\n"
>         /* We don't bother saving the ss register */
>         "       pushq %rsp\n"				// THIS
>         "       pushfq\n"
> 
> Thus, from inside the kretprobe handler, like ftrace, you'll see
> the sp value instead of the real return address.

I see.  If you change is_kretprobe_trampoline_address() to include the
entire function, like:

static bool is_kretprobe_trampoline_address(unsigned long ip)
{
	return (void *)ip >= kretprobe_trampoline &&
	       (void *)ip < kretprobe_trampoline_end;
}

then the unwinder won't ever read the bogus %rsp value into state->ip,
and the 'state->ip == sp' check can be removed.

> > And it would make the unwinder just work automatically when unwinding
> > from the handler using the regs.
> > 
> > It would also work when unwinding from the handler's stack, if we put an
> > UNWIND_HINT_REGS after saving the regs.
> 
> At that moment, the real return address is not identified. So we can not
> put it.

True, at the time the regs are originally saved, the real return address
isn't available.  But by the time the user handler is called, the return
address *is* available.  So if the real return address were placed in
regs->ip before calling the handler, the unwinder could find it there,
when called from the handler.

Then we wouldn't need the call to orc_kretprobe_correct_ip() in
__unwind_start().

But maybe it's not possible due to the regs->ip expectations of legacy
handlers?
Masami Hiramatsu (Google) March 11, 2021, 1:54 a.m. UTC | #16
On Wed, 10 Mar 2021 19:06:15 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote:
> > > >  bool unwind_next_frame(struct unwind_state *state)
> > > >  {
> > > >  	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
> > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
> > > >  
> > > >  		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> > > >  						  state->ip, (void *)ip_p);
> > > > +		/*
> > > > +		 * There are special cases when the stack unwinder is called
> > > > +		 * from the kretprobe handler or the interrupt handler which
> > > > +		 * occurs in the kretprobe trampoline code. In those cases,
> > > > +		 * %sp is shown on the stack instead of the return address.
> > > > +		 * Or, when the unwinder find the return address is replaced
> > > > +		 * by kretprobe_trampoline.
> > > > +		 * In those cases, correct address can be found in kretprobe.
> > > > +		 */
> > > > +		if (state->ip == sp ||
> > > 
> > > Why is the 'state->ip == sp' needed?
> > 
> > As I commented above, until kretprobe_trampoline writes back the real
> > address to the stack, sp value is there (which has been pushed by the
> > 'pushq %rsp' at the entry of kretprobe_trampoline.)
> > 
> >         ".type kretprobe_trampoline, @function\n"
> >         "kretprobe_trampoline:\n"
> >         /* We don't bother saving the ss register */
> >         "       pushq %rsp\n"				// THIS
> >         "       pushfq\n"
> > 
> > Thus, from inside the kretprobe handler, like ftrace, you'll see
> > the sp value instead of the real return address.
> 
> I see.  If you change is_kretprobe_trampoline_address() to include the
> entire function, like:
> 
> static bool is_kretprobe_trampoline_address(unsigned long ip)
> {
> 	return (void *)ip >= kretprobe_trampoline &&
> 	       (void *)ip < kretprobe_trampoline_end;
> }
> 
> then the unwinder won't ever read the bogus %rsp value into state->ip,
> and the 'state->ip == sp' check can be removed.

Hmm, I couldn't get your point. Since sp is the address of stack,
it always out of text address.

> 
> > > And it would make the unwinder just work automatically when unwinding
> > > from the handler using the regs.
> > > 
> > > It would also work when unwinding from the handler's stack, if we put an
> > > UNWIND_HINT_REGS after saving the regs.
> > 
> > At that moment, the real return address is not identified. So we can not
> > put it.
> 
> True, at the time the regs are originally saved, the real return address
> isn't available.  But by the time the user handler is called, the return
> address *is* available.  So if the real return address were placed in
> regs->ip before calling the handler, the unwinder could find it there,
> when called from the handler.

OK, but this is not arch independent specification. I can make a hack
only for x86, but that is not clean implementation, hmm.

> 
> Then we wouldn't need the call to orc_kretprobe_correct_ip() in
> __unwind_start().

What about the ORC implementation in other architecture? Is that for x86 only?

> 
> But maybe it's not possible due to the regs->ip expectations of legacy
> handlers?

Usually, the legacy handlers will ignore it, the official way to access
the correct return address is kretprobe_instance.ret_addr. Because it is
arch independent.

Nowadays there are instruction_pointer() and instruction_pointer_set() APIs
in many (not all) architecutre, so I can try to replace to use it instead
of the kretprobe_instance.ret_addr.
(and it will break the out-of-tree codes)


Thank you,
Josh Poimboeuf March 11, 2021, 4:51 p.m. UTC | #17
On Thu, Mar 11, 2021 at 10:54:38AM +0900, Masami Hiramatsu wrote:
> On Wed, 10 Mar 2021 19:06:15 -0600
> Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote:
> > > > >  bool unwind_next_frame(struct unwind_state *state)
> > > > >  {
> > > > >  	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
> > > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
> > > > >  
> > > > >  		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> > > > >  						  state->ip, (void *)ip_p);
> > > > > +		/*
> > > > > +		 * There are special cases when the stack unwinder is called
> > > > > +		 * from the kretprobe handler or the interrupt handler which
> > > > > +		 * occurs in the kretprobe trampoline code. In those cases,
> > > > > +		 * %sp is shown on the stack instead of the return address.
> > > > > +		 * Or, when the unwinder find the return address is replaced
> > > > > +		 * by kretprobe_trampoline.
> > > > > +		 * In those cases, correct address can be found in kretprobe.
> > > > > +		 */
> > > > > +		if (state->ip == sp ||
> > > > 
> > > > Why is the 'state->ip == sp' needed?
> > > 
> > > As I commented above, until kretprobe_trampoline writes back the real
> > > address to the stack, sp value is there (which has been pushed by the
> > > 'pushq %rsp' at the entry of kretprobe_trampoline.)
> > > 
> > >         ".type kretprobe_trampoline, @function\n"
> > >         "kretprobe_trampoline:\n"
> > >         /* We don't bother saving the ss register */
> > >         "       pushq %rsp\n"				// THIS
> > >         "       pushfq\n"
> > > 
> > > Thus, from inside the kretprobe handler, like ftrace, you'll see
> > > the sp value instead of the real return address.
> > 
> > I see.  If you change is_kretprobe_trampoline_address() to include the
> > entire function, like:
> > 
> > static bool is_kretprobe_trampoline_address(unsigned long ip)
> > {
> > 	return (void *)ip >= kretprobe_trampoline &&
> > 	       (void *)ip < kretprobe_trampoline_end;
> > }
> > 
> > then the unwinder won't ever read the bogus %rsp value into state->ip,
> > and the 'state->ip == sp' check can be removed.
> 
> Hmm, I couldn't get your point. Since sp is the address of stack,
> it always out of text address.

When unwinding from trampoline_handler(), state->ip will point to the
instruction after the call:

	call trampoline_handler
	movq %rax, 19*8(%rsp)   <-- state->ip points to this insn

But then, the above version of is_kretprobe_trampoline_address() is
true, so state->ip gets immediately replaced with the real return
address:

	if (is_kretprobe_trampoline_address(state->ip))
		state->ip = orc_kretprobe_correct_ip(state);

so the unwinder skips over the kretprobe_trampoline() frame and goes
straight to the frame of the real return address.  Thus it never reads
this bogus return value into state->ip:

	pushq %rsp

which is why the weird 'state->ip == sp' check is no longer needed.

The only "downside" is that the unwinder skips the
kretprobe_trampoline() frame.  (note that downside wouldn't exist in
the case of UNWIND_HINT_REGS + valid regs->ip).

> > > > And it would make the unwinder just work automatically when unwinding
> > > > from the handler using the regs.
> > > > 
> > > > It would also work when unwinding from the handler's stack, if we put an
> > > > UNWIND_HINT_REGS after saving the regs.
> > > 
> > > At that moment, the real return address is not identified. So we can not
> > > put it.
> > 
> > True, at the time the regs are originally saved, the real return address
> > isn't available.  But by the time the user handler is called, the return
> > address *is* available.  So if the real return address were placed in
> > regs->ip before calling the handler, the unwinder could find it there,
> > when called from the handler.
> 
> OK, but this is not arch independent specification. I can make a hack
> only for x86, but that is not clean implementation, hmm.
> 
> > 
> > Then we wouldn't need the call to orc_kretprobe_correct_ip() in
> > __unwind_start().
> 
> What about the ORC implementation in other architecture? Is that for
> x86 only?

ORC is x86 only.

> > But maybe it's not possible due to the regs->ip expectations of legacy
> > handlers?
> 
> Usually, the legacy handlers will ignore it, the official way to access
> the correct return address is kretprobe_instance.ret_addr. Because it is
> arch independent.
> 
> Nowadays there are instruction_pointer() and instruction_pointer_set() APIs
> in many (not all) architecutre, so I can try to replace to use it instead
> of the kretprobe_instance.ret_addr.
> (and it will break the out-of-tree codes)

That sounds better to me, though I don't have an understanding of what
it would break.
Masami Hiramatsu (Google) March 12, 2021, 3:22 a.m. UTC | #18
On Thu, 11 Mar 2021 10:51:10 -0600
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Thu, Mar 11, 2021 at 10:54:38AM +0900, Masami Hiramatsu wrote:
> > On Wed, 10 Mar 2021 19:06:15 -0600
> > Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > 
> > > On Thu, Mar 11, 2021 at 09:20:18AM +0900, Masami Hiramatsu wrote:
> > > > > >  bool unwind_next_frame(struct unwind_state *state)
> > > > > >  {
> > > > > >  	unsigned long ip_p, sp, tmp, orig_ip = state->ip, prev_sp = state->sp;
> > > > > > @@ -536,6 +561,18 @@ bool unwind_next_frame(struct unwind_state *state)
> > > > > >  
> > > > > >  		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
> > > > > >  						  state->ip, (void *)ip_p);
> > > > > > +		/*
> > > > > > +		 * There are special cases when the stack unwinder is called
> > > > > > +		 * from the kretprobe handler or the interrupt handler which
> > > > > > +		 * occurs in the kretprobe trampoline code. In those cases,
> > > > > > +		 * %sp is shown on the stack instead of the return address.
> > > > > > +		 * Or, when the unwinder find the return address is replaced
> > > > > > +		 * by kretprobe_trampoline.
> > > > > > +		 * In those cases, correct address can be found in kretprobe.
> > > > > > +		 */
> > > > > > +		if (state->ip == sp ||
> > > > > 
> > > > > Why is the 'state->ip == sp' needed?
> > > > 
> > > > As I commented above, until kretprobe_trampoline writes back the real
> > > > address to the stack, sp value is there (which has been pushed by the
> > > > 'pushq %rsp' at the entry of kretprobe_trampoline.)
> > > > 
> > > >         ".type kretprobe_trampoline, @function\n"
> > > >         "kretprobe_trampoline:\n"
> > > >         /* We don't bother saving the ss register */
> > > >         "       pushq %rsp\n"				// THIS
> > > >         "       pushfq\n"
> > > > 
> > > > Thus, from inside the kretprobe handler, like ftrace, you'll see
> > > > the sp value instead of the real return address.
> > > 
> > > I see.  If you change is_kretprobe_trampoline_address() to include the
> > > entire function, like:
> > > 
> > > static bool is_kretprobe_trampoline_address(unsigned long ip)
> > > {
> > > 	return (void *)ip >= kretprobe_trampoline &&
> > > 	       (void *)ip < kretprobe_trampoline_end;
> > > }
> > > 
> > > then the unwinder won't ever read the bogus %rsp value into state->ip,
> > > and the 'state->ip == sp' check can be removed.
> > 
> > Hmm, I couldn't get your point. Since sp is the address of stack,
> > it always out of text address.
> 
> When unwinding from trampoline_handler(), state->ip will point to the
> instruction after the call:
> 
> 	call trampoline_handler
> 	movq %rax, 19*8(%rsp)   <-- state->ip points to this insn
> 
> But then, the above version of is_kretprobe_trampoline_address() is
> true, so state->ip gets immediately replaced with the real return
> address:
> 
> 	if (is_kretprobe_trampoline_address(state->ip))
> 		state->ip = orc_kretprobe_correct_ip(state);

Hmm, but that ip is for the "next" frame, isn't it?

[stack]
0x040 kretprobe_trampoline+0x25 -+
0x048 r15                        | 
...     /* pt_regs */            +- ORC sp_offset at the kretprobe_trampoline+0x24
0x0d8 flags                     -+
0x0e0 rsp (=0x0e0) /* will be replaced by the real return address */

Your idea seems replacing stack@0x040 with the real return address.
In that case, may orc_find() returns the wrong ORC info?


> so the unwinder skips over the kretprobe_trampoline() frame and goes
> straight to the frame of the real return address.  Thus it never reads
> this bogus return value into state->ip:
> 
> 	pushq %rsp
> 
> which is why the weird 'state->ip == sp' check is no longer needed.

But that is what the kretprobe_trampoline actually does. I would rather
like to see the real information from stacktrace.

> The only "downside" is that the unwinder skips the
> kretprobe_trampoline() frame.  (note that downside wouldn't exist in
> the case of UNWIND_HINT_REGS + valid regs->ip).

As I said, I would like to see the kretprobe_trampoline+0x25 on
my stacktrace, because kretprobe doesn't replace the kretprobe_trampoline
but the target_func on the stack.

> > > > > And it would make the unwinder just work automatically when unwinding
> > > > > from the handler using the regs.
> > > > > 
> > > > > It would also work when unwinding from the handler's stack, if we put an
> > > > > UNWIND_HINT_REGS after saving the regs.
> > > > 
> > > > At that moment, the real return address is not identified. So we can not
> > > > put it.
> > > 
> > > True, at the time the regs are originally saved, the real return address
> > > isn't available.  But by the time the user handler is called, the return
> > > address *is* available.  So if the real return address were placed in
> > > regs->ip before calling the handler, the unwinder could find it there,
> > > when called from the handler.
> > 
> > OK, but this is not arch independent specification. I can make a hack
> > only for x86, but that is not clean implementation, hmm.
> > 
> > > 
> > > Then we wouldn't need the call to orc_kretprobe_correct_ip() in
> > > __unwind_start().
> > 
> > What about the ORC implementation in other architecture? Is that for
> > x86 only?
> 
> ORC is x86 only.
> 
> > > But maybe it's not possible due to the regs->ip expectations of legacy
> > > handlers?
> > 
> > Usually, the legacy handlers will ignore it, the official way to access
> > the correct return address is kretprobe_instance.ret_addr. Because it is
> > arch independent.
> > 
> > Nowadays there are instruction_pointer() and instruction_pointer_set() APIs
> > in many (not all) architecutre, so I can try to replace to use it instead
> > of the kretprobe_instance.ret_addr.
> > (and it will break the out-of-tree codes)
> 
> That sounds better to me, though I don't have an understanding of what
> it would break.

OK, anyway try to do that and see what happen.

Thank you,