diff mbox series

[v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Message ID 20240702171858.187562-1-andrii@kernel.org (mailing list archive)
State Superseded
Headers show
Series [v2] perf,x86: avoid missing caller address in stack traces captured in uprobe | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrii Nakryiko July 2, 2024, 5:18 p.m. UTC
When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
(`push %ebp` on 32-bit architecture) instruction being traced. Given
entire kernel implementation of user space stack trace capturing works
under assumption that user space code was compiled with frame pointer
register (%rbp/%ebp) preservation, it seems pretty reasonable to use
this instruction as a strong indicator that this is the entry to the
function. In that case, return address is still pointed to by %rsp/%esp,
so we fetch it and add to stack trace before proceeding to unwind the
rest using frame pointer-based logic.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
v1->v2:
  - use native unsigned long for ret_addr (Peter);
  - add same logic for compat logic in perf_callchain_user32 (Peter).

 arch/x86/events/core.c  | 33 +++++++++++++++++++++++++++++++++
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 37 insertions(+)

Comments

Josh Poimboeuf July 2, 2024, 11:35 p.m. UTC | #1
On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote:
> When tracing user functions with uprobe functionality, it's common to
> install the probe (e.g., a BPF program) at the first instruction of the
> function. This is often going to be `push %rbp` instruction in function
> preamble, which means that within that function frame pointer hasn't
> been established yet. This leads to consistently missing an actual
> caller of the traced function, because perf_callchain_user() only
> records current IP (capturing traced function) and then following frame
> pointer chain (which would be caller's frame, containing the address of
> caller's caller).
> 
> So when we have target_1 -> target_2 -> target_3 call chain and we are
> tracing an entry to target_3, captured stack trace will report
> target_1 -> target_3 call chain, which is wrong and confusing.
> 
> This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> (`push %ebp` on 32-bit architecture) instruction being traced. Given
> entire kernel implementation of user space stack trace capturing works
> under assumption that user space code was compiled with frame pointer
> register (%rbp/%ebp) preservation, it seems pretty reasonable to use
> this instruction as a strong indicator that this is the entry to the
> function. In that case, return address is still pointed to by %rsp/%esp,
> so we fetch it and add to stack trace before proceeding to unwind the
> rest using frame pointer-based logic.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Should it also check for ENDBR64?

When compiled with -fcf-protection=branch, the first instruction of the
function will almost always be ENDBR64.  I'm not sure about other
distros, but at least Fedora compiles its binaries like that.
Josh Poimboeuf July 2, 2024, 11:39 p.m. UTC | #2
On Tue, Jul 02, 2024 at 04:35:56PM -0700, Josh Poimboeuf wrote:
> On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote:
> > When tracing user functions with uprobe functionality, it's common to
> > install the probe (e.g., a BPF program) at the first instruction of the
> > function. This is often going to be `push %rbp` instruction in function
> > preamble, which means that within that function frame pointer hasn't
> > been established yet. This leads to consistently missing an actual
> > caller of the traced function, because perf_callchain_user() only
> > records current IP (capturing traced function) and then following frame
> > pointer chain (which would be caller's frame, containing the address of
> > caller's caller).
> > 
> > So when we have target_1 -> target_2 -> target_3 call chain and we are
> > tracing an entry to target_3, captured stack trace will report
> > target_1 -> target_3 call chain, which is wrong and confusing.
> > 
> > This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> > (`push %ebp` on 32-bit architecture) instruction being traced. Given
> > entire kernel implementation of user space stack trace capturing works
> > under assumption that user space code was compiled with frame pointer
> > register (%rbp/%ebp) preservation, it seems pretty reasonable to use
> > this instruction as a strong indicator that this is the entry to the
> > function. In that case, return address is still pointed to by %rsp/%esp,
> > so we fetch it and add to stack trace before proceeding to unwind the
> > rest using frame pointer-based logic.
> > 
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> 
> Should it also check for ENDBR64?
> 
> When compiled with -fcf-protection=branch, the first instruction of the
> function will almost always be ENDBR64.  I'm not sure about other
> distros, but at least Fedora compiles its binaries like that.

BTW, there are some cases (including leaf functions and some stack
alignment sequences) where a "push %rbp" can happen inside a function.
Then it would presumably add a bogus trace entry.  Are such false
positives ok?
Andrii Nakryiko July 3, 2024, 12:06 a.m. UTC | #3
On Tue, Jul 2, 2024 at 4:39 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 04:35:56PM -0700, Josh Poimboeuf wrote:
> > On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote:
> > > When tracing user functions with uprobe functionality, it's common to
> > > install the probe (e.g., a BPF program) at the first instruction of the
> > > function. This is often going to be `push %rbp` instruction in function
> > > preamble, which means that within that function frame pointer hasn't
> > > been established yet. This leads to consistently missing an actual
> > > caller of the traced function, because perf_callchain_user() only
> > > records current IP (capturing traced function) and then following frame
> > > pointer chain (which would be caller's frame, containing the address of
> > > caller's caller).
> > >
> > > So when we have target_1 -> target_2 -> target_3 call chain and we are
> > > tracing an entry to target_3, captured stack trace will report
> > > target_1 -> target_3 call chain, which is wrong and confusing.
> > >
> > > This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> > > (`push %ebp` on 32-bit architecture) instruction being traced. Given
> > > entire kernel implementation of user space stack trace capturing works
> > > under assumption that user space code was compiled with frame pointer
> > > register (%rbp/%ebp) preservation, it seems pretty reasonable to use
> > > this instruction as a strong indicator that this is the entry to the
> > > function. In that case, return address is still pointed to by %rsp/%esp,
> > > so we fetch it and add to stack trace before proceeding to unwind the
> > > rest using frame pointer-based logic.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > Should it also check for ENDBR64?
> >

Sure, I can add a check for endbr64 as well. endbr64 probably can be
used not just at function entry, is that right? So it might be another
case of false positive (which I think is ok, see below).

> > When compiled with -fcf-protection=branch, the first instruction of the
> > function will almost always be ENDBR64.  I'm not sure about other
> > distros, but at least Fedora compiles its binaries like that.
>
> BTW, there are some cases (including leaf functions and some stack
> alignment sequences) where a "push %rbp" can happen inside a function.
> Then it would presumably add a bogus trace entry.  Are such false
> positives ok?

I think such cases should be rare. People mostly seem to trace user
function entry/exit, rarely if ever they trace something within the
function, except for USDT cases, where it will be a nop instruction
that they trace.

In general, even with false positives, I think it's overwhelmingly
better to get correct entry stack trace 99.9% of the time, and in the
rest 0.01% cases it's fine having one extra bogus entry (but the rest
should still be correct), which should be easy for humans to recognize
and filter out, if necessary.

>
> --
> Josh
Josh Poimboeuf July 3, 2024, 1:11 a.m. UTC | #4
On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > Should it also check for ENDBR64?
> > >
> 
> Sure, I can add a check for endbr64 as well. endbr64 probably can be
> used not just at function entry, is that right? So it might be another
> case of false positive (which I think is ok, see below).

Yeah, at least theoretically they could happen in the middle of a
function for implementing C switch jump tables.

> > > When compiled with -fcf-protection=branch, the first instruction of the
> > > function will almost always be ENDBR64.  I'm not sure about other
> > > distros, but at least Fedora compiles its binaries like that.
> >
> > BTW, there are some cases (including leaf functions and some stack
> > alignment sequences) where a "push %rbp" can happen inside a function.
> > Then it would presumably add a bogus trace entry.  Are such false
> > positives ok?
> 
> I think such cases should be rare. People mostly seem to trace user
> function entry/exit, rarely if ever they trace something within the
> function, except for USDT cases, where it will be a nop instruction
> that they trace.
> 
> In general, even with false positives, I think it's overwhelmingly
> better to get correct entry stack trace 99.9% of the time, and in the
> rest 0.01% cases it's fine having one extra bogus entry (but the rest
> should still be correct), which should be easy for humans to recognize
> and filter out, if necessary.

Agreed, this is a definite improvement overall.

BTW, soon there will be support for sframes instead of frame pointers,
at which point these checks should only be done for the frame pointer
case.
Andrii Nakryiko July 3, 2024, 3:35 a.m. UTC | #5
On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > > Should it also check for ENDBR64?
> > > >
> >
> > Sure, I can add a check for endbr64 as well. endbr64 probably can be
> > used not just at function entry, is that right? So it might be another
> > case of false positive (which I think is ok, see below).
>
> Yeah, at least theoretically they could happen in the middle of a
> function for implementing C switch jump tables.
>
> > > > When compiled with -fcf-protection=branch, the first instruction of the
> > > > function will almost always be ENDBR64.  I'm not sure about other
> > > > distros, but at least Fedora compiles its binaries like that.
> > >
> > > BTW, there are some cases (including leaf functions and some stack
> > > alignment sequences) where a "push %rbp" can happen inside a function.
> > > Then it would presumably add a bogus trace entry.  Are such false
> > > positives ok?
> >
> > I think such cases should be rare. People mostly seem to trace user
> > function entry/exit, rarely if ever they trace something within the
> > function, except for USDT cases, where it will be a nop instruction
> > that they trace.
> >
> > In general, even with false positives, I think it's overwhelmingly
> > better to get correct entry stack trace 99.9% of the time, and in the
> > rest 0.01% cases it's fine having one extra bogus entry (but the rest
> > should still be correct), which should be easy for humans to recognize
> > and filter out, if necessary.
>
> Agreed, this is a definite improvement overall.

Cool, I'll incorporate that into v3 and send it soon.

>
> BTW, soon there will be support for sframes instead of frame pointers,
> at which point these checks should only be done for the frame pointer
> case.

Nice, this is one of the reasons I've been thinking about asynchronous
stack trace capture in BPF (see [0] from recent LSF/MM).

Few questions, while we are at it. Does it mean that
perf_callchain_user() will support working from sleepable context and
will wait for data to be paged in? Is anyone already working on this?
Any pointers?

 [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4

>
> --
> Josh
Josh Poimboeuf July 3, 2024, 6:11 a.m. UTC | #6
On Tue, Jul 02, 2024 at 08:35:08PM -0700, Andrii Nakryiko wrote:
> On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > In general, even with false positives, I think it's overwhelmingly
> > > better to get correct entry stack trace 99.9% of the time, and in the
> > > rest 0.01% cases it's fine having one extra bogus entry (but the rest
> > > should still be correct), which should be easy for humans to recognize
> > > and filter out, if necessary.
> >
> > Agreed, this is a definite improvement overall.
> 
> Cool, I'll incorporate that into v3 and send it soon.
> 
> >
> > BTW, soon there will be support for sframes instead of frame pointers,
> > at which point these checks should only be done for the frame pointer
> > case.
> 
> Nice, this is one of the reasons I've been thinking about asynchronous
> stack trace capture in BPF (see [0] from recent LSF/MM).
>  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4

I don't seem to have permission to open it.

> Few questions, while we are at it. Does it mean that
> perf_callchain_user() will support working from sleepable context and
> will wait for data to be paged in? Is anyone already working on this?
> Any pointers?

I had a prototype here:

  https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org

Hopefully I can get started on v2 soon.
Andrii Nakryiko July 3, 2024, 8:23 p.m. UTC | #7
On Tue, Jul 2, 2024 at 11:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 08:35:08PM -0700, Andrii Nakryiko wrote:
> > On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > > In general, even with false positives, I think it's overwhelmingly
> > > > better to get correct entry stack trace 99.9% of the time, and in the
> > > > rest 0.01% cases it's fine having one extra bogus entry (but the rest
> > > > should still be correct), which should be easy for humans to recognize
> > > > and filter out, if necessary.
> > >
> > > Agreed, this is a definite improvement overall.
> >
> > Cool, I'll incorporate that into v3 and send it soon.
> >

BTW, if you have a chance, please do take a look at v3 and leave your
ack, if you are ok with it. Thanks!

> > >
> > > BTW, soon there will be support for sframes instead of frame pointers,
> > > at which point these checks should only be done for the frame pointer
> > > case.
> >
> > Nice, this is one of the reasons I've been thinking about asynchronous
> > stack trace capture in BPF (see [0] from recent LSF/MM).
> >  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4
>
> I don't seem to have permission to open it.
>

Argh, sorry, it's under my corporate account which doesn't allow
others to view it. Try this, I "published" it, let me know if that
still doesn't work:

  [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub

> > Few questions, while we are at it. Does it mean that
> > perf_callchain_user() will support working from sleepable context and
> > will wait for data to be paged in? Is anyone already working on this?
> > Any pointers?
>
> I had a prototype here:
>
>   https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org
>
> Hopefully I can get started on v2 soon.

Ok, so you are going to work on this. Please cc me on future revisions
then. Thanks!

>
> --
> Josh
Josh Poimboeuf July 3, 2024, 10:41 p.m. UTC | #8
On Wed, Jul 03, 2024 at 01:23:39PM -0700, Andrii Nakryiko wrote:
> > >  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4
> >
> > I don't seem to have permission to open it.
> >
> 
> Argh, sorry, it's under my corporate account which doesn't allow
> others to view it. Try this, I "published" it, let me know if that
> still doesn't work:
> 
>   [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub

The new link doesn't work either :-)

> > > Few questions, while we are at it. Does it mean that
> > > perf_callchain_user() will support working from sleepable context and
> > > will wait for data to be paged in? Is anyone already working on this?
> > > Any pointers?
> >
> > I had a prototype here:
> >
> >   https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org
> >
> > Hopefully I can get started on v2 soon.
> 
> Ok, so you are going to work on this. Please cc me on future revisions
> then. Thanks!

Will do!
Andrii Nakryiko July 3, 2024, 10:58 p.m. UTC | #9
On Wed, Jul 3, 2024 at 3:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Wed, Jul 03, 2024 at 01:23:39PM -0700, Andrii Nakryiko wrote:
> > > >  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4
> > >
> > > I don't seem to have permission to open it.
> > >
> >
> > Argh, sorry, it's under my corporate account which doesn't allow
> > others to view it. Try this, I "published" it, let me know if that
> > still doesn't work:
> >
> >   [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub
>
> The new link doesn't work either :-)
>

Goodness, sorry about that. I just recreated it under my public
account and shared it with the world. This HAS to work:

  https://docs.google.com/presentation/d/1eaOf9CVZlCOD6b7_UtZBYMfTyYIDZw9clyjzu-IIOIo

> > > > Few questions, while we are at it. Does it mean that
> > > > perf_callchain_user() will support working from sleepable context and
> > > > will wait for data to be paged in? Is anyone already working on this?
> > > > Any pointers?
> > >
> > > I had a prototype here:
> > >
> > >   https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org
> > >
> > > Hopefully I can get started on v2 soon.
> >
> > Ok, so you are going to work on this. Please cc me on future revisions
> > then. Thanks!
>
> Will do!
>
> --
> Josh
diff mbox series

Patch

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..60821c1ff2f3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2833,6 +2833,19 @@  perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 
 	fp = compat_ptr(ss_base + regs->bp);
 	pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+	/* see perf_callchain_user() below for why we do this */
+	if (current->utask) {
+		struct arch_uprobe *auprobe = current->utask->auprobe;
+		u32 ret_addr;
+
+		if (auprobe && auprobe->insn[0] == 0x55 /* push %ebp */ &&
+		    !__get_user(ret_addr, (const u32 __user *)regs->sp))
+			perf_callchain_store(entry, ret_addr);
+	}
+#endif
+
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2884,6 +2897,26 @@  perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		return;
 
 	pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+	/*
+	 * If we are called from uprobe handler, and we are indeed at the very
+	 * entry to user function (which is normally a `push %rbp` instruction,
+	 * under assumption of application being compiled with frame pointers),
+	 * we should read return address from *regs->sp before proceeding
+	 * to follow frame pointers, otherwise we'll skip immediate caller
+	 * as %rbp is not yet setup.
+	 */
+	if (current->utask) {
+		struct arch_uprobe *auprobe = current->utask->auprobe;
+		unsigned long ret_addr;
+
+		if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp/%ebp */ &&
+		    !__get_user(ret_addr, (const unsigned long __user *)regs->sp))
+			perf_callchain_store(entry, ret_addr);
+	}
+#endif
+
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a270a5892ab4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@  struct uprobe_task {
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
 
+	struct arch_uprobe              *auprobe;
+
 	struct return_instance		*return_instances;
 	unsigned int			depth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..6e22e4d80f1e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2082,6 +2082,7 @@  static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	bool need_prep = false; /* prepare return uprobe, when needed */
 
 	down_read(&uprobe->register_rwsem);
+	current->utask->auprobe = &uprobe->arch;
 	for (uc = uprobe->consumers; uc; uc = uc->next) {
 		int rc = 0;
 
@@ -2096,6 +2097,7 @@  static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 
 		remove &= rc;
 	}
+	current->utask->auprobe = NULL;
 
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */