diff mbox series

[2/3] arm64: perf: Improve compat perf_callchain_user() for clang leaf functions

Message ID 20210507135509.2.Ib54050e4091679cc31b04d52d7ef200f99faaae5@changeid (mailing list archive)
State New, archived
Headers show
Series arm64: perf: Make compat tracing better | expand

Commit Message

Doug Anderson May 7, 2021, 8:55 p.m. UTC
It turns out that even when you compile code with clang with
"-fno-omit-frame-pointer" that it won't generate a frame pointer for
leaf functions (those that don't call any sub-functions). Presumably
clang does this to reduce the overhead of frame pointers. In a leaf
function you don't really need frame pointers since the Link Register
(LR) is guaranteed to always point to the caller.

Clang's optimization here is a bit of a pain for us, though. A human
might have an easy time figuring out if a function is a leaf function
or not and in theory we could have a way to lookup a given PC to find
out if it's in a leaf function. Unfortunately we haven't passed the
Turing test and we don't have any auxiliary data to help us.

If we just ignore this problem then the end result isn't terrible. It
just turns out that the _callers_ of leaf functions won't be logged. I
guess that's OK, but it could lead to some confusing traces.

Another option (the one proposed in this patch) is to always log the
first LR when we're tracing, assuming that we hadn't already decided
to log it for some other reason.

Good things about always logging the LR:
* clang leaf functions are handled better.
* if PC is right at the start of a function (even on non-clang) it's
  handled better.

Bad things about the LR:
* We could log a "bogus" PC in the trace.

I believe that the most common "bogus" PC that would be logged would
be a PC somewhere in the top function being traced. As an example, if
we have this function:

  non_leaf():
    1. Setup the frame pointer
    2. Call example()
    3. Do something slow
    4. Do something else slow
    5. Call example2()
    6. Return

If the PC was in the middle of "Do something else slow" and then we
tried to trace, our stack trace would look like this:
  Top:  a) A PC in the middle of "Do something else slow".
        b) The return address that example() used, probably in "Do
           something slow"
	c) The caller of non_leaf()

Specifically you can see that there would be two PCs logged for
non_leaf().

To me it feels like this is a net-win. It also should be noted that
the consumer of our trace records probably _does_ have more
information than we do. It could fairly easily ignore this info.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 arch/arm64/kernel/perf_callchain.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

James Clark June 7, 2021, 9:14 a.m. UTC | #1
On 07/05/2021 23:55, Douglas Anderson wrote:
> It turns out that even when you compile code with clang with
> "-fno-omit-frame-pointer" that it won't generate a frame pointer for
> leaf functions (those that don't call any sub-functions). Presumably
> clang does this to reduce the overhead of frame pointers. In a leaf
> function you don't really need frame pointers since the Link Register
> (LR) is guaranteed to always point to the caller> 
[...]
> 
>  arch/arm64/kernel/perf_callchain.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
> index e5ce5f7965d1..b3cd9f371469 100644
> --- a/arch/arm64/kernel/perf_callchain.c
> +++ b/arch/arm64/kernel/perf_callchain.c
> @@ -326,6 +326,20 @@ static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
>  	while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
>  		err = compat_perf_trace_1(&fp, &pc, leaf_lr);
>  
> +		/*
> +		 * If this is the first trace and it didn't find the LR then
> +		 * let's throw it in the trace first. This isn't perfect but
> +		 * is the best we can do for handling clang leaf functions (or
> +		 * the case where we're right at the start of the function
> +		 * before the new frame has been pushed). In the worst case
> +		 * this can cause us to throw an extra entry that will be some
> +		 * location in the same function as the PC. That's not
> +		 * amazing but shouldn't really hurt. It seems better than
> +		 * throwing away the LR.
> +		 */

Hi Douglas,

I think the behaviour with GCC is also similar. We were working on this change
(https://lore.kernel.org/lkml/20210304163255.10363-4-alexandre.truong@arm.com/)
in userspace Perf which addresses the same issue.

The basic concept of our version is to record only the link register
(as in --user-regs=lr). Then use the existing dwarf based unwind
to determine if the link register is valid for that frame, and then if
it is and it doesn't already exist on the stack then insert it.

You mention that your version isn't perfect, do you think that saving the
LR and using something like libunwind in a post process could be better?

Thanks
James
Doug Anderson June 7, 2021, 8:57 p.m. UTC | #2
Hi,

On Mon, Jun 7, 2021 at 2:14 AM James Clark <james.clark@arm.com> wrote:
>
>
>
> On 07/05/2021 23:55, Douglas Anderson wrote:
> > It turns out that even when you compile code with clang with
> > "-fno-omit-frame-pointer" that it won't generate a frame pointer for
> > leaf functions (those that don't call any sub-functions). Presumably
> > clang does this to reduce the overhead of frame pointers. In a leaf
> > function you don't really need frame pointers since the Link Register
> > (LR) is guaranteed to always point to the caller>
> [...]
> >
> >  arch/arm64/kernel/perf_callchain.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> >
> > diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
> > index e5ce5f7965d1..b3cd9f371469 100644
> > --- a/arch/arm64/kernel/perf_callchain.c
> > +++ b/arch/arm64/kernel/perf_callchain.c
> > @@ -326,6 +326,20 @@ static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
> >       while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
> >               err = compat_perf_trace_1(&fp, &pc, leaf_lr);
> >
> > +             /*
> > +              * If this is the first trace and it didn't find the LR then
> > +              * let's throw it in the trace first. This isn't perfect but
> > +              * is the best we can do for handling clang leaf functions (or
> > +              * the case where we're right at the start of the function
> > +              * before the new frame has been pushed). In the worst case
> > +              * this can cause us to throw an extra entry that will be some
> > +              * location in the same function as the PC. That's not
> > +              * amazing but shouldn't really hurt. It seems better than
> > +              * throwing away the LR.
> > +              */
>
> Hi Douglas,
>
> I think the behaviour with GCC is also similar. We were working on this change
> (https://lore.kernel.org/lkml/20210304163255.10363-4-alexandre.truong@arm.com/)
> in userspace Perf which addresses the same issue.
>
> The basic concept of our version is to record only the link register
> (as in --user-regs=lr). Then use the existing dwarf based unwind
> to determine if the link register is valid for that frame, and then if
> it is and it doesn't already exist on the stack then insert it.
>
> You mention that your version isn't perfect, do you think that saving the
> LR and using something like libunwind in a post process could be better?

Using post processing atop a patch to always save the LR is definitely
the right solution IMO and (I think) you can fully overcome the "no
frame pointers in leaf functions" with the post processing.

-Doug
diff mbox series

Patch

diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c
index e5ce5f7965d1..b3cd9f371469 100644
--- a/arch/arm64/kernel/perf_callchain.c
+++ b/arch/arm64/kernel/perf_callchain.c
@@ -326,6 +326,20 @@  static void compat_perf_callchain_user(struct perf_callchain_entry_ctx *entry,
 	while ((entry->nr < entry->max_stack) && fp && !(fp & 0x3)) {
 		err = compat_perf_trace_1(&fp, &pc, leaf_lr);
 
+		/*
+		 * If this is the first trace and it didn't find the LR then
+		 * let's throw it in the trace first. This isn't perfect but
+		 * is the best we can do for handling clang leaf functions (or
+		 * the case where we're right at the start of the function
+		 * before the new frame has been pushed). In the worst case
+		 * this can cause us to throw an extra entry that will be some
+		 * location in the same function as the PC. That's not
+		 * amazing but shouldn't really hurt. It seems better than
+		 * throwing away the LR.
+		 */
+		if (leaf_lr && leaf_lr != pc)
+			perf_callchain_store(entry, leaf_lr & ~BIT(0));
+
 		/* Bail out on any type of error */
 		if (err)
 			break;