mbox series

[RFC,0/4] perf record: Implement off-cpu profiling with BPF (v1)

Message ID 20220422053401.208207-1-namhyung@kernel.org (mailing list archive)
Headers show
Series perf record: Implement off-cpu profiling with BPF (v1) | expand

Message

Namhyung Kim April 22, 2022, 5:33 a.m. UTC
Hello,

This is the first version of off-cpu profiling support.  Together with
(PMU-based) cpu profiling, it can show holistic view of the performance
characteristics of your application or system.

With BPF, it can aggregate scheduling stats for interested tasks
and/or states and convert the data into a form of perf sample records.
I chose the bpf-output event which is a software event supposed to be
consumed by BPF programs and renamed it as "offcpu-time".  So it
requires no change on the perf report side except for setting sample
types of bpf-output event.

Basically it collects userspace callstack for tasks as it's what users
want mostly.  Maybe we can add support for the kernel stacks but I'm
afraid that it'd cause more overhead.  So the offcpu-time event will
always have callchains regardless of the command line option, and it
enables the children mode in perf report by default.

It adds --off-cpu option to perf record like below:

  $ sudo perf record -a --off-cpu -- perf bench sched messaging -l 1000
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

     Total time: 1.518 [sec]
  [ perf record: Woken up 9 times to write data ]
  [ perf record: Captured and wrote 5.313 MB perf.data (53341 samples) ]

Then we can run perf report as usual.  The below is just to skip less
important parts.

  $ sudo perf report --stdio --call-graph=no --percent-limit=2
  # To display the perf.data header info, please use --header/--header-only options.
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 52K of event 'cycles'
  # Event count (approx.): 42522453276
  #
  # Children      Self  Command          Shared Object     Symbol                            
  # ........  ........  ...............  ................  ..................................
  #
       9.58%     9.58%  sched-messaging  [kernel.vmlinux]  [k] audit_filter_rules.constprop.0
       8.46%     8.46%  sched-messaging  [kernel.vmlinux]  [k] audit_filter_syscall
       4.54%     4.54%  sched-messaging  [kernel.vmlinux]  [k] copy_user_enhanced_fast_string
       2.94%     2.94%  sched-messaging  [kernel.vmlinux]  [k] unix_stream_read_generic
       2.45%     2.45%  sched-messaging  [kernel.vmlinux]  [k] memcg_slab_free_hook
  
  
  # Samples: 983  of event 'offcpu-time'
  # Event count (approx.): 684538813464
  #
  # Children      Self  Command          Shared Object         Symbol                    
  # ........  ........  ...............  ....................  ..........................
  #
      83.86%     0.00%  sched-messaging  libc-2.33.so          [.] __libc_start_main
      83.86%     0.00%  sched-messaging  perf                  [.] cmd_bench
      83.86%     0.00%  sched-messaging  perf                  [.] main
      83.86%     0.00%  sched-messaging  perf                  [.] run_builtin
      83.64%     0.00%  sched-messaging  perf                  [.] bench_sched_messaging
      41.35%    41.35%  sched-messaging  libpthread-2.33.so    [.] __read
      38.88%    38.88%  sched-messaging  libpthread-2.33.so    [.] __write
       3.41%     3.41%  sched-messaging  libc-2.33.so          [.] __poll

The perf bench sched messaging created 400 processes to send/receive
messages through unix sockets.  It spent a large portion of cpu cycles
for audit filter and read/copy the messages while most of the
offcpu-time was in read and write calls.

You can get the code from 'perf/offcpu-v1' branch in my tree at

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Enjoy! :)

Thanks,
Namhyung


Namhyung Kim (4):
  perf report: Do not extend sample type of bpf-output event
  perf record: Enable off-cpu analysis with BPF
  perf record: Implement basic filtering for off-cpu
  perf record: Handle argument change in sched_switch

 tools/perf/Makefile.perf               |   1 +
 tools/perf/builtin-record.c            |  21 ++
 tools/perf/util/Build                  |   1 +
 tools/perf/util/bpf_off_cpu.c          | 301 +++++++++++++++++++++++++
 tools/perf/util/bpf_skel/off_cpu.bpf.c | 214 ++++++++++++++++++
 tools/perf/util/evsel.c                |   4 +-
 6 files changed, 540 insertions(+), 2 deletions(-)
 create mode 100644 tools/perf/util/bpf_off_cpu.c
 create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c


base-commit: 41204da4c16071be9090940b18f566832d46becc

Comments

Jiri Olsa April 22, 2022, 10:11 a.m. UTC | #1
On Thu, Apr 21, 2022 at 10:33:57PM -0700, Namhyung Kim wrote:

SNIP

> The perf bench sched messaging created 400 processes to send/receive
> messages through unix sockets.  It spent a large portion of cpu cycles
> for audit filter and read/copy the messages while most of the
> offcpu-time was in read and write calls.
> 
> You can get the code from 'perf/offcpu-v1' branch in my tree at
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> Enjoy! :)

  CC      builtin-record.o
builtin-record.c:52:10: fatal error: util/off_cpu.h: No such file or directory
   52 | #include "util/off_cpu.h"

forgot to add util/off_cpu.h ?

jirka

> 
> Thanks,
> Namhyung
> 
> 
> Namhyung Kim (4):
>   perf report: Do not extend sample type of bpf-output event
>   perf record: Enable off-cpu analysis with BPF
>   perf record: Implement basic filtering for off-cpu
>   perf record: Handle argument change in sched_switch
> 
>  tools/perf/Makefile.perf               |   1 +
>  tools/perf/builtin-record.c            |  21 ++
>  tools/perf/util/Build                  |   1 +
>  tools/perf/util/bpf_off_cpu.c          | 301 +++++++++++++++++++++++++
>  tools/perf/util/bpf_skel/off_cpu.bpf.c | 214 ++++++++++++++++++
>  tools/perf/util/evsel.c                |   4 +-
>  6 files changed, 540 insertions(+), 2 deletions(-)
>  create mode 100644 tools/perf/util/bpf_off_cpu.c
>  create mode 100644 tools/perf/util/bpf_skel/off_cpu.bpf.c
> 
> 
> base-commit: 41204da4c16071be9090940b18f566832d46becc
> -- 
> 2.36.0.rc2.479.g8af0fa9b8e-goog
>
Milian Wolff April 22, 2022, 10:20 a.m. UTC | #2
On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> Hello,
> 
> This is the first version of off-cpu profiling support.  Together with
> (PMU-based) cpu profiling, it can show holistic view of the performance
> characteristics of your application or system.

Hey Namhyung,

this is awesome news! In hotspot, I've long done off-cpu profiling manually by 
looking at the time between --switch-events. The downside is that we also need 
to track the sched:sched_switch event to get a call stack. But this approach 
also works with dwarf based unwinding, and also includes kernel stacks.

> With BPF, it can aggregate scheduling stats for interested tasks
> and/or states and convert the data into a form of perf sample records.
> I chose the bpf-output event which is a software event supposed to be
> consumed by BPF programs and renamed it as "offcpu-time".  So it
> requires no change on the perf report side except for setting sample
> types of bpf-output event.
> 
> Basically it collects userspace callstack for tasks as it's what users
> want mostly.  Maybe we can add support for the kernel stacks but I'm
> afraid that it'd cause more overhead.  So the offcpu-time event will
> always have callchains regardless of the command line option, and it
> enables the children mode in perf report by default.

Has anything changed wrt perf/bpf and user applications not compiled with `-
fno-omit-frame-pointer`? I.e. does this new utility only work for specially 
compiled applications, or do we also get backtraces for "normal" binaries that 
we can install through package managers?

Thanks
Namhyung Kim April 22, 2022, 2:53 p.m. UTC | #3
Hi Jiri,

On Fri, Apr 22, 2022 at 3:11 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Apr 21, 2022 at 10:33:57PM -0700, Namhyung Kim wrote:
>
> SNIP
>
> > The perf bench sched messaging created 400 processes to send/receive
> > messages through unix sockets.  It spent a large portion of cpu cycles
> > for audit filter and read/copy the messages while most of the
> > offcpu-time was in read and write calls.
> >
> > You can get the code from 'perf/offcpu-v1' branch in my tree at
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> >
> > Enjoy! :)
>
>   CC      builtin-record.o
> builtin-record.c:52:10: fatal error: util/off_cpu.h: No such file or directory
>    52 | #include "util/off_cpu.h"
>
> forgot to add util/off_cpu.h ?

Oops, you're right.  Will resend soon.

Thanks,
Namhyung
Namhyung Kim April 22, 2022, 3:01 p.m. UTC | #4
Hi Milian,

On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > Hello,
> >
> > This is the first version of off-cpu profiling support.  Together with
> > (PMU-based) cpu profiling, it can show holistic view of the performance
> > characteristics of your application or system.
>
> Hey Namhyung,
>
> this is awesome news! In hotspot, I've long done off-cpu profiling manually by
> looking at the time between --switch-events. The downside is that we also need
> to track the sched:sched_switch event to get a call stack. But this approach
> also works with dwarf based unwinding, and also includes kernel stacks.

Thanks, I've also briefly thought about the switch event based off-cpu
profiling as it doesn't require root.  But collecting call stacks is hard and
I'd like to do it in kernel/bpf to reduce the overhead.

>
> > With BPF, it can aggregate scheduling stats for interested tasks
> > and/or states and convert the data into a form of perf sample records.
> > I chose the bpf-output event which is a software event supposed to be
> > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > requires no change on the perf report side except for setting sample
> > types of bpf-output event.
> >
> > Basically it collects userspace callstack for tasks as it's what users
> > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > afraid that it'd cause more overhead.  So the offcpu-time event will
> > always have callchains regardless of the command line option, and it
> > enables the children mode in perf report by default.
>
> Has anything changed wrt perf/bpf and user applications not compiled with `-
> fno-omit-frame-pointer`? I.e. does this new utility only work for specially
> compiled applications, or do we also get backtraces for "normal" binaries that
> we can install through package managers?

I am not aware of such changes, it still needs a frame pointer to get
backtraces.

Thanks,
Namhyung
Arnaldo Carvalho de Melo April 22, 2022, 7:04 p.m. UTC | #5
Em Fri, Apr 22, 2022 at 08:01:15AM -0700, Namhyung Kim escreveu:
> Hi Milian,
 
> On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > This is the first version of off-cpu profiling support.  Together with
> > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > characteristics of your application or system.

> > Hey Namhyung,

> > this is awesome news! In hotspot, I've long done off-cpu profiling manually by
> > looking at the time between --switch-events. The downside is that we also need
> > to track the sched:sched_switch event to get a call stack. But this approach
> > also works with dwarf based unwinding, and also includes kernel stacks.
> 
> Thanks, I've also briefly thought about the switch event based off-cpu
> profiling as it doesn't require root.  But collecting call stacks is hard and
> I'd like to do it in kernel/bpf to reduce the overhead.

It would be great to have both in perf. Right now since we have one in
hotspot that is working, perfecting the other method, Namhyung's, using
BPF to reduce the amount of data to postprocess in userspace, looks
great.
 
> > > With BPF, it can aggregate scheduling stats for interested tasks
> > > and/or states and convert the data into a form of perf sample records.
> > > I chose the bpf-output event which is a software event supposed to be
> > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > requires no change on the perf report side except for setting sample
> > > types of bpf-output event.
> > >
> > > Basically it collects userspace callstack for tasks as it's what users
> > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > always have callchains regardless of the command line option, and it
> > > enables the children mode in perf report by default.
> >
> > Has anything changed wrt perf/bpf and user applications not compiled with `-
> > fno-omit-frame-pointer`? I.e. does this new utility only work for specially
> > compiled applications, or do we also get backtraces for "normal" binaries that
> > we can install through package managers?
> 
> I am not aware of such changes, it still needs a frame pointer to get
> backtraces.

I see this as an initial limitation, one that we can lift later?

- Arnaldo
Milian Wolff April 25, 2022, 12:42 p.m. UTC | #6
On Freitag, 22. April 2022 17:01:15 CEST Namhyung Kim wrote:
> Hi Milian,
> 
> On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > Hello,
> > > 
> > > This is the first version of off-cpu profiling support.  Together with
> > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > characteristics of your application or system.
> > 
> > Hey Namhyung,
> > 
> > this is awesome news! In hotspot, I've long done off-cpu profiling
> > manually by looking at the time between --switch-events. The downside is
> > that we also need to track the sched:sched_switch event to get a call
> > stack. But this approach also works with dwarf based unwinding, and also
> > includes kernel stacks.
>
> Thanks, I've also briefly thought about the switch event based off-cpu
> profiling as it doesn't require root.  But collecting call stacks is hard
> and I'd like to do it in kernel/bpf to reduce the overhead.

I'm all for reducing the overhead, I just wonder about the practicality. At 
the very least, please make sure to note this limitation explicitly to end 
users. As a preacher for perf, I have come across lots of people stumbling 
over `perf record -g` not producing any sensible output because they are 
simply not aware that this requires frame pointers which are basically non 
existing on most "normal" distributions. Nowadays `man perf record` tries to 
educate people, please do the same for the new `--off-cpu` switch.

> > > With BPF, it can aggregate scheduling stats for interested tasks
> > > and/or states and convert the data into a form of perf sample records.
> > > I chose the bpf-output event which is a software event supposed to be
> > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > requires no change on the perf report side except for setting sample
> > > types of bpf-output event.
> > > 
> > > Basically it collects userspace callstack for tasks as it's what users
> > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > always have callchains regardless of the command line option, and it
> > > enables the children mode in perf report by default.
> > 
> > Has anything changed wrt perf/bpf and user applications not compiled with
> > `- fno-omit-frame-pointer`? I.e. does this new utility only work for
> > specially compiled applications, or do we also get backtraces for
> > "normal" binaries that we can install through package managers?
> 
> I am not aware of such changes, it still needs a frame pointer to get
> backtraces.

May I ask what kind of setup you are using this on? Do you use something like 
Gentoo or yocto where you compile your whole system with `-fno-omit-frame-
pointer`? Because otherwise, any kind of off-cpu time in system libraries will 
not be resolved properly, no?

Thanks
Ian Rogers April 25, 2022, 4:49 p.m. UTC | #7
On Mon, Apr 25, 2022 at 5:42 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> On Freitag, 22. April 2022 17:01:15 CEST Namhyung Kim wrote:
> > Hi Milian,
> >
> > On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > This is the first version of off-cpu profiling support.  Together with
> > > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > > characteristics of your application or system.
> > >
> > > Hey Namhyung,
> > >
> > > this is awesome news! In hotspot, I've long done off-cpu profiling
> > > manually by looking at the time between --switch-events. The downside is
> > > that we also need to track the sched:sched_switch event to get a call
> > > stack. But this approach also works with dwarf based unwinding, and also
> > > includes kernel stacks.
> >
> > Thanks, I've also briefly thought about the switch event based off-cpu
> > profiling as it doesn't require root.  But collecting call stacks is hard
> > and I'd like to do it in kernel/bpf to reduce the overhead.
>
> I'm all for reducing the overhead, I just wonder about the practicality. At
> the very least, please make sure to note this limitation explicitly to end
> users. As a preacher for perf, I have come across lots of people stumbling
> over `perf record -g` not producing any sensible output because they are
> simply not aware that this requires frame pointers which are basically non
> existing on most "normal" distributions. Nowadays `man perf record` tries to
> educate people, please do the same for the new `--off-cpu` switch.

I think documenting that off-cpu has a dependency on frame pointers
makes sense. There has been work to make LBR work:
https://lore.kernel.org/bpf/20210818012937.2522409-1-songliubraving@fb.com/
DWARF unwinding is problematic and is probably something best kept in
user land. There is also Intel's CET that may provide an alternate
backtraces.

More recent Intel and AMD cpus have techniques to turn memory
locations into registers, an approach generally called memory
renaming. There is some description here:
https://www.agner.org/forum/viewtopic.php?t=41
In LLVM there is a pass to promote memory locations into registers
called mem2reg. Having the frame pointer as an extra register will
help this pass as there will be 1 more register to replace something
from memory. The memory renaming optimization is similar to mem2reg
except done in the CPU's front-end. It would be interesting to see
benchmark results on modern CPUs with and without omit-frame-pointer.
My expectation is that the performance wins aren't as great, if any,
as they used to be (cc-ed Michael Larabel as I Iove phoronix and it'd
be awesome if someone could do an omit-frame-pointer shoot-out).

> > > > With BPF, it can aggregate scheduling stats for interested tasks
> > > > and/or states and convert the data into a form of perf sample records.
> > > > I chose the bpf-output event which is a software event supposed to be
> > > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > > requires no change on the perf report side except for setting sample
> > > > types of bpf-output event.
> > > >
> > > > Basically it collects userspace callstack for tasks as it's what users
> > > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > > always have callchains regardless of the command line option, and it
> > > > enables the children mode in perf report by default.
> > >
> > > Has anything changed wrt perf/bpf and user applications not compiled with
> > > `- fno-omit-frame-pointer`? I.e. does this new utility only work for
> > > specially compiled applications, or do we also get backtraces for
> > > "normal" binaries that we can install through package managers?
> >
> > I am not aware of such changes, it still needs a frame pointer to get
> > backtraces.
>
> May I ask what kind of setup you are using this on? Do you use something like
> Gentoo or yocto where you compile your whole system with `-fno-omit-frame-
> pointer`? Because otherwise, any kind of off-cpu time in system libraries will
> not be resolved properly, no?

I agree with your point. Often in cloud environments binaries are
static blobs linking in all their dependencies. This can aid
deployment, bug compatibility, etc. Fwiw, all backtraces gathered in
Google's profiling are frame pointer based. A large motivation for
this is the security aspect of having a privileged application able to
snapshot other threads stacks that happens with dwarf based unwinding.

In summary, your point is that frame pointer based unwinding is
largely broken on all major distributions today limiting the utility
of off-CPU as it is here. I agree, memory renaming in hardware could
hopefully mean that this isn't the case in distributions in the
future. Even if it isn't there are alternate backtraces from sources
like LBR and CET that mean we can fix this other ways.

Thanks,
Ian

> Thanks
> --
> Milian Wolff | milian.wolff@kdab.com | Senior Software Engineer
> KDAB (Deutschland) GmbH, a KDAB Group company
> Tel: +49-30-521325470
> KDAB - The Qt, C++ and OpenGL Experts
Namhyung Kim April 25, 2022, 6:58 p.m. UTC | #8
On Mon, Apr 25, 2022 at 5:42 AM Milian Wolff <milian.wolff@kdab.com> wrote:
>
> On Freitag, 22. April 2022 17:01:15 CEST Namhyung Kim wrote:
> > Hi Milian,
> >
> > On Fri, Apr 22, 2022 at 3:21 AM Milian Wolff <milian.wolff@kdab.com> wrote:
> > > On Freitag, 22. April 2022 07:33:57 CEST Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > This is the first version of off-cpu profiling support.  Together with
> > > > (PMU-based) cpu profiling, it can show holistic view of the performance
> > > > characteristics of your application or system.
> > >
> > > Hey Namhyung,
> > >
> > > this is awesome news! In hotspot, I've long done off-cpu profiling
> > > manually by looking at the time between --switch-events. The downside is
> > > that we also need to track the sched:sched_switch event to get a call
> > > stack. But this approach also works with dwarf based unwinding, and also
> > > includes kernel stacks.
> >
> > Thanks, I've also briefly thought about the switch event based off-cpu
> > profiling as it doesn't require root.  But collecting call stacks is hard
> > and I'd like to do it in kernel/bpf to reduce the overhead.
>
> I'm all for reducing the overhead, I just wonder about the practicality. At
> the very least, please make sure to note this limitation explicitly to end
> users. As a preacher for perf, I have come across lots of people stumbling
> over `perf record -g` not producing any sensible output because they are
> simply not aware that this requires frame pointers which are basically non
> existing on most "normal" distributions. Nowadays `man perf record` tries to
> educate people, please do the same for the new `--off-cpu` switch.

Good point, will add it .

>
> > > > With BPF, it can aggregate scheduling stats for interested tasks
> > > > and/or states and convert the data into a form of perf sample records.
> > > > I chose the bpf-output event which is a software event supposed to be
> > > > consumed by BPF programs and renamed it as "offcpu-time".  So it
> > > > requires no change on the perf report side except for setting sample
> > > > types of bpf-output event.
> > > >
> > > > Basically it collects userspace callstack for tasks as it's what users
> > > > want mostly.  Maybe we can add support for the kernel stacks but I'm
> > > > afraid that it'd cause more overhead.  So the offcpu-time event will
> > > > always have callchains regardless of the command line option, and it
> > > > enables the children mode in perf report by default.
> > >
> > > Has anything changed wrt perf/bpf and user applications not compiled with
> > > `- fno-omit-frame-pointer`? I.e. does this new utility only work for
> > > specially compiled applications, or do we also get backtraces for
> > > "normal" binaries that we can install through package managers?
> >
> > I am not aware of such changes, it still needs a frame pointer to get
> > backtraces.
>
> May I ask what kind of setup you are using this on? Do you use something like
> Gentoo or yocto where you compile your whole system with `-fno-omit-frame-
> pointer`? Because otherwise, any kind of off-cpu time in system libraries will
> not be resolved properly, no?

In my work environment, everything is built with the frame pointer.
It's unfortunate most distros build without it, but as Ian said, I hope
we can lift the limitation with recent technologies soon.

Thanks,
Namhyung