mbox series

[v2,0/4] Dump off-cpu samples directly

Message ID 20240424024805.144759-1-howardchu95@gmail.com (mailing list archive)
Headers show
Series Dump off-cpu samples directly | expand

Message

Howard Chu April 24, 2024, 2:48 a.m. UTC
As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323

Currently, off-cpu samples are dumped when perf record is exiting. This
results in off-cpu samples being after the regular samples. Also, samples
are stored in large BPF maps which contain all the stack traces and
accumulated off-cpu time, but they are eventually going to fill up after
running for an extensive period. This patch fixes those problems by dumping
samples directly into perf ring buffer, and dispatching those samples to the
correct format.

Before, off-cpu samples are after regular samples

```
         swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
         swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
         swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
         swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time: 
	    7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)


sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time: 
	    7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)


awk 1409644 [000] 18446744069.414584:     191785 offcpu-time: 
	    702609d03681 read+0x11 (/usr/lib/libc.so.6)
	          4a02a4 [unknown] ([unknown])
```


After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
collected simultaneously:

```
upowerd     741 [000] 963757.428701:     297848 offcpu-time: 
	    72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)


      irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
upowerd     741 [000] 963757.429172:     459522 offcpu-time: 
	    72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)


         swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
perf 1419260 [000] 963757.434550: 1001012116 offcpu-time: 
	    7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
	    591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
	    591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
	    7274e5ca955a [unknown] (/usr/lib/libc.so.6)
```

Here's a simple flowchart:

[parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
[record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]

Changes in v2:
 - Remove unnecessary comments.
 - Rename function off_cpu_change_type to off_cpu_prepare_parse

Howard Chu (4):
  perf record off-cpu: Parse off-cpu event, change config location
  perf record off-cpu: BPF perf_event_output on sched_switch
  perf record off-cpu: extract off-cpu sample data from raw_data
  perf record off-cpu: delete bound-to-fail test

 tools/perf/builtin-record.c             |  98 +++++++++-
 tools/perf/tests/shell/record_offcpu.sh |  29 ---
 tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
 tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
 tools/perf/util/evsel.c                 |   8 -
 tools/perf/util/off_cpu.h               |  14 +-
 tools/perf/util/perf-hooks-list.h       |   1 +
 7 files changed, 344 insertions(+), 211 deletions(-)

Comments

Namhyung Kim April 24, 2024, 7:12 p.m. UTC | #1
Hello,

On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
>
> Currently, off-cpu samples are dumped when perf record is exiting. This
> results in off-cpu samples being after the regular samples. Also, samples
> are stored in large BPF maps which contain all the stack traces and
> accumulated off-cpu time, but they are eventually going to fill up after
> running for an extensive period. This patch fixes those problems by dumping
> samples directly into perf ring buffer, and dispatching those samples to the
> correct format.

Thanks for working on this.

But the problem of dumping all sched-switch events is that it can be
too frequent on loaded machines.  Copying many events to the buffer
can result in losing other records.  As perf report doesn't care about
timing much, I decided to aggregate the result in a BPF map and dump
them at the end of the profiling session.

Maybe that's not a concern for you (or smaller systems).  Then I think
we can keep the original behavior and add a new option (I'm not good
at naming things, but maybe --off-cpu-sample?) to work differently
instead of removing the old behavior.

Thanks,
Namhyung

>
> Before, off-cpu samples are after regular samples
>
> ```
>          swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
>          swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
>          swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
>          swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time:
>             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
>
>
> sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time:
>             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
>
>
> awk 1409644 [000] 18446744069.414584:     191785 offcpu-time:
>             702609d03681 read+0x11 (/usr/lib/libc.so.6)
>                   4a02a4 [unknown] ([unknown])
> ```
>
>
> After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
> collected simultaneously:
>
> ```
> upowerd     741 [000] 963757.428701:     297848 offcpu-time:
>             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
>
>
>       irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
> upowerd     741 [000] 963757.429172:     459522 offcpu-time:
>             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
>
>
>          swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> perf 1419260 [000] 963757.434550: 1001012116 offcpu-time:
>             7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
>             591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
>             591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
>             7274e5ca955a [unknown] (/usr/lib/libc.so.6)
> ```
>
> Here's a simple flowchart:
>
> [parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
> sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
> [record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]
>
> Changes in v2:
>  - Remove unnecessary comments.
>  - Rename function off_cpu_change_type to off_cpu_prepare_parse
>
> Howard Chu (4):
>   perf record off-cpu: Parse off-cpu event, change config location
>   perf record off-cpu: BPF perf_event_output on sched_switch
>   perf record off-cpu: extract off-cpu sample data from raw_data
>   perf record off-cpu: delete bound-to-fail test
>
>  tools/perf/builtin-record.c             |  98 +++++++++-
>  tools/perf/tests/shell/record_offcpu.sh |  29 ---
>  tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
>  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
>  tools/perf/util/evsel.c                 |   8 -
>  tools/perf/util/off_cpu.h               |  14 +-
>  tools/perf/util/perf-hooks-list.h       |   1 +
>  7 files changed, 344 insertions(+), 211 deletions(-)
>
> --
> 2.44.0
>
Arnaldo Carvalho de Melo April 24, 2024, 9:11 p.m. UTC | #2
On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> Hello,
> 
> On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> >
> > Currently, off-cpu samples are dumped when perf record is exiting. This
> > results in off-cpu samples being after the regular samples. Also, samples
> > are stored in large BPF maps which contain all the stack traces and
> > accumulated off-cpu time, but they are eventually going to fill up after
> > running for an extensive period. This patch fixes those problems by dumping
> > samples directly into perf ring buffer, and dispatching those samples to the
> > correct format.
> 
> Thanks for working on this.
> 
> But the problem of dumping all sched-switch events is that it can be
> too frequent on loaded machines.  Copying many events to the buffer
> can result in losing other records.  As perf report doesn't care about
> timing much, I decided to aggregate the result in a BPF map and dump
> them at the end of the profiling session.

Should we try to adapt when there are too many context switches, i.e.
the BPF program can notice that the interval from the last context
switch is too small and then avoid adding samples, while if the interval
is a long one then indeed this is a problem where the workload is
waiting for a long time for something and we want to know what is that,
and in that case capturing callchains is both desirable and not costly,
no?

The tool could then at the end produce one of two outputs: the most
common reasons for being off cpu, or some sort of counter stating that
there are way too many context switches?

And perhaps we should think about what is best to have as a default, not
to present just plain old cycles, but point out that the workload is
most of the time waiting for IO, etc, i.e. the default should give
interesting clues instead of expecting that the tool user knows all the
possible knobs and try them in all sorts of combinations to then reach
some conclusion.

The default should use stuff that isn't that costly, thus not getting in
the way of what is being observed, but at the same time look for common
patterns, etc.

- Arnaldo
 
> Maybe that's not a concern for you (or smaller systems).  Then I think
> we can keep the original behavior and add a new option (I'm not good
> at naming things, but maybe --off-cpu-sample?) to work differently
> instead of removing the old behavior.
> 
> Thanks,
> Namhyung
> 
> >
> > Before, off-cpu samples are after regular samples
> >
> > ```
> >          swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> >          swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> >          swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> >          swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time:
> >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> >
> >
> > sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time:
> >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> >
> >
> > awk 1409644 [000] 18446744069.414584:     191785 offcpu-time:
> >             702609d03681 read+0x11 (/usr/lib/libc.so.6)
> >                   4a02a4 [unknown] ([unknown])
> > ```
> >
> >
> > After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
> > collected simultaneously:
> >
> > ```
> > upowerd     741 [000] 963757.428701:     297848 offcpu-time:
> >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> >
> >
> >       irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
> > upowerd     741 [000] 963757.429172:     459522 offcpu-time:
> >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> >
> >
> >          swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > perf 1419260 [000] 963757.434550: 1001012116 offcpu-time:
> >             7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
> >             591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
> >             591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
> >             7274e5ca955a [unknown] (/usr/lib/libc.so.6)
> > ```
> >
> > Here's a simple flowchart:
> >
> > [parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
> > sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
> > [record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]
> >
> > Changes in v2:
> >  - Remove unnecessary comments.
> >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> >
> > Howard Chu (4):
> >   perf record off-cpu: Parse off-cpu event, change config location
> >   perf record off-cpu: BPF perf_event_output on sched_switch
> >   perf record off-cpu: extract off-cpu sample data from raw_data
> >   perf record off-cpu: delete bound-to-fail test
> >
> >  tools/perf/builtin-record.c             |  98 +++++++++-
> >  tools/perf/tests/shell/record_offcpu.sh |  29 ---
> >  tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
> >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
> >  tools/perf/util/evsel.c                 |   8 -
> >  tools/perf/util/off_cpu.h               |  14 +-
> >  tools/perf/util/perf-hooks-list.h       |   1 +
> >  7 files changed, 344 insertions(+), 211 deletions(-)
> >
> > --
> > 2.44.0
> >
Ian Rogers April 24, 2024, 10:19 p.m. UTC | #3
On Wed, Apr 24, 2024 at 2:11 PM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> > Hello,
> >
> > On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> > >
> > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > >
> > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > results in off-cpu samples being after the regular samples. Also, samples
> > > are stored in large BPF maps which contain all the stack traces and
> > > accumulated off-cpu time, but they are eventually going to fill up after
> > > running for an extensive period. This patch fixes those problems by dumping
> > > samples directly into perf ring buffer, and dispatching those samples to the
> > > correct format.
> >
> > Thanks for working on this.
> >
> > But the problem of dumping all sched-switch events is that it can be
> > too frequent on loaded machines.  Copying many events to the buffer
> > can result in losing other records.  As perf report doesn't care about
> > timing much, I decided to aggregate the result in a BPF map and dump
> > them at the end of the profiling session.
>
> Should we try to adapt when there are too many context switches, i.e.
> the BPF program can notice that the interval from the last context
> switch is too small and then avoid adding samples, while if the interval
> is a long one then indeed this is a problem where the workload is
> waiting for a long time for something and we want to know what is that,
> and in that case capturing callchains is both desirable and not costly,
> no?
>
> The tool could then at the end produce one of two outputs: the most
> common reasons for being off cpu, or some sort of counter stating that
> there are way too many context switches?
>
> And perhaps we should think about what is best to have as a default, not
> to present just plain old cycles, but point out that the workload is
> most of the time waiting for IO, etc, i.e. the default should give
> interesting clues instead of expecting that the tool user knows all the
> possible knobs and try them in all sorts of combinations to then reach
> some conclusion.
>
> The default should use stuff that isn't that costly, thus not getting in
> the way of what is being observed, but at the same time look for common
> patterns, etc.
>
> - Arnaldo

I really appreciate Howard doing this work!

I wonder there are other cases where we want to synthesize events in
BPF, for example, we may have fast and slow memory on a system, we
could turn memory events on a system into either fast or slow ones in
BPF based on the memory accessed, so that fast/slow memory systems can
be simulated without access to hardware. This also feels like a perf
script type problem. Perhaps we can add something to the bpf-output
event so it can have multiple uses and not just off-cpu.

To turn the bpf-output samples into off-cpu events there is a pass
added to the saving. I wonder if that can be more generic, like a save
time perf inject.

I worry about dropping short samples we can create a property that
off-cpu time + on-cpu time != wall clock time. Perhaps such short
things can get pushed into Namhyung's "at the end" approach while
longer things get samples. Perhaps we only do that when the frequency
is too great.

It would be nice to start landing this work so I'm wondering what the
minimal way to do that is. It seems putting behavior behind a flag is
a first step.

Thanks,
Ian

> > Maybe that's not a concern for you (or smaller systems).  Then I think
> > we can keep the original behavior and add a new option (I'm not good
> > at naming things, but maybe --off-cpu-sample?) to work differently
> > instead of removing the old behavior.
> >
> > Thanks,
> > Namhyung
> >
> > >
> > > Before, off-cpu samples are after regular samples
> > >
> > > ```
> > >          swapper       0 [000] 963432.136150:    2812933    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > >          swapper       0 [000] 963432.637911:    4932876    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > >          swapper       0 [001] 963432.798072:    6273398    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > >          swapper       0 [000] 963433.541152:    5279005    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > > sh 1410180 [000] 18446744069.414584:    2528851 offcpu-time:
> > >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> > >
> > >
> > > sh 1410185 [000] 18446744069.414584:    2314223 offcpu-time:
> > >             7837148e6e87 wait4+0x17 (/usr/lib/libc.so.6)
> > >
> > >
> > > awk 1409644 [000] 18446744069.414584:     191785 offcpu-time:
> > >             702609d03681 read+0x11 (/usr/lib/libc.so.6)
> > >                   4a02a4 [unknown] ([unknown])
> > > ```
> > >
> > >
> > > After, regular samples(cycles:P) and off-cpu(offcpu-time) samples are
> > > collected simultaneously:
> > >
> > > ```
> > > upowerd     741 [000] 963757.428701:     297848 offcpu-time:
> > >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> > >
> > >
> > >       irq/9-acpi      56 [000] 963757.429116:    8760875    cycles:P:  ffffffffb779849f acpi_os_read_port+0x2f ([kernel.kallsyms])
> > > upowerd     741 [000] 963757.429172:     459522 offcpu-time:
> > >             72b2da11e6bc read+0x4c (/usr/lib/libc.so.6)
> > >
> > >
> > >          swapper       0 [002] 963757.434529:    5759904    cycles:P:  ffffffffb7db1bc2 intel_idle+0x62 ([kernel.kallsyms])
> > > perf 1419260 [000] 963757.434550: 1001012116 offcpu-time:
> > >             7274e5d190bf __poll+0x4f (/usr/lib/libc.so.6)
> > >             591acfc5daf0 perf_evlist__poll+0x24 (/root/hw/perf-tools-next/tools/perf/perf)
> > >             591acfb1ca50 perf_evlist__poll_thread+0x160 (/root/hw/perf-tools-next/tools/perf/perf)
> > >             7274e5ca955a [unknown] (/usr/lib/libc.so.6)
> > > ```
> > >
> > > Here's a simple flowchart:
> > >
> > > [parse_event (sample type: PERF_SAMPLE_RAW)] --> [config (bind fds,
> > > sample_id, sample_type)] --> [off_cpu_strip (sample type: PERF_SAMPLE_RAW)] -->
> > > [record_done(hooks off_cpu_finish)] --> [prepare_parse(sample type: OFFCPU_SAMPLE_TYPES)]
> > >
> > > Changes in v2:
> > >  - Remove unnecessary comments.
> > >  - Rename function off_cpu_change_type to off_cpu_prepare_parse
> > >
> > > Howard Chu (4):
> > >   perf record off-cpu: Parse off-cpu event, change config location
> > >   perf record off-cpu: BPF perf_event_output on sched_switch
> > >   perf record off-cpu: extract off-cpu sample data from raw_data
> > >   perf record off-cpu: delete bound-to-fail test
> > >
> > >  tools/perf/builtin-record.c             |  98 +++++++++-
> > >  tools/perf/tests/shell/record_offcpu.sh |  29 ---
> > >  tools/perf/util/bpf_off_cpu.c           | 242 +++++++++++-------------
> > >  tools/perf/util/bpf_skel/off_cpu.bpf.c  | 163 +++++++++++++---
> > >  tools/perf/util/evsel.c                 |   8 -
> > >  tools/perf/util/off_cpu.h               |  14 +-
> > >  tools/perf/util/perf-hooks-list.h       |   1 +
> > >  7 files changed, 344 insertions(+), 211 deletions(-)
> > >
> > > --
> > > 2.44.0
> > >
Namhyung Kim April 24, 2024, 10:57 p.m. UTC | #4
On Wed, Apr 24, 2024 at 3:19 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, Apr 24, 2024 at 2:11 PM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> >
> > On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> > > Hello,
> > >
> > > On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> > > >
> > > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > > >
> > > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > > results in off-cpu samples being after the regular samples. Also, samples
> > > > are stored in large BPF maps which contain all the stack traces and
> > > > accumulated off-cpu time, but they are eventually going to fill up after
> > > > running for an extensive period. This patch fixes those problems by dumping
> > > > samples directly into perf ring buffer, and dispatching those samples to the
> > > > correct format.
> > >
> > > Thanks for working on this.
> > >
> > > But the problem of dumping all sched-switch events is that it can be
> > > too frequent on loaded machines.  Copying many events to the buffer
> > > can result in losing other records.  As perf report doesn't care about
> > > timing much, I decided to aggregate the result in a BPF map and dump
> > > them at the end of the profiling session.
> >
> > Should we try to adapt when there are too many context switches, i.e.
> > the BPF program can notice that the interval from the last context
> > switch is too small and then avoid adding samples, while if the interval
> > is a long one then indeed this is a problem where the workload is
> > waiting for a long time for something and we want to know what is that,
> > and in that case capturing callchains is both desirable and not costly,
> > no?

Sounds interesting.  Yeah we could make it adaptive based on the
off-cpu time at the moment.

> >
> > The tool could then at the end produce one of two outputs: the most
> > common reasons for being off cpu, or some sort of counter stating that
> > there are way too many context switches?
> >
> > And perhaps we should think about what is best to have as a default, not
> > to present just plain old cycles, but point out that the workload is
> > most of the time waiting for IO, etc, i.e. the default should give
> > interesting clues instead of expecting that the tool user knows all the
> > possible knobs and try them in all sorts of combinations to then reach
> > some conclusion.
> >
> > The default should use stuff that isn't that costly, thus not getting in
> > the way of what is being observed, but at the same time look for common
> > patterns, etc.
> >
> > - Arnaldo
>
> I really appreciate Howard doing this work!
>
> I wonder there are other cases where we want to synthesize events in
> BPF, for example, we may have fast and slow memory on a system, we
> could turn memory events on a system into either fast or slow ones in
> BPF based on the memory accessed, so that fast/slow memory systems can
> be simulated without access to hardware. This also feels like a perf
> script type problem. Perhaps we can add something to the bpf-output
> event so it can have multiple uses and not just off-cpu.
>
> To turn the bpf-output samples into off-cpu events there is a pass
> added to the saving. I wonder if that can be more generic, like a save
> time perf inject.
>
> I worry about dropping short samples we can create a property that
> off-cpu time + on-cpu time != wall clock time. Perhaps such short
> things can get pushed into Namhyung's "at the end" approach while
> longer things get samples. Perhaps we only do that when the frequency
> is too great.

Sounds good.  We might add an option to specify the threshold to
determine whether to dump the data or to save it for later.  But ideally
it should be able to find a good default.

>
> It would be nice to start landing this work so I'm wondering what the
> minimal way to do that is. It seems putting behavior behind a flag is
> a first step.

Agreed!

Thanks,
Namhyung
Howard Chu May 16, 2024, 4:24 a.m. UTC | #5
Hello,

Here is a little update on --off-cpu.

> > It would be nice to start landing this work so I'm wondering what the
> > minimal way to do that is. It seems putting behavior behind a flag is
> > a first step.

The flag to determine output threshold of off-cpu has been implemented.
If the accumulated off-cpu time exceeds this threshold, output the sample
directly; otherwise, save it later for off_cpu_write.

But adding an extra pass to handle off-cpu samples introduces performance
issues, here's the processing rate of --off-cpu sampling(with the
extra pass to extract raw
sample data) and without. The --off-cpu-threshold is in nanoseconds.

+-----------------------------------------------------+---------------------------------------+----------------------+
| comm                                                | type
                       | process rate         |
+-----------------------------------------------------+---------------------------------------+----------------------+
| -F 4999 -a                                          | regular
samples (w/o extra pass)      | 13128.675 samples/ms |
+-----------------------------------------------------+---------------------------------------+----------------------+
| -F 1 -a --off-cpu --off-cpu-threshold 100           | offcpu samples
(extra pass)           |  2843.247 samples/ms |
+-----------------------------------------------------+---------------------------------------+----------------------+
| -F 4999 -a --off-cpu --off-cpu-threshold 100        | offcpu &
regular samples (extra pass) |  3910.686 samples/ms |
+-----------------------------------------------------+---------------------------------------+----------------------+
| -F 4999 -a --off-cpu --off-cpu-threshold 1000000000 | few offcpu &
regular (extra pass)     |  4661.229 samples/ms |
+-----------------------------------------------------+---------------------------------------+----------------------+

It's not ideal. I will find a way to reduce overhead. For example
process them samples
at save time as Ian mentioned.

> > To turn the bpf-output samples into off-cpu events there is a pass
> > added to the saving. I wonder if that can be more generic, like a save
> > time perf inject.

And I will find a default value for such a threshold based on performance
and common use cases.

> Sounds good.  We might add an option to specify the threshold to
> determine whether to dump the data or to save it for later.  But ideally
> it should be able to find a good default.

These will be done before the GSoC kick-off on May 27.

Thanks,
Howard

On Thu, Apr 25, 2024 at 6:57 AM Namhyung Kim <namhyung@kernel.org> wrote:
>
> On Wed, Apr 24, 2024 at 3:19 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, Apr 24, 2024 at 2:11 PM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > >
> > > On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> > > > Hello,
> > > >
> > > > On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> > > > >
> > > > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > > > >
> > > > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > > > results in off-cpu samples being after the regular samples. Also, samples
> > > > > are stored in large BPF maps which contain all the stack traces and
> > > > > accumulated off-cpu time, but they are eventually going to fill up after
> > > > > running for an extensive period. This patch fixes those problems by dumping
> > > > > samples directly into perf ring buffer, and dispatching those samples to the
> > > > > correct format.
> > > >
> > > > Thanks for working on this.
> > > >
> > > > But the problem of dumping all sched-switch events is that it can be
> > > > too frequent on loaded machines.  Copying many events to the buffer
> > > > can result in losing other records.  As perf report doesn't care about
> > > > timing much, I decided to aggregate the result in a BPF map and dump
> > > > them at the end of the profiling session.
> > >
> > > Should we try to adapt when there are too many context switches, i.e.
> > > the BPF program can notice that the interval from the last context
> > > switch is too small and then avoid adding samples, while if the interval
> > > is a long one then indeed this is a problem where the workload is
> > > waiting for a long time for something and we want to know what is that,
> > > and in that case capturing callchains is both desirable and not costly,
> > > no?
>
> Sounds interesting.  Yeah we could make it adaptive based on the
> off-cpu time at the moment.
>
> > >
> > > The tool could then at the end produce one of two outputs: the most
> > > common reasons for being off cpu, or some sort of counter stating that
> > > there are way too many context switches?
> > >
> > > And perhaps we should think about what is best to have as a default, not
> > > to present just plain old cycles, but point out that the workload is
> > > most of the time waiting for IO, etc, i.e. the default should give
> > > interesting clues instead of expecting that the tool user knows all the
> > > possible knobs and try them in all sorts of combinations to then reach
> > > some conclusion.
> > >
> > > The default should use stuff that isn't that costly, thus not getting in
> > > the way of what is being observed, but at the same time look for common
> > > patterns, etc.
> > >
> > > - Arnaldo
> >
> > I really appreciate Howard doing this work!
> >
> > I wonder there are other cases where we want to synthesize events in
> > BPF, for example, we may have fast and slow memory on a system, we
> > could turn memory events on a system into either fast or slow ones in
> > BPF based on the memory accessed, so that fast/slow memory systems can
> > be simulated without access to hardware. This also feels like a perf
> > script type problem. Perhaps we can add something to the bpf-output
> > event so it can have multiple uses and not just off-cpu.
> >
> >
> > I worry about dropping short samples we can create a property that
> > off-cpu time + on-cpu time != wall clock time. Perhaps such short
> > things can get pushed into Namhyung's "at the end" approach while
> > longer things get samples. Perhaps we only do that when the frequency
> > is too great.
>
> Sounds good.  We might add an option to specify the threshold to
> determine whether to dump the data or to save it for later.  But ideally
> it should be able to find a good default.
>
> >

>
> Agreed!
>
> Thanks,
> Namhyung
Ian Rogers May 16, 2024, 4:56 a.m. UTC | #6
On Wed, May 15, 2024 at 9:24 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> Hello,
>
> Here is a little update on --off-cpu.
>
> > > It would be nice to start landing this work so I'm wondering what the
> > > minimal way to do that is. It seems putting behavior behind a flag is
> > > a first step.
>
> The flag to determine output threshold of off-cpu has been implemented.
> If the accumulated off-cpu time exceeds this threshold, output the sample
> directly; otherwise, save it later for off_cpu_write.
>
> But adding an extra pass to handle off-cpu samples introduces performance
> issues, here's the processing rate of --off-cpu sampling(with the
> extra pass to extract raw
> sample data) and without. The --off-cpu-threshold is in nanoseconds.
>
> +-----------------------------------------------------+---------------------------------------+----------------------+
> | comm                                                | type
>                        | process rate         |
> +-----------------------------------------------------+---------------------------------------+----------------------+
> | -F 4999 -a                                          | regular
> samples (w/o extra pass)      | 13128.675 samples/ms |
> +-----------------------------------------------------+---------------------------------------+----------------------+
> | -F 1 -a --off-cpu --off-cpu-threshold 100           | offcpu samples
> (extra pass)           |  2843.247 samples/ms |
> +-----------------------------------------------------+---------------------------------------+----------------------+
> | -F 4999 -a --off-cpu --off-cpu-threshold 100        | offcpu &
> regular samples (extra pass) |  3910.686 samples/ms |
> +-----------------------------------------------------+---------------------------------------+----------------------+
> | -F 4999 -a --off-cpu --off-cpu-threshold 1000000000 | few offcpu &
> regular (extra pass)     |  4661.229 samples/ms |
> +-----------------------------------------------------+---------------------------------------+----------------------+
>
> It's not ideal. I will find a way to reduce overhead. For example
> process them samples
> at save time as Ian mentioned.
>
> > > To turn the bpf-output samples into off-cpu events there is a pass
> > > added to the saving. I wonder if that can be more generic, like a save
> > > time perf inject.
>
> And I will find a default value for such a threshold based on performance
> and common use cases.
>
> > Sounds good.  We might add an option to specify the threshold to
> > determine whether to dump the data or to save it for later.  But ideally
> > it should be able to find a good default.
>
> These will be done before the GSoC kick-off on May 27.

This all sounds good. 100ns seems like quite a low threshold and 1s
extremely high, shame such a high threshold is marginal for the
context switch performance change. I wonder 100 microseconds may be a
more sensible threshold. It's 100 times larger than the cost of 1
context switch but considerably less than a frame redraw at 60FPS (16
milliseconds).

Thanks,
Ian

> Thanks,
> Howard
>
> On Thu, Apr 25, 2024 at 6:57 AM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > On Wed, Apr 24, 2024 at 3:19 PM Ian Rogers <irogers@google.com> wrote:
> > >
> > > On Wed, Apr 24, 2024 at 2:11 PM Arnaldo Carvalho de Melo
> > > <acme@kernel.org> wrote:
> > > >
> > > > On Wed, Apr 24, 2024 at 12:12:26PM -0700, Namhyung Kim wrote:
> > > > > Hello,
> > > > >
> > > > > On Tue, Apr 23, 2024 at 7:46 PM Howard Chu <howardchu95@gmail.com> wrote:
> > > > > >
> > > > > > As mentioned in: https://bugzilla.kernel.org/show_bug.cgi?id=207323
> > > > > >
> > > > > > Currently, off-cpu samples are dumped when perf record is exiting. This
> > > > > > results in off-cpu samples being after the regular samples. Also, samples
> > > > > > are stored in large BPF maps which contain all the stack traces and
> > > > > > accumulated off-cpu time, but they are eventually going to fill up after
> > > > > > running for an extensive period. This patch fixes those problems by dumping
> > > > > > samples directly into perf ring buffer, and dispatching those samples to the
> > > > > > correct format.
> > > > >
> > > > > Thanks for working on this.
> > > > >
> > > > > But the problem of dumping all sched-switch events is that it can be
> > > > > too frequent on loaded machines.  Copying many events to the buffer
> > > > > can result in losing other records.  As perf report doesn't care about
> > > > > timing much, I decided to aggregate the result in a BPF map and dump
> > > > > them at the end of the profiling session.
> > > >
> > > > Should we try to adapt when there are too many context switches, i.e.
> > > > the BPF program can notice that the interval from the last context
> > > > switch is too small and then avoid adding samples, while if the interval
> > > > is a long one then indeed this is a problem where the workload is
> > > > waiting for a long time for something and we want to know what is that,
> > > > and in that case capturing callchains is both desirable and not costly,
> > > > no?
> >
> > Sounds interesting.  Yeah we could make it adaptive based on the
> > off-cpu time at the moment.
> >
> > > >
> > > > The tool could then at the end produce one of two outputs: the most
> > > > common reasons for being off cpu, or some sort of counter stating that
> > > > there are way too many context switches?
> > > >
> > > > And perhaps we should think about what is best to have as a default, not
> > > > to present just plain old cycles, but point out that the workload is
> > > > most of the time waiting for IO, etc, i.e. the default should give
> > > > interesting clues instead of expecting that the tool user knows all the
> > > > possible knobs and try them in all sorts of combinations to then reach
> > > > some conclusion.
> > > >
> > > > The default should use stuff that isn't that costly, thus not getting in
> > > > the way of what is being observed, but at the same time look for common
> > > > patterns, etc.
> > > >
> > > > - Arnaldo
> > >
> > > I really appreciate Howard doing this work!
> > >
> > > I wonder there are other cases where we want to synthesize events in
> > > BPF, for example, we may have fast and slow memory on a system, we
> > > could turn memory events on a system into either fast or slow ones in
> > > BPF based on the memory accessed, so that fast/slow memory systems can
> > > be simulated without access to hardware. This also feels like a perf
> > > script type problem. Perhaps we can add something to the bpf-output
> > > event so it can have multiple uses and not just off-cpu.
> > >
> > >
> > > I worry about dropping short samples we can create a property that
> > > off-cpu time + on-cpu time != wall clock time. Perhaps such short
> > > things can get pushed into Namhyung's "at the end" approach while
> > > longer things get samples. Perhaps we only do that when the frequency
> > > is too great.
> >
> > Sounds good.  We might add an option to specify the threshold to
> > determine whether to dump the data or to save it for later.  But ideally
> > it should be able to find a good default.
> >
> > >
>
> >
> > Agreed!
> >
> > Thanks,
> > Namhyung
Namhyung Kim May 23, 2024, 4:34 a.m. UTC | #7
Hello,

On Wed, May 15, 2024 at 9:56 PM Ian Rogers <irogers@google.com> wrote:
>
> On Wed, May 15, 2024 at 9:24 PM Howard Chu <howardchu95@gmail.com> wrote:
> >
> > Hello,
> >
> > Here is a little update on --off-cpu.
> >
> > > > It would be nice to start landing this work so I'm wondering what the
> > > > minimal way to do that is. It seems putting behavior behind a flag is
> > > > a first step.
> >
> > The flag to determine output threshold of off-cpu has been implemented.
> > If the accumulated off-cpu time exceeds this threshold, output the sample
> > directly; otherwise, save it later for off_cpu_write.
> >
> > But adding an extra pass to handle off-cpu samples introduces performance
> > issues, here's the processing rate of --off-cpu sampling(with the
> > extra pass to extract raw
> > sample data) and without. The --off-cpu-threshold is in nanoseconds.
> >
> > +-----------------------------------------------------+---------------------------------------+----------------------+
> > | comm                                                | type
> >                        | process rate         |
> > +-----------------------------------------------------+---------------------------------------+----------------------+
> > | -F 4999 -a                                          | regular
> > samples (w/o extra pass)      | 13128.675 samples/ms |
> > +-----------------------------------------------------+---------------------------------------+----------------------+
> > | -F 1 -a --off-cpu --off-cpu-threshold 100           | offcpu samples
> > (extra pass)           |  2843.247 samples/ms |
> > +-----------------------------------------------------+---------------------------------------+----------------------+
> > | -F 4999 -a --off-cpu --off-cpu-threshold 100        | offcpu &
> > regular samples (extra pass) |  3910.686 samples/ms |
> > +-----------------------------------------------------+---------------------------------------+----------------------+
> > | -F 4999 -a --off-cpu --off-cpu-threshold 1000000000 | few offcpu &
> > regular (extra pass)     |  4661.229 samples/ms |
> > +-----------------------------------------------------+---------------------------------------+----------------------+

What does the process rate mean?  Is the sample for the
off-cpu event or other (cpu-cycles)?  Is it from a single CPU
or system-wide or per-task?

> >
> > It's not ideal. I will find a way to reduce overhead. For example
> > process them samples
> > at save time as Ian mentioned.
> >
> > > > To turn the bpf-output samples into off-cpu events there is a pass
> > > > added to the saving. I wonder if that can be more generic, like a save
> > > > time perf inject.
> >
> > And I will find a default value for such a threshold based on performance
> > and common use cases.
> >
> > > Sounds good.  We might add an option to specify the threshold to
> > > determine whether to dump the data or to save it for later.  But ideally
> > > it should be able to find a good default.
> >
> > These will be done before the GSoC kick-off on May 27.
>
> This all sounds good. 100ns seems like quite a low threshold and 1s
> extremely high, shame such a high threshold is marginal for the
> context switch performance change. I wonder 100 microseconds may be a
> more sensible threshold. It's 100 times larger than the cost of 1
> context switch but considerably less than a frame redraw at 60FPS (16
> milliseconds).

I don't know what's the sensible default.  But 1 msec could be
another candidate for the similar reason. :)

Thanks,
Namhyung
Howard Chu May 23, 2024, 4:34 p.m. UTC | #8
Hello,

On Thu, May 23, 2024 at 12:34 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> On Wed, May 15, 2024 at 9:56 PM Ian Rogers <irogers@google.com> wrote:
> >
> > On Wed, May 15, 2024 at 9:24 PM Howard Chu <howardchu95@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > Here is a little update on --off-cpu.
> > >
> > > > > It would be nice to start landing this work so I'm wondering what the
> > > > > minimal way to do that is. It seems putting behavior behind a flag is
> > > > > a first step.
> > >
> > > The flag to determine output threshold of off-cpu has been implemented.
> > > If the accumulated off-cpu time exceeds this threshold, output the sample
> > > directly; otherwise, save it later for off_cpu_write.
> > >
> > > But adding an extra pass to handle off-cpu samples introduces performance
> > > issues, here's the processing rate of --off-cpu sampling(with the
> > > extra pass to extract raw
> > > sample data) and without. The --off-cpu-threshold is in nanoseconds.
> > >
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | comm                                                | type
> > >                        | process rate         |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 4999 -a                                          | regular
> > > samples (w/o extra pass)      | 13128.675 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 1 -a --off-cpu --off-cpu-threshold 100           | offcpu samples
> > > (extra pass)           |  2843.247 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 4999 -a --off-cpu --off-cpu-threshold 100        | offcpu &
> > > regular samples (extra pass) |  3910.686 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
> > > | -F 4999 -a --off-cpu --off-cpu-threshold 1000000000 | few offcpu &
> > > regular (extra pass)     |  4661.229 samples/ms |
> > > +-----------------------------------------------------+---------------------------------------+----------------------+
>
> What does the process rate mean?  Is the sample for the
> off-cpu event or other (cpu-cycles)?  Is it from a single CPU
> or system-wide or per-task?

Process rate is just a silly name for the time record__pushfn() takes
to write data from the ring buffer.
record__pushfn() is where I added the extra pass to strip the off-cpu
samples from the original raw
samples that eBPF's perf_output collected.

With -a option it runs on all cpu, system-wide. Sorry that I only
tested on extreme cases.

I ran perf record on `-F 4999 -a `, `-F 1 -a --off-cpu
--off-cpu-threshold 100`, `-F 4999 -a --off-cpu
--off-cpu-threshold 100`, and `-F 4999 -a --off-cpu
--off-cpu-threshold 1000000000`.
`-F 4999 -a` is only cpu-cycles samples which is the fastest(13128.675
samples/ms)
when it comes to writing samples to perf.data, because there's no
extra pass for stripping
extra data from BPF's raw samples.

`-F 1 -a --off-cpu --off-cpu-threshold 100` is mostly off-cpu samples,
which requires considerably
more time to strip the data, being the slowest(2843.247 samples/ms).

 `-F 4999 -a --off-cpu --off-cpu-threshold 100` is half and half, lots
of cpu-cycle samples so
a little faster than the former one(3910.686 samples/ms). Because for cpu-cycles
samples, there's no extra handling(but there's still cost on copying
memory back and forth).

`-F 4999 -a --off-cpu --off-cpu-threshold 1000000000` is a blend of a
large amount of cpu-cycles
samples and only a couple of off-cpu samples. It is the
fastest(4661.229 samples/ms) but still
nowhere near the original one, which doesn't have the extra pass of
off_cpu_strip().

What I'm trying to say is just, stripping/handling off-cpu samples at
runtime is a bad idea, the extra
pass of off_cpu_strip() should be reconsidered. Reading events one by
one, put together samples,
and checking sample_id and stuff introduces lots of overhead. It
should be done at save time.

By the way, the default off_cpu_write() is perfectly fine.

Sorry about the horrible data table and explanation; they will be more
readable next time.

>
> > >
> > > It's not ideal. I will find a way to reduce overhead. For example
> > > process them samples
> > > at save time as Ian mentioned.
> > >
> > > > > To turn the bpf-output samples into off-cpu events there is a pass
> > > > > added to the saving. I wonder if that can be more generic, like a save
> > > > > time perf inject.
> > >
> > > And I will find a default value for such a threshold based on performance
> > > and common use cases.
> > >
> > > > Sounds good.  We might add an option to specify the threshold to
> > > > determine whether to dump the data or to save it for later.  But ideally
> > > > it should be able to find a good default.
> > >
> > > These will be done before the GSoC kick-off on May 27.
> >
> > This all sounds good. 100ns seems like quite a low threshold and 1s
> > extremely high, shame such a high threshold is marginal for the

> > context switch performance change. I wonder 100 microseconds may be a
> > more sensible threshold. It's 100 times larger than the cost of 1
> > context switch but considerably less than a frame redraw at 60FPS (16
> > milliseconds).
>
> I don't know what's the sensible default.  But 1 msec could be
> another candidate for the similar reason. :)

Sure, I'll give them all a test and see the overhead they cause.

I understand that all I'm talking about is optimization, and that premature
optimization is the root of all evil. However, being almost three times
slower for only a few dozen direct off-CPU samples sounds weird to me.

Thanks,
Howard
>
> Thanks,
> Namhyung