mbox series

[RFC,00/48] perf tools: Introduce data type profiling (v1)

Message ID 20231012035111.676789-1-namhyung@kernel.org (mailing list archive)
Headers show
Series perf tools: Introduce data type profiling (v1) | expand

Message

Namhyung Kim Oct. 12, 2023, 3:50 a.m. UTC
Hello,

I'm happy to share my work on data type profiling.  This is to associate
PMU samples to data types they refer using DWARF debug information.  So
basically it depends on quality of PMU events and compiler for producing
DWARF info.  But it doesn't require any changes in the target program.

As it's an early stage, I've targeted the kernel on x86 to reduce the
amount of work but IIUC there's no fundamental blocker to apply it to
other architectures and applications.


* How to use it

To get precise memory access samples, users can use `perf mem record`
command to utilize those events supported by their architecture.  Intel
machines would work best as they have dedicated memory access events but
they would have a filter to ignore low latency loads like less than 30
cycles (use --ldlat option to change the default value).

    # To get memory access samples in kernel for 1 second (on Intel)
    $ sudo perf mem record -a -K --ldlat=4 -- sleep 1

    # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
    $ sudo perf mem record -a --filter 'mem_op == load, ip > 0x8000000000000000' -- sleep 1

Note that it used 'sudo' command because it's collecting the event in
system wide mode.  Actually it would depend on the sysctl setting of
kernel.perf_event_paranoid.  AMD still needs root due to the BPF filter
though.

After getting a profile data, you would run perf report or perf
annotate as usual to see the result.  Make sure that you have a kernel
debug package installed or vmlinux with DWARF info.

I've added new options and sort keys to enable the data type profiling.
Probably I need to add it to perf mem or perf c2c command for better
user experience.  I'm open to discussion how we can make it simpler and
intuitive for regular users.  But let's talk about the lower level
interface for now.

In perf report, it's just a matter of selecting new sort keys: 'type'
and 'typeoff'.  The 'type' shows name of the data type as a whole while
'typeoff' shows name of the field in the data type.  I found it useful
to use it with --hierarchy option to group relevant entries in the same
level.

    $ sudo perf report -s type,typeoff --hierarchy --stdio
    ...
    #
    #    Overhead  Data Type / Data Type Offset
    # ...........  ............................
    #
        23.95%     (stack operation)
           23.95%     (stack operation) +0 (no field)
        23.43%     (unknown)
           23.43%     (unknown) +0 (no field)
        10.30%     struct pcpu_hot
            4.80%     struct pcpu_hot +0 (current_task)
            3.53%     struct pcpu_hot +8 (preempt_count)
            1.88%     struct pcpu_hot +12 (cpu_number)
            0.07%     struct pcpu_hot +24 (top_of_stack)
            0.01%     struct pcpu_hot +40 (softirq_pending)
         4.25%     struct task_struct
            1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
            0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
            0.49%     struct task_struct +2936 (cred)
            0.35%     struct task_struct +3144 (audit_context)
            0.19%     struct task_struct +46 (flags)
            0.17%     struct task_struct +972 (policy)
            0.15%     struct task_struct +32 (stack)
            0.15%     struct task_struct +8 (thread_info.syscall_work)
            0.10%     struct task_struct +976 (nr_cpus_allowed)
            0.09%     struct task_struct +2272 (mm)
        ...

The (stack operation) and (unknown) have no type and field info.  FYI,
the stack operations are samples in PUSH, POP or RET instructions which
save or restore registers from/to the stack.  They are usually parts of
function prologue and epilogue and have no type info.  The next is the
struct pcpu_hot and you can see the first field (current_task) at offset
0 was accessed mostly.  It's listed in order of access frequency (not in
offset) as you can see it in the task_struct.

In perf annotate, new --data-type option was added to enable data
field level annotation.  Now it only shows number of samples for each
field but we can improve it.

    $ sudo perf annotate --data-type
    Annotate type: 'struct pcpu_hot' in [kernel.kallsyms] (223 samples):
    ============================================================================
        samples     offset       size  field
            223          0         64  struct pcpu_hot       {
            223          0         64      union     {
            223          0         48          struct        {
             78          0          8              struct task_struct*      current_task;
             98          8          4              int      preempt_count;
             45         12          4              int      cpu_number;
              0         16          8              u64      call_depth;
              1         24          8              long unsigned int        top_of_stack;
              0         32          8              void*    hardirq_stack_ptr;
              1         40          2              u16      softirq_pending;
              0         42          1              bool     hardirq_stack_inuse;
                                               };
            223          0         64          u8*  pad;
                                           };
                                       };
    ...

This shows each struct one by one and field-level access info in C-like
style.  The number of samples for the outer struct is a sum of number of
samples in every field in the struct.  In unions, each field is placed
in the same offset so they will have the same number of samples.

No TUI support yet.


* How it works

The basic idea is to use DWARF location expression in debug entries for
variables.  Say we got a sample in the instruction below:

    0x123456:  mov    0x18(%rdi), %rcx

Then we know the instruction at 0x123456 is accessing to a memory region
where %rdi register has a base address and offset 0x18 from the base.
DWARF would have a debug info entry for a function or a block which
covers that address.  For example, we might have something like this:

    <1><100>: Abbrev Number: 10 (DW_TAG_subroutine_type)
       <101>    DW_AT_name       : (indirect string, offset: 0x184e6): foo
       <105>    DW_AT_type       : <0x29ad7>
       <106>    DW_AT_low_pc     : 0x123400
       <10e>    DW_AT_high_pc    : 0x1234ff
    <2><116>: Abbrev Number: 8 (DW_TAG_formal_parameter)
       <117>    DW_AT_name       : (indirect string, offset: 0x18527): bar
       <11b>    DW_AT_type       : <0x29b3a>
       <11c>    DW_AT_location   : 1 byte block: 55    (DW_OP_reg2 (rdi))

So the function 'foo' covers the instruction from 0x123400 to 0x1234ff
and we know the sample instruction belongs to the function.  And it has
a parameter called 'bar' and it's located at the %rdi register.  Then we
know the instruction is using the variable bar and its type would be a
pointer (to a struct).  We can follow the type info of bar and verify
its access by checking the size of the (struct) type and offset in the
instruction (0x18).

Well.. this is a simple example that the 'bar' has a single location.
Other variables might be located in various places over time but it
should be covered by the location list of the debug entry.  Therefore,
as long as DWARF produces a correct location expression for a variable,
it should be able to find the variable using the location info.

Global variables and local variables are different as they can be
accessed directly without a pointer.  They are located in an absolute
address or relative position from the current stack frame.  So it needs
to handle such location expressions as well.

However, some memory accesses don't have a variable in some cases.  For
example, you have a pointer variable for a struct which contains another
pointers.  And then you can directly dereference it without using a
variable.  Consider the following source code.

    int foo(struct baz *bar) {
        ...
        if (bar->p->q == 0)
            return 1;
        ...
    }

This can generate instructions like below.

    ...
    0x123456:  mov    0x18(%rdi), %rcx
    0x12345a:  mov    0x10(%rcx), %rax     <=== sample
    0x12345e:  test   %rax, %rax
    0x123461:  je     <...>
    ...

And imagine we have a sample at 0x12345a.  Then it cannot find a
variable for %rcx since DWARF didn't generate one (it only knows about
'bar').  Without compiler support, all it can do is to track the code
execution in each instruction and propagate the type info in each
register and stack location by following the memory access.

Actually I found a discussion in the DWARF mailing list to support
"inverted location lists" and it seems a perfect fit for this project.
It'd be great if new DWARF would provide a way to lookup variable and
type info using a concrete location info (like a register number).

  https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html 


* Patch structure

The patch 1-5 are cleanups and a fix that can be applied separately.
The patch 6-21 are the main changes in perf report and perf annotate to
support simple cases with a pointer variable.  The patch 22-33 are to
improve it by handling global and local variables (without a pointer)
and some edge cases.  The patch 34-43 implemented instruction tracking
to infer data type when there's no variable for that.  The patch 44-47
handles kernel-specific per-cpu variables (only for current CPU).  The
patch 48 is to help debugging and is not intended for merge.


* Limitations and future work

As I said earlier, this work is in a very early shape and has many
limitations or rooms for improvement.  Basically it uses objdump tool to
extract location information from the sample instruction.  And the
parsing code and instruction tracking work on x86 only.

Actually there's a performance issue about getting disassembly from the
objdump for kernel.  On my system, GNU objdump was really slower than the
one from LLVM for some reason so I had to pass the following option for
each perf report and perf annotate.

    $ sudo perf report --objdump=llvm-objdump ...

    # To save it in the config file and drop the command line option
    $ sudo perf config annotate.objdump=llvm-objdump

Even with this change, still the most processing time was spent on the
objdump to get the disassembly.  It'd be nice if we can get the result
without using objdump at all.

Also I only tested it with C programs (mostly vmlinux) and I believe
there are many issues on handling C++ applications.  Probably other
languages (like Rust?) could be supported too.  But even for C programs,
it could improve things like better supporting union and array types and
dealing with type casts and so on.

I think compiler could generate more DWARF information to help this kind
of analysis.  Like I mentioned, it doesn't have a variable for
intermediate pointers when they are chained: a->b->c.  This chain could
be longer and hard to track the type from the previous variable.  If
compiler could generate (artificial) debug entries for the intermediate
pointers with a precise location expression and type info, it would be
really helpful.

And I plan to improve the analysis in perf tools with better integration
to the existing command like perf mem and/or perf c2c.  It'd be pretty
interesting to see per-struct or per-field access patterns both for load
and store event at the same time.  Also using data-source or snoop info
for each struct/field would give some insights on optimizing memory
usage or layout.

There are kernel specific issues too.  Some per-cpu variable accesses
created complex instruction patterns so it was hard to determine which
data/type it accessed.  For now, it just parsed simple patterns for
this-cpu access using %gs segment register.  Also it should handle
self-modifying codes like kprobe, ftrace, live patch and so on.  I guess
they would usually create an out-of-line copy of modified instructions
but needs more checking.  And I have no idea about the status of struct
layout randomization and the DWARF info of the resulting struct.  Maybe
there are more issues I'm not aware of, please let me know if you notice
something.


* Summary

Despite all the issues, I believe this would be a good addition to our
performance toolset.  It would help to observe memory overheads in a
different angle and to optimize the memory usage.  I'm really looking
forward to hearing any feedback.

The code is available at 'perf/data-profile-v1' branch in

  git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Enjoy,
Namhyung


Cc: Ben Woodard <woodard@redhat.com> 
Cc: Joe Mario <jmario@redhat.com>
CC: Kees Cook <keescook@chromium.org>
Cc: David Blaikie <blaikie@google.com>
Cc: Xu Liu <xliuprof@google.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>


Namhyung Kim (48):
  perf annotate: Move raw_comment and raw_func_start
  perf annotate: Check if operand has multiple regs
  perf tools: Add util/debuginfo.[ch] files
  perf dwarf-aux: Fix die_get_typename() for void *
  perf dwarf-aux: Move #ifdef code to the header file
  perf dwarf-aux: Add die_get_scopes() helper
  perf dwarf-aux: Add die_find_variable_by_reg() helper
  perf dwarf-aux: Factor out __die_get_typename()
  perf dwarf-regs: Add get_dwarf_regnum()
  perf annotate-data: Add find_data_type()
  perf annotate-data: Add dso->data_types tree
  perf annotate: Factor out evsel__get_arch()
  perf annotate: Add annotate_get_insn_location()
  perf annotate: Implement hist_entry__get_data_type()
  perf report: Add 'type' sort key
  perf report: Support data type profiling
  perf annotate-data: Add member field in the data type
  perf annotate-data: Update sample histogram for type
  perf report: Add 'typeoff' sort key
  perf report: Add 'symoff' sort key
  perf annotate: Add --data-type option
  perf annotate: Add --type-stat option for debugging
  perf annotate: Add --insn-stat option for debugging
  perf annotate-data: Parse 'lock' prefix from llvm-objdump
  perf annotate-data: Handle macro fusion on x86
  perf annotate-data: Handle array style accesses
  perf annotate-data: Add stack operation pseudo type
  perf dwarf-aux: Add die_find_variable_by_addr()
  perf annotate-data: Handle PC-relative addressing
  perf annotate-data: Support global variables
  perf dwarf-aux: Add die_get_cfa()
  perf annotate-data: Support stack variables
  perf dwarf-aux: Check allowed DWARF Ops
  perf dwarf-aux: Add die_collect_vars()
  perf dwarf-aux: Handle type transfer for memory access
  perf annotate-data: Introduce struct data_loc_info
  perf map: Add map__objdump_2rip()
  perf annotate: Add annotate_get_basic_blocks()
  perf annotate-data: Maintain variable type info
  perf annotate-data: Add update_insn_state()
  perf annotate-data: Handle global variable access
  perf annotate-data: Handle call instructions
  perf annotate-data: Implement instruction tracking
  perf annotate: Parse x86 segment register location
  perf annotate-data: Handle this-cpu variables in kernel
  perf annotate-data: Track instructions with a this-cpu variable
  perf annotate-data: Add stack canary type
  perf annotate-data: Add debug message

 tools/perf/Documentation/perf-report.txt      |    3 +
 .../arch/loongarch/annotate/instructions.c    |    6 +-
 tools/perf/arch/x86/util/dwarf-regs.c         |   38 +
 tools/perf/builtin-annotate.c                 |  149 +-
 tools/perf/builtin-report.c                   |   19 +-
 tools/perf/util/Build                         |    2 +
 tools/perf/util/annotate-data.c               | 1246 +++++++++++++++++
 tools/perf/util/annotate-data.h               |  222 +++
 tools/perf/util/annotate.c                    |  763 +++++++++-
 tools/perf/util/annotate.h                    |  104 +-
 tools/perf/util/debuginfo.c                   |  205 +++
 tools/perf/util/debuginfo.h                   |   64 +
 tools/perf/util/dso.c                         |    4 +
 tools/perf/util/dso.h                         |    2 +
 tools/perf/util/dwarf-aux.c                   |  561 +++++++-
 tools/perf/util/dwarf-aux.h                   |   86 +-
 tools/perf/util/dwarf-regs.c                  |   33 +
 tools/perf/util/hist.h                        |    3 +
 tools/perf/util/include/dwarf-regs.h          |   11 +
 tools/perf/util/map.c                         |   20 +
 tools/perf/util/map.h                         |    3 +
 tools/perf/util/probe-finder.c                |  193 +--
 tools/perf/util/probe-finder.h                |   19 +-
 tools/perf/util/sort.c                        |  195 ++-
 tools/perf/util/sort.h                        |    7 +
 tools/perf/util/symbol_conf.h                 |    4 +-
 26 files changed, 3703 insertions(+), 259 deletions(-)
 create mode 100644 tools/perf/util/annotate-data.c
 create mode 100644 tools/perf/util/annotate-data.h
 create mode 100644 tools/perf/util/debuginfo.c
 create mode 100644 tools/perf/util/debuginfo.h


base-commit: 87cd3d48191e533cd9c224f2da1d78b3513daf47

Comments

Ingo Molnar Oct. 12, 2023, 6:03 a.m. UTC | #1
* Namhyung Kim <namhyung@kernel.org> wrote:

> * How to use it
> 
> To get precise memory access samples, users can use `perf mem record`
> command to utilize those events supported by their architecture.  Intel
> machines would work best as they have dedicated memory access events but
> they would have a filter to ignore low latency loads like less than 30
> cycles (use --ldlat option to change the default value).
> 
>     # To get memory access samples in kernel for 1 second (on Intel)
>     $ sudo perf mem record -a -K --ldlat=4 -- sleep 1
> 
>     # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
>     $ sudo perf mem record -a --filter 'mem_op == load, ip > 0x8000000000000000' -- sleep 1

BTW., it would be nice for 'perf mem record' to just do the right thing on 
whatever machine it is running on.

Also, why are BPF filters required - due to the IP filtering of mem-load 
events?

Could we perhaps add an IP filter to perf events to get this built-in? 
Perhaps attr->exclude_user would achieve something similar?

> In perf report, it's just a matter of selecting new sort keys: 'type'
> and 'typeoff'.  The 'type' shows name of the data type as a whole while
> 'typeoff' shows name of the field in the data type.  I found it useful
> to use it with --hierarchy option to group relevant entries in the same
> level.
> 
>     $ sudo perf report -s type,typeoff --hierarchy --stdio
>     ...
>     #
>     #    Overhead  Data Type / Data Type Offset
>     # ...........  ............................
>     #
>         23.95%     (stack operation)
>            23.95%     (stack operation) +0 (no field)
>         23.43%     (unknown)
>            23.43%     (unknown) +0 (no field)
>         10.30%     struct pcpu_hot
>             4.80%     struct pcpu_hot +0 (current_task)
>             3.53%     struct pcpu_hot +8 (preempt_count)
>             1.88%     struct pcpu_hot +12 (cpu_number)
>             0.07%     struct pcpu_hot +24 (top_of_stack)
>             0.01%     struct pcpu_hot +40 (softirq_pending)
>          4.25%     struct task_struct
>             1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
>             0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
>             0.49%     struct task_struct +2936 (cred)
>             0.35%     struct task_struct +3144 (audit_context)
>             0.19%     struct task_struct +46 (flags)
>             0.17%     struct task_struct +972 (policy)
>             0.15%     struct task_struct +32 (stack)
>             0.15%     struct task_struct +8 (thread_info.syscall_work)
>             0.10%     struct task_struct +976 (nr_cpus_allowed)
>             0.09%     struct task_struct +2272 (mm)
>         ...

This looks really useful!

Thanks,

	Ingo
Peter Zijlstra Oct. 12, 2023, 9:11 a.m. UTC | #2
W00t!! Finally! :-)

On Wed, Oct 11, 2023 at 08:50:23PM -0700, Namhyung Kim wrote:

> * How to use it
> 
> To get precise memory access samples, users can use `perf mem record`
> command to utilize those events supported by their architecture.  Intel
> machines would work best as they have dedicated memory access events but
> they would have a filter to ignore low latency loads like less than 30
> cycles (use --ldlat option to change the default value).
> 
>     # To get memory access samples in kernel for 1 second (on Intel)
>     $ sudo perf mem record -a -K --ldlat=4 -- sleep 1

Fundamentally this should work with anything PEBS from MEM_ as
well, no? No real reason to rely on perf mem for this.

> In perf report, it's just a matter of selecting new sort keys: 'type'
> and 'typeoff'.  The 'type' shows name of the data type as a whole while
> 'typeoff' shows name of the field in the data type.  I found it useful
> to use it with --hierarchy option to group relevant entries in the same
> level.
> 
>     $ sudo perf report -s type,typeoff --hierarchy --stdio
>     ...
>     #
>     #    Overhead  Data Type / Data Type Offset
>     # ...........  ............................
>     #
>         23.95%     (stack operation)
>            23.95%     (stack operation) +0 (no field)
>         23.43%     (unknown)
>            23.43%     (unknown) +0 (no field)
>         10.30%     struct pcpu_hot
>             4.80%     struct pcpu_hot +0 (current_task)
>             3.53%     struct pcpu_hot +8 (preempt_count)
>             1.88%     struct pcpu_hot +12 (cpu_number)
>             0.07%     struct pcpu_hot +24 (top_of_stack)
>             0.01%     struct pcpu_hot +40 (softirq_pending)
>          4.25%     struct task_struct
>             1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
>             0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
>             0.49%     struct task_struct +2936 (cred)
>             0.35%     struct task_struct +3144 (audit_context)
>             0.19%     struct task_struct +46 (flags)
>             0.17%     struct task_struct +972 (policy)
>             0.15%     struct task_struct +32 (stack)
>             0.15%     struct task_struct +8 (thread_info.syscall_work)
>             0.10%     struct task_struct +976 (nr_cpus_allowed)
>             0.09%     struct task_struct +2272 (mm)
>         ...
> 
> The (stack operation) and (unknown) have no type and field info.  FYI,
> the stack operations are samples in PUSH, POP or RET instructions which
> save or restore registers from/to the stack.  They are usually parts of
> function prologue and epilogue and have no type info.  The next is the
> struct pcpu_hot and you can see the first field (current_task) at offset
> 0 was accessed mostly.  It's listed in order of access frequency (not in
> offset) as you can see it in the task_struct.
> 
> In perf annotate, new --data-type option was added to enable data
> field level annotation.  Now it only shows number of samples for each
> field but we can improve it.
> 
>     $ sudo perf annotate --data-type
>     Annotate type: 'struct pcpu_hot' in [kernel.kallsyms] (223 samples):
>     ============================================================================
>         samples     offset       size  field
>             223          0         64  struct pcpu_hot       {
>             223          0         64      union     {
>             223          0         48          struct        {
>              78          0          8              struct task_struct*      current_task;
>              98          8          4              int      preempt_count;
>              45         12          4              int      cpu_number;
>               0         16          8              u64      call_depth;
>               1         24          8              long unsigned int        top_of_stack;
>               0         32          8              void*    hardirq_stack_ptr;
>               1         40          2              u16      softirq_pending;
>               0         42          1              bool     hardirq_stack_inuse;
>                                                };
>             223          0         64          u8*  pad;
>                                            };
>                                        };
>     ...
> 
> This shows each struct one by one and field-level access info in C-like
> style.  The number of samples for the outer struct is a sum of number of
> samples in every field in the struct.  In unions, each field is placed
> in the same offset so they will have the same number of samples.

This is excellent -- and pretty much what I've been asking for forever.

Would it be possible to have multiple sample columns, for eg.
MEM_LOADS_UOPS_RETIRED.L1_HIT and MEM_LOADS_UOPS_RETIRED.L1_MISS
or even more (adding LLC hit and miss as well etc.) ?

(for bonus points: --data-type=typename, would be awesome)

Additionally, annotating the regular perf-annotate output with data-type
information (where we have it) might also be very useful. That way, even
when profiling with PEBS-cycles, an expensive memop immediately gives a
clue as to what data-type to look at.

> No TUI support yet.

Yeah, nobody needs that anyway :-)

> This can generate instructions like below.
> 
>     ...
>     0x123456:  mov    0x18(%rdi), %rcx
>     0x12345a:  mov    0x10(%rcx), %rax     <=== sample
>     0x12345e:  test   %rax, %rax
>     0x123461:  je     <...>
>     ...
> 
> And imagine we have a sample at 0x12345a.  Then it cannot find a
> variable for %rcx since DWARF didn't generate one (it only knows about
> 'bar').  Without compiler support, all it can do is to track the code
> execution in each instruction and propagate the type info in each
> register and stack location by following the memory access.

Right, this has more or less been the 'excuse' for why doing this has
been 'difficult' for the past 10+ years :/

> Actually I found a discussion in the DWARF mailing list to support
> "inverted location lists" and it seems a perfect fit for this project.
> It'd be great if new DWARF would provide a way to lookup variable and
> type info using a concrete location info (like a register number).
> 
>   https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html 

Stephane was going to talk to tools people about this over 10 years ago
:-)

Thanks for *finally* getting this started!!
Peter Zijlstra Oct. 12, 2023, 9:15 a.m. UTC | #3
On Wed, Oct 11, 2023 at 08:50:23PM -0700, Namhyung Kim wrote:

> Actually there's a performance issue about getting disassembly from the
> objdump for kernel.  On my system, GNU objdump was really slower than the
> one from LLVM for some reason so I had to pass the following option for
> each perf report and perf annotate.
> 
>     $ sudo perf report --objdump=llvm-objdump ...
> 
>     # To save it in the config file and drop the command line option
>     $ sudo perf config annotate.objdump=llvm-objdump
> 
> Even with this change, still the most processing time was spent on the
> objdump to get the disassembly.  It'd be nice if we can get the result
> without using objdump at all.

So the kernel has an instruction decoder, all we need is something that
can pretty print the result. IIRC Masami had an early version of that
somewhere.

With those bits, and some basic ELF parsing (find in objtool for
instance) you can implement most of objdump yourself.
Namhyung Kim Oct. 12, 2023, 4:19 p.m. UTC | #4
Hi Ingo,

On Wed, Oct 11, 2023 at 11:03 PM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Namhyung Kim <namhyung@kernel.org> wrote:
>
> > * How to use it
> >
> > To get precise memory access samples, users can use `perf mem record`
> > command to utilize those events supported by their architecture.  Intel
> > machines would work best as they have dedicated memory access events but
> > they would have a filter to ignore low latency loads like less than 30
> > cycles (use --ldlat option to change the default value).
> >
> >     # To get memory access samples in kernel for 1 second (on Intel)
> >     $ sudo perf mem record -a -K --ldlat=4 -- sleep 1
> >
> >     # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
> >     $ sudo perf mem record -a --filter 'mem_op == load, ip > 0x8000000000000000' -- sleep 1
>
> BTW., it would be nice for 'perf mem record' to just do the right thing on
> whatever machine it is running on.
>
> Also, why are BPF filters required - due to the IP filtering of mem-load
> events?

Right, because AMD uses IBS for precise events and it doesn't
have a filtering feature.

>
> Could we perhaps add an IP filter to perf events to get this built-in?
> Perhaps attr->exclude_user would achieve something similar?

Unfortunately IBS doesn't support privilege filters IIUC.  Maybe
we could add a general filtering logic in the NMI handler but I'm
afraid it can complicate the code and maybe slow it down a bit.
Probably it's ok to have only a simple privilege filter by IP range.

>
> > In perf report, it's just a matter of selecting new sort keys: 'type'
> > and 'typeoff'.  The 'type' shows name of the data type as a whole while
> > 'typeoff' shows name of the field in the data type.  I found it useful
> > to use it with --hierarchy option to group relevant entries in the same
> > level.
> >
> >     $ sudo perf report -s type,typeoff --hierarchy --stdio
> >     ...
> >     #
> >     #    Overhead  Data Type / Data Type Offset
> >     # ...........  ............................
> >     #
> >         23.95%     (stack operation)
> >            23.95%     (stack operation) +0 (no field)
> >         23.43%     (unknown)
> >            23.43%     (unknown) +0 (no field)
> >         10.30%     struct pcpu_hot
> >             4.80%     struct pcpu_hot +0 (current_task)
> >             3.53%     struct pcpu_hot +8 (preempt_count)
> >             1.88%     struct pcpu_hot +12 (cpu_number)
> >             0.07%     struct pcpu_hot +24 (top_of_stack)
> >             0.01%     struct pcpu_hot +40 (softirq_pending)
> >          4.25%     struct task_struct
> >             1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
> >             0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
> >             0.49%     struct task_struct +2936 (cred)
> >             0.35%     struct task_struct +3144 (audit_context)
> >             0.19%     struct task_struct +46 (flags)
> >             0.17%     struct task_struct +972 (policy)
> >             0.15%     struct task_struct +32 (stack)
> >             0.15%     struct task_struct +8 (thread_info.syscall_work)
> >             0.10%     struct task_struct +976 (nr_cpus_allowed)
> >             0.09%     struct task_struct +2272 (mm)
> >         ...
>
> This looks really useful!

:)

Thanks,
Namhyung
Namhyung Kim Oct. 12, 2023, 4:41 p.m. UTC | #5
Hi Peter,

On Thu, Oct 12, 2023 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> W00t!! Finally! :-)

Yay!

>
> On Wed, Oct 11, 2023 at 08:50:23PM -0700, Namhyung Kim wrote:
>
> > * How to use it
> >
> > To get precise memory access samples, users can use `perf mem record`
> > command to utilize those events supported by their architecture.  Intel
> > machines would work best as they have dedicated memory access events but
> > they would have a filter to ignore low latency loads like less than 30
> > cycles (use --ldlat option to change the default value).
> >
> >     # To get memory access samples in kernel for 1 second (on Intel)
> >     $ sudo perf mem record -a -K --ldlat=4 -- sleep 1
>
> Fundamentally this should work with anything PEBS from MEM_ as
> well, no? No real reason to rely on perf mem for this.

Correct, experienced users can choose any supported event.
Right now it doesn't even use any MEM_ (data_src) fields but it
should be added later.  BTW I think it'd be better to have an option
to enable the data src sample collection without gathering data MMAPs.

>
> > In perf report, it's just a matter of selecting new sort keys: 'type'
> > and 'typeoff'.  The 'type' shows name of the data type as a whole while
> > 'typeoff' shows name of the field in the data type.  I found it useful
> > to use it with --hierarchy option to group relevant entries in the same
> > level.
> >
> >     $ sudo perf report -s type,typeoff --hierarchy --stdio
> >     ...
> >     #
> >     #    Overhead  Data Type / Data Type Offset
> >     # ...........  ............................
> >     #
> >         23.95%     (stack operation)
> >            23.95%     (stack operation) +0 (no field)
> >         23.43%     (unknown)
> >            23.43%     (unknown) +0 (no field)
> >         10.30%     struct pcpu_hot
> >             4.80%     struct pcpu_hot +0 (current_task)
> >             3.53%     struct pcpu_hot +8 (preempt_count)
> >             1.88%     struct pcpu_hot +12 (cpu_number)
> >             0.07%     struct pcpu_hot +24 (top_of_stack)
> >             0.01%     struct pcpu_hot +40 (softirq_pending)
> >          4.25%     struct task_struct
> >             1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
> >             0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
> >             0.49%     struct task_struct +2936 (cred)
> >             0.35%     struct task_struct +3144 (audit_context)
> >             0.19%     struct task_struct +46 (flags)
> >             0.17%     struct task_struct +972 (policy)
> >             0.15%     struct task_struct +32 (stack)
> >             0.15%     struct task_struct +8 (thread_info.syscall_work)
> >             0.10%     struct task_struct +976 (nr_cpus_allowed)
> >             0.09%     struct task_struct +2272 (mm)
> >         ...
> >
> > The (stack operation) and (unknown) have no type and field info.  FYI,
> > the stack operations are samples in PUSH, POP or RET instructions which
> > save or restore registers from/to the stack.  They are usually parts of
> > function prologue and epilogue and have no type info.  The next is the
> > struct pcpu_hot and you can see the first field (current_task) at offset
> > 0 was accessed mostly.  It's listed in order of access frequency (not in
> > offset) as you can see it in the task_struct.
> >
> > In perf annotate, new --data-type option was added to enable data
> > field level annotation.  Now it only shows number of samples for each
> > field but we can improve it.
> >
> >     $ sudo perf annotate --data-type
> >     Annotate type: 'struct pcpu_hot' in [kernel.kallsyms] (223 samples):
> >     ============================================================================
> >         samples     offset       size  field
> >             223          0         64  struct pcpu_hot       {
> >             223          0         64      union     {
> >             223          0         48          struct        {
> >              78          0          8              struct task_struct*      current_task;
> >              98          8          4              int      preempt_count;
> >              45         12          4              int      cpu_number;
> >               0         16          8              u64      call_depth;
> >               1         24          8              long unsigned int        top_of_stack;
> >               0         32          8              void*    hardirq_stack_ptr;
> >               1         40          2              u16      softirq_pending;
> >               0         42          1              bool     hardirq_stack_inuse;
> >                                                };
> >             223          0         64          u8*  pad;
> >                                            };
> >                                        };
> >     ...
> >
> > This shows each struct one by one and field-level access info in C-like
> > style.  The number of samples for the outer struct is a sum of number of
> > samples in every field in the struct.  In unions, each field is placed
> > in the same offset so they will have the same number of samples.
>
> This is excellent -- and pretty much what I've been asking for forever.

Glad you like it.

>
> Would it be possible to have multiple sample columns, for eg.
> MEM_LOADS_UOPS_RETIRED.L1_HIT and MEM_LOADS_UOPS_RETIRED.L1_MISS
> or even more (adding LLC hit and miss as well etc.) ?

Yep, that should be supported.  Ideally it would display samples
(or overhead) for each event in an event group.  And you can
force individual events to a group at report/annotate time.
But it doesn't work well with this for now.  Will fix.

>
> (for bonus points: --data-type=typename, would be awesome)

Right, will do that in the next spin.

>
> Additionally, annotating the regular perf-annotate output with data-type
> information (where we have it) might also be very useful. That way, even
> when profiling with PEBS-cycles, an expensive memop immediately gives a
> clue as to what data-type to look at.
>
> > No TUI support yet.
>
> Yeah, nobody needs that anyway :-)

I need that ;-)

At least, interactive transition between perf report and
perf annotate is really useful for me.  You should try
that someday.

Note that perf report TUI works well with data types.

>
> > This can generate instructions like below.
> >
> >     ...
> >     0x123456:  mov    0x18(%rdi), %rcx
> >     0x12345a:  mov    0x10(%rcx), %rax     <=== sample
> >     0x12345e:  test   %rax, %rax
> >     0x123461:  je     <...>
> >     ...
> >
> > And imagine we have a sample at 0x12345a.  Then it cannot find a
> > variable for %rcx since DWARF didn't generate one (it only knows about
> > 'bar').  Without compiler support, all it can do is to track the code
> > execution in each instruction and propagate the type info in each
> > register and stack location by following the memory access.
>
> Right, this has more or less been the 'excuse' for why doing this has
> been 'difficult' for the past 10+ years :/

I'm sure I missed some cases, but I managed to make it work on
usual cases.  We can improve it by handling it more cases and
instructions but it'd be great if we have a better support from the
toolchains.

>
> > Actually I found a discussion in the DWARF mailing list to support
> > "inverted location lists" and it seems a perfect fit for this project.
> > It'd be great if new DWARF would provide a way to lookup variable and
> > type info using a concrete location info (like a register number).
> >
> >   https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html
>
> Stephane was going to talk to tools people about this over 10 years ago
> :-)

Hope that they would make some progress.

>
> Thanks for *finally* getting this started!!

Yep, let's make it better!

Thanks,
Namhyung
Namhyung Kim Oct. 12, 2023, 4:52 p.m. UTC | #6
On Thu, Oct 12, 2023 at 2:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 11, 2023 at 08:50:23PM -0700, Namhyung Kim wrote:
>
> > Actually there's a performance issue about getting disassembly from the
> > objdump for kernel.  On my system, GNU objdump was really slower than the
> > one from LLVM for some reason so I had to pass the following option for
> > each perf report and perf annotate.
> >
> >     $ sudo perf report --objdump=llvm-objdump ...
> >
> >     # To save it in the config file and drop the command line option
> >     $ sudo perf config annotate.objdump=llvm-objdump
> >
> > Even with this change, still the most processing time was spent on the
> > objdump to get the disassembly.  It'd be nice if we can get the result
> > without using objdump at all.
>
> So the kernel has an instruction decoder, all we need is something that
> can pretty print the result. IIRC Masami had an early version of that
> somewhere.
>
> With those bits, and some basic ELF parsing (find in objtool for
> instance) you can implement most of objdump yourself.

That would be nice, but I'm a bit afraid of dealing with details
of instruction decoding especially for unusual ones considering
extensibility to user space and other architectures.

Thanks,
Namhyung
Ingo Molnar Oct. 12, 2023, 6:33 p.m. UTC | #7
* Namhyung Kim <namhyung@kernel.org> wrote:

> > Could we perhaps add an IP filter to perf events to get this built-in?
> > Perhaps attr->exclude_user would achieve something similar?
> 
> Unfortunately IBS doesn't support privilege filters IIUC.  Maybe
> we could add a general filtering logic in the NMI handler but I'm
> afraid it can complicate the code and maybe slow it down a bit.
> Probably it's ok to have only a simple privilege filter by IP range.

It will still be so much faster than moving it through the BPF machinery, 
and bonus points if we merge this into the existing privilege-domain 
filtering ABI, so no magic 0x800000000000 constants are needed.

'Overhead' to other usecases shouldn't be much more than a single branch 
somewhere.

Thanks,

	Ingo
Namhyung Kim Oct. 12, 2023, 8:45 p.m. UTC | #8
On Thu, Oct 12, 2023 at 11:33 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Namhyung Kim <namhyung@kernel.org> wrote:
>
> > > Could we perhaps add an IP filter to perf events to get this built-in?
> > > Perhaps attr->exclude_user would achieve something similar?
> >
> > Unfortunately IBS doesn't support privilege filters IIUC.  Maybe
> > we could add a general filtering logic in the NMI handler but I'm
> > afraid it can complicate the code and maybe slow it down a bit.
> > Probably it's ok to have only a simple privilege filter by IP range.
>
> It will still be so much faster than moving it through the BPF machinery,
> and bonus points if we merge this into the existing privilege-domain
> filtering ABI, so no magic 0x800000000000 constants are needed.
>
> 'Overhead' to other usecases shouldn't be much more than a single branch
> somewhere.

Ok, maybe overhead is not a concern.  But users need to pass the
filter expression to the kernel.

Thanks,
Namhyung
Arnaldo Carvalho de Melo Oct. 13, 2023, 2:15 p.m. UTC | #9
Em Wed, Oct 11, 2023 at 08:50:23PM -0700, Namhyung Kim escreveu:
> Hello,
> 
> I'm happy to share my work on data type profiling.  This is to associate
> PMU samples to data types they refer using DWARF debug information.  So
> basically it depends on quality of PMU events and compiler for producing
> DWARF info.  But it doesn't require any changes in the target program.

Great news, finally work on this started :-)

And using what is already available, so that we get to see what are the
road blocks to cover more cases, great.

I'll test it all and report back,

Thanks for doing this work!

- Arnaldo
 
> As it's an early stage, I've targeted the kernel on x86 to reduce the
> amount of work but IIUC there's no fundamental blocker to apply it to
> other architectures and applications.
> 
> 
> * How to use it
> 
> To get precise memory access samples, users can use `perf mem record`
> command to utilize those events supported by their architecture.  Intel
> machines would work best as they have dedicated memory access events but
> they would have a filter to ignore low latency loads like less than 30
> cycles (use --ldlat option to change the default value).
> 
>     # To get memory access samples in kernel for 1 second (on Intel)
>     $ sudo perf mem record -a -K --ldlat=4 -- sleep 1
> 
>     # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
>     $ sudo perf mem record -a --filter 'mem_op == load, ip > 0x8000000000000000' -- sleep 1
> 
> Note that it used 'sudo' command because it's collecting the event in
> system wide mode.  Actually it would depend on the sysctl setting of
> kernel.perf_event_paranoid.  AMD still needs root due to the BPF filter
> though.
> 
> After getting a profile data, you would run perf report or perf
> annotate as usual to see the result.  Make sure that you have a kernel
> debug package installed or vmlinux with DWARF info.
 
> I've added new options and sort keys to enable the data type profiling.
> Probably I need to add it to perf mem or perf c2c command for better
> user experience.  I'm open to discussion how we can make it simpler and
> intuitive for regular users.  But let's talk about the lower level
> interface for now.
> 
> In perf report, it's just a matter of selecting new sort keys: 'type'
> and 'typeoff'.  The 'type' shows name of the data type as a whole while
> 'typeoff' shows name of the field in the data type.  I found it useful
> to use it with --hierarchy option to group relevant entries in the same
> level.
> 
>     $ sudo perf report -s type,typeoff --hierarchy --stdio
>     ...
>     #
>     #    Overhead  Data Type / Data Type Offset
>     # ...........  ............................
>     #
>         23.95%     (stack operation)
>            23.95%     (stack operation) +0 (no field)
>         23.43%     (unknown)
>            23.43%     (unknown) +0 (no field)
>         10.30%     struct pcpu_hot
>             4.80%     struct pcpu_hot +0 (current_task)
>             3.53%     struct pcpu_hot +8 (preempt_count)
>             1.88%     struct pcpu_hot +12 (cpu_number)
>             0.07%     struct pcpu_hot +24 (top_of_stack)
>             0.01%     struct pcpu_hot +40 (softirq_pending)
>          4.25%     struct task_struct
>             1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
>             0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
>             0.49%     struct task_struct +2936 (cred)
>             0.35%     struct task_struct +3144 (audit_context)
>             0.19%     struct task_struct +46 (flags)
>             0.17%     struct task_struct +972 (policy)
>             0.15%     struct task_struct +32 (stack)
>             0.15%     struct task_struct +8 (thread_info.syscall_work)
>             0.10%     struct task_struct +976 (nr_cpus_allowed)
>             0.09%     struct task_struct +2272 (mm)
>         ...
> 
> The (stack operation) and (unknown) have no type and field info.  FYI,
> the stack operations are samples in PUSH, POP or RET instructions which
> save or restore registers from/to the stack.  They are usually parts of
> function prologue and epilogue and have no type info.  The next is the
> struct pcpu_hot and you can see the first field (current_task) at offset
> 0 was accessed mostly.  It's listed in order of access frequency (not in
> offset) as you can see it in the task_struct.
> 
> In perf annotate, new --data-type option was added to enable data
> field level annotation.  Now it only shows number of samples for each
> field but we can improve it.
> 
>     $ sudo perf annotate --data-type
>     Annotate type: 'struct pcpu_hot' in [kernel.kallsyms] (223 samples):
>     ============================================================================
>         samples     offset       size  field
>             223          0         64  struct pcpu_hot       {
>             223          0         64      union     {
>             223          0         48          struct        {
>              78          0          8              struct task_struct*      current_task;
>              98          8          4              int      preempt_count;
>              45         12          4              int      cpu_number;
>               0         16          8              u64      call_depth;
>               1         24          8              long unsigned int        top_of_stack;
>               0         32          8              void*    hardirq_stack_ptr;
>               1         40          2              u16      softirq_pending;
>               0         42          1              bool     hardirq_stack_inuse;
>                                                };
>             223          0         64          u8*  pad;
>                                            };
>                                        };
>     ...
> 
> This shows each struct one by one and field-level access info in C-like
> style.  The number of samples for the outer struct is a sum of number of
> samples in every field in the struct.  In unions, each field is placed
> in the same offset so they will have the same number of samples.
> 
> No TUI support yet.
> 
> 
> * How it works
> 
> The basic idea is to use DWARF location expression in debug entries for
> variables.  Say we got a sample in the instruction below:
> 
>     0x123456:  mov    0x18(%rdi), %rcx
> 
> Then we know the instruction at 0x123456 is accessing to a memory region
> where %rdi register has a base address and offset 0x18 from the base.
> DWARF would have a debug info entry for a function or a block which
> covers that address.  For example, we might have something like this:
> 
>     <1><100>: Abbrev Number: 10 (DW_TAG_subroutine_type)
>        <101>    DW_AT_name       : (indirect string, offset: 0x184e6): foo
>        <105>    DW_AT_type       : <0x29ad7>
>        <106>    DW_AT_low_pc     : 0x123400
>        <10e>    DW_AT_high_pc    : 0x1234ff
>     <2><116>: Abbrev Number: 8 (DW_TAG_formal_parameter)
>        <117>    DW_AT_name       : (indirect string, offset: 0x18527): bar
>        <11b>    DW_AT_type       : <0x29b3a>
>        <11c>    DW_AT_location   : 1 byte block: 55    (DW_OP_reg2 (rdi))
> 
> So the function 'foo' covers the instruction from 0x123400 to 0x1234ff
> and we know the sample instruction belongs to the function.  And it has
> a parameter called 'bar' and it's located at the %rdi register.  Then we
> know the instruction is using the variable bar and its type would be a
> pointer (to a struct).  We can follow the type info of bar and verify
> its access by checking the size of the (struct) type and offset in the
> instruction (0x18).
> 
> Well.. this is a simple example that the 'bar' has a single location.
> Other variables might be located in various places over time but it
> should be covered by the location list of the debug entry.  Therefore,
> as long as DWARF produces a correct location expression for a variable,
> it should be able to find the variable using the location info.
> 
> Global variables and local variables are different as they can be
> accessed directly without a pointer.  They are located in an absolute
> address or relative position from the current stack frame.  So it needs
> to handle such location expressions as well.
> 
> However, some memory accesses don't have a variable in some cases.  For
> example, you have a pointer variable for a struct which contains another
> pointers.  And then you can directly dereference it without using a
> variable.  Consider the following source code.
> 
>     int foo(struct baz *bar) {
>         ...
>         if (bar->p->q == 0)
>             return 1;
>         ...
>     }
> 
> This can generate instructions like below.
> 
>     ...
>     0x123456:  mov    0x18(%rdi), %rcx
>     0x12345a:  mov    0x10(%rcx), %rax     <=== sample
>     0x12345e:  test   %rax, %rax
>     0x123461:  je     <...>
>     ...
> 
> And imagine we have a sample at 0x12345a.  Then it cannot find a
> variable for %rcx since DWARF didn't generate one (it only knows about
> 'bar').  Without compiler support, all it can do is to track the code
> execution in each instruction and propagate the type info in each
> register and stack location by following the memory access.
> 
> Actually I found a discussion in the DWARF mailing list to support
> "inverted location lists" and it seems a perfect fit for this project.
> It'd be great if new DWARF would provide a way to lookup variable and
> type info using a concrete location info (like a register number).
> 
>   https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html 
> 
> 
> * Patch structure
> 
> The patch 1-5 are cleanups and a fix that can be applied separately.
> The patch 6-21 are the main changes in perf report and perf annotate to
> support simple cases with a pointer variable.  The patch 22-33 are to
> improve it by handling global and local variables (without a pointer)
> and some edge cases.  The patch 34-43 implemented instruction tracking
> to infer data type when there's no variable for that.  The patch 44-47
> handles kernel-specific per-cpu variables (only for current CPU).  The
> patch 48 is to help debugging and is not intended for merge.
> 
> 
> * Limitations and future work
> 
> As I said earlier, this work is in a very early shape and has many
> limitations or rooms for improvement.  Basically it uses objdump tool to
> extract location information from the sample instruction.  And the
> parsing code and instruction tracking work on x86 only.
> 
> Actually there's a performance issue about getting disassembly from the
> objdump for kernel.  On my system, GNU objdump was really slower than the
> one from LLVM for some reason so I had to pass the following option for
> each perf report and perf annotate.
> 
>     $ sudo perf report --objdump=llvm-objdump ...
> 
>     # To save it in the config file and drop the command line option
>     $ sudo perf config annotate.objdump=llvm-objdump
> 
> Even with this change, still the most processing time was spent on the
> objdump to get the disassembly.  It'd be nice if we can get the result
> without using objdump at all.
> 
> Also I only tested it with C programs (mostly vmlinux) and I believe
> there are many issues on handling C++ applications.  Probably other
> languages (like Rust?) could be supported too.  But even for C programs,
> it could improve things like better supporting union and array types and
> dealing with type casts and so on.
> 
> I think compiler could generate more DWARF information to help this kind
> of analysis.  Like I mentioned, it doesn't have a variable for
> intermediate pointers when they are chained: a->b->c.  This chain could
> be longer and hard to track the type from the previous variable.  If
> compiler could generate (artificial) debug entries for the intermediate
> pointers with a precise location expression and type info, it would be
> really helpful.
> 
> And I plan to improve the analysis in perf tools with better integration
> to the existing command like perf mem and/or perf c2c.  It'd be pretty
> interesting to see per-struct or per-field access patterns both for load
> and store event at the same time.  Also using data-source or snoop info
> for each struct/field would give some insights on optimizing memory
> usage or layout.
> 
> There are kernel specific issues too.  Some per-cpu variable accesses
> created complex instruction patterns so it was hard to determine which
> data/type it accessed.  For now, it just parsed simple patterns for
> this-cpu access using %gs segment register.  Also it should handle
> self-modifying codes like kprobe, ftrace, live patch and so on.  I guess
> they would usually create an out-of-line copy of modified instructions
> but needs more checking.  And I have no idea about the status of struct
> layout randomization and the DWARF info of the resulting struct.  Maybe
> there are more issues I'm not aware of, please let me know if you notice
> something.
> 
> 
> * Summary
> 
> Despite all the issues, I believe this would be a good addition to our
> performance toolset.  It would help to observe memory overheads in a
> different angle and to optimize the memory usage.  I'm really looking
> forward to hearing any feedback.
> 
> The code is available at 'perf/data-profile-v1' branch in
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> Enjoy,
> Namhyung
> 
> 
> Cc: Ben Woodard <woodard@redhat.com> 
> Cc: Joe Mario <jmario@redhat.com>
> CC: Kees Cook <keescook@chromium.org>
> Cc: David Blaikie <blaikie@google.com>
> Cc: Xu Liu <xliuprof@google.com>
> Cc: Kan Liang <kan.liang@linux.intel.com>
> Cc: Ravi Bangoria <ravi.bangoria@amd.com>
> 
> 
> Namhyung Kim (48):
>   perf annotate: Move raw_comment and raw_func_start
>   perf annotate: Check if operand has multiple regs
>   perf tools: Add util/debuginfo.[ch] files
>   perf dwarf-aux: Fix die_get_typename() for void *
>   perf dwarf-aux: Move #ifdef code to the header file
>   perf dwarf-aux: Add die_get_scopes() helper
>   perf dwarf-aux: Add die_find_variable_by_reg() helper
>   perf dwarf-aux: Factor out __die_get_typename()
>   perf dwarf-regs: Add get_dwarf_regnum()
>   perf annotate-data: Add find_data_type()
>   perf annotate-data: Add dso->data_types tree
>   perf annotate: Factor out evsel__get_arch()
>   perf annotate: Add annotate_get_insn_location()
>   perf annotate: Implement hist_entry__get_data_type()
>   perf report: Add 'type' sort key
>   perf report: Support data type profiling
>   perf annotate-data: Add member field in the data type
>   perf annotate-data: Update sample histogram for type
>   perf report: Add 'typeoff' sort key
>   perf report: Add 'symoff' sort key
>   perf annotate: Add --data-type option
>   perf annotate: Add --type-stat option for debugging
>   perf annotate: Add --insn-stat option for debugging
>   perf annotate-data: Parse 'lock' prefix from llvm-objdump
>   perf annotate-data: Handle macro fusion on x86
>   perf annotate-data: Handle array style accesses
>   perf annotate-data: Add stack operation pseudo type
>   perf dwarf-aux: Add die_find_variable_by_addr()
>   perf annotate-data: Handle PC-relative addressing
>   perf annotate-data: Support global variables
>   perf dwarf-aux: Add die_get_cfa()
>   perf annotate-data: Support stack variables
>   perf dwarf-aux: Check allowed DWARF Ops
>   perf dwarf-aux: Add die_collect_vars()
>   perf dwarf-aux: Handle type transfer for memory access
>   perf annotate-data: Introduce struct data_loc_info
>   perf map: Add map__objdump_2rip()
>   perf annotate: Add annotate_get_basic_blocks()
>   perf annotate-data: Maintain variable type info
>   perf annotate-data: Add update_insn_state()
>   perf annotate-data: Handle global variable access
>   perf annotate-data: Handle call instructions
>   perf annotate-data: Implement instruction tracking
>   perf annotate: Parse x86 segment register location
>   perf annotate-data: Handle this-cpu variables in kernel
>   perf annotate-data: Track instructions with a this-cpu variable
>   perf annotate-data: Add stack canary type
>   perf annotate-data: Add debug message
> 
>  tools/perf/Documentation/perf-report.txt      |    3 +
>  .../arch/loongarch/annotate/instructions.c    |    6 +-
>  tools/perf/arch/x86/util/dwarf-regs.c         |   38 +
>  tools/perf/builtin-annotate.c                 |  149 +-
>  tools/perf/builtin-report.c                   |   19 +-
>  tools/perf/util/Build                         |    2 +
>  tools/perf/util/annotate-data.c               | 1246 +++++++++++++++++
>  tools/perf/util/annotate-data.h               |  222 +++
>  tools/perf/util/annotate.c                    |  763 +++++++++-
>  tools/perf/util/annotate.h                    |  104 +-
>  tools/perf/util/debuginfo.c                   |  205 +++
>  tools/perf/util/debuginfo.h                   |   64 +
>  tools/perf/util/dso.c                         |    4 +
>  tools/perf/util/dso.h                         |    2 +
>  tools/perf/util/dwarf-aux.c                   |  561 +++++++-
>  tools/perf/util/dwarf-aux.h                   |   86 +-
>  tools/perf/util/dwarf-regs.c                  |   33 +
>  tools/perf/util/hist.h                        |    3 +
>  tools/perf/util/include/dwarf-regs.h          |   11 +
>  tools/perf/util/map.c                         |   20 +
>  tools/perf/util/map.h                         |    3 +
>  tools/perf/util/probe-finder.c                |  193 +--
>  tools/perf/util/probe-finder.h                |   19 +-
>  tools/perf/util/sort.c                        |  195 ++-
>  tools/perf/util/sort.h                        |    7 +
>  tools/perf/util/symbol_conf.h                 |    4 +-
>  26 files changed, 3703 insertions(+), 259 deletions(-)
>  create mode 100644 tools/perf/util/annotate-data.c
>  create mode 100644 tools/perf/util/annotate-data.h
>  create mode 100644 tools/perf/util/debuginfo.c
>  create mode 100644 tools/perf/util/debuginfo.h
> 
> 
> base-commit: 87cd3d48191e533cd9c224f2da1d78b3513daf47
> -- 
> 2.42.0.655.g421f12c284-goog
>
Andi Kleen Oct. 23, 2023, 9:58 p.m. UTC | #10
Namhyung Kim <namhyung@kernel.org> writes:

> Hello,
>
> I'm happy to share my work on data type profiling.  This is to associate
> PMU samples to data types they refer using DWARF debug information.  So
> basically it depends on quality of PMU events and compiler for producing
> DWARF info.  But it doesn't require any changes in the target program.
>
> As it's an early stage, I've targeted the kernel on x86 to reduce the
> amount of work but IIUC there's no fundamental blocker to apply it to
> other architectures and applications.

FWIW i posted a similar patchkit a long time ago

https://lore.kernel.org/lkml/20171128002321.2878-13-andi@firstfloor.org/

It was on my list to resurrect that, it's great that you are doing
something similar.

The latest iteration (not posted) was here:

https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=perf/var-resolve-7

The main difference seems to be that mine was more for perf script
(e.g. i supported PT decoding), while you are more focused on sampling.
I relied on the kprobes/uprobes engine, which unfortunately was always
quite slow and had many limitations.

Perhaps it would be possible merge the useful parts of the two approaches?

-Andi
Namhyung Kim Oct. 24, 2023, 7:16 p.m. UTC | #11
Hi Andi,

On Mon, Oct 23, 2023 at 2:58 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> Namhyung Kim <namhyung@kernel.org> writes:
>
> > Hello,
> >
> > I'm happy to share my work on data type profiling.  This is to associate
> > PMU samples to data types they refer using DWARF debug information.  So
> > basically it depends on quality of PMU events and compiler for producing
> > DWARF info.  But it doesn't require any changes in the target program.
> >
> > As it's an early stage, I've targeted the kernel on x86 to reduce the
> > amount of work but IIUC there's no fundamental blocker to apply it to
> > other architectures and applications.
>
> FWIW i posted a similar patchkit a long time ago
>
> https://lore.kernel.org/lkml/20171128002321.2878-13-andi@firstfloor.org/
>
> It was on my list to resurrect that, it's great that you are doing
> something similar.
>
> The latest iteration (not posted) was here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=perf/var-resolve-7

Oh, I wasn't aware of this series.  I'll take a look.

>
> The main difference seems to be that mine was more for perf script
> (e.g. i supported PT decoding), while you are more focused on sampling.
> I relied on the kprobes/uprobes engine, which unfortunately was always
> quite slow and had many limitations.

Right, I think dealing with regular samples would be more useful.
But Intel PT support looks interesting.

>
> Perhaps it would be possible merge the useful parts of the two approaches?

Sounds good!  Thanks for your comment!
Namhyung
Andi Kleen Oct. 25, 2023, 2:09 a.m. UTC | #12
> 
> >
> > The main difference seems to be that mine was more for perf script
> > (e.g. i supported PT decoding), while you are more focused on sampling.
> > I relied on the kprobes/uprobes engine, which unfortunately was always
> > quite slow and had many limitations.
> 
> Right, I think dealing with regular samples would be more useful.

My code supported samples too, but only through perf script, not report.

See 

https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/commit/?h=perf/var-resolve-7&id=4775664750a6296acb732b7adfa224c6a06a126f

for an example.

My take was that i wasn't sure that perf report is the right interface
to visualize the variables changing -- to be really usable you probably
need some plots and likely something like an UI.

For you I think you focus more on the types than the individual
variables? That's a slightly different approach.

But then my engine had a lot of limitations, i suppose redoing that on
top of yours would give better results.


-Andi
Namhyung Kim Oct. 25, 2023, 5:51 a.m. UTC | #13
On Tue, Oct 24, 2023 at 7:09 PM Andi Kleen <ak@linux.intel.com> wrote:
>
> >
> > >
> > > The main difference seems to be that mine was more for perf script
> > > (e.g. i supported PT decoding), while you are more focused on sampling.
> > > I relied on the kprobes/uprobes engine, which unfortunately was always
> > > quite slow and had many limitations.
> >
> > Right, I think dealing with regular samples would be more useful.
>
> My code supported samples too, but only through perf script, not report.
>
> See
>
> https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/commit/?h=perf/var-resolve-7&id=4775664750a6296acb732b7adfa224c6a06a126f
>
> for an example.
>
> My take was that i wasn't sure that perf report is the right interface
> to visualize the variables changing -- to be really usable you probably
> need some plots and likely something like an UI.

I see.  Your concern is to see how variables are changing.
But it seems you only displayed constant values.

>
> For you I think you focus more on the types than the individual
> variables? That's a slightly different approach.

Right, you can see which fields in a struct are accessed
mostly and probably change the layout for better result.

>
> But then my engine had a lot of limitations, i suppose redoing that on
> top of yours would give better results.

Sounds good, thanks.
Namhyung
Namhyung Kim Oct. 25, 2023, 5:58 a.m. UTC | #14
Hello,

On Tue, Oct 24, 2023 at 8:07 PM Jason Merrill <jason@redhat.com> wrote:
>
> On Thu, Oct 12, 2023 at 12:44 PM Namhyung Kim <namhyung@kernel.org> wrote:
>>
>> On Thu, Oct 12, 2023 at 2:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> > > This can generate instructions like below.
>> > >
>> > >     ...
>> > >     0x123456:  mov    0x18(%rdi), %rcx
>> > >     0x12345a:  mov    0x10(%rcx), %rax     <=== sample
>> > >     0x12345e:  test   %rax, %rax
>> > >     0x123461:  je     <...>
>> > >     ...
>> > >
>> > > And imagine we have a sample at 0x12345a.  Then it cannot find a
>> > > variable for %rcx since DWARF didn't generate one (it only knows about
>> > > 'bar').  Without compiler support, all it can do is to track the code
>> > > execution in each instruction and propagate the type info in each
>> > > register and stack location by following the memory access.
>> >
>> > Right, this has more or less been the 'excuse' for why doing this has
>> > been 'difficult' for the past 10+ years :/
>>
>> I'm sure I missed some cases, but I managed to make it work on
>> usual cases.  We can improve it by handling it more cases and
>> instructions but it'd be great if we have a better support from the
>> toolchains.
>
>
> How helpful would it be for the compiler to generate an unnamed DW_TAG_variable for the temporary in %rcx?

That'd be fantastic, and that's exactly what I'm asking. :)

>
>>
>> > > Actually I found a discussion in the DWARF mailing list to support
>> > > "inverted location lists" and it seems a perfect fit for this project.
>> > > It'd be great if new DWARF would provide a way to lookup variable and
>> > > type info using a concrete location info (like a register number).
>> > >
>> > >   https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html
>> >
>> > Stephane was going to talk to tools people about this over 10 years ago
>> > :-)
>>
>> Hope that they would make some progress.
>
>
> This seems expensive in debug info size for a lookup optimization; better IMO to construct the reverse lookup table from the existing information, as it sounds like you are doing.  Though it would be good to cache that table between runs, and ideally share it with other interested tools.

Probably.  Then it still needs a standard way to express such data.
Maybe we can have an option to generate the table.

Thanks,
Namhyung
Andi Kleen Oct. 25, 2023, 8:01 p.m. UTC | #15
On Tue, Oct 24, 2023 at 10:51:41PM -0700, Namhyung Kim wrote:
> On Tue, Oct 24, 2023 at 7:09 PM Andi Kleen <ak@linux.intel.com> wrote:
> >
> > >
> > > >
> > > > The main difference seems to be that mine was more for perf script
> > > > (e.g. i supported PT decoding), while you are more focused on sampling.
> > > > I relied on the kprobes/uprobes engine, which unfortunately was always
> > > > quite slow and had many limitations.
> > >
> > > Right, I think dealing with regular samples would be more useful.
> >
> > My code supported samples too, but only through perf script, not report.
> >
> > See
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/commit/?h=perf/var-resolve-7&id=4775664750a6296acb732b7adfa224c6a06a126f
> >
> > for an example.
> >
> > My take was that i wasn't sure that perf report is the right interface
> > to visualize the variables changing -- to be really usable you probably
> > need some plots and likely something like an UI.
> 
> I see.  Your concern is to see how variables are changing.
> But it seems you only displayed constant values.

Yes the examples were not very good, but that was the intention.
Values can be much more powerful than only types!

For PT I also had special compiler patch that added suitable ptwrites
(see [1]) that allowed to track any variable.

-Andi

[1] https://github.com/andikleen/gcc-old-svn/tree/ptwrite-18
Joe Mario Nov. 8, 2023, 5:12 p.m. UTC | #16
On 10/11/23 11:50 PM, Namhyung Kim wrote:
> Hello,
> 
> I'm happy to share my work on data type profiling.  This is to associate
> PMU samples to data types they refer using DWARF debug information.  So
> basically it depends on quality of PMU events and compiler for producing
> DWARF info.  But it doesn't require any changes in the target program.
> 
> As it's an early stage, I've targeted the kernel on x86 to reduce the
> amount of work but IIUC there's no fundamental blocker to apply it to
> other architectures and applications.
> 
> 
> * How to use it
> 
> To get precise memory access samples, users can use `perf mem record`
> command to utilize those events supported by their architecture.  Intel
> machines would work best as they have dedicated memory access events but
> they would have a filter to ignore low latency loads like less than 30
> cycles (use --ldlat option to change the default value).
> 
>     # To get memory access samples in kernel for 1 second (on Intel)
>     $ sudo perf mem record -a -K --ldlat=4 -- sleep 1
> 
>     # Similar for the AMD (but it requires 6.3+ kernel for BPF filters)
>     $ sudo perf mem record -a --filter 'mem_op == load, ip > 0x8000000000000000' -- sleep 1
> 
> Note that it used 'sudo' command because it's collecting the event in
> system wide mode.  Actually it would depend on the sysctl setting of
> kernel.perf_event_paranoid.  AMD still needs root due to the BPF filter
> though.
> 
> After getting a profile data, you would run perf report or perf
> annotate as usual to see the result.  Make sure that you have a kernel
> debug package installed or vmlinux with DWARF info.
> 
> I've added new options and sort keys to enable the data type profiling.
> Probably I need to add it to perf mem or perf c2c command for better
> user experience.  I'm open to discussion how we can make it simpler and
> intuitive for regular users.  But let's talk about the lower level
> interface for now.
> 
> In perf report, it's just a matter of selecting new sort keys: 'type'
> and 'typeoff'.  The 'type' shows name of the data type as a whole while
> 'typeoff' shows name of the field in the data type.  I found it useful
> to use it with --hierarchy option to group relevant entries in the same
> level.
> 
>     $ sudo perf report -s type,typeoff --hierarchy --stdio
>     ...
>     #
>     #    Overhead  Data Type / Data Type Offset
>     # ...........  ............................
>     #
>         23.95%     (stack operation)
>            23.95%     (stack operation) +0 (no field)
>         23.43%     (unknown)
>            23.43%     (unknown) +0 (no field)
>         10.30%     struct pcpu_hot
>             4.80%     struct pcpu_hot +0 (current_task)
>             3.53%     struct pcpu_hot +8 (preempt_count)
>             1.88%     struct pcpu_hot +12 (cpu_number)
>             0.07%     struct pcpu_hot +24 (top_of_stack)
>             0.01%     struct pcpu_hot +40 (softirq_pending)
>          4.25%     struct task_struct
>             1.48%     struct task_struct +2036 (rcu_read_lock_nesting)
>             0.53%     struct task_struct +2040 (rcu_read_unlock_special.b.blocked)
>             0.49%     struct task_struct +2936 (cred)
>             0.35%     struct task_struct +3144 (audit_context)
>             0.19%     struct task_struct +46 (flags)
>             0.17%     struct task_struct +972 (policy)
>             0.15%     struct task_struct +32 (stack)
>             0.15%     struct task_struct +8 (thread_info.syscall_work)
>             0.10%     struct task_struct +976 (nr_cpus_allowed)
>             0.09%     struct task_struct +2272 (mm)
>         ...
> 
> The (stack operation) and (unknown) have no type and field info.  FYI,
> the stack operations are samples in PUSH, POP or RET instructions which
> save or restore registers from/to the stack.  They are usually parts of
> function prologue and epilogue and have no type info.  The next is the
> struct pcpu_hot and you can see the first field (current_task) at offset
> 0 was accessed mostly.  It's listed in order of access frequency (not in
> offset) as you can see it in the task_struct.
> 
> In perf annotate, new --data-type option was added to enable data
> field level annotation.  Now it only shows number of samples for each
> field but we can improve it.
> 
>     $ sudo perf annotate --data-type
>     Annotate type: 'struct pcpu_hot' in [kernel.kallsyms] (223 samples):
>     ============================================================================
>         samples     offset       size  field
>             223          0         64  struct pcpu_hot       {
>             223          0         64      union     {
>             223          0         48          struct        {
>              78          0          8              struct task_struct*      current_task;
>              98          8          4              int      preempt_count;
>              45         12          4              int      cpu_number;
>               0         16          8              u64      call_depth;
>               1         24          8              long unsigned int        top_of_stack;
>               0         32          8              void*    hardirq_stack_ptr;
>               1         40          2              u16      softirq_pending;
>               0         42          1              bool     hardirq_stack_inuse;
>                                                };
>             223          0         64          u8*  pad;
>                                            };
>                                        };
>     ...
> 
> This shows each struct one by one and field-level access info in C-like
> style.  The number of samples for the outer struct is a sum of number of
> samples in every field in the struct.  In unions, each field is placed
> in the same offset so they will have the same number of samples.
> 
> No TUI support yet.
> 
> 
> * How it works
> 
> The basic idea is to use DWARF location expression in debug entries for
> variables.  Say we got a sample in the instruction below:
> 
>     0x123456:  mov    0x18(%rdi), %rcx
> 
> Then we know the instruction at 0x123456 is accessing to a memory region
> where %rdi register has a base address and offset 0x18 from the base.
> DWARF would have a debug info entry for a function or a block which
> covers that address.  For example, we might have something like this:
> 
>     <1><100>: Abbrev Number: 10 (DW_TAG_subroutine_type)
>        <101>    DW_AT_name       : (indirect string, offset: 0x184e6): foo
>        <105>    DW_AT_type       : <0x29ad7>
>        <106>    DW_AT_low_pc     : 0x123400
>        <10e>    DW_AT_high_pc    : 0x1234ff
>     <2><116>: Abbrev Number: 8 (DW_TAG_formal_parameter)
>        <117>    DW_AT_name       : (indirect string, offset: 0x18527): bar
>        <11b>    DW_AT_type       : <0x29b3a>
>        <11c>    DW_AT_location   : 1 byte block: 55    (DW_OP_reg2 (rdi))
> 
> So the function 'foo' covers the instruction from 0x123400 to 0x1234ff
> and we know the sample instruction belongs to the function.  And it has
> a parameter called 'bar' and it's located at the %rdi register.  Then we
> know the instruction is using the variable bar and its type would be a
> pointer (to a struct).  We can follow the type info of bar and verify
> its access by checking the size of the (struct) type and offset in the
> instruction (0x18).
> 
> Well.. this is a simple example that the 'bar' has a single location.
> Other variables might be located in various places over time but it
> should be covered by the location list of the debug entry.  Therefore,
> as long as DWARF produces a correct location expression for a variable,
> it should be able to find the variable using the location info.
> 
> Global variables and local variables are different as they can be
> accessed directly without a pointer.  They are located in an absolute
> address or relative position from the current stack frame.  So it needs
> to handle such location expressions as well.
> 
> However, some memory accesses don't have a variable in some cases.  For
> example, you have a pointer variable for a struct which contains another
> pointers.  And then you can directly dereference it without using a
> variable.  Consider the following source code.
> 
>     int foo(struct baz *bar) {
>         ...
>         if (bar->p->q == 0)
>             return 1;
>         ...
>     }
> 
> This can generate instructions like below.
> 
>     ...
>     0x123456:  mov    0x18(%rdi), %rcx
>     0x12345a:  mov    0x10(%rcx), %rax     <=== sample
>     0x12345e:  test   %rax, %rax
>     0x123461:  je     <...>
>     ...
> 
> And imagine we have a sample at 0x12345a.  Then it cannot find a
> variable for %rcx since DWARF didn't generate one (it only knows about
> 'bar').  Without compiler support, all it can do is to track the code
> execution in each instruction and propagate the type info in each
> register and stack location by following the memory access.
> 
> Actually I found a discussion in the DWARF mailing list to support
> "inverted location lists" and it seems a perfect fit for this project.
> It'd be great if new DWARF would provide a way to lookup variable and
> type info using a concrete location info (like a register number).
> 
>   https://lists.dwarfstd.org/pipermail/dwarf-discuss/2023-June/002278.html 
> 
> 
> * Patch structure
> 
> The patch 1-5 are cleanups and a fix that can be applied separately.
> The patch 6-21 are the main changes in perf report and perf annotate to
> support simple cases with a pointer variable.  The patch 22-33 are to
> improve it by handling global and local variables (without a pointer)
> and some edge cases.  The patch 34-43 implemented instruction tracking
> to infer data type when there's no variable for that.  The patch 44-47
> handles kernel-specific per-cpu variables (only for current CPU).  The
> patch 48 is to help debugging and is not intended for merge.
> 
> 
> * Limitations and future work
> 
> As I said earlier, this work is in a very early shape and has many
> limitations or rooms for improvement.  Basically it uses objdump tool to
> extract location information from the sample instruction.  And the
> parsing code and instruction tracking work on x86 only.
> 
> Actually there's a performance issue about getting disassembly from the
> objdump for kernel.  On my system, GNU objdump was really slower than the
> one from LLVM for some reason so I had to pass the following option for
> each perf report and perf annotate.
> 
>     $ sudo perf report --objdump=llvm-objdump ...
> 
>     # To save it in the config file and drop the command line option
>     $ sudo perf config annotate.objdump=llvm-objdump
> 
> Even with this change, still the most processing time was spent on the
> objdump to get the disassembly.  It'd be nice if we can get the result
> without using objdump at all.
> 
> Also I only tested it with C programs (mostly vmlinux) and I believe
> there are many issues on handling C++ applications.  Probably other
> languages (like Rust?) could be supported too.  But even for C programs,
> it could improve things like better supporting union and array types and
> dealing with type casts and so on.
> 
> I think compiler could generate more DWARF information to help this kind
> of analysis.  Like I mentioned, it doesn't have a variable for
> intermediate pointers when they are chained: a->b->c.  This chain could
> be longer and hard to track the type from the previous variable.  If
> compiler could generate (artificial) debug entries for the intermediate
> pointers with a precise location expression and type info, it would be
> really helpful.
> 
> And I plan to improve the analysis in perf tools with better integration
> to the existing command like perf mem and/or perf c2c.  It'd be pretty
> interesting to see per-struct or per-field access patterns both for load
> and store event at the same time.  Also using data-source or snoop info
> for each struct/field would give some insights on optimizing memory
> usage or layout.
> 
> There are kernel specific issues too.  Some per-cpu variable accesses
> created complex instruction patterns so it was hard to determine which
> data/type it accessed.  For now, it just parsed simple patterns for
> this-cpu access using %gs segment register.  Also it should handle
> self-modifying codes like kprobe, ftrace, live patch and so on.  I guess
> they would usually create an out-of-line copy of modified instructions
> but needs more checking.  And I have no idea about the status of struct
> layout randomization and the DWARF info of the resulting struct.  Maybe
> there are more issues I'm not aware of, please let me know if you notice
> something.
> 
> 
> * Summary
> 
> Despite all the issues, I believe this would be a good addition to our
> performance toolset.  It would help to observe memory overheads in a
> different angle and to optimize the memory usage.  I'm really looking
> forward to hearing any feedback.
> 
> The code is available at 'perf/data-profile-v1' branch in
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
> 
> Enjoy,
> Namhyung
> 
> 
> Cc: Ben Woodard <woodard@redhat.com> 
> Cc: Joe Mario <jmario@redhat.com>
> CC: Kees Cook <keescook@chromium.org>
> Cc: David Blaikie <blaikie@google.com>
> Cc: Xu Liu <xliuprof@google.com>
> Cc: Kan Liang <kan.liang@linux.intel.com>
> Cc: Ravi Bangoria <ravi.bangoria@amd.com>
> 
> 
> Namhyung Kim (48):
>   perf annotate: Move raw_comment and raw_func_start
>   perf annotate: Check if operand has multiple regs
>   perf tools: Add util/debuginfo.[ch] files
>   perf dwarf-aux: Fix die_get_typename() for void *
>   perf dwarf-aux: Move #ifdef code to the header file
>   perf dwarf-aux: Add die_get_scopes() helper
>   perf dwarf-aux: Add die_find_variable_by_reg() helper
>   perf dwarf-aux: Factor out __die_get_typename()
>   perf dwarf-regs: Add get_dwarf_regnum()
>   perf annotate-data: Add find_data_type()
>   perf annotate-data: Add dso->data_types tree
>   perf annotate: Factor out evsel__get_arch()
>   perf annotate: Add annotate_get_insn_location()
>   perf annotate: Implement hist_entry__get_data_type()
>   perf report: Add 'type' sort key
>   perf report: Support data type profiling
>   perf annotate-data: Add member field in the data type
>   perf annotate-data: Update sample histogram for type
>   perf report: Add 'typeoff' sort key
>   perf report: Add 'symoff' sort key
>   perf annotate: Add --data-type option
>   perf annotate: Add --type-stat option for debugging
>   perf annotate: Add --insn-stat option for debugging
>   perf annotate-data: Parse 'lock' prefix from llvm-objdump
>   perf annotate-data: Handle macro fusion on x86
>   perf annotate-data: Handle array style accesses
>   perf annotate-data: Add stack operation pseudo type
>   perf dwarf-aux: Add die_find_variable_by_addr()
>   perf annotate-data: Handle PC-relative addressing
>   perf annotate-data: Support global variables
>   perf dwarf-aux: Add die_get_cfa()
>   perf annotate-data: Support stack variables
>   perf dwarf-aux: Check allowed DWARF Ops
>   perf dwarf-aux: Add die_collect_vars()
>   perf dwarf-aux: Handle type transfer for memory access
>   perf annotate-data: Introduce struct data_loc_info
>   perf map: Add map__objdump_2rip()
>   perf annotate: Add annotate_get_basic_blocks()
>   perf annotate-data: Maintain variable type info
>   perf annotate-data: Add update_insn_state()
>   perf annotate-data: Handle global variable access
>   perf annotate-data: Handle call instructions
>   perf annotate-data: Implement instruction tracking
>   perf annotate: Parse x86 segment register location
>   perf annotate-data: Handle this-cpu variables in kernel
>   perf annotate-data: Track instructions with a this-cpu variable
>   perf annotate-data: Add stack canary type
>   perf annotate-data: Add debug message
> 
>  tools/perf/Documentation/perf-report.txt      |    3 +
>  .../arch/loongarch/annotate/instructions.c    |    6 +-
>  tools/perf/arch/x86/util/dwarf-regs.c         |   38 +
>  tools/perf/builtin-annotate.c                 |  149 +-
>  tools/perf/builtin-report.c                   |   19 +-
>  tools/perf/util/Build                         |    2 +
>  tools/perf/util/annotate-data.c               | 1246 +++++++++++++++++
>  tools/perf/util/annotate-data.h               |  222 +++
>  tools/perf/util/annotate.c                    |  763 +++++++++-
>  tools/perf/util/annotate.h                    |  104 +-
>  tools/perf/util/debuginfo.c                   |  205 +++
>  tools/perf/util/debuginfo.h                   |   64 +
>  tools/perf/util/dso.c                         |    4 +
>  tools/perf/util/dso.h                         |    2 +
>  tools/perf/util/dwarf-aux.c                   |  561 +++++++-
>  tools/perf/util/dwarf-aux.h                   |   86 +-
>  tools/perf/util/dwarf-regs.c                  |   33 +
>  tools/perf/util/hist.h                        |    3 +
>  tools/perf/util/include/dwarf-regs.h          |   11 +
>  tools/perf/util/map.c                         |   20 +
>  tools/perf/util/map.h                         |    3 +
>  tools/perf/util/probe-finder.c                |  193 +--
>  tools/perf/util/probe-finder.h                |   19 +-
>  tools/perf/util/sort.c                        |  195 ++-
>  tools/perf/util/sort.h                        |    7 +
>  tools/perf/util/symbol_conf.h                 |    4 +-
>  26 files changed, 3703 insertions(+), 259 deletions(-)
>  create mode 100644 tools/perf/util/annotate-data.c
>  create mode 100644 tools/perf/util/annotate-data.h
>  create mode 100644 tools/perf/util/debuginfo.c
>  create mode 100644 tools/perf/util/debuginfo.h
> 
> 
> base-commit: 87cd3d48191e533cd9c224f2da1d78b3513daf47

Hi Namhyung:

I've been playing with your datatype profile patch and it looks really promising.
I think it would be a big help if it could be integrated into perf c2c.  

Perf c2c gives a great insight into what's contributing to cpu cacheline contention, but it
can be difficult to understand the output.  Having visuals with your datatype profile output
would be a big help.

I have a simple test program with readers and writers tugging on the data below:

  uint64_t hotVar; 
  typedef struct __foo {
     uint64_t m1;
     uint64_t m2;
     uint64_t m3;
  } FOO;

The rest of this reply looks at both your datatype output and c2c to see where they 
might compliment each other.


When I run perf with your patches on a simple program to cause contention on the above data, I get the following:

# perf mem record --ldlat=1 --all-user --  ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 -w14 -w16 -b5 -H2000000
# perf report -s type,typeoff --hierarchy --stdio 

   # Samples: 26K of event 'cpu/mem-loads,ldlat=1/P'
   # Event count (approx.): 2958226
   #
   #    Overhead  Data Type / Data Type Offset
   # ...........  ............................
   #
       54.50%     int      
          54.50%     int +0 (no field)
       23.21%     long int 
          23.21%     long int +0 (no field)
       18.30%     struct __foo
           9.57%     struct __foo +8 (m2)
           8.73%     struct __foo +0 (m1)
        3.86%     long unsigned int
           3.86%     long unsigned int +0 (no field)
       <snip>  
   
   # Samples: 30K of event 'cpu/mem-stores/P'
   # Event count (approx.): 33880197
   #
   #    Overhead  Data Type / Data Type Offset
   # ...........  ............................
   #
       99.85%     struct __foo
          70.48%     struct __foo +0 (m1)
          29.34%     struct __foo +16 (m3)
           0.03%     struct __foo +8 (m2)
        0.09%     long unsigned int
           0.09%     long unsigned int +0 (no field)
        0.06%     (unknown)
           0.06%     (unknown) +0 (no field)
       <snip>  
   
Then I run perf annotate with your patches, and I get the following:

  # perf annotate  --data-type 

   Annotate type: 'long int' in /home/joe/tugtest/tugtest (2901 samples):
   ============================================================================
       samples     offset       size  field
          2901          0          8  long int	;
   
   Annotate type: 'struct __foo' in /home/joe/tugtest/tugtest (5593 samples):
   ============================================================================
       samples     offset       size  field
          5593          0         24  struct __foo	 {
          2755          0          8      uint64_t	m1;
          2838          8          8      uint64_t	m2;
             0         16          8      uint64_t	m3;
                                      };

Now when I run that same simple test using perf c2c, and I focus on the cachline that the struct and hotVar reside in, I get:

# perf c2c record --all-user -- ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 -w14 -w16 -b5 -H2000000
# perf c2c report -NNN --stdio 
# <snip>
#
#      ----- HITM -----  ------- Store Refs ------  ------ Data address ------                ---------- cycles ----------    Total    cpu               Shared                     
# Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A        Offset  Node  PA cnt  Code address  rmt hitm  lcl hitm      load  records    cnt      Symbol   Object    Source:Line  Node{cpu list}
#....  .......  .......  .......  .......  .......  ............  ....  ......  ............  ........  ........  ........  .......  .....  ..........  .......  .............  ....
#
 ---------------------------------------------------------------
    0     1094     2008    17071    13762        0      0x406100
 ---------------------------------------------------------------
         0.00%    0.20%    0.00%    0.00%    0.00%           0x8     1       1      0x401355         0       978      1020     2962      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
         0.00%    0.00%    0.12%    0.02%    0.00%           0x8     1       1      0x401360         0         0         0       23      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
        68.10%   60.26%    0.00%    0.00%    0.00%          0x10     1       1      0x401505      2181      1541      1393     5813      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
        31.63%   39.34%    0.00%    0.00%    0.00%          0x10     1       1      0x401331      1242      1095       936     3393      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
         0.00%    0.00%   40.03%   40.25%    0.00%          0x10     1       1      0x40133c         0         0         0    12372      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
         0.27%    0.15%    0.00%    0.00%    0.00%          0x18     1       1      0x401343       834      1136      1032     2930      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
         0.00%    0.05%    0.00%    0.00%    0.00%          0x18     1       1      0x40150c         0       933      1567     5050      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
         0.00%    0.00%    0.06%    0.00%    0.00%          0x18     1       1      0x40134e         0         0         0       10      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
         0.00%    0.00%   59.80%   59.73%    0.00%          0x20     1       1      0x401516         0         0         0    18428      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}

With the above c2c output, we can see:
 - the hottest contended addresses, and the load latencies they caused.
 - the cacheline offset for the contended addresses.
 - the cpus and numa nodes where the accesses came from.
 - the cacheline alignment for the data of interest.
 - the number of cpus and threads concurrently accessing each address.
 - the breakdown of reads causing HITM (contention) and writes hitting or missing the cacheline.
 - the object name, source line and line number for where the accesses occured.
 - the numa node where the data is allocated.
 - the number of physical pages the virtual addresses were mapped to (e.g. numa_balancing).

What would really help the c2c output be more usable is if it had a better visual to it.
It's likely the current c2c output can be trimmed a bit.

Here's one idea that incorporates your datatype info, though I'm sure there are better ways, as this may get unwieldy.:
     
#      ----- HITM -----  ------- Store Refs ------  ------ Data address ------                ---------- cycles ----------    Total    cpu               Shared                     
# Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A        Offset  Node  PA cnt  Code address  rmt hitm  lcl hitm      load  records    cnt      Symbol   Object    Source:Line  Node{cpu list}
#....  .......  .......  .......  .......  .......  ............  ....  ......  ............  ........  ........  ........  .......  .....  ..........  .......  .............  ....
#
 ---------------------------------------------------------------
    0     1094     2008    17071    13762        0      0x406100
 ---------------------------------------------------------------
  uint64_t hotVar: tugtest.c:38
         0.00%    0.20%    0.00%    0.00%    0.00%           0x8     1       1      0x401355         0       978      1020     2962      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
         0.00%    0.00%    0.12%    0.02%    0.00%           0x8     1       1      0x401360         0         0         0       23      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
  struct __foo uint64_t m1: tugtest.c:39
        68.10%   60.26%    0.00%    0.00%    0.00%          0x10     1       1      0x401505      2181      1541      1393     5813      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
        31.63%   39.34%    0.00%    0.00%    0.00%          0x10     1       1      0x401331      1242      1095       936     3393      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
         0.00%    0.00%   40.03%   40.25%    0.00%          0x10     1       1      0x40133c         0         0         0    12372      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
  struct __foo uint64_t m2: tugtest.c:40
         0.27%    0.15%    0.00%    0.00%    0.00%          0x18     1       1      0x401343       834      1136      1032     2930      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
         0.00%    0.05%    0.00%    0.00%    0.00%          0x18     1       1      0x40150c         0       933      1567     5050      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
         0.00%    0.00%    0.06%    0.00%    0.00%          0x18     1       1      0x40134e         0         0         0       10      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
  struct __foo uint64_t m3: tugtest.c:41
         0.00%    0.00%   59.80%   59.73%    0.00%          0x20     1       1      0x401516         0         0         0    18428      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}

And then it would be good to find a clean way to incorporate your sample counts.

On a related note, is there a way the accesses could be broken down into read counts 
and write counts?   That, with the above source line info for all the accesses, 
helps to convey a picture of "the affinity of the accesses".  

For example, while it's normally good to separate read-mostly data from hot 
written data, if the reads and writes are done together in the same block of 
code by the same thread, then keeping the two data symbols in the same cacheline
could be a win.  I've seen this often. Your datatype info might be able to 
make these affinities more visible to the user.

Thanks for doing this. This is great.
Joe
Namhyung Kim Nov. 9, 2023, 4:48 a.m. UTC | #17
Hello,

On Wed, Nov 8, 2023 at 9:12 AM Joe Mario <jmario@redhat.com> wrote:
>
> Hi Namhyung:
>
> I've been playing with your datatype profile patch and it looks really promising.
> I think it would be a big help if it could be integrated into perf c2c.

Great!  Yeah, I think we can collaborate on it.

>
> Perf c2c gives a great insight into what's contributing to cpu cacheline contention, but it
> can be difficult to understand the output.  Having visuals with your datatype profile output
> would be a big help.

Exactly.

>
> I have a simple test program with readers and writers tugging on the data below:
>
>   uint64_t hotVar;
>   typedef struct __foo {
>      uint64_t m1;
>      uint64_t m2;
>      uint64_t m3;
>   } FOO;
>
> The rest of this reply looks at both your datatype output and c2c to see where they
> might compliment each other.
>
>
> When I run perf with your patches on a simple program to cause contention on the above data, I get the following:
>
> # perf mem record --ldlat=1 --all-user --  ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 -w14 -w16 -b5 -H2000000
> # perf report -s type,typeoff --hierarchy --stdio
>
>    # Samples: 26K of event 'cpu/mem-loads,ldlat=1/P'
>    # Event count (approx.): 2958226
>    #
>    #    Overhead  Data Type / Data Type Offset
>    # ...........  ............................
>    #
>        54.50%     int
>           54.50%     int +0 (no field)
>        23.21%     long int
>           23.21%     long int +0 (no field)
>        18.30%     struct __foo
>            9.57%     struct __foo +8 (m2)
>            8.73%     struct __foo +0 (m1)
>         3.86%     long unsigned int
>            3.86%     long unsigned int +0 (no field)
>        <snip>
>
>    # Samples: 30K of event 'cpu/mem-stores/P'
>    # Event count (approx.): 33880197
>    #
>    #    Overhead  Data Type / Data Type Offset
>    # ...........  ............................
>    #
>        99.85%     struct __foo
>           70.48%     struct __foo +0 (m1)
>           29.34%     struct __foo +16 (m3)
>            0.03%     struct __foo +8 (m2)
>         0.09%     long unsigned int
>            0.09%     long unsigned int +0 (no field)
>         0.06%     (unknown)
>            0.06%     (unknown) +0 (no field)
>        <snip>
>
> Then I run perf annotate with your patches, and I get the following:
>
>   # perf annotate  --data-type
>
>    Annotate type: 'long int' in /home/joe/tugtest/tugtest (2901 samples):
>    ============================================================================
>        samples     offset       size  field
>           2901          0          8  long int  ;
>
>    Annotate type: 'struct __foo' in /home/joe/tugtest/tugtest (5593 samples):
>    ============================================================================
>        samples     offset       size  field
>           5593          0         24  struct __foo       {
>           2755          0          8      uint64_t      m1;
>           2838          8          8      uint64_t      m2;
>              0         16          8      uint64_t      m3;
>                                       };
>
> Now when I run that same simple test using perf c2c, and I focus on the cachline that the struct and hotVar reside in, I get:
>
> # perf c2c record --all-user -- ./tugtest -r3 -r5 -r7 -r9 -r11 -w10 -w12 -w14 -w16 -b5 -H2000000
> # perf c2c report -NNN --stdio
> # <snip>
> #
> #      ----- HITM -----  ------- Store Refs ------  ------ Data address ------                ---------- cycles ----------    Total    cpu               Shared
> # Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A        Offset  Node  PA cnt  Code address  rmt hitm  lcl hitm      load  records    cnt      Symbol   Object    Source:Line  Node{cpu list}
> #....  .......  .......  .......  .......  .......  ............  ....  ......  ............  ........  ........  ........  .......  .....  ..........  .......  .............  ....
> #
>  ---------------------------------------------------------------
>     0     1094     2008    17071    13762        0      0x406100
>  ---------------------------------------------------------------
>          0.00%    0.20%    0.00%    0.00%    0.00%           0x8     1       1      0x401355         0       978      1020     2962      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>          0.00%    0.00%    0.12%    0.02%    0.00%           0x8     1       1      0x401360         0         0         0       23      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>         68.10%   60.26%    0.00%    0.00%    0.00%          0x10     1       1      0x401505      2181      1541      1393     5813      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>         31.63%   39.34%    0.00%    0.00%    0.00%          0x10     1       1      0x401331      1242      1095       936     3393      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>          0.00%    0.00%   40.03%   40.25%    0.00%          0x10     1       1      0x40133c         0         0         0    12372      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>          0.27%    0.15%    0.00%    0.00%    0.00%          0x18     1       1      0x401343       834      1136      1032     2930      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>          0.00%    0.05%    0.00%    0.00%    0.00%          0x18     1       1      0x40150c         0       933      1567     5050      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>          0.00%    0.00%    0.06%    0.00%    0.00%          0x18     1       1      0x40134e         0         0         0       10      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>          0.00%    0.00%   59.80%   59.73%    0.00%          0x20     1       1      0x401516         0         0         0    18428      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>
> With the above c2c output, we can see:
>  - the hottest contended addresses, and the load latencies they caused.
>  - the cacheline offset for the contended addresses.
>  - the cpus and numa nodes where the accesses came from.
>  - the cacheline alignment for the data of interest.
>  - the number of cpus and threads concurrently accessing each address.
>  - the breakdown of reads causing HITM (contention) and writes hitting or missing the cacheline.
>  - the object name, source line and line number for where the accesses occured.
>  - the numa node where the data is allocated.
>  - the number of physical pages the virtual addresses were mapped to (e.g. numa_balancing).
>
> What would really help the c2c output be more usable is if it had a better visual to it.
> It's likely the current c2c output can be trimmed a bit.
>
> Here's one idea that incorporates your datatype info, though I'm sure there are better ways, as this may get unwieldy.:
>
> #      ----- HITM -----  ------- Store Refs ------  ------ Data address ------                ---------- cycles ----------    Total    cpu               Shared
> # Num  RmtHitm  LclHitm   L1 Hit  L1 Miss      N/A        Offset  Node  PA cnt  Code address  rmt hitm  lcl hitm      load  records    cnt      Symbol   Object    Source:Line  Node{cpu list}
> #....  .......  .......  .......  .......  .......  ............  ....  ......  ............  ........  ........  ........  .......  .....  ..........  .......  .............  ....
> #
>  ---------------------------------------------------------------
>     0     1094     2008    17071    13762        0      0x406100
>  ---------------------------------------------------------------
>   uint64_t hotVar: tugtest.c:38
>          0.00%    0.20%    0.00%    0.00%    0.00%           0x8     1       1      0x401355         0       978      1020     2962      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>          0.00%    0.00%    0.12%    0.02%    0.00%           0x8     1       1      0x401360         0         0         0       23      4  [.] writer  tugtest  tugtest.c:129   0{10,12,14,16}
>   struct __foo uint64_t m1: tugtest.c:39
>         68.10%   60.26%    0.00%    0.00%    0.00%          0x10     1       1      0x401505      2181      1541      1393     5813      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>         31.63%   39.34%    0.00%    0.00%    0.00%          0x10     1       1      0x401331      1242      1095       936     3393      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>          0.00%    0.00%   40.03%   40.25%    0.00%          0x10     1       1      0x40133c         0         0         0    12372      4  [.] writer  tugtest  tugtest.c:127   0{10,12,14,16}
>   struct __foo uint64_t m2: tugtest.c:40
>          0.27%    0.15%    0.00%    0.00%    0.00%          0x18     1       1      0x401343       834      1136      1032     2930      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>          0.00%    0.05%    0.00%    0.00%    0.00%          0x18     1       1      0x40150c         0       933      1567     5050      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>          0.00%    0.00%    0.06%    0.00%    0.00%          0x18     1       1      0x40134e         0         0         0       10      4  [.] writer  tugtest  tugtest.c:128   0{10,12,14,16}
>   struct __foo uint64_t m3: tugtest.c:41
>          0.00%    0.00%   59.80%   59.73%    0.00%          0x20     1       1      0x401516         0         0         0    18428      5  [.] reader  tugtest  tugtest.c:163   1{3,5,7,9,11}
>
> And then it would be good to find a clean way to incorporate your sample counts.

I'm not sure we can get the exact source line for the data type/fields.
Of course, we can aggregate the results for each field.  Actually you
can use `perf report -s type,typeoff,symoff --hierarchy` for something
similar. :)

>
> On a related note, is there a way the accesses could be broken down into read counts
> and write counts?   That, with the above source line info for all the accesses,
> helps to convey a picture of "the affinity of the accesses".

Sure, perf report already supports showing events in a group
together.  You can use --group option to force grouping
individual events.  perf annotate with --data-type doesn't have
that yet.  I'll update it in v2.

>
> For example, while it's normally good to separate read-mostly data from hot
> written data, if the reads and writes are done together in the same block of
> code by the same thread, then keeping the two data symbols in the same cacheline
> could be a win.  I've seen this often. Your datatype info might be able to
> make these affinities more visible to the user.
>
> Thanks for doing this. This is great.
> Joe

Thanks for your feedback!
Namhyung