mbox series

[00/40] Memory allocation profiling

Message ID 20230501165450.15352-1-surenb@google.com (mailing list archive)
Headers show
Series Memory allocation profiling | expand

Message

Suren Baghdasaryan May 1, 2023, 4:54 p.m. UTC
Memory allocation profiling infrastructure provides a low overhead
mechanism to make all kernel allocations in the system visible. It can be
used to monitor memory usage, track memory hotspots, detect memory leaks,
identify memory regressions.

To keep the overhead to the minimum, we record only allocation sizes for
every allocation in the codebase. With that information, if users are
interested in more detailed context for a specific allocation, they can
enable in-depth context tracking, which includes capturing the pid, tgid,
task name, allocation size, timestamp and call stack for every allocation
at the specified code location.

The data is exposed to the user space via a read-only debugfs file called
allocations. Usage example:

$ sort -hr /sys/kernel/debug/allocations|head
  153MiB     8599 mm/slub.c:1826 module:slub func:alloc_slab_page
 6.08MiB      49 mm/slab_common.c:950 module:slab_common func:_kmalloc_order
 5.09MiB     6335 mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
 4.54MiB      78 mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
 1.32MiB      338 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
 1.16MiB      603 fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
 1.00MiB      256 mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
  734KiB     5380 fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
  640KiB      160 kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
  640KiB      160 drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf

For allocation context capture, a new debugfs file called allocations.ctx
is used to select which code location should capture allocation context
and to read captured context information. Usage example:

$ cd /sys/kernel/debug/
$ echo "file include/asm-generic/pgalloc.h line 63 enable" > allocations.ctx
$ cat allocations.ctx
  920KiB      230 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
    size: 4096
    pid: 1474
    tgid: 1474
    comm: bash
    ts: 175332940994
    call stack:
         pte_alloc_one+0xfe/0x130
         __pte_alloc+0x22/0xb0
         copy_page_range+0x842/0x1640
         dup_mm+0x42d/0x580
         copy_process+0xfb1/0x1ac0
         kernel_clone+0x92/0x3e0
         __do_sys_clone+0x66/0x90
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
...

Implementation utilizes a more generic concept of code tagging, introduced
as part of this patchset. Code tag is a structure identifying a specific
location in the source code which is generated at compile time and can be
embedded in an application-specific structure. A number of applications
for code tagging have been presented in the original RFC [1].
Code tagging uses the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it. 

To profile memory allocations, we instrument page, slab and percpu
allocators to record total memory allocated in the associated code tag at
every allocation in the codebase. Every time an allocation is performed by
an instrumented allocator, the code tag at that location increments its
counter by allocation size. Every time the memory is freed the counter is
decremented. To decrement the counter upon freeing, allocated object needs
a reference to its code tag. Page allocators use page_ext to record this
reference while slab allocators use memcg_data (renamed into more generic
slabobj_ext) of the slab page.

Module allocations are accounted the same way as other kernel allocations.
Module loading and unloading is supported. If a module is unloaded while
one or more of its allocations is still not freed (rather rare condition),
its data section will be kept in memory to allow later code tag
referencing when the allocation is freed later on.

As part of this series we introduce several kernel configs:
CODE_TAGGING - to enable code tagging framework
CONFIG_MEM_ALLOC_PROFILING - to enable memory allocation profiling
CONFIG_MEM_ALLOC_PROFILING_DEBUG - to enable memory allocation profiling
validation
Note: CONFIG_MEM_ALLOC_PROFILING enables CONFIG_PAGE_EXTENSION to store
code tag reference in the page_ext object. 

nomem_profiling kernel command-line parameter is also provided to disable
the functionality and avoid the performance overhead.
Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below is performance
comparison between the baseline kernel, profiling when enabled, profiling
when disabled (nomem_profiling=y) and (for comparison purposes) baseline
with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:

			kmalloc			pgalloc
Baseline (6.3-rc7)	9.200s			31.050s
profiling disabled	9.800 (+6.52%)		32.600 (+4.99%)
profiling enabled	12.500 (+35.87%)	39.010 (+25.60%)
memcg_kmem enabled	41.400 (+350.00%)	70.600 (+127.38%)

[1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/

Kent Overstreet (15):
  lib/string_helpers: Drop space in string_get_size's output
  scripts/kallysms: Always include __start and __stop symbols
  fs: Convert alloc_inode_sb() to a macro
  nodemask: Split out include/linux/nodemask_types.h
  prandom: Remove unused include
  lib/string.c: strsep_no_empty()
  Lazy percpu counters
  lib: code tagging query helper functions
  mm/slub: Mark slab_free_freelist_hook() __always_inline
  mempool: Hook up to memory allocation profiling
  timekeeping: Fix a circular include dependency
  mm: percpu: Introduce pcpuobj_ext
  mm: percpu: Add codetag reference into pcpuobj_ext
  arm64: Fix circular header dependency
  MAINTAINERS: Add entries for code tagging and memory allocation
    profiling

Suren Baghdasaryan (25):
  mm: introduce slabobj_ext to support slab object extensions
  mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
    creation
  mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
  mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
    objects
  slab: objext: introduce objext_flags as extension to
    page_memcg_data_flags
  lib: code tagging framework
  lib: code tagging module support
  lib: prevent module unloading if memory is not freed
  lib: add allocation tagging support for memory allocation profiling
  lib: introduce support for page allocation tagging
  change alloc_pages name in dma_map_ops to avoid name conflicts
  mm: enable page allocation tagging
  mm/page_ext: enable early_page_ext when
    CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
  mm: create new codetag references during page splitting
  lib: add codetag reference into slabobj_ext
  mm/slab: add allocation accounting into slab allocation and free paths
  mm/slab: enable slab allocation tagging for kmalloc and friends
  mm: percpu: enable per-cpu allocation tagging
  move stack capture functionality into a separate function for reuse
  lib: code tagging context capture support
  lib: implement context capture support for tagged allocations
  lib: add memory allocations report in show_mem()
  codetag: debug: skip objext checking when it's for objext itself
  codetag: debug: mark codetags for reserved pages as empty
  codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
    allocations

 .../admin-guide/kernel-parameters.txt         |   2 +
 MAINTAINERS                                   |  22 +
 arch/arm64/include/asm/spectre.h              |   4 +-
 arch/x86/kernel/amd_gart_64.c                 |   2 +-
 drivers/iommu/dma-iommu.c                     |   2 +-
 drivers/xen/grant-dma-ops.c                   |   2 +-
 drivers/xen/swiotlb-xen.c                     |   2 +-
 include/asm-generic/codetag.lds.h             |  14 +
 include/asm-generic/vmlinux.lds.h             |   3 +
 include/linux/alloc_tag.h                     | 161 ++++++
 include/linux/codetag.h                       | 159 ++++++
 include/linux/codetag_ctx.h                   |  48 ++
 include/linux/dma-map-ops.h                   |   2 +-
 include/linux/fs.h                            |   6 +-
 include/linux/gfp.h                           | 123 ++--
 include/linux/gfp_types.h                     |  12 +-
 include/linux/hrtimer.h                       |   2 +-
 include/linux/lazy-percpu-counter.h           | 102 ++++
 include/linux/memcontrol.h                    |  56 +-
 include/linux/mempool.h                       |  73 ++-
 include/linux/mm.h                            |   8 +
 include/linux/mm_types.h                      |   4 +-
 include/linux/nodemask.h                      |   2 +-
 include/linux/nodemask_types.h                |   9 +
 include/linux/page_ext.h                      |   1 -
 include/linux/pagemap.h                       |   9 +-
 include/linux/percpu.h                        |  19 +-
 include/linux/pgalloc_tag.h                   |  95 ++++
 include/linux/prandom.h                       |   1 -
 include/linux/sched.h                         |  32 +-
 include/linux/slab.h                          | 182 +++---
 include/linux/slab_def.h                      |   2 +-
 include/linux/slub_def.h                      |   4 +-
 include/linux/stackdepot.h                    |  16 +
 include/linux/string.h                        |   1 +
 include/linux/time_namespace.h                |   2 +
 init/Kconfig                                  |   4 +
 kernel/dma/mapping.c                          |   4 +-
 kernel/module/main.c                          |  25 +-
 lib/Kconfig                                   |   3 +
 lib/Kconfig.debug                             |  26 +
 lib/Makefile                                  |   5 +
 lib/alloc_tag.c                               | 464 +++++++++++++++
 lib/codetag.c                                 | 529 ++++++++++++++++++
 lib/lazy-percpu-counter.c                     | 127 +++++
 lib/show_mem.c                                |  15 +
 lib/stackdepot.c                              |  68 +++
 lib/string.c                                  |  19 +
 lib/string_helpers.c                          |   3 +-
 mm/compaction.c                               |   9 +-
 mm/filemap.c                                  |   6 +-
 mm/huge_memory.c                              |   2 +
 mm/kfence/core.c                              |  14 +-
 mm/kfence/kfence.h                            |   4 +-
 mm/memcontrol.c                               |  56 +-
 mm/mempolicy.c                                |  30 +-
 mm/mempool.c                                  |  28 +-
 mm/mm_init.c                                  |   1 +
 mm/page_alloc.c                               |  75 ++-
 mm/page_ext.c                                 |  21 +-
 mm/page_owner.c                               |  54 +-
 mm/percpu-internal.h                          |  26 +-
 mm/percpu.c                                   | 122 ++--
 mm/slab.c                                     |  22 +-
 mm/slab.h                                     | 224 ++++++--
 mm/slab_common.c                              |  95 +++-
 mm/slub.c                                     |  24 +-
 mm/util.c                                     |  10 +-
 scripts/kallsyms.c                            |  13 +
 scripts/module.lds.S                          |   7 +
 70 files changed, 2765 insertions(+), 554 deletions(-)
 create mode 100644 include/asm-generic/codetag.lds.h
 create mode 100644 include/linux/alloc_tag.h
 create mode 100644 include/linux/codetag.h
 create mode 100644 include/linux/codetag_ctx.h
 create mode 100644 include/linux/lazy-percpu-counter.h
 create mode 100644 include/linux/nodemask_types.h
 create mode 100644 include/linux/pgalloc_tag.h
 create mode 100644 lib/alloc_tag.c
 create mode 100644 lib/codetag.c
 create mode 100644 lib/lazy-percpu-counter.c

Comments

Roman Gushchin May 1, 2023, 5:47 p.m. UTC | #1
On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below is performance
> comparison between the baseline kernel, profiling when enabled, profiling
> when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> 
> 			kmalloc			pgalloc
> Baseline (6.3-rc7)	9.200s			31.050s
> profiling disabled	9.800 (+6.52%)		32.600 (+4.99%)
> profiling enabled	12.500 (+35.87%)	39.010 (+25.60%)
> memcg_kmem enabled	41.400 (+350.00%)	70.600 (+127.38%)

Hm, this makes me think we have a regression with memcg_kmem in one of
the recent releases. When I measured it a couple of years ago, the overhead
was definitely within 100%.

Do you understand what makes the your profiling drastically faster than kmem?

Thanks!
Suren Baghdasaryan May 1, 2023, 6:08 p.m. UTC | #2
On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> > Performance overhead:
> > To evaluate performance we implemented an in-kernel test executing
> > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > affinity set to a specific CPU to minimize the noise. Below is performance
> > comparison between the baseline kernel, profiling when enabled, profiling
> > when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> >
> >                       kmalloc                 pgalloc
> > Baseline (6.3-rc7)    9.200s                  31.050s
> > profiling disabled    9.800 (+6.52%)          32.600 (+4.99%)
> > profiling enabled     12.500 (+35.87%)        39.010 (+25.60%)
> > memcg_kmem enabled    41.400 (+350.00%)       70.600 (+127.38%)
>
> Hm, this makes me think we have a regression with memcg_kmem in one of
> the recent releases. When I measured it a couple of years ago, the overhead
> was definitely within 100%.
>
> Do you understand what makes the your profiling drastically faster than kmem?

I haven't profiled or looked into kmem overhead closely but I can do
that. I just wanted to see how the overhead compares with the existing
accounting mechanisms.

For kmalloc, the overhead is low because after we create the vector of
slab_ext objects (which is the same as what memcg_kmem does), memory
profiling just increments a lazy counter (which in many cases would be
a per-cpu counter). memcg_kmem operates on cgroup hierarchy with
additional overhead associated with that. I'm guessing that's the
reason for the big difference between these mechanisms but, I didn't
look into the details to understand memcg_kmem performance.

>
> Thanks!
Roman Gushchin May 1, 2023, 6:14 p.m. UTC | #3
On Mon, May 01, 2023 at 11:08:05AM -0700, Suren Baghdasaryan wrote:
> On Mon, May 1, 2023 at 10:47 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Mon, May 01, 2023 at 09:54:10AM -0700, Suren Baghdasaryan wrote:
> > > Performance overhead:
> > > To evaluate performance we implemented an in-kernel test executing
> > > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > > affinity set to a specific CPU to minimize the noise. Below is performance
> > > comparison between the baseline kernel, profiling when enabled, profiling
> > > when disabled (nomem_profiling=y) and (for comparison purposes) baseline
> > > with CONFIG_MEMCG_KMEM enabled and allocations using __GFP_ACCOUNT:
> > >
> > >                       kmalloc                 pgalloc
> > > Baseline (6.3-rc7)    9.200s                  31.050s
> > > profiling disabled    9.800 (+6.52%)          32.600 (+4.99%)
> > > profiling enabled     12.500 (+35.87%)        39.010 (+25.60%)
> > > memcg_kmem enabled    41.400 (+350.00%)       70.600 (+127.38%)
> >
> > Hm, this makes me think we have a regression with memcg_kmem in one of
> > the recent releases. When I measured it a couple of years ago, the overhead
> > was definitely within 100%.
> >
> > Do you understand what makes the your profiling drastically faster than kmem?
> 
> I haven't profiled or looked into kmem overhead closely but I can do
> that. I just wanted to see how the overhead compares with the existing
> accounting mechanisms.

It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
should be ok for the production use, which is great!
In the reality, most workloads are not that sensitive to the speed of
memory allocation.

> 
> For kmalloc, the overhead is low because after we create the vector of
> slab_ext objects (which is the same as what memcg_kmem does), memory
> profiling just increments a lazy counter (which in many cases would be
> a per-cpu counter).

So does kmem (this is why I'm somewhat surprised by the difference).

> memcg_kmem operates on cgroup hierarchy with
> additional overhead associated with that. I'm guessing that's the
> reason for the big difference between these mechanisms but, I didn't
> look into the details to understand memcg_kmem performance.

I suspect recent rt-related changes and also the wide usage of
rcu primitives in the kmem code. I'll try to look closer as well.

Thanks!
Kent Overstreet May 1, 2023, 7:37 p.m. UTC | #4
On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote:
> It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
> should be ok for the production use, which is great!
> In the reality, most workloads are not that sensitive to the speed of
> memory allocation.

:)

My main takeaway has been "the slub fast path is _really_ fast". No
disabling of preemption, no atomic instructions, just a non locked
double word cmpxchg - it's a slick piece of work.

> > For kmalloc, the overhead is low because after we create the vector of
> > slab_ext objects (which is the same as what memcg_kmem does), memory
> > profiling just increments a lazy counter (which in many cases would be
> > a per-cpu counter).
> 
> So does kmem (this is why I'm somewhat surprised by the difference).
> 
> > memcg_kmem operates on cgroup hierarchy with
> > additional overhead associated with that. I'm guessing that's the
> > reason for the big difference between these mechanisms but, I didn't
> > look into the details to understand memcg_kmem performance.
> 
> I suspect recent rt-related changes and also the wide usage of
> rcu primitives in the kmem code. I'll try to look closer as well.

Happy to give you something to compare against :)
Roman Gushchin May 1, 2023, 9:18 p.m. UTC | #5
On Mon, May 01, 2023 at 03:37:58PM -0400, Kent Overstreet wrote:
> On Mon, May 01, 2023 at 11:14:45AM -0700, Roman Gushchin wrote:
> > It's a good idea and I generally think that +25-35% for kmalloc/pgalloc
> > should be ok for the production use, which is great!
> > In the reality, most workloads are not that sensitive to the speed of
> > memory allocation.
> 
> :)
> 
> My main takeaway has been "the slub fast path is _really_ fast". No
> disabling of preemption, no atomic instructions, just a non locked
> double word cmpxchg - it's a slick piece of work.
> 
> > > For kmalloc, the overhead is low because after we create the vector of
> > > slab_ext objects (which is the same as what memcg_kmem does), memory
> > > profiling just increments a lazy counter (which in many cases would be
> > > a per-cpu counter).
> > 
> > So does kmem (this is why I'm somewhat surprised by the difference).
> > 
> > > memcg_kmem operates on cgroup hierarchy with
> > > additional overhead associated with that. I'm guessing that's the
> > > reason for the big difference between these mechanisms but, I didn't
> > > look into the details to understand memcg_kmem performance.
> > 
> > I suspect recent rt-related changes and also the wide usage of
> > rcu primitives in the kmem code. I'll try to look closer as well.
> 
> Happy to give you something to compare against :)

To be fair, it's not an apple-to-apple comparison, because:
1) memcgs are organized in a tree, these days usually with at least 3 layers,
2) memcgs are dynamic. In theory a task can be moved to a different
   memcg while performing a (very slow) allocation, and the original
   memcg can be released. To prevent this we have to perform a lot
   of operations which you can happily avoid.

That said, there is clearly a place for optimization, so thank you
for indirectly bringing this up.

Thanks!
Michal Hocko May 3, 2023, 7:25 a.m. UTC | #6
On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> Memory allocation profiling infrastructure provides a low overhead
> mechanism to make all kernel allocations in the system visible. It can be
> used to monitor memory usage, track memory hotspots, detect memory leaks,
> identify memory regressions.
> 
> To keep the overhead to the minimum, we record only allocation sizes for
> every allocation in the codebase. With that information, if users are
> interested in more detailed context for a specific allocation, they can
> enable in-depth context tracking, which includes capturing the pid, tgid,
> task name, allocation size, timestamp and call stack for every allocation
> at the specified code location.
[...]
> Implementation utilizes a more generic concept of code tagging, introduced
> as part of this patchset. Code tag is a structure identifying a specific
> location in the source code which is generated at compile time and can be
> embedded in an application-specific structure. A number of applications
> for code tagging have been presented in the original RFC [1].
> Code tagging uses the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it. 
> 
> To profile memory allocations, we instrument page, slab and percpu
> allocators to record total memory allocated in the associated code tag at
> every allocation in the codebase. Every time an allocation is performed by
> an instrumented allocator, the code tag at that location increments its
> counter by allocation size. Every time the memory is freed the counter is
> decremented. To decrement the counter upon freeing, allocated object needs
> a reference to its code tag. Page allocators use page_ext to record this
> reference while slab allocators use memcg_data (renamed into more generic
> slabobj_ext) of the slab page.
[...]
> [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
[...]
>  70 files changed, 2765 insertions(+), 554 deletions(-)

Sorry for cutting the cover considerably but I believe I have quoted the
most important/interesting parts here. The approach is not fundamentally
different from the previous version [1] and there was a significant
discussion around this approach. The cover letter doesn't summarize nor
deal with concerns expressed previous AFAICS. So let me bring those up
back. At least those I find the most important:
- This is a big change and it adds a significant maintenance burden
  because each allocation entry point needs to be handled specifically.
  The cost will grow with the intended coverage especially there when
  allocation is hidden in a library code.
- It has been brought up that this is duplicating functionality already
  available via existing tracing infrastructure. You should make it very
  clear why that is not suitable for the job
- We already have page_owner infrastructure that provides allocation
  tracking data. Why it cannot be used/extended?

Thanks!
Kent Overstreet May 3, 2023, 7:34 a.m. UTC | #7
On Wed, May 03, 2023 at 09:25:29AM +0200, Michal Hocko wrote:
> On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > Memory allocation profiling infrastructure provides a low overhead
> > mechanism to make all kernel allocations in the system visible. It can be
> > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > identify memory regressions.
> > 
> > To keep the overhead to the minimum, we record only allocation sizes for
> > every allocation in the codebase. With that information, if users are
> > interested in more detailed context for a specific allocation, they can
> > enable in-depth context tracking, which includes capturing the pid, tgid,
> > task name, allocation size, timestamp and call stack for every allocation
> > at the specified code location.
> [...]
> > Implementation utilizes a more generic concept of code tagging, introduced
> > as part of this patchset. Code tag is a structure identifying a specific
> > location in the source code which is generated at compile time and can be
> > embedded in an application-specific structure. A number of applications
> > for code tagging have been presented in the original RFC [1].
> > Code tagging uses the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it. 
> > 
> > To profile memory allocations, we instrument page, slab and percpu
> > allocators to record total memory allocated in the associated code tag at
> > every allocation in the codebase. Every time an allocation is performed by
> > an instrumented allocator, the code tag at that location increments its
> > counter by allocation size. Every time the memory is freed the counter is
> > decremented. To decrement the counter upon freeing, allocated object needs
> > a reference to its code tag. Page allocators use page_ext to record this
> > reference while slab allocators use memcg_data (renamed into more generic
> > slabobj_ext) of the slab page.
> [...]
> > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
> [...]
> >  70 files changed, 2765 insertions(+), 554 deletions(-)
> 
> Sorry for cutting the cover considerably but I believe I have quoted the
> most important/interesting parts here. The approach is not fundamentally
> different from the previous version [1] and there was a significant
> discussion around this approach. The cover letter doesn't summarize nor
> deal with concerns expressed previous AFAICS. So let me bring those up
> back. At least those I find the most important:

We covered this previously, I'll just be giving the same answers I did
before:

> - This is a big change and it adds a significant maintenance burden
>   because each allocation entry point needs to be handled specifically.
>   The cost will grow with the intended coverage especially there when
>   allocation is hidden in a library code.

We've made this as clean and simple as posssible: a single new macro
invocation per allocation function, no calling convention changes (that
would indeed have been a lot of churn!)

> - It has been brought up that this is duplicating functionality already
>   available via existing tracing infrastructure. You should make it very
>   clear why that is not suitable for the job

Tracing people _claimed_ this, but never demonstrated it. Tracepoints
exist but the tooling that would consume them to provide this kind of
information does not exist; it would require maintaining an index of
_every outstanding allocation_ so that frees could be accounted
correctly - IOW, it would be _drastically_ higher overhead, so not at
all comparable.

> - We already have page_owner infrastructure that provides allocation
>   tracking data. Why it cannot be used/extended?

Page owner is also very high overhead, and the output is not very user
friendly (tracking full call stack means many related overhead gets
split, not generally what you want), and it doesn't cover slab.

This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
Michal Hocko May 3, 2023, 7:51 a.m. UTC | #8
On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 09:25:29AM +0200, Michal Hocko wrote:
> > On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > > Memory allocation profiling infrastructure provides a low overhead
> > > mechanism to make all kernel allocations in the system visible. It can be
> > > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > > identify memory regressions.
> > > 
> > > To keep the overhead to the minimum, we record only allocation sizes for
> > > every allocation in the codebase. With that information, if users are
> > > interested in more detailed context for a specific allocation, they can
> > > enable in-depth context tracking, which includes capturing the pid, tgid,
> > > task name, allocation size, timestamp and call stack for every allocation
> > > at the specified code location.
> > [...]
> > > Implementation utilizes a more generic concept of code tagging, introduced
> > > as part of this patchset. Code tag is a structure identifying a specific
> > > location in the source code which is generated at compile time and can be
> > > embedded in an application-specific structure. A number of applications
> > > for code tagging have been presented in the original RFC [1].
> > > Code tagging uses the old trick of "define a special elf section for
> > > objects of a given type so that we can iterate over them at runtime" and
> > > creates a proper library for it. 
> > > 
> > > To profile memory allocations, we instrument page, slab and percpu
> > > allocators to record total memory allocated in the associated code tag at
> > > every allocation in the codebase. Every time an allocation is performed by
> > > an instrumented allocator, the code tag at that location increments its
> > > counter by allocation size. Every time the memory is freed the counter is
> > > decremented. To decrement the counter upon freeing, allocated object needs
> > > a reference to its code tag. Page allocators use page_ext to record this
> > > reference while slab allocators use memcg_data (renamed into more generic
> > > slabobj_ext) of the slab page.
> > [...]
> > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
> > [...]
> > >  70 files changed, 2765 insertions(+), 554 deletions(-)
> > 
> > Sorry for cutting the cover considerably but I believe I have quoted the
> > most important/interesting parts here. The approach is not fundamentally
> > different from the previous version [1] and there was a significant
> > discussion around this approach. The cover letter doesn't summarize nor
> > deal with concerns expressed previous AFAICS. So let me bring those up
> > back. At least those I find the most important:
> 
> We covered this previously, I'll just be giving the same answers I did
> before:

Your answers have shown your insight into tracing is very limited. I
have a clear recollection there were many suggestions on how to get what
you need and willingness to help out. Repeating your previous position
will not help much to be honest with you.

> > - This is a big change and it adds a significant maintenance burden
> >   because each allocation entry point needs to be handled specifically.
> >   The cost will grow with the intended coverage especially there when
> >   allocation is hidden in a library code.
> 
> We've made this as clean and simple as posssible: a single new macro
> invocation per allocation function, no calling convention changes (that
> would indeed have been a lot of churn!)

That doesn't really make the concern any less relevant. I believe you
and Suren have made a great effort to reduce the churn as much as
possible but looking at the diffstat the code changes are clearly there
and you have to convince the rest of the community that this maintenance
overhead is really worth it. The above statement hasn't helped to
convinced me to be honest.

> > - It has been brought up that this is duplicating functionality already
> >   available via existing tracing infrastructure. You should make it very
> >   clear why that is not suitable for the job
> 
> Tracing people _claimed_ this, but never demonstrated it.

The burden is on you and Suren. You are proposing the implement an
alternative tracing infrastructure.

> Tracepoints
> exist but the tooling that would consume them to provide this kind of
> information does not exist;

Any reasons why an appropriate tooling cannot be developed?

> it would require maintaining an index of
> _every outstanding allocation_ so that frees could be accounted
> correctly - IOW, it would be _drastically_ higher overhead, so not at
> all comparable.

Do you have any actual data points to prove your claim?

> > - We already have page_owner infrastructure that provides allocation
> >   tracking data. Why it cannot be used/extended?
> 
> Page owner is also very high overhead,

Is there any data to prove that claim? I would be really surprised that
page_owner would give higher overhead than page tagging with profiling
enabled (there is an allocation for each allocation request!!!). We can
discuss the bare bone page tagging comparision to page_owner because of
the full stack unwinding but is that overhead really prohibitively costly?
Can we reduce that by trimming the unwinder information?

> and the output is not very user
> friendly (tracking full call stack means many related overhead gets
> split, not generally what you want), and it doesn't cover slab.

Is this something we cannot do anything about? Have you explored any
potential ways?

> This tracks _all_ memory allocations - slab, page, vmalloc, percpu.
Kent Overstreet May 3, 2023, 8:05 a.m. UTC | #9
On Wed, May 03, 2023 at 09:51:49AM +0200, Michal Hocko wrote:
> Your answers have shown your insight into tracing is very limited. I
> have a clear recollection there were many suggestions on how to get what
> you need and willingness to help out. Repeating your previous position
> will not help much to be honest with you.

Please enlighten us, oh wise one.

> > > - It has been brought up that this is duplicating functionality already
> > >   available via existing tracing infrastructure. You should make it very
> > >   clear why that is not suitable for the job
> > 
> > Tracing people _claimed_ this, but never demonstrated it.
> 
> The burden is on you and Suren. You are proposing the implement an
> alternative tracing infrastructure.

No, we're still waiting on the tracing people to _demonstrate_, not
claim, that this is at all possible in a comparable way with tracing. 

It's not on us to make your argument for you, and before making
accusations about honesty you should try to be more honest yourself.

The expectations you're trying to level have never been the norm in the
kernel community, sorry. When there's a technical argument about the
best way to do something, _code wins_ and we've got working code to do
something that hasn't been possible previously.

There's absolutely no rule that "tracing has to be the one and only tool
for kernel visibility".

I'm considering the tracing discussion closed until someone in the
pro-tracing camp shows something new.

> > > - We already have page_owner infrastructure that provides allocation
> > >   tracking data. Why it cannot be used/extended?
> > 
> > Page owner is also very high overhead,
> 
> Is there any data to prove that claim? I would be really surprised that
> page_owner would give higher overhead than page tagging with profiling
> enabled (there is an allocation for each allocation request!!!). We can
> discuss the bare bone page tagging comparision to page_owner because of
> the full stack unwinding but is that overhead really prohibitively costly?
> Can we reduce that by trimming the unwinder information?

Honestly, this isn't terribly relevant, because as noted before page
owner is limited to just page allocations.

> 
> > and the output is not very user
> > friendly (tracking full call stack means many related overhead gets
> > split, not generally what you want), and it doesn't cover slab.
> 
> Is this something we cannot do anything about? Have you explored any
> potential ways?
> 
> > This tracks _all_ memory allocations - slab, page, vmalloc, percpu.

Michel, the discussions with you seem to perpetually go in circles; it's
clear you're negative on the patchset, you keep raising the same
objections while refusing to concede a single point.

I believe I've answered enough, so I'll leave off further discussions
with you.
Petr Tesařík May 3, 2023, 9:50 a.m. UTC | #10
On Wed, 3 May 2023 09:51:49 +0200
Michal Hocko <mhocko@suse.com> wrote:

> On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
>[...]
> > We've made this as clean and simple as posssible: a single new macro
> > invocation per allocation function, no calling convention changes (that
> > would indeed have been a lot of churn!)  
> 
> That doesn't really make the concern any less relevant. I believe you
> and Suren have made a great effort to reduce the churn as much as
> possible but looking at the diffstat the code changes are clearly there
> and you have to convince the rest of the community that this maintenance
> overhead is really worth it.

I believe this is the crucial point.

I have my own concerns about the use of preprocessor macros, which goes
against the basic idea of a code tagging framework (patch 13/40).
AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
line as the tagged code, which makes it hard to use without further
macros (unless you want to make the source code unreadable beyond
imagination). That's why all allocation functions must be converted to
macros.

If anyone ever wants to use this code tagging framework for something
else, they will also have to convert relevant functions to macros,
slowly changing the kernel to a minefield where local identifiers,
struct, union and enum tags, field names and labels must avoid name
conflict with a tagged function. For now, I have to remember that
alloc_pages is forbidden, but the list may grow.

FWIW I can see some occurences of "alloc_pages" under arch/ which are
not renamed by patch 19/40 of this series. For instance, does the
kernel build for s390x after applying the patch series?

New code may also work initially, but explode after adding an #include
later...

HOWEVER, if the rest of the community agrees that the added value of
code tagging is worth all these potential risks, I can live with it.

Petr T
Kent Overstreet May 3, 2023, 9:54 a.m. UTC | #11
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> On Wed, 3 May 2023 09:51:49 +0200
> Michal Hocko <mhocko@suse.com> wrote:
> 
> > On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
> >[...]
> > > We've made this as clean and simple as posssible: a single new macro
> > > invocation per allocation function, no calling convention changes (that
> > > would indeed have been a lot of churn!)  
> > 
> > That doesn't really make the concern any less relevant. I believe you
> > and Suren have made a great effort to reduce the churn as much as
> > possible but looking at the diffstat the code changes are clearly there
> > and you have to convince the rest of the community that this maintenance
> > overhead is really worth it.
> 
> I believe this is the crucial point.
> 
> I have my own concerns about the use of preprocessor macros, which goes
> against the basic idea of a code tagging framework (patch 13/40).
> AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
> line as the tagged code, which makes it hard to use without further
> macros (unless you want to make the source code unreadable beyond
> imagination). That's why all allocation functions must be converted to
> macros.
> 
> If anyone ever wants to use this code tagging framework for something
> else, they will also have to convert relevant functions to macros,
> slowly changing the kernel to a minefield where local identifiers,
> struct, union and enum tags, field names and labels must avoid name
> conflict with a tagged function. For now, I have to remember that
> alloc_pages is forbidden, but the list may grow.

No, we've got other code tagging applications (that have already been
posted!) and they don't "convert functions to macros" in the way this
patchset does - they do introduce new macros, but as new identifiers,
which we do all the time.

This was simply the least churny way to hook memory allocations.
Kent Overstreet May 3, 2023, 9:57 a.m. UTC | #12
On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> If anyone ever wants to use this code tagging framework for something
> else, they will also have to convert relevant functions to macros,
> slowly changing the kernel to a minefield where local identifiers,
> struct, union and enum tags, field names and labels must avoid name
> conflict with a tagged function. For now, I have to remember that
> alloc_pages is forbidden, but the list may grow.

Also, since you're not actually a kernel contributor yet...

It's not really good decorum to speculate in code review about things
that can be answered by just reading the code. If you're going to
comment, please do the necessary work to make sure you're saying
something that makes sense.
Petr Tesařík May 3, 2023, 10:24 a.m. UTC | #13
On Wed, 3 May 2023 05:54:43 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > On Wed, 3 May 2023 09:51:49 +0200
> > Michal Hocko <mhocko@suse.com> wrote:
> >   
> > > On Wed 03-05-23 03:34:21, Kent Overstreet wrote:
> > >[...]  
> > > > We've made this as clean and simple as posssible: a single new macro
> > > > invocation per allocation function, no calling convention changes (that
> > > > would indeed have been a lot of churn!)    
> > > 
> > > That doesn't really make the concern any less relevant. I believe you
> > > and Suren have made a great effort to reduce the churn as much as
> > > possible but looking at the diffstat the code changes are clearly there
> > > and you have to convince the rest of the community that this maintenance
> > > overhead is really worth it.  
> > 
> > I believe this is the crucial point.
> > 
> > I have my own concerns about the use of preprocessor macros, which goes
> > against the basic idea of a code tagging framework (patch 13/40).
> > AFAICS the CODE_TAG_INIT macro must be expanded on the same source code
> > line as the tagged code, which makes it hard to use without further
> > macros (unless you want to make the source code unreadable beyond
> > imagination). That's why all allocation functions must be converted to
> > macros.
> > 
> > If anyone ever wants to use this code tagging framework for something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.  
> 
> No, we've got other code tagging applications (that have already been
> posted!) and they don't "convert functions to macros" in the way this
> patchset does - they do introduce new macros, but as new identifiers,
> which we do all the time.

Yes, new all-lowercase macros which do not expand to a single
identifier are still added under include/linux. It's unfortunate IMO,
but it's a fact of life. You have a point here.

> This was simply the least churny way to hook memory allocations.

This is a bold statement. You certainly know what you plan to do, but
other people keep coming up with ideas... Like, anyone would like to
tag semaphore use: up() and down()?

Don't get me wrong. I can see how the benefits of code tagging, and I
agree that my concerns are not very strong. I just want that the
consequences are understood and accepted, and they don't take us by
surprise.

Petr T
Petr Tesařík May 3, 2023, 10:26 a.m. UTC | #14
On Wed, 3 May 2023 05:57:15 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > If anyone ever wants to use this code tagging framework for something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.  
> 
> Also, since you're not actually a kernel contributor yet...

I see, I've been around only since 2007...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a

Petr T
James Bottomley May 3, 2023, 12:33 p.m. UTC | #15
On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > If anyone ever wants to use this code tagging framework for
> > something
> > else, they will also have to convert relevant functions to macros,
> > slowly changing the kernel to a minefield where local identifiers,
> > struct, union and enum tags, field names and labels must avoid name
> > conflict with a tagged function. For now, I have to remember that
> > alloc_pages is forbidden, but the list may grow.
> 
> Also, since you're not actually a kernel contributor yet...

You have an amazing talent for being wrong.  But even if you were
actually right about this, it would be an ad hominem personal attack on
a new contributor which crosses the line into unacceptable behaviour on
the list and runs counter to our code of conduct.

James
Steven Rostedt May 3, 2023, 1:21 p.m. UTC | #16
On Wed, 3 May 2023 04:05:08 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> > The burden is on you and Suren. You are proposing the implement an
> > alternative tracing infrastructure.  
> 
> No, we're still waiting on the tracing people to _demonstrate_, not
> claim, that this is at all possible in a comparable way with tracing. 

It's not my job to do your work for you!

I gave you hints on how you can do this with attaching to existing trace
events and your response was "If you don't think it's hard, go ahead and
show us." No! I'm too busy with my own work to do free work for you!

https://lore.kernel.org/all/20220905235007.sc4uk6illlog62fl@kmo-framework/

I know it's easier to create something from scratch that you fully know,
than to work with an existing infrastructure that you need to spend effort
and learn to make it do what you want. But by recreating the work, you now
pass the burden onto everyone else that needs to learn what you did. Not to
mention, we would likely have multiple ways to do the same thing.

Sorry, but that's not how an open source community is suppose to work.

-- Steve
Suren Baghdasaryan May 3, 2023, 2:31 p.m. UTC | #17
On Wed, May 3, 2023 at 5:34 AM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for
> > > something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> >
> > Also, since you're not actually a kernel contributor yet...
>
> You have an amazing talent for being wrong.  But even if you were
> actually right about this, it would be an ad hominem personal attack on
> a new contributor which crosses the line into unacceptable behaviour on
> the list and runs counter to our code of conduct.

Kent, I asked you before and I'm asking you again. Please focus on the
technical discussion and stop personal attacks. That is extremely
counter-productive.

>
> James
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan May 3, 2023, 3:09 p.m. UTC | #18
On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
> > Memory allocation profiling infrastructure provides a low overhead
> > mechanism to make all kernel allocations in the system visible. It can be
> > used to monitor memory usage, track memory hotspots, detect memory leaks,
> > identify memory regressions.
> >
> > To keep the overhead to the minimum, we record only allocation sizes for
> > every allocation in the codebase. With that information, if users are
> > interested in more detailed context for a specific allocation, they can
> > enable in-depth context tracking, which includes capturing the pid, tgid,
> > task name, allocation size, timestamp and call stack for every allocation
> > at the specified code location.
> [...]
> > Implementation utilizes a more generic concept of code tagging, introduced
> > as part of this patchset. Code tag is a structure identifying a specific
> > location in the source code which is generated at compile time and can be
> > embedded in an application-specific structure. A number of applications
> > for code tagging have been presented in the original RFC [1].
> > Code tagging uses the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> >
> > To profile memory allocations, we instrument page, slab and percpu
> > allocators to record total memory allocated in the associated code tag at
> > every allocation in the codebase. Every time an allocation is performed by
> > an instrumented allocator, the code tag at that location increments its
> > counter by allocation size. Every time the memory is freed the counter is
> > decremented. To decrement the counter upon freeing, allocated object needs
> > a reference to its code tag. Page allocators use page_ext to record this
> > reference while slab allocators use memcg_data (renamed into more generic
> > slabobj_ext) of the slab page.
> [...]
> > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
> [...]
> >  70 files changed, 2765 insertions(+), 554 deletions(-)
>
> Sorry for cutting the cover considerably but I believe I have quoted the
> most important/interesting parts here. The approach is not fundamentally
> different from the previous version [1] and there was a significant
> discussion around this approach. The cover letter doesn't summarize nor
> deal with concerns expressed previous AFAICS. So let me bring those up
> back.

Thanks for summarizing!

> At least those I find the most important:
> - This is a big change and it adds a significant maintenance burden
>   because each allocation entry point needs to be handled specifically.
>   The cost will grow with the intended coverage especially there when
>   allocation is hidden in a library code.

Do you mean with more allocations in the codebase more codetags will
be generated? Is that the concern? Or maybe as you commented in
another patch that context capturing feature does not limit how many
stacks will be captured?

> - It has been brought up that this is duplicating functionality already
>   available via existing tracing infrastructure. You should make it very
>   clear why that is not suitable for the job

I experimented with using tracing with _RET_IP_ to implement this
accounting. The major issue is the _RET_IP_ to codetag lookup runtime
overhead which is orders of magnitude higher than proposed code
tagging approach. With code tagging proposal, that link is resolved at
compile time. Since we want this mechanism deployed in production, we
want to keep the overhead to the absolute minimum.
You asked me before how much overhead would be tolerable and the
answer will always be "as small as possible". This is especially true
for slab allocators which are ridiculously fast and regressing them
would be very noticable (due to the frequent use).

There is another issue, which I think can be solved in a smart way but
will either affect performance or would require more memory. With the
tracing approach we don't know beforehand how many individual
allocation sites exist, so we have to allocate code tags (or similar
structures for counting) at runtime vs compile time. We can be smart
about it and allocate in batches or even preallocate more than we need
beforehand but, as I said, it will require some kind of compromise.

I understand that code tagging creates additional maintenance burdens
but I hope it also produces enough benefits that people will want
this. The cost is also hopefully amortized when additional
applications like the ones we presented in RFC [1] are built using the
same framework.

> - We already have page_owner infrastructure that provides allocation
>   tracking data. Why it cannot be used/extended?

1. The overhead.
2. Covers only page allocators.

I didn't think about extending the page_owner approach to slab
allocators but I suspect it would not be trivial. I don't see
attaching an owner to every slab object to be a scalable solution. The
overhead would again be of concern here.

I should point out that there was one important technical concern
about lack of a kill switch for this feature, which was an issue for
distributions that can't disable the CONFIG flag. In this series we
addressed that concern.

[1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/

Thanks,
Suren.

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
Kent Overstreet May 3, 2023, 3:28 p.m. UTC | #19
On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for
> > > something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.
> > 
> > Also, since you're not actually a kernel contributor yet...
> 
> You have an amazing talent for being wrong.  But even if you were
> actually right about this, it would be an ad hominem personal attack on
> a new contributor which crosses the line into unacceptable behaviour on
> the list and runs counter to our code of conduct.

...Err, what? That was intended _in no way_ as a personal attack.

If I was mistaken I do apologize, but lately I've run across quite a lot
of people offering review feedback to patches I post that turn out to
have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
offering feedback in strong language with a presumption of experience
that takes a lot to respond to adequately on a technical basis.

I don't think a suggestion to spend a bit more time reading code instead
of speculating is out of order! We could all, put more effort into how
we offer review feedback.
Kent Overstreet May 3, 2023, 3:30 p.m. UTC | #20
On Wed, May 03, 2023 at 12:26:27PM +0200, Petr Tesařík wrote:
> On Wed, 3 May 2023 05:57:15 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > If anyone ever wants to use this code tagging framework for something
> > > else, they will also have to convert relevant functions to macros,
> > > slowly changing the kernel to a minefield where local identifiers,
> > > struct, union and enum tags, field names and labels must avoid name
> > > conflict with a tagged function. For now, I have to remember that
> > > alloc_pages is forbidden, but the list may grow.  
> > 
> > Also, since you're not actually a kernel contributor yet...
> 
> I see, I've been around only since 2007...
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a97468024fb5b6eccee2a67a7796485c829343a

My sincere apologies :) I'd searched for your name and email and found
nothing, whoops.
Lorenzo Stoakes May 3, 2023, 3:37 p.m. UTC | #21
On Wed, May 03, 2023 at 11:28:06AM -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > > If anyone ever wants to use this code tagging framework for
> > > > something
> > > > else, they will also have to convert relevant functions to macros,
> > > > slowly changing the kernel to a minefield where local identifiers,
> > > > struct, union and enum tags, field names and labels must avoid name
> > > > conflict with a tagged function. For now, I have to remember that
> > > > alloc_pages is forbidden, but the list may grow.
> > >
> > > Also, since you're not actually a kernel contributor yet...
> >
> > You have an amazing talent for being wrong.  But even if you were
> > actually right about this, it would be an ad hominem personal attack on
> > a new contributor which crosses the line into unacceptable behaviour on
> > the list and runs counter to our code of conduct.
>
> ...Err, what? That was intended _in no way_ as a personal attack.
>

As an outside observer, I can assure you that absolutely came across as a
personal attack, and the precise kind that puts people off from
contributing. I should know as a hobbyist contributor myself.

> If I was mistaken I do apologize, but lately I've run across quite a lot
> of people offering review feedback to patches I post that turn out to
> have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
> offering feedback in strong language with a presumption of experience
> that takes a lot to respond to adequately on a technical basis.
>

I, who may very well not merit being considered a contributor of
significant merit in your view, have had such 'drive-by' commentary on some
of my patches by precisely this type of person, and at no time felt the
need to question whether they were a true Scotsman or not. It's simply not
productive.

> I don't think a suggestion to spend a bit more time reading code instead
> of speculating is out of order! We could all, put more effort into how
> we offer review feedback.

It's the means by which you say it that counts for everything. If you feel
the technical comments might not be merited on a deeper level, perhaps ask
a broader question, or even don't respond at all? There are other means
available.

It's remarkable the impact comments like the one you made can have on
contributors, certainly those of us who are not maintainers and are
naturally plagued with imposter syndrome, so I would ask you on a human
level to try to be a little more considerate.

By all means address technical issues as robustly as you feel appropriate,
that is after all the purpose of code review, but just take a step back and
perhaps find the 'cuddlier' side of yourself when not addressing technical
things :)
James Bottomley May 3, 2023, 3:49 p.m. UTC | #22
On Wed, 2023-05-03 at 11:28 -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 08:33:48AM -0400, James Bottomley wrote:
> > On Wed, 2023-05-03 at 05:57 -0400, Kent Overstreet wrote:
> > > On Wed, May 03, 2023 at 11:50:51AM +0200, Petr Tesařík wrote:
> > > > If anyone ever wants to use this code tagging framework for
> > > > something else, they will also have to convert relevant
> > > > functions to macros, slowly changing the kernel to a minefield
> > > > where local identifiers, struct, union and enum tags, field
> > > > names and labels must avoid name conflict with a tagged
> > > > function. For now, I have to remember that alloc_pages is
> > > > forbidden, but the list may grow.
> > > 
> > > Also, since you're not actually a kernel contributor yet...
> > 
> > You have an amazing talent for being wrong.  But even if you were
> > actually right about this, it would be an ad hominem personal
> > attack on a new contributor which crosses the line into
> > unacceptable behaviour on the list and runs counter to our code of
> > conduct.
> 
> ...Err, what? That was intended _in no way_ as a personal attack.

Your reply went on to say "If you're going to comment, please do the
necessary work to make sure you're saying something that makes sense."
That is a personal attack belittling the person involved and holding
them up for general contempt on the mailing list.  This is exactly how
we should *not* treat newcomers.

> If I was mistaken I do apologize, but lately I've run across quite a
> lot of people offering review feedback to patches I post that turn
> out to have 0 or 10 patches in the kernel, and - to be blunt - a
> pattern of offering feedback in strong language with a presumption of
> experience that takes a lot to respond to adequately on a technical
> basis.

A synopsis of the feedback is that using macros to attach trace tags
pollutes the global function namespace of the kernel.  That's a valid
observation and merits a technical not a personal response.

James
Kent Overstreet May 3, 2023, 4:03 p.m. UTC | #23
On Wed, May 03, 2023 at 04:37:36PM +0100, Lorenzo Stoakes wrote:
> As an outside observer, I can assure you that absolutely came across as a
> personal attack, and the precise kind that puts people off from
> contributing. I should know as a hobbyist contributor myself.
> 
> > If I was mistaken I do apologize, but lately I've run across quite a lot
> > of people offering review feedback to patches I post that turn out to
> > have 0 or 10 patches in the kernel, and - to be blunt - a pattern of
> > offering feedback in strong language with a presumption of experience
> > that takes a lot to respond to adequately on a technical basis.
> >
> 
> I, who may very well not merit being considered a contributor of
> significant merit in your view, have had such 'drive-by' commentary on some
> of my patches by precisely this type of person, and at no time felt the
> need to question whether they were a true Scotsman or not. It's simply not
> productive.
> 
> > I don't think a suggestion to spend a bit more time reading code instead
> > of speculating is out of order! We could all, put more effort into how
> > we offer review feedback.
> 
> It's the means by which you say it that counts for everything. If you feel
> the technical comments might not be merited on a deeper level, perhaps ask
> a broader question, or even don't respond at all? There are other means
> available.
> 
> It's remarkable the impact comments like the one you made can have on
> contributors, certainly those of us who are not maintainers and are
> naturally plagued with imposter syndrome, so I would ask you on a human
> level to try to be a little more considerate.
> 
> By all means address technical issues as robustly as you feel appropriate,
> that is after all the purpose of code review, but just take a step back and
> perhaps find the 'cuddlier' side of yourself when not addressing technical
> things :)

Thanks for your reply, it's level headed and appreciated.

But I personally value directness, and I see quite a few people in this
thread going all out on the tone policing - but look, without the
directness the confusion (that Petr is not actually a new contributor)
never would've been cleared up.

Food for thought, perhaps?
Steven Rostedt May 3, 2023, 4:28 p.m. UTC | #24
On Wed, 3 May 2023 08:09:28 -0700
Suren Baghdasaryan <surenb@google.com> wrote:

> There is another issue, which I think can be solved in a smart way but
> will either affect performance or would require more memory. With the
> tracing approach we don't know beforehand how many individual
> allocation sites exist, so we have to allocate code tags (or similar
> structures for counting) at runtime vs compile time. We can be smart
> about it and allocate in batches or even preallocate more than we need
> beforehand but, as I said, it will require some kind of compromise.

This approach is actually quite common, especially since tagging every
instance is usually overkill, as if you trace function calls in a running
kernel, you will find that only a small percentage of the kernel ever
executes. It's possible that you will be allocating a lot of tags that will
never be used. If run time allocation is possible, that is usually the
better approach.

-- Steve
Tejun Heo May 3, 2023, 4:35 p.m. UTC | #25
Hello, Kent.

On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> No, we're still waiting on the tracing people to _demonstrate_, not
> claim, that this is at all possible in a comparable way with tracing. 

So, we (meta) happen to do stuff like this all the time in the fleet to hunt
down tricky persistent problems like memory leaks, ref leaks, what-have-you.
In recent kernels, with kprobe and BPF, our ability to debug these sorts of
problems has improved a great deal. Below, I'm attaching a bcc script I used
to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
can follow the same pattern.

There are of course some pros and cons to this approach:

Pros:

* The framework doesn't really have any runtime overhead, so we can have it
  deployed in the entire fleet and debug wherever problem is.

* It's fully flexible and programmable which enables non-trivial filtering
  and summarizing to be done inside kernel w/ BPF as necessary, which is
  pretty handy for tracking high frequency events.

* BPF is pretty performant. Dedicated built-in kernel code can do better of
  course but BPF's jit compiled code & its data structures are fast enough.
  I don't remember any time this was a problem.

Cons:

* BPF has some learning curve. Also the fact that what it provides is a wide
  open field rather than something scoped out for a specific problem can
  make it seem a bit daunting at the beginning.

* Because tracking starts when the script starts running, it doesn't know
  anything which has happened upto that point, so you gotta pay attention to
  handling e.g. handling frees which don't match allocs. It's kinda annoying
  but not a huge problem usually. There are ways to build in BPF progs into
  the kernel and load it early but I haven't experiemnted with it yet
  personally.

I'm not necessarily against adding dedicated memory debugging mechanism but
do wonder whether the extra benefits would be enough to justify the code and
maintenance overhead.

Oh, a bit of delta but for anyone who's more interested in debugging
problems like this, while I tend to go for bcc
(https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
write against libbpf directly or use bpftrace
(https://github.com/iovisor/bpftrace).

Thanks.

#!/usr/bin/env bcc-py

import bcc
import time
import datetime
import argparse
import os
import sys
import errno

description = """
Record vmalloc/vfrees and trigger on unmatched vfree
"""

bpf_source = """
#include <uapi/linux/ptrace.h>
#include <linux/vmalloc.h>

struct vmalloc_rec {
	unsigned long		ptr;
	int			last_alloc_stkid;
	int			last_free_stkid;
	int			this_stkid;
	bool			allocated;
};

BPF_STACK_TRACE(stacks, 8192);
BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072);
BPF_ARRAY(dup_free, struct vmalloc_rec, 1);

int kpret_vmalloc_node_range(struct pt_regs *ctx)
{
        unsigned long ptr = PT_REGS_RC(ctx);
	uint32_t zkey = 0;
	struct vmalloc_rec rec_init = { };
	struct vmalloc_rec *rec;
	int stkid;

	if (!ptr)
		return 0;

	stkid = stacks.get_stackid(ctx, 0);

        rec_init.ptr = ptr;
        rec_init.last_alloc_stkid = -1;
        rec_init.last_free_stkid = -1;
        rec_init.this_stkid = -1;

	rec = vmallocs.lookup_or_init(&ptr, &rec_init);
	rec->allocated = true;
	rec->last_alloc_stkid = stkid;
	return 0;
}

int kp_vfree(struct pt_regs *ctx, const void *addr)
{
	unsigned long ptr = (unsigned long)addr;
	uint32_t zkey = 0;
	struct vmalloc_rec rec_init = { };
	struct vmalloc_rec *rec;
	int stkid;

	stkid = stacks.get_stackid(ctx, 0);

        rec_init.ptr = ptr;
        rec_init.last_alloc_stkid = -1;
        rec_init.last_free_stkid = -1;
        rec_init.this_stkid = -1;

	rec = vmallocs.lookup_or_init(&ptr, &rec_init);
	if (!rec->allocated && rec->last_alloc_stkid >= 0) {
		rec->this_stkid = stkid;
		dup_free.update(&zkey, rec);
	}

	rec->allocated = false;
	rec->last_free_stkid = stkid;
        return 0;
}
"""

bpf = bcc.BPF(text=bpf_source)
bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range");
bpf.attach_kprobe(event="vfree", fn_name="kp_vfree");
bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree");

stacks = bpf["stacks"]
vmallocs = bpf["vmallocs"]
dup_free = bpf["dup_free"]
last_dup_free_ptr = dup_free[0].ptr

def print_stack(stkid):
    for addr in stacks.walk(stkid):
        sym = bpf.ksym(addr)
        print('  {}'.format(sym))

def print_dup(dup):
    print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr)))
    if (dup.last_alloc_stkid >= 0):
        print('last_alloc_stack: ')
        print_stack(dup.last_alloc_stkid)
    if (dup.last_free_stkid >= 0):
        print('last_free_stack: ')
        print_stack(dup.last_free_stkid)
    if (dup.this_stkid >= 0):
        print('this_stack: ')
        print_stack(dup.this_stkid)

while True:
    time.sleep(1)
    
    if dup_free[0].ptr != last_dup_free_ptr:
        print('\nDUP_FREE:')
        print_dup(dup_free[0])
        last_dup_free_ptr = dup_free[0].ptr
Suren Baghdasaryan May 3, 2023, 5:40 p.m. UTC | #26
On Wed, May 3, 2023 at 9:28 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 3 May 2023 08:09:28 -0700
> Suren Baghdasaryan <surenb@google.com> wrote:
>
> > There is another issue, which I think can be solved in a smart way but
> > will either affect performance or would require more memory. With the
> > tracing approach we don't know beforehand how many individual
> > allocation sites exist, so we have to allocate code tags (or similar
> > structures for counting) at runtime vs compile time. We can be smart
> > about it and allocate in batches or even preallocate more than we need
> > beforehand but, as I said, it will require some kind of compromise.
>
> This approach is actually quite common, especially since tagging every
> instance is usually overkill, as if you trace function calls in a running
> kernel, you will find that only a small percentage of the kernel ever
> executes. It's possible that you will be allocating a lot of tags that will
> never be used. If run time allocation is possible, that is usually the
> better approach.

True but the memory overhead should not be prohibitive here. As a
ballpark number, on my machine I see there are 4838 individual
allocation locations and each codetag structure is 32 bytes, so that's
152KB.

>
> -- Steve
Suren Baghdasaryan May 3, 2023, 5:42 p.m. UTC | #27
On Wed, May 3, 2023 at 9:35 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.

Thanks for sharing, Tejun!

>
> There are of course some pros and cons to this approach:
>
> Pros:
>
> * The framework doesn't really have any runtime overhead, so we can have it
>   deployed in the entire fleet and debug wherever problem is.

Do you mean it has no runtime overhead when disabled?
If so, do you know what's the overhead when enabled? I want to
understand if that's truly a viable solution to track all allocations
(including slab) all the time.
Thanks,
Suren.

>
> * It's fully flexible and programmable which enables non-trivial filtering
>   and summarizing to be done inside kernel w/ BPF as necessary, which is
>   pretty handy for tracking high frequency events.
>
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
>   course but BPF's jit compiled code & its data structures are fast enough.
>   I don't remember any time this was a problem.
>
> Cons:
>
> * BPF has some learning curve. Also the fact that what it provides is a wide
>   open field rather than something scoped out for a specific problem can
>   make it seem a bit daunting at the beginning.
>
> * Because tracking starts when the script starts running, it doesn't know
>   anything which has happened upto that point, so you gotta pay attention to
>   handling e.g. handling frees which don't match allocs. It's kinda annoying
>   but not a huge problem usually. There are ways to build in BPF progs into
>   the kernel and load it early but I haven't experiemnted with it yet
>   personally.
>
> I'm not necessarily against adding dedicated memory debugging mechanism but
> do wonder whether the extra benefits would be enough to justify the code and
> maintenance overhead.
>
> Oh, a bit of delta but for anyone who's more interested in debugging
> problems like this, while I tend to go for bcc
> (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
> write against libbpf directly or use bpftrace
> (https://github.com/iovisor/bpftrace).
>
> Thanks.
>
> #!/usr/bin/env bcc-py
>
> import bcc
> import time
> import datetime
> import argparse
> import os
> import sys
> import errno
>
> description = """
> Record vmalloc/vfrees and trigger on unmatched vfree
> """
>
> bpf_source = """
> #include <uapi/linux/ptrace.h>
> #include <linux/vmalloc.h>
>
> struct vmalloc_rec {
>         unsigned long           ptr;
>         int                     last_alloc_stkid;
>         int                     last_free_stkid;
>         int                     this_stkid;
>         bool                    allocated;
> };
>
> BPF_STACK_TRACE(stacks, 8192);
> BPF_HASH(vmallocs, unsigned long, struct vmalloc_rec, 131072);
> BPF_ARRAY(dup_free, struct vmalloc_rec, 1);
>
> int kpret_vmalloc_node_range(struct pt_regs *ctx)
> {
>         unsigned long ptr = PT_REGS_RC(ctx);
>         uint32_t zkey = 0;
>         struct vmalloc_rec rec_init = { };
>         struct vmalloc_rec *rec;
>         int stkid;
>
>         if (!ptr)
>                 return 0;
>
>         stkid = stacks.get_stackid(ctx, 0);
>
>         rec_init.ptr = ptr;
>         rec_init.last_alloc_stkid = -1;
>         rec_init.last_free_stkid = -1;
>         rec_init.this_stkid = -1;
>
>         rec = vmallocs.lookup_or_init(&ptr, &rec_init);
>         rec->allocated = true;
>         rec->last_alloc_stkid = stkid;
>         return 0;
> }
>
> int kp_vfree(struct pt_regs *ctx, const void *addr)
> {
>         unsigned long ptr = (unsigned long)addr;
>         uint32_t zkey = 0;
>         struct vmalloc_rec rec_init = { };
>         struct vmalloc_rec *rec;
>         int stkid;
>
>         stkid = stacks.get_stackid(ctx, 0);
>
>         rec_init.ptr = ptr;
>         rec_init.last_alloc_stkid = -1;
>         rec_init.last_free_stkid = -1;
>         rec_init.this_stkid = -1;
>
>         rec = vmallocs.lookup_or_init(&ptr, &rec_init);
>         if (!rec->allocated && rec->last_alloc_stkid >= 0) {
>                 rec->this_stkid = stkid;
>                 dup_free.update(&zkey, rec);
>         }
>
>         rec->allocated = false;
>         rec->last_free_stkid = stkid;
>         return 0;
> }
> """
>
> bpf = bcc.BPF(text=bpf_source)
> bpf.attach_kretprobe(event="__vmalloc_node_range", fn_name="kpret_vmalloc_node_range");
> bpf.attach_kprobe(event="vfree", fn_name="kp_vfree");
> bpf.attach_kprobe(event="vfree_atomic", fn_name="kp_vfree");
>
> stacks = bpf["stacks"]
> vmallocs = bpf["vmallocs"]
> dup_free = bpf["dup_free"]
> last_dup_free_ptr = dup_free[0].ptr
>
> def print_stack(stkid):
>     for addr in stacks.walk(stkid):
>         sym = bpf.ksym(addr)
>         print('  {}'.format(sym))
>
> def print_dup(dup):
>     print('allocated={} ptr={}'.format(dup.allocated, hex(dup.ptr)))
>     if (dup.last_alloc_stkid >= 0):
>         print('last_alloc_stack: ')
>         print_stack(dup.last_alloc_stkid)
>     if (dup.last_free_stkid >= 0):
>         print('last_free_stack: ')
>         print_stack(dup.last_free_stkid)
>     if (dup.this_stkid >= 0):
>         print('this_stack: ')
>         print_stack(dup.this_stkid)
>
> while True:
>     time.sleep(1)
>
>     if dup_free[0].ptr != last_dup_free_ptr:
>         print('\nDUP_FREE:')
>         print_dup(dup_free[0])
>         last_dup_free_ptr = dup_free[0].ptr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Kent Overstreet May 3, 2023, 5:44 p.m. UTC | #28
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing. 
> 
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
> 
> There are of course some pros and cons to this approach:
> 
> Pros:
> 
> * The framework doesn't really have any runtime overhead, so we can have it
>   deployed in the entire fleet and debug wherever problem is.
> 
> * It's fully flexible and programmable which enables non-trivial filtering
>   and summarizing to be done inside kernel w/ BPF as necessary, which is
>   pretty handy for tracking high frequency events.
> 
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
>   course but BPF's jit compiled code & its data structures are fast enough.
>   I don't remember any time this was a problem.

You're still going to have the inherent overhead a separate index of
outstanding memory allocations, so that frees can be decremented to the
correct callsite.

The BPF approach is going to be _way_ higher overhead if you try to use
it as a general profiler, like this is.
Kent Overstreet May 3, 2023, 5:51 p.m. UTC | #29
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing. 
> 
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
> 
> There are of course some pros and cons to this approach:
> 
> Pros:
> 
> * The framework doesn't really have any runtime overhead, so we can have it
>   deployed in the entire fleet and debug wherever problem is.
> 
> * It's fully flexible and programmable which enables non-trivial filtering
>   and summarizing to be done inside kernel w/ BPF as necessary, which is
>   pretty handy for tracking high frequency events.
> 
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
>   course but BPF's jit compiled code & its data structures are fast enough.
>   I don't remember any time this was a problem.
> 
> Cons:
> 
> * BPF has some learning curve. Also the fact that what it provides is a wide
>   open field rather than something scoped out for a specific problem can
>   make it seem a bit daunting at the beginning.
> 
> * Because tracking starts when the script starts running, it doesn't know
>   anything which has happened upto that point, so you gotta pay attention to
>   handling e.g. handling frees which don't match allocs. It's kinda annoying
>   but not a huge problem usually. There are ways to build in BPF progs into
>   the kernel and load it early but I haven't experiemnted with it yet
>   personally.
> 
> I'm not necessarily against adding dedicated memory debugging mechanism but
> do wonder whether the extra benefits would be enough to justify the code and
> maintenance overhead.
> 
> Oh, a bit of delta but for anyone who's more interested in debugging
> problems like this, while I tend to go for bcc
> (https://github.com/iovisor/bcc) for this sort of problems. Others prefer to
> write against libbpf directly or use bpftrace
> (https://github.com/iovisor/bpftrace).

Do you have example output?

TBH I'm skeptical that it's even possible to do full memory allocation
profiling with tracing/bpf, due to recursive memory allocations and
needing an index of outstanding allcations.
Steven Rostedt May 3, 2023, 6:03 p.m. UTC | #30
On Wed, 3 May 2023 10:40:42 -0700
Suren Baghdasaryan <surenb@google.com> wrote:

> > This approach is actually quite common, especially since tagging every
> > instance is usually overkill, as if you trace function calls in a running
> > kernel, you will find that only a small percentage of the kernel ever
> > executes. It's possible that you will be allocating a lot of tags that will
> > never be used. If run time allocation is possible, that is usually the
> > better approach.  
> 
> True but the memory overhead should not be prohibitive here. As a
> ballpark number, on my machine I see there are 4838 individual
> allocation locations and each codetag structure is 32 bytes, so that's
> 152KB.

If it's not that big, then allocating at runtime should not be an issue
either. If runtime allocation can make it less intrusive to the code, that
would be more rationale to do so.

-- Steve
Tejun Heo May 3, 2023, 6:06 p.m. UTC | #31
Hello, Suren.

On Wed, May 03, 2023 at 10:42:11AM -0700, Suren Baghdasaryan wrote:
> > * The framework doesn't really have any runtime overhead, so we can have it
> >   deployed in the entire fleet and debug wherever problem is.
> 
> Do you mean it has no runtime overhead when disabled?

Yes, that's what I meant.

> If so, do you know what's the overhead when enabled? I want to
> understand if that's truly a viable solution to track all allocations
> (including slab) all the time.

(cc'ing Alexei and Andrii who know a lot better than me)

I don't have enough concrete benchmark data on the hand to answer
definitively but hopefully what my general impresison would help. We attach
BPF programs to both per-packet and per-IO paths. They obviously aren't free
but their overhead isn't signficantly higher than building in the same thing
in C code. Once loaded, BPF progs are jit compiled into native code. The
generated code will be a bit worse than regularly compiled C code but those
are really micro differences. There's some bridging code to jump into BPF
but again negligible / acceptable even in the hottest paths.

In terms of execution overhead, I don't think there is a signficant
disadvantage to doing these things in BPF. Bigger differences would likely
be in tracking data structures and locking around them. One can definitely
better integrate tracking into alloc / free paths piggybacking on existing
locking and whatnot. That said, BPF hashtable is pretty fast and BPF is
constantly improving in terms of data structure support.

It really depends on the workload and how much overhead one considers
acceptable and I'm sure persistent global tracking can be done more
efficiently with built-in C code. That said, done right, the overhead
difference most likely isn't gonna be orders of magnitude but more like in
the realm of tens of percents, if that.

So, it doesn't nullify the benefits a dedicated mechansim can bring but does
change the conversation quite a bit. Is the extra code justifiable given
that most of what it enables is already possible using a more generic
mechanism, albeit at a bit higher cost? That may well be the case but it
does raise the bar.

Thanks.
Johannes Weiner May 3, 2023, 6:07 p.m. UTC | #32
On Wed, May 03, 2023 at 06:35:49AM -1000, Tejun Heo wrote:
> Hello, Kent.
> 
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing. 
> 
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
> 
> There are of course some pros and cons to this approach:
> 
> Pros:
> 
> * The framework doesn't really have any runtime overhead, so we can have it
>   deployed in the entire fleet and debug wherever problem is.
> 
> * It's fully flexible and programmable which enables non-trivial filtering
>   and summarizing to be done inside kernel w/ BPF as necessary, which is
>   pretty handy for tracking high frequency events.
> 
> * BPF is pretty performant. Dedicated built-in kernel code can do better of
>   course but BPF's jit compiled code & its data structures are fast enough.
>   I don't remember any time this was a problem.
> 
> Cons:
> 
> * BPF has some learning curve. Also the fact that what it provides is a wide
>   open field rather than something scoped out for a specific problem can
>   make it seem a bit daunting at the beginning.
> 
> * Because tracking starts when the script starts running, it doesn't know
>   anything which has happened upto that point, so you gotta pay attention to
>   handling e.g. handling frees which don't match allocs. It's kinda annoying
>   but not a huge problem usually. There are ways to build in BPF progs into
>   the kernel and load it early but I haven't experiemnted with it yet
>   personally.

Yeah, early loading is definitely important, especially before module
loading etc.

One common usecase is that we see a machine in the wild with a high
amount of kernel memory disappearing somewhere that isn't voluntarily
reported in vmstat/meminfo. Reproducing it isn't always
practical. Something that records early and always (with acceptable
runtime overhead) would be the holy grail.

Matching allocs to frees is doable using the pfn as the key for pages,
and virtual addresses for slab objects.

The biggest issue I had when I tried with bpf was losing updates to
the map. IIRC there is some trylocking going on to avoid deadlocks
from nested contexts (alloc interrupted, interrupt frees). It doesn't
sound like an unsolvable problem, though.

Another minor thing was the stack trace map exploding on a basically
infinite number of unique interrupt stacks. This could probably also
be solved by extending the trace extraction API to cut the frames off
at the context switch boundary.

Taking a step back though, given the multitude of allocation sites in
the kernel, it's a bit odd that the only accounting we do is the tiny
fraction of voluntary vmstat/meminfo reporting. We try to cover the
biggest consumers with this of course, but it's always going to be
incomplete and is maintenance overhead too. There are on average
several gigabytes in unknown memory (total - known vmstats) on our
machines. It's difficult to detect regressions easily. And it's per
definition the unexpected cornercases that are the trickiest to track
down. So it might be doable with BPF, but it does feel like the kernel
should do a better job of tracking out of the box and without
requiring too much plumbing and somewhat fragile kernel allocation API
tracking and probing from userspace.
Suren Baghdasaryan May 3, 2023, 6:07 p.m. UTC | #33
On Wed, May 3, 2023 at 11:03 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Wed, 3 May 2023 10:40:42 -0700
> Suren Baghdasaryan <surenb@google.com> wrote:
>
> > > This approach is actually quite common, especially since tagging every
> > > instance is usually overkill, as if you trace function calls in a running
> > > kernel, you will find that only a small percentage of the kernel ever
> > > executes. It's possible that you will be allocating a lot of tags that will
> > > never be used. If run time allocation is possible, that is usually the
> > > better approach.
> >
> > True but the memory overhead should not be prohibitive here. As a
> > ballpark number, on my machine I see there are 4838 individual
> > allocation locations and each codetag structure is 32 bytes, so that's
> > 152KB.
>
> If it's not that big, then allocating at runtime should not be an issue
> either. If runtime allocation can make it less intrusive to the code, that
> would be more rationale to do so.

As I noted, this issue is minor since we can be smart about how we
allocate these entries. The main issue is the performance overhead.
The kmalloc path is extremely fast and very hot. Even adding a per-cpu
increment in our patchset has a 35% overhead. Adding an additional
lookup here would prevent us from having it enabled all the time in
production.

>
> -- Steve
Kent Overstreet May 3, 2023, 6:12 p.m. UTC | #34
On Wed, May 03, 2023 at 02:03:37PM -0400, Steven Rostedt wrote:
> On Wed, 3 May 2023 10:40:42 -0700
> Suren Baghdasaryan <surenb@google.com> wrote:
> 
> > > This approach is actually quite common, especially since tagging every
> > > instance is usually overkill, as if you trace function calls in a running
> > > kernel, you will find that only a small percentage of the kernel ever
> > > executes. It's possible that you will be allocating a lot of tags that will
> > > never be used. If run time allocation is possible, that is usually the
> > > better approach.  
> > 
> > True but the memory overhead should not be prohibitive here. As a
> > ballpark number, on my machine I see there are 4838 individual
> > allocation locations and each codetag structure is 32 bytes, so that's
> > 152KB.
> 
> If it's not that big, then allocating at runtime should not be an issue
> either. If runtime allocation can make it less intrusive to the code, that
> would be more rationale to do so.

We're more optimizing for runtime overhead - a major goal of this
patchset was to be cheap enough to be always on, we've got too many
debugging features that are really useful, but too expensive to have on
all the time.

Doing more runtime allocation would add another pointer fetch to the
fast paths - and I don't see how it would even be possible to runtime
allocate the codetag struct itself.

We already do runtime allocation of percpu counters; see the lazy percpu
counter patch.
Tejun Heo May 3, 2023, 6:19 p.m. UTC | #35
Hello,

On Wed, May 03, 2023 at 02:07:26PM -0400, Johannes Weiner wrote:
...
> > * Because tracking starts when the script starts running, it doesn't know
> >   anything which has happened upto that point, so you gotta pay attention to
> >   handling e.g. handling frees which don't match allocs. It's kinda annoying
> >   but not a huge problem usually. There are ways to build in BPF progs into
> >   the kernel and load it early but I haven't experiemnted with it yet
> >   personally.
> 
> Yeah, early loading is definitely important, especially before module
> loading etc.
> 
> One common usecase is that we see a machine in the wild with a high
> amount of kernel memory disappearing somewhere that isn't voluntarily
> reported in vmstat/meminfo. Reproducing it isn't always
> practical. Something that records early and always (with acceptable
> runtime overhead) would be the holy grail.
> 
> Matching allocs to frees is doable using the pfn as the key for pages,
> and virtual addresses for slab objects.
> 
> The biggest issue I had when I tried with bpf was losing updates to
> the map. IIRC there is some trylocking going on to avoid deadlocks
> from nested contexts (alloc interrupted, interrupt frees). It doesn't
> sound like an unsolvable problem, though.

(cc'ing Alexei and Andrii)

This is the same thing that I hit with sched_ext. BPF plugged it for
struct_ops but I wonder whether it can be done for specific maps / progs -
ie. just declare that a given map or prog is not to be accessed from NMI and
bypass the trylock deadlock avoidance mechanism. But, yeah, this should be
addressed from BPF side.

> Another minor thing was the stack trace map exploding on a basically
> infinite number of unique interrupt stacks. This could probably also
> be solved by extending the trace extraction API to cut the frames off
> at the context switch boundary.
> 
> Taking a step back though, given the multitude of allocation sites in
> the kernel, it's a bit odd that the only accounting we do is the tiny
> fraction of voluntary vmstat/meminfo reporting. We try to cover the
> biggest consumers with this of course, but it's always going to be
> incomplete and is maintenance overhead too. There are on average
> several gigabytes in unknown memory (total - known vmstats) on our
> machines. It's difficult to detect regressions easily. And it's per
> definition the unexpected cornercases that are the trickiest to track
> down. So it might be doable with BPF, but it does feel like the kernel
> should do a better job of tracking out of the box and without
> requiring too much plumbing and somewhat fragile kernel allocation API
> tracking and probing from userspace.

Yeah, easy / default visibility argument does make sense to me.

Thanks.
Tejun Heo May 3, 2023, 6:24 p.m. UTC | #36
Hello,

On Wed, May 03, 2023 at 01:51:23PM -0400, Kent Overstreet wrote:
> Do you have example output?

Not right now. It's from many months ago. It's just a script I could find
easily.

> TBH I'm skeptical that it's even possible to do full memory allocation
> profiling with tracing/bpf, due to recursive memory allocations and
> needing an index of outstanding allcations.

There are some issues e.g. w/ lossy updates which should be fixed from BPF
side but we do run BPF on every single packet and IO on most of our
machines, so basing this argument on whether tracking all memory allocations
from BPF is possible is probably not a winning strategy for this proposal.

Thanks.
Tejun Heo May 3, 2023, 6:40 p.m. UTC | #37
On Wed, May 03, 2023 at 08:19:24AM -1000, Tejun Heo wrote:
> > Taking a step back though, given the multitude of allocation sites in
> > the kernel, it's a bit odd that the only accounting we do is the tiny
> > fraction of voluntary vmstat/meminfo reporting. We try to cover the
> > biggest consumers with this of course, but it's always going to be
> > incomplete and is maintenance overhead too. There are on average
> > several gigabytes in unknown memory (total - known vmstats) on our
> > machines. It's difficult to detect regressions easily. And it's per
> > definition the unexpected cornercases that are the trickiest to track
> > down. So it might be doable with BPF, but it does feel like the kernel
> > should do a better job of tracking out of the box and without
> > requiring too much plumbing and somewhat fragile kernel allocation API
> > tracking and probing from userspace.
> 
> Yeah, easy / default visibility argument does make sense to me.

So, a bit of addition here. If this is the thrust, the debugfs part seems
rather redundant, right? That's trivially obtainable with tracing / bpf and
in a more flexible and performant manner. Also, are we happy with recording
just single depth for persistent tracking?

Thanks.
Kent Overstreet May 3, 2023, 6:56 p.m. UTC | #38
On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > Yeah, easy / default visibility argument does make sense to me.
> 
> So, a bit of addition here. If this is the thrust, the debugfs part seems
> rather redundant, right? That's trivially obtainable with tracing / bpf and
> in a more flexible and performant manner. Also, are we happy with recording
> just single depth for persistent tracking?

Not sure what you're envisioning?

I'd consider the debugfs interface pretty integral; it's much more
discoverable for users, and it's hardly any code out of the whole
patchset.

Single depth was discussed previously. It's what makes it cheap enough
to be always-on (saving stack traces is expensive!), and I find the
output much more usable than e.g. page owner.
Tejun Heo May 3, 2023, 6:58 p.m. UTC | #39
On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > Yeah, easy / default visibility argument does make sense to me.
> > 
> > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > in a more flexible and performant manner. Also, are we happy with recording
> > just single depth for persistent tracking?
> 
> Not sure what you're envisioning?
> 
> I'd consider the debugfs interface pretty integral; it's much more
> discoverable for users, and it's hardly any code out of the whole
> patchset.

You can do the same thing with a bpftrace one liner tho. That's rather
difficult to beat.

Thanks.
Tejun Heo May 3, 2023, 7:09 p.m. UTC | #40
On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > Yeah, easy / default visibility argument does make sense to me.
> > > 
> > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > in a more flexible and performant manner. Also, are we happy with recording
> > > just single depth for persistent tracking?
> > 
> > Not sure what you're envisioning?
> > 
> > I'd consider the debugfs interface pretty integral; it's much more
> > discoverable for users, and it's hardly any code out of the whole
> > patchset.
> 
> You can do the same thing with a bpftrace one liner tho. That's rather
> difficult to beat.

Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
allocations.ctx under debugfs. I meant allocations.ctx is redundant.

Thanks.
Suren Baghdasaryan May 3, 2023, 7:41 p.m. UTC | #41
On Wed, May 3, 2023 at 12:09 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > > Yeah, easy / default visibility argument does make sense to me.
> > > >
> > > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > > in a more flexible and performant manner. Also, are we happy with recording
> > > > just single depth for persistent tracking?

IIUC, by single depth you mean no call stack capturing?
If so, that's the idea behind the context capture feature so that we
can enable it on specific allocations only after we determine there is
something interesting there. So, with low-cost persistent tracking we
can determine the suspects and then pay some more to investigate those
suspects in more detail.

> > >
> > > Not sure what you're envisioning?
> > >
> > > I'd consider the debugfs interface pretty integral; it's much more
> > > discoverable for users, and it's hardly any code out of the whole
> > > patchset.
> >
> > You can do the same thing with a bpftrace one liner tho. That's rather
> > difficult to beat.

debugfs seemed like a natural choice for such information. If another
interface is more appropriate I'm happy to explore that.

>
> Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
> allocations.ctx under debugfs. I meant allocations.ctx is redundant.

Do you mean that we could display allocation context in
debugfs/allocations file (for the allocations which we explicitly
enabled context capturing)?

>
> Thanks.
>
> --
> tejun
Tejun Heo May 3, 2023, 7:48 p.m. UTC | #42
Hello,

On Wed, May 03, 2023 at 12:41:08PM -0700, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:09 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > > > Yeah, easy / default visibility argument does make sense to me.
> > > > >
> > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > > > in a more flexible and performant manner. Also, are we happy with recording
> > > > > just single depth for persistent tracking?
> 
> IIUC, by single depth you mean no call stack capturing?

Yes.

> If so, that's the idea behind the context capture feature so that we
> can enable it on specific allocations only after we determine there is
> something interesting there. So, with low-cost persistent tracking we
> can determine the suspects and then pay some more to investigate those
> suspects in more detail.

Yeah, I was wondering whether it'd be useful to have that configurable so
that it'd be possible for a user to say "I'm okay with the cost, please
track more context per allocation". Given that tracking the immediate caller
is already a huge improvement and narrowing it down from there using
existing tools shouldn't be that difficult, I don't think this is a blocker
in any way. It just bothers me a bit that the code is structured so that
source line is the main abstraction.

> > > > Not sure what you're envisioning?
> > > >
> > > > I'd consider the debugfs interface pretty integral; it's much more
> > > > discoverable for users, and it's hardly any code out of the whole
> > > > patchset.
> > >
> > > You can do the same thing with a bpftrace one liner tho. That's rather
> > > difficult to beat.
> 
> debugfs seemed like a natural choice for such information. If another
> interface is more appropriate I'm happy to explore that.
> 
> >
> > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
> > allocations.ctx under debugfs. I meant allocations.ctx is redundant.
> 
> Do you mean that we could display allocation context in
> debugfs/allocations file (for the allocations which we explicitly
> enabled context capturing)?

Sorry about the fumbled communication. Here's what I mean:

* Improving memory allocation visibility makes sense to me. To me, a more
  natural place for that feels like /proc/allocations next to other memory
  info files rather than under debugfs.

* The default visibility provided by "allocations" provides something which
  is more difficult or at least cumbersome to obtain using existing tracing
  tools. However, what's provided by "allocations.ctx" can be trivially
  obtained using kprobe and BPF and seems redundant.

Thanks.
Tejun Heo May 3, 2023, 8 p.m. UTC | #43
Hello,

On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote:
> > If so, that's the idea behind the context capture feature so that we
> > can enable it on specific allocations only after we determine there is
> > something interesting there. So, with low-cost persistent tracking we
> > can determine the suspects and then pay some more to investigate those
> > suspects in more detail.
> 
> Yeah, I was wondering whether it'd be useful to have that configurable so
> that it'd be possible for a user to say "I'm okay with the cost, please
> track more context per allocation". Given that tracking the immediate caller
> is already a huge improvement and narrowing it down from there using
> existing tools shouldn't be that difficult, I don't think this is a blocker
> in any way. It just bothers me a bit that the code is structured so that
> source line is the main abstraction.

Another related question. So, the reason for macro'ing stuff is needed is
because you want to print the line directly from kernel, right? Is that
really necessary? Values from __builtin_return_address() can easily be
printed out as function+offset from kernel which already gives most of the
necessary information for triaging and mapping that back to source line from
userspace isn't difficult. Wouldn't using __builtin_return_address() make
the whole thing a lot simpler?

Thanks.
Andrey Ryabinin May 3, 2023, 8:04 p.m. UTC | #44
On Wed, May 3, 2023 at 6:35 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, Kent.
>
> On Wed, May 03, 2023 at 04:05:08AM -0400, Kent Overstreet wrote:
> > No, we're still waiting on the tracing people to _demonstrate_, not
> > claim, that this is at all possible in a comparable way with tracing.
>
> So, we (meta) happen to do stuff like this all the time in the fleet to hunt
> down tricky persistent problems like memory leaks, ref leaks, what-have-you.
> In recent kernels, with kprobe and BPF, our ability to debug these sorts of
> problems has improved a great deal. Below, I'm attaching a bcc script I used
> to hunt down, IIRC, a double vfree. It's not exactly for a leak but leaks
> can follow the same pattern.
>

For leaks there is example bcc
https://github.com/iovisor/bcc/blob/master/tools/memleak.py
Suren Baghdasaryan May 3, 2023, 8:08 p.m. UTC | #45
On Wed, May 3, 2023 at 12:49 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, May 03, 2023 at 12:41:08PM -0700, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:09 PM Tejun Heo <tj@kernel.org> wrote:
> > >
> > > On Wed, May 03, 2023 at 08:58:51AM -1000, Tejun Heo wrote:
> > > > On Wed, May 03, 2023 at 02:56:44PM -0400, Kent Overstreet wrote:
> > > > > On Wed, May 03, 2023 at 08:40:07AM -1000, Tejun Heo wrote:
> > > > > > > Yeah, easy / default visibility argument does make sense to me.
> > > > > >
> > > > > > So, a bit of addition here. If this is the thrust, the debugfs part seems
> > > > > > rather redundant, right? That's trivially obtainable with tracing / bpf and
> > > > > > in a more flexible and performant manner. Also, are we happy with recording
> > > > > > just single depth for persistent tracking?
> >
> > IIUC, by single depth you mean no call stack capturing?
>
> Yes.
>
> > If so, that's the idea behind the context capture feature so that we
> > can enable it on specific allocations only after we determine there is
> > something interesting there. So, with low-cost persistent tracking we
> > can determine the suspects and then pay some more to investigate those
> > suspects in more detail.
>
> Yeah, I was wondering whether it'd be useful to have that configurable so
> that it'd be possible for a user to say "I'm okay with the cost, please
> track more context per allocation".

I assume by "more context per allocation" you mean for a specific
allocation, not for all allocations.
So, in a sense you are asking if the context capture feature can be
dropped from this series and implemented using some other means. Is
that right?

> Given that tracking the immediate caller
> is already a huge improvement and narrowing it down from there using
> existing tools shouldn't be that difficult, I don't think this is a blocker
> in any way. It just bothers me a bit that the code is structured so that
> source line is the main abstraction.
>
> > > > > Not sure what you're envisioning?
> > > > >
> > > > > I'd consider the debugfs interface pretty integral; it's much more
> > > > > discoverable for users, and it's hardly any code out of the whole
> > > > > patchset.
> > > >
> > > > You can do the same thing with a bpftrace one liner tho. That's rather
> > > > difficult to beat.
> >
> > debugfs seemed like a natural choice for such information. If another
> > interface is more appropriate I'm happy to explore that.
> >
> > >
> > > Ah, shit, I'm an idiot. Sorry. I thought allocations was under /proc and
> > > allocations.ctx under debugfs. I meant allocations.ctx is redundant.
> >
> > Do you mean that we could display allocation context in
> > debugfs/allocations file (for the allocations which we explicitly
> > enabled context capturing)?
>
> Sorry about the fumbled communication. Here's what I mean:
>
> * Improving memory allocation visibility makes sense to me. To me, a more
>   natural place for that feels like /proc/allocations next to other memory
>   info files rather than under debugfs.

TBH I would love that if this approach is acceptable.

>
> * The default visibility provided by "allocations" provides something which
>   is more difficult or at least cumbersome to obtain using existing tracing
>   tools. However, what's provided by "allocations.ctx" can be trivially
>   obtained using kprobe and BPF and seems redundant.

Hmm. That might be a good way forward. Since context capture has
already high performance overhead, maybe choosing not the most
performant but more generic solution is the right answer here. I'll
need to think about it some more but thanks for the idea!

>
> Thanks.
>
> --
> tejun
Johannes Weiner May 3, 2023, 8:11 p.m. UTC | #46
On Wed, May 03, 2023 at 01:08:40PM -0700, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:49 PM Tejun Heo <tj@kernel.org> wrote:
> > * Improving memory allocation visibility makes sense to me. To me, a more
> >   natural place for that feels like /proc/allocations next to other memory
> >   info files rather than under debugfs.
> 
> TBH I would love that if this approach is acceptable.

Ack
Suren Baghdasaryan May 3, 2023, 8:14 p.m. UTC | #47
On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote:
> > > If so, that's the idea behind the context capture feature so that we
> > > can enable it on specific allocations only after we determine there is
> > > something interesting there. So, with low-cost persistent tracking we
> > > can determine the suspects and then pay some more to investigate those
> > > suspects in more detail.
> >
> > Yeah, I was wondering whether it'd be useful to have that configurable so
> > that it'd be possible for a user to say "I'm okay with the cost, please
> > track more context per allocation". Given that tracking the immediate caller
> > is already a huge improvement and narrowing it down from there using
> > existing tools shouldn't be that difficult, I don't think this is a blocker
> > in any way. It just bothers me a bit that the code is structured so that
> > source line is the main abstraction.
>
> Another related question. So, the reason for macro'ing stuff is needed is
> because you want to print the line directly from kernel, right?

The main reason is because we want to inject a code tag at the
location of the call. If we have a code tag injected at every
allocation call, then finding the allocation counter (code tag) to
operate takes no time.

> Is that
> really necessary? Values from __builtin_return_address() can easily be
> printed out as function+offset from kernel which already gives most of the
> necessary information for triaging and mapping that back to source line from
> userspace isn't difficult. Wouldn't using __builtin_return_address() make
> the whole thing a lot simpler?

If we do that we have to associate that address with the allocation
counter at runtime on the first allocation and look it up on all
following allocations. That introduces the overhead which we are
trying to avoid by using macros.

>
> Thanks.
>
> --
> tejun
Tejun Heo May 4, 2023, 2:16 a.m. UTC | #48
Hello,

On Wed, May 03, 2023 at 01:08:40PM -0700, Suren Baghdasaryan wrote:
> > Yeah, I was wondering whether it'd be useful to have that configurable so
> > that it'd be possible for a user to say "I'm okay with the cost, please
> > track more context per allocation".
> 
> I assume by "more context per allocation" you mean for a specific
> allocation, not for all allocations.
> So, in a sense you are asking if the context capture feature can be
> dropped from this series and implemented using some other means. Is
> that right?

Oh, no, what I meant was whether it'd make sense to allow enable richer
tracking (e.g. record deeper into callstack) for all allocations. For
targeted tracking, it seems that the kernel already has everything needed.
But this is more of an idle thought and the immediate caller tracking is
already a big improvement in terms of visibility, so no need to be hung up
on this part of discussion at all.

Thanks.
Tejun Heo May 4, 2023, 2:25 a.m. UTC | #49
Hello,

On Wed, May 03, 2023 at 01:14:57PM -0700, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote:
> > Another related question. So, the reason for macro'ing stuff is needed is
> > because you want to print the line directly from kernel, right?
> 
> The main reason is because we want to inject a code tag at the
> location of the call. If we have a code tag injected at every
> allocation call, then finding the allocation counter (code tag) to
> operate takes no time.
>
> > Is that
> > really necessary? Values from __builtin_return_address() can easily be
> > printed out as function+offset from kernel which already gives most of the
> > necessary information for triaging and mapping that back to source line from
> > userspace isn't difficult. Wouldn't using __builtin_return_address() make
> > the whole thing a lot simpler?
> 
> If we do that we have to associate that address with the allocation
> counter at runtime on the first allocation and look it up on all
> following allocations. That introduces the overhead which we are
> trying to avoid by using macros.

I see. I'm a bit skeptical about the performance angle given that the hot
path can be probably made really cheap even with lookups. In most cases,
it's just gonna be an extra pointer deref and a few more arithmetics. That
can show up in microbenchmarks but it's not gonna be much. The benefit of
going that route would be the tracking thing being mostly self contained.

That said, it's nice to not have to worry about allocating tracking slots
and managing hash table, so no strong opinion.

Thanks.
Kent Overstreet May 4, 2023, 3:33 a.m. UTC | #50
On Wed, May 03, 2023 at 04:25:30PM -1000, Tejun Heo wrote:
> I see. I'm a bit skeptical about the performance angle given that the hot
> path can be probably made really cheap even with lookups. In most cases,
> it's just gonna be an extra pointer deref and a few more arithmetics. That
> can show up in microbenchmarks but it's not gonna be much. The benefit of
> going that route would be the tracking thing being mostly self contained.

The only way to do it with a single additional pointer deref would be
with a completely statically sized hash table without chaining - it'd
have to be open addressing.

More realistically you're looking at ~3 dependent loads.
Suren Baghdasaryan May 4, 2023, 3:33 a.m. UTC | #51
On Wed, May 3, 2023 at 7:25 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, May 03, 2023 at 01:14:57PM -0700, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote:
> > > Another related question. So, the reason for macro'ing stuff is needed is
> > > because you want to print the line directly from kernel, right?
> >
> > The main reason is because we want to inject a code tag at the
> > location of the call. If we have a code tag injected at every
> > allocation call, then finding the allocation counter (code tag) to
> > operate takes no time.
> >
> > > Is that
> > > really necessary? Values from __builtin_return_address() can easily be
> > > printed out as function+offset from kernel which already gives most of the
> > > necessary information for triaging and mapping that back to source line from
> > > userspace isn't difficult. Wouldn't using __builtin_return_address() make
> > > the whole thing a lot simpler?
> >
> > If we do that we have to associate that address with the allocation
> > counter at runtime on the first allocation and look it up on all
> > following allocations. That introduces the overhead which we are
> > trying to avoid by using macros.
>
> I see. I'm a bit skeptical about the performance angle given that the hot
> path can be probably made really cheap even with lookups. In most cases,
> it's just gonna be an extra pointer deref and a few more arithmetics. That
> can show up in microbenchmarks but it's not gonna be much. The benefit of
> going that route would be the tracking thing being mostly self contained.

I'm in the process of rerunning the tests to compare the overhead on
the latest kernel but I don't expect that to be cheap compared to
kmalloc().

>
> That said, it's nice to not have to worry about allocating tracking slots
> and managing hash table, so no strong opinion.
>
> Thanks.
>
> --
> tejun
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Petr Tesařík May 4, 2023, 8 a.m. UTC | #52
On Wed, 3 May 2023 13:14:57 -0700
Suren Baghdasaryan <surenb@google.com> wrote:

> On Wed, May 3, 2023 at 1:00 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, May 03, 2023 at 09:48:55AM -1000, Tejun Heo wrote:  
> > > > If so, that's the idea behind the context capture feature so that we
> > > > can enable it on specific allocations only after we determine there is
> > > > something interesting there. So, with low-cost persistent tracking we
> > > > can determine the suspects and then pay some more to investigate those
> > > > suspects in more detail.  
> > >
> > > Yeah, I was wondering whether it'd be useful to have that configurable so
> > > that it'd be possible for a user to say "I'm okay with the cost, please
> > > track more context per allocation". Given that tracking the immediate caller
> > > is already a huge improvement and narrowing it down from there using
> > > existing tools shouldn't be that difficult, I don't think this is a blocker
> > > in any way. It just bothers me a bit that the code is structured so that
> > > source line is the main abstraction.  
> >
> > Another related question. So, the reason for macro'ing stuff is needed is
> > because you want to print the line directly from kernel, right?  
> 
> The main reason is because we want to inject a code tag at the
> location of the call. If we have a code tag injected at every
> allocation call, then finding the allocation counter (code tag) to
> operate takes no time.

Another consequence is that each source code location gets its own tag.
The compiler can no longer apply common subexpression elimination
(because the tag is different). I have some doubts that there are any
places where CSE could be applied to allocation calls, but in general,
this is one more difference to using _RET_IP_.

Petr T

> > Is that
> > really necessary? Values from __builtin_return_address() can easily be
> > printed out as function+offset from kernel which already gives most of the
> > necessary information for triaging and mapping that back to source line from
> > userspace isn't difficult. Wouldn't using __builtin_return_address() make
> > the whole thing a lot simpler?  
> 
> If we do that we have to associate that address with the allocation
> counter at runtime on the first allocation and look it up on all
> following allocations. That introduces the overhead which we are
> trying to avoid by using macros.
> 
> >
> > Thanks.
> >
> > --
> > tejun  
>
Michal Hocko May 4, 2023, 9:07 a.m. UTC | #53
On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote:
> On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> Thanks for summarizing!
> 
> > At least those I find the most important:
> > - This is a big change and it adds a significant maintenance burden
> >   because each allocation entry point needs to be handled specifically.
> >   The cost will grow with the intended coverage especially there when
> >   allocation is hidden in a library code.
> 
> Do you mean with more allocations in the codebase more codetags will
> be generated? Is that the concern?

No. I am mostly concerned about the _maintenance_ overhead. For the
bare tracking (without profiling and thus stack traces) only those
allocations that are directly inlined into the consumer are really
of any use. That increases the code impact of the tracing because any
relevant allocation location has to go through the micro surgery. 

e.g. is it really interesting to know that there is a likely memory
leak in seq_file proper doing and allocation? No as it is the specific
implementation using seq_file that is leaking most likely. There are
other examples like that See?

> Or maybe as you commented in
> another patch that context capturing feature does not limit how many
> stacks will be captured?

That is a memory overhead which can be really huge and it would be nice
to be more explicit about that in the cover letter. It is a downside for
sure but not something that has a code maintenance impact and it is an
opt-in so it can be enabled only when necessary.

Quite honestly, though, the more I look into context capturing part it
seems to me that there is much more to be reconsidered there and if you
really want to move forward with the code tagging part then you should
drop that for now. It would make the whole series smaller and easier to
digest.

> > - It has been brought up that this is duplicating functionality already
> >   available via existing tracing infrastructure. You should make it very
> >   clear why that is not suitable for the job
> 
> I experimented with using tracing with _RET_IP_ to implement this
> accounting. The major issue is the _RET_IP_ to codetag lookup runtime
> overhead which is orders of magnitude higher than proposed code
> tagging approach. With code tagging proposal, that link is resolved at
> compile time. Since we want this mechanism deployed in production, we
> want to keep the overhead to the absolute minimum.
> You asked me before how much overhead would be tolerable and the
> answer will always be "as small as possible". This is especially true
> for slab allocators which are ridiculously fast and regressing them
> would be very noticable (due to the frequent use).

It would have been more convincing if you had some numbers at hands.
E.g. this is a typical workload we are dealing with. With the compile
time tags we are able to learn this with that much of cost. With a dynamic
tracing we are able to learn this much with that cost. See? As small as
possible is a rather vague term that different people will have a very
different idea about.

> There is another issue, which I think can be solved in a smart way but
> will either affect performance or would require more memory. With the
> tracing approach we don't know beforehand how many individual
> allocation sites exist, so we have to allocate code tags (or similar
> structures for counting) at runtime vs compile time. We can be smart
> about it and allocate in batches or even preallocate more than we need
> beforehand but, as I said, it will require some kind of compromise.

I have tried our usual distribution config (only vmlinux without modules
so the real impact will be larger as we build a lot of stuff into
modules) just to get an idea:
   text    data     bss     dec     hex filename
28755345        17040322        19845124        65640791        3e99957 vmlinux.before
28867168        17571838        19386372        65825378        3ec6a62 vmlinux.after

Less than 1% for text 3% for data.  This is not all that terrible
for an initial submission and a more dynamic approach could be added
later. E.g. with a smaller pre-allocated hash table that could be
expanded lazily. Anyway not something I would be losing sleep over. This
can always be improved later on.

> I understand that code tagging creates additional maintenance burdens
> but I hope it also produces enough benefits that people will want
> this. The cost is also hopefully amortized when additional
> applications like the ones we presented in RFC [1] are built using the
> same framework.

TBH I am much more concerned about the maintenance burden on the MM side
than the actual code tagging itslef which is much more self contained. I
haven't seen other potential applications of the same infrastructure and
maybe the code impact would be much smaller than in the MM proper. Our
allocator API is really hairy and convoluted.

> > - We already have page_owner infrastructure that provides allocation
> >   tracking data. Why it cannot be used/extended?
> 
> 1. The overhead.

Do you have any numbers?

> 2. Covers only page allocators.

Yes this sucks.
> 
> I didn't think about extending the page_owner approach to slab
> allocators but I suspect it would not be trivial. I don't see
> attaching an owner to every slab object to be a scalable solution. The
> overhead would again be of concern here.

This would have been a nice argument to mention in the changelog so that
we know that you have considered that option at least. Why should I (as
a reviewer) wild guess that?

> I should point out that there was one important technical concern
> about lack of a kill switch for this feature, which was an issue for
> distributions that can't disable the CONFIG flag. In this series we
> addressed that concern.

Thanks, that is certainly appreciated. I haven't looked deeper into that
part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING
implies unconditional page_ext and therefore the memory overhead
assosiated with that. There seems to be a killswitch nomem_profiling but
from a quick look it doesn't seem to disable page_ext allocations. I
might be missing something there of course. Having a highlevel
describtion for that would be really nice as well.

> [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
Suren Baghdasaryan May 4, 2023, 3:08 p.m. UTC | #54
On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote:
> > On Wed, May 3, 2023 at 12:25 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > Thanks for summarizing!
> >
> > > At least those I find the most important:
> > > - This is a big change and it adds a significant maintenance burden
> > >   because each allocation entry point needs to be handled specifically.
> > >   The cost will grow with the intended coverage especially there when
> > >   allocation is hidden in a library code.
> >
> > Do you mean with more allocations in the codebase more codetags will
> > be generated? Is that the concern?
>
> No. I am mostly concerned about the _maintenance_ overhead. For the
> bare tracking (without profiling and thus stack traces) only those
> allocations that are directly inlined into the consumer are really
> of any use. That increases the code impact of the tracing because any
> relevant allocation location has to go through the micro surgery.
>
> e.g. is it really interesting to know that there is a likely memory
> leak in seq_file proper doing and allocation? No as it is the specific
> implementation using seq_file that is leaking most likely. There are
> other examples like that See?

Yes, I see that. One level tracking does not provide all the
information needed to track such issues. Something more informative
would cost more. That's why our proposal is to have a light-weight
mechanism to get a high level picture and then be able to zoom into a
specific area using context capture. If you have ideas to improve
this, I'm open to suggestions.

>
> > Or maybe as you commented in
> > another patch that context capturing feature does not limit how many
> > stacks will be captured?
>
> That is a memory overhead which can be really huge and it would be nice
> to be more explicit about that in the cover letter. It is a downside for
> sure but not something that has a code maintenance impact and it is an
> opt-in so it can be enabled only when necessary.

You are right, I'll add that into the cover letter.

>
> Quite honestly, though, the more I look into context capturing part it
> seems to me that there is much more to be reconsidered there and if you
> really want to move forward with the code tagging part then you should
> drop that for now. It would make the whole series smaller and easier to
> digest.

Sure, I don't see an issue with removing that for now and refining the
mechanism before posting again.

>
> > > - It has been brought up that this is duplicating functionality already
> > >   available via existing tracing infrastructure. You should make it very
> > >   clear why that is not suitable for the job
> >
> > I experimented with using tracing with _RET_IP_ to implement this
> > accounting. The major issue is the _RET_IP_ to codetag lookup runtime
> > overhead which is orders of magnitude higher than proposed code
> > tagging approach. With code tagging proposal, that link is resolved at
> > compile time. Since we want this mechanism deployed in production, we
> > want to keep the overhead to the absolute minimum.
> > You asked me before how much overhead would be tolerable and the
> > answer will always be "as small as possible". This is especially true
> > for slab allocators which are ridiculously fast and regressing them
> > would be very noticable (due to the frequent use).
>
> It would have been more convincing if you had some numbers at hands.
> E.g. this is a typical workload we are dealing with. With the compile
> time tags we are able to learn this with that much of cost. With a dynamic
> tracing we are able to learn this much with that cost. See? As small as
> possible is a rather vague term that different people will have a very
> different idea about.

I'm rerunning my tests with the latest kernel to collect the
comparison data. I profiled these solutions before but the kernel
changed since then, so I need to update them.

>
> > There is another issue, which I think can be solved in a smart way but
> > will either affect performance or would require more memory. With the
> > tracing approach we don't know beforehand how many individual
> > allocation sites exist, so we have to allocate code tags (or similar
> > structures for counting) at runtime vs compile time. We can be smart
> > about it and allocate in batches or even preallocate more than we need
> > beforehand but, as I said, it will require some kind of compromise.
>
> I have tried our usual distribution config (only vmlinux without modules
> so the real impact will be larger as we build a lot of stuff into
> modules) just to get an idea:
>    text    data     bss     dec     hex filename
> 28755345        17040322        19845124        65640791        3e99957 vmlinux.before
> 28867168        17571838        19386372        65825378        3ec6a62 vmlinux.after
>
> Less than 1% for text 3% for data.  This is not all that terrible
> for an initial submission and a more dynamic approach could be added
> later. E.g. with a smaller pre-allocated hash table that could be
> expanded lazily. Anyway not something I would be losing sleep over. This
> can always be improved later on.

Ah, right. I should have mentioned this overhead too. Thanks for
keeping me honest.

> > I understand that code tagging creates additional maintenance burdens
> > but I hope it also produces enough benefits that people will want
> > this. The cost is also hopefully amortized when additional
> > applications like the ones we presented in RFC [1] are built using the
> > same framework.
>
> TBH I am much more concerned about the maintenance burden on the MM side
> than the actual code tagging itslef which is much more self contained. I
> haven't seen other potential applications of the same infrastructure and
> maybe the code impact would be much smaller than in the MM proper. Our
> allocator API is really hairy and convoluted.

Yes, other applications are much smaller and cleaner. MM allocation
code is quite complex indeed.

>
> > > - We already have page_owner infrastructure that provides allocation
> > >   tracking data. Why it cannot be used/extended?
> >
> > 1. The overhead.
>
> Do you have any numbers?

Will post once my tests are completed.

>
> > 2. Covers only page allocators.
>
> Yes this sucks.
> >
> > I didn't think about extending the page_owner approach to slab
> > allocators but I suspect it would not be trivial. I don't see
> > attaching an owner to every slab object to be a scalable solution. The
> > overhead would again be of concern here.
>
> This would have been a nice argument to mention in the changelog so that
> we know that you have considered that option at least. Why should I (as
> a reviewer) wild guess that?

Sorry, It's hard to remember all the decisions, discussions and
conclusions when working on a feature over a long time period. I'll
include more information about that.

>
> > I should point out that there was one important technical concern
> > about lack of a kill switch for this feature, which was an issue for
> > distributions that can't disable the CONFIG flag. In this series we
> > addressed that concern.
>
> Thanks, that is certainly appreciated. I haven't looked deeper into that
> part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PROFILING
> implies unconditional page_ext and therefore the memory overhead
> assosiated with that. There seems to be a killswitch nomem_profiling but
> from a quick look it doesn't seem to disable page_ext allocations. I
> might be missing something there of course. Having a highlevel
> describtion for that would be really nice as well.

Right, will add a description of that as well.
We eliminate the runtime overhead but not the memory one. However I
believe it's also doable using page_ext_operations.need callback. Will
look into it.
Thanks,
Suren.

>
> > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko May 7, 2023, 10:27 a.m. UTC | #55
On Thu 04-05-23 08:08:13, Suren Baghdasaryan wrote:
> On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > e.g. is it really interesting to know that there is a likely memory
> > leak in seq_file proper doing and allocation? No as it is the specific
> > implementation using seq_file that is leaking most likely. There are
> > other examples like that See?
> 
> Yes, I see that. One level tracking does not provide all the
> information needed to track such issues. Something more informative
> would cost more. That's why our proposal is to have a light-weight
> mechanism to get a high level picture and then be able to zoom into a
> specific area using context capture. If you have ideas to improve
> this, I'm open to suggestions.

Well, I think that a more scalable approach would be to not track in
callers but in the allocator itself. The full stack trace might not be
all that important or interesting and maybe even increase the overall
overhead but a partial one with a configurable depth would sound more
interesting to me. A per cache hastable indexed by stack trace reference
and extending slab metadata to store the reference for kfree path won't
be free but the overhead might be just acceptable.

If the stack unwinding is really too expensive for tracking another
option would be to add code tags dynamically to the compiled
kernel without any actual code changes. I can imagine the tracing
infrastructure could be used for that or maybe even consider compiler
plugins to inject code for functions marked as allocators. So the kernel
could be instrumented even without eny userspace tooling required by
users directly.
Kent Overstreet May 7, 2023, 5:01 p.m. UTC | #56
On Sun, May 07, 2023 at 12:27:18PM +0200, Michal Hocko wrote:
> On Thu 04-05-23 08:08:13, Suren Baghdasaryan wrote:
> > On Thu, May 4, 2023 at 2:07 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > e.g. is it really interesting to know that there is a likely memory
> > > leak in seq_file proper doing and allocation? No as it is the specific
> > > implementation using seq_file that is leaking most likely. There are
> > > other examples like that See?
> > 
> > Yes, I see that. One level tracking does not provide all the
> > information needed to track such issues. Something more informative
> > would cost more. That's why our proposal is to have a light-weight
> > mechanism to get a high level picture and then be able to zoom into a
> > specific area using context capture. If you have ideas to improve
> > this, I'm open to suggestions.
> 
> Well, I think that a more scalable approach would be to not track in
> callers but in the allocator itself. The full stack trace might not be
> all that important or interesting and maybe even increase the overall
> overhead but a partial one with a configurable depth would sound more
> interesting to me. A per cache hastable indexed by stack trace reference
> and extending slab metadata to store the reference for kfree path won't
> be free but the overhead might be just acceptable.

How would you propose to annotate what call chains need what depth of
stack trace recorded?

How would you propose to make this performant?
Kent Overstreet May 7, 2023, 5:20 p.m. UTC | #57
On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> No. I am mostly concerned about the _maintenance_ overhead. For the
> bare tracking (without profiling and thus stack traces) only those
> allocations that are directly inlined into the consumer are really
> of any use. That increases the code impact of the tracing because any
> relevant allocation location has to go through the micro surgery. 
> 
> e.g. is it really interesting to know that there is a likely memory
> leak in seq_file proper doing and allocation? No as it is the specific
> implementation using seq_file that is leaking most likely. There are
> other examples like that See?

So this is a rather strange usage of "maintenance overhead" :)

But it's something we thought of. If we had to plumb around a _RET_IP_
parameter, or a codetag pointer, it would be a hassle annotating the
correct callsite.

Instead, alloc_hooks() wraps a memory allocation function and stashes a
pointer to a codetag in task_struct for use by the core slub/buddy
allocator code.

That means that in your example, to move tracking to a given seq_file
function, we just:
 - hook the seq_file function with alloc_hooks
 - change the seq_file function to call non-hooked memory allocation
   functions.

> It would have been more convincing if you had some numbers at hands.
> E.g. this is a typical workload we are dealing with. With the compile
> time tags we are able to learn this with that much of cost. With a dynamic
> tracing we are able to learn this much with that cost. See? As small as
> possible is a rather vague term that different people will have a very
> different idea about.

Engineers don't prototype and benchmark everything as a matter of
course, we're expected to have the rough equivealent of a CS education
and an understanding of big O notation, cache architecture, etc.

The slub fast path is _really_ fast - double word non locked cmpxchg.
That's what we're trying to compete with. Adding a big globally
accessible hash table is going to tank performance compared to that.

I believe the numbers we already posted speak for themselves. We're
considerably faster than memcg, fast enough to run in production.

I'm not going to be switching to a design that significantly regresses
performance, sorry :)

> TBH I am much more concerned about the maintenance burden on the MM side
> than the actual code tagging itslef which is much more self contained. I
> haven't seen other potential applications of the same infrastructure and
> maybe the code impact would be much smaller than in the MM proper. Our
> allocator API is really hairy and convoluted.

You keep saying "maintenance burden", but this is a criticism that can
be directed at _any_ patchset that adds new code; it's generally
understood that that is the accepted cost for new functionality.

If you have specific concerns where you think we did something that
makes the code harder to maintain, _please point them out in the
appropriate patch_. I don't think you'll find too much - the
instrumentation in the allocators simply generalizes what memcg was
already doing, and the hooks themselves are a bit boilerplaty but hardly
the sort of thing people will be tripping over later.

TL;DR - put up or shut up :)
Steven Rostedt May 7, 2023, 8:55 p.m. UTC | #58
On Sun, 7 May 2023 13:20:55 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > No. I am mostly concerned about the _maintenance_ overhead. For the
> > bare tracking (without profiling and thus stack traces) only those
> > allocations that are directly inlined into the consumer are really
> > of any use. That increases the code impact of the tracing because any
> > relevant allocation location has to go through the micro surgery. 
> > 
> > e.g. is it really interesting to know that there is a likely memory
> > leak in seq_file proper doing and allocation? No as it is the specific
> > implementation using seq_file that is leaking most likely. There are
> > other examples like that See?  
> 
> So this is a rather strange usage of "maintenance overhead" :)
> 
> But it's something we thought of. If we had to plumb around a _RET_IP_
> parameter, or a codetag pointer, it would be a hassle annotating the
> correct callsite.
> 
> Instead, alloc_hooks() wraps a memory allocation function and stashes a
> pointer to a codetag in task_struct for use by the core slub/buddy
> allocator code.
> 
> That means that in your example, to move tracking to a given seq_file
> function, we just:
>  - hook the seq_file function with alloc_hooks
>  - change the seq_file function to call non-hooked memory allocation
>    functions.
> 
> > It would have been more convincing if you had some numbers at hands.
> > E.g. this is a typical workload we are dealing with. With the compile
> > time tags we are able to learn this with that much of cost. With a dynamic
> > tracing we are able to learn this much with that cost. See? As small as
> > possible is a rather vague term that different people will have a very
> > different idea about.  
> 
> Engineers don't prototype and benchmark everything as a matter of
> course, we're expected to have the rough equivealent of a CS education
> and an understanding of big O notation, cache architecture, etc.
> 
> The slub fast path is _really_ fast - double word non locked cmpxchg.
> That's what we're trying to compete with. Adding a big globally
> accessible hash table is going to tank performance compared to that.
> 
> I believe the numbers we already posted speak for themselves. We're
> considerably faster than memcg, fast enough to run in production.
> 
> I'm not going to be switching to a design that significantly regresses
> performance, sorry :)
> 
> > TBH I am much more concerned about the maintenance burden on the MM side
> > than the actual code tagging itslef which is much more self contained. I
> > haven't seen other potential applications of the same infrastructure and
> > maybe the code impact would be much smaller than in the MM proper. Our
> > allocator API is really hairy and convoluted.  
> 
> You keep saying "maintenance burden", but this is a criticism that can
> be directed at _any_ patchset that adds new code; it's generally
> understood that that is the accepted cost for new functionality.
> 
> If you have specific concerns where you think we did something that
> makes the code harder to maintain, _please point them out in the
> appropriate patch_. I don't think you'll find too much - the
> instrumentation in the allocators simply generalizes what memcg was
> already doing, and the hooks themselves are a bit boilerplaty but hardly
> the sort of thing people will be tripping over later.
> 


> TL;DR - put up or shut up :)

Your email would have been much better if you left the above line out. :-/
Comments like the above do not go over well via text. Even if you add the ":)"

Back to the comment about this being a burden. I just applied all the
patches and did a diff (much easier than to wade through 40 patches!)

One thing we need to get rid of, and this isn't your fault but this
series is extending it, is the use of the damn underscores to
differentiate functions. This is one of the abominations of the early
Linux kernel code base. I admit, I'm guilty of this too. But today I
have learned and avoid it at all cost. Underscores are meaningless and
error prone, not to mention confusing to people coming onboard. Let's
use something that has some meaning.

What's the difference between:

  _kmem_cache_alloc_node() and __kmem_cache_alloc_node()?

And if every allocation function requires a double hook, that is a
maintenance burden. We do this for things like system calls, but
there's a strong rationale for that. I'm guessing that Michal's concern
is that he and other mm maintainers will need to make sure any new
allocation function has this double call and is done properly. This
isn't just new code that needs to be maintained, it's something that
needs to be understood when adding any new interface to page
allocations.

It's true that all new code has a maintenance burden, and unless the
maintainer feels the burden is worth their time, they have the right to
complain about it.

I've given talks about how to get code into open source projects, and
the title is "Commits are pulled and never pushed". Where basically I
talk about convincing the maintainers that they want your change, and
not by pushing it because you want it.

-- Steve
Kent Overstreet May 7, 2023, 9:53 p.m. UTC | #59
On Sun, May 07, 2023 at 04:55:38PM -0400, Steven Rostedt wrote:
> > TL;DR - put up or shut up :)
> 
> Your email would have been much better if you left the above line out. :-/
> Comments like the above do not go over well via text. Even if you add the ":)"

I stand by that comment :)

> Back to the comment about this being a burden. I just applied all the
> patches and did a diff (much easier than to wade through 40 patches!)
> 
> One thing we need to get rid of, and this isn't your fault but this
> series is extending it, is the use of the damn underscores to
> differentiate functions. This is one of the abominations of the early
> Linux kernel code base. I admit, I'm guilty of this too. But today I
> have learned and avoid it at all cost. Underscores are meaningless and
> error prone, not to mention confusing to people coming onboard. Let's
> use something that has some meaning.
> 
> What's the difference between:
> 
>   _kmem_cache_alloc_node() and __kmem_cache_alloc_node()?
> 
> And if every allocation function requires a double hook, that is a
> maintenance burden. We do this for things like system calls, but
> there's a strong rationale for that.

The underscore is a legitimate complaint - I brought this up in
development, not sure why it got lost. We'll do something better with a
consistent suffix, perhaps kmem_cache_alloc_noacct().

> I'm guessing that Michal's concern is that he and other mm maintainers
> will need to make sure any new allocation function has this double
> call and is done properly. This isn't just new code that needs to be
> maintained, it's something that needs to be understood when adding any
> new interface to page allocations.

Well, isn't that part of the problem then? We're _this far_ into the
thread and still guessing on what Michal's "maintenance concerns" are?

Regarding your specific concern: My main design consideration was making
sure every allocation gets accounted somewhere; we don't want a memory
allocation profiling system where it's possible for allocations to be
silently not tracked! There's warnings in the core allocators if they
see an allocation without an alloc tag, and in testing we chased down
everything we found.

So if anyone later creates a new memory allocation interface and forgets
to hook it, they'll see the same warning - but perhaps we could improve
the warning message so it says exactly what needs to be done (wrap the
allocation in an alloc_hooks() call).

> It's true that all new code has a maintenance burden, and unless the
> maintainer feels the burden is worth their time, they have the right to
> complain about it.

Sure, but complaints should say what they're complaining about.
Complaints so vague they could be levelled at any patchset don't do
anything for the discussion.
Steven Rostedt May 7, 2023, 10:09 p.m. UTC | #60
On Sun, 7 May 2023 17:53:09 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> The underscore is a legitimate complaint - I brought this up in
> development, not sure why it got lost. We'll do something better with a
> consistent suffix, perhaps kmem_cache_alloc_noacct().

Would "_noprofile()" be a better name. I'm not sure what "acct" means.

-- Steve
Kent Overstreet May 7, 2023, 10:17 p.m. UTC | #61
On Sun, May 07, 2023 at 06:09:11PM -0400, Steven Rostedt wrote:
> On Sun, 7 May 2023 17:53:09 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > The underscore is a legitimate complaint - I brought this up in
> > development, not sure why it got lost. We'll do something better with a
> > consistent suffix, perhaps kmem_cache_alloc_noacct().
> 
> Would "_noprofile()" be a better name. I'm not sure what "acct" means.

account - but _noprofile() is probably better. Didn't suggest it at
first because of an association in my head with CPU profiling, but
considering we already renamed the feature to memory allocation
profiling... :)
Petr Tesařík May 8, 2023, 3:52 p.m. UTC | #62
On Sun, 7 May 2023 13:20:55 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > No. I am mostly concerned about the _maintenance_ overhead. For the
> > bare tracking (without profiling and thus stack traces) only those
> > allocations that are directly inlined into the consumer are really
> > of any use. That increases the code impact of the tracing because any
> > relevant allocation location has to go through the micro surgery. 
> > 
> > e.g. is it really interesting to know that there is a likely memory
> > leak in seq_file proper doing and allocation? No as it is the specific
> > implementation using seq_file that is leaking most likely. There are
> > other examples like that See?  
> 
> So this is a rather strange usage of "maintenance overhead" :)
> 
> But it's something we thought of. If we had to plumb around a _RET_IP_
> parameter, or a codetag pointer, it would be a hassle annotating the
> correct callsite.
> 
> Instead, alloc_hooks() wraps a memory allocation function and stashes a
> pointer to a codetag in task_struct for use by the core slub/buddy
> allocator code.
> 
> That means that in your example, to move tracking to a given seq_file
> function, we just:
>  - hook the seq_file function with alloc_hooks

Thank you. That's exactly what I was trying to point out. So you hook
seq_buf_alloc(), just to find out it's called from traverse(), which
is not very helpful either. So, you hook traverse(), which sounds quite
generic. Yes, you're lucky, because it is a static function, and the
identifier is not actually used anywhere else (right now), but each
time you want to hook something, you must make sure it does not
conflict with any other identifier in the kernel...

Petr T
Kent Overstreet May 8, 2023, 3:57 p.m. UTC | #63
On Mon, May 08, 2023 at 05:52:06PM +0200, Petr Tesařík wrote:
> On Sun, 7 May 2023 13:20:55 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:
> > > No. I am mostly concerned about the _maintenance_ overhead. For the
> > > bare tracking (without profiling and thus stack traces) only those
> > > allocations that are directly inlined into the consumer are really
> > > of any use. That increases the code impact of the tracing because any
> > > relevant allocation location has to go through the micro surgery. 
> > > 
> > > e.g. is it really interesting to know that there is a likely memory
> > > leak in seq_file proper doing and allocation? No as it is the specific
> > > implementation using seq_file that is leaking most likely. There are
> > > other examples like that See?  
> > 
> > So this is a rather strange usage of "maintenance overhead" :)
> > 
> > But it's something we thought of. If we had to plumb around a _RET_IP_
> > parameter, or a codetag pointer, it would be a hassle annotating the
> > correct callsite.
> > 
> > Instead, alloc_hooks() wraps a memory allocation function and stashes a
> > pointer to a codetag in task_struct for use by the core slub/buddy
> > allocator code.
> > 
> > That means that in your example, to move tracking to a given seq_file
> > function, we just:
> >  - hook the seq_file function with alloc_hooks
> 
> Thank you. That's exactly what I was trying to point out. So you hook
> seq_buf_alloc(), just to find out it's called from traverse(), which
> is not very helpful either. So, you hook traverse(), which sounds quite
> generic. Yes, you're lucky, because it is a static function, and the
> identifier is not actually used anywhere else (right now), but each
> time you want to hook something, you must make sure it does not
> conflict with any other identifier in the kernel...

Cscope makes quick and easy work of this kind of stuff.
Petr Tesařík May 8, 2023, 4:09 p.m. UTC | #64
On Mon, 8 May 2023 11:57:10 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Mon, May 08, 2023 at 05:52:06PM +0200, Petr Tesařík wrote:
> > On Sun, 7 May 2023 13:20:55 -0400
> > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >   
> > > On Thu, May 04, 2023 at 11:07:22AM +0200, Michal Hocko wrote:  
> > > > No. I am mostly concerned about the _maintenance_ overhead. For the
> > > > bare tracking (without profiling and thus stack traces) only those
> > > > allocations that are directly inlined into the consumer are really
> > > > of any use. That increases the code impact of the tracing because any
> > > > relevant allocation location has to go through the micro surgery. 
> > > > 
> > > > e.g. is it really interesting to know that there is a likely memory
> > > > leak in seq_file proper doing and allocation? No as it is the specific
> > > > implementation using seq_file that is leaking most likely. There are
> > > > other examples like that See?    
> > > 
> > > So this is a rather strange usage of "maintenance overhead" :)
> > > 
> > > But it's something we thought of. If we had to plumb around a _RET_IP_
> > > parameter, or a codetag pointer, it would be a hassle annotating the
> > > correct callsite.
> > > 
> > > Instead, alloc_hooks() wraps a memory allocation function and stashes a
> > > pointer to a codetag in task_struct for use by the core slub/buddy
> > > allocator code.
> > > 
> > > That means that in your example, to move tracking to a given seq_file
> > > function, we just:
> > >  - hook the seq_file function with alloc_hooks  
> > 
> > Thank you. That's exactly what I was trying to point out. So you hook
> > seq_buf_alloc(), just to find out it's called from traverse(), which
> > is not very helpful either. So, you hook traverse(), which sounds quite
> > generic. Yes, you're lucky, because it is a static function, and the
> > identifier is not actually used anywhere else (right now), but each
> > time you want to hook something, you must make sure it does not
> > conflict with any other identifier in the kernel...  
> 
> Cscope makes quick and easy work of this kind of stuff.

Sure, although AFAIK the index does not cover all possible config
options (so non-x86 arch code is often forgotten). However, that's the
less important part.

What do you do if you need to hook something that does conflict with an
existing identifier?

Petr T
Kent Overstreet May 8, 2023, 4:28 p.m. UTC | #65
On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote:
> Sure, although AFAIK the index does not cover all possible config
> options (so non-x86 arch code is often forgotten). However, that's the
> less important part.
> 
> What do you do if you need to hook something that does conflict with an
> existing identifier?

As already happens in this patchset, rename the other identifier.

But this is C, we avoid these kinds of conflicts already because the
language has no namespacing - it's going to be a pretty rare situtaion
going forward. Most of the hooking that will be done is done with this
patchset, and there was only one identifier that needed to be renamed.
Petr Tesařík May 8, 2023, 6:59 p.m. UTC | #66
On Mon, 8 May 2023 12:28:52 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote:
> > Sure, although AFAIK the index does not cover all possible config
> > options (so non-x86 arch code is often forgotten). However, that's the
> > less important part.
> > 
> > What do you do if you need to hook something that does conflict with an
> > existing identifier?  
> 
> As already happens in this patchset, rename the other identifier.
> 
> But this is C, we avoid these kinds of conflicts already because the
> language has no namespacing

This statement is not accurate, but I agree there's not much. Refer to
section 6.2.3 of ISO/IEC9899:2018 (Name spaces of identifiers).

More importantly, macros also interfere with identifier scoping, e.g.
you cannot even have a local variable with the same name as a macro.
That's why I dislike macros so much.

But since there's no clear policy regarding macros in the kernel, I'm
merely showing a downside; it's perfectly fine to write kernel code
like this as long as the maintainers agree that the limitation is
acceptable and outweighed by the benefits.

Petr T

> it's going to be a pretty rare situtaion
> going forward. Most of the hooking that will be done is done with this
> patchset, and there was only one identifier that needed to be renamed.
>
Kent Overstreet May 8, 2023, 8:48 p.m. UTC | #67
On Mon, May 08, 2023 at 08:59:39PM +0200, Petr Tesařík wrote:
> On Mon, 8 May 2023 12:28:52 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > On Mon, May 08, 2023 at 06:09:13PM +0200, Petr Tesařík wrote:
> > > Sure, although AFAIK the index does not cover all possible config
> > > options (so non-x86 arch code is often forgotten). However, that's the
> > > less important part.
> > > 
> > > What do you do if you need to hook something that does conflict with an
> > > existing identifier?  
> > 
> > As already happens in this patchset, rename the other identifier.
> > 
> > But this is C, we avoid these kinds of conflicts already because the
> > language has no namespacing
> 
> This statement is not accurate, but I agree there's not much. Refer to
> section 6.2.3 of ISO/IEC9899:2018 (Name spaces of identifiers).
> 
> More importantly, macros also interfere with identifier scoping, e.g.
> you cannot even have a local variable with the same name as a macro.
> That's why I dislike macros so much.

Shadowing a global identifier like that would at best be considered poor
style, so I don't see this as a major downside.

> But since there's no clear policy regarding macros in the kernel, I'm
> merely showing a downside; it's perfectly fine to write kernel code
> like this as long as the maintainers agree that the limitation is
> acceptable and outweighed by the benefits.

Macros do have lots of tricky downsides, but in general we're not shy
about using them for things that can't be done otherwise; see
wait_event(), all of tracing...

I think we could in general do a job of making the macros _themselves_
more managable, when writing things that need to be macros I'll often
have just the wrapper as a macro and write the bulk as inline functions.
See the generic radix tree code for example.

Reflection is a major use case for macros, and the underlying mechanism
here - code tagging - is something worth talking about more, since it's
codifying something that's been done ad-hoc in the kernel for a long
time and something we hope to refactor other existing code to use,
including tracing - I've got a patch already written to convert the
dynamic debug code to code tagging; it's a nice -200 loc cleanup.

Regarding the alloc_hooks() macro itself specifically, I've got more
plans for it. I have another patch series after this one that implements
code tagging based fault injection, which is _far_ more ergonomic to use
than our existing fault injection capabilities (and this matters! Fault
injection is a really important tool for getting good test coverage, but
tools that are a pain in the ass to use don't get used) - and with the
alloc_hooks() macro already in place, we'll be able to turn _every
individual memory allocation callsite_ into a distinct, individually
selectable fault injection point - which is something our existing fault
injection framework attempts at but doesn't really manage.

If we can get this in, it'll make it really easy to write unit tests
that iterate over every memory allocation site in a given module,
individually telling them to fail, run a workload until they hit, and
verify that the code path being tested was executed. It'll nicely
complement the fuzz testing capabilities that we've been working on,
particularly in filesystem land.