mbox series

[RFC,00/30] Code tagging framework and applications

Message ID 20220830214919.53220-1-surenb@google.com (mailing list archive)
Headers show
Series Code tagging framework and applications | expand

Message

Suren Baghdasaryan Aug. 30, 2022, 9:48 p.m. UTC
===========================
Code tagging framework
===========================
Code tag is a structure identifying a specific location in the source code
which is generated at compile time and can be embedded in an application-
specific structure. Several applications of code tagging are included in
this RFC, such as memory allocation tracking, dynamic fault injection,
latency tracking and improved error code reporting.
Basically, it takes the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it.

===========================
Memory allocation tracking
===========================
The goal for using codetags for memory allocation tracking is to minimize
performance and memory overhead. By recording only the call count and
allocation size, the required operations are kept at the minimum while
collecting statistics for every allocation in the codebase. With that
information, if users are interested in mode detailed context for a
specific allocation, they can enable more in-depth context tracking,
which includes capturing the pid, tgid, task name, allocation size,
timestamp and call stack for every allocation at the specified code
location.
Memory allocation tracking is implemented in two parts:

part1: instruments page and slab allocators to record call count and total
memory allocated at every allocation in the source code. Every time an
allocation is performed by an instrumented allocator, the codetag at that
location increments its call and size counters. Every time the memory is
freed these counters are decremented. To decrement the counters upon free,
allocated object needs a reference to its codetag. Page allocators use
page_ext to record this reference while slab allocators use memcg_data of
the slab page.
The data is exposed to the user space via a read-only debugfs file called
alloc_tags.

Usage example:

$ sort -hr /sys/kernel/debug/alloc_tags|head
  153MiB     8599 mm/slub.c:1826 module:slub func:alloc_slab_page
 6.08MiB      49 mm/slab_common.c:950 module:slab_common func:_kmalloc_order
 5.09MiB     6335 mm/memcontrol.c:2814 module:memcontrol func:alloc_slab_obj_exts
 4.54MiB      78 mm/page_alloc.c:5777 module:page_alloc func:alloc_pages_exact
 1.32MiB      338 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
 1.16MiB      603 fs/xfs/xfs_log_priv.h:700 module:xfs func:xlog_kvmalloc
 1.00MiB      256 mm/swap_cgroup.c:48 module:swap_cgroup func:swap_cgroup_prepare
  734KiB     5380 fs/xfs/kmem.c:20 module:xfs func:kmem_alloc
  640KiB      160 kernel/rcu/tree.c:3184 module:tree func:fill_page_cache_func
  640KiB      160 drivers/char/virtio_console.c:452 module:virtio_console func:alloc_buf

part2: adds support for the user to select a specific code location to capture
allocation context. A new debugfs file called alloc_tags.ctx is used to select
which code location should capture allocation context and to read captured
context information.

Usage example:

$ cd /sys/kernel/debug/
$ echo "file include/asm-generic/pgalloc.h line 63 enable" > alloc_tags.ctx
$ cat alloc_tags.ctx
  920KiB      230 include/asm-generic/pgalloc.h:63 module:pgtable func:__pte_alloc_one
    size: 4096
    pid: 1474
    tgid: 1474
    comm: bash
    ts: 175332940994
    call stack:
         pte_alloc_one+0xfe/0x130
         __pte_alloc+0x22/0xb0
         copy_page_range+0x842/0x1640
         dup_mm+0x42d/0x580
         copy_process+0xfb1/0x1ac0
         kernel_clone+0x92/0x3e0
         __do_sys_clone+0x66/0x90
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
...

NOTE: slab allocation tracking is not yet stable and has a leak that
shows up in long-running tests. We are working on fixing it and posting
the RFC early to collect some feedback and to have a reference code in
public before presenting the idea at LPC2022.

===========================
Dynamic fault injection
===========================
Dynamic fault injection lets you do fault injection with a single call
to dynamic_fault(), with a debugfs interface similar to dynamic_debug.

Calls to dynamic_fault are listed in debugfs and can be enabled at
runtime (oneshot mode or a defined frequency are also available). This
patch also uses the memory allocation wrapper macros introduced by the
memory allocation tracking patches to add distinct fault injection
points for every memory allocation in the kernel.

Example fault injection points, after hooking memory allocation paths:

  fs/xfs/libxfs/xfs_iext_tree.c:606 module:xfs func:xfs_iext_realloc_rootclass:memory disabled "
  fs/xfs/libxfs/xfs_inode_fork.c:503 module:xfs func:xfs_idata_reallocclass:memory disabled "
  fs/xfs/libxfs/xfs_inode_fork.c:399 module:xfs func:xfs_iroot_reallocclass:memory disabled "
  fs/xfs/xfs_buf.c:373 module:xfs func:xfs_buf_alloc_pagesclass:memory disabled "
  fs/xfs/xfs_iops.c:497 module:xfs func:xfs_vn_get_linkclass:memory disabled "
  fs/xfs/xfs_mount.c:85 module:xfs func:xfs_uuid_mountclass:memory disabled "

===========================
Latency tracking
===========================
This lets you instrument code for measuring latency with just two calls
to code_tag_time_stats_start() and code_tag_time_stats_finish(), and
makes statistics available in debugfs on a per-callsite basis.

Recorded statistics include total count, frequency/rate, average
duration, max duration, and event duration quantiles.

Additionally, this patch instruments prepare_to_wait() and finish_wait().

Example output:

  fs/xfs/xfs_extent_busy.c:589 module:xfs func:xfs_extent_busy_flush
  count:          61
  rate:           0/sec
  frequency:    19 sec
  avg duration:   632 us
  max duration:   2 ms
  quantiles (us): 274 288 288 296 296 296 296 336 336 336 336 336 336 336 336

===========================
Improved error codes
===========================
Ever waste hours trying to figure out which line of code from some
obscure module is returning you -EINVAL and nothing else?

What if we had... more error codes?

This patch adds ERR(), which returns a unique error code that is related
to the error code that passed to it: the original error code can be
recovered with error_class(), and errname() (as well as %pE) returns an
error string that includes the file and line number of the ERR() call.

Example output:

  VFS: Cannot open root device "sda" or unknown-block(8,0): error -EINVAL at fs/ext4/super.c:4387

===========================
Dynamic debug conversion to code tagging
===========================
There are several open coded implementations of the "define a special elf
section for objects and iterate" technique that should be converted to
code tagging. This series just converts dynamic debug; there are others
(multiple in ftrace, in particular) that should also be converted.

===========================

The patchset applies cleanly over Linux 6.0-rc3
The tree for testing is published at:
https://github.com/surenbaghdasaryan/linux/tree/alloc_tags_rfc

The structure of the patchset is:
- code tagging framework (patches 1-6)
- page allocation tracking (patches 7-10)
- slab allocation tracking (patch 11-16)
- allocation context capture (patch 17-21)
- dynamic fault injection (patch 22)
- latency tracking (patch 23-27)
- improved error codes (patch 28)
- dynamic debug conversion to code tagging (patch 29)
- MAINTAINERS update (patch 30)

Next steps:
- track and fix slab allocator leak mentioned earlier;
- instrument more allocators: vmalloc, per-cpu allocations, others?


Kent Overstreet (14):
  lib/string_helpers: Drop space in string_get_size's output
  Lazy percpu counters
  scripts/kallysms: Always include __start and __stop symbols
  lib/string.c: strsep_no_empty()
  codetag: add codetag query helper functions
  Code tagging based fault injection
  timekeeping: Add a missing include
  wait: Clean up waitqueue_entry initialization
  lib/time_stats: New library for statistics on events
  bcache: Convert to lib/time_stats
  Code tagging based latency tracking
  Improved symbolic error names
  dyndbg: Convert to code tagging
  MAINTAINERS: Add entries for code tagging & related

Suren Baghdasaryan (16):
  kernel/module: move find_kallsyms_symbol_value declaration
  lib: code tagging framework
  lib: code tagging module support
  lib: add support for allocation tagging
  lib: introduce page allocation tagging
  change alloc_pages name in dma_map_ops to avoid name conflicts
  mm: enable page allocation tagging for __get_free_pages and
    alloc_pages
  mm: introduce slabobj_ext to support slab object extensions
  mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
    creation
  mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
  mm: prevent slabobj_ext allocations for slabobj_ext and kmem_cache
    objects
  lib: introduce slab allocation tagging
  mm: enable slab allocation tagging for kmalloc and friends
  move stack capture functionality into a separate function for reuse
  lib: introduce support for storing code tag context
  lib: implement context capture support for page and slab allocators

 MAINTAINERS                         |  34 ++
 arch/x86/kernel/amd_gart_64.c       |   2 +-
 drivers/iommu/dma-iommu.c           |   2 +-
 drivers/md/bcache/Kconfig           |   1 +
 drivers/md/bcache/bcache.h          |   1 +
 drivers/md/bcache/bset.c            |   8 +-
 drivers/md/bcache/bset.h            |   1 +
 drivers/md/bcache/btree.c           |  12 +-
 drivers/md/bcache/super.c           |   3 +
 drivers/md/bcache/sysfs.c           |  43 ++-
 drivers/md/bcache/util.c            |  30 --
 drivers/md/bcache/util.h            |  57 ---
 drivers/xen/grant-dma-ops.c         |   2 +-
 drivers/xen/swiotlb-xen.c           |   2 +-
 include/asm-generic/codetag.lds.h   |  18 +
 include/asm-generic/vmlinux.lds.h   |   8 +-
 include/linux/alloc_tag.h           |  84 +++++
 include/linux/codetag.h             | 159 +++++++++
 include/linux/codetag_ctx.h         |  48 +++
 include/linux/codetag_time_stats.h  |  54 +++
 include/linux/dma-map-ops.h         |   2 +-
 include/linux/dynamic_debug.h       |  11 +-
 include/linux/dynamic_fault.h       |  79 +++++
 include/linux/err.h                 |   2 +-
 include/linux/errname.h             |  50 +++
 include/linux/gfp.h                 |  10 +-
 include/linux/gfp_types.h           |  12 +-
 include/linux/io_uring_types.h      |   2 +-
 include/linux/lazy-percpu-counter.h |  67 ++++
 include/linux/memcontrol.h          |  23 +-
 include/linux/module.h              |   1 +
 include/linux/page_ext.h            |   3 +-
 include/linux/pgalloc_tag.h         |  63 ++++
 include/linux/sbitmap.h             |   6 +-
 include/linux/sched.h               |   6 +-
 include/linux/slab.h                | 136 +++++---
 include/linux/slab_def.h            |   2 +-
 include/linux/slub_def.h            |   4 +-
 include/linux/stackdepot.h          |   3 +
 include/linux/string.h              |   1 +
 include/linux/time_stats.h          |  44 +++
 include/linux/timekeeping.h         |   1 +
 include/linux/wait.h                |  72 ++--
 include/linux/wait_bit.h            |   7 +-
 init/Kconfig                        |   5 +
 kernel/dma/mapping.c                |   4 +-
 kernel/module/internal.h            |   3 -
 kernel/module/main.c                |  27 +-
 kernel/sched/wait.c                 |  15 +-
 lib/Kconfig                         |   6 +
 lib/Kconfig.debug                   |  46 +++
 lib/Makefile                        |  10 +
 lib/alloc_tag.c                     | 391 +++++++++++++++++++++
 lib/codetag.c                       | 519 ++++++++++++++++++++++++++++
 lib/codetag_time_stats.c            | 143 ++++++++
 lib/dynamic_debug.c                 | 452 +++++++++---------------
 lib/dynamic_fault.c                 | 372 ++++++++++++++++++++
 lib/errname.c                       | 103 ++++++
 lib/lazy-percpu-counter.c           | 141 ++++++++
 lib/pgalloc_tag.c                   |  22 ++
 lib/stackdepot.c                    |  68 ++++
 lib/string.c                        |  19 +
 lib/string_helpers.c                |   3 +-
 lib/time_stats.c                    | 236 +++++++++++++
 mm/kfence/core.c                    |   2 +-
 mm/memcontrol.c                     |  62 ++--
 mm/mempolicy.c                      |   4 +-
 mm/page_alloc.c                     |  13 +-
 mm/page_ext.c                       |   6 +
 mm/page_owner.c                     |  54 +--
 mm/slab.c                           |   4 +-
 mm/slab.h                           | 125 ++++---
 mm/slab_common.c                    |  49 ++-
 mm/slob.c                           |   2 +
 mm/slub.c                           |   7 +-
 scripts/kallsyms.c                  |  13 +
 scripts/module.lds.S                |   7 +
 77 files changed, 3406 insertions(+), 703 deletions(-)
 create mode 100644 include/asm-generic/codetag.lds.h
 create mode 100644 include/linux/alloc_tag.h
 create mode 100644 include/linux/codetag.h
 create mode 100644 include/linux/codetag_ctx.h
 create mode 100644 include/linux/codetag_time_stats.h
 create mode 100644 include/linux/dynamic_fault.h
 create mode 100644 include/linux/lazy-percpu-counter.h
 create mode 100644 include/linux/pgalloc_tag.h
 create mode 100644 include/linux/time_stats.h
 create mode 100644 lib/alloc_tag.c
 create mode 100644 lib/codetag.c
 create mode 100644 lib/codetag_time_stats.c
 create mode 100644 lib/dynamic_fault.c
 create mode 100644 lib/lazy-percpu-counter.c
 create mode 100644 lib/pgalloc_tag.c
 create mode 100644 lib/time_stats.c

Comments

Peter Zijlstra Aug. 31, 2022, 7:38 a.m. UTC | #1
On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> ===========================
> Code tagging framework
> ===========================
> Code tag is a structure identifying a specific location in the source code
> which is generated at compile time and can be embedded in an application-
> specific structure. Several applications of code tagging are included in
> this RFC, such as memory allocation tracking, dynamic fault injection,
> latency tracking and improved error code reporting.
> Basically, it takes the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it.

I might be super dense this morning, but what!? I've skimmed through the
set and I don't think I get it.

What does this provide that ftrace/kprobes don't already allow?
Kent Overstreet Aug. 31, 2022, 8:42 a.m. UTC | #2
On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > ===========================
> > Code tagging framework
> > ===========================
> > Code tag is a structure identifying a specific location in the source code
> > which is generated at compile time and can be embedded in an application-
> > specific structure. Several applications of code tagging are included in
> > this RFC, such as memory allocation tracking, dynamic fault injection,
> > latency tracking and improved error code reporting.
> > Basically, it takes the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> 
> I might be super dense this morning, but what!? I've skimmed through the
> set and I don't think I get it.
> 
> What does this provide that ftrace/kprobes don't already allow?

You're kidding, right?
Mel Gorman Aug. 31, 2022, 10:19 a.m. UTC | #3
On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > ===========================
> > > Code tagging framework
> > > ===========================
> > > Code tag is a structure identifying a specific location in the source code
> > > which is generated at compile time and can be embedded in an application-
> > > specific structure. Several applications of code tagging are included in
> > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > latency tracking and improved error code reporting.
> > > Basically, it takes the old trick of "define a special elf section for
> > > objects of a given type so that we can iterate over them at runtime" and
> > > creates a proper library for it.
> > 
> > I might be super dense this morning, but what!? I've skimmed through the
> > set and I don't think I get it.
> > 
> > What does this provide that ftrace/kprobes don't already allow?
> 
> You're kidding, right?

It's a valid question. From the description, it main addition that would
be hard to do with ftrace or probes is catching where an error code is
returned. A secondary addition would be catching all historical state and
not just state since the tracing started.

It's also unclear *who* would enable this. It looks like it would mostly
have value during the development stage of an embedded platform to track
kernel memory usage on a per-application basis in an environment where it
may be difficult to setup tracing and tracking. Would it ever be enabled
in production? Would a distribution ever enable this? If it's enabled, any
overhead cannot be disabled/enabled at run or boot time so anyone enabling
this would carry the cost without never necessarily consuming the data.

It might be an ease-of-use thing. Gathering the information from traces
is tricky and would need combining multiple different elements and that
is development effort but not impossible.

Whatever asking for an explanation as to why equivalent functionality
cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
Michal Hocko Aug. 31, 2022, 10:47 a.m. UTC | #4
On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > ===========================
> > > > Code tagging framework
> > > > ===========================
> > > > Code tag is a structure identifying a specific location in the source code
> > > > which is generated at compile time and can be embedded in an application-
> > > > specific structure. Several applications of code tagging are included in
> > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > latency tracking and improved error code reporting.
> > > > Basically, it takes the old trick of "define a special elf section for
> > > > objects of a given type so that we can iterate over them at runtime" and
> > > > creates a proper library for it.
> > > 
> > > I might be super dense this morning, but what!? I've skimmed through the
> > > set and I don't think I get it.
> > > 
> > > What does this provide that ftrace/kprobes don't already allow?
> > 
> > You're kidding, right?
> 
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.
> 
> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.
> 
> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
> 
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

Fully agreed and this is especially true for a change this size
77 files changed, 3406 insertions(+), 703 deletions(-)
Suren Baghdasaryan Aug. 31, 2022, 3:28 p.m. UTC | #5
On Wed, Aug 31, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > ===========================
> > > > > Code tagging framework
> > > > > ===========================
> > > > > Code tag is a structure identifying a specific location in the source code
> > > > > which is generated at compile time and can be embedded in an application-
> > > > > specific structure. Several applications of code tagging are included in
> > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > latency tracking and improved error code reporting.
> > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > creates a proper library for it.
> > > >
> > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > set and I don't think I get it.
> > > >
> > > > What does this provide that ftrace/kprobes don't already allow?
> > >
> > > You're kidding, right?
> >
> > It's a valid question. From the description, it main addition that would
> > be hard to do with ftrace or probes is catching where an error code is
> > returned. A secondary addition would be catching all historical state and
> > not just state since the tracing started.
> >
> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? Would a distribution ever enable this? If it's enabled, any
> > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > this would carry the cost without never necessarily consuming the data.

Thank you for the question.
For memory tracking my intent is to have a mechanism that can be enabled in
the field testing (pre-production testing on a large population of
internal users).
The issue that we are often facing is when some memory leaks are happening
in the field but very hard to reproduce locally. We get a bugreport
from the user
which indicates it but often has not enough information to track it. Note that
quite often these leaks/issues happen in the drivers, so even simply finding out
where they came from is a big help.
The way I envision this mechanism to be used is to enable the basic memory
tracking in the field tests and have a user space process collecting
the allocation
statistics periodically (say once an hour). Once it detects some counter growing
infinitely or atypically (the definition of this is left to the user
space) it can enable
context capturing only for that specific location, still keeping the
overhead to the
minimum but getting more information about potential issues. Collected stats and
contexts are then attached to the bugreport and we get more visibility
into the issue
when we receive it.
The goal is to provide a mechanism with low enough overhead that it
can be enabled
all the time during these field tests without affecting the device's
performance profiles.
Tracing is very cheap when it's disabled but having it enabled all the
time would
introduce higher overhead than the counter manipulations.
My apologies, I should have clarified all this in this cover letter
from the beginning.

As for other applications, maybe I'm not such an advanced user of
tracing but I think only
the latency tracking application might be done with tracing, assuming
we have all the
right tracepoints but I don't see how we would use tracing for fault
injections and
descriptive error codes. Again, I might be mistaken.

Thanks,
Suren.

> >
> > It might be an ease-of-use thing. Gathering the information from traces
> > is tricky and would need combining multiple different elements and that
> > is development effort but not impossible.
> >
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>
> Fully agreed and this is especially true for a change this size
> 77 files changed, 3406 insertions(+), 703 deletions(-)
>
> --
> Michal Hocko
> SUSE Labs
Kent Overstreet Aug. 31, 2022, 3:59 p.m. UTC | #6
On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > ===========================
> > > > Code tagging framework
> > > > ===========================
> > > > Code tag is a structure identifying a specific location in the source code
> > > > which is generated at compile time and can be embedded in an application-
> > > > specific structure. Several applications of code tagging are included in
> > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > latency tracking and improved error code reporting.
> > > > Basically, it takes the old trick of "define a special elf section for
> > > > objects of a given type so that we can iterate over them at runtime" and
> > > > creates a proper library for it.
> > > 
> > > I might be super dense this morning, but what!? I've skimmed through the
> > > set and I don't think I get it.
> > > 
> > > What does this provide that ftrace/kprobes don't already allow?
> > 
> > You're kidding, right?
> 
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.

Catching all historical state is pretty important in the case of memory
allocation accounting, don't you think?

Also, ftrace can drop events. Not really ideal if under system load your memory
accounting numbers start to drift.

> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.

The whole point of this is to be cheap enough to enable in production -
especially the latency tracing infrastructure. There's a lot of value to
always-on system visibility infrastructure, so that when a live machine starts
to do something wonky the data is already there.

What we've built here this is _far_ cheaper than anything that could be done
with ftrace.

> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
> 
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

I think perhaps some of the expectation should be on the "ftrace for
everything!" people to explain a: how their alternative could be even built and
b: how it would compare in terms of performance and ease of use.

Look, I've been a tracing user for many years, and it has its uses, but some of
the claims I've been hearing from tracing/bpf people when any alternative
tooling is proposed sound like vaporware and bullshitting.
Suren Baghdasaryan Aug. 31, 2022, 4:48 p.m. UTC | #7
On Wed, Aug 31, 2022 at 8:28 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Aug 31, 2022 at 3:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > > ===========================
> > > > > > Code tagging framework
> > > > > > ===========================
> > > > > > Code tag is a structure identifying a specific location in the source code
> > > > > > which is generated at compile time and can be embedded in an application-
> > > > > > specific structure. Several applications of code tagging are included in
> > > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > > latency tracking and improved error code reporting.
> > > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > > creates a proper library for it.
> > > > >
> > > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > > set and I don't think I get it.
> > > > >
> > > > > What does this provide that ftrace/kprobes don't already allow?
> > > >
> > > > You're kidding, right?
> > >
> > > It's a valid question. From the description, it main addition that would
> > > be hard to do with ftrace or probes is catching where an error code is
> > > returned. A secondary addition would be catching all historical state and
> > > not just state since the tracing started.
> > >
> > > It's also unclear *who* would enable this. It looks like it would mostly
> > > have value during the development stage of an embedded platform to track
> > > kernel memory usage on a per-application basis in an environment where it
> > > may be difficult to setup tracing and tracking. Would it ever be enabled
> > > in production? Would a distribution ever enable this? If it's enabled, any
> > > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > > this would carry the cost without never necessarily consuming the data.
>
> Thank you for the question.
> For memory tracking my intent is to have a mechanism that can be enabled in
> the field testing (pre-production testing on a large population of
> internal users).
> The issue that we are often facing is when some memory leaks are happening
> in the field but very hard to reproduce locally. We get a bugreport
> from the user
> which indicates it but often has not enough information to track it. Note that
> quite often these leaks/issues happen in the drivers, so even simply finding out
> where they came from is a big help.
> The way I envision this mechanism to be used is to enable the basic memory
> tracking in the field tests and have a user space process collecting
> the allocation
> statistics periodically (say once an hour). Once it detects some counter growing
> infinitely or atypically (the definition of this is left to the user
> space) it can enable
> context capturing only for that specific location, still keeping the
> overhead to the
> minimum but getting more information about potential issues. Collected stats and
> contexts are then attached to the bugreport and we get more visibility
> into the issue
> when we receive it.
> The goal is to provide a mechanism with low enough overhead that it
> can be enabled
> all the time during these field tests without affecting the device's
> performance profiles.
> Tracing is very cheap when it's disabled but having it enabled all the
> time would
> introduce higher overhead than the counter manipulations.
> My apologies, I should have clarified all this in this cover letter
> from the beginning.
>
> As for other applications, maybe I'm not such an advanced user of
> tracing but I think only
> the latency tracking application might be done with tracing, assuming
> we have all the
> right tracepoints but I don't see how we would use tracing for fault
> injections and
> descriptive error codes. Again, I might be mistaken.

Sorry about the formatting of my reply. Forgot to reconfigure the editor on
the new machine.

>
> Thanks,
> Suren.
>
> > >
> > > It might be an ease-of-use thing. Gathering the information from traces
> > > is tricky and would need combining multiple different elements and that
> > > is development effort but not impossible.
> > >
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > --
> > Michal Hocko
> > SUSE Labs
Kent Overstreet Aug. 31, 2022, 7:01 p.m. UTC | #8
On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> 
> Fully agreed and this is especially true for a change this size
> 77 files changed, 3406 insertions(+), 703 deletions(-)

In the case of memory allocation accounting, you flat cannot do this with ftrace
- you could maybe do a janky version that isn't fully accurate, much slower,
more complicated for the developer to understand and debug and more complicated
for the end user.

But please, I invite anyone who's actually been doing this with ftrace to
demonstrate otherwise.

Ftrace just isn't the right tool for the job here - we're talking about adding
per callsite accounting to some of the fastest fast paths in the kernel.

And the size of the changes for memory allocation accounting are much more
reasonable:
 33 files changed, 623 insertions(+), 99 deletions(-)

The code tagging library should exist anyways, it's been open coded half a dozen
times in the kernel already.

And once we've got that, the time stats code is _also_ far simpler than doing it
with ftrace would be. If anyone here has successfully debugged latency issues
with ftrace, I'd really like to hear it. Again, for debugging latency issues you
want something that can always be on, and that's not cheap with ftrace - and
never mind the hassle of correlating start and end wait trace events, builting
up histograms, etc. - that's all handled here.

Cheap, simple, easy to use. What more could you want?
Yosry Ahmed Aug. 31, 2022, 8:56 p.m. UTC | #9
On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
>
> In the case of memory allocation accounting, you flat cannot do this with ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more complicated
> for the end user.
>
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
>
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
>
> And the size of the changes for memory allocation accounting are much more
> reasonable:
>  33 files changed, 623 insertions(+), 99 deletions(-)
>
> The code tagging library should exist anyways, it's been open coded half a dozen
> times in the kernel already.
>
> And once we've got that, the time stats code is _also_ far simpler than doing it
> with ftrace would be. If anyone here has successfully debugged latency issues
> with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> want something that can always be on, and that's not cheap with ftrace - and
> never mind the hassle of correlating start and end wait trace events, builting
> up histograms, etc. - that's all handled here.
>
> Cheap, simple, easy to use. What more could you want?
>

This is very interesting work! Do you have any data about the overhead
this introduces, especially in a production environment? I am
especially interested in memory allocations tracking and detecting
leaks.
(Sorry if you already posted this kind of data somewhere that I missed)
Suren Baghdasaryan Aug. 31, 2022, 9:38 p.m. UTC | #10
On Wed, Aug 31, 2022 at 1:56 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than doing it
> > with ftrace would be. If anyone here has successfully debugged latency issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
> >
>
> This is very interesting work! Do you have any data about the overhead
> this introduces, especially in a production environment? I am
> especially interested in memory allocations tracking and detecting
> leaks.

I had the numbers for my previous implementation, before we started using the
lazy percpu counters but that would not apply to the new implementation. I'll
rerun the measurements and will post the exact numbers in a day or so.

> (Sorry if you already posted this kind of data somewhere that I missed)
Oscar Salvador Sept. 1, 2022, 4:52 a.m. UTC | #11
On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> ===========================
> Code tagging framework
> ===========================
> Code tag is a structure identifying a specific location in the source code
> which is generated at compile time and can be embedded in an application-
> specific structure. Several applications of code tagging are included in
> this RFC, such as memory allocation tracking, dynamic fault injection,
> latency tracking and improved error code reporting.
> Basically, it takes the old trick of "define a special elf section for
> objects of a given type so that we can iterate over them at runtime" and
> creates a proper library for it.
> 
> ===========================
> Memory allocation tracking
> ===========================
> The goal for using codetags for memory allocation tracking is to minimize
> performance and memory overhead. By recording only the call count and
> allocation size, the required operations are kept at the minimum while
> collecting statistics for every allocation in the codebase. With that
> information, if users are interested in mode detailed context for a
> specific allocation, they can enable more in-depth context tracking,
> which includes capturing the pid, tgid, task name, allocation size,
> timestamp and call stack for every allocation at the specified code
> location.
> Memory allocation tracking is implemented in two parts:
> 
> part1: instruments page and slab allocators to record call count and total
> memory allocated at every allocation in the source code. Every time an
> allocation is performed by an instrumented allocator, the codetag at that
> location increments its call and size counters. Every time the memory is
> freed these counters are decremented. To decrement the counters upon free,
> allocated object needs a reference to its codetag. Page allocators use
> page_ext to record this reference while slab allocators use memcg_data of
> the slab page.
> The data is exposed to the user space via a read-only debugfs file called
> alloc_tags.

Hi Suren,

I just posted a patch [1] and reading through your changelog and seeing your PoC,
I think we have some kind of overlap.
My patchset aims to give you the stacktrace <-> relationship information and it is
achieved by a little amount of extra code mostly in page_owner.c/ and lib/stackdepot.

Of course, your works seems to be more complete wrt. the information you get.

I CCed you in case you want to have a look

[1] https://lkml.org/lkml/2022/9/1/36

Thanks
Suren Baghdasaryan Sept. 1, 2022, 5:05 a.m. UTC | #12
On Wed, Aug 31, 2022 at 9:52 PM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > ===========================
> > Code tagging framework
> > ===========================
> > Code tag is a structure identifying a specific location in the source code
> > which is generated at compile time and can be embedded in an application-
> > specific structure. Several applications of code tagging are included in
> > this RFC, such as memory allocation tracking, dynamic fault injection,
> > latency tracking and improved error code reporting.
> > Basically, it takes the old trick of "define a special elf section for
> > objects of a given type so that we can iterate over them at runtime" and
> > creates a proper library for it.
> >
> > ===========================
> > Memory allocation tracking
> > ===========================
> > The goal for using codetags for memory allocation tracking is to minimize
> > performance and memory overhead. By recording only the call count and
> > allocation size, the required operations are kept at the minimum while
> > collecting statistics for every allocation in the codebase. With that
> > information, if users are interested in mode detailed context for a
> > specific allocation, they can enable more in-depth context tracking,
> > which includes capturing the pid, tgid, task name, allocation size,
> > timestamp and call stack for every allocation at the specified code
> > location.
> > Memory allocation tracking is implemented in two parts:
> >
> > part1: instruments page and slab allocators to record call count and total
> > memory allocated at every allocation in the source code. Every time an
> > allocation is performed by an instrumented allocator, the codetag at that
> > location increments its call and size counters. Every time the memory is
> > freed these counters are decremented. To decrement the counters upon free,
> > allocated object needs a reference to its codetag. Page allocators use
> > page_ext to record this reference while slab allocators use memcg_data of
> > the slab page.
> > The data is exposed to the user space via a read-only debugfs file called
> > alloc_tags.
>
> Hi Suren,
>
> I just posted a patch [1] and reading through your changelog and seeing your PoC,
> I think we have some kind of overlap.
> My patchset aims to give you the stacktrace <-> relationship information and it is
> achieved by a little amount of extra code mostly in page_owner.c/ and lib/stackdepot.
>
> Of course, your works seems to be more complete wrt. the information you get.
>
> I CCed you in case you want to have a look
>
> [1] https://lkml.org/lkml/2022/9/1/36

Hi Oscar,
Thanks for the note. I'll take a look most likely on Friday and will
follow up with you.
Thanks,
Suren.

>
> Thanks
>
>
> --
> Oscar Salvador
> SUSE Labs
Peter Zijlstra Sept. 1, 2022, 7 a.m. UTC | #13
On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:

> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? 

Afaict this is developer only; it is all unconditional code.

> Would a distribution ever enable this? 

I would sincerely hope not. Because:

> If it's enabled, any overhead cannot be disabled/enabled at run or
> boot time so anyone enabling this would carry the cost without never
> necessarily consuming the data.

this.
Peter Zijlstra Sept. 1, 2022, 7:05 a.m. UTC | #14
On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:

> Also, ftrace can drop events. Not really ideal if under system load your memory
> accounting numbers start to drift.

You could attach custom handlers to tracepoints. If you were to replace
these unconditional code hooks of yours with tracepoints then you could
conditionally (say at boot) register custom handlers that do the
accounting you want.

Nobody is mandating you use the ftrace ringbuffer to consume tracepoints.
Many people these days attach eBPF scripts to them and do whatever they
want.
Michal Hocko Sept. 1, 2022, 7:18 a.m. UTC | #15
On Wed 31-08-22 15:01:54, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > 
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
> 
> In the case of memory allocation accounting, you flat cannot do this with ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more complicated
> for the end user.
> 
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
> 
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
> 
> And the size of the changes for memory allocation accounting are much more
> reasonable:
>  33 files changed, 623 insertions(+), 99 deletions(-)
> 
> The code tagging library should exist anyways, it's been open coded half a dozen
> times in the kernel already.
> 
> And once we've got that, the time stats code is _also_ far simpler than doing it
> with ftrace would be. If anyone here has successfully debugged latency issues
> with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> want something that can always be on, and that's not cheap with ftrace - and
> never mind the hassle of correlating start and end wait trace events, builting
> up histograms, etc. - that's all handled here.
> 
> Cheap, simple, easy to use. What more could you want?

A big ad on a banner. But more seriously.

This patchset is _huge_ and touching a lot of different areas. It will
be not only hard to review but even harder to maintain longterm. So
it is completely reasonable to ask for potential alternatives with a
smaller code footprint. I am pretty sure you are aware of that workflow.

So I find Peter's question completely appropriate while your response to
that not so much! Maybe ftrace is not the right tool for the intented
job. Maybe there are other ways and it would be really great to show
that those have been evaluated and they are not suitable for a), b) and
c) reasons.

E.g. Oscar has been working on extending page_ext to track number of
allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
it can help in environments where page_ext can be enabled and it is
completely non-intrusive to the MM code.

If the page_ext overhead is not desirable/acceptable then I am sure
there are other options. E.g. kprobes/LivePatching framework can hook
into functions and alter their behavior. So why not use that for data
collection? Has this been evaluated at all?

And please note that I am not claiming the presented work is approaching
the problem from a wrong direction. It might very well solve multiple
problems in a single go _but_ the long term code maintenance burden
really has to to be carefully evaluated and if we can achieve a
reasonable subset of the functionality with an existing infrastructure
then I would be inclined to sacrifice some portions with a considerably
smaller code footprint.

[1] http://lkml.kernel.org/r/20220901044249.4624-1-osalvador@suse.de
Daniel Bristot de Oliveira Sept. 1, 2022, 7:36 a.m. UTC | #16
On 9/1/22 09:05, Peter Zijlstra wrote:
>> Also, ftrace can drop events. Not really ideal if under system load your memory
>> accounting numbers start to drift.
> You could attach custom handlers to tracepoints. If you were to replace
> these unconditional code hooks of yours with tracepoints then you could
> conditionally (say at boot) register custom handlers that do the
> accounting you want.

That is strategy in RV (kernel/trace/rv/). It is in C, but I am also
adding support for monitors in bpf. The osnoise/timerlat tracers work this
way too, and they are enabled on Fedora/Red Hat/SUSE... production. They
will also be enabled in Ubuntu and Debian (the interwebs say).

The overhead of attaching code to tracepoints (or any "attachable thing") and
processing data in kernel is often lower than consuming it in user-space.
Obviously, when it is possible, e.g., when you respect locking rules, etc.

This paper (the basis for RV) shows a little comparison:
https://bristot.me/wp-content/uploads/2019/09/paper.pdf

By doing so, we also avoid problems of losing events... and you can also
generate other events from your attached code.

(It is also way easier to convince a maintainer to add a tracepoints or a trace
events than to add arbitrary code... ;-)

-- Daniel
Peter Zijlstra Sept. 1, 2022, 7:42 a.m. UTC | #17
On Thu, Sep 01, 2022 at 09:05:36AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:
> 
> > Also, ftrace can drop events. Not really ideal if under system load your memory
> > accounting numbers start to drift.
> 
> You could attach custom handlers to tracepoints. If you were to replace
> these unconditional code hooks of yours with tracepoints then you could
> conditionally (say at boot) register custom handlers that do the
> accounting you want.
> 
> Nobody is mandating you use the ftrace ringbuffer to consume tracepoints.
> Many people these days attach eBPF scripts to them and do whatever they
> want.

Look at kernel/trace/blktrace.c for a fine in-kernel !BFP example of this.
David Hildenbrand Sept. 1, 2022, 8:05 a.m. UTC | #18
On 31.08.22 21:01, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
>> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
>>> Whatever asking for an explanation as to why equivalent functionality
>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>>
>> Fully agreed and this is especially true for a change this size
>> 77 files changed, 3406 insertions(+), 703 deletions(-)
> 
> In the case of memory allocation accounting, you flat cannot do this with ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more complicated
> for the end user.
> 
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
> 
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
> 
> And the size of the changes for memory allocation accounting are much more
> reasonable:
>  33 files changed, 623 insertions(+), 99 deletions(-)
> 
> The code tagging library should exist anyways, it's been open coded half a dozen
> times in the kernel already.

Hi Kent,

independent of the other discussions, if it's open coded already, does
it make sense to factor that already-open-coded part out independently
of the remainder of the full series here?

[I didn't immediately spot if this series also attempts already to
replace that open-coded part]
Mel Gorman Sept. 1, 2022, 11:05 a.m. UTC | #19
On Wed, Aug 31, 2022 at 11:59:41AM -0400, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> > On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > > ===========================
> > > > > Code tagging framework
> > > > > ===========================
> > > > > Code tag is a structure identifying a specific location in the source code
> > > > > which is generated at compile time and can be embedded in an application-
> > > > > specific structure. Several applications of code tagging are included in
> > > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > > latency tracking and improved error code reporting.
> > > > > Basically, it takes the old trick of "define a special elf section for
> > > > > objects of a given type so that we can iterate over them at runtime" and
> > > > > creates a proper library for it.
> > > > 
> > > > I might be super dense this morning, but what!? I've skimmed through the
> > > > set and I don't think I get it.
> > > > 
> > > > What does this provide that ftrace/kprobes don't already allow?
> > > 
> > > You're kidding, right?
> > 
> > It's a valid question. From the description, it main addition that would
> > be hard to do with ftrace or probes is catching where an error code is
> > returned. A secondary addition would be catching all historical state and
> > not just state since the tracing started.
> 
> Catching all historical state is pretty important in the case of memory
> allocation accounting, don't you think?
> 

Not always. If the intent is to catch a memory leak that gets worse over
time, early boot should be sufficient. Sure, there might be drivers that leak
memory allocated at init but if it's not a growing leak, it doesn't matter.

> Also, ftrace can drop events. Not really ideal if under system load your memory
> accounting numbers start to drift.
> 

As pointed out elsewhere, attaching to the tracepoint and recording relevant
state is an option other than trying to parse a raw ftrace feed. For memory
leaks, there are already tracepoints for page allocation and free that could
be used to track allocations that are not freed at a given point in time.
There is also the kernel memory leak detector although I never had reason
to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html)
and it sounds like it would be expensive.

> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? Would a distribution ever enable this? If it's enabled, any
> > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > this would carry the cost without never necessarily consuming the data.
> 
> The whole point of this is to be cheap enough to enable in production -
> especially the latency tracing infrastructure. There's a lot of value to
> always-on system visibility infrastructure, so that when a live machine starts
> to do something wonky the data is already there.
> 

Sure, there is value but nothing stops the tracepoints being attached as
a boot-time service where interested. For latencies, there is already
bpf examples for tracing individual function latency over time e.g.
https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although
I haven't used it recently.

Live parsing of ftrace is possible, albeit expensive.
https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl
tracks counts of high-order allocations and dumps a report on interrupt as
an example of live parsing ftrace and only recording interesting state. It's
not tracking state you are interested in but it demonstrates it is possible
to rely on ftrace alone and monitor from userspace. It's bit-rotted but
can be fixed with

diff --git a/monitors/watch-highorder.pl b/monitors/watch-highorder.pl
index 8c80ae79e556..fd0d477427df 100755
--- a/monitors/watch-highorder.pl
+++ b/monitors/watch-highorder.pl
@@ -52,7 +52,7 @@ my $regex_pagealloc;
 
 # Static regex used. Specified like this for readability and for use with /o
 #                      (process_pid)     (cpus      )   ( time  )   (tpoint    ) (details)
-my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9. ]*):\s*([a-zA-Z_]*):\s*(.*)';
 my $regex_statname = '[-0-9]*\s\((.*)\).*';
 my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
 
@@ -73,6 +73,7 @@ sub generate_traceevent_regex {
 				$regex =~ s/%p/\([0-9a-f]*\)/g;
 				$regex =~ s/%d/\([-0-9]*\)/g;
 				$regex =~ s/%lu/\([0-9]*\)/g;
+				$regex =~ s/%lx/\([0-9a-zA-Z]*\)/g;
 				$regex =~ s/%s/\([A-Z_|]*\)/g;
 				$regex =~ s/\(REC->gfp_flags\).*/REC->gfp_flags/;
 				$regex =~ s/\",.*//;

Example output

3 instances order=2 normal gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO
 => trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd>
 => __alloc_pages+0x188/0x250 <ffffffffb1cee8a8>
 => kmalloc_large_node+0x3f/0x80 <ffffffffb1d1cd3f>
 => __kmalloc_node+0x321/0x420 <ffffffffb1d22351>
 => kvmalloc_node+0x46/0xe0 <ffffffffb1ca4906>
 => ttm_sg_tt_init+0x88/0xb0 [ttm] <ffffffffc03a02c8>
 => amdgpu_ttm_tt_create+0x4f/0x80 [amdgpu] <ffffffffc04cff0f>
 => ttm_tt_create+0x59/0x90 [ttm] <ffffffffc03a03b9>
 => ttm_bo_handle_move_mem+0x7e/0x1c0 [ttm] <ffffffffc03a0d9e>
 => ttm_bo_validate+0xc5/0x140 [ttm] <ffffffffc03a2095>
 => ttm_bo_init_reserved+0x17b/0x200 [ttm] <ffffffffc03a228b>
 => amdgpu_bo_create+0x1a3/0x470 [amdgpu] <ffffffffc04d36c3>
 => amdgpu_bo_create_user+0x34/0x60 [amdgpu] <ffffffffc04d39c4>
 => amdgpu_gem_create_ioctl+0x131/0x3a0 [amdgpu] <ffffffffc04d94f1>
 => drm_ioctl_kernel+0xb5/0x140 <ffffffffb21652c5>
 => drm_ioctl+0x224/0x3e0 <ffffffffb2165574>
 => amdgpu_drm_ioctl+0x49/0x80 [amdgpu] <ffffffffc04bd2d9>
 => __x64_sys_ioctl+0x8a/0xc0 <ffffffffb1d7c2da>
 => do_syscall_64+0x5c/0x90 <ffffffffb253016c>
 => entry_SYSCALL_64_after_hwframe+0x63/0xcd <ffffffffb260009b>

3 instances order=1 normal gfp_flags=GFP_NOWAIT|__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_ACCOUNT
 => trace_event_raw_event_mm_page_alloc+0x7d/0xc0 <ffffffffb1caeccd>
 => __alloc_pages+0x188/0x250 <ffffffffb1cee8a8>
 => __folio_alloc+0x17/0x50 <ffffffffb1cef1a7>
 => vma_alloc_folio+0x8f/0x350 <ffffffffb1d11e4f>
 => __handle_mm_fault+0xa1e/0x1120 <ffffffffb1cc80ee>
 => handle_mm_fault+0xb2/0x280 <ffffffffb1cc88a2>
 => do_user_addr_fault+0x1b9/0x690 <ffffffffb1a89949>
 => exc_page_fault+0x67/0x150 <ffffffffb2534627>
 => asm_exc_page_fault+0x22/0x30 <ffffffffb2600b62>

It's not tracking leaks because that is not what I was intrested in at
the time but could using the same method and recording PFNs that were
allocated, their call site but not freed. These days, this approach may
be a bit unexpected but it was originally written 13 years ago. It could
have been done with systemtap back then but my recollection was that it
was difficult to keep systemtap working with rc kernels.

> What we've built here this is _far_ cheaper than anything that could be done
> with ftrace.
> 
> > It might be an ease-of-use thing. Gathering the information from traces
> > is tricky and would need combining multiple different elements and that
> > is development effort but not impossible.
> > 
> > Whatever asking for an explanation as to why equivalent functionality
> > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> 
> I think perhaps some of the expectation should be on the "ftrace for
> everything!" people to explain a: how their alternative could be even built and
> b: how it would compare in terms of performance and ease of use.
> 

The ease of use is a criticism as there is effort required to develop
the state tracking of in-kernel event be it from live parsing ftrace,
attaching to tracepoints with systemtap/bpf/whatever and the like. The
main disadvantage with an in-kernel implementation is three-fold. First,
it doesn't work with older kernels without backports. Second, if something
slightly different it needed then it's a kernel rebuild.  Third, if the
option is not enabled in the deployed kernel config then you are relying
on the end user being willing to deploy a custom kernel.  The initial
investment in doing memory leak tracking or latency tracking by attaching
to tracepoints is significant but it works with older kernels up to a point
and is less sensitive to the kernel config options selected as features
like ftrace are often selected.
Kent Overstreet Sept. 1, 2022, 2:23 p.m. UTC | #20
On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
> On 31.08.22 21:01, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> >> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> >>> Whatever asking for an explanation as to why equivalent functionality
> >>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >>
> >> Fully agreed and this is especially true for a change this size
> >> 77 files changed, 3406 insertions(+), 703 deletions(-)
> > 
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> > 
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> > 
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> > 
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> > 
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> 
> Hi Kent,
> 
> independent of the other discussions, if it's open coded already, does
> it make sense to factor that already-open-coded part out independently
> of the remainder of the full series here?

It's discussed in the cover letter, that is exactly how the patch series is
structured.
 
> [I didn't immediately spot if this series also attempts already to
> replace that open-coded part]

Uh huh.

Honestly, some days it feels like lkml is just as bad as slashdot, with people
wanting to get in their two cents without actually reading...
Kent Overstreet Sept. 1, 2022, 2:29 p.m. UTC | #21
On Thu, Sep 01, 2022 at 09:00:17AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 31, 2022 at 11:19:48AM +0100, Mel Gorman wrote:
> 
> > It's also unclear *who* would enable this. It looks like it would mostly
> > have value during the development stage of an embedded platform to track
> > kernel memory usage on a per-application basis in an environment where it
> > may be difficult to setup tracing and tracking. Would it ever be enabled
> > in production? 
> 
> Afaict this is developer only; it is all unconditional code.
> 
> > Would a distribution ever enable this? 
> 
> I would sincerely hope not. Because:
> 
> > If it's enabled, any overhead cannot be disabled/enabled at run or
> > boot time so anyone enabling this would carry the cost without never
> > necessarily consuming the data.
> 
> this.

We could make it a boot parameter, with the alternatives infrastructure - with a
bit of refactoring there'd be a single function call to nop out, and then we
could also drop the elf sections as well, so that when built in but disabled the
overhead would be practically nil.
David Hildenbrand Sept. 1, 2022, 3:07 p.m. UTC | #22
On 01.09.22 16:23, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
>> On 31.08.22 21:01, Kent Overstreet wrote:
>>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
>>>> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
>>>>> Whatever asking for an explanation as to why equivalent functionality
>>>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
>>>>
>>>> Fully agreed and this is especially true for a change this size
>>>> 77 files changed, 3406 insertions(+), 703 deletions(-)
>>>
>>> In the case of memory allocation accounting, you flat cannot do this with ftrace
>>> - you could maybe do a janky version that isn't fully accurate, much slower,
>>> more complicated for the developer to understand and debug and more complicated
>>> for the end user.
>>>
>>> But please, I invite anyone who's actually been doing this with ftrace to
>>> demonstrate otherwise.
>>>
>>> Ftrace just isn't the right tool for the job here - we're talking about adding
>>> per callsite accounting to some of the fastest fast paths in the kernel.
>>>
>>> And the size of the changes for memory allocation accounting are much more
>>> reasonable:
>>>  33 files changed, 623 insertions(+), 99 deletions(-)
>>>
>>> The code tagging library should exist anyways, it's been open coded half a dozen
>>> times in the kernel already.
>>
>> Hi Kent,
>>
>> independent of the other discussions, if it's open coded already, does
>> it make sense to factor that already-open-coded part out independently
>> of the remainder of the full series here?
> 
> It's discussed in the cover letter, that is exactly how the patch series is
> structured.

Skimming over the patches (that I was CCed on) and skimming over the
cover letter, I got the impression that everything after patch 7 is
introducing something new instead of refactoring something out.

>  
>> [I didn't immediately spot if this series also attempts already to
>> replace that open-coded part]
> 
> Uh huh.
> 
> Honestly, some days it feels like lkml is just as bad as slashdot, with people
> wanting to get in their two cents without actually reading...

... and of course you had to reply like that. I should just have learned
from my last upstream experience with you and kept you on my spam list.

Thanks, bye
Suren Baghdasaryan Sept. 1, 2022, 3:33 p.m. UTC | #23
On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 31-08-22 15:01:54, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than doing it
> > with ftrace would be. If anyone here has successfully debugged latency issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
>
> A big ad on a banner. But more seriously.
>
> This patchset is _huge_ and touching a lot of different areas. It will
> be not only hard to review but even harder to maintain longterm. So
> it is completely reasonable to ask for potential alternatives with a
> smaller code footprint. I am pretty sure you are aware of that workflow.

The patchset is huge because it introduces a reusable part (the first
6 patches introducing code tagging) and 6 different applications in
very different areas of the kernel. We wanted to present all of them
in the RFC to show the variety of cases this mechanism can be reused
for. If the code tagging is accepted, each application can be posted
separately to the appropriate group of people. Hopefully that makes it
easier to review. Those first 6 patches are not that big and are quite
isolated IMHO:

 include/linux/codetag.h             |  83 ++++++++++
 include/linux/lazy-percpu-counter.h |  67 ++++++++
 include/linux/module.h              |   1 +
 kernel/module/internal.h            |   1 -
 kernel/module/main.c                |   4 +
 lib/Kconfig                         |   3 +
 lib/Kconfig.debug                   |   4 +
 lib/Makefile                        |   3 +
 lib/codetag.c                       | 248 ++++++++++++++++++++++++++++
 lib/lazy-percpu-counter.c           | 141 ++++++++++++++++
 lib/string_helpers.c                |   3 +-
 scripts/kallsyms.c                  |  13 ++

>
> So I find Peter's question completely appropriate while your response to
> that not so much! Maybe ftrace is not the right tool for the intented
> job. Maybe there are other ways and it would be really great to show
> that those have been evaluated and they are not suitable for a), b) and
> c) reasons.

That's fair.
For memory tracking I looked into using kmemleak and page_owner which
can't match the required functionality at an overhead acceptable for
production and pre-production testing environments. traces + BPF I
haven't evaluated myself but heard from other members of my team who
tried using that in production environment with poor results. I'll try
to get more specific information on that.

>
> E.g. Oscar has been working on extending page_ext to track number of
> allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> it can help in environments where page_ext can be enabled and it is
> completely non-intrusive to the MM code.

Thanks for pointing out this work. I'll need to review and maybe
profile it before making any claims.

>
> If the page_ext overhead is not desirable/acceptable then I am sure
> there are other options. E.g. kprobes/LivePatching framework can hook
> into functions and alter their behavior. So why not use that for data
> collection? Has this been evaluated at all?

I'm not sure how I can hook into say alloc_pages() to find out where
it was called from without capturing the call stack (which would
introduce an overhead at every allocation). Would love to discuss this
or other alternatives if they can be done with low enough overhead.
Thanks,
Suren.

>
> And please note that I am not claiming the presented work is approaching
> the problem from a wrong direction. It might very well solve multiple
> problems in a single go _but_ the long term code maintenance burden
> really has to to be carefully evaluated and if we can achieve a
> reasonable subset of the functionality with an existing infrastructure
> then I would be inclined to sacrifice some portions with a considerably
> smaller code footprint.
>
> [1] http://lkml.kernel.org/r/20220901044249.4624-1-osalvador@suse.de
>
> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Sept. 1, 2022, 3:39 p.m. UTC | #24
On Thu, Sep 1, 2022 at 8:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.09.22 16:23, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 10:05:03AM +0200, David Hildenbrand wrote:
> >> On 31.08.22 21:01, Kent Overstreet wrote:
> >>> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> >>>> On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> >>>>> Whatever asking for an explanation as to why equivalent functionality
> >>>>> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> >>>>
> >>>> Fully agreed and this is especially true for a change this size
> >>>> 77 files changed, 3406 insertions(+), 703 deletions(-)
> >>>
> >>> In the case of memory allocation accounting, you flat cannot do this with ftrace
> >>> - you could maybe do a janky version that isn't fully accurate, much slower,
> >>> more complicated for the developer to understand and debug and more complicated
> >>> for the end user.
> >>>
> >>> But please, I invite anyone who's actually been doing this with ftrace to
> >>> demonstrate otherwise.
> >>>
> >>> Ftrace just isn't the right tool for the job here - we're talking about adding
> >>> per callsite accounting to some of the fastest fast paths in the kernel.
> >>>
> >>> And the size of the changes for memory allocation accounting are much more
> >>> reasonable:
> >>>  33 files changed, 623 insertions(+), 99 deletions(-)
> >>>
> >>> The code tagging library should exist anyways, it's been open coded half a dozen
> >>> times in the kernel already.
> >>
> >> Hi Kent,
> >>
> >> independent of the other discussions, if it's open coded already, does
> >> it make sense to factor that already-open-coded part out independently
> >> of the remainder of the full series here?
> >
> > It's discussed in the cover letter, that is exactly how the patch series is
> > structured.
>
> Skimming over the patches (that I was CCed on) and skimming over the
> cover letter, I got the impression that everything after patch 7 is
> introducing something new instead of refactoring something out.

Hi David,
Yes, you are right, the RFC does incorporate lots of parts which can
be considered separately. They are sent together to present the
overall scope of the proposal but I do intend to send them separately
once we decide if it's worth working on.
Thanks,
Suren.

>
> >
> >> [I didn't immediately spot if this series also attempts already to
> >> replace that open-coded part]
> >
> > Uh huh.
> >
> > Honestly, some days it feels like lkml is just as bad as slashdot, with people
> > wanting to get in their two cents without actually reading...
>
> ... and of course you had to reply like that. I should just have learned
> from my last upstream experience with you and kept you on my spam list.
>
> Thanks, bye
>
> --
> Thanks,
>
> David / dhildenb
>
Kent Overstreet Sept. 1, 2022, 3:48 p.m. UTC | #25
On Thu, Sep 01, 2022 at 05:07:06PM +0200, David Hildenbrand wrote:
> Skimming over the patches (that I was CCed on) and skimming over the
> cover letter, I got the impression that everything after patch 7 is
> introducing something new instead of refactoring something out.

You skimmed over the dynamic debug patch then...
Kent Overstreet Sept. 1, 2022, 4:31 p.m. UTC | #26
On Thu, Sep 01, 2022 at 12:05:01PM +0100, Mel Gorman wrote:
> As pointed out elsewhere, attaching to the tracepoint and recording relevant
> state is an option other than trying to parse a raw ftrace feed. For memory
> leaks, there are already tracepoints for page allocation and free that could
> be used to track allocations that are not freed at a given point in time.

Page allocation tracepoints are not sufficient for what we're trying to do here,
and a substantial amount of effort in this patchset has gone into just getting
the hooking locations right - our memory allocation interfaces are not trivial.

That's something people should keep in mind when commenting on the size of this
patchset, since that's effort that would have to be spent for /any/ complete
solution, be in tracepoint based or no.

Additionally, we need to be able to write assertions that verify that our hook
locations are correct, that allocations or frees aren't getting double counted
or missed - highly necessary given the maze of nested memory allocation
interfaces we have (i.e. slab.h), and it's something a tracepoint based
implementation would have to account for - otherwise, a tool isn't very useful
if you can't trust the numbers it's giving you.

And then you have to correlate the allocate and free events, so that you know
which allocate callsite to decrement the amount freed from.

How would you plan on doing that with tracepoints?

> There is also the kernel memory leak detector although I never had reason
> to use it (https://www.kernel.org/doc/html/v6.0-rc3/dev-tools/kmemleak.html)
> and it sounds like it would be expensive.

Kmemleak is indeed expensive, and in the past I've had issues with it not
catching everything (I've noticed the kmemleak annotations growing, so maybe
this is less of an issue than it was).

And this is a more complete solution (though not something that could strictly
replace kmemleak): strict memory leaks aren't the only issue, it's also drivers
unexpectedly consuming more memory than expected.

I'll bet you a beer that when people have had this awhile, we're going to have a
bunch of bugs discovered and fixed along the lines of "oh hey, this driver
wasn't supposed to be using this 1 MB of memory, I never noticed that before".

> > > It's also unclear *who* would enable this. It looks like it would mostly
> > > have value during the development stage of an embedded platform to track
> > > kernel memory usage on a per-application basis in an environment where it
> > > may be difficult to setup tracing and tracking. Would it ever be enabled
> > > in production? Would a distribution ever enable this? If it's enabled, any
> > > overhead cannot be disabled/enabled at run or boot time so anyone enabling
> > > this would carry the cost without never necessarily consuming the data.
> > 
> > The whole point of this is to be cheap enough to enable in production -
> > especially the latency tracing infrastructure. There's a lot of value to
> > always-on system visibility infrastructure, so that when a live machine starts
> > to do something wonky the data is already there.
> > 
> 
> Sure, there is value but nothing stops the tracepoints being attached as
> a boot-time service where interested. For latencies, there is already
> bpf examples for tracing individual function latency over time e.g.
> https://github.com/iovisor/bcc/blob/master/tools/funclatency.py although
> I haven't used it recently.

So this is cool, I'll check it out today.

Tracing of /function/ latency is definitely something you'd want tracing/kprobes
for - that's way more practical than any code tagging-based approach. And if the
output is reliable and useful I could definitely see myself using this, thank
you.

But for data collection where it makes sense to annotate in the source code
where the data collection points are, I see the code-tagging based approach as
simpler - it cuts out a whole bunch of indirection. The diffstat on the code
tagging time stats patch is

 8 files changed, 233 insertions(+), 6 deletions(-)

And that includes hooking wait.h - this is really simple, easy stuff.

The memory allocation tracking patches are more complicated because we've got a
ton of memory allocation interfaces and we're aiming for strict correctness
there - because that tool needs strict correctness in order to be useful.

> Live parsing of ftrace is possible, albeit expensive.
> https://github.com/gormanm/mmtests/blob/master/monitors/watch-highorder.pl
> tracks counts of high-order allocations and dumps a report on interrupt as
> an example of live parsing ftrace and only recording interesting state. It's
> not tracking state you are interested in but it demonstrates it is possible
> to rely on ftrace alone and monitor from userspace. It's bit-rotted but
> can be fixed with

Yeah, if this is as far as people have gotten with ftrace on memory allocations
than I don't think tracing is credible here, sorry.

> The ease of use is a criticism as there is effort required to develop
> the state tracking of in-kernel event be it from live parsing ftrace,
> attaching to tracepoints with systemtap/bpf/whatever and the like. The
> main disadvantage with an in-kernel implementation is three-fold. First,
> it doesn't work with older kernels without backports. Second, if something
> slightly different it needed then it's a kernel rebuild.  Third, if the
> option is not enabled in the deployed kernel config then you are relying
> on the end user being willing to deploy a custom kernel.  The initial
> investment in doing memory leak tracking or latency tracking by attaching
> to tracepoints is significant but it works with older kernels up to a point
> and is less sensitive to the kernel config options selected as features
> like ftrace are often selected.

The next version of this patch set is going to use the alternatives mechanism to
add a boot parameter.

I'm not interested in backporting to older kernels - eesh. People on old
enterprise kernels don't always get all the new shiny things :)
Michal Hocko Sept. 1, 2022, 7:15 p.m. UTC | #27
On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > So I find Peter's question completely appropriate while your response to
> > that not so much! Maybe ftrace is not the right tool for the intented
> > job. Maybe there are other ways and it would be really great to show
> > that those have been evaluated and they are not suitable for a), b) and
> > c) reasons.
> 
> That's fair.
> For memory tracking I looked into using kmemleak and page_owner which
> can't match the required functionality at an overhead acceptable for
> production and pre-production testing environments.

Being more specific would be really helpful. Especially when your cover
letter suggests that you rely on page_owner/memcg metadata as well to
match allocation and their freeing parts.

> traces + BPF I
> haven't evaluated myself but heard from other members of my team who
> tried using that in production environment with poor results. I'll try
> to get more specific information on that.

That would be helpful as well.

> > E.g. Oscar has been working on extending page_ext to track number of
> > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > it can help in environments where page_ext can be enabled and it is
> > completely non-intrusive to the MM code.
> 
> Thanks for pointing out this work. I'll need to review and maybe
> profile it before making any claims.
> 
> >
> > If the page_ext overhead is not desirable/acceptable then I am sure
> > there are other options. E.g. kprobes/LivePatching framework can hook
> > into functions and alter their behavior. So why not use that for data
> > collection? Has this been evaluated at all?
> 
> I'm not sure how I can hook into say alloc_pages() to find out where
> it was called from without capturing the call stack (which would
> introduce an overhead at every allocation). Would love to discuss this
> or other alternatives if they can be done with low enough overhead.

Yes, tracking back the call trace would be really needed. The question
is whether this is really prohibitively expensive. How much overhead are
we talking about? There is no free lunch here, really.  You either have
the overhead during runtime when the feature is used or on the source
code level for all the future development (with a maze of macros and
wrappers).

Thanks!
Suren Baghdasaryan Sept. 1, 2022, 7:39 p.m. UTC | #28
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > So I find Peter's question completely appropriate while your response to
> > > that not so much! Maybe ftrace is not the right tool for the intented
> > > job. Maybe there are other ways and it would be really great to show
> > > that those have been evaluated and they are not suitable for a), b) and
> > > c) reasons.
> >
> > That's fair.
> > For memory tracking I looked into using kmemleak and page_owner which
> > can't match the required functionality at an overhead acceptable for
> > production and pre-production testing environments.
>
> Being more specific would be really helpful. Especially when your cover
> letter suggests that you rely on page_owner/memcg metadata as well to
> match allocation and their freeing parts.

kmemleak is known to be slow and it's even documented [1], so I hope I
can skip that part. For page_owner to provide the comparable
information we would have to capture the call stacks for all page
allocations unlike our proposal which allows to do that selectively
for specific call sites. I'll post the overhead numbers of call stack
capturing once I'm finished with profiling the latest code, hopefully
sometime tomorrow, in the worst case after the long weekend.

>
> > traces + BPF I
> > haven't evaluated myself but heard from other members of my team who
> > tried using that in production environment with poor results. I'll try
> > to get more specific information on that.
>
> That would be helpful as well.

Ack.

>
> > > E.g. Oscar has been working on extending page_ext to track number of
> > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > > it can help in environments where page_ext can be enabled and it is
> > > completely non-intrusive to the MM code.
> >
> > Thanks for pointing out this work. I'll need to review and maybe
> > profile it before making any claims.
> >
> > >
> > > If the page_ext overhead is not desirable/acceptable then I am sure
> > > there are other options. E.g. kprobes/LivePatching framework can hook
> > > into functions and alter their behavior. So why not use that for data
> > > collection? Has this been evaluated at all?
> >
> > I'm not sure how I can hook into say alloc_pages() to find out where
> > it was called from without capturing the call stack (which would
> > introduce an overhead at every allocation). Would love to discuss this
> > or other alternatives if they can be done with low enough overhead.
>
> Yes, tracking back the call trace would be really needed. The question
> is whether this is really prohibitively expensive. How much overhead are
> we talking about? There is no free lunch here, really.  You either have
> the overhead during runtime when the feature is used or on the source
> code level for all the future development (with a maze of macros and
> wrappers).

Will post the overhead numbers soon.
What I hear loud and clear is that we need a kernel command-line kill
switch that mitigates the overhead for having this feature. That seems
to be the main concern.
Thanks,
Suren.

[1] https://docs.kernel.org/dev-tools/kmemleak.html#limitations-and-drawbacks

>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
Kent Overstreet Sept. 1, 2022, 8:15 p.m. UTC | #29
On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote:
> kmemleak is known to be slow and it's even documented [1], so I hope I
> can skip that part. For page_owner to provide the comparable
> information we would have to capture the call stacks for all page
> allocations unlike our proposal which allows to do that selectively
> for specific call sites. I'll post the overhead numbers of call stack
> capturing once I'm finished with profiling the latest code, hopefully
> sometime tomorrow, in the worst case after the long weekend.

To expand on this further: we're stashing a pointer to the alloc_tag, which is
defined at the allocation callsite. That's how we're able to decrement the
proper counter on free, and why this beats any tracing based approach - with
tracing you'd instead have to correlate allocate/free events. Ouch.

> > Yes, tracking back the call trace would be really needed. The question
> > is whether this is really prohibitively expensive. How much overhead are
> > we talking about? There is no free lunch here, really.  You either have
> > the overhead during runtime when the feature is used or on the source
> > code level for all the future development (with a maze of macros and
> > wrappers).

The full call stack is really not what you want in most applications - that's
what people think they want at first, and why page_owner works the way it does,
but it turns out that then combining all the different but related stack traces
_sucks_ (so why were you saving them in the first place?), and then you have to
do a separate memory allocate for each stack track, which destroys performance.

> 
> Will post the overhead numbers soon.
> What I hear loud and clear is that we need a kernel command-line kill
> switch that mitigates the overhead for having this feature. That seems
> to be the main concern.
> Thanks,

After looking at this more I don't think we should commit just yet - there's
some tradeoffs to be evaluated, and maybe the thing to do first will be to see
if we can cut down on the (huge!) number of allocation interfaces before adding
more complexity.

The ideal approach, from a performance POV, would be to pass a pointer to the
alloc tag to kmalloc() et. all, and then we'd have the actual accounting code in
one place and use a jump label to skip over it when this feature is disabled.

However, there are _many, many_ wrapper functions in our allocation code, and
this approach is going to make the plumbing for the hooks quite a bit bigger
than what we have now - and then, do we want to have this extra alloc_tag
parameter that's not used when CONFIG_ALLOC_TAGGING=n? It's a tiny cost for an
extra unused parameter, but it's a cost - or do we get rid of that with some
extra macro hackery (eww, gross)?

If we do the boot parameter before submission, I think we'll have something
that's maybe not strictly ideal from a performance POV when
CONFIG_ALLOC_TAGGING=y but boot parameter=n, but it'll introduce the minimum
amount of macro insanity.

What we should be able to do pretty easily is discard the alloc_tag structs when
the boot parameter is disabled, because they're in special elf sections and we
already do that (e.g. for .init).
Roman Gushchin Sept. 1, 2022, 10:27 p.m. UTC | #30
On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> On Wed, Aug 31, 2022 at 12:02 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > > Whatever asking for an explanation as to why equivalent functionality
> > > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > >
> > > Fully agreed and this is especially true for a change this size
> > > 77 files changed, 3406 insertions(+), 703 deletions(-)
> >
> > In the case of memory allocation accounting, you flat cannot do this with ftrace
> > - you could maybe do a janky version that isn't fully accurate, much slower,
> > more complicated for the developer to understand and debug and more complicated
> > for the end user.
> >
> > But please, I invite anyone who's actually been doing this with ftrace to
> > demonstrate otherwise.
> >
> > Ftrace just isn't the right tool for the job here - we're talking about adding
> > per callsite accounting to some of the fastest fast paths in the kernel.
> >
> > And the size of the changes for memory allocation accounting are much more
> > reasonable:
> >  33 files changed, 623 insertions(+), 99 deletions(-)
> >
> > The code tagging library should exist anyways, it's been open coded half a dozen
> > times in the kernel already.
> >
> > And once we've got that, the time stats code is _also_ far simpler than doing it
> > with ftrace would be. If anyone here has successfully debugged latency issues
> > with ftrace, I'd really like to hear it. Again, for debugging latency issues you
> > want something that can always be on, and that's not cheap with ftrace - and
> > never mind the hassle of correlating start and end wait trace events, builting
> > up histograms, etc. - that's all handled here.
> >
> > Cheap, simple, easy to use. What more could you want?
> >
> 
> This is very interesting work! Do you have any data about the overhead
> this introduces, especially in a production environment? I am
> especially interested in memory allocations tracking and detecting
> leaks.

+1

I think the question whether it indeed can be always turned on in the production
or not is the main one. If not, the advantage over ftrace/bpf/... is not that
obvious. Otherwise it will be indeed a VERY useful thing.

Also, there is a lot of interesting stuff within this patchset, which
might be useful elsewhere. So thanks to Kent and Suren for this work!

Thanks!
Kent Overstreet Sept. 1, 2022, 10:37 p.m. UTC | #31
On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > This is very interesting work! Do you have any data about the overhead
> > this introduces, especially in a production environment? I am
> > especially interested in memory allocations tracking and detecting
> > leaks.
> 
> +1
> 
> I think the question whether it indeed can be always turned on in the production
> or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> obvious. Otherwise it will be indeed a VERY useful thing.

Low enough overhead to run in production was my primary design goal.

Stats are kept in a struct that's defined at the callsite. So this adds _no_
pointer chasing to the allocation path, unless we've switch to percpu counters
at that callsite (see the lazy percpu counters patch), where we need to deref
one percpu pointer to save an atomic.

Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
For slab allocations this uses the same storage area as memcg, so for
allocations that are using that we won't be touching any additional cachelines.
(I wanted the pointer to the alloc_tag to be stored inline with the allocation,
but that would've caused alignment difficulties).

Then there's a pointer deref introduced to the kfree() path, to get back to the
original alloc_tag and subtract the allocation from that callsite. That one
won't be free, and with percpu counters we've got another dependent load too -
hmm, it might be worth benchmarking with just atomics, skipping the percpu
counters.

So the overhead won't be zero, I expect it'll show up in some synthetic
benchmarks, but yes I do definitely expect this to be worth enabling in
production in many scenarios.
Roman Gushchin Sept. 1, 2022, 10:53 p.m. UTC | #32
On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > > This is very interesting work! Do you have any data about the overhead
> > > this introduces, especially in a production environment? I am
> > > especially interested in memory allocations tracking and detecting
> > > leaks.
> > 
> > +1
> > 
> > I think the question whether it indeed can be always turned on in the production
> > or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> > obvious. Otherwise it will be indeed a VERY useful thing.
> 
> Low enough overhead to run in production was my primary design goal.
> 
> Stats are kept in a struct that's defined at the callsite. So this adds _no_
> pointer chasing to the allocation path, unless we've switch to percpu counters
> at that callsite (see the lazy percpu counters patch), where we need to deref
> one percpu pointer to save an atomic.
> 
> Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
> For slab allocations this uses the same storage area as memcg, so for
> allocations that are using that we won't be touching any additional cachelines.
> (I wanted the pointer to the alloc_tag to be stored inline with the allocation,
> but that would've caused alignment difficulties).
> 
> Then there's a pointer deref introduced to the kfree() path, to get back to the
> original alloc_tag and subtract the allocation from that callsite. That one
> won't be free, and with percpu counters we've got another dependent load too -
> hmm, it might be worth benchmarking with just atomics, skipping the percpu
> counters.
> 
> So the overhead won't be zero, I expect it'll show up in some synthetic
> benchmarks, but yes I do definitely expect this to be worth enabling in
> production in many scenarios.

I'm somewhat sceptical, but I usually am. And in this case I'll be really happy
to be wrong.

On a bright side, maybe most of the overhead will come from few allocations,
so an option to explicitly exclude them will do the trick.

I'd suggest to run something like iperf on a fast hardware. And maybe some
io_uring stuff too. These are two places which were historically most sensitive
to the (kernel) memory accounting speed.

Thanks!
Suren Baghdasaryan Sept. 1, 2022, 11:36 p.m. UTC | #33
On Thu, Sep 1, 2022 at 3:54 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Thu, Sep 01, 2022 at 06:37:20PM -0400, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 03:27:27PM -0700, Roman Gushchin wrote:
> > > On Wed, Aug 31, 2022 at 01:56:08PM -0700, Yosry Ahmed wrote:
> > > > This is very interesting work! Do you have any data about the overhead
> > > > this introduces, especially in a production environment? I am
> > > > especially interested in memory allocations tracking and detecting
> > > > leaks.
> > >
> > > +1
> > >
> > > I think the question whether it indeed can be always turned on in the production
> > > or not is the main one. If not, the advantage over ftrace/bpf/... is not that
> > > obvious. Otherwise it will be indeed a VERY useful thing.
> >
> > Low enough overhead to run in production was my primary design goal.
> >
> > Stats are kept in a struct that's defined at the callsite. So this adds _no_
> > pointer chasing to the allocation path, unless we've switch to percpu counters
> > at that callsite (see the lazy percpu counters patch), where we need to deref
> > one percpu pointer to save an atomic.
> >
> > Then we need to stash a pointer to the alloc_tag, so that kfree() can find it.
> > For slab allocations this uses the same storage area as memcg, so for
> > allocations that are using that we won't be touching any additional cachelines.
> > (I wanted the pointer to the alloc_tag to be stored inline with the allocation,
> > but that would've caused alignment difficulties).
> >
> > Then there's a pointer deref introduced to the kfree() path, to get back to the
> > original alloc_tag and subtract the allocation from that callsite. That one
> > won't be free, and with percpu counters we've got another dependent load too -
> > hmm, it might be worth benchmarking with just atomics, skipping the percpu
> > counters.
> >
> > So the overhead won't be zero, I expect it'll show up in some synthetic
> > benchmarks, but yes I do definitely expect this to be worth enabling in
> > production in many scenarios.
>
> I'm somewhat sceptical, but I usually am. And in this case I'll be really happy
> to be wrong.
>
> On a bright side, maybe most of the overhead will come from few allocations,
> so an option to explicitly exclude them will do the trick.
>
> I'd suggest to run something like iperf on a fast hardware. And maybe some
> io_uring stuff too. These are two places which were historically most sensitive
> to the (kernel) memory accounting speed.

Thanks for the suggestions, Roman. I'll see how I can get this done.
I'll have to find someone with access to fast hardware (Android is not
great for that) and backporting the patchset to the supported kernel
version. Will do my best.
Thanks,
Suren.

>
> Thanks!
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Kent Overstreet Sept. 2, 2022, 12:17 a.m. UTC | #34
On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> I'd suggest to run something like iperf on a fast hardware. And maybe some
> io_uring stuff too. These are two places which were historically most sensitive
> to the (kernel) memory accounting speed.

I'm getting wildly inconsistent results with iperf.

io_uring-echo-server and rust_echo_bench gets me:
Benchmarking: 127.0.0.1:12345
50 clients, running 512 bytes, 60 sec.

Without alloc tagging:	120547 request/sec
With:			116748 request/sec

https://github.com/frevib/io_uring-echo-server
https://github.com/haraldh/rust_echo_bench

How's that look to you? Close enough? :)
Roman Gushchin Sept. 2, 2022, 1:04 a.m. UTC | #35
On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> > I'd suggest to run something like iperf on a fast hardware. And maybe some
> > io_uring stuff too. These are two places which were historically most sensitive
> > to the (kernel) memory accounting speed.
> 
> I'm getting wildly inconsistent results with iperf.
> 
> io_uring-echo-server and rust_echo_bench gets me:
> Benchmarking: 127.0.0.1:12345
> 50 clients, running 512 bytes, 60 sec.
> 
> Without alloc tagging:	120547 request/sec
> With:			116748 request/sec
> 
> https://github.com/frevib/io_uring-echo-server
> https://github.com/haraldh/rust_echo_bench
> 
> How's that look to you? Close enough? :)

Yes, this looks good (a bit too good).

I'm not that familiar with io_uring, Jens and Pavel should have a better idea
what and how to run (I know they've workarounded the kernel memory accounting
because of the performance in the past, this is why I suspect it might be an
issue here as well).

This is a recent optimization on the networking side:
https://lore.kernel.org/linux-mm/20220825000506.239406-1-shakeelb@google.com/

Maybe you can try to repeat this experiment.

Thanks!
Kent Overstreet Sept. 2, 2022, 1:16 a.m. UTC | #36
On Thu, Sep 01, 2022 at 06:04:46PM -0700, Roman Gushchin wrote:
> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> > On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> > > I'd suggest to run something like iperf on a fast hardware. And maybe some
> > > io_uring stuff too. These are two places which were historically most sensitive
> > > to the (kernel) memory accounting speed.
> > 
> > I'm getting wildly inconsistent results with iperf.
> > 
> > io_uring-echo-server and rust_echo_bench gets me:
> > Benchmarking: 127.0.0.1:12345
> > 50 clients, running 512 bytes, 60 sec.
> > 
> > Without alloc tagging:	120547 request/sec
> > With:			116748 request/sec
> > 
> > https://github.com/frevib/io_uring-echo-server
> > https://github.com/haraldh/rust_echo_bench
> > 
> > How's that look to you? Close enough? :)
> 
> Yes, this looks good (a bit too good).

Eh, I was hoping for better :)

> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> what and how to run (I know they've workarounded the kernel memory accounting
> because of the performance in the past, this is why I suspect it might be an
> issue here as well).
> 
> This is a recent optimization on the networking side:
> https://lore.kernel.org/linux-mm/20220825000506.239406-1-shakeelb@google.com/
> 
> Maybe you can try to repeat this experiment.

I'd be more interested in a synthetic benchmark, if you know of any.
Jens Axboe Sept. 2, 2022, 12:02 p.m. UTC | #37
On 9/1/22 7:04 PM, Roman Gushchin wrote:
> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
>> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
>>> I'd suggest to run something like iperf on a fast hardware. And maybe some
>>> io_uring stuff too. These are two places which were historically most sensitive
>>> to the (kernel) memory accounting speed.
>>
>> I'm getting wildly inconsistent results with iperf.
>>
>> io_uring-echo-server and rust_echo_bench gets me:
>> Benchmarking: 127.0.0.1:12345
>> 50 clients, running 512 bytes, 60 sec.
>>
>> Without alloc tagging:	120547 request/sec
>> With:			116748 request/sec
>>
>> https://github.com/frevib/io_uring-echo-server
>> https://github.com/haraldh/rust_echo_bench
>>
>> How's that look to you? Close enough? :)
> 
> Yes, this looks good (a bit too good).
> 
> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> what and how to run (I know they've workarounded the kernel memory accounting
> because of the performance in the past, this is why I suspect it might be an
> issue here as well).

io_uring isn't alloc+free intensive on a per request basis anymore, it
would not be a good benchmark if the goal is to check for regressions in
that area.
Kent Overstreet Sept. 2, 2022, 7:48 p.m. UTC | #38
On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote:
> On 9/1/22 7:04 PM, Roman Gushchin wrote:
> > On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
> >> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
> >>> I'd suggest to run something like iperf on a fast hardware. And maybe some
> >>> io_uring stuff too. These are two places which were historically most sensitive
> >>> to the (kernel) memory accounting speed.
> >>
> >> I'm getting wildly inconsistent results with iperf.
> >>
> >> io_uring-echo-server and rust_echo_bench gets me:
> >> Benchmarking: 127.0.0.1:12345
> >> 50 clients, running 512 bytes, 60 sec.
> >>
> >> Without alloc tagging:	120547 request/sec
> >> With:			116748 request/sec
> >>
> >> https://github.com/frevib/io_uring-echo-server
> >> https://github.com/haraldh/rust_echo_bench
> >>
> >> How's that look to you? Close enough? :)
> > 
> > Yes, this looks good (a bit too good).
> > 
> > I'm not that familiar with io_uring, Jens and Pavel should have a better idea
> > what and how to run (I know they've workarounded the kernel memory accounting
> > because of the performance in the past, this is why I suspect it might be an
> > issue here as well).
> 
> io_uring isn't alloc+free intensive on a per request basis anymore, it
> would not be a good benchmark if the goal is to check for regressions in
> that area.

Good to know. The benchmark is still a TCP benchmark though, so still useful.

Matthew suggested
  while true; do echo 1 >/tmp/foo; rm /tmp/foo; done

I ran that on tmpfs, and the numbers with and without alloc tagging were
statistically equal - there was a fair amount of variation, it wasn't a super
controlled test, anywhere from 38-41 seconds with 100000 iterations (and alloc
tagging was some of the faster runs).

But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same
mechanism memcg uses for stashing per-object pointers, so it looks like that's
the bigger cost.
Jens Axboe Sept. 2, 2022, 7:53 p.m. UTC | #39
On 9/2/22 1:48 PM, Kent Overstreet wrote:
> On Fri, Sep 02, 2022 at 06:02:12AM -0600, Jens Axboe wrote:
>> On 9/1/22 7:04 PM, Roman Gushchin wrote:
>>> On Thu, Sep 01, 2022 at 08:17:47PM -0400, Kent Overstreet wrote:
>>>> On Thu, Sep 01, 2022 at 03:53:57PM -0700, Roman Gushchin wrote:
>>>>> I'd suggest to run something like iperf on a fast hardware. And maybe some
>>>>> io_uring stuff too. These are two places which were historically most sensitive
>>>>> to the (kernel) memory accounting speed.
>>>>
>>>> I'm getting wildly inconsistent results with iperf.
>>>>
>>>> io_uring-echo-server and rust_echo_bench gets me:
>>>> Benchmarking: 127.0.0.1:12345
>>>> 50 clients, running 512 bytes, 60 sec.
>>>>
>>>> Without alloc tagging:	120547 request/sec
>>>> With:			116748 request/sec
>>>>
>>>> https://github.com/frevib/io_uring-echo-server
>>>> https://github.com/haraldh/rust_echo_bench
>>>>
>>>> How's that look to you? Close enough? :)
>>>
>>> Yes, this looks good (a bit too good).
>>>
>>> I'm not that familiar with io_uring, Jens and Pavel should have a better idea
>>> what and how to run (I know they've workarounded the kernel memory accounting
>>> because of the performance in the past, this is why I suspect it might be an
>>> issue here as well).
>>
>> io_uring isn't alloc+free intensive on a per request basis anymore, it
>> would not be a good benchmark if the goal is to check for regressions in
>> that area.
> 
> Good to know. The benchmark is still a TCP benchmark though, so still useful.
> 
> Matthew suggested
>   while true; do echo 1 >/tmp/foo; rm /tmp/foo; done
> 
> I ran that on tmpfs, and the numbers with and without alloc tagging were
> statistically equal - there was a fair amount of variation, it wasn't a super
> controlled test, anywhere from 38-41 seconds with 100000 iterations (and alloc
> tagging was some of the faster runs).
> 
> But with memcg off, it ran in 32-33 seconds. We're piggybacking on the same
> mechanism memcg uses for stashing per-object pointers, so it looks like that's
> the bigger cost.

I've complained about memcg accounting before, the slowness of it is why
io_uring works around it by caching. Anything we account we try NOT do
in the fast path because of it, the slowdown is considerable.

You care about efficiency now? I thought that was relegated to
irrelevant 10M IOPS cases.
Kent Overstreet Sept. 2, 2022, 8:05 p.m. UTC | #40
On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote:
> I've complained about memcg accounting before, the slowness of it is why
> io_uring works around it by caching. Anything we account we try NOT do
> in the fast path because of it, the slowdown is considerable.

I'm with you on that, it definitely raises an eyebrow.

> You care about efficiency now? I thought that was relegated to
> irrelevant 10M IOPS cases.

I always did, it's just not the only thing I care about.
Jens Axboe Sept. 2, 2022, 8:23 p.m. UTC | #41
On 9/2/22 2:05 PM, Kent Overstreet wrote:
> On Fri, Sep 02, 2022 at 01:53:53PM -0600, Jens Axboe wrote:
>> I've complained about memcg accounting before, the slowness of it is why
>> io_uring works around it by caching. Anything we account we try NOT do
>> in the fast path because of it, the slowdown is considerable.
> 
> I'm with you on that, it definitely raises an eyebrow.
> 
>> You care about efficiency now? I thought that was relegated to
>> irrelevant 10M IOPS cases.
> 
> I always did, it's just not the only thing I care about.

It's not the only thing anyone cares about.
Suren Baghdasaryan Sept. 5, 2022, 1:32 a.m. UTC | #42
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > So I find Peter's question completely appropriate while your response to
> > > that not so much! Maybe ftrace is not the right tool for the intented
> > > job. Maybe there are other ways and it would be really great to show
> > > that those have been evaluated and they are not suitable for a), b) and
> > > c) reasons.
> >
> > That's fair.
> > For memory tracking I looked into using kmemleak and page_owner which
> > can't match the required functionality at an overhead acceptable for
> > production and pre-production testing environments.
>
> Being more specific would be really helpful. Especially when your cover
> letter suggests that you rely on page_owner/memcg metadata as well to
> match allocation and their freeing parts.
>
> > traces + BPF I
> > haven't evaluated myself but heard from other members of my team who
> > tried using that in production environment with poor results. I'll try
> > to get more specific information on that.
>
> That would be helpful as well.
>
> > > E.g. Oscar has been working on extending page_ext to track number of
> > > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > > it can help in environments where page_ext can be enabled and it is
> > > completely non-intrusive to the MM code.
> >
> > Thanks for pointing out this work. I'll need to review and maybe
> > profile it before making any claims.
> >
> > >
> > > If the page_ext overhead is not desirable/acceptable then I am sure
> > > there are other options. E.g. kprobes/LivePatching framework can hook
> > > into functions and alter their behavior. So why not use that for data
> > > collection? Has this been evaluated at all?
> >
> > I'm not sure how I can hook into say alloc_pages() to find out where
> > it was called from without capturing the call stack (which would
> > introduce an overhead at every allocation). Would love to discuss this
> > or other alternatives if they can be done with low enough overhead.
>
> Yes, tracking back the call trace would be really needed. The question
> is whether this is really prohibitively expensive. How much overhead are
> we talking about? There is no free lunch here, really.  You either have
> the overhead during runtime when the feature is used or on the source
> code level for all the future development (with a maze of macros and
> wrappers).

As promised, I profiled a simple code that repeatedly makes 10
allocations/frees in a loop and measured overheads of code tagging,
call stack capturing and tracing+BPF for page and slab allocations.
Summary:

Page allocations (overheads are compared to get_free_pages() duration):
6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
8.8% lookup_page_ext
1237% call stack capture
139% tracepoint with attached empty BPF program

Slab allocations (overheads are compared to __kmalloc() duration):
With CONFIG_MEMCG_KMEM=y
39% Codetag counter manipulations(__lazy_percpu_counter_add + __alloc_tag_add)
55% get_slab_tag_ref
3.9% __ksize
3027% call stack capture
397% tracepoint with attached empty BPF program

With CONFIG_MEMCG_KMEM=n
26% Codetag counter manipulation(__lazy_percpu_counter_add + __alloc_tag_add)
72% get_slab_tag_ref
7.4% __ksize
2789% call stack capture
345% tracepoint with attached empty BPF program

Details:
_get_free_pages is used as page allocation duration baseline
__kmalloc is used as slab allocation duration baseline

1. Profile with instrumented page allocator
|--50.13%--my__get_free_page
|          |
|          |--38.99%--_get_free_pages
|          |          |
|          |          |--34.75%--__alloc_pages
|          |          |          |
|          |          |          |--27.59%--get_page_from_freelist
|          |          |
|          |           --3.98%--_alloc_pages
|          |                     |
|          |                      --0.53%--policy_node
|          |
|          |--3.45%--lookup_page_ext
|          |
|          |--1.59%--__lazy_percpu_counter_add
|          |          |
|          |           --0.80%--pcpu_alloc
|          |                     memset_orig
|          |
|           --1.06%--__alloc_tag_add
|                     |
|                      --0.80%--__lazy_percpu_counter_add
|
|--35.28%--free_unref_page
|          |
|          |--23.08%--_raw_spin_unlock_irqrestore
|          |
|          |--2.39%--preempt_count_add
|          |          |
|          |           --0.80%--in_lock_functions
|          |
|          |--1.59%--free_pcp_prepare
|          |
|          |--1.33%--preempt_count_sub
|          |
|           --0.80%--check_preemption_disabled
|
|--4.24%--__free_pages
|
 --1.59%--free_pages


2. Profile with non-instrumented page allocator and call stack capturing
|--84.18%--my__get_free_page
|          |
|           --83.91%--stack_depot_capture_stack
|                     |
|                     |--77.99%--stack_trace_save
|                     |          |
|                     |           --77.53%--arch_stack_walk
|                     |                     |
|                     |                     |--37.17%--unwind_next_frame
|                     |                     |          |
|                     |                     |          |--8.44%--__orc_find
|                     |                     |          |
|                     |                     |--10.57%-stack_trace_consume_entry
|                     |                     |
|                     |                      --9.64%--unwind_get_return_address
|                     |
|                      --5.78%--__stack_depot_save
|
|--6.78%--__get_free_pages
|          |
|          |--5.85%--__alloc_pages
|          |          |
|          |           --3.86%--get_page_from_freelist
|          |                     |
|          |                      --1.33%--_raw_spin_unlock_irqrestore
|          |
|           --0.80%--alloc_pages
|
|--5.19%--free_unref_page
|          |
|          |--2.73%--_raw_spin_unlock_irqrestore
|          |
|           --0.60%--free_pcp_prepare
|
 --0.73%--__free_pages


3. Profile with non-instrumented page allocator and BPF attached to tracepoint
|--42.42%--my__get_free_page
|          |
|           --38.53%--perf_trace_kmem_alloc
|                     |
|                     |--25.76%--perf_trace_run_bpf_submit
|                     |          |
|                     |          |--21.86%--trace_call_bpf
|                     |          |          |
|                     |          |          |--4.76%--migrate_enable
|                     |          |          |
|                     |          |          |--4.55%--migrate_disable
|                     |          |          |
|                     |          |          |--3.03%--check_preemption_disabled
|                     |          |          |
|                     |          |          |--0.65%--__this_cpu_preempt_check
|                     |          |          |
|                     |          |           --0.65%--__rcu_read_unlock
|                     |          |
|                     |           --0.87%--check_preemption_disabled
|                     |
|                     |--8.01%--perf_trace_buf_alloc
|                     |          |
|                     |          |--3.68%--perf_swevent_get_recursion_context
|                     |          |          |
|                     |          |           --0.87%--check_preemption_disabled
|                     |          |
|                     |           --1.30%--check_preemption_disabled
|                     |
|                      --0.87%--check_preemption_disabled
|
|--27.71%--__get_free_pages
|          |
|          |--23.38%--__alloc_pages
|          |          |
|          |           --17.75%--get_page_from_freelist
|          |                     |
|          |                     |--8.66%--_raw_spin_unlock_irqrestore
|          |                     |          |
|          |                     |           --1.95%--preempt_count_sub
|          |                     |
|          |                      --1.08%--preempt_count_add
|          |
|           --4.33%--alloc_pages
|                     |
|                     |--0.87%--policy_node
|                     |
|                      --0.65%--policy_nodemask
|
|--15.37%--free_unref_page
|          |
|          |--6.71%--_raw_spin_unlock_irqrestore
|          |
|          |--1.52%--check_preemption_disabled
|          |
|          |--0.65%--free_pcp_prepare
|          |
|           --0.65%--preempt_count_add
|--4.98%--__free_pages


4. Profile with instrumented slab allocator CONFIG_MEMCG_KMEM=y
|--51.28%--my__get_free_page
|          |
|          |--21.79%--__kmalloc
|          |          |
|          |          |--3.42%--memcg_slab_post_alloc_hook
|          |          |
|          |          |--1.71%--kmalloc_slab
|          |          |
|          |           --0.85%--should_failslab
|          |
|          |--11.97%--get_slab_tag_ref
|          |
|          |--5.56%--__alloc_tag_add
|          |          |
|          |           --2.56%--__lazy_percpu_counter_add
|          |
|          |--2.99%--__lazy_percpu_counter_add
|          |
|           --0.85%--__ksize
|
 --35.90%--kfree
           |
           |--13.68%--get_slab_tag_ref
           |
           |--6.41%--__alloc_tag_sub
           |          |
           |           --4.70%--__lazy_percpu_counter_add
           |
            --2.14%--__ksize


5. Profile with non-instrumented slab allocator and call stack
capturing CONFIG_MEMCG_KMEM=y
|--91.50%--my__get_free_page
|          |
|           --91.13%--stack_depot_capture_stack
|                     |
|                     |--85.48%--stack_trace_save
|                     |          |
|                     |           --85.12%--arch_stack_walk
|                     |                     |
|                     |                     |--40.54%--unwind_next_frame
|                     |                     |
|                     |                     |--14.30%--__unwind_start
|                     |                     |
|                     |                     |--11.95%-unwind_get_return_address
|                     |                     |
|                     |                      --10.48%-stack_trace_consume_entry
|                     |
|                      --4.99%--__stack_depot_save
|                                |
|                                 --0.66%--filter_irq_stacks
|
|--3.01%--__kmalloc
|
|--2.05%--kfree

6. Profile with non-instrumented slab allocator and BPF attached to a
tracepoint CONFIG_MEMCG_KMEM=y
|--72.39%--__kmalloc
|          |
|          |--57.84%--perf_trace_kmem_alloc
|          |          |
|          |          |--38.06%--perf_trace_run_bpf_submit
|          |          |          |
|          |          |           --33.96%--trace_call_bpf
|          |          |                     |
|          |          |                     |--10.07%--migrate_disable
|          |          |                     |
|          |          |                     |--4.85%--migrate_enable
|          |          |                     |
|          |          |                     |--4.10%--check_preemption_disabled
|          |          |                     |
|          |          |                     |--1.87%--__rcu_read_unlock
|          |          |                     |
|          |          |                      --0.75%--__rcu_read_lock
|          |          |
|          |           --9.70%--perf_trace_buf_alloc
|          |                     |
|          |                     |--2.99%--perf_swevent_get_recursion_context
|          |                     |
|          |                     |--1.12%--check_preemption_disabled
|          |                     |
|          |                      --0.75%--debug_smp_processor_id
|          |
|          |--2.24%--kmalloc_slab
|          |
|          |--1.49%--memcg_slab_post_alloc_hook
|          |
|           --1.12%--__cond_resched
|
|--7.84%--kfree


7. Profile with instrumented slab allocator CONFIG_MEMCG_KMEM=n
|--49.39%--my__get_free_page
|          |
|          |--22.04%--__kmalloc
|          |          |
|          |          |--3.27%--kmalloc_slab
|          |          |
|          |           --0.82%--asm_sysvec_apic_timer_interrupt
|          |                     sysvec_apic_timer_interrupt
|          |                     __irq_exit_rcu
|          |                     __softirqentry_text_start
|          |
|          |--15.92%--get_slab_tag_ref
|          |
|          |--3.27%--__alloc_tag_add
|          |          |
|          |           --2.04%--__lazy_percpu_counter_add
|          |
|           --2.45%--__lazy_percpu_counter_add
|
|--35.51%--kfree
|          |
|          |--13.88%--get_slab_tag_ref
|          |
|          |--11.84%--__alloc_tag_sub
|          |          |
|          |           --5.31%--__lazy_percpu_counter_add
|          |
|           --1.63%--__ksize

8. Profile with non-instrumented slab allocator and call stack
capturing CONFIG_MEMCG_KMEM=n
|--91.70%--my__get_free_page
|          |
|           --91.48%--stack_depot_capture_stack
|                     |
|                     |--85.29%--stack_trace_save
|                     |          |
|                     |           --85.07%--arch_stack_walk
|                     |                     |
|                     |                     |--45.23%--unwind_next_frame
|                     |                     |
|                     |                     |--12.89%--__unwind_start
|                     |                     |
|                     |                     |--10.20%-unwind_get_return_address
|                     |                     |
|                     |                      --10.12%-stack_trace_consume_entry
|                     |
|                      --5.75%--__stack_depot_save
|                                |
|                                 --0.87%--filter_irq_stacks
|
|--3.28%--__kmalloc
|
 --1.89%--kfree

9. Profile with non-instrumented slab allocator and BPF attached to a
tracepoint CONFIG_MEMCG_KMEM=n
|--71.65%--__kmalloc
|          |
|          |--55.56%--perf_trace_kmem_alloc
|          |          |
|          |          |--38.31%--perf_trace_run_bpf_submit
|          |          |          |
|          |          |          |--31.80%--trace_call_bpf
|          |          |          |          |
|          |          |          |          |--9.96%--migrate_enable
|          |          |          |          |
|          |          |          |          |--4.98%--migrate_disable
|          |          |          |          |
|          |          |          |          |--1.92%--check_preemption_disabled
|          |          |          |          |
|          |          |          |          |--1.92%--__rcu_read_unlock
|          |          |          |          |
|          |          |          |           --1.15%--__rcu_read_lock
|          |          |          |
|          |          |           --0.77%--check_preemption_disabled
|          |          |
|          |           --11.11%--perf_trace_buf_alloc
|          |                     |
|          |                      --4.98%--perf_swevent_get_recursion_context
|          |                                |
|          |                                 --1.53%--check_preemption_disabled
|          |
|          |--2.68%--kmalloc_slab
|          |
|           --1.15%--__cond_resched
|
 --9.58%--kfree


>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Sept. 5, 2022, 8:12 a.m. UTC | #43
On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Yes, tracking back the call trace would be really needed. The question
> > is whether this is really prohibitively expensive. How much overhead are
> > we talking about? There is no free lunch here, really.  You either have
> > the overhead during runtime when the feature is used or on the source
> > code level for all the future development (with a maze of macros and
> > wrappers).
> 
> As promised, I profiled a simple code that repeatedly makes 10
> allocations/frees in a loop and measured overheads of code tagging,
> call stack capturing and tracing+BPF for page and slab allocations.
> Summary:
> 
> Page allocations (overheads are compared to get_free_pages() duration):
> 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> 8.8% lookup_page_ext
> 1237% call stack capture
> 139% tracepoint with attached empty BPF program

Yes, I am not surprised that the call stack capturing is really
expensive comparing to the allocator fast path (which is really highly
optimized and I suspect that with 10 allocation/free loop you mostly get
your memory from the pcp lists). Is this overhead still _that_ visible
for somehow less microoptimized workloads which have to take slow paths
as well?

Also what kind of stack unwinder is configured (I guess ORC)? This is
not my area but from what I remember the unwinder overhead varies
between ORC and FP.

And just to make it clear. I do realize that an overhead from the stack
unwinding is unavoidable. And code tagging would logically have lower
overhead as it performs much less work. But the main point is whether
our existing stack unwiding approach is really prohibitively expensive
to be used for debugging purposes on production systems. I might
misremember but I recall people having bigger concerns with page_owner
memory footprint than the actual stack unwinder overhead.
Michal Hocko Sept. 5, 2022, 8:49 a.m. UTC | #44
On Thu 01-09-22 16:15:02, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote:
> > kmemleak is known to be slow and it's even documented [1], so I hope I
> > can skip that part. For page_owner to provide the comparable
> > information we would have to capture the call stacks for all page
> > allocations unlike our proposal which allows to do that selectively
> > for specific call sites. I'll post the overhead numbers of call stack
> > capturing once I'm finished with profiling the latest code, hopefully
> > sometime tomorrow, in the worst case after the long weekend.
> 
> To expand on this further: we're stashing a pointer to the alloc_tag, which is
> defined at the allocation callsite. That's how we're able to decrement the
> proper counter on free, and why this beats any tracing based approach - with
> tracing you'd instead have to correlate allocate/free events. Ouch.
> 
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really.  You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
> 
> The full call stack is really not what you want in most applications - that's
> what people think they want at first, and why page_owner works the way it does,
> but it turns out that then combining all the different but related stack traces
> _sucks_ (so why were you saving them in the first place?), and then you have to
> do a separate memory allocate for each stack track, which destroys performance.

I do agree that the full stack trace is likely not what you need. But
the portion of the stack that you need is not really clear because the
relevant part might be on a different level of the calltrace depending
on the allocation site. Take this as an example:
{traverse, seq_read_iter, single_open_size}->seq_buf_alloc -> kvmalloc -> kmalloc

This whole part of the stack is not really all that interesting and you
would have to allocate pretty high at the API layer to catch something
useful. And please remember that seq_file interface is heavily used in
throughout the kernel. I wouldn't suspect seq_file itself to be buggy,
that is well exercised code but its users can botch things and that is
where the leak would happen. There are many other examples like that
where the allocation is done at a lib/infrastructure layer (sysfs
framework, mempools network pool allocators and whatnot). We do care
about those users, really. Ad-hoc pool allocators built on top of the
core MM allocators are not really uncommon. And I am really skeptical we
really want to add all the tagging source code level changes to each and
every one of them.

This is really my main concern about this whole work. Not only it adds a
considerable maintenance burden to the core MM because it adds on top of
our existing allocator layers complexity but it would need to spread beyond
MM to be useful because it is usually outside of MM where leaks happen.
Marco Elver Sept. 5, 2022, 8:58 a.m. UTC | #45
On Mon, 5 Sept 2022 at 10:12, Michal Hocko <mhocko@suse.com> wrote:
> On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really.  You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
> >
> > As promised, I profiled a simple code that repeatedly makes 10
> > allocations/frees in a loop and measured overheads of code tagging,
> > call stack capturing and tracing+BPF for page and slab allocations.
> > Summary:
> >
> > Page allocations (overheads are compared to get_free_pages() duration):
> > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > 8.8% lookup_page_ext
> > 1237% call stack capture
> > 139% tracepoint with attached empty BPF program
>
> Yes, I am not surprised that the call stack capturing is really
> expensive comparing to the allocator fast path (which is really highly
> optimized and I suspect that with 10 allocation/free loop you mostly get
> your memory from the pcp lists). Is this overhead still _that_ visible
> for somehow less microoptimized workloads which have to take slow paths
> as well?
>
> Also what kind of stack unwinder is configured (I guess ORC)? This is
> not my area but from what I remember the unwinder overhead varies
> between ORC and FP.
>
> And just to make it clear. I do realize that an overhead from the stack
> unwinding is unavoidable. And code tagging would logically have lower
> overhead as it performs much less work. But the main point is whether
> our existing stack unwiding approach is really prohibitively expensive
> to be used for debugging purposes on production systems. I might
> misremember but I recall people having bigger concerns with page_owner
> memory footprint than the actual stack unwinder overhead.

This is just to point out that we've also been looking at cheaper
collection of the stack trace (for KASAN and other sanitizers). The
cheapest way to unwind the stack would be a system with "shadow call
stack" enabled. With compiler support it's available on arm64, see
CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the
kernel will support CET, which newer Intel and AMD CPUs support.
Collecting the call stack would then be a simple memcpy.
Steven Rostedt Sept. 5, 2022, 3:07 p.m. UTC | #46
On Sun, 4 Sep 2022 18:32:58 -0700
Suren Baghdasaryan <surenb@google.com> wrote:

> Page allocations (overheads are compared to get_free_pages() duration):
> 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> 8.8% lookup_page_ext
> 1237% call stack capture
> 139% tracepoint with attached empty BPF program

Have you tried tracepoint with custom callback?

static void my_callback(void *data, unsigned long call_site,
			const void *ptr, struct kmem_cache *s,
			size_t bytes_req, size_t bytes_alloc,
			gfp_t gfp_flags)
{
	struct my_data_struct *my_data = data;

	{ do whatever }
}

[..]
	register_trace_kmem_alloc(my_callback, my_data);

Now the my_callback function will be called directly every time the
kmem_alloc tracepoint is hit.

This avoids that perf and BPF overhead.

-- Steve
Suren Baghdasaryan Sept. 5, 2022, 6:03 p.m. UTC | #47
On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really.  You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
> >
> > As promised, I profiled a simple code that repeatedly makes 10
> > allocations/frees in a loop and measured overheads of code tagging,
> > call stack capturing and tracing+BPF for page and slab allocations.
> > Summary:
> >
> > Page allocations (overheads are compared to get_free_pages() duration):
> > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > 8.8% lookup_page_ext
> > 1237% call stack capture
> > 139% tracepoint with attached empty BPF program
>
> Yes, I am not surprised that the call stack capturing is really
> expensive comparing to the allocator fast path (which is really highly
> optimized and I suspect that with 10 allocation/free loop you mostly get
> your memory from the pcp lists). Is this overhead still _that_ visible
> for somehow less microoptimized workloads which have to take slow paths
> as well?

Correct, it's a comparison with the allocation fast path, so in a
sense represents the worst case scenario. However at the same time the
measurements are fair because they measure the overheads against the
same meaningful baseline, therefore can be used for comparison.

>
> Also what kind of stack unwinder is configured (I guess ORC)? This is
> not my area but from what I remember the unwinder overhead varies
> between ORC and FP.

I used whatever is default and didn't try other mechanisms. Don't
think the difference would be orders of magnitude better though.

>
> And just to make it clear. I do realize that an overhead from the stack
> unwinding is unavoidable. And code tagging would logically have lower
> overhead as it performs much less work. But the main point is whether
> our existing stack unwiding approach is really prohibitively expensive
> to be used for debugging purposes on production systems. I might
> misremember but I recall people having bigger concerns with page_owner
> memory footprint than the actual stack unwinder overhead.

That's one of those questions which are very difficult to answer (if
even possible) because that would depend on the use scenario. If the
workload allocates frequently then adding the overhead will likely
affect it, otherwise might not be even noticeable. In general, in
pre-production testing we try to minimize the difference in
performance and memory profiles between the software we are testing
and the production one. From that point of view, the smaller the
overhead, the better. I know it's kinda obvious but unfortunately I
have no better answer to that question.

For the memory overhead, in my early internal proposal with assumption
of 10000 instrumented allocation call sites, I've made some
calculations for an 8GB 8-core system (quite typical for Android) and
ended up with the following:

                                    per-cpu counters      atomic counters
page_ext references     16MB                      16MB
slab object references   10.5MB                   10.5MB
alloc_tags                      900KB                    312KB
Total memory overhead 27.4MB                  26.8MB

so, about 0.34% of the total memory. Our implementation has changed
since then and the number might not be completely correct but it
should be in the ballpark.
I just checked the number of instrumented calls that we currently have
in the 6.0-rc3 built with defconfig and it's 165 page allocation and
2684 slab allocation sites. I readily accept that we are probably
missing some allocations and additional modules can also contribute to
these numbers but my guess it's still less than 10000 that I used in
my calculations.
I don't claim that 0.34% overhead is low enough to be always
acceptable, just posting the numbers to provide some reference points.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Sept. 5, 2022, 6:07 p.m. UTC | #48
On Mon, Sep 5, 2022 at 1:58 AM Marco Elver <elver@google.com> wrote:
>
> On Mon, 5 Sept 2022 at 10:12, Michal Hocko <mhocko@suse.com> wrote:
> > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > Yes, tracking back the call trace would be really needed. The question
> > > > is whether this is really prohibitively expensive. How much overhead are
> > > > we talking about? There is no free lunch here, really.  You either have
> > > > the overhead during runtime when the feature is used or on the source
> > > > code level for all the future development (with a maze of macros and
> > > > wrappers).
> > >
> > > As promised, I profiled a simple code that repeatedly makes 10
> > > allocations/frees in a loop and measured overheads of code tagging,
> > > call stack capturing and tracing+BPF for page and slab allocations.
> > > Summary:
> > >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Yes, I am not surprised that the call stack capturing is really
> > expensive comparing to the allocator fast path (which is really highly
> > optimized and I suspect that with 10 allocation/free loop you mostly get
> > your memory from the pcp lists). Is this overhead still _that_ visible
> > for somehow less microoptimized workloads which have to take slow paths
> > as well?
> >
> > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > not my area but from what I remember the unwinder overhead varies
> > between ORC and FP.
> >
> > And just to make it clear. I do realize that an overhead from the stack
> > unwinding is unavoidable. And code tagging would logically have lower
> > overhead as it performs much less work. But the main point is whether
> > our existing stack unwiding approach is really prohibitively expensive
> > to be used for debugging purposes on production systems. I might
> > misremember but I recall people having bigger concerns with page_owner
> > memory footprint than the actual stack unwinder overhead.
>
> This is just to point out that we've also been looking at cheaper
> collection of the stack trace (for KASAN and other sanitizers). The
> cheapest way to unwind the stack would be a system with "shadow call
> stack" enabled. With compiler support it's available on arm64, see
> CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the
> kernel will support CET, which newer Intel and AMD CPUs support.
> Collecting the call stack would then be a simple memcpy.

Thanks for the note Marco! I'll check out the CONFIG_SHADOW_CALL_STACK
on Android.
Suren Baghdasaryan Sept. 5, 2022, 6:08 p.m. UTC | #49
On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Sun, 4 Sep 2022 18:32:58 -0700
> Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Page allocations (overheads are compared to get_free_pages() duration):
> > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > 8.8% lookup_page_ext
> > 1237% call stack capture
> > 139% tracepoint with attached empty BPF program
>
> Have you tried tracepoint with custom callback?
>
> static void my_callback(void *data, unsigned long call_site,
>                         const void *ptr, struct kmem_cache *s,
>                         size_t bytes_req, size_t bytes_alloc,
>                         gfp_t gfp_flags)
> {
>         struct my_data_struct *my_data = data;
>
>         { do whatever }
> }
>
> [..]
>         register_trace_kmem_alloc(my_callback, my_data);
>
> Now the my_callback function will be called directly every time the
> kmem_alloc tracepoint is hit.
>
> This avoids that perf and BPF overhead.

Haven't tried that yet but will do. Thanks for the reference code!

>
> -- Steve
Nadav Amit Sept. 5, 2022, 6:44 p.m. UTC | #50
On Aug 31, 2022, at 3:19 AM, Mel Gorman <mgorman@suse.de> wrote:

> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
>> On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
>>> On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
>>>> ===========================
>>>> Code tagging framework
>>>> ===========================
>>>> Code tag is a structure identifying a specific location in the source code
>>>> which is generated at compile time and can be embedded in an application-
>>>> specific structure. Several applications of code tagging are included in
>>>> this RFC, such as memory allocation tracking, dynamic fault injection,
>>>> latency tracking and improved error code reporting.
>>>> Basically, it takes the old trick of "define a special elf section for
>>>> objects of a given type so that we can iterate over them at runtime" and
>>>> creates a proper library for it.
>>> 
>>> I might be super dense this morning, but what!? I've skimmed through the
>>> set and I don't think I get it.
>>> 
>>> What does this provide that ftrace/kprobes don't already allow?
>> 
>> You're kidding, right?
> 
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.
> 
> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.
> 
> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
> 
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

I would note that I have a solution in the making (which pretty much works)
for this matter, and does not require any kernel changes. It produces a
call stack that leads to the code that lead to syscall failure.

The way it works is by using seccomp to trap syscall failures, and then
setting ftrace function filters and kprobes on conditional branches,
indirect branch targets and function returns.

Using symbolic execution, backtracking is performed and the condition that
lead to the failure is then pin-pointed.

I hope to share the code soon.
Steven Rostedt Sept. 5, 2022, 7:16 p.m. UTC | #51
On Mon, 5 Sep 2022 11:44:55 -0700
Nadav Amit <nadav.amit@gmail.com> wrote:

> I would note that I have a solution in the making (which pretty much works)
> for this matter, and does not require any kernel changes. It produces a
> call stack that leads to the code that lead to syscall failure.
> 
> The way it works is by using seccomp to trap syscall failures, and then
> setting ftrace function filters and kprobes on conditional branches,
> indirect branch targets and function returns.

Ooh nifty!

> 
> Using symbolic execution, backtracking is performed and the condition that
> lead to the failure is then pin-pointed.
> 
> I hope to share the code soon.

Looking forward to it.

-- Steve
Kent Overstreet Sept. 5, 2022, 8:42 p.m. UTC | #52
On Mon, Sep 05, 2022 at 11:08:21AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 8:06 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > On Sun, 4 Sep 2022 18:32:58 -0700
> > Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Have you tried tracepoint with custom callback?
> >
> > static void my_callback(void *data, unsigned long call_site,
> >                         const void *ptr, struct kmem_cache *s,
> >                         size_t bytes_req, size_t bytes_alloc,
> >                         gfp_t gfp_flags)
> > {
> >         struct my_data_struct *my_data = data;
> >
> >         { do whatever }
> > }
> >
> > [..]
> >         register_trace_kmem_alloc(my_callback, my_data);
> >
> > Now the my_callback function will be called directly every time the
> > kmem_alloc tracepoint is hit.
> >
> > This avoids that perf and BPF overhead.
> 
> Haven't tried that yet but will do. Thanks for the reference code!

Is it really worth the effort of benchmarking tracing API overhead here?

The main cost of a tracing based approach is going to to be the data structure
for remembering outstanding allocations so that free events can be matched to
the appropriate callsite. Regardless of whether it's done with BFP or by
attaching to the tracepoints directly, that's going to be the main overhead.
Steven Rostedt Sept. 5, 2022, 10:16 p.m. UTC | #53
On Mon, 5 Sep 2022 16:42:29 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> > Haven't tried that yet but will do. Thanks for the reference code!  
> 
> Is it really worth the effort of benchmarking tracing API overhead here?
> 
> The main cost of a tracing based approach is going to to be the data structure
> for remembering outstanding allocations so that free events can be matched to
> the appropriate callsite. Regardless of whether it's done with BFP or by
> attaching to the tracepoints directly, that's going to be the main overhead.

The point I was making here is that you do not need your own hooking
mechanism. You can get the information directly by attaching to the
tracepoint.

> > static void my_callback(void *data, unsigned long call_site,
> >                         const void *ptr, struct kmem_cache *s,
> >                         size_t bytes_req, size_t bytes_alloc,
> >                         gfp_t gfp_flags)
> > {
> >         struct my_data_struct *my_data = data;
> >
> >         { do whatever }
> > }

The "do whatever" is anything you want to do.

Or is the data structure you create with this approach going to be too much
overhead? How hard is it for a hash or binary search lookup?


-- Steve
Kent Overstreet Sept. 5, 2022, 11:46 p.m. UTC | #54
On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> This is really my main concern about this whole work. Not only it adds a
> considerable maintenance burden to the core MM because

[citation needed]

> it adds on top of
> our existing allocator layers complexity but it would need to spread beyond
> MM to be useful because it is usually outside of MM where leaks happen.

If you want the tracking to happen at a different level of the call stack, just
call _kmalloc() directly and call alloc_tag_add()/sub() yourself.
Kent Overstreet Sept. 5, 2022, 11:50 p.m. UTC | #55
On Mon, Sep 05, 2022 at 06:16:50PM -0400, Steven Rostedt wrote:
> On Mon, 5 Sep 2022 16:42:29 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > > Haven't tried that yet but will do. Thanks for the reference code!  
> > 
> > Is it really worth the effort of benchmarking tracing API overhead here?
> > 
> > The main cost of a tracing based approach is going to to be the data structure
> > for remembering outstanding allocations so that free events can be matched to
> > the appropriate callsite. Regardless of whether it's done with BFP or by
> > attaching to the tracepoints directly, that's going to be the main overhead.
> 
> The point I was making here is that you do not need your own hooking
> mechanism. You can get the information directly by attaching to the
> tracepoint.
> 
> > > static void my_callback(void *data, unsigned long call_site,
> > >                         const void *ptr, struct kmem_cache *s,
> > >                         size_t bytes_req, size_t bytes_alloc,
> > >                         gfp_t gfp_flags)
> > > {
> > >         struct my_data_struct *my_data = data;
> > >
> > >         { do whatever }
> > > }
> 
> The "do whatever" is anything you want to do.
> 
> Or is the data structure you create with this approach going to be too much
> overhead? How hard is it for a hash or binary search lookup?

If you don't think it's hard, go ahead and show us.
Michal Hocko Sept. 6, 2022, 7:23 a.m. UTC | #56
On Mon 05-09-22 19:46:49, Kent Overstreet wrote:
> On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> > This is really my main concern about this whole work. Not only it adds a
> > considerable maintenance burden to the core MM because
> 
> [citation needed]

I thought this was clear from the email content (the part you haven't
quoted here). But let me be explicit one more time for you.

I hope we can agree that in order for this kind of tracking to be useful
you need to cover _callers_ of the allocator or in the ideal world
the users/owner of the tracked memory (the later is sometimes much
harder/impossible to track when the memory is handed over from one peer
to another).

It is not particularly useful IMO to see that a large portion of the
memory has been allocated by say vmalloc or kvmalloc, right?  How
much does it really tell you that a lot of memory has been allocated
by kvmalloc or vmalloc? Yet, neither of the two is handled by the
proposed tracking and it would require additional code to be added and
_maintained_ to cover them. But that would be still far from complete,
we have bulk allocator, mempools etc.

If that was not enough some of those allocators are used by library code
like seq_file, networking pools, module loader and whatnot. So this
grows and effectively doubles the API space for many allocators as they
need both normal API and the one which can pass the tracking context
down the path to prevent double tracking. Right?

This in my book is a considerable maintenance burden. And especially for
the MM subsystem this means additional burden because we have a very
rich allocators APIs.

You are absolutely right that processing stack traces is PITA but that
allows to see the actual callers irrespectively how many layers of
indirection or library code it goes.

> > it adds on top of
> > our existing allocator layers complexity but it would need to spread beyond
> > MM to be useful because it is usually outside of MM where leaks happen.
> 
> If you want the tracking to happen at a different level of the call stack, just
> call _kmalloc() directly and call alloc_tag_add()/sub() yourself.

As pointed above this just scales poorly and adds to the API space. Not
to mention that direct use of alloc_tag_add can just confuse layers
below which rely on the same thing.

Hope this makes it clearer.
Michal Hocko Sept. 6, 2022, 8:01 a.m. UTC | #57
On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > Yes, tracking back the call trace would be really needed. The question
> > > > is whether this is really prohibitively expensive. How much overhead are
> > > > we talking about? There is no free lunch here, really.  You either have
> > > > the overhead during runtime when the feature is used or on the source
> > > > code level for all the future development (with a maze of macros and
> > > > wrappers).
> > >
> > > As promised, I profiled a simple code that repeatedly makes 10
> > > allocations/frees in a loop and measured overheads of code tagging,
> > > call stack capturing and tracing+BPF for page and slab allocations.
> > > Summary:
> > >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Yes, I am not surprised that the call stack capturing is really
> > expensive comparing to the allocator fast path (which is really highly
> > optimized and I suspect that with 10 allocation/free loop you mostly get
> > your memory from the pcp lists). Is this overhead still _that_ visible
> > for somehow less microoptimized workloads which have to take slow paths
> > as well?
> 
> Correct, it's a comparison with the allocation fast path, so in a
> sense represents the worst case scenario. However at the same time the
> measurements are fair because they measure the overheads against the
> same meaningful baseline, therefore can be used for comparison.

Yes, I am not saying it is an unfair comparision. It is just not a
particularly practical one for real life situations. So I am not sure
you can draw many conclusions from that. Or let me put it differently.
There is not real point comparing the code tagging and stack unwiding
approaches because the later is simply more complex because it collects
more state. The main question is whether that additional state
collection is too expensive to be practically used.
 
> > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > not my area but from what I remember the unwinder overhead varies
> > between ORC and FP.
> 
> I used whatever is default and didn't try other mechanisms. Don't
> think the difference would be orders of magnitude better though.
> 
> >
> > And just to make it clear. I do realize that an overhead from the stack
> > unwinding is unavoidable. And code tagging would logically have lower
> > overhead as it performs much less work. But the main point is whether
> > our existing stack unwiding approach is really prohibitively expensive
> > to be used for debugging purposes on production systems. I might
> > misremember but I recall people having bigger concerns with page_owner
> > memory footprint than the actual stack unwinder overhead.
> 
> That's one of those questions which are very difficult to answer (if
> even possible) because that would depend on the use scenario. If the
> workload allocates frequently then adding the overhead will likely
> affect it, otherwise might not be even noticeable. In general, in
> pre-production testing we try to minimize the difference in
> performance and memory profiles between the software we are testing
> and the production one. From that point of view, the smaller the
> overhead, the better. I know it's kinda obvious but unfortunately I
> have no better answer to that question.

This is clear but it doesn't really tell whether the existing tooling is
unusable for _your_ or any specific scenarios. Because when we are
talking about adding quite a lot of code and make our allocators APIs
more complicated to track the state then we should carefully weigh the
benefit and the cost. As replied to other email I am really skeptical
this patchset is at the final stage and the more allocators get covered
the more code we have to maintain. So there must be a very strong reason
to add it.

> For the memory overhead, in my early internal proposal with assumption
> of 10000 instrumented allocation call sites, I've made some
> calculations for an 8GB 8-core system (quite typical for Android) and
> ended up with the following:
> 
>                                     per-cpu counters      atomic counters
> page_ext references     16MB                      16MB
> slab object references   10.5MB                   10.5MB
> alloc_tags                      900KB                    312KB
> Total memory overhead 27.4MB                  26.8MB

I do not really think this is all that interesting because the major
memory overhead contributors (page_ext and objcg are going to be there
with other approaches that want to match alloc and free as that clearly
requires to store the allocator objects somewhere).

> so, about 0.34% of the total memory. Our implementation has changed
> since then and the number might not be completely correct but it
> should be in the ballpark.
> I just checked the number of instrumented calls that we currently have
> in the 6.0-rc3 built with defconfig and it's 165 page allocation and
> 2684 slab allocation sites. I readily accept that we are probably
> missing some allocations and additional modules can also contribute to
> these numbers but my guess it's still less than 10000 that I used in
> my calculations.

yes, in the current implementation you are missing most indirect users
of the page allocator as stated elsewhere so the usefulness can be
really limited. A better coverege will not increase the memory
consumption much but it will add an additional maintenance burden that
will scale with different usecases.
Suren Baghdasaryan Sept. 6, 2022, 3:35 p.m. UTC | #58
On Tue, Sep 6, 2022 at 1:01 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote:
> > On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > Yes, tracking back the call trace would be really needed. The question
> > > > > is whether this is really prohibitively expensive. How much overhead are
> > > > > we talking about? There is no free lunch here, really.  You either have
> > > > > the overhead during runtime when the feature is used or on the source
> > > > > code level for all the future development (with a maze of macros and
> > > > > wrappers).
> > > >
> > > > As promised, I profiled a simple code that repeatedly makes 10
> > > > allocations/frees in a loop and measured overheads of code tagging,
> > > > call stack capturing and tracing+BPF for page and slab allocations.
> > > > Summary:
> > > >
> > > > Page allocations (overheads are compared to get_free_pages() duration):
> > > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
> > > > 8.8% lookup_page_ext
> > > > 1237% call stack capture
> > > > 139% tracepoint with attached empty BPF program
> > >
> > > Yes, I am not surprised that the call stack capturing is really
> > > expensive comparing to the allocator fast path (which is really highly
> > > optimized and I suspect that with 10 allocation/free loop you mostly get
> > > your memory from the pcp lists). Is this overhead still _that_ visible
> > > for somehow less microoptimized workloads which have to take slow paths
> > > as well?
> >
> > Correct, it's a comparison with the allocation fast path, so in a
> > sense represents the worst case scenario. However at the same time the
> > measurements are fair because they measure the overheads against the
> > same meaningful baseline, therefore can be used for comparison.
>
> Yes, I am not saying it is an unfair comparision. It is just not a
> particularly practical one for real life situations. So I am not sure
> you can draw many conclusions from that. Or let me put it differently.
> There is not real point comparing the code tagging and stack unwiding
> approaches because the later is simply more complex because it collects
> more state. The main question is whether that additional state
> collection is too expensive to be practically used.

You asked me to provide the numbers in one of your replies, that's what I did.

>
> > > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > > not my area but from what I remember the unwinder overhead varies
> > > between ORC and FP.
> >
> > I used whatever is default and didn't try other mechanisms. Don't
> > think the difference would be orders of magnitude better though.
> >
> > >
> > > And just to make it clear. I do realize that an overhead from the stack
> > > unwinding is unavoidable. And code tagging would logically have lower
> > > overhead as it performs much less work. But the main point is whether
> > > our existing stack unwiding approach is really prohibitively expensive
> > > to be used for debugging purposes on production systems. I might
> > > misremember but I recall people having bigger concerns with page_owner
> > > memory footprint than the actual stack unwinder overhead.
> >
> > That's one of those questions which are very difficult to answer (if
> > even possible) because that would depend on the use scenario. If the
> > workload allocates frequently then adding the overhead will likely
> > affect it, otherwise might not be even noticeable. In general, in
> > pre-production testing we try to minimize the difference in
> > performance and memory profiles between the software we are testing
> > and the production one. From that point of view, the smaller the
> > overhead, the better. I know it's kinda obvious but unfortunately I
> > have no better answer to that question.
>
> This is clear but it doesn't really tell whether the existing tooling is
> unusable for _your_ or any specific scenarios. Because when we are
> talking about adding quite a lot of code and make our allocators APIs
> more complicated to track the state then we should carefully weigh the
> benefit and the cost. As replied to other email I am really skeptical
> this patchset is at the final stage and the more allocators get covered
> the more code we have to maintain. So there must be a very strong reason
> to add it.

The patchset is quite complete at this point. Instrumenting new
allocators takes 3 lines of code, see how kmalloc_hooks macro is used
in https://lore.kernel.org/all/20220830214919.53220-17-surenb@google.com/

>
> > For the memory overhead, in my early internal proposal with assumption
> > of 10000 instrumented allocation call sites, I've made some
> > calculations for an 8GB 8-core system (quite typical for Android) and
> > ended up with the following:
> >
> >                                     per-cpu counters      atomic counters
> > page_ext references     16MB                      16MB
> > slab object references   10.5MB                   10.5MB
> > alloc_tags                      900KB                    312KB
> > Total memory overhead 27.4MB                  26.8MB
>
> I do not really think this is all that interesting because the major
> memory overhead contributors (page_ext and objcg are going to be there
> with other approaches that want to match alloc and free as that clearly
> requires to store the allocator objects somewhere).

You mentioned that memory consumption in the page_owner approach was
more important overhead, so I provided the numbers for that part of
the discussion.

>
> > so, about 0.34% of the total memory. Our implementation has changed
> > since then and the number might not be completely correct but it
> > should be in the ballpark.
> > I just checked the number of instrumented calls that we currently have
> > in the 6.0-rc3 built with defconfig and it's 165 page allocation and
> > 2684 slab allocation sites. I readily accept that we are probably
> > missing some allocations and additional modules can also contribute to
> > these numbers but my guess it's still less than 10000 that I used in
> > my calculations.
>
> yes, in the current implementation you are missing most indirect users
> of the page allocator as stated elsewhere so the usefulness can be
> really limited. A better coverege will not increase the memory
> consumption much but it will add an additional maintenance burden that
> will scale with different usecases.

Your comments in the last two letters about needing the stack tracing
and covering indirect users of the allocators makes me think that you
missed my reply here:
https://lore.kernel.org/all/CAJuCfpGZ==v0HGWBzZzHTgbo4B_ZBe6V6U4T_788LVWj8HhCRQ@mail.gmail.com/.
I messed up with formatting but hopefully it's still readable. The
idea of having two stage tracking - first one very cheap and the
second one more in-depth I think should address your concerns about
indirect users.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
Kent Overstreet Sept. 6, 2022, 6:20 p.m. UTC | #59
On Tue, Sep 06, 2022 at 09:23:31AM +0200, Michal Hocko wrote:
> On Mon 05-09-22 19:46:49, Kent Overstreet wrote:
> > On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> > > This is really my main concern about this whole work. Not only it adds a
> > > considerable maintenance burden to the core MM because
> > 
> > [citation needed]
> 
> I thought this was clear from the email content (the part you haven't
> quoted here). But let me be explicit one more time for you.
> 
> I hope we can agree that in order for this kind of tracking to be useful
> you need to cover _callers_ of the allocator or in the ideal world
> the users/owner of the tracked memory (the later is sometimes much
> harder/impossible to track when the memory is handed over from one peer
> to another).
> 
> It is not particularly useful IMO to see that a large portion of the
> memory has been allocated by say vmalloc or kvmalloc, right?  How
> much does it really tell you that a lot of memory has been allocated
> by kvmalloc or vmalloc? Yet, neither of the two is handled by the
> proposed tracking and it would require additional code to be added and
> _maintained_ to cover them. But that would be still far from complete,
> we have bulk allocator, mempools etc.

Of course - and even a light skimming of the patch set would see it does indeed
address this. We still have to do vmalloc and percpu memory allocations, but
slab is certainly handled and that's the big one.

> As pointed above this just scales poorly and adds to the API space. Not
> to mention that direct use of alloc_tag_add can just confuse layers
> below which rely on the same thing.

It might help you make your case if you'd say something about what you'd like
better.

Otherwise, saying "code has to be maintained" is a little bit like saying water
is wet, and we're all engineers here, I think we know that :)
Michal Hocko Sept. 7, 2022, 11 a.m. UTC | #60
On Tue 06-09-22 14:20:58, Kent Overstreet wrote:
[...]
> Otherwise, saying "code has to be maintained" is a little bit like saying water
> is wet, and we're all engineers here, I think we know that :)

Hmm, it seems that further discussion doesn't really make much sense
here. I know how to use my time better.
Kent Overstreet Sept. 7, 2022, 1:04 p.m. UTC | #61
On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> Hmm, it seems that further discussion doesn't really make much sense
> here. I know how to use my time better.

Just a thought, but I generally find it more productive to propose ideas than to
just be disparaging.

Cheers,
Kent
Steven Rostedt Sept. 7, 2022, 1:45 p.m. UTC | #62
On Wed, 7 Sep 2022 09:04:28 -0400
Kent Overstreet <kent.overstreet@linux.dev> wrote:

> On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > Hmm, it seems that further discussion doesn't really make much sense
> > here. I know how to use my time better.  
> 
> Just a thought, but I generally find it more productive to propose ideas than to
> just be disparaging.
> 

But it's not Michal's job to do so. He's just telling you that the given
feature is not worth the burden. He's telling you the issues that he has
with the patch set. It's the submitter's job to address those concerns and
not the maintainer's to tell you how to make it better.

When Linus tells us that a submission is crap, we don't ask him how to make
it less crap, we listen to why he called it crap, and then rewrite to be
not so crappy. If we cannot figure it out, it doesn't get in.

-- Steve
Kent Overstreet Sept. 8, 2022, 6:35 a.m. UTC | #63
On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> On Wed, 7 Sep 2022 09:04:28 -0400
> Kent Overstreet <kent.overstreet@linux.dev> wrote:
> 
> > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > Hmm, it seems that further discussion doesn't really make much sense
> > > here. I know how to use my time better.  
> > 
> > Just a thought, but I generally find it more productive to propose ideas than to
> > just be disparaging.
> > 
> 
> But it's not Michal's job to do so. He's just telling you that the given
> feature is not worth the burden. He's telling you the issues that he has
> with the patch set. It's the submitter's job to address those concerns and
> not the maintainer's to tell you how to make it better.
> 
> When Linus tells us that a submission is crap, we don't ask him how to make
> it less crap, we listen to why he called it crap, and then rewrite to be
> not so crappy. If we cannot figure it out, it doesn't get in.

When Linus tells someone a submission is crap, he _always_ has a sound, and
_specific_ technical justification for doing so.

"This code is going to be a considerable maintenance burden" is vapid, and lazy.
It's the kind of feedback made by someone who has looked at the number of lines
of code a patch touches and not much more.
Suren Baghdasaryan Sept. 8, 2022, 6:49 a.m. UTC | #64
On Wed, Sep 7, 2022 at 11:35 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> > On Wed, 7 Sep 2022 09:04:28 -0400
> > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> >
> > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > > Hmm, it seems that further discussion doesn't really make much sense
> > > > here. I know how to use my time better.
> > >
> > > Just a thought, but I generally find it more productive to propose ideas than to
> > > just be disparaging.
> > >
> >
> > But it's not Michal's job to do so. He's just telling you that the given
> > feature is not worth the burden. He's telling you the issues that he has
> > with the patch set. It's the submitter's job to address those concerns and
> > not the maintainer's to tell you how to make it better.
> >
> > When Linus tells us that a submission is crap, we don't ask him how to make
> > it less crap, we listen to why he called it crap, and then rewrite to be
> > not so crappy. If we cannot figure it out, it doesn't get in.
>
> When Linus tells someone a submission is crap, he _always_ has a sound, and
> _specific_ technical justification for doing so.
>
> "This code is going to be a considerable maintenance burden" is vapid, and lazy.
> It's the kind of feedback made by someone who has looked at the number of lines
> of code a patch touches and not much more.

I would really appreciate if everyone could please stick to the
technical side of the conversation. That way we can get some
constructive feedback. Everything else is not helpful and at best is a
distraction.
Maintenance burden is a price we pay and I think it's the prerogative
of the maintainers to take that into account. Our job is to prove that
the price is worth paying.

>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Kent Overstreet Sept. 8, 2022, 7:07 a.m. UTC | #65
On Wed, Sep 07, 2022 at 11:49:37PM -0700, Suren Baghdasaryan wrote:
> I would really appreciate if everyone could please stick to the
> technical side of the conversation. That way we can get some
> constructive feedback. Everything else is not helpful and at best is a
> distraction.
> Maintenance burden is a price we pay and I think it's the prerogative
> of the maintainers to take that into account. Our job is to prove that
> the price is worth paying.

Well said.

I'd also like to add - slab.h does look pretty overgrown and messy. We've grown
a _lot_ of special purpose memory allocation interfaces, and I think it probably
is time to try and wrangle that back.

The API complexity isn't just an issue for this patch - it's an issue for
anything that has to wrap and plumb through memory allocation interfaces. It's a
pain point for the Rust people, and also comes in e.g. the mempool API.

I think we should keep going with the memalloc_no*_save()/restore() approach,
and extend it to other things:

 - memalloc_nowait_save()
 - memalloc_highpri_save()

(these two get you GFP_ATOMIC).

Also, I don't think these all need to be separate functions, we could have

memalloc_gfp_apply()
memalloc_gfp_restore()

which simply takes a gfp flags argument and applies it to the current
PF_MEMALLOC flags.

We've had long standing bugs where vmalloc() can't correctly take gfp flags
because some of the allocations it does for page tables don't have it correctly
plumbed through; switching to the memalloc_*_(save|restore) is something people
have been wanting in order to fix this - for years. Actually following through
and completing this would let us kill the gfp flags arguments to our various
memory allocators entirely.

I think we can do the same thing with the numa node parameter - kill
kmalloc_node() et. all, move it to task_struct with a set of save/restore
functions.

There's probably other things we can do to simplify slab.h if we look more. I've
been hoping to start pushing patches for some of this stuff - it's going to be
some time before I can get to it though, can only handle so many projects in
flight at a time :)
Michal Hocko Sept. 8, 2022, 7:12 a.m. UTC | #66
On Thu 08-09-22 02:35:48, Kent Overstreet wrote:
> On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> > On Wed, 7 Sep 2022 09:04:28 -0400
> > Kent Overstreet <kent.overstreet@linux.dev> wrote:
> > 
> > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > > Hmm, it seems that further discussion doesn't really make much sense
> > > > here. I know how to use my time better.  
> > > 
> > > Just a thought, but I generally find it more productive to propose ideas than to
> > > just be disparaging.
> > > 
> > 
> > But it's not Michal's job to do so. He's just telling you that the given
> > feature is not worth the burden. He's telling you the issues that he has
> > with the patch set. It's the submitter's job to address those concerns and
> > not the maintainer's to tell you how to make it better.
> > 
> > When Linus tells us that a submission is crap, we don't ask him how to make
> > it less crap, we listen to why he called it crap, and then rewrite to be
> > not so crappy. If we cannot figure it out, it doesn't get in.
> 
> When Linus tells someone a submission is crap, he _always_ has a sound, and
> _specific_ technical justification for doing so.
> 
> "This code is going to be a considerable maintenance burden" is vapid, and lazy.
> It's the kind of feedback made by someone who has looked at the number of lines
> of code a patch touches and not much more.

Then you have probably missed a huge part of my emails. Please
re-read. If those arguments are not clear, feel free to ask for
clarification. Reducing the whole my reasoning and objections to the
sentence above and calling that vapid and lazy is not only unfair but
also disrespectful.
Kent Overstreet Sept. 8, 2022, 7:29 a.m. UTC | #67
On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote:
> Then you have probably missed a huge part of my emails. Please
> re-read. If those arguments are not clear, feel free to ask for
> clarification. Reducing the whole my reasoning and objections to the
> sentence above and calling that vapid and lazy is not only unfair but
> also disrespectful.

What, where you complained about slab's page allocations showing up in the
profile instead of slab, and I pointed out to you that actually each and every
slab call is instrumented, and you're just seeing some double counting (that we
will no doubt fix?)

Or when you complained about allocation sites where it should actually be the
caller that should be instrumented, and I pointed out that it'd be quite easy to
simply change that code to use _kmalloc() and slab_tag_add() directly, if it
becomes an issue.

Of course, if we got that far, we'd have this code to thank for telling us where
to look!

Did I miss anything?
Michal Hocko Sept. 8, 2022, 7:47 a.m. UTC | #68
On Thu 08-09-22 03:29:50, Kent Overstreet wrote:
> On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote:
> > Then you have probably missed a huge part of my emails. Please
> > re-read. If those arguments are not clear, feel free to ask for
> > clarification. Reducing the whole my reasoning and objections to the
> > sentence above and calling that vapid and lazy is not only unfair but
> > also disrespectful.
> 
> What, where you complained about slab's page allocations showing up in the
> profile instead of slab, and I pointed out to you that actually each and every
> slab call is instrumented, and you're just seeing some double counting (that we
> will no doubt fix?)
> 
> Or when you complained about allocation sites where it should actually be the
> caller that should be instrumented, and I pointed out that it'd be quite easy to
> simply change that code to use _kmalloc() and slab_tag_add() directly, if it
> becomes an issue.
> 
> Of course, if we got that far, we'd have this code to thank for telling us where
> to look!
> 
> Did I miss anything?

Feel free to reponse to specific arguments as I wrote them. I won't
repeat them again. Sure we can discuss how important/relevant those
are. And that _can_ be a productive discussion.