[v6,00/37] Memory allocation profiling

Message ID	20240321163705.3067592-1-surenb@google.com (mailing list archive)
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6207D8BF3 for <linux-fsdevel@vger.kernel.org>; Thu, 21 Mar 2024 16:37:11 +0000 (UTC) Date: Thu, 21 Mar 2024 09:36:22 -0700 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240321163705.3067592-1-surenb@google.com> Subject: [PATCH v6 00/37] Memory allocation profiling From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: kent.overstreet@linux.dev, mhocko@suse.com, vbabka@suse.cz, hannes@cmpxchg.org, roman.gushchin@linux.dev, mgorman@suse.de, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, penguin-kernel@i-love.sakura.ne.jp, corbet@lwn.net, void@manifault.com, peterz@infradead.org, juri.lelli@redhat.com, catalin.marinas@arm.com, will@kernel.org, arnd@arndb.de, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, peterx@redhat.com, david@redhat.com, axboe@kernel.dk, mcgrof@kernel.org, masahiroy@kernel.org, nathan@kernel.org, dennis@kernel.org, jhubbard@nvidia.com, tj@kernel.org, muchun.song@linux.dev, rppt@kernel.org, paulmck@kernel.org, pasha.tatashin@soleen.com, yosryahmed@google.com, yuzhao@google.com, dhowells@redhat.com, hughd@google.com, andreyknvl@gmail.com, keescook@chromium.org, ndesaulniers@google.com, vvvvvv@google.com, gregkh@linuxfoundation.org, ebiggers@google.com, ytcoode@gmail.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com, vschneid@redhat.com, cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com, 42.hyeyoo@gmail.com, glider@google.com, elver@google.com, dvyukov@google.com, songmuchun@bytedance.com, jbaron@akamai.com, aliceryhl@google.com, rientjes@google.com, minchan@google.com, kaleshsingh@google.com, surenb@google.com, kernel-team@android.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, iommu@lists.linux.dev, linux-arch@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, kasan-dev@googlegroups.com, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	Memory allocation profiling \| expand [v6,00/37] Memory allocation profiling [v6,01/37] fix missing vmalloc.h includes [v6,02/37] asm-generic/io.h: Kill vmalloc.h dependency [v6,03/37] mm/slub: Mark slab_free_freelist_hook() __always_inline [v6,04/37] scripts/kallysms: Always include __start and __stop symbols [v6,05/37] fs: Convert alloc_inode_sb() to a macro [v6,06/37] mm: introduce slabobj_ext to support slab object extensions [v6,07/37] mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext creation [v6,08/37] mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation [v6,09/37] slab: objext: introduce objext_flags as extension to page_memcg_data_flags [v6,10/37] lib: code tagging framework [v6,11/37] lib: code tagging module support [v6,12/37] lib: prevent module unloading if memory is not freed [v6,13/37] lib: add allocation tagging support for memory allocation profiling [v6,14/37] lib: introduce support for page allocation tagging [v6,15/37] lib: introduce early boot parameter to avoid page_ext memory overhead [v6,16/37] mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation tags [v6,17/37] change alloc_pages name in dma_map_ops to avoid name conflicts [v6,18/37] mm: enable page allocation tagging [v6,19/37] mm: create new codetag references during page splitting [v6,20/37] mm: fix non-compound multi-order memory accounting in __free_pages [v6,21/37] mm/page_ext: enable early_page_ext when CONFIG_MEM_ALLOC_PROFILING_DEBUG=y [v6,22/37] lib: add codetag reference into slabobj_ext [v6,23/37] mm/slab: add allocation accounting into slab allocation and free paths [v6,24/37] rust: Add a rust helper for krealloc() [v6,25/37] mm/slab: enable slab allocation tagging for kmalloc and friends [v6,26/37] mempool: Hook up to memory allocation profiling [v6,27/37] mm: percpu: Introduce pcpuobj_ext [v6,28/37] mm: percpu: Add codetag reference into pcpuobj_ext [v6,29/37] mm: percpu: enable per-cpu allocation tagging [v6,30/37] mm: vmalloc: Enable memory allocation profiling [v6,31/37] rhashtable: Plumb through alloc tag [v6,32/37] lib: add memory allocations report in show_mem() [v6,33/37] codetag: debug: skip objext checking when it's for objext itself [v6,34/37] codetag: debug: mark codetags for reserved pages as empty [v6,35/37] codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations [v6,36/37] MAINTAINERS: Add entries for code tagging and memory allocation profiling [v6,37/37] memprofiling: Documentation

Suren Baghdasaryan March 21, 2024, 4:36 p.m. UTC

Overview:
Low overhead [1] per-callsite memory allocation profiling. Not just for
debug kernels, overhead low enough to be deployed in production.

Example output:
  root@moria-kvm:~# sort -rn /proc/allocinfo
   127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
    56373248     4737 mm/slub.c:2259 func:alloc_slab_page
    14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
    14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
    13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
    11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
     9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
     4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
     4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
     3940352      962 mm/memory.c:4214 func:alloc_anon_folio
     2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
     ...

Since v5 [2]:
- Added Reviewed-by and Acked-by, per Vlastimil Babka and Miguel Ojeda
- Changed pgalloc_tag_{add|sub} to use number of pages instead of order, per Matthew Wilcox
- Changed pgalloc_tag_sub_bytes to pgalloc_tag_sub_pages and adjusted the usage, per Matthew Wilcox
- Moved static key check before prepare_slab_obj_exts_hook(), per Vlastimil Babka
- Fixed RUST helper, per Miguel Ojeda
- Fixed documentation, per Randy Dunlap
- Rebased over mm-unstable

Usage:
kconfig options:
 - CONFIG_MEM_ALLOC_PROFILING
 - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
 - CONFIG_MEM_ALLOC_PROFILING_DEBUG
   adds warnings for allocations that weren't accounted because of a
   missing annotation

sysctl:
  /proc/sys/vm/mem_profiling

Runtime info:
  /proc/allocinfo

Notes:

[1]: Overhead
To measure the overhead we are comparing the following configurations:
(1) Baseline with CONFIG_MEMCG_KMEM=n
(2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
(3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
(4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
(5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
(6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
(7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
    CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y

Performance overhead:
To evaluate performance we implemented an in-kernel test executing
multiple get_free_page/free_page and kmalloc/kfree calls with allocation
sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
affinity set to a specific CPU to minimize the noise. Below are results
from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
56 core Intel Xeon:

                        kmalloc                 pgalloc
(1 baseline)            6.764s                  16.902s
(2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
(3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
(4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
(5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
(6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
(7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)

Memory overhead:
Kernel size:

   text           data        bss         dec         diff
(1) 26515311	      18890222    17018880    62424413
(2) 26524728	      19423818    16740352    62688898    264485
(3) 26524724	      19423818    16740352    62688894    264481
(4) 26524728	      19423818    16740352    62688898    264485
(5) 26541782	      18964374    16957440    62463596    39183

Memory consumption on a 56 core Intel CPU with 125GB of memory:
Code tags:           192 kB
PageExts:         262144 kB (256MB)
SlabExts:           9876 kB (9.6MB)
PcpuExts:            512 kB (0.5MB)

Total overhead is 0.2% of total memory.

Benchmarks:

Hackbench tests run 100 times:
hackbench -s 512 -l 200 -g 15 -f 25 -P
      baseline       disabled profiling           enabled profiling
avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
stdev 0.0137         0.0188                       0.0077


hackbench -l 10000
      baseline       disabled profiling           enabled profiling
avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
stdev 0.0933         0.0286                       0.0489

stress-ng tests:
stress-ng --class memory --seq 4 -t 60
stress-ng --class cpu --seq 4 -t 60
Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/

[2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/

Kent Overstreet (13):
  fix missing vmalloc.h includes
  asm-generic/io.h: Kill vmalloc.h dependency
  mm/slub: Mark slab_free_freelist_hook() __always_inline
  scripts/kallysms: Always include __start and __stop symbols
  fs: Convert alloc_inode_sb() to a macro
  rust: Add a rust helper for krealloc()
  mempool: Hook up to memory allocation profiling
  mm: percpu: Introduce pcpuobj_ext
  mm: percpu: Add codetag reference into pcpuobj_ext
  mm: vmalloc: Enable memory allocation profiling
  rhashtable: Plumb through alloc tag
  MAINTAINERS: Add entries for code tagging and memory allocation
    profiling
  memprofiling: Documentation

Suren Baghdasaryan (24):
  mm: introduce slabobj_ext to support slab object extensions
  mm: introduce __GFP_NO_OBJ_EXT flag to selectively prevent slabobj_ext
    creation
  mm/slab: introduce SLAB_NO_OBJ_EXT to avoid obj_ext creation
  slab: objext: introduce objext_flags as extension to
    page_memcg_data_flags
  lib: code tagging framework
  lib: code tagging module support
  lib: prevent module unloading if memory is not freed
  lib: add allocation tagging support for memory allocation profiling
  lib: introduce support for page allocation tagging
  lib: introduce early boot parameter to avoid page_ext memory overhead
  mm: percpu: increase PERCPU_MODULE_RESERVE to accommodate allocation
    tags
  change alloc_pages name in dma_map_ops to avoid name conflicts
  mm: enable page allocation tagging
  mm: create new codetag references during page splitting
  mm: fix non-compound multi-order memory accounting in __free_pages
  mm/page_ext: enable early_page_ext when
    CONFIG_MEM_ALLOC_PROFILING_DEBUG=y
  lib: add codetag reference into slabobj_ext
  mm/slab: add allocation accounting into slab allocation and free paths
  mm/slab: enable slab allocation tagging for kmalloc and friends
  mm: percpu: enable per-cpu allocation tagging
  lib: add memory allocations report in show_mem()
  codetag: debug: skip objext checking when it's for objext itself
  codetag: debug: mark codetags for reserved pages as empty
  codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext
    allocations

 Documentation/admin-guide/sysctl/vm.rst       |  16 +
 Documentation/filesystems/proc.rst            |  29 ++
 Documentation/mm/allocation-profiling.rst     | 100 ++++++
 Documentation/mm/index.rst                    |   1 +
 MAINTAINERS                                   |  17 +
 arch/alpha/kernel/pci_iommu.c                 |   2 +-
 arch/alpha/lib/checksum.c                     |   1 +
 arch/alpha/lib/fpreg.c                        |   1 +
 arch/alpha/lib/memcpy.c                       |   1 +
 arch/arm/kernel/irq.c                         |   1 +
 arch/arm/kernel/traps.c                       |   1 +
 arch/arm64/kernel/efi.c                       |   1 +
 arch/loongarch/include/asm/kfence.h           |   1 +
 arch/mips/jazz/jazzdma.c                      |   2 +-
 arch/powerpc/kernel/dma-iommu.c               |   2 +-
 arch/powerpc/kernel/iommu.c                   |   1 +
 arch/powerpc/mm/mem.c                         |   1 +
 arch/powerpc/platforms/ps3/system-bus.c       |   4 +-
 arch/powerpc/platforms/pseries/vio.c          |   2 +-
 arch/riscv/kernel/elf_kexec.c                 |   1 +
 arch/riscv/kernel/probes/kprobes.c            |   1 +
 arch/s390/kernel/cert_store.c                 |   1 +
 arch/s390/kernel/ipl.c                        |   1 +
 arch/x86/include/asm/io.h                     |   1 +
 arch/x86/kernel/amd_gart_64.c                 |   2 +-
 arch/x86/kernel/cpu/sgx/main.c                |   1 +
 arch/x86/kernel/irq_64.c                      |   1 +
 arch/x86/mm/fault.c                           |   1 +
 drivers/accel/ivpu/ivpu_mmu_context.c         |   1 +
 drivers/gpu/drm/gma500/mmu.c                  |   1 +
 drivers/gpu/drm/i915/gem/i915_gem_pages.c     |   1 +
 .../gpu/drm/i915/gem/selftests/mock_dmabuf.c  |   1 +
 drivers/gpu/drm/i915/gt/shmem_utils.c         |   1 +
 drivers/gpu/drm/i915/gvt/firmware.c           |   1 +
 drivers/gpu/drm/i915/gvt/gtt.c                |   1 +
 drivers/gpu/drm/i915/gvt/handlers.c           |   1 +
 drivers/gpu/drm/i915/gvt/mmio.c               |   1 +
 drivers/gpu/drm/i915/gvt/vgpu.c               |   1 +
 drivers/gpu/drm/i915/intel_gvt.c              |   1 +
 drivers/gpu/drm/imagination/pvr_vm_mips.c     |   1 +
 drivers/gpu/drm/mediatek/mtk_drm_gem.c        |   1 +
 drivers/gpu/drm/omapdrm/omap_gem.c            |   1 +
 drivers/gpu/drm/v3d/v3d_bo.c                  |   1 +
 drivers/gpu/drm/vmwgfx/vmwgfx_binding.c       |   1 +
 drivers/gpu/drm/vmwgfx/vmwgfx_cmd.c           |   1 +
 drivers/gpu/drm/vmwgfx/vmwgfx_devcaps.c       |   1 +
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c           |   1 +
 drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c       |   1 +
 drivers/gpu/drm/vmwgfx/vmwgfx_ioctl.c         |   1 +
 drivers/gpu/drm/xen/xen_drm_front_gem.c       |   1 +
 drivers/hwtracing/coresight/coresight-trbe.c  |   1 +
 drivers/iommu/dma-iommu.c                     |   2 +-
 .../marvell/octeon_ep/octep_pfvf_mbox.c       |   1 +
 .../marvell/octeon_ep_vf/octep_vf_mbox.c      |   1 +
 .../net/ethernet/microsoft/mana/hw_channel.c  |   1 +
 drivers/parisc/ccio-dma.c                     |   2 +-
 drivers/parisc/sba_iommu.c                    |   2 +-
 drivers/platform/x86/uv_sysfs.c               |   1 +
 drivers/scsi/mpi3mr/mpi3mr_transport.c        |   2 +
 drivers/staging/media/atomisp/pci/hmm/hmm.c   |   2 +-
 drivers/vfio/pci/pds/dirty.c                  |   1 +
 drivers/virt/acrn/mm.c                        |   1 +
 drivers/virtio/virtio_mem.c                   |   1 +
 drivers/xen/grant-dma-ops.c                   |   2 +-
 drivers/xen/swiotlb-xen.c                     |   2 +-
 include/asm-generic/codetag.lds.h             |  14 +
 include/asm-generic/io.h                      |   1 -
 include/asm-generic/vmlinux.lds.h             |   3 +
 include/linux/alloc_tag.h                     | 205 +++++++++++
 include/linux/codetag.h                       |  81 +++++
 include/linux/dma-map-ops.h                   |   2 +-
 include/linux/fortify-string.h                |   5 +-
 include/linux/fs.h                            |   6 +-
 include/linux/gfp.h                           | 126 ++++---
 include/linux/gfp_types.h                     |  11 +
 include/linux/memcontrol.h                    |  56 ++-
 include/linux/mempool.h                       |  73 ++--
 include/linux/mm.h                            |   9 +
 include/linux/mm_types.h                      |   4 +-
 include/linux/page_ext.h                      |   1 -
 include/linux/pagemap.h                       |   9 +-
 include/linux/pds/pds_common.h                |   2 +
 include/linux/percpu.h                        |  27 +-
 include/linux/pgalloc_tag.h                   | 134 +++++++
 include/linux/rhashtable-types.h              |  11 +-
 include/linux/sched.h                         |  24 ++
 include/linux/slab.h                          | 179 +++++-----
 include/linux/string.h                        |   4 +-
 include/linux/vmalloc.h                       |  60 +++-
 include/rdma/rdmavt_qp.h                      |   1 +
 init/Kconfig                                  |   4 +
 kernel/dma/mapping.c                          |   4 +-
 kernel/kallsyms_selftest.c                    |   2 +-
 kernel/module/main.c                          |  29 +-
 lib/Kconfig.debug                             |  31 ++
 lib/Makefile                                  |   3 +
 lib/alloc_tag.c                               | 243 +++++++++++++
 lib/codetag.c                                 | 283 +++++++++++++++
 lib/rhashtable.c                              |  28 +-
 mm/compaction.c                               |   7 +-
 mm/debug_vm_pgtable.c                         |   1 +
 mm/filemap.c                                  |   6 +-
 mm/huge_memory.c                              |   2 +
 mm/kfence/core.c                              |  14 +-
 mm/kfence/kfence.h                            |   4 +-
 mm/memcontrol.c                               |  56 +--
 mm/mempolicy.c                                |  52 +--
 mm/mempool.c                                  |  36 +-
 mm/mm_init.c                                  |  13 +-
 mm/nommu.c                                    |  64 ++--
 mm/page_alloc.c                               |  71 ++--
 mm/page_ext.c                                 |  13 +
 mm/page_owner.c                               |   2 +-
 mm/percpu-internal.h                          |  26 +-
 mm/percpu.c                                   | 120 +++----
 mm/show_mem.c                                 |  26 ++
 mm/slab.h                                     |  51 ++-
 mm/slab_common.c                              |   6 +-
 mm/slub.c                                     | 327 +++++++++++++++---
 mm/util.c                                     |  44 +--
 mm/vmalloc.c                                  |  88 ++---
 rust/helpers.c                                |   8 +
 scripts/kallsyms.c                            |  13 +
 scripts/module.lds.S                          |   7 +
 sound/pci/hda/cs35l41_hda.c                   |   1 +
 125 files changed, 2319 insertions(+), 652 deletions(-)
 create mode 100644 Documentation/mm/allocation-profiling.rst
 create mode 100644 include/asm-generic/codetag.lds.h
 create mode 100644 include/linux/alloc_tag.h
 create mode 100644 include/linux/codetag.h
 create mode 100644 include/linux/pgalloc_tag.h
 create mode 100644 lib/alloc_tag.c
 create mode 100644 lib/codetag.c


base-commit: a824831a082f1d8f9b51a4c0598e633d38555fcf

Andrew Morton March 21, 2024, 8:41 p.m. UTC | #1

On Thu, 21 Mar 2024 09:36:22 -0700 Suren Baghdasaryan <surenb@google.com> wrote:

> Low overhead [1] per-callsite memory allocation profiling. Not just for
> debug kernels, overhead low enough to be deployed in production.
> 
> Example output:
>   root@moria-kvm:~# sort -rn /proc/allocinfo
>    127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
>     56373248     4737 mm/slub.c:2259 func:alloc_slab_page
>     14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
>     14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
>     13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
>     11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
>      9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
>      4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
>      4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
>      3940352      962 mm/memory.c:4214 func:alloc_anon_folio
>      2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node

Did you consider adding a knob to permit all the data to be wiped out? 
So people can zap everything, run the chosen workload then go see what
happened?

Of course, this can be done in userspace by taking a snapshot before
and after, then crunching on the two....

Suren Baghdasaryan March 21, 2024, 9:08 p.m. UTC | #2

On Thu, Mar 21, 2024 at 1:42 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 21 Mar 2024 09:36:22 -0700 Suren Baghdasaryan <surenb@google.com> wrote:
>
> > Low overhead [1] per-callsite memory allocation profiling. Not just for
> > debug kernels, overhead low enough to be deployed in production.
> >
> > Example output:
> >   root@moria-kvm:~# sort -rn /proc/allocinfo
> >    127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
> >     56373248     4737 mm/slub.c:2259 func:alloc_slab_page
> >     14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
> >     14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
> >     13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
> >     11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
> >      9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
> >      4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
> >      4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
> >      3940352      962 mm/memory.c:4214 func:alloc_anon_folio
> >      2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
>
> Did you consider adding a knob to permit all the data to be wiped out?
> So people can zap everything, run the chosen workload then go see what
> happened?
>
> Of course, this can be done in userspace by taking a snapshot before
> and after, then crunching on the two....

Yeah, that's exactly what I was envisioning. Don't think we need to
complicate more by adding a reset functionality unless there are other
reasons for it. Thanks!

Klara Modin April 5, 2024, 1:37 p.m. UTC | #3

Hi,

On 2024-03-21 17:36, Suren Baghdasaryan wrote:
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for
> debug kernels, overhead low enough to be deployed in production.
> 
> Example output:
>    root@moria-kvm:~# sort -rn /proc/allocinfo
>     127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
>      56373248     4737 mm/slub.c:2259 func:alloc_slab_page
>      14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
>      14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
>      13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
>      11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
>       9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
>       4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
>       4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
>       3940352      962 mm/memory.c:4214 func:alloc_anon_folio
>       2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
>       ...
> 
> Since v5 [2]:
> - Added Reviewed-by and Acked-by, per Vlastimil Babka and Miguel Ojeda
> - Changed pgalloc_tag_{add|sub} to use number of pages instead of order, per Matthew Wilcox
> - Changed pgalloc_tag_sub_bytes to pgalloc_tag_sub_pages and adjusted the usage, per Matthew Wilcox
> - Moved static key check before prepare_slab_obj_exts_hook(), per Vlastimil Babka
> - Fixed RUST helper, per Miguel Ojeda
> - Fixed documentation, per Randy Dunlap
> - Rebased over mm-unstable
> 
> Usage:
> kconfig options:
>   - CONFIG_MEM_ALLOC_PROFILING
>   - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
>   - CONFIG_MEM_ALLOC_PROFILING_DEBUG
>     adds warnings for allocations that weren't accounted because of a
>     missing annotation
> 
> sysctl:
>    /proc/sys/vm/mem_profiling
> 
> Runtime info:
>    /proc/allocinfo
> 
> Notes:
> 
> [1]: Overhead
> To measure the overhead we are comparing the following configurations:
> (1) Baseline with CONFIG_MEMCG_KMEM=n
> (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
>      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
> (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
>      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
> (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
>      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
> (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
> (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
>      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
> (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
>      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y
> 
> Performance overhead:
> To evaluate performance we implemented an in-kernel test executing
> multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> affinity set to a specific CPU to minimize the noise. Below are results
> from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> 56 core Intel Xeon:
> 
>                          kmalloc                 pgalloc
> (1 baseline)            6.764s                  16.902s
> (2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
> (3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
> (4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
> (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
> (6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
> (7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)
> 
> Memory overhead:
> Kernel size:
> 
>     text           data        bss         dec         diff
> (1) 26515311	      18890222    17018880    62424413
> (2) 26524728	      19423818    16740352    62688898    264485
> (3) 26524724	      19423818    16740352    62688894    264481
> (4) 26524728	      19423818    16740352    62688898    264485
> (5) 26541782	      18964374    16957440    62463596    39183
> 
> Memory consumption on a 56 core Intel CPU with 125GB of memory:
> Code tags:           192 kB
> PageExts:         262144 kB (256MB)
> SlabExts:           9876 kB (9.6MB)
> PcpuExts:            512 kB (0.5MB)
> 
> Total overhead is 0.2% of total memory.
> 
> Benchmarks:
> 
> Hackbench tests run 100 times:
> hackbench -s 512 -l 200 -g 15 -f 25 -P
>        baseline       disabled profiling           enabled profiling
> avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
> stdev 0.0137         0.0188                       0.0077
> 
> 
> hackbench -l 10000
>        baseline       disabled profiling           enabled profiling
> avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
> stdev 0.0933         0.0286                       0.0489
> 
> stress-ng tests:
> stress-ng --class memory --seq 4 -t 60
> stress-ng --class cpu --seq 4 -t 60
> Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/
> 
> [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/

If I enable this, I consistently get percpu allocation failures. I can 
occasionally reproduce it in qemu. I've attached the logs and my config, 
please let me know if there's anything else that could be relevant.

Kind regards,
Klara Modin

Suren Baghdasaryan April 5, 2024, 2:14 p.m. UTC | #4

On Fri, Apr 5, 2024 at 6:37 AM Klara Modin <klarasmodin@gmail.com> wrote:
>
> Hi,
>
> On 2024-03-21 17:36, Suren Baghdasaryan wrote:
> > Overview:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for
> > debug kernels, overhead low enough to be deployed in production.
> >
> > Example output:
> >    root@moria-kvm:~# sort -rn /proc/allocinfo
> >     127664128    31168 mm/page_ext.c:270 func:alloc_page_ext
> >      56373248     4737 mm/slub.c:2259 func:alloc_slab_page
> >      14880768     3633 mm/readahead.c:247 func:page_cache_ra_unbounded
> >      14417920     3520 mm/mm_init.c:2530 func:alloc_large_system_hash
> >      13377536      234 block/blk-mq.c:3421 func:blk_mq_alloc_rqs
> >      11718656     2861 mm/filemap.c:1919 func:__filemap_get_folio
> >       9192960     2800 kernel/fork.c:307 func:alloc_thread_stack_node
> >       4206592        4 net/netfilter/nf_conntrack_core.c:2567 func:nf_ct_alloc_hashtable
> >       4136960     1010 drivers/staging/ctagmod/ctagmod.c:20 [ctagmod] func:ctagmod_start
> >       3940352      962 mm/memory.c:4214 func:alloc_anon_folio
> >       2894464    22613 fs/kernfs/dir.c:615 func:__kernfs_new_node
> >       ...
> >
> > Since v5 [2]:
> > - Added Reviewed-by and Acked-by, per Vlastimil Babka and Miguel Ojeda
> > - Changed pgalloc_tag_{add|sub} to use number of pages instead of order, per Matthew Wilcox
> > - Changed pgalloc_tag_sub_bytes to pgalloc_tag_sub_pages and adjusted the usage, per Matthew Wilcox
> > - Moved static key check before prepare_slab_obj_exts_hook(), per Vlastimil Babka
> > - Fixed RUST helper, per Miguel Ojeda
> > - Fixed documentation, per Randy Dunlap
> > - Rebased over mm-unstable
> >
> > Usage:
> > kconfig options:
> >   - CONFIG_MEM_ALLOC_PROFILING
> >   - CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> >   - CONFIG_MEM_ALLOC_PROFILING_DEBUG
> >     adds warnings for allocations that weren't accounted because of a
> >     missing annotation
> >
> > sysctl:
> >    /proc/sys/vm/mem_profiling
> >
> > Runtime info:
> >    /proc/allocinfo
> >
> > Notes:
> >
> > [1]: Overhead
> > To measure the overhead we are comparing the following configurations:
> > (1) Baseline with CONFIG_MEMCG_KMEM=n
> > (2) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> >      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)
> > (3) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> >      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y)
> > (4) Enabled at runtime (CONFIG_MEM_ALLOC_PROFILING=y &&
> >      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n && /proc/sys/vm/mem_profiling=1)
> > (5) Baseline with CONFIG_MEMCG_KMEM=y && allocating with __GFP_ACCOUNT
> > (6) Disabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> >      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=n)  && CONFIG_MEMCG_KMEM=y
> > (7) Enabled by default (CONFIG_MEM_ALLOC_PROFILING=y &&
> >      CONFIG_MEM_ALLOC_PROFILING_BY_DEFAULT=y) && CONFIG_MEMCG_KMEM=y
> >
> > Performance overhead:
> > To evaluate performance we implemented an in-kernel test executing
> > multiple get_free_page/free_page and kmalloc/kfree calls with allocation
> > sizes growing from 8 to 240 bytes with CPU frequency set to max and CPU
> > affinity set to a specific CPU to minimize the noise. Below are results
> > from running the test on Ubuntu 22.04.2 LTS with 6.8.0-rc1 kernel on
> > 56 core Intel Xeon:
> >
> >                          kmalloc                 pgalloc
> > (1 baseline)            6.764s                  16.902s
> > (2 default disabled)    6.793s  (+0.43%)        17.007s (+0.62%)
> > (3 default enabled)     7.197s  (+6.40%)        23.666s (+40.02%)
> > (4 runtime enabled)     7.405s  (+9.48%)        23.901s (+41.41%)
> > (5 memcg)               13.388s (+97.94%)       48.460s (+186.71%)
> > (6 def disabled+memcg)  13.332s (+97.10%)       48.105s (+184.61%)
> > (7 def enabled+memcg)   13.446s (+98.78%)       54.963s (+225.18%)
> >
> > Memory overhead:
> > Kernel size:
> >
> >     text           data        bss         dec         diff
> > (1) 26515311        18890222    17018880    62424413
> > (2) 26524728        19423818    16740352    62688898    264485
> > (3) 26524724        19423818    16740352    62688894    264481
> > (4) 26524728        19423818    16740352    62688898    264485
> > (5) 26541782        18964374    16957440    62463596    39183
> >
> > Memory consumption on a 56 core Intel CPU with 125GB of memory:
> > Code tags:           192 kB
> > PageExts:         262144 kB (256MB)
> > SlabExts:           9876 kB (9.6MB)
> > PcpuExts:            512 kB (0.5MB)
> >
> > Total overhead is 0.2% of total memory.
> >
> > Benchmarks:
> >
> > Hackbench tests run 100 times:
> > hackbench -s 512 -l 200 -g 15 -f 25 -P
> >        baseline       disabled profiling           enabled profiling
> > avg   0.3543         0.3559 (+0.0016)             0.3566 (+0.0023)
> > stdev 0.0137         0.0188                       0.0077
> >
> >
> > hackbench -l 10000
> >        baseline       disabled profiling           enabled profiling
> > avg   6.4218         6.4306 (+0.0088)             6.5077 (+0.0859)
> > stdev 0.0933         0.0286                       0.0489
> >
> > stress-ng tests:
> > stress-ng --class memory --seq 4 -t 60
> > stress-ng --class cpu --seq 4 -t 60
> > Results posted at: https://evilpiepirate.org/~kent/memalloc_prof_v4_stress-ng/
> >
> > [2] https://lore.kernel.org/all/20240306182440.2003814-1-surenb@google.com/
>
> If I enable this, I consistently get percpu allocation failures. I can
> occasionally reproduce it in qemu. I've attached the logs and my config,
> please let me know if there's anything else that could be relevant.

Thanks for the report!
In debug_alloc_profiling.log I see:

[    7.445127] percpu: limit reached, disable warning

That's probably the reason. I'll take a closer look at the cause of
that and how we can fix it.

 In qemu-alloc3.log I see couple of warnings:

[    1.111620] alloc_tag was not set
[    1.111880] WARNING: CPU: 0 PID: 164 at
include/linux/alloc_tag.h:118 kfree (./include/linux/alloc_tag.h:118
(discriminator 1) ./include/linux/alloc_tag.h:161 (discriminator 1)
mm/slub.c:2043 ...

[    1.161710] alloc_tag was not cleared (got tag for fs/squashfs/cache.c:413)
[    1.162289] WARNING: CPU: 0 PID: 195 at
include/linux/alloc_tag.h:109 kmalloc_trace_noprof
(./include/linux/alloc_tag.h:109 (discriminator 1)
./include/linux/alloc_tag.h:149 (discriminator 1) ...

Which means we missed to instrument some allocation. Can you please
check if disabling CONFIG_MEM_ALLOC_PROFILING_DEBUG fixes QEMU case?
In the meantime I'll try to reproduce and fix this.
Thanks,
Suren.



>
> Kind regards,
> Klara Modin

Klara Modin April 5, 2024, 2:30 p.m. UTC | #5

On 2024-04-05 16:14, Suren Baghdasaryan wrote:
> On Fri, Apr 5, 2024 at 6:37 AM Klara Modin <klarasmodin@gmail.com> wrote:
>> If I enable this, I consistently get percpu allocation failures. I can
>> occasionally reproduce it in qemu. I've attached the logs and my config,
>> please let me know if there's anything else that could be relevant.
> 
> Thanks for the report!
> In debug_alloc_profiling.log I see:
> 
> [    7.445127] percpu: limit reached, disable warning
> 
> That's probably the reason. I'll take a closer look at the cause of
> that and how we can fix it.

Thanks!

> 
>   In qemu-alloc3.log I see couple of warnings:
> 
> [    1.111620] alloc_tag was not set
> [    1.111880] WARNING: CPU: 0 PID: 164 at
> include/linux/alloc_tag.h:118 kfree (./include/linux/alloc_tag.h:118
> (discriminator 1) ./include/linux/alloc_tag.h:161 (discriminator 1)
> mm/slub.c:2043 ...
> 
> [    1.161710] alloc_tag was not cleared (got tag for fs/squashfs/cache.c:413)
> [    1.162289] WARNING: CPU: 0 PID: 195 at
> include/linux/alloc_tag.h:109 kmalloc_trace_noprof
> (./include/linux/alloc_tag.h:109 (discriminator 1)
> ./include/linux/alloc_tag.h:149 (discriminator 1) ...
> 
> Which means we missed to instrument some allocation. Can you please
> check if disabling CONFIG_MEM_ALLOC_PROFILING_DEBUG fixes QEMU case?
> In the meantime I'll try to reproduce and fix this.
> Thanks,
> Suren.

That does seem to be the case from what I can tell. I didn't get the 
warning in qemu consistently, but it hasn't reappeared for a number of 
times at least with the debugging option off.

Regards,
Klara Modin

Suren Baghdasaryan April 5, 2024, 3:20 p.m. UTC | #6

On Fri, Apr 5, 2024 at 7:30 AM Klara Modin <klarasmodin@gmail.com> wrote:
>
> On 2024-04-05 16:14, Suren Baghdasaryan wrote:
> > On Fri, Apr 5, 2024 at 6:37 AM Klara Modin <klarasmodin@gmail.com> wrote:
> >> If I enable this, I consistently get percpu allocation failures. I can
> >> occasionally reproduce it in qemu. I've attached the logs and my config,
> >> please let me know if there's anything else that could be relevant.
> >
> > Thanks for the report!
> > In debug_alloc_profiling.log I see:
> >
> > [    7.445127] percpu: limit reached, disable warning
> >
> > That's probably the reason. I'll take a closer look at the cause of
> > that and how we can fix it.
>
> Thanks!

In the build that produced debug_alloc_profiling.log I think we are
consuming all the per-cpu memory reserved for the modules. Could you
please try this change and see if that fixes the issue:

 include/linux/percpu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index a790afba9386..03053de557cf 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -17,7 +17,7 @@
 /* enough to cover all DEFINE_PER_CPUs in modules */
 #ifdef CONFIG_MODULES
 #ifdef CONFIG_MEM_ALLOC_PROFILING
-#define PERCPU_MODULE_RESERVE (8 << 12)
+#define PERCPU_MODULE_RESERVE (8 << 13)
 #else
 #define PERCPU_MODULE_RESERVE (8 << 10)
 #endif

>
> >
> >   In qemu-alloc3.log I see couple of warnings:
> >
> > [    1.111620] alloc_tag was not set
> > [    1.111880] WARNING: CPU: 0 PID: 164 at
> > include/linux/alloc_tag.h:118 kfree (./include/linux/alloc_tag.h:118
> > (discriminator 1) ./include/linux/alloc_tag.h:161 (discriminator 1)
> > mm/slub.c:2043 ...
> >
> > [    1.161710] alloc_tag was not cleared (got tag for fs/squashfs/cache.c:413)
> > [    1.162289] WARNING: CPU: 0 PID: 195 at
> > include/linux/alloc_tag.h:109 kmalloc_trace_noprof
> > (./include/linux/alloc_tag.h:109 (discriminator 1)
> > ./include/linux/alloc_tag.h:149 (discriminator 1) ...
> >
> > Which means we missed to instrument some allocation. Can you please
> > check if disabling CONFIG_MEM_ALLOC_PROFILING_DEBUG fixes QEMU case?
> > In the meantime I'll try to reproduce and fix this.
> > Thanks,
> > Suren.
>
> That does seem to be the case from what I can tell. I didn't get the
> warning in qemu consistently, but it hasn't reappeared for a number of
> times at least with the debugging option off.
>
> Regards,
> Klara Modin

Klara Modin April 5, 2024, 3:37 p.m. UTC | #7

On 2024-04-05 17:20, Suren Baghdasaryan wrote:
> On Fri, Apr 5, 2024 at 7:30 AM Klara Modin <klarasmodin@gmail.com> wrote:
>>
>> On 2024-04-05 16:14, Suren Baghdasaryan wrote:
>>> On Fri, Apr 5, 2024 at 6:37 AM Klara Modin <klarasmodin@gmail.com> wrote:
>>>> If I enable this, I consistently get percpu allocation failures. I can
>>>> occasionally reproduce it in qemu. I've attached the logs and my config,
>>>> please let me know if there's anything else that could be relevant.
>>>
>>> Thanks for the report!
>>> In debug_alloc_profiling.log I see:
>>>
>>> [    7.445127] percpu: limit reached, disable warning
>>>
>>> That's probably the reason. I'll take a closer look at the cause of
>>> that and how we can fix it.
>>
>> Thanks!
> 
> In the build that produced debug_alloc_profiling.log I think we are
> consuming all the per-cpu memory reserved for the modules. Could you
> please try this change and see if that fixes the issue:
> 
>   include/linux/percpu.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index a790afba9386..03053de557cf 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -17,7 +17,7 @@
>   /* enough to cover all DEFINE_PER_CPUs in modules */
>   #ifdef CONFIG_MODULES
>   #ifdef CONFIG_MEM_ALLOC_PROFILING
> -#define PERCPU_MODULE_RESERVE (8 << 12)
> +#define PERCPU_MODULE_RESERVE (8 << 13)
>   #else
>   #define PERCPU_MODULE_RESERVE (8 << 10)
>   #endif
> 

Yeah, that patch fixes the issue for me.

Thanks,
Tested-by: Klara Modin

Suren Baghdasaryan April 6, 2024, 9:42 p.m. UTC | #8

On Fri, Apr 5, 2024 at 8:38 AM Klara Modin <klarasmodin@gmail.com> wrote:
>
>
>
> On 2024-04-05 17:20, Suren Baghdasaryan wrote:
> > On Fri, Apr 5, 2024 at 7:30 AM Klara Modin <klarasmodin@gmail.com> wrote:
> >>
> >> On 2024-04-05 16:14, Suren Baghdasaryan wrote:
> >>> On Fri, Apr 5, 2024 at 6:37 AM Klara Modin <klarasmodin@gmail.com> wrote:
> >>>> If I enable this, I consistently get percpu allocation failures. I can
> >>>> occasionally reproduce it in qemu. I've attached the logs and my config,
> >>>> please let me know if there's anything else that could be relevant.
> >>>
> >>> Thanks for the report!
> >>> In debug_alloc_profiling.log I see:
> >>>
> >>> [    7.445127] percpu: limit reached, disable warning
> >>>
> >>> That's probably the reason. I'll take a closer look at the cause of
> >>> that and how we can fix it.
> >>
> >> Thanks!
> >
> > In the build that produced debug_alloc_profiling.log I think we are
> > consuming all the per-cpu memory reserved for the modules. Could you
> > please try this change and see if that fixes the issue:
> >
> >   include/linux/percpu.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> > index a790afba9386..03053de557cf 100644
> > --- a/include/linux/percpu.h
> > +++ b/include/linux/percpu.h
> > @@ -17,7 +17,7 @@
> >   /* enough to cover all DEFINE_PER_CPUs in modules */
> >   #ifdef CONFIG_MODULES
> >   #ifdef CONFIG_MEM_ALLOC_PROFILING
> > -#define PERCPU_MODULE_RESERVE (8 << 12)
> > +#define PERCPU_MODULE_RESERVE (8 << 13)
> >   #else
> >   #define PERCPU_MODULE_RESERVE (8 << 10)
> >   #endif
> >
>
> Yeah, that patch fixes the issue for me.
>
> Thanks,
> Tested-by: Klara Modin

Official fix is posted at
https://lore.kernel.org/all/20240406214044.1114406-1-surenb@google.com/
Thanks,
Suren.

Kees Cook April 25, 2024, 1:59 a.m. UTC | #9

On Thu, Mar 21, 2024 at 09:36:22AM -0700, Suren Baghdasaryan wrote:
> Low overhead [1] per-callsite memory allocation profiling. Not just for
> debug kernels, overhead low enough to be deployed in production.

Okay, I think I'm holding it wrong. With next-20240424 if I set:

CONFIG_CODE_TAGGING=y
CONFIG_MEM_ALLOC_PROFILING=y
CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y

My test system totally freaks out:

...
SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
Oops: general protection fault, probably for non-canonical address 0xc388d881e4808550: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 0 Comm: swapper Not tainted 6.9.0-rc5-next-20240424 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:__kmalloc_node_noprof+0xcd/0x560

Which is:

__kmalloc_node_noprof+0xcd/0x560:
__slab_alloc_node at mm/slub.c:3780 (discriminator 2)
(inlined by) slab_alloc_node at mm/slub.c:3982 (discriminator 2)
(inlined by) __do_kmalloc_node at mm/slub.c:4114 (discriminator 2)
(inlined by) __kmalloc_node_noprof at mm/slub.c:4122 (discriminator 2)

Which is:

        tid = READ_ONCE(c->tid);

I haven't gotten any further than that; I'm EOD. Anyone seen anything
like this with this series?

-Kees

Kent Overstreet April 25, 2024, 3:25 a.m. UTC | #10

On Wed, Apr 24, 2024 at 06:59:01PM -0700, Kees Cook wrote:
> On Thu, Mar 21, 2024 at 09:36:22AM -0700, Suren Baghdasaryan wrote:
> > Low overhead [1] per-callsite memory allocation profiling. Not just for
> > debug kernels, overhead low enough to be deployed in production.
> 
> Okay, I think I'm holding it wrong. With next-20240424 if I set:
> 
> CONFIG_CODE_TAGGING=y
> CONFIG_MEM_ALLOC_PROFILING=y
> CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> 
> My test system totally freaks out:
> 
> ...
> SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> Oops: general protection fault, probably for non-canonical address 0xc388d881e4808550: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 0 Comm: swapper Not tainted 6.9.0-rc5-next-20240424 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> RIP: 0010:__kmalloc_node_noprof+0xcd/0x560
> 
> Which is:
> 
> __kmalloc_node_noprof+0xcd/0x560:
> __slab_alloc_node at mm/slub.c:3780 (discriminator 2)
> (inlined by) slab_alloc_node at mm/slub.c:3982 (discriminator 2)
> (inlined by) __do_kmalloc_node at mm/slub.c:4114 (discriminator 2)
> (inlined by) __kmalloc_node_noprof at mm/slub.c:4122 (discriminator 2)
> 
> Which is:
> 
>         tid = READ_ONCE(c->tid);
> 
> I haven't gotten any further than that; I'm EOD. Anyone seen anything
> like this with this series?

I certainly haven't. That looks like some real corruption, we're in slub
internal data structures and derefing a garbage address. Check kasan and
all that?

Suren Baghdasaryan April 25, 2024, 3:39 p.m. UTC | #11

On Wed, Apr 24, 2024 at 8:26 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Wed, Apr 24, 2024 at 06:59:01PM -0700, Kees Cook wrote:
> > On Thu, Mar 21, 2024 at 09:36:22AM -0700, Suren Baghdasaryan wrote:
> > > Low overhead [1] per-callsite memory allocation profiling. Not just for
> > > debug kernels, overhead low enough to be deployed in production.
> >
> > Okay, I think I'm holding it wrong. With next-20240424 if I set:
> >
> > CONFIG_CODE_TAGGING=y
> > CONFIG_MEM_ALLOC_PROFILING=y
> > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> >
> > My test system totally freaks out:
> >
> > ...
> > SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> > Oops: general protection fault, probably for non-canonical address 0xc388d881e4808550: 0000 [#1] PREEMPT SMP NOPTI
> > CPU: 0 PID: 0 Comm: swapper Not tainted 6.9.0-rc5-next-20240424 #1
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> > RIP: 0010:__kmalloc_node_noprof+0xcd/0x560
> >
> > Which is:
> >
> > __kmalloc_node_noprof+0xcd/0x560:
> > __slab_alloc_node at mm/slub.c:3780 (discriminator 2)
> > (inlined by) slab_alloc_node at mm/slub.c:3982 (discriminator 2)
> > (inlined by) __do_kmalloc_node at mm/slub.c:4114 (discriminator 2)
> > (inlined by) __kmalloc_node_noprof at mm/slub.c:4122 (discriminator 2)
> >
> > Which is:
> >
> >         tid = READ_ONCE(c->tid);
> >
> > I haven't gotten any further than that; I'm EOD. Anyone seen anything
> > like this with this series?
>
> I certainly haven't. That looks like some real corruption, we're in slub
> internal data structures and derefing a garbage address. Check kasan and
> all that?

Hi Kees,
I tested next-20240424 yesterday with defconfig and
CONFIG_MEM_ALLOC_PROFILING enabled but didn't see any issue like that.
Could you share your config file please?
Thanks,
Suren.

Kees Cook April 25, 2024, 8 p.m. UTC | #12

On Thu, Apr 25, 2024 at 08:39:37AM -0700, Suren Baghdasaryan wrote:
> On Wed, Apr 24, 2024 at 8:26 PM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Wed, Apr 24, 2024 at 06:59:01PM -0700, Kees Cook wrote:
> > > On Thu, Mar 21, 2024 at 09:36:22AM -0700, Suren Baghdasaryan wrote:
> > > > Low overhead [1] per-callsite memory allocation profiling. Not just for
> > > > debug kernels, overhead low enough to be deployed in production.
> > >
> > > Okay, I think I'm holding it wrong. With next-20240424 if I set:
> > >
> > > CONFIG_CODE_TAGGING=y
> > > CONFIG_MEM_ALLOC_PROFILING=y
> > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> > >
> > > My test system totally freaks out:
> > >
> > > ...
> > > SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> > > Oops: general protection fault, probably for non-canonical address 0xc388d881e4808550: 0000 [#1] PREEMPT SMP NOPTI
> > > CPU: 0 PID: 0 Comm: swapper Not tainted 6.9.0-rc5-next-20240424 #1
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> > > RIP: 0010:__kmalloc_node_noprof+0xcd/0x560
> > >
> > > Which is:
> > >
> > > __kmalloc_node_noprof+0xcd/0x560:
> > > __slab_alloc_node at mm/slub.c:3780 (discriminator 2)
> > > (inlined by) slab_alloc_node at mm/slub.c:3982 (discriminator 2)
> > > (inlined by) __do_kmalloc_node at mm/slub.c:4114 (discriminator 2)
> > > (inlined by) __kmalloc_node_noprof at mm/slub.c:4122 (discriminator 2)
> > >
> > > Which is:
> > >
> > >         tid = READ_ONCE(c->tid);
> > >
> > > I haven't gotten any further than that; I'm EOD. Anyone seen anything
> > > like this with this series?
> >
> > I certainly haven't. That looks like some real corruption, we're in slub
> > internal data structures and derefing a garbage address. Check kasan and
> > all that?
> 
> Hi Kees,
> I tested next-20240424 yesterday with defconfig and
> CONFIG_MEM_ALLOC_PROFILING enabled but didn't see any issue like that.
> Could you share your config file please?

Well *that* took a while to .config bisect. I probably should have found
it sooner, but CONFIG_DEBUG_KMEMLEAK=y is what broke me. Without that,
everything is lovely! :)

I can reproduce it now with:

$ make defconfig kvm_guest.config
$ ./scripts/config -e CONFIG_MEM_ALLOC_PROFILING -e CONFIG_DEBUG_KMEMLEAK

-Kees

Kees Cook April 25, 2024, 8:08 p.m. UTC | #13

On Thu, Mar 21, 2024 at 09:36:22AM -0700, Suren Baghdasaryan wrote:
> Overview:
> Low overhead [1] per-callsite memory allocation profiling. Not just for
> debug kernels, overhead low enough to be deployed in production.

A bit late to actually _running_ this code, but I remain a fan:

Tested-by: Kees Cook <keescook@chromium.org>

I have a little tweak patch I'll send out too...

Suren Baghdasaryan April 25, 2024, 9:35 p.m. UTC | #14

On Thu, Apr 25, 2024 at 1:01 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Apr 25, 2024 at 08:39:37AM -0700, Suren Baghdasaryan wrote:
> > On Wed, Apr 24, 2024 at 8:26 PM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Wed, Apr 24, 2024 at 06:59:01PM -0700, Kees Cook wrote:
> > > > On Thu, Mar 21, 2024 at 09:36:22AM -0700, Suren Baghdasaryan wrote:
> > > > > Low overhead [1] per-callsite memory allocation profiling. Not just for
> > > > > debug kernels, overhead low enough to be deployed in production.
> > > >
> > > > Okay, I think I'm holding it wrong. With next-20240424 if I set:
> > > >
> > > > CONFIG_CODE_TAGGING=y
> > > > CONFIG_MEM_ALLOC_PROFILING=y
> > > > CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y
> > > >
> > > > My test system totally freaks out:
> > > >
> > > > ...
> > > > SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> > > > Oops: general protection fault, probably for non-canonical address 0xc388d881e4808550: 0000 [#1] PREEMPT SMP NOPTI
> > > > CPU: 0 PID: 0 Comm: swapper Not tainted 6.9.0-rc5-next-20240424 #1
> > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> > > > RIP: 0010:__kmalloc_node_noprof+0xcd/0x560
> > > >
> > > > Which is:
> > > >
> > > > __kmalloc_node_noprof+0xcd/0x560:
> > > > __slab_alloc_node at mm/slub.c:3780 (discriminator 2)
> > > > (inlined by) slab_alloc_node at mm/slub.c:3982 (discriminator 2)
> > > > (inlined by) __do_kmalloc_node at mm/slub.c:4114 (discriminator 2)
> > > > (inlined by) __kmalloc_node_noprof at mm/slub.c:4122 (discriminator 2)
> > > >
> > > > Which is:
> > > >
> > > >         tid = READ_ONCE(c->tid);
> > > >
> > > > I haven't gotten any further than that; I'm EOD. Anyone seen anything
> > > > like this with this series?
> > >
> > > I certainly haven't. That looks like some real corruption, we're in slub
> > > internal data structures and derefing a garbage address. Check kasan and
> > > all that?
> >
> > Hi Kees,
> > I tested next-20240424 yesterday with defconfig and
> > CONFIG_MEM_ALLOC_PROFILING enabled but didn't see any issue like that.
> > Could you share your config file please?
>
> Well *that* took a while to .config bisect. I probably should have found
> it sooner, but CONFIG_DEBUG_KMEMLEAK=y is what broke me. Without that,
> everything is lovely! :)
>
> I can reproduce it now with:
>
> $ make defconfig kvm_guest.config
> $ ./scripts/config -e CONFIG_MEM_ALLOC_PROFILING -e CONFIG_DEBUG_KMEMLEAK

Thanks! I'll use this to reproduce the issue and will see if we can
handle that recursion in a better way.

>
> -Kees
>
> --
> Kees Cook

[v6,00/37] Memory allocation profiling

Message

Comments