mbox series

[v3,00/19] stackdepot: allow evicting stack traces

Message ID cover.1698077459.git.andreyknvl@google.com (mailing list archive)
Headers show
Series stackdepot: allow evicting stack traces | expand

Message

andrey.konovalov@linux.dev Oct. 23, 2023, 4:22 p.m. UTC
From: Andrey Konovalov <andreyknvl@google.com>

Currently, the stack depot grows indefinitely until it reaches its
capacity. Once that happens, the stack depot stops saving new stack
traces.

This creates a problem for using the stack depot for in-field testing
and in production.

For such uses, an ideal stack trace storage should:

1. Allow saving fresh stack traces on systems with a large uptime while
   limiting the amount of memory used to store the traces;
2. Have a low performance impact.

Implementing #1 in the stack depot is impossible with the current
keep-forever approach. This series targets to address that. Issue #2 is
left to be addressed in a future series.

This series changes the stack depot implementation to allow evicting
unneeded stack traces from the stack depot. The users of the stack depot
can do that via new stack_depot_save_flags(STACK_DEPOT_FLAG_GET) and
stack_depot_put APIs.

Internal changes to the stack depot code include:

1. Storing stack traces in fixed-frame-sized slots; the slot size is
   controlled via CONFIG_STACKDEPOT_MAX_FRAMES (vs precisely-sized
   slots in the current implementation);
2. Keeping available slots in a freelist (vs keeping an offset to the next
   free slot);
3. Using a read/write lock for synchronization (vs a lock-free approach
   combined with a spinlock).

This series also integrates the eviction functionality in the tag-based
KASAN modes.

Despite wasting some space on rounding up the size of each stack record,
with CONFIG_STACKDEPOT_MAX_FRAMES=32, the tag-based KASAN modes end up
consuming ~5% less memory in stack depot during boot (with the default
stack ring size of 32k entries). The reason for this is the eviction of
irrelevant stack traces from the stack depot, which frees up space for
other stack traces.

For other tools that heavily rely on the stack depot, like Generic KASAN
and KMSAN, this change leads to the stack depot capacity being reached
sooner than before. However, as these tools are mainly used in fuzzing
scenarios where the kernel is frequently rebooted, this outcome should
be acceptable.

There is no measurable boot time performance impact of these changes for
KASAN on x86-64. I haven't done any tests for arm64 modes (the stack
depot without performance optimizations is not suitable for intended use
of those anyway), but I expect a similar result. Obtaining and copying
stack trace frames when saving them into stack depot is what takes the
most time.

This series does not yet provide a way to configure the maximum size of
the stack depot externally (e.g. via a command-line parameter). This will
be added in a separate series, possibly together with the performance
improvement changes.

---

Changes v2->v3:
- Fix null-ptr-deref by using the proper number of entries for
  initializing the stack table when alloc_large_system_hash()
  auto-calculates the number (see patch #12).
- Keep STACKDEPOT/STACKDEPOT_ALWAYS_INIT Kconfig options not configurable
  by users.
- Use lockdep_assert_held_read annotation in depot_fetch_stack.
- WARN_ON invalid flags in stack_depot_save_flags.
- Moved "../slab.h" include in mm/kasan/report_tags.c in the right patch.
- Various comment fixes.

Changes v1->v2:
- Rework API to stack_depot_save_flags(STACK_DEPOT_FLAG_GET) +
  stack_depot_put.
- Add CONFIG_STACKDEPOT_MAX_FRAMES Kconfig option.
- Switch stack depot to using list_head's.
- Assorted minor changes, see the commit message for each path.

Andrey Konovalov (19):
  lib/stackdepot: check disabled flag when fetching
  lib/stackdepot: simplify __stack_depot_save
  lib/stackdepot: drop valid bit from handles
  lib/stackdepot: add depot_fetch_stack helper
  lib/stackdepot: use fixed-sized slots for stack records
  lib/stackdepot: fix and clean-up atomic annotations
  lib/stackdepot: rework helpers for depot_alloc_stack
  lib/stackdepot: rename next_pool_required to new_pool_required
  lib/stackdepot: store next pool pointer in new_pool
  lib/stackdepot: store free stack records in a freelist
  lib/stackdepot: use read/write lock
  lib/stackdepot: use list_head for stack record links
  kmsan: use stack_depot_save instead of __stack_depot_save
  lib/stackdepot, kasan: add flags to __stack_depot_save and rename
  lib/stackdepot: add refcount for records
  lib/stackdepot: allow users to evict stack traces
  kasan: remove atomic accesses to stack ring entries
  kasan: check object_size in kasan_complete_mode_report_info
  kasan: use stack_depot_put for tag-based modes

 include/linux/stackdepot.h |  59 ++++--
 lib/Kconfig                |  10 +
 lib/stackdepot.c           | 418 ++++++++++++++++++++++++-------------
 mm/kasan/common.c          |   7 +-
 mm/kasan/generic.c         |   9 +-
 mm/kasan/kasan.h           |   2 +-
 mm/kasan/report_tags.c     |  27 +--
 mm/kasan/tags.c            |  24 ++-
 mm/kmsan/core.c            |   7 +-
 9 files changed, 365 insertions(+), 198 deletions(-)

Comments

Anders Roxell Oct. 24, 2023, 9:32 a.m. UTC | #1
On 2023-10-23 18:22, andrey.konovalov@linux.dev wrote:
> From: Andrey Konovalov <andreyknvl@google.com>
> 
> Currently, the stack depot grows indefinitely until it reaches its
> capacity. Once that happens, the stack depot stops saving new stack
> traces.
> 
> This creates a problem for using the stack depot for in-field testing
> and in production.
> 
> For such uses, an ideal stack trace storage should:
> 
> 1. Allow saving fresh stack traces on systems with a large uptime while
>    limiting the amount of memory used to store the traces;
> 2. Have a low performance impact.
> 
> Implementing #1 in the stack depot is impossible with the current
> keep-forever approach. This series targets to address that. Issue #2 is
> left to be addressed in a future series.
> 
> This series changes the stack depot implementation to allow evicting
> unneeded stack traces from the stack depot. The users of the stack depot
> can do that via new stack_depot_save_flags(STACK_DEPOT_FLAG_GET) and
> stack_depot_put APIs.
> 
> Internal changes to the stack depot code include:
> 
> 1. Storing stack traces in fixed-frame-sized slots; the slot size is
>    controlled via CONFIG_STACKDEPOT_MAX_FRAMES (vs precisely-sized
>    slots in the current implementation);
> 2. Keeping available slots in a freelist (vs keeping an offset to the next
>    free slot);
> 3. Using a read/write lock for synchronization (vs a lock-free approach
>    combined with a spinlock).
> 
> This series also integrates the eviction functionality in the tag-based
> KASAN modes.
> 
> Despite wasting some space on rounding up the size of each stack record,
> with CONFIG_STACKDEPOT_MAX_FRAMES=32, the tag-based KASAN modes end up
> consuming ~5% less memory in stack depot during boot (with the default
> stack ring size of 32k entries). The reason for this is the eviction of
> irrelevant stack traces from the stack depot, which frees up space for
> other stack traces.
> 
> For other tools that heavily rely on the stack depot, like Generic KASAN
> and KMSAN, this change leads to the stack depot capacity being reached
> sooner than before. However, as these tools are mainly used in fuzzing
> scenarios where the kernel is frequently rebooted, this outcome should
> be acceptable.
> 
> There is no measurable boot time performance impact of these changes for
> KASAN on x86-64. I haven't done any tests for arm64 modes (the stack
> depot without performance optimizations is not suitable for intended use
> of those anyway), but I expect a similar result. Obtaining and copying
> stack trace frames when saving them into stack depot is what takes the
> most time.
> 
> This series does not yet provide a way to configure the maximum size of
> the stack depot externally (e.g. via a command-line parameter). This will
> be added in a separate series, possibly together with the performance
> improvement changes.
> 
> ---
> 
> Changes v2->v3:
> - Fix null-ptr-deref by using the proper number of entries for
>   initializing the stack table when alloc_large_system_hash()
>   auto-calculates the number (see patch #12).
> - Keep STACKDEPOT/STACKDEPOT_ALWAYS_INIT Kconfig options not configurable
>   by users.
> - Use lockdep_assert_held_read annotation in depot_fetch_stack.
> - WARN_ON invalid flags in stack_depot_save_flags.
> - Moved "../slab.h" include in mm/kasan/report_tags.c in the right patch.
> - Various comment fixes.
> 
> Changes v1->v2:
> - Rework API to stack_depot_save_flags(STACK_DEPOT_FLAG_GET) +
>   stack_depot_put.
> - Add CONFIG_STACKDEPOT_MAX_FRAMES Kconfig option.
> - Switch stack depot to using list_head's.
> - Assorted minor changes, see the commit message for each path.
> 
> Andrey Konovalov (19):
>   lib/stackdepot: check disabled flag when fetching
>   lib/stackdepot: simplify __stack_depot_save
>   lib/stackdepot: drop valid bit from handles
>   lib/stackdepot: add depot_fetch_stack helper
>   lib/stackdepot: use fixed-sized slots for stack records
>   lib/stackdepot: fix and clean-up atomic annotations
>   lib/stackdepot: rework helpers for depot_alloc_stack
>   lib/stackdepot: rename next_pool_required to new_pool_required
>   lib/stackdepot: store next pool pointer in new_pool
>   lib/stackdepot: store free stack records in a freelist
>   lib/stackdepot: use read/write lock
>   lib/stackdepot: use list_head for stack record links
>   kmsan: use stack_depot_save instead of __stack_depot_save
>   lib/stackdepot, kasan: add flags to __stack_depot_save and rename
>   lib/stackdepot: add refcount for records
>   lib/stackdepot: allow users to evict stack traces
>   kasan: remove atomic accesses to stack ring entries
>   kasan: check object_size in kasan_complete_mode_report_info
>   kasan: use stack_depot_put for tag-based modes

Tested-by: Anders Roxell <anders.roxell@linaro.org>

Applied this patchset to linux-next tag next-20231023 and built an arm64
kernel and that
booted fine in QEMU.

Cheers,
Anders
Marco Elver Oct. 24, 2023, 1:13 p.m. UTC | #2
On Mon, 23 Oct 2023 at 18:22, <andrey.konovalov@linux.dev> wrote:
[...]
> ---
>
> Changes v2->v3:
> - Fix null-ptr-deref by using the proper number of entries for
>   initializing the stack table when alloc_large_system_hash()
>   auto-calculates the number (see patch #12).
> - Keep STACKDEPOT/STACKDEPOT_ALWAYS_INIT Kconfig options not configurable
>   by users.
> - Use lockdep_assert_held_read annotation in depot_fetch_stack.
> - WARN_ON invalid flags in stack_depot_save_flags.
> - Moved "../slab.h" include in mm/kasan/report_tags.c in the right patch.
> - Various comment fixes.
>
> Changes v1->v2:
> - Rework API to stack_depot_save_flags(STACK_DEPOT_FLAG_GET) +
>   stack_depot_put.
> - Add CONFIG_STACKDEPOT_MAX_FRAMES Kconfig option.
> - Switch stack depot to using list_head's.
> - Assorted minor changes, see the commit message for each path.
>
> Andrey Konovalov (19):
>   lib/stackdepot: check disabled flag when fetching
>   lib/stackdepot: simplify __stack_depot_save
>   lib/stackdepot: drop valid bit from handles
>   lib/stackdepot: add depot_fetch_stack helper
>   lib/stackdepot: use fixed-sized slots for stack records

1. I know fixed-sized slots are need for eviction to work, but have
you evaluated if this causes some excessive memory waste now? Or is it
negligible?

If it turns out to be a problem, one way out would be to partition the
freelist into stack size classes; e.g. one for each of stack traces of
size 8, 16, 32, 64.

>   lib/stackdepot: fix and clean-up atomic annotations
>   lib/stackdepot: rework helpers for depot_alloc_stack
>   lib/stackdepot: rename next_pool_required to new_pool_required
>   lib/stackdepot: store next pool pointer in new_pool
>   lib/stackdepot: store free stack records in a freelist
>   lib/stackdepot: use read/write lock

2. I still think switching to the percpu_rwsem right away is the right
thing, and not actually a downside. I mentioned this before, but you
promised a follow-up patch, so I trust that this will happen. ;-)

>   lib/stackdepot: use list_head for stack record links
>   kmsan: use stack_depot_save instead of __stack_depot_save
>   lib/stackdepot, kasan: add flags to __stack_depot_save and rename
>   lib/stackdepot: add refcount for records
>   lib/stackdepot: allow users to evict stack traces
>   kasan: remove atomic accesses to stack ring entries
>   kasan: check object_size in kasan_complete_mode_report_info
>   kasan: use stack_depot_put for tag-based modes
>
>  include/linux/stackdepot.h |  59 ++++--
>  lib/Kconfig                |  10 +
>  lib/stackdepot.c           | 418 ++++++++++++++++++++++++-------------
>  mm/kasan/common.c          |   7 +-
>  mm/kasan/generic.c         |   9 +-
>  mm/kasan/kasan.h           |   2 +-
>  mm/kasan/report_tags.c     |  27 +--
>  mm/kasan/tags.c            |  24 ++-
>  mm/kmsan/core.c            |   7 +-
>  9 files changed, 365 insertions(+), 198 deletions(-)

Acked-by: Marco Elver <elver@google.com>

The series looks good in its current state. However, see my 2
higher-level comments above.
Andrey Konovalov Nov. 3, 2023, 9:37 p.m. UTC | #3
On Tue, Oct 24, 2023 at 3:14 PM Marco Elver <elver@google.com> wrote:
>
> 1. I know fixed-sized slots are need for eviction to work, but have
> you evaluated if this causes some excessive memory waste now? Or is it
> negligible?

With the current default stack depot slot size of 64 frames, a single
stack trace takes up ~3-4x on average compared to precisely sized
slots (KMSAN is closer to ~4x due to its 3-frame-sized linking
records).

However, as the tag-based KASAN modes evict old stack traces, the
average total amount of memory used for stack traces is ~0.5 MB (with
the default stack ring size of 32k entries).

I also have just mailed an eviction implementation for Generic KASAN.
With it, the stack traces take up ~1 MB per 1 GB of RAM while running
syzkaller (stack traces are evicted when they are flushed from
quarantine, and quarantine's size depends on the amount of RAM.)

The only problem is KMSAN. Based on a discussion with Alexander, it
might not be possible to implement the eviction for it. So I suspect,
with this change, syzbot might run into the capacity WARNING from time
to time.

The simplest solution would be to bump the maximum size of stack depot
storage to x4 if KMSAN is enabled (to 512 MB from the current 128 MB).
KMSAN requires a significant amount of RAM for shadow anyway.

Would that be acceptable?

> If it turns out to be a problem, one way out would be to partition the
> freelist into stack size classes; e.g. one for each of stack traces of
> size 8, 16, 32, 64.

This shouldn't be hard to implement.

However, as one of the perf improvements, I'm thinking of saving a
stack trace directly into a stack depot slot (to avoid copying it).
With this, we won't know the stack trace size before it is saved. So
this won't work together with the size classes.

> 2. I still think switching to the percpu_rwsem right away is the right
> thing, and not actually a downside. I mentioned this before, but you
> promised a follow-up patch, so I trust that this will happen. ;-)

First thing on my TODO list wrt perf improvements :)

> Acked-by: Marco Elver <elver@google.com>
>
> The series looks good in its current state. However, see my 2
> higher-level comments above.

Thank you!
Andrey Konovalov Nov. 6, 2023, 5:41 p.m. UTC | #4
On Fri, Nov 3, 2023 at 10:37 PM Andrey Konovalov <andreyknvl@gmail.com> wrote:
>
> On Tue, Oct 24, 2023 at 3:14 PM Marco Elver <elver@google.com> wrote:
> >
> > 1. I know fixed-sized slots are need for eviction to work, but have
> > you evaluated if this causes some excessive memory waste now? Or is it
> > negligible?
>
> With the current default stack depot slot size of 64 frames, a single
> stack trace takes up ~3-4x on average compared to precisely sized
> slots (KMSAN is closer to ~4x due to its 3-frame-sized linking
> records).
>
> However, as the tag-based KASAN modes evict old stack traces, the
> average total amount of memory used for stack traces is ~0.5 MB (with
> the default stack ring size of 32k entries).
>
> I also have just mailed an eviction implementation for Generic KASAN.
> With it, the stack traces take up ~1 MB per 1 GB of RAM while running
> syzkaller (stack traces are evicted when they are flushed from
> quarantine, and quarantine's size depends on the amount of RAM.)
>
> The only problem is KMSAN. Based on a discussion with Alexander, it
> might not be possible to implement the eviction for it. So I suspect,
> with this change, syzbot might run into the capacity WARNING from time
> to time.
>
> The simplest solution would be to bump the maximum size of stack depot
> storage to x4 if KMSAN is enabled (to 512 MB from the current 128 MB).
> KMSAN requires a significant amount of RAM for shadow anyway.
>
> Would that be acceptable?
>
> > If it turns out to be a problem, one way out would be to partition the
> > freelist into stack size classes; e.g. one for each of stack traces of
> > size 8, 16, 32, 64.
>
> This shouldn't be hard to implement.
>
> However, as one of the perf improvements, I'm thinking of saving a
> stack trace directly into a stack depot slot (to avoid copying it).
> With this, we won't know the stack trace size before it is saved. So
> this won't work together with the size classes.

On a second thought, saving stack traces directly into a stack depot
slot will require taking the write lock, which will badly affect
performance, or using some other elaborate locking scheme, which might
be an overkill.

> > 2. I still think switching to the percpu_rwsem right away is the right
> > thing, and not actually a downside. I mentioned this before, but you
> > promised a follow-up patch, so I trust that this will happen. ;-)
>
> First thing on my TODO list wrt perf improvements :)
>
> > Acked-by: Marco Elver <elver@google.com>
> >
> > The series looks good in its current state. However, see my 2
> > higher-level comments above.
>
> Thank you!