mbox series

[v3,0/3] kasan, arm64, scs: collect stack traces from Shadow Call Stack

Message ID cover.1649877511.git.andreyknvl@google.com (mailing list archive)
Headers show
Series kasan, arm64, scs: collect stack traces from Shadow Call Stack | expand

Message

andrey.konovalov@linux.dev April 13, 2022, 7:26 p.m. UTC
From: Andrey Konovalov <andreyknvl@google.com>

Currently, when saving alloc and free stack traces, KASAN uses the normal
stack trace collection routines, which rely on the unwinder.

Instead of invoking the unwinder, collect the stack trace by copying
frames from the Shadow Call Stack. This reduces boot time by ~30% for
all KASAN modes when Shadow Call Stack is enabled. See below for the
details of how the measurements were performed.

Stack staces are collected from the Shadow Call Stack via a new
stack_trace_save_shadow() interface.

Note that the implementation is best-effort and only works in certain
contexts. See patch #3 for details.

---

Changes
=======

v2->v3:
- Limit hardirq and drop SDEI support for performance and simplicity.
- Move stack_trace_save_shadow() implementation back to mm/kasan:
  it's not mature enough to be used as a system-wide stack trace
  collection replacement.
- Clarify -ENOSYS return value from stack_trace_save_shadow().
- Don't rename nr_entries to size in kasan_save_stack().
- Check return value of stack_trace_save_shadow() instead of checking
  CONFIG_HAVE_SHADOW_STACKTRACE in kasan_save_stack().

v1->v2:
- Provide a kernel-wide stack_trace_save_shadow() interface for collecting
  stack traces from shadow stack.
- Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
- Get SCS pointer from x18, as per-task value is meant to save the SCS
  value on CPU switches.
- Collect stack frames from SDEI and IRQ contexts.

Perf
====

To measure performance impact, I used QEMU in full system emulation mode
on an x86-64 host.

As proposed by Mark, I passed no filesystem to QEMU and booted with panic=-1:

qemu-system-aarch64 \
	-machine virt,mte=on -cpu max \
	-m 2G -smp 1 -nographic \
	-kernel ./xbins/Image \
	-append "console=ttyAMA0 earlyprintk=serial panic=-1" \
	-no-shutdown -no-reboot

Just in case, the QEMU version is:

$ qemu-system-aarch64 --version
QEMU emulator version 6.2.94 (v5.2.0-rc3-12124-g81c7ed41a1)
Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers

Then, I recorded the timestamp of when the "Kernel panic" line was printed
to the kernel log.

The measurements were done on 5 kernel flavors:

master                 (mainline commit a19944809fe99):
master-no-stack-traces (stack trace collection commented out)
master-no-stack-depot  (saving to stack depot commented out)
up-scs-stacks-v3       (collecting stack traces from SCS)
up-scs-stacks-v3-noscs (up-scs-stacks-v3 with __noscs marking)

(The last flavor is included just for the record: it produces an unexpected
 slowdown. The likely reason is that helper functions stop getting inlined.)

All the branches can be found here:

https://github.com/xairy/linux/branches/all

The measurements were performed for Generic and HW_TAGS KASAN modes.

The .configs are here (essentially, defconfig + SCS + KASAN):

Generic KASAN: https://gist.github.com/xairy/d527ad31c0b54898512c92898d62beed
HW_TAGS KASAN: https://gist.github.com/xairy/390e4ef0140de3f4f9a49efe20708d21

The results:

Generic KASAN
-------------

master-no-stack-traces: 8.03
master:                 11.55 (+43.8%)
master-no-stack-depot:  11.53 (+43.5%)
up-scs-stacks-v3:       8.31  (+3.4%)
up-scs-stacks-v3-noscs: 9.11  (+13.4%)

HW_TAGS KASAN
-------------

master-no-stack-traces: 3.31
master:                 5.01 (+51%)
master-no-stack-depot:  4.85 (+47%)
up-scs-stacks-v3:       3.49 (+5.4%)
up-scs-stacks-v3-noscs: 4.27 (+29%)

The deviation for all numbers above is ~0.05.

As can be seen, the up-scs-stacks-v3 flavor results in a significantly
faster boot compared to master.

Andrey Konovalov (3):
  arm64, scs: expose irq_shadow_call_stack_ptr
  kasan, arm64: implement stack_trace_save_shadow
  kasan: use stack_trace_save_shadow

 arch/arm64/include/asm/scs.h | 10 +++++-
 arch/arm64/kernel/irq.c      |  4 +--
 arch/arm64/kernel/sdei.c     |  3 --
 mm/kasan/common.c            | 66 +++++++++++++++++++++++++++++++++++-
 4 files changed, 75 insertions(+), 8 deletions(-)

Comments

Mark Rutland April 14, 2022, 12:36 p.m. UTC | #1
On Wed, Apr 13, 2022 at 09:26:43PM +0200, andrey.konovalov@linux.dev wrote:
> From: Andrey Konovalov <andreyknvl@google.com>
> 
> Currently, when saving alloc and free stack traces, KASAN uses the normal
> stack trace collection routines, which rely on the unwinder.
> 
> Instead of invoking the unwinder, collect the stack trace by copying
> frames from the Shadow Call Stack. This reduces boot time by ~30% for
> all KASAN modes when Shadow Call Stack is enabled. See below for the
> details of how the measurements were performed.
> 
> Stack staces are collected from the Shadow Call Stack via a new
> stack_trace_save_shadow() interface.
> 
> Note that the implementation is best-effort and only works in certain
> contexts. See patch #3 for details.
> 
> ---
> 
> Changes
> =======
> 
> v2->v3:
> - Limit hardirq and drop SDEI support for performance and simplicity.
> - Move stack_trace_save_shadow() implementation back to mm/kasan:
>   it's not mature enough to be used as a system-wide stack trace
>   collection replacement.
> - Clarify -ENOSYS return value from stack_trace_save_shadow().
> - Don't rename nr_entries to size in kasan_save_stack().
> - Check return value of stack_trace_save_shadow() instead of checking
>   CONFIG_HAVE_SHADOW_STACKTRACE in kasan_save_stack().
> 
> v1->v2:
> - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
>   stack traces from shadow stack.
> - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> - Get SCS pointer from x18, as per-task value is meant to save the SCS
>   value on CPU switches.
> - Collect stack frames from SDEI and IRQ contexts.
> 
> Perf
> ====
> 
> To measure performance impact, I used QEMU in full system emulation mode
> on an x86-64 host.

Just to be clear: QEMU TCG mode is *in no way* representative of HW
performance, and has drastically different performance characteristics
compared to real HW. Please be very clear when you are quoting
performance figures from QEMU TCG mode.

Previously you said you were trying to optimize this so that some
version of KASAN could be enabled in production builds, and the above is
not a suitable benchmark system for that.

Is that *actually* what you're trying to enable, or are you just trying
to speed up running instances under QEMU (e.g. for arm64 Syzkaller runs
on GCE)?

> As proposed by Mark, I passed no filesystem to QEMU and booted with panic=-1:
> 
> qemu-system-aarch64 \
> 	-machine virt,mte=on -cpu max \
> 	-m 2G -smp 1 -nographic \
> 	-kernel ./xbins/Image \
> 	-append "console=ttyAMA0 earlyprintk=serial panic=-1" \
> 	-no-shutdown -no-reboot
> 
> Just in case, the QEMU version is:
> 
> $ qemu-system-aarch64 --version
> QEMU emulator version 6.2.94 (v5.2.0-rc3-12124-g81c7ed41a1)

Thanks for this information, this makes it *much* easier to see what's
going on here.

As I suspected, you're hitting a known performance oddity with QEMU TCG
mode where pointer authentication is *incredibly* slow when using the
architected QARMA5 algorithm (enabled by default with `-cpu max`).

Practically speaking, when using TCG mode with `-cpu max`, PACIASP and
AUTIASP instructions in function prologues/epilogues take much longer
than they would on real hardware, and this means that those skew
performance with an overhead whcih scales with the number of function
calls. The regular unwinder is hit pretty bad by this because it has a
few function calls (including KASAN instrumentation), and the overhead
of this dominates the cost of the actual unwind logic. Your SCS unwinder
isn't as badly hit by that because it has fewer function calls.

This overhead has nothing to do with the *nature* of the unwinder, and
is an artifact of the *platform* and the *structure* of the code.
There's plenty that can be done to avoid that overhead, including basic
QEMU options which don't require kernel changes.

For example, if you replace:

	-cpu max

With:

	-cpu max,pauth-impdef=true

... QEMU will use an IMPLEMENTATION DEFINED pointer authentication
algorithm which is *much* faster than its implementation of QARMA5 (and
tests below show that make the kernel reach the panic in ~1/6 the time).

Since you're testing with shadow call stack, you could decide to disable
pointer authentication entirely. You can tell QEMU to not provide that
with:

	-cpu max,pauth=false

... which in tests below makes the kernel reach the panic in 1/9 the
time.

You could instead change your config to have
CONFIG_ARM64_PTR_AUTH_KERNEL=n, which would cause the kernel to be built
without the instructions which are causing the slowdown.

For comparisons below, I've built the same version of QEMU as you're
using. The numbering in that version string is a bit confusing, so I've
gone by the git commit ID:

| commit 81c7ed41a1b33031f3e4fe24191a998a492044b8 (HEAD, tag: v7.0.0-rc4, origin/master, origin/HEAD)
| Author:     Peter Maydell <peter.maydell@linaro.org>
| AuthorDate: Tue Apr 12 17:01:34 2022 +0100
| Commit:     Peter Maydell <peter.maydell@linaro.org>
| CommitDate: Tue Apr 12 17:01:34 2022 +0100
| 
|     Update version for v7.0.0-rc4 release
|     
|     Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
| 
| diff --git a/VERSION b/VERSION
| index 80d0afb063..9c57137cdf 100644
| --- a/VERSION
| +++ b/VERSION
| @@ -1 +1 @@
| -6.2.93
| +6.2.94

My host machine is an Intel Xeon E5-2660.

> Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
> 
> Then, I recorded the timestamp of when the "Kernel panic" line was printed
> to the kernel log.
> 
> The measurements were done on 5 kernel flavors:
> 
> master                 (mainline commit a19944809fe99):
> master-no-stack-traces (stack trace collection commented out)
> master-no-stack-depot  (saving to stack depot commented out)
> up-scs-stacks-v3       (collecting stack traces from SCS)
> up-scs-stacks-v3-noscs (up-scs-stacks-v3 with __noscs marking)
> 
> (The last flavor is included just for the record: it produces an unexpected
>  slowdown. The likely reason is that helper functions stop getting inlined.)

As above, that case is hitting the overhead of QEMU TCG mode's
incredibly slow pointer authentication.

> All the branches can be found here:
> 
> https://github.com/xairy/linux/branches/all
> 
> The measurements were performed for Generic and HW_TAGS KASAN modes.
> 
> The .configs are here (essentially, defconfig + SCS + KASAN):
> 
> Generic KASAN: https://gist.github.com/xairy/d527ad31c0b54898512c92898d62beed
> HW_TAGS KASAN: https://gist.github.com/xairy/390e4ef0140de3f4f9a49efe20708d21
> 
> The results:
> 
> Generic KASAN
> -------------
> 
> master-no-stack-traces: 8.03
> master:                 11.55 (+43.8%)
> master-no-stack-depot:  11.53 (+43.5%)
> up-scs-stacks-v3:       8.31  (+3.4%)
> up-scs-stacks-v3-noscs: 9.11  (+13.4%)

I made the same measurements, reading the timestamp on the panic
message. From my local results, the performance oddity with pointer
authentication dominates everything else, and by changing QEMU options
to minimize or eliminate that overhead the difference in unwinder
overhead becomes far less pronouced and the overall boot time is reduced
to a fraction of the time taken when the incredibly slow imlpementation
of pointer authentication is used:

Generic KASAN w/ `-cpu max`
---------------------------

master-no-stack-traces: 12.66
master:                 18.39 (+45.2%)
master-no-stack-depot:  17.85 (+40.1%)
up-scs-stacks-v3:       13.54 (+7.0%)

Generic KASAN w/ `-cpu max,pauth-impdef=true`
---------------------------------------------

master-no-stack-traces: 2.69
master:                 3.35 (+24.5%)
master-no-stack-depot:  3.54 (+31.5%)
up-scs-stacks-v3:       2.80 (+4.1%)

All results are 5x to 6x faster; the regular unwinder overhead is
reduced by ~20% relative to `-cpu max`.

Generic KASAN w/ `-cpu max,pauth=false`
---------------------------------------

master-no-stack-traces: 1.92
master:                 2.27  (+18.2%)
master-no-stack-depot:  2.22  (+15.6%)
up-scs-stacks-v3:       2.06  (+7.3%)

All results are 6x to 9x faster; the regular unwinder overhead is
reduced by 27% relative to `-cpu max`.

To speed up your QEMU runs, there are a couple of trivial options
available to you which'll improve your runs by ~6x to ~9x, which vastly
outstrips any benefit gained from changing the unwinder. I'd recommend
you at least consider moving to `pauth-impdef=true`.

While the SCS unwinder is still faster, the difference is nowhere near
as pronounced. As I mentioned before, there are changes that we can make
to the regular unwinder to close that gap somewhat, some of which I
intend to make as part of ongoing cleanup/rework in that area.

I haven't bothered testing HW_TAGS, because the performance
characteristics of emulated MTE are also nothing like that of a real HW
implementation.

So, given that and the problems I mentioned before, I don't think
there's a justification for adding a separate SCS unwinder. As before,
I'm still happy to try to make the regular unwinder faster (and I'm
happy to make changes which benefit QEMU TCG mode if those don't harm
the maintainability of the unwinder).

NAK to adding an SCS-specific unwinder, regardless of where in the
source tree that is placed.

Thanks,
Mark.
Mark Rutland April 14, 2022, 1:40 p.m. UTC | #2
On Thu, Apr 14, 2022 at 01:36:59PM +0100, Mark Rutland wrote:
> As I suspected, you're hitting a known performance oddity with QEMU TCG
> mode where pointer authentication is *incredibly* slow when using the
> architected QARMA5 algorithm (enabled by default with `-cpu max`).

> This overhead has nothing to do with the *nature* of the unwinder, and
> is an artifact of the *platform* and the *structure* of the code.
> There's plenty that can be done to avoid that overhead

FWIW, from a quick look, disabling KASAN instrumentation for the
stacktrace object alone (with no other changes) has a significant impact
(compounded by the TCG QARMA5 slowdown), and I note that x86 doesn't
both instrumenting its stacktrace code anyway, so we could consider
doing likewise.

Atop that, replacing set_bit() with __set_bit() brings the regular
unwinder *really* close to the earlier SCS unwinder figures. I know that
the on_accessible_stack() calculations and checks could be ammortized
with some refactoring (which I'd planned to do anyway), so I think it's
plausible that with some changes to the existing unwinder we can bring
the difference into the noise.

> generic kasan w/ `-cpu max`
> ---------------------------
> 
> master-no-stack-traces: 12.66
> master:                 18.39 (+45.2%)
> master-no-stack-depot:  17.85 (+40.1%)
> up-scs-stacks-v3:       13.54 (+7.0%)

master-noasan:            15.67 (+23.8%)
master-noasan-__set_bit:  14.61 (+15.5%)

> Generic KASAN w/ `-cpu max,pauth-impdef=true`
> ---------------------------------------------
> 
> master-no-stack-traces: 2.69
> master:                 3.35 (+24.5%)
> master-no-stack-depot:  3.54 (+31.5%)
> up-scs-stacks-v3:       2.80 (+4.1%)

master-noasan:            3.05 (+13.0%)
master-noasan-__set_bit:  2.96 (+10.0%)

> Generic KASAN w/ `-cpu max,pauth=false`
> ---------------------------------------
> 
> master-no-stack-traces: 1.92
> master:                 2.27  (+18.2%)
> master-no-stack-depot:  2.22  (+15.6%)
> up-scs-stacks-v3:       2.06  (+7.3%)

master-noasan:             2.14 (+11.4%)
master-noasan-__set_bit:   2.10 (+9.4%)

Thanks,
Mark.
Andrey Konovalov May 21, 2022, 10:30 p.m. UTC | #3
On Thu, Apr 14, 2022 at 2:37 PM Mark Rutland <mark.rutland@arm.com> wrote:
>

Hi Mark,

Sorry for the delayed response, it took some time getting my hands on
hardware for testing these changes.

> Just to be clear: QEMU TCG mode is *in no way* representative of HW
> performance, and has drastically different performance characteristics
> compared to real HW. Please be very clear when you are quoting
> performance figures from QEMU TCG mode.
>
> Previously you said you were trying to optimize this so that some
> version of KASAN could be enabled in production builds, and the above is
> not a suitable benchmark system for that.

Understood.

My expectation was that performance numbers from QEMU would be close
to hardware. I knew that there are instructions that take longer to be
emulated, but I expected that they would be uniformly spread across
the code.

However, your explanation proved this wrong. This indeed doesn't apply
when measuring the performance of a piece of code with a different
density of function calls.

Thank you for the detailed explanation! Those QEMU arguments will
definitely be handy when I need a faster QEMU setup.

> Is that *actually* what you're trying to enable, or are you just trying
> to speed up running instances under QEMU (e.g. for arm64 Syzkaller runs
> on GCE)?

No, I'm not trying to speed up QEMU. QEMU was just the only setup that
I had access to at that moment.

The goal is to allow enabling stack trace collection in production on
HW_TAGS-enabled devices once those are created.

[...]

> While the SCS unwinder is still faster, the difference is nowhere near
> as pronounced. As I mentioned before, there are changes that we can make
> to the regular unwinder to close that gap somewhat, some of which I
> intend to make as part of ongoing cleanup/rework in that area.

I tried running the same experiments on Pixel 6.

Unfortunately, I was only able to test the OUTLINE SW_TAGS mode
(without STACK instrumentation, as HW_TAGS doesn't support STACK at
the moment.) All of the other modes either fail to flash or fail to
boot with AOSP on Pixel 6 :(

The results are (timestamps were measured when "ALSA device list" was
printed to the kernel log):

sw_tags outline nostacks: 2.218
sw_tags outline: 2.516 (+13.4%)
sw_tags outline nosanitize: 2.364 (+6.5%)
sw_tags outline nosanitize __set_bit: 2.364 (+6.5%)
sw_tags outline nosanitize scs: 2.236 (+0.8%)

Used markings:

nostacks: patch from master-no-stack-traces applied
nosanitize: KASAN_SANITIZE_stacktrace.o := n
__set_bit: set_bit -> __set_bit change applied
scs: patches from up-scs-stacks-v3 applied

First, disabling instrumentation of stacktrace.c is indeed a great
idea for software KASAN modes! I will send a patch for this later.

Changing set_bit to __set_bit seems to make no difference on Pixel 6.

The awesome part is that the overhead of collecting stack traces with
SCS and even saving them into the stack depot is less than 1%.

However once again note, that this is for OUTLINE SW_TAGS without STACK.

> I haven't bothered testing HW_TAGS, because the performance
> characteristics of emulated MTE are also nothing like that of a real HW
> implementation.
>
> So, given that and the problems I mentioned before, I don't think
> there's a justification for adding a separate SCS unwinder. As before,
> I'm still happy to try to make the regular unwinder faster (and I'm
> happy to make changes which benefit QEMU TCG mode if those don't harm
> the maintainability of the unwinder).
>
> NAK to adding an SCS-specific unwinder, regardless of where in the
> source tree that is placed.

I see.

Perhaps, it makes sense to wait until there's HW_TAGS-enabled hardware
available before continuing to look into this. At the end, the
performance overhead for that setup is what matters.

I'll look into improving the performance of the existing unwinder a
bit more. However, I don't think I'll be able to speed it up to < 1%.
Which means that we'll likely need a sample-based approach for HW_TAGS
stack collection to reduce the overhead.

Thank you!