mbox series

[v2,0/4] kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack

Message ID cover.1648049113.git.andreyknvl@google.com (mailing list archive)
Headers show
Series kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack | expand

Message

andrey.konovalov@linux.dev March 23, 2022, 3:32 p.m. UTC
From: Andrey Konovalov <andreyknvl@google.com>

kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack

Currently, KASAN always uses the normal stack trace collection routines,
which rely on the unwinder, when saving alloc and free stack traces.

Instead of invoking the unwinder, collect the stack trace by copying
frames from the Shadow Call Stack whenever it is enabled. This reduces
boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.

Stack staces are collected from the Shadow Call Stack via a new
stack_trace_save_shadow() interface.

Note that the frame of the interrupted function is not included into
the stack trace, as it is not yet saved on the SCS when an interrupt
happens.

---

To deal with this last thing, we could save the interrupted frame address
in another per-CPU variable. I'll look into implementing this for v3.

I decided to postpone the changes to stack depot that avoid copying
frames twice until a planned upcoming update for stack depot.

Changes v1->v2:
- Provide a kernel-wide stack_trace_save_shadow() interface for collecting
  stack traces from shadow stack.
- Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
- Get SCS pointer from x18, as per-task value is meant to save the SCS
  value on CPU switches.
- Collect stack frames from SDEI and IRQ contexts.

Andrey Konovalov (4):
  stacktrace: add interface based on shadow call stack
  arm64, scs: save scs_sp values per-cpu when switching stacks
  arm64: implement stack_trace_save_shadow
  kasan: use stack_trace_save_shadow

 arch/Kconfig                       |  6 +++
 arch/arm64/Kconfig                 |  1 +
 arch/arm64/include/asm/assembler.h | 12 +++++
 arch/arm64/include/asm/scs.h       | 13 ++++-
 arch/arm64/kernel/entry.S          | 28 ++++++++--
 arch/arm64/kernel/irq.c            |  4 +-
 arch/arm64/kernel/sdei.c           |  5 +-
 arch/arm64/kernel/stacktrace.c     | 83 ++++++++++++++++++++++++++++++
 include/linux/stacktrace.h         | 15 ++++++
 kernel/stacktrace.c                | 21 ++++++++
 mm/kasan/common.c                  |  9 ++--
 11 files changed, 183 insertions(+), 14 deletions(-)

Comments

Marco Elver March 28, 2022, 12:36 p.m. UTC | #1
On Wed, 23 Mar 2022 at 16:33, <andrey.konovalov@linux.dev> wrote:
>
> From: Andrey Konovalov <andreyknvl@google.com>
>
> kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
>
> Currently, KASAN always uses the normal stack trace collection routines,
> which rely on the unwinder, when saving alloc and free stack traces.
>
> Instead of invoking the unwinder, collect the stack trace by copying
> frames from the Shadow Call Stack whenever it is enabled. This reduces
> boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.
>
> Stack staces are collected from the Shadow Call Stack via a new
> stack_trace_save_shadow() interface.
>
> Note that the frame of the interrupted function is not included into
> the stack trace, as it is not yet saved on the SCS when an interrupt
> happens.
>
> ---
>
> To deal with this last thing, we could save the interrupted frame address
> in another per-CPU variable. I'll look into implementing this for v3.
>
> I decided to postpone the changes to stack depot that avoid copying
> frames twice until a planned upcoming update for stack depot.

That's fair.

> Changes v1->v2:
> - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
>   stack traces from shadow stack.
> - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> - Get SCS pointer from x18, as per-task value is meant to save the SCS
>   value on CPU switches.
> - Collect stack frames from SDEI and IRQ contexts.

Do any of these new changes introduce new (noticeable) overhead (in
particular patch 2)?
Andrey Konovalov March 29, 2022, 6:36 p.m. UTC | #2
On Mon, Mar 28, 2022 at 2:36 PM Marco Elver <elver@google.com> wrote:
>
> > Changes v1->v2:
> > - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> >   stack traces from shadow stack.
> > - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> > - Get SCS pointer from x18, as per-task value is meant to save the SCS
> >   value on CPU switches.
> > - Collect stack frames from SDEI and IRQ contexts.
>
> Do any of these new changes introduce new (noticeable) overhead (in
> particular patch 2)?

I'll measure the overheads and include the results into v3. Thanks!
Andrey Konovalov March 29, 2022, 8:11 p.m. UTC | #3
On Tue, Mar 29, 2022 at 8:36 PM Andrey Konovalov <andreyknvl@gmail.com> wrote:
>
> On Mon, Mar 28, 2022 at 2:36 PM Marco Elver <elver@google.com> wrote:
> >
> > > Changes v1->v2:
> > > - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
> > >   stack traces from shadow stack.
> > > - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> > > - Get SCS pointer from x18, as per-task value is meant to save the SCS
> > >   value on CPU switches.
> > > - Collect stack frames from SDEI and IRQ contexts.
> >
> > Do any of these new changes introduce new (noticeable) overhead (in
> > particular patch 2)?
>
> I'll measure the overheads and include the results into v3. Thanks!

Hm, looks like the overhead is overly significant: ~5%. I'll explore a
different approach in v3 instead.
Mark Rutland March 31, 2022, 9:54 a.m. UTC | #4
Hi Andrey,

On Wed, Mar 23, 2022 at 04:32:51PM +0100, andrey.konovalov@linux.dev wrote:
> From: Andrey Konovalov <andreyknvl@google.com>
> 
> kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
> 
> Currently, KASAN always uses the normal stack trace collection routines,
> which rely on the unwinder, when saving alloc and free stack traces.
> 
> Instead of invoking the unwinder, collect the stack trace by copying
> frames from the Shadow Call Stack whenever it is enabled. This reduces
> boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.

That is an impressive number. TBH, I'm shocked that this has *that* much of an
improvement, and I suspect this means we're doing something unnecssarily
expensive in the regular unwinder.

I've given some specific comments on patches, but a a high-level, I don't want
to add yet another unwind mechanism. For maintenance and correctness reasons,
we've spent the last few years consolidating various unwinders, which this
unfortunately goes against.

I see that there are number of cases this unwinder will fall afoul of (e.g.
kretprobes and ftrace graph trampolines), and making those work correctly will
require changes elsewhere (e.g. as we rely upon a snapshot of the FP to
disambiguate cases today).

I'm also very much not keen on having to stash things in the entry assembly for
this distinct unwinder.

Going forward, I'm also planning on making changes to the way we unwind across
exception boundaries (e.g. to report the LR and FP), and as that depends on
finding the pt_regs based on the FP, that's not going to work with SCS.

So at a high level, I don't want to add an SCS based unwinder.

However, I'm very much open to how we could improve the standard unwinder to be
faster, which would be more generally beneficial. I can see that there are some
things we could reasonably do with simple refactoring.

Thanks,
Mark.

> Stack staces are collected from the Shadow Call Stack via a new
> stack_trace_save_shadow() interface.
> 
> Note that the frame of the interrupted function is not included into
> the stack trace, as it is not yet saved on the SCS when an interrupt
> happens.
> 
> ---
> 
> To deal with this last thing, we could save the interrupted frame address
> in another per-CPU variable. I'll look into implementing this for v3.
> 
> I decided to postpone the changes to stack depot that avoid copying
> frames twice until a planned upcoming update for stack depot.
> 
> Changes v1->v2:
> - Provide a kernel-wide stack_trace_save_shadow() interface for collecting
>   stack traces from shadow stack.
> - Use ptrauth_strip_insn_pac() and READ_ONCE_NOCHECK, see the comments.
> - Get SCS pointer from x18, as per-task value is meant to save the SCS
>   value on CPU switches.
> - Collect stack frames from SDEI and IRQ contexts.
> 
> Andrey Konovalov (4):
>   stacktrace: add interface based on shadow call stack
>   arm64, scs: save scs_sp values per-cpu when switching stacks
>   arm64: implement stack_trace_save_shadow
>   kasan: use stack_trace_save_shadow
> 
>  arch/Kconfig                       |  6 +++
>  arch/arm64/Kconfig                 |  1 +
>  arch/arm64/include/asm/assembler.h | 12 +++++
>  arch/arm64/include/asm/scs.h       | 13 ++++-
>  arch/arm64/kernel/entry.S          | 28 ++++++++--
>  arch/arm64/kernel/irq.c            |  4 +-
>  arch/arm64/kernel/sdei.c           |  5 +-
>  arch/arm64/kernel/stacktrace.c     | 83 ++++++++++++++++++++++++++++++
>  include/linux/stacktrace.h         | 15 ++++++
>  kernel/stacktrace.c                | 21 ++++++++
>  mm/kasan/common.c                  |  9 ++--
>  11 files changed, 183 insertions(+), 14 deletions(-)
> 
> -- 
> 2.25.1
>
Mark Rutland March 31, 2022, 12:39 p.m. UTC | #5
On Thu, Mar 31, 2022 at 10:54:08AM +0100, Mark Rutland wrote:
> On Wed, Mar 23, 2022 at 04:32:51PM +0100, andrey.konovalov@linux.dev wrote:
> > From: Andrey Konovalov <andreyknvl@google.com>
> > 
> > kasan, arm64, scs, stacktrace: collect stack traces from Shadow Call Stack
> > 
> > Currently, KASAN always uses the normal stack trace collection routines,
> > which rely on the unwinder, when saving alloc and free stack traces.
> > 
> > Instead of invoking the unwinder, collect the stack trace by copying
> > frames from the Shadow Call Stack whenever it is enabled. This reduces
> > boot time by 30% for all KASAN modes when Shadow Call Stack is enabled.
> 
> That is an impressive number. TBH, I'm shocked that this has *that* much of an
> improvement, and I suspect this means we're doing something unnecssarily
> expensive in the regular unwinder.

I've had a quick look into this, to see what we could do to improve the regular
unwinder, but I can't reproduce that 30% number.

In local testing the worst can I could get to was 6-13% (with both the
stacktrace *and* stackdepot logic hacked out entirely).

I'm testing with clang 13.0.0 from the llvm.org binary releases, with defconfig
+ SHADOW_CALL_STACK + KASAN_<option>, using a very recent snapshot of mainline
(commit d888c83fcec75194a8a48ccd283953bdba7b2550). I'm booting a
KVM-accelerated QEMU VM on ThunderX2 with "init=/sbin/reboot -- -f" in the
kernel bootargs, timing the whole run from the outside with "perf stat --null".

The 6% figure is if I count boot as a whole including VM startup and teardown
(i.e. an under-estimate of the proportion), the 13% figure is if I subtract a
baseline timing from a run without KASAN (i.e. an over-estimate of the
proportion).

Could you let me know how you're measuring this, and which platform+config
you're using?

I'll have a play with some configs in case there's a pathological
configuration, but if you could let me know how/what you're testing that'd be a
great help.

Thanks,
Mark.
Andrey Konovalov April 5, 2022, 3:09 p.m. UTC | #6
On Thu, Mar 31, 2022 at 11:54 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> That is an impressive number. TBH, I'm shocked that this has *that* much of an
> improvement, and I suspect this means we're doing something unnecssarily
> expensive in the regular unwinder.
>
> I've given some specific comments on patches, but a a high-level, I don't want
> to add yet another unwind mechanism. For maintenance and correctness reasons,
> we've spent the last few years consolidating various unwinders, which this
> unfortunately goes against.
>
> I see that there are number of cases this unwinder will fall afoul of (e.g.
> kretprobes and ftrace graph trampolines), and making those work correctly will
> require changes elsewhere (e.g. as we rely upon a snapshot of the FP to
> disambiguate cases today).

Do I understand correctly that kretprobes and ftrace modify frames
saved on SCS? So, if either is enabled, SCS frames might contain their
addresses instead of actual PCs?

If so, this is good enough for our use case. Having kretprobes or
ftrace enabled is an unusual setting and there's no requirement to
support it.

The goal is to have stack trace collection working in most cases
during a normal usage of an Android device. Being not feature-complete
and not covering all possible peculiar cases is fine.

> I'm also very much not keen on having to stash things in the entry assembly for
> this distinct unwinder.

I'll drop these changes, I'll respond on that patch.

> Going forward, I'm also planning on making changes to the way we unwind across
> exception boundaries (e.g. to report the LR and FP), and as that depends on
> finding the pt_regs based on the FP, that's not going to work with SCS.
>
> So at a high level, I don't want to add an SCS based unwinder.
>
> However, I'm very much open to how we could improve the standard unwinder to be
> faster, which would be more generally beneficial. I can see that there are some
> things we could reasonably do with simple refactoring.

The intention of adding an SCS-based unwinder it to use in production
together with MTE-based KASAN. Thus, it needs to be as fast as
possible. I doubt even a very optimized FP-based unwinder will compare
with a simple loop over SCS frames.

It seems a pity to let the kernel maintain the current call trace via
SCS and then not to use it to collect stack traces.

Would it be acceptable if we keep the SCS unwinder code in mm/kasan
and not integrate with the common stacktrace machanisms?

Thanks!
Andrey Konovalov April 5, 2022, 3:10 p.m. UTC | #7
On Thu, Mar 31, 2022 at 2:39 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> I've had a quick look into this, to see what we could do to improve the regular
> unwinder, but I can't reproduce that 30% number.
>
> In local testing the worst can I could get to was 6-13% (with both the
> stacktrace *and* stackdepot logic hacked out entirely).
>
> I'm testing with clang 13.0.0 from the llvm.org binary releases, with defconfig
> + SHADOW_CALL_STACK + KASAN_<option>, using a very recent snapshot of mainline
> (commit d888c83fcec75194a8a48ccd283953bdba7b2550). I'm booting a
> KVM-accelerated QEMU VM on ThunderX2 with "init=/sbin/reboot -- -f" in the
> kernel bootargs, timing the whole run from the outside with "perf stat --null".
>
> The 6% figure is if I count boot as a whole including VM startup and teardown
> (i.e. an under-estimate of the proportion), the 13% figure is if I subtract a
> baseline timing from a run without KASAN (i.e. an over-estimate of the
> proportion).

I think this is the reason for the limited improvement that you
observe. If you measure the time throughout VM startup and teardown,
you include the time required for userspace apps, which is irrelevant.

I measure boot time until a certain point during kernel boot. E.g.,
with the attached config, I measure the time until test_meminit start
running.

It takes 6 seconds for the kernel to reach test_meminit as is, and 4
seconds with kasan_save_stack() commented out. Only commenting out
__stack_depot_save() gives 5.9 seconds, so stack_trace_save() is the
slow part.

> Could you let me know how you're measuring this, and which platform+config
> you're using?

I've attached the config that I use. It's essentially defconfig + SCS
+ KASAN + maybe a few other options.

> I'll have a play with some configs in case there's a pathological
> configuration, but if you could let me know how/what you're testing that'd be a
> great help.

Thanks!
Mark Rutland April 7, 2022, 6:41 p.m. UTC | #8
On Tue, Apr 05, 2022 at 05:10:02PM +0200, Andrey Konovalov wrote:
> On Thu, Mar 31, 2022 at 2:39 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > I've had a quick look into this, to see what we could do to improve the regular
> > unwinder, but I can't reproduce that 30% number.
> >
> > In local testing the worst can I could get to was 6-13% (with both the
> > stacktrace *and* stackdepot logic hacked out entirely).
> >
> > I'm testing with clang 13.0.0 from the llvm.org binary releases, with defconfig
> > + SHADOW_CALL_STACK + KASAN_<option>, using a very recent snapshot of mainline
> > (commit d888c83fcec75194a8a48ccd283953bdba7b2550). I'm booting a
> > KVM-accelerated QEMU VM on ThunderX2 with "init=/sbin/reboot -- -f" in the
> > kernel bootargs, timing the whole run from the outside with "perf stat --null".
> >
> > The 6% figure is if I count boot as a whole including VM startup and teardown
> > (i.e. an under-estimate of the proportion), the 13% figure is if I subtract a
> > baseline timing from a run without KASAN (i.e. an over-estimate of the
> > proportion).
> 
> I think this is the reason for the limited improvement that you
> observe. If you measure the time throughout VM startup and teardown,
> you include the time required for userspace apps, which is irrelevant.

Yes, that's the case for the 6% figure. However I also mentioned how I
accounted for that to get the 13% figure, which does not include those
irrelevant timings (and is an over-estimate of that proportion).

I think the discrepancy lies elsewhere, e.g. kernel version, boot arguments,
platform (and hence driver / subsystem behaviour), etc.

Can you share any of those details? Are you able to test with a mainline
kernel, e.g. v5.18-rc1?

I have a bunch of numbers below, and I think those imply one of the following:

* In your test setup, there are significantly more allocs/frees for which a
  stacktrace is being recorded. That could be down to the platform you're
  testing on, and the drivers that are in use.

* In your test setup, for some reason, some aspect of the stacktrace is
  signficantly more expensive than in my setup. There's the potential that a
  hardware quirk has some impact here, so knowing which hardware you're testing
  on would be very helpful.

* There is a secondary effect at play. There are a number of potential things
  here (e.g. console accesses, the number of stacktraces taken in interrupt
  context, etc).

I'd like to figure out which (if any) of those apply.

> I measure boot time until a certain point during kernel boot. E.g.,
> with the attached config, I measure the time until test_meminit start
> running.

How exactly are you measuring the point at which test_meminit() starts running?
Are you looking through dmesg, or passing some debug options? I ask because
that doesn't seem to dump anything into dmesg until at least one test has run.

FWIW, I was measuring the kernel boot up-to the point we'd run the userspace
init program, by booting the kernel with:

	init=/sbin/reboot -- -f

... which I think is more representative of the full boot time.

I can instead avoid that by not passing a filesystem and booting with:

	panic=-1

... to trigger an instant reboot when we'd normally mount the filesystem, but
the numbers as similar either way for me.

I've followed the latter approach for my numbers below, since it's easier to
reproduce.

> It takes 6 seconds for the kernel to reach test_meminit as is, and 4
> seconds with kasan_save_stack() commented out. Only commenting out
> __stack_depot_save() gives 5.9 seconds, so stack_trace_save() is the
> slow part.

As above, how are you measuring this?

... and since your config has CONFIG_KASAN_KUNIT_TEST=y, which console
options (e.g. 'quiet', 'loglevel=') are you passing on the kernel command line?

I ask because in my local testing, that options results in a load of test
results being dumped to the console, and the time taken to do so dominates
everything else. Hacking out the stack sampling reduces the amount of data we
log to the console, and this secondary effect reduces boot time.

> > Could you let me know how you're measuring this, and which platform+config
> > you're using?
> 
> I've attached the config that I use. It's essentially defconfig + SCS
> + KASAN + maybe a few other options.
> 
> > I'll have a play with some configs in case there's a pathological
> > configuration, but if you could let me know how/what you're testing that'd be a
> > great help.

I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
get anywhere near your figures. I've tried to match toolchain versions with
what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
binary releases, and binutils from the kernel.org crosstool 11.1.0 release.

I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
output completely suppressed with 'quiet loglevel=0':

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image               
| 
|  Performance counter stats for
|  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
|  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
|  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
|  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
| 
|        0.512626031 seconds time elapsed                                          ( +-  0.26% )
| 
| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
| 
|  Performance counter stats for
|  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
|  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
|  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
|  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
| 
|        0.523245952 seconds time elapsed                                          ( +-  0.18% )

Then I tried the same with your config, without your patches:

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
| 
|  Performance counter stats for
|  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
|  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
|  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
|  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
| 
|        1.994692366 seconds time elapsed                                          ( +-  0.05% )

Then with your config, without your patches, with the stacktrace hacked out:

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image            
| 
|  Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
| -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
| 
|        1.861823869 seconds time elapsed                                          ( +-  0.05% )

If I use those number to estimate the proportion of time spent stacktracing,
with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
I get:

	(1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
        ---------------------------------------------------------  = 0.09029788358
	(1.994692366 - 0.523245952)

So roughly 9% when I try to maximize that figure. When actually poking hardware
and doing real work, that figure goes down. For example, if just using "quiet":

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
| 
|  Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
| loglevel=9 earlycon panic=-1 quiet' (50 runs):
| 
|        4.653286475 seconds time elapsed                                          ( +-  0.06% )

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
| 
|  Performance counter stats for
|  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
|  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
|  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
|  -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
| 
|        4.585750154 seconds time elapsed                                          ( +-  0.05% )

Which gives an estimate of:

	(4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
	---------------------------------------------------------  = 0.01635245964
	(4.653286475 - 0.523245952)

... or ~1.6% time spent backtracing:

FWIW, applying your patches do show some benefit, but not as drastic as I was
expecting:

With console output suprressed:

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image               
| 
|  Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
| -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
| 
|        1.920300410 seconds time elapsed                                          ( +-  0.05% )

... down from ~9% to ~4%

With console output merely reduced:

| [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
| 
|  Performance counter stats for
| '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
| -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
| /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
| -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
| 
|        4.611277833 seconds time elapsed                                          ( +-  0.04% )

... down from 1.6% to 0.6%

Given the above I still think we need to understand this a bit better before we
consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.

My hope is that we can improve the regular unwinder or other code such that
this becomes moot. I'm aware of a few things we could try, but given it's very
easy to sink a lot of time and effort into this, I'd like to first get some
more details, as above.

Thanks,
Mark.
Andrey Konovalov April 13, 2022, 7:28 p.m. UTC | #9
On Thu, Apr 7, 2022 at 8:42 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
> get anywhere near your figures. I've tried to match toolchain versions with
> what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
> binary releases, and binutils from the kernel.org crosstool 11.1.0 release.
>
> I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
> output completely suppressed with 'quiet loglevel=0':
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image
> |
> |  Performance counter stats for
> |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
> |  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> |        0.512626031 seconds time elapsed                                          ( +-  0.26% )
> |
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
> |
> |  Performance counter stats for
> |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
> |  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> |        0.523245952 seconds time elapsed                                          ( +-  0.18% )
>
> Then I tried the same with your config, without your patches:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
> |
> |  Performance counter stats for
> |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> |  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> |        1.994692366 seconds time elapsed                                          ( +-  0.05% )
>
> Then with your config, without your patches, with the stacktrace hacked out:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> |
> |  Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> |        1.861823869 seconds time elapsed                                          ( +-  0.05% )
>
> If I use those number to estimate the proportion of time spent stacktracing,
> with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
> I get:
>
>         (1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
>         ---------------------------------------------------------  = 0.09029788358
>         (1.994692366 - 0.523245952)
>
> So roughly 9% when I try to maximize that figure. When actually poking hardware
> and doing real work, that figure goes down. For example, if just using "quiet":
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
> |
> |  Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> | loglevel=9 earlycon panic=-1 quiet' (50 runs):
> |
> |        4.653286475 seconds time elapsed                                          ( +-  0.06% )
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
> |
> |  Performance counter stats for
> |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> |  -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> |
> |        4.585750154 seconds time elapsed                                          ( +-  0.05% )
>
> Which gives an estimate of:
>
>         (4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
>         ---------------------------------------------------------  = 0.01635245964
>         (4.653286475 - 0.523245952)
>
> ... or ~1.6% time spent backtracing:
>
> FWIW, applying your patches do show some benefit, but not as drastic as I was
> expecting:
>
> With console output suprressed:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> |
> |  Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> |
> |        1.920300410 seconds time elapsed                                          ( +-  0.05% )
>
> ... down from ~9% to ~4%
>
> With console output merely reduced:
>
> | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
> |
> |  Performance counter stats for
> | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> |
> |        4.611277833 seconds time elapsed                                          ( +-  0.04% )
>
> ... down from 1.6% to 0.6%
>
> Given the above I still think we need to understand this a bit better before we
> consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.
>
> My hope is that we can improve the regular unwinder or other code such that
> this becomes moot. I'm aware of a few things we could try, but given it's very
> easy to sink a lot of time and effort into this, I'd like to first get some
> more details, as above.

Hi Mark,

I'm about to publish v3, where I'll include a detailed description of
how I measured the performance.

Perhaps we see different performance numbers because you're using
KVM-enabled VM on an Arm host and I'm using QEMU on x86-64 host.
Although, it's suspicious that the difference is so drastic. I'll try
to get my hands on some Arm hardware in the next few days and do the
measurements there.

This new version also will not be making any changes to the entry
code, as these changes add unwanted additional slowdown. That would be
great, if you could check the performance impact of v3 with your
setup.

Thanks!
Mark Rutland April 14, 2022, 7:02 a.m. UTC | #10
On Wed, Apr 13, 2022 at 09:28:00PM +0200, Andrey Konovalov wrote:
> On Thu, Apr 7, 2022 at 8:42 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > I'm afraid from local testing (atop v5.18-rc1), with your config, I still can't
> > get anywhere near your figures. I've tried to match toolchain versions with
> > what was in your .config file, so I'm using clang 12.0.0 from the llvm.org
> > binary releases, and binutils from the kernel.org crosstool 11.1.0 release.
> >
> > I took baselines with defconfig and defconfig + SHADOW_CALL_STACK, with console
> > output completely suppressed with 'quiet loglevel=0':
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image
> > |
> > |  Performance counter stats for
> > |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig/Image -append
> > |  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > |        0.512626031 seconds time elapsed                                          ( +-  0.26% )
> > |
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image
> > |
> > |  Performance counter stats for
> > |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-defconfig+scs/Image -append
> > |  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > |        0.523245952 seconds time elapsed                                          ( +-  0.18% )
> >
> > Then I tried the same with your config, without your patches:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image
> > |
> > |  Performance counter stats for
> > |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> > |  loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > |        1.994692366 seconds time elapsed                                          ( +-  0.05% )
> >
> > Then with your config, without your patches, with the stacktrace hacked out:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > |
> > |  Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > |        1.861823869 seconds time elapsed                                          ( +-  0.05% )
> >
> > If I use those number to estimate the proportion of time spent stacktracing,
> > with the baseline SCS number discounted to remove the hypervisor+VMM overheads,
> > I get:
> >
> >         (1.994692366 - 0.523245952) - (1.861823869 - 0.523245952)
> >         ---------------------------------------------------------  = 0.09029788358
> >         (1.994692366 - 0.523245952)
> >
> > So roughly 9% when I try to maximize that figure. When actually poking hardware
> > and doing real work, that figure goes down. For example, if just using "quiet":
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image > /dev/null
> > |
> > |  Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config/Image -append
> > | loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > |        4.653286475 seconds time elapsed                                          ( +-  0.06% )
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image > /dev/null
> > |
> > |  Performance counter stats for
> > |  '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > |  -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > |  /home/mark/kernel-images/andrey-unwind-v5.18-rc1-andrey.config-nostacktrace/Image
> > |  -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > |        4.585750154 seconds time elapsed                                          ( +-  0.05% )
> >
> > Which gives an estimate of:
> >
> >         (4.653286475 - 0.523245952) - (4.585750154 - 0.523245952)
> >         ---------------------------------------------------------  = 0.01635245964
> >         (4.653286475 - 0.523245952)
> >
> > ... or ~1.6% time spent backtracing:
> >
> > FWIW, applying your patches do show some benefit, but not as drastic as I was
> > expecting:
> >
> > With console output suprressed:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -s -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > |
> > |  Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > | -append loglevel=9 earlycon panic=-1 quiet loglevel=0' (50 runs):
> > |
> > |        1.920300410 seconds time elapsed                                          ( +-  0.05% )
> >
> > ... down from ~9% to ~4%
> >
> > With console output merely reduced:
> >
> > | [mark@gravadlaks:~/repro]% taskset -c 64 ./vmboot.sh --perf-trials 50 -q -k ~/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image > /dev/null
> > |
> > |  Performance counter stats for
> > | '/home/mark/.opt/apps/qemu/bin/qemu-system-aarch64 -m 2048 -smp 1 -nographic
> > | -no-reboot -machine virt,accel=kvm,gic-version=host -cpu host -kernel
> > | /home/mark/kernel-images/andrey-unwind-v5.18-rc1+scs-unwind-andrey.config/Image
> > | -append loglevel=9 earlycon panic=-1 quiet' (50 runs):
> > |
> > |        4.611277833 seconds time elapsed                                          ( +-  0.04% )
> >
> > ... down from 1.6% to 0.6%
> >
> > Given the above I still think we need to understand this a bit better before we
> > consider pursuing the SCS unwinder, given the issues I laid out in my prior mails.
> >
> > My hope is that we can improve the regular unwinder or other code such that
> > this becomes moot. I'm aware of a few things we could try, but given it's very
> > easy to sink a lot of time and effort into this, I'd like to first get some
> > more details, as above.
> 
> Hi Mark,
> 
> I'm about to publish v3, where I'll include a detailed description of
> how I measured the performance.
> 
> Perhaps we see different performance numbers because you're using
> KVM-enabled VM on an Arm host and I'm using QEMU on x86-64 host.

Hold on; are you using QEMU in TCG mode? If so that's in no way representative
of real HW performance, and there are operations it simply cannot make as fast
as HW can (e.g. pointer authentication using the architected QARMA variants).

> Although, it's suspicious that the difference is so drastic.

I'm not surprised at all. Some operations can be *orders of magnitude slower*
under TCG than on real HW even when considered relative to other operations,
and this can drasticaly skew benchmarks. We recently hit a case when PACIASP
and AUTIASP were so slow under TCG mode they appeared to be causing a boot
hang, and we eventually figured out that they were just *very* slow, adding
minutes to the boot time. Richard Henderson added options to QEMU to mitigate
that (either disabling authentication, or using an IMPLEMENTATION DEFINED
algorithm).

The bottom line is that QEMU TCG mode is in no way representative of real-world
performance, and *cannot* be used for benchmarking.

I think we first need to understand *what* is so slow under QEMU TCG mode, and
*why* TCG mode performance matters. I suspect there are other ways we could
avoid this overhead *without* adding another unwinder, but even then we need a
justification for *why* we should care.

> I'll try to get my hands on some Arm hardware in the next few days and do the
> measurements there.
> 
> This new version also will not be making any changes to the entry
> code, as these changes add unwanted additional slowdown. That would be
> great, if you could check the performance impact of v3 with your
> setup.

I'll take a look at the series, but as before I do not think we should add
another unwinder. As above, I *certainly* do not think we should add another
unwinder based on TCG performance.

Thanks,
Mark.