[RFC] x86/entry/64: randomize kernel stack offset upon system call

If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
the kernel stack offset is randomized upon each
exit from a system call via the trampoline stack.

This feature is based on the original idea from
the PaX's RANDKSTACK feature:
https://pax.grsecurity.net/docs/randkstack.txt
All the credits for the original idea goes to the PaX team.
However, the implementation of RANDOMIZE_KSTACK_OFFSET
differs greatly from the RANDKSTACK feature (see below).

Reasoning for the feature:

This feature should make considerably harder various
stack-based attacks that are based upon overflowing
a kernel stack into adjusted kernel stack with a
possibility to jump over a guard page.
Since the stack offset is randomized upon each
system call, it is very hard for attacker to reliably
land in any particular place on the adjusted stack.

Design description:

During most of the kernel's execution, it runs on the "thread
stack", which is allocated at fork.c/dup_task_struct() and stored in
a per-task variable (tsk->stack). Since stack is growing downwards,
the stack top can be always calculated using task_top_of_stack(tsk)
function, which essentially returns an address of tsk->stack + stack
size. When VMAP_STACK is enabled, the thread stack is allocated from
vmalloc space.

Thread stack is pretty deterministic on its structure - fixed in size,
and upon every enter from a userspace to kernel on a
syscall the thread stack is started to be constructed from an
address fetched from a per-cpu cpu_current_top_of_stack variable.
This variable is required since there is no way to reference "current"
from the kernel entry/exit code, so the value of task_top_of_stack(tsk)
is "shadowed" in a per-cpu variable each time the kernel context
switches to a new task.

The RANDOMIZE_KSTACK_OFFSET feature works by randomizing the value of
task_top_of_stack(tsk) every time a process exits from a syscall. As
a result the thread stack for that process will be constructed from a
random offset from a fixed tsk->stack + stack size value upon subsequent
syscall.

Since the kernel is always exited (IRET / SYSRET) from a per-cpu
"trampoline stack", it provides a safe place for modifying the value
of cpu_current_top_of_stack, because the thread stack is not in
use anymore at that point.

There is only one small issue: currently thread stack top is never
stored in a per-task variable, but always calculated as needed
via task_top_of_stack(tsk) and existing tsk->stack value (essentially
relying on its fixed size and structure). So we need to create a new
per-task variable, tsk->stack_start, that stores newly calculated
random value for the thread stack top. Together with the value of
cpu_current_top_of_stack, tsk->stack_start is also updated when
leaving the kernel space from a trampoline stack, so that it can be
used by scheduler to correctly "shadow" the cpu_current_top_of_stack
upon the task switch.

Impact on the kernel thread stack size:

Since the current version does not allocate any additional pages
for the thread stack, it shifts cpu_current_top_of_stack value
randomly between 000 .. FF0 (or 00 .. F0 if only 4 bits are randomized).
So, in the worst case (random offsets FF0/F0), the actual usable stack
size is 12304/16144 bytes.

Performance impact:

All measurements are done on Intel Kaby Lake i7-8550U, 16GB RAM

1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P
    base:           Time: 12.243
    random_offset:  Time: 13.411

2) kernel build time (as one example of real-world load):
    base:           user  299m20,348s;  sys   21m39,047s
    random_offset:  user  300m19,759s;  sys   20m48,173s

3) perf on fopen/flose loop 1000000000 times:
(the perf values below still manage to differ somewhat between
different runs, so I don't consider them to be very
representative apart that they obviously show big
impact on using get_random_u64())

    base:
     8.46%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
     4.77%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
     4.14%  time     [kernel.kallsyms]  [k] fsnotify
     3.94%  time     [kernel.kallsyms]  [k] _raw_spin_lock
     2.48%  time     [kernel.kallsyms]  [k] syscall_return_via_sysret
     2.42%  time     [kernel.kallsyms]  [k] entry_SYSCALL_64
     2.28%  time     [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     2.07%  time     [kernel.kallsyms]  [k] inotify_handle_event

    random_offset:
     8.35%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
     5.61%  time     [kernel.kallsyms]  [k] get_random_u64
     4.88%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
     3.08%  time     [kernel.kallsyms]  [k] _raw_spin_lock
     2.98%  time     [kernel.kallsyms]  [k] fsnotify
     2.73%  time     [kernel.kallsyms]  [k] syscall_return_via_sysret
     2.45%  time     [kernel.kallsyms]  [k] entry_SYSCALL_64
     1.87%  time     [kernel.kallsyms]  [k] __ext4_get_inode_loc
     1.65%  time     [kernel.kallsyms]  [k] _raw_spin_lock_irqsave

Comparison to grsecurity RANDKSTACK feature:

The basic idea is taken from RANDKSTACK: randomization of the
cpu_current_top_of_stack is performed within the existing 4
pages of memory allocated for the thread stack.
No additional pages are allocated.

This patch introduces 8 bits of randomness (bits 4 - 11 are
randomized, bits 0-3 must be zero due to stack alignment)
to the kernel stack top.
The very old grsecurity patch I checked has only 4 bits
of randomization for x86-64. This patch works with this
little randomness also, we only have to decide how much
stack space we wish/can trade for security.

  Notable differences from RANDKSTACK:

  - x86_64 only, since this does not make sense
    without vmap-based stack allocation that provides
    guard pages, and latter is only implemented for x86-64.
  - randomization is performed on trampoline stack upon
    system call exit.
  - random bits are taken from get_random_long() instead of
    rdtsc() for a better randomness. This however has a big
    performance impact (see above the numbers) and additionally
    if we happen to hit a point when a generator needs to be
    reseeded, we might have an issue. Alternatives can be to
    make this feature dependent on CONFIG_RANDOM_TRUST_CPU,
    which can solve some issues, but I doubt that all of them.
    Of course rdtsc() can be a fallback if there is no way to
    make calls for a proper randomness from the trampoline stack.
  - instead of storing the actual top of the stack in
    task->thread.sp0 (does not exist on x86-64), a
    new unsigned long variable stack_start is created in
    the task struct and key stack functions, like task_pt_regs,
    are updated to use it when available.
  - Instead of preserving a set of registers that are
    used within the randomization function, the current
    version uses PUSH_AND_CLEAR_REGS/POP_REGS combination
    similar to STACKLEAK. It would seem that we can
    go away with only preserving rax,rdx,rbx,rsi and rdi,
    but I am not sure how stable this is in the long run.

Future work possibilities:

  - One can do a version where we allocate an
    additional page for each kernel stack and then employ
    proper randomization. Can be a stricter config option,
    for example.
  - Alternatively, one can allocate normally 4 pages of
    stack only and allocate an additional page, if stack
    + randomized offset grows beyond 4 pages (only happens
    for big call chains).

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
---
 arch/Kconfig                     | 15 +++++++++++++++
 arch/x86/Kconfig                 |  1 +
 arch/x86/entry/calling.h         |  8 ++++++++
 arch/x86/entry/common.c          | 21 +++++++++++++++++++++
 arch/x86/entry/entry_64.S        |  4 ++++
 arch/x86/include/asm/processor.h | 15 ++++++++++++---
 arch/x86/kernel/dumpstack.c      |  2 +-
 arch/x86/kernel/irq_64.c         |  2 +-
 arch/x86/kernel/process.c        |  2 +-
 include/linux/sched.h            |  3 +++
 include/linux/sched/task_stack.h | 18 +++++++++++++++++-
 kernel/fork.c                    | 10 ++++++++++
 mm/kmemleak.c                    |  2 +-
 mm/usercopy.c                    |  2 +-
 14 files changed, 96 insertions(+), 9 deletions(-)

Message ID	1549628149-11881-2-git-send-email-elena.reshetova@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kernel-hardening-return-15129-patchwork-kernel-hardening=patchwork.kernel.org@lists.openwall.com> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1A780746 for <patchwork-kernel-hardening@patchwork.kernel.org>; Fri, 8 Feb 2019 12:16:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 07EDF2B35E for <patchwork-kernel-hardening@patchwork.kernel.org>; Fri, 8 Feb 2019 12:16:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id EF9722B562; Fri, 8 Feb 2019 12:16:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from mother.openwall.net (mother.openwall.net [195.42.179.200]) by mail.wl.linuxfoundation.org (Postfix) with SMTP id 11EAB2B35E for <patchwork-kernel-hardening@patchwork.kernel.org>; Fri, 8 Feb 2019 12:16:23 +0000 (UTC) Received: (qmail 17776 invoked by uid 550); 8 Feb 2019 12:16:17 -0000 Mailing-List: contact kernel-hardening-help@lists.openwall.com; run by ezmlm Precedence: bulk List-Post: <mailto:kernel-hardening@lists.openwall.com> List-Help: <mailto:kernel-hardening-help@lists.openwall.com> List-Unsubscribe: <mailto:kernel-hardening-unsubscribe@lists.openwall.com> List-Subscribe: <mailto:kernel-hardening-subscribe@lists.openwall.com> List-ID: <kernel-hardening.lists.openwall.com> Delivered-To: mailing list kernel-hardening@lists.openwall.com Received: (qmail 17555 invoked from network); 8 Feb 2019 12:16:16 -0000 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,347,1544515200"; d="scan'208";a="113377770" From: Elena Reshetova <elena.reshetova@intel.com> To: kernel-hardening@lists.openwall.com Cc: luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, peterz@infradead.org, keescook@chromium.org, Elena Reshetova <elena.reshetova@intel.com> Subject: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon system call Date: Fri, 8 Feb 2019 14:15:49 +0200 Message-Id: <1549628149-11881-2-git-send-email-elena.reshetova@intel.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1549628149-11881-1-git-send-email-elena.reshetova@intel.com> References: <1549628149-11881-1-git-send-email-elena.reshetova@intel.com> X-Virus-Scanned: ClamAV using ClamSMTP
Series	[RFC] x86/entry/64: randomize kernel stack offset upon system call \| expand [RFC] x86/entry/64: randomize kernel stack offset upon system call

[RFC] x86/entry/64: randomize kernel stack offset upon system call

Commit Message

Comments

Patch