Patchwork [RFC,v10,2/6] x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

login
register
mail settings
Submitter Alexander Popov
Date March 28, 2018, 7:57 p.m.
Message ID <1522267032-6603-3-git-send-email-alex.popov@linux.com>
Download mbox | patch
Permalink /patch/10313989/
State New
Headers show

Comments

Alexander Popov - March 28, 2018, 7:57 p.m.
The STACKLEAK feature erases the kernel stack before returning from
syscalls. That reduces the information which kernel stack leak bugs can
reveal and blocks some uninitialized stack variable attacks. Moreover,
STACKLEAK provides runtime checks for kernel stack overflow detection.

This commit introduces the architecture-specific code filling the used
part of the kernel stack with a poison value before returning to the
userspace. Full STACKLEAK feature also contains the gcc plugin which
comes in a separate commit.

The STACKLEAK feature is ported from grsecurity/PaX. More information at:
  https://grsecurity.net/
  https://pax.grsecurity.net/

This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
---
 Documentation/x86/x86_64/mm.txt  |  2 ++
 arch/Kconfig                     | 27 ++++++++++++++++++++
 arch/x86/Kconfig                 |  1 +
 arch/x86/entry/Makefile          |  3 +++
 arch/x86/entry/entry_32.S        | 11 ++++++++
 arch/x86/entry/entry_64.S        | 11 ++++++++
 arch/x86/entry/entry_64_compat.S | 11 ++++++++
 arch/x86/entry/erase.c           | 54 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h |  4 +++
 arch/x86/kernel/process_32.c     |  5 ++++
 arch/x86/kernel/process_64.c     |  5 ++++
 include/linux/compiler.h         |  4 +++
 12 files changed, 138 insertions(+)
 create mode 100644 arch/x86/entry/erase.c
Dave Hansen - March 28, 2018, 10:55 p.m.
On 03/28/2018 12:57 PM, Alexander Popov wrote:
> +.macro ERASE_KSTACK
> +#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
> +	PUSH_AND_CLEAR_REGS
> +	call erase_kstack
> +	POP_REGS
> +#endif
> +.endm

Thanks again for the hard work to trim down the complexity of the
assembly.  I noticed the 64-bit version saves/restores registers while
the 32-bit version doesn't.  What's the reasoning there?
Alexander Popov - March 29, 2018, 6:58 a.m.
On 29.03.2018 01:55, Dave Hansen wrote:
> On 03/28/2018 12:57 PM, Alexander Popov wrote:
>> +.macro ERASE_KSTACK
>> +#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
>> +	PUSH_AND_CLEAR_REGS
>> +	call erase_kstack
>> +	POP_REGS
>> +#endif
>> +.endm
> 
> Thanks again for the hard work to trim down the complexity of the
> assembly. 

Hello Dave, thanks!

> I noticed the 64-bit version saves/restores registers while
> the 32-bit version doesn't.  What's the reasoning there?

When erase_kstack() is called from the trampoline stack, it must save and
restore any modified registers, since all registers except RDI are live
(prepared for the userspace).

When erase_kstack() is called from the thread stack, it can clobber registers
according the function call convention without any harm.

Best regards,
Alexander
Dave Hansen - March 29, 2018, 1:51 p.m.
On 03/28/2018 11:58 PM, Alexander Popov wrote:
>> I noticed the 64-bit version saves/restores registers while
>> the 32-bit version doesn't.  What's the reasoning there?
> When erase_kstack() is called from the trampoline stack, it must save and
> restore any modified registers, since all registers except RDI are live
> (prepared for the userspace).
> 
> When erase_kstack() is called from the thread stack, it can clobber registers
> according the function call convention without any harm.

Oh, and since there's no 32-bit trampoline stack, we don't need it on
32-bit?

If end up reposting this set again, could you add a few comments about
this around the ERASE_KSTACK macro definitions, or perhaps the call
sites?  You might even want to call them ERASE_KSTACK_CLOBBER (for
32-bit) and ERASE_KSTACK_NOCLOBBER (for 64-bit) to make this more clear.
Alexander Popov - March 29, 2018, 8:56 p.m.
Hello Dave and Boris,

Thanks for your replies!

On 29.03.2018 18:09, Boris Lukashev wrote:
> On Thu, Mar 29, 2018 at 9:51 AM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>> On 03/28/2018 11:58 PM, Alexander Popov wrote:
>>>> I noticed the 64-bit version saves/restores registers while
>>>> the 32-bit version doesn't.  What's the reasoning there?
>>> When erase_kstack() is called from the trampoline stack, it must save and
>>> restore any modified registers, since all registers except RDI are live
>>> (prepared for the userspace).
>>>
>>> When erase_kstack() is called from the thread stack, it can clobber registers
>>> according the function call convention without any harm.
>>
>> Oh, and since there's no 32-bit trampoline stack, we don't need it on
>> 32-bit?
>>
>> If end up reposting this set again,

Hope so. Let's see...

>> could you add a few comments about
>> this around the ERASE_KSTACK macro definitions, or perhaps the call
>> sites?  You might even want to call them ERASE_KSTACK_CLOBBER (for
>> 32-bit) and ERASE_KSTACK_NOCLOBBER (for 64-bit) to make this more clear.
> 
> Not sure if the macro name differentiation is such a good idea, might
> entice improper use attempts.
> A more detailed explanation of this should probably go into the
> headers and doc/commit log for future implementation on architectures
> which may have their own weird semantics around the trampoline
> stack/not have one.

Ok, I see. Let me give the overview and propose the solution.

The current version has 3 separate ERASE_KSTACK definitions:

1. a simple one in entry_32.S, used only in that file:

+.macro ERASE_KSTACK
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	call erase_kstack
+#endif
+.endm


2. another one saving registers in entry_64.S, used only in that file for
erasing from the trampoline stack:

+.macro ERASE_KSTACK
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	PUSH_AND_CLEAR_REGS
+	call erase_kstack
+	POP_REGS
+#endif
+.endm

The call sights are already prepared and documented by Andy Lutomirski:

	/*
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	ERASE_KSTACK


3. a simple one in entry_64_compat.S (similar to case 1), used only in that file:

+	.macro ERASE_KSTACK
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	call erase_kstack
+#endif
+	.endm
+

The call sight is documented as well:

 sysret32_from_system_call:
+	/*
+	 * We are not going to return to the userspace from the trampoline
+	 * stack. So let's erase the thread stack right now.
+	 */
+	ERASE_KSTACK


If STACKLEAK is not banned, would you like me to introduce ERASE_KSTACK (for
cases 1 and 3) and ERASE_KSTACK_NOCLOBBER (for case 2) in
arch/x86/entry/calling.h?

Best regards,
Alexander
Dave Hansen - March 29, 2018, 9 p.m.
On 03/29/2018 01:56 PM, Alexander Popov wrote:
> 
> If STACKLEAK is not banned, would you like me to introduce ERASE_KSTACK (for
> cases 1 and 3) and ERASE_KSTACK_NOCLOBBER (for case 2) in
> arch/x86/entry/calling.h?

Yes, please.  But, whatever you do, I think having each name have
constant behavior is a must.  Please don't have "ERASE_KSTACK" in one
file clobber and another not clobber (even if one is 32-bit and the
other is not).
Alexander Popov - March 29, 2018, 9:34 p.m.
On 29.03.2018 21:38, Shivappa Vikas wrote:
> On Wed, 28 Mar 2018, Alexander Popov wrote:
> 
>> +	const unsigned long check_depth = STACKLEAK_POISON_CHECK_DEPTH /
>> +							sizeof(unsigned long);
>> +
>> +	/*
>> +	 * Let's search for the poison value in the stack.
>> +	 * Start from the lowest_stack and go to the bottom.
>> +	 */
>> +	while (p > boundary && poison <= check_depth) {
>> +		if (*(unsigned long *)p == STACKLEAK_POISON)
>> +			poison++;
>> +		else
>> +			poison = 0;
>> +
>> +		p -= sizeof(unsigned long);
>> +	}
>> +
>> +	/*
>> +	 * One long int at the bottom of the thread stack is reserved and
>> +	 * should not be poisoned (see CONFIG_SCHED_STACK_END_CHECK).
>> +	 */
>> +	if (p == boundary)
>> +		p += sizeof(unsigned long);
>> +
>> +	/*
>> +	 * So let's write the poison value to the kernel stack.
>> +	 * Start from the address in p and move up till the new boundary.
>> +	 */
>> +	if (on_thread_stack())
>> +		boundary = current_stack_pointer;
>> +	else
>> +		boundary = current_top_of_stack();
>> +
>> +	BUG_ON(boundary - p >= THREAD_SIZE);
>> +
>> +	while (p < boundary) {
>> +		*(unsigned long *)p = STACKLEAK_POISON;
>> +		p += sizeof(unsigned long);
>> +	}
> 
> Sorry catching up late on this. 

Hello Vikas,

You are just in time! Thank you for looking at the code.

> can we use a stosq or something here ? wondering 
> if that helps get more performance. since you seem to copy the whole stack with 
> the poison value. If that was already discussed sorry for the spam :)

Yes, the previous version of the patch series had this function written in
assembly. It used scasq for the poison searching and then stosq for writing:
http://www.openwall.com/lists/kernel-hardening/2018/03/03/9

That amount of assembly was blamed by the maintainers. So I've done my best to
rework it in C bypassing the pitfalls:
http://www.openwall.com/lists/kernel-hardening/2018/03/21/4

In fact, this implementation is not much slower than the assembly one, since we
erase only the used part of the thread stack. This is achieved by the
lowest_stack variable, which is updated during the syscall (please see the next
patch of the series).

Best regards,
Alexander

Patch

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index ea91cb6..21ee7c5 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -24,6 +24,7 @@  ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space (variable)
 [fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
 ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+STACKLEAK_POISON value in this last hole: ffffffffffff4111
 
 Virtual memory map with 5 level page tables:
 
@@ -50,6 +51,7 @@  ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
 [fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
 ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+STACKLEAK_POISON value in this last hole: ffffffffffff4111
 
 Architecture defines a 64-bit virtual address. Implementations can support
 less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
diff --git a/arch/Kconfig b/arch/Kconfig
index 76c0b54..368e2fb 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -401,6 +401,13 @@  config SECCOMP_FILTER
 
 	  See Documentation/prctl/seccomp_filter.txt for details.
 
+config HAVE_ARCH_STACKLEAK
+	bool
+	help
+	  An architecture should select this if it has the code which
+	  fills the used part of the kernel stack with the STACKLEAK_POISON
+	  value before returning from system calls.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
@@ -531,6 +538,26 @@  config GCC_PLUGIN_RANDSTRUCT_PERFORMANCE
 	  in structures.  This reduces the performance hit of RANDSTRUCT
 	  at the cost of weakened randomization.
 
+config GCC_PLUGIN_STACKLEAK
+	bool "Erase the kernel stack before returning from syscalls"
+	depends on GCC_PLUGINS
+	depends on HAVE_ARCH_STACKLEAK
+	help
+	  This option makes the kernel erase the kernel stack before it
+	  returns from a system call. That reduces the information which
+	  kernel stack leak bugs can reveal and blocks some uninitialized
+	  stack variable attacks. This option also provides runtime checks
+	  for kernel stack overflow detection.
+
+	  The tradeoff is the performance impact: on a single CPU system kernel
+	  compilation sees a 1% slowdown, other systems and workloads may vary
+	  and you are advised to test this feature on your expected workload
+	  before deploying it.
+
+	  This plugin was ported from grsecurity/PaX. More information at:
+	   * https://grsecurity.net/
+	   * https://pax.grsecurity.net/
+
 config HAVE_CC_STACKPROTECTOR
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 0fa71a7..e700879 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@  config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
+	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
diff --git a/arch/x86/entry/Makefile b/arch/x86/entry/Makefile
index 06fc70c..abe4d92 100644
--- a/arch/x86/entry/Makefile
+++ b/arch/x86/entry/Makefile
@@ -15,3 +15,6 @@  obj-y				+= vsyscall/
 
 obj-$(CONFIG_IA32_EMULATION)	+= entry_64_compat.o syscall_32.o
 
+obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += erase.o
+KASAN_SANITIZE_erase.o		:= n
+
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 6ad064c..0dec4f3 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -77,6 +77,12 @@ 
 #endif
 .endm
 
+.macro ERASE_KSTACK
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	call erase_kstack
+#endif
+.endm
+
 /*
  * User gs save/restore
  *
@@ -298,6 +304,7 @@  ENTRY(ret_from_fork)
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
 	call    syscall_return_slowpath
+	ERASE_KSTACK
 	jmp     restore_all
 
 	/* kernel thread */
@@ -458,6 +465,8 @@  ENTRY(entry_SYSENTER_32)
 	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
 		    "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
 
+	ERASE_KSTACK
+
 /* Opportunistic SYSEXIT */
 	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
 	movl	PT_EIP(%esp), %edx	/* pt_regs->ip */
@@ -544,6 +553,8 @@  ENTRY(entry_INT80_32)
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
 
+	ERASE_KSTACK
+
 restore_all:
 	TRACE_IRQS_IRET
 .Lrestore_all_notrace:
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 18ed349..c85af83 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -66,6 +66,14 @@  END(native_usergs_sysret64)
 	TRACE_IRQS_FLAGS EFLAGS(%rsp)
 .endm
 
+.macro ERASE_KSTACK
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	PUSH_AND_CLEAR_REGS
+	call erase_kstack
+	POP_REGS
+#endif
+.endm
+
 /*
  * When dynamic function tracer is enabled it will add a breakpoint
  * to all locations that it is about to modify, sync CPUs, update
@@ -323,6 +331,8 @@  syscall_return_via_sysret:
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	ERASE_KSTACK
+
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	popq	%rdi
@@ -681,6 +691,7 @@  GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	ERASE_KSTACK
 
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 08425c4..6c29df6 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -19,6 +19,12 @@ 
 
 	.section .entry.text, "ax"
 
+	.macro ERASE_KSTACK
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	call erase_kstack
+#endif
+	.endm
+
 /*
  * 32-bit SYSENTER entry.
  *
@@ -258,6 +264,11 @@  GLOBAL(entry_SYSCALL_compat_after_hwframe)
 
 	/* Opportunistic SYSRET */
 sysret32_from_system_call:
+	/*
+	 * We are not going to return to the userspace from the trampoline
+	 * stack. So let's erase the thread stack right now.
+	 */
+	ERASE_KSTACK
 	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
 	movq	RBX(%rsp), %rbx		/* pt_regs->rbx */
 	movq	RBP(%rsp), %rbp		/* pt_regs->rbp */
diff --git a/arch/x86/entry/erase.c b/arch/x86/entry/erase.c
new file mode 100644
index 0000000..4892335
--- /dev/null
+++ b/arch/x86/entry/erase.c
@@ -0,0 +1,54 @@ 
+#include <linux/bug.h>
+#include <linux/sched.h>
+#include <asm/current.h>
+#include <asm/linkage.h>
+#include <asm/processor.h>
+
+asmlinkage void erase_kstack(void)
+{
+	register unsigned long p = current->thread.lowest_stack;
+	register unsigned long boundary = p & ~(THREAD_SIZE - 1);
+	unsigned long poison = 0;
+	const unsigned long check_depth = STACKLEAK_POISON_CHECK_DEPTH /
+							sizeof(unsigned long);
+
+	/*
+	 * Let's search for the poison value in the stack.
+	 * Start from the lowest_stack and go to the bottom.
+	 */
+	while (p > boundary && poison <= check_depth) {
+		if (*(unsigned long *)p == STACKLEAK_POISON)
+			poison++;
+		else
+			poison = 0;
+
+		p -= sizeof(unsigned long);
+	}
+
+	/*
+	 * One long int at the bottom of the thread stack is reserved and
+	 * should not be poisoned (see CONFIG_SCHED_STACK_END_CHECK).
+	 */
+	if (p == boundary)
+		p += sizeof(unsigned long);
+
+	/*
+	 * So let's write the poison value to the kernel stack.
+	 * Start from the address in p and move up till the new boundary.
+	 */
+	if (on_thread_stack())
+		boundary = current_stack_pointer;
+	else
+		boundary = current_top_of_stack();
+
+	BUG_ON(boundary - p >= THREAD_SIZE);
+
+	while (p < boundary) {
+		*(unsigned long *)p = STACKLEAK_POISON;
+		p += sizeof(unsigned long);
+	}
+
+	/* Reset the lowest_stack value for the next syscall */
+	current->thread.lowest_stack = current_top_of_stack() - 256;
+}
+
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b0ccd48..0c87813 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -494,6 +494,10 @@  struct thread_struct {
 
 	mm_segment_t		addr_limit;
 
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	unsigned long		lowest_stack;
+#endif
+
 	unsigned int		sig_on_uaccess_err:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
 
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5224c60..1b0892e 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -136,6 +136,11 @@  int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	p->thread.sp0 = (unsigned long) (childregs+1);
 	memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	p->thread.lowest_stack = (unsigned long)task_stack_page(p) +
+						sizeof(unsigned long);
+#endif
+
 	if (unlikely(p->flags & PF_KTHREAD)) {
 		/* kernel thread */
 		memset(childregs, 0, sizeof(struct pt_regs));
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9eb448c..82122af 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -281,6 +281,11 @@  int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
 	p->thread.sp = (unsigned long) fork_frame;
 	p->thread.io_bitmap_ptr = NULL;
 
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	p->thread.lowest_stack = (unsigned long)task_stack_page(p) +
+						sizeof(unsigned long);
+#endif
+
 	savesegment(gs, p->thread.gsindex);
 	p->thread.gsbase = p->thread.gsindex ? 0 : me->thread.gsbase;
 	savesegment(fs, p->thread.fsindex);
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index ab4711c..341b6cf8 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -342,4 +342,8 @@  unsigned long read_word_at_a_time(const void *addr)
 	compiletime_assert(__native_word(t),				\
 		"Need native word sized stores/loads for atomicity.")
 
+/* Poison value points to the unused hole in the virtual memory map */
+#define STACKLEAK_POISON -0xBEEF
+#define STACKLEAK_POISON_CHECK_DEPTH 128
+
 #endif /* __LINUX_COMPILER_H */