Patchwork [v13,(resend),2/6] x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls

login
register
mail settings
Submitter Alexander Popov
Date June 22, 2018, 4:58 p.m.
Message ID <1529686717-16017-3-git-send-email-alex.popov@linux.com>
Download mbox | patch
Permalink /patch/10482707/
State New
Headers show

Comments

Alexander Popov - June 22, 2018, 4:58 p.m.
The STACKLEAK feature erases the kernel stack before returning from
syscalls. That reduces the information which kernel stack leak bugs can
reveal and blocks some uninitialized stack variable attacks. Moreover,
STACKLEAK blocks kernel stack depth overflow caused by alloca (aka
Stack Clash attack).

This commit introduces the code filling the used part of the kernel
stack with a poison value before returning to the userspace. Full
STACKLEAK feature also contains the gcc plugin which comes in a
separate commit.

The STACKLEAK feature is ported from grsecurity/PaX. More information at:
  https://grsecurity.net/
  https://pax.grsecurity.net/

This code is modified from Brad Spengler/PaX Team's code in the last
public patch of grsecurity/PaX based on our understanding of the code.
Changes or omissions from the original code are ours and don't reflect
the original grsecurity/PaX code.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 Documentation/x86/x86_64/mm.txt  |  2 ++
 arch/Kconfig                     | 27 +++++++++++++++++
 arch/x86/Kconfig                 |  1 +
 arch/x86/entry/calling.h         | 14 +++++++++
 arch/x86/entry/entry_32.S        |  7 +++++
 arch/x86/entry/entry_64.S        |  3 ++
 arch/x86/entry/entry_64_compat.S |  5 ++++
 include/linux/sched.h            |  4 +++
 include/linux/stackleak.h        | 24 +++++++++++++++
 kernel/Makefile                  |  4 +++
 kernel/fork.c                    |  3 ++
 kernel/stackleak.c               | 63 ++++++++++++++++++++++++++++++++++++++++
 12 files changed, 157 insertions(+)
 create mode 100644 include/linux/stackleak.h
 create mode 100644 kernel/stackleak.c
Laura Abbott - June 22, 2018, 6:58 p.m.
On 06/22/2018 09:58 AM, Alexander Popov wrote:
> +	/*
> +	 * Now write the poison value to the kernel stack. Start from
> +	 * 'kstack_ptr' and move up till the new 'boundary'. We assume that
> +	 * the stack pointer doesn't change when we write poison.
> +	 */
> +	if (on_thread_stack())
> +		boundary = current_stack_pointer;
> +	else
> +		boundary = current_top_of_stack();
> +
> +	BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);
> +
> +	while (kstack_ptr < boundary) {
> +		*(unsigned long *)kstack_ptr = STACKLEAK_POISON;
> +		kstack_ptr += sizeof(unsigned long);
> +	}
> +
> +	/* Reset the 'lowest_stack' value for the next syscall */
> +	current->lowest_stack = current_top_of_stack() - THREAD_SIZE / 64;

on_thread_stack() and current_top_of_stack() are x86 only
functions currently defined in asm/processor.h. There are
similar functions in arch/arm64/include/asm/stacktrace.h
which I think I can use to build those functions. Should
I just throw the arm64 versions in processor.h or do
we want to consider abstracting these into something like
asm/stackleak.h? I'd like to know a direction before I
start ripping apart stacktrace.h.

Thanks,
Laura
Kees Cook - June 27, 2018, 12:13 a.m.
On Fri, Jun 22, 2018 at 11:58 AM, Laura Abbott <labbott@redhat.com> wrote:
> On 06/22/2018 09:58 AM, Alexander Popov wrote:
>>
>> +       /*
>> +        * Now write the poison value to the kernel stack. Start from
>> +        * 'kstack_ptr' and move up till the new 'boundary'. We assume
>> that
>> +        * the stack pointer doesn't change when we write poison.
>> +        */
>> +       if (on_thread_stack())
>> +               boundary = current_stack_pointer;
>> +       else
>> +               boundary = current_top_of_stack();
>> +
>> +       BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);
>> +
>> +       while (kstack_ptr < boundary) {
>> +               *(unsigned long *)kstack_ptr = STACKLEAK_POISON;
>> +               kstack_ptr += sizeof(unsigned long);
>> +       }
>> +
>> +       /* Reset the 'lowest_stack' value for the next syscall */
>> +       current->lowest_stack = current_top_of_stack() - THREAD_SIZE / 64;
>
>
> on_thread_stack() and current_top_of_stack() are x86 only
> functions currently defined in asm/processor.h. There are
> similar functions in arch/arm64/include/asm/stacktrace.h
> which I think I can use to build those functions. Should
> I just throw the arm64 versions in processor.h or do
> we want to consider abstracting these into something like
> asm/stackleak.h? I'd like to know a direction before I
> start ripping apart stacktrace.h.

static inline unsigned long current_top_of_stack(void)
{
        /*
         *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
         *  and around vm86 mode and sp0 on x86_64 is special because of the
         *  entry trampoline.
         */
        return this_cpu_read_stable(cpu_current_top_of_stack);
}

static inline bool on_thread_stack(void)
{
        return (unsigned long)(current_top_of_stack() -
                               current_stack_pointer) < THREAD_SIZE;
}

The only arch-specific part is current_top_of_stack(), so it seems
like just solving current_top_of_stack() and copying on_thread_stack()
directly should work, yes? I would think sticking those directly into
asm/processor.h on arm64 would be okay?

FWIW, I don't think a separate asm/stackleak.h is needed.

-Kees
Kees Cook - June 27, 2018, 12:21 a.m.
On Tue, Jun 26, 2018 at 5:13 PM, Kees Cook <keescook@chromium.org> wrote:
> The only arch-specific part is current_top_of_stack(), so it seems
> like just solving current_top_of_stack() and copying on_thread_stack()
> directly should work, yes? I would think sticking those directly into
> asm/processor.h on arm64 would be okay?

The difference in the earlier x86 and arm64 erase_kstack was:

x86:
        if (on_thread_stack())
                boundary = current_stack_pointer;
        else
                boundary = current_top_of_stack();

arm64:
        boundary = current_stack_pointer;

If a sane current_top_of_stack() isn't possible on arm64, I'm not
opposed to a small bit of arch-specific ifdef in there:

        boundary = get_stack_boundary();

static inline unsigned long get_stack_boundary(void)
{
#ifdef CONFIG_X86
        if (!on_thread_stack())
                return current_top_of_stack();
#else
         return current_stack_pointer;
}

-Kees
Laura Abbott - June 27, 2018, 12:55 a.m.
On 06/26/2018 05:21 PM, Kees Cook wrote:
> On Tue, Jun 26, 2018 at 5:13 PM, Kees Cook <keescook@chromium.org> wrote:
>> The only arch-specific part is current_top_of_stack(), so it seems
>> like just solving current_top_of_stack() and copying on_thread_stack()
>> directly should work, yes? I would think sticking those directly into
>> asm/processor.h on arm64 would be okay?
> 
> The difference in the earlier x86 and arm64 erase_kstack was:
> 
> x86:
>          if (on_thread_stack())
>                  boundary = current_stack_pointer;
>          else
>                  boundary = current_top_of_stack();
> 
> arm64:
>          boundary = current_stack_pointer;
> 
> If a sane current_top_of_stack() isn't possible on arm64, I'm not
> opposed to a small bit of arch-specific ifdef in there:
> 
>          boundary = get_stack_boundary();
> 
> static inline unsigned long get_stack_boundary(void)
> {
> #ifdef CONFIG_X86
>          if (!on_thread_stack())
>                  return current_top_of_stack();
> #else
>           return current_stack_pointer;
> }
> 
> -Kees
> 

I _think_ I have reasonable implementations for both functions,
I was asking about pulling stuff out into a new header because
I wanted to see about reusing on_task_stack which was defined
in asm/stacktrace.h and I was concerned about include nightmares
with asm/processor.h . I'll see what I come up with.

Thanks,
Laura
Ingo Molnar - July 5, 2018, 8:12 a.m.
* Alexander Popov <alex.popov@linux.com> wrote:

> The STACKLEAK feature erases the kernel stack before returning from
> syscalls. That reduces the information which kernel stack leak bugs can
> reveal and blocks some uninitialized stack variable attacks. Moreover,
> STACKLEAK blocks kernel stack depth overflow caused by alloca (aka
> Stack Clash attack).
> 
> This commit introduces the code filling the used part of the kernel
> stack with a poison value before returning to the userspace. Full
> STACKLEAK feature also contains the gcc plugin which comes in a
> separate commit.
> 
> The STACKLEAK feature is ported from grsecurity/PaX. More information at:
>   https://grsecurity.net/
>   https://pax.grsecurity.net/
> 
> This code is modified from Brad Spengler/PaX Team's code in the last
> public patch of grsecurity/PaX based on our understanding of the code.
> Changes or omissions from the original code are ours and don't reflect
> the original grsecurity/PaX code.
> 
> Signed-off-by: Alexander Popov <alex.popov@linux.com>
> Acked-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
> ---
>  Documentation/x86/x86_64/mm.txt  |  2 ++
>  arch/Kconfig                     | 27 +++++++++++++++++
>  arch/x86/Kconfig                 |  1 +
>  arch/x86/entry/calling.h         | 14 +++++++++
>  arch/x86/entry/entry_32.S        |  7 +++++
>  arch/x86/entry/entry_64.S        |  3 ++
>  arch/x86/entry/entry_64_compat.S |  5 ++++
>  include/linux/sched.h            |  4 +++
>  include/linux/stackleak.h        | 24 +++++++++++++++
>  kernel/Makefile                  |  4 +++
>  kernel/fork.c                    |  3 ++
>  kernel/stackleak.c               | 63 ++++++++++++++++++++++++++++++++++++++++
>  12 files changed, 157 insertions(+)
>  create mode 100644 include/linux/stackleak.h
>  create mode 100644 kernel/stackleak.c
> 
> diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
> index 5432a96..600bc2a 100644
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -24,6 +24,7 @@ ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
>  [fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
>  ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
>  ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
> +STACKLEAK_POISON value in this last hole: ffffffffffff4111
>  
>  Virtual memory map with 5 level page tables:
>  
> @@ -50,6 +51,7 @@ ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
>  [fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
>  ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
>  ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
> +STACKLEAK_POISON value in this last hole: ffffffffffff4111
>  
>  Architecture defines a 64-bit virtual address. Implementations can support
>  less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 1aa5906..57817f0 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -414,6 +414,13 @@ config PLUGIN_HOSTCC
>  	  Host compiler used to build GCC plugins.  This can be $(HOSTCXX),
>  	  $(HOSTCC), or a null string if GCC plugin is unsupported.
>  
> +config HAVE_ARCH_STACKLEAK
> +	bool
> +	help
> +	  An architecture should select this if it has the code which
> +	  fills the used part of the kernel stack with the STACKLEAK_POISON
> +	  value before returning from system calls.
> +
>  config HAVE_GCC_PLUGINS
>  	bool
>  	help
> @@ -549,6 +556,26 @@ config GCC_PLUGIN_RANDSTRUCT_PERFORMANCE
>  	  in structures.  This reduces the performance hit of RANDSTRUCT
>  	  at the cost of weakened randomization.
>  
> +config GCC_PLUGIN_STACKLEAK
> +	bool "Erase the kernel stack before returning from syscalls"
> +	depends on GCC_PLUGINS
> +	depends on HAVE_ARCH_STACKLEAK
> +	help
> +	  This option makes the kernel erase the kernel stack before
> +	  returning from system calls. That reduces the information which
> +	  kernel stack leak bugs can reveal and blocks some uninitialized
> +	  stack variable attacks. This option also blocks kernel stack depth
> +	  overflow caused by alloca (aka Stack Clash attack).

Nit, please pick one of these variants to refer to library functions:

	  overflow caused by 'alloca' (aka Stack Clash attack).
	  overflow caused by alloca() (aka Stack Clash attack).

Like you correctly did later on in a C comment block:

> + * STACKLEAK blocks stack depth overflow caused by alloca() (aka Stack Clash
> + * attack).
> + */


> +	  The tradeoff is the performance impact: on a single CPU system kernel
> +	  compilation sees a 1% slowdown, other systems and workloads may vary
> +	  and you are advised to test this feature on your expected workload
> +	  before deploying it.

Is there a way to patch this out runtime? I.e. if a distro enabled it, is there an 
easy way to disable much of the overhead without rebooting the kernel?

> --- /dev/null
> +++ b/include/linux/stackleak.h
> @@ -0,0 +1,24 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_STACKLEAK_H
> +#define _LINUX_STACKLEAK_H
> +
> +#include <linux/sched.h>
> +#include <linux/sched/task_stack.h>
> +
> +/*
> + * Check that the poison value points to the unused hole in the
> + * virtual memory map for your platform.
> + */
> +#define STACKLEAK_POISON -0xBEEF
> +
> +#define STACKLEAK_POISON_CHECK_DEPTH 128
> +
> +static inline void stackleak_task_init(struct task_struct *task)
> +{
> +#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
> +	task->lowest_stack = (unsigned long)end_of_stack(task) +
> +						sizeof(unsigned long);
> +#endif

Please don't break the line in such an ugly fashion - just keep the line long and 
ignore checkpatch, because the cure is worse than the disease.

> --- /dev/null
> +++ b/kernel/stackleak.c
> @@ -0,0 +1,63 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This code fills the used part of the kernel stack with a poison value
> + * before returning to the userspace. It's a part of the STACKLEAK feature
> + * ported from grsecurity/PaX.

s/returning to the userspace
 /returning to userspace

s/It's a part of the STACKLEAK feature
 /It's part of the STACKLEAK feature

> + * Author: Alexander Popov <alex.popov@linux.com>
> + *
> + * STACKLEAK reduces the information which kernel stack leak bugs can
> + * reveal and blocks some uninitialized stack variable attacks. Moreover,
> + * STACKLEAK blocks stack depth overflow caused by alloca() (aka Stack Clash
> + * attack).
> + */
> +
> +#include <linux/stackleak.h>
> +
> +asmlinkage void stackleak_erase_kstack(void)

s/stackleak_erase_kstack
 /stackleak_erase

?

> +{
> +	/* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */
> +	unsigned long kstack_ptr = current->lowest_stack;
> +	unsigned long boundary = kstack_ptr & ~(THREAD_SIZE - 1);
> +	unsigned int poison_count = 0;
> +	const unsigned int check_depth =
> +			STACKLEAK_POISON_CHECK_DEPTH / sizeof(unsigned long);

ugly linebreak.

> +
> +	/* Search for the poison value in the kernel stack */
> +	while (kstack_ptr > boundary && poison_count <= check_depth) {
> +		if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
> +			poison_count++;
> +		else
> +			poison_count = 0;
> +
> +		kstack_ptr -= sizeof(unsigned long);
> +	}
> +
> +	/*
> +	 * One 'long int' at the bottom of the thread stack is reserved and
> +	 * should not be poisoned (see CONFIG_SCHED_STACK_END_CHECK).

Nit:

  s/CONFIG_SCHED_STACK_END_CHECK
   /CONFIG_SCHED_STACK_END_CHECK=y

> +	 */
> +	if (kstack_ptr == boundary)
> +		kstack_ptr += sizeof(unsigned long);
> +
> +	/*
> +	 * Now write the poison value to the kernel stack. Start from
> +	 * 'kstack_ptr' and move up till the new 'boundary'. We assume that
> +	 * the stack pointer doesn't change when we write poison.
> +	 */
> +	if (on_thread_stack())
> +		boundary = current_stack_pointer;
> +	else
> +		boundary = current_top_of_stack();
> +
> +	BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);

Should never happen, right? If so then please make this:

	if (WARN_ON(boundary - kstack_ptr >= THREAD_SIZE))
		return;

or so, to make it non-fatal and to allow users to report it, should it trigger 
against all expectations.

> +
> +	while (kstack_ptr < boundary) {
> +		*(unsigned long *)kstack_ptr = STACKLEAK_POISON;
> +		kstack_ptr += sizeof(unsigned long);
> +	}
> +
> +	/* Reset the 'lowest_stack' value for the next syscall */
> +	current->lowest_stack = current_top_of_stack() - THREAD_SIZE / 64;
> +}

Nit, I'd write this as:

	current->lowest_stack = current_top_of_stack() - THREAD_SIZE/64;

to make it group better visually. (Again, ignore checkpatch if it complains, it's 
wrong.)

Overall I like the interface cleanups in v13, so if these nits are addressed and 
it becomes possible for (root users) to disable the checking then I suppose this 
is fine.

Thanks,

	Ingo
Peter Zijlstra - July 5, 2018, 10 a.m.
On Thu, Jul 05, 2018 at 10:12:37AM +0200, Ingo Molnar wrote:
> * Alexander Popov <alex.popov@linux.com> wrote:
> > +static inline void stackleak_task_init(struct task_struct *task)
> > +{
> > +#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
> > +	task->lowest_stack = (unsigned long)end_of_stack(task) +
> > +						sizeof(unsigned long);
> > +#endif
> 
> Please don't break the line in such an ugly fashion - just keep the line long and 
> ignore checkpatch, because the cure is worse than the disease.

FWIW I don't totally hate on:

	task->lowest_stack =
		(unsigned long)end_of_stack(task) + sizeof(unsigned long);

which is a much more readable split.
Ingo Molnar - July 5, 2018, 10:35 a.m.
* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jul 05, 2018 at 10:12:37AM +0200, Ingo Molnar wrote:
> > * Alexander Popov <alex.popov@linux.com> wrote:
> > > +static inline void stackleak_task_init(struct task_struct *task)
> > > +{
> > > +#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
> > > +	task->lowest_stack = (unsigned long)end_of_stack(task) +
> > > +						sizeof(unsigned long);
> > > +#endif
> > 
> > Please don't break the line in such an ugly fashion - just keep the line long and 
> > ignore checkpatch, because the cure is worse than the disease.
> 
> FWIW I don't totally hate on:
> 
> 	task->lowest_stack =
> 		(unsigned long)end_of_stack(task) + sizeof(unsigned long);
> 
> which is a much more readable split.

Yeah, or rename 'task' to the regular 't' or 'p' and do:

	t->lowest_stack = (long)end_of_stack(t) + sizeof(long);

;-)

	Ingo
Alexander Popov - July 5, 2018, 9:55 p.m.
Hello Ingo,

Thanks for your review! I'll fix the style issues you pointed at.

Please also see my answers below.

On 05.07.2018 11:12, Ingo Molnar wrote:
>> +	  The tradeoff is the performance impact: on a single CPU system kernel
>> +	  compilation sees a 1% slowdown, other systems and workloads may vary
>> +	  and you are advised to test this feature on your expected workload
>> +	  before deploying it.
> 
> Is there a way to patch this out runtime? I.e. if a distro enabled it, is there an 
> easy way to disable much of the overhead without rebooting the kernel?

Hm. We can't completely disable STACKLEAK in runtime, since STACKLEAK gcc plugin
performs compile-time instrumentation of the kernel code. So we can only chop
off a part of functionality, for example, by introducing some variable and
checking it before every stack erasing (additional performance impact), but the
kernel will stay uselessly instrumented. It doesn't look reasonable to me.


>> +	BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);
> 
> Should never happen, right? 

Yes. It can happen only if 'current->lowest_stack' was corrupted.

> If so then please make this:
> 
> 	if (WARN_ON(boundary - kstack_ptr >= THREAD_SIZE))
> 		return;
> 
> or so, to make it non-fatal and to allow users to report it, should it trigger 
> against all expectations.

I've made an experiment. The results:
 1. BUG_ON() here doesn't freeze the kernel output - I see a full 'PANIC: double
fault' report;
 2. WARN_ON() here gives absolutely same 'PANIC: double fault' here.

So there is no reason to introduce the surplus "WARN_ON() + return" logic here.


> Overall I like the interface cleanups in v13, so if these nits are addressed and 
> it becomes possible for (root users) to disable the checking then I suppose this 
> is fine.

Thanks a lot for your positive attitude.

Best regards,
Alexander
Ingo Molnar - July 5, 2018, 10:20 p.m.
* Alexander Popov <alex.popov@linux.com> wrote:

> Hello Ingo,
> 
> Thanks for your review! I'll fix the style issues you pointed at.
> 
> Please also see my answers below.
> 
> On 05.07.2018 11:12, Ingo Molnar wrote:
> >> +	  The tradeoff is the performance impact: on a single CPU system kernel
> >> +	  compilation sees a 1% slowdown, other systems and workloads may vary
> >> +	  and you are advised to test this feature on your expected workload
> >> +	  before deploying it.
> > 
> > Is there a way to patch this out runtime? I.e. if a distro enabled it, is there an 
> > easy way to disable much of the overhead without rebooting the kernel?
> 
> Hm. We can't completely disable STACKLEAK in runtime, since STACKLEAK gcc plugin
> performs compile-time instrumentation of the kernel code. So we can only chop
> off a part of functionality, for example, by introducing some variable and
> checking it before every stack erasing (additional performance impact), but the
> kernel will stay uselessly instrumented. It doesn't look reasonable to me.

Or we could use what every other performance critical instrumentation feature uses 
to reduce overhead (ftrace, perf): kernel patching.

> > If so then please make this:
> > 
> > 	if (WARN_ON(boundary - kstack_ptr >= THREAD_SIZE))
> > 		return;
> > 
> > or so, to make it non-fatal and to allow users to report it, should it trigger 
> > against all expectations.
> 
> I've made an experiment. The results:
>  1. BUG_ON() here doesn't freeze the kernel output - I see a full 'PANIC: double
> fault' report;

Only in text mode - very few users are using text mode.

>  2. WARN_ON() here gives absolutely same 'PANIC: double fault' here.

that should only happen if the kernel is otherwise already fatally corrupted, 
right?

Thanks,

	Ingo
Alexander Popov - July 6, 2018, 11:46 a.m.
On 06.07.2018 01:20, Ingo Molnar wrote:
> 
> * Alexander Popov <alex.popov@linux.com> wrote:
> 
>> Hello Ingo,
>>
>> Thanks for your review! I'll fix the style issues you pointed at.
>>
>> Please also see my answers below.
>>
>> On 05.07.2018 11:12, Ingo Molnar wrote:
>>>> +	  The tradeoff is the performance impact: on a single CPU system kernel
>>>> +	  compilation sees a 1% slowdown, other systems and workloads may vary
>>>> +	  and you are advised to test this feature on your expected workload
>>>> +	  before deploying it.
>>>
>>> Is there a way to patch this out runtime? I.e. if a distro enabled it, is there an 
>>> easy way to disable much of the overhead without rebooting the kernel?
>>
>> Hm. We can't completely disable STACKLEAK in runtime, since STACKLEAK gcc plugin
>> performs compile-time instrumentation of the kernel code. So we can only chop
>> off a part of functionality, for example, by introducing some variable and
>> checking it before every stack erasing (additional performance impact), but the
>> kernel will stay uselessly instrumented. It doesn't look reasonable to me.
> 
> Or we could use what every other performance critical instrumentation feature uses 
> to reduce overhead (ftrace, perf): kernel patching.

I see. It would be a big separate research - how to combine those different
kinds of instrumentation. I would propose to postpone it until we have a request
for STACKLEAK runtime disabling.

>>> If so then please make this:
>>>
>>> 	if (WARN_ON(boundary - kstack_ptr >= THREAD_SIZE))
>>> 		return;
>>>
>>> or so, to make it non-fatal and to allow users to report it, should it trigger 
>>> against all expectations.
>>
>> I've made an experiment. The results:
>>  1. BUG_ON() here doesn't freeze the kernel output - I see a full 'PANIC: double
>> fault' report;
> 
> Only in text mode - very few users are using text mode.
> 
>>  2. WARN_ON() here gives absolutely same 'PANIC: double fault' here.
> 
> that should only happen if the kernel is otherwise already fatally corrupted, 
> right?

No, I mean WARN_ON() in stackleak_erase_kstack() gives the double fault just
like BUG_ON() (without any corruption). In my experiment I've made the following
change:

-       BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);
+//     BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);
+       WARN_ON(1);

It might be caused by the fact, that stackleak_erase_kstack() is called from the
trampoline stack just before returning to the userspace.

So I mean 'WARN_ON() + return' here wouldn't give any profit over a single
BUG_ON() check.

Best regards,
Alexander

Patch

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index 5432a96..600bc2a 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -24,6 +24,7 @@  ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
 [fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
 ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+STACKLEAK_POISON value in this last hole: ffffffffffff4111
 
 Virtual memory map with 5 level page tables:
 
@@ -50,6 +51,7 @@  ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
 [fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
 ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+STACKLEAK_POISON value in this last hole: ffffffffffff4111
 
 Architecture defines a 64-bit virtual address. Implementations can support
 less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
diff --git a/arch/Kconfig b/arch/Kconfig
index 1aa5906..57817f0 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@  config PLUGIN_HOSTCC
 	  Host compiler used to build GCC plugins.  This can be $(HOSTCXX),
 	  $(HOSTCC), or a null string if GCC plugin is unsupported.
 
+config HAVE_ARCH_STACKLEAK
+	bool
+	help
+	  An architecture should select this if it has the code which
+	  fills the used part of the kernel stack with the STACKLEAK_POISON
+	  value before returning from system calls.
+
 config HAVE_GCC_PLUGINS
 	bool
 	help
@@ -549,6 +556,26 @@  config GCC_PLUGIN_RANDSTRUCT_PERFORMANCE
 	  in structures.  This reduces the performance hit of RANDSTRUCT
 	  at the cost of weakened randomization.
 
+config GCC_PLUGIN_STACKLEAK
+	bool "Erase the kernel stack before returning from syscalls"
+	depends on GCC_PLUGINS
+	depends on HAVE_ARCH_STACKLEAK
+	help
+	  This option makes the kernel erase the kernel stack before
+	  returning from system calls. That reduces the information which
+	  kernel stack leak bugs can reveal and blocks some uninitialized
+	  stack variable attacks. This option also blocks kernel stack depth
+	  overflow caused by alloca (aka Stack Clash attack).
+
+	  The tradeoff is the performance impact: on a single CPU system kernel
+	  compilation sees a 1% slowdown, other systems and workloads may vary
+	  and you are advised to test this feature on your expected workload
+	  before deploying it.
+
+	  This plugin was ported from grsecurity/PaX. More information at:
+	   * https://grsecurity.net/
+	   * https://pax.grsecurity.net/
+
 config HAVE_STACKPROTECTOR
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f1dbb4e..3718d70 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@  config X86
 	select HAVE_ARCH_COMPAT_MMAP_BASES	if MMU && COMPAT
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
+	select HAVE_ARCH_STACKLEAK
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 352e70c..ba001e2 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -329,8 +329,22 @@  For 32-bit we have the following conventions - kernel is built with
 
 #endif
 
+.macro STACKLEAK_ERASE_NOCLOBBER
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	PUSH_AND_CLEAR_REGS
+	call stackleak_erase_kstack
+	POP_REGS
+#endif
+.endm
+
 #endif /* CONFIG_X86_64 */
 
+.macro STACKLEAK_ERASE
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	call stackleak_erase_kstack
+#endif
+.endm
+
 /*
  * This does 'call enter_from_user_mode' unless we can avoid it based on
  * kernel config or using the static jump infrastructure.
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2582881..72ce401 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -46,6 +46,8 @@ 
 #include <asm/frame.h>
 #include <asm/nospec-branch.h>
 
+#include "calling.h"
+
 	.section .entry.text, "ax"
 
 /*
@@ -298,6 +300,7 @@  ENTRY(ret_from_fork)
 	/* When we fork, we trace the syscall return in the child, too. */
 	movl    %esp, %eax
 	call    syscall_return_slowpath
+	STACKLEAK_ERASE
 	jmp     restore_all
 
 	/* kernel thread */
@@ -458,6 +461,8 @@  ENTRY(entry_SYSENTER_32)
 	ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
 		    "jmp .Lsyscall_32_done", X86_FEATURE_XENPV
 
+	STACKLEAK_ERASE
+
 /* Opportunistic SYSEXIT */
 	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
 	movl	PT_EIP(%esp), %edx	/* pt_regs->ip */
@@ -544,6 +549,8 @@  ENTRY(entry_INT80_32)
 	call	do_int80_syscall_32
 .Lsyscall_32_done:
 
+	STACKLEAK_ERASE
+
 restore_all:
 	TRACE_IRQS_IRET
 .Lrestore_all_notrace:
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 73a522d..408101b 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -329,6 +329,8 @@  syscall_return_via_sysret:
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	STACKLEAK_ERASE_NOCLOBBER
+
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	popq	%rdi
@@ -687,6 +689,7 @@  GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	STACKLEAK_ERASE_NOCLOBBER
 
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 9de7f1e..193c7e6 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -261,6 +261,11 @@  GLOBAL(entry_SYSCALL_compat_after_hwframe)
 
 	/* Opportunistic SYSRET */
 sysret32_from_system_call:
+	/*
+	 * We are not going to return to the userspace from the trampoline
+	 * stack. So let's erase the thread stack right now.
+	 */
+	STACKLEAK_ERASE
 	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
 	movq	RBX(%rsp), %rbx		/* pt_regs->rbx */
 	movq	RBP(%rsp), %rbp		/* pt_regs->rbp */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d..89c4cba 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,6 +1180,10 @@  struct task_struct {
 	void				*security;
 #endif
 
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	unsigned long			lowest_stack;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/include/linux/stackleak.h b/include/linux/stackleak.h
new file mode 100644
index 0000000..743c911
--- /dev/null
+++ b/include/linux/stackleak.h
@@ -0,0 +1,24 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_STACKLEAK_H
+#define _LINUX_STACKLEAK_H
+
+#include <linux/sched.h>
+#include <linux/sched/task_stack.h>
+
+/*
+ * Check that the poison value points to the unused hole in the
+ * virtual memory map for your platform.
+ */
+#define STACKLEAK_POISON -0xBEEF
+
+#define STACKLEAK_POISON_CHECK_DEPTH 128
+
+static inline void stackleak_task_init(struct task_struct *task)
+{
+#ifdef CONFIG_GCC_PLUGIN_STACKLEAK
+	task->lowest_stack = (unsigned long)end_of_stack(task) +
+						sizeof(unsigned long);
+#endif
+}
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 04bc07c..e5dc293 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -117,6 +117,10 @@  obj-$(CONFIG_HAS_IOMEM) += iomem.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_RSEQ) += rseq.o
 
+obj-$(CONFIG_GCC_PLUGIN_STACKLEAK) += stackleak.o
+KASAN_SANITIZE_stackleak.o := n
+KCOV_INSTRUMENT_stackleak.o := n
+
 $(obj)/configs.o: $(obj)/config_data.h
 
 targets += config_data.gz
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61..1bc86ba 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -91,6 +91,7 @@ 
 #include <linux/kcov.h>
 #include <linux/livepatch.h>
 #include <linux/thread_info.h>
+#include <linux/stackleak.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -1813,6 +1814,8 @@  static __latent_entropy struct task_struct *copy_process(
 	if (retval)
 		goto bad_fork_cleanup_io;
 
+	stackleak_task_init(p);
+
 	if (pid != &init_struct_pid) {
 		pid = alloc_pid(p->nsproxy->pid_ns_for_children);
 		if (IS_ERR(pid)) {
diff --git a/kernel/stackleak.c b/kernel/stackleak.c
new file mode 100644
index 0000000..a84a161
--- /dev/null
+++ b/kernel/stackleak.c
@@ -0,0 +1,63 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This code fills the used part of the kernel stack with a poison value
+ * before returning to the userspace. It's a part of the STACKLEAK feature
+ * ported from grsecurity/PaX.
+ *
+ * Author: Alexander Popov <alex.popov@linux.com>
+ *
+ * STACKLEAK reduces the information which kernel stack leak bugs can
+ * reveal and blocks some uninitialized stack variable attacks. Moreover,
+ * STACKLEAK blocks stack depth overflow caused by alloca() (aka Stack Clash
+ * attack).
+ */
+
+#include <linux/stackleak.h>
+
+asmlinkage void stackleak_erase_kstack(void)
+{
+	/* It would be nice not to have 'kstack_ptr' and 'boundary' on stack */
+	unsigned long kstack_ptr = current->lowest_stack;
+	unsigned long boundary = kstack_ptr & ~(THREAD_SIZE - 1);
+	unsigned int poison_count = 0;
+	const unsigned int check_depth =
+			STACKLEAK_POISON_CHECK_DEPTH / sizeof(unsigned long);
+
+	/* Search for the poison value in the kernel stack */
+	while (kstack_ptr > boundary && poison_count <= check_depth) {
+		if (*(unsigned long *)kstack_ptr == STACKLEAK_POISON)
+			poison_count++;
+		else
+			poison_count = 0;
+
+		kstack_ptr -= sizeof(unsigned long);
+	}
+
+	/*
+	 * One 'long int' at the bottom of the thread stack is reserved and
+	 * should not be poisoned (see CONFIG_SCHED_STACK_END_CHECK).
+	 */
+	if (kstack_ptr == boundary)
+		kstack_ptr += sizeof(unsigned long);
+
+	/*
+	 * Now write the poison value to the kernel stack. Start from
+	 * 'kstack_ptr' and move up till the new 'boundary'. We assume that
+	 * the stack pointer doesn't change when we write poison.
+	 */
+	if (on_thread_stack())
+		boundary = current_stack_pointer;
+	else
+		boundary = current_top_of_stack();
+
+	BUG_ON(boundary - kstack_ptr >= THREAD_SIZE);
+
+	while (kstack_ptr < boundary) {
+		*(unsigned long *)kstack_ptr = STACKLEAK_POISON;
+		kstack_ptr += sizeof(unsigned long);
+	}
+
+	/* Reset the 'lowest_stack' value for the next syscall */
+	current->lowest_stack = current_top_of_stack() - THREAD_SIZE / 64;
+}
+