diff mbox

fork: Allow stack to be wiped on fork

Message ID 20180117055015.GA15256@beast (mailing list archive)
State New, archived
Headers show

Commit Message

Kees Cook Jan. 17, 2018, 5:50 a.m. UTC
One of the classes of kernel stack content leaks is exposing the contents
of prior heap or stack contents when a new process stack is allocated.
Normally, those stacks are not zeroed, and the old contents remain in
place. With some types of stack content exposure flaws, those contents
can leak to userspace. Kernels built with CONFIG_CLEAR_STACK_FORK will
no longer be vulnerable to this, as the stack will be wiped each time
a stack is assigned to a new process. There's not a meaningful change
in runtime performance; it almost looks like it provides a benefit.

Performing back-to-back kernel builds before:
	Run times: 157.86 157.09 158.90 160.94 160.80
	Mean: 159.12
	Std Dev: 1.54

With CONFIG_CLEAR_STACK_FORK=y:
	Run times: 159.31 157.34 156.71 158.15 160.81
	Mean: 158.46
	Std Dev: 1.46

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/Kconfig                | 8 ++++++++
 include/linux/thread_info.h | 4 +++-
 kernel/fork.c               | 2 +-
 3 files changed, 12 insertions(+), 2 deletions(-)

Comments

Michal Hocko Jan. 17, 2018, 9:17 a.m. UTC | #1
On Tue 16-01-18 21:50:15, Kees Cook wrote:
> One of the classes of kernel stack content leaks is exposing the contents
> of prior heap or stack contents when a new process stack is allocated.
> Normally, those stacks are not zeroed, and the old contents remain in
> place. With some types of stack content exposure flaws, those contents
> can leak to userspace. Kernels built with CONFIG_CLEAR_STACK_FORK will
> no longer be vulnerable to this, as the stack will be wiped each time
> a stack is assigned to a new process. There's not a meaningful change
> in runtime performance; it almost looks like it provides a benefit.

Have you tried something as simple as /bin/true in a loop. kbuild will
certainly amortize few cycles for the clearing and I would expect, most
reasonable applications would do as well. But it would be better to know
the worst case scenario IMHO.

> Performing back-to-back kernel builds before:
> 	Run times: 157.86 157.09 158.90 160.94 160.80
> 	Mean: 159.12
> 	Std Dev: 1.54
> 
> With CONFIG_CLEAR_STACK_FORK=y:
> 	Run times: 159.31 157.34 156.71 158.15 160.81
> 	Mean: 158.46
> 	Std Dev: 1.46
>
> Signed-off-by: Kees Cook <keescook@chromium.org>

The change seems reasonable to me. Although it would be better to extend
on the types of attacks this prevents from, with some examples ideally.
How many attacks of that kind we had in the past and how often they
appear. That might help people to decide whether to deserve few cycles
on each fork. Also the config option sounds rather limiting. Consider
distros, should they enable it just to be on the safe side? This is kind
of generic concern with other hardening options though.
Laura Abbott Jan. 19, 2018, 7:16 p.m. UTC | #2
On 01/17/2018 01:17 AM, Michal Hocko wrote:
> On Tue 16-01-18 21:50:15, Kees Cook wrote:
>> One of the classes of kernel stack content leaks is exposing the contents
>> of prior heap or stack contents when a new process stack is allocated.
>> Normally, those stacks are not zeroed, and the old contents remain in
>> place. With some types of stack content exposure flaws, those contents
>> can leak to userspace. Kernels built with CONFIG_CLEAR_STACK_FORK will
>> no longer be vulnerable to this, as the stack will be wiped each time
>> a stack is assigned to a new process. There's not a meaningful change
>> in runtime performance; it almost looks like it provides a benefit.
> 
> Have you tried something as simple as /bin/true in a loop. kbuild will
> certainly amortize few cycles for the clearing and I would expect, most
> reasonable applications would do as well. But it would be better to know
> the worst case scenario IMHO.
> 

I tried /bin/true in a loop in my QEMU setup and didn't see a difference
there.

>> Performing back-to-back kernel builds before:
>> 	Run times: 157.86 157.09 158.90 160.94 160.80
>> 	Mean: 159.12
>> 	Std Dev: 1.54
>>
>> With CONFIG_CLEAR_STACK_FORK=y:
>> 	Run times: 159.31 157.34 156.71 158.15 160.81
>> 	Mean: 158.46
>> 	Std Dev: 1.46
>>
>> Signed-off-by: Kees Cook <keescook@chromium.org>
> 
> The change seems reasonable to me. Although it would be better to extend
> on the types of attacks this prevents from, with some examples ideally.
> How many attacks of that kind we had in the past and how often they
> appear. That might help people to decide whether to deserve few cycles
> on each fork. Also the config option sounds rather limiting. Consider
> distros, should they enable it just to be on the safe side? This is kind
> of generic concern with other hardening options though.
> 

Agreed this could use a few more words, but it looks good to me overall.

Thanks,
Laura
Jiri Kosina Jan. 26, 2018, 10:01 p.m. UTC | #3
On Tue, 16 Jan 2018, Kees Cook wrote:

> diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> index 34f053a150a9..091f53fe31cc 100644
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -43,7 +43,9 @@ enum {
>  #define THREAD_ALIGN	THREAD_SIZE
>  #endif
>  
> -#if IS_ENABLED(CONFIG_DEBUG_STACK_USAGE) || IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
> +#if IS_ENABLED(CONFIG_DEBUG_STACK_USAGE) || \
> +    IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) || \
> +    IS_ENABLED(CONFIG_CLEAR_STACK_FORK)
>  # define THREADINFO_GFP		(GFP_KERNEL_ACCOUNT | __GFP_ZERO)
>  #else
>  # define THREADINFO_GFP		(GFP_KERNEL_ACCOUNT)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 2295fc69717f..215b1ce2b2cd 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -215,7 +215,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
>  		if (!s)
>  			continue;
>  
> -#ifdef CONFIG_DEBUG_KMEMLEAK
> +#if IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) || IS_ENABLED(CONFIG_CLEAR_STACK_FORK)
>  		/* Clear stale pointers from reused stack. */
>  		memset(s->addr, 0, THREAD_SIZE);
>  #endif

Is there any good reason not to do it symmetricaly also for non-vmapped 
stacks? (by passing __GFP_ZERO to alloc_pages_node())?

Thanks,
Jiri Kosina Jan. 26, 2018, 10:31 p.m. UTC | #4
On Fri, 26 Jan 2018, Jiri Kosina wrote:

> > diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
> > index 34f053a150a9..091f53fe31cc 100644
> > --- a/include/linux/thread_info.h
> > +++ b/include/linux/thread_info.h
> > @@ -43,7 +43,9 @@ enum {
> >  #define THREAD_ALIGN	THREAD_SIZE
> >  #endif
> >  
> > -#if IS_ENABLED(CONFIG_DEBUG_STACK_USAGE) || IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
> > +#if IS_ENABLED(CONFIG_DEBUG_STACK_USAGE) || \
> > +    IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) || \
> > +    IS_ENABLED(CONFIG_CLEAR_STACK_FORK)
> >  # define THREADINFO_GFP		(GFP_KERNEL_ACCOUNT | __GFP_ZERO)
> >  #else
> >  # define THREADINFO_GFP		(GFP_KERNEL_ACCOUNT)
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index 2295fc69717f..215b1ce2b2cd 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -215,7 +215,7 @@ static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
> >  		if (!s)
> >  			continue;
> >  
> > -#ifdef CONFIG_DEBUG_KMEMLEAK
> > +#if IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) || IS_ENABLED(CONFIG_CLEAR_STACK_FORK)
> >  		/* Clear stale pointers from reused stack. */
> >  		memset(s->addr, 0, THREAD_SIZE);
> >  #endif
> 
> Is there any good reason not to do it symmetricaly also for non-vmapped 
> stacks? (by passing __GFP_ZERO to alloc_pages_node())?

Ah, of course you already do by extending THREADINFO_GFP, sorry for the 
noise.
Andrew Morton Feb. 21, 2018, 12:31 a.m. UTC | #5
On Tue, 16 Jan 2018 21:50:15 -0800 Kees Cook <keescook@chromium.org> wrote:

> One of the classes of kernel stack content leaks is exposing the contents
> of prior heap or stack contents when a new process stack is allocated.
> Normally, those stacks are not zeroed, and the old contents remain in
> place. With some types of stack content exposure flaws, those contents
> can leak to userspace. Kernels built with CONFIG_CLEAR_STACK_FORK will
> no longer be vulnerable to this, as the stack will be wiped each time
> a stack is assigned to a new process. There's not a meaningful change
> in runtime performance; it almost looks like it provides a benefit.
> 
> Performing back-to-back kernel builds before:
> 	Run times: 157.86 157.09 158.90 160.94 160.80
> 	Mean: 159.12
> 	Std Dev: 1.54
> 
> With CONFIG_CLEAR_STACK_FORK=y:
> 	Run times: 159.31 157.34 156.71 158.15 160.81
> 	Mean: 158.46
> 	Std Dev: 1.46
> 
> ...
>
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -904,6 +904,14 @@ config VMAP_STACK
>  	  the stack to map directly to the KASAN shadow map using a formula
>  	  that is incorrect if the stack is in vmalloc space.
>  
> +config CLEAR_STACK_FORK
> +	bool "Clear the kernel stack at each fork"
> +	help
> +	  To resist stack content leak flaws, this clears newly allocated
> +	  kernel stacks to keep previously freed heap or stack contents
> +	  from being present in the new stack. This has almost no
> +	  measurable performance impact.
> +

It would be much nicer to be able to control this at runtime rather
than compile-time.  Why not a /proc tunable?  We could always use more
of those ;)
Andy Lutomirski Feb. 21, 2018, 1:56 a.m. UTC | #6
On Wed, Feb 21, 2018 at 12:31 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 16 Jan 2018 21:50:15 -0800 Kees Cook <keescook@chromium.org> wrote:
>
>> One of the classes of kernel stack content leaks is exposing the contents
>> of prior heap or stack contents when a new process stack is allocated.
>> Normally, those stacks are not zeroed, and the old contents remain in
>> place. With some types of stack content exposure flaws, those contents
>> can leak to userspace. Kernels built with CONFIG_CLEAR_STACK_FORK will
>> no longer be vulnerable to this, as the stack will be wiped each time
>> a stack is assigned to a new process. There's not a meaningful change
>> in runtime performance; it almost looks like it provides a benefit.
>>
>> Performing back-to-back kernel builds before:
>>       Run times: 157.86 157.09 158.90 160.94 160.80
>>       Mean: 159.12
>>       Std Dev: 1.54
>>
>> With CONFIG_CLEAR_STACK_FORK=y:
>>       Run times: 159.31 157.34 156.71 158.15 160.81
>>       Mean: 158.46
>>       Std Dev: 1.46
>>
>> ...
>>
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -904,6 +904,14 @@ config VMAP_STACK
>>         the stack to map directly to the KASAN shadow map using a formula
>>         that is incorrect if the stack is in vmalloc space.
>>
>> +config CLEAR_STACK_FORK
>> +     bool "Clear the kernel stack at each fork"
>> +     help
>> +       To resist stack content leak flaws, this clears newly allocated
>> +       kernel stacks to keep previously freed heap or stack contents
>> +       from being present in the new stack. This has almost no
>> +       measurable performance impact.
>> +
>
> It would be much nicer to be able to control this at runtime rather
> than compile-time.  Why not a /proc tunable?  We could always use more
> of those ;)

/proc/sys/kernel/hardening_features_that_cost_essentially_nothing?

Seriously, though, why don't we just enable it unconditionally?  It
wouldn't surprise me if it really is a speedup on more workloads than
it slows down -- it'll fill the kernel stack into the CPU cache with
exclusive ownership very quickly (streamily and without actually
reading from memory, I imagine, at least on new enough CPUs) rather
than grabbing each cache line one by one as they get used.
Mel Gorman Feb. 22, 2018, 9:47 a.m. UTC | #7
On Wed, Feb 21, 2018 at 01:56:33AM +0000, Andrew Lutomirski wrote:
> > It would be much nicer to be able to control this at runtime rather
> > than compile-time.  Why not a /proc tunable?  We could always use more
> > of those ;)
> 
> /proc/sys/kernel/hardening_features_that_cost_essentially_nothing?
> 
> Seriously, though, why don't we just enable it unconditionally?  It
> wouldn't surprise me if it really is a speedup on more workloads than
> it slows down -- it'll fill the kernel stack into the CPU cache with
> exclusive ownership very quickly (streamily and without actually
> reading from memory, I imagine, at least on new enough CPUs) rather
> than grabbing each cache line one by one as they get used.

Note that this is not unconditionally true, it depends on the calling
context that clears the page. If this is during fork, then the parent may
be doing the clear (I didn't check) which means it's quite likely when the
child wakes for the first time that it will not necessary wake on the same
CPU. Up until recently on NUMA machines, the child was almost guaranteed
to be running on a remote node (mitigated in tip for sched now).

I'm not claiming I've measured the overhead of this, just pointing out that
"cache hotness" may actually result in double the cache line bounces --
first clear, then write on first wake. If only zeroing pages was a bit
faster :/
diff mbox

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 400b9e1b2f27..42d56dad03ec 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -904,6 +904,14 @@  config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config CLEAR_STACK_FORK
+	bool "Clear the kernel stack at each fork"
+	help
+	  To resist stack content leak flaws, this clears newly allocated
+	  kernel stacks to keep previously freed heap or stack contents
+	  from being present in the new stack. This has almost no
+	  measurable performance impact.
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index 34f053a150a9..091f53fe31cc 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -43,7 +43,9 @@  enum {
 #define THREAD_ALIGN	THREAD_SIZE
 #endif
 
-#if IS_ENABLED(CONFIG_DEBUG_STACK_USAGE) || IS_ENABLED(CONFIG_DEBUG_KMEMLEAK)
+#if IS_ENABLED(CONFIG_DEBUG_STACK_USAGE) || \
+    IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) || \
+    IS_ENABLED(CONFIG_CLEAR_STACK_FORK)
 # define THREADINFO_GFP		(GFP_KERNEL_ACCOUNT | __GFP_ZERO)
 #else
 # define THREADINFO_GFP		(GFP_KERNEL_ACCOUNT)
diff --git a/kernel/fork.c b/kernel/fork.c
index 2295fc69717f..215b1ce2b2cd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -215,7 +215,7 @@  static unsigned long *alloc_thread_stack_node(struct task_struct *tsk, int node)
 		if (!s)
 			continue;
 
-#ifdef CONFIG_DEBUG_KMEMLEAK
+#if IS_ENABLED(CONFIG_DEBUG_KMEMLEAK) || IS_ENABLED(CONFIG_CLEAR_STACK_FORK)
 		/* Clear stale pointers from reused stack. */
 		memset(s->addr, 0, THREAD_SIZE);
 #endif