diff mbox series

[v7,4/6] x86/entry: Enable random_kstack_offset support

Message ID 20210319212835.3928492-5-keescook@chromium.org (mailing list archive)
State New, archived
Headers show
Series Optionally randomize kernel stack offset each syscall | expand

Commit Message

Kees Cook March 19, 2021, 9:28 p.m. UTC
Allow for a randomized stack offset on a per-syscall basis, with roughly
5-6 bits of entropy, depending on compiler and word size. Since the
method of offsetting uses macros, this cannot live in the common entry
code (the stack offset needs to be retained for the life of the syscall,
which means it needs to happen at the actual entry point).

Signed-off-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig                    | 1 +
 arch/x86/entry/common.c             | 3 +++
 arch/x86/include/asm/entry-common.h | 8 ++++++++
 3 files changed, 12 insertions(+)

Comments

Ingo Molnar March 20, 2021, 11:58 a.m. UTC | #1
* Kees Cook <keescook@chromium.org> wrote:

> Allow for a randomized stack offset on a per-syscall basis, with roughly
> 5-6 bits of entropy, depending on compiler and word size. Since the
> method of offsetting uses macros, this cannot live in the common entry
> code (the stack offset needs to be retained for the life of the syscall,
> which means it needs to happen at the actual entry point).

>  __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
>  {
> +	add_random_kstack_offset();
>  	nr = syscall_enter_from_user_mode(regs, nr);

> @@ -83,6 +84,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
>  {
>  	unsigned int nr = syscall_32_enter(regs);
>  
> +	add_random_kstack_offset();

>  	unsigned int nr = syscall_32_enter(regs);
>  	int res;
>  
> +	add_random_kstack_offset();

> @@ -70,6 +71,13 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>  	 */
>  	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
>  #endif
> +
> +	/*
> +	 * x86_64 stack alignment means 3 bits are ignored, so keep
> +	 * the top 5 bits. x86_32 needs only 2 bits of alignment, so
> +	 * the top 6 bits will be used.
> +	 */
> +	choose_random_kstack_offset(rdtsc() & 0xFF);
>  }

1)

Wondering why the calculation of the kstack offset (which happens in 
every syscall) is separated from the entry-time logic and happens 
during return to user-space?

The two methods:

+#define add_random_kstack_offset() do {                                        \
+       if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
+                               &randomize_kstack_offset)) {            \
+               u32 offset = this_cpu_read(kstack_offset);              \
+               u8 *ptr = __builtin_alloca(offset & 0x3FF);             \
+               asm volatile("" : "=m"(*ptr) :: "memory");              \
+       }                                                               \
+} while (0)
+
+#define choose_random_kstack_offset(rand) do {                         \
+       if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
+                               &randomize_kstack_offset)) {            \
+               u32 offset = this_cpu_read(kstack_offset);              \
+               offset ^= (rand);                                       \
+               this_cpu_write(kstack_offset, offset);                  \
+       }                                                               \
+} while (0)

choose_random_kstack_offset() basically calculates the offset and 
stores it in a percpu variable (mixing it with the previous offset 
value), add_random_kstack_offset() uses it in an alloca() dynamic 
stack allocation.

Wouldn't it be (slightly) lower combined overhead to just do it in a 
single step? There would be duplication along the 3 syscall entry 
points, but this should be marginal as this looks small, and the entry 
points would probably be cache-hot.

2)

Another detail I noticed: add_random_kstack_offset() limits the offset 
to 0x3ff, or 1k - 10 bits.

But the RDTSC mask is 0xff, 8 bits:

+       /*
+        * x86_64 stack alignment means 3 bits are ignored, so keep
+        * the top 5 bits. x86_32 needs only 2 bits of alignment, so
+        * the top 6 bits will be used.
+        */
+       choose_random_kstack_offset(rdtsc() & 0xFF);

alloca() itself works in byte units and will round the allocation to 8 
bytes on x86-64, to 4 bytes on x86-32, this is what the 'ignored bits' 
reference in the comment is to, right?

Why is there a 0x3ff mask for the alloca() call and a 0xff mask to the 
RDTSC randomizing value? Shouldn't the two be synced up? Or was the 
intention to shift the RDTSC value to the left by 3 bits?

3)

Finally, kstack_offset is a percpu variable:

  #ifdef CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
  ...
  DEFINE_PER_CPU(u32, kstack_offset);

This is inherited across tasks on scheduling, and new syscalls will 
mix in new RDTSC values to continue to randomize the offset.

Wouldn't it make sense to further mix values into this across context 
switching boundaries? A really inexpensive way would be to take the 
kernel stack value and mix it into the offset, and maybe even the 
randomized t->stack_canary value?

This would further isolate the syscall kernel stack offsets of 
separate execution contexts from each other, should an attacker find a 
way to estimate or influence likely RDTSC values.

Thanks,

	Ingo
Kees Cook March 21, 2021, 5:03 p.m. UTC | #2
On Sat, Mar 20, 2021 at 12:58:20PM +0100, Ingo Molnar wrote:
> 
> * Kees Cook <keescook@chromium.org> wrote:
> 
> > Allow for a randomized stack offset on a per-syscall basis, with roughly
> > 5-6 bits of entropy, depending on compiler and word size. Since the
> > method of offsetting uses macros, this cannot live in the common entry
> > code (the stack offset needs to be retained for the life of the syscall,
> > which means it needs to happen at the actual entry point).
> 
> >  __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
> >  {
> > +	add_random_kstack_offset();
> >  	nr = syscall_enter_from_user_mode(regs, nr);
> 
> > @@ -83,6 +84,7 @@ __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
> >  {
> >  	unsigned int nr = syscall_32_enter(regs);
> >  
> > +	add_random_kstack_offset();
> 
> >  	unsigned int nr = syscall_32_enter(regs);
> >  	int res;
> >  
> > +	add_random_kstack_offset();
> 
> > @@ -70,6 +71,13 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
> >  	 */
> >  	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
> >  #endif
> > +
> > +	/*
> > +	 * x86_64 stack alignment means 3 bits are ignored, so keep
> > +	 * the top 5 bits. x86_32 needs only 2 bits of alignment, so
> > +	 * the top 6 bits will be used.
> > +	 */
> > +	choose_random_kstack_offset(rdtsc() & 0xFF);
> >  }
> 
> 1)
> 
> Wondering why the calculation of the kstack offset (which happens in 
> every syscall) is separated from the entry-time logic and happens 
> during return to user-space?
> 
> The two methods:
> 
> +#define add_random_kstack_offset() do {                                        \
> +       if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> +                               &randomize_kstack_offset)) {            \
> +               u32 offset = this_cpu_read(kstack_offset);              \
> +               u8 *ptr = __builtin_alloca(offset & 0x3FF);             \
> +               asm volatile("" : "=m"(*ptr) :: "memory");              \
> +       }                                                               \
> +} while (0)
> +
> +#define choose_random_kstack_offset(rand) do {                         \
> +       if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> +                               &randomize_kstack_offset)) {            \
> +               u32 offset = this_cpu_read(kstack_offset);              \
> +               offset ^= (rand);                                       \
> +               this_cpu_write(kstack_offset, offset);                  \
> +       }                                                               \
> +} while (0)
> 
> choose_random_kstack_offset() basically calculates the offset and 
> stores it in a percpu variable (mixing it with the previous offset 
> value), add_random_kstack_offset() uses it in an alloca() dynamic 
> stack allocation.
> 
> Wouldn't it be (slightly) lower combined overhead to just do it in a 
> single step? There would be duplication along the 3 syscall entry 
> points, but this should be marginal as this looks small, and the entry 
> points would probably be cache-hot.

In earlier threads it was pointed out that one way to make things less
predictable was to do the calculation at the end of a syscall so that it
was more distant from entering userspace (with the thinking that things
like rdtsc were more predictable by userspace if it was always happening
X cycles after entering a syscall). Additionally, the idea of using
percpu meant that the chosen values wouldn't be tied to a process,
making even "short" syscalls (i.e. getpid) less predictable because an
attacker would need to have pinned the process to a single CPU, etc.

I can include these details more explicitly in the next change log, if
you think that makes sense?

> 2)
> 
> Another detail I noticed: add_random_kstack_offset() limits the offset 
> to 0x3ff, or 1k - 10 bits.
> 
> But the RDTSC mask is 0xff, 8 bits:
> 
> +       /*
> +        * x86_64 stack alignment means 3 bits are ignored, so keep
> +        * the top 5 bits. x86_32 needs only 2 bits of alignment, so
> +        * the top 6 bits will be used.
> +        */
> +       choose_random_kstack_offset(rdtsc() & 0xFF);
> 
> alloca() itself works in byte units and will round the allocation to 8 
> bytes on x86-64, to 4 bytes on x86-32, this is what the 'ignored bits' 
> reference in the comment is to, right?
> 
> Why is there a 0x3ff mask for the alloca() call and a 0xff mask to the 
> RDTSC randomizing value? Shouldn't the two be synced up? Or was the 
> intention to shift the RDTSC value to the left by 3 bits?

Yes, it's intentional -- the 0x3ff limit is there to make sure the
alloca has a distinct upper bound, and the 8 bits is there to let the
compiler choose how much of those 8 bits it wants to throw away to stack
alignment. The limit to "at most 8 bits" (really 5) is chosen as a
middle ground between raising unpredictability without shrinking the
available stack space too much. (Note that arm64's alignment has to
tweak this by 1 more bit, so it is masking with 0x1ff.

I could attempt to adjust the comments to reflect these considerations.

> 3)
> 
> Finally, kstack_offset is a percpu variable:
> 
>   #ifdef CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
>   ...
>   DEFINE_PER_CPU(u32, kstack_offset);
> 
> This is inherited across tasks on scheduling, and new syscalls will 
> mix in new RDTSC values to continue to randomize the offset.
> 
> Wouldn't it make sense to further mix values into this across context 
> switching boundaries? A really inexpensive way would be to take the 
> kernel stack value and mix it into the offset, and maybe even the 
> randomized t->stack_canary value?
> 
> This would further isolate the syscall kernel stack offsets of 
> separate execution contexts from each other, should an attacker find a 
> way to estimate or influence likely RDTSC values.

I think this was discussed at some point too, though my search-foo is
failing me. I'm open to adding this to the mix, though care is needed
for both stack address and stack canary values, since they both have
specific structure (i.e. high and low bits of address are "known", and
low bits of canary are always zero) and we don't want to run the risk
of entangling secret values: if one can be exposed, does the entangling
expose the others?

I've had to deal with both of these issues with the slab freelist pointer
obfuscation, so my instinct here is to avoid mixing the other values in.
I'm open to improving it, of course, but I think rdtsc is a good first
step.
Thomas Gleixner March 28, 2021, 2:18 p.m. UTC | #3
On Fri, Mar 19 2021 at 14:28, Kees Cook wrote:
> +
> +	/*
> +	 * x86_64 stack alignment means 3 bits are ignored, so keep
> +	 * the top 5 bits. x86_32 needs only 2 bits of alignment, so
> +	 * the top 6 bits will be used.
> +	 */
> +	choose_random_kstack_offset(rdtsc() & 0xFF);

Comment mumbles about 5/6 bits and the TSC value is masked with 0xFF and
then the applied offset is itself limited with 0x3FF.

Too many moving parts for someone who does not have the details of all
this memorized.

Thanks,

        tglx
Kees Cook March 29, 2021, 6:43 p.m. UTC | #4
On Sun, Mar 28, 2021 at 04:18:56PM +0200, Thomas Gleixner wrote:
> On Fri, Mar 19 2021 at 14:28, Kees Cook wrote:
> > +
> > +	/*
> > +	 * x86_64 stack alignment means 3 bits are ignored, so keep
> > +	 * the top 5 bits. x86_32 needs only 2 bits of alignment, so
> > +	 * the top 6 bits will be used.
> > +	 */
> > +	choose_random_kstack_offset(rdtsc() & 0xFF);
> 
> Comment mumbles about 5/6 bits and the TSC value is masked with 0xFF and
> then the applied offset is itself limited with 0x3FF.
> 
> Too many moving parts for someone who does not have the details of all
> this memorized.

Each piece is intentional -- I will improve the comments to explain
each level of masking happening (implicit compiler stack alignment mask,
explicit per-arch mask, and the VLA upper-bound protection mask).
diff mbox series

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..4b4ad8ec10d2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -165,6 +165,7 @@  config X86
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
 	select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_ASM_MODVERSIONS
 	select HAVE_CMPXCHG_DOUBLE
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index a2433ae8a65e..810983d7c26f 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -38,6 +38,7 @@ 
 #ifdef CONFIG_X86_64
 __visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 {
+	add_random_kstack_offset();
 	nr = syscall_enter_from_user_mode(regs, nr);
 
 	instrumentation_begin();
@@ -83,6 +84,7 @@  __visible noinstr void do_int80_syscall_32(struct pt_regs *regs)
 {
 	unsigned int nr = syscall_32_enter(regs);
 
+	add_random_kstack_offset();
 	/*
 	 * Subtlety here: if ptrace pokes something larger than 2^32-1 into
 	 * orig_ax, the unsigned int return value truncates it.  This may
@@ -102,6 +104,7 @@  static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
 	unsigned int nr = syscall_32_enter(regs);
 	int res;
 
+	add_random_kstack_offset();
 	/*
 	 * This cannot use syscall_enter_from_user_mode() as it has to
 	 * fetch EBP before invoking any of the syscall entry work
diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 2b87b191b3b8..8e41566e154a 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -2,6 +2,7 @@ 
 #ifndef _ASM_X86_ENTRY_COMMON_H
 #define _ASM_X86_ENTRY_COMMON_H
 
+#include <linux/randomize_kstack.h>
 #include <linux/user-return-notifier.h>
 
 #include <asm/nospec-branch.h>
@@ -70,6 +71,13 @@  static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	 */
 	current_thread_info()->status &= ~(TS_COMPAT | TS_I386_REGS_POKED);
 #endif
+
+	/*
+	 * x86_64 stack alignment means 3 bits are ignored, so keep
+	 * the top 5 bits. x86_32 needs only 2 bits of alignment, so
+	 * the top 6 bits will be used.
+	 */
+	choose_random_kstack_offset(rdtsc() & 0xFF);
 }
 #define arch_exit_to_user_mode_prepare arch_exit_to_user_mode_prepare