[RFC] x86/entry/64: randomize kernel stack offset upon system call
diff mbox series

Message ID 1549628149-11881-2-git-send-email-elena.reshetova@intel.com
State New
Headers show
Series
  • [RFC] x86/entry/64: randomize kernel stack offset upon system call
Related show

Commit Message

Reshetova, Elena Feb. 8, 2019, 12:15 p.m. UTC
If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
the kernel stack offset is randomized upon each
exit from a system call via the trampoline stack.

This feature is based on the original idea from
the PaX's RANDKSTACK feature:
https://pax.grsecurity.net/docs/randkstack.txt
All the credits for the original idea goes to the PaX team.
However, the implementation of RANDOMIZE_KSTACK_OFFSET
differs greatly from the RANDKSTACK feature (see below).

Reasoning for the feature:

This feature should make considerably harder various
stack-based attacks that are based upon overflowing
a kernel stack into adjusted kernel stack with a
possibility to jump over a guard page.
Since the stack offset is randomized upon each
system call, it is very hard for attacker to reliably
land in any particular place on the adjusted stack.

Design description:

During most of the kernel's execution, it runs on the "thread
stack", which is allocated at fork.c/dup_task_struct() and stored in
a per-task variable (tsk->stack). Since stack is growing downwards,
the stack top can be always calculated using task_top_of_stack(tsk)
function, which essentially returns an address of tsk->stack + stack
size. When VMAP_STACK is enabled, the thread stack is allocated from
vmalloc space.

Thread stack is pretty deterministic on its structure - fixed in size,
and upon every enter from a userspace to kernel on a
syscall the thread stack is started to be constructed from an
address fetched from a per-cpu cpu_current_top_of_stack variable.
This variable is required since there is no way to reference "current"
from the kernel entry/exit code, so the value of task_top_of_stack(tsk)
is "shadowed" in a per-cpu variable each time the kernel context
switches to a new task.

The RANDOMIZE_KSTACK_OFFSET feature works by randomizing the value of
task_top_of_stack(tsk) every time a process exits from a syscall. As
a result the thread stack for that process will be constructed from a
random offset from a fixed tsk->stack + stack size value upon subsequent
syscall.

Since the kernel is always exited (IRET / SYSRET) from a per-cpu
"trampoline stack", it provides a safe place for modifying the value
of cpu_current_top_of_stack, because the thread stack is not in
use anymore at that point.

There is only one small issue: currently thread stack top is never
stored in a per-task variable, but always calculated as needed
via task_top_of_stack(tsk) and existing tsk->stack value (essentially
relying on its fixed size and structure). So we need to create a new
per-task variable, tsk->stack_start, that stores newly calculated
random value for the thread stack top. Together with the value of
cpu_current_top_of_stack, tsk->stack_start is also updated when
leaving the kernel space from a trampoline stack, so that it can be
used by scheduler to correctly "shadow" the cpu_current_top_of_stack
upon the task switch.

Impact on the kernel thread stack size:

Since the current version does not allocate any additional pages
for the thread stack, it shifts cpu_current_top_of_stack value
randomly between 000 .. FF0 (or 00 .. F0 if only 4 bits are randomized).
So, in the worst case (random offsets FF0/F0), the actual usable stack
size is 12304/16144 bytes.

Performance impact:

All measurements are done on Intel Kaby Lake i7-8550U, 16GB RAM

1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P
    base:           Time: 12.243
    random_offset:  Time: 13.411

2) kernel build time (as one example of real-world load):
    base:           user  299m20,348s;  sys   21m39,047s
    random_offset:  user  300m19,759s;  sys   20m48,173s

3) perf on fopen/flose loop 1000000000 times:
(the perf values below still manage to differ somewhat between
different runs, so I don't consider them to be very
representative apart that they obviously show big
impact on using get_random_u64())

    base:
     8.46%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
     4.77%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
     4.14%  time     [kernel.kallsyms]  [k] fsnotify
     3.94%  time     [kernel.kallsyms]  [k] _raw_spin_lock
     2.48%  time     [kernel.kallsyms]  [k] syscall_return_via_sysret
     2.42%  time     [kernel.kallsyms]  [k] entry_SYSCALL_64
     2.28%  time     [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     2.07%  time     [kernel.kallsyms]  [k] inotify_handle_event

    random_offset:
     8.35%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
     5.61%  time     [kernel.kallsyms]  [k] get_random_u64
     4.88%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
     3.08%  time     [kernel.kallsyms]  [k] _raw_spin_lock
     2.98%  time     [kernel.kallsyms]  [k] fsnotify
     2.73%  time     [kernel.kallsyms]  [k] syscall_return_via_sysret
     2.45%  time     [kernel.kallsyms]  [k] entry_SYSCALL_64
     1.87%  time     [kernel.kallsyms]  [k] __ext4_get_inode_loc
     1.65%  time     [kernel.kallsyms]  [k] _raw_spin_lock_irqsave

Comparison to grsecurity RANDKSTACK feature:

The basic idea is taken from RANDKSTACK: randomization of the
cpu_current_top_of_stack is performed within the existing 4
pages of memory allocated for the thread stack.
No additional pages are allocated.

This patch introduces 8 bits of randomness (bits 4 - 11 are
randomized, bits 0-3 must be zero due to stack alignment)
to the kernel stack top.
The very old grsecurity patch I checked has only 4 bits
of randomization for x86-64. This patch works with this
little randomness also, we only have to decide how much
stack space we wish/can trade for security.

  Notable differences from RANDKSTACK:

  - x86_64 only, since this does not make sense
    without vmap-based stack allocation that provides
    guard pages, and latter is only implemented for x86-64.
  - randomization is performed on trampoline stack upon
    system call exit.
  - random bits are taken from get_random_long() instead of
    rdtsc() for a better randomness. This however has a big
    performance impact (see above the numbers) and additionally
    if we happen to hit a point when a generator needs to be
    reseeded, we might have an issue. Alternatives can be to
    make this feature dependent on CONFIG_RANDOM_TRUST_CPU,
    which can solve some issues, but I doubt that all of them.
    Of course rdtsc() can be a fallback if there is no way to
    make calls for a proper randomness from the trampoline stack.
  - instead of storing the actual top of the stack in
    task->thread.sp0 (does not exist on x86-64), a
    new unsigned long variable stack_start is created in
    the task struct and key stack functions, like task_pt_regs,
    are updated to use it when available.
  - Instead of preserving a set of registers that are
    used within the randomization function, the current
    version uses PUSH_AND_CLEAR_REGS/POP_REGS combination
    similar to STACKLEAK. It would seem that we can
    go away with only preserving rax,rdx,rbx,rsi and rdi,
    but I am not sure how stable this is in the long run.

Future work possibilities:

  - One can do a version where we allocate an
    additional page for each kernel stack and then employ
    proper randomization. Can be a stricter config option,
    for example.
  - Alternatively, one can allocate normally 4 pages of
    stack only and allocate an additional page, if stack
    + randomized offset grows beyond 4 pages (only happens
    for big call chains).

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
---
 arch/Kconfig                     | 15 +++++++++++++++
 arch/x86/Kconfig                 |  1 +
 arch/x86/entry/calling.h         |  8 ++++++++
 arch/x86/entry/common.c          | 21 +++++++++++++++++++++
 arch/x86/entry/entry_64.S        |  4 ++++
 arch/x86/include/asm/processor.h | 15 ++++++++++++---
 arch/x86/kernel/dumpstack.c      |  2 +-
 arch/x86/kernel/irq_64.c         |  2 +-
 arch/x86/kernel/process.c        |  2 +-
 include/linux/sched.h            |  3 +++
 include/linux/sched/task_stack.h | 18 +++++++++++++++++-
 kernel/fork.c                    | 10 ++++++++++
 mm/kmemleak.c                    |  2 +-
 mm/usercopy.c                    |  2 +-
 14 files changed, 96 insertions(+), 9 deletions(-)

Comments

Peter Zijlstra Feb. 8, 2019, 1:05 p.m. UTC | #1
On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> 1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P
>     base:           Time: 12.243
>     random_offset:  Time: 13.411

>     base:
>      8.46%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
>      4.77%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
>      4.14%  time     [kernel.kallsyms]  [k] fsnotify
> 
>     random_offset:
>      8.35%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
>      5.61%  time     [kernel.kallsyms]  [k] get_random_u64
>      4.88%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty

*ouch*

>   Notable differences from RANDKSTACK:

>   - random bits are taken from get_random_long() instead of
>     rdtsc() for a better randomness. This however has a big
>     performance impact (see above the numbers) and additionally
>     if we happen to hit a point when a generator needs to be
>     reseeded, we might have an issue. Alternatives can be to
>     make this feature dependent on CONFIG_RANDOM_TRUST_CPU,
>     which can solve some issues, but I doubt that all of them.
>     Of course rdtsc() can be a fallback if there is no way to
>     make calls for a proper randomness from the trampoline stack.

http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.html

That would seem to suggest that the low bits of rdtsc would in fact be a
fairly good source of random.

Still, doing this on sysexit seems painful at best, syscall performance
matters (and hopefully we'll get rid of meltdown 'soon').

Why can't we change the stack offset periodically from an interrupt or
so, and then have every later entry use that.
Reshetova, Elena Feb. 8, 2019, 1:20 p.m. UTC | #2
> On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> > 1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P
> >     base:           Time: 12.243
> >     random_offset:  Time: 13.411
> 
> >     base:
> >      8.46%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
> >      4.77%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
> >      4.14%  time     [kernel.kallsyms]  [k] fsnotify
> >
> >     random_offset:
> >      8.35%  time     [kernel.kallsyms]  [k] crc32c_pcl_intel_update
> >      5.61%  time     [kernel.kallsyms]  [k] get_random_u64
> >      4.88%  time     [kernel.kallsyms]  [k] ext4_mark_iloc_dirty
> 
> *ouch*
> 
> >   Notable differences from RANDKSTACK:
> 
> >   - random bits are taken from get_random_long() instead of
> >     rdtsc() for a better randomness. This however has a big
> >     performance impact (see above the numbers) and additionally
> >     if we happen to hit a point when a generator needs to be
> >     reseeded, we might have an issue. Alternatives can be to
> >     make this feature dependent on CONFIG_RANDOM_TRUST_CPU,
> >     which can solve some issues, but I doubt that all of them.
> >     Of course rdtsc() can be a fallback if there is no way to
> >     make calls for a proper randomness from the trampoline stack.
> 
> http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.html

Oh, this seems to be a great write up, I will read this carefully,
thank you for the pointer! 

Yes, I believe RANDKSTACK was using rdtsc() exactly because of
many issues get_random_* brings in addition to horrid performance. 
This patch works with rdtsc() as well just fine, just wanted to show
the *full* randomness option first with the impact it brings. 

> 
> That would seem to suggest that the low bits of rdtsc would in fact be a
> fairly good source of random.
> 
> Still, doing this on sysexit seems painful at best, syscall performance
> matters (and hopefully we'll get rid of meltdown 'soon').

I can measure the impact with rdtsc(), I think it is *very* small. 
Btw, what should be the good measurement test?
I am not that happy with just looping on fopen-fclose, too
much noise. 

> 
> Why can't we change the stack offset periodically from an interrupt or
> so, and then have every later entry use that.

Hm... This sounds more complex conceptually - we cannot touch
stack when it is in use, so we have to periodically probe for a 
good time (when process is in userspace I guess) to change it from an interrupt?
IMO trampoline stack provides such a good clean place for doing it and we
have stackleak there doing stack cleanup, so would make sense to keep
these features operating together.

Best Regards,
Elena.
Peter Zijlstra Feb. 8, 2019, 2:26 p.m. UTC | #3
On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:

> > 
> > Why can't we change the stack offset periodically from an interrupt or
> > so, and then have every later entry use that.
> 
> Hm... This sounds more complex conceptually - we cannot touch
> stack when it is in use, so we have to periodically probe for a 
> good time (when process is in userspace I guess) to change it from an interrupt?
> IMO trampoline stack provides such a good clean place for doing it and we
> have stackleak there doing stack cleanup, so would make sense to keep
> these features operating together.

The idea was to just change a per-cpu (possible per-task if you ctxsw
it) offset that is used on entry to offset the stack.

So only entries after the change will have the updated offset, any
in-progress syscalls will continue with their current offset and will be
unaffected.
Peter Zijlstra Feb. 8, 2019, 3:15 p.m. UTC | #4
On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:

> I can measure the impact with rdtsc(), I think it is *very* small. 
> Btw, what should be the good measurement test?
> I am not that happy with just looping on fopen-fclose, too
> much noise. 

IIRC amluto had a nice micro-bench for 'pure' syscall overhead
somewhere.

Something like getpid/gettid in a loop should work fairly well, that's a
really light syscall to make.
Andy Lutomirski Feb. 8, 2019, 4:34 p.m. UTC | #5
> On Feb 8, 2019, at 4:15 AM, Elena Reshetova <elena.reshetova@intel.com> wrote:
> 
> If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected,
> the kernel stack offset is randomized upon each
> exit from a system call via the trampoline stack.
> 
> This feature is based on the original idea from
> the PaX's RANDKSTACK feature:
> https://pax.grsecurity.net/docs/randkstack.txt
> All the credits for the original idea goes to the PaX team.
> However, the implementation of RANDOMIZE_KSTACK_OFFSET
> differs greatly from the RANDKSTACK feature (see below).
> 
> Reasoning for the feature:
> 
> This feature should make considerably harder various
> stack-based attacks that are based upon overflowing
> a kernel stack into adjusted kernel stack with a
> possibility to jump over a guard page.
> Since the stack offset is randomized upon each
> system call, it is very hard for attacker to reliably
> land in any particular place on the adjusted stack.
> 

I think we need a better justification. With VLAs gone, it should be statically impossible to overflow past a guard page.

> Design description:
> 
> During most of the kernel's execution, it runs on the "thread
> stack", which is allocated at fork.c/dup_task_struct() and stored in
> a per-task variable (tsk->stack). Since stack is growing downwards,
> the stack top can be always calculated using task_top_of_stack(tsk)
> function, which essentially returns an address of tsk->stack + stack
> size. When VMAP_STACK is enabled, the thread stack is allocated from
> vmalloc space.
> 
> Thread stack is pretty deterministic on its structure - fixed in size,
> and upon every enter from a userspace to kernel on a
> syscall the thread stack is started to be constructed from an
> address fetched from a per-cpu cpu_current_top_of_stack variable.
> This variable is required since there is no way to reference "current"
> from the kernel entry/exit code, so the value of task_top_of_stack(tsk)
> is "shadowed" in a per-cpu variable each time the kernel context
> switches to a new task.
> 
> The RANDOMIZE_KSTACK_OFFSET feature works by randomizing the value of
> task_top_of_stack(tsk) every time a process exits from a syscall. As
> a result the thread stack for that process will be constructed from a
> random offset from a fixed tsk->stack + stack size value upon subsequent
> syscall.
> 

There is a vastly simpler way to do this: leave pt_regs in place and subtract a random multiple of 8 before pushing anything else onto the stack.  This gets most of the benefit with much less complexity.

It’s maybe even better. If you can overflow a stack buffer to rewrite pt_regs, you gain control of all registers. If the stack to pt_regs offset is randomized, then this gets much harder.


>  - random bits are taken from get_random_long() instead of
>    rdtsc() for a better randomness. This however has a big
>    performance impact (see above the numbers) and additionally
>    if we happen to hit a point when a generator needs to be
>    reseeded, we might have an issue.

NAK.  I do not want entry code calling non-arch C code that is not explicitly intended to be used from entry code like this. You may be violating all kinds of rules about context tracking and even stack usage.

Just use ALTERNATIVE to select between RDTSC and RDRAND.  This isn’t used for crypto — refusing to “trust” the CPU here makes no sense.

I think that, if you make these two changes, you’ll have a very straightforward 50-ish line patch.  The only real complication is how to find pt_regs on the way out. I think you could use RBP for this — just make it look like you have a regular frame-pointer-using stack frame between do_syscall_whatever and pt_regs. Josh Poimboeuf can help make sure you get all the unwinding details right. It should be straightforward.
Kees Cook Feb. 8, 2019, 9:28 p.m. UTC | #6
On Fri, Feb 8, 2019 at 4:16 AM Elena Reshetova
<elena.reshetova@intel.com> wrote:
> --- a/mm/usercopy.c
> +++ b/mm/usercopy.c
> @@ -37,7 +37,7 @@
>  static noinline int check_stack_object(const void *obj, unsigned long len)
>  {
>         const void * const stack = task_stack_page(current);
> -       const void * const stackend = stack + THREAD_SIZE;
> +       const void * const stackend = (void *)task_top_of_stack(current);
>         int ret;
>
>         /* Object is not on the stack at all. */

It seems like having task_top_of_stack() would be a nice refactoring
to have regardless of this feature (and splitting out would make this
patch a bit easier to read too).
Reshetova, Elena Feb. 9, 2019, 11:13 a.m. UTC | #7
> On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> 
> > >
> > > Why can't we change the stack offset periodically from an interrupt or
> > > so, and then have every later entry use that.
> >
> > Hm... This sounds more complex conceptually - we cannot touch
> > stack when it is in use, so we have to periodically probe for a
> > good time (when process is in userspace I guess) to change it from an interrupt?
> > IMO trampoline stack provides such a good clean place for doing it and we
> > have stackleak there doing stack cleanup, so would make sense to keep
> > these features operating together.
> 
> The idea was to just change a per-cpu (possible per-task if you ctxsw
> it) offset that is used on entry to offset the stack.
> So only entries after the change will have the updated offset, any
> in-progress syscalls will continue with their current offset and will be
> unaffected.

Let me try to write this into simple steps to make sure I understand your
approach:

- create a new per-stack value (and potentially its per-cpu "shadow") called stack_offset = 0
- periodically issue an interrupt, and inside it walk the process tree and
  update stack_offset randomly for each process
- when a process makes a new syscall, it subtracts stack_offset value from top_of_stack()
 and that becomes its new  top_of_stack() for that system call. 

Smth like this? 

I think it is close to what Andy has proposed
in his reply, but the main difference is that you propose to do this via an interrupt. 
And the main reasoning for doing this via interrupt would be not to affect
syscall performance, right? 

The problem I see with interrupt approach is how often that should be done?
Because we don't want to end up with situation when we issue it too often, since
it is not going to be very light-weight operation (update all processes), and we 
don't want it to be too rarely done that we end up with processes that execute many
syscalls with the same offset. So, we might have a situation when some processes
 will execute a number of syscalls with same offset and some will change their offset
more than once without even making a single syscall. 

Also I fear that attacker might have more playroom here, since we don't have any
fixed guarantees on randomization anymore, but it depends on interrupt scheduling,
syscall rate for a given process, etc. I don't know how real are these concerns in
practice, but feels much more things to worry about. 

Best Regards,
Elena.
Reshetova, Elena Feb. 9, 2019, 11:38 a.m. UTC | #8
On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> 
> > I can measure the impact with rdtsc(), I think it is *very* small.
> > Btw, what should be the good measurement test?
> > I am not that happy with just looping on fopen-fclose, too
> > much noise.
> 
> IIRC amluto had a nice micro-bench for 'pure' syscall overhead
> somewhere.
> 
> Something like getpid/gettid in a loop should work fairly well, that's a
> really light syscall to make.

Thanks! Let me try this for measurements next. I was surprise that
there nothing  ready-made in perf or rt-tools. You would think
this must be regularly in a need to measure. 

Best Regards,
Elena.
Greg Kroah-Hartman Feb. 9, 2019, 12:09 p.m. UTC | #9
On Sat, Feb 09, 2019 at 11:38:00AM +0000, Reshetova, Elena wrote:
> 
>  On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> > 
> > > I can measure the impact with rdtsc(), I think it is *very* small.
> > > Btw, what should be the good measurement test?
> > > I am not that happy with just looping on fopen-fclose, too
> > > much noise.
> > 
> > IIRC amluto had a nice micro-bench for 'pure' syscall overhead
> > somewhere.
> > 
> > Something like getpid/gettid in a loop should work fairly well, that's a
> > really light syscall to make.
> 
> Thanks! Let me try this for measurements next. I was surprise that
> there nothing  ready-made in perf or rt-tools. You would think
> this must be regularly in a need to measure. 

lmbench has traditionally been used for something like this in the past,
you might want to look at that too.

greg k-h
Andy Lutomirski Feb. 9, 2019, 6:25 p.m. UTC | #10
On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
>
> > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> >
> > > >
> > > > Why can't we change the stack offset periodically from an interrupt or
> > > > so, and then have every later entry use that.
> > >
> > > Hm... This sounds more complex conceptually - we cannot touch
> > > stack when it is in use, so we have to periodically probe for a
> > > good time (when process is in userspace I guess) to change it from an interrupt?
> > > IMO trampoline stack provides such a good clean place for doing it and we
> > > have stackleak there doing stack cleanup, so would make sense to keep
> > > these features operating together.
> >
> > The idea was to just change a per-cpu (possible per-task if you ctxsw
> > it) offset that is used on entry to offset the stack.
> > So only entries after the change will have the updated offset, any
> > in-progress syscalls will continue with their current offset and will be
> > unaffected.
>
> Let me try to write this into simple steps to make sure I understand your
> approach:
>
> - create a new per-stack value (and potentially its per-cpu "shadow") called stack_offset = 0
> - periodically issue an interrupt, and inside it walk the process tree and
>   update stack_offset randomly for each process
> - when a process makes a new syscall, it subtracts stack_offset value from top_of_stack()
>  and that becomes its new  top_of_stack() for that system call.
>
> Smth like this?

I'm proposing somthing that is conceptually different.  You are,
conceptually, changing the location of the stack.  I'm suggesting that
you leave the stack alone and, instead, randomize how you use the
stack.  In plain C, this would consist of adding roughly this snippet
in do_syscall_64() and possibly other entry functions:

if (randomize_stack()) {
  void *dummy = alloca(rdrand() & 0x7f8);

  /* Make sure the compiler doesn't optimize out the alloca. */
  asm volatile ("" :: "=rm" (dummy));
}

... do the actual syscall work here.

This has a few problems, namely that the generated code might be awful
and that alloca is more or less banned in the kernel.  I suppose
alloca could be unbanned in the entry C code, but this could also be
done fairly easily in the asm code.  You'd just need to use a register
to store whatever is needed to put RSP back in the exit code.  The
obvious way would be to use RBP, but it's plausible that using a
different callee-saved register would make the unwinder interactions
easier to get right.

With this approach, you don't modify any of the top_of_stack()
functions or macros at all -- the top of stack isn't changed.

>
> I think it is close to what Andy has proposed
> in his reply, but the main difference is that you propose to do this via an interrupt.
> And the main reasoning for doing this via interrupt would be not to affect
> syscall performance, right?
>
> The problem I see with interrupt approach is how often that should be done?
> Because we don't want to end up with situation when we issue it too often, since
> it is not going to be very light-weight operation (update all processes), and we
> don't want it to be too rarely done that we end up with processes that execute many
> syscalls with the same offset. So, we might have a situation when some processes
>  will execute a number of syscalls with same offset and some will change their offset
> more than once without even making a single syscall.

I bet that any attacker worth their salt could learn the offset by
doing a couple of careful syscalls and looking for cache and/or TLB
effects.  This might make the whole exercise mostly useless.  Isn't
RDRAND supposed to be extremely fast, though?

I usually benchmark like this:

$ ./timing_test_64 10M sys_enosys
10000000 loops in 2.53484s = 253.48 nsec / loop

using https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/
Reshetova, Elena Feb. 11, 2019, 6:05 a.m. UTC | #11
> On Sat, Feb 09, 2019 at 11:38:00AM +0000, Reshetova, Elena wrote:
> >
> >  On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> > >
> > > > I can measure the impact with rdtsc(), I think it is *very* small.
> > > > Btw, what should be the good measurement test?
> > > > I am not that happy with just looping on fopen-fclose, too
> > > > much noise.
> > >
> > > IIRC amluto had a nice micro-bench for 'pure' syscall overhead
> > > somewhere.
> > >
> > > Something like getpid/gettid in a loop should work fairly well, that's a
> > > really light syscall to make.
> >
> > Thanks! Let me try this for measurements next. I was surprise that
> > there nothing  ready-made in perf or rt-tools. You would think
> > this must be regularly in a need to measure.
> 
> lmbench has traditionally been used for something like this in the past,
> you might want to look at that too.

Thank you! I will check!

Best Regards,
Elena.
Reshetova, Elena Feb. 11, 2019, 6:39 a.m. UTC | #12
> On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena
> <elena.reshetova@intel.com> wrote:
> >
> > > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
> > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
> > >
> > > > >
> > > > > Why can't we change the stack offset periodically from an interrupt or
> > > > > so, and then have every later entry use that.
> > > >
> > > > Hm... This sounds more complex conceptually - we cannot touch
> > > > stack when it is in use, so we have to periodically probe for a
> > > > good time (when process is in userspace I guess) to change it from an
> interrupt?
> > > > IMO trampoline stack provides such a good clean place for doing it and we
> > > > have stackleak there doing stack cleanup, so would make sense to keep
> > > > these features operating together.
> > >
> > > The idea was to just change a per-cpu (possible per-task if you ctxsw
> > > it) offset that is used on entry to offset the stack.
> > > So only entries after the change will have the updated offset, any
> > > in-progress syscalls will continue with their current offset and will be
> > > unaffected.
> >
> > Let me try to write this into simple steps to make sure I understand your
> > approach:
> >
> > - create a new per-stack value (and potentially its per-cpu "shadow") called
> stack_offset = 0
> > - periodically issue an interrupt, and inside it walk the process tree and
> >   update stack_offset randomly for each process
> > - when a process makes a new syscall, it subtracts stack_offset value from
> top_of_stack()
> >  and that becomes its new  top_of_stack() for that system call.
> >
> > Smth like this?
> 
> I'm proposing somthing that is conceptually different. 

OK, looks like I fully misunderstand what you meant indeed.
The reason I didn’t reply to your earlier answer is that I started to look
into unwinder code & logic to get at least a slight clue on how things
can be done since I haven't looked in it almost at all before (I wasn't changing
anything with regards to it, so I didn't have to). So, I meant to come back
with a more rigid answer that just "let me study this first"...

 You are,
> conceptually, changing the location of the stack.  I'm suggesting that
> you leave the stack alone and, instead, randomize how you use the
> stack. 


So, yes, instead of having:

allocated_stack_top
random_offset
actual_stack_top
pt_regs
...
and so on

We will have smth like:

allocated_stack_top = actual_stack_top
pt_regs
random_offset
...

So, conceptually we have the same amount of randomization with 
both approaches, but it is applied very differently. 

Security-wise I will have to think more if second approach has any negative
consequences, in addition to positive ones. As a paranoid security person,
you might want to merge both approaches and randomize both places (before and
after pt_regs) with different offsets, but I guess this would be out of question, right? 

I am not that experienced with exploits , but we have been
talking now with Jann and Enrico on this, so I think it is the best they comment
directly here. I am just wondering if having pt_regs in a fixed place can
be an advantage for an attacker under any scenario... 

 In plain C, this would consist of adding roughly this snippet
> in do_syscall_64() and possibly other entry functions:
> 
> if (randomize_stack()) {
>   void *dummy = alloca(rdrand() & 0x7f8);
> 
>   /* Make sure the compiler doesn't optimize out the alloca. */
>   asm volatile ("" :: "=rm" (dummy));
> }
> 
> ... do the actual syscall work here.
> 
> This has a few problems, namely that the generated code might be awful
> and that alloca is more or less banned in the kernel.  I suppose
> alloca could be unbanned in the entry C code, but this could also be
> done fairly easily in the asm code.  You'd just need to use a register
> to store whatever is needed to put RSP back in the exit code.  

Yes, I was actually thinking now on doing it in assembly since I think
it would look smaller and more clearer in it. Just need to get details slowly
in place. 


The
> obvious way would be to use RBP, but it's plausible that using a
> different callee-saved register would make the unwinder interactions
> easier to get right.

This is what I started looking into, bear with me please, all this stuff is new
for my eyes, so I am slow...

> 
> With this approach, you don't modify any of the top_of_stack()
> functions or macros at all -- the top of stack isn't changed.

Yes, understood. 

> 
> >
> > I think it is close to what Andy has proposed
> > in his reply, but the main difference is that you propose to do this via an interrupt.
> > And the main reasoning for doing this via interrupt would be not to affect
> > syscall performance, right?
> >
> > The problem I see with interrupt approach is how often that should be done?
> > Because we don't want to end up with situation when we issue it too often, since
> > it is not going to be very light-weight operation (update all processes), and we
> > don't want it to be too rarely done that we end up with processes that execute
> many
> > syscalls with the same offset. So, we might have a situation when some processes
> >  will execute a number of syscalls with same offset and some will change their
> offset
> > more than once without even making a single syscall.
> 
> I bet that any attacker worth their salt could learn the offset by
> doing a couple of careful syscalls and looking for cache and/or TLB
> effects.  This might make the whole exercise mostly useless.  Isn't
> RDRAND supposed to be extremely fast, though?
> 
> I usually benchmark like this:
> 
> $ ./timing_test_64 10M sys_enosys
> 10000000 loops in 2.53484s = 253.48 nsec / loop
> 
> using https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/

Thank you for the pointer! 
With everyone's suggestions I am now having much better set of tools to do
my next measurements.

Best Regards,
Elena.
Reshetova, Elena Feb. 11, 2019, 12:47 p.m. UTC | #13
> On Fri, Feb 8, 2019 at 4:16 AM Elena Reshetova
> <elena.reshetova@intel.com> wrote:
> > --- a/mm/usercopy.c
> > +++ b/mm/usercopy.c
> > @@ -37,7 +37,7 @@
> >  static noinline int check_stack_object(const void *obj, unsigned long len)
> >  {
> >         const void * const stack = task_stack_page(current);
> > -       const void * const stackend = stack + THREAD_SIZE;
> > +       const void * const stackend = (void *)task_top_of_stack(current);
> >         int ret;
> >
> >         /* Object is not on the stack at all. */
> 
> It seems like having task_top_of_stack() would be a nice refactoring
> to have regardless of this feature (and splitting out would make this
> patch a bit easier to read too).

Yes, if my refactoring in these places looks correct, I can create a separate
patch to just use task_top_of_stack() instead of hard coding some math like
above. Does it sound correct to people? I could not find a reason why these
places do not use task_top_of_stack() to begin with... 

Best Regards,
Elena.
Andy Lutomirski Feb. 11, 2019, 3:54 p.m. UTC | #14
On Feb 10, 2019, at 10:39 PM, Reshetova, Elena <elena.reshetova@intel.com> wrote:

>> On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena
>> <elena.reshetova@intel.com> wrote:
>>> 
>>>> On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote:
>>>>>> On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote:
>>>> 
>>>>>> 
>>>>>> Why can't we change the stack offset periodically from an interrupt or
>>>>>> so, and then have every later entry use that.
>>>>> 
>>>>> Hm... This sounds more complex conceptually - we cannot touch
>>>>> stack when it is in use, so we have to periodically probe for a
>>>>> good time (when process is in userspace I guess) to change it from an
>> interrupt?
>>>>> IMO trampoline stack provides such a good clean place for doing it and we
>>>>> have stackleak there doing stack cleanup, so would make sense to keep
>>>>> these features operating together.
>>>> 
>>>> The idea was to just change a per-cpu (possible per-task if you ctxsw
>>>> it) offset that is used on entry to offset the stack.
>>>> So only entries after the change will have the updated offset, any
>>>> in-progress syscalls will continue with their current offset and will be
>>>> unaffected.
>>> 
>>> Let me try to write this into simple steps to make sure I understand your
>>> approach:
>>> 
>>> - create a new per-stack value (and potentially its per-cpu "shadow") called
>> stack_offset = 0
>>> - periodically issue an interrupt, and inside it walk the process tree and
>>>  update stack_offset randomly for each process
>>> - when a process makes a new syscall, it subtracts stack_offset value from
>> top_of_stack()
>>> and that becomes its new  top_of_stack() for that system call.
>>> 
>>> Smth like this?
>> 
>> I'm proposing somthing that is conceptually different. 
> 
> OK, looks like I fully misunderstand what you meant indeed.
> The reason I didn’t reply to your earlier answer is that I started to look
> into unwinder code & logic to get at least a slight clue on how things
> can be done since I haven't looked in it almost at all before (I wasn't changing
> anything with regards to it, so I didn't have to). So, I meant to come back
> with a more rigid answer that just "let me study this first"...

Fair enough.

> 
> You are,
>> conceptually, changing the location of the stack.  I'm suggesting that
>> you leave the stack alone and, instead, randomize how you use the
>> stack. 
> 
> 
> So, yes, instead of having:
> 
> allocated_stack_top
> random_offset
> actual_stack_top
> pt_regs
> ...
> and so on
> 
> We will have smth like:
> 
> allocated_stack_top = actual_stack_top
> pt_regs
> random_offset
> ...
> 
> So, conceptually we have the same amount of randomization with 
> both approaches, but it is applied very differently. 

Exactly.

> 
> Security-wise I will have to think more if second approach has any negative
> consequences, in addition to positive ones. As a paranoid security person,
> you might want to merge both approaches and randomize both places (before and
> after pt_regs) with different offsets, but I guess this would be out of question, right? 

It’s not out of the question, but it needs some amount of cost vs benefit analysis.  The costs are complexity, speed, and a reduction in available randomness for any given amount of memory consumed.

> 
> I am not that experienced with exploits , but we have been
> talking now with Jann and Enrico on this, so I think it is the best they comment
> directly here. I am just wondering if having pt_regs in a fixed place can
> be an advantage for an attacker under any scenario... 

If an attacker has write-what-where (i.e. can write controlled values to controlled absolute virtual addresses), then I expect that pt_regs is a pretty low ranking target.  But it may be a fairly juicy target if you have a stack buffer overflow that lets an attacker write to a controlled *offset* from the stack. We used to keep thread_info at the bottom of the stack, and that was a great attack target.

But there’s an easier mitigation: just do regs->cs |= 3 or something like that in the exit code. Then any such attack can only corrupt *user* state.  The performance impact would be *very* low, since this could go in the asm path that’s only used for IRET to user mode.
Perla, Enrico Feb. 12, 2019, 10:16 a.m. UTC | #15
Hi,
  I was somewhat fond of randomizing the pt_regs location, as that is something I could relate with in writing an exploit (handy way to load user controlled data to kernel at a known location).

But, as Jann pointed out, that only has value in a ptrace-blocked sandbox, because the randomization offset can be leaked otherwise through ptrace PEEK/POKE and observing cache behavior. Worse, if ptrace is present, then the randomization is moot.

Since containers seems to be going towards leaving ptrace open, I'm now wondering whether that is a good motivation at all and the proposed simplified version is not just better. 

> 
> If an attacker has write-what-where (i.e. can write controlled values to
> controlled absolute virtual addresses), then I expect that pt_regs is a pretty
> low ranking target.  But it may be a fairly juicy target if you have a stack
> buffer overflow that lets an attacker write to a controlled *offset* from the
> stack. We used to keep thread_info at the bottom of the stack, and that was
> a great attack target.
> 
> But there’s an easier mitigation: just do regs->cs |= 3 or something like that
> in the exit code. Then any such attack can only corrupt *user* state.  The
> performance impact would be *very* low, since this could go in the asm
> path that’s only used for IRET to user mode.

That's all fair. What I struggle with is finding a precise motivation for the randomization (granted this might be extended to other KASLR cases, so perhaps is not a strong hard stop).

The proposed randomization does fit the overall KASLR story and it does its job of not letting an attacker predict future stack offset from one leak, but in practical terms I'm struggling to find a case or two where this would have made a difference in an exploit.

Can any of you think of some?


                 -   Enrico

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4 
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124 
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di 
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
Reshetova, Elena Feb. 14, 2019, 7:52 a.m. UTC | #16
After some thinking and discussions, let me try to summarize the options and the
security benefits of each approach based on everyone else feedback before
going any further with work. We all want only useful things in kernel, so merging
smth that provides a subtle/almost non-existing benefit is clearly not a priority. 

So, first about the protection methods. We can do randomization in two ways:

1. stack top itself (and location of pt_regs) 
2. relative stack positioning (offset from pt_regs).

In addition to randomization we can do some fixup on exit, like Andy proposed:

3. Make sure CS always points to user on exit (for example by regs->cs |= 3),
potentially smth else can be fixed up similarly, if needed

Here are *known* (at least to me, please shout if you know more!) possible
attacks vectors and implications on them by doing 1), 2) or 3)

a) attacker's goal is to store some user-controlled data in pt_regs to reference it later
  1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing.
     If discovered, then attack succeeds as of now.
  2) nothing changed for this type of attack, attack succeeds as of now
  3) nothing changed for this type of attack, attack succeeds as of now

b) attacker's goal is to return to userland from a syscall with CS pointing to kernel
 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing.
     If discovered, then attack succeeds as of now.
 2) nothing changed for this type of attack, attack succeeds as of now
 3) CS changed explicitly on exit, so impossible to have this attack, *given* that the
    change is done late enough and no races, sleeps, etc. are possible

c) attacker's goal is to perform some kind of stack overflow into parts of adjusted stack 
  memory via some method. Here the main unknown is the "method".
This vector of attack is the challenge for the current exploit writers: I guess if you can do
it now with all the current protections for stack enabled, you get a nice defcon talk at least :) 
VLA used to be an easy way of doing it, hopefully they are gone from the main source now
(out of tree drivers is a different story). 
Uninitialized locals might be other way, but there is work ongoing to close this (see Kees's
patches on stackinit).
Smth else we don’t yet know? 
Now back to our proposed countermeasures given that attacker has found a way to do
a crafted overflow and overwrite:

  1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing.
     If discovered, then attack succeeds as of now. 
  2) relative stack offset is not predictable and randomized, cannot be probed very easily via 
      cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non-
      deterministic now.
  3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS
      in adjusted pt_regs. If it is his goal, then it helps with that. 


Now summary:

It would seem to me that:

- regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it 
might make a positive difference in two out of three attack scenarios. Objections?

- randomization of stack top is only worth doing in ptrace-blocked scenario. 
Do we have such scenarios left that people care about? 
Because if we do, then we know that there is a real attack vector that we close this way, otherwise not. 
This is actually interesting, because we need to remember to take ptrace into our overall
kernel hardening threat model (smth that at least I haven't quite realized before) and evaluate every new
feature (especially randomization ones) being robust against ptrace probing. 

- randomization after pt_regs only would make a difference in attack scenario "c", for which 
  we don't yet have a proof of concept exploit or technique that would work (does not guarantee that
attackers don't have the exploits ready through :( ). 
So, if we implement this, the "justification part" for the feature would be smth like "to make it
harder for future possible stack-based exploits that utilize overflows", if/when someone find a new
'ala VLA' way of doing the controlled overflow. 
How do people feel about it? Is it worth having? I can work on the POC for this in direction that Andy 
outlined and can provide performance impact/etc., but it is good that we understand that we cannot
provide a better justification for this feature at the moment unless someone is ready to share some
new exploit technique with us. 

Best Regards,
Elena.
Jann Horn Feb. 19, 2019, 2:47 p.m. UTC | #17
On Thu, Feb 14, 2019 at 8:52 AM Reshetova, Elena
<elena.reshetova@intel.com> wrote:
> After some thinking and discussions, let me try to summarize the options and the
> security benefits of each approach based on everyone else feedback before
> going any further with work. We all want only useful things in kernel, so merging
> smth that provides a subtle/almost non-existing benefit is clearly not a priority.
>
> So, first about the protection methods. We can do randomization in two ways:
>
> 1. stack top itself (and location of pt_regs)
> 2. relative stack positioning (offset from pt_regs).
>
> In addition to randomization we can do some fixup on exit, like Andy proposed:
>
> 3. Make sure CS always points to user on exit (for example by regs->cs |= 3),
> potentially smth else can be fixed up similarly, if needed
>
> Here are *known* (at least to me, please shout if you know more!) possible
> attacks vectors and implications on them by doing 1), 2) or 3)
>
> a) attacker's goal is to store some user-controlled data in pt_regs to reference it later
>   1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing.
>      If discovered, then attack succeeds as of now.
>   2) nothing changed for this type of attack, attack succeeds as of now
>   3) nothing changed for this type of attack, attack succeeds as of now
>
> b) attacker's goal is to return to userland from a syscall with CS pointing to kernel
>  1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing.
>      If discovered, then attack succeeds as of now.
>  2) nothing changed for this type of attack, attack succeeds as of now
>  3) CS changed explicitly on exit, so impossible to have this attack, *given* that the
>     change is done late enough and no races, sleeps, etc. are possible
>
> c) attacker's goal is to perform some kind of stack overflow into parts of adjusted stack
>   memory via some method. Here the main unknown is the "method".
> This vector of attack is the challenge for the current exploit writers: I guess if you can do
> it now with all the current protections for stack enabled, you get a nice defcon talk at least :)
> VLA used to be an easy way of doing it, hopefully they are gone from the main source now
> (out of tree drivers is a different story).
> Uninitialized locals might be other way, but there is work ongoing to close this (see Kees's
> patches on stackinit).
> Smth else we don’t yet know?
> Now back to our proposed countermeasures given that attacker has found a way to do
> a crafted overflow and overwrite:
>
>   1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing.
>      If discovered, then attack succeeds as of now.
>   2) relative stack offset is not predictable and randomized, cannot be probed very easily via
>       cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non-
>       deterministic now.
>   3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS
>       in adjusted pt_regs. If it is his goal, then it helps with that.
>
>
> Now summary:
>
> It would seem to me that:
>
> - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it
> might make a positive difference in two out of three attack scenarios. Objections?
>
> - randomization of stack top is only worth doing in ptrace-blocked scenario.
> Do we have such scenarios left that people care about?

There are some, I think; for example, Chrome renderers on desktop
Linux can't use ptrace.

Patch
diff mbox series

diff --git a/arch/Kconfig b/arch/Kconfig
index e1e540f..577186e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -802,6 +802,21 @@  config VMAP_STACK
 	  the stack to map directly to the KASAN shadow map using a formula
 	  that is incorrect if the stack is in vmalloc space.
 
+config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
+	def_bool n
+	help
+	  An arch should select this symbol if it can support kernel stack
+	  offset randomization.
+
+config RANDOMIZE_KSTACK_OFFSET
+	default n
+	bool "Randomize kernel stack offset on syscall exit"
+	depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET && VMAP_STACK
+	help
+	  Enable this if you want the randomize kernel stack offset upon
+	  each syscall exit. This causes kernel stack to have a randomized
+	  offset upon executing each system call.
+
 config ARCH_OPTIONAL_KERNEL_RWX
 	def_bool n
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e79..85d3849 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -134,6 +134,7 @@  config X86
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
 	select HAVE_ARCH_VMAP_STACK		if X86_64
+	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET  if X86_64
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
 	select HAVE_CMPXCHG_DOUBLE
 	select HAVE_CMPXCHG_LOCAL
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 25e5a6b..d644f72 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -337,6 +337,14 @@  For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
+.macro RANDOMIZE_KSTACK_NOCLOBBER
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	PUSH_AND_CLEAR_REGS
+	call randomize_kstack
+	POP_REGS
+#endif
+.endm
+
 #endif /* CONFIG_X86_64 */
 
 .macro STACKLEAK_ERASE
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b..0031887 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -23,6 +23,7 @@ 
 #include <linux/user-return-notifier.h>
 #include <linux/nospec.h>
 #include <linux/uprobes.h>
+#include <linux/random.h>
 #include <linux/livepatch.h>
 #include <linux/syscalls.h>
 
@@ -294,6 +295,26 @@  __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs)
 }
 #endif
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+__visible void randomize_kstack(void)
+{
+	unsigned long r_offset, new_top, stack_bottom;
+
+	if (current->stack_start != 0) {
+
+		r_offset = get_random_long();
+		r_offset &= 0xFFUL;
+		r_offset <<= 4;
+		stack_bottom = (unsigned long)task_stack_page(current);
+
+		new_top = stack_bottom + THREAD_SIZE - 0xFF0UL;
+		new_top += r_offset;
+		this_cpu_write(cpu_current_top_of_stack, new_top);
+		current->stack_start = new_top;
+	}
+}
+#endif
+
 #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
 /*
  * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb..ae9d370 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -268,6 +268,8 @@  syscall_return_via_sysret:
 	 */
 	STACKLEAK_ERASE_NOCLOBBER
 
+	RANDOMIZE_KSTACK_NOCLOBBER
+
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	popq	%rdi
@@ -630,6 +632,8 @@  GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	 */
 	STACKLEAK_ERASE_NOCLOBBER
 
+	RANDOMIZE_KSTACK_NOCLOBBER
+
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	/* Restore RDI. */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 071b2a6..dad09f2 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -569,10 +569,15 @@  static inline unsigned long current_top_of_stack(void)
 
 static inline bool on_thread_stack(void)
 {
+	/* this might need adjustment to a more fine-grained comparison
+	 * we want a condition like
+	 * "< current_top_of_stack() - task_stack_page(current)"
+	 */
 	return (unsigned long)(current_top_of_stack() -
 			       current_stack_pointer) < THREAD_SIZE;
 }
 
+
 #ifdef CONFIG_PARAVIRT_XXL
 #include <asm/paravirt.h>
 #else
@@ -829,12 +834,16 @@  static inline void spin_lock_prefetch(const void *x)
 
 #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1))
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+#define task_pt_regs(task) ((struct pt_regs *)(task_ptregs(task)))
+#else
 #define task_pt_regs(task) \
-({									\
+({                                 \
 	unsigned long __ptr = (unsigned long)task_stack_page(task);	\
-	__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;		\
-	((struct pt_regs *)__ptr) - 1;					\
+	__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;         \
+	((struct pt_regs *)__ptr) - 1;                              \
 })
+#endif
 
 #ifdef CONFIG_X86_32
 /*
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 2b58864..030ee15 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -33,7 +33,7 @@  bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info)
 {
 	unsigned long *begin = task_stack_page(task);
-	unsigned long *end   = task_stack_page(task) + THREAD_SIZE;
+	unsigned long *end   = (unsigned long *)task_top_of_stack(task);
 
 	if (stack < begin || stack >= end)
 		return false;
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 0469cd0..3f03b79 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -43,7 +43,7 @@  static inline void stack_overflow_check(struct pt_regs *regs)
 		return;
 
 	if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN &&
-	    regs->sp <= curbase + THREAD_SIZE)
+	    regs->sp <= task_top_of_stack(current))
 		return;
 
 	irq_stack_top = (u64)this_cpu_ptr(irq_stack_union.irq_stack) +
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 7d31192..f30485a 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -819,7 +819,7 @@  unsigned long get_wchan(struct task_struct *p)
 	 * We need to read FP and IP, so we need to adjust the upper
 	 * bound by another unsigned long.
 	 */
-	top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
+	top = task_top_of_stack(p);
 	top -= 2 * sizeof(unsigned long);
 	bottom = start;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 291a9bd..8e748e4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -605,6 +605,9 @@  struct task_struct {
 	randomized_struct_fields_start
 
 	void				*stack;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	unsigned long		stack_start;
+#endif
 	atomic_t			usage;
 	/* Per task flags (PF_*), defined further below: */
 	unsigned int			flags;
diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h
index 6a84192..229c434 100644
--- a/include/linux/sched/task_stack.h
+++ b/include/linux/sched/task_stack.h
@@ -21,6 +21,22 @@  static inline void *task_stack_page(const struct task_struct *task)
 	return task->stack;
 }
 
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+static inline void *task_ptregs(const struct task_struct *task)
+{
+	unsigned long __ptr;
+
+	if (task->stack_start == 0) {
+		__ptr = (unsigned long)task_stack_page(task);
+		__ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING;
+		return ((struct pt_regs *)__ptr) - 1;
+	}
+
+	__ptr = task->stack_start;
+	return ((struct pt_regs *)__ptr) - 1;
+}
+#endif
+
 #define setup_thread_stack(new,old)	do { } while(0)
 
 static inline unsigned long *end_of_stack(const struct task_struct *task)
@@ -82,7 +98,7 @@  static inline int object_is_on_stack(const void *obj)
 {
 	void *stack = task_stack_page(current);
 
-	return (obj >= stack) && (obj < (stack + THREAD_SIZE));
+	return (obj >= stack) && (obj < ((void *)task_top_of_stack(current)));
 }
 
 extern void thread_stack_cache_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff..8eccf94 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -422,6 +422,9 @@  static void release_task_stack(struct task_struct *tsk)
 	tsk->stack = NULL;
 #ifdef CONFIG_VMAP_STACK
 	tsk->stack_vm_area = NULL;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	tsk->stack_start = 0;
+#endif
 #endif
 }
 
@@ -863,6 +866,10 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->stack = stack;
 #ifdef CONFIG_VMAP_STACK
 	tsk->stack_vm_area = stack_vm_area;
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	tsk->stack_start = 0;
+	tsk->stack_start = (unsigned long)task_top_of_stack(tsk);
+#endif
 #endif
 #ifdef CONFIG_THREAD_INFO_IN_TASK
 	atomic_set(&tsk->stack_refcount, 1);
@@ -922,6 +929,9 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 free_stack:
 	free_thread_stack(tsk);
+#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET
+	tsk->stack_start = 0;
+#endif
 free_tsk:
 	free_task_struct(tsk);
 	return NULL;
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 877de4f..e52c76f 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -1572,7 +1572,7 @@  static void kmemleak_scan(void)
 		do_each_thread(g, p) {
 			void *stack = try_get_task_stack(p);
 			if (stack) {
-				scan_block(stack, stack + THREAD_SIZE, NULL);
+				scan_block(stack, task_top_of_stack(p), NULL);
 				put_task_stack(p);
 			}
 		} while_each_thread(g, p);
diff --git a/mm/usercopy.c b/mm/usercopy.c
index 852eb4e..4b07542 100644
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -37,7 +37,7 @@ 
 static noinline int check_stack_object(const void *obj, unsigned long len)
 {
 	const void * const stack = task_stack_page(current);
-	const void * const stackend = stack + THREAD_SIZE;
+	const void * const stackend = (void *)task_top_of_stack(current);
 	int ret;
 
 	/* Object is not on the stack at all. */