Message ID | 1549628149-11881-2-git-send-email-elena.reshetova@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] x86/entry/64: randomize kernel stack offset upon system call | expand |
On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > 1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P > base: Time: 12.243 > random_offset: Time: 13.411 > base: > 8.46% time [kernel.kallsyms] [k] crc32c_pcl_intel_update > 4.77% time [kernel.kallsyms] [k] ext4_mark_iloc_dirty > 4.14% time [kernel.kallsyms] [k] fsnotify > > random_offset: > 8.35% time [kernel.kallsyms] [k] crc32c_pcl_intel_update > 5.61% time [kernel.kallsyms] [k] get_random_u64 > 4.88% time [kernel.kallsyms] [k] ext4_mark_iloc_dirty *ouch* > Notable differences from RANDKSTACK: > - random bits are taken from get_random_long() instead of > rdtsc() for a better randomness. This however has a big > performance impact (see above the numbers) and additionally > if we happen to hit a point when a generator needs to be > reseeded, we might have an issue. Alternatives can be to > make this feature dependent on CONFIG_RANDOM_TRUST_CPU, > which can solve some issues, but I doubt that all of them. > Of course rdtsc() can be a fallback if there is no way to > make calls for a proper randomness from the trampoline stack. http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.html That would seem to suggest that the low bits of rdtsc would in fact be a fairly good source of random. Still, doing this on sysexit seems painful at best, syscall performance matters (and hopefully we'll get rid of meltdown 'soon'). Why can't we change the stack offset periodically from an interrupt or so, and then have every later entry use that.
> On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > 1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P > > base: Time: 12.243 > > random_offset: Time: 13.411 > > > base: > > 8.46% time [kernel.kallsyms] [k] crc32c_pcl_intel_update > > 4.77% time [kernel.kallsyms] [k] ext4_mark_iloc_dirty > > 4.14% time [kernel.kallsyms] [k] fsnotify > > > > random_offset: > > 8.35% time [kernel.kallsyms] [k] crc32c_pcl_intel_update > > 5.61% time [kernel.kallsyms] [k] get_random_u64 > > 4.88% time [kernel.kallsyms] [k] ext4_mark_iloc_dirty > > *ouch* > > > Notable differences from RANDKSTACK: > > > - random bits are taken from get_random_long() instead of > > rdtsc() for a better randomness. This however has a big > > performance impact (see above the numbers) and additionally > > if we happen to hit a point when a generator needs to be > > reseeded, we might have an issue. Alternatives can be to > > make this feature dependent on CONFIG_RANDOM_TRUST_CPU, > > which can solve some issues, but I doubt that all of them. > > Of course rdtsc() can be a fallback if there is no way to > > make calls for a proper randomness from the trampoline stack. > > http://www.chronox.de/jent/doc/CPU-Jitter-NPTRNG.html Oh, this seems to be a great write up, I will read this carefully, thank you for the pointer! Yes, I believe RANDKSTACK was using rdtsc() exactly because of many issues get_random_* brings in addition to horrid performance. This patch works with rdtsc() as well just fine, just wanted to show the *full* randomness option first with the impact it brings. > > That would seem to suggest that the low bits of rdtsc would in fact be a > fairly good source of random. > > Still, doing this on sysexit seems painful at best, syscall performance > matters (and hopefully we'll get rid of meltdown 'soon'). I can measure the impact with rdtsc(), I think it is *very* small. Btw, what should be the good measurement test? I am not that happy with just looping on fopen-fclose, too much noise. > > Why can't we change the stack offset periodically from an interrupt or > so, and then have every later entry use that. Hm... This sounds more complex conceptually - we cannot touch stack when it is in use, so we have to periodically probe for a good time (when process is in userspace I guess) to change it from an interrupt? IMO trampoline stack provides such a good clean place for doing it and we have stackleak there doing stack cleanup, so would make sense to keep these features operating together. Best Regards, Elena.
On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > Why can't we change the stack offset periodically from an interrupt or > > so, and then have every later entry use that. > > Hm... This sounds more complex conceptually - we cannot touch > stack when it is in use, so we have to periodically probe for a > good time (when process is in userspace I guess) to change it from an interrupt? > IMO trampoline stack provides such a good clean place for doing it and we > have stackleak there doing stack cleanup, so would make sense to keep > these features operating together. The idea was to just change a per-cpu (possible per-task if you ctxsw it) offset that is used on entry to offset the stack. So only entries after the change will have the updated offset, any in-progress syscalls will continue with their current offset and will be unaffected.
On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > I can measure the impact with rdtsc(), I think it is *very* small. > Btw, what should be the good measurement test? > I am not that happy with just looping on fopen-fclose, too > much noise. IIRC amluto had a nice micro-bench for 'pure' syscall overhead somewhere. Something like getpid/gettid in a loop should work fairly well, that's a really light syscall to make.
> On Feb 8, 2019, at 4:15 AM, Elena Reshetova <elena.reshetova@intel.com> wrote: > > If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected, > the kernel stack offset is randomized upon each > exit from a system call via the trampoline stack. > > This feature is based on the original idea from > the PaX's RANDKSTACK feature: > https://pax.grsecurity.net/docs/randkstack.txt > All the credits for the original idea goes to the PaX team. > However, the implementation of RANDOMIZE_KSTACK_OFFSET > differs greatly from the RANDKSTACK feature (see below). > > Reasoning for the feature: > > This feature should make considerably harder various > stack-based attacks that are based upon overflowing > a kernel stack into adjusted kernel stack with a > possibility to jump over a guard page. > Since the stack offset is randomized upon each > system call, it is very hard for attacker to reliably > land in any particular place on the adjusted stack. > I think we need a better justification. With VLAs gone, it should be statically impossible to overflow past a guard page. > Design description: > > During most of the kernel's execution, it runs on the "thread > stack", which is allocated at fork.c/dup_task_struct() and stored in > a per-task variable (tsk->stack). Since stack is growing downwards, > the stack top can be always calculated using task_top_of_stack(tsk) > function, which essentially returns an address of tsk->stack + stack > size. When VMAP_STACK is enabled, the thread stack is allocated from > vmalloc space. > > Thread stack is pretty deterministic on its structure - fixed in size, > and upon every enter from a userspace to kernel on a > syscall the thread stack is started to be constructed from an > address fetched from a per-cpu cpu_current_top_of_stack variable. > This variable is required since there is no way to reference "current" > from the kernel entry/exit code, so the value of task_top_of_stack(tsk) > is "shadowed" in a per-cpu variable each time the kernel context > switches to a new task. > > The RANDOMIZE_KSTACK_OFFSET feature works by randomizing the value of > task_top_of_stack(tsk) every time a process exits from a syscall. As > a result the thread stack for that process will be constructed from a > random offset from a fixed tsk->stack + stack size value upon subsequent > syscall. > There is a vastly simpler way to do this: leave pt_regs in place and subtract a random multiple of 8 before pushing anything else onto the stack. This gets most of the benefit with much less complexity. It’s maybe even better. If you can overflow a stack buffer to rewrite pt_regs, you gain control of all registers. If the stack to pt_regs offset is randomized, then this gets much harder. > - random bits are taken from get_random_long() instead of > rdtsc() for a better randomness. This however has a big > performance impact (see above the numbers) and additionally > if we happen to hit a point when a generator needs to be > reseeded, we might have an issue. NAK. I do not want entry code calling non-arch C code that is not explicitly intended to be used from entry code like this. You may be violating all kinds of rules about context tracking and even stack usage. Just use ALTERNATIVE to select between RDTSC and RDRAND. This isn’t used for crypto — refusing to “trust” the CPU here makes no sense. I think that, if you make these two changes, you’ll have a very straightforward 50-ish line patch. The only real complication is how to find pt_regs on the way out. I think you could use RBP for this — just make it look like you have a regular frame-pointer-using stack frame between do_syscall_whatever and pt_regs. Josh Poimboeuf can help make sure you get all the unwinding details right. It should be straightforward.
On Fri, Feb 8, 2019 at 4:16 AM Elena Reshetova <elena.reshetova@intel.com> wrote: > --- a/mm/usercopy.c > +++ b/mm/usercopy.c > @@ -37,7 +37,7 @@ > static noinline int check_stack_object(const void *obj, unsigned long len) > { > const void * const stack = task_stack_page(current); > - const void * const stackend = stack + THREAD_SIZE; > + const void * const stackend = (void *)task_top_of_stack(current); > int ret; > > /* Object is not on the stack at all. */ It seems like having task_top_of_stack() would be a nice refactoring to have regardless of this feature (and splitting out would make this patch a bit easier to read too).
> On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > > > Why can't we change the stack offset periodically from an interrupt or > > > so, and then have every later entry use that. > > > > Hm... This sounds more complex conceptually - we cannot touch > > stack when it is in use, so we have to periodically probe for a > > good time (when process is in userspace I guess) to change it from an interrupt? > > IMO trampoline stack provides such a good clean place for doing it and we > > have stackleak there doing stack cleanup, so would make sense to keep > > these features operating together. > > The idea was to just change a per-cpu (possible per-task if you ctxsw > it) offset that is used on entry to offset the stack. > So only entries after the change will have the updated offset, any > in-progress syscalls will continue with their current offset and will be > unaffected. Let me try to write this into simple steps to make sure I understand your approach: - create a new per-stack value (and potentially its per-cpu "shadow") called stack_offset = 0 - periodically issue an interrupt, and inside it walk the process tree and update stack_offset randomly for each process - when a process makes a new syscall, it subtracts stack_offset value from top_of_stack() and that becomes its new top_of_stack() for that system call. Smth like this? I think it is close to what Andy has proposed in his reply, but the main difference is that you propose to do this via an interrupt. And the main reasoning for doing this via interrupt would be not to affect syscall performance, right? The problem I see with interrupt approach is how often that should be done? Because we don't want to end up with situation when we issue it too often, since it is not going to be very light-weight operation (update all processes), and we don't want it to be too rarely done that we end up with processes that execute many syscalls with the same offset. So, we might have a situation when some processes will execute a number of syscalls with same offset and some will change their offset more than once without even making a single syscall. Also I fear that attacker might have more playroom here, since we don't have any fixed guarantees on randomization anymore, but it depends on interrupt scheduling, syscall rate for a given process, etc. I don't know how real are these concerns in practice, but feels much more things to worry about. Best Regards, Elena.
On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > I can measure the impact with rdtsc(), I think it is *very* small. > > Btw, what should be the good measurement test? > > I am not that happy with just looping on fopen-fclose, too > > much noise. > > IIRC amluto had a nice micro-bench for 'pure' syscall overhead > somewhere. > > Something like getpid/gettid in a loop should work fairly well, that's a > really light syscall to make. Thanks! Let me try this for measurements next. I was surprise that there nothing ready-made in perf or rt-tools. You would think this must be regularly in a need to measure. Best Regards, Elena.
On Sat, Feb 09, 2019 at 11:38:00AM +0000, Reshetova, Elena wrote: > > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > I can measure the impact with rdtsc(), I think it is *very* small. > > > Btw, what should be the good measurement test? > > > I am not that happy with just looping on fopen-fclose, too > > > much noise. > > > > IIRC amluto had a nice micro-bench for 'pure' syscall overhead > > somewhere. > > > > Something like getpid/gettid in a loop should work fairly well, that's a > > really light syscall to make. > > Thanks! Let me try this for measurements next. I was surprise that > there nothing ready-made in perf or rt-tools. You would think > this must be regularly in a need to measure. lmbench has traditionally been used for something like this in the past, you might want to look at that too. greg k-h
On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena <elena.reshetova@intel.com> wrote: > > > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > > > > > > Why can't we change the stack offset periodically from an interrupt or > > > > so, and then have every later entry use that. > > > > > > Hm... This sounds more complex conceptually - we cannot touch > > > stack when it is in use, so we have to periodically probe for a > > > good time (when process is in userspace I guess) to change it from an interrupt? > > > IMO trampoline stack provides such a good clean place for doing it and we > > > have stackleak there doing stack cleanup, so would make sense to keep > > > these features operating together. > > > > The idea was to just change a per-cpu (possible per-task if you ctxsw > > it) offset that is used on entry to offset the stack. > > So only entries after the change will have the updated offset, any > > in-progress syscalls will continue with their current offset and will be > > unaffected. > > Let me try to write this into simple steps to make sure I understand your > approach: > > - create a new per-stack value (and potentially its per-cpu "shadow") called stack_offset = 0 > - periodically issue an interrupt, and inside it walk the process tree and > update stack_offset randomly for each process > - when a process makes a new syscall, it subtracts stack_offset value from top_of_stack() > and that becomes its new top_of_stack() for that system call. > > Smth like this? I'm proposing somthing that is conceptually different. You are, conceptually, changing the location of the stack. I'm suggesting that you leave the stack alone and, instead, randomize how you use the stack. In plain C, this would consist of adding roughly this snippet in do_syscall_64() and possibly other entry functions: if (randomize_stack()) { void *dummy = alloca(rdrand() & 0x7f8); /* Make sure the compiler doesn't optimize out the alloca. */ asm volatile ("" :: "=rm" (dummy)); } ... do the actual syscall work here. This has a few problems, namely that the generated code might be awful and that alloca is more or less banned in the kernel. I suppose alloca could be unbanned in the entry C code, but this could also be done fairly easily in the asm code. You'd just need to use a register to store whatever is needed to put RSP back in the exit code. The obvious way would be to use RBP, but it's plausible that using a different callee-saved register would make the unwinder interactions easier to get right. With this approach, you don't modify any of the top_of_stack() functions or macros at all -- the top of stack isn't changed. > > I think it is close to what Andy has proposed > in his reply, but the main difference is that you propose to do this via an interrupt. > And the main reasoning for doing this via interrupt would be not to affect > syscall performance, right? > > The problem I see with interrupt approach is how often that should be done? > Because we don't want to end up with situation when we issue it too often, since > it is not going to be very light-weight operation (update all processes), and we > don't want it to be too rarely done that we end up with processes that execute many > syscalls with the same offset. So, we might have a situation when some processes > will execute a number of syscalls with same offset and some will change their offset > more than once without even making a single syscall. I bet that any attacker worth their salt could learn the offset by doing a couple of careful syscalls and looking for cache and/or TLB effects. This might make the whole exercise mostly useless. Isn't RDRAND supposed to be extremely fast, though? I usually benchmark like this: $ ./timing_test_64 10M sys_enosys 10000000 loops in 2.53484s = 253.48 nsec / loop using https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/
> On Sat, Feb 09, 2019 at 11:38:00AM +0000, Reshetova, Elena wrote: > > > > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > > > I can measure the impact with rdtsc(), I think it is *very* small. > > > > Btw, what should be the good measurement test? > > > > I am not that happy with just looping on fopen-fclose, too > > > > much noise. > > > > > > IIRC amluto had a nice micro-bench for 'pure' syscall overhead > > > somewhere. > > > > > > Something like getpid/gettid in a loop should work fairly well, that's a > > > really light syscall to make. > > > > Thanks! Let me try this for measurements next. I was surprise that > > there nothing ready-made in perf or rt-tools. You would think > > this must be regularly in a need to measure. > > lmbench has traditionally been used for something like this in the past, > you might want to look at that too. Thank you! I will check! Best Regards, Elena.
> On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena > <elena.reshetova@intel.com> wrote: > > > > > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > > > > > > > > > Why can't we change the stack offset periodically from an interrupt or > > > > > so, and then have every later entry use that. > > > > > > > > Hm... This sounds more complex conceptually - we cannot touch > > > > stack when it is in use, so we have to periodically probe for a > > > > good time (when process is in userspace I guess) to change it from an > interrupt? > > > > IMO trampoline stack provides such a good clean place for doing it and we > > > > have stackleak there doing stack cleanup, so would make sense to keep > > > > these features operating together. > > > > > > The idea was to just change a per-cpu (possible per-task if you ctxsw > > > it) offset that is used on entry to offset the stack. > > > So only entries after the change will have the updated offset, any > > > in-progress syscalls will continue with their current offset and will be > > > unaffected. > > > > Let me try to write this into simple steps to make sure I understand your > > approach: > > > > - create a new per-stack value (and potentially its per-cpu "shadow") called > stack_offset = 0 > > - periodically issue an interrupt, and inside it walk the process tree and > > update stack_offset randomly for each process > > - when a process makes a new syscall, it subtracts stack_offset value from > top_of_stack() > > and that becomes its new top_of_stack() for that system call. > > > > Smth like this? > > I'm proposing somthing that is conceptually different. OK, looks like I fully misunderstand what you meant indeed. The reason I didn’t reply to your earlier answer is that I started to look into unwinder code & logic to get at least a slight clue on how things can be done since I haven't looked in it almost at all before (I wasn't changing anything with regards to it, so I didn't have to). So, I meant to come back with a more rigid answer that just "let me study this first"... You are, > conceptually, changing the location of the stack. I'm suggesting that > you leave the stack alone and, instead, randomize how you use the > stack. So, yes, instead of having: allocated_stack_top random_offset actual_stack_top pt_regs ... and so on We will have smth like: allocated_stack_top = actual_stack_top pt_regs random_offset ... So, conceptually we have the same amount of randomization with both approaches, but it is applied very differently. Security-wise I will have to think more if second approach has any negative consequences, in addition to positive ones. As a paranoid security person, you might want to merge both approaches and randomize both places (before and after pt_regs) with different offsets, but I guess this would be out of question, right? I am not that experienced with exploits , but we have been talking now with Jann and Enrico on this, so I think it is the best they comment directly here. I am just wondering if having pt_regs in a fixed place can be an advantage for an attacker under any scenario... In plain C, this would consist of adding roughly this snippet > in do_syscall_64() and possibly other entry functions: > > if (randomize_stack()) { > void *dummy = alloca(rdrand() & 0x7f8); > > /* Make sure the compiler doesn't optimize out the alloca. */ > asm volatile ("" :: "=rm" (dummy)); > } > > ... do the actual syscall work here. > > This has a few problems, namely that the generated code might be awful > and that alloca is more or less banned in the kernel. I suppose > alloca could be unbanned in the entry C code, but this could also be > done fairly easily in the asm code. You'd just need to use a register > to store whatever is needed to put RSP back in the exit code. Yes, I was actually thinking now on doing it in assembly since I think it would look smaller and more clearer in it. Just need to get details slowly in place. The > obvious way would be to use RBP, but it's plausible that using a > different callee-saved register would make the unwinder interactions > easier to get right. This is what I started looking into, bear with me please, all this stuff is new for my eyes, so I am slow... > > With this approach, you don't modify any of the top_of_stack() > functions or macros at all -- the top of stack isn't changed. Yes, understood. > > > > > I think it is close to what Andy has proposed > > in his reply, but the main difference is that you propose to do this via an interrupt. > > And the main reasoning for doing this via interrupt would be not to affect > > syscall performance, right? > > > > The problem I see with interrupt approach is how often that should be done? > > Because we don't want to end up with situation when we issue it too often, since > > it is not going to be very light-weight operation (update all processes), and we > > don't want it to be too rarely done that we end up with processes that execute > many > > syscalls with the same offset. So, we might have a situation when some processes > > will execute a number of syscalls with same offset and some will change their > offset > > more than once without even making a single syscall. > > I bet that any attacker worth their salt could learn the offset by > doing a couple of careful syscalls and looking for cache and/or TLB > effects. This might make the whole exercise mostly useless. Isn't > RDRAND supposed to be extremely fast, though? > > I usually benchmark like this: > > $ ./timing_test_64 10M sys_enosys > 10000000 loops in 2.53484s = 253.48 nsec / loop > > using https://git.kernel.org/pub/scm/linux/kernel/git/luto/misc-tests.git/ Thank you for the pointer! With everyone's suggestions I am now having much better set of tools to do my next measurements. Best Regards, Elena.
> On Fri, Feb 8, 2019 at 4:16 AM Elena Reshetova > <elena.reshetova@intel.com> wrote: > > --- a/mm/usercopy.c > > +++ b/mm/usercopy.c > > @@ -37,7 +37,7 @@ > > static noinline int check_stack_object(const void *obj, unsigned long len) > > { > > const void * const stack = task_stack_page(current); > > - const void * const stackend = stack + THREAD_SIZE; > > + const void * const stackend = (void *)task_top_of_stack(current); > > int ret; > > > > /* Object is not on the stack at all. */ > > It seems like having task_top_of_stack() would be a nice refactoring > to have regardless of this feature (and splitting out would make this > patch a bit easier to read too). Yes, if my refactoring in these places looks correct, I can create a separate patch to just use task_top_of_stack() instead of hard coding some math like above. Does it sound correct to people? I could not find a reason why these places do not use task_top_of_stack() to begin with... Best Regards, Elena.
On Feb 10, 2019, at 10:39 PM, Reshetova, Elena <elena.reshetova@intel.com> wrote: >> On Sat, Feb 9, 2019 at 3:13 AM Reshetova, Elena >> <elena.reshetova@intel.com> wrote: >>> >>>> On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: >>>>>> On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: >>>> >>>>>> >>>>>> Why can't we change the stack offset periodically from an interrupt or >>>>>> so, and then have every later entry use that. >>>>> >>>>> Hm... This sounds more complex conceptually - we cannot touch >>>>> stack when it is in use, so we have to periodically probe for a >>>>> good time (when process is in userspace I guess) to change it from an >> interrupt? >>>>> IMO trampoline stack provides such a good clean place for doing it and we >>>>> have stackleak there doing stack cleanup, so would make sense to keep >>>>> these features operating together. >>>> >>>> The idea was to just change a per-cpu (possible per-task if you ctxsw >>>> it) offset that is used on entry to offset the stack. >>>> So only entries after the change will have the updated offset, any >>>> in-progress syscalls will continue with their current offset and will be >>>> unaffected. >>> >>> Let me try to write this into simple steps to make sure I understand your >>> approach: >>> >>> - create a new per-stack value (and potentially its per-cpu "shadow") called >> stack_offset = 0 >>> - periodically issue an interrupt, and inside it walk the process tree and >>> update stack_offset randomly for each process >>> - when a process makes a new syscall, it subtracts stack_offset value from >> top_of_stack() >>> and that becomes its new top_of_stack() for that system call. >>> >>> Smth like this? >> >> I'm proposing somthing that is conceptually different. > > OK, looks like I fully misunderstand what you meant indeed. > The reason I didn’t reply to your earlier answer is that I started to look > into unwinder code & logic to get at least a slight clue on how things > can be done since I haven't looked in it almost at all before (I wasn't changing > anything with regards to it, so I didn't have to). So, I meant to come back > with a more rigid answer that just "let me study this first"... Fair enough. > > You are, >> conceptually, changing the location of the stack. I'm suggesting that >> you leave the stack alone and, instead, randomize how you use the >> stack. > > > So, yes, instead of having: > > allocated_stack_top > random_offset > actual_stack_top > pt_regs > ... > and so on > > We will have smth like: > > allocated_stack_top = actual_stack_top > pt_regs > random_offset > ... > > So, conceptually we have the same amount of randomization with > both approaches, but it is applied very differently. Exactly. > > Security-wise I will have to think more if second approach has any negative > consequences, in addition to positive ones. As a paranoid security person, > you might want to merge both approaches and randomize both places (before and > after pt_regs) with different offsets, but I guess this would be out of question, right? It’s not out of the question, but it needs some amount of cost vs benefit analysis. The costs are complexity, speed, and a reduction in available randomness for any given amount of memory consumed. > > I am not that experienced with exploits , but we have been > talking now with Jann and Enrico on this, so I think it is the best they comment > directly here. I am just wondering if having pt_regs in a fixed place can > be an advantage for an attacker under any scenario... If an attacker has write-what-where (i.e. can write controlled values to controlled absolute virtual addresses), then I expect that pt_regs is a pretty low ranking target. But it may be a fairly juicy target if you have a stack buffer overflow that lets an attacker write to a controlled *offset* from the stack. We used to keep thread_info at the bottom of the stack, and that was a great attack target. But there’s an easier mitigation: just do regs->cs |= 3 or something like that in the exit code. Then any such attack can only corrupt *user* state. The performance impact would be *very* low, since this could go in the asm path that’s only used for IRET to user mode.
Hi, I was somewhat fond of randomizing the pt_regs location, as that is something I could relate with in writing an exploit (handy way to load user controlled data to kernel at a known location). But, as Jann pointed out, that only has value in a ptrace-blocked sandbox, because the randomization offset can be leaked otherwise through ptrace PEEK/POKE and observing cache behavior. Worse, if ptrace is present, then the randomization is moot. Since containers seems to be going towards leaving ptrace open, I'm now wondering whether that is a good motivation at all and the proposed simplified version is not just better. > > If an attacker has write-what-where (i.e. can write controlled values to > controlled absolute virtual addresses), then I expect that pt_regs is a pretty > low ranking target. But it may be a fairly juicy target if you have a stack > buffer overflow that lets an attacker write to a controlled *offset* from the > stack. We used to keep thread_info at the bottom of the stack, and that was > a great attack target. > > But there’s an easier mitigation: just do regs->cs |= 3 or something like that > in the exit code. Then any such attack can only corrupt *user* state. The > performance impact would be *very* low, since this could go in the asm > path that’s only used for IRET to user mode. That's all fair. What I struggle with is finding a precise motivation for the randomization (granted this might be extended to other KASLR cases, so perhaps is not a strong hard stop). The proposed randomization does fit the overall KASLR story and it does its job of not letting an attacker predict future stack offset from one leak, but in practical terms I'm struggling to find a case or two where this would have made a difference in an exploit. Can any of you think of some? - Enrico --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
After some thinking and discussions, let me try to summarize the options and the security benefits of each approach based on everyone else feedback before going any further with work. We all want only useful things in kernel, so merging smth that provides a subtle/almost non-existing benefit is clearly not a priority. So, first about the protection methods. We can do randomization in two ways: 1. stack top itself (and location of pt_regs) 2. relative stack positioning (offset from pt_regs). In addition to randomization we can do some fixup on exit, like Andy proposed: 3. Make sure CS always points to user on exit (for example by regs->cs |= 3), potentially smth else can be fixed up similarly, if needed Here are *known* (at least to me, please shout if you know more!) possible attacks vectors and implications on them by doing 1), 2) or 3) a) attacker's goal is to store some user-controlled data in pt_regs to reference it later 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. If discovered, then attack succeeds as of now. 2) nothing changed for this type of attack, attack succeeds as of now 3) nothing changed for this type of attack, attack succeeds as of now b) attacker's goal is to return to userland from a syscall with CS pointing to kernel 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. If discovered, then attack succeeds as of now. 2) nothing changed for this type of attack, attack succeeds as of now 3) CS changed explicitly on exit, so impossible to have this attack, *given* that the change is done late enough and no races, sleeps, etc. are possible c) attacker's goal is to perform some kind of stack overflow into parts of adjusted stack memory via some method. Here the main unknown is the "method". This vector of attack is the challenge for the current exploit writers: I guess if you can do it now with all the current protections for stack enabled, you get a nice defcon talk at least :) VLA used to be an easy way of doing it, hopefully they are gone from the main source now (out of tree drivers is a different story). Uninitialized locals might be other way, but there is work ongoing to close this (see Kees's patches on stackinit). Smth else we don’t yet know? Now back to our proposed countermeasures given that attacker has found a way to do a crafted overflow and overwrite: 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. If discovered, then attack succeeds as of now. 2) relative stack offset is not predictable and randomized, cannot be probed very easily via cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non- deterministic now. 3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS in adjusted pt_regs. If it is his goal, then it helps with that. Now summary: It would seem to me that: - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it might make a positive difference in two out of three attack scenarios. Objections? - randomization of stack top is only worth doing in ptrace-blocked scenario. Do we have such scenarios left that people care about? Because if we do, then we know that there is a real attack vector that we close this way, otherwise not. This is actually interesting, because we need to remember to take ptrace into our overall kernel hardening threat model (smth that at least I haven't quite realized before) and evaluate every new feature (especially randomization ones) being robust against ptrace probing. - randomization after pt_regs only would make a difference in attack scenario "c", for which we don't yet have a proof of concept exploit or technique that would work (does not guarantee that attackers don't have the exploits ready through :( ). So, if we implement this, the "justification part" for the feature would be smth like "to make it harder for future possible stack-based exploits that utilize overflows", if/when someone find a new 'ala VLA' way of doing the controlled overflow. How do people feel about it? Is it worth having? I can work on the POC for this in direction that Andy outlined and can provide performance impact/etc., but it is good that we understand that we cannot provide a better justification for this feature at the moment unless someone is ready to share some new exploit technique with us. Best Regards, Elena.
On Thu, Feb 14, 2019 at 8:52 AM Reshetova, Elena <elena.reshetova@intel.com> wrote: > After some thinking and discussions, let me try to summarize the options and the > security benefits of each approach based on everyone else feedback before > going any further with work. We all want only useful things in kernel, so merging > smth that provides a subtle/almost non-existing benefit is clearly not a priority. > > So, first about the protection methods. We can do randomization in two ways: > > 1. stack top itself (and location of pt_regs) > 2. relative stack positioning (offset from pt_regs). > > In addition to randomization we can do some fixup on exit, like Andy proposed: > > 3. Make sure CS always points to user on exit (for example by regs->cs |= 3), > potentially smth else can be fixed up similarly, if needed > > Here are *known* (at least to me, please shout if you know more!) possible > attacks vectors and implications on them by doing 1), 2) or 3) > > a) attacker's goal is to store some user-controlled data in pt_regs to reference it later > 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. > If discovered, then attack succeeds as of now. > 2) nothing changed for this type of attack, attack succeeds as of now > 3) nothing changed for this type of attack, attack succeeds as of now > > b) attacker's goal is to return to userland from a syscall with CS pointing to kernel > 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. > If discovered, then attack succeeds as of now. > 2) nothing changed for this type of attack, attack succeeds as of now > 3) CS changed explicitly on exit, so impossible to have this attack, *given* that the > change is done late enough and no races, sleeps, etc. are possible > > c) attacker's goal is to perform some kind of stack overflow into parts of adjusted stack > memory via some method. Here the main unknown is the "method". > This vector of attack is the challenge for the current exploit writers: I guess if you can do > it now with all the current protections for stack enabled, you get a nice defcon talk at least :) > VLA used to be an easy way of doing it, hopefully they are gone from the main source now > (out of tree drivers is a different story). > Uninitialized locals might be other way, but there is work ongoing to close this (see Kees's > patches on stackinit). > Smth else we don’t yet know? > Now back to our proposed countermeasures given that attacker has found a way to do > a crafted overflow and overwrite: > > 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. > If discovered, then attack succeeds as of now. > 2) relative stack offset is not predictable and randomized, cannot be probed very easily via > cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non- > deterministic now. > 3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS > in adjusted pt_regs. If it is his goal, then it helps with that. > > > Now summary: > > It would seem to me that: > > - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it > might make a positive difference in two out of three attack scenarios. Objections? > > - randomization of stack top is only worth doing in ptrace-blocked scenario. > Do we have such scenarios left that people care about? There are some, I think; for example, Chrome renderers on desktop Linux can't use ptrace.
On Fri, Feb 8, 2019 at 6:26 AM Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Feb 08, 2019 at 01:20:09PM +0000, Reshetova, Elena wrote: > > > On Fri, Feb 08, 2019 at 02:15:49PM +0200, Elena Reshetova wrote: > > > > > > > Why can't we change the stack offset periodically from an interrupt or > > > so, and then have every later entry use that. > > > > Hm... This sounds more complex conceptually - we cannot touch > > stack when it is in use, so we have to periodically probe for a > > good time (when process is in userspace I guess) to change it from an interrupt? > > IMO trampoline stack provides such a good clean place for doing it and we > > have stackleak there doing stack cleanup, so would make sense to keep > > these features operating together. > > The idea was to just change a per-cpu (possible per-task if you ctxsw > it) offset that is used on entry to offset the stack. > > So only entries after the change will have the updated offset, any > in-progress syscalls will continue with their current offset and will be > unaffected. These defenses tend to need randomization between syscalls to actually disrupt an attack. (Take for example, the attack I demonstrated with uninitialized stack variables[1], which used back-to-back syscalls, one to prime the memory contents followed by one to use those values for an exploit.) (Though I would say that with CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y the uninitialized variables issue.) -Kees [1] https://www.defcon.org/images/defcon-19/dc-19-presentations/Cook/DEFCON-19-Cook-Kernel-Exploitation.pdf slide 21
On Fri, Feb 8, 2019 at 8:34 AM Andy Lutomirski <luto@amacapital.net> wrote: > > On Feb 8, 2019, at 4:15 AM, Elena Reshetova <elena.reshetova@intel.com> wrote: > > This feature should make considerably harder various > > stack-based attacks that are based upon overflowing > > a kernel stack into adjusted kernel stack with a > > possibility to jump over a guard page. > > Since the stack offset is randomized upon each > > system call, it is very hard for attacker to reliably > > land in any particular place on the adjusted stack. > > > > I think we need a better justification. With VLAs gone, it should be statically impossible to overflow past a guard page. I do agree, that the stack is a much more well-defended area right now, and that the urgency for this feature is lower, but if it's easy to implement, I think we should do it. With VLAs universally gone, we can't get unbounded allocations that lead to both linear overflows and indexed overflows. But this doesn't mean a counter or index can't still go crazy -- it just means the expected stack layout will no longer be attached to it. With VMAP_STACK we stop linear overflows of the stack since we'll hit a guard page. However, this does not stop indexed overflows where we can jump the guard page. It is harder to control an indexed overflow vs a linear overflow, but it's still possible. Adding more variability to this, I think, has value in making attacks less reliable. Also, I'd note that while this is currently an x86-only implementation, it's not hard to extend to other architectures that don't already have VMAP_STACK. (And while it is the default, not all x86 builds have CONFIG_VMAP_STACK=y, though why you'd turn that off and turn this on, I'm not sure.)
On Fri, Feb 8, 2019 at 4:16 AM Elena Reshetova <elena.reshetova@intel.com> wrote: > +.macro RANDOMIZE_KSTACK_NOCLOBBER > +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET > + PUSH_AND_CLEAR_REGS > + call randomize_kstack > + POP_REGS > +#endif > +.endm > + > [...] > @@ -268,6 +268,8 @@ syscall_return_via_sysret: > */ > STACKLEAK_ERASE_NOCLOBBER > > + RANDOMIZE_KSTACK_NOCLOBBER Probably we could extract the PUSH_AND_CLEAR_REGS and POP_REGS out here from both this and stackleak to avoid doing it twice?
On Tue, Feb 12, 2019 at 2:16 AM Perla, Enrico <enrico.perla@intel.com> wrote: > I was somewhat fond of randomizing the pt_regs location, as that is something I could relate with in writing an exploit (handy way to load user controlled data to kernel at a known location). > > But, as Jann pointed out, that only has value in a ptrace-blocked sandbox, because the randomization offset can be leaked otherwise through ptrace PEEK/POKE and observing cache behavior. Worse, if ptrace is present, then the randomization is moot. > > Since containers seems to be going towards leaving ptrace open, I'm now wondering whether that is a good motivation at all and the proposed simplified version is not just better. It does seem that using a flaw to attack one's own registers is rather pointless. Maybe we'll eat our words, but for now, I'd agree. > That's all fair. What I struggle with is finding a precise motivation for the randomization (granted this might be extended to other KASLR cases, so perhaps is not a strong hard stop). > > The proposed randomization does fit the overall KASLR story and it does its job of not letting an attacker predict future stack offset from one leak, but in practical terms I'm struggling to find a case or two where this would have made a difference in an exploit. > > Can any of you think of some? As you know, exploits tend to be written using the easiest-possible-path to attack, so prior to VMAP_STACK, thread_info moving, and VLA removal, attacks would use those properties. However, looking at something like half-nelson[1], an attack may just probe to find the distance between stacks and using a confused index to jump over guard pages, etc. But if that calculation is disrupted at every syscall, reliability goes way down. (Which, BTW, is likely why stack randomization might be better to do an syscall entry time so the offset isn't calculated and left hanging around in memory to be exposed via some other flaw before starting the next syscall.)
On Wed, Feb 13, 2019 at 11:52 PM Reshetova, Elena <elena.reshetova@intel.com> wrote: > Now back to our proposed countermeasures given that attacker has found a way to do > a crafted overflow and overwrite: > > 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. > If discovered, then attack succeeds as of now. > 2) relative stack offset is not predictable and randomized, cannot be probed very easily via > cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non- > deterministic now. > 3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS > in adjusted pt_regs. If it is his goal, then it helps with that. > > > Now summary: > > It would seem to me that: > > - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it > might make a positive difference in two out of three attack scenarios. Objections? I would agree, let's just do this. > - randomization of stack top is only worth doing in ptrace-blocked scenario. > Do we have such scenarios left that people care about? > Because if we do, then we know that there is a real attack vector that we close this way, otherwise not. > This is actually interesting, because we need to remember to take ptrace into our overall > kernel hardening threat model (smth that at least I haven't quite realized before) and evaluate every new > feature (especially randomization ones) being robust against ptrace probing. > > - randomization after pt_regs only would make a difference in attack scenario "c", for which > we don't yet have a proof of concept exploit or technique that would work (does not guarantee that > attackers don't have the exploits ready through :( ). > So, if we implement this, the "justification part" for the feature would be smth like "to make it > harder for future possible stack-based exploits that utilize overflows", if/when someone find a new > 'ala VLA' way of doing the controlled overflow. > How do people feel about it? Is it worth having? I can work on the POC for this in direction that Andy > outlined and can provide performance impact/etc., but it is good that we understand that we cannot > provide a better justification for this feature at the moment unless someone is ready to share some > new exploit technique with us. I think this make sense. I do think, however, the work should be done at syscall entry, though. Thoughts?
On Wed, Feb 20, 2019 at 2:15 PM Kees Cook <keescook@chromium.org> wrote: > > On Tue, Feb 12, 2019 at 2:16 AM Perla, Enrico <enrico.perla@intel.com> wrote: > > I was somewhat fond of randomizing the pt_regs location, as that is something I could relate with in writing an exploit (handy way to load user controlled data to kernel at a known location). > > > > But, as Jann pointed out, that only has value in a ptrace-blocked sandbox, because the randomization offset can be leaked otherwise through ptrace PEEK/POKE and observing cache behavior. Worse, if ptrace is present, then the randomization is moot. > > > > Since containers seems to be going towards leaving ptrace open, I'm now wondering whether that is a good motivation at all and the proposed simplified version is not just better. > > It does seem that using a flaw to attack one's own registers is rather > pointless. Maybe we'll eat our words, but for now, I'd agree. > > > That's all fair. What I struggle with is finding a precise motivation for the randomization (granted this might be extended to other KASLR cases, so perhaps is not a strong hard stop). > > > > The proposed randomization does fit the overall KASLR story and it does its job of not letting an attacker predict future stack offset from one leak, but in practical terms I'm struggling to find a case or two where this would have made a difference in an exploit. > > > > Can any of you think of some? > > As you know, exploits tend to be written using the > easiest-possible-path to attack, so prior to VMAP_STACK, thread_info > moving, and VLA removal, attacks would use those properties. However, > looking at something like half-nelson[1], an attack may just probe to > find the distance between stacks and using a confused index to jump > over guard pages, etc. But if that calculation is disrupted at every > syscall, reliability goes way down. (Which, BTW, is likely why stack > randomization might be better to do an syscall entry time so the > offset isn't calculated and left hanging around in memory to be > exposed via some other flaw before starting the next syscall.) BTW, the attack that inspired grsecurity's RANDKSTACK is described in these slides (lots of steps, see slide 79): https://www.slideshare.net/scovetta/stackjacking
> On Feb 20, 2019, at 2:20 PM, Kees Cook <keescook@chromium.org> wrote: > > On Wed, Feb 13, 2019 at 11:52 PM Reshetova, Elena > <elena.reshetova@intel.com> wrote: >> Now back to our proposed countermeasures given that attacker has found a way to do >> a crafted overflow and overwrite: >> >> 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. >> If discovered, then attack succeeds as of now. >> 2) relative stack offset is not predictable and randomized, cannot be probed very easily via >> cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non- >> deterministic now. >> 3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS >> in adjusted pt_regs. If it is his goal, then it helps with that. >> >> >> Now summary: >> >> It would seem to me that: >> >> - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it >> might make a positive difference in two out of three attack scenarios. Objections? > > I would agree, let's just do this. Thinking slightly more about this, it’s an incomplete protection. It keeps an attacker from returning to kernel mode, but it does not protect the privileged flag bits. I think that IOPL is the only thing we really care about, and doing anything useful about IOPL would be rather more complex, unfortunately. I suppose we could just zero it and guard that with a static branch that is switched off the first time anyone uses iopl(3). I suppose we could also add a config option to straight-up disable IOPL. I sincerely hope that no one uses it any more. Even the small number of semi-legit users really ought to be using ioperm() instead. > >> - randomization of stack top is only worth doing in ptrace-blocked scenario. >> Do we have such scenarios left that people care about? >> Because if we do, then we know that there is a real attack vector that we close this way, otherwise not. >> This is actually interesting, because we need to remember to take ptrace into our overall >> kernel hardening threat model (smth that at least I haven't quite realized before) and evaluate every new >> feature (especially randomization ones) being robust against ptrace probing. >> >> - randomization after pt_regs only would make a difference in attack scenario "c", for which >> we don't yet have a proof of concept exploit or technique that would work (does not guarantee that >> attackers don't have the exploits ready through :( ). >> So, if we implement this, the "justification part" for the feature would be smth like "to make it >> harder for future possible stack-based exploits that utilize overflows", if/when someone find a new >> 'ala VLA' way of doing the controlled overflow. >> How do people feel about it? Is it worth having? I can work on the POC for this in direction that Andy >> outlined and can provide performance impact/etc., but it is good that we understand that we cannot >> provide a better justification for this feature at the moment unless someone is ready to share some >> new exploit technique with us. > > I think this make sense. I do think, however, the work should be done > at syscall entry, though. Thoughts? > > Seems reasonable to me.
> > It does seem that using a flaw to attack one's own registers is rather > pointless. Maybe we'll eat our words, but for now, I'd agree. > You don't attack your own registers, you use them to load controlled data to the kernel and emulate structures or similar at any stage of an exploit, bypassing SMAP and co. --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
On Thu, Feb 21, 2019 at 7:38 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Feb 20, 2019, at 2:20 PM, Kees Cook <keescook@chromium.org> wrote: > > > > On Wed, Feb 13, 2019 at 11:52 PM Reshetova, Elena > > <elena.reshetova@intel.com> wrote: > >> Now back to our proposed countermeasures given that attacker has found a way to do > >> a crafted overflow and overwrite: > >> > >> 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. > >> If discovered, then attack succeeds as of now. > >> 2) relative stack offset is not predictable and randomized, cannot be probed very easily via > >> cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non- > >> deterministic now. > >> 3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS > >> in adjusted pt_regs. If it is his goal, then it helps with that. > >> > >> > >> Now summary: > >> > >> It would seem to me that: > >> > >> - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it > >> might make a positive difference in two out of three attack scenarios. Objections? > > > > I would agree, let's just do this. > > Thinking slightly more about this, it’s an incomplete protection. It > keeps an attacker from returning to kernel mode, but it does not > protect the privileged flag bits. I think that IOPL is the only thing > we really care about, and doing anything useful about IOPL would be > rather more complex, unfortunately. I suppose we could just zero it > and guard that with a static branch that is switched off the first > time anyone uses iopl(3). > > I suppose we could also add a config option to straight-up disable > IOPL. I sincerely hope that no one uses it any more. Even the small > number of semi-legit users really ought to be using ioperm() instead. /me raises hand. iopl(3) is useful for making CLI and STI work from userspace, I've used it for that (for testing stuff, not for anything that has been shipped to people). Of course, that's probably a reason to get rid of it, not to keep it. ^^
> On Feb 21, 2019, at 5:20 AM, Jann Horn <jannh@google.com> wrote: > > On Thu, Feb 21, 2019 at 7:38 AM Andy Lutomirski <luto@kernel.org> wrote: >>> On Feb 20, 2019, at 2:20 PM, Kees Cook <keescook@chromium.org> wrote: >>> >>> On Wed, Feb 13, 2019 at 11:52 PM Reshetova, Elena >>> <elena.reshetova@intel.com> wrote: >>>> Now back to our proposed countermeasures given that attacker has found a way to do >>>> a crafted overflow and overwrite: >>>> >>>> 1) pt_regs is not predictable, but can be discovered in ptrace-style scenario or cache-probing. >>>> If discovered, then attack succeeds as of now. >>>> 2) relative stack offset is not predictable and randomized, cannot be probed very easily via >>>> cache or ptrace. So, this is an additional hurdle on the attacker's way since stack is non- >>>> deterministic now. >>>> 3) nothing changed for this type of attack, given that attacker's goal is not to overwrite CS >>>> in adjusted pt_regs. If it is his goal, then it helps with that. >>>> >>>> >>>> Now summary: >>>> >>>> It would seem to me that: >>>> >>>> - regs->cs |= 3 on exit is a thing worth doing anyway, just because it is cheap, as Andy said, and it >>>> might make a positive difference in two out of three attack scenarios. Objections? >>> >>> I would agree, let's just do this. >> >> Thinking slightly more about this, it’s an incomplete protection. It >> keeps an attacker from returning to kernel mode, but it does not >> protect the privileged flag bits. I think that IOPL is the only thing >> we really care about, and doing anything useful about IOPL would be >> rather more complex, unfortunately. I suppose we could just zero it >> and guard that with a static branch that is switched off the first >> time anyone uses iopl(3). >> >> I suppose we could also add a config option to straight-up disable >> IOPL. I sincerely hope that no one uses it any more. Even the small >> number of semi-legit users really ought to be using ioperm() instead. > > /me raises hand. iopl(3) is useful for making CLI and STI work from > userspace, I've used it for that (for testing stuff, not for anything > that has been shipped to people). Of course, that's probably a reason > to get rid of it, not to keep it. ^^ I was thinking that I don’t even try to make this use case work correctly ;)
On Thu, Feb 21, 2019 at 1:35 AM Perla, Enrico <enrico.perla@intel.com> wrote: > > It does seem that using a flaw to attack one's own registers is rather > > pointless. Maybe we'll eat our words, but for now, I'd agree. > > > > You don't attack your own registers, you use them to load controlled data to the kernel and emulate structures or similar at any stage of an exploit, bypassing SMAP and co. Given all the rewriting of the syscall entry code over the last few years, perhaps I'm missing something. My understanding was that at syscall entry we do effectively this: - save pt_regs - clear all registers not needed for a syscall - run syscall - restore pt_regs (excepting syscall return value) I didn't think pt_regs got used during the syscall? In looking more closely, I see some current_pt_regs() in some paths, but again: what's the attack you're thinking of that isn't directly overlapped with existing control over registers at entry or via ptrace?
In many kernel exploits one needs to emulate structures (or provide some controlled data) to keep things stable, do a second stage, etc. Old school approach was to dereference to userland. Now, with SMAP (or any other dereference protection), that cannot be done anymore. If I have a stack address leak, then I have a pretty nice primitive through pt_regs to load some arbitrary content at a known address. As discussed with Jan, if I have ptrace, randomization is basically moot. I can just PTRACE_SYSCALL and time my way to the correct location. Actually, randomization could even help getting some needed alignment. So the open questions are: 1) are pt_regs considered enough of a vector to add the randomization nuisance? 2) is it worthwhile to randomize both pt_regs and the stack start location, so that ptrace doesn't leak at least the latter? I had mostly sandboxed scenarios in mind, I guess you had mostly the stackjacking case. Any variation on the above is ok, from not considering any of this worthwhile to doing just some - as long as the tradeoffs are clear (which is basically Elena's email), since randomization ends up being always a stopgap, not really a great defense. I don't have a strong opinion on any of this, especially since lots is happening to reduce/remove the leaking of kernel stack contents. - Enrico > -----Original Message----- > From: Kees Cook [mailto:keescook@chromium.org] > Sent: Thursday, February 21, 2019 6:24 PM > To: Perla, Enrico <enrico.perla@intel.com> > Cc: Andy Lutomirski <luto@amacapital.net>; Reshetova, Elena > <elena.reshetova@intel.com>; Andy Lutomirski <luto@kernel.org>; Jann > Horn <jannh@google.com>; Peter Zijlstra <peterz@infradead.org>; kernel- > hardening@lists.openwall.com; tglx@linutronix.de; mingo@redhat.com; > bp@alien8.de; tytso@mit.edu > Subject: Re: [RFC PATCH] x86/entry/64: randomize kernel stack offset upon > system call > > On Thu, Feb 21, 2019 at 1:35 AM Perla, Enrico <enrico.perla@intel.com> > wrote: > > > It does seem that using a flaw to attack one's own registers is > > > rather pointless. Maybe we'll eat our words, but for now, I'd agree. > > > > > > > You don't attack your own registers, you use them to load controlled data > to the kernel and emulate structures or similar at any stage of an exploit, > bypassing SMAP and co. > > Given all the rewriting of the syscall entry code over the last few years, > perhaps I'm missing something. My understanding was that at syscall entry > we do effectively this: > > - save pt_regs > - clear all registers not needed for a syscall > - run syscall > - restore pt_regs (excepting syscall return value) > > I didn't think pt_regs got used during the syscall? In looking more closely, I > see some current_pt_regs() in some paths, but again: what's the attack > you're thinking of that isn't directly overlapped with existing control over > registers at entry or via ptrace? > > -- > Kees Cook --------------------------------------------------------------------- INTEL CORPORATION ITALIA S.p.A. con unico socio Sede: Milanofiori Palazzo E 4 CAP 20094 Assago (MI) Capitale Sociale Euro 104.000,00 interamente versato Partita I.V.A. e Codice Fiscale 04236760155 Repertorio Economico Amministrativo n. 997124 Registro delle Imprese di Milano nr. 183983/5281/33 Soggetta ad attivita' di direzione e coordinamento di INTEL CORPORATION, USA This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
On Thu, Feb 21, 2019 at 9:48 AM Perla, Enrico <enrico.perla@intel.com> wrote: > In many kernel exploits one needs to emulate structures (or provide some controlled data) to keep things stable, do a second stage, etc. > Old school approach was to dereference to userland. Now, with SMAP (or any other dereference protection), that cannot be done anymore. Oh, I see now: using the pt_regs storage as a place in kernel memory to have built the malicious structure. Got it. > If I have a stack address leak, then I have a pretty nice primitive through pt_regs to load some arbitrary content at a known address. > As discussed with Jan, if I have ptrace, randomization is basically moot. I can just PTRACE_SYSCALL and time my way to the correct location. > Actually, randomization could even help getting some needed alignment. Okay, I've chatted more with Jann off-list now; he's enlightened me with more examples. Thank you both for walking me through my denseness. :) > So the open questions are: > 1) are pt_regs considered enough of a vector to add the randomization nuisance? In most cases, to be able to locate pt_regs at a fixed location, you already need a stack address location leak. If you have such a leak, and pt_regs location is randomized, a second thread's pt_regs could be used, held with ptrace, and have prime+probe with PEEKs to figure out where the offset to pt_regs is. And the other thread would just use THAT pt_regs for its scratch space. So, while it'd be nice to randomize it, it seems that only when there are no other threads and no ptrace would it be useful (i.e. the sandbox situation, as you've said). > 2) is it worthwhile to randomize both pt_regs and the stack start location, so that ptrace doesn't leak at least the latter? Given the pile of prerequisites needed for these attacks, I would say "no" currently. In my assessment, I generally think a sandboxed environment will already have reduced attack surface, so whatever needed flaws may already be unavailable. These stack attacks tend to need a lot of pieces working together, so better to optimize this defense for the non-sandboxed case, in order to reduce the code's complexity. > I had mostly sandboxed scenarios in mind, I guess you had mostly the stackjacking case. > > Any variation on the above is ok, from not considering any of this worthwhile to doing just some - as long as the tradeoffs are clear (which is basically Elena's email), since randomization ends up being always a stopgap, not really a great defense. > > I don't have a strong opinion on any of this, especially since lots is happening to reduce/remove the leaking of kernel stack contents. I'm sufficiently convinced that with all the trade-offs, leaving pt_regs unrandomized is fine. If all we see is sandbox escapes using fixed-location pt_regs, well, then we can change this later. :) -- Kees Cook
On Wed, Feb 20, 2019 at 2:53 PM Kees Cook <keescook@chromium.org> wrote: > BTW, the attack that inspired grsecurity's RANDKSTACK is described in > these slides (lots of steps, see slide 79): > https://www.slideshare.net/scovetta/stackjacking Sorry, as PaX Team reminded me, I misremembered this. RANDKSTACK already existed. It was STACKLEAK that was created in response to this particular attack. I still think this attack is worth understanding to see what hoops must be jumped through when dealing with stack randomization (and other defenses).
> On Wed, Feb 20, 2019 at 2:53 PM Kees Cook <keescook@chromium.org> wrote: > > BTW, the attack that inspired grsecurity's RANDKSTACK is described in > > these slides (lots of steps, see slide 79): > > https://www.slideshare.net/scovetta/stackjacking > > Sorry, as PaX Team reminded me, I misremembered this. RANDKSTACK > already existed. It was STACKLEAK that was created in response to this > particular attack. I still think this attack is worth understanding to > see what hoops must be jumped through when dealing with stack > randomization (and other defenses). Yes, I actually went through a number of stack-based attacks, including above, in order to understand what we are trying to protect against. If you are interested, I wrote some notes here mainly for organizing my own thoughts and understanding: https://docs.google.com/document/d/1h1gRuZpOjVxaaDag-MxOrASka0OEBeApQOl8OK2GIVY/edit?usp=sharing It also has references to slidedecks of relevant attacks. I am going to update them now based on our good discussion here. Anyhow, I am glad that we arrived to conclusion here and I know how to proceed. So, I will start working on randomizing after pt_regs in direction that Andy outlined. With regards to disabling iopl(), this is pretty separate thing. If anyone wants to run with this and submit a patch, please go ahead, I can also do it a bit later (after a study of it since I never used it before) if noone finds bandwidth in the meantime. Best Regards, Elena.
diff --git a/arch/Kconfig b/arch/Kconfig index e1e540f..577186e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -802,6 +802,21 @@ config VMAP_STACK the stack to map directly to the KASAN shadow map using a formula that is incorrect if the stack is in vmalloc space. +config HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET + def_bool n + help + An arch should select this symbol if it can support kernel stack + offset randomization. + +config RANDOMIZE_KSTACK_OFFSET + default n + bool "Randomize kernel stack offset on syscall exit" + depends on HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET && VMAP_STACK + help + Enable this if you want the randomize kernel stack offset upon + each syscall exit. This causes kernel stack to have a randomized + offset upon executing each system call. + config ARCH_OPTIONAL_KERNEL_RWX def_bool n diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 8689e79..85d3849 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -134,6 +134,7 @@ config X86 select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 select HAVE_ARCH_VMAP_STACK if X86_64 + select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET if X86_64 select HAVE_ARCH_WITHIN_STACK_FRAMES select HAVE_CMPXCHG_DOUBLE select HAVE_CMPXCHG_LOCAL diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 25e5a6b..d644f72 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -337,6 +337,14 @@ For 32-bit we have the following conventions - kernel is built with #endif .endm +.macro RANDOMIZE_KSTACK_NOCLOBBER +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET + PUSH_AND_CLEAR_REGS + call randomize_kstack + POP_REGS +#endif +.endm + #endif /* CONFIG_X86_64 */ .macro STACKLEAK_ERASE diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 3b2490b..0031887 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -23,6 +23,7 @@ #include <linux/user-return-notifier.h> #include <linux/nospec.h> #include <linux/uprobes.h> +#include <linux/random.h> #include <linux/livepatch.h> #include <linux/syscalls.h> @@ -294,6 +295,26 @@ __visible void do_syscall_64(unsigned long nr, struct pt_regs *regs) } #endif +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET +__visible void randomize_kstack(void) +{ + unsigned long r_offset, new_top, stack_bottom; + + if (current->stack_start != 0) { + + r_offset = get_random_long(); + r_offset &= 0xFFUL; + r_offset <<= 4; + stack_bottom = (unsigned long)task_stack_page(current); + + new_top = stack_bottom + THREAD_SIZE - 0xFF0UL; + new_top += r_offset; + this_cpu_write(cpu_current_top_of_stack, new_top); + current->stack_start = new_top; + } +} +#endif + #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) /* * Does a 32-bit syscall. Called with IRQs on in CONTEXT_KERNEL. Does diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1f0efdb..ae9d370 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -268,6 +268,8 @@ syscall_return_via_sysret: */ STACKLEAK_ERASE_NOCLOBBER + RANDOMIZE_KSTACK_NOCLOBBER + SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi popq %rdi @@ -630,6 +632,8 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode) */ STACKLEAK_ERASE_NOCLOBBER + RANDOMIZE_KSTACK_NOCLOBBER + SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi /* Restore RDI. */ diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 071b2a6..dad09f2 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -569,10 +569,15 @@ static inline unsigned long current_top_of_stack(void) static inline bool on_thread_stack(void) { + /* this might need adjustment to a more fine-grained comparison + * we want a condition like + * "< current_top_of_stack() - task_stack_page(current)" + */ return (unsigned long)(current_top_of_stack() - current_stack_pointer) < THREAD_SIZE; } + #ifdef CONFIG_PARAVIRT_XXL #include <asm/paravirt.h> #else @@ -829,12 +834,16 @@ static inline void spin_lock_prefetch(const void *x) #define task_top_of_stack(task) ((unsigned long)(task_pt_regs(task) + 1)) +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET +#define task_pt_regs(task) ((struct pt_regs *)(task_ptregs(task))) +#else #define task_pt_regs(task) \ -({ \ +({ \ unsigned long __ptr = (unsigned long)task_stack_page(task); \ - __ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING; \ - ((struct pt_regs *)__ptr) - 1; \ + __ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING; \ + ((struct pt_regs *)__ptr) - 1; \ }) +#endif #ifdef CONFIG_X86_32 /* diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index 2b58864..030ee15 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -33,7 +33,7 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task, struct stack_info *info) { unsigned long *begin = task_stack_page(task); - unsigned long *end = task_stack_page(task) + THREAD_SIZE; + unsigned long *end = (unsigned long *)task_top_of_stack(task); if (stack < begin || stack >= end) return false; diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c index 0469cd0..3f03b79 100644 --- a/arch/x86/kernel/irq_64.c +++ b/arch/x86/kernel/irq_64.c @@ -43,7 +43,7 @@ static inline void stack_overflow_check(struct pt_regs *regs) return; if (regs->sp >= curbase + sizeof(struct pt_regs) + STACK_TOP_MARGIN && - regs->sp <= curbase + THREAD_SIZE) + regs->sp <= task_top_of_stack(current)) return; irq_stack_top = (u64)this_cpu_ptr(irq_stack_union.irq_stack) + diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 7d31192..f30485a 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -819,7 +819,7 @@ unsigned long get_wchan(struct task_struct *p) * We need to read FP and IP, so we need to adjust the upper * bound by another unsigned long. */ - top = start + THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING; + top = task_top_of_stack(p); top -= 2 * sizeof(unsigned long); bottom = start; diff --git a/include/linux/sched.h b/include/linux/sched.h index 291a9bd..8e748e4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -605,6 +605,9 @@ struct task_struct { randomized_struct_fields_start void *stack; +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET + unsigned long stack_start; +#endif atomic_t usage; /* Per task flags (PF_*), defined further below: */ unsigned int flags; diff --git a/include/linux/sched/task_stack.h b/include/linux/sched/task_stack.h index 6a84192..229c434 100644 --- a/include/linux/sched/task_stack.h +++ b/include/linux/sched/task_stack.h @@ -21,6 +21,22 @@ static inline void *task_stack_page(const struct task_struct *task) return task->stack; } +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET +static inline void *task_ptregs(const struct task_struct *task) +{ + unsigned long __ptr; + + if (task->stack_start == 0) { + __ptr = (unsigned long)task_stack_page(task); + __ptr += THREAD_SIZE - TOP_OF_KERNEL_STACK_PADDING; + return ((struct pt_regs *)__ptr) - 1; + } + + __ptr = task->stack_start; + return ((struct pt_regs *)__ptr) - 1; +} +#endif + #define setup_thread_stack(new,old) do { } while(0) static inline unsigned long *end_of_stack(const struct task_struct *task) @@ -82,7 +98,7 @@ static inline int object_is_on_stack(const void *obj) { void *stack = task_stack_page(current); - return (obj >= stack) && (obj < (stack + THREAD_SIZE)); + return (obj >= stack) && (obj < ((void *)task_top_of_stack(current))); } extern void thread_stack_cache_init(void); diff --git a/kernel/fork.c b/kernel/fork.c index 07cddff..8eccf94 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -422,6 +422,9 @@ static void release_task_stack(struct task_struct *tsk) tsk->stack = NULL; #ifdef CONFIG_VMAP_STACK tsk->stack_vm_area = NULL; +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET + tsk->stack_start = 0; +#endif #endif } @@ -863,6 +866,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) tsk->stack = stack; #ifdef CONFIG_VMAP_STACK tsk->stack_vm_area = stack_vm_area; +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET + tsk->stack_start = 0; + tsk->stack_start = (unsigned long)task_top_of_stack(tsk); +#endif #endif #ifdef CONFIG_THREAD_INFO_IN_TASK atomic_set(&tsk->stack_refcount, 1); @@ -922,6 +929,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) free_stack: free_thread_stack(tsk); +#ifdef CONFIG_RANDOMIZE_KSTACK_OFFSET + tsk->stack_start = 0; +#endif free_tsk: free_task_struct(tsk); return NULL; diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 877de4f..e52c76f 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1572,7 +1572,7 @@ static void kmemleak_scan(void) do_each_thread(g, p) { void *stack = try_get_task_stack(p); if (stack) { - scan_block(stack, stack + THREAD_SIZE, NULL); + scan_block(stack, task_top_of_stack(p), NULL); put_task_stack(p); } } while_each_thread(g, p); diff --git a/mm/usercopy.c b/mm/usercopy.c index 852eb4e..4b07542 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -37,7 +37,7 @@ static noinline int check_stack_object(const void *obj, unsigned long len) { const void * const stack = task_stack_page(current); - const void * const stackend = stack + THREAD_SIZE; + const void * const stackend = (void *)task_top_of_stack(current); int ret; /* Object is not on the stack at all. */
If CONFIG_RANDOMIZE_KSTACK_OFFSET is selected, the kernel stack offset is randomized upon each exit from a system call via the trampoline stack. This feature is based on the original idea from the PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt All the credits for the original idea goes to the PaX team. However, the implementation of RANDOMIZE_KSTACK_OFFSET differs greatly from the RANDKSTACK feature (see below). Reasoning for the feature: This feature should make considerably harder various stack-based attacks that are based upon overflowing a kernel stack into adjusted kernel stack with a possibility to jump over a guard page. Since the stack offset is randomized upon each system call, it is very hard for attacker to reliably land in any particular place on the adjusted stack. Design description: During most of the kernel's execution, it runs on the "thread stack", which is allocated at fork.c/dup_task_struct() and stored in a per-task variable (tsk->stack). Since stack is growing downwards, the stack top can be always calculated using task_top_of_stack(tsk) function, which essentially returns an address of tsk->stack + stack size. When VMAP_STACK is enabled, the thread stack is allocated from vmalloc space. Thread stack is pretty deterministic on its structure - fixed in size, and upon every enter from a userspace to kernel on a syscall the thread stack is started to be constructed from an address fetched from a per-cpu cpu_current_top_of_stack variable. This variable is required since there is no way to reference "current" from the kernel entry/exit code, so the value of task_top_of_stack(tsk) is "shadowed" in a per-cpu variable each time the kernel context switches to a new task. The RANDOMIZE_KSTACK_OFFSET feature works by randomizing the value of task_top_of_stack(tsk) every time a process exits from a syscall. As a result the thread stack for that process will be constructed from a random offset from a fixed tsk->stack + stack size value upon subsequent syscall. Since the kernel is always exited (IRET / SYSRET) from a per-cpu "trampoline stack", it provides a safe place for modifying the value of cpu_current_top_of_stack, because the thread stack is not in use anymore at that point. There is only one small issue: currently thread stack top is never stored in a per-task variable, but always calculated as needed via task_top_of_stack(tsk) and existing tsk->stack value (essentially relying on its fixed size and structure). So we need to create a new per-task variable, tsk->stack_start, that stores newly calculated random value for the thread stack top. Together with the value of cpu_current_top_of_stack, tsk->stack_start is also updated when leaving the kernel space from a trampoline stack, so that it can be used by scheduler to correctly "shadow" the cpu_current_top_of_stack upon the task switch. Impact on the kernel thread stack size: Since the current version does not allocate any additional pages for the thread stack, it shifts cpu_current_top_of_stack value randomly between 000 .. FF0 (or 00 .. F0 if only 4 bits are randomized). So, in the worst case (random offsets FF0/F0), the actual usable stack size is 12304/16144 bytes. Performance impact: All measurements are done on Intel Kaby Lake i7-8550U, 16GB RAM 1) hackbench -s 4096 -l 2000 -g 15 -f 25 -P base: Time: 12.243 random_offset: Time: 13.411 2) kernel build time (as one example of real-world load): base: user 299m20,348s; sys 21m39,047s random_offset: user 300m19,759s; sys 20m48,173s 3) perf on fopen/flose loop 1000000000 times: (the perf values below still manage to differ somewhat between different runs, so I don't consider them to be very representative apart that they obviously show big impact on using get_random_u64()) base: 8.46% time [kernel.kallsyms] [k] crc32c_pcl_intel_update 4.77% time [kernel.kallsyms] [k] ext4_mark_iloc_dirty 4.14% time [kernel.kallsyms] [k] fsnotify 3.94% time [kernel.kallsyms] [k] _raw_spin_lock 2.48% time [kernel.kallsyms] [k] syscall_return_via_sysret 2.42% time [kernel.kallsyms] [k] entry_SYSCALL_64 2.28% time [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.07% time [kernel.kallsyms] [k] inotify_handle_event random_offset: 8.35% time [kernel.kallsyms] [k] crc32c_pcl_intel_update 5.61% time [kernel.kallsyms] [k] get_random_u64 4.88% time [kernel.kallsyms] [k] ext4_mark_iloc_dirty 3.08% time [kernel.kallsyms] [k] _raw_spin_lock 2.98% time [kernel.kallsyms] [k] fsnotify 2.73% time [kernel.kallsyms] [k] syscall_return_via_sysret 2.45% time [kernel.kallsyms] [k] entry_SYSCALL_64 1.87% time [kernel.kallsyms] [k] __ext4_get_inode_loc 1.65% time [kernel.kallsyms] [k] _raw_spin_lock_irqsave Comparison to grsecurity RANDKSTACK feature: The basic idea is taken from RANDKSTACK: randomization of the cpu_current_top_of_stack is performed within the existing 4 pages of memory allocated for the thread stack. No additional pages are allocated. This patch introduces 8 bits of randomness (bits 4 - 11 are randomized, bits 0-3 must be zero due to stack alignment) to the kernel stack top. The very old grsecurity patch I checked has only 4 bits of randomization for x86-64. This patch works with this little randomness also, we only have to decide how much stack space we wish/can trade for security. Notable differences from RANDKSTACK: - x86_64 only, since this does not make sense without vmap-based stack allocation that provides guard pages, and latter is only implemented for x86-64. - randomization is performed on trampoline stack upon system call exit. - random bits are taken from get_random_long() instead of rdtsc() for a better randomness. This however has a big performance impact (see above the numbers) and additionally if we happen to hit a point when a generator needs to be reseeded, we might have an issue. Alternatives can be to make this feature dependent on CONFIG_RANDOM_TRUST_CPU, which can solve some issues, but I doubt that all of them. Of course rdtsc() can be a fallback if there is no way to make calls for a proper randomness from the trampoline stack. - instead of storing the actual top of the stack in task->thread.sp0 (does not exist on x86-64), a new unsigned long variable stack_start is created in the task struct and key stack functions, like task_pt_regs, are updated to use it when available. - Instead of preserving a set of registers that are used within the randomization function, the current version uses PUSH_AND_CLEAR_REGS/POP_REGS combination similar to STACKLEAK. It would seem that we can go away with only preserving rax,rdx,rbx,rsi and rdi, but I am not sure how stable this is in the long run. Future work possibilities: - One can do a version where we allocate an additional page for each kernel stack and then employ proper randomization. Can be a stricter config option, for example. - Alternatively, one can allocate normally 4 pages of stack only and allocate an additional page, if stack + randomized offset grows beyond 4 pages (only happens for big call chains). Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> --- arch/Kconfig | 15 +++++++++++++++ arch/x86/Kconfig | 1 + arch/x86/entry/calling.h | 8 ++++++++ arch/x86/entry/common.c | 21 +++++++++++++++++++++ arch/x86/entry/entry_64.S | 4 ++++ arch/x86/include/asm/processor.h | 15 ++++++++++++--- arch/x86/kernel/dumpstack.c | 2 +- arch/x86/kernel/irq_64.c | 2 +- arch/x86/kernel/process.c | 2 +- include/linux/sched.h | 3 +++ include/linux/sched/task_stack.h | 18 +++++++++++++++++- kernel/fork.c | 10 ++++++++++ mm/kmemleak.c | 2 +- mm/usercopy.c | 2 +- 14 files changed, 96 insertions(+), 9 deletions(-)