mbox series

[v2,0/5] Optionally randomize kernel stack offset each syscall

Message ID 20200324203231.64324-1-keescook@chromium.org (mailing list archive)
Headers show
Series Optionally randomize kernel stack offset each syscall | expand

Message

Kees Cook March 24, 2020, 8:32 p.m. UTC
Hi,

This is a continuation and refactoring of Elena's earlier effort to add
kernel stack base offset randomization. In the time since the previous
discussions, two attacks[1][2] were made public that depended on stack
determinism, so we're no longer in the position of "this is a good idea
but we have no examples of attacks". :)

Earlier discussions also devolved into debates on entropy sources, which
is mostly a red herring, given the already low entropy available due
to stack size. Regardless, entropy can be changed/improved separately
from this series as needed.

Earlier discussions also got stuck debating how much syscall overhead
was too much, but this is also a red herring since the feature itself
needs to be selectable at boot with no cost for those that don't want it:
this is solved here with static branches.

So, here is an improved version, made as arch-agnostic as possible,
with usage added for x86 and arm64. It also includes some small static
branch clean ups.

-Kees

[1] https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
[2] https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf

v2:
- move to per-cpu rdtsc() saved on syscall exit
- add static branches for zero-cost dynamic enabling
- Kconfig just selects the default state of static branch
- __builtin_alloca() produces ugly asm without -fno-stack-clash-protection
- made arch agnostic
rfc: https://lore.kernel.org/kernel-hardening/20190329081358.30497-1-elena.reshetova@intel.com/

Kees Cook (5):
  jump_label: Provide CONFIG-driven build state defaults
  init_on_alloc: Unpessimize default-on builds
  stack: Optionally randomize kernel stack offset each syscall
  x86/entry: Enable random_kstack_offset support
  arm64: entry: Enable random_kstack_offset support

 Makefile                         |  4 ++++
 arch/Kconfig                     | 19 +++++++++++++++
 arch/arm64/Kconfig               |  1 +
 arch/arm64/kernel/syscall.c      | 10 ++++++++
 arch/x86/Kconfig                 |  1 +
 arch/x86/entry/common.c          | 12 +++++++++-
 include/linux/jump_label.h       | 19 +++++++++++++++
 include/linux/mm.h               | 18 +++++---------
 include/linux/randomize_kstack.h | 40 ++++++++++++++++++++++++++++++++
 init/main.c                      | 23 ++++++++++++++++++
 mm/page_alloc.c                  | 12 ++--------
 11 files changed, 136 insertions(+), 23 deletions(-)
 create mode 100644 include/linux/randomize_kstack.h

Comments

Jann Horn March 24, 2020, 9:28 p.m. UTC | #1
On Tue, Mar 24, 2020 at 9:32 PM Kees Cook <keescook@chromium.org> wrote:
> This is a continuation and refactoring of Elena's earlier effort to add
> kernel stack base offset randomization. In the time since the previous
> discussions, two attacks[1][2] were made public that depended on stack
> determinism, so we're no longer in the position of "this is a good idea
> but we have no examples of attacks". :)
[...]
> [1] https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html

This one only starts using the stack's location after having parsed
it out of dmesg (which in any environment that wants to provide a
reasonable level of security really ought to be restricted to root),
right? If you give people read access to dmesg, they can leak all
sorts of pointers; not just the stack pointer, but also whatever else
happens to be in the registers at that point - which is likely to give
the attacker more ways to place controlled data at a known location.
See e.g. <https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html>,
which leaks the pointer to a BPF map out of dmesg.

Also, are you sure that it isn't possible to make the syscall that
leaked its stack pointer never return to userspace (via ptrace or
SIGSTOP or something like that), and therefore never realign its
stack, while keeping some controlled data present on the syscall's
stack?

> [2] https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf

That's a moderately large document; which specific part are you referencing?
Kees Cook March 24, 2020, 11:07 p.m. UTC | #2
[-enrico, who is bouncing]

On Tue, Mar 24, 2020 at 10:28:35PM +0100, Jann Horn wrote:
> On Tue, Mar 24, 2020 at 9:32 PM Kees Cook <keescook@chromium.org> wrote:
> > This is a continuation and refactoring of Elena's earlier effort to add
> > kernel stack base offset randomization. In the time since the previous
> > discussions, two attacks[1][2] were made public that depended on stack
> > determinism, so we're no longer in the position of "this is a good idea
> > but we have no examples of attacks". :)
> [...]
> > [1] https://a13xp0p0v.github.io/2020/02/15/CVE-2019-18683.html
> 
> This one only starts using the stack's location after having parsed
> it out of dmesg (which in any environment that wants to provide a
> reasonable level of security really ought to be restricted to root),
> right? If you give people read access to dmesg, they can leak all
> sorts of pointers; not just the stack pointer, but also whatever else
> happens to be in the registers at that point - which is likely to give
> the attacker more ways to place controlled data at a known location.
> See e.g. <https://googleprojectzero.blogspot.com/2018/09/a-cache-invalidation-bug-in-linux.html>,
> which leaks the pointer to a BPF map out of dmesg.

It was mentioned that it would re-use the base across syscalls, so this
defense would have frustrated it.

More to my point was that there still are attacks using a deterministic
stack as part of the exploit chain. We have a low-cost way to make that
go away.

> Also, are you sure that it isn't possible to make the syscall that
> leaked its stack pointer never return to userspace (via ptrace or
> SIGSTOP or something like that), and therefore never realign its
> stack, while keeping some controlled data present on the syscall's
> stack?
> 
> > [2] https://repositorio-aberto.up.pt/bitstream/10216/125357/2/374717.pdf
> 
> That's a moderately large document; which specific part are you referencing?

IIRC, section 3.3 discusses using the stack for CFI bypass, though
thinking about it again, it may have been targeting pt_regs. I'll
double check and remove this reference if that's the case.

But, as I mention, this is proactive and I'd like to stop yet more
things from being able to depend on the stack location.
Reshetova, Elena March 25, 2020, 12:15 p.m. UTC | #3
> > Also, are you sure that it isn't possible to make the syscall that
> > leaked its stack pointer never return to userspace (via ptrace or
> > SIGSTOP or something like that), and therefore never realign its
> > stack, while keeping some controlled data present on the syscall's
> > stack?

How would you reliably detect that a stack pointer has been leaked
to userspace while it has been in a syscall? Does not seem to be a trivial
task to me. 

Best Regards,
Elena.
Kees Cook March 25, 2020, 8:27 p.m. UTC | #4
On Wed, Mar 25, 2020 at 12:15:12PM +0000, Reshetova, Elena wrote:
> > > Also, are you sure that it isn't possible to make the syscall that
> > > leaked its stack pointer never return to userspace (via ptrace or
> > > SIGSTOP or something like that), and therefore never realign its
> > > stack, while keeping some controlled data present on the syscall's
> > > stack?
> 
> How would you reliably detect that a stack pointer has been leaked
> to userspace while it has been in a syscall? Does not seem to be a trivial
> task to me. 

Well, my expectation is that folks using this defense are also using
panic_on_warn sysctl, etc, so attackers don't get a chance to actually
_use_ register values spilled to dmesg.
Jann Horn March 25, 2020, 11:20 p.m. UTC | #5
On Wed, Mar 25, 2020 at 9:27 PM Kees Cook <keescook@chromium.org> wrote:
> On Wed, Mar 25, 2020 at 12:15:12PM +0000, Reshetova, Elena wrote:
> > > > Also, are you sure that it isn't possible to make the syscall that
> > > > leaked its stack pointer never return to userspace (via ptrace or
> > > > SIGSTOP or something like that), and therefore never realign its
> > > > stack, while keeping some controlled data present on the syscall's
> > > > stack?
> >
> > How would you reliably detect that a stack pointer has been leaked
> > to userspace while it has been in a syscall? Does not seem to be a trivial
> > task to me.
>
> Well, my expectation is that folks using this defense are also using
> panic_on_warn sysctl, etc, so attackers don't get a chance to actually
> _use_ register values spilled to dmesg.

Uh... I thought that thing was exclusively for stuff like syzkaller,
because nuking the entire system because of a WARN is far too
excessive? WARNs should be safe to add almost anywhere in the kernel,
so that developers can put their assumptions about system behavior
into code without having to worry about bringing down the entire
system if that assumption turns out to have been false in some
harmless edgecase.

Also, there are other places that dump register state. In particular
the soft lockup detection, which you can IIRC easily trip even
accidentally if you play around with stuff like FUSE filesystems, or
if a disk becomes unresponsive. Sure, *theoretically* you can also set
the "panic on soft lockup" flag, but that seems like a really terrible
idea to me.

As far as I can tell, the only clean way to fix this is to tell
distros that give non-root users access to dmesg (Ubuntu in
particular) that they have to stop doing that. E.g. Debian seems to
get by just fine with root-restricted dmesg.
Kees Cook March 26, 2020, 5:18 p.m. UTC | #6
On Thu, Mar 26, 2020 at 12:20:19AM +0100, Jann Horn wrote:
> On Wed, Mar 25, 2020 at 9:27 PM Kees Cook <keescook@chromium.org> wrote:
> > On Wed, Mar 25, 2020 at 12:15:12PM +0000, Reshetova, Elena wrote:
> > > > > Also, are you sure that it isn't possible to make the syscall that
> > > > > leaked its stack pointer never return to userspace (via ptrace or
> > > > > SIGSTOP or something like that), and therefore never realign its
> > > > > stack, while keeping some controlled data present on the syscall's
> > > > > stack?
> > >
> > > How would you reliably detect that a stack pointer has been leaked
> > > to userspace while it has been in a syscall? Does not seem to be a trivial
> > > task to me.
> >
> > Well, my expectation is that folks using this defense are also using
> > panic_on_warn sysctl, etc, so attackers don't get a chance to actually
> > _use_ register values spilled to dmesg.
> 
> Uh... I thought that thing was exclusively for stuff like syzkaller,
> because nuking the entire system because of a WARN is far too
> excessive? WARNs should be safe to add almost anywhere in the kernel,
> so that developers can put their assumptions about system behavior
> into code without having to worry about bringing down the entire
> system if that assumption turns out to have been false in some
> harmless edgecase.

So, I'm caught in a tight spot between Linus's deprecation of BUG()[1],
and the desire for high-sensitivity security-oriented system builders
to have a "completely stop running that kernel thread" option. Linus's
entirely reasonable observation that BUG() destabilizes the kernel more
often than it doesn't means there isn't actually a safe "stop that kernel
thread" option, especially since many mitigations that detect badness span
a spectrum of "stops the badness before it happens" (e.g. NX memory) to
"I see badness has already happened" (e.g. stack protector). As a result,
the only way to provide a way for the security-prioritized users is to
downgrade corruptions to DoSes via panic(). I wish there was a magic
way to have a perfect kernel state unwinder to get us the BUG() we
wanted it to be, but given the kernel's complexity, it doesn't exist
(and is unlikely to be worth developing). Right now, we either get
"WARN() and keep going as best we can" or we get "WARN() and panic".

And with regard to "WARNs should be safe to add", yes, that's generally
true, but the goal is to not make them reachable from userspace because
of this need to be able to "upgrade" them to panic(). I have tried to
document[1] this:

  Note that the WARN()-family should only be used for "expected to
  be unreachable" situations. If you want to warn about "reachable
  but undesirable" situations, please use the pr_warn()-family of
  functions. System owners may have set the *panic_on_warn* sysctl,
  to make sure their systems do not continue running in the face of
  "unreachable" conditions. (For example, see commits like `this one
  <https://git.kernel.org/linus/d4689846881d160a4d12a514e991a740bcb5d65a>`_.)

[1] https://lore.kernel.org/lkml/202003141524.59C619B51A@keescook/

> Also, there are other places that dump register state. In particular
> the soft lockup detection, which you can IIRC easily trip even
> accidentally if you play around with stuff like FUSE filesystems, or
> if a disk becomes unresponsive. Sure, *theoretically* you can also set
> the "panic on soft lockup" flag, but that seems like a really terrible
> idea to me.

I understand your general objection to non-deterministic defenses,
as there will always be ways to weaken them, but I don't think that's
reason enough to not have them. I prefer to look at mitigations as a
spectrum, and to recognize that some are more effective with certain
system configurations. They become tools to choose from when building
defense in depth.

> As far as I can tell, the only clean way to fix this is to tell
> distros that give non-root users access to dmesg (Ubuntu in
> particular) that they have to stop doing that. E.g. Debian seems to
> get by just fine with root-restricted dmesg.

Totally agreed about that. Ubuntu may be hard to convince as one of
their design principles has been to make the first user able to use the
system completely with as little interruption as possible. (e.g. pop-up
confirmation dialogs are strongly discouraged, etc.)

So, for this series, I think the benefit-to-complexity value is high.
It's a simple solution even if it's not perfect (most things can't be
given the existing kernel design trade-offs).

-Kees