mbox series

[RFC,0/7] x86: introduce system calls addess space isolation

Message ID 1556228754-12996-1-git-send-email-rppt@linux.ibm.com (mailing list archive)
Headers show
Series x86: introduce system calls addess space isolation | expand

Message

Mike Rapoport April 25, 2019, 9:45 p.m. UTC
Hi,

Address space isolation has been used to protect the kernel from the
userspace and userspace programs from each other since the invention of the
virtual memory.

Assuming that kernel bugs and therefore vulnerabilities are inevitable it
might be worth isolating parts of the kernel to minimize damage that these
vulnerabilities can cause.

The idea here is to allow an untrusted user access to a potentially
vulnerable kernel in such a way that any kernel vulnerability they find to
exploit is either prevented or the consequences confined to their isolated
address space such that the compromise attempt has minimal impact on other
tenants or the protected structures of the monolithic kernel.  Although we
hope to prevent many classes of attack, the first target we're looking at
is ROP gadget protection.

These patches implement a "system call isolation (SCI)" mechanism that
allows running system calls in an isolated address space with reduced page
tables to prevent ROP attacks.

ROP attacks involve corrupting the stack return address to repoint it to a
segment of code you know exists in the kernel that can be used to perform
the action you need to exploit the system.

The idea behind the prevention is that if we fault in pages in the
execution path, we can compare target address against the kernel symbol
table.  So if we're in a function, we allow local jumps (and simply falling
of the end of a page) but if we're jumping to a new function it must be to
an external label in the symbol table.  Since ROP attacks are all about
jumping to gadget code which is effectively in the middle of real
functions, the jumps they induce are to code that doesn't have an external
symbol, so it should mostly detect when they happen.

This is very early POC, it's able to run the simple dummy system calls and
a little bit beyond that, but it's not yet stable and robust enough to boot
a system with system call isolation enabled for all system calls. Still, we
wanted to get some feedback about the concept in general as early as
possible.
 
At this time we are not suggesting any API that will enable the system
calls isolation. Because of the overhead required for this, it should only
be activated for processes or containers we know should be untrusted. We
still have no actual numbers, but surely forcing page faults during system
call execution will not come for free.

One possible way is to create a namespace, and force the system calls
isolation on all the processes in that namespace. Another thing that came
to mind was to use a seccomp filter to allow fine grained control of this
feature.

The current implementation is pretty much x86-centric, but the general idea
can be used on other architectures.

A brief TOC of the set:
* patch 1 adds  definitions of X86_FEATURE_SCI
* patch 2 is the core implementation of system calls isolation (SCI)
* patches 3-5 add hooks to SCI at entry paths and in the page fault
  handler 
* patch 6 enables the SCI in Kconfig
* patch 7 includes example dummy system calls that are used to
  demonstrate the SCI in action.

Mike Rapoport (7):
  x86/cpufeatures: add X86_FEATURE_SCI
  x86/sci: add core implementation for system call isolation
  x86/entry/64: add infrastructure for switching to isolated syscall
    context
  x86/sci: hook up isolated system call entry and exit
  x86/mm/fault: hook up SCI verification
  security: enable system call isolation in kernel config
  sci: add example system calls to exercse SCI

 arch/x86/entry/calling.h                 |  65 ++++
 arch/x86/entry/common.c                  |  65 ++++
 arch/x86/entry/entry_64.S                |  13 +-
 arch/x86/entry/syscalls/syscall_64.tbl   |   3 +
 arch/x86/include/asm/cpufeatures.h       |   1 +
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/processor-flags.h   |   8 +
 arch/x86/include/asm/sci.h               |  55 +++
 arch/x86/include/asm/tlbflush.h          |   8 +-
 arch/x86/kernel/asm-offsets.c            |   7 +
 arch/x86/kernel/process_64.c             |   5 +
 arch/x86/mm/Makefile                     |   1 +
 arch/x86/mm/fault.c                      |  28 ++
 arch/x86/mm/init.c                       |   2 +
 arch/x86/mm/sci.c                        | 608 +++++++++++++++++++++++++++++++
 include/linux/sched.h                    |   5 +
 include/linux/sci.h                      |  12 +
 kernel/Makefile                          |   2 +-
 kernel/exit.c                            |   3 +
 kernel/sci-examples.c                    |  52 +++
 security/Kconfig                         |  10 +
 21 files changed, 956 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/include/asm/sci.h
 create mode 100644 arch/x86/mm/sci.c
 create mode 100644 include/linux/sci.h
 create mode 100644 kernel/sci-examples.c

Comments

Andy Lutomirski April 26, 2019, 12:30 a.m. UTC | #1
On Thu, Apr 25, 2019 at 2:46 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
>
> Hi,
>
> Address space isolation has been used to protect the kernel from the
> userspace and userspace programs from each other since the invention of the
> virtual memory.
>
> Assuming that kernel bugs and therefore vulnerabilities are inevitable it
> might be worth isolating parts of the kernel to minimize damage that these
> vulnerabilities can cause.
>
> The idea here is to allow an untrusted user access to a potentially
> vulnerable kernel in such a way that any kernel vulnerability they find to
> exploit is either prevented or the consequences confined to their isolated
> address space such that the compromise attempt has minimal impact on other
> tenants or the protected structures of the monolithic kernel.  Although we
> hope to prevent many classes of attack, the first target we're looking at
> is ROP gadget protection.
>
> These patches implement a "system call isolation (SCI)" mechanism that
> allows running system calls in an isolated address space with reduced page
> tables to prevent ROP attacks.
>
> ROP attacks involve corrupting the stack return address to repoint it to a
> segment of code you know exists in the kernel that can be used to perform
> the action you need to exploit the system.
>
> The idea behind the prevention is that if we fault in pages in the
> execution path, we can compare target address against the kernel symbol
> table.  So if we're in a function, we allow local jumps (and simply falling
> of the end of a page) but if we're jumping to a new function it must be to
> an external label in the symbol table.

That's quite an assumption.  The entry code at least uses .L labels.
Do you get that right?

As far as I can see, most of what's going on here has very little to
do with jumps and calls.  The benefit seems to come from making sure
that the RET instruction actually goes somewhere that's already been
faulted in.  Am I understanding right?

--Andy
Jiri Kosina April 26, 2019, 8:07 a.m. UTC | #2
On Thu, 25 Apr 2019, Andy Lutomirski wrote:

> The benefit seems to come from making sure that the RET instruction 
> actually goes somewhere that's already been faulted in.

Which doesn't seem to be really compatible with things like retpolines or 
anyone using FTRACE_WITH_REGS to modify stored instruction pointer.
Dave Hansen April 26, 2019, 2:41 p.m. UTC | #3
On 4/25/19 2:45 PM, Mike Rapoport wrote:
> The idea behind the prevention is that if we fault in pages in the
> execution path, we can compare target address against the kernel symbol
> table.  So if we're in a function, we allow local jumps (and simply falling
> of the end of a page) but if we're jumping to a new function it must be to
> an external label in the symbol table.  Since ROP attacks are all about
> jumping to gadget code which is effectively in the middle of real
> functions, the jumps they induce are to code that doesn't have an external
> symbol, so it should mostly detect when they happen.

This turns the problem from: "attackers can leverage any data/code that
the kernel has mapped (anything)" to "attackers can leverage any
code/data that the current syscall has faulted in".

That seems like a pretty restrictive change.

> At this time we are not suggesting any API that will enable the system
> calls isolation. Because of the overhead required for this, it should only
> be activated for processes or containers we know should be untrusted. We
> still have no actual numbers, but surely forcing page faults during system
> call execution will not come for free.

What's the minimum number of faults that have to occur to handle the
simplest dummy fault?
Mike Rapoport April 28, 2019, 6:01 a.m. UTC | #4
On Thu, Apr 25, 2019 at 05:30:13PM -0700, Andy Lutomirski wrote:
> On Thu, Apr 25, 2019 at 2:46 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> >
> > Hi,
> >
> > Address space isolation has been used to protect the kernel from the
> > userspace and userspace programs from each other since the invention of the
> > virtual memory.
> >
> > Assuming that kernel bugs and therefore vulnerabilities are inevitable it
> > might be worth isolating parts of the kernel to minimize damage that these
> > vulnerabilities can cause.
> >
> > The idea here is to allow an untrusted user access to a potentially
> > vulnerable kernel in such a way that any kernel vulnerability they find to
> > exploit is either prevented or the consequences confined to their isolated
> > address space such that the compromise attempt has minimal impact on other
> > tenants or the protected structures of the monolithic kernel.  Although we
> > hope to prevent many classes of attack, the first target we're looking at
> > is ROP gadget protection.
> >
> > These patches implement a "system call isolation (SCI)" mechanism that
> > allows running system calls in an isolated address space with reduced page
> > tables to prevent ROP attacks.
> >
> > ROP attacks involve corrupting the stack return address to repoint it to a
> > segment of code you know exists in the kernel that can be used to perform
> > the action you need to exploit the system.
> >
> > The idea behind the prevention is that if we fault in pages in the
> > execution path, we can compare target address against the kernel symbol
> > table.  So if we're in a function, we allow local jumps (and simply falling
> > of the end of a page) but if we're jumping to a new function it must be to
> > an external label in the symbol table.
> 
> That's quite an assumption.  The entry code at least uses .L labels.
> Do you get that right?
> 
> As far as I can see, most of what's going on here has very little to
> do with jumps and calls.  The benefit seems to come from making sure
> that the RET instruction actually goes somewhere that's already been
> faulted in.  Am I understanding right?

Well, RET indeed will go somewhere that's already been faulted in. But
before that, the first CALL to not-yet-mapped code will fault and bring in
the page containing the CALL target.

If the CALL is made into a middle of a function, SCI will refuse to
continue the syscall execution.

As for the local jumps, as long as they are inside a page that was already
mapped or the next page, they are allowed.

This does not take care (yet) of larger functions where local jumps are
further then PAGE_SIZE.

Here's an example trace of #PF's produced by a dummy get_answer system call
from patch 7:

[   12.012906] #PF: DATA: do_syscall_64+0x26b/0x4c0 fault at 0xffffffff82000bb8
[   12.012918] #PF: INSN: __x86_indirect_thunk_rax+0x0/0x20 fault at __x86_indirect_thunk_rax+0x0/0x20
[   12.012929] #PF: INSN: __x64_sys_get_answer+0x0/0x10 fault at __x64_sys_get_answer+0x0/0x10
 
> --Andy
>
Mike Rapoport April 28, 2019, 6:08 a.m. UTC | #5
On Fri, Apr 26, 2019 at 07:41:09AM -0700, Dave Hansen wrote:
> On 4/25/19 2:45 PM, Mike Rapoport wrote:
> > The idea behind the prevention is that if we fault in pages in the
> > execution path, we can compare target address against the kernel symbol
> > table.  So if we're in a function, we allow local jumps (and simply falling
> > of the end of a page) but if we're jumping to a new function it must be to
> > an external label in the symbol table.  Since ROP attacks are all about
> > jumping to gadget code which is effectively in the middle of real
> > functions, the jumps they induce are to code that doesn't have an external
> > symbol, so it should mostly detect when they happen.
> 
> This turns the problem from: "attackers can leverage any data/code that
> the kernel has mapped (anything)" to "attackers can leverage any
> code/data that the current syscall has faulted in".
> 
> That seems like a pretty restrictive change.
> 
> > At this time we are not suggesting any API that will enable the system
> > calls isolation. Because of the overhead required for this, it should only
> > be activated for processes or containers we know should be untrusted. We
> > still have no actual numbers, but surely forcing page faults during system
> > call execution will not come for free.
> 
> What's the minimum number of faults that have to occur to handle the
> simplest dummy fault?
 
For the current implementation it's 3.

Here is the example trace of #PF's produced by a dummy get_answer
system call from patch 7:

[   12.012906] #PF: DATA: do_syscall_64+0x26b/0x4c0 fault at 0xffffffff82000bb8
[   12.012918] #PF: INSN: __x86_indirect_thunk_rax+0x0/0x20 fault at __x86_indirect_thunk_rax+0x0/0x20
[   12.012929] #PF: INSN: __x64_sys_get_answer+0x0/0x10 fault at__x64_sys_get_answer+0x0/0x10

For the sci_write_dmesg syscall that does copy_from_user() and printk() its
between 35 and 60 depending on console and /proc/sys/kernel/printk values.

This includes both code and data accesses. The data page faults can be
avoided if we pre-populate SCI page tables with data.