[v7,1/6] seccomp: add a return code to trap to userspace

This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years) (Andy)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree (Andy)
    * notify the listener of signals the tracee receives as well (Andy)
    * implement poll
v3: * lockdep fix (Oleg)
    * drop unnecessary WARN()s (Christian)
    * rearrange error returns to be more rpetty (Christian)
    * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
v4: * fix implementation of poll to use poll_wait() (Jann)
    * change listener's fd flags to be 0 (Jann)
    * hoist filter initialization out of ifdefs to its own function
      init_user_notification()
    * add some more testing around poll() and closing the listener while a
      syscall is in action
    * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
      creates a new one (Matthew)
    * correctly handle pid namespaces, add some testcases (Matthew)
    * use EINPROGRESS instead of EINVAL when a notification response is
      written twice (Matthew)
    * fix comment typo from older version (SEND vs READ) (Matthew)
    * whitespace and logic simplification (Tobin)
    * add some Documentation/ bits on userspace trapping
v5: * fix documentation typos (Jann)
    * add signalled field to struct seccomp_notif (Jann)
    * switch to using ioctls instead of read()/write() for struct passing
      (Jann)
    * add an ioctl to ensure an id is still valid
v6: * docs typo fixes, update docs for ioctl() change (Christian)
v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
    * use notify_lock in IS_ID_VALID query to avoid racing
    * s/signalled/signaled (Tyler)
    * fix docs to reflect that ids are not globally unique (Tyler)
    * add a test to check -ERESTARTSYS behavior (Tyler)
    * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
    * reorder USER_NOTIF in seccomp return codes list (Tyler)
    * return size instead of sizeof(struct user_notif) (Tyler)
    * ENOENT instead of EINVAL when invalid id is passed (Tyler)
    * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
    * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
    * add a new struct notification to minimize the additions to
      struct seccomp_filter, also pack the necessary additions a bit more
      cleverly (Tyler)
    * switch to keeping track of the task itself instead of the pid (we'll
      use this for implementing PUT_FD)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  73 +++
 include/linux/seccomp.h                       |   7 +-
 include/uapi/linux/seccomp.h                  |  33 +-
 kernel/seccomp.c                              | 436 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 413 ++++++++++++++++-
 6 files changed, 954 insertions(+), 9 deletions(-)

Message ID	20180927151119.9989-2-tycho@tycho.ws (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Tycho Andersen <tycho@tycho.ws> To: Kees Cook <keescook@chromium.org> Cc: linux-kernel@vger.kernel.org, containers@lists.linux-foundation.org, linux-api@vger.kernel.org, Andy Lutomirski <luto@amacapital.net>, Oleg Nesterov <oleg@redhat.com>, "Eric W . Biederman" <ebiederm@xmission.com>, "Serge E . Hallyn" <serge@hallyn.com>, Christian Brauner <christian.brauner@ubuntu.com>, Tyler Hicks <tyhicks@canonical.com>, Akihiro Suda <suda.akihiro@lab.ntt.co.jp>, Jann Horn <jannh@google.com>, linux-fsdevel@vger.kernel.org, Tycho Andersen <tycho@tycho.ws> Subject: [PATCH v7 1/6] seccomp: add a return code to trap to userspace Date: Thu, 27 Sep 2018 09:11:14 -0600 Message-Id: <20180927151119.9989-2-tycho@tycho.ws> In-Reply-To: <20180927151119.9989-1-tycho@tycho.ws> References: <20180927151119.9989-1-tycho@tycho.ws> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	seccomp trap to userspace \| expand [v7,0/6] seccomp trap to userspace [v7,1/6] seccomp: add a return code to trap to userspace [v7,2/6] seccomp: make get_nth_filter available outside of CHECKPOINT_RESTORE [v7,3/6] seccomp: add a way to get a listener fd from ptrace [v7,4/6] files: add a replace_fd_files() function [v7,5/6] seccomp: add a way to pass FDs via a notification fd [v7,6/6] samples: add an example of seccomp user trap

[v7,1/6] seccomp: add a return code to trap to userspace

Commit Message

Comments

Patch