diff mbox series

[v7,1/6] seccomp: add a return code to trap to userspace

Message ID 20180927151119.9989-2-tycho@tycho.ws (mailing list archive)
State New, archived
Headers show
Series seccomp trap to userspace | expand

Commit Message

Tycho Andersen Sept. 27, 2018, 3:11 p.m. UTC
This patch introduces a means for syscalls matched in seccomp to notify
some other task that a particular filter has been triggered.

The motivation for this is primarily for use with containers. For example,
if a container does an init_module(), we obviously don't want to load this
untrusted code, which may be compiled for the wrong version of the kernel
anyway. Instead, we could parse the module image, figure out which module
the container is trying to load and load it on the host.

As another example, containers cannot mknod(), since this checks
capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
/dev/zero should be ok for containers to mknod, but we'd like to avoid hard
coding some whitelist in the kernel. Another example is mount(), which has
many security restrictions for good reason, but configuration or runtime
knowledge could potentially be used to relax these restrictions.

This patch adds functionality that is already possible via at least two
other means that I know about, both of which involve ptrace(): first, one
could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
Unfortunately this is slow, so a faster version would be to install a
filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
Since ptrace allows only one tracer, if the container runtime is that
tracer, users inside the container (or outside) trying to debug it will not
be able to use ptrace, which is annoying. It also means that older
distributions based on Upstart cannot boot inside containers using ptrace,
since upstart itself uses ptrace to start services.

The actual implementation of this is fairly small, although getting the
synchronization right was/is slightly complex.

Finally, it's worth noting that the classic seccomp TOCTOU of reading
memory data from the task still applies here, but can be avoided with
careful design of the userspace handler: if the userspace handler reads all
of the task memory that is necessary before applying its security policy,
the tracee's subsequent memory edits will not be read by the tracer.

v2: * make id a u64; the idea here being that it will never overflow,
      because 64 is huge (one syscall every nanosecond => wrap every 584
      years) (Andy)
    * prevent nesting of user notifications: if someone is already attached
      the tree in one place, nobody else can attach to the tree (Andy)
    * notify the listener of signals the tracee receives as well (Andy)
    * implement poll
v3: * lockdep fix (Oleg)
    * drop unnecessary WARN()s (Christian)
    * rearrange error returns to be more rpetty (Christian)
    * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
v4: * fix implementation of poll to use poll_wait() (Jann)
    * change listener's fd flags to be 0 (Jann)
    * hoist filter initialization out of ifdefs to its own function
      init_user_notification()
    * add some more testing around poll() and closing the listener while a
      syscall is in action
    * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
      creates a new one (Matthew)
    * correctly handle pid namespaces, add some testcases (Matthew)
    * use EINPROGRESS instead of EINVAL when a notification response is
      written twice (Matthew)
    * fix comment typo from older version (SEND vs READ) (Matthew)
    * whitespace and logic simplification (Tobin)
    * add some Documentation/ bits on userspace trapping
v5: * fix documentation typos (Jann)
    * add signalled field to struct seccomp_notif (Jann)
    * switch to using ioctls instead of read()/write() for struct passing
      (Jann)
    * add an ioctl to ensure an id is still valid
v6: * docs typo fixes, update docs for ioctl() change (Christian)
v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
    * use notify_lock in IS_ID_VALID query to avoid racing
    * s/signalled/signaled (Tyler)
    * fix docs to reflect that ids are not globally unique (Tyler)
    * add a test to check -ERESTARTSYS behavior (Tyler)
    * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
    * reorder USER_NOTIF in seccomp return codes list (Tyler)
    * return size instead of sizeof(struct user_notif) (Tyler)
    * ENOENT instead of EINVAL when invalid id is passed (Tyler)
    * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
    * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
    * add a new struct notification to minimize the additions to
      struct seccomp_filter, also pack the necessary additions a bit more
      cleverly (Tyler)
    * switch to keeping track of the task itself instead of the pid (we'll
      use this for implementing PUT_FD)

Signed-off-by: Tycho Andersen <tycho@tycho.ws>
CC: Kees Cook <keescook@chromium.org>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Eric W. Biederman <ebiederm@xmission.com>
CC: "Serge E. Hallyn" <serge@hallyn.com>
CC: Christian Brauner <christian.brauner@ubuntu.com>
CC: Tyler Hicks <tyhicks@canonical.com>
CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
---
 Documentation/ioctl/ioctl-number.txt          |   1 +
 .../userspace-api/seccomp_filter.rst          |  73 +++
 include/linux/seccomp.h                       |   7 +-
 include/uapi/linux/seccomp.h                  |  33 +-
 kernel/seccomp.c                              | 436 +++++++++++++++++-
 tools/testing/selftests/seccomp/seccomp_bpf.c | 413 ++++++++++++++++-
 6 files changed, 954 insertions(+), 9 deletions(-)

Comments

Kees Cook Sept. 27, 2018, 9:31 p.m. UTC | #1
On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.
>
> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here, but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
>
> v2: * make id a u64; the idea here being that it will never overflow,
>       because 64 is huge (one syscall every nanosecond => wrap every 584
>       years) (Andy)
>     * prevent nesting of user notifications: if someone is already attached
>       the tree in one place, nobody else can attach to the tree (Andy)
>     * notify the listener of signals the tracee receives as well (Andy)
>     * implement poll
> v3: * lockdep fix (Oleg)
>     * drop unnecessary WARN()s (Christian)
>     * rearrange error returns to be more rpetty (Christian)
>     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> v4: * fix implementation of poll to use poll_wait() (Jann)
>     * change listener's fd flags to be 0 (Jann)
>     * hoist filter initialization out of ifdefs to its own function
>       init_user_notification()
>     * add some more testing around poll() and closing the listener while a
>       syscall is in action
>     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
>       creates a new one (Matthew)
>     * correctly handle pid namespaces, add some testcases (Matthew)
>     * use EINPROGRESS instead of EINVAL when a notification response is
>       written twice (Matthew)
>     * fix comment typo from older version (SEND vs READ) (Matthew)
>     * whitespace and logic simplification (Tobin)
>     * add some Documentation/ bits on userspace trapping
> v5: * fix documentation typos (Jann)
>     * add signalled field to struct seccomp_notif (Jann)
>     * switch to using ioctls instead of read()/write() for struct passing
>       (Jann)
>     * add an ioctl to ensure an id is still valid
> v6: * docs typo fixes, update docs for ioctl() change (Christian)
> v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
>     * use notify_lock in IS_ID_VALID query to avoid racing
>     * s/signalled/signaled (Tyler)
>     * fix docs to reflect that ids are not globally unique (Tyler)
>     * add a test to check -ERESTARTSYS behavior (Tyler)
>     * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
>     * reorder USER_NOTIF in seccomp return codes list (Tyler)
>     * return size instead of sizeof(struct user_notif) (Tyler)
>     * ENOENT instead of EINVAL when invalid id is passed (Tyler)
>     * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
>     * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
>     * add a new struct notification to minimize the additions to
>       struct seccomp_filter, also pack the necessary additions a bit more
>       cleverly (Tyler)
>     * switch to keeping track of the task itself instead of the pid (we'll
>       use this for implementing PUT_FD)

Patch-sending nit: can you put the versioning below the "---" line so
it isn't included in the final commit? (And I normally read these
backwards, so I'd expect v7 at the top, but that's not a big deal. I
mean... neither is the --- thing, but it makes "git am" easier for me
since I don't have to go edit the versioning out of the log.)

> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
> ---
>  Documentation/ioctl/ioctl-number.txt          |   1 +
>  .../userspace-api/seccomp_filter.rst          |  73 +++
>  include/linux/seccomp.h                       |   7 +-
>  include/uapi/linux/seccomp.h                  |  33 +-
>  kernel/seccomp.c                              | 436 +++++++++++++++++-
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 413 ++++++++++++++++-
>  6 files changed, 954 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 13a7c999c04a..31e9707f7e06 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -345,4 +345,5 @@ Code  Seq#(hex)     Include File            Comments
>                                         <mailto:raph@8d.com>
>  0xF6   all     LTTng                   Linux Trace Toolkit Next Generation
>                                         <mailto:mathieu.desnoyers@efficios.com>
> +0xF7    00-1F   uapi/linux/seccomp.h
>  0xFD   all     linux/dm-ioctl.h

I spent some time looking at this, and yes, it seems preferred to add
an entry here.

> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index e5320f6c8654..017444b5efed 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -4,9 +4,10 @@
>
>  #include <uapi/linux/seccomp.h>
>
> -#define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC      | \
> -                                        SECCOMP_FILTER_FLAG_LOG        | \
> -                                        SECCOMP_FILTER_FLAG_SPEC_ALLOW)
> +#define SECCOMP_FILTER_FLAG_MASK       (SECCOMP_FILTER_FLAG_TSYNC | \
> +                                        SECCOMP_FILTER_FLAG_LOG | \
> +                                        SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
> +                                        SECCOMP_FILTER_FLAG_NEW_LISTENER)
>
>  #ifdef CONFIG_SECCOMP
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 9efc0e73d50b..d4ccb32fe089 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -17,9 +17,10 @@
>  #define SECCOMP_GET_ACTION_AVAIL       2
>
>  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> -#define SECCOMP_FILTER_FLAG_TSYNC      (1UL << 0)
> -#define SECCOMP_FILTER_FLAG_LOG                (1UL << 1)
> -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
> +#define SECCOMP_FILTER_FLAG_TSYNC              (1UL << 0)
> +#define SECCOMP_FILTER_FLAG_LOG                        (1UL << 1)
> +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW         (1UL << 2)
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER       (1UL << 3)

Since these are all getting indentation updates, can you switch them
to BIT(0), BIT(1), etc?

>  /*
>   * All BPF programs must return a 32-bit value.
> @@ -35,6 +36,7 @@
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -60,4 +62,29 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u16 len;
> +       __u64 id;
> +       __u32 pid;
> +       __u8 signaled;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u16 len;
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};

So, len has to come first, for versioning. However, since it's ahead
of a u64, this leaves a struct padding hole. pahole output:

struct seccomp_notif {
        __u16                      len;                  /*     0     2 */

        /* XXX 6 bytes hole, try to pack */

        __u64                      id;                   /*     8     8 */
        __u32                      pid;                  /*    16     4 */
        __u8                       signaled;             /*    20     1 */

        /* XXX 3 bytes hole, try to pack */

        struct seccomp_data        data;                 /*    24    64 */
        /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */

        /* size: 88, cachelines: 2, members: 5 */
        /* sum members: 79, holes: 2, sum holes: 9 */
        /* last cacheline: 24 bytes */
};
struct seccomp_notif_resp {
        __u16                      len;                  /*     0     2 */

        /* XXX 6 bytes hole, try to pack */

        __u64                      id;                   /*     8     8 */
        __s32                      error;                /*    16     4 */

        /* XXX 4 bytes hole, try to pack */

        __s64                      val;                  /*    24     8 */

        /* size: 32, cachelines: 1, members: 4 */
        /* sum members: 22, holes: 2, sum holes: 10 */
        /* last cacheline: 32 bytes */
};

How about making len u32, and moving pid and error above "id"? This
leaves a hole after signaled, so changing "len" won't be sufficient
for versioning here. Perhaps move it after data?

> +
> +#define SECCOMP_IOC_MAGIC              0xF7

Was there any specific reason for picking this value? There are lots
of fun ASCII code left like '!' or '*'. :)

> +
> +/* Flags for seccomp notification fd ioctl. */
> +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> +                                       struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> +                                       struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> +                                       __u64)

To match other UAPI ioctl, can these have a prefix of "SECCOMP_IOCTOL_..."?

It may also be useful to match how other uapis do this, like for DRM:

#define DRM_IOCTL_BASE                  'd'
#define DRM_IO(nr)                      _IO(DRM_IOCTL_BASE,nr)
#define DRM_IOR(nr,type)                _IOR(DRM_IOCTL_BASE,nr,type)
#define DRM_IOW(nr,type)                _IOW(DRM_IOCTL_BASE,nr,type)
#define DRM_IOWR(nr,type)               _IOWR(DRM_IOCTL_BASE,nr,type)

#define DRM_IOCTL_VERSION               DRM_IOWR(0x00, struct drm_version)
#define DRM_IOCTL_GET_UNIQUE            DRM_IOWR(0x01, struct drm_unique)
#define DRM_IOCTL_GET_MAGIC             DRM_IOR( 0x02, struct drm_auth)
...


> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fd023ac24e10..fa6fe9756c80 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,12 +33,78 @@
>  #endif
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_SENT,
> +       SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +       /* The struct pid of the task whose filter triggered the notification */
> +       struct task_struct *task;
> +
> +       /* The "cookie" for this request; this is unique for this filter. */
> +       u64 id;
> +
> +       /* Whether or not this task has been given an interruptible signal. */
> +       bool signaled;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once the
> +        * handler reads the notification off of an FD, it transitions to SENT.
> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies, state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +
> +/**
> + * struct notification - container for seccomp userspace notifications. Since
> + * most seccomp filters will not have notification listeners attached and this
> + * structure is fairly large, we store the notification-specific stuff in a
> + * separate structure.
> + *
> + * @request: A semaphore that users of this notification can wait on for
> + *           changes. Actual reads and writes are still controlled with
> + *           filter->notify_lock.
> + * @notify_lock: A lock for all notification-related accesses.
> + * @next_id: The id of the next request.
> + * @notifications: A list of struct seccomp_knotif elements.
> + * @wqh: A wait queue for poll.
> + */
> +struct notification {
> +       struct semaphore request;
> +       u64 next_id;
> +       struct list_head notifications;
> +       wait_queue_head_t wqh;
> +};
>
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
> @@ -66,6 +132,8 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +       struct notification *notif;
> +       struct mutex notify_lock;
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +       mutex_init(&sfilter->notify_lock);
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
> @@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
>  #define SECCOMP_LOG_TRACE              (1 << 4)
>  #define SECCOMP_LOG_LOG                        (1 << 5)
>  #define SECCOMP_LOG_ALLOW              (1 << 6)
> +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
>
>  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
>                                     SECCOMP_LOG_KILL_THREAD  |
>                                     SECCOMP_LOG_TRAP  |
>                                     SECCOMP_LOG_ERRNO |
> +                                   SECCOMP_LOG_USER_NOTIF |
>                                     SECCOMP_LOG_TRACE |
>                                     SECCOMP_LOG_LOG;
>
> @@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
>         case SECCOMP_RET_TRACE:
>                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
>                 break;
> +       case SECCOMP_RET_USER_NOTIF:
> +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> +               break;
>         case SECCOMP_RET_LOG:
>                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
>                 break;
> @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
>  #else
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +       /* Note: overflow is ok here, the id just needs to be unique */

Maybe just clarify in the comment: unique to the filter.

> +       return filter->notif->next_id++;

Also, it might be useful to add for both documentation and lockdep:

lockdep_assert_held(filter->notif->notify_lock);

into this function?


> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->notif)
> +               goto out;
> +
> +       n.task = current;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notif->notifications);
> +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->notif->request);
> +

Maybe add a big comment here saying this is where we're waiting for a reply?

> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the mutex
> +        * while the reply was sent, so let's be sure there wasn't a response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it (potentially
> +                * again, if we had already notified them about the first one).
> +                */
> +               n.signaled = true;
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->notif->request);
> +               }
> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;
> +       }
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
>  {
> @@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>
>                 return 0;
>
> +       case SECCOMP_RET_USER_NOTIF:
> +               seccomp_do_user_notification(this_syscall, match, sd);
> +               goto skip;

Nit: please add a blank line here (to match the other cases).

>         case SECCOMP_RET_LOG:
>                 seccomp_log(this_syscall, 0, action, true);
>                 return 0;
> @@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
>  }
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +                                 struct seccomp_filter *);

Why is the forward declaration needed instead of just moving the
function here? I didn't see anything in it that looked like it
couldn't move.

> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;

Nit: "invalid fd" should be -1, not 0.

> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               listener = get_unused_fd_flags(0);

As with the other place pointed out by Jann, this should maybe be O_CLOEXEC too?

> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(current, prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }

Can you update the kern-docs for seccomp_set_mode_filter(), since we
can now return positive values?

 * Returns 0 on success or -EINVAL on failure.

(this shoudln't say only -EINVAL, I realize too)

I have to say, I'm vaguely nervous about changing the semantics here
for passing back the fd as the return code from the seccomp() syscall.
Alternatives seem less appealing, though: changing the meaning of the
uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
example. Hmm.

>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
> @@ -911,6 +1085,7 @@ static long seccomp_get_action_avail(const char __user *uaction)
>         case SECCOMP_RET_KILL_THREAD:
>         case SECCOMP_RET_TRAP:
>         case SECCOMP_RET_ERRNO:
> +       case SECCOMP_RET_USER_NOTIF:
>         case SECCOMP_RET_TRACE:
>         case SECCOMP_RET_LOG:
>         case SECCOMP_RET_ALLOW:
> @@ -1111,6 +1286,7 @@ long seccomp_get_metadata(struct task_struct *task,
>  #define SECCOMP_RET_KILL_THREAD_NAME   "kill_thread"
>  #define SECCOMP_RET_TRAP_NAME          "trap"
>  #define SECCOMP_RET_ERRNO_NAME         "errno"
> +#define SECCOMP_RET_USER_NOTIF_NAME    "user_notif"
>  #define SECCOMP_RET_TRACE_NAME         "trace"
>  #define SECCOMP_RET_LOG_NAME           "log"
>  #define SECCOMP_RET_ALLOW_NAME         "allow"
> @@ -1120,6 +1296,7 @@ static const char seccomp_actions_avail[] =
>                                 SECCOMP_RET_KILL_THREAD_NAME    " "
>                                 SECCOMP_RET_TRAP_NAME           " "
>                                 SECCOMP_RET_ERRNO_NAME          " "
> +                               SECCOMP_RET_USER_NOTIF_NAME     " "
>                                 SECCOMP_RET_TRACE_NAME          " "
>                                 SECCOMP_RET_LOG_NAME            " "
>                                 SECCOMP_RET_ALLOW_NAME;
> @@ -1134,6 +1311,7 @@ static const struct seccomp_log_name seccomp_log_names[] = {
>         { SECCOMP_LOG_KILL_THREAD, SECCOMP_RET_KILL_THREAD_NAME },
>         { SECCOMP_LOG_TRAP, SECCOMP_RET_TRAP_NAME },
>         { SECCOMP_LOG_ERRNO, SECCOMP_RET_ERRNO_NAME },
> +       { SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
>         { SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
>         { SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
>         { SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
> @@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
>  device_initcall(seccomp_sysctl_init)
>
>  #endif /* CONFIG_SYSCTL */
> +
> +#ifdef CONFIG_SECCOMP_FILTER
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_knotif *knotif;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +                       continue;
> +
> +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +
> +               complete(&knotif->ready);
> +       }
> +
> +       wake_up_all(&filter->notif->wqh);
> +       kfree(filter->notif);
> +       filter->notif = NULL;
> +       mutex_unlock(&filter->notify_lock);

It looks like that means nothing waiting on knotif->ready can access
filter->notif without rechecking it, yes?

e.g. in seccomp_do_user_notification() I see:

                        up(&match->notif->request);

I *think* this isn't reachable due to the test for n.state !=
SECCOMP_NOTIFY_REPLIED, though. Perhaps, just for sanity and because
it's not fast-path, we could add a WARN_ON() while checking for
unreplied signal death?

                n.signaled = true;
                if (n.state == SECCOMP_NOTIFY_SENT) {
                        n.state = SECCOMP_NOTIFY_INIT;
                        if (!WARN_ON(match->notif))
                            up(&match->notif->request);
                }
                mutex_unlock(&match->notify_lock);


> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static long seccomp_notify_recv(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif = {};
> +       ssize_t ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;

I'd prefer this casting happen in seccomp_notify_ioctl(). This keeps
anything from accidentally using "arg" directly here.

> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +
> +       ret = down_interruptible(&filter->notif->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were able to
> +        * acquire the rw lock.
> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       size = min_t(size_t, size, sizeof(unotif));
> +

It is possible (though unlikely given the type widths involved here)
for unotif = {} to not initialize padding, so I would recommend an
explicit memset(&unotif, 0, sizeof(unotif)) here.

> +       unotif.len = size;
> +       unotif.id = knotif->id;
> +       unotif.pid = task_pid_vnr(knotif->task);
> +       unotif.signaled = knotif->signaled;
> +       unotif.data = *(knotif->data);
> +
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> +
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);

Is there some way to rearrange the locking here to avoid holding the
mutex while doing copy_to_user() (which userspace could block with
userfaultfd, and then stall all the other notifications for this
filter)?

> +       return ret;
> +}
> +
> +static long seccomp_notify_send(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;

Same cast note as above.

> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +       size = min_t(size_t, size, sizeof(resp));
> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;

For sanity checking on a double-read from userspace, please add:

    if (resp.len != size)
        return -EINVAL;

> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == resp.id)
> +                       break;
> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       /* Allow exactly one reply. */
> +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +               ret = -EINPROGRESS;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> +                                   unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       void __user *buf = (void __user *)arg;
> +       u64 id;
> +       long ret;
> +
> +       if (copy_from_user(&id, buf, sizeof(id)))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -1;

Isn't this EPERM? Shouldn't it be -ENOENT?

> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == id) {
> +                       ret = 0;
> +                       goto out;
> +               }
> +       }
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> +                                unsigned long arg)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +
> +       switch (cmd) {
> +       case SECCOMP_NOTIF_RECV:
> +               return seccomp_notify_recv(filter, arg);
> +       case SECCOMP_NOTIF_SEND:
> +               return seccomp_notify_send(filter, arg);
> +       case SECCOMP_NOTIF_ID_VALID:
> +               return seccomp_notify_id_valid(filter, arg);
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
> +static __poll_t seccomp_notify_poll(struct file *file,
> +                                   struct poll_table_struct *poll_tab)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       __poll_t ret = 0;
> +       struct seccomp_knotif *cur;
> +
> +       poll_wait(file, &filter->notif->wqh, poll_tab);
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT)
> +                       ret |= EPOLLIN | EPOLLRDNORM;
> +               if (cur->state == SECCOMP_NOTIFY_SENT)
> +                       ret |= EPOLLOUT | EPOLLWRNORM;
> +               if (ret & EPOLLIN && ret & EPOLLOUT)

My eyes! :) Can you wrap the bit operations in parens here?

> +                       break;
> +       }

Should POLLERR be handled here too? I don't quite see the conditions
that might be exposed? All the processes die for the filter, which
does what here?

> +
> +       mutex_unlock(&filter->notify_lock);
> +
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .poll = seccomp_notify_poll,
> +       .release = seccomp_notify_release,
> +       .unlocked_ioctl = seccomp_notify_ioctl,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +                                 struct seccomp_filter *filter)
> +{
> +       struct file *ret = ERR_PTR(-EBUSY);
> +       struct seccomp_filter *cur, *last_locked = NULL;
> +       int filter_nesting = 0;
> +
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +               filter_nesting++;
> +               last_locked = cur;
> +               if (cur->notif)
> +                       goto out;
> +       }
> +
> +       ret = ERR_PTR(-ENOMEM);
> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
> +       if (!filter->notif)
> +               goto out;
> +
> +       sema_init(&filter->notif->request, 0);
> +       INIT_LIST_HEAD(&filter->notif->notifications);
> +       filter->notif->next_id = get_random_u64();
> +       init_waitqueue_head(&filter->notif->wqh);
> +
> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret))
> +               goto out;
> +
> +
> +       /* The file has a reference to it now */
> +       __get_seccomp_filter(filter);
> +
> +out:
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_unlock(&cur->notify_lock);
> +               if (cur == last_locked)
> +                       break;
> +       }
> +
> +       return ret;
> +}
> +#endif
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index e1473234968d..5f4b836a6792 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -5,6 +5,7 @@
>   * Test code for seccomp bpf.
>   */
>
> +#define _GNU_SOURCE
>  #include <sys/types.h>
>
>  /*
> @@ -40,10 +41,12 @@
>  #include <sys/fcntl.h>
>  #include <sys/mman.h>
>  #include <sys/times.h>
> +#include <sys/socket.h>
> +#include <sys/ioctl.h>
>
> -#define _GNU_SOURCE
>  #include <unistd.h>
>  #include <sys/syscall.h>
> +#include <poll.h>
>
>  #include "../kselftest_harness.h"
>
> @@ -154,6 +157,34 @@ struct seccomp_metadata {
>  };
>  #endif
>
> +#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
> +#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
> +
> +#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
> +
> +#define SECCOMP_IOC_MAGIC              0xF7
> +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> +                                       struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> +                                       struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> +                                       __u64)
> +struct seccomp_notif {
> +       __u16 len;
> +       __u64 id;
> +       __u32 pid;
> +       __u8 signaled;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u16 len;
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +#endif
> +
>  #ifndef seccomp
>  int seccomp(unsigned int op, unsigned int flags, void *args)
>  {
> @@ -2077,7 +2108,8 @@ TEST(detect_seccomp_filter_flags)
>  {
>         unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
>                                  SECCOMP_FILTER_FLAG_LOG,
> -                                SECCOMP_FILTER_FLAG_SPEC_ALLOW };
> +                                SECCOMP_FILTER_FLAG_SPEC_ALLOW,
> +                                SECCOMP_FILTER_FLAG_NEW_LISTENER };
>         unsigned int flag, all_flags;
>         int i;
>         long ret;
> @@ -2933,6 +2965,383 @@ TEST(get_metadata)
>         ASSERT_EQ(0, kill(pid, SIGKILL));
>  }
>
> +static int user_trap_syscall(int nr, unsigned int flags)
> +{
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> +                       offsetof(struct seccomp_data, nr)),
> +               BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
> +               BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
> +}
> +
> +static int read_notif(int listener, struct seccomp_notif *req)
> +{
> +       int ret;
> +
> +       do {
> +               errno = 0;
> +               req->len = sizeof(*req);
> +               ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
> +       } while (ret == -1 && errno == ENOENT);
> +       return ret;
> +}
> +
> +static void signal_handler(int signal)
> +{
> +}
> +
> +#define USER_NOTIF_MAGIC 116983961184613L
> +TEST(get_user_notification_syscall)
> +{
> +       pid_t pid;
> +       long ret;
> +       int status, listener;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +       struct pollfd pollfd;
> +
> +       struct sock_filter filter[] = {
> +               BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> +       };
> +       struct sock_fprog prog = {
> +               .len = (unsigned short)ARRAY_SIZE(filter),
> +               .filter = filter,
> +       };
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       /* Check that we get -ENOSYS with no listener attached */
> +       if (pid == 0) {
> +               if (user_trap_syscall(__NR_getpid, 0) < 0)
> +                       exit(1);
> +               ret = syscall(__NR_getpid);
> +               exit(ret >= 0 || errno != ENOSYS);
> +       }
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /* Add some no-op filters so that we (don't) trigger lockdep. */
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +       EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
> +
> +       /* Check that the basic notification machinery works */
> +       listener = user_trap_syscall(__NR_getpid,
> +                                    SECCOMP_FILTER_FLAG_NEW_LISTENER);
> +       EXPECT_GE(listener, 0);
> +
> +       /* Installing a second listener in the chain should EBUSY */
> +       EXPECT_EQ(user_trap_syscall(__NR_getpid,
> +                                   SECCOMP_FILTER_FLAG_NEW_LISTENER),
> +                 -1);
> +       EXPECT_EQ(errno, EBUSY);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLIN);
> +
> +       req.len = sizeof(req);
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +
> +       pollfd.fd = listener;
> +       pollfd.events = POLLIN | POLLOUT;
> +
> +       EXPECT_GT(poll(&pollfd, 1, -1), 0);
> +       EXPECT_EQ(pollfd.revents, POLLOUT);
> +
> +       EXPECT_EQ(req.data.nr,  __NR_getpid);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that nothing bad happens when we kill the task in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
> +
> +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);

Please document SECCOMP_NOTIF_ID_VALID in seccomp_filter.rst. I had
been wondering what it's for, and now I see it's kind of an advisory
"is the other end still alive?" test.

> +
> +       resp.id = req.id;
> +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> +       EXPECT_EQ(ret, -1);
> +       EXPECT_EQ(errno, ENOENT);
> +
> +       /*
> +        * Check that we get another notification about a signal in the middle
> +        * of a syscall.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> +                       perror("signal");
> +                       exit(1);
> +               }
> +               ret = syscall(__NR_getpid);
> +               exit(ret != USER_NOTIF_MAGIC);
> +       }
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> +
> +       ret = read_notif(listener, &req);
> +       EXPECT_EQ(req.signaled, 1);
> +       EXPECT_EQ(ret, sizeof(req));
> +       EXPECT_EQ(errno, 0);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = -512; /* -ERESTARTSYS */
> +       resp.val = 0;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       ret = read_notif(listener, &req);
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);

I was slightly confused here: why have there been 3 reads? I was
expecting one notification for hitting getpid and one from catching a
signal. But in rereading, I see that NOTIF_RECV will return the most
recently unresponded notification, yes?

But... catching a signal replaces the existing seccomp_knotif? I
remain confused about how signal handling is meant to work here. What
happens if two signals get sent? It looks like you just block without
allowing more signals? (Thank you for writing the tests!)

(And can you document the expected behavior in the seccomp_filter.rst too?)

> +       EXPECT_EQ(ret, sizeof(resp));
> +       EXPECT_EQ(errno, 0);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       /*
> +        * Check that we get an ENOSYS when the listener is closed.
> +        */
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +       if (pid == 0) {
> +               close(listener);
> +               ret = syscall(__NR_getpid);
> +               exit(ret != -1 && errno != ENOSYS);
> +       }
> +
> +       close(listener);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +/*
> + * Check that a pid in a child namespace still shows up as valid in ours.
> + */
> +TEST(user_notification_child_pid_ns)
> +{
> +       pid_t pid;
> +       int status, listener;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +       ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +               /* Signal we're ready and have installed the filter. */
> +               EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
> +
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +
> +               exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
> +       EXPECT_EQ(c, 'J');
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
> +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> +       EXPECT_GE(listener, 0);
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
> +
> +       /* Now signal we are done and respond with magic */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +       req.len = sizeof(req);
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +       EXPECT_EQ(req.pid, pid);
> +
> +       resp.len = sizeof(resp);
> +       resp.id = req.id;
> +       resp.error = 0;
> +       resp.val = USER_NOTIF_MAGIC;
> +
> +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +       close(listener);
> +}
> +
> +/*
> + * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
> + * invalid.
> + */
> +TEST(user_notification_sibling_pid_ns)
> +{
> +       pid_t pid, pid2;
> +       int status, listener;
> +       int sk_pair[2];
> +       char c;
> +       struct seccomp_notif req = {};
> +       struct seccomp_notif_resp resp = {};
> +
> +       ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
> +
> +       pid = fork();
> +       ASSERT_GE(pid, 0);
> +
> +       if (pid == 0) {
> +               int child_pair[2];
> +
> +               ASSERT_EQ(unshare(CLONE_NEWPID), 0);
> +
> +               ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
> +
> +               pid2 = fork();
> +               ASSERT_GE(pid2, 0);
> +
> +               if (pid2 == 0) {
> +                       close(child_pair[0]);
> +                       EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
> +
> +                       /* Signal we're ready and have installed the filter. */
> +                       EXPECT_EQ(write(child_pair[1], "J", 1), 1);
> +
> +                       EXPECT_EQ(read(child_pair[1], &c, 1), 1);
> +                       EXPECT_EQ(c, 'H');
> +
> +                       exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
> +               }
> +
> +               /* check that child has installed the filter */
> +               EXPECT_EQ(read(child_pair[0], &c, 1), 1);
> +               EXPECT_EQ(c, 'J');
> +
> +               /* tell parent who child is */
> +               EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
> +
> +               /* parent has installed listener, tell child to call syscall */
> +               EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
> +               EXPECT_EQ(c, 'H');
> +               EXPECT_EQ(write(child_pair[0], "H", 1), 1);
> +
> +               EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
> +               EXPECT_EQ(true, WIFEXITED(status));
> +               EXPECT_EQ(0, WEXITSTATUS(status));
> +               exit(WEXITSTATUS(status));
> +       }
> +
> +       EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
> +
> +       EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
> +       EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
> +       listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
> +       EXPECT_GE(listener, 0);
> +       EXPECT_EQ(errno, 0);
> +       EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
> +
> +       /* Create the sibling ns, and sibling in it. */
> +       EXPECT_EQ(unshare(CLONE_NEWPID), 0);
> +       EXPECT_EQ(errno, 0);
> +
> +       pid2 = fork();
> +       EXPECT_GE(pid2, 0);
> +
> +       if (pid2 == 0) {
> +               req.len = sizeof(req);
> +               ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> +               /*
> +                * The pid should be 0, i.e. the task is in some namespace that
> +                * we can't "see".
> +                */
> +               ASSERT_EQ(req.pid, 0);
> +
> +               resp.len = sizeof(resp);
> +               resp.id = req.id;
> +               resp.error = 0;
> +               resp.val = USER_NOTIF_MAGIC;
> +
> +               ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> +               exit(0);
> +       }
> +
> +       close(listener);
> +
> +       /* Now signal we are done setting up sibling listener. */
> +       EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
> +
> +       EXPECT_EQ(waitpid(pid, &status, 0), pid);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +
> +       EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
> +       EXPECT_EQ(true, WIFEXITED(status));
> +       EXPECT_EQ(0, WEXITSTATUS(status));
> +}
> +
> +
>  /*
>   * TODO:
>   * - add microbenchmarks
> --
> 2.17.1
>

Looking good!

-Kees
Jann Horn Sept. 27, 2018, 9:51 p.m. UTC | #2
+Christoph Hellwig, Al Viro, fsdevel: For two questions about the poll
interface (search for "seccomp_notify_poll" and
"seccomp_notify_release" in the patch)

@Tycho: FYI, I've gone through all of v7 now, apart from the
test/sample code. So don't wait for more comments from me before
sending out v8.

On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
>
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
>
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Note that in that case, the trusted runtime needs to be in the same
mount namespace as the container. mount() doesn't work on the mount
structure of a foreign mount namespace; check_mnt() specifically
checks for this case, and I think pretty much everything in
sys_mount() uses that check. So you'd have to join the container's
mount namespace before forwarding a mount syscall.

> This patch adds functionality that is already possible via at least two
> other means that I know about, both of which involve ptrace(): first, one
> could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> Unfortunately this is slow, so a faster version would be to install a
> filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> Since ptrace allows only one tracer, if the container runtime is that
> tracer, users inside the container (or outside) trying to debug it will not
> be able to use ptrace, which is annoying. It also means that older
> distributions based on Upstart cannot boot inside containers using ptrace,
> since upstart itself uses ptrace to start services.
>
> The actual implementation of this is fairly small, although getting the
> synchronization right was/is slightly complex.
>
> Finally, it's worth noting that the classic seccomp TOCTOU of reading
> memory data from the task still applies here,

Actually, it doesn't, right? It would apply if you told the kernel "go
ahead, that syscall is fine", but that's not how the API works - you
always intercept the syscall, copy argument data to a trusted tracer,
and then the tracer can make a replacement syscall. Sounds fine to me.

> but can be avoided with
> careful design of the userspace handler: if the userspace handler reads all
> of the task memory that is necessary before applying its security policy,
> the tracee's subsequent memory edits will not be read by the tracer.
[...]
> diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
[...]
> +which (on success) will return a listener fd for the filter, which can then be
> +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> +acquired via:
> +
> +.. code-block::
> +
> +    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);

The manpage documents ptrace() as taking four arguments, not three. I
know that the header defines it with varargs, but it would probably be
more useful to require passing in zero as the fourth argument so that
we have a place to stick flags if necessary in the future.

> +which grabs the 0th filter for some task which the tracer has privilege over.
> +Note that filter fds correspond to a particular filter, and not a particular
> +task. So if this task then forks, notifications from both tasks will appear on
> +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> +so a filter fd can safely have many readers.

Add a note about needing CAP_SYS_ADMIN here? Also, might be useful to
clarify in which direction "nth filter" counts.

> +The interface for a seccomp notification fd consists of two structures:
> +
> +.. code-block::
> +
> +    struct seccomp_notif {
> +        __u16 len;
> +        __u64 id;
> +        pid_t pid;
> +        __u8 signalled;
> +        struct seccomp_data data;
> +    };
> +
> +    struct seccomp_notif_resp {
> +        __u16 len;
> +        __u64 id;
> +        __s32 error;
> +        __s64 val;
> +    };
> +
> +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> +notification fd to receive a ``struct seccomp_notif``, which contains five
> +members: the input length of the structure, a unique-per-filter ``id``, the
> +``pid`` of the task which triggered this request (which may be 0 if the task is
> +in a pid ns not visible from the listener's pid namespace), a flag representing
> +whether or not the notification is a result of a non-fatal signal, and the
> +``data`` passed to seccomp. Userspace can then make a decision based on this
> +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> +indicating what should be returned to userspace. The ``id`` member of ``struct
> +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> +
> +It is worth noting that ``struct seccomp_data`` contains the values of register
> +arguments to the syscall, but does not contain pointers to memory. The task's
> +memory is accessible to suitably privileged traces via ``ptrace()`` or
> +``/proc/pid/map_files/``.

You probably don't actually want to use /proc/pid/map_files here; you
can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
And while reading memory via ptrace() is possible, the interface is
really ugly (e.g. you can only read data in 4-byte chunks), and your
caveat about locking out other ptracers (or getting locked out by
them) applies. I'm not even sure if you could read memory via ptrace
while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
requires the target to be in a __TASK_TRACED state.
The two interfaces you might want to use instead are /proc/$pid/mem
and process_vm_{readv,writev}, which allow you to do nice,
arbitrarily-sized, vectored IO on the memory of another process.

> However, care should be taken to avoid the TOCTOU
> +mentioned above in this document: all arguments being read from the tracee's
> +memory should be read into the tracer's memory before any policy decisions are
> +made. This allows for an atomic decision on syscall arguments.

Again, I don't really see how you could get this wrong.
[...]
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
[...]
>  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
>  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
>  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
>  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
>  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
>  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> @@ -60,4 +62,29 @@ struct seccomp_data {
>         __u64 args[6];
>  };
>
> +struct seccomp_notif {
> +       __u16 len;
> +       __u64 id;
> +       __u32 pid;
> +       __u8 signaled;
> +       struct seccomp_data data;
> +};
> +
> +struct seccomp_notif_resp {
> +       __u16 len;
> +       __u64 id;
> +       __s32 error;
> +       __s64 val;
> +};
> +
> +#define SECCOMP_IOC_MAGIC              0xF7
> +
> +/* Flags for seccomp notification fd ioctl. */
> +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> +                                       struct seccomp_notif)
> +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> +                                       struct seccomp_notif_resp)
> +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> +                                       __u64)
> +
>  #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fd023ac24e10..fa6fe9756c80 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -33,12 +33,78 @@
>  #endif
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +#include <linux/file.h>
>  #include <linux/filter.h>
>  #include <linux/pid.h>
>  #include <linux/ptrace.h>
>  #include <linux/security.h>
>  #include <linux/tracehook.h>
>  #include <linux/uaccess.h>
> +#include <linux/anon_inodes.h>
> +
> +enum notify_state {
> +       SECCOMP_NOTIFY_INIT,
> +       SECCOMP_NOTIFY_SENT,
> +       SECCOMP_NOTIFY_REPLIED,
> +};
> +
> +struct seccomp_knotif {
> +       /* The struct pid of the task whose filter triggered the notification */
> +       struct task_struct *task;
> +
> +       /* The "cookie" for this request; this is unique for this filter. */
> +       u64 id;
> +
> +       /* Whether or not this task has been given an interruptible signal. */
> +       bool signaled;
> +
> +       /*
> +        * The seccomp data. This pointer is valid the entire time this
> +        * notification is active, since it comes from __seccomp_filter which
> +        * eclipses the entire lifecycle here.
> +        */
> +       const struct seccomp_data *data;
> +
> +       /*
> +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> +        * struct seccomp_knotif is created and starts out in INIT. Once the
> +        * handler reads the notification off of an FD, it transitions to SENT.
> +        * If a signal is received the state transitions back to INIT and
> +        * another message is sent. When the userspace handler replies, state
> +        * transitions to REPLIED.
> +        */
> +       enum notify_state state;
> +
> +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> +       int error;
> +       long val;
> +
> +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> +       struct completion ready;
> +
> +       struct list_head list;
> +};
> +
> +/**
> + * struct notification - container for seccomp userspace notifications. Since
> + * most seccomp filters will not have notification listeners attached and this
> + * structure is fairly large, we store the notification-specific stuff in a
> + * separate structure.
> + *
> + * @request: A semaphore that users of this notification can wait on for
> + *           changes. Actual reads and writes are still controlled with
> + *           filter->notify_lock.
> + * @notify_lock: A lock for all notification-related accesses.

notify_lock is documented here, but is a member of struct
seccomp_filter, not of struct notification.

> + * @next_id: The id of the next request.
> + * @notifications: A list of struct seccomp_knotif elements.
> + * @wqh: A wait queue for poll.
> + */
> +struct notification {
> +       struct semaphore request;
> +       u64 next_id;
> +       struct list_head notifications;
> +       wait_queue_head_t wqh;
> +};
>
>  /**
>   * struct seccomp_filter - container for seccomp BPF programs
> @@ -66,6 +132,8 @@ struct seccomp_filter {
>         bool log;
>         struct seccomp_filter *prev;
>         struct bpf_prog *prog;
> +       struct notification *notif;
> +       struct mutex notify_lock;
>  };
>
>  /* Limit any path through the tree to 256KB worth of instructions. */
> @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
>         if (!sfilter)
>                 return ERR_PTR(-ENOMEM);
>
> +       mutex_init(&sfilter->notify_lock);
>         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
>                                         seccomp_check_filter, save_orig);
>         if (ret < 0) {
[...]
> @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
>  #else
>
>  #ifdef CONFIG_SECCOMP_FILTER
> +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> +{
> +       /* Note: overflow is ok here, the id just needs to be unique */
> +       return filter->notif->next_id++;
> +}
> +
> +static void seccomp_do_user_notification(int this_syscall,
> +                                        struct seccomp_filter *match,
> +                                        const struct seccomp_data *sd)
> +{
> +       int err;
> +       long ret = 0;
> +       struct seccomp_knotif n = {};
> +
> +       mutex_lock(&match->notify_lock);
> +       err = -ENOSYS;
> +       if (!match->notif)
> +               goto out;
> +
> +       n.task = current;
> +       n.state = SECCOMP_NOTIFY_INIT;
> +       n.data = sd;
> +       n.id = seccomp_next_notify_id(match);
> +       init_completion(&n.ready);
> +
> +       list_add(&n.list, &match->notif->notifications);
> +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> +
> +       mutex_unlock(&match->notify_lock);
> +       up(&match->notif->request);
> +
> +       err = wait_for_completion_interruptible(&n.ready);
> +       mutex_lock(&match->notify_lock);
> +
> +       /*
> +        * Here it's possible we got a signal and then had to wait on the mutex
> +        * while the reply was sent, so let's be sure there wasn't a response
> +        * in the meantime.
> +        */
> +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> +               /*
> +                * We got a signal. Let's tell userspace about it (potentially
> +                * again, if we had already notified them about the first one).
> +                */
> +               n.signaled = true;
> +               if (n.state == SECCOMP_NOTIFY_SENT) {
> +                       n.state = SECCOMP_NOTIFY_INIT;
> +                       up(&match->notif->request);
> +               }

Do you need another wake_up_poll() here?

> +               mutex_unlock(&match->notify_lock);
> +               err = wait_for_completion_killable(&n.ready);
> +               mutex_lock(&match->notify_lock);
> +               if (err < 0)
> +                       goto remove_list;

Add a comment here explaining that we intentionally leave the
semaphore count too high (because otherwise we'd have to block), and
seccomp_notify_recv() compensates for that?

> +       }
> +
> +       ret = n.val;
> +       err = n.error;
> +
> +remove_list:
> +       list_del(&n.list);
> +out:
> +       mutex_unlock(&match->notify_lock);
> +       syscall_set_return_value(current, task_pt_regs(current),
> +                                err, ret);
> +}
> +
>  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
>                             const bool recheck_after_trace)
>  {
[...]
>  #ifdef CONFIG_SECCOMP_FILTER
> +static struct file *init_listener(struct task_struct *,
> +                                 struct seccomp_filter *);
> +
>  /**
>   * seccomp_set_mode_filter: internal function for setting seccomp filter
>   * @flags:  flags to change filter behavior
> @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
>         struct seccomp_filter *prepared = NULL;
>         long ret = -EINVAL;
> +       int listener = 0;
> +       struct file *listener_f = NULL;
>
>         /* Validate flags. */
>         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         if (IS_ERR(prepared))
>                 return PTR_ERR(prepared);
>
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               listener = get_unused_fd_flags(0);
> +               if (listener < 0) {
> +                       ret = listener;
> +                       goto out_free;
> +               }
> +
> +               listener_f = init_listener(current, prepared);
> +               if (IS_ERR(listener_f)) {
> +                       put_unused_fd(listener);
> +                       ret = PTR_ERR(listener_f);
> +                       goto out_free;
> +               }
> +       }
> +
>         /*
>          * Make sure we cannot change seccomp or nnp state via TSYNC
>          * while another thread is in the middle of calling exec.
>          */
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
>             mutex_lock_killable(&current->signal->cred_guard_mutex))
> -               goto out_free;
> +               goto out_put_fd;
>
>         spin_lock_irq(&current->sighand->siglock);
>
> @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
>         spin_unlock_irq(&current->sighand->siglock);
>         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
>                 mutex_unlock(&current->signal->cred_guard_mutex);
> +out_put_fd:
> +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> +               if (ret < 0) {
> +                       fput(listener_f);
> +                       put_unused_fd(listener);
> +               } else {
> +                       fd_install(listener, listener_f);
> +                       ret = listener;
> +               }
> +       }
>  out_free:
>         seccomp_filter_free(prepared);
>         return ret;
[...]
> +
> +#ifdef CONFIG_SECCOMP_FILTER
> +static int seccomp_notify_release(struct inode *inode, struct file *file)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       struct seccomp_knotif *knotif;
> +
> +       mutex_lock(&filter->notify_lock);
> +
> +       /*
> +        * If this file is being closed because e.g. the task who owned it
> +        * died, let's wake everyone up who was waiting on us.
> +        */
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> +                       continue;
> +
> +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> +               knotif->error = -ENOSYS;
> +               knotif->val = 0;
> +
> +               complete(&knotif->ready);
> +       }
> +
> +       wake_up_all(&filter->notif->wqh);

If select() is polling us, a reference to the open file is being held,
and this can't be reached; and I think if epoll is polling us,
eventpoll_release() will remove itself from the wait queue, right? So
can this wake_up_all() actually ever notify anyone?

> +       kfree(filter->notif);
> +       filter->notif = NULL;
> +       mutex_unlock(&filter->notify_lock);
> +       __put_seccomp_filter(filter);
> +       return 0;
> +}
> +
> +static long seccomp_notify_recv(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL, *cur;
> +       struct seccomp_notif unotif = {};
> +       ssize_t ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;
> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +
> +       ret = down_interruptible(&filter->notif->request);
> +       if (ret < 0)
> +               return ret;
> +
> +       mutex_lock(&filter->notify_lock);
> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> +                       knotif = cur;
> +                       break;
> +               }
> +       }
> +
> +       /*
> +        * If we didn't find a notification, it could be that the task was
> +        * interrupted between the time we were woken and when we were able to

s/interrupted/interrupted by a fatal signal/ ?

> +        * acquire the rw lock.

State more explicitly here that we are compensating for an incorrectly
high semaphore count?

> +        */
> +       if (!knotif) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       size = min_t(size_t, size, sizeof(unotif));
> +
> +       unotif.len = size;
> +       unotif.id = knotif->id;
> +       unotif.pid = task_pid_vnr(knotif->task);
> +       unotif.signaled = knotif->signaled;
> +       unotif.data = *(knotif->data);
> +
> +       if (copy_to_user(buf, &unotif, size)) {
> +               ret = -EFAULT;
> +               goto out;
> +       }
> +
> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_SENT;
> +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> +
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_send(struct seccomp_filter *filter,
> +                               unsigned long arg)
> +{
> +       struct seccomp_notif_resp resp = {};
> +       struct seccomp_knotif *knotif = NULL;
> +       long ret;
> +       u16 size;
> +       void __user *buf = (void __user *)arg;
> +
> +       if (copy_from_user(&size, buf, sizeof(size)))
> +               return -EFAULT;
> +       size = min_t(size_t, size, sizeof(resp));
> +       if (copy_from_user(&resp, buf, size))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == resp.id)
> +                       break;
> +       }
> +
> +       if (!knotif || knotif->id != resp.id) {

Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
NULL here. If `filter->notif->notifications` is empty, I think
`knotif` will be `container_of(&filter->notif->notifications, struct
seccom_knotif, list)` - in other words, you'll have a type confusion,
and `knotif` probably points into some random memory in front of
`filter->notif`.

Am I missing something?

> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       /* Allow exactly one reply. */
> +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> +               ret = -EINPROGRESS;
> +               goto out;
> +       }

This means that if seccomp_do_user_notification() has in the meantime
received a signal and transitioned from SENT back to INIT, this will
fail, right? So we fail here, then we read the new notification, and
then we can retry SECCOMP_NOTIF_SEND? Is that intended?

> +       ret = size;
> +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> +       knotif->error = resp.error;
> +       knotif->val = resp.val;
> +       complete(&knotif->ready);
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
> +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> +                                   unsigned long arg)
> +{
> +       struct seccomp_knotif *knotif = NULL;
> +       void __user *buf = (void __user *)arg;
> +       u64 id;
> +       long ret;
> +
> +       if (copy_from_user(&id, buf, sizeof(id)))
> +               return -EFAULT;
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;
> +
> +       ret = -1;

In strace, this is going to show up as EPERM. Maybe use something like
-ENOENT instead? Or whatever you think resembles a fitting error
number.

> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> +               if (knotif->id == id) {
> +                       ret = 0;

Would it make sense to treat notifications that have already been
replied to as invalid?

> +                       goto out;
> +               }
> +       }
> +
> +out:
> +       mutex_unlock(&filter->notify_lock);
> +       return ret;
> +}
> +
[...]
> +static __poll_t seccomp_notify_poll(struct file *file,
> +                                   struct poll_table_struct *poll_tab)
> +{
> +       struct seccomp_filter *filter = file->private_data;
> +       __poll_t ret = 0;
> +       struct seccomp_knotif *cur;
> +
> +       poll_wait(file, &filter->notif->wqh, poll_tab);
> +
> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> +       if (ret < 0)
> +               return ret;

Looking at the callers of vfs_poll(), as far as I can tell, a poll
handler is not allowed to return error codes. Perhaps someone who
knows the poll interface better can weigh in here. I've CCed some
people who should hopefully know better how this stuff works.

> +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> +               if (cur->state == SECCOMP_NOTIFY_INIT)
> +                       ret |= EPOLLIN | EPOLLRDNORM;
> +               if (cur->state == SECCOMP_NOTIFY_SENT)
> +                       ret |= EPOLLOUT | EPOLLWRNORM;
> +               if (ret & EPOLLIN && ret & EPOLLOUT)
> +                       break;
> +       }
> +
> +       mutex_unlock(&filter->notify_lock);
> +
> +       return ret;
> +}
> +
> +static const struct file_operations seccomp_notify_ops = {
> +       .poll = seccomp_notify_poll,
> +       .release = seccomp_notify_release,
> +       .unlocked_ioctl = seccomp_notify_ioctl,
> +};
> +
> +static struct file *init_listener(struct task_struct *task,
> +                                 struct seccomp_filter *filter)
> +{

Why does this function take a `task` pointer instead of always
accessing `current`? If `task` actually wasn't `current`, I would have
concurrency concerns. A comment in seccomp.h even explains:

 *          @filter must only be accessed from the context of current as there
 *          is no read locking.

Unless there's a good reason for it, I would prefer it if this
function didn't take a `task` pointer.

> +       struct file *ret = ERR_PTR(-EBUSY);
> +       struct seccomp_filter *cur, *last_locked = NULL;
> +       int filter_nesting = 0;
> +
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> +               filter_nesting++;
> +               last_locked = cur;
> +               if (cur->notif)
> +                       goto out;
> +       }
> +
> +       ret = ERR_PTR(-ENOMEM);
> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);

sizeof(struct notification) instead, to make the code clearer?

> +       if (!filter->notif)
> +               goto out;
> +
> +       sema_init(&filter->notif->request, 0);
> +       INIT_LIST_HEAD(&filter->notif->notifications);
> +       filter->notif->next_id = get_random_u64();
> +       init_waitqueue_head(&filter->notif->wqh);

Nit: next_id and notifications are declared in reverse order in the
struct. Could you flip them around here?

> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> +                                filter, O_RDWR);
> +       if (IS_ERR(ret))
> +               goto out;
> +
> +
> +       /* The file has a reference to it now */
> +       __get_seccomp_filter(filter);

__get_seccomp_filter() has a comment in it that claims "/* Reference
count is bounded by the number of total processes. */". I think this
change invalidates that comment. I think it should be fine to just
remove the comment.

> +out:
> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {

s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
here, something went very wrong.

> +               mutex_unlock(&cur->notify_lock);
> +               if (cur == last_locked)
> +                       break;
> +       }
> +
> +       return ret;
> +}
> +#endif
Kees Cook Sept. 27, 2018, 10:45 p.m. UTC | #3
On Thu, Sep 27, 2018 at 2:51 PM, Jann Horn <jannh@google.com> wrote:
> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
>> However, care should be taken to avoid the TOCTOU
>> +mentioned above in this document: all arguments being read from the tracee's
>> +memory should be read into the tracer's memory before any policy decisions are
>> +made. This allows for an atomic decision on syscall arguments.
>
> Again, I don't really see how you could get this wrong.

Doesn't hurt to mention it, IMO.

>> +static long seccomp_notify_send(struct seccomp_filter *filter,
>> +                               unsigned long arg)
>> +{
>> +       struct seccomp_notif_resp resp = {};
>> +       struct seccomp_knotif *knotif = NULL;
>> +       long ret;
>> +       u16 size;
>> +       void __user *buf = (void __user *)arg;
>> +
>> +       if (copy_from_user(&size, buf, sizeof(size)))
>> +               return -EFAULT;
>> +       size = min_t(size_t, size, sizeof(resp));
>> +       if (copy_from_user(&resp, buf, size))
>> +               return -EFAULT;
>> +
>> +       ret = mutex_lock_interruptible(&filter->notify_lock);
>> +       if (ret < 0)
>> +               return ret;
>> +
>> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
>> +               if (knotif->id == resp.id)
>> +                       break;
>> +       }
>> +
>> +       if (!knotif || knotif->id != resp.id) {
>
> Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> NULL here. If `filter->notif->notifications` is empty, I think
> `knotif` will be `container_of(&filter->notif->notifications, struct
> seccom_knotif, list)` - in other words, you'll have a type confusion,
> and `knotif` probably points into some random memory in front of
> `filter->notif`.
>
> Am I missing something?

Oh, good catch. This just needs to be fixed like it's done in
seccomp_notif_recv (separate cur and knotif).

>> +static struct file *init_listener(struct task_struct *task,
>> +                                 struct seccomp_filter *filter)
>> +{
>
> Why does this function take a `task` pointer instead of always
> accessing `current`? If `task` actually wasn't `current`, I would have
> concurrency concerns. A comment in seccomp.h even explains:
>
>  *          @filter must only be accessed from the context of current as there
>  *          is no read locking.
>
> Unless there's a good reason for it, I would prefer it if this
> function didn't take a `task` pointer.

This is to support PTRACE_SECCOMP_NEW_LISTENER.

But you make an excellent point. Even TSYNC expects to operate only on
the current thread group. Hmm.

While the process is stopped by ptrace, we could, in theory, update
task->seccomp.filter via something like TSYNC.

So perhaps use:

mutex_lock_killable(&task->signal->cred_guard_mutex);

before walking the notify_locks?

>
>> +       struct file *ret = ERR_PTR(-EBUSY);
>> +       struct seccomp_filter *cur, *last_locked = NULL;
>> +       int filter_nesting = 0;
>> +
>> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
>> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
>> +               filter_nesting++;
>> +               last_locked = cur;
>> +               if (cur->notif)
>> +                       goto out;
>> +       }
>> +
>> +       ret = ERR_PTR(-ENOMEM);
>> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
>
> sizeof(struct notification) instead, to make the code clearer?

I prefer what Tycho has: I want to allocate an instances of whatever
filter->notif is.

Though, let's do the kzalloc outside of the locking, instead?

>> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
>> +                                filter, O_RDWR);
>> +       if (IS_ERR(ret))
>> +               goto out;
>> +
>> +
>> +       /* The file has a reference to it now */
>> +       __get_seccomp_filter(filter);
>
> __get_seccomp_filter() has a comment in it that claims "/* Reference
> count is bounded by the number of total processes. */". I think this
> change invalidates that comment. I think it should be fine to just
> remove the comment.

Update it to "bounded by total processes and notification listeners"?

>> +out:
>> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
>
> s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> here, something went very wrong.

Hm? This is correct. This is how seccomp_run_filters() walks the list too:

        struct seccomp_filter *f =
                        READ_ONCE(current->seccomp.filter);
        ...
        for (; f; f = f->prev) {

Especially if we'll be holding the cred_guard_mutex.

-Kees
Tycho Andersen Sept. 27, 2018, 10:48 p.m. UTC | #4
On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> >
> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here, but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> >
> > v2: * make id a u64; the idea here being that it will never overflow,
> >       because 64 is huge (one syscall every nanosecond => wrap every 584
> >       years) (Andy)
> >     * prevent nesting of user notifications: if someone is already attached
> >       the tree in one place, nobody else can attach to the tree (Andy)
> >     * notify the listener of signals the tracee receives as well (Andy)
> >     * implement poll
> > v3: * lockdep fix (Oleg)
> >     * drop unnecessary WARN()s (Christian)
> >     * rearrange error returns to be more rpetty (Christian)
> >     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> > v4: * fix implementation of poll to use poll_wait() (Jann)
> >     * change listener's fd flags to be 0 (Jann)
> >     * hoist filter initialization out of ifdefs to its own function
> >       init_user_notification()
> >     * add some more testing around poll() and closing the listener while a
> >       syscall is in action
> >     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
> >       creates a new one (Matthew)
> >     * correctly handle pid namespaces, add some testcases (Matthew)
> >     * use EINPROGRESS instead of EINVAL when a notification response is
> >       written twice (Matthew)
> >     * fix comment typo from older version (SEND vs READ) (Matthew)
> >     * whitespace and logic simplification (Tobin)
> >     * add some Documentation/ bits on userspace trapping
> > v5: * fix documentation typos (Jann)
> >     * add signalled field to struct seccomp_notif (Jann)
> >     * switch to using ioctls instead of read()/write() for struct passing
> >       (Jann)
> >     * add an ioctl to ensure an id is still valid
> > v6: * docs typo fixes, update docs for ioctl() change (Christian)
> > v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
> >     * use notify_lock in IS_ID_VALID query to avoid racing
> >     * s/signalled/signaled (Tyler)
> >     * fix docs to reflect that ids are not globally unique (Tyler)
> >     * add a test to check -ERESTARTSYS behavior (Tyler)
> >     * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
> >     * reorder USER_NOTIF in seccomp return codes list (Tyler)
> >     * return size instead of sizeof(struct user_notif) (Tyler)
> >     * ENOENT instead of EINVAL when invalid id is passed (Tyler)
> >     * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
> >     * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
> >     * add a new struct notification to minimize the additions to
> >       struct seccomp_filter, also pack the necessary additions a bit more
> >       cleverly (Tyler)
> >     * switch to keeping track of the task itself instead of the pid (we'll
> >       use this for implementing PUT_FD)
> 
> Patch-sending nit: can you put the versioning below the "---" line so
> it isn't included in the final commit? (And I normally read these
> backwards, so I'd expect v7 at the top, but that's not a big deal. I
> mean... neither is the --- thing, but it makes "git am" easier for me
> since I don't have to go edit the versioning out of the log.)

Sure, will do.

> > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > index 9efc0e73d50b..d4ccb32fe089 100644
> > --- a/include/uapi/linux/seccomp.h
> > +++ b/include/uapi/linux/seccomp.h
> > @@ -17,9 +17,10 @@
> >  #define SECCOMP_GET_ACTION_AVAIL       2
> >
> >  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> > -#define SECCOMP_FILTER_FLAG_TSYNC      (1UL << 0)
> > -#define SECCOMP_FILTER_FLAG_LOG                (1UL << 1)
> > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
> > +#define SECCOMP_FILTER_FLAG_TSYNC              (1UL << 0)
> > +#define SECCOMP_FILTER_FLAG_LOG                        (1UL << 1)
> > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW         (1UL << 2)
> > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER       (1UL << 3)
> 
> Since these are all getting indentation updates, can you switch them
> to BIT(0), BIT(1), etc?

Will do.

> >  /*
> >   * All BPF programs must return a 32-bit value.
> > @@ -35,6 +36,7 @@
> >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> >  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
> >  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >         __u64 args[6];
> >  };
> >
> > +struct seccomp_notif {
> > +       __u16 len;
> > +       __u64 id;
> > +       __u32 pid;
> > +       __u8 signaled;
> > +       struct seccomp_data data;
> > +};
> > +
> > +struct seccomp_notif_resp {
> > +       __u16 len;
> > +       __u64 id;
> > +       __s32 error;
> > +       __s64 val;
> > +};
> 
> So, len has to come first, for versioning. However, since it's ahead
> of a u64, this leaves a struct padding hole. pahole output:
> 
> struct seccomp_notif {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __u32                      pid;                  /*    16     4 */
>         __u8                       signaled;             /*    20     1 */
> 
>         /* XXX 3 bytes hole, try to pack */
> 
>         struct seccomp_data        data;                 /*    24    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> 
>         /* size: 88, cachelines: 2, members: 5 */
>         /* sum members: 79, holes: 2, sum holes: 9 */
>         /* last cacheline: 24 bytes */
> };
> struct seccomp_notif_resp {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __s32                      error;                /*    16     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         __s64                      val;                  /*    24     8 */
> 
>         /* size: 32, cachelines: 1, members: 4 */
>         /* sum members: 22, holes: 2, sum holes: 10 */
>         /* last cacheline: 32 bytes */
> };
> 
> How about making len u32, and moving pid and error above "id"? This
> leaves a hole after signaled, so changing "len" won't be sufficient
> for versioning here. Perhaps move it after data?

I'm not sure what you mean by "len won't be sufficient for versioning
here"? Anyway, I can do some packing on these; I didn't bother before
since I figured it's a userspace interface, so saving a few bytes
isn't a huge deal.

> > +
> > +#define SECCOMP_IOC_MAGIC              0xF7
> 
> Was there any specific reason for picking this value? There are lots
> of fun ASCII code left like '!' or '*'. :)

No, ! it is :)

> > +
> > +/* Flags for seccomp notification fd ioctl. */
> > +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> > +                                       struct seccomp_notif)
> > +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> > +                                       struct seccomp_notif_resp)
> > +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> > +                                       __u64)
> 
> To match other UAPI ioctl, can these have a prefix of "SECCOMP_IOCTOL_..."?
> 
> It may also be useful to match how other uapis do this, like for DRM:
> 
> #define DRM_IOCTL_BASE                  'd'
> #define DRM_IO(nr)                      _IO(DRM_IOCTL_BASE,nr)
> #define DRM_IOR(nr,type)                _IOR(DRM_IOCTL_BASE,nr,type)
> #define DRM_IOW(nr,type)                _IOW(DRM_IOCTL_BASE,nr,type)
> #define DRM_IOWR(nr,type)               _IOWR(DRM_IOCTL_BASE,nr,type)
> 
> #define DRM_IOCTL_VERSION               DRM_IOWR(0x00, struct drm_version)
> #define DRM_IOCTL_GET_UNIQUE            DRM_IOWR(0x01, struct drm_unique)
> #define DRM_IOCTL_GET_MAGIC             DRM_IOR( 0x02, struct drm_auth)
> ...

Will do.

> 
> > +
> >  #endif /* _UAPI_LINUX_SECCOMP_H */
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index fd023ac24e10..fa6fe9756c80 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -33,12 +33,78 @@
> >  #endif
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +#include <linux/file.h>
> >  #include <linux/filter.h>
> >  #include <linux/pid.h>
> >  #include <linux/ptrace.h>
> >  #include <linux/security.h>
> >  #include <linux/tracehook.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/anon_inodes.h>
> > +
> > +enum notify_state {
> > +       SECCOMP_NOTIFY_INIT,
> > +       SECCOMP_NOTIFY_SENT,
> > +       SECCOMP_NOTIFY_REPLIED,
> > +};
> > +
> > +struct seccomp_knotif {
> > +       /* The struct pid of the task whose filter triggered the notification */
> > +       struct task_struct *task;
> > +
> > +       /* The "cookie" for this request; this is unique for this filter. */
> > +       u64 id;
> > +
> > +       /* Whether or not this task has been given an interruptible signal. */
> > +       bool signaled;
> > +
> > +       /*
> > +        * The seccomp data. This pointer is valid the entire time this
> > +        * notification is active, since it comes from __seccomp_filter which
> > +        * eclipses the entire lifecycle here.
> > +        */
> > +       const struct seccomp_data *data;
> > +
> > +       /*
> > +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> > +        * struct seccomp_knotif is created and starts out in INIT. Once the
> > +        * handler reads the notification off of an FD, it transitions to SENT.
> > +        * If a signal is received the state transitions back to INIT and
> > +        * another message is sent. When the userspace handler replies, state
> > +        * transitions to REPLIED.
> > +        */
> > +       enum notify_state state;
> > +
> > +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > +       int error;
> > +       long val;
> > +
> > +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > +       struct completion ready;
> > +
> > +       struct list_head list;
> > +};
> > +
> > +/**
> > + * struct notification - container for seccomp userspace notifications. Since
> > + * most seccomp filters will not have notification listeners attached and this
> > + * structure is fairly large, we store the notification-specific stuff in a
> > + * separate structure.
> > + *
> > + * @request: A semaphore that users of this notification can wait on for
> > + *           changes. Actual reads and writes are still controlled with
> > + *           filter->notify_lock.
> > + * @notify_lock: A lock for all notification-related accesses.
> > + * @next_id: The id of the next request.
> > + * @notifications: A list of struct seccomp_knotif elements.
> > + * @wqh: A wait queue for poll.
> > + */
> > +struct notification {
> > +       struct semaphore request;
> > +       u64 next_id;
> > +       struct list_head notifications;
> > +       wait_queue_head_t wqh;
> > +};
> >
> >  /**
> >   * struct seccomp_filter - container for seccomp BPF programs
> > @@ -66,6 +132,8 @@ struct seccomp_filter {
> >         bool log;
> >         struct seccomp_filter *prev;
> >         struct bpf_prog *prog;
> > +       struct notification *notif;
> > +       struct mutex notify_lock;
> >  };
> >
> >  /* Limit any path through the tree to 256KB worth of instructions. */
> > @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> >         if (!sfilter)
> >                 return ERR_PTR(-ENOMEM);
> >
> > +       mutex_init(&sfilter->notify_lock);
> >         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
> >                                         seccomp_check_filter, save_orig);
> >         if (ret < 0) {
> > @@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
> >  #define SECCOMP_LOG_TRACE              (1 << 4)
> >  #define SECCOMP_LOG_LOG                        (1 << 5)
> >  #define SECCOMP_LOG_ALLOW              (1 << 6)
> > +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
> >
> >  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
> >                                     SECCOMP_LOG_KILL_THREAD  |
> >                                     SECCOMP_LOG_TRAP  |
> >                                     SECCOMP_LOG_ERRNO |
> > +                                   SECCOMP_LOG_USER_NOTIF |
> >                                     SECCOMP_LOG_TRACE |
> >                                     SECCOMP_LOG_LOG;
> >
> > @@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
> >         case SECCOMP_RET_TRACE:
> >                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
> >                 break;
> > +       case SECCOMP_RET_USER_NOTIF:
> > +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> > +               break;
> >         case SECCOMP_RET_LOG:
> >                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
> >                 break;
> > @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
> >  #else
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > +{
> > +       /* Note: overflow is ok here, the id just needs to be unique */
> 
> Maybe just clarify in the comment: unique to the filter.
> 
> > +       return filter->notif->next_id++;
> 
> Also, it might be useful to add for both documentation and lockdep:
> 
> lockdep_assert_held(filter->notif->notify_lock);
> 
> into this function?

Will do.

> 
> > +}
> > +
> > +static void seccomp_do_user_notification(int this_syscall,
> > +                                        struct seccomp_filter *match,
> > +                                        const struct seccomp_data *sd)
> > +{
> > +       int err;
> > +       long ret = 0;
> > +       struct seccomp_knotif n = {};
> > +
> > +       mutex_lock(&match->notify_lock);
> > +       err = -ENOSYS;
> > +       if (!match->notif)
> > +               goto out;
> > +
> > +       n.task = current;
> > +       n.state = SECCOMP_NOTIFY_INIT;
> > +       n.data = sd;
> > +       n.id = seccomp_next_notify_id(match);
> > +       init_completion(&n.ready);
> > +
> > +       list_add(&n.list, &match->notif->notifications);
> > +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> > +
> > +       mutex_unlock(&match->notify_lock);
> > +       up(&match->notif->request);
> > +
> 
> Maybe add a big comment here saying this is where we're waiting for a reply?

Will do.

> > +       err = wait_for_completion_interruptible(&n.ready);
> > +       mutex_lock(&match->notify_lock);
> > +
> > +       /*
> > +        * Here it's possible we got a signal and then had to wait on the mutex
> > +        * while the reply was sent, so let's be sure there wasn't a response
> > +        * in the meantime.
> > +        */
> > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +               /*
> > +                * We got a signal. Let's tell userspace about it (potentially
> > +                * again, if we had already notified them about the first one).
> > +                */
> > +               n.signaled = true;
> > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > +                       n.state = SECCOMP_NOTIFY_INIT;
> > +                       up(&match->notif->request);
> > +               }
> > +               mutex_unlock(&match->notify_lock);
> > +               err = wait_for_completion_killable(&n.ready);
> > +               mutex_lock(&match->notify_lock);
> > +               if (err < 0)
> > +                       goto remove_list;
> > +       }
> > +
> > +       ret = n.val;
> > +       err = n.error;
> > +
> > +remove_list:
> > +       list_del(&n.list);
> > +out:
> > +       mutex_unlock(&match->notify_lock);
> > +       syscall_set_return_value(current, task_pt_regs(current),
> > +                                err, ret);
> > +}
> > +
> >  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> >                             const bool recheck_after_trace)
> >  {
> > @@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> >
> >                 return 0;
> >
> > +       case SECCOMP_RET_USER_NOTIF:
> > +               seccomp_do_user_notification(this_syscall, match, sd);
> > +               goto skip;
> 
> Nit: please add a blank line here (to match the other cases).
> 
> >         case SECCOMP_RET_LOG:
> >                 seccomp_log(this_syscall, 0, action, true);
> >                 return 0;
> > @@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
> >  }
> >
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > +                                 struct seccomp_filter *);
> 
> Why is the forward declaration needed instead of just moving the
> function here? I didn't see anything in it that looked like it
> couldn't move.

I think there was a cycle in some earlier version, but I agree there
isn't now. I'll fix it.

> > +
> >  /**
> >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> >   * @flags:  flags to change filter behavior
> > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> >         struct seccomp_filter *prepared = NULL;
> >         long ret = -EINVAL;
> > +       int listener = 0;
> 
> Nit: "invalid fd" should be -1, not 0.
> 
> > +       struct file *listener_f = NULL;
> >
> >         /* Validate flags. */
> >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         if (IS_ERR(prepared))
> >                 return PTR_ERR(prepared);
> >
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               listener = get_unused_fd_flags(0);
> 
> As with the other place pointed out by Jann, this should maybe be O_CLOEXEC too?

Yep, will do.

> > +               if (listener < 0) {
> > +                       ret = listener;
> > +                       goto out_free;
> > +               }
> > +
> > +               listener_f = init_listener(current, prepared);
> > +               if (IS_ERR(listener_f)) {
> > +                       put_unused_fd(listener);
> > +                       ret = PTR_ERR(listener_f);
> > +                       goto out_free;
> > +               }
> > +       }
> > +
> >         /*
> >          * Make sure we cannot change seccomp or nnp state via TSYNC
> >          * while another thread is in the middle of calling exec.
> >          */
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> >             mutex_lock_killable(&current->signal->cred_guard_mutex))
> > -               goto out_free;
> > +               goto out_put_fd;
> >
> >         spin_lock_irq(&current->sighand->siglock);
> >
> > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         spin_unlock_irq(&current->sighand->siglock);
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> >                 mutex_unlock(&current->signal->cred_guard_mutex);
> > +out_put_fd:
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               if (ret < 0) {
> > +                       fput(listener_f);
> > +                       put_unused_fd(listener);
> > +               } else {
> > +                       fd_install(listener, listener_f);
> > +                       ret = listener;
> > +               }
> > +       }
> 
> Can you update the kern-docs for seccomp_set_mode_filter(), since we
> can now return positive values?
> 
>  * Returns 0 on success or -EINVAL on failure.
> 
> (this shoudln't say only -EINVAL, I realize too)

Sure, I can fix both of these.

> I have to say, I'm vaguely nervous about changing the semantics here
> for passing back the fd as the return code from the seccomp() syscall.
> Alternatives seem less appealing, though: changing the meaning of the
> uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> example. Hmm.

From my perspective we can drop this whole thing. The only thing I'll
ever use is the ptrace version. Someone at some point (I don't
remember who, maybe stgraber) suggested this version would be useful
as well.

Anyway, let me know if your nervousness outweighs this, I'm happy to
drop it.

> > @@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
> >  device_initcall(seccomp_sysctl_init)
> >
> >  #endif /* CONFIG_SYSCTL */
> > +
> > +#ifdef CONFIG_SECCOMP_FILTER
> > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_knotif *knotif;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +
> > +       /*
> > +        * If this file is being closed because e.g. the task who owned it
> > +        * died, let's wake everyone up who was waiting on us.
> > +        */
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > +                       continue;
> > +
> > +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +               knotif->error = -ENOSYS;
> > +               knotif->val = 0;
> > +
> > +               complete(&knotif->ready);
> > +       }
> > +
> > +       wake_up_all(&filter->notif->wqh);
> > +       kfree(filter->notif);
> > +       filter->notif = NULL;
> > +       mutex_unlock(&filter->notify_lock);
> 
> It looks like that means nothing waiting on knotif->ready can access
> filter->notif without rechecking it, yes?
> 
> e.g. in seccomp_do_user_notification() I see:
> 
>                         up(&match->notif->request);
> 
> I *think* this isn't reachable due to the test for n.state !=
> SECCOMP_NOTIFY_REPLIED, though. Perhaps, just for sanity and because
> it's not fast-path, we could add a WARN_ON() while checking for
> unreplied signal death?
> 
>                 n.signaled = true;
>                 if (n.state == SECCOMP_NOTIFY_SENT) {
>                         n.state = SECCOMP_NOTIFY_INIT;
>                         if (!WARN_ON(match->notif))
>                             up(&match->notif->request);
>                 }
>                 mutex_unlock(&match->notify_lock);

So this code path should actually be safe, since notify_lock is held
throughout, as it is in the release handler. However, there is one just above
it that is not, because we do:

        mutex_unlock(&match->notify_lock);
        up(&match->notif->request);

When this was all a member of struct seccomp_filter the order didn't matter,
but now it very much does, and I think you're right that these statements need
to be reordered. There maybe others, I'll check everything else as well.

> 
> > +       __put_seccomp_filter(filter);
> > +       return 0;
> > +}
> > +
> > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL, *cur;
> > +       struct seccomp_notif unotif = {};
> > +       ssize_t ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> 
> I'd prefer this casting happen in seccomp_notify_ioctl(). This keeps
> anything from accidentally using "arg" directly here.

Will do.

> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +
> > +       ret = down_interruptible(&filter->notif->request);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > +                       knotif = cur;
> > +                       break;
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If we didn't find a notification, it could be that the task was
> > +        * interrupted between the time we were woken and when we were able to
> > +        * acquire the rw lock.
> > +        */
> > +       if (!knotif) {
> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       size = min_t(size_t, size, sizeof(unotif));
> > +
> 
> It is possible (though unlikely given the type widths involved here)
> for unotif = {} to not initialize padding, so I would recommend an
> explicit memset(&unotif, 0, sizeof(unotif)) here.

Orly? I didn't know that, thanks.

> > +       unotif.len = size;
> > +       unotif.id = knotif->id;
> > +       unotif.pid = task_pid_vnr(knotif->task);
> > +       unotif.signaled = knotif->signaled;
> > +       unotif.data = *(knotif->data);
> > +
> > +       if (copy_to_user(buf, &unotif, size)) {
> > +               ret = -EFAULT;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > +
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> 
> Is there some way to rearrange the locking here to avoid holding the
> mutex while doing copy_to_user() (which userspace could block with
> userfaultfd, and then stall all the other notifications for this
> filter)?

Yes, I don't think it'll cause any problems to release the lock earlier.

> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> 
> Same cast note as above.
> 
> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +       size = min_t(size_t, size, sizeof(resp));
> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> 
> For sanity checking on a double-read from userspace, please add:
> 
>     if (resp.len != size)
>         return -EINVAL;

Won't that fail if sizeof(resp) < resp.len, because of the min_t()?

> > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > +                                   unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL;
> > +       void __user *buf = (void __user *)arg;
> > +       u64 id;
> > +       long ret;
> > +
> > +       if (copy_from_user(&id, buf, sizeof(id)))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -1;
> 
> Isn't this EPERM? Shouldn't it be -ENOENT?

Yes, I wasn't thinking of errno here, I'll switch it.

> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->id == id) {
> > +                       ret = 0;
> > +                       goto out;
> > +               }
> > +       }
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> > +                                unsigned long arg)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +
> > +       switch (cmd) {
> > +       case SECCOMP_NOTIF_RECV:
> > +               return seccomp_notify_recv(filter, arg);
> > +       case SECCOMP_NOTIF_SEND:
> > +               return seccomp_notify_send(filter, arg);
> > +       case SECCOMP_NOTIF_ID_VALID:
> > +               return seccomp_notify_id_valid(filter, arg);
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +                                   struct poll_table_struct *poll_tab)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       __poll_t ret = 0;
> > +       struct seccomp_knotif *cur;
> > +
> > +       poll_wait(file, &filter->notif->wqh, poll_tab);
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > +               if (ret & EPOLLIN && ret & EPOLLOUT)
> 
> My eyes! :) Can you wrap the bit operations in parens here?
> 
> > +                       break;
> > +       }
> 
> Should POLLERR be handled here too? I don't quite see the conditions
> that might be exposed? All the processes die for the filter, which
> does what here?

I think it shouldn't do anything, because I was thinking of the semantics of
poll() as "when a tracee does a syscall that matches, fire". So a task could
start, never make a targeted syscall, and exit, and poll() shouldn't return a
value. Maybe it's useful to write that down somewhere, though.

> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
> > +
> > +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> > +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> > +
> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);
> 
> Please document SECCOMP_NOTIF_ID_VALID in seccomp_filter.rst. I had
> been wondering what it's for, and now I see it's kind of an advisory
> "is the other end still alive?" test.

Yes, in fact it's necessary for avoiding races. There's some comments in the
sample code, but I'll update seccomp_filter.rst too.

> > +
> > +       resp.id = req.id;
> > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> > +       EXPECT_EQ(ret, -1);
> > +       EXPECT_EQ(errno, ENOENT);
> > +
> > +       /*
> > +        * Check that we get another notification about a signal in the middle
> > +        * of a syscall.
> > +        */
> > +       pid = fork();
> > +       ASSERT_GE(pid, 0);
> > +
> > +       if (pid == 0) {
> > +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> > +                       perror("signal");
> > +                       exit(1);
> > +               }
> > +               ret = syscall(__NR_getpid);
> > +               exit(ret != USER_NOTIF_MAGIC);
> > +       }
> > +
> > +       ret = read_notif(listener, &req);
> > +       EXPECT_EQ(ret, sizeof(req));
> > +       EXPECT_EQ(errno, 0);
> > +
> > +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> > +
> > +       ret = read_notif(listener, &req);
> > +       EXPECT_EQ(req.signaled, 1);
> > +       EXPECT_EQ(ret, sizeof(req));
> > +       EXPECT_EQ(errno, 0);
> > +
> > +       resp.len = sizeof(resp);
> > +       resp.id = req.id;
> > +       resp.error = -512; /* -ERESTARTSYS */
> > +       resp.val = 0;
> > +
> > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> > +
> > +       ret = read_notif(listener, &req);
> > +       resp.len = sizeof(resp);
> > +       resp.id = req.id;
> > +       resp.error = 0;
> > +       resp.val = USER_NOTIF_MAGIC;
> > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> 
> I was slightly confused here: why have there been 3 reads? I was
> expecting one notification for hitting getpid and one from catching a
> signal. But in rereading, I see that NOTIF_RECV will return the most
> recently unresponded notification, yes?

The three reads are:

1. original syscall
# send SIGUSR1
2. another notif with signaled is set
# respond with -ERESTARTSYS to make sure that works
3. this is the result of -ERESTARTSYS

> But... catching a signal replaces the existing seccomp_knotif? I
> remain confused about how signal handling is meant to work here. What
> happens if two signals get sent? It looks like you just block without
> allowing more signals? (Thank you for writing the tests!)

Yes, that's the idea. This is an implementation of Andy's pseudocode:
https://lkml.org/lkml/2018/3/15/1122

> (And can you document the expected behavior in the seccomp_filter.rst too?)

Will do.

> 
> Looking good!

Thanks for your review!

Tycho
Tycho Andersen Sept. 27, 2018, 11:04 p.m. UTC | #5
On Thu, Sep 27, 2018 at 11:51:40PM +0200, Jann Horn wrote:
> +Christoph Hellwig, Al Viro, fsdevel: For two questions about the poll
> interface (search for "seccomp_notify_poll" and
> "seccomp_notify_release" in the patch)
> 
> @Tycho: FYI, I've gone through all of v7 now, apart from the
> test/sample code. So don't wait for more comments from me before
> sending out v8.

(assuming you meant v8 -> v9) yes thanks for your reviews! Much
appreciated.

> On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > This patch introduces a means for syscalls matched in seccomp to notify
> > some other task that a particular filter has been triggered.
> >
> > The motivation for this is primarily for use with containers. For example,
> > if a container does an init_module(), we obviously don't want to load this
> > untrusted code, which may be compiled for the wrong version of the kernel
> > anyway. Instead, we could parse the module image, figure out which module
> > the container is trying to load and load it on the host.
> >
> > As another example, containers cannot mknod(), since this checks
> > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > coding some whitelist in the kernel. Another example is mount(), which has
> > many security restrictions for good reason, but configuration or runtime
> > knowledge could potentially be used to relax these restrictions.
> 
> Note that in that case, the trusted runtime needs to be in the same
> mount namespace as the container. mount() doesn't work on the mount
> structure of a foreign mount namespace; check_mnt() specifically
> checks for this case, and I think pretty much everything in
> sys_mount() uses that check. So you'd have to join the container's
> mount namespace before forwarding a mount syscall.

Yep, Serge came up with a pretty neat trick that we used in LXD to
accomplish sending mounts to containers, but it requires some
coordination up front.

> > This patch adds functionality that is already possible via at least two
> > other means that I know about, both of which involve ptrace(): first, one
> > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > Unfortunately this is slow, so a faster version would be to install a
> > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > Since ptrace allows only one tracer, if the container runtime is that
> > tracer, users inside the container (or outside) trying to debug it will not
> > be able to use ptrace, which is annoying. It also means that older
> > distributions based on Upstart cannot boot inside containers using ptrace,
> > since upstart itself uses ptrace to start services.
> >
> > The actual implementation of this is fairly small, although getting the
> > synchronization right was/is slightly complex.
> >
> > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > memory data from the task still applies here,
> 
> Actually, it doesn't, right? It would apply if you told the kernel "go
> ahead, that syscall is fine", but that's not how the API works - you
> always intercept the syscall, copy argument data to a trusted tracer,
> and then the tracer can make a replacement syscall. Sounds fine to me.

Right, I guess the point here is just "you need to copy all the data
to the tracer *before* making a policy decision".

> > but can be avoided with
> > careful design of the userspace handler: if the userspace handler reads all
> > of the task memory that is necessary before applying its security policy,
> > the tracee's subsequent memory edits will not be read by the tracer.
> [...]
> > diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
> [...]
> > +which (on success) will return a listener fd for the filter, which can then be
> > +passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
> > +acquired via:
> > +
> > +.. code-block::
> > +
> > +    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
> 
> The manpage documents ptrace() as taking four arguments, not three. I
> know that the header defines it with varargs, but it would probably be
> more useful to require passing in zero as the fourth argument so that
> we have a place to stick flags if necessary in the future.

Yep, I'll fix this, thanks. But also this documentation should really
live in the seccomp patch; some rebase got screwed up somewhere.

> > +which grabs the 0th filter for some task which the tracer has privilege over.
> > +Note that filter fds correspond to a particular filter, and not a particular
> > +task. So if this task then forks, notifications from both tasks will appear on
> > +the same filter fd. Reads and writes to/from a filter fd are also synchronized,
> > +so a filter fd can safely have many readers.
> 
> Add a note about needing CAP_SYS_ADMIN here? Also, might be useful to
> clarify in which direction "nth filter" counts.

Will do.

> > +The interface for a seccomp notification fd consists of two structures:
> > +
> > +.. code-block::
> > +
> > +    struct seccomp_notif {
> > +        __u16 len;
> > +        __u64 id;
> > +        pid_t pid;
> > +        __u8 signalled;
> > +        struct seccomp_data data;
> > +    };
> > +
> > +    struct seccomp_notif_resp {
> > +        __u16 len;
> > +        __u64 id;
> > +        __s32 error;
> > +        __s64 val;
> > +    };
> > +
> > +Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
> > +notification fd to receive a ``struct seccomp_notif``, which contains five
> > +members: the input length of the structure, a unique-per-filter ``id``, the
> > +``pid`` of the task which triggered this request (which may be 0 if the task is
> > +in a pid ns not visible from the listener's pid namespace), a flag representing
> > +whether or not the notification is a result of a non-fatal signal, and the
> > +``data`` passed to seccomp. Userspace can then make a decision based on this
> > +information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
> > +indicating what should be returned to userspace. The ``id`` member of ``struct
> > +seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
> > +
> > +It is worth noting that ``struct seccomp_data`` contains the values of register
> > +arguments to the syscall, but does not contain pointers to memory. The task's
> > +memory is accessible to suitably privileged traces via ``ptrace()`` or
> > +``/proc/pid/map_files/``.
> 
> You probably don't actually want to use /proc/pid/map_files here; you
> can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
> And while reading memory via ptrace() is possible, the interface is
> really ugly (e.g. you can only read data in 4-byte chunks), and your
> caveat about locking out other ptracers (or getting locked out by
> them) applies. I'm not even sure if you could read memory via ptrace
> while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
> requires the target to be in a __TASK_TRACED state.
> The two interfaces you might want to use instead are /proc/$pid/mem
> and process_vm_{readv,writev}, which allow you to do nice,
> arbitrarily-sized, vectored IO on the memory of another process.

Yes, in fact the sample code does use /proc/$pid/mem, but the docs
should be correct :)

> > +       /*
> > +        * Here it's possible we got a signal and then had to wait on the mutex
> > +        * while the reply was sent, so let's be sure there wasn't a response
> > +        * in the meantime.
> > +        */
> > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > +               /*
> > +                * We got a signal. Let's tell userspace about it (potentially
> > +                * again, if we had already notified them about the first one).
> > +                */
> > +               n.signaled = true;
> > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > +                       n.state = SECCOMP_NOTIFY_INIT;
> > +                       up(&match->notif->request);
> > +               }
> 
> Do you need another wake_up_poll() here?

Yes! Good point.

> > +               mutex_unlock(&match->notify_lock);
> > +               err = wait_for_completion_killable(&n.ready);
> > +               mutex_lock(&match->notify_lock);
> > +               if (err < 0)
> > +                       goto remove_list;
> 
> Add a comment here explaining that we intentionally leave the
> semaphore count too high (because otherwise we'd have to block), and
> seccomp_notify_recv() compensates for that?

Will do.

> > +       }
> > +
> > +       ret = n.val;
> > +       err = n.error;
> > +
> > +remove_list:
> > +       list_del(&n.list);
> > +out:
> > +       mutex_unlock(&match->notify_lock);
> > +       syscall_set_return_value(current, task_pt_regs(current),
> > +                                err, ret);
> > +}
> > +
> >  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> >                             const bool recheck_after_trace)
> >  {
> [...]
> >  #ifdef CONFIG_SECCOMP_FILTER
> > +static struct file *init_listener(struct task_struct *,
> > +                                 struct seccomp_filter *);
> > +
> >  /**
> >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> >   * @flags:  flags to change filter behavior
> > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> >         struct seccomp_filter *prepared = NULL;
> >         long ret = -EINVAL;
> > +       int listener = 0;
> > +       struct file *listener_f = NULL;
> >
> >         /* Validate flags. */
> >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         if (IS_ERR(prepared))
> >                 return PTR_ERR(prepared);
> >
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               listener = get_unused_fd_flags(0);
> > +               if (listener < 0) {
> > +                       ret = listener;
> > +                       goto out_free;
> > +               }
> > +
> > +               listener_f = init_listener(current, prepared);
> > +               if (IS_ERR(listener_f)) {
> > +                       put_unused_fd(listener);
> > +                       ret = PTR_ERR(listener_f);
> > +                       goto out_free;
> > +               }
> > +       }
> > +
> >         /*
> >          * Make sure we cannot change seccomp or nnp state via TSYNC
> >          * while another thread is in the middle of calling exec.
> >          */
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> >             mutex_lock_killable(&current->signal->cred_guard_mutex))
> > -               goto out_free;
> > +               goto out_put_fd;
> >
> >         spin_lock_irq(&current->sighand->siglock);
> >
> > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> >         spin_unlock_irq(&current->sighand->siglock);
> >         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> >                 mutex_unlock(&current->signal->cred_guard_mutex);
> > +out_put_fd:
> > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > +               if (ret < 0) {
> > +                       fput(listener_f);
> > +                       put_unused_fd(listener);
> > +               } else {
> > +                       fd_install(listener, listener_f);
> > +                       ret = listener;
> > +               }
> > +       }
> >  out_free:
> >         seccomp_filter_free(prepared);
> >         return ret;
> [...]
> > +
> > +#ifdef CONFIG_SECCOMP_FILTER
> > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       struct seccomp_knotif *knotif;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +
> > +       /*
> > +        * If this file is being closed because e.g. the task who owned it
> > +        * died, let's wake everyone up who was waiting on us.
> > +        */
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > +                       continue;
> > +
> > +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +               knotif->error = -ENOSYS;
> > +               knotif->val = 0;
> > +
> > +               complete(&knotif->ready);
> > +       }
> > +
> > +       wake_up_all(&filter->notif->wqh);
> 
> If select() is polling us, a reference to the open file is being held,
> and this can't be reached; and I think if epoll is polling us,
> eventpoll_release() will remove itself from the wait queue, right? So
> can this wake_up_all() actually ever notify anyone?

I don't know actually, I just thought better safe than sorry. I can
drop it, though.

> > +       kfree(filter->notif);
> > +       filter->notif = NULL;
> > +       mutex_unlock(&filter->notify_lock);
> > +       __put_seccomp_filter(filter);
> > +       return 0;
> > +}
> > +
> > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL, *cur;
> > +       struct seccomp_notif unotif = {};
> > +       ssize_t ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +
> > +       ret = down_interruptible(&filter->notif->request);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       mutex_lock(&filter->notify_lock);
> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > +                       knotif = cur;
> > +                       break;
> > +               }
> > +       }
> > +
> > +       /*
> > +        * If we didn't find a notification, it could be that the task was
> > +        * interrupted between the time we were woken and when we were able to
> 
> s/interrupted/interrupted by a fatal signal/ ?
> 
> > +        * acquire the rw lock.
> 
> State more explicitly here that we are compensating for an incorrectly
> high semaphore count?

Will do, thanks.

> > +        */
> > +       if (!knotif) {
> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       size = min_t(size_t, size, sizeof(unotif));
> > +
> > +       unotif.len = size;
> > +       unotif.id = knotif->id;
> > +       unotif.pid = task_pid_vnr(knotif->task);
> > +       unotif.signaled = knotif->signaled;
> > +       unotif.data = *(knotif->data);
> > +
> > +       if (copy_to_user(buf, &unotif, size)) {
> > +               ret = -EFAULT;
> > +               goto out;
> > +       }
> > +
> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > +
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > +                               unsigned long arg)
> > +{
> > +       struct seccomp_notif_resp resp = {};
> > +       struct seccomp_knotif *knotif = NULL;
> > +       long ret;
> > +       u16 size;
> > +       void __user *buf = (void __user *)arg;
> > +
> > +       if (copy_from_user(&size, buf, sizeof(size)))
> > +               return -EFAULT;
> > +       size = min_t(size_t, size, sizeof(resp));
> > +       if (copy_from_user(&resp, buf, size))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->id == resp.id)
> > +                       break;
> > +       }
> > +
> > +       if (!knotif || knotif->id != resp.id) {
> 
> Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> NULL here. If `filter->notif->notifications` is empty, I think
> `knotif` will be `container_of(&filter->notif->notifications, struct
> seccom_knotif, list)` - in other words, you'll have a type confusion,
> and `knotif` probably points into some random memory in front of
> `filter->notif`.
> 
> Am I missing something?

No, I just flubbed the list API.

> > +               ret = -ENOENT;
> > +               goto out;
> > +       }
> > +
> > +       /* Allow exactly one reply. */
> > +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > +               ret = -EINPROGRESS;
> > +               goto out;
> > +       }
> 
> This means that if seccomp_do_user_notification() has in the meantime
> received a signal and transitioned from SENT back to INIT, this will
> fail, right? So we fail here, then we read the new notification, and
> then we can retry SECCOMP_NOTIF_SEND? Is that intended?

I think so, the idea being that you might want to do something
different if a signal was sent. But Andy seemed to think that we might
not actually do anything different.

Either way, for the case you describe, EINPROGRESS is a little weird.
Perhaps it should be:

if (knotif->state == SECCOMP_NOTIFY_INIT) {
        ret = -EBUSY; /* or something? */
        goto out;
} else if (knotif->state == SECCOMP_NOTIFY_REPLIED) {
        ret = -EINPROGRESS;
        goto out;
}

?

> > +       ret = size;
> > +       knotif->state = SECCOMP_NOTIFY_REPLIED;
> > +       knotif->error = resp.error;
> > +       knotif->val = resp.val;
> > +       complete(&knotif->ready);
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > +                                   unsigned long arg)
> > +{
> > +       struct seccomp_knotif *knotif = NULL;
> > +       void __user *buf = (void __user *)arg;
> > +       u64 id;
> > +       long ret;
> > +
> > +       if (copy_from_user(&id, buf, sizeof(id)))
> > +               return -EFAULT;
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> > +
> > +       ret = -1;
> 
> In strace, this is going to show up as EPERM. Maybe use something like
> -ENOENT instead? Or whatever you think resembles a fitting error
> number.

Yep, will do.

> > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > +               if (knotif->id == id) {
> > +                       ret = 0;
> 
> Would it make sense to treat notifications that have already been
> replied to as invalid?

I suppose so, since we aren't going to let you reply to them anyway.

> > +                       goto out;
> > +               }
> > +       }
> > +
> > +out:
> > +       mutex_unlock(&filter->notify_lock);
> > +       return ret;
> > +}
> > +
> [...]
> > +static __poll_t seccomp_notify_poll(struct file *file,
> > +                                   struct poll_table_struct *poll_tab)
> > +{
> > +       struct seccomp_filter *filter = file->private_data;
> > +       __poll_t ret = 0;
> > +       struct seccomp_knotif *cur;
> > +
> > +       poll_wait(file, &filter->notif->wqh, poll_tab);
> > +
> > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > +       if (ret < 0)
> > +               return ret;
> 
> Looking at the callers of vfs_poll(), as far as I can tell, a poll
> handler is not allowed to return error codes. Perhaps someone who
> knows the poll interface better can weigh in here. I've CCed some
> people who should hopefully know better how this stuff works.

Thanks.

> > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > +               if (ret & EPOLLIN && ret & EPOLLOUT)
> > +                       break;
> > +       }
> > +
> > +       mutex_unlock(&filter->notify_lock);
> > +
> > +       return ret;
> > +}
> > +
> > +static const struct file_operations seccomp_notify_ops = {
> > +       .poll = seccomp_notify_poll,
> > +       .release = seccomp_notify_release,
> > +       .unlocked_ioctl = seccomp_notify_ioctl,
> > +};
> > +
> > +static struct file *init_listener(struct task_struct *task,
> > +                                 struct seccomp_filter *filter)
> > +{
> 
> Why does this function take a `task` pointer instead of always
> accessing `current`? If `task` actually wasn't `current`, I would have
> concurrency concerns. A comment in seccomp.h even explains:
> 
>  *          @filter must only be accessed from the context of current as there
>  *          is no read locking.
> 
> Unless there's a good reason for it, I would prefer it if this
> function didn't take a `task` pointer.

I think Kees replied already, but yes, this is a good point :(. We can
continue in his thread.

> > +       struct file *ret = ERR_PTR(-EBUSY);
> > +       struct seccomp_filter *cur, *last_locked = NULL;
> > +       int filter_nesting = 0;
> > +
> > +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> > +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> > +               filter_nesting++;
> > +               last_locked = cur;
> > +               if (cur->notif)
> > +                       goto out;
> > +       }
> > +
> > +       ret = ERR_PTR(-ENOMEM);
> > +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
> 
> sizeof(struct notification) instead, to make the code clearer?
> 
> > +       if (!filter->notif)
> > +               goto out;
> > +
> > +       sema_init(&filter->notif->request, 0);
> > +       INIT_LIST_HEAD(&filter->notif->notifications);
> > +       filter->notif->next_id = get_random_u64();
> > +       init_waitqueue_head(&filter->notif->wqh);
> 
> Nit: next_id and notifications are declared in reverse order in the
> struct. Could you flip them around here?

Sure, will do.

> > +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> > +                                filter, O_RDWR);
> > +       if (IS_ERR(ret))
> > +               goto out;
> > +
> > +
> > +       /* The file has a reference to it now */
> > +       __get_seccomp_filter(filter);
> 
> __get_seccomp_filter() has a comment in it that claims "/* Reference
> count is bounded by the number of total processes. */". I think this
> change invalidates that comment. I think it should be fine to just
> remove the comment.

Will do, thanks.

> > +out:
> > +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> 
> s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> here, something went very wrong.

I suppose so, given that we have last_locked.

Tycho
Tycho Andersen Sept. 27, 2018, 11:08 p.m. UTC | #6
On Thu, Sep 27, 2018 at 03:45:11PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 2:51 PM, Jann Horn <jannh@google.com> wrote:
> > On Thu, Sep 27, 2018 at 5:11 PM Tycho Andersen <tycho@tycho.ws> wrote:
> >> However, care should be taken to avoid the TOCTOU
> >> +mentioned above in this document: all arguments being read from the tracee's
> >> +memory should be read into the tracer's memory before any policy decisions are
> >> +made. This allows for an atomic decision on syscall arguments.
> >
> > Again, I don't really see how you could get this wrong.
> 
> Doesn't hurt to mention it, IMO.
> 
> >> +static long seccomp_notify_send(struct seccomp_filter *filter,
> >> +                               unsigned long arg)
> >> +{
> >> +       struct seccomp_notif_resp resp = {};
> >> +       struct seccomp_knotif *knotif = NULL;
> >> +       long ret;
> >> +       u16 size;
> >> +       void __user *buf = (void __user *)arg;
> >> +
> >> +       if (copy_from_user(&size, buf, sizeof(size)))
> >> +               return -EFAULT;
> >> +       size = min_t(size_t, size, sizeof(resp));
> >> +       if (copy_from_user(&resp, buf, size))
> >> +               return -EFAULT;
> >> +
> >> +       ret = mutex_lock_interruptible(&filter->notify_lock);
> >> +       if (ret < 0)
> >> +               return ret;
> >> +
> >> +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> >> +               if (knotif->id == resp.id)
> >> +                       break;
> >> +       }
> >> +
> >> +       if (!knotif || knotif->id != resp.id) {
> >
> > Uuuh, this looks unsafe and wrong. I don't think `knotif` can ever be
> > NULL here. If `filter->notif->notifications` is empty, I think
> > `knotif` will be `container_of(&filter->notif->notifications, struct
> > seccom_knotif, list)` - in other words, you'll have a type confusion,
> > and `knotif` probably points into some random memory in front of
> > `filter->notif`.
> >
> > Am I missing something?
> 
> Oh, good catch. This just needs to be fixed like it's done in
> seccomp_notif_recv (separate cur and knotif).
> 
> >> +static struct file *init_listener(struct task_struct *task,
> >> +                                 struct seccomp_filter *filter)
> >> +{
> >
> > Why does this function take a `task` pointer instead of always
> > accessing `current`? If `task` actually wasn't `current`, I would have
> > concurrency concerns. A comment in seccomp.h even explains:
> >
> >  *          @filter must only be accessed from the context of current as there
> >  *          is no read locking.
> >
> > Unless there's a good reason for it, I would prefer it if this
> > function didn't take a `task` pointer.
> 
> This is to support PTRACE_SECCOMP_NEW_LISTENER.
> 
> But you make an excellent point. Even TSYNC expects to operate only on
> the current thread group. Hmm.
> 
> While the process is stopped by ptrace, we could, in theory, update
> task->seccomp.filter via something like TSYNC.
> 
> So perhaps use:
> 
> mutex_lock_killable(&task->signal->cred_guard_mutex);
> 
> before walking the notify_locks?

This means that all the seccomp/ptrace code probably needs to be
updated for this? I'll try to send patches for this as well as the
return code thing Jann pointed out.

> >
> >> +       struct file *ret = ERR_PTR(-EBUSY);
> >> +       struct seccomp_filter *cur, *last_locked = NULL;
> >> +       int filter_nesting = 0;
> >> +
> >> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> >> +               mutex_lock_nested(&cur->notify_lock, filter_nesting);
> >> +               filter_nesting++;
> >> +               last_locked = cur;
> >> +               if (cur->notif)
> >> +                       goto out;
> >> +       }
> >> +
> >> +       ret = ERR_PTR(-ENOMEM);
> >> +       filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
> >
> > sizeof(struct notification) instead, to make the code clearer?
> 
> I prefer what Tycho has: I want to allocate an instances of whatever
> filter->notif is.
> 
> Though, let's do the kzalloc outside of the locking, instead?

Yep, sounds good.

> >> +       ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
> >> +                                filter, O_RDWR);
> >> +       if (IS_ERR(ret))
> >> +               goto out;
> >> +
> >> +
> >> +       /* The file has a reference to it now */
> >> +       __get_seccomp_filter(filter);
> >
> > __get_seccomp_filter() has a comment in it that claims "/* Reference
> > count is bounded by the number of total processes. */". I think this
> > change invalidates that comment. I think it should be fine to just
> > remove the comment.
> 
> Update it to "bounded by total processes and notification listeners"?

Will do.

> >> +out:
> >> +       for (cur = task->seccomp.filter; cur; cur = cur->prev) {
> >
> > s/; cur;/; 1;/, or use a while loop instead? If the NULL check fires
> > here, something went very wrong.
> 
> Hm? This is correct. This is how seccomp_run_filters() walks the list too:
> 
>         struct seccomp_filter *f =
>                         READ_ONCE(current->seccomp.filter);
>         ...
>         for (; f; f = f->prev) {
> 
> Especially if we'll be holding the cred_guard_mutex.

There is a last_locked local here though, I think that's what Jann is
pointing out.

Cheers,

Tycho
Kees Cook Sept. 27, 2018, 11:10 p.m. UTC | #7
On Thu, Sep 27, 2018 at 3:48 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
>> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> struct seccomp_notif {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __u32                      pid;                  /*    16     4 */
>>         __u8                       signaled;             /*    20     1 */
>>
>>         /* XXX 3 bytes hole, try to pack */
>>
>>         struct seccomp_data        data;                 /*    24    64 */
>>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
>>
>>         /* size: 88, cachelines: 2, members: 5 */
>>         /* sum members: 79, holes: 2, sum holes: 9 */
>>         /* last cacheline: 24 bytes */
>> };
>> struct seccomp_notif_resp {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __s32                      error;                /*    16     4 */
>>
>>         /* XXX 4 bytes hole, try to pack */
>>
>>         __s64                      val;                  /*    24     8 */
>>
>>         /* size: 32, cachelines: 1, members: 4 */
>>         /* sum members: 22, holes: 2, sum holes: 10 */
>>         /* last cacheline: 32 bytes */
>> };
>>
>> How about making len u32, and moving pid and error above "id"? This
>> leaves a hole after signaled, so changing "len" won't be sufficient
>> for versioning here. Perhaps move it after data?
>
> I'm not sure what you mean by "len won't be sufficient for versioning
> here"? Anyway, I can do some packing on these; I didn't bother before
> since I figured it's a userspace interface, so saving a few bytes
> isn't a huge deal.

I was thinking the "len" portion was for determining if the API ever
changes in the future. My point was that given the padding holes, e.g.
adding a u8 after signaled, "len" wouldn't change, so the kernel might
expect to starting reading something after signaled that it wasn't
checking before, but the len would be the same.

>> I have to say, I'm vaguely nervous about changing the semantics here
>> for passing back the fd as the return code from the seccomp() syscall.
>> Alternatives seem less appealing, though: changing the meaning of the
>> uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
>> example. Hmm.
>
> From my perspective we can drop this whole thing. The only thing I'll
> ever use is the ptrace version. Someone at some point (I don't
> remember who, maybe stgraber) suggested this version would be useful
> as well.

Well that would certainly change the exposure of the interface pretty
drastically. :)

So, let's talk more about this, as it raises another thought I had
too: for the PTRACE interface to work, you have to know specifically
which filter you want to get notifications for. Won't that be slightly
tricky?

> Anyway, let me know if your nervousness outweighs this, I'm happy to
> drop it.

I'm not opposed to keeping it, but if you don't think anyone will use
it ... we should probably drop it just to avoid the complexity. It's a
cool API, though, so I'd like to hear from others first before you go
tearing it out. ;) (stgraber added to CC)

>> It is possible (though unlikely given the type widths involved here)
>> for unotif = {} to not initialize padding, so I would recommend an
>> explicit memset(&unotif, 0, sizeof(unotif)) here.
>
> Orly? I didn't know that, thanks.

Yeah, it's a pretty annoying C-ism. The spec says that struct
_members_ will get zero-initialized, but it doesn't say anything about
padding. >_< In most cases, the padding gets initialized too, just
because of bit widths being small enough that they're caught in the
member initialization that the compiler does. But for REALLY big
holes, they may get missed. In this case, while the padding is small,
it's directly exposed to userspace, so I want to make it robust.

>> > +       if (copy_from_user(&size, buf, sizeof(size)))
>> > +               return -EFAULT;
>> > +       size = min_t(size_t, size, sizeof(resp));
>> > +       if (copy_from_user(&resp, buf, size))
>> > +               return -EFAULT;
>>
>> For sanity checking on a double-read from userspace, please add:
>>
>>     if (resp.len != size)
>>         return -EINVAL;
>
> Won't that fail if sizeof(resp) < resp.len, because of the min_t()?

Ah, true. In that case, probably do resp.len = size to avoid any logic
failures due to the double-read? I just want to avoid any chance of
confusing the size and actually using it somewhere.

-Kees
Jann Horn Sept. 27, 2018, 11:37 p.m. UTC | #8
On Fri, Sep 28, 2018 at 1:04 AM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 11:51:40PM +0200, Jann Horn wrote:
> > > +It is worth noting that ``struct seccomp_data`` contains the values of register
> > > +arguments to the syscall, but does not contain pointers to memory. The task's
> > > +memory is accessible to suitably privileged traces via ``ptrace()`` or
> > > +``/proc/pid/map_files/``.
> >
> > You probably don't actually want to use /proc/pid/map_files here; you
> > can't use that to access anonymous memory, and it needs CAP_SYS_ADMIN.
> > And while reading memory via ptrace() is possible, the interface is
> > really ugly (e.g. you can only read data in 4-byte chunks), and your
> > caveat about locking out other ptracers (or getting locked out by
> > them) applies. I'm not even sure if you could read memory via ptrace
> > while a process is stopped in the seccomp logic? PTRACE_PEEKDATA
> > requires the target to be in a __TASK_TRACED state.
> > The two interfaces you might want to use instead are /proc/$pid/mem
> > and process_vm_{readv,writev}, which allow you to do nice,
> > arbitrarily-sized, vectored IO on the memory of another process.
>
> Yes, in fact the sample code does use /proc/$pid/mem, but the docs
> should be correct :)

Please also mention the process_vm_readv/writev syscalls though, given
that fast access to remote processes is what they were made for.

> > > +#ifdef CONFIG_SECCOMP_FILTER
> > > +static int seccomp_notify_release(struct inode *inode, struct file *file)
[...]
> > > +       wake_up_all(&filter->notif->wqh);
> >
> > If select() is polling us, a reference to the open file is being held,
> > and this can't be reached; and I think if epoll is polling us,
> > eventpoll_release() will remove itself from the wait queue, right? So
> > can this wake_up_all() actually ever notify anyone?
>
> I don't know actually, I just thought better safe than sorry. I can
> drop it, though.

Let's see if any fs people have some insight...

> > > +               ret = -ENOENT;
> > > +               goto out;
> > > +       }
> > > +
> > > +       /* Allow exactly one reply. */
> > > +       if (knotif->state != SECCOMP_NOTIFY_SENT) {
> > > +               ret = -EINPROGRESS;
> > > +               goto out;
> > > +       }
> >
> > This means that if seccomp_do_user_notification() has in the meantime
> > received a signal and transitioned from SENT back to INIT, this will
> > fail, right? So we fail here, then we read the new notification, and
> > then we can retry SECCOMP_NOTIF_SEND? Is that intended?
>
> I think so, the idea being that you might want to do something
> different if a signal was sent. But Andy seemed to think that we might
> not actually do anything different.

If you already have the proper response ready, you'd probably want to
just go through with it, no? Otherwise you'll just end up re-emulating
the syscall afterwards for no good reason. If you noticed the
interruption in the middle of the emulated syscall, that'd be
different, but since this is the case where we're already done with
the emulation and getting ready to continue...
Tycho Andersen Sept. 28, 2018, 2:39 p.m. UTC | #9
On Thu, Sep 27, 2018 at 04:10:29PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 3:48 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> struct seccomp_notif {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __u32                      pid;                  /*    16     4 */
> >>         __u8                       signaled;             /*    20     1 */
> >>
> >>         /* XXX 3 bytes hole, try to pack */
> >>
> >>         struct seccomp_data        data;                 /*    24    64 */
> >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >>
> >>         /* size: 88, cachelines: 2, members: 5 */
> >>         /* sum members: 79, holes: 2, sum holes: 9 */
> >>         /* last cacheline: 24 bytes */
> >> };
> >> struct seccomp_notif_resp {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __s32                      error;                /*    16     4 */
> >>
> >>         /* XXX 4 bytes hole, try to pack */
> >>
> >>         __s64                      val;                  /*    24     8 */
> >>
> >>         /* size: 32, cachelines: 1, members: 4 */
> >>         /* sum members: 22, holes: 2, sum holes: 10 */
> >>         /* last cacheline: 32 bytes */
> >> };
> >>
> >> How about making len u32, and moving pid and error above "id"? This
> >> leaves a hole after signaled, so changing "len" won't be sufficient
> >> for versioning here. Perhaps move it after data?
> >
> > I'm not sure what you mean by "len won't be sufficient for versioning
> > here"? Anyway, I can do some packing on these; I didn't bother before
> > since I figured it's a userspace interface, so saving a few bytes
> > isn't a huge deal.
> 
> I was thinking the "len" portion was for determining if the API ever
> changes in the future. My point was that given the padding holes, e.g.
> adding a u8 after signaled, "len" wouldn't change, so the kernel might
> expect to starting reading something after signaled that it wasn't
> checking before, but the len would be the same.

Oh, yeah. That's ugly :(

> >> I have to say, I'm vaguely nervous about changing the semantics here
> >> for passing back the fd as the return code from the seccomp() syscall.
> >> Alternatives seem less appealing, though: changing the meaning of the
> >> uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> >> example. Hmm.
> >
> > From my perspective we can drop this whole thing. The only thing I'll
> > ever use is the ptrace version. Someone at some point (I don't
> > remember who, maybe stgraber) suggested this version would be useful
> > as well.
> 
> Well that would certainly change the exposure of the interface pretty
> drastically. :)
> 
> So, let's talk more about this, as it raises another thought I had
> too: for the PTRACE interface to work, you have to know specifically
> which filter you want to get notifications for. Won't that be slightly
> tricky?

Not necessarily. The way I imagine using it is:

1. container manager forks init task
2. init task does a bunch of setup stuff, then installs the filter
3. optionally install any user specified filter (or just merge the
   filter with step 2 instead of chaining them)
4. container manager grabs the listener fd from the container init via
   ptrace
5. init execs the user specified init

So the offset will always be known at least in my usecase. The
container manager doesn't want to install the filter on itself, so it
won't use NEW_LISTENER. Similarly, we don't want init to use
NEW_LISTENER, because if the user has decided to block sendmsg as part
of their policy, there's no way to get the fd out.

> > Anyway, let me know if your nervousness outweighs this, I'm happy to
> > drop it.
> 
> I'm not opposed to keeping it, but if you don't think anyone will use
> it ... we should probably drop it just to avoid the complexity. It's a
> cool API, though, so I'd like to hear from others first before you go
> tearing it out. ;) (stgraber added to CC)

It does seem useful for lighter weight cases than a container. The "I
want to run some random binary that I don't have the source for that
tries to make some privileged calls it doesn't really need" case. But
as a Container Guy I think I have in my contract somewhere that I have
to use containers :). But let's see what people think.

> >> > +       if (copy_from_user(&size, buf, sizeof(size)))
> >> > +               return -EFAULT;
> >> > +       size = min_t(size_t, size, sizeof(resp));
> >> > +       if (copy_from_user(&resp, buf, size))
> >> > +               return -EFAULT;
> >>
> >> For sanity checking on a double-read from userspace, please add:
> >>
> >>     if (resp.len != size)
> >>         return -EINVAL;
> >
> > Won't that fail if sizeof(resp) < resp.len, because of the min_t()?
> 
> Ah, true. In that case, probably do resp.len = size to avoid any logic
> failures due to the double-read? I just want to avoid any chance of
> confusing the size and actually using it somewhere.

Yep, sounds good.

Tycho
Aleksa Sarai Sept. 29, 2018, 12:28 a.m. UTC | #10
On 2018-09-27, Tycho Andersen <tycho@tycho.ws> wrote:
> This patch introduces a means for syscalls matched in seccomp to notify
> some other task that a particular filter has been triggered.
> 
> The motivation for this is primarily for use with containers. For example,
> if a container does an init_module(), we obviously don't want to load this
> untrusted code, which may be compiled for the wrong version of the kernel
> anyway. Instead, we could parse the module image, figure out which module
> the container is trying to load and load it on the host.
> 
> As another example, containers cannot mknod(), since this checks
> capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> coding some whitelist in the kernel. Another example is mount(), which has
> many security restrictions for good reason, but configuration or runtime
> knowledge could potentially be used to relax these restrictions.

Minor thing, but this is no longer _entirely_ true (now it checks
ns_capable(sb->s_user_ns)). I think the kernel module auto-loading is a
much more interesting example, but since this is just a commit message
feel free to ignore my pedantry. :P

> Signed-off-by: Tycho Andersen <tycho@tycho.ws>
> CC: Kees Cook <keescook@chromium.org>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Eric W. Biederman <ebiederm@xmission.com>
> CC: "Serge E. Hallyn" <serge@hallyn.com>
> CC: Christian Brauner <christian.brauner@ubuntu.com>
> CC: Tyler Hicks <tyhicks@canonical.com>
> CC: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>

Would you mind adding me to the Cc: list for the next round of patches?
It's looking pretty neat!

Thanks!
Christian Brauner Oct. 8, 2018, 2:58 p.m. UTC | #11
On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > This patch introduces a means for syscalls matched in seccomp to notify
> > > some other task that a particular filter has been triggered.
> > >
> > > The motivation for this is primarily for use with containers. For example,
> > > if a container does an init_module(), we obviously don't want to load this
> > > untrusted code, which may be compiled for the wrong version of the kernel
> > > anyway. Instead, we could parse the module image, figure out which module
> > > the container is trying to load and load it on the host.
> > >
> > > As another example, containers cannot mknod(), since this checks
> > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or
> > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard
> > > coding some whitelist in the kernel. Another example is mount(), which has
> > > many security restrictions for good reason, but configuration or runtime
> > > knowledge could potentially be used to relax these restrictions.
> > >
> > > This patch adds functionality that is already possible via at least two
> > > other means that I know about, both of which involve ptrace(): first, one
> > > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL.
> > > Unfortunately this is slow, so a faster version would be to install a
> > > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP.
> > > Since ptrace allows only one tracer, if the container runtime is that
> > > tracer, users inside the container (or outside) trying to debug it will not
> > > be able to use ptrace, which is annoying. It also means that older
> > > distributions based on Upstart cannot boot inside containers using ptrace,
> > > since upstart itself uses ptrace to start services.
> > >
> > > The actual implementation of this is fairly small, although getting the
> > > synchronization right was/is slightly complex.
> > >
> > > Finally, it's worth noting that the classic seccomp TOCTOU of reading
> > > memory data from the task still applies here, but can be avoided with
> > > careful design of the userspace handler: if the userspace handler reads all
> > > of the task memory that is necessary before applying its security policy,
> > > the tracee's subsequent memory edits will not be read by the tracer.
> > >
> > > v2: * make id a u64; the idea here being that it will never overflow,
> > >       because 64 is huge (one syscall every nanosecond => wrap every 584
> > >       years) (Andy)
> > >     * prevent nesting of user notifications: if someone is already attached
> > >       the tree in one place, nobody else can attach to the tree (Andy)
> > >     * notify the listener of signals the tracee receives as well (Andy)
> > >     * implement poll
> > > v3: * lockdep fix (Oleg)
> > >     * drop unnecessary WARN()s (Christian)
> > >     * rearrange error returns to be more rpetty (Christian)
> > >     * fix build in !CONFIG_SECCOMP_USER_NOTIFICATION case
> > > v4: * fix implementation of poll to use poll_wait() (Jann)
> > >     * change listener's fd flags to be 0 (Jann)
> > >     * hoist filter initialization out of ifdefs to its own function
> > >       init_user_notification()
> > >     * add some more testing around poll() and closing the listener while a
> > >       syscall is in action
> > >     * s/GET_LISTENER/NEW_LISTENER, since you can't _get_ a listener, but it
> > >       creates a new one (Matthew)
> > >     * correctly handle pid namespaces, add some testcases (Matthew)
> > >     * use EINPROGRESS instead of EINVAL when a notification response is
> > >       written twice (Matthew)
> > >     * fix comment typo from older version (SEND vs READ) (Matthew)
> > >     * whitespace and logic simplification (Tobin)
> > >     * add some Documentation/ bits on userspace trapping
> > > v5: * fix documentation typos (Jann)
> > >     * add signalled field to struct seccomp_notif (Jann)
> > >     * switch to using ioctls instead of read()/write() for struct passing
> > >       (Jann)
> > >     * add an ioctl to ensure an id is still valid
> > > v6: * docs typo fixes, update docs for ioctl() change (Christian)
> > > v7: * switch struct seccomp_knotif's id member to a u64 (derp :)
> > >     * use notify_lock in IS_ID_VALID query to avoid racing
> > >     * s/signalled/signaled (Tyler)
> > >     * fix docs to reflect that ids are not globally unique (Tyler)
> > >     * add a test to check -ERESTARTSYS behavior (Tyler)
> > >     * drop CONFIG_SECCOMP_USER_NOTIFICATION (Tyler)
> > >     * reorder USER_NOTIF in seccomp return codes list (Tyler)
> > >     * return size instead of sizeof(struct user_notif) (Tyler)
> > >     * ENOENT instead of EINVAL when invalid id is passed (Tyler)
> > >     * drop CONFIG_SECCOMP_USER_NOTIFICATION guards (Tyler)
> > >     * s/IS_ID_VALID/ID_VALID and switch ioctl to be "well behaved" (Tyler)
> > >     * add a new struct notification to minimize the additions to
> > >       struct seccomp_filter, also pack the necessary additions a bit more
> > >       cleverly (Tyler)
> > >     * switch to keeping track of the task itself instead of the pid (we'll
> > >       use this for implementing PUT_FD)
> > 
> > Patch-sending nit: can you put the versioning below the "---" line so
> > it isn't included in the final commit? (And I normally read these
> > backwards, so I'd expect v7 at the top, but that's not a big deal. I
> > mean... neither is the --- thing, but it makes "git am" easier for me
> > since I don't have to go edit the versioning out of the log.)
> 
> Sure, will do.
> 
> > > diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> > > index 9efc0e73d50b..d4ccb32fe089 100644
> > > --- a/include/uapi/linux/seccomp.h
> > > +++ b/include/uapi/linux/seccomp.h
> > > @@ -17,9 +17,10 @@
> > >  #define SECCOMP_GET_ACTION_AVAIL       2
> > >
> > >  /* Valid flags for SECCOMP_SET_MODE_FILTER */
> > > -#define SECCOMP_FILTER_FLAG_TSYNC      (1UL << 0)
> > > -#define SECCOMP_FILTER_FLAG_LOG                (1UL << 1)
> > > -#define SECCOMP_FILTER_FLAG_SPEC_ALLOW (1UL << 2)
> > > +#define SECCOMP_FILTER_FLAG_TSYNC              (1UL << 0)
> > > +#define SECCOMP_FILTER_FLAG_LOG                        (1UL << 1)
> > > +#define SECCOMP_FILTER_FLAG_SPEC_ALLOW         (1UL << 2)
> > > +#define SECCOMP_FILTER_FLAG_NEW_LISTENER       (1UL << 3)
> > 
> > Since these are all getting indentation updates, can you switch them
> > to BIT(0), BIT(1), etc?
> 
> Will do.
> 
> > >  /*
> > >   * All BPF programs must return a 32-bit value.
> > > @@ -35,6 +36,7 @@
> > >  #define SECCOMP_RET_KILL        SECCOMP_RET_KILL_THREAD
> > >  #define SECCOMP_RET_TRAP        0x00030000U /* disallow and force a SIGSYS */
> > >  #define SECCOMP_RET_ERRNO       0x00050000U /* returns an errno */
> > > +#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
> > >  #define SECCOMP_RET_TRACE       0x7ff00000U /* pass to a tracer or disallow */
> > >  #define SECCOMP_RET_LOG                 0x7ffc0000U /* allow after logging */
> > >  #define SECCOMP_RET_ALLOW       0x7fff0000U /* allow */
> > > @@ -60,4 +62,29 @@ struct seccomp_data {
> > >         __u64 args[6];
> > >  };
> > >
> > > +struct seccomp_notif {
> > > +       __u16 len;
> > > +       __u64 id;
> > > +       __u32 pid;
> > > +       __u8 signaled;
> > > +       struct seccomp_data data;
> > > +};
> > > +
> > > +struct seccomp_notif_resp {
> > > +       __u16 len;
> > > +       __u64 id;
> > > +       __s32 error;
> > > +       __s64 val;
> > > +};
> > 
> > So, len has to come first, for versioning. However, since it's ahead
> > of a u64, this leaves a struct padding hole. pahole output:
> > 
> > struct seccomp_notif {
> >         __u16                      len;                  /*     0     2 */
> > 
> >         /* XXX 6 bytes hole, try to pack */
> > 
> >         __u64                      id;                   /*     8     8 */
> >         __u32                      pid;                  /*    16     4 */
> >         __u8                       signaled;             /*    20     1 */
> > 
> >         /* XXX 3 bytes hole, try to pack */
> > 
> >         struct seccomp_data        data;                 /*    24    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> > 
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* sum members: 79, holes: 2, sum holes: 9 */
> >         /* last cacheline: 24 bytes */
> > };
> > struct seccomp_notif_resp {
> >         __u16                      len;                  /*     0     2 */
> > 
> >         /* XXX 6 bytes hole, try to pack */
> > 
> >         __u64                      id;                   /*     8     8 */
> >         __s32                      error;                /*    16     4 */
> > 
> >         /* XXX 4 bytes hole, try to pack */
> > 
> >         __s64                      val;                  /*    24     8 */
> > 
> >         /* size: 32, cachelines: 1, members: 4 */
> >         /* sum members: 22, holes: 2, sum holes: 10 */
> >         /* last cacheline: 32 bytes */
> > };
> > 
> > How about making len u32, and moving pid and error above "id"? This
> > leaves a hole after signaled, so changing "len" won't be sufficient
> > for versioning here. Perhaps move it after data?
> 
> I'm not sure what you mean by "len won't be sufficient for versioning
> here"? Anyway, I can do some packing on these; I didn't bother before
> since I figured it's a userspace interface, so saving a few bytes
> isn't a huge deal.
> 
> > > +
> > > +#define SECCOMP_IOC_MAGIC              0xF7
> > 
> > Was there any specific reason for picking this value? There are lots
> > of fun ASCII code left like '!' or '*'. :)
> 
> No, ! it is :)
> 
> > > +
> > > +/* Flags for seccomp notification fd ioctl. */
> > > +#define SECCOMP_NOTIF_RECV     _IOWR(SECCOMP_IOC_MAGIC, 0,     \
> > > +                                       struct seccomp_notif)
> > > +#define SECCOMP_NOTIF_SEND     _IOWR(SECCOMP_IOC_MAGIC, 1,     \
> > > +                                       struct seccomp_notif_resp)
> > > +#define SECCOMP_NOTIF_ID_VALID _IOR(SECCOMP_IOC_MAGIC, 2,      \
> > > +                                       __u64)
> > 
> > To match other UAPI ioctl, can these have a prefix of "SECCOMP_IOCTOL_..."?
> > 
> > It may also be useful to match how other uapis do this, like for DRM:
> > 
> > #define DRM_IOCTL_BASE                  'd'
> > #define DRM_IO(nr)                      _IO(DRM_IOCTL_BASE,nr)
> > #define DRM_IOR(nr,type)                _IOR(DRM_IOCTL_BASE,nr,type)
> > #define DRM_IOW(nr,type)                _IOW(DRM_IOCTL_BASE,nr,type)
> > #define DRM_IOWR(nr,type)               _IOWR(DRM_IOCTL_BASE,nr,type)
> > 
> > #define DRM_IOCTL_VERSION               DRM_IOWR(0x00, struct drm_version)
> > #define DRM_IOCTL_GET_UNIQUE            DRM_IOWR(0x01, struct drm_unique)
> > #define DRM_IOCTL_GET_MAGIC             DRM_IOR( 0x02, struct drm_auth)
> > ...
> 
> Will do.
> 
> > 
> > > +
> > >  #endif /* _UAPI_LINUX_SECCOMP_H */
> > > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > > index fd023ac24e10..fa6fe9756c80 100644
> > > --- a/kernel/seccomp.c
> > > +++ b/kernel/seccomp.c
> > > @@ -33,12 +33,78 @@
> > >  #endif
> > >
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > > +#include <linux/file.h>
> > >  #include <linux/filter.h>
> > >  #include <linux/pid.h>
> > >  #include <linux/ptrace.h>
> > >  #include <linux/security.h>
> > >  #include <linux/tracehook.h>
> > >  #include <linux/uaccess.h>
> > > +#include <linux/anon_inodes.h>
> > > +
> > > +enum notify_state {
> > > +       SECCOMP_NOTIFY_INIT,
> > > +       SECCOMP_NOTIFY_SENT,
> > > +       SECCOMP_NOTIFY_REPLIED,
> > > +};
> > > +
> > > +struct seccomp_knotif {
> > > +       /* The struct pid of the task whose filter triggered the notification */
> > > +       struct task_struct *task;
> > > +
> > > +       /* The "cookie" for this request; this is unique for this filter. */
> > > +       u64 id;
> > > +
> > > +       /* Whether or not this task has been given an interruptible signal. */
> > > +       bool signaled;
> > > +
> > > +       /*
> > > +        * The seccomp data. This pointer is valid the entire time this
> > > +        * notification is active, since it comes from __seccomp_filter which
> > > +        * eclipses the entire lifecycle here.
> > > +        */
> > > +       const struct seccomp_data *data;
> > > +
> > > +       /*
> > > +        * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
> > > +        * struct seccomp_knotif is created and starts out in INIT. Once the
> > > +        * handler reads the notification off of an FD, it transitions to SENT.
> > > +        * If a signal is received the state transitions back to INIT and
> > > +        * another message is sent. When the userspace handler replies, state
> > > +        * transitions to REPLIED.
> > > +        */
> > > +       enum notify_state state;
> > > +
> > > +       /* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
> > > +       int error;
> > > +       long val;
> > > +
> > > +       /* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
> > > +       struct completion ready;
> > > +
> > > +       struct list_head list;
> > > +};
> > > +
> > > +/**
> > > + * struct notification - container for seccomp userspace notifications. Since
> > > + * most seccomp filters will not have notification listeners attached and this
> > > + * structure is fairly large, we store the notification-specific stuff in a
> > > + * separate structure.
> > > + *
> > > + * @request: A semaphore that users of this notification can wait on for
> > > + *           changes. Actual reads and writes are still controlled with
> > > + *           filter->notify_lock.
> > > + * @notify_lock: A lock for all notification-related accesses.
> > > + * @next_id: The id of the next request.
> > > + * @notifications: A list of struct seccomp_knotif elements.
> > > + * @wqh: A wait queue for poll.
> > > + */
> > > +struct notification {
> > > +       struct semaphore request;
> > > +       u64 next_id;
> > > +       struct list_head notifications;
> > > +       wait_queue_head_t wqh;
> > > +};
> > >
> > >  /**
> > >   * struct seccomp_filter - container for seccomp BPF programs
> > > @@ -66,6 +132,8 @@ struct seccomp_filter {
> > >         bool log;
> > >         struct seccomp_filter *prev;
> > >         struct bpf_prog *prog;
> > > +       struct notification *notif;
> > > +       struct mutex notify_lock;
> > >  };
> > >
> > >  /* Limit any path through the tree to 256KB worth of instructions. */
> > > @@ -392,6 +460,7 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
> > >         if (!sfilter)
> > >                 return ERR_PTR(-ENOMEM);
> > >
> > > +       mutex_init(&sfilter->notify_lock);
> > >         ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
> > >                                         seccomp_check_filter, save_orig);
> > >         if (ret < 0) {
> > > @@ -556,11 +625,13 @@ static void seccomp_send_sigsys(int syscall, int reason)
> > >  #define SECCOMP_LOG_TRACE              (1 << 4)
> > >  #define SECCOMP_LOG_LOG                        (1 << 5)
> > >  #define SECCOMP_LOG_ALLOW              (1 << 6)
> > > +#define SECCOMP_LOG_USER_NOTIF         (1 << 7)
> > >
> > >  static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
> > >                                     SECCOMP_LOG_KILL_THREAD  |
> > >                                     SECCOMP_LOG_TRAP  |
> > >                                     SECCOMP_LOG_ERRNO |
> > > +                                   SECCOMP_LOG_USER_NOTIF |
> > >                                     SECCOMP_LOG_TRACE |
> > >                                     SECCOMP_LOG_LOG;
> > >
> > > @@ -581,6 +652,9 @@ static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
> > >         case SECCOMP_RET_TRACE:
> > >                 log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
> > >                 break;
> > > +       case SECCOMP_RET_USER_NOTIF:
> > > +               log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
> > > +               break;
> > >         case SECCOMP_RET_LOG:
> > >                 log = seccomp_actions_logged & SECCOMP_LOG_LOG;
> > >                 break;
> > > @@ -652,6 +726,73 @@ void secure_computing_strict(int this_syscall)
> > >  #else
> > >
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > > +static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
> > > +{
> > > +       /* Note: overflow is ok here, the id just needs to be unique */
> > 
> > Maybe just clarify in the comment: unique to the filter.
> > 
> > > +       return filter->notif->next_id++;
> > 
> > Also, it might be useful to add for both documentation and lockdep:
> > 
> > lockdep_assert_held(filter->notif->notify_lock);
> > 
> > into this function?
> 
> Will do.
> 
> > 
> > > +}
> > > +
> > > +static void seccomp_do_user_notification(int this_syscall,
> > > +                                        struct seccomp_filter *match,
> > > +                                        const struct seccomp_data *sd)
> > > +{
> > > +       int err;
> > > +       long ret = 0;
> > > +       struct seccomp_knotif n = {};
> > > +
> > > +       mutex_lock(&match->notify_lock);
> > > +       err = -ENOSYS;
> > > +       if (!match->notif)
> > > +               goto out;
> > > +
> > > +       n.task = current;
> > > +       n.state = SECCOMP_NOTIFY_INIT;
> > > +       n.data = sd;
> > > +       n.id = seccomp_next_notify_id(match);
> > > +       init_completion(&n.ready);
> > > +
> > > +       list_add(&n.list, &match->notif->notifications);
> > > +       wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
> > > +
> > > +       mutex_unlock(&match->notify_lock);
> > > +       up(&match->notif->request);
> > > +
> > 
> > Maybe add a big comment here saying this is where we're waiting for a reply?
> 
> Will do.
> 
> > > +       err = wait_for_completion_interruptible(&n.ready);
> > > +       mutex_lock(&match->notify_lock);
> > > +
> > > +       /*
> > > +        * Here it's possible we got a signal and then had to wait on the mutex
> > > +        * while the reply was sent, so let's be sure there wasn't a response
> > > +        * in the meantime.
> > > +        */
> > > +       if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
> > > +               /*
> > > +                * We got a signal. Let's tell userspace about it (potentially
> > > +                * again, if we had already notified them about the first one).
> > > +                */
> > > +               n.signaled = true;
> > > +               if (n.state == SECCOMP_NOTIFY_SENT) {
> > > +                       n.state = SECCOMP_NOTIFY_INIT;
> > > +                       up(&match->notif->request);
> > > +               }
> > > +               mutex_unlock(&match->notify_lock);
> > > +               err = wait_for_completion_killable(&n.ready);
> > > +               mutex_lock(&match->notify_lock);
> > > +               if (err < 0)
> > > +                       goto remove_list;
> > > +       }
> > > +
> > > +       ret = n.val;
> > > +       err = n.error;
> > > +
> > > +remove_list:
> > > +       list_del(&n.list);
> > > +out:
> > > +       mutex_unlock(&match->notify_lock);
> > > +       syscall_set_return_value(current, task_pt_regs(current),
> > > +                                err, ret);
> > > +}
> > > +
> > >  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> > >                             const bool recheck_after_trace)
> > >  {
> > > @@ -728,6 +869,9 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> > >
> > >                 return 0;
> > >
> > > +       case SECCOMP_RET_USER_NOTIF:
> > > +               seccomp_do_user_notification(this_syscall, match, sd);
> > > +               goto skip;
> > 
> > Nit: please add a blank line here (to match the other cases).
> > 
> > >         case SECCOMP_RET_LOG:
> > >                 seccomp_log(this_syscall, 0, action, true);
> > >                 return 0;
> > > @@ -834,6 +978,9 @@ static long seccomp_set_mode_strict(void)
> > >  }
> > >
> > >  #ifdef CONFIG_SECCOMP_FILTER
> > > +static struct file *init_listener(struct task_struct *,
> > > +                                 struct seccomp_filter *);
> > 
> > Why is the forward declaration needed instead of just moving the
> > function here? I didn't see anything in it that looked like it
> > couldn't move.
> 
> I think there was a cycle in some earlier version, but I agree there
> isn't now. I'll fix it.
> 
> > > +
> > >  /**
> > >   * seccomp_set_mode_filter: internal function for setting seccomp filter
> > >   * @flags:  flags to change filter behavior
> > > @@ -853,6 +1000,8 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
> > >         struct seccomp_filter *prepared = NULL;
> > >         long ret = -EINVAL;
> > > +       int listener = 0;
> > 
> > Nit: "invalid fd" should be -1, not 0.
> > 
> > > +       struct file *listener_f = NULL;
> > >
> > >         /* Validate flags. */
> > >         if (flags & ~SECCOMP_FILTER_FLAG_MASK)
> > > @@ -863,13 +1012,28 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         if (IS_ERR(prepared))
> > >                 return PTR_ERR(prepared);
> > >
> > > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > > +               listener = get_unused_fd_flags(0);
> > 
> > As with the other place pointed out by Jann, this should maybe be O_CLOEXEC too?
> 
> Yep, will do.
> 
> > > +               if (listener < 0) {
> > > +                       ret = listener;
> > > +                       goto out_free;
> > > +               }
> > > +
> > > +               listener_f = init_listener(current, prepared);
> > > +               if (IS_ERR(listener_f)) {
> > > +                       put_unused_fd(listener);
> > > +                       ret = PTR_ERR(listener_f);
> > > +                       goto out_free;
> > > +               }
> > > +       }
> > > +
> > >         /*
> > >          * Make sure we cannot change seccomp or nnp state via TSYNC
> > >          * while another thread is in the middle of calling exec.
> > >          */
> > >         if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
> > >             mutex_lock_killable(&current->signal->cred_guard_mutex))
> > > -               goto out_free;
> > > +               goto out_put_fd;
> > >
> > >         spin_lock_irq(&current->sighand->siglock);
> > >
> > > @@ -887,6 +1051,16 @@ static long seccomp_set_mode_filter(unsigned int flags,
> > >         spin_unlock_irq(&current->sighand->siglock);
> > >         if (flags & SECCOMP_FILTER_FLAG_TSYNC)
> > >                 mutex_unlock(&current->signal->cred_guard_mutex);
> > > +out_put_fd:
> > > +       if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
> > > +               if (ret < 0) {
> > > +                       fput(listener_f);
> > > +                       put_unused_fd(listener);
> > > +               } else {
> > > +                       fd_install(listener, listener_f);
> > > +                       ret = listener;
> > > +               }
> > > +       }
> > 
> > Can you update the kern-docs for seccomp_set_mode_filter(), since we
> > can now return positive values?
> > 
> >  * Returns 0 on success or -EINVAL on failure.
> > 
> > (this shoudln't say only -EINVAL, I realize too)
> 
> Sure, I can fix both of these.
> 
> > I have to say, I'm vaguely nervous about changing the semantics here
> > for passing back the fd as the return code from the seccomp() syscall.
> > Alternatives seem less appealing, though: changing the meaning of the
> > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > example. Hmm.
> 
> From my perspective we can drop this whole thing. The only thing I'll
> ever use is the ptrace version. Someone at some point (I don't
> remember who, maybe stgraber) suggested this version would be useful
> as well.

So I think we want to have the ability to get an fd via seccomp().
Especially, if we all we worry about are weird semantics. When we
discussed this we knew the whole patchset was going to be weird. :)

This is a seccomp feature so seccomp should - if feasible - equip you
with everything to use it in a meaningful way without having to go
through a different kernel api. I know ptrace and seccomp are
already connected but I still find this cleaner. :)

Another thing is that the container itself might be traced for some
reason while you still might want to get an fd out.

Also, I wonder what happens if you want to filter the ptrace() syscall
itself? Then you'd deadlock?

Also, it seems that getting an fd via ptrace requires CAP_SYS_ADMIN in
the inital user namespace (which I just realized now) whereas getting
the fd via seccomp() doesn't seem to.

> 
> Anyway, let me know if your nervousness outweighs this, I'm happy to
> drop it.
> 
> > > @@ -1342,3 +1520,259 @@ static int __init seccomp_sysctl_init(void)
> > >  device_initcall(seccomp_sysctl_init)
> > >
> > >  #endif /* CONFIG_SYSCTL */
> > > +
> > > +#ifdef CONFIG_SECCOMP_FILTER
> > > +static int seccomp_notify_release(struct inode *inode, struct file *file)
> > > +{
> > > +       struct seccomp_filter *filter = file->private_data;
> > > +       struct seccomp_knotif *knotif;
> > > +
> > > +       mutex_lock(&filter->notify_lock);
> > > +
> > > +       /*
> > > +        * If this file is being closed because e.g. the task who owned it
> > > +        * died, let's wake everyone up who was waiting on us.
> > > +        */
> > > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > > +               if (knotif->state == SECCOMP_NOTIFY_REPLIED)
> > > +                       continue;
> > > +
> > > +               knotif->state = SECCOMP_NOTIFY_REPLIED;
> > > +               knotif->error = -ENOSYS;
> > > +               knotif->val = 0;
> > > +
> > > +               complete(&knotif->ready);
> > > +       }
> > > +
> > > +       wake_up_all(&filter->notif->wqh);
> > > +       kfree(filter->notif);
> > > +       filter->notif = NULL;
> > > +       mutex_unlock(&filter->notify_lock);
> > 
> > It looks like that means nothing waiting on knotif->ready can access
> > filter->notif without rechecking it, yes?
> > 
> > e.g. in seccomp_do_user_notification() I see:
> > 
> >                         up(&match->notif->request);
> > 
> > I *think* this isn't reachable due to the test for n.state !=
> > SECCOMP_NOTIFY_REPLIED, though. Perhaps, just for sanity and because
> > it's not fast-path, we could add a WARN_ON() while checking for
> > unreplied signal death?
> > 
> >                 n.signaled = true;
> >                 if (n.state == SECCOMP_NOTIFY_SENT) {
> >                         n.state = SECCOMP_NOTIFY_INIT;
> >                         if (!WARN_ON(match->notif))
> >                             up(&match->notif->request);
> >                 }
> >                 mutex_unlock(&match->notify_lock);
> 
> So this code path should actually be safe, since notify_lock is held
> throughout, as it is in the release handler. However, there is one just above
> it that is not, because we do:
> 
>         mutex_unlock(&match->notify_lock);
>         up(&match->notif->request);
> 
> When this was all a member of struct seccomp_filter the order didn't matter,
> but now it very much does, and I think you're right that these statements need
> to be reordered. There maybe others, I'll check everything else as well.
> 
> > 
> > > +       __put_seccomp_filter(filter);
> > > +       return 0;
> > > +}
> > > +
> > > +static long seccomp_notify_recv(struct seccomp_filter *filter,
> > > +                               unsigned long arg)
> > > +{
> > > +       struct seccomp_knotif *knotif = NULL, *cur;
> > > +       struct seccomp_notif unotif = {};
> > > +       ssize_t ret;
> > > +       u16 size;
> > > +       void __user *buf = (void __user *)arg;
> > 
> > I'd prefer this casting happen in seccomp_notify_ioctl(). This keeps
> > anything from accidentally using "arg" directly here.
> 
> Will do.
> 
> > > +
> > > +       if (copy_from_user(&size, buf, sizeof(size)))
> > > +               return -EFAULT;
> > > +
> > > +       ret = down_interruptible(&filter->notif->request);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       mutex_lock(&filter->notify_lock);
> > > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > > +               if (cur->state == SECCOMP_NOTIFY_INIT) {
> > > +                       knotif = cur;
> > > +                       break;
> > > +               }
> > > +       }
> > > +
> > > +       /*
> > > +        * If we didn't find a notification, it could be that the task was
> > > +        * interrupted between the time we were woken and when we were able to
> > > +        * acquire the rw lock.
> > > +        */
> > > +       if (!knotif) {
> > > +               ret = -ENOENT;
> > > +               goto out;
> > > +       }
> > > +
> > > +       size = min_t(size_t, size, sizeof(unotif));
> > > +
> > 
> > It is possible (though unlikely given the type widths involved here)
> > for unotif = {} to not initialize padding, so I would recommend an
> > explicit memset(&unotif, 0, sizeof(unotif)) here.
> 
> Orly? I didn't know that, thanks.
> 
> > > +       unotif.len = size;
> > > +       unotif.id = knotif->id;
> > > +       unotif.pid = task_pid_vnr(knotif->task);
> > > +       unotif.signaled = knotif->signaled;
> > > +       unotif.data = *(knotif->data);
> > > +
> > > +       if (copy_to_user(buf, &unotif, size)) {
> > > +               ret = -EFAULT;
> > > +               goto out;
> > > +       }
> > > +
> > > +       ret = size;
> > > +       knotif->state = SECCOMP_NOTIFY_SENT;
> > > +       wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
> > > +
> > > +
> > > +out:
> > > +       mutex_unlock(&filter->notify_lock);
> > 
> > Is there some way to rearrange the locking here to avoid holding the
> > mutex while doing copy_to_user() (which userspace could block with
> > userfaultfd, and then stall all the other notifications for this
> > filter)?
> 
> Yes, I don't think it'll cause any problems to release the lock earlier.
> 
> > > +       return ret;
> > > +}
> > > +
> > > +static long seccomp_notify_send(struct seccomp_filter *filter,
> > > +                               unsigned long arg)
> > > +{
> > > +       struct seccomp_notif_resp resp = {};
> > > +       struct seccomp_knotif *knotif = NULL;
> > > +       long ret;
> > > +       u16 size;
> > > +       void __user *buf = (void __user *)arg;
> > 
> > Same cast note as above.
> > 
> > > +
> > > +       if (copy_from_user(&size, buf, sizeof(size)))
> > > +               return -EFAULT;
> > > +       size = min_t(size_t, size, sizeof(resp));
> > > +       if (copy_from_user(&resp, buf, size))
> > > +               return -EFAULT;
> > 
> > For sanity checking on a double-read from userspace, please add:
> > 
> >     if (resp.len != size)
> >         return -EINVAL;
> 
> Won't that fail if sizeof(resp) < resp.len, because of the min_t()?
> 
> > > +static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> > > +                                   unsigned long arg)
> > > +{
> > > +       struct seccomp_knotif *knotif = NULL;
> > > +       void __user *buf = (void __user *)arg;
> > > +       u64 id;
> > > +       long ret;
> > > +
> > > +       if (copy_from_user(&id, buf, sizeof(id)))
> > > +               return -EFAULT;
> > > +
> > > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       ret = -1;
> > 
> > Isn't this EPERM? Shouldn't it be -ENOENT?
> 
> Yes, I wasn't thinking of errno here, I'll switch it.
> 
> > > +       list_for_each_entry(knotif, &filter->notif->notifications, list) {
> > > +               if (knotif->id == id) {
> > > +                       ret = 0;
> > > +                       goto out;
> > > +               }
> > > +       }
> > > +
> > > +out:
> > > +       mutex_unlock(&filter->notify_lock);
> > > +       return ret;
> > > +}
> > > +
> > > +static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> > > +                                unsigned long arg)
> > > +{
> > > +       struct seccomp_filter *filter = file->private_data;
> > > +
> > > +       switch (cmd) {
> > > +       case SECCOMP_NOTIF_RECV:
> > > +               return seccomp_notify_recv(filter, arg);
> > > +       case SECCOMP_NOTIF_SEND:
> > > +               return seccomp_notify_send(filter, arg);
> > > +       case SECCOMP_NOTIF_ID_VALID:
> > > +               return seccomp_notify_id_valid(filter, arg);
> > > +       default:
> > > +               return -EINVAL;
> > > +       }
> > > +}
> > > +
> > > +static __poll_t seccomp_notify_poll(struct file *file,
> > > +                                   struct poll_table_struct *poll_tab)
> > > +{
> > > +       struct seccomp_filter *filter = file->private_data;
> > > +       __poll_t ret = 0;
> > > +       struct seccomp_knotif *cur;
> > > +
> > > +       poll_wait(file, &filter->notif->wqh, poll_tab);
> > > +
> > > +       ret = mutex_lock_interruptible(&filter->notify_lock);
> > > +       if (ret < 0)
> > > +               return ret;
> > > +
> > > +       list_for_each_entry(cur, &filter->notif->notifications, list) {
> > > +               if (cur->state == SECCOMP_NOTIFY_INIT)
> > > +                       ret |= EPOLLIN | EPOLLRDNORM;
> > > +               if (cur->state == SECCOMP_NOTIFY_SENT)
> > > +                       ret |= EPOLLOUT | EPOLLWRNORM;
> > > +               if (ret & EPOLLIN && ret & EPOLLOUT)
> > 
> > My eyes! :) Can you wrap the bit operations in parens here?
> > 
> > > +                       break;
> > > +       }
> > 
> > Should POLLERR be handled here too? I don't quite see the conditions
> > that might be exposed? All the processes die for the filter, which
> > does what here?
> 
> I think it shouldn't do anything, because I was thinking of the semantics of
> poll() as "when a tracee does a syscall that matches, fire". So a task could
> start, never make a targeted syscall, and exit, and poll() shouldn't return a
> value. Maybe it's useful to write that down somewhere, though.
> 
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
> > > +
> > > +       EXPECT_EQ(kill(pid, SIGKILL), 0);
> > > +       EXPECT_EQ(waitpid(pid, NULL, 0), pid);
> > > +
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);
> > 
> > Please document SECCOMP_NOTIF_ID_VALID in seccomp_filter.rst. I had
> > been wondering what it's for, and now I see it's kind of an advisory
> > "is the other end still alive?" test.
> 
> Yes, in fact it's necessary for avoiding races. There's some comments in the
> sample code, but I'll update seccomp_filter.rst too.
> 
> > > +
> > > +       resp.id = req.id;
> > > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> > > +       EXPECT_EQ(ret, -1);
> > > +       EXPECT_EQ(errno, ENOENT);
> > > +
> > > +       /*
> > > +        * Check that we get another notification about a signal in the middle
> > > +        * of a syscall.
> > > +        */
> > > +       pid = fork();
> > > +       ASSERT_GE(pid, 0);
> > > +
> > > +       if (pid == 0) {
> > > +               if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
> > > +                       perror("signal");
> > > +                       exit(1);
> > > +               }
> > > +               ret = syscall(__NR_getpid);
> > > +               exit(ret != USER_NOTIF_MAGIC);
> > > +       }
> > > +
> > > +       ret = read_notif(listener, &req);
> > > +       EXPECT_EQ(ret, sizeof(req));
> > > +       EXPECT_EQ(errno, 0);
> > > +
> > > +       EXPECT_EQ(kill(pid, SIGUSR1), 0);
> > > +
> > > +       ret = read_notif(listener, &req);
> > > +       EXPECT_EQ(req.signaled, 1);
> > > +       EXPECT_EQ(ret, sizeof(req));
> > > +       EXPECT_EQ(errno, 0);
> > > +
> > > +       resp.len = sizeof(resp);
> > > +       resp.id = req.id;
> > > +       resp.error = -512; /* -ERESTARTSYS */
> > > +       resp.val = 0;
> > > +
> > > +       EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
> > > +
> > > +       ret = read_notif(listener, &req);
> > > +       resp.len = sizeof(resp);
> > > +       resp.id = req.id;
> > > +       resp.error = 0;
> > > +       resp.val = USER_NOTIF_MAGIC;
> > > +       ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
> > 
> > I was slightly confused here: why have there been 3 reads? I was
> > expecting one notification for hitting getpid and one from catching a
> > signal. But in rereading, I see that NOTIF_RECV will return the most
> > recently unresponded notification, yes?
> 
> The three reads are:
> 
> 1. original syscall
> # send SIGUSR1
> 2. another notif with signaled is set
> # respond with -ERESTARTSYS to make sure that works
> 3. this is the result of -ERESTARTSYS
> 
> > But... catching a signal replaces the existing seccomp_knotif? I
> > remain confused about how signal handling is meant to work here. What
> > happens if two signals get sent? It looks like you just block without
> > allowing more signals? (Thank you for writing the tests!)
> 
> Yes, that's the idea. This is an implementation of Andy's pseudocode:
> https://lkml.org/lkml/2018/3/15/1122
> 
> > (And can you document the expected behavior in the seccomp_filter.rst too?)
> 
> Will do.
> 
> > 
> > Looking good!
> 
> Thanks for your review!
> 
> Tycho
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Tycho Andersen Oct. 9, 2018, 2:28 p.m. UTC | #12
On Mon, Oct 08, 2018 at 04:58:05PM +0200, Christian Brauner wrote:
> On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > > I have to say, I'm vaguely nervous about changing the semantics here
> > > for passing back the fd as the return code from the seccomp() syscall.
> > > Alternatives seem less appealing, though: changing the meaning of the
> > > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > > example. Hmm.
> > 
> > From my perspective we can drop this whole thing. The only thing I'll
> > ever use is the ptrace version. Someone at some point (I don't
> > remember who, maybe stgraber) suggested this version would be useful
> > as well.
> 
> So I think we want to have the ability to get an fd via seccomp().
> Especially, if we all we worry about are weird semantics. When we
> discussed this we knew the whole patchset was going to be weird. :)
> 
> This is a seccomp feature so seccomp should - if feasible - equip you
> with everything to use it in a meaningful way without having to go
> through a different kernel api. I know ptrace and seccomp are
> already connected but I still find this cleaner. :)
> 
> Another thing is that the container itself might be traced for some
> reason while you still might want to get an fd out.

Sure, I don't see the problem here.

> Also, I wonder what happens if you want to filter the ptrace() syscall
> itself? Then you'd deadlock?

No, are you confusing the tracee with the tracer here? Filtering
ptrace() will happen just like any other syscall... what would you
deadlock with?

> Also, it seems that getting an fd via ptrace requires CAP_SYS_ADMIN in
> the inital user namespace (which I just realized now) whereas getting
> the fd via seccomp() doesn't seem to.

Yep, I'll leave this discussion to the other thread.

Tycho
Christian Brauner Oct. 9, 2018, 4:24 p.m. UTC | #13
On Tue, Oct 09, 2018 at 07:28:33AM -0700, Tycho Andersen wrote:
> On Mon, Oct 08, 2018 at 04:58:05PM +0200, Christian Brauner wrote:
> > On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> > > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > > > I have to say, I'm vaguely nervous about changing the semantics here
> > > > for passing back the fd as the return code from the seccomp() syscall.
> > > > Alternatives seem less appealing, though: changing the meaning of the
> > > > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > > > example. Hmm.
> > > 
> > > From my perspective we can drop this whole thing. The only thing I'll
> > > ever use is the ptrace version. Someone at some point (I don't
> > > remember who, maybe stgraber) suggested this version would be useful
> > > as well.
> > 
> > So I think we want to have the ability to get an fd via seccomp().
> > Especially, if we all we worry about are weird semantics. When we
> > discussed this we knew the whole patchset was going to be weird. :)
> > 
> > This is a seccomp feature so seccomp should - if feasible - equip you
> > with everything to use it in a meaningful way without having to go
> > through a different kernel api. I know ptrace and seccomp are
> > already connected but I still find this cleaner. :)
> > 
> > Another thing is that the container itself might be traced for some
> > reason while you still might want to get an fd out.
> 
> Sure, I don't see the problem here.

How'd you to PTRACE_ATTACH in that case?

Anyway, the whole point is as we've discusses in the other thread we
really want a one-syscall-only, purely-seccomp() based way of getting
the fd. There are multiple options to get the fd even when you block
sendmsg()/socket() whatever and there's no good reason to only be able
to get the fd via a three-syscall-ptrace dance. :)
Tycho Andersen Oct. 9, 2018, 4:29 p.m. UTC | #14
On Tue, Oct 09, 2018 at 06:24:14PM +0200, Christian Brauner wrote:
> On Tue, Oct 09, 2018 at 07:28:33AM -0700, Tycho Andersen wrote:
> > On Mon, Oct 08, 2018 at 04:58:05PM +0200, Christian Brauner wrote:
> > > On Thu, Sep 27, 2018 at 04:48:39PM -0600, Tycho Andersen wrote:
> > > > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > > > > I have to say, I'm vaguely nervous about changing the semantics here
> > > > > for passing back the fd as the return code from the seccomp() syscall.
> > > > > Alternatives seem less appealing, though: changing the meaning of the
> > > > > uargs parameter when SECCOMP_FILTER_FLAG_NEW_LISTENER is set, for
> > > > > example. Hmm.
> > > > 
> > > > From my perspective we can drop this whole thing. The only thing I'll
> > > > ever use is the ptrace version. Someone at some point (I don't
> > > > remember who, maybe stgraber) suggested this version would be useful
> > > > as well.
> > > 
> > > So I think we want to have the ability to get an fd via seccomp().
> > > Especially, if we all we worry about are weird semantics. When we
> > > discussed this we knew the whole patchset was going to be weird. :)
> > > 
> > > This is a seccomp feature so seccomp should - if feasible - equip you
> > > with everything to use it in a meaningful way without having to go
> > > through a different kernel api. I know ptrace and seccomp are
> > > already connected but I still find this cleaner. :)
> > > 
> > > Another thing is that the container itself might be traced for some
> > > reason while you still might want to get an fd out.
> > 
> > Sure, I don't see the problem here.
> 
> How'd you to PTRACE_ATTACH in that case?

Oh, you mean if someone has *ptrace*'d the task, and a third party
wants to get a seccomp fd? I think "too bad" is the answer; I don't
really mind not supporting this case.

> Anyway, the whole point is as we've discusses in the other thread we
> really want a one-syscall-only, purely-seccomp() based way of getting
> the fd. There are multiple options to get the fd even when you block
> sendmsg()/socket() whatever and there's no good reason to only be able
> to get the fd via a three-syscall-ptrace dance. :)

Ok, I'll leave these bits in then for v8.

Tycho
Tycho Andersen Oct. 17, 2018, 8:29 p.m. UTC | #15
On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >         __u64 args[6];
> >  };
> >
> > +struct seccomp_notif {
> > +       __u16 len;
> > +       __u64 id;
> > +       __u32 pid;
> > +       __u8 signaled;
> > +       struct seccomp_data data;
> > +};
> > +
> > +struct seccomp_notif_resp {
> > +       __u16 len;
> > +       __u64 id;
> > +       __s32 error;
> > +       __s64 val;
> > +};
> 
> So, len has to come first, for versioning. However, since it's ahead
> of a u64, this leaves a struct padding hole. pahole output:
> 
> struct seccomp_notif {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __u32                      pid;                  /*    16     4 */
>         __u8                       signaled;             /*    20     1 */
> 
>         /* XXX 3 bytes hole, try to pack */
> 
>         struct seccomp_data        data;                 /*    24    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> 
>         /* size: 88, cachelines: 2, members: 5 */
>         /* sum members: 79, holes: 2, sum holes: 9 */
>         /* last cacheline: 24 bytes */
> };
> struct seccomp_notif_resp {
>         __u16                      len;                  /*     0     2 */
> 
>         /* XXX 6 bytes hole, try to pack */
> 
>         __u64                      id;                   /*     8     8 */
>         __s32                      error;                /*    16     4 */
> 
>         /* XXX 4 bytes hole, try to pack */
> 
>         __s64                      val;                  /*    24     8 */
> 
>         /* size: 32, cachelines: 1, members: 4 */
>         /* sum members: 22, holes: 2, sum holes: 10 */
>         /* last cacheline: 32 bytes */
> };
> 
> How about making len u32, and moving pid and error above "id"? This
> leaves a hole after signaled, so changing "len" won't be sufficient
> for versioning here. Perhaps move it after data?

Just to confirm my understanding; I've got these as:

struct seccomp_notif {
	__u32                      len;                  /*     0     4 */
	__u32                      pid;                  /*     4     4 */
	__u64                      id;                   /*     8     8 */
	__u8                       signaled;             /*    16     1 */

	/* XXX 7 bytes hole, try to pack */

	struct seccomp_data        data;                 /*    24    64 */
	/* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */

	/* size: 88, cachelines: 2, members: 5 */
	/* sum members: 81, holes: 1, sum holes: 7 */
	/* last cacheline: 24 bytes */
};
struct seccomp_notif_resp {
	__u32                      len;                  /*     0     4 */
	__s32                      error;                /*     4     4 */
	__u64                      id;                   /*     8     8 */
	__s64                      val;                  /*    16     8 */

	/* size: 24, cachelines: 1, members: 4 */
	/* last cacheline: 24 bytes */
};

in the next version. Since the structure has no padding at the end of
it, I think the Right Thing will happen. Note that this is slightly
different than what Kees suggested, if I add signaled after data, then
I end up with:

struct seccomp_notif {
	__u32                      len;                  /*     0     4 */
	__u32                      pid;                  /*     4     4 */
	__u64                      id;                   /*     8     8 */
	struct seccomp_data        data;                 /*    16    64 */
	/* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
	__u8                       signaled;             /*    80     1 */

	/* size: 88, cachelines: 2, members: 5 */
	/* padding: 7 */
	/* last cacheline: 24 bytes */
};

which I think will have the versioning problem if the next member
introduces is < 7 bytes.

Tycho
Kees Cook Oct. 17, 2018, 10:21 p.m. UTC | #16
On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
>> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
>> > @@ -60,4 +62,29 @@ struct seccomp_data {
>> >         __u64 args[6];
>> >  };
>> >
>> > +struct seccomp_notif {
>> > +       __u16 len;
>> > +       __u64 id;
>> > +       __u32 pid;
>> > +       __u8 signaled;
>> > +       struct seccomp_data data;
>> > +};
>> > +
>> > +struct seccomp_notif_resp {
>> > +       __u16 len;
>> > +       __u64 id;
>> > +       __s32 error;
>> > +       __s64 val;
>> > +};
>>
>> So, len has to come first, for versioning. However, since it's ahead
>> of a u64, this leaves a struct padding hole. pahole output:
>>
>> struct seccomp_notif {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __u32                      pid;                  /*    16     4 */
>>         __u8                       signaled;             /*    20     1 */
>>
>>         /* XXX 3 bytes hole, try to pack */
>>
>>         struct seccomp_data        data;                 /*    24    64 */
>>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
>>
>>         /* size: 88, cachelines: 2, members: 5 */
>>         /* sum members: 79, holes: 2, sum holes: 9 */
>>         /* last cacheline: 24 bytes */
>> };
>> struct seccomp_notif_resp {
>>         __u16                      len;                  /*     0     2 */
>>
>>         /* XXX 6 bytes hole, try to pack */
>>
>>         __u64                      id;                   /*     8     8 */
>>         __s32                      error;                /*    16     4 */
>>
>>         /* XXX 4 bytes hole, try to pack */
>>
>>         __s64                      val;                  /*    24     8 */
>>
>>         /* size: 32, cachelines: 1, members: 4 */
>>         /* sum members: 22, holes: 2, sum holes: 10 */
>>         /* last cacheline: 32 bytes */
>> };
>>
>> How about making len u32, and moving pid and error above "id"? This
>> leaves a hole after signaled, so changing "len" won't be sufficient
>> for versioning here. Perhaps move it after data?
>
> Just to confirm my understanding; I've got these as:
>
> struct seccomp_notif {
>         __u32                      len;                  /*     0     4 */
>         __u32                      pid;                  /*     4     4 */
>         __u64                      id;                   /*     8     8 */
>         __u8                       signaled;             /*    16     1 */
>
>         /* XXX 7 bytes hole, try to pack */
>
>         struct seccomp_data        data;                 /*    24    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
>
>         /* size: 88, cachelines: 2, members: 5 */
>         /* sum members: 81, holes: 1, sum holes: 7 */
>         /* last cacheline: 24 bytes */
> };
> struct seccomp_notif_resp {
>         __u32                      len;                  /*     0     4 */
>         __s32                      error;                /*     4     4 */
>         __u64                      id;                   /*     8     8 */
>         __s64                      val;                  /*    16     8 */
>
>         /* size: 24, cachelines: 1, members: 4 */
>         /* last cacheline: 24 bytes */
> };
>
> in the next version. Since the structure has no padding at the end of
> it, I think the Right Thing will happen. Note that this is slightly
> different than what Kees suggested, if I add signaled after data, then
> I end up with:
>
> struct seccomp_notif {
>         __u32                      len;                  /*     0     4 */
>         __u32                      pid;                  /*     4     4 */
>         __u64                      id;                   /*     8     8 */
>         struct seccomp_data        data;                 /*    16    64 */
>         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
>         __u8                       signaled;             /*    80     1 */
>
>         /* size: 88, cachelines: 2, members: 5 */
>         /* padding: 7 */
>         /* last cacheline: 24 bytes */
> };
>
> which I think will have the versioning problem if the next member
> introduces is < 7 bytes.

It'll be a problem in either place. What I was thinking was that
specific versioning is required instead of just length.

-Kees
Tycho Andersen Oct. 17, 2018, 10:33 p.m. UTC | #17
On Wed, Oct 17, 2018 at 03:21:02PM -0700, Kees Cook wrote:
> On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >> >         __u64 args[6];
> >> >  };
> >> >
> >> > +struct seccomp_notif {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __u32 pid;
> >> > +       __u8 signaled;
> >> > +       struct seccomp_data data;
> >> > +};
> >> > +
> >> > +struct seccomp_notif_resp {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __s32 error;
> >> > +       __s64 val;
> >> > +};
> >>
> >> So, len has to come first, for versioning. However, since it's ahead
> >> of a u64, this leaves a struct padding hole. pahole output:
> >>
> >> struct seccomp_notif {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __u32                      pid;                  /*    16     4 */
> >>         __u8                       signaled;             /*    20     1 */
> >>
> >>         /* XXX 3 bytes hole, try to pack */
> >>
> >>         struct seccomp_data        data;                 /*    24    64 */
> >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >>
> >>         /* size: 88, cachelines: 2, members: 5 */
> >>         /* sum members: 79, holes: 2, sum holes: 9 */
> >>         /* last cacheline: 24 bytes */
> >> };
> >> struct seccomp_notif_resp {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __s32                      error;                /*    16     4 */
> >>
> >>         /* XXX 4 bytes hole, try to pack */
> >>
> >>         __s64                      val;                  /*    24     8 */
> >>
> >>         /* size: 32, cachelines: 1, members: 4 */
> >>         /* sum members: 22, holes: 2, sum holes: 10 */
> >>         /* last cacheline: 32 bytes */
> >> };
> >>
> >> How about making len u32, and moving pid and error above "id"? This
> >> leaves a hole after signaled, so changing "len" won't be sufficient
> >> for versioning here. Perhaps move it after data?
> >
> > Just to confirm my understanding; I've got these as:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __u8                       signaled;             /*    16     1 */
> >
> >         /* XXX 7 bytes hole, try to pack */
> >
> >         struct seccomp_data        data;                 /*    24    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* sum members: 81, holes: 1, sum holes: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> > struct seccomp_notif_resp {
> >         __u32                      len;                  /*     0     4 */
> >         __s32                      error;                /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __s64                      val;                  /*    16     8 */
> >
> >         /* size: 24, cachelines: 1, members: 4 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > in the next version. Since the structure has no padding at the end of
> > it, I think the Right Thing will happen. Note that this is slightly
> > different than what Kees suggested, if I add signaled after data, then
> > I end up with:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         struct seccomp_data        data;                 /*    16    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> >         __u8                       signaled;             /*    80     1 */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* padding: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > which I think will have the versioning problem if the next member
> > introduces is < 7 bytes.
> 
> It'll be a problem in either place. What I was thinking was that
> specific versioning is required instead of just length.

Oh, if we decide to use the padded space? Yes, that makes sense. Ok,
I'll switch it to a version.

Tycho
Tycho Andersen Oct. 21, 2018, 4:04 p.m. UTC | #18
On Wed, Oct 17, 2018 at 03:21:02PM -0700, Kees Cook wrote:
> On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> >> > @@ -60,4 +62,29 @@ struct seccomp_data {
> >> >         __u64 args[6];
> >> >  };
> >> >
> >> > +struct seccomp_notif {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __u32 pid;
> >> > +       __u8 signaled;
> >> > +       struct seccomp_data data;
> >> > +};
> >> > +
> >> > +struct seccomp_notif_resp {
> >> > +       __u16 len;
> >> > +       __u64 id;
> >> > +       __s32 error;
> >> > +       __s64 val;
> >> > +};
> >>
> >> So, len has to come first, for versioning. However, since it's ahead
> >> of a u64, this leaves a struct padding hole. pahole output:
> >>
> >> struct seccomp_notif {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __u32                      pid;                  /*    16     4 */
> >>         __u8                       signaled;             /*    20     1 */
> >>
> >>         /* XXX 3 bytes hole, try to pack */
> >>
> >>         struct seccomp_data        data;                 /*    24    64 */
> >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >>
> >>         /* size: 88, cachelines: 2, members: 5 */
> >>         /* sum members: 79, holes: 2, sum holes: 9 */
> >>         /* last cacheline: 24 bytes */
> >> };
> >> struct seccomp_notif_resp {
> >>         __u16                      len;                  /*     0     2 */
> >>
> >>         /* XXX 6 bytes hole, try to pack */
> >>
> >>         __u64                      id;                   /*     8     8 */
> >>         __s32                      error;                /*    16     4 */
> >>
> >>         /* XXX 4 bytes hole, try to pack */
> >>
> >>         __s64                      val;                  /*    24     8 */
> >>
> >>         /* size: 32, cachelines: 1, members: 4 */
> >>         /* sum members: 22, holes: 2, sum holes: 10 */
> >>         /* last cacheline: 32 bytes */
> >> };
> >>
> >> How about making len u32, and moving pid and error above "id"? This
> >> leaves a hole after signaled, so changing "len" won't be sufficient
> >> for versioning here. Perhaps move it after data?
> >
> > Just to confirm my understanding; I've got these as:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __u8                       signaled;             /*    16     1 */
> >
> >         /* XXX 7 bytes hole, try to pack */
> >
> >         struct seccomp_data        data;                 /*    24    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* sum members: 81, holes: 1, sum holes: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> > struct seccomp_notif_resp {
> >         __u32                      len;                  /*     0     4 */
> >         __s32                      error;                /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         __s64                      val;                  /*    16     8 */
> >
> >         /* size: 24, cachelines: 1, members: 4 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > in the next version. Since the structure has no padding at the end of
> > it, I think the Right Thing will happen. Note that this is slightly
> > different than what Kees suggested, if I add signaled after data, then
> > I end up with:
> >
> > struct seccomp_notif {
> >         __u32                      len;                  /*     0     4 */
> >         __u32                      pid;                  /*     4     4 */
> >         __u64                      id;                   /*     8     8 */
> >         struct seccomp_data        data;                 /*    16    64 */
> >         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> >         __u8                       signaled;             /*    80     1 */
> >
> >         /* size: 88, cachelines: 2, members: 5 */
> >         /* padding: 7 */
> >         /* last cacheline: 24 bytes */
> > };
> >
> > which I think will have the versioning problem if the next member
> > introduces is < 7 bytes.
> 
> It'll be a problem in either place. What I was thinking was that
> specific versioning is required instead of just length.

Euh, so I implemented this, and it sucks :). It's ugly, and generally
feels bad.

What if instead we just get rid of versioning all together, and
instead introduce a u32 flags? We could have one flag right now
(SECCOMP_NOTIF_FLAG_SIGNALED), and use introduce others as we add more
information to the response. Then we can add
SECCOMP_NOTIF_FLAG_EXTRA_FOO, and add another SECCOMP_IOCTL_GET_FOO to
grab the info?

FWIW, it's not really clear to me that we'll ever add anything to the
response since hopefully we'll land PUT_FD, so maybe this is all moot
anyway.

Tycho
Christian Brauner Oct. 22, 2018, 9:42 a.m. UTC | #19
On Sun, Oct 21, 2018 at 05:04:37PM +0100, Tycho Andersen wrote:
> On Wed, Oct 17, 2018 at 03:21:02PM -0700, Kees Cook wrote:
> > On Wed, Oct 17, 2018 at 1:29 PM, Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Thu, Sep 27, 2018 at 02:31:24PM -0700, Kees Cook wrote:
> > >> On Thu, Sep 27, 2018 at 8:11 AM, Tycho Andersen <tycho@tycho.ws> wrote:
> > >> > @@ -60,4 +62,29 @@ struct seccomp_data {
> > >> >         __u64 args[6];
> > >> >  };
> > >> >
> > >> > +struct seccomp_notif {
> > >> > +       __u16 len;
> > >> > +       __u64 id;
> > >> > +       __u32 pid;
> > >> > +       __u8 signaled;
> > >> > +       struct seccomp_data data;
> > >> > +};
> > >> > +
> > >> > +struct seccomp_notif_resp {
> > >> > +       __u16 len;
> > >> > +       __u64 id;
> > >> > +       __s32 error;
> > >> > +       __s64 val;
> > >> > +};
> > >>
> > >> So, len has to come first, for versioning. However, since it's ahead
> > >> of a u64, this leaves a struct padding hole. pahole output:
> > >>
> > >> struct seccomp_notif {
> > >>         __u16                      len;                  /*     0     2 */
> > >>
> > >>         /* XXX 6 bytes hole, try to pack */
> > >>
> > >>         __u64                      id;                   /*     8     8 */
> > >>         __u32                      pid;                  /*    16     4 */
> > >>         __u8                       signaled;             /*    20     1 */
> > >>
> > >>         /* XXX 3 bytes hole, try to pack */
> > >>
> > >>         struct seccomp_data        data;                 /*    24    64 */
> > >>         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> > >>
> > >>         /* size: 88, cachelines: 2, members: 5 */
> > >>         /* sum members: 79, holes: 2, sum holes: 9 */
> > >>         /* last cacheline: 24 bytes */
> > >> };
> > >> struct seccomp_notif_resp {
> > >>         __u16                      len;                  /*     0     2 */
> > >>
> > >>         /* XXX 6 bytes hole, try to pack */
> > >>
> > >>         __u64                      id;                   /*     8     8 */
> > >>         __s32                      error;                /*    16     4 */
> > >>
> > >>         /* XXX 4 bytes hole, try to pack */
> > >>
> > >>         __s64                      val;                  /*    24     8 */
> > >>
> > >>         /* size: 32, cachelines: 1, members: 4 */
> > >>         /* sum members: 22, holes: 2, sum holes: 10 */
> > >>         /* last cacheline: 32 bytes */
> > >> };
> > >>
> > >> How about making len u32, and moving pid and error above "id"? This
> > >> leaves a hole after signaled, so changing "len" won't be sufficient
> > >> for versioning here. Perhaps move it after data?
> > >
> > > Just to confirm my understanding; I've got these as:
> > >
> > > struct seccomp_notif {
> > >         __u32                      len;                  /*     0     4 */
> > >         __u32                      pid;                  /*     4     4 */
> > >         __u64                      id;                   /*     8     8 */
> > >         __u8                       signaled;             /*    16     1 */
> > >
> > >         /* XXX 7 bytes hole, try to pack */
> > >
> > >         struct seccomp_data        data;                 /*    24    64 */
> > >         /* --- cacheline 1 boundary (64 bytes) was 24 bytes ago --- */
> > >
> > >         /* size: 88, cachelines: 2, members: 5 */
> > >         /* sum members: 81, holes: 1, sum holes: 7 */
> > >         /* last cacheline: 24 bytes */
> > > };
> > > struct seccomp_notif_resp {
> > >         __u32                      len;                  /*     0     4 */
> > >         __s32                      error;                /*     4     4 */
> > >         __u64                      id;                   /*     8     8 */
> > >         __s64                      val;                  /*    16     8 */
> > >
> > >         /* size: 24, cachelines: 1, members: 4 */
> > >         /* last cacheline: 24 bytes */
> > > };
> > >
> > > in the next version. Since the structure has no padding at the end of
> > > it, I think the Right Thing will happen. Note that this is slightly
> > > different than what Kees suggested, if I add signaled after data, then
> > > I end up with:
> > >
> > > struct seccomp_notif {
> > >         __u32                      len;                  /*     0     4 */
> > >         __u32                      pid;                  /*     4     4 */
> > >         __u64                      id;                   /*     8     8 */
> > >         struct seccomp_data        data;                 /*    16    64 */
> > >         /* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
> > >         __u8                       signaled;             /*    80     1 */
> > >
> > >         /* size: 88, cachelines: 2, members: 5 */
> > >         /* padding: 7 */
> > >         /* last cacheline: 24 bytes */
> > > };
> > >
> > > which I think will have the versioning problem if the next member
> > > introduces is < 7 bytes.
> > 
> > It'll be a problem in either place. What I was thinking was that
> > specific versioning is required instead of just length.
> 
> Euh, so I implemented this, and it sucks :). It's ugly, and generally
> feels bad.
> 
> What if instead we just get rid of versioning all together, and
> instead introduce a u32 flags? We could have one flag right now
> (SECCOMP_NOTIF_FLAG_SIGNALED), and use introduce others as we add more
> information to the response. Then we can add
> SECCOMP_NOTIF_FLAG_EXTRA_FOO, and add another SECCOMP_IOCTL_GET_FOO to
> grab the info?
> 
> FWIW, it's not really clear to me that we'll ever add anything to the
> response since hopefully we'll land PUT_FD, so maybe this is all moot
> anyway.

I guess the only argument against a flag would be that you run out of
bits quickly if your interface grows (cf. mount, netlink etc.). But this
is likely not a concern here.
I actually think that the way vfs capabilities are done is pretty
nice. By accident or design they allow transparent translation between
old and new formats in-kernel. So would be cool if we can have the same
guarantee for this interface.

Christian
diff mbox series

Patch

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 13a7c999c04a..31e9707f7e06 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -345,4 +345,5 @@  Code  Seq#(hex)	Include File		Comments
 					<mailto:raph@8d.com>
 0xF6	all	LTTng			Linux Trace Toolkit Next Generation
 					<mailto:mathieu.desnoyers@efficios.com>
+0xF7    00-1F   uapi/linux/seccomp.h
 0xFD	all	linux/dm-ioctl.h
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index 82a468bc7560..d2e61f1c0a0b 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -122,6 +122,11 @@  In precedence order, they are:
 	Results in the lower 16-bits of the return value being passed
 	to userland as the errno without executing the system call.
 
+``SECCOMP_RET_USER_NOTIF``:
+    Results in a ``struct seccomp_notif`` message sent on the userspace
+    notification fd, if it is attached, or ``-ENOSYS`` if it is not. See below
+    on discussion of how to handle user notifications.
+
 ``SECCOMP_RET_TRACE``:
 	When returned, this value will cause the kernel to attempt to
 	notify a ``ptrace()``-based tracer prior to executing the system
@@ -183,6 +188,74 @@  The ``samples/seccomp/`` directory contains both an x86-specific example
 and a more generic example of a higher level macro interface for BPF
 program generation.
 
+Userspace Notification
+======================
+
+The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a
+particular syscall to userspace to be handled. This may be useful for
+applications like container managers, which wish to intercept particular
+syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior.
+
+There are currently two APIs to acquire a userspace notification fd for a
+particular filter. The first is when the filter is installed, the task
+installing the filter can ask the ``seccomp()`` syscall:
+
+.. code-block::
+
+    fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog);
+
+which (on success) will return a listener fd for the filter, which can then be
+passed around via ``SCM_RIGHTS`` or similar. Alternatively, a filter fd can be
+acquired via:
+
+.. code-block::
+
+    fd = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+
+which grabs the 0th filter for some task which the tracer has privilege over.
+Note that filter fds correspond to a particular filter, and not a particular
+task. So if this task then forks, notifications from both tasks will appear on
+the same filter fd. Reads and writes to/from a filter fd are also synchronized,
+so a filter fd can safely have many readers.
+
+The interface for a seccomp notification fd consists of two structures:
+
+.. code-block::
+
+    struct seccomp_notif {
+        __u16 len;
+        __u64 id;
+        pid_t pid;
+        __u8 signalled;
+        struct seccomp_data data;
+    };
+
+    struct seccomp_notif_resp {
+        __u16 len;
+        __u64 id;
+        __s32 error;
+        __s64 val;
+    };
+
+Users can read via ``ioctl(SECCOMP_NOTIF_RECV)``  (or ``poll()``) on a seccomp
+notification fd to receive a ``struct seccomp_notif``, which contains five
+members: the input length of the structure, a unique-per-filter ``id``, the
+``pid`` of the task which triggered this request (which may be 0 if the task is
+in a pid ns not visible from the listener's pid namespace), a flag representing
+whether or not the notification is a result of a non-fatal signal, and the
+``data`` passed to seccomp. Userspace can then make a decision based on this
+information about what to do, and ``ioctl(SECCOMP_NOTIF_SEND)`` a response,
+indicating what should be returned to userspace. The ``id`` member of ``struct
+seccomp_notif_resp`` should be the same ``id`` as in ``struct seccomp_notif``.
+
+It is worth noting that ``struct seccomp_data`` contains the values of register
+arguments to the syscall, but does not contain pointers to memory. The task's
+memory is accessible to suitably privileged traces via ``ptrace()`` or
+``/proc/pid/map_files/``. However, care should be taken to avoid the TOCTOU
+mentioned above in this document: all arguments being read from the tracee's
+memory should be read into the tracer's memory before any policy decisions are
+made. This allows for an atomic decision on syscall arguments.
+
 Sysctls
 =======
 
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index e5320f6c8654..017444b5efed 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -4,9 +4,10 @@ 
 
 #include <uapi/linux/seccomp.h>
 
-#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC	| \
-					 SECCOMP_FILTER_FLAG_LOG	| \
-					 SECCOMP_FILTER_FLAG_SPEC_ALLOW)
+#define SECCOMP_FILTER_FLAG_MASK	(SECCOMP_FILTER_FLAG_TSYNC | \
+					 SECCOMP_FILTER_FLAG_LOG | \
+					 SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
+					 SECCOMP_FILTER_FLAG_NEW_LISTENER)
 
 #ifdef CONFIG_SECCOMP
 
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 9efc0e73d50b..d4ccb32fe089 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -17,9 +17,10 @@ 
 #define SECCOMP_GET_ACTION_AVAIL	2
 
 /* Valid flags for SECCOMP_SET_MODE_FILTER */
-#define SECCOMP_FILTER_FLAG_TSYNC	(1UL << 0)
-#define SECCOMP_FILTER_FLAG_LOG		(1UL << 1)
-#define SECCOMP_FILTER_FLAG_SPEC_ALLOW	(1UL << 2)
+#define SECCOMP_FILTER_FLAG_TSYNC		(1UL << 0)
+#define SECCOMP_FILTER_FLAG_LOG			(1UL << 1)
+#define SECCOMP_FILTER_FLAG_SPEC_ALLOW		(1UL << 2)
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER	(1UL << 3)
 
 /*
  * All BPF programs must return a 32-bit value.
@@ -35,6 +36,7 @@ 
 #define SECCOMP_RET_KILL	 SECCOMP_RET_KILL_THREAD
 #define SECCOMP_RET_TRAP	 0x00030000U /* disallow and force a SIGSYS */
 #define SECCOMP_RET_ERRNO	 0x00050000U /* returns an errno */
+#define SECCOMP_RET_USER_NOTIF   0x7fc00000U /* notifies userspace */
 #define SECCOMP_RET_TRACE	 0x7ff00000U /* pass to a tracer or disallow */
 #define SECCOMP_RET_LOG		 0x7ffc0000U /* allow after logging */
 #define SECCOMP_RET_ALLOW	 0x7fff0000U /* allow */
@@ -60,4 +62,29 @@  struct seccomp_data {
 	__u64 args[6];
 };
 
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signaled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+
+#define SECCOMP_IOC_MAGIC		0xF7
+
+/* Flags for seccomp notification fd ioctl. */
+#define SECCOMP_NOTIF_RECV	_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+					struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND	_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+					struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+					__u64)
+
 #endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fd023ac24e10..fa6fe9756c80 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -33,12 +33,78 @@ 
 #endif
 
 #ifdef CONFIG_SECCOMP_FILTER
+#include <linux/file.h>
 #include <linux/filter.h>
 #include <linux/pid.h>
 #include <linux/ptrace.h>
 #include <linux/security.h>
 #include <linux/tracehook.h>
 #include <linux/uaccess.h>
+#include <linux/anon_inodes.h>
+
+enum notify_state {
+	SECCOMP_NOTIFY_INIT,
+	SECCOMP_NOTIFY_SENT,
+	SECCOMP_NOTIFY_REPLIED,
+};
+
+struct seccomp_knotif {
+	/* The struct pid of the task whose filter triggered the notification */
+	struct task_struct *task;
+
+	/* The "cookie" for this request; this is unique for this filter. */
+	u64 id;
+
+	/* Whether or not this task has been given an interruptible signal. */
+	bool signaled;
+
+	/*
+	 * The seccomp data. This pointer is valid the entire time this
+	 * notification is active, since it comes from __seccomp_filter which
+	 * eclipses the entire lifecycle here.
+	 */
+	const struct seccomp_data *data;
+
+	/*
+	 * Notification states. When SECCOMP_RET_USER_NOTIF is returned, a
+	 * struct seccomp_knotif is created and starts out in INIT. Once the
+	 * handler reads the notification off of an FD, it transitions to SENT.
+	 * If a signal is received the state transitions back to INIT and
+	 * another message is sent. When the userspace handler replies, state
+	 * transitions to REPLIED.
+	 */
+	enum notify_state state;
+
+	/* The return values, only valid when in SECCOMP_NOTIFY_REPLIED */
+	int error;
+	long val;
+
+	/* Signals when this has entered SECCOMP_NOTIFY_REPLIED */
+	struct completion ready;
+
+	struct list_head list;
+};
+
+/**
+ * struct notification - container for seccomp userspace notifications. Since
+ * most seccomp filters will not have notification listeners attached and this
+ * structure is fairly large, we store the notification-specific stuff in a
+ * separate structure.
+ *
+ * @request: A semaphore that users of this notification can wait on for
+ *           changes. Actual reads and writes are still controlled with
+ *           filter->notify_lock.
+ * @notify_lock: A lock for all notification-related accesses.
+ * @next_id: The id of the next request.
+ * @notifications: A list of struct seccomp_knotif elements.
+ * @wqh: A wait queue for poll.
+ */
+struct notification {
+	struct semaphore request;
+	u64 next_id;
+	struct list_head notifications;
+	wait_queue_head_t wqh;
+};
 
 /**
  * struct seccomp_filter - container for seccomp BPF programs
@@ -66,6 +132,8 @@  struct seccomp_filter {
 	bool log;
 	struct seccomp_filter *prev;
 	struct bpf_prog *prog;
+	struct notification *notif;
+	struct mutex notify_lock;
 };
 
 /* Limit any path through the tree to 256KB worth of instructions. */
@@ -392,6 +460,7 @@  static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)
 	if (!sfilter)
 		return ERR_PTR(-ENOMEM);
 
+	mutex_init(&sfilter->notify_lock);
 	ret = bpf_prog_create_from_user(&sfilter->prog, fprog,
 					seccomp_check_filter, save_orig);
 	if (ret < 0) {
@@ -556,11 +625,13 @@  static void seccomp_send_sigsys(int syscall, int reason)
 #define SECCOMP_LOG_TRACE		(1 << 4)
 #define SECCOMP_LOG_LOG			(1 << 5)
 #define SECCOMP_LOG_ALLOW		(1 << 6)
+#define SECCOMP_LOG_USER_NOTIF		(1 << 7)
 
 static u32 seccomp_actions_logged = SECCOMP_LOG_KILL_PROCESS |
 				    SECCOMP_LOG_KILL_THREAD  |
 				    SECCOMP_LOG_TRAP  |
 				    SECCOMP_LOG_ERRNO |
+				    SECCOMP_LOG_USER_NOTIF |
 				    SECCOMP_LOG_TRACE |
 				    SECCOMP_LOG_LOG;
 
@@ -581,6 +652,9 @@  static inline void seccomp_log(unsigned long syscall, long signr, u32 action,
 	case SECCOMP_RET_TRACE:
 		log = requested && seccomp_actions_logged & SECCOMP_LOG_TRACE;
 		break;
+	case SECCOMP_RET_USER_NOTIF:
+		log = requested && seccomp_actions_logged & SECCOMP_LOG_USER_NOTIF;
+		break;
 	case SECCOMP_RET_LOG:
 		log = seccomp_actions_logged & SECCOMP_LOG_LOG;
 		break;
@@ -652,6 +726,73 @@  void secure_computing_strict(int this_syscall)
 #else
 
 #ifdef CONFIG_SECCOMP_FILTER
+static u64 seccomp_next_notify_id(struct seccomp_filter *filter)
+{
+	/* Note: overflow is ok here, the id just needs to be unique */
+	return filter->notif->next_id++;
+}
+
+static void seccomp_do_user_notification(int this_syscall,
+					 struct seccomp_filter *match,
+					 const struct seccomp_data *sd)
+{
+	int err;
+	long ret = 0;
+	struct seccomp_knotif n = {};
+
+	mutex_lock(&match->notify_lock);
+	err = -ENOSYS;
+	if (!match->notif)
+		goto out;
+
+	n.task = current;
+	n.state = SECCOMP_NOTIFY_INIT;
+	n.data = sd;
+	n.id = seccomp_next_notify_id(match);
+	init_completion(&n.ready);
+
+	list_add(&n.list, &match->notif->notifications);
+	wake_up_poll(&match->notif->wqh, EPOLLIN | EPOLLRDNORM);
+
+	mutex_unlock(&match->notify_lock);
+	up(&match->notif->request);
+
+	err = wait_for_completion_interruptible(&n.ready);
+	mutex_lock(&match->notify_lock);
+
+	/*
+	 * Here it's possible we got a signal and then had to wait on the mutex
+	 * while the reply was sent, so let's be sure there wasn't a response
+	 * in the meantime.
+	 */
+	if (err < 0 && n.state != SECCOMP_NOTIFY_REPLIED) {
+		/*
+		 * We got a signal. Let's tell userspace about it (potentially
+		 * again, if we had already notified them about the first one).
+		 */
+		n.signaled = true;
+		if (n.state == SECCOMP_NOTIFY_SENT) {
+			n.state = SECCOMP_NOTIFY_INIT;
+			up(&match->notif->request);
+		}
+		mutex_unlock(&match->notify_lock);
+		err = wait_for_completion_killable(&n.ready);
+		mutex_lock(&match->notify_lock);
+		if (err < 0)
+			goto remove_list;
+	}
+
+	ret = n.val;
+	err = n.error;
+
+remove_list:
+	list_del(&n.list);
+out:
+	mutex_unlock(&match->notify_lock);
+	syscall_set_return_value(current, task_pt_regs(current),
+				 err, ret);
+}
+
 static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 			    const bool recheck_after_trace)
 {
@@ -728,6 +869,9 @@  static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
 
 		return 0;
 
+	case SECCOMP_RET_USER_NOTIF:
+		seccomp_do_user_notification(this_syscall, match, sd);
+		goto skip;
 	case SECCOMP_RET_LOG:
 		seccomp_log(this_syscall, 0, action, true);
 		return 0;
@@ -834,6 +978,9 @@  static long seccomp_set_mode_strict(void)
 }
 
 #ifdef CONFIG_SECCOMP_FILTER
+static struct file *init_listener(struct task_struct *,
+				  struct seccomp_filter *);
+
 /**
  * seccomp_set_mode_filter: internal function for setting seccomp filter
  * @flags:  flags to change filter behavior
@@ -853,6 +1000,8 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	const unsigned long seccomp_mode = SECCOMP_MODE_FILTER;
 	struct seccomp_filter *prepared = NULL;
 	long ret = -EINVAL;
+	int listener = 0;
+	struct file *listener_f = NULL;
 
 	/* Validate flags. */
 	if (flags & ~SECCOMP_FILTER_FLAG_MASK)
@@ -863,13 +1012,28 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	if (IS_ERR(prepared))
 		return PTR_ERR(prepared);
 
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		listener = get_unused_fd_flags(0);
+		if (listener < 0) {
+			ret = listener;
+			goto out_free;
+		}
+
+		listener_f = init_listener(current, prepared);
+		if (IS_ERR(listener_f)) {
+			put_unused_fd(listener);
+			ret = PTR_ERR(listener_f);
+			goto out_free;
+		}
+	}
+
 	/*
 	 * Make sure we cannot change seccomp or nnp state via TSYNC
 	 * while another thread is in the middle of calling exec.
 	 */
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC &&
 	    mutex_lock_killable(&current->signal->cred_guard_mutex))
-		goto out_free;
+		goto out_put_fd;
 
 	spin_lock_irq(&current->sighand->siglock);
 
@@ -887,6 +1051,16 @@  static long seccomp_set_mode_filter(unsigned int flags,
 	spin_unlock_irq(&current->sighand->siglock);
 	if (flags & SECCOMP_FILTER_FLAG_TSYNC)
 		mutex_unlock(&current->signal->cred_guard_mutex);
+out_put_fd:
+	if (flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) {
+		if (ret < 0) {
+			fput(listener_f);
+			put_unused_fd(listener);
+		} else {
+			fd_install(listener, listener_f);
+			ret = listener;
+		}
+	}
 out_free:
 	seccomp_filter_free(prepared);
 	return ret;
@@ -911,6 +1085,7 @@  static long seccomp_get_action_avail(const char __user *uaction)
 	case SECCOMP_RET_KILL_THREAD:
 	case SECCOMP_RET_TRAP:
 	case SECCOMP_RET_ERRNO:
+	case SECCOMP_RET_USER_NOTIF:
 	case SECCOMP_RET_TRACE:
 	case SECCOMP_RET_LOG:
 	case SECCOMP_RET_ALLOW:
@@ -1111,6 +1286,7 @@  long seccomp_get_metadata(struct task_struct *task,
 #define SECCOMP_RET_KILL_THREAD_NAME	"kill_thread"
 #define SECCOMP_RET_TRAP_NAME		"trap"
 #define SECCOMP_RET_ERRNO_NAME		"errno"
+#define SECCOMP_RET_USER_NOTIF_NAME	"user_notif"
 #define SECCOMP_RET_TRACE_NAME		"trace"
 #define SECCOMP_RET_LOG_NAME		"log"
 #define SECCOMP_RET_ALLOW_NAME		"allow"
@@ -1120,6 +1296,7 @@  static const char seccomp_actions_avail[] =
 				SECCOMP_RET_KILL_THREAD_NAME	" "
 				SECCOMP_RET_TRAP_NAME		" "
 				SECCOMP_RET_ERRNO_NAME		" "
+				SECCOMP_RET_USER_NOTIF_NAME     " "
 				SECCOMP_RET_TRACE_NAME		" "
 				SECCOMP_RET_LOG_NAME		" "
 				SECCOMP_RET_ALLOW_NAME;
@@ -1134,6 +1311,7 @@  static const struct seccomp_log_name seccomp_log_names[] = {
 	{ SECCOMP_LOG_KILL_THREAD, SECCOMP_RET_KILL_THREAD_NAME },
 	{ SECCOMP_LOG_TRAP, SECCOMP_RET_TRAP_NAME },
 	{ SECCOMP_LOG_ERRNO, SECCOMP_RET_ERRNO_NAME },
+	{ SECCOMP_LOG_USER_NOTIF, SECCOMP_RET_USER_NOTIF_NAME },
 	{ SECCOMP_LOG_TRACE, SECCOMP_RET_TRACE_NAME },
 	{ SECCOMP_LOG_LOG, SECCOMP_RET_LOG_NAME },
 	{ SECCOMP_LOG_ALLOW, SECCOMP_RET_ALLOW_NAME },
@@ -1342,3 +1520,259 @@  static int __init seccomp_sysctl_init(void)
 device_initcall(seccomp_sysctl_init)
 
 #endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_FILTER
+static int seccomp_notify_release(struct inode *inode, struct file *file)
+{
+	struct seccomp_filter *filter = file->private_data;
+	struct seccomp_knotif *knotif;
+
+	mutex_lock(&filter->notify_lock);
+
+	/*
+	 * If this file is being closed because e.g. the task who owned it
+	 * died, let's wake everyone up who was waiting on us.
+	 */
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		if (knotif->state == SECCOMP_NOTIFY_REPLIED)
+			continue;
+
+		knotif->state = SECCOMP_NOTIFY_REPLIED;
+		knotif->error = -ENOSYS;
+		knotif->val = 0;
+
+		complete(&knotif->ready);
+	}
+
+	wake_up_all(&filter->notif->wqh);
+	kfree(filter->notif);
+	filter->notif = NULL;
+	mutex_unlock(&filter->notify_lock);
+	__put_seccomp_filter(filter);
+	return 0;
+}
+
+static long seccomp_notify_recv(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL, *cur;
+	struct seccomp_notif unotif = {};
+	ssize_t ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+
+	ret = down_interruptible(&filter->notif->request);
+	if (ret < 0)
+		return ret;
+
+	mutex_lock(&filter->notify_lock);
+	list_for_each_entry(cur, &filter->notif->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT) {
+			knotif = cur;
+			break;
+		}
+	}
+
+	/*
+	 * If we didn't find a notification, it could be that the task was
+	 * interrupted between the time we were woken and when we were able to
+	 * acquire the rw lock.
+	 */
+	if (!knotif) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	size = min_t(size_t, size, sizeof(unotif));
+
+	unotif.len = size;
+	unotif.id = knotif->id;
+	unotif.pid = task_pid_vnr(knotif->task);
+	unotif.signaled = knotif->signaled;
+	unotif.data = *(knotif->data);
+
+	if (copy_to_user(buf, &unotif, size)) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_SENT;
+	wake_up_poll(&filter->notif->wqh, EPOLLOUT | EPOLLWRNORM);
+
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_send(struct seccomp_filter *filter,
+				unsigned long arg)
+{
+	struct seccomp_notif_resp resp = {};
+	struct seccomp_knotif *knotif = NULL;
+	long ret;
+	u16 size;
+	void __user *buf = (void __user *)arg;
+
+	if (copy_from_user(&size, buf, sizeof(size)))
+		return -EFAULT;
+	size = min_t(size_t, size, sizeof(resp));
+	if (copy_from_user(&resp, buf, size))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		if (knotif->id == resp.id)
+			break;
+	}
+
+	if (!knotif || knotif->id != resp.id) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	/* Allow exactly one reply. */
+	if (knotif->state != SECCOMP_NOTIFY_SENT) {
+		ret = -EINPROGRESS;
+		goto out;
+	}
+
+	ret = size;
+	knotif->state = SECCOMP_NOTIFY_REPLIED;
+	knotif->error = resp.error;
+	knotif->val = resp.val;
+	complete(&knotif->ready);
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_id_valid(struct seccomp_filter *filter,
+				    unsigned long arg)
+{
+	struct seccomp_knotif *knotif = NULL;
+	void __user *buf = (void __user *)arg;
+	u64 id;
+	long ret;
+
+	if (copy_from_user(&id, buf, sizeof(id)))
+		return -EFAULT;
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	ret = -1;
+	list_for_each_entry(knotif, &filter->notif->notifications, list) {
+		if (knotif->id == id) {
+			ret = 0;
+			goto out;
+		}
+	}
+
+out:
+	mutex_unlock(&filter->notify_lock);
+	return ret;
+}
+
+static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	struct seccomp_filter *filter = file->private_data;
+
+	switch (cmd) {
+	case SECCOMP_NOTIF_RECV:
+		return seccomp_notify_recv(filter, arg);
+	case SECCOMP_NOTIF_SEND:
+		return seccomp_notify_send(filter, arg);
+	case SECCOMP_NOTIF_ID_VALID:
+		return seccomp_notify_id_valid(filter, arg);
+	default:
+		return -EINVAL;
+	}
+}
+
+static __poll_t seccomp_notify_poll(struct file *file,
+				    struct poll_table_struct *poll_tab)
+{
+	struct seccomp_filter *filter = file->private_data;
+	__poll_t ret = 0;
+	struct seccomp_knotif *cur;
+
+	poll_wait(file, &filter->notif->wqh, poll_tab);
+
+	ret = mutex_lock_interruptible(&filter->notify_lock);
+	if (ret < 0)
+		return ret;
+
+	list_for_each_entry(cur, &filter->notif->notifications, list) {
+		if (cur->state == SECCOMP_NOTIFY_INIT)
+			ret |= EPOLLIN | EPOLLRDNORM;
+		if (cur->state == SECCOMP_NOTIFY_SENT)
+			ret |= EPOLLOUT | EPOLLWRNORM;
+		if (ret & EPOLLIN && ret & EPOLLOUT)
+			break;
+	}
+
+	mutex_unlock(&filter->notify_lock);
+
+	return ret;
+}
+
+static const struct file_operations seccomp_notify_ops = {
+	.poll = seccomp_notify_poll,
+	.release = seccomp_notify_release,
+	.unlocked_ioctl = seccomp_notify_ioctl,
+};
+
+static struct file *init_listener(struct task_struct *task,
+				  struct seccomp_filter *filter)
+{
+	struct file *ret = ERR_PTR(-EBUSY);
+	struct seccomp_filter *cur, *last_locked = NULL;
+	int filter_nesting = 0;
+
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_lock_nested(&cur->notify_lock, filter_nesting);
+		filter_nesting++;
+		last_locked = cur;
+		if (cur->notif)
+			goto out;
+	}
+
+	ret = ERR_PTR(-ENOMEM);
+	filter->notif = kzalloc(sizeof(*(filter->notif)), GFP_KERNEL);
+	if (!filter->notif)
+		goto out;
+
+	sema_init(&filter->notif->request, 0);
+	INIT_LIST_HEAD(&filter->notif->notifications);
+	filter->notif->next_id = get_random_u64();
+	init_waitqueue_head(&filter->notif->wqh);
+
+	ret = anon_inode_getfile("seccomp notify", &seccomp_notify_ops,
+				 filter, O_RDWR);
+	if (IS_ERR(ret))
+		goto out;
+
+
+	/* The file has a reference to it now */
+	__get_seccomp_filter(filter);
+
+out:
+	for (cur = task->seccomp.filter; cur; cur = cur->prev) {
+		mutex_unlock(&cur->notify_lock);
+		if (cur == last_locked)
+			break;
+	}
+
+	return ret;
+}
+#endif
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index e1473234968d..5f4b836a6792 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -5,6 +5,7 @@ 
  * Test code for seccomp bpf.
  */
 
+#define _GNU_SOURCE
 #include <sys/types.h>
 
 /*
@@ -40,10 +41,12 @@ 
 #include <sys/fcntl.h>
 #include <sys/mman.h>
 #include <sys/times.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
 
-#define _GNU_SOURCE
 #include <unistd.h>
 #include <sys/syscall.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 
@@ -154,6 +157,34 @@  struct seccomp_metadata {
 };
 #endif
 
+#ifndef SECCOMP_FILTER_FLAG_NEW_LISTENER
+#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
+
+#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
+
+#define SECCOMP_IOC_MAGIC		0xF7
+#define SECCOMP_NOTIF_RECV	_IOWR(SECCOMP_IOC_MAGIC, 0,	\
+					struct seccomp_notif)
+#define SECCOMP_NOTIF_SEND	_IOWR(SECCOMP_IOC_MAGIC, 1,	\
+					struct seccomp_notif_resp)
+#define SECCOMP_NOTIF_ID_VALID	_IOR(SECCOMP_IOC_MAGIC, 2,	\
+					__u64)
+struct seccomp_notif {
+	__u16 len;
+	__u64 id;
+	__u32 pid;
+	__u8 signaled;
+	struct seccomp_data data;
+};
+
+struct seccomp_notif_resp {
+	__u16 len;
+	__u64 id;
+	__s32 error;
+	__s64 val;
+};
+#endif
+
 #ifndef seccomp
 int seccomp(unsigned int op, unsigned int flags, void *args)
 {
@@ -2077,7 +2108,8 @@  TEST(detect_seccomp_filter_flags)
 {
 	unsigned int flags[] = { SECCOMP_FILTER_FLAG_TSYNC,
 				 SECCOMP_FILTER_FLAG_LOG,
-				 SECCOMP_FILTER_FLAG_SPEC_ALLOW };
+				 SECCOMP_FILTER_FLAG_SPEC_ALLOW,
+				 SECCOMP_FILTER_FLAG_NEW_LISTENER };
 	unsigned int flag, all_flags;
 	int i;
 	long ret;
@@ -2933,6 +2965,383 @@  TEST(get_metadata)
 	ASSERT_EQ(0, kill(pid, SIGKILL));
 }
 
+static int user_trap_syscall(int nr, unsigned int flags)
+{
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+			offsetof(struct seccomp_data, nr)),
+		BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_USER_NOTIF),
+		BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+	};
+
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+static int read_notif(int listener, struct seccomp_notif *req)
+{
+	int ret;
+
+	do {
+		errno = 0;
+		req->len = sizeof(*req);
+		ret = ioctl(listener, SECCOMP_NOTIF_RECV, req);
+	} while (ret == -1 && errno == ENOENT);
+	return ret;
+}
+
+static void signal_handler(int signal)
+{
+}
+
+#define USER_NOTIF_MAGIC 116983961184613L
+TEST(get_user_notification_syscall)
+{
+	pid_t pid;
+	long ret;
+	int status, listener;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+	struct pollfd pollfd;
+
+	struct sock_filter filter[] = {
+		BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+	};
+	struct sock_fprog prog = {
+		.len = (unsigned short)ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	/* Check that we get -ENOSYS with no listener attached */
+	if (pid == 0) {
+		if (user_trap_syscall(__NR_getpid, 0) < 0)
+			exit(1);
+		ret = syscall(__NR_getpid);
+		exit(ret >= 0 || errno != ENOSYS);
+	}
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/* Add some no-op filters so that we (don't) trigger lockdep. */
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+	EXPECT_EQ(seccomp(SECCOMP_SET_MODE_FILTER, 0, &prog), 0);
+
+	/* Check that the basic notification machinery works */
+	listener = user_trap_syscall(__NR_getpid,
+				     SECCOMP_FILTER_FLAG_NEW_LISTENER);
+	EXPECT_GE(listener, 0);
+
+	/* Installing a second listener in the chain should EBUSY */
+	EXPECT_EQ(user_trap_syscall(__NR_getpid,
+				    SECCOMP_FILTER_FLAG_NEW_LISTENER),
+		  -1);
+	EXPECT_EQ(errno, EBUSY);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLIN);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+
+	pollfd.fd = listener;
+	pollfd.events = POLLIN | POLLOUT;
+
+	EXPECT_GT(poll(&pollfd, 1, -1), 0);
+	EXPECT_EQ(pollfd.revents, POLLOUT);
+
+	EXPECT_EQ(req.data.nr,  __NR_getpid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that nothing bad happens when we kill the task in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), 0);
+
+	EXPECT_EQ(kill(pid, SIGKILL), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_ID_VALID, &req.id), -1);
+
+	resp.id = req.id;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, -1);
+	EXPECT_EQ(errno, ENOENT);
+
+	/*
+	 * Check that we get another notification about a signal in the middle
+	 * of a syscall.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		if (signal(SIGUSR1, signal_handler) == SIG_ERR) {
+			perror("signal");
+			exit(1);
+		}
+		ret = syscall(__NR_getpid);
+		exit(ret != USER_NOTIF_MAGIC);
+	}
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(kill(pid, SIGUSR1), 0);
+
+	ret = read_notif(listener, &req);
+	EXPECT_EQ(req.signaled, 1);
+	EXPECT_EQ(ret, sizeof(req));
+	EXPECT_EQ(errno, 0);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = -512; /* -ERESTARTSYS */
+	resp.val = 0;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	ret = read_notif(listener, &req);
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+	ret = ioctl(listener, SECCOMP_NOTIF_SEND, &resp);
+	EXPECT_EQ(ret, sizeof(resp));
+	EXPECT_EQ(errno, 0);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	/*
+	 * Check that we get an ENOSYS when the listener is closed.
+	 */
+	pid = fork();
+	ASSERT_GE(pid, 0);
+	if (pid == 0) {
+		close(listener);
+		ret = syscall(__NR_getpid);
+		exit(ret != -1 && errno != ENOSYS);
+	}
+
+	close(listener);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+/*
+ * Check that a pid in a child namespace still shows up as valid in ours.
+ */
+TEST(user_notification_child_pid_ns)
+{
+	pid_t pid;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+	ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+		/* Signal we're ready and have installed the filter. */
+		EXPECT_EQ(write(sk_pair[1], "J", 1), 1);
+
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+
+		exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &c, 1), 1);
+	EXPECT_EQ(c, 'J');
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid), 0);
+	EXPECT_EQ(waitpid(pid, NULL, 0), pid);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid, NULL, 0), 0);
+
+	/* Now signal we are done and respond with magic */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	req.len = sizeof(req);
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+	EXPECT_EQ(req.pid, pid);
+
+	resp.len = sizeof(resp);
+	resp.id = req.id;
+	resp.error = 0;
+	resp.val = USER_NOTIF_MAGIC;
+
+	EXPECT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+	close(listener);
+}
+
+/*
+ * Check that a pid in a sibling (i.e. unrelated) namespace shows up as 0, i.e.
+ * invalid.
+ */
+TEST(user_notification_sibling_pid_ns)
+{
+	pid_t pid, pid2;
+	int status, listener;
+	int sk_pair[2];
+	char c;
+	struct seccomp_notif req = {};
+	struct seccomp_notif_resp resp = {};
+
+	ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair), 0);
+
+	pid = fork();
+	ASSERT_GE(pid, 0);
+
+	if (pid == 0) {
+		int child_pair[2];
+
+		ASSERT_EQ(unshare(CLONE_NEWPID), 0);
+
+		ASSERT_EQ(socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, child_pair), 0);
+
+		pid2 = fork();
+		ASSERT_GE(pid2, 0);
+
+		if (pid2 == 0) {
+			close(child_pair[0]);
+			EXPECT_EQ(user_trap_syscall(__NR_getpid, 0), 0);
+
+			/* Signal we're ready and have installed the filter. */
+			EXPECT_EQ(write(child_pair[1], "J", 1), 1);
+
+			EXPECT_EQ(read(child_pair[1], &c, 1), 1);
+			EXPECT_EQ(c, 'H');
+
+			exit(syscall(__NR_getpid) != USER_NOTIF_MAGIC);
+		}
+
+		/* check that child has installed the filter */
+		EXPECT_EQ(read(child_pair[0], &c, 1), 1);
+		EXPECT_EQ(c, 'J');
+
+		/* tell parent who child is */
+		EXPECT_EQ(write(sk_pair[1], &pid2, sizeof(pid2)), sizeof(pid2));
+
+		/* parent has installed listener, tell child to call syscall */
+		EXPECT_EQ(read(sk_pair[1], &c, 1), 1);
+		EXPECT_EQ(c, 'H');
+		EXPECT_EQ(write(child_pair[0], "H", 1), 1);
+
+		EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+		EXPECT_EQ(true, WIFEXITED(status));
+		EXPECT_EQ(0, WEXITSTATUS(status));
+		exit(WEXITSTATUS(status));
+	}
+
+	EXPECT_EQ(read(sk_pair[0], &pid2, sizeof(pid2)), sizeof(pid2));
+
+	EXPECT_EQ(ptrace(PTRACE_ATTACH, pid2), 0);
+	EXPECT_EQ(waitpid(pid2, NULL, 0), pid2);
+	listener = ptrace(PTRACE_SECCOMP_NEW_LISTENER, pid2, 0);
+	EXPECT_GE(listener, 0);
+	EXPECT_EQ(errno, 0);
+	EXPECT_EQ(ptrace(PTRACE_DETACH, pid2, NULL, 0), 0);
+
+	/* Create the sibling ns, and sibling in it. */
+	EXPECT_EQ(unshare(CLONE_NEWPID), 0);
+	EXPECT_EQ(errno, 0);
+
+	pid2 = fork();
+	EXPECT_GE(pid2, 0);
+
+	if (pid2 == 0) {
+		req.len = sizeof(req);
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_RECV, &req), sizeof(req));
+		/*
+		 * The pid should be 0, i.e. the task is in some namespace that
+		 * we can't "see".
+		 */
+		ASSERT_EQ(req.pid, 0);
+
+		resp.len = sizeof(resp);
+		resp.id = req.id;
+		resp.error = 0;
+		resp.val = USER_NOTIF_MAGIC;
+
+		ASSERT_EQ(ioctl(listener, SECCOMP_NOTIF_SEND, &resp), sizeof(resp));
+		exit(0);
+	}
+
+	close(listener);
+
+	/* Now signal we are done setting up sibling listener. */
+	EXPECT_EQ(write(sk_pair[0], "H", 1), 1);
+
+	EXPECT_EQ(waitpid(pid, &status, 0), pid);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+
+	EXPECT_EQ(waitpid(pid2, &status, 0), pid2);
+	EXPECT_EQ(true, WIFEXITED(status));
+	EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
+
 /*
  * TODO:
  * - add microbenchmarks