From patchwork Wed Sep 30 15:19:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: YiFei Zhu X-Patchwork-Id: 11809425 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 84168618 for ; Wed, 30 Sep 2020 15:25:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5DF9E20759 for ; Wed, 30 Sep 2020 15:25:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="faLp8XRW" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730626AbgI3PUp (ORCPT ); Wed, 30 Sep 2020 11:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40826 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725385AbgI3PUZ (ORCPT ); Wed, 30 Sep 2020 11:20:25 -0400 Received: from mail-io1-xd41.google.com (mail-io1-xd41.google.com [IPv6:2607:f8b0:4864:20::d41]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7F276C061755; Wed, 30 Sep 2020 08:20:25 -0700 (PDT) Received: by mail-io1-xd41.google.com with SMTP id v8so2193206iom.6; Wed, 30 Sep 2020 08:20:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Mv4ii8zAP+OnnhUrvn84xAFhnizL7MtbdUx+GH7PQaM=; b=faLp8XRWWZBJTGcWmfT4Onxq+2qJR2ZNvRXCr5h8BPn4Uc5zUe5sKpa3bU2zZx/UoH oPN7V7n4HUbMwcESTej9CA3ep+EYCoXa2JynNjku8sN3Cv/DmbQWmdkkFnA5OLd9YJNp +pZnXwWT10vJsEK/RknN+QI7HU13OGEyKwR05Y5/7TLZpW7+f4EfzY5YQhGerKNUOCwq 6h8vlrYz7QRAwC68dMtgihsPR8nnakqyn0u/6V5iLnsx0LNbddmPxxRuBnykd+7Y1K6Q cJ65cIAW1Q4d+xOQUqKCyCQVix4it5tExeZvGIsDatvLDDuKJ8MaA+ndRb8QVLO1yuCx NaMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Mv4ii8zAP+OnnhUrvn84xAFhnizL7MtbdUx+GH7PQaM=; b=iNt9ZJxx2qfMKKUFuFFnrUXLgDRICZ3uMd05Ygl7KvboyutBeffyTkjOCVtDiFueuF bCKKBoa3lvlpAo70FB60S/lpqFWFOUIncMZ4hyc25eWU5GTmraBK/FxMd3t5lcFlHMXG Lqk4f1DO4gyRaTnQ+7CRw8LmzL0X0Jq0XtEi/iM2i2eyBleAKR5b2RZJdtz0EGXyq4oa oFmgE6vsox4qfwb83XK+2yGcBFpThsnNdw6sJ9GpBgqsJv1LtrSqnxwJF4zt1KxYGgHD zGWdWlCXRxG7kB4xwfgpqnKilUF7JQbqtDVfx7EGz8PmifdXXmezPm/e0qEv59cXaYUM wQRQ== X-Gm-Message-State: AOAM533ZgpISUOSgPEv+jacaupiHjsg1HEE9cJbSZwG7hMoagx9z42cT JS817GPaG8vJ/vHE2XXl+ds= X-Google-Smtp-Source: ABdhPJwjIxz2zEuBhqj0fQKlXjCbvbF3OmJHQpSM1hO02ApFZSa4yKPPp5z7QVD9U1wWVPWWnSoEhA== X-Received: by 2002:a5d:9693:: with SMTP id m19mr2082274ion.161.1601479224812; Wed, 30 Sep 2020 08:20:24 -0700 (PDT) Received: from localhost.localdomain (ip-99-203-15-156.pools.cgn.spcsdns.net. [99.203.15.156]) by smtp.gmail.com with ESMTPSA id t10sm770788iog.49.2020.09.30.08.20.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Sep 2020 08:20:24 -0700 (PDT) From: YiFei Zhu To: containers@lists.linux-foundation.org Cc: YiFei Zhu , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Aleksa Sarai , Andrea Arcangeli , Andy Lutomirski , David Laight , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jack Chen , Jann Horn , Josep Torrellas , Kees Cook , Tianyin Xu , Tobin Feldman-Fitzthum , Tycho Andersen , Valentin Rothberg , Will Drewry Subject: [PATCH v3 seccomp 1/5] x86: Enable seccomp architecture tracking Date: Wed, 30 Sep 2020 10:19:12 -0500 Message-Id: <484392624b475cc25d90a787525ede70df9f7d51.1601478774.git.yifeifz2@illinois.edu> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org From: Kees Cook Provide seccomp internals with the details to calculate which syscall table the running kernel is expecting to deal with. This allows for efficient architecture pinning and paves the way for constant-action bitmaps. Signed-off-by: Kees Cook [YiFei: Removed x32, added macro for nr_syscalls] Signed-off-by: YiFei Zhu Signed-off-by: Kees Cook Signed-off-by: YiFei Zhu Signed-off-by: of the associated co-author" (and has an example of how --- arch/x86/include/asm/seccomp.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 2bd1338de236..7b3a58271656 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -16,6 +16,18 @@ #define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn #endif +#ifdef CONFIG_X86_64 +# define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 +# define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +# ifdef CONFIG_COMPAT +# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 +# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# endif +#else /* !CONFIG_X86_64 */ +# define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_I386 +# define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +#endif + #include #endif /* _ASM_X86_SECCOMP_H */ From patchwork Wed Sep 30 15:19:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: YiFei Zhu X-Patchwork-Id: 11809429 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0F07C618 for ; Wed, 30 Sep 2020 15:25:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CA40A207FB for ; Wed, 30 Sep 2020 15:25:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="e51pN1L8" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729395AbgI3PUp (ORCPT ); Wed, 30 Sep 2020 11:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728031AbgI3PU3 (ORCPT ); Wed, 30 Sep 2020 11:20:29 -0400 Received: from mail-il1-x144.google.com (mail-il1-x144.google.com [IPv6:2607:f8b0:4864:20::144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2DE6DC061755; Wed, 30 Sep 2020 08:20:29 -0700 (PDT) Received: by mail-il1-x144.google.com with SMTP id l16so2037181ilt.13; Wed, 30 Sep 2020 08:20:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=hYUJTNyuZKx7323reM1G24J+fKq7O2/Mzw5J9lqQY70=; b=e51pN1L8+paKoCv7+p4JqO2pXYH+VKoz7CIJBuiphVR3caLailgxgi0tItgMfXxW/r +GStvQ7rE0rldtpKQ9W/sC2tQ49VffAJTd06+Kw3Ge5QW1wCk8zKiGs0T4vPkdA5OvOb pFFT44YI6ObbrkoGuMNPdKgcVqQl5aLu0RyNeSpF5aRZ2bovaboRkxv45Rz/jkzqlFHG UMSkblDoFQ4d0zkqneV5m4sz3iB43ZSgzVjYpdjk08lCbaKntAsNYBtK3CLvxqu8tjBT C+cPw7XPacoBwVUSAnnR0qyTnCdRpu8tgoRKEG58af4qL0eg2qH60s1OMnyPL3Rx/JUo NxjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=hYUJTNyuZKx7323reM1G24J+fKq7O2/Mzw5J9lqQY70=; b=i+iRDFEsMMRwjr/B39QGxvuebqtzK6YHD2z1etK3bkFYX5Qi15zj0t/veD0+2znLcA Kl5J7MFPmKF/9MMtGxlNhNcJRUXOB9ji2CpO4nu42ZAooSPK0qh5g0J+/UGjG+bRq36k 4NbsSG7tCK+BIfh6gkd1kCQfR9sI2rcfrnK43IOd5n5GIa+FUT9dcb0iySHpN0wNSZQI nu9hSKCblWvw1sZcHG/5aUeFM7VQAW4DVIRmnb12bcrhzqKXWd85226b8qn6dGdxgXht V0zWwaNMV3BGTwQUs0yRzfqsKH7WX9hjR6LfOcrkUaruQtACmrFot9UgW+R88uJKnghO Lu7Q== X-Gm-Message-State: AOAM533PR2V/gQxEvcCqVEpOM3aSUN+j564oROglKkxbJjLnHbszMWdm XSLuSMAtplPsgBsjIZ3TFh4= X-Google-Smtp-Source: ABdhPJy1DzpCglh2ifXStoYVzKUDINbDv8p9VTC1kvYdyF9HZJheV703bxFS8sr3L7mbWHIjq8KF7w== X-Received: by 2002:a05:6e02:4cc:: with SMTP id f12mr2483926ils.28.1601479228424; Wed, 30 Sep 2020 08:20:28 -0700 (PDT) Received: from localhost.localdomain (ip-99-203-15-156.pools.cgn.spcsdns.net. [99.203.15.156]) by smtp.gmail.com with ESMTPSA id t10sm770788iog.49.2020.09.30.08.20.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Sep 2020 08:20:27 -0700 (PDT) From: YiFei Zhu To: containers@lists.linux-foundation.org Cc: YiFei Zhu , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Aleksa Sarai , Andrea Arcangeli , Andy Lutomirski , David Laight , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jack Chen , Jann Horn , Josep Torrellas , Kees Cook , Tianyin Xu , Tobin Feldman-Fitzthum , Tycho Andersen , Valentin Rothberg , Will Drewry Subject: [PATCH v3 seccomp 2/5] seccomp/cache: Add "emulator" to check if filter is constant allow Date: Wed, 30 Sep 2020 10:19:13 -0500 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org From: YiFei Zhu SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. Each common BPF instruction are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Emulator structure and comments are from Kees [1] and Jann [2]. Emulation is done at attach time. If a filter depends on more filters, and if the dependee does not guarantee to allow the syscall, then we skip the emulation of this syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ Signed-off-by: YiFei Zhu Signed-off-by: Kees Cook --- arch/Kconfig | 34 ++++++++++ arch/x86/Kconfig | 1 + kernel/seccomp.c | 167 ++++++++++++++++++++++++++++++++++++++++++++++- 3 files changed, 201 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 21a3675a7a3a..ca867b2a5d71 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -471,6 +471,14 @@ config HAVE_ARCH_SECCOMP_FILTER results in the system call being skipped immediately. - seccomp syscall wired up +config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + bool + help + An arch should select this symbol if it provides all of these things: + - all the requirements for HAVE_ARCH_SECCOMP_FILTER + - SECCOMP_ARCH_DEFAULT + - SECCOMP_ARCH_DEFAULT_NR + config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" def_bool y @@ -498,6 +506,32 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +choice + prompt "Seccomp filter cache" + default SECCOMP_CACHE_NONE + depends on SECCOMP_FILTER + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + help + Seccomp filters can potentially incur large overhead for each + system call. This can alleviate some of the overhead. + + If in doubt, select 'syscall numbers only'. + +config SECCOMP_CACHE_NONE + bool "None" + help + No caching is done. Seccomp filters will be called each time + a system call occurs in a seccomp-guarded task. + +config SECCOMP_CACHE_NR_ONLY + bool "Syscall number only" + depends on HAVE_ARCH_SECCOMP_CACHE_NR_ONLY + help + For each syscall number, if the seccomp filter has a fixed + result, store that result in a bitmap to speed up system calls. + +endchoice + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 1ab22869a765..ff5289228ea5 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -150,6 +150,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_SECCOMP_CACHE_NR_ONLY select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/kernel/seccomp.c b/kernel/seccomp.c index ae6b40cc39f4..f09c9e74ae05 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -143,6 +143,37 @@ struct notification { struct list_head notifications; }; +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * struct seccomp_cache_filter_data - container for cache's per-filter data + * + * Tis struct is ordered to minimize padding holes. + * + * @syscall_allow_default: A bitmap where each bit represents whether the + * filter willalways allow the syscall, for the + * default architecture. + * @syscall_allow_compat: A bitmap where each bit represents whether the + * filter will always allow the syscall, for the + * compat architecture. + */ +struct seccomp_cache_filter_data { +#ifdef SECCOMP_ARCH_DEFAULT + DECLARE_BITMAP(syscall_allow_default, SECCOMP_ARCH_DEFAULT_NR); +#endif +#ifdef SECCOMP_ARCH_COMPAT + DECLARE_BITMAP(syscall_allow_compat, SECCOMP_ARCH_COMPAT_NR); +#endif +}; + +#define SECCOMP_EMU_MAX_PENDING_STATES 64 +#else +struct seccomp_cache_filter_data { }; + +static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * struct seccomp_filter - container for seccomp BPF programs * @@ -159,6 +190,7 @@ struct notification { * this filter after reaching 0. The @users count is always smaller * or equal to @refs. Hence, reaching 0 for @users does not mean * the filter can be freed. + * @cache: container for cache-related data. * @log: true if all actions except for SECCOMP_RET_ALLOW should be logged * @prev: points to a previously installed, or inherited, filter * @prog: the BPF program to evaluate @@ -180,6 +212,7 @@ struct seccomp_filter { refcount_t refs; refcount_t users; bool log; + struct seccomp_cache_filter_data cache; struct seccomp_filter *prev; struct bpf_prog *prog; struct notification *notif; @@ -544,7 +577,8 @@ static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog) { struct seccomp_filter *sfilter; int ret; - const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE); + const bool save_orig = IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) || + IS_ENABLED(CONFIG_SECCOMP_CACHE_NR_ONLY); if (fprog->len == 0 || fprog->len > BPF_MAXINSNS) return ERR_PTR(-EINVAL); @@ -610,6 +644,136 @@ seccomp_prepare_user_filter(const char __user *user_filter) return filter; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +/** + * seccomp_emu_is_const_allow - check if filter is constant allow with given data + * @fprog: The BPF programs + * @sd: The seccomp data to check against, only syscall number are arch + * number are considered constant. + */ +static bool seccomp_emu_is_const_allow(struct sock_fprog_kern *fprog, + struct seccomp_data *sd) +{ + unsigned int insns; + unsigned int reg_value = 0; + unsigned int pc; + bool op_res; + + if (WARN_ON_ONCE(!fprog)) + return false; + + insns = bpf_classic_proglen(fprog); + for (pc = 0; pc < insns; pc++) { + struct sock_filter *insn = &fprog->filter[pc]; + u16 code = insn->code; + u32 k = insn->k; + + switch (code) { + case BPF_LD | BPF_W | BPF_ABS: + switch (k) { + case offsetof(struct seccomp_data, nr): + reg_value = sd->nr; + break; + case offsetof(struct seccomp_data, arch): + reg_value = sd->arch; + break; + default: + /* can't optimize (non-constant value load) */ + return false; + } + break; + case BPF_RET | BPF_K: + /* reached return with constant values only, check allow */ + return k == SECCOMP_RET_ALLOW; + case BPF_JMP | BPF_JA: + pc += insn->k; + break; + case BPF_JMP | BPF_JEQ | BPF_K: + case BPF_JMP | BPF_JGE | BPF_K: + case BPF_JMP | BPF_JGT | BPF_K: + case BPF_JMP | BPF_JSET | BPF_K: + switch (BPF_OP(code)) { + case BPF_JEQ: + op_res = reg_value == k; + break; + case BPF_JGE: + op_res = reg_value >= k; + break; + case BPF_JGT: + op_res = reg_value > k; + break; + case BPF_JSET: + op_res = !!(reg_value & k); + break; + default: + /* can't optimize (unknown jump) */ + return false; + } + + pc += op_res ? insn->jt : insn->jf; + break; + case BPF_ALU | BPF_AND | BPF_K: + reg_value &= k; + break; + default: + /* can't optimize (unknown insn) */ + return false; + } + } + + /* ran off the end of the filter?! */ + WARN_ON(1); + return false; +} + +static void seccomp_cache_prepare_bitmap(struct seccomp_filter *sfilter, + void *bitmap, const void *bitmap_prev, + size_t bitmap_size, int arch) +{ + struct sock_fprog_kern *fprog = sfilter->prog->orig_prog; + struct seccomp_data sd; + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + if (bitmap_prev && !test_bit(nr, bitmap_prev)) + continue; + + sd.nr = nr; + sd.arch = arch; + + if (seccomp_emu_is_const_allow(fprog, &sd)) + set_bit(nr, bitmap); + } +} + +/** + * seccomp_cache_prepare - emulate the filter to find cachable syscalls + * @sfilter: The seccomp filter + * + * Returns 0 if successful or -errno if error occurred. + */ +static void seccomp_cache_prepare(struct seccomp_filter *sfilter) +{ + struct seccomp_cache_filter_data *cache = &sfilter->cache; + const struct seccomp_cache_filter_data *cache_prev = + sfilter->prev ? &sfilter->prev->cache : NULL; + +#ifdef SECCOMP_ARCH_DEFAULT + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_default, + cache_prev ? cache_prev->syscall_allow_default : NULL, + SECCOMP_ARCH_DEFAULT_NR, + SECCOMP_ARCH_DEFAULT); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + seccomp_cache_prepare_bitmap(sfilter, cache->syscall_allow_compat, + cache_prev ? cache_prev->syscall_allow_compat : NULL, + SECCOMP_ARCH_COMPAT_NR, + SECCOMP_ARCH_COMPAT); +#endif /* SECCOMP_ARCH_COMPAT */ +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_attach_filter: validate and attach filter * @flags: flags to change filter behavior @@ -659,6 +823,7 @@ static long seccomp_attach_filter(unsigned int flags, * task reference. */ filter->prev = current->seccomp.filter; + seccomp_cache_prepare(filter); current->seccomp.filter = filter; atomic_inc(¤t->seccomp.filter_count); From patchwork Wed Sep 30 15:19:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: YiFei Zhu X-Patchwork-Id: 11809451 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0B23E92C for ; Wed, 30 Sep 2020 15:27:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D98142071E for ; Wed, 30 Sep 2020 15:27:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WAIVZfAy" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729432AbgI3PUp (ORCPT ); Wed, 30 Sep 2020 11:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40844 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729351AbgI3PUc (ORCPT ); Wed, 30 Sep 2020 11:20:32 -0400 Received: from mail-io1-xd42.google.com (mail-io1-xd42.google.com [IPv6:2607:f8b0:4864:20::d42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 94E23C061755; Wed, 30 Sep 2020 08:20:32 -0700 (PDT) Received: by mail-io1-xd42.google.com with SMTP id k6so2210379ior.2; Wed, 30 Sep 2020 08:20:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=AThlbV8qSIf//EFg3le7/Kv8iNlRjvJHI5D0t9AJZm8=; b=WAIVZfAyst0tKhF9rP+a6GZ+1JKCLVToxPDvriVX3CfSR+N3Xd6eMx+ty3tny7GcQX Ox0ZwqAzeN9LprFtrND1a43fsROAkb06iKQnUVhQ+6308rCs33JcrgNiXG2E4BXkGOW+ Da5kBXuylCu+EJnHnu4DodHho0uQsQ7U2CXPjgtT8tRH6nLserbvWx/b7mH6EITU6xCw Lj/Unacm/R864IYgL21+F2Zxt4Hm2yc8eW7zsqUI8mj2aXsP0X5XaFl7CZSBcQv/UXDA CQmHuaXTcI2lVPGJ6g6+9M/ltyubt8DetU3jMrh1xULbP0DyAtxUbZVjajpdC/5bd7zp 1EIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=AThlbV8qSIf//EFg3le7/Kv8iNlRjvJHI5D0t9AJZm8=; b=cTc3AaxMx4t2wGj67+G40zoRnN4w5ojGtx516++2MwHOZ0qD+C6TOmc//kydK9Pi5p TpkAdtNKmiYiPgEHxqIY7LqE+TKX0o+gKyo95xwC5XvOwyUCF9phJDRCpTyFAZqAH0tV gCKuDJZyYe43SZivC7F8FRDaXrrcCid+cV5BBUv+hlw8dE3tnl4yOdpBoZ902o3TfMYs yOg7PHSq97HDlOknhRkInZyAckoSxLicTn0TlWFTnASyuCmD42oLxLcQgBrOBafoGDL2 wobb0TX+tnv591KLJV867yumYqZCMjY9JOmvAdzkFnyjTu8z2SQ3QVvXQginzNn5Elge jDqw== X-Gm-Message-State: AOAM532Ga5Z0wLt8pB9nJORWauo03f1mCT7zF1n+LEd6xQGoKolz/RRV IwM75WlWQs6QaAPPjDlptwI= X-Google-Smtp-Source: ABdhPJzWaCyTch/tKZaigzg6FeIVXvEYSGeGYLTEGOfVr8Nx/Z0PJIexqASFa5s5Jud5vcYxyB44VA== X-Received: by 2002:a05:6638:da:: with SMTP id w26mr2363943jao.137.1601479231865; Wed, 30 Sep 2020 08:20:31 -0700 (PDT) Received: from localhost.localdomain (ip-99-203-15-156.pools.cgn.spcsdns.net. [99.203.15.156]) by smtp.gmail.com with ESMTPSA id t10sm770788iog.49.2020.09.30.08.20.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Sep 2020 08:20:31 -0700 (PDT) From: YiFei Zhu To: containers@lists.linux-foundation.org Cc: YiFei Zhu , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Aleksa Sarai , Andrea Arcangeli , Andy Lutomirski , David Laight , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jack Chen , Jann Horn , Josep Torrellas , Kees Cook , Tianyin Xu , Tobin Feldman-Fitzthum , Tycho Andersen , Valentin Rothberg , Will Drewry Subject: [PATCH v3 seccomp 3/5] seccomp/cache: Lookup syscall allowlist for fast path Date: Wed, 30 Sep 2020 10:19:14 -0500 Message-Id: <83c72471f9f79fa982508bd4db472686a67b8320.1601478774.git.yifeifz2@illinois.edu> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org From: YiFei Zhu The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. This first finds the current allow bitmask by iterating through syscall_arches[] array and comparing it to the one in struct seccomp_data; this loop is expected to be unrolled. It then does a test_bit against the bitmask. If the bit is set, then there is no need to run the full filter; it returns SECCOMP_RET_ALLOW immediately. Co-developed-by: Dimitrios Skarlatos Signed-off-by: Dimitrios Skarlatos Signed-off-by: YiFei Zhu --- kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/kernel/seccomp.c b/kernel/seccomp.c index f09c9e74ae05..bed3b2a7f6c8 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -172,6 +172,12 @@ struct seccomp_cache_filter_data { }; static inline void seccomp_cache_prepare(struct seccomp_filter *sfilter) { } + +static inline bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + return false; +} #endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ /** @@ -331,6 +337,49 @@ static int seccomp_check_filter(struct sock_filter *filter, unsigned int flen) return 0; } +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY +static bool seccomp_cache_check_bitmap(const void *bitmap, size_t bitmap_size, + int syscall_nr) +{ + if (unlikely(syscall_nr < 0 || syscall_nr >= bitmap_size)) + return false; + syscall_nr = array_index_nospec(syscall_nr, bitmap_size); + + return test_bit(syscall_nr, bitmap); +} + +/** + * seccomp_cache_check - lookup seccomp cache + * @sfilter: The seccomp filter + * @sd: The seccomp data to lookup the cache with + * + * Returns true if the seccomp_data is cached and allowed. + */ +static bool seccomp_cache_check(const struct seccomp_filter *sfilter, + const struct seccomp_data *sd) +{ + int syscall_nr = sd->nr; + const struct seccomp_cache_filter_data *cache = &sfilter->cache; + +#ifdef SECCOMP_ARCH_DEFAULT + if (likely(sd->arch == SECCOMP_ARCH_DEFAULT)) + return seccomp_cache_check_bitmap(cache->syscall_allow_default, + SECCOMP_ARCH_DEFAULT_NR, + syscall_nr); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + if (likely(sd->arch == SECCOMP_ARCH_COMPAT)) + return seccomp_cache_check_bitmap(cache->syscall_allow_compat, + SECCOMP_ARCH_COMPAT_NR, + syscall_nr); +#endif /* SECCOMP_ARCH_COMPAT */ + + WARN_ON_ONCE(true); + return false; +} +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */ + /** * seccomp_run_filters - evaluates all seccomp filters against @sd * @sd: optional seccomp data to be passed to filters @@ -353,6 +402,9 @@ static u32 seccomp_run_filters(const struct seccomp_data *sd, if (WARN_ON(f == NULL)) return SECCOMP_RET_KILL_PROCESS; + if (seccomp_cache_check(f, sd)) + return SECCOMP_RET_ALLOW; + /* * All filters in the list are evaluated and the lowest BPF return * value always takes priority (ignoring the DATA). From patchwork Wed Sep 30 15:19:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: YiFei Zhu X-Patchwork-Id: 11809465 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5A738618 for ; Wed, 30 Sep 2020 15:28:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 281A32071E for ; Wed, 30 Sep 2020 15:28:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="d713WqGD" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730519AbgI3PUp (ORCPT ); Wed, 30 Sep 2020 11:20:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729395AbgI3PUg (ORCPT ); Wed, 30 Sep 2020 11:20:36 -0400 Received: from mail-io1-xd43.google.com (mail-io1-xd43.google.com [IPv6:2607:f8b0:4864:20::d43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 18BF3C061755; Wed, 30 Sep 2020 08:20:36 -0700 (PDT) Received: by mail-io1-xd43.google.com with SMTP id z13so2191886iom.8; Wed, 30 Sep 2020 08:20:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CqHR3iMjvi/+BrsGXxXP4f66oyzEMJKnPL1FtLSH8L8=; b=d713WqGDCNpGjBvAOshn0SivS92z6rxcw8GKKckeY3VNOinC44xgP3uWRFu/+Xz+sy YvKgTk9LhXN6gbsUAnDE/rUFNfwsfMu4cmCa6xEjDI1k/cLU1/VnkcNP6gIDik19Z92y snHokw11f7ndkL8ALgMO+eqmOs9+9bk1FW5Nw6UZOdke5K6KNy99IQn1kiztsKvkRQAC U+/KHgr7bEt6ibWOe/jxKZY7DHpguxEHGngrFAvo1CGhHHtzCJp7QAVreX4oDspyABaD HAPm58H9wdqXC6JaeMNL1vYW6yILYMOF2Cq3d+fR1V0g0YzXoIwjMYMtk/UpYtYAH+Hc WjgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CqHR3iMjvi/+BrsGXxXP4f66oyzEMJKnPL1FtLSH8L8=; b=Ahiz61DjDHM0XwqrEErye4COYzUjhVaDT9lqs87xyHiNzdg843XPwTllljkF+ll+tg NRTBeClQC+wdkKxzI7Efq+pNICS4wvF+HNd8+V2GvkJuK3BsJMeFAhXt4iazYvjdVn7Y xJLmGaz8q332G0YMqExcdc383//PatmsWY+pJQvUPEI1tl+tDDAhVKeZGypLyo4nAErK bcYpvZ9OGLH4m6I3yBv6GMXf+GlcvZlY3TKZjWOiJ580ksEs1LWOccO3X3k0n9YJ3RVq a5zwAErkLKhAOy70xFSnn3nt2NxndCuPd0Wh9UEwHOcoTCNUaWf0db2CT4pUh9mGKmP6 S+Lg== X-Gm-Message-State: AOAM532aMJzmOOQw1pDW+ekcXBTnN0vdu2csbexPnK5Gdse0wVTG4PJL QeLxwJ9sIszHFprgOSt4BOg= X-Google-Smtp-Source: ABdhPJzZB8L46jzftLd5djIuk3m9qq3lXlQEDaXsco/Ve7fJaCZ1l/d+c4jPWYiXOfyIhv74FtKxlg== X-Received: by 2002:a02:b149:: with SMTP id s9mr2430538jah.80.1601479235320; Wed, 30 Sep 2020 08:20:35 -0700 (PDT) Received: from localhost.localdomain (ip-99-203-15-156.pools.cgn.spcsdns.net. [99.203.15.156]) by smtp.gmail.com with ESMTPSA id t10sm770788iog.49.2020.09.30.08.20.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Sep 2020 08:20:34 -0700 (PDT) From: YiFei Zhu To: containers@lists.linux-foundation.org Cc: YiFei Zhu , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Aleksa Sarai , Andrea Arcangeli , Andy Lutomirski , David Laight , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jack Chen , Jann Horn , Josep Torrellas , Kees Cook , Tianyin Xu , Tobin Feldman-Fitzthum , Tycho Andersen , Valentin Rothberg , Will Drewry Subject: [PATCH v3 seccomp 4/5] selftests/seccomp: Compare bitmap vs filter overhead Date: Wed, 30 Sep 2020 10:19:15 -0500 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org From: Kees Cook As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675): ✔️ native ≤ 1 filter (646 ≤ 732): ✔️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️ entry ≈ 1 bitmapped (29 ≈ 29): ✔️ entry ≈ 2 bitmapped (29 ≈ 29): ✔️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️ Signed-off-by: Kees Cook [YiFei: Changed commit message to show stats for this patch series] Signed-off-by: YiFei Zhu --- .../selftests/seccomp/seccomp_benchmark.c | 151 +++++++++++++++--- tools/testing/selftests/seccomp/settings | 2 +- 2 files changed, 130 insertions(+), 23 deletions(-) diff --git a/tools/testing/selftests/seccomp/seccomp_benchmark.c b/tools/testing/selftests/seccomp/seccomp_benchmark.c index 91f5a89cadac..fcc806585266 100644 --- a/tools/testing/selftests/seccomp/seccomp_benchmark.c +++ b/tools/testing/selftests/seccomp/seccomp_benchmark.c @@ -4,12 +4,16 @@ */ #define _GNU_SOURCE #include +#include +#include +#include #include #include #include #include #include #include +#include #include #include #include @@ -70,18 +74,74 @@ unsigned long long calibrate(void) return samples * seconds; } +bool approx(int i_one, int i_two) +{ + double one = i_one, one_bump = one * 0.01; + double two = i_two, two_bump = two * 0.01; + + one_bump = one + MAX(one_bump, 2.0); + two_bump = two + MAX(two_bump, 2.0); + + /* Equal to, or within 1% or 2 digits */ + if (one == two || + (one > two && one <= two_bump) || + (two > one && two <= one_bump)) + return true; + return false; +} + +bool le(int i_one, int i_two) +{ + if (i_one <= i_two) + return true; + return false; +} + +long compare(const char *name_one, const char *name_eval, const char *name_two, + unsigned long long one, bool (*eval)(int, int), unsigned long long two) +{ + bool good; + + printf("\t%s %s %s (%lld %s %lld): ", name_one, name_eval, name_two, + (long long)one, name_eval, (long long)two); + if (one > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)one); + return 1; + } + if (two > INT_MAX) { + printf("Miscalculation! Measurement went negative: %lld\n", (long long)two); + return 1; + } + + good = eval(one, two); + printf("%s\n", good ? "✔️" : "❌"); + + return good ? 0 : 1; +} + int main(int argc, char *argv[]) { + struct sock_filter bitmap_filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)), + BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), + }; + struct sock_fprog bitmap_prog = { + .len = (unsigned short)ARRAY_SIZE(bitmap_filter), + .filter = bitmap_filter, + }; struct sock_filter filter[] = { + BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, args[0])), BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW), }; struct sock_fprog prog = { .len = (unsigned short)ARRAY_SIZE(filter), .filter = filter, }; - long ret; - unsigned long long samples; - unsigned long long native, filter1, filter2; + + long ret, bits; + unsigned long long samples, calc; + unsigned long long native, filter1, filter2, bitmap1, bitmap2; + unsigned long long entry, per_filter1, per_filter2; printf("Current BPF sysctl settings:\n"); system("sysctl net.core.bpf_jit_enable"); @@ -101,35 +161,82 @@ int main(int argc, char *argv[]) ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); assert(ret == 0); - /* One filter */ - ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); + /* One filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); assert(ret == 0); - filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 1 filter: %llu ns\n", filter1); + bitmap1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 1 filter (bitmap): %llu ns\n", bitmap1); + + /* Second filter resulting in a bitmap */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - if (filter1 == native) - printf("No overhead measured!? Try running again with more samples.\n"); + bitmap2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 2 filters (bitmap): %llu ns\n", bitmap2); - /* Two filters */ + /* Third filter, can no longer be converted to bitmap */ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog); assert(ret == 0); - filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; - printf("getpid RET_ALLOW 2 filters: %llu ns\n", filter2); - - /* Calculations */ - printf("Estimated total seccomp overhead for 1 filter: %llu ns\n", - filter1 - native); + filter1 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 3 filters (full): %llu ns\n", filter1); - printf("Estimated total seccomp overhead for 2 filters: %llu ns\n", - filter2 - native); + /* Fourth filter, can not be converted to bitmap because of filter 3 */ + ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bitmap_prog); + assert(ret == 0); - printf("Estimated seccomp per-filter overhead: %llu ns\n", - filter2 - filter1); + filter2 = timing(CLOCK_PROCESS_CPUTIME_ID, samples) / samples; + printf("getpid RET_ALLOW 4 filters (full): %llu ns\n", filter2); + + /* Estimations */ +#define ESTIMATE(fmt, var, what) do { \ + var = (what); \ + printf("Estimated " fmt ": %llu ns\n", var); \ + if (var > INT_MAX) \ + goto more_samples; \ + } while (0) + + ESTIMATE("total seccomp overhead for 1 bitmapped filter", calc, + bitmap1 - native); + ESTIMATE("total seccomp overhead for 2 bitmapped filters", calc, + bitmap2 - native); + ESTIMATE("total seccomp overhead for 3 full filters", calc, + filter1 - native); + ESTIMATE("total seccomp overhead for 4 full filters", calc, + filter2 - native); + ESTIMATE("seccomp entry overhead", entry, + bitmap1 - native - (bitmap2 - bitmap1)); + ESTIMATE("seccomp per-filter overhead (last 2 diff)", per_filter1, + filter2 - filter1); + ESTIMATE("seccomp per-filter overhead (filters / 4)", per_filter2, + (filter2 - native - entry) / 4); + + printf("Expectations:\n"); + ret |= compare("native", "≤", "1 bitmap", native, le, bitmap1); + bits = compare("native", "≤", "1 filter", native, le, filter1); + if (bits) + goto more_samples; + + ret |= compare("per-filter (last 2 diff)", "≈", "per-filter (filters / 4)", + per_filter1, approx, per_filter2); + + bits = compare("1 bitmapped", "≈", "2 bitmapped", + bitmap1 - native, approx, bitmap2 - native); + if (bits) { + printf("Skipping constant action bitmap expectations: they appear unsupported.\n"); + goto out; + } - printf("Estimated seccomp entry overhead: %llu ns\n", - filter1 - native - (filter2 - filter1)); + ret |= compare("entry", "≈", "1 bitmapped", entry, approx, bitmap1 - native); + ret |= compare("entry", "≈", "2 bitmapped", entry, approx, bitmap2 - native); + ret |= compare("native + entry + (per filter * 4)", "≈", "4 filters total", + entry + (per_filter1 * 4) + native, approx, filter2); + if (ret == 0) + goto out; +more_samples: + printf("Saw unexpected benchmark result. Try running again with more samples?\n"); +out: return 0; } diff --git a/tools/testing/selftests/seccomp/settings b/tools/testing/selftests/seccomp/settings index ba4d85f74cd6..6091b45d226b 100644 --- a/tools/testing/selftests/seccomp/settings +++ b/tools/testing/selftests/seccomp/settings @@ -1 +1 @@ -timeout=90 +timeout=120 From patchwork Wed Sep 30 15:19:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: YiFei Zhu X-Patchwork-Id: 11809469 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D318792C for ; Wed, 30 Sep 2020 15:28:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AC48D20789 for ; Wed, 30 Sep 2020 15:28:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="h9ZjX4eG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729434AbgI3PUo (ORCPT ); Wed, 30 Sep 2020 11:20:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729432AbgI3PUj (ORCPT ); Wed, 30 Sep 2020 11:20:39 -0400 Received: from mail-io1-xd42.google.com (mail-io1-xd42.google.com [IPv6:2607:f8b0:4864:20::d42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3A852C061755; Wed, 30 Sep 2020 08:20:39 -0700 (PDT) Received: by mail-io1-xd42.google.com with SMTP id j2so2201287ioj.7; Wed, 30 Sep 2020 08:20:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=jREIpX4zm+bXQV4GY7erFrXt82ePUSIl6wMoQiJQey4=; b=h9ZjX4eGPqcJOpfWYc3rRkbc0/uIBIA0z208Muoytgd2Zkc4s/wN8o1um7Vh0YsWAR +2Y/YgZE292Y9q3TXdoGRaaHanr4WAnFP5uFMmQ+45q1ordQI0w/+irOShjPxp/4/V7/ +XoFy0al9TGWw+kzoT352r4cmnvjksJMI83BvANTNQR0N8NkSTa41JWyylb5b7CtkpkA pJmWTT1mSTk5V/y7CyHIrkwOEXOqmuAVJphlBeMaBGYW5N09gtmFrw/ZI6MkLy8Dnvai YvrUK8uvDZiPouOAaZaO1FLE2/42ujQDkI4hfPyKpRRDUyD7Kz2MB2pvSHPudFbVWiHE ps6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=jREIpX4zm+bXQV4GY7erFrXt82ePUSIl6wMoQiJQey4=; b=tkae90etAn/gB/TnpvVaGPIwMAwaZ4ynFYFnop3rpYp1RQotmJ3svwNni/x1pgdziO DfjmPUr259IBNVIBBXq6rIyJEoNJv47QGNFst0P9TSPPuhO3pBalclUZmo0xvOwR4mIx C6yHi7kTMGfpagORFwLs9AY3ICp2z4k9hNqAb3poai5mVEuRQ+708RYWn3ZuHFwAjQhb 9w9xhlcIgKldl/ZF/aA+v7CWl6Bg07fPYYng2XPBiEtgiMyEgX+wJDS+JrG5BPA5v/y0 +Io/B2ieakrKWfpZlMIfZDPVPxYF3+kCOOHeGxbSoCAij2IqWlOoB6P/ORzjbCl//LIP +IQQ== X-Gm-Message-State: AOAM530HwKY8VkW/BtZ5dL2cWazCzIEIVyRlpViXxXoklRi7KLdqAMoE xUgquBpxZfwUd5VBDPq+3Iw= X-Google-Smtp-Source: ABdhPJwUId5jwICkoihvxugKENCdWV2ugU3LmdKq2serskfp9gUu2bmLJhqPV/YGR57wE3HQ4IzBBw== X-Received: by 2002:a5d:80d6:: with SMTP id h22mr2066435ior.154.1601479238534; Wed, 30 Sep 2020 08:20:38 -0700 (PDT) Received: from localhost.localdomain (ip-99-203-15-156.pools.cgn.spcsdns.net. [99.203.15.156]) by smtp.gmail.com with ESMTPSA id t10sm770788iog.49.2020.09.30.08.20.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Sep 2020 08:20:37 -0700 (PDT) From: YiFei Zhu To: containers@lists.linux-foundation.org Cc: YiFei Zhu , bpf@vger.kernel.org, linux-kernel@vger.kernel.org, Aleksa Sarai , Andrea Arcangeli , Andy Lutomirski , David Laight , Dimitrios Skarlatos , Giuseppe Scrivano , Hubertus Franke , Jack Chen , Jann Horn , Josep Torrellas , Kees Cook , Tianyin Xu , Tobin Feldman-Fitzthum , Tycho Andersen , Valentin Rothberg , Will Drewry Subject: [PATCH v3 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache Date: Wed, 30 Sep 2020 10:19:16 -0500 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org From: YiFei Zhu Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: x86_64 0 ALLOW x86_64 1 ALLOW x86_64 2 ALLOW x86_64 3 ALLOW [...] x86_64 132 ALLOW x86_64 133 ALLOW x86_64 134 FILTER x86_64 135 FILTER x86_64 136 FILTER x86_64 137 ALLOW x86_64 138 ALLOW x86_64 139 FILTER x86_64 140 ALLOW x86_64 141 ALLOW [...] This file is guarded by CONFIG_DEBUG_SECCOMP_CACHE with a default of N because I think certain users of seccomp might not want the application to know which syscalls are definitely usable. For the same reason, it is also guarded by CAP_SYS_ADMIN. Suggested-by: Jann Horn Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu --- arch/Kconfig | 15 +++++++++++ arch/x86/include/asm/seccomp.h | 3 +++ fs/proc/base.c | 3 +++ include/linux/seccomp.h | 5 ++++ kernel/seccomp.c | 46 ++++++++++++++++++++++++++++++++++ 5 files changed, 72 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index ca867b2a5d71..b840cadcc882 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -478,6 +478,7 @@ config HAVE_ARCH_SECCOMP_CACHE_NR_ONLY - all the requirements for HAVE_ARCH_SECCOMP_FILTER - SECCOMP_ARCH_DEFAULT - SECCOMP_ARCH_DEFAULT_NR + - SECCOMP_ARCH_DEFAULT_NAME config SECCOMP prompt "Enable seccomp to safely execute untrusted bytecode" @@ -532,6 +533,20 @@ config SECCOMP_CACHE_NR_ONLY endchoice +config DEBUG_SECCOMP_CACHE + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache" + depends on SECCOMP_CACHE_NR_ONLY + depends on PROC_FS + help + This is enables /proc/pid/seccomp_cache interface to monitor + seccomp cache data. The file format is subject to change. Reading + the file requires CAP_SYS_ADMIN. + + This option is for debugging only. Enabling present the risk that + an adversary may be able to infer the seccomp filter logic. + + If unsure, say N. + config HAVE_ARCH_STACKLEAK bool help diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h index 7b3a58271656..33ccc074be7a 100644 --- a/arch/x86/include/asm/seccomp.h +++ b/arch/x86/include/asm/seccomp.h @@ -19,13 +19,16 @@ #ifdef CONFIG_X86_64 # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_X86_64 # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +# define SECCOMP_ARCH_DEFAULT_NAME "x86_64" # ifdef CONFIG_COMPAT # define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386 # define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" # endif #else /* !CONFIG_X86_64 */ # define SECCOMP_ARCH_DEFAULT AUDIT_ARCH_I386 # define SECCOMP_ARCH_DEFAULT_NR NR_syscalls +# define SECCOMP_ARCH_COMPAT_NAME "x86_32" #endif #include diff --git a/fs/proc/base.c b/fs/proc/base.c index 617db4e0faa0..c60c5fce70fa 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_PROC_PID_ARCH_STATUS ONE("arch_status", S_IRUGO, proc_pid_arch_status), #endif +#ifdef CONFIG_DEBUG_SECCOMP_CACHE + ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h index 02aef2844c38..c35430f5f553 100644 --- a/include/linux/seccomp.h +++ b/include/linux/seccomp.h @@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task, return -EINVAL; } #endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */ + +#ifdef CONFIG_DEBUG_SECCOMP_CACHE +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task); +#endif #endif /* _LINUX_SECCOMP_H */ diff --git a/kernel/seccomp.c b/kernel/seccomp.c index bed3b2a7f6c8..c5ca5e30281b 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -2297,3 +2297,49 @@ static int __init seccomp_sysctl_init(void) device_initcall(seccomp_sysctl_init) #endif /* CONFIG_SYSCTL */ + +#ifdef CONFIG_DEBUG_SECCOMP_CACHE +/* Currently CONFIG_DEBUG_SECCOMP_CACHE implies CONFIG_SECCOMP_CACHE_NR_ONLY */ +static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name, + const void *bitmap, size_t bitmap_size) +{ + int nr; + + for (nr = 0; nr < bitmap_size; nr++) { + bool cached = test_bit(nr, bitmap); + char *status = cached ? "ALLOW" : "FILTER"; + + seq_printf(m, "%s %d %s\n", name, nr, status); + } +} + +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns, + struct pid *pid, struct task_struct *task) +{ + struct seccomp_filter *f; + + /* + * We don't want some sandboxed process know what their seccomp + * filters consist of. + */ + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN)) + return -EACCES; + + f = READ_ONCE(task->seccomp.filter); + if (!f) + return 0; + +#ifdef SECCOMP_ARCH_DEFAULT + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_DEFAULT_NAME, + f->cache.syscall_allow_default, + SECCOMP_ARCH_DEFAULT_NR); +#endif /* SECCOMP_ARCH_DEFAULT */ + +#ifdef SECCOMP_ARCH_COMPAT + proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME, + f->cache.syscall_allow_compat, + SECCOMP_ARCH_COMPAT_NR); +#endif /* SECCOMP_ARCH_COMPAT */ + return 0; +} +#endif /* CONFIG_DEBUG_SECCOMP_CACHE */