[v8,3/7] kernel: Implement selective syscall userspace redirection

Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.

Cc: Matthew Wilcox <willy@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paul Gofman <gofmanp@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-api@vger.kernel.org
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

---
Changes since v7:
  - Correct half-open interval in commit message (PeterZ)
  - Solve rebase conflicts

Changes since v6:
  (Matthew Wilcox)
  - Use unsigned long for mode
  (peterZ)
  - Change interface to {offset,len}
  - Use SYSCALL_WORK interface instead of TIF flags

Changes since v4:
  (Andy Lutomirski)
  - Allow sigreturn coming from vDSO
  - Exit with SIGSYS instead of SIGSEGV on bad selector
  (Thomas Gleixner)
  - Use sizeof selector in access_ok
  - Document usage of __get_user
  - Use constant for state value
  - Split out x86 parts
  - Rebase on top of Gleixner's common entry code
  - Don't expose do_syscall_user_dispatch

Changes since v3:
  - NTR.

Changes since v2:
  (Matthew Wilcox suggestions)
  - Drop __user on non-ptr type.
  - Move #define closer to similar defs
  - Allow a memory region that can dispatch directly
  (Kees Cook suggestions)
  - Improve kconfig summary line
  - Move flag cleanup on execve to begin_new_exec
  - Hint branch predictor in the syscall path
  (Me)
  - Convert selector to char

Changes since RFC:
  (Kees Cook suggestions)
  - Don't mention personality while explaining the feature
  - Use syscall_get_nr
  - Remove header guard on several places
  - Convert WARN_ON to WARN_ON_ONCE
  - Explicit check for state values
  - Rename to syscall user dispatcher
---
 fs/exec.c                             |   3 +
 include/linux/sched.h                 |   2 +
 include/linux/syscall_user_dispatch.h |  40 ++++++++++
 include/linux/thread_info.h           |   2 +
 include/uapi/linux/prctl.h            |   5 ++
 kernel/entry/Makefile                 |   2 +-
 kernel/entry/common.h                 |  16 ++++
 kernel/entry/syscall_user_dispatch.c  | 102 ++++++++++++++++++++++++++
 kernel/fork.c                         |   1 +
 kernel/sys.c                          |   5 ++
 10 files changed, 177 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/syscall_user_dispatch.h
 create mode 100644 kernel/entry/common.h
 create mode 100644 kernel/entry/syscall_user_dispatch.c

Message ID	20201127193238.821364-4-krisman@collabora.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65FFDC64E8A for <linux-kselftest@archiver.kernel.org>; Sat, 28 Nov 2020 22:13:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3043B22240 for <linux-kselftest@archiver.kernel.org>; Sat, 28 Nov 2020 22:13:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387517AbgK1Vt5 (ORCPT <rfc822;linux-kselftest@archiver.kernel.org>); Sat, 28 Nov 2020 16:49:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55560 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730635AbgK0TwS (ORCPT <rfc822;linux-kselftest@vger.kernel.org>); Fri, 27 Nov 2020 14:52:18 -0500 Received: from bhuna.collabora.co.uk (bhuna.collabora.co.uk [IPv6:2a00:1098:0:82:1000:25:2eeb:e3e3]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD44BC08E862; Fri, 27 Nov 2020 11:38:08 -0800 (PST) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 9BA6C1F465A1 From: Gabriel Krisman Bertazi <krisman@collabora.com> To: luto@kernel.org, tglx@linutronix.de, keescook@chromium.org Cc: gofmanp@gmail.com, christian.brauner@ubuntu.com, peterz@infradead.org, willy@infradead.org, shuah@kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-kselftest@vger.kernel.org, x86@kernel.org, Gabriel Krisman Bertazi <krisman@collabora.com>, kernel@collabora.com Subject: [PATCH v8 3/7] kernel: Implement selective syscall userspace redirection Date: Fri, 27 Nov 2020 14:32:34 -0500 Message-Id: <20201127193238.821364-4-krisman@collabora.com> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201127193238.821364-1-krisman@collabora.com> References: <20201127193238.821364-1-krisman@collabora.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-kselftest.vger.kernel.org> X-Mailing-List: linux-kselftest@vger.kernel.org
Series	Syscall User Dispatch \| expand [v8,0/7] Syscall User Dispatch [v8,1/7] x86: vdso: Expose sigreturn address on vdso to the kernel [v8,2/7] signal: Expose SYS_USER_DISPATCH si_code type [v8,3/7] kernel: Implement selective syscall userspace redirection [v8,4/7] entry: Support Syscall User Dispatch on common syscall entry [v8,5/7] selftests: Add kselftest for syscall user dispatch [v8,6/7] selftests: Add benchmark for syscall user dispatch [v8,7/7] docs: Document Syscall User Dispatch

[v8,3/7] kernel: Implement selective syscall userspace redirection

Commit Message

Comments

Patch