From patchwork Fri Mar 31 23:50:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ackerley Tng X-Patchwork-Id: 13196693 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66D58C761A6 for ; Fri, 31 Mar 2023 23:50:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C118A6B0074; Fri, 31 Mar 2023 19:50:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC2636B0075; Fri, 31 Mar 2023 19:50:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A63226B0078; Fri, 31 Mar 2023 19:50:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 979A66B0074 for ; Fri, 31 Mar 2023 19:50:52 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5B8551C6C5B for ; Fri, 31 Mar 2023 23:50:52 +0000 (UTC) X-FDA: 80630841144.28.796E065 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf15.hostedemail.com (Postfix) with ESMTP id 9AD0DA0007 for ; Fri, 31 Mar 2023 23:50:49 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=aqbsmcr2; spf=pass (imf15.hostedemail.com: domain of 32HEnZAsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=32HEnZAsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680306649; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=67aN/B+WtOaSUEhiQW5ieKe9kdphJ9yXXZLhFTmAYCY=; b=nhokEBskJ6wc4EHh5awKkYulMRYuATIphRyP9VOkvKtkpcdQYuVeCWaForRXB7l0b6dG/G mumbEqnbVDR66w3R7kegY/00T1HLENl+1MOZNDDdCLAkj/y93oBCY2NGdN3zbpUPjXcG5D qgEw1sEc+x7yjynj6XIpn0zLy9iTQVA= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=aqbsmcr2; spf=pass (imf15.hostedemail.com: domain of 32HEnZAsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=32HEnZAsKCIEfhpjwqj3ysllttlqj.htrqnsz2-rrp0fhp.twl@flex--ackerleytng.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680306649; a=rsa-sha256; cv=none; b=70m9gGF7+Y+U/2V/3Fx53acdymdKET02/MKyYM+qMJLWiJCqOLJPxxAXub+kU+D7sywzWm KLmUXOhIxhuCSFSZkK5kip47WTlgQdkSXGXCCYBotcFjvvQ0UfDgV/m4LhbNpK0M+dThPu NdGIHUQdm3S9jwaJ1mTGDbFXqo/zjFM= Received: by mail-pj1-f73.google.com with SMTP id ml17-20020a17090b361100b0023f9e99ab95so11505516pjb.1 for ; Fri, 31 Mar 2023 16:50:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1680306648; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=67aN/B+WtOaSUEhiQW5ieKe9kdphJ9yXXZLhFTmAYCY=; b=aqbsmcr2pK1zJc/QddLocmFe7u0eN9wv3vYw/b6U0d43YucmpKQusaaQyAaVSa0brL LpSKYapJtGiMflg5mvZkz3d5j4vniDYYeKTW9Q4ay1SgV2A+IZ22Cf6Rxm/yGJKvgMN8 SMKUd9my6zvqI6T79WbKu7yBu6efMqRgxeGWTlGD4T7pNs+f8J8J0byx5Zn/rGxIBacK RO2pdC/oG+kndVLe0/EN60h+Ufhb6JX9aJrTNcBTxw5DYOj8yvfJ2FTBeGdPqfrf3kJN vw3lhr7RqWCr8GoHdH9B4JRBqn6YLWfmu0XgDDbuD1vuUAihW3PD1pSNBYYD+O6dlOsv bwXw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680306648; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=67aN/B+WtOaSUEhiQW5ieKe9kdphJ9yXXZLhFTmAYCY=; b=fHid7dP8ctGxXh6NJnkpgQlL01CnsLjJB5MWiQuB/7LAK9VrOTi1EKxXpdjkbIf526 0s9yB6Y6LgPPrmOmGV9lzFh4VJWufzZd89SUjspP0U2xJ+/nWUn7ZL/7Dozl2OnOxPkV Z7NzfcP4fN+IHlCy0vSUtRh1XFJEmWwE2oOq+xr05tm2EylvHXdpXIjGJDgauJx5quCj wqWube4FLgPKolPubEU6rR9M92pZF2TGVztOVfxRUIXhdXGgFq9UWvRZ//B5wBUIXD+F EkfhQ+gCFhbP0C+l2k2HzStLAgdSIVv1gqdXYV5SVFdAXLlfGV0n98AeaEadst7EhILy EKcw== X-Gm-Message-State: AAQBX9dUyXWgrNuA6GwPIKu/fGCXTJEoleMBLLbHzPqrHzgT4TTwX9N+ Rw9du7Bn1Tsq9Lm35WuhU0MzAAct5Aoy88bCTA== X-Google-Smtp-Source: AKy350Y16oanl4juVqg+eyN21vJExuVR4y0XGk1EcBSjqL6yTWS/45m6tpDkpZOf1psTO3M9dHBN9cOBWfihqWNDiA== X-Received: from ackerleytng-cloudtop.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1f5f]) (user=ackerleytng job=sendgmr) by 2002:a17:90a:ba0a:b0:23f:6eff:9430 with SMTP id s10-20020a17090aba0a00b0023f6eff9430mr9066957pjr.3.1680306648452; Fri, 31 Mar 2023 16:50:48 -0700 (PDT) Date: Fri, 31 Mar 2023 23:50:39 +0000 In-Reply-To: Mime-Version: 1.0 References: X-Mailer: git-send-email 2.40.0.348.gf938b09366-goog Message-ID: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> Subject: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted From: Ackerley Tng To: kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org Cc: aarcange@redhat.com, ak@linux.intel.com, akpm@linux-foundation.org, arnd@arndb.de, bfields@fieldses.org, bp@alien8.de, chao.p.peng@linux.intel.com, corbet@lwn.net, dave.hansen@intel.com, david@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, jmattson@google.com, joro@8bytes.org, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, linmiaohe@huawei.com, luto@kernel.org, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, mingo@redhat.com, naoya.horiguchi@nec.com, pbonzini@redhat.com, qperret@google.com, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, tabba@google.com, tglx@linutronix.de, vannapurve@google.com, vbabka@suse.cz, vkuznets@redhat.com, wanpengli@tencent.com, wei.w.wang@intel.com, x86@kernel.org, yu.c.zhang@linux.intel.com, Ackerley Tng X-Rspamd-Queue-Id: 9AD0DA0007 X-Stat-Signature: xhfwhkws9fjhm67ktnq66i7yna8gkek8 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1680306649-331081 X-HE-Meta: U2FsdGVkX1+es7sWHq0uAZS/yvtRmpR/oJ43kJ7kxUxtsC0k6g3tQe/bTnbOrI8n2WE7Lpv1EiWDjU/Og9OK6dLcmfLWxjgg3Hj2Jryh7J1ehlsDB3xDd3DYSl7V1NXHr5+yjJNR45XewQK/IyT89X+EePFBqIVt9lHYuqXdH7sCtpj983PQawmETbyTV97ELQv5GQMZm7wuITDWxFeWU/O7eU0IDJPKxYpuLaXqGLcbz1MoPowWEiyGdBMoyhKjOvKX2FMQAtS0mm2gWH8GG6JolcGIRtBY3x4wNU9ZUF9Ul4XJ+DLrkVmsDlf/OzHFWP6EuX5ibjh/nNNvy7bkrm0LYpJWCJM6Zf+5fpMKVu1ksgAPEaLUGNHa38U61wJPOZKeh2O9VyxxjqvAACG8s2VUOHzO+G2qO6aPL6gWE0SHIS02LYvUAQOqqBXyGOKWQb2yCFcu1d8lLhJzAYzJhldekMLphoySOgCxudiD0NVB6NXJuTh+WOI86V7hXT+L9A0gzZCz9dKXtoR0LmbUC9iPwY/N6kpnRTv+DbHNVhIaBbZYHf/V2Nrn2s0JAbNNS89PQ8ASAIsQa5Y4RVV2xD8I2gPuJvyz3+3EffHiT0nxUuyBFFDFVaArlH1YPnWZ5JXPT9T0cMfjTpzIWed2yforuwfmPOGCJymd8xj25ajTZL6NRKGwUdkO1mn2yOlw87WLGw0ZokBPW8H/iUrsdt6Crw1QBFvbVd5BHJHMXtrxxbpzSTuRV4rWHaIMqPwyn5/jUG8Qky1LQN1XplzLOImuY+tRgj+OieI+mrRu4HG/hvtXaPpe0G1VY17qGqV0A0NoSC3gSTsLMulyvemQXuzRuUziEUJ4VjKCZfyh4m0qkvH3bGB9UCyWKS27mIEDwItIwCUtyijblhcwu5ekQhVMkABp+MbetbDWeOjMvX/2dObD8CoDxH5Seg0gWlsQvb5Ug8z+xDutexSYyyu BQ7785fV RwR9NFmAysl+Qp+dRiFefztzIFYglON3mbyFTRxdFSQSmNG99Ry7XkzTqv2g77NROIy5gWSZHcfn+bUmEdgT2jrbQOr5WkY+I5YJCU4KkKWBCI9t7a5MqC7isWO6pocDenlGu7/4K7lkwD4oU/ISoBI9rKE6IvTrbWZAGLDBepMo0tZqURYuGsfgMwlL1bnNljTmyvNr7+21m5w3hsfYqeiX9nvxxjMQdCzk751BIgutzDORwAX3BwE0F0cK2c2SIF7JsAYV4fVnvqhkMVcIh3W0KuLvnwzs/GnAmIF2Dzz8M9JFj6n+MaEqnliuN++TD+N8la4P1a+n4PZbGv4jkGCZzY0o3oeiRR1UXjBitiQPBfpvBa6WQ0P50Zs/s8DZnbwwHVdN5c1FA9NEd3dDZHa2AoMOUs9M86L9T6Ew9uDrYYb2SHpzkmePbwxk6UCkry3myshDYQWmZ5RpMqlNHmOjow/jw6O30OFusxhLD4xg2ZIRU99dGuKDppWmq5NotuE/KG1lVXU5b8BbFPQxulUZpAxAU6tf4a89v4k1fD2RZo96zVnrZ3FVg/8MkuwMY0UlrxyquLtlQv2GqBg9L4fQIzykSyLMdGGeigdbERGcgcBORPfXKqaNXxvj+l4sL1Ve8gJfSnn0XnZ2hc8uAl6A5YJ44103+EbrAD6oc3O70BpnBKVdS0U4PJzgkBuT17HCo7OJok8npPis= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: By default, the backing shmem file for a restrictedmem fd is created on shmem's kernel space mount. With this patch, an optional tmpfs mount can be specified via an fd, which will be used as the mountpoint for backing the shmem file associated with a restrictedmem fd. This will help restrictedmem fds inherit the properties of the provided tmpfs mounts, for example, hugepage allocation hints, NUMA binding hints, etc. Permissions for the fd passed to memfd_restricted() is modeled after the openat() syscall, since both of these allow creation of a file upon a mount/directory. Permission to reference the mount the fd represents is checked upon fd creation by other syscalls (e.g. fsmount(), open(), or open_tree(), etc) and any process that can present memfd_restricted() with a valid fd is expected to have obtained permission to use the mount represented by the fd. This behavior is intended to parallel that of the openat() syscall. memfd_restricted() will check that the tmpfs superblock is writable, and that the mount is also writable, before attempting to create a restrictedmem file on the mount. Signed-off-by: Ackerley Tng --- include/linux/syscalls.h | 2 +- include/uapi/linux/restrictedmem.h | 8 ++++ mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- 3 files changed, 77 insertions(+), 7 deletions(-) create mode 100644 include/uapi/linux/restrictedmem.h -- 2.40.0.348.gf938b09366-goog diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index f9e9e0c820c5..a23c4c385cd3 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, unsigned long home_node, unsigned long flags); -asmlinkage long sys_memfd_restricted(unsigned int flags); +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); /* * Architecture-specific system calls diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h new file mode 100644 index 000000000000..22d6f2285f6d --- /dev/null +++ b/include/uapi/linux/restrictedmem.h @@ -0,0 +1,8 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H +#define _UAPI_LINUX_RESTRICTEDMEM_H + +/* flags for memfd_restricted */ +#define RMFD_USERMNT 0x0001U + +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c index c5d869d8c2d8..f7b62364a31a 100644 --- a/mm/restrictedmem.c +++ b/mm/restrictedmem.c @@ -1,11 +1,12 @@ // SPDX-License-Identifier: GPL-2.0 -#include "linux/sbitmap.h" +#include #include #include #include #include #include #include +#include #include struct restrictedmem { @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) return file; } -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) +static int restrictedmem_create(struct vfsmount *mount) { struct file *file, *restricted_file; int fd, err; - if (flags) - return -EINVAL; - fd = get_unused_fd_flags(0); if (fd < 0) return fd; - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); + if (mount) + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); + else + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); + if (IS_ERR(file)) { err = PTR_ERR(file); goto err_fd; @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) return err; } +static bool is_shmem_mount(struct vfsmount *mnt) +{ + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; +} + +static bool is_mount_root(struct file *file) +{ + return file->f_path.dentry == file->f_path.mnt->mnt_root; +} + +static int restrictedmem_create_on_user_mount(int mount_fd) +{ + int ret; + struct fd f; + struct vfsmount *mnt; + + f = fdget_raw(mount_fd); + if (!f.file) + return -EBADF; + + ret = -EINVAL; + if (!is_mount_root(f.file)) + goto out; + + mnt = f.file->f_path.mnt; + if (!is_shmem_mount(mnt)) + goto out; + + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); + if (ret) + goto out; + + ret = mnt_want_write(mnt); + if (unlikely(ret)) + goto out; + + ret = restrictedmem_create(mnt); + + mnt_drop_write(mnt); +out: + fdput(f); + + return ret; +} + +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) +{ + if (flags & ~RMFD_USERMNT) + return -EINVAL; + + if (flags == RMFD_USERMNT) { + if (mount_fd < 0) + return -EINVAL; + + return restrictedmem_create_on_user_mount(mount_fd); + } else { + return restrictedmem_create(NULL); + } +} + int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, struct restrictedmem_notifier *notifier, bool exclusive) {