From patchwork Tue Feb 13 16:45:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13555377 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4C33C5FDD8 for ; Tue, 13 Feb 2024 16:45:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707842755; cv=none; b=JKNdz4id4PXM3EId/qlIhQERuBk9ypQZgsEmnFwMFU/YwnT/N205xrXBs0bkhYSYRJqaZDCmjoxbm244oji+wslyE4Udt2tWd59K4R78Vq5qMPOxhE2cTTrDjspy53uB5Rk9ebUl3kFMAf+Ot20OqKXJiwItqU2HYgkWjX1dE7E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707842755; c=relaxed/simple; bh=mzoSryXQAE28Ty6D4BldftRJA7rDPWVTevgfrqhZYy4=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=MKxFfydY89zscV2p9AF9ZHAFnvViXK/8tnmouNIxZVbDCwhVoCOjy6bcKn6kQYYSqQkjgmv8JqNkGUWXocqBeKfjALmQo8uQU9y3tLBisi1oSJk+sASchTvY11S7wbKn5+/qDqpjRTWPWMepgy1CdvMUNBEJC0Iw5zA24rZQkmQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=oeSVAoTg; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="oeSVAoTg" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4B1E9C433C7; Tue, 13 Feb 2024 16:45:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707842754; bh=mzoSryXQAE28Ty6D4BldftRJA7rDPWVTevgfrqhZYy4=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=oeSVAoTgb8Ff3iaHxIhfTo3MsMQ7/W3/Ttyq+l75yeP/2egpRklwrGI6UnnRWhoWP vamWz4zt0qRy98asdB8oMuF2NnsInKl3D/RaWB12GBgGWMAn58tp4ftxOhPSmy2eeT LG3UzL0R5+9txDWN/2Ue0akvBJWpTgfCJQtD81aDbtTze2mRX4j3adTcked1wKO7EO QcKGUpg27jGx2gbXiP9yEzMHZGwpjOpGUkOF4TT/mRcmWdsnbHbRjQbetotOxu7lHe t6dlT3LsgT/90BtvpXw18a4cP8ZaOILUMM7PKxqTu5Fd9VScoPOnK93GgXd/jzq+/6 TUF2Im3U9lZgQ== From: Christian Brauner Date: Tue, 13 Feb 2024 17:45:46 +0100 Subject: [PATCH 1/2] pidfd: move struct pidfd_fops Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20240213-vfs-pidfd_fs-v1-1-f863f58cfce1@kernel.org> References: <20240213-vfs-pidfd_fs-v1-0-f863f58cfce1@kernel.org> In-Reply-To: <20240213-vfs-pidfd_fs-v1-0-f863f58cfce1@kernel.org> To: linux-fsdevel@vger.kernel.org Cc: Linus Torvalds , Alexander Viro , Seth Forshee , Tycho Andersen , Christian Brauner X-Mailer: b4 0.13-dev-2d940 X-Developer-Signature: v=1; a=openpgp-sha256; l=9199; i=brauner@kernel.org; h=from:subject:message-id; bh=mzoSryXQAE28Ty6D4BldftRJA7rDPWVTevgfrqhZYy4=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaSenrMv6Y0z45vzU7qnP7g+b9aZANVrqRcaVvWvcwjMF GF6fnFLXkcpC4MYF4OsmCKLQ7tJuNxynorNRpkaMHNYmUCGMHBxCsBE3pYzMpyrd5swx+yQcHpJ u1PjnvZLX6xr9W9bKOV9C+zhiSw37WBkeKX8vkMzfOp0t9N504Qdlin/5A4p4nrKsemp3XLrD2Z GHAA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 Signed-off-by: Christian Brauner --- fs/Makefile | 2 +- fs/pidfdfs.c | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 110 --------------------------------------------------- 3 files changed, 124 insertions(+), 111 deletions(-) diff --git a/fs/Makefile b/fs/Makefile index c09016257f05..0fe5d0151fcc 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -15,7 +15,7 @@ obj-y := open.o read_write.o file_table.o super.o \ pnode.o splice.o sync.o utimes.o d_path.o \ stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \ fs_types.o fs_context.o fs_parser.o fsopen.o init.o \ - kernel_read_file.o mnt_idmapping.o remap_range.o + kernel_read_file.o mnt_idmapping.o remap_range.o pidfdfs.o obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o obj-$(CONFIG_PROC_FS) += proc_namespace.o diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c new file mode 100644 index 000000000000..55e8396e7fc4 --- /dev/null +++ b/fs/pidfdfs.c @@ -0,0 +1,123 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static int pidfd_release(struct inode *inode, struct file *file) +{ + struct pid *pid = file->private_data; + + file->private_data = NULL; + put_pid(pid); + return 0; +} + +#ifdef CONFIG_PROC_FS +/** + * pidfd_show_fdinfo - print information about a pidfd + * @m: proc fdinfo file + * @f: file referencing a pidfd + * + * Pid: + * This function will print the pid that a given pidfd refers to in the + * pid namespace of the procfs instance. + * If the pid namespace of the process is not a descendant of the pid + * namespace of the procfs instance 0 will be shown as its pid. This is + * similar to calling getppid() on a process whose parent is outside of + * its pid namespace. + * + * NSpid: + * If pid namespaces are supported then this function will also print + * the pid of a given pidfd refers to for all descendant pid namespaces + * starting from the current pid namespace of the instance, i.e. the + * Pid field and the first entry in the NSpid field will be identical. + * If the pid namespace of the process is not a descendant of the pid + * namespace of the procfs instance 0 will be shown as its first NSpid + * entry and no others will be shown. + * Note that this differs from the Pid and NSpid fields in + * /proc//status where Pid and NSpid are always shown relative to + * the pid namespace of the procfs instance. The difference becomes + * obvious when sending around a pidfd between pid namespaces from a + * different branch of the tree, i.e. where no ancestral relation is + * present between the pid namespaces: + * - create two new pid namespaces ns1 and ns2 in the initial pid + * namespace (also take care to create new mount namespaces in the + * new pid namespace and mount procfs) + * - create a process with a pidfd in ns1 + * - send pidfd from ns1 to ns2 + * - read /proc/self/fdinfo/ and observe that both Pid and NSpid + * have exactly one entry, which is 0 + */ +static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) +{ + struct pid *pid = f->private_data; + struct pid_namespace *ns; + pid_t nr = -1; + + if (likely(pid_has_task(pid, PIDTYPE_PID))) { + ns = proc_pid_ns(file_inode(m->file)->i_sb); + nr = pid_nr_ns(pid, ns); + } + + seq_put_decimal_ll(m, "Pid:\t", nr); + +#ifdef CONFIG_PID_NS + seq_put_decimal_ll(m, "\nNSpid:\t", nr); + if (nr > 0) { + int i; + + /* If nr is non-zero it means that 'pid' is valid and that + * ns, i.e. the pid namespace associated with the procfs + * instance, is in the pid namespace hierarchy of pid. + * Start at one below the already printed level. + */ + for (i = ns->level + 1; i <= pid->level; i++) + seq_put_decimal_ll(m, "\t", pid->numbers[i].nr); + } +#endif + seq_putc(m, '\n'); +} +#endif + +/* + * Poll support for process exit notification. + */ +static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) +{ + struct pid *pid = file->private_data; + bool thread = file->f_flags & PIDFD_THREAD; + struct task_struct *task; + __poll_t poll_flags = 0; + + poll_wait(file, &pid->wait_pidfd, pts); + /* + * Depending on PIDFD_THREAD, inform pollers when the thread + * or the whole thread-group exits. + */ + rcu_read_lock(); + task = pid_task(pid, PIDTYPE_PID); + if (!task) + poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; + else if (task->exit_state && (thread || thread_group_empty(task))) + poll_flags = EPOLLIN | EPOLLRDNORM; + rcu_read_unlock(); + + return poll_flags; +} + +const struct file_operations pidfd_fops = { + .release = pidfd_release, + .poll = pidfd_poll, +#ifdef CONFIG_PROC_FS + .show_fdinfo = pidfd_show_fdinfo, +#endif +}; diff --git a/kernel/fork.c b/kernel/fork.c index 3f22ec90c5c6..662a61f340ce 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1993,116 +1993,6 @@ struct pid *pidfd_pid(const struct file *file) return ERR_PTR(-EBADF); } -static int pidfd_release(struct inode *inode, struct file *file) -{ - struct pid *pid = file->private_data; - - file->private_data = NULL; - put_pid(pid); - return 0; -} - -#ifdef CONFIG_PROC_FS -/** - * pidfd_show_fdinfo - print information about a pidfd - * @m: proc fdinfo file - * @f: file referencing a pidfd - * - * Pid: - * This function will print the pid that a given pidfd refers to in the - * pid namespace of the procfs instance. - * If the pid namespace of the process is not a descendant of the pid - * namespace of the procfs instance 0 will be shown as its pid. This is - * similar to calling getppid() on a process whose parent is outside of - * its pid namespace. - * - * NSpid: - * If pid namespaces are supported then this function will also print - * the pid of a given pidfd refers to for all descendant pid namespaces - * starting from the current pid namespace of the instance, i.e. the - * Pid field and the first entry in the NSpid field will be identical. - * If the pid namespace of the process is not a descendant of the pid - * namespace of the procfs instance 0 will be shown as its first NSpid - * entry and no others will be shown. - * Note that this differs from the Pid and NSpid fields in - * /proc//status where Pid and NSpid are always shown relative to - * the pid namespace of the procfs instance. The difference becomes - * obvious when sending around a pidfd between pid namespaces from a - * different branch of the tree, i.e. where no ancestral relation is - * present between the pid namespaces: - * - create two new pid namespaces ns1 and ns2 in the initial pid - * namespace (also take care to create new mount namespaces in the - * new pid namespace and mount procfs) - * - create a process with a pidfd in ns1 - * - send pidfd from ns1 to ns2 - * - read /proc/self/fdinfo/ and observe that both Pid and NSpid - * have exactly one entry, which is 0 - */ -static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) -{ - struct pid *pid = f->private_data; - struct pid_namespace *ns; - pid_t nr = -1; - - if (likely(pid_has_task(pid, PIDTYPE_PID))) { - ns = proc_pid_ns(file_inode(m->file)->i_sb); - nr = pid_nr_ns(pid, ns); - } - - seq_put_decimal_ll(m, "Pid:\t", nr); - -#ifdef CONFIG_PID_NS - seq_put_decimal_ll(m, "\nNSpid:\t", nr); - if (nr > 0) { - int i; - - /* If nr is non-zero it means that 'pid' is valid and that - * ns, i.e. the pid namespace associated with the procfs - * instance, is in the pid namespace hierarchy of pid. - * Start at one below the already printed level. - */ - for (i = ns->level + 1; i <= pid->level; i++) - seq_put_decimal_ll(m, "\t", pid->numbers[i].nr); - } -#endif - seq_putc(m, '\n'); -} -#endif - -/* - * Poll support for process exit notification. - */ -static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) -{ - struct pid *pid = file->private_data; - bool thread = file->f_flags & PIDFD_THREAD; - struct task_struct *task; - __poll_t poll_flags = 0; - - poll_wait(file, &pid->wait_pidfd, pts); - /* - * Depending on PIDFD_THREAD, inform pollers when the thread - * or the whole thread-group exits. - */ - rcu_read_lock(); - task = pid_task(pid, PIDTYPE_PID); - if (!task) - poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP; - else if (task->exit_state && (thread || thread_group_empty(task))) - poll_flags = EPOLLIN | EPOLLRDNORM; - rcu_read_unlock(); - - return poll_flags; -} - -const struct file_operations pidfd_fops = { - .release = pidfd_release, - .poll = pidfd_poll, -#ifdef CONFIG_PROC_FS - .show_fdinfo = pidfd_show_fdinfo, -#endif -}; - /** * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd * @pid: the struct pid for which to create a pidfd From patchwork Tue Feb 13 16:45:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christian Brauner X-Patchwork-Id: 13555378 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7B0D45FEF6 for ; Tue, 13 Feb 2024 16:45:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707842757; cv=none; b=XeGMRm3r8G8hK6b4LVXPIXulm0lNbXzwQGCpldcgiT40TURZ1nZirVIygkF3dYNmwN8ahiMoskVf86l7L2arEms41UsZqzeI7RF7ToZXI83MJLut6VBqOqns1O1TTYmnkn6FtiMAXx3d79wFHRdfIq19RySgESt4MK1wLSRiJ9c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707842757; c=relaxed/simple; bh=SF7qwLDVeOtCWpJa+s4xePDhYxhJ40fSS6SEgamkeHE=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=SHV2QEQ3U4hK7AJo7X7p4rLDzozTLu1w+ppGf6rlYo0cB5NYYQfGyiYmo1vI/sMcxSdHZ24tpZJcdwBPbTpkvsZIvF5/QH25B1hsJ61zN2/gnzOPSip8QjGe+ntXohvOEr2MTsQIWXUeVWF0kawgmoMlNj5KcZV1uiIQXEsgs9M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=DnIQeN1K; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="DnIQeN1K" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4A690C43394; Tue, 13 Feb 2024 16:45:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1707842757; bh=SF7qwLDVeOtCWpJa+s4xePDhYxhJ40fSS6SEgamkeHE=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=DnIQeN1KNKRAFGw5w7aMKwOaYIoiYAu1CeKK74DuK/ZYj7/fyxa9sdFpaj8FbTY+C PWz+dPzAUkpIAdJ9SecauCmJKI59FttljC/Q+MI1ZOlfV6tkXsKR3xTKVH6DUT423B Ltl3CQ6a9ncwmZ4l5pQ2GErzXlPhBPPpSUAlxj5VcHIZ655kXw8KCpsQNe5c8wl/7r PAwvI3tgKYLuJufcas1Ir/3RJY2yBrFoFxntgAnrhdEMjswdIHyQO7L/fJV7J2zrIr dGXGG2AsPFWJSImW9ecIbwwSdJUe6X1kRQTBLD5iXoiyGjM8xceBowKGKKYaimHrNW g/57plBoUEBOA== From: Christian Brauner Date: Tue, 13 Feb 2024 17:45:47 +0100 Subject: [PATCH 2/2] pidfd: add pidfdfs Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Message-Id: <20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org> References: <20240213-vfs-pidfd_fs-v1-0-f863f58cfce1@kernel.org> In-Reply-To: <20240213-vfs-pidfd_fs-v1-0-f863f58cfce1@kernel.org> To: linux-fsdevel@vger.kernel.org Cc: Linus Torvalds , Alexander Viro , Seth Forshee , Tycho Andersen , Christian Brauner X-Mailer: b4 0.13-dev-2d940 X-Developer-Signature: v=1; a=openpgp-sha256; l=12803; i=brauner@kernel.org; h=from:subject:message-id; bh=SF7qwLDVeOtCWpJa+s4xePDhYxhJ40fSS6SEgamkeHE=; b=owGbwMvMwCU28Zj0gdSKO4sYT6slMaSenrNv9jsTDtbPLz4wPvnmyLC5sHdiScS04IlnUnjUa nh//P92raOUhUGMi0FWTJHFod0kXG45T8Vmo0wNmDmsTCBDGLg4BWAiExYwMhySzzZMWhygp8dz g/tx+Nl+86Kiyp/rLu6+bHH+CoO5vSPD/0JdwbUHt0iKWAt2X/Dr5EnJMtuvLVe9epJFQHzvxy2 L+QA= X-Developer-Key: i=brauner@kernel.org; a=openpgp; fpr=4880B8C9BD0E5106FC070F4F7B3C391EFEA93624 This moves pidfds from the anonymous inode infrastructure to a tiny pseudo filesystem. This has been on my todo for quite a while as it will unblock further work that we weren't able to do simply because of the very justified limitations of anonymous inodes. Moving pidfds to a tiny pseudo filesystem allows: * statx() on pidfds becomes useful for the first time. * pidfds can be compared simply via statx() and then comparing inode numbers. * pidfds have unique inode numbers for the system lifetime. * struct pid is now stashed in inode->i_private instead of file->private_data. This means it is now possible to introduce concepts that operate on a process once all file descriptors have been closed. A concrete example is kill-on-last-close. * file->private_data is freed up for per-file options for pidfds. * Each struct pid will refer to a different inode but the same struct pid will refer to the same inode if it's opened multiple times. In contrast to now where each struct pid refers to the same inode. Even if we were to move to anon_inode_create_getfile() which creates new inodes we'd still be associating the same struct pid with multiple different inodes. * Pidfds now go through the regular dentry_open() path which means that all security hooks are called unblocking proper LSM management for pidfds. In addition fsnotify hooks are called and allow for listening to open events on pidfds. The tiny pseudo filesystem is not visible anywhere in userspace exactly like e.g., pipefs and sockfs. There's no lookup, there's no complex inode operations, nothing. Dentries and inodes are always deleted when the last pidfd is closed. The code is entirely optional and fairly small. If it's not selected we fallback to anonymous inodes. Heavily inspired by nsfs which uses a similar stashing mechanism just for namespaces. Signed-off-by: Christian Brauner Signed-off-by: Christian Brauner Signed-off-by: Christian Brauner Reported-by: Nathan Chancellor Signed-off-by: Christian Brauner --- fs/Kconfig | 6 ++ fs/pidfdfs.c | 189 ++++++++++++++++++++++++++++++++++++++++++++- include/linux/pid.h | 4 + include/linux/pidfdfs.h | 9 +++ include/uapi/linux/magic.h | 1 + init/main.c | 2 + kernel/fork.c | 13 +--- kernel/nsproxy.c | 2 +- kernel/pid.c | 2 + 9 files changed, 214 insertions(+), 14 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 89fdbefd1075..c7ed65e34820 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -174,6 +174,12 @@ source "fs/proc/Kconfig" source "fs/kernfs/Kconfig" source "fs/sysfs/Kconfig" +config FS_PIDFD + bool "Pseudo filesystem for process file descriptors" + depends on 64BIT + help + Pidfdfs implements advanced features for process file descriptors. + config TMPFS bool "Tmpfs virtual memory file system support (former shm fs)" depends on SHMEM diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c index 55e8396e7fc4..efc68ef3a08d 100644 --- a/fs/pidfdfs.c +++ b/fs/pidfdfs.c @@ -1,9 +1,11 @@ // SPDX-License-Identifier: GPL-2.0 +#include #include #include #include #include #include +#include #include #include #include @@ -12,12 +14,25 @@ #include #include +struct pid *pidfd_pid(const struct file *file) +{ + if (file->f_op != &pidfd_fops) + return ERR_PTR(-EBADF); +#ifdef CONFIG_FS_PIDFD + return file_inode(file)->i_private; +#else + return file->private_data; +#endif +} + static int pidfd_release(struct inode *inode, struct file *file) { +#ifndef CONFIG_FS_PIDFD struct pid *pid = file->private_data; file->private_data = NULL; put_pid(pid); +#endif return 0; } @@ -59,7 +74,7 @@ static int pidfd_release(struct inode *inode, struct file *file) */ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) { - struct pid *pid = f->private_data; + struct pid *pid = pidfd_pid(f); struct pid_namespace *ns; pid_t nr = -1; @@ -93,7 +108,7 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f) */ static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts) { - struct pid *pid = file->private_data; + struct pid *pid = pidfd_pid(file); bool thread = file->f_flags & PIDFD_THREAD; struct task_struct *task; __poll_t poll_flags = 0; @@ -121,3 +136,173 @@ const struct file_operations pidfd_fops = { .show_fdinfo = pidfd_show_fdinfo, #endif }; + +#ifdef CONFIG_FS_PIDFD +static struct vfsmount *pidfdfs_mnt __ro_after_init; +static struct super_block *pidfdfs_sb __ro_after_init; +static u64 pidfdfs_ino = 0; + +static void pidfdfs_evict_inode(struct inode *inode) +{ + struct pid *pid = inode->i_private; + + clear_inode(inode); + put_pid(pid); +} + +static const struct super_operations pidfdfs_sops = { + .statfs = simple_statfs, + .evict_inode = pidfdfs_evict_inode, +}; + +static void pidfdfs_prune_dentry(struct dentry *dentry) +{ + struct inode *inode; + struct pid *pid; + + inode = d_inode(dentry); + if (!inode) + return; + + pid = inode->i_private; + atomic_long_set(&pid->stashed, 0); +} + +static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen) +{ + return dynamic_dname(buffer, buflen, "pidfd:[%lu]", + d_inode(dentry)->i_ino); +} + +const struct dentry_operations pidfdfs_dentry_operations = { + .d_prune = pidfdfs_prune_dentry, + .d_delete = always_delete_dentry, + .d_dname = pidfdfs_dname, +}; + +static int pidfdfs_init_fs_context(struct fs_context *fc) +{ + struct pseudo_fs_context *ctx; + + ctx = init_pseudo(fc, PIDFDFS_MAGIC); + if (!ctx) + return -ENOMEM; + + ctx->ops = &pidfdfs_sops; + ctx->dops = &pidfdfs_dentry_operations; + return 0; +} + +static struct file_system_type pidfdfs_type = { + .name = "pidfdfs", + .init_fs_context = pidfdfs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static struct dentry *pidfdfs_dentry(struct pid *pid) +{ + struct inode *inode; + struct dentry *dentry; + unsigned long i_ptr; + + inode = new_inode_pseudo(pidfdfs_sb); + if (!inode) + return ERR_PTR(-ENOMEM); + + inode->i_ino = pid->ino; + inode->i_mode = S_IFREG | S_IRUGO; + inode->i_fop = &pidfd_fops; + inode->i_flags |= S_IMMUTABLE; + simple_inode_init_ts(inode); + /* grab a reference */ + inode->i_private = get_pid(pid); + + /* consumes @inode */ + dentry = d_make_root(inode); + if (!dentry) + return ERR_PTR(-ENOMEM); + + i_ptr = atomic_long_cmpxchg(&pid->stashed, 0, (unsigned long)dentry); + if (i_ptr) { + d_delete(dentry); /* make sure ->d_prune() does nothing */ + dput(dentry); + cpu_relax(); + return ERR_PTR(-EAGAIN); + } + + return dentry; +} + +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags) +{ + + struct path path; + struct dentry *dentry; + struct file *pidfd_file; + + for (;;) { + rcu_read_lock(); + dentry = (struct dentry *)atomic_long_read(&pid->stashed); + if (!dentry || !lockref_get_not_dead(&dentry->d_lockref)) { + rcu_read_unlock(); + + dentry = pidfdfs_dentry(pid); + if (!IS_ERR(dentry)) + break; + if (PTR_ERR(dentry) == -EAGAIN) + continue; + } + rcu_read_unlock(); + break; + } + if (IS_ERR(dentry)) + return ERR_CAST(dentry); + + path.mnt = mntget(pidfdfs_mnt); + path.dentry = dentry; + + pidfd_file = dentry_open(&path, flags, current_cred()); + path_put(&path); + + return pidfd_file; +} + +void pid_init_pidfdfs(struct pid *pid) +{ + atomic_long_set(&pid->stashed, 0); + pid->ino = ++pidfdfs_ino; +} + +void __init pidfdfs_init(void) +{ + int err; + + err = register_filesystem(&pidfdfs_type); + if (err) + panic("Failed to register pidfdfs pseudo filesystem"); + + pidfdfs_mnt = kern_mount(&pidfdfs_type); + if (IS_ERR(pidfdfs_mnt)) + panic("Failed to mount pidfdfs pseudo filesystem"); + + pidfdfs_sb = pidfdfs_mnt->mnt_sb; +} + +#else /* !CONFIG_FS_PIDFD */ + +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags) +{ + struct file *pidfd_file; + + pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, + flags | O_RDWR); + if (IS_ERR(pidfd_file)) + return pidfd_file; + + get_pid(pid); + return pidfd_file; +} + +void pid_init_pidfdfs(struct pid *pid) { } +void __init pidfdfs_init(void) { } +#endif diff --git a/include/linux/pid.h b/include/linux/pid.h index 8124d57752b9..1a47676a04c2 100644 --- a/include/linux/pid.h +++ b/include/linux/pid.h @@ -55,6 +55,10 @@ struct pid refcount_t count; unsigned int level; spinlock_t lock; +#ifdef CONFIG_FS_PIDFD + atomic_long_t stashed; + unsigned long ino; +#endif /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; struct hlist_head inodes; diff --git a/include/linux/pidfdfs.h b/include/linux/pidfdfs.h new file mode 100644 index 000000000000..760dbc163625 --- /dev/null +++ b/include/linux/pidfdfs.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_PIDFDFS_H +#define _LINUX_PIDFDFS_H + +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags); +void __init pidfdfs_init(void); +void pid_init_pidfdfs(struct pid *pid); + +#endif /* _LINUX_PIDFDFS_H */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index 6325d1d0e90f..a0d5480115c5 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -101,5 +101,6 @@ #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ +#define PIDFDFS_MAGIC 0x50494446 /* "PIDF" */ #endif /* __LINUX_MAGIC_H__ */ diff --git a/init/main.c b/init/main.c index e24b0780fdff..0663003f3146 100644 --- a/init/main.c +++ b/init/main.c @@ -99,6 +99,7 @@ #include #include #include +#include #include #include @@ -1059,6 +1060,7 @@ void start_kernel(void) seq_file_init(); proc_root_init(); nsfs_init(); + pidfdfs_init(); cpuset_init(); cgroup_init(); taskstats_init_early(); diff --git a/kernel/fork.c b/kernel/fork.c index 662a61f340ce..eab2fcc90342 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -102,6 +102,7 @@ #include #include #include +#include #include #include @@ -1985,14 +1986,6 @@ static inline void rcu_copy_process(struct task_struct *p) #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ } -struct pid *pidfd_pid(const struct file *file) -{ - if (file->f_op == &pidfd_fops) - return file->private_data; - - return ERR_PTR(-EBADF); -} - /** * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd * @pid: the struct pid for which to create a pidfd @@ -2030,13 +2023,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re if (pidfd < 0) return pidfd; - pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, - flags | O_RDWR); + pidfd_file = pidfdfs_alloc_file(pid, flags | O_RDWR); if (IS_ERR(pidfd_file)) { put_unused_fd(pidfd); return PTR_ERR(pidfd_file); } - get_pid(pid); /* held by pidfd_file now */ /* * anon_inode_getfile() ignores everything outside of the * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually. diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index 15781acaac1c..6ec3deec68c2 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -573,7 +573,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags) if (proc_ns_file(f.file)) err = validate_ns(&nsset, ns); else - err = validate_nsset(&nsset, f.file->private_data); + err = validate_nsset(&nsset, pidfd_pid(f.file)); if (!err) { commit_nsset(&nsset); perf_event_namespaces(current); diff --git a/kernel/pid.c b/kernel/pid.c index c1d940fbd314..dbff84493bff 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -270,6 +271,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, upid = pid->numbers + ns->level; spin_lock_irq(&pidmap_lock); + pid_init_pidfdfs(pid); if (!(ns->pid_allocated & PIDNS_ADDING)) goto out_unlock; for ( ; upid >= pid->numbers; --upid) {