diff mbox series

[2/2] pidfd: add pidfdfs

Message ID 20240213-vfs-pidfd_fs-v1-2-f863f58cfce1@kernel.org (mailing list archive)
State New
Headers show
Series Move pidfd to tiny pseudo fs | expand

Commit Message

Christian Brauner Feb. 13, 2024, 4:45 p.m. UTC
This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
  numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
  file->private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.
* Pidfds now go through the regular dentry_open() path which means that
  all security hooks are called unblocking proper LSM management for
  pidfds. In addition fsnotify hooks are called and allow for listening
  to open events on pidfds.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig                 |   6 ++
 fs/pidfdfs.c               | 189 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/pid.h        |   4 +
 include/linux/pidfdfs.h    |   9 +++
 include/uapi/linux/magic.h |   1 +
 init/main.c                |   2 +
 kernel/fork.c              |  13 +---
 kernel/nsproxy.c           |   2 +-
 kernel/pid.c               |   2 +
 9 files changed, 214 insertions(+), 14 deletions(-)

Comments

Linus Torvalds Feb. 13, 2024, 5:17 p.m. UTC | #1
On Tue, 13 Feb 2024 at 08:46, Christian Brauner <brauner@kernel.org> wrote:
>
> * Each struct pid will refer to a different inode but the same struct
>   pid will refer to the same inode if it's opened multiple times. In
>   contrast to now where each struct pid refers to the same inode.

The games for this are disgusting. This needs to be done some other way.

Yes, I realize that you copied what Al did for nsfs. It's still
entirely disgusting.

To quote the Romans: "Quod licet Al, non licet bovi".

It might be a *bit* less disgusting if we hid the casting games with
"atomic_long_xyz" by having nice wrappers that do "atomic_ptr_xyz".
Because having an "atomic_long_t" and stuffing a dentry pointer in it
is just too ugly for words.

But I do think that if we're going to do the same disgusting thing
that nsfs does, then the whole "keep one dentry" logic needs to be
shared between the two filesystem. Not be two different copies of the
same inscrutable thing.

Because they *are* the same thing, although you wrote the pidfs copy
slightly differently.

I was ok with one incredibly ugly thing, but I'm not ok with having it
copied and duplicated.

               Linus
Christian Brauner Feb. 14, 2024, 2:40 p.m. UTC | #2
On Tue, Feb 13, 2024 at 09:17:18AM -0800, Linus Torvalds wrote:
> On Tue, 13 Feb 2024 at 08:46, Christian Brauner <brauner@kernel.org> wrote:
> >
> > * Each struct pid will refer to a different inode but the same struct
> >   pid will refer to the same inode if it's opened multiple times. In
> >   contrast to now where each struct pid refers to the same inode.
> 
> The games for this are disgusting. This needs to be done some other way.

Yeah, I'm not particularly happy about it and I did consider deviating
from this completely and doing a lookup based on stashed inode number
alone. But that would've been a bit more code. Let me see though.

> To quote the Romans:

uniti aedificamus

> Because they *are* the same thing, although you wrote the pidfs copy

I didn't see the point in sharing the code particularly for an rfc
because I didn't expect you to be enthusiastic about the original code
(Which is also why I Cced you here in the first place).
Christian Brauner Feb. 14, 2024, 6:27 p.m. UTC | #3
On Wed, Feb 14, 2024 at 03:40:26PM +0100, Christian Brauner wrote:
> On Tue, Feb 13, 2024 at 09:17:18AM -0800, Linus Torvalds wrote:
> > On Tue, 13 Feb 2024 at 08:46, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > * Each struct pid will refer to a different inode but the same struct
> > >   pid will refer to the same inode if it's opened multiple times. In
> > >   contrast to now where each struct pid refers to the same inode.
> > 
> > The games for this are disgusting. This needs to be done some other way.
> 
> Yeah, I'm not particularly happy about it and I did consider deviating
> from this completely and doing a lookup based on stashed inode number
> alone. But that would've been a bit more code. Let me see though.

Ok, that turned out to be simpler than I had evisioned - unless I made
horrible mistakes:

From c96d88910d029ea639902d245619acbd910fc32d Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 12 Feb 2024 16:32:38 +0100
Subject: [PATCH] [RFC] pidfd: add pidfdfs

This moves pidfds from the anonymous inode infrastructure to a tiny
pseudo filesystem. This has been on my todo for quite a while as it will
unblock further work that we weren't able to do simply because of the
very justified limitations of anonymous inodes. Moving pidfds to a tiny
pseudo filesystem allows:

* statx() on pidfds becomes useful for the first time.
* pidfds can be compared simply via statx() and then comparing inode
  numbers.
* pidfds have unique inode numbers for the system lifetime.
* struct pid is now stashed in inode->i_private instead of
  file->private_data. This means it is now possible to introduce
  concepts that operate on a process once all file descriptors have been
  closed. A concrete example is kill-on-last-close.
* file->private_data is freed up for per-file options for pidfds.
* Each struct pid will refer to a different inode but the same struct
  pid will refer to the same inode if it's opened multiple times. In
  contrast to now where each struct pid refers to the same inode. Even
  if we were to move to anon_inode_create_getfile() which creates new
  inodes we'd still be associating the same struct pid with multiple
  different inodes.

The tiny pseudo filesystem is not visible anywhere in userspace exactly
like e.g., pipefs and sockfs. There's no lookup, there's no complex
inode operations, nothing. Dentries and inodes are always deleted when
the last pidfd is closed.

We allocate a new inode for each struct pid and we reuse that inode for
all pidfds. We use iget_locked() to find that inode again based on the
inode number which isn't recycled. We allocate a new dentry for each
pidfd that uses the same inode. That is similar to anonymous inodes
which reuse the same inode for thousands of dentries. For pidfds we're
talking way less than that. There usually won't be a lot of concurrent
openers of the same struct pid. They can probably often be counted on
two hands. I know that systemd does use separate pidfd for the same
struct pid for various complex process tracking issues. So I think with
that things actually become way simpler. Especially because we don't
have to care about lookup. Dentries and inodes continue to be always
deleted.

The code is entirely optional and fairly small. If it's not selected we
fallback to anonymous inodes. Heavily inspired by nsfs which uses a
similar stashing mechanism just for namespaces.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig                 |   6 ++
 fs/pidfdfs.c               | 134 ++++++++++++++++++++++++++++++++++++-
 include/linux/pid.h        |   3 +
 include/linux/pidfdfs.h    |   9 +++
 include/uapi/linux/magic.h |   1 +
 init/main.c                |   2 +
 kernel/fork.c              |  13 +---
 kernel/nsproxy.c           |   2 +-
 kernel/pid.c               |   2 +
 9 files changed, 158 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/pidfdfs.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..c7ed65e34820 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -174,6 +174,12 @@ source "fs/proc/Kconfig"
 source "fs/kernfs/Kconfig"
 source "fs/sysfs/Kconfig"
 
+config FS_PIDFD
+	bool "Pseudo filesystem for process file descriptors"
+	depends on 64BIT
+	help
+	  Pidfdfs implements advanced features for process file descriptors.
+
 config TMPFS
 	bool "Tmpfs virtual memory file system support (former shm fs)"
 	depends on SHMEM
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index 55e8396e7fc4..80caf1759f97 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -1,9 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/anon_inodes.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
 #include <linux/mount.h>
 #include <linux/pid.h>
+#include <linux/pidfdfs.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -12,12 +14,25 @@
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op != &pidfd_fops)
+		return ERR_PTR(-EBADF);
+#ifdef CONFIG_FS_PIDFD
+	return file_inode(file)->i_private;
+#else
+	return file->private_data;
+#endif
+}
+
 static int pidfd_release(struct inode *inode, struct file *file)
 {
+#ifndef CONFIG_FS_PIDFD
 	struct pid *pid = file->private_data;
 
 	file->private_data = NULL;
 	put_pid(pid);
+#endif
 	return 0;
 }
 
@@ -59,7 +74,7 @@ static int pidfd_release(struct inode *inode, struct file *file)
  */
 static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
-	struct pid *pid = f->private_data;
+	struct pid *pid = pidfd_pid(f);
 	struct pid_namespace *ns;
 	pid_t nr = -1;
 
@@ -93,7 +108,7 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
  */
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
-	struct pid *pid = file->private_data;
+	struct pid *pid = pidfd_pid(file);
 	bool thread = file->f_flags & PIDFD_THREAD;
 	struct task_struct *task;
 	__poll_t poll_flags = 0;
@@ -121,3 +136,118 @@ const struct file_operations pidfd_fops = {
 	.show_fdinfo	= pidfd_show_fdinfo,
 #endif
 };
+
+#ifdef CONFIG_FS_PIDFD
+static struct vfsmount *pidfdfs_mnt __ro_after_init;
+static struct super_block *pidfdfs_sb __ro_after_init;
+static u64 pidfdfs_ino = 0;
+
+static void pidfdfs_evict_inode(struct inode *inode)
+{
+	struct pid *pid = inode->i_private;
+
+	clear_inode(inode);
+	put_pid(pid);
+}
+
+static const struct super_operations pidfdfs_sops = {
+	.drop_inode	= generic_delete_inode,
+	.evict_inode	= pidfdfs_evict_inode,
+	.statfs		= simple_statfs,
+};
+
+static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(buffer, buflen, "pidfd:[%lu]",
+			     d_inode(dentry)->i_ino);
+}
+
+const struct dentry_operations pidfdfs_dentry_operations = {
+	.d_delete	= always_delete_dentry,
+	.d_dname	= pidfdfs_dname,
+};
+
+static int pidfdfs_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	ctx = init_pseudo(fc, PIDFDFS_MAGIC);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ops = &pidfdfs_sops;
+	ctx->dops = &pidfdfs_dentry_operations;
+	return 0;
+}
+
+static struct file_system_type pidfdfs_type = {
+	.name			= "pidfdfs",
+	.init_fs_context	= pidfdfs_init_fs_context,
+	.kill_sb		= kill_anon_super,
+};
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+
+	struct inode *inode;
+	struct file *pidfd_file;
+
+	inode = iget_locked(pidfdfs_sb, pid->ino);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	if (inode->i_state & I_NEW) {
+		inode->i_ino = pid->ino;
+		inode->i_mode = S_IFREG | S_IRUGO;
+		inode->i_fop = &pidfd_fops;
+		inode->i_flags |= S_IMMUTABLE;
+		inode->i_private = get_pid(pid);
+		simple_inode_init_ts(inode);
+		unlock_new_inode(inode);
+	}
+
+	pidfd_file = alloc_file_pseudo(inode, pidfdfs_mnt, "", flags, &pidfd_fops);
+	if (IS_ERR(pidfd_file))
+		iput(inode);
+
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid)
+{
+	pid->ino = ++pidfdfs_ino;
+}
+
+void __init pidfdfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&pidfdfs_type);
+	if (err)
+		panic("Failed to register pidfdfs pseudo filesystem");
+
+	pidfdfs_mnt = kern_mount(&pidfdfs_type);
+	if (IS_ERR(pidfdfs_mnt))
+		panic("Failed to mount pidfdfs pseudo filesystem");
+
+	pidfdfs_sb = pidfdfs_mnt->mnt_sb;
+}
+
+#else /* !CONFIG_FS_PIDFD */
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+	struct file *pidfd_file;
+
+	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
+					flags | O_RDWR);
+	if (IS_ERR(pidfd_file))
+		return pidfd_file;
+
+	get_pid(pid);
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid) { }
+void __init pidfdfs_init(void) { }
+#endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8124d57752b9..7b6f5deab36a 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -55,6 +55,9 @@ struct pid
 	refcount_t count;
 	unsigned int level;
 	spinlock_t lock;
+#ifdef CONFIG_FS_PIDFD
+	unsigned long ino;
+#endif
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct hlist_head inodes;
diff --git a/include/linux/pidfdfs.h b/include/linux/pidfdfs.h
new file mode 100644
index 000000000000..760dbc163625
--- /dev/null
+++ b/include/linux/pidfdfs.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PIDFDFS_H
+#define _LINUX_PIDFDFS_H
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags);
+void __init pidfdfs_init(void);
+void pid_init_pidfdfs(struct pid *pid);
+
+#endif /* _LINUX_PIDFDFS_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..a0d5480115c5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define PIDFDFS_MAGIC		0x50494446	/* "PIDF" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..0663003f3146 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/pidfdfs.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1059,6 +1060,7 @@ void start_kernel(void)
 	seq_file_init();
 	proc_root_init();
 	nsfs_init();
+	pidfdfs_init();
 	cpuset_init();
 	cgroup_init();
 	taskstats_init_early();
diff --git a/kernel/fork.c b/kernel/fork.c
index 662a61f340ce..eab2fcc90342 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -102,6 +102,7 @@
 #include <linux/iommu.h>
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
+#include <linux/pidfdfs.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1985,14 +1986,6 @@ static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
-struct pid *pidfd_pid(const struct file *file)
-{
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
-
-	return ERR_PTR(-EBADF);
-}
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd
@@ -2030,13 +2023,11 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (pidfd < 0)
 		return pidfd;
 
-	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
-					flags | O_RDWR);
+	pidfd_file = pidfdfs_alloc_file(pid, flags | O_RDWR);
 	if (IS_ERR(pidfd_file)) {
 		put_unused_fd(pidfd);
 		return PTR_ERR(pidfd_file);
 	}
-	get_pid(pid); /* held by pidfd_file now */
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 15781acaac1c..6ec3deec68c2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -573,7 +573,7 @@ SYSCALL_DEFINE2(setns, int, fd, int, flags)
 	if (proc_ns_file(f.file))
 		err = validate_ns(&nsset, ns);
 	else
-		err = validate_nsset(&nsset, f.file->private_data);
+		err = validate_nsset(&nsset, pidfd_pid(f.file));
 	if (!err) {
 		commit_nsset(&nsset);
 		perf_event_namespaces(current);
diff --git a/kernel/pid.c b/kernel/pid.c
index c1d940fbd314..dbff84493bff 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/pidfdfs.h>
 #include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 
@@ -270,6 +271,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
+	pid_init_pidfdfs(pid);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {
Linus Torvalds Feb. 14, 2024, 6:37 p.m. UTC | #4
On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
>
> Ok, that turned out to be simpler than I had evisioned - unless I made
> horrible mistakes:

Hmm. Could we do the same for nsfs?

Also, quick question:

> +void pid_init_pidfdfs(struct pid *pid)
> +{
> +       pid->ino = ++pidfdfs_ino;
> +}

As far as I can tell, the above is only safe because it is serialized by

        spin_lock_irq(&pidmap_lock);

in the only caller.

Can we please just not have a function at all, and just move it there,
so that it's a whole lot more obvious that that inode generation
really gets us a unique number?

Am I missing something?

                 Linus
Christian Brauner Feb. 15, 2024, 4:11 p.m. UTC | #5
On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Ok, that turned out to be simpler than I had evisioned - unless I made
> > horrible mistakes:
> 
> Hmm. Could we do the same for nsfs?

Jein (Ja/Yes + Nein/No).
We could but it would alter nsfs quite a bit.

Before stashing nsfs did (roughly):

/* get an unhashed dentry */
dentry = d_alloc_pseudo(sb, &qname);

/* get new or find existing inode */
inode = iget_locked(sb, ns->inum);

/* Add new or reuse existing dentry atomically */
d_instantiate_unique(dentry, inode);

Afaict, that was intended to work because d_instantiate_unique()
existed. But I don't think it ever worked before the stashing. I'll get
to that below.

d_instantiate_unique() used to to find an existing dentry or added a new
dentry all with inode->i_lock held.

A little after that stashing mechanism for nsfs was added
d_instantiate_unique() was split into two: d_exact_alias() and
d_splice_alias(). The semantics are different. d_exact_alias() discounts
hashed dentries - returns NULL - and only considers unhashed dentries
eligible for reuse. If it finds an unhashed alias it will rehash it and
return it.

And whereas d_instantiate_unique() found and added a new dentry with
inode->i_lock held and thus guaranteed that there would only be one
dentry the d_exact_alias() followed by d_splice_alias() doesn't because
i_lock is dropped in between of course.

But even the logic back then didn't work. Because back then nsfs called
d_alloc_pseudo() which sets dentry->d_parent = dentry. IOW, creates
IS_ROOT() dentries. But __d_instantiate_unique() did:

if (alias->d_parent != entry->d_parent)
        continue;

and so would reliably discount d_alloc_pseudo() dentries... So even back
then nsfs created aliases for each open of the same namespace.

Right now the same problem plus another problem would exist. Consider:

/* get an unhashed dentry */
dentry = d_alloc_anon(sb, &qname);

/* get new or find existing inode */
inode = iget_locked(sb, ns->inum);

/* Add new or reuse existing dentry atomically */
alias = d_exact_alias(dentry, inode);
if (!alias)
	d_splice_alias(inode, dentry);

That doesn't work. First, because of the inode->i_lock drop in between
d_exact_alias() and d_splice_alias(). Second, because d_exact_alias()
discounts IS_ROOT() for reuse. Third, because d_exact_alias() discounts
hashed dentries for reuse.

Consequently switching to a similar iget_locked() mechanism for nsfs as
for pidfdfs right now would mean that there's now not just a single
dentry that's reused by all concurrent __ns_get_path() calls. Instead
one does get multiple dentries like in the old (likely broken) scheme.

For nsfs this might matter because in contrast to anonymous inodes and
pidfds these namespace fds can be bind-mounted and thus persisted. And
in that case it's nice if you only have a single dentry. Whether that
matters in practice in terms of memory I'm not sure. It likely doesn't.

Now, for pidfds we don't care. pidfds can never be bind-mounted and
there's no reason for that. That doesn't work with anonymous inodes and
it doesn't work with pidfdfs. The pidfds_mnt is marked as vfs internal
preventing it from being used for mounting.

For pidfds we also don't care about multiple dentries for the same
inode. Because right now - with pidfds being anonymous inodes - that's
exacty what already happens. And that really hasn't been an issue so far
so I don't see why it would be an issue now.

But, if we wanted this we could solve it without that stashing mechanism
as a patch on top of both pidfdfs and nsfs later but it would require
some dcache help.

I think what we'd need - specifically tailored to both nsfs and pidfdfs
- is something like the below. So, this is only expected to work on
stuff that does d_alloc_anon(). So has no real name and dentry->d_parent
== dentry; basically a bunch of IS_ROOT() dentries. I can add that on
top and then pidfds can use that and nsfs as well. But again, I don't
need it for pidfdfs to be functional. Up to you.

diff --git a/fs/dcache.c b/fs/dcache.c
index b813528fb147..7c78b8b1a8b3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2642,6 +2642,39 @@ void d_add(struct dentry *entry, struct inode *inode)
 }
 EXPORT_SYMBOL(d_add);

+/*
+ * Helper for special filesystems that want to recycle the exact same dentry
+ * over and over. Requires d_alloc_anon() and will reject anything that isn't
+ * IS_ROOT(). Caller must provide valid inode.
+ */
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode)
+{
+       struct dentry *alias;
+       unsigned int hash = entry->d_name.hash;
+
+       if (!inode)
+               return NULL;
+
+       if (!IS_ROOT(entry))
+               return NULL;
+
+       spin_lock(&inode->i_lock);
+       hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+               if (alias->d_name.hash != hash)
+                       continue;
+               if (!d_same_name(alias, entry->d_parent, &entry->d_name))
+                       continue;
+               dget_dlock(alias);
+               spin_unlock(&inode->i_lock);
+               return alias;
+       }
+
+       __d_instantiate(entry, inode);
+       spin_unlock(&inode->i_lock);
+       security_d_instantiate(entry, inode);
+       return NULL;
+}
+
 /**
  * d_exact_alias - find and hash an exact unhashed alias
  * @entry: dentry to add
diff --git a/fs/internal.h b/fs/internal.h
index b67406435fc0..41b441c7b2a0 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -310,3 +310,4 @@ ssize_t __kernel_write_iter(struct file *file, struct iov_iter *from, loff_t *po
 struct mnt_idmap *alloc_mnt_idmap(struct user_namespace *mnt_userns);
 struct mnt_idmap *mnt_idmap_get(struct mnt_idmap *idmap);
 void mnt_idmap_put(struct mnt_idmap *idmap);
+struct dentry *d_instantiate_unique_anon(struct dentry *entry, struct inode *inode);

And then this can become roughly:

struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
{
        struct path path;
        struct dentry *dentry, *alias;
        struct inode *inode;
        struct file *pidfd_file;

        dentry = d_alloc_anon(pidfdfs_sb);
        if (!dentry)
                return ERR_PTR(-ENOMEM);

        inode = iget_locked(pidfdfs_sb, pid->ino);
        if (!inode) {
                dput(dentry);
                return ERR_PTR(-ENOMEM);
        }

        if (inode->i_state & I_NEW) {
                inode->i_ino = pid->ino;
                inode->i_mode = S_IFREG | S_IRUGO;
                inode->i_fop = &pidfd_fops;
                inode->i_flags |= S_IMMUTABLE;
                inode->i_private = get_pid(pid);
                simple_inode_init_ts(inode);
                unlock_new_inode(inode);
        }

        alias = d_instantiate_unique_anon(dentry, inode);
        if (alias) {
                dput(dentry);
                dentry = alias;
        }

        path.dentry = dentry;
        path.mnt = mntget(pidfdfs_mnt);

        pidfd_file = dentry_open(&path, flags, current_cred());
        path_put(&path);
        return pidfd_file;
}

> 
> Also, quick question:
> 
> > +void pid_init_pidfdfs(struct pid *pid)
> > +{
> > +       pid->ino = ++pidfdfs_ino;
> > +}
> 
> As far as I can tell, the above is only safe because it is serialized by
> 
>         spin_lock_irq(&pidmap_lock);

Yes, I'm explicitly relying on that because that thing serializes all
alloc_pid() calls anyway which came in very handy nice.

> 
> in the only caller.
> 
> Can we please just not have a function at all, and just move it there,
> so that it's a whole lot more obvious that that inode generation
> really gets us a unique number?

Yes, of course. I just did it to avoid an ifdef basically.

> 
> Am I missing something?

Nope.
Christian Brauner Feb. 16, 2024, 11:50 a.m. UTC | #6
On Thu, Feb 15, 2024 at 05:11:31PM +0100, Christian Brauner wrote:
> On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> > On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > Ok, that turned out to be simpler than I had evisioned - unless I made
> > > horrible mistakes:
> > 
> > Hmm. Could we do the same for nsfs?

Ok, here's what I got. I'll put the changes to switch both nsfs and
pidfdfs to the proposed unique mechanism suggested yesterday on top. I
would send that in two batches in any case. So if that's somehow broken
then we can just drop it.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Makefile   |   2 +-
 fs/pidfdfs.c  | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c | 110 ---------------------------------------------------
 3 files changed, 124 insertions(+), 111 deletions(-)

diff --git a/fs/Makefile b/fs/Makefile
index c09016257f05..0fe5d0151fcc 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -15,7 +15,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
 		pnode.o splice.o sync.o utimes.o d_path.o \
 		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
 		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
-		kernel_read_file.o mnt_idmapping.o remap_range.o
+		kernel_read_file.o mnt_idmapping.o remap_range.o pidfdfs.o
 
 obj-$(CONFIG_BUFFER_HEAD)	+= buffer.o mpage.o
 obj-$(CONFIG_PROC_FS)		+= proc_namespace.o
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
new file mode 100644
index 000000000000..55e8396e7fc4
--- /dev/null
+++ b/fs/pidfdfs.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/magic.h>
+#include <linux/mount.h>
+#include <linux/pid.h>
+#include <linux/pid_namespace.h>
+#include <linux/poll.h>
+#include <linux/proc_fs.h>
+#include <linux/proc_ns.h>
+#include <linux/pseudo_fs.h>
+#include <linux/seq_file.h>
+#include <uapi/linux/pidfd.h>
+
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+	struct pid *pid = file->private_data;
+
+	file->private_data = NULL;
+	put_pid(pid);
+	return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+/**
+ * pidfd_show_fdinfo - print information about a pidfd
+ * @m: proc fdinfo file
+ * @f: file referencing a pidfd
+ *
+ * Pid:
+ * This function will print the pid that a given pidfd refers to in the
+ * pid namespace of the procfs instance.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its pid. This is
+ * similar to calling getppid() on a process whose parent is outside of
+ * its pid namespace.
+ *
+ * NSpid:
+ * If pid namespaces are supported then this function will also print
+ * the pid of a given pidfd refers to for all descendant pid namespaces
+ * starting from the current pid namespace of the instance, i.e. the
+ * Pid field and the first entry in the NSpid field will be identical.
+ * If the pid namespace of the process is not a descendant of the pid
+ * namespace of the procfs instance 0 will be shown as its first NSpid
+ * entry and no others will be shown.
+ * Note that this differs from the Pid and NSpid fields in
+ * /proc/<pid>/status where Pid and NSpid are always shown relative to
+ * the  pid namespace of the procfs instance. The difference becomes
+ * obvious when sending around a pidfd between pid namespaces from a
+ * different branch of the tree, i.e. where no ancestral relation is
+ * present between the pid namespaces:
+ * - create two new pid namespaces ns1 and ns2 in the initial pid
+ *   namespace (also take care to create new mount namespaces in the
+ *   new pid namespace and mount procfs)
+ * - create a process with a pidfd in ns1
+ * - send pidfd from ns1 to ns2
+ * - read /proc/self/fdinfo/<pidfd> and observe that both Pid and NSpid
+ *   have exactly one entry, which is 0
+ */
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+	struct pid *pid = f->private_data;
+	struct pid_namespace *ns;
+	pid_t nr = -1;
+
+	if (likely(pid_has_task(pid, PIDTYPE_PID))) {
+		ns = proc_pid_ns(file_inode(m->file)->i_sb);
+		nr = pid_nr_ns(pid, ns);
+	}
+
+	seq_put_decimal_ll(m, "Pid:\t", nr);
+
+#ifdef CONFIG_PID_NS
+	seq_put_decimal_ll(m, "\nNSpid:\t", nr);
+	if (nr > 0) {
+		int i;
+
+		/* If nr is non-zero it means that 'pid' is valid and that
+		 * ns, i.e. the pid namespace associated with the procfs
+		 * instance, is in the pid namespace hierarchy of pid.
+		 * Start at one below the already printed level.
+		 */
+		for (i = ns->level + 1; i <= pid->level; i++)
+			seq_put_decimal_ll(m, "\t", pid->numbers[i].nr);
+	}
+#endif
+	seq_putc(m, '\n');
+}
+#endif
+
+/*
+ * Poll support for process exit notification.
+ */
+static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
+{
+	struct pid *pid = file->private_data;
+	bool thread = file->f_flags & PIDFD_THREAD;
+	struct task_struct *task;
+	__poll_t poll_flags = 0;
+
+	poll_wait(file, &pid->wait_pidfd, pts);
+	/*
+	 * Depending on PIDFD_THREAD, inform pollers when the thread
+	 * or the whole thread-group exits.
+	 */
+	rcu_read_lock();
+	task = pid_task(pid, PIDTYPE_PID);
+	if (!task)
+		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
+	else if (task->exit_state && (thread || thread_group_empty(task)))
+		poll_flags = EPOLLIN | EPOLLRDNORM;
+	rcu_read_unlock();
+
+	return poll_flags;
+}
+
+const struct file_operations pidfd_fops = {
+	.release	= pidfd_release,
+	.poll		= pidfd_poll,
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= pidfd_show_fdinfo,
+#endif
+};
diff --git a/kernel/fork.c b/kernel/fork.c
index 3f22ec90c5c6..662a61f340ce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1993,116 +1993,6 @@ struct pid *pidfd_pid(const struct file *file)
 	return ERR_PTR(-EBADF);
 }
 
-static int pidfd_release(struct inode *inode, struct file *file)
-{
-	struct pid *pid = file->private_data;
-
-	file->private_data = NULL;
-	put_pid(pid);
-	return 0;
-}
-
-#ifdef CONFIG_PROC_FS
-/**
- * pidfd_show_fdinfo - print information about a pidfd
- * @m: proc fdinfo file
- * @f: file referencing a pidfd
- *
- * Pid:
- * This function will print the pid that a given pidfd refers to in the
- * pid namespace of the procfs instance.
- * If the pid namespace of the process is not a descendant of the pid
- * namespace of the procfs instance 0 will be shown as its pid. This is
- * similar to calling getppid() on a process whose parent is outside of
- * its pid namespace.
- *
- * NSpid:
- * If pid namespaces are supported then this function will also print
- * the pid of a given pidfd refers to for all descendant pid namespaces
- * starting from the current pid namespace of the instance, i.e. the
- * Pid field and the first entry in the NSpid field will be identical.
- * If the pid namespace of the process is not a descendant of the pid
- * namespace of the procfs instance 0 will be shown as its first NSpid
- * entry and no others will be shown.
- * Note that this differs from the Pid and NSpid fields in
- * /proc/<pid>/status where Pid and NSpid are always shown relative to
- * the  pid namespace of the procfs instance. The difference becomes
- * obvious when sending around a pidfd between pid namespaces from a
- * different branch of the tree, i.e. where no ancestral relation is
- * present between the pid namespaces:
- * - create two new pid namespaces ns1 and ns2 in the initial pid
- *   namespace (also take care to create new mount namespaces in the
- *   new pid namespace and mount procfs)
- * - create a process with a pidfd in ns1
- * - send pidfd from ns1 to ns2
- * - read /proc/self/fdinfo/<pidfd> and observe that both Pid and NSpid
- *   have exactly one entry, which is 0
- */
-static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
-{
-	struct pid *pid = f->private_data;
-	struct pid_namespace *ns;
-	pid_t nr = -1;
-
-	if (likely(pid_has_task(pid, PIDTYPE_PID))) {
-		ns = proc_pid_ns(file_inode(m->file)->i_sb);
-		nr = pid_nr_ns(pid, ns);
-	}
-
-	seq_put_decimal_ll(m, "Pid:\t", nr);
-
-#ifdef CONFIG_PID_NS
-	seq_put_decimal_ll(m, "\nNSpid:\t", nr);
-	if (nr > 0) {
-		int i;
-
-		/* If nr is non-zero it means that 'pid' is valid and that
-		 * ns, i.e. the pid namespace associated with the procfs
-		 * instance, is in the pid namespace hierarchy of pid.
-		 * Start at one below the already printed level.
-		 */
-		for (i = ns->level + 1; i <= pid->level; i++)
-			seq_put_decimal_ll(m, "\t", pid->numbers[i].nr);
-	}
-#endif
-	seq_putc(m, '\n');
-}
-#endif
-
-/*
- * Poll support for process exit notification.
- */
-static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
-{
-	struct pid *pid = file->private_data;
-	bool thread = file->f_flags & PIDFD_THREAD;
-	struct task_struct *task;
-	__poll_t poll_flags = 0;
-
-	poll_wait(file, &pid->wait_pidfd, pts);
-	/*
-	 * Depending on PIDFD_THREAD, inform pollers when the thread
-	 * or the whole thread-group exits.
-	 */
-	rcu_read_lock();
-	task = pid_task(pid, PIDTYPE_PID);
-	if (!task)
-		poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
-	else if (task->exit_state && (thread || thread_group_empty(task)))
-		poll_flags = EPOLLIN | EPOLLRDNORM;
-	rcu_read_unlock();
-
-	return poll_flags;
-}
-
-const struct file_operations pidfd_fops = {
-	.release = pidfd_release,
-	.poll = pidfd_poll,
-#ifdef CONFIG_PROC_FS
-	.show_fdinfo = pidfd_show_fdinfo,
-#endif
-};
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd
Christian Brauner Feb. 16, 2024, 4:41 p.m. UTC | #7
On Fri, Feb 16, 2024 at 12:50:45PM +0100, Christian Brauner wrote:
> On Thu, Feb 15, 2024 at 05:11:31PM +0100, Christian Brauner wrote:
> > On Wed, Feb 14, 2024 at 10:37:33AM -0800, Linus Torvalds wrote:
> > > On Wed, 14 Feb 2024 at 10:27, Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > Ok, that turned out to be simpler than I had evisioned - unless I made
> > > > horrible mistakes:
> > > 
> > > Hmm. Could we do the same for nsfs?
> 
> Ok, here's what I got. I'll put the changes to switch both nsfs and
> pidfdfs to the proposed unique mechanism suggested yesterday on top. I
> would send that in two batches in any case. So if that's somehow broken
> then we can just drop it.

Bah, I pasted the version where I forgot an iput() when a reusable
dentry is found. Fixed that already.
Oleg Nesterov Feb. 17, 2024, 1:59 p.m. UTC | #8
On 02/16, Christian Brauner wrote:
>
> +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
> +{
> +
> +	struct inode *inode;
> +	struct file *pidfd_file;
> +
> +	inode = iget_locked(pidfdfs_sb, pid->ino);
> +	if (!inode)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (inode->i_state & I_NEW) {
> +		inode->i_ino = pid->ino;

I guess this is unnecessary, iget_locked() should initialize i_ino if I_NEW ?

But I have a really stupid (I know nothing about vfs) question, why do we
need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?

IIUC, if this pid is freed and then another "struct pid" has the same address
we can rely on __wait_on_freeing_inode() ?

Oleg.
Linus Torvalds Feb. 17, 2024, 5:30 p.m. UTC | #9
On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
>
> But I have a really stupid (I know nothing about vfs) question, why do we
> need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
>
> IIUC, if this pid is freed and then another "struct pid" has the same address
> we can rely on __wait_on_freeing_inode() ?

Heh. Maybe it would work, but we really don't want to expose core
kernel pointers to user space as the inode number.

So then we'd have to add extra hackery to that (ie we'd have to
intercept stat calls, and we'd have to have something else for
->d_dname() etc..).

Those are all things that the VFS does support, but ...

So I do prefer Christian's new approach, although some of it ends up
being a bit unclear.

Christian, can you explain why this:

        spin_lock(&alias->d_lock);
        dget_dlock(alias);
        spin_unlock(&alias->d_lock);

instead of just 'dget()'?

Also, while I found the old __ns_get_path() to be fairly disgusting, I
actually think it's simpler and clearer than playing around with the
dentry alias list. So my expectation on code sharing was that you'd
basically lift the old __ns_get_path(), make *that* the helper, and
just pass it an argument that is the pointer to the filesystem
"stashed" entry...

And yes, using "atomic_long_t" for stashed is a crime against
humanity. It's also entirely pointless. There are no actual atomic
operations that the code wants except for reading and writing (aka
READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
"atomic_long_t" buys the code nothing, and only makes things more
complicated and requires crazy casts.

So I think the nsfs.c code should be changed to do

-       atomic_long_t stashed;
+       struct dentry *stashed;

and remove the crazy workarounds for using the wrong type.

Something like the attached patch.

Then, I think the whole "lockref_get_not_dead()" etc games of
__ns_get_path() could be extracted out into a helper function that
takes that "&ns->stashed" pointer as an argument, and now that helper
could also be used for pidfs, except pidfs obviously just does
"&pid->stashed" instead.

Hmm?

Entirely untested patch. Also, the above idea may be broken because of
some obvious issue that I didn't think about. Christian?

               Linus
Linus Torvalds Feb. 17, 2024, 5:38 p.m. UTC | #10
On Sat, 17 Feb 2024 at 09:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> And yes, using "atomic_long_t" for stashed is a crime against
> humanity. It's also entirely pointless. There are no actual atomic
> operations that the code wants [..]

Just to clarify: the reason to use 'atomic_long_t' is for the
_arithmetic_ atomic ops. So the "inc/dec/inc_and_test" etc.

The code wants none of that, and can make do with the regular smp-safe
cmpxchg operation that works on any word-sized type.

              Linus
Christian Brauner Feb. 18, 2024, 9:30 a.m. UTC | #11
On Sat, Feb 17, 2024 at 02:59:16PM +0100, Oleg Nesterov wrote:
> On 02/16, Christian Brauner wrote:
> >
> > +struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
> > +{
> > +
> > +	struct inode *inode;
> > +	struct file *pidfd_file;
> > +
> > +	inode = iget_locked(pidfdfs_sb, pid->ino);
> > +	if (!inode)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	if (inode->i_state & I_NEW) {
> > +		inode->i_ino = pid->ino;
> 
> I guess this is unnecessary, iget_locked() should initialize i_ino if I_NEW ?

Yes, it does. I just like to be explicit in such cases.

> 
> But I have a really stupid (I know nothing about vfs) question, why do we
> need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> 
> IIUC, if this pid is freed and then another "struct pid" has the same address
> we can rely on __wait_on_freeing_inode() ?

Yeah, I had thought about something like this but see Linus' reply.
Christian Brauner Feb. 18, 2024, 11:15 a.m. UTC | #12
On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > But I have a really stupid (I know nothing about vfs) question, why do we
> > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> >
> > IIUC, if this pid is freed and then another "struct pid" has the same address
> > we can rely on __wait_on_freeing_inode() ?
> 
> Heh. Maybe it would work, but we really don't want to expose core
> kernel pointers to user space as the inode number.

And then also the property that the inode number is unique for the
system lifetime is extremely useful for userspace and I would like to
retain that property.

> 
> So then we'd have to add extra hackery to that (ie we'd have to
> intercept stat calls, and we'd have to have something else for
> ->d_dname() etc..).
> 
> Those are all things that the VFS does support, but ...
> 
> So I do prefer Christian's new approach, although some of it ends up
> being a bit unclear.
> 
> Christian, can you explain why this:
> 
>         spin_lock(&alias->d_lock);
>         dget_dlock(alias);
>         spin_unlock(&alias->d_lock);
> 
> instead of just 'dget()'?

No reason other than I forgot to switch to dget().

> 
> Also, while I found the old __ns_get_path() to be fairly disgusting, I
> actually think it's simpler and clearer than playing around with the
> dentry alias list. So my expectation on code sharing was that you'd

It's overall probably also cheaper, I think.

> basically lift the old __ns_get_path(), make *that* the helper, and
> just pass it an argument that is the pointer to the filesystem
> "stashed" entry...
> 
> And yes, using "atomic_long_t" for stashed is a crime against
> humanity. It's also entirely pointless. There are no actual atomic
> operations that the code wants except for reading and writing (aka
> READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> "atomic_long_t" buys the code nothing, and only makes things more
> complicated and requires crazy casts.

Yup, I had that as a draft and that introduced struct ino_stash which
contained a dentry pointer and the inode number using cmpxchg(). But I
decided against this because ns_common.h would require to have access to
ino_stash definition so we wouldn't just able to hide it in internal.h
where it should belong.

> 
> So I think the nsfs.c code should be changed to do
> 
> -       atomic_long_t stashed;
> +       struct dentry *stashed;
> 
> and remove the crazy workarounds for using the wrong type.
> 
> Something like the attached patch.
> 
> Then, I think the whole "lockref_get_not_dead()" etc games of
> __ns_get_path() could be extracted out into a helper function that
> takes that "&ns->stashed" pointer as an argument, and now that helper
> could also be used for pidfs, except pidfs obviously just does
> "&pid->stashed" instead.
> 
> Hmm?
> 
> Entirely untested patch. Also, the above idea may be broken because of
> some obvious issue that I didn't think about. Christian?

Yeah, sure. Works for me. Let me play with something.
Christian Brauner Feb. 18, 2024, 11:33 a.m. UTC | #13
On Sun, Feb 18, 2024 at 12:15:02PM +0100, Christian Brauner wrote:
> On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > >
> > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > we can rely on __wait_on_freeing_inode() ?
> > 
> > Heh. Maybe it would work, but we really don't want to expose core
> > kernel pointers to user space as the inode number.
> 
> And then also the property that the inode number is unique for the
> system lifetime is extremely useful for userspace and I would like to
> retain that property.
> 
> > 
> > So then we'd have to add extra hackery to that (ie we'd have to
> > intercept stat calls, and we'd have to have something else for
> > ->d_dname() etc..).
> > 
> > Those are all things that the VFS does support, but ...
> > 
> > So I do prefer Christian's new approach, although some of it ends up
> > being a bit unclear.
> > 
> > Christian, can you explain why this:
> > 
> >         spin_lock(&alias->d_lock);
> >         dget_dlock(alias);
> >         spin_unlock(&alias->d_lock);
> > 
> > instead of just 'dget()'?
> 
> No reason other than I forgot to switch to dget().
> 
> > 
> > Also, while I found the old __ns_get_path() to be fairly disgusting, I
> > actually think it's simpler and clearer than playing around with the
> > dentry alias list. So my expectation on code sharing was that you'd
> 
> It's overall probably also cheaper, I think.
> 
> > basically lift the old __ns_get_path(), make *that* the helper, and
> > just pass it an argument that is the pointer to the filesystem
> > "stashed" entry...
> > 
> > And yes, using "atomic_long_t" for stashed is a crime against
> > humanity. It's also entirely pointless. There are no actual atomic
> > operations that the code wants except for reading and writing (aka
> > READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> > "atomic_long_t" buys the code nothing, and only makes things more
> > complicated and requires crazy casts.
> 
> Yup, I had that as a draft and that introduced struct ino_stash which
> contained a dentry pointer and the inode number using cmpxchg(). But I
> decided against this because ns_common.h would require to have access to
> ino_stash definition so we wouldn't just able to hide it in internal.h
> where it should belong.

Right, I remember. The annoying thing will be how to cleanly handle this
without having to pass too many parameters because we need d_fsdata, the
vfsmount, and the inode->i_fop. So let me see if I can get this to
something that doesn't look too ugly.
Oleg Nesterov Feb. 18, 2024, 2:27 p.m. UTC | #14
On 02/18, Christian Brauner wrote:
>
> On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> > >
> > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > >
> > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > we can rely on __wait_on_freeing_inode() ?
> >
> > Heh. Maybe it would work, but we really don't want to expose core
> > kernel pointers to user space as the inode number.

We could use ptr_to_hashval(pid).

> And then also the property that the inode number is unique for the
> system lifetime is extremely useful for userspace and I would like to
> retain that property.

OK.

and perhaps task->thread_pid->ino can also be used as task_monotonic_id(task)
if we move the pid->ino initialization into init_task_pid(PIDTYPE_PID), this
allows to implement for_each_process_thread_break/continue... Nevermind, please
forget.

> > > +	if (inode->i_state & I_NEW) {
> > > +		inode->i_ino = pid->ino;
> >
> > I guess this is unnecessary, iget_locked() should initialize i_ino if I_NEW ?
>
> Yes, it does. I just like to be explicit in such cases.

Well. Of course I won't insist, this is minor. But to me this adds the
unnecessary confusion, as if we need to override ->ino for some reason.

Oleg.
Christian Brauner Feb. 18, 2024, 5:54 p.m. UTC | #15
On Sun, Feb 18, 2024 at 12:33:41PM +0100, Christian Brauner wrote:
> On Sun, Feb 18, 2024 at 12:15:02PM +0100, Christian Brauner wrote:
> > On Sat, Feb 17, 2024 at 09:30:19AM -0800, Linus Torvalds wrote:
> > > On Sat, 17 Feb 2024 at 06:00, Oleg Nesterov <oleg@redhat.com> wrote:
> > > >
> > > > But I have a really stupid (I know nothing about vfs) question, why do we
> > > > need pidfdfs_ino and pid->ino ? Can you explain why pidfdfs_alloc_file()
> > > > can't simply use, say, iget_locked(pidfdfs_sb, (unsigned long)pid) ?
> > > >
> > > > IIUC, if this pid is freed and then another "struct pid" has the same address
> > > > we can rely on __wait_on_freeing_inode() ?
> > > 
> > > Heh. Maybe it would work, but we really don't want to expose core
> > > kernel pointers to user space as the inode number.
> > 
> > And then also the property that the inode number is unique for the
> > system lifetime is extremely useful for userspace and I would like to
> > retain that property.
> > 
> > > 
> > > So then we'd have to add extra hackery to that (ie we'd have to
> > > intercept stat calls, and we'd have to have something else for
> > > ->d_dname() etc..).
> > > 
> > > Those are all things that the VFS does support, but ...
> > > 
> > > So I do prefer Christian's new approach, although some of it ends up
> > > being a bit unclear.
> > > 
> > > Christian, can you explain why this:
> > > 
> > >         spin_lock(&alias->d_lock);
> > >         dget_dlock(alias);
> > >         spin_unlock(&alias->d_lock);
> > > 
> > > instead of just 'dget()'?
> > 
> > No reason other than I forgot to switch to dget().
> > 
> > > 
> > > Also, while I found the old __ns_get_path() to be fairly disgusting, I
> > > actually think it's simpler and clearer than playing around with the
> > > dentry alias list. So my expectation on code sharing was that you'd
> > 
> > It's overall probably also cheaper, I think.
> > 
> > > basically lift the old __ns_get_path(), make *that* the helper, and
> > > just pass it an argument that is the pointer to the filesystem
> > > "stashed" entry...
> > > 
> > > And yes, using "atomic_long_t" for stashed is a crime against
> > > humanity. It's also entirely pointless. There are no actual atomic
> > > operations that the code wants except for reading and writing (aka
> > > READ_ONCE() and WRITE_ONCE()) and cmpxchg (aka just cmpxchg()). Using
> > > "atomic_long_t" buys the code nothing, and only makes things more
> > > complicated and requires crazy casts.
> > 
> > Yup, I had that as a draft and that introduced struct ino_stash which
> > contained a dentry pointer and the inode number using cmpxchg(). But I
> > decided against this because ns_common.h would require to have access to
> > ino_stash definition so we wouldn't just able to hide it in internal.h
> > where it should belong.
> 
> Right, I remember. The annoying thing will be how to cleanly handle this
> without having to pass too many parameters because we need d_fsdata, the
> vfsmount, and the inode->i_fop. So let me see if I can get this to
> something that doesn't look too ugly.

So, I'm running out of time today so I'm appending the draft I jotted
down now (untested).
Roughly, here's what I think would work for both nsfs and pidfs (I got
the hint and will rename from pidfdfs ;)). The thing that makes it a bit
tricky is that we need to indicate to the caller whether we've reused a
stashed or added a new inode/dentry so the caller can put the reference
to the object it took for i_private in case we reused a dentry/inode. On
EAGAIN i_private is property of the fs and will be cleaned up in
->evict(). Alternative is a callback for getting a reference which I
think is also ugly. Better suggestions welcome of course.
Linus Torvalds Feb. 18, 2024, 6:08 p.m. UTC | #16
On Sun, 18 Feb 2024 at 09:54, Christian Brauner <brauner@kernel.org> wrote:
>
> So, I'm running out of time today so I'm appending the draft I jotted
> down now (untested).

This looks quite nice to me. Written like that with those well-named
helpers it is much more readable than the old nsfs code, honestly.

The only ugliness I see is the one that comes from the original code -
I'm not thrilled about the "return -EAGAIN" part, and I think that if
we found a previously stashed entry after all, we should loop.

But I think that whole horror comes from a fear of an endless loop
when the dentry is marked dead, and another CPU still sees the old
value (so you can't re-use it, and yet it's not NULL).

Which probably never happens, but I didn't think about it too much.

Anyway, that issue is pre-existing code, and your patch looks like an
improvement regardless, so that is *not* in any way a NAK, it's just
me musing on the code.

Al's email address has been bouncing for me for the last week or so,
so presumably he is not getting this and commenting on it. Oh well.

Anyway, ACK from me on this approach (assuming it passes your testing
when you get back, of course).

              Linus
Linus Torvalds Feb. 18, 2024, 6:57 p.m. UTC | #17
On Sun, 18 Feb 2024 at 10:08, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The only ugliness I see is the one that comes from the original code -
> I'm not thrilled about the "return -EAGAIN" part, and I think that if
> we found a previously stashed entry after all, we should loop.
>
> But I think that whole horror comes from a fear of an endless loop
> when the dentry is marked dead, and another CPU still sees the old
> value (so you can't re-use it, and yet it's not NULL).

Actually, I think this is fairly easily fixable, but let's fix it
*after* you've done your cleanups.

The eventual fix is fairly simple: allow installing a new dentry not
just as a replacement for a previous NULL dentry, but also replacing a
previous dead dentry.

That requires just two simple changes:

 - the ->d_prune() callback should no longer just blindly set the
stashed value to NULL, it would do

        // Somebody could have re-used our stash as the dentry
        // died, so we only NULL it out of if matches our pruned one
        cmpxchg(&stashed, dentry, NULL);

 - when installing, instead of doing

        if (cmpxchg(&stashed, NULL, dentry) .. FAIL ..

   we'd loop with something like this:

        guard(rcu)();
        for (;;) {
                struct dentry *old;

                // Assume any old dentry was cleared out
                old = cmpxchg(&stashed, NULL, dentry);
                if (likely(!old))
                        break;

                // Maybe somebody else installed a good dentry
                // .. so release ours and use the new one
                if (lockref_get_not_dead(&old->d_lockref)) {
                        d_delete(dentry);
                        dput(dentry);
                        return old;
                }

                // There's an old dead dentry there, try to take it over
                if (likely(try_cmpxchg(&stashed, old, dentry)))
                        break;

                // Ooops, that failed, to back and try again
                cpu_relax();
        }

        // We successfully installed our dentry
        // as the new stashed one
        return dentry;

which really isn't doesn't look that complicated (note the RCU guard
as a way to make sure this all runs RCU-locked without needing to
worry about the unlock sequence).

Note: your initial optimistic "get_stashed_dentry()" stays exactly as
it is. The above loop is just for the "oh, we didn't trivially re-use
an old dentry, so now we need to allocate a new one and install it"
case.

Anyway, the above is written just in the MUA, there's no testing of
the above, and again - I think this should be done *after* you've done
the cleanups of the current code. But I think it would clarify the odd
race condition with an old dentry dying just as a new one is created,
and make sure there isn't some -EAGAIN case that people need to worry
about.

Because either we can re-use the old one, or there isn't an old one,
or we find a dead one that can't be reused but can just be replaced.

Fairly straightforward, no?

               Linus
Christian Brauner Feb. 19, 2024, 6:05 p.m. UTC | #18
> Fairly straightforward, no?

So, I've finished the series today. It took a while because I wanted to
hammer on it with stress-ng --{procfs,pidfd}, run LTP, and then do some
other tests. It survives everything now for over 24h. There were a few
bugs as expected in the first draft.

@Linus, if you're up for it, please take a look at:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.pidfd

The topmost 6 commits contain everything we've had on here. I renamed
pidfdfs to pidfs and all helpers completely mechanically. The rest is
unchanged. At the end we have the conversion fo path_from_stashed(). And
the topmost commit implements your suggestion. Let me know if your ACK
still stands.
Linus Torvalds Feb. 19, 2024, 6:34 p.m. UTC | #19
On Mon, 19 Feb 2024 at 10:05, Christian Brauner <brauner@kernel.org> wrote:
>
> @Linus, if you're up for it, please take a look at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.pidfd
>
> The topmost 6 commits contain everything we've had on here.

Looks ok. The commit message on that last one reads a bit oddly,
because the "Quoting Linus" part looks like it means just the first
quote, even if it's really everything.

I'd suggest you just not attribute that explanation to me at all, and
edit it down to just a neutral explanation of what is going on.

But the code itself looks fine, and I like how it just cleaned up the
callers a lot now that they don't have that odd EAGAIN loop thing.

I expected that to happen, of course, and it was the point of my
suggestion, but it's still nice to actually see it as a patch that
removes the nasty code rather than just my "I think that's ugly and
could be done differently".

             Linus
Christian Brauner Feb. 19, 2024, 9:18 p.m. UTC | #20
On Mon, Feb 19, 2024 at 10:34:47AM -0800, Linus Torvalds wrote:
> On Mon, 19 Feb 2024 at 10:05, Christian Brauner <brauner@kernel.org> wrote:
> >
> > @Linus, if you're up for it, please take a look at:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.pidfd
> >
> > The topmost 6 commits contain everything we've had on here.
> 
> Looks ok. The commit message on that last one reads a bit oddly,
> because the "Quoting Linus" part looks like it means just the first
> quote, even if it's really everything.
> 
> I'd suggest you just not attribute that explanation to me at all, and
> edit it down to just a neutral explanation of what is going on.

Done.

> 
> But the code itself looks fine, and I like how it just cleaned up the
> callers a lot now that they don't have that odd EAGAIN loop thing.

Yup.

> 
> I expected that to happen, of course, and it was the point of my
> suggestion, but it's still nice to actually see it as a patch that
> removes the nasty code rather than just my "I think that's ugly and
> could be done differently".

I've moved the shared cmpxchg() bit in {ns,pidfs}_prune_dentry() to a
tiny helper so the cmpxchg() isn't coded in the open and is documented
in a single location. Pushed.
Linus Torvalds Feb. 19, 2024, 11:24 p.m. UTC | #21
On Mon, 19 Feb 2024 at 13:18, Christian Brauner <brauner@kernel.org> wrote:
>
> I've moved the shared cmpxchg() bit in {ns,pidfs}_prune_dentry() to a
> tiny helper so the cmpxchg() isn't coded in the open and is documented
> in a single location.

Ack, thumbs up. LGTM,

              Linus
Nathan Chancellor Feb. 22, 2024, 7:03 p.m. UTC | #22
Hi Christian,

On Tue, Feb 13, 2024 at 05:45:47PM +0100, Christian Brauner wrote:
> This moves pidfds from the anonymous inode infrastructure to a tiny
> pseudo filesystem. This has been on my todo for quite a while as it will
> unblock further work that we weren't able to do simply because of the
> very justified limitations of anonymous inodes. Moving pidfds to a tiny
> pseudo filesystem allows:
> 
> * statx() on pidfds becomes useful for the first time.
> * pidfds can be compared simply via statx() and then comparing inode
>   numbers.
> * pidfds have unique inode numbers for the system lifetime.
> * struct pid is now stashed in inode->i_private instead of
>   file->private_data. This means it is now possible to introduce
>   concepts that operate on a process once all file descriptors have been
>   closed. A concrete example is kill-on-last-close.
> * file->private_data is freed up for per-file options for pidfds.
> * Each struct pid will refer to a different inode but the same struct
>   pid will refer to the same inode if it's opened multiple times. In
>   contrast to now where each struct pid refers to the same inode. Even
>   if we were to move to anon_inode_create_getfile() which creates new
>   inodes we'd still be associating the same struct pid with multiple
>   different inodes.
> * Pidfds now go through the regular dentry_open() path which means that
>   all security hooks are called unblocking proper LSM management for
>   pidfds. In addition fsnotify hooks are called and allow for listening
>   to open events on pidfds.
> 
> The tiny pseudo filesystem is not visible anywhere in userspace exactly
> like e.g., pipefs and sockfs. There's no lookup, there's no complex
> inode operations, nothing. Dentries and inodes are always deleted when
> the last pidfd is closed.
> 
> The code is entirely optional and fairly small. If it's not selected we
> fallback to anonymous inodes. Heavily inspired by nsfs which uses a
> similar stashing mechanism just for namespaces.
> 
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Apologies if this has already been reported or fixed but I did not see
anything on the mailing list.

On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
services such as abrtd, dbus, and polkit fail to start on my Fedora
machines, which causes further isssues like failing to start network
interfaces with NetworkManager. I can easily reproduce this in a Fedora
39 QEMU virtual machine, which has:

  # systemctl --version
  systemd 254 (254.9-1.fc39)
  +PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP -GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN -IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +BPF_FRAMEWORK +XKBCOMMON +UTMP +SYSVINIT
   default-hierarchy=unified

Unfortunately, there does not really appear to be much information to
provide off bat but I am more than happy to try and gather whatever
information would be helpful if you are not able to reproduce locally.

  # uname -r
  6.8.0-rc1-00017-ga1a466d5af6c

  # zgrep CONFIG_FS_PID /proc/config.gz
  CONFIG_FS_PID=y

  # systemctl status polkit.service
  × polkit.service - Authorization Manager
       Loaded: loaded (/usr/lib/systemd/system/polkit.service; static)
      Drop-In: /usr/lib/systemd/system/service.d
               └─10-timeout-abort.conf
       Active: failed (Result: timeout) since Thu 2024-02-22 11:35:52 MST; 11min ago
         Docs: man:polkit(8)
      Process: 844 ExecStart=/usr/lib/polkit-1/polkitd --no-debug (code=killed, signal=TERM)
     Main PID: 844 (code=killed, signal=TERM)
          CPU: 116ms

  Feb 22 11:34:22 qemu systemd[1]: Starting polkit.service - Authorization Manager...
  Feb 22 11:34:22 qemu polkitd[844]: Started polkitd version 123
  Feb 22 11:34:22 qemu polkitd[844]: Loading rules from directory /etc/polkit-1/rules.d
  Feb 22 11:34:22 qemu polkitd[844]: Loading rules from directory /usr/share/polkit-1/rules.d
  Feb 22 11:34:22 qemu polkitd[844]: Finished loading, compiling and executing 5 rules
  Feb 22 11:34:22 qemu polkitd[844]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
  Feb 22 11:35:52 qemu systemd[1]: polkit.service: start operation timed out. Terminating.
  Feb 22 11:35:52 qemu systemd[1]: polkit.service: Failed with result 'timeout'.
  Feb 22 11:35:52 qemu systemd[1]: Failed to start polkit.service - Authorization Manager.

vs.

  # uname -r
  6.8.0-rc1-00016-gd68c1231c030

  # systemctl status polkit.service
  ● polkit.service - Authorization Manager
       Loaded: loaded (/usr/lib/systemd/system/polkit.service; static)
      Drop-In: /usr/lib/systemd/system/service.d
               └─10-timeout-abort.conf
       Active: active (running) since Thu 2024-02-22 11:30:38 MST; 21s ago
         Docs: man:polkit(8)
     Main PID: 843 (polkitd)
        Tasks: 4 (limit: 19010)
       Memory: 5.0M
          CPU: 169ms
       CGroup: /system.slice/polkit.service
               └─843 /usr/lib/polkit-1/polkitd --no-debug

  Feb 22 11:30:38 qemu systemd[1]: Starting polkit.service - Authorization Manager...
  Feb 22 11:30:38 qemu polkitd[843]: Started polkitd version 123
  Feb 22 11:30:38 qemu polkitd[843]: Loading rules from directory /etc/polkit-1/rules.d
  Feb 22 11:30:38 qemu polkitd[843]: Loading rules from directory /usr/share/polkit-1/rules.d
  Feb 22 11:30:38 qemu polkitd[843]: Finished loading, compiling and executing 5 rules
  Feb 22 11:30:38 qemu polkitd[843]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
  Feb 22 11:30:38 qemu systemd[1]: Started polkit.service - Authorization Manager.

or

  # uname -r
  6.8.0-rc1-00017-ga1a466d5af6c

  # zgrep CONFIG_FS_PID /proc/config.gz
  # CONFIG_FS_PID is not set

  # systemctl status polkit.service
  ● polkit.service - Authorization Manager
       Loaded: loaded (/usr/lib/systemd/system/polkit.service; static)
      Drop-In: /usr/lib/systemd/system/service.d
               └─10-timeout-abort.conf
       Active: active (running) since Thu 2024-02-22 11:52:41 MST; 5min ago
         Docs: man:polkit(8)
     Main PID: 845 (polkitd)
        Tasks: 4 (limit: 19010)
       Memory: 5.0M
          CPU: 177ms
       CGroup: /system.slice/polkit.service
               └─845 /usr/lib/polkit-1/polkitd --no-debug

  Feb 22 11:52:41 qemu systemd[1]: Starting polkit.service - Authorization Manager...
  Feb 22 11:52:41 qemu polkitd[845]: Started polkitd version 123
  Feb 22 11:52:41 qemu polkitd[845]: Loading rules from directory /etc/polkit-1/rules.d
  Feb 22 11:52:41 qemu polkitd[845]: Loading rules from directory /usr/share/polkit-1/rules.d
  Feb 22 11:52:41 qemu polkitd[845]: Finished loading, compiling and executing 5 rules
  Feb 22 11:52:41 qemu polkitd[845]: Acquired the name org.freedesktop.PolicyKit1 on the system bus
  Feb 22 11:52:41 qemu systemd[1]: Started polkit.service - Authorization Manager.

I looked your most recent push of vfs.pidfd but I did not see anything
that would have appeared to fix this, so I did not test it.

Cheers,
Nathan
Heiko Carstens Feb. 23, 2024, 10:18 a.m. UTC | #23
On Thu, Feb 22, 2024 at 12:03:34PM -0700, Nathan Chancellor wrote:
> Apologies if this has already been reported or fixed but I did not see
> anything on the mailing list.
> 
> On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> services such as abrtd, dbus, and polkit fail to start on my Fedora
> machines, which causes further isssues like failing to start network
> interfaces with NetworkManager. I can easily reproduce this in a Fedora
> 39 QEMU virtual machine, which has:

Same here with Fedora 39 on s390 and next-20240223: network does not
come up.

Disabling CONFIG_FS_PID "fixes" the problem.
Christian Brauner Feb. 23, 2024, 11:55 a.m. UTC | #24
> Apologies if this has already been reported or fixed but I did not see
> anything on the mailing list.
> 
> On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> services such as abrtd, dbus, and polkit fail to start on my Fedora
> machines, which causes further isssues like failing to start network
> interfaces with NetworkManager. I can easily reproduce this in a Fedora
> 39 QEMU virtual machine, which has:
> 
>   # systemctl --version
>   systemd 254 (254.9-1.fc39)

If something fails for completely inexplicable reasons:

Feb 23 12:09:58 fed1 audit[353]: AVC avc:  denied  { read write open } for  pid=353 comm="systemd-userdbd" path="pidfd:[709]" dev="pidfs" ino=709 scontext=system_u:system_r:systemd_userdbd_t:>

>   +SELINUX

pidfd creation can now be mediated by LSMs since we can finally go
through the regular open path. That wasn't possible before but LSM
mediation ability had been requested a few times.

In short, we have to update the selinux policy for Fedora. (Fwiw, went
through the same excercise with nsfs back then.)

I've created a pull-request here:

https://github.com/fedora-selinux/selinux-policy/pull/2050

and filed an issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=2265630

We have sufficient time to get this resolved and I was assured that this
would be resolved. If we can't get it resolved in a timely manner we'll
default to N for a while until everything's updated but I'd like to
avoid that. I'll track that issue.
Christian Brauner Feb. 23, 2024, 11:56 a.m. UTC | #25
On Fri, Feb 23, 2024 at 11:18:33AM +0100, Heiko Carstens wrote:
> On Thu, Feb 22, 2024 at 12:03:34PM -0700, Nathan Chancellor wrote:
> > Apologies if this has already been reported or fixed but I did not see
> > anything on the mailing list.
> > 
> > On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> > services such as abrtd, dbus, and polkit fail to start on my Fedora
> > machines, which causes further isssues like failing to start network
> > interfaces with NetworkManager. I can easily reproduce this in a Fedora
> > 39 QEMU virtual machine, which has:
> 
> Same here with Fedora 39 on s390 and next-20240223: network does not
> come up.
> 
> Disabling CONFIG_FS_PID "fixes" the problem.

It's Selinux. See the other reply. It's already tracked.
Heiko Carstens Feb. 23, 2024, 12:57 p.m. UTC | #26
On Fri, Feb 23, 2024 at 12:55:07PM +0100, Christian Brauner wrote:
> In short, we have to update the selinux policy for Fedora. (Fwiw, went
> through the same excercise with nsfs back then.)
> 
> I've created a pull-request here:
> 
> https://github.com/fedora-selinux/selinux-policy/pull/2050
> 
> and filed an issue here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2265630
> 
> We have sufficient time to get this resolved and I was assured that this
> would be resolved. If we can't get it resolved in a timely manner we'll
> default to N for a while until everything's updated but I'd like to
> avoid that. I'll track that issue.

So you are basically saying that for now it is ok to break everybody's
system who tries linux-next and let them bisect, just to figure out they
have to disable a config option?

To me this is a clear indication that the default is wrong already now.
Christian Brauner Feb. 23, 2024, 1:27 p.m. UTC | #27
> So you are basically saying that for now it is ok to break everybody's
> system who tries linux-next and let them bisect, just to figure out they
> have to disable a config option?

Let me fix that suggestive phrasing for you. You seem to struggle a bit:

> I think we should flip the default because this is breaking people who
> are trying linux-next and making them bisect which can be annoying.

Yes, that was the intent. I probably should've made that clear. What I
was trying to do is to flip the default to N and fix the policy. When
that is done we should aim to flip back to y.
Heiko Carstens Feb. 23, 2024, 1:35 p.m. UTC | #28
On Fri, Feb 23, 2024 at 02:27:23PM +0100, Christian Brauner wrote:
> > So you are basically saying that for now it is ok to break everybody's
> > system who tries linux-next and let them bisect, just to figure out they
> > have to disable a config option?
> 
> Let me fix that suggestive phrasing for you. You seem to struggle a bit:

Thanks for helping!

> > I think we should flip the default because this is breaking people who
> > are trying linux-next and making them bisect which can be annoying.
> 
> Yes, that was the intent. I probably should've made that clear. What I
> was trying to do is to flip the default to N and fix the policy. When
> that is done we should aim to flip back to y.

Ok, thanks.
Christian Brauner Feb. 23, 2024, 1:41 p.m. UTC | #29
> default to N for a while until everything's updated but I'd like to

Ok, updated vfs.pidfd with a patch to flip this to n as the default for
now until the LSM learns to deal with this. Should show up in -next
tomorrow.

---

From 57a220844820980f8e3de1c1cd9d112e6e73da83 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Fri, 23 Feb 2024 14:17:21 +0100
Subject: [PATCH] pidfs: default to n for now

Moving pidfds from the anonymous inode infrastructure to a separate tiny
in-kernel filesystem similar to sockfs, pipefs, and anon_inodefs causes
Selinux denials and thus various userspace components that make heavy
use of pidfds to fail.

Feb 23 12:09:58 fed1 audit[353]: AVC avc:  denied  { read write open } for  pid=353 comm="systemd-userdbd" path="pidfd:[709]" dev="pidfs" ino=709 scontext=system_u:system_r:systemd_userdbd_t:>

So far pidfds weren't able to be mediated by selinux which was requested
multiple times. Now that pidfs exists it is actually possible to medite
pidfds because they go through the regular open path that calls the
security_file_open() hook. This is a huge advantage.

Until the Selinux policy is fixed we need to default to n to avoid
breaking people. That process is under way in [1] and [2].

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2265630 [1]
Link: https://github.com/fedora-selinux/selinux-policy/pull/2050 [2]
Reported-by: Nathan Chancellor <nathan@kernel.org>
Link: https://lore.kernel.org/r/20240222190334.GA412503@dev-arch.thelio-3990X
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index f3dbd84a0e40..eea2582fd4af 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -177,7 +177,7 @@ source "fs/sysfs/Kconfig"
 config FS_PID
 	bool "Pseudo filesystem for process file descriptors"
 	depends on 64BIT
-	default y
+	default n
 	help
 	  Pidfs implements advanced features for process file descriptors.
Christian Brauner Feb. 23, 2024, 9:26 p.m. UTC | #30
On Fri, Feb 23, 2024 at 12:55:07PM +0100, Christian Brauner wrote:
> > Apologies if this has already been reported or fixed but I did not see
> > anything on the mailing list.
> > 
> > On next-20240221 and next-20240222, with CONFIG_FS_PID=y, some of my
> > services such as abrtd, dbus, and polkit fail to start on my Fedora
> > machines, which causes further isssues like failing to start network
> > interfaces with NetworkManager. I can easily reproduce this in a Fedora
> > 39 QEMU virtual machine, which has:
> > 
> >   # systemctl --version
> >   systemd 254 (254.9-1.fc39)
> 
> If something fails for completely inexplicable reasons:
> 
> Feb 23 12:09:58 fed1 audit[353]: AVC avc:  denied  { read write open } for  pid=353 comm="systemd-userdbd" path="pidfd:[709]" dev="pidfs" ino=709 scontext=system_u:system_r:systemd_userdbd_t:>
> 
> >   +SELINUX
> 
> pidfd creation can now be mediated by LSMs since we can finally go
> through the regular open path. That wasn't possible before but LSM
> mediation ability had been requested a few times.
> 
> In short, we have to update the selinux policy for Fedora. (Fwiw, went
> through the same excercise with nsfs back then.)
> 
> I've created a pull-request here:
> 
> https://github.com/fedora-selinux/selinux-policy/pull/2050
> 
> and filed an issue here:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=2265630
> 
> We have sufficient time to get this resolved and I was assured that this
> would be resolved. If we can't get it resolved in a timely manner we'll
> default to N for a while until everything's updated but I'd like to
> avoid that. I'll track that issue.

So I want to provide more context since I took the time to track this
all down in detail.

The failure you are seeing is indeed an selinux denial as I've pointed
out. The core failure is dbus-broker. That cascades into all the other
services failing. When dbus-broker fails to start polkit and all the
others won't be able to work because they depend on dbus-broker.

The reason for dbus-broker failing is because it doesn't handle failures
for SO_PEERPIDFD correctly. Last kernel release (either v6.7 or v6.6,
I'm not completely sure right now) we introduced SO_PEERPIDFD (and
SCM_PIDFD). SO_PEERPIDFD allows dbus-broker and polkit and others to
receive a pidfd for the peer of an AF_UNIX socket. This is the first
time in the history of Linux that we can safely authenticate clients in
a race-free manner. :)

dbus-broker immediately made use of this but messed up the error
checking. It only allowed EINVAL as a valid failure for SO_PEERPIDFD.
That's obviously problematic not just because of LSM denials but because
of seccomp denials that would prevent SO_PEERPIDFD from working; or any
other new error code from there.

So this is catching a flawed implementation in dbus-broker as well. It
_has_ to fallback to the old pid-based authentication when SO_PEERPIDFD
doesn't work no matter the reasons otherwise it'll always risk such
failures.

So, the immediate fix separate from the selinux policy update is to fix
dbus-broker which we've done now:

https://github.com/bus1/dbus-broker/pull/343

That should make it into Fedora asap as well.

The selinux reference policy is also updated in addition to the Fedora
policy:

https://github.com/bus1/dbus-broker/pull/343

So overall that LSM denial should not have caused dbus-broker to fail.
It can never assume that a feature released one kernel ago like
SO_PEERPIDFD can be assumed to be available.
Linus Torvalds Feb. 23, 2024, 9:58 p.m. UTC | #31
On Fri, 23 Feb 2024 at 13:26, Christian Brauner <brauner@kernel.org> wrote:
>
> So, the immediate fix separate from the selinux policy update is to fix
> dbus-broker which we've done now:
>
> https://github.com/bus1/dbus-broker/pull/343

Why is that code then continuing the idiocy of doing different things
for different error conditions?

IOW, it causes user space failure when that code doesn't fall back to
"don't do pidfd", but then it continues the crazy habit of treating
*some* error returns as "fallback to not use pidfd" and other errors
as "fail user space".

That was the fundamental bug with special-casing EINVAL in the first
place, and the above "fix" continues the braindamage.

Did nobody learn anything?

Also, honestly, if this breaks existing setups, then we should fix the
kernel anyway. Changing things from the old anonymous inodes to the
new pidfs inodes should *not* have caused any LSM denial issues.

You used the same pointer to dbus-broker for the LSM changes, but I
really don't think this should have required LSM changes in the first
place. Your reaction to "my kernel change caused LSM to barf" should
have made you go "let's fix the kernel so that LSM _doesn't_ barf".

Maybe by making pidfs look exactly like anonfs to LSM. Since I don't
see the LSM change, I'm not actually sure exactly what LSM even
reacted to in that switch-over.

           Linus
Christian Brauner Feb. 24, 2024, 5:52 a.m. UTC | #32
On Fri, Feb 23, 2024 at 01:58:36PM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 13:26, Christian Brauner <brauner@kernel.org> wrote:
> >
> > So, the immediate fix separate from the selinux policy update is to fix
> > dbus-broker which we've done now:
> >
> > https://github.com/bus1/dbus-broker/pull/343
> 
> Why is that code then continuing the idiocy of doing different things
> for different error conditions?

Not under my control unfortunately.

> Also, honestly, if this breaks existing setups, then we should fix the
> kernel anyway. Changing things from the old anonymous inodes to the
> new pidfs inodes should *not* have caused any LSM denial issues.
> 
> You used the same pointer to dbus-broker for the LSM changes, but I
> really don't think this should have required LSM changes in the first
> place. Your reaction to "my kernel change caused LSM to barf" should
> have made you go "let's fix the kernel so that LSM _doesn't_ barf".
> 
> Maybe by making pidfs look exactly like anonfs to LSM. Since I don't
> see the LSM change, I'm not actually sure exactly what LSM even
> reacted to in that switch-over.

This is selinux. So I think this is a misunderstanding. This isn't
something we can fix in the kernel. If Selinux is in enforcing mode in
userspace and it encounters anything that it doesn't know about it will
deny it by default. And the policy is entirely in userspace including
declaring new types for stuff like nsfs or pidfs to allow it. There's
just nothing to do in the kernel.

The Selinux policy update in userspace would always have to happen just
like it had to for nsfs. Usually that happens after a change has landed
and people realize breakage or realize that new functionality isn't
available. This time it's just interacting with bad error handling in
dbus-broker.
Christian Brauner Feb. 24, 2024, 6:05 a.m. UTC | #33
On Sat, Feb 24, 2024 at 06:52:41AM +0100, Christian Brauner wrote:
> On Fri, Feb 23, 2024 at 01:58:36PM -0800, Linus Torvalds wrote:
> > On Fri, 23 Feb 2024 at 13:26, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > So, the immediate fix separate from the selinux policy update is to fix
> > > dbus-broker which we've done now:
> > >
> > > https://github.com/bus1/dbus-broker/pull/343
> > 
> > Why is that code then continuing the idiocy of doing different things
> > for different error conditions?
> 
> Not under my control unfortunately.
> 
> > Also, honestly, if this breaks existing setups, then we should fix the
> > kernel anyway. Changing things from the old anonymous inodes to the
> > new pidfs inodes should *not* have caused any LSM denial issues.
> > 
> > You used the same pointer to dbus-broker for the LSM changes, but I
> > really don't think this should have required LSM changes in the first
> > place. Your reaction to "my kernel change caused LSM to barf" should
> > have made you go "let's fix the kernel so that LSM _doesn't_ barf".
> > 
> > Maybe by making pidfs look exactly like anonfs to LSM. Since I don't
> > see the LSM change, I'm not actually sure exactly what LSM even
> > reacted to in that switch-over.
> 
> This is selinux. So I think this is a misunderstanding. This isn't
> something we can fix in the kernel. If Selinux is in enforcing mode in
> userspace and it encounters anything that it doesn't know about it will
> deny it by default. And the policy is entirely in userspace including
> declaring new types for stuff like nsfs or pidfs to allow it. There's
> just nothing to do in the kernel.
> 
> The Selinux policy update in userspace would always have to happen just
> like it had to for nsfs. Usually that happens after a change has landed
> and people realize breakage or realize that new functionality isn't
> available. This time it's just interacting with bad error handling in
> dbus-broker.

I found the old thread for nsfs for example. Same thing:

https://www.spinics.net/lists/selinux/msg18425.html

"Since Linux 3.19 targets of /proc/PID/ns/* symlinks have lived in a fs
separated from /proc, named nsfs [1].  [...] 
When using a recent kernel with a policy without nsfs support, the
inodes are not labeled, as reported for example in Fedora bug #1234757
[3].  As I encounter this issue on my systems, I asked yesterday on the
refpolicy ML how nsfs inodes should be labeled [4]."

With the asker being pointed to a userspace policy update in

https://spinics.net/lists/selinux/msg18426.html

Honestly, my default reaction is always to test things like that with
various security modules and if I encounter anything that I can fix in
the kernel I do it. But the policies aren't in the kernel. The last link
above explicitly mentions this.
Linus Torvalds Feb. 24, 2024, 6:48 p.m. UTC | #34
On Fri, 23 Feb 2024 at 21:52, Christian Brauner <brauner@kernel.org> wrote:
>
> This is selinux. So I think this is a misunderstanding. This isn't
> something we can fix in the kernel.

Sure it is. SELinux just goes by what the kernel tells it anyway.

Presumably this is purely about the fact that the inode in question
*used* to be that magical 'anon_inode_inode' that is shared when you
don't want or need a separate inode allocation. I assume it doesn't
even look at that, it just looks at the 'anon_inode_fs_type' thing (or
maybe at the anon_inode_mnt->mnt_sb that is created by kern_mount in
anon_inode_init?)

IOW, isn't the *only* difference that selinux can actually see just
the inode allocation? It used to be that

       inode = anon_inode_getfile();

now it is

        inode = new_inode_pseudo(pidfdfs_sb);

and instead of sharing one single inode (like anon_inode_getfile()
does unless you ask for separate inodes), it now shares the dentry
instead (for the same pid).

Would selinux be happy if the inode allocation just used the
anon_inode superblock instead of pidfdfs_sb?

               Linus
Christian Brauner Feb. 24, 2024, 7:15 p.m. UTC | #35
On Sat, Feb 24, 2024 at 10:48:11AM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 21:52, Christian Brauner <brauner@kernel.org> wrote:
> >
> > This is selinux. So I think this is a misunderstanding. This isn't
> > something we can fix in the kernel.
> 
> Sure it is. SELinux just goes by what the kernel tells it anyway.
> 
> Presumably this is purely about the fact that the inode in question
> *used* to be that magical 'anon_inode_inode' that is shared when you
> don't want or need a separate inode allocation. I assume it doesn't
> even look at that, it just looks at the 'anon_inode_fs_type' thing (or
> maybe at the anon_inode_mnt->mnt_sb that is created by kern_mount in
> anon_inode_init?)
> 
> IOW, isn't the *only* difference that selinux can actually see just
> the inode allocation? It used to be that
> 
>        inode = anon_inode_getfile();
> 
> now it is
> 
>         inode = new_inode_pseudo(pidfdfs_sb);
> 
> and instead of sharing one single inode (like anon_inode_getfile()
> does unless you ask for separate inodes), it now shares the dentry
> instead (for the same pid).
> 
> Would selinux be happy if the inode allocation just used the
> anon_inode superblock instead of pidfdfs_sb?

No, unfortunately not. The core issue is that anon_inode_getfile() isn't
subject to any LSM hooks which is what pidfds used. But dentry_open() is
via security_file_open(). LSMs wanted to have a say in pidfd mediation
which is now possible. So the switch to dentry_open() is what is causing
the issue.

But here's a straightforward fix appended. We let pidfs.c use that fix
as and then we introduce a new LSM hook for pidfds that allows mediation
of pidfds and selinux can implement it when they're ready. This is
regression free and future proof. I actually tested this already today.

How does that sounds?
Christian Brauner Feb. 24, 2024, 7:19 p.m. UTC | #36
> How does that sounds?

And fwiw, this is equivalent to what we do today.
Linus Torvalds Feb. 24, 2024, 7:21 p.m. UTC | #37
On Sat, 24 Feb 2024 at 11:16, Christian Brauner <brauner@kernel.org> wrote:
>
> > Would selinux be happy if the inode allocation just used the
> > anon_inode superblock instead of pidfdfs_sb?
>
> No, unfortunately not. The core issue is that anon_inode_getfile() isn't
> subject to any LSM hooks which is what pidfds used. But dentry_open() is
> via security_file_open().

Ahh.

> But here's a straightforward fix appended. We let pidfs.c use that fix
> as and then we introduce a new LSM hook for pidfds that allows mediation
> of pidfds and selinux can implement it when they're ready. This is
> regression free and future proof. I actually tested this already today.
>
> How does that sounds?

Ack. Perfect. This is how new features go in: they act like the old
ones, but have expanded capabilities that you can expose for people
who want to use them.

The fact that this all apparently happened in nsfs too is a bit sad. I
hadn't even been aware of it.

I absolutely *hate* how some kernel people will just say "the fix is
to upgrade your user space".

Oh well, water under the bridge. But let's do it right for pidfs, and
your fix looks good to me.

Thanks,

            Linus
Nathan Chancellor Feb. 27, 2024, 7:26 p.m. UTC | #38
Hi Christian,

On Sat, Feb 24, 2024 at 08:15:53PM +0100, Christian Brauner wrote:
> On Sat, Feb 24, 2024 at 10:48:11AM -0800, Linus Torvalds wrote:
> > On Fri, 23 Feb 2024 at 21:52, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > This is selinux. So I think this is a misunderstanding. This isn't
> > > something we can fix in the kernel.
> > 
> > Sure it is. SELinux just goes by what the kernel tells it anyway.
> > 
> > Presumably this is purely about the fact that the inode in question
> > *used* to be that magical 'anon_inode_inode' that is shared when you
> > don't want or need a separate inode allocation. I assume it doesn't
> > even look at that, it just looks at the 'anon_inode_fs_type' thing (or
> > maybe at the anon_inode_mnt->mnt_sb that is created by kern_mount in
> > anon_inode_init?)
> > 
> > IOW, isn't the *only* difference that selinux can actually see just
> > the inode allocation? It used to be that
> > 
> >        inode = anon_inode_getfile();
> > 
> > now it is
> > 
> >         inode = new_inode_pseudo(pidfdfs_sb);
> > 
> > and instead of sharing one single inode (like anon_inode_getfile()
> > does unless you ask for separate inodes), it now shares the dentry
> > instead (for the same pid).
> > 
> > Would selinux be happy if the inode allocation just used the
> > anon_inode superblock instead of pidfdfs_sb?
> 
> No, unfortunately not. The core issue is that anon_inode_getfile() isn't
> subject to any LSM hooks which is what pidfds used. But dentry_open() is
> via security_file_open(). LSMs wanted to have a say in pidfd mediation
> which is now possible. So the switch to dentry_open() is what is causing
> the issue.
> 
> But here's a straightforward fix appended. We let pidfs.c use that fix
> as and then we introduce a new LSM hook for pidfds that allows mediation
> of pidfds and selinux can implement it when they're ready. This is
> regression free and future proof. I actually tested this already today.
> 
> How does that sounds?

I see a patch similar to this change in your vfs.pidfs branch as
commit 47a1fbce74c3 ("pidfs: convert to path_from_stashed() helper"),
which also appears to be in next-20240227. However, I still seem to be
having similar issues (although I cannot reproduce them every single
boot like I used to). I do see some SELinux denials for pidfs, although
it seems like it is only write that is being denied, rather than open,
read, and write?

  # uname -r
  6.8.0-rc6-next-20240227

  # systemctl --failed --no-legend --plain
  fwupd-refresh.service loaded failed failed Refresh fwupd metadata and update motd
  mcelog.service        loaded failed failed Machine Check Exception Logging Daemon
  polkit.service        loaded failed failed Authorization Manager

  # journalctl -b 0 -g denied -t audit | head -3
  Feb 27 10:49:20 qemu audit[1]: AVC avc:  denied  { write } for  pid=1 comm="systemd" path="pidfd:[1547]" dev="pidfs" ino=1547 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0
  Feb 27 10:49:21 qemu audit[1]: AVC avc:  denied  { write } for  pid=1 comm="systemd" path="pidfd:[1564]" dev="pidfs" ino=1564 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0
  Feb 27 10:50:50 qemu audit[1]: AVC avc:  denied  { write } for  pid=1 comm="systemd" path="pidfd:[1547]" dev="pidfs" ino=1547 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0

Cheers,
Nathan
Christian Brauner Feb. 27, 2024, 10:13 p.m. UTC | #39
> boot like I used to). I do see some SELinux denials for pidfs, although
> it seems like it is only write that is being denied, rather than open,
> read, and write?

Yeah, the full extent of this just became apparent to me today. pidfd
inodes are currently private and they need to continue to be so until
userspace has caught up. This is now fixed and will show up in -next
tomorrow.
Geert Uytterhoeven March 12, 2024, 10:35 a.m. UTC | #40
Hi Christian,

On Tue, 13 Feb 2024, Christian Brauner wrote:
> This moves pidfds from the anonymous inode infrastructure to a tiny
> pseudo filesystem. This has been on my todo for quite a while as it will
> unblock further work that we weren't able to do simply because of the
> very justified limitations of anonymous inodes. Moving pidfds to a tiny
> pseudo filesystem allows:
>
> * statx() on pidfds becomes useful for the first time.
> * pidfds can be compared simply via statx() and then comparing inode
>  numbers.
> * pidfds have unique inode numbers for the system lifetime.
> * struct pid is now stashed in inode->i_private instead of
>  file->private_data. This means it is now possible to introduce
>  concepts that operate on a process once all file descriptors have been
>  closed. A concrete example is kill-on-last-close.
> * file->private_data is freed up for per-file options for pidfds.
> * Each struct pid will refer to a different inode but the same struct
>  pid will refer to the same inode if it's opened multiple times. In
>  contrast to now where each struct pid refers to the same inode. Even
>  if we were to move to anon_inode_create_getfile() which creates new
>  inodes we'd still be associating the same struct pid with multiple
>  different inodes.
> * Pidfds now go through the regular dentry_open() path which means that
>  all security hooks are called unblocking proper LSM management for
>  pidfds. In addition fsnotify hooks are called and allow for listening
>  to open events on pidfds.
>
> The tiny pseudo filesystem is not visible anywhere in userspace exactly
> like e.g., pipefs and sockfs. There's no lookup, there's no complex
> inode operations, nothing. Dentries and inodes are always deleted when
> the last pidfd is closed.
>
> The code is entirely optional and fairly small. If it's not selected we
> fallback to anonymous inodes. Heavily inspired by nsfs which uses a
> similar stashing mechanism just for namespaces.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>

Thanks for your patch, which is now commit cb12fd8e0dabb9a1
("pidfd: add pidfs") upstream.

> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -174,6 +174,12 @@ source "fs/proc/Kconfig"
> source "fs/kernfs/Kconfig"
> source "fs/sysfs/Kconfig"
>
> +config FS_PIDFD
> +	bool "Pseudo filesystem for process file descriptors"
> +	depends on 64BIT

I think it would have been good if this dependency would have been
explained in the commit message.  I.e. why does this depend on 64BIT?

What is the risk this will not stay entirely optional?  I.e. can it
become a requirement for modern userspace, and thus be used as a stick
to kill support for 32-bit architectures?

> +	help
> +	  Pidfdfs implements advanced features for process file descriptors.
> +
> config TMPFS
> 	bool "Tmpfs virtual memory file system support (former shm fs)"
> 	depends on SHMEM

Gr{oetje,eeting}s,

 						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
 							    -- Linus Torvalds
Christian Brauner March 12, 2024, 2:09 p.m. UTC | #41
> What is the risk this will not stay entirely optional?  I.e. can it
> become a requirement for modern userspace, and thus be used as a stick
> to kill support for 32-bit architectures?

Yeah, Linus has requested to remove the config option.
I'm about to send him a patch.
diff mbox series

Patch

diff --git a/fs/Kconfig b/fs/Kconfig
index 89fdbefd1075..c7ed65e34820 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -174,6 +174,12 @@  source "fs/proc/Kconfig"
 source "fs/kernfs/Kconfig"
 source "fs/sysfs/Kconfig"
 
+config FS_PIDFD
+	bool "Pseudo filesystem for process file descriptors"
+	depends on 64BIT
+	help
+	  Pidfdfs implements advanced features for process file descriptors.
+
 config TMPFS
 	bool "Tmpfs virtual memory file system support (former shm fs)"
 	depends on SHMEM
diff --git a/fs/pidfdfs.c b/fs/pidfdfs.c
index 55e8396e7fc4..efc68ef3a08d 100644
--- a/fs/pidfdfs.c
+++ b/fs/pidfdfs.c
@@ -1,9 +1,11 @@ 
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/anon_inodes.h>
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/magic.h>
 #include <linux/mount.h>
 #include <linux/pid.h>
+#include <linux/pidfdfs.h>
 #include <linux/pid_namespace.h>
 #include <linux/poll.h>
 #include <linux/proc_fs.h>
@@ -12,12 +14,25 @@ 
 #include <linux/seq_file.h>
 #include <uapi/linux/pidfd.h>
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op != &pidfd_fops)
+		return ERR_PTR(-EBADF);
+#ifdef CONFIG_FS_PIDFD
+	return file_inode(file)->i_private;
+#else
+	return file->private_data;
+#endif
+}
+
 static int pidfd_release(struct inode *inode, struct file *file)
 {
+#ifndef CONFIG_FS_PIDFD
 	struct pid *pid = file->private_data;
 
 	file->private_data = NULL;
 	put_pid(pid);
+#endif
 	return 0;
 }
 
@@ -59,7 +74,7 @@  static int pidfd_release(struct inode *inode, struct file *file)
  */
 static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
-	struct pid *pid = f->private_data;
+	struct pid *pid = pidfd_pid(f);
 	struct pid_namespace *ns;
 	pid_t nr = -1;
 
@@ -93,7 +108,7 @@  static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
  */
 static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
 {
-	struct pid *pid = file->private_data;
+	struct pid *pid = pidfd_pid(file);
 	bool thread = file->f_flags & PIDFD_THREAD;
 	struct task_struct *task;
 	__poll_t poll_flags = 0;
@@ -121,3 +136,173 @@  const struct file_operations pidfd_fops = {
 	.show_fdinfo	= pidfd_show_fdinfo,
 #endif
 };
+
+#ifdef CONFIG_FS_PIDFD
+static struct vfsmount *pidfdfs_mnt __ro_after_init;
+static struct super_block *pidfdfs_sb __ro_after_init;
+static u64 pidfdfs_ino = 0;
+
+static void pidfdfs_evict_inode(struct inode *inode)
+{
+	struct pid *pid = inode->i_private;
+
+	clear_inode(inode);
+	put_pid(pid);
+}
+
+static const struct super_operations pidfdfs_sops = {
+	.statfs		= simple_statfs,
+	.evict_inode	= pidfdfs_evict_inode,
+};
+
+static void pidfdfs_prune_dentry(struct dentry *dentry)
+{
+	struct inode *inode;
+	struct pid *pid;
+
+	inode = d_inode(dentry);
+	if (!inode)
+		return;
+
+	pid = inode->i_private;
+	atomic_long_set(&pid->stashed, 0);
+}
+
+static char *pidfdfs_dname(struct dentry *dentry, char *buffer, int buflen)
+{
+	return dynamic_dname(buffer, buflen, "pidfd:[%lu]",
+			     d_inode(dentry)->i_ino);
+}
+
+const struct dentry_operations pidfdfs_dentry_operations = {
+	.d_prune	= pidfdfs_prune_dentry,
+	.d_delete	= always_delete_dentry,
+	.d_dname	= pidfdfs_dname,
+};
+
+static int pidfdfs_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	ctx = init_pseudo(fc, PIDFDFS_MAGIC);
+	if (!ctx)
+		return -ENOMEM;
+
+	ctx->ops = &pidfdfs_sops;
+	ctx->dops = &pidfdfs_dentry_operations;
+	return 0;
+}
+
+static struct file_system_type pidfdfs_type = {
+	.name			= "pidfdfs",
+	.init_fs_context	= pidfdfs_init_fs_context,
+	.kill_sb		= kill_anon_super,
+};
+
+static struct dentry *pidfdfs_dentry(struct pid *pid)
+{
+	struct inode *inode;
+	struct dentry *dentry;
+	unsigned long i_ptr;
+
+	inode = new_inode_pseudo(pidfdfs_sb);
+	if (!inode)
+		return ERR_PTR(-ENOMEM);
+
+	inode->i_ino	= pid->ino;
+	inode->i_mode	= S_IFREG | S_IRUGO;
+	inode->i_fop	= &pidfd_fops;
+	inode->i_flags |= S_IMMUTABLE;
+	simple_inode_init_ts(inode);
+	/* grab a reference */
+	inode->i_private = get_pid(pid);
+
+	/* consumes @inode */
+	dentry = d_make_root(inode);
+	if (!dentry)
+		return ERR_PTR(-ENOMEM);
+
+	i_ptr = atomic_long_cmpxchg(&pid->stashed, 0, (unsigned long)dentry);
+	if (i_ptr) {
+		d_delete(dentry); /* make sure ->d_prune() does nothing */
+		dput(dentry);
+		cpu_relax();
+		return ERR_PTR(-EAGAIN);
+	}
+
+	return dentry;
+}
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+
+	struct path path;
+	struct dentry *dentry;
+	struct file *pidfd_file;
+
+	for (;;) {
+		rcu_read_lock();
+		dentry = (struct dentry *)atomic_long_read(&pid->stashed);
+		if (!dentry || !lockref_get_not_dead(&dentry->d_lockref)) {
+			rcu_read_unlock();
+
+			dentry = pidfdfs_dentry(pid);
+			if (!IS_ERR(dentry))
+				break;
+			if (PTR_ERR(dentry) == -EAGAIN)
+				continue;
+		}
+		rcu_read_unlock();
+		break;
+	}
+	if (IS_ERR(dentry))
+		return ERR_CAST(dentry);
+
+	path.mnt = mntget(pidfdfs_mnt);
+	path.dentry = dentry;
+
+	pidfd_file = dentry_open(&path, flags, current_cred());
+	path_put(&path);
+
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid)
+{
+	atomic_long_set(&pid->stashed, 0);
+	pid->ino = ++pidfdfs_ino;
+}
+
+void __init pidfdfs_init(void)
+{
+	int err;
+
+	err = register_filesystem(&pidfdfs_type);
+	if (err)
+		panic("Failed to register pidfdfs pseudo filesystem");
+
+	pidfdfs_mnt = kern_mount(&pidfdfs_type);
+	if (IS_ERR(pidfdfs_mnt))
+		panic("Failed to mount pidfdfs pseudo filesystem");
+
+	pidfdfs_sb = pidfdfs_mnt->mnt_sb;
+}
+
+#else /* !CONFIG_FS_PIDFD */
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags)
+{
+	struct file *pidfd_file;
+
+	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
+					flags | O_RDWR);
+	if (IS_ERR(pidfd_file))
+		return pidfd_file;
+
+	get_pid(pid);
+	return pidfd_file;
+}
+
+void pid_init_pidfdfs(struct pid *pid) { }
+void __init pidfdfs_init(void) { }
+#endif
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 8124d57752b9..1a47676a04c2 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -55,6 +55,10 @@  struct pid
 	refcount_t count;
 	unsigned int level;
 	spinlock_t lock;
+#ifdef CONFIG_FS_PIDFD
+	atomic_long_t stashed;
+	unsigned long ino;
+#endif
 	/* lists of tasks that use this pid */
 	struct hlist_head tasks[PIDTYPE_MAX];
 	struct hlist_head inodes;
diff --git a/include/linux/pidfdfs.h b/include/linux/pidfdfs.h
new file mode 100644
index 000000000000..760dbc163625
--- /dev/null
+++ b/include/linux/pidfdfs.h
@@ -0,0 +1,9 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PIDFDFS_H
+#define _LINUX_PIDFDFS_H
+
+struct file *pidfdfs_alloc_file(struct pid *pid, unsigned int flags);
+void __init pidfdfs_init(void);
+void pid_init_pidfdfs(struct pid *pid);
+
+#endif /* _LINUX_PIDFDFS_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 6325d1d0e90f..a0d5480115c5 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -101,5 +101,6 @@ 
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
+#define PIDFDFS_MAGIC		0x50494446	/* "PIDF" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..0663003f3146 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@ 
 #include <linux/init_syscalls.h>
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
+#include <linux/pidfdfs.h>
 #include <net/net_namespace.h>
 
 #include <asm/io.h>
@@ -1059,6 +1060,7 @@  void start_kernel(void)
 	seq_file_init();
 	proc_root_init();
 	nsfs_init();
+	pidfdfs_init();
 	cpuset_init();
 	cgroup_init();
 	taskstats_init_early();
diff --git a/kernel/fork.c b/kernel/fork.c
index 662a61f340ce..eab2fcc90342 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -102,6 +102,7 @@ 
 #include <linux/iommu.h>
 #include <linux/rseq.h>
 #include <uapi/linux/pidfd.h>
+#include <linux/pidfdfs.h>
 
 #include <asm/pgalloc.h>
 #include <linux/uaccess.h>
@@ -1985,14 +1986,6 @@  static inline void rcu_copy_process(struct task_struct *p)
 #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */
 }
 
-struct pid *pidfd_pid(const struct file *file)
-{
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
-
-	return ERR_PTR(-EBADF);
-}
-
 /**
  * __pidfd_prepare - allocate a new pidfd_file and reserve a pidfd
  * @pid:   the struct pid for which to create a pidfd
@@ -2030,13 +2023,11 @@  static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 	if (pidfd < 0)
 		return pidfd;
 
-	pidfd_file = anon_inode_getfile("[pidfd]", &pidfd_fops, pid,
-					flags | O_RDWR);
+	pidfd_file = pidfdfs_alloc_file(pid, flags | O_RDWR);
 	if (IS_ERR(pidfd_file)) {
 		put_unused_fd(pidfd);
 		return PTR_ERR(pidfd_file);
 	}
-	get_pid(pid); /* held by pidfd_file now */
 	/*
 	 * anon_inode_getfile() ignores everything outside of the
 	 * O_ACCMODE | O_NONBLOCK mask, set PIDFD_THREAD manually.
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 15781acaac1c..6ec3deec68c2 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -573,7 +573,7 @@  SYSCALL_DEFINE2(setns, int, fd, int, flags)
 	if (proc_ns_file(f.file))
 		err = validate_ns(&nsset, ns);
 	else
-		err = validate_nsset(&nsset, f.file->private_data);
+		err = validate_nsset(&nsset, pidfd_pid(f.file));
 	if (!err) {
 		commit_nsset(&nsset);
 		perf_event_namespaces(current);
diff --git a/kernel/pid.c b/kernel/pid.c
index c1d940fbd314..dbff84493bff 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -42,6 +42,7 @@ 
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
 #include <linux/idr.h>
+#include <linux/pidfdfs.h>
 #include <net/sock.h>
 #include <uapi/linux/pidfd.h>
 
@@ -270,6 +271,7 @@  struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	upid = pid->numbers + ns->level;
 	spin_lock_irq(&pidmap_lock);
+	pid_init_pidfdfs(pid);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	for ( ; upid >= pid->numbers; --upid) {