diff mbox series

anon_inode: use a proper mode internally

Message ID 20250404-aphorismen-reibung-12028a1db7e3@brauner (mailing list archive)
State New
Headers show
Series anon_inode: use a proper mode internally | expand

Commit Message

Christian Brauner April 4, 2025, 10:39 a.m. UTC
This allows the VFS to not trip over anonymous inodes and we can add
asserts based on the mode into the vfs. When we report it to userspace
we can simply hide the mode to avoid regressions. I've audited all
direct callers of alloc_anon_inode() and only secretmen overrides i_mode
and i_op inode operations but it already uses a regular file.

Fixes: af153bb63a336 ("vfs: catch invalid modes in may_open()")
Reported-by: syzbot+5d8e79d323a13aa0b248@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67ed3fb3.050a0220.14623d.0009.GAE@google.com"
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/anon_inodes.c | 36 ++++++++++++++++++++++++++++++++++++
 fs/internal.h    |  3 +++
 fs/libfs.c       |  2 +-
 fs/pidfs.c       | 24 +-----------------------
 4 files changed, 41 insertions(+), 24 deletions(-)

Comments

Mateusz Guzik April 5, 2025, 4:54 a.m. UTC | #1
On Fri, Apr 4, 2025 at 12:39 PM Christian Brauner <brauner@kernel.org> wrote:
>
> This allows the VFS to not trip over anonymous inodes and we can add
> asserts based on the mode into the vfs. When we report it to userspace
> we can simply hide the mode to avoid regressions. I've audited all
> direct callers of alloc_anon_inode() and only secretmen overrides i_mode
> and i_op inode operations but it already uses a regular file.
>
[snip]
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 6393d7c49ee6..0ad3336f5b49 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1647,7 +1647,7 @@ struct inode *alloc_anon_inode(struct super_block *s)
>          * that it already _is_ on the dirty list.
>          */
>         inode->i_state = I_DIRTY;
> -       inode->i_mode = S_IRUSR | S_IWUSR;
> +       inode->i_mode = S_IFREG | S_IRUSR | S_IWUSR;
>         inode->i_uid = current_fsuid();
>         inode->i_gid = current_fsgid();
>         inode->i_flags |= S_PRIVATE;

Switching the mode to S_IFREG imo can open a can of worms (perhaps a
dedicated in-kernel only mode would be better? S_IFANON?), but that's
a long and handwave-y subject, so I'm going to stop at this remark.

I think the key here is to provide an invariant that anon inodes don't
pass MAY_EXEC in may_open() regardless of their specific i_mode and
which the kernel fails to provide at the moment afaics.

With the patch as proposed in the other thread this is handled with
MAY_EXEC check added to the default: clause.

With your patch this is going to depend on the i_mode not allowing
exec or at least the anon inode having a mount or sb with noexec flags
(thus failing path_noexec() check), but from code reading I'm not at
all confident it has one (if it does not, it should get it). Merely
depending on mode is imo too risky as sooner or later there may be a
way to add +x to it. That is to say I think your patch works, provided
the backing mount or sb for anon inodes is guaranteed to have noexec
on it.




--
Mateusz Guzik <mjguzik gmail.com>
Christian Brauner April 6, 2025, 7:14 a.m. UTC | #2
On Sat, Apr 05, 2025 at 06:54:28AM +0200, Mateusz Guzik wrote:
> On Fri, Apr 4, 2025 at 12:39 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > This allows the VFS to not trip over anonymous inodes and we can add
> > asserts based on the mode into the vfs. When we report it to userspace
> > we can simply hide the mode to avoid regressions. I've audited all
> > direct callers of alloc_anon_inode() and only secretmen overrides i_mode
> > and i_op inode operations but it already uses a regular file.
> >
> [snip]
> > diff --git a/fs/libfs.c b/fs/libfs.c
> > index 6393d7c49ee6..0ad3336f5b49 100644
> > --- a/fs/libfs.c
> > +++ b/fs/libfs.c
> > @@ -1647,7 +1647,7 @@ struct inode *alloc_anon_inode(struct super_block *s)
> >          * that it already _is_ on the dirty list.
> >          */
> >         inode->i_state = I_DIRTY;
> > -       inode->i_mode = S_IRUSR | S_IWUSR;
> > +       inode->i_mode = S_IFREG | S_IRUSR | S_IWUSR;
> >         inode->i_uid = current_fsuid();
> >         inode->i_gid = current_fsgid();
> >         inode->i_flags |= S_PRIVATE;
> 
> Switching the mode to S_IFREG imo can open a can of worms (perhaps a
> dedicated in-kernel only mode would be better? S_IFANON?), but that's

No, I don't think we should have weird kernel-only things in i_mode.
That's what i_flags is for.

> a long and handwave-y subject, so I'm going to stop at this remark.
> 
> I think the key here is to provide an invariant that anon inodes don't
> pass MAY_EXEC in may_open() regardless of their specific i_mode and
> which the kernel fails to provide at the moment afaics.

The current way of relying on IS_REG() in do_open_execat() is broken
with current kernel semantics ever since commit 633fb6ac3980 ("exec:
move S_ISREG() check earlier"). That commit introduced a WARN_ON_ONCE()
around the S_ISREG() check which indicates that it wasn't clear to the
authors that this is very likely a reachable WARN_ON_ONCE() via e.g.:

int main(int argc, char *argv[])
{
	int ret, sfd;
	sigset_t mask;
	struct signalfd_siginfo fdsi;

	sigemptyset(&mask);
	sigaddset(&mask, SIGINT);
	sigaddset(&mask, SIGQUIT);

	ret = sigprocmask(SIG_BLOCK, &mask, NULL);
	if (ret < 0)
		_exit(1);

	sfd = signalfd(-1, &mask, 0);
	if (sfd < 0)
		_exit(2);

	ret = fchown(sfd, 5555, 5555);
        if (ret < 0)
		_exit(3);

	ret = fchmod(sfd, 0777);
	if (ret < 0)
		_exit(3);

	/* trigger the WARN_ON_ONCE() */
	execveat(sfd, "", NULL, NULL, AT_EMPTY_PATH);
	_exit(4);
}

Because the mode and owner of the single anonymous inode in the
kernel can be changed by anonyone with suitable privileges. That of
course it itself a bug although not a really meaningful one because
anonymous inodes don't really figure into path lookup. They cannot be
reopened via /proc/<pid>/fd/<nr> and can't be used for lookup itself. So
they can only ever serve as direct references. But I'm going to fix
that. Similar to pidfs we should simply fail setattr [1]. Given that noone has
been hitting the execveat() WARN_ON_ONCE() it's safe to say that no one
ever bothered chowning/chmoding anonymous inodes. With that removed it's
not possible to exec anonymous inodes but we should harden this by
setting SB_I_NOEXEC as well (Secretmem is another very fun example
because it uses anonymous inodes but does set S_ISREG already. So
chmod() on such an inode and passing it to execveat() could be fun.)

Anyway, I'm finishing the patch and testing tomorrow and will send out
with all the things I mentioned (unless I find out I'm wrong).

[1]: That's been a constant sore spot for me. We currently simply
     fallback to simple_setattr() if the fs doesn't provide a dedicated
     ->setattr() function. Imho that's broken and we need to remove
     that. All fses that need it need to set ->setattr() and if it isn't
     set EOPNOTSUPP will be returned. So that would need some code audit
     which fses definitely don't need or expect it. Otherwise we risk
     setting attributes on things that don't want them changed. That's
     patch I had on my backlog for quite a while ever since the big
     xattr/acl rework.
Christian Brauner April 6, 2025, 7:51 p.m. UTC | #3
> Anyway, I'm finishing the patch and testing tomorrow and will send out
> with all the things I mentioned (unless I find out I'm wrong).

Found my notes about this. I knew I had notes about this somewhere...
It isn't possible to execute anoymous inodes because you cannot open
them. That includes stuff like:

execveat(fd_anon_inode, "", NULL, NULL, AT_EMPTY_PATH)

Look, anonymous inodes have inode->f_op set to no_open_fops which sets
no_open() which returns ENXIO. That means any call to do_dentry_open()
which is the endpoint of the do_open_execat() will fail. There's no
chance to execute an anonymous inode. Unless a given subsystem overrides
it ofc.

I still agree that we need to be more coherent about this and we need to
improve various semantical quirks I pointed out. But the exec problem
isn't really an issue so the patch itself still seems correct to me.
diff mbox series

Patch

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 583ac81669c2..42e4b9c34f89 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -24,9 +24,43 @@ 
 
 #include <linux/uaccess.h>
 
+#include "internal.h"
+
 static struct vfsmount *anon_inode_mnt __ro_after_init;
 static struct inode *anon_inode_inode __ro_after_init;
 
+/*
+ * User space expects anonymous inodes to have no file type in st_mode.
+ *
+ * In particular, 'lsof' has this legacy logic:
+ *
+ *	type = s->st_mode & S_IFMT;
+ *	switch (type) {
+ *	  ...
+ *	case 0:
+ *		if (!strcmp(p, "anon_inode"))
+ *			Lf->ntype = Ntype = N_ANON_INODE;
+ *
+ * to detect our old anon_inode logic.
+ *
+ * Rather than mess with our internal sane inode data, just fix it
+ * up here in getattr() by masking off the format bits.
+ */
+int anon_inode_getattr(struct mnt_idmap *idmap, const struct path *path,
+		       struct kstat *stat, u32 request_mask,
+		       unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+
+	generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
+	stat->mode &= ~S_IFMT;
+	return 0;
+}
+
+static const struct inode_operations anon_inode_operations = {
+	.getattr = anon_inode_getattr,
+};
+
 /*
  * anon_inodefs_dname() is called from d_path().
  */
@@ -66,6 +100,7 @@  static struct inode *anon_inode_make_secure_inode(
 	if (IS_ERR(inode))
 		return inode;
 	inode->i_flags &= ~S_PRIVATE;
+	inode->i_op = &anon_inode_operations;
 	error =	security_inode_init_security_anon(inode, &QSTR(name),
 						  context_inode);
 	if (error) {
@@ -313,6 +348,7 @@  static int __init anon_inode_init(void)
 	anon_inode_inode = alloc_anon_inode(anon_inode_mnt->mnt_sb);
 	if (IS_ERR(anon_inode_inode))
 		panic("anon_inode_init() inode allocation failed (%ld)\n", PTR_ERR(anon_inode_inode));
+	anon_inode_inode->i_op = &anon_inode_operations;
 
 	return 0;
 }
diff --git a/fs/internal.h b/fs/internal.h
index b9b3e29a73fd..717dc9eb6185 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -343,3 +343,6 @@  static inline bool path_mounted(const struct path *path)
 void file_f_owner_release(struct file *file);
 bool file_seek_cur_needs_f_lock(struct file *file);
 int statmount_mnt_idmap(struct mnt_idmap *idmap, struct seq_file *seq, bool uid_map);
+int anon_inode_getattr(struct mnt_idmap *idmap, const struct path *path,
+		       struct kstat *stat, u32 request_mask,
+		       unsigned int query_flags);
diff --git a/fs/libfs.c b/fs/libfs.c
index 6393d7c49ee6..0ad3336f5b49 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1647,7 +1647,7 @@  struct inode *alloc_anon_inode(struct super_block *s)
 	 * that it already _is_ on the dirty list.
 	 */
 	inode->i_state = I_DIRTY;
-	inode->i_mode = S_IRUSR | S_IWUSR;
+	inode->i_mode = S_IFREG | S_IRUSR | S_IWUSR;
 	inode->i_uid = current_fsuid();
 	inode->i_gid = current_fsgid();
 	inode->i_flags |= S_PRIVATE;
diff --git a/fs/pidfs.c b/fs/pidfs.c
index d64a4cbeb0da..809c3393b6a3 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -572,33 +572,11 @@  static int pidfs_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return -EOPNOTSUPP;
 }
 
-
-/*
- * User space expects pidfs inodes to have no file type in st_mode.
- *
- * In particular, 'lsof' has this legacy logic:
- *
- *	type = s->st_mode & S_IFMT;
- *	switch (type) {
- *	  ...
- *	case 0:
- *		if (!strcmp(p, "anon_inode"))
- *			Lf->ntype = Ntype = N_ANON_INODE;
- *
- * to detect our old anon_inode logic.
- *
- * Rather than mess with our internal sane inode data, just fix it
- * up here in getattr() by masking off the format bits.
- */
 static int pidfs_getattr(struct mnt_idmap *idmap, const struct path *path,
 			 struct kstat *stat, u32 request_mask,
 			 unsigned int query_flags)
 {
-	struct inode *inode = d_inode(path->dentry);
-
-	generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
-	stat->mode &= ~S_IFMT;
-	return 0;
+	return anon_inode_getattr(idmap, path, stat, request_mask, query_flags);
 }
 
 static const struct inode_operations pidfs_inode_operations = {