diff mbox series

[v6,2/5] userfaultfd: add /dev/userfaultfd for fine grained access control

Message ID 20220817214728.489904-3-axelrasmussen@google.com (mailing list archive)
State New
Headers show
Series userfaultfd: add /dev/userfaultfd for fine grained access control | expand

Commit Message

Axel Rasmussen Aug. 17, 2022, 9:47 p.m. UTC
Historically, it has been shown that intercepting kernel faults with
userfaultfd (thereby forcing the kernel to wait for an arbitrary amount
of time) can be exploited, or at least can make some kinds of exploits
easier. So, in 37cd0575b8 "userfaultfd: add UFFD_USER_MODE_ONLY" we
changed things so, in order for kernel faults to be handled by
userfaultfd, either the process needs CAP_SYS_PTRACE, or this sysctl
must be configured so that any unprivileged user can do it.

In a typical implementation of a hypervisor with live migration (take
QEMU/KVM as one such example), we do indeed need to be able to handle
kernel faults. But, both options above are less than ideal:

- Toggling the sysctl increases attack surface by allowing any
  unprivileged user to do it.

- Granting the live migration process CAP_SYS_PTRACE gives it this
  ability, but *also* the ability to "observe and control the
  execution of another process [...], and examine and change [its]
  memory and registers" (from ptrace(2)). This isn't something we need
  or want to be able to do, so granting this permission violates the
  "principle of least privilege".

This is all a long winded way to say: we want a more fine-grained way to
grant access to userfaultfd, without granting other additional
permissions at the same time.

To achieve this, add a /dev/userfaultfd misc device. This device
provides an alternative to the userfaultfd(2) syscall for the creation
of new userfaultfds. The idea is, any userfaultfds created this way will
be able to handle kernel faults, without the caller having any special
capabilities. Access to this mechanism is instead restricted using e.g.
standard filesystem permissions.

Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Nadav Amit <namit@vmware.com>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 fs/userfaultfd.c                 | 73 +++++++++++++++++++++++++-------
 include/uapi/linux/userfaultfd.h |  4 ++
 2 files changed, 61 insertions(+), 16 deletions(-)

Comments

Greg Kroah-Hartman Aug. 18, 2022, 6:25 a.m. UTC | #1
On Wed, Aug 17, 2022 at 02:47:25PM -0700, Axel Rasmussen wrote:
>  static int __init userfaultfd_init(void)
>  {
> +	WARN_ON(misc_register(&userfaultfd_misc));

Please no.

Spell this out and properly error out if there is an issue:
	int ret;

	ret = misc_register(&userfaultfd_misc);
	if (ret)
		return ret;

Handle issues properly, don't paper over them with WARN_ON().

thanks,

greg k-h
Greg Kroah-Hartman Aug. 18, 2022, 6:26 a.m. UTC | #2
On Wed, Aug 17, 2022 at 02:47:25PM -0700, Axel Rasmussen wrote:
> +static int userfaultfd_dev_open(struct inode *inode, struct file *file)
> +{
> +	return 0;

If your open does nothing, no need to list it here at all, right?

> +}
> +
> +static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
> +{
> +	if (cmd != USERFAULTFD_IOC_NEW)
> +		return -EINVAL;
> +
> +	return new_userfaultfd(flags);
> +}
> +
> +static const struct file_operations userfaultfd_dev_fops = {
> +	.open = userfaultfd_dev_open,
> +	.unlocked_ioctl = userfaultfd_dev_ioctl,
> +	.compat_ioctl = userfaultfd_dev_ioctl,

Why do you need to set compat_ioctl?  Shouldn't it just default to the
existing one?

And why is this a device node at all?  Shouldn't the syscall handle all
of this (to be honest, I didn't read anything but the misc code, sorry.)

thanks,

greg k-h
Greg Kroah-Hartman Aug. 18, 2022, 6:32 a.m. UTC | #3
On Thu, Aug 18, 2022 at 08:26:38AM +0200, Greg KH wrote:
> On Wed, Aug 17, 2022 at 02:47:25PM -0700, Axel Rasmussen wrote:
> > +static int userfaultfd_dev_open(struct inode *inode, struct file *file)
> > +{
> > +	return 0;
> 
> If your open does nothing, no need to list it here at all, right?
> 
> > +}
> > +
> > +static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
> > +{
> > +	if (cmd != USERFAULTFD_IOC_NEW)
> > +		return -EINVAL;
> > +
> > +	return new_userfaultfd(flags);
> > +}
> > +
> > +static const struct file_operations userfaultfd_dev_fops = {
> > +	.open = userfaultfd_dev_open,
> > +	.unlocked_ioctl = userfaultfd_dev_ioctl,
> > +	.compat_ioctl = userfaultfd_dev_ioctl,
> 
> Why do you need to set compat_ioctl?  Shouldn't it just default to the
> existing one?
> 
> And why is this a device node at all?  Shouldn't the syscall handle all
> of this (to be honest, I didn't read anything but the misc code, sorry.)

Ah, read the documentation now.  Seems you want to make it easier for
people to get permissions on a system.  Doesn't seem wise, but hey, it's
not my feature...

thanks,

greg k-h
Axel Rasmussen Aug. 18, 2022, 5:22 p.m. UTC | #4
On Wed, Aug 17, 2022 at 11:32 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Thu, Aug 18, 2022 at 08:26:38AM +0200, Greg KH wrote:
> > On Wed, Aug 17, 2022 at 02:47:25PM -0700, Axel Rasmussen wrote:
> > > +static int userfaultfd_dev_open(struct inode *inode, struct file *file)
> > > +{
> > > +   return 0;
> >
> > If your open does nothing, no need to list it here at all, right?
> >
> > > +}
> > > +
> > > +static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
> > > +{
> > > +   if (cmd != USERFAULTFD_IOC_NEW)
> > > +           return -EINVAL;
> > > +
> > > +   return new_userfaultfd(flags);
> > > +}
> > > +
> > > +static const struct file_operations userfaultfd_dev_fops = {
> > > +   .open = userfaultfd_dev_open,
> > > +   .unlocked_ioctl = userfaultfd_dev_ioctl,
> > > +   .compat_ioctl = userfaultfd_dev_ioctl,
> >
> > Why do you need to set compat_ioctl?  Shouldn't it just default to the
> > existing one?
> >
> > And why is this a device node at all?  Shouldn't the syscall handle all
> > of this (to be honest, I didn't read anything but the misc code, sorry.)
>
> Ah, read the documentation now.  Seems you want to make it easier for
> people to get permissions on a system.  Doesn't seem wise, but hey, it's
> not my feature...

Thanks for taking a look Greg!

WIth the syscall, the only way to get access to this feature is to
have CAP_SYS_PTRACE. Which gives you access to this, *plus* a bunch
more stuff.

My basic goal is to grant access to just this feature by itself, not
really just to make it easier to access. I think a device node is the
simplest way to achieve that (see the cover letter for considered
alternatives).

The other feedback looks like good simplification to me - I'll send
another version with those changes. I have to admit this is the first
time I've messed with misc device nodes, so apologies for being overly
explicit. :)

>
> thanks,
>
> greg k-h
Axel Rasmussen Aug. 19, 2022, 8:12 p.m. UTC | #5
On Wed, Aug 17, 2022 at 11:26 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Wed, Aug 17, 2022 at 02:47:25PM -0700, Axel Rasmussen wrote:
> > +static int userfaultfd_dev_open(struct inode *inode, struct file *file)
> > +{
> > +     return 0;
>
> If your open does nothing, no need to list it here at all, right?
>
> > +}
> > +
> > +static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
> > +{
> > +     if (cmd != USERFAULTFD_IOC_NEW)
> > +             return -EINVAL;
> > +
> > +     return new_userfaultfd(flags);
> > +}
> > +
> > +static const struct file_operations userfaultfd_dev_fops = {
> > +     .open = userfaultfd_dev_open,
> > +     .unlocked_ioctl = userfaultfd_dev_ioctl,
> > +     .compat_ioctl = userfaultfd_dev_ioctl,
>
> Why do you need to set compat_ioctl?  Shouldn't it just default to the
> existing one?

I took some more time looking at this today, and I think it actually
has to be the way it is.

I didn't find anywhere we noticed compat_ioctl unset, and default to
the "normal" one (e.g. see the compat ioctl syscall definition in
fs/ioctl.c). It looks to me like it really does need some value. It's
common to use compat_ptr_ioctl for this, but since we're interpreting
the arg as a scalar not as a pointer, doing that here would be
incorrect.

It looks like there are other existing examples that do it the same
way, e.g. seccomp_notify_ops in linux/seccomp.c.

>
> And why is this a device node at all?  Shouldn't the syscall handle all
> of this (to be honest, I didn't read anything but the misc code, sorry.)
>
> thanks,
>
> greg k-h
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 1c44bf75f916..698e768d5c3d 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -30,6 +30,7 @@ 
 #include <linux/security.h>
 #include <linux/hugetlb.h>
 #include <linux/swapops.h>
+#include <linux/miscdevice.h>
 
 int sysctl_unprivileged_userfaultfd __read_mostly;
 
@@ -415,13 +416,8 @@  vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	if (ctx->features & UFFD_FEATURE_SIGBUS)
 		goto out;
-	if ((vmf->flags & FAULT_FLAG_USER) == 0 &&
-	    ctx->flags & UFFD_USER_MODE_ONLY) {
-		printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd "
-			"sysctl knob to 1 if kernel faults must be handled "
-			"without obtaining CAP_SYS_PTRACE capability\n");
+	if (!(vmf->flags & FAULT_FLAG_USER) && (ctx->flags & UFFD_USER_MODE_ONLY))
 		goto out;
-	}
 
 	/*
 	 * If it's already released don't get it. This avoids to loop
@@ -2052,20 +2048,11 @@  static void init_once_userfaultfd_ctx(void *mem)
 	seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
 }
 
-SYSCALL_DEFINE1(userfaultfd, int, flags)
+static int new_userfaultfd(int flags)
 {
 	struct userfaultfd_ctx *ctx;
 	int fd;
 
-	if (!sysctl_unprivileged_userfaultfd &&
-	    (flags & UFFD_USER_MODE_ONLY) == 0 &&
-	    !capable(CAP_SYS_PTRACE)) {
-		printk_once(KERN_WARNING "uffd: Set unprivileged_userfaultfd "
-			"sysctl knob to 1 if kernel faults must be handled "
-			"without obtaining CAP_SYS_PTRACE capability\n");
-		return -EPERM;
-	}
-
 	BUG_ON(!current->mm);
 
 	/* Check the UFFD_* constants for consistency.  */
@@ -2098,8 +2085,62 @@  SYSCALL_DEFINE1(userfaultfd, int, flags)
 	return fd;
 }
 
+static inline bool userfaultfd_syscall_allowed(int flags)
+{
+	/* Userspace-only page faults are always allowed */
+	if (flags & UFFD_USER_MODE_ONLY)
+		return true;
+
+	/*
+	 * The user is requesting a userfaultfd which can handle kernel faults.
+	 * Privileged users are always allowed to do this.
+	 */
+	if (capable(CAP_SYS_PTRACE))
+		return true;
+
+	/* Otherwise, access to kernel fault handling is sysctl controlled. */
+	return sysctl_unprivileged_userfaultfd;
+}
+
+SYSCALL_DEFINE1(userfaultfd, int, flags)
+{
+	if (!userfaultfd_syscall_allowed(flags))
+		return -EPERM;
+
+	return new_userfaultfd(flags);
+}
+
+static int userfaultfd_dev_open(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static long userfaultfd_dev_ioctl(struct file *file, unsigned int cmd, unsigned long flags)
+{
+	if (cmd != USERFAULTFD_IOC_NEW)
+		return -EINVAL;
+
+	return new_userfaultfd(flags);
+}
+
+static const struct file_operations userfaultfd_dev_fops = {
+	.open = userfaultfd_dev_open,
+	.unlocked_ioctl = userfaultfd_dev_ioctl,
+	.compat_ioctl = userfaultfd_dev_ioctl,
+	.owner = THIS_MODULE,
+	.llseek = noop_llseek,
+};
+
+static struct miscdevice userfaultfd_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "userfaultfd",
+	.fops = &userfaultfd_dev_fops
+};
+
 static int __init userfaultfd_init(void)
 {
+	WARN_ON(misc_register(&userfaultfd_misc));
+
 	userfaultfd_ctx_cachep = kmem_cache_create("userfaultfd_ctx_cache",
 						sizeof(struct userfaultfd_ctx),
 						0,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 7d32b1e797fb..005e5e306266 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -12,6 +12,10 @@ 
 
 #include <linux/types.h>
 
+/* ioctls for /dev/userfaultfd */
+#define USERFAULTFD_IOC 0xAA
+#define USERFAULTFD_IOC_NEW _IO(USERFAULTFD_IOC, 0x00)
+
 /*
  * If the UFFDIO_API is upgraded someday, the UFFDIO_UNREGISTER and
  * UFFDIO_WAKE ioctls should be defined as _IOW and not as _IOR.  In