diff mbox series

[v5,2/2] Add user-mode only option to unprivileged_userfaultfd sysctl knob

Message ID 20201011062456.4065576-3-lokeshgidra@google.com (mailing list archive)
State New, archived
Headers show
Series Control over userfaultfd kernel-fault handling | expand

Commit Message

Lokesh Gidra Oct. 11, 2020, 6:24 a.m. UTC
With this change, when the knob is set to 0, it allows unprivileged
users to call userfaultfd, like when it is set to 1, but with the
restriction that page faults from only user-mode can be handled.
In this mode, an unprivileged user (without SYS_CAP_PTRACE capability)
must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with
EPERM.

This enables administrators to reduce the likelihood that
an attacker with access to userfaultfd can delay faulting kernel
code to widen timing windows for other exploits.

The default value of this knob is changed to 0. This is required for
correct functioning of pipe mutex. However, this will fail postcopy
live migration, which will be unnoticeable to the VM guests. To avoid
this, set 'vm.userfault = 1' in /sys/sysctl.conf. For more details,
refer to Andrea's reply [1].

[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/

Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++-----
 fs/userfaultfd.c                        |  6 ++++--
 2 files changed, 14 insertions(+), 7 deletions(-)

Comments

Andrea Arcangeli Oct. 24, 2020, 2:48 a.m. UTC | #1
Hello everyone,

On Sat, Oct 10, 2020 at 11:24:56PM -0700, Lokesh Gidra wrote:
> With this change, when the knob is set to 0, it allows unprivileged
> users to call userfaultfd, like when it is set to 1, but with the
> restriction that page faults from only user-mode can be handled.
> In this mode, an unprivileged user (without SYS_CAP_PTRACE capability)
> must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with
> EPERM.
> 
> This enables administrators to reduce the likelihood that
> an attacker with access to userfaultfd can delay faulting kernel
> code to widen timing windows for other exploits.
> 
> The default value of this knob is changed to 0. This is required for
> correct functioning of pipe mutex. However, this will fail postcopy
> live migration, which will be unnoticeable to the VM guests. To avoid
> this, set 'vm.userfault = 1' in /sys/sysctl.conf. For more details,
> refer to Andrea's reply [1].
> 
> [1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
> 
> Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>

Nobody commented so it seems everyone is on board with this change to
synchronize the kernel default with the post-boot Android default.

The email in the link above was pretty long, so the below would be a
summary that could be added to the commit header:

==

The main reason this change is desirable as in the short term is that
the Android userland will behave as with the sysctl set to zero. So
without this commit, any Linux binary using userfaultfd to manage its
memory would behave differently if run within the Android userland.

==

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>


BTW, this is still a minor nitpick, but a printk_once of the 1/2 could
be added before the return -EPERM too, that's actually what I meant
when I suggested to add a printk_once :), however the printk_once you
added can turn out to be useful too for devs converting code to use
bounce buffers, so it's fine too, just it could go under DEBUG_VM and
to be ratelimited (similarly to the "FAULT_FLAG_ALLOW_RETRY missing
%x\n" printk).

Thanks,
Andrea
Lokesh Gidra Oct. 24, 2020, 4:08 a.m. UTC | #2
On Fri, Oct 23, 2020 at 7:48 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello everyone,
>
> On Sat, Oct 10, 2020 at 11:24:56PM -0700, Lokesh Gidra wrote:
> > With this change, when the knob is set to 0, it allows unprivileged
> > users to call userfaultfd, like when it is set to 1, but with the
> > restriction that page faults from only user-mode can be handled.
> > In this mode, an unprivileged user (without SYS_CAP_PTRACE capability)
> > must pass UFFD_USER_MODE_ONLY to userfaultd or the API will fail with
> > EPERM.
> >
> > This enables administrators to reduce the likelihood that
> > an attacker with access to userfaultfd can delay faulting kernel
> > code to widen timing windows for other exploits.
> >
> > The default value of this knob is changed to 0. This is required for
> > correct functioning of pipe mutex. However, this will fail postcopy
> > live migration, which will be unnoticeable to the VM guests. To avoid
> > this, set 'vm.userfault = 1' in /sys/sysctl.conf. For more details,
> > refer to Andrea's reply [1].
> >
> > [1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
> >
> > Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
>
> Nobody commented so it seems everyone is on board with this change to
> synchronize the kernel default with the post-boot Android default.
>
> The email in the link above was pretty long, so the below would be a
> summary that could be added to the commit header:
>
> ==
>
> The main reason this change is desirable as in the short term is that
> the Android userland will behave as with the sysctl set to zero. So
> without this commit, any Linux binary using userfaultfd to manage its
> memory would behave differently if run within the Android userland.
>
> ==

Sure. I'll add it in the next revision.
>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
>
Thanks so much for the review. I hope it's ok to add your
'reviewed-by' in the next revision?
>
> BTW, this is still a minor nitpick, but a printk_once of the 1/2 could
> be added before the return -EPERM too, that's actually what I meant
> when I suggested to add a printk_once :), however the printk_once you
> added can turn out to be useful too for devs converting code to use
> bounce buffers, so it's fine too, just it could go under DEBUG_VM and
> to be ratelimited (similarly to the "FAULT_FLAG_ALLOW_RETRY missing
> %x\n" printk).

I'll move the printk_once from 1/2 to this patch, as you suggested.
>
> Thanks,
> Andrea
>
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4b9d2e8e9142..4263d38c3c21 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -871,12 +871,17 @@  file-backed pages is less than the high watermark in a zone.
 unprivileged_userfaultfd
 ========================
 
-This flag controls whether unprivileged users can use the userfaultfd
-system calls.  Set this to 1 to allow unprivileged users to use the
-userfaultfd system calls, or set this to 0 to restrict userfaultfd to only
-privileged users (with SYS_CAP_PTRACE capability).
+This flag controls the mode in which unprivileged users can use the
+userfaultfd system calls. Set this to 0 to restrict unprivileged users
+to handle page faults in user mode only. In this case, users without
+SYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
+succeed. Prohibiting use of userfaultfd for handling faults from kernel
+mode may make certain vulnerabilities more difficult to exploit.
 
-The default value is 1.
+Set this to 1 to allow unprivileged users to use the userfaultfd system
+calls without any restrictions.
+
+The default value is 0.
 
 
 user_reserve_kbytes
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index bd229f06d4e9..0f8a975db3be 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -28,7 +28,7 @@ 
 #include <linux/security.h>
 #include <linux/hugetlb.h>
 
-int sysctl_unprivileged_userfaultfd __read_mostly = 1;
+int sysctl_unprivileged_userfaultfd __read_mostly;
 
 static struct kmem_cache *userfaultfd_ctx_cachep __read_mostly;
 
@@ -1976,7 +1976,9 @@  SYSCALL_DEFINE1(userfaultfd, int, flags)
 	struct userfaultfd_ctx *ctx;
 	int fd;
 
-	if (!sysctl_unprivileged_userfaultfd && !capable(CAP_SYS_PTRACE))
+	if (!sysctl_unprivileged_userfaultfd &&
+	    (flags & UFFD_USER_MODE_ONLY) == 0 &&
+	    !capable(CAP_SYS_PTRACE))
 		return -EPERM;
 
 	BUG_ON(!current->mm);