[v4,0/2] Control over userfaultfd kernel-fault handling

Message ID	20200924065606.3351177-1-lokeshgidra@google.com (mailing list archive)
Headers	show Return-Path: <SRS0=eAdI=DB=vger.kernel.org=linux-fsdevel-owner@kernel.org> Sender: "lokeshgidra via sendgmr" <lokeshgidra@lg.mtv.corp.google.com> Date: Wed, 23 Sep 2020 23:56:04 -0700 Message-Id: <20200924065606.3351177-1-lokeshgidra@google.com> Mime-Version: 1.0 Subject: [PATCH v4 0/2] Control over userfaultfd kernel-fault handling From: Lokesh Gidra <lokeshgidra@google.com> To: Kees Cook <keescook@chromium.org>, Jonathan Corbet <corbet@lwn.net>, Peter Xu <peterx@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Stephen Smalley <stephen.smalley.work@gmail.com>, Eric Biggers <ebiggers@kernel.org>, Lokesh Gidra <lokeshgidra@google.com>, Daniel Colascione <dancol@dancol.org>, "Joel Fernandes (Google)" <joel@joelfernandes.org>, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kaleshsingh@google.com, calin@google.com, surenb@google.com, nnk@google.com, jeffv@google.com, kernel-team@android.com, Mike Rapoport <rppt@linux.vnet.ibm.com>, Shaohua Li <shli@fb.com>, Jerome Glisse <jglisse@redhat.com>, Mauro Carvalho Chehab <mchehab+huawei@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Mel Gorman <mgorman@techsingularity.net>, Nitin Gupta <nigupta@nvidia.com>, Vlastimil Babka <vbabka@suse.cz>, Iurii Zaikin <yzaikin@google.com>, Luis Chamberlain <mcgrof@kernel.org> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	Control over userfaultfd kernel-fault handling \| expand [v4,0/2] Control over userfaultfd kernel-fault handling [v4,1/2] Add UFFD_USER_MODE_ONLY [v4,2/2] Add user-mode only option to unprivileged_userfaultfd sysctl knob

Lokesh Gidra Sept. 24, 2020, 6:56 a.m. UTC

This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and
movement can be controlled.

It has been demonstrated on various occasions that suspending kernel
code execution for an arbitrary amount of time at any access to
userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
to change the intended behavior of the kernel. For instance, handling
page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
Likewise, FUSE, which is similar to userfaultfd in this respect, has been
exploited in [4, 5] for similar outcome.

This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to
the unprivileged_userfaultfd sysctl knob to require unprivileged
callers to use this new flag.

The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to
enhance security vulnerabilities by lengthening the race window in
kernel code.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808

Changes since v3:

- Modified the meaning of value '0' of unprivileged_userfaultfd
sysctl knob. Setting this knob to '0' now allows unprivileged users
to use userfaultfd, but can handle page faults in user-mode only.
- The default value of unprivileged_userfaultfd sysctl knob is changed
to '0'.

Changes since v2:

- Removed 'uffd_flags' and directly used 'UFFD_USER_MODE_ONLY' in
userfaultfd().

Changes since v1:

- Added external references to the threats from allowing unprivileged
users to handle page faults from kernel-mode.
- Removed the new sysctl knob restricting handling of page
faults from kernel-mode, and added an option for the same
in the existing 'unprivileged_userfaultfd' knob.

Lokesh Gidra (2):
Add UFFD_USER_MODE_ONLY
Add user-mode only option to unprivileged_userfaultfd sysctl knob

Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++-----
fs/userfaultfd.c | 12 +++++++++---
include/uapi/linux/userfaultfd.h | 9 +++++++++
3 files changed, 28 insertions(+), 8 deletions(-)

Lokesh Gidra Oct. 7, 2020, 8:26 p.m. UTC | #1

On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> This patch series is split from [1]. The other series enables SELinux
> support for userfaultfd file descriptors so that its creation and
> movement can be controlled.
>
> It has been demonstrated on various occasions that suspending kernel
> code execution for an arbitrary amount of time at any access to
> userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> to change the intended behavior of the kernel. For instance, handling
> page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> exploited in [4, 5] for similar outcome.
>
> This small patch series adds a new flag to userfaultfd(2) that allows
> callers to give up the ability to handle kernel-mode faults with the
> resulting UFFD file object. It then adds a 'user-mode only' option to
> the unprivileged_userfaultfd sysctl knob to require unprivileged
> callers to use this new flag.
>
> The purpose of this new interface is to decrease the chance of an
> unprivileged userfaultfd user taking advantage of userfaultfd to
> enhance security vulnerabilities by lengthening the race window in
> kernel code.
>
> [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> [2] https://duasynt.com/blog/linux-kernel-heap-spray
> [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
> [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
> [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
>
> Changes since v3:
>
>   - Modified the meaning of value '0' of unprivileged_userfaultfd
>     sysctl knob. Setting this knob to '0' now allows unprivileged users
>     to use userfaultfd, but can handle page faults in user-mode only.
>   - The default value of unprivileged_userfaultfd sysctl knob is changed
>     to '0'.
>
Request reviewers and maintainers to please take a look.

> Changes since v2:
>
>   - Removed 'uffd_flags' and directly used 'UFFD_USER_MODE_ONLY' in
>     userfaultfd().
>
> Changes since v1:
>
>   - Added external references to the threats from allowing unprivileged
>     users to handle page faults from kernel-mode.
>   - Removed the new sysctl knob restricting handling of page
>     faults from kernel-mode, and added an option for the same
>     in the existing 'unprivileged_userfaultfd' knob.
>
> Lokesh Gidra (2):
>   Add UFFD_USER_MODE_ONLY
>   Add user-mode only option to unprivileged_userfaultfd sysctl knob
>
>  Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++-----
>  fs/userfaultfd.c                        | 12 +++++++++---
>  include/uapi/linux/userfaultfd.h        |  9 +++++++++
>  3 files changed, 28 insertions(+), 8 deletions(-)
>
> --
> 2.28.0.681.g6f77f65b4e-goog
>

Andrea Arcangeli Oct. 8, 2020, 4:01 a.m. UTC | #2

Hello Lokesh,

On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > This patch series is split from [1]. The other series enables SELinux
> > support for userfaultfd file descriptors so that its creation and
> > movement can be controlled.
> >
> > It has been demonstrated on various occasions that suspending kernel
> > code execution for an arbitrary amount of time at any access to
> > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > to change the intended behavior of the kernel. For instance, handling
> > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > exploited in [4, 5] for similar outcome.
> >
> > This small patch series adds a new flag to userfaultfd(2) that allows
> > callers to give up the ability to handle kernel-mode faults with the
> > resulting UFFD file object. It then adds a 'user-mode only' option to
> > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > callers to use this new flag.
> >
> > The purpose of this new interface is to decrease the chance of an
> > unprivileged userfaultfd user taking advantage of userfaultfd to
> > enhance security vulnerabilities by lengthening the race window in
> > kernel code.
> >
> > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit

I've looking at those links and I've been trying to verify the link
[3] is relevant.

Specifically I've been trying to verify if 1) current state of the art
modern SLUB randomization techniques already enabled in production and
rightfully wasting some CPU in all enterprise kernels to prevent
things like above to become an issue in practice 2) combined with the
fact different memcg need to share the same kmemcaches (which was
incidentally fixed a few months ago upstream) and 3) further
robustness enhancements against exploits in the slub metadata, may
already render the exploit [3] from 2016 irrelevant in practice.

So I started by trying to reproduce [3] by building 4.5.1 with a
.config with no robustness features and I booted it on fedora-32 or
gentoo userland and I cannot even invoke call_usermodehelper. Calling
socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
for call_usermodehelper to be invoked by the exploit? Could you help
to verify it?

It even has uninitialized variable spawning random perrors so it
doesn't give a warm fuzzy feeling:

====
int main(int argc, char **argv) {
	void *region, *map;
	              ^^^^^
	pthread_t uffd_thread;
	int uffd, msqid, i;

	region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
                               MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);

	if (!region) {
		perror("mmap");
		exit(2);
	}

	setup_pagefault(region + 0x1000, 0x1000, 1);

	printf("my pid = %d\n", getpid());

	if (!map) {
	^^^^^^^^
		perror("mmap");
====

The whole point of being able to reproduce on 4.5.1 is then to
simulate if the same exploit would also reproduce on current kernels
with all enterprise default robustness features enabled. Or did
anybody already verify it?

Anyway the links I was playing with are all in the cover letter, the
cover letter is not as important as the actual patches. The actual
patches looks fine to me.

The only improvement I can think of is, what about to add a
printk_once to suggest to toggle the sysctl if userfaultfd bails out
because the process lacks the CAP_SYS_PTRACE capability? That would
facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
aren't verbose enough.

It's not relevant anymore with this latest patchset, but about the
previous argument that seccomp couldn't be used in all Android
processes because of performance concern, I'm slightly confused.

https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html

"Android O includes a single seccomp filter installed into zygote, the
process from which all the Android applications are derived. Because
the filter is installed into zygote—and therefore all apps—the Android
security team took extra caution to not break existing apps"

Example:

$ uname -mo
aarch64 Android
$ cat swapoff.c
#include <sys/swap.h>

int main()
{
        swapoff("");
}
$ gcc swapoff.c -o swapoff -O2
$ ./swapoff
Bad system call
$

It's hard to imagine what is more performance critical than the zygote
process and the actual apps as above?

It's also hard to imagine what kind of performance concern can arise
by adding seccomp filters also to background system apps that
generally should consume ~0% of CPU.

If performance is really a concern, the BPF JIT representation with
the bitmap to be able to run the filter in O(1) sounds a better
solution than not adding ad-hoc filters and it's being worked on for
x86-64 and can be ported to aarch64 too. Many of the standalone
background processes likely wouldn't even use uffd at all so you could
block the user initiated faults too that way.

Ultimately because of issues as [3] (be them still relevant or not, to
be double checked), no matter if through selinux, seccomp or a
different sysctl value, without this patchset applied the default
behavior of the userfaultfd syscall for all Linux binaries running on
Android kernels, would deviate from the upstream kernel. So even if we
would make the pipe mutex logic more complex the deviation would
remain. Your patchset adds much less risk of breakage than adding a
timeout to kernel initiated userfaults and it resolves all concerns as
well as a timeout. We'll also make better use of the "0" value this
way. So while I'm not certain this is the best for the long term, this
looks the sweet spot for the short term to resolve many issues at
once.

Thanks!
Andrea

Nick Kralevich Oct. 8, 2020, 11:22 p.m. UTC | #3

On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello Lokesh,
>
> On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > >
> > > This patch series is split from [1]. The other series enables SELinux
> > > support for userfaultfd file descriptors so that its creation and
> > > movement can be controlled.
> > >
> > > It has been demonstrated on various occasions that suspending kernel
> > > code execution for an arbitrary amount of time at any access to
> > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > > to change the intended behavior of the kernel. For instance, handling
> > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > > exploited in [4, 5] for similar outcome.
> > >
> > > This small patch series adds a new flag to userfaultfd(2) that allows
> > > callers to give up the ability to handle kernel-mode faults with the
> > > resulting UFFD file object. It then adds a 'user-mode only' option to
> > > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > > callers to use this new flag.
> > >
> > > The purpose of this new interface is to decrease the chance of an
> > > unprivileged userfaultfd user taking advantage of userfaultfd to
> > > enhance security vulnerabilities by lengthening the race window in
> > > kernel code.
> > >
> > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> > > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
>
> I've looking at those links and I've been trying to verify the link
> [3] is relevant.
>
> Specifically I've been trying to verify if 1) current state of the art
> modern SLUB randomization techniques already enabled in production and
> rightfully wasting some CPU in all enterprise kernels to prevent
> things like above to become an issue in practice 2) combined with the
> fact different memcg need to share the same kmemcaches (which was
> incidentally fixed a few months ago upstream) and 3) further
> robustness enhancements against exploits in the slub metadata, may
> already render the exploit [3] from 2016 irrelevant in practice.

It's quite possible that some other mitigation was helpful against the
technique used by this particular exploit. It's the nature of exploits
that they are fragile and will change as new soft mitigations are
introduced. The effectiveness of a particular exploit mitigation
change is orthogonal to the change presented here.

The purpose of this change is to prevent an attacker from suspending
kernel code execution and having kernel data structures in a
predictable state. This makes it harder for an attacker to "win" race
conditions against various kernel data structures. This change
compliments other kernel hardening changes such as the changes you've
referenced above. Focusing on one particular exploit somewhat misses
the point of this change.

>
> So I started by trying to reproduce [3] by building 4.5.1 with a
> .config with no robustness features and I booted it on fedora-32 or
> gentoo userland and I cannot even invoke call_usermodehelper. Calling
> socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
> on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
> for call_usermodehelper to be invoked by the exploit? Could you help
> to verify it?

I haven't tried to verify this myself. I wonder if the usermode
hardening changes also impacted this exploit? See
https://lkml.org/lkml/2017/1/16/468

But again, focusing on an exploit, which is inherently fragile in
nature and dependent on the state of the kernel tree at a particular
time, is unlikely to be useful to analyze this patch.

>
> It even has uninitialized variable spawning random perrors so it
> doesn't give a warm fuzzy feeling:
>
> ====
> int main(int argc, char **argv) {
>         void *region, *map;
>                       ^^^^^
>         pthread_t uffd_thread;
>         int uffd, msqid, i;
>
>         region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
>                                MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);
>
>         if (!region) {
>                 perror("mmap");
>                 exit(2);
>         }
>
>         setup_pagefault(region + 0x1000, 0x1000, 1);
>
>         printf("my pid = %d\n", getpid());
>
>         if (!map) {
>         ^^^^^^^^
>                 perror("mmap");
> ====
>
> The whole point of being able to reproduce on 4.5.1 is then to
> simulate if the same exploit would also reproduce on current kernels
> with all enterprise default robustness features enabled. Or did
> anybody already verify it?
>
> Anyway the links I was playing with are all in the cover letter, the
> cover letter is not as important as the actual patches. The actual
> patches looks fine to me.

That's great to hear.

>
> The only improvement I can think of is, what about to add a
> printk_once to suggest to toggle the sysctl if userfaultfd bails out
> because the process lacks the CAP_SYS_PTRACE capability? That would
> facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
> aren't verbose enough.
>
> It's not relevant anymore with this latest patchset, but about the
> previous argument that seccomp couldn't be used in all Android
> processes because of performance concern, I'm slightly confused.

Seccomp causes more problems than just performance. Seccomp is not
designed for whole-of-system protections. Please see my other writeup
at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/

>
> https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html
>
> "Android O includes a single seccomp filter installed into zygote, the
> process from which all the Android applications are derived. Because
> the filter is installed into zygote—and therefore all apps—the Android
> security team took extra caution to not break existing apps"
>
> Example:
>
> $ uname -mo
> aarch64 Android
> $ cat swapoff.c
> #include <sys/swap.h>
>
> int main()
> {
>         swapoff("");
> }
> $ gcc swapoff.c -o swapoff -O2
> $ ./swapoff
> Bad system call
> $
>
> It's hard to imagine what is more performance critical than the zygote
> process and the actual apps as above?
>
> It's also hard to imagine what kind of performance concern can arise
> by adding seccomp filters also to background system apps that
> generally should consume ~0% of CPU.
>
> If performance is really a concern, the BPF JIT representation with
> the bitmap to be able to run the filter in O(1) sounds a better
> solution than not adding ad-hoc filters and it's being worked on for
> x86-64 and can be ported to aarch64 too. Many of the standalone
> background processes likely wouldn't even use uffd at all so you could
> block the user initiated faults too that way.
>
> Ultimately because of issues as [3] (be them still relevant or not, to
> be double checked), no matter if through selinux, seccomp or a
> different sysctl value, without this patchset applied the default
> behavior of the userfaultfd syscall for all Linux binaries running on
> Android kernels, would deviate from the upstream kernel. So even if we
> would make the pipe mutex logic more complex the deviation would
> remain. Your patchset adds much less risk of breakage than adding a
> timeout to kernel initiated userfaults and it resolves all concerns as
> well as a timeout. We'll also make better use of the "0" value this
> way. So while I'm not certain this is the best for the long term, this
> looks the sweet spot for the short term to resolve many issues at
> once.
>
> Thanks!
> Andrea
>

Lokesh Gidra Oct. 22, 2020, 8:38 p.m. UTC | #4

On Thu, Oct 8, 2020 at 4:22 PM Nick Kralevich <nnk@google.com> wrote:
>
> On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Hello Lokesh,
> >
> > On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote:
> > > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
> > > >
> > > > This patch series is split from [1]. The other series enables SELinux
> > > > support for userfaultfd file descriptors so that its creation and
> > > > movement can be controlled.
> > > >
> > > > It has been demonstrated on various occasions that suspending kernel
> > > > code execution for an arbitrary amount of time at any access to
> > > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited
> > > > to change the intended behavior of the kernel. For instance, handling
> > > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3].
> > > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been
> > > > exploited in [4, 5] for similar outcome.
> > > >
> > > > This small patch series adds a new flag to userfaultfd(2) that allows
> > > > callers to give up the ability to handle kernel-mode faults with the
> > > > resulting UFFD file object. It then adds a 'user-mode only' option to
> > > > the unprivileged_userfaultfd sysctl knob to require unprivileged
> > > > callers to use this new flag.
> > > >
> > > > The purpose of this new interface is to decrease the chance of an
> > > > unprivileged userfaultfd user taking advantage of userfaultfd to
> > > > enhance security vulnerabilities by lengthening the race window in
> > > > kernel code.
> > > >
> > > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
> > > > [2] https://duasynt.com/blog/linux-kernel-heap-spray
> > > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
> >
> > I've looking at those links and I've been trying to verify the link
> > [3] is relevant.
> >
> > Specifically I've been trying to verify if 1) current state of the art
> > modern SLUB randomization techniques already enabled in production and
> > rightfully wasting some CPU in all enterprise kernels to prevent
> > things like above to become an issue in practice 2) combined with the
> > fact different memcg need to share the same kmemcaches (which was
> > incidentally fixed a few months ago upstream) and 3) further
> > robustness enhancements against exploits in the slub metadata, may
> > already render the exploit [3] from 2016 irrelevant in practice.
>
> It's quite possible that some other mitigation was helpful against the
> technique used by this particular exploit. It's the nature of exploits
> that they are fragile and will change as new soft mitigations are
> introduced. The effectiveness of a particular exploit mitigation
> change is orthogonal to the change presented here.
>
> The purpose of this change is to prevent an attacker from suspending
> kernel code execution and having kernel data structures in a
> predictable state. This makes it harder for an attacker to "win" race
> conditions against various kernel data structures. This change
> compliments other kernel hardening changes such as the changes you've
> referenced above. Focusing on one particular exploit somewhat misses
> the point of this change.
>
> >
> > So I started by trying to reproduce [3] by building 4.5.1 with a
> > .config with no robustness features and I booted it on fedora-32 or
> > gentoo userland and I cannot even invoke call_usermodehelper. Calling
> > socket(22, AF_INET, 0) won't invoke such function. Can you reproduce
> > on 4.5.1? Which kernel .config should I use to build 4.5.1 in order
> > for call_usermodehelper to be invoked by the exploit? Could you help
> > to verify it?
>
> I haven't tried to verify this myself. I wonder if the usermode
> hardening changes also impacted this exploit? See
> https://lkml.org/lkml/2017/1/16/468
>
> But again, focusing on an exploit, which is inherently fragile in
> nature and dependent on the state of the kernel tree at a particular
> time, is unlikely to be useful to analyze this patch.
>
> >
> > It even has uninitialized variable spawning random perrors so it
> > doesn't give a warm fuzzy feeling:
> >
> > ====
> > int main(int argc, char **argv) {
> >         void *region, *map;
> >                       ^^^^^
> >         pthread_t uffd_thread;
> >         int uffd, msqid, i;
> >
> >         region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE,
> >                                MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0);
> >
> >         if (!region) {
> >                 perror("mmap");
> >                 exit(2);
> >         }
> >
> >         setup_pagefault(region + 0x1000, 0x1000, 1);
> >
> >         printf("my pid = %d\n", getpid());
> >
> >         if (!map) {
> >         ^^^^^^^^
> >                 perror("mmap");
> > ====
> >
> > The whole point of being able to reproduce on 4.5.1 is then to
> > simulate if the same exploit would also reproduce on current kernels
> > with all enterprise default robustness features enabled. Or did
> > anybody already verify it?
> >
> > Anyway the links I was playing with are all in the cover letter, the
> > cover letter is not as important as the actual patches. The actual
> > patches looks fine to me.
>
> That's great to hear.
>
> >
> > The only improvement I can think of is, what about to add a
> > printk_once to suggest to toggle the sysctl if userfaultfd bails out
> > because the process lacks the CAP_SYS_PTRACE capability? That would
> > facilitate the /etc/sysctl.conf or tuned tweaking in case the apps
> > aren't verbose enough.
> >
> > It's not relevant anymore with this latest patchset, but about the
> > previous argument that seccomp couldn't be used in all Android
> > processes because of performance concern, I'm slightly confused.
>
> Seccomp causes more problems than just performance. Seccomp is not
> designed for whole-of-system protections. Please see my other writeup
> at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/
>
> >
> > https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html
> >
> > "Android O includes a single seccomp filter installed into zygote, the
> > process from which all the Android applications are derived. Because
> > the filter is installed into zygote—and therefore all apps—the Android
> > security team took extra caution to not break existing apps"
> >
> > Example:
> >
> > $ uname -mo
> > aarch64 Android
> > $ cat swapoff.c
> > #include <sys/swap.h>
> >
> > int main()
> > {
> >         swapoff("");
> > }
> > $ gcc swapoff.c -o swapoff -O2
> > $ ./swapoff
> > Bad system call
> > $
> >
> > It's hard to imagine what is more performance critical than the zygote
> > process and the actual apps as above?
> >
> > It's also hard to imagine what kind of performance concern can arise
> > by adding seccomp filters also to background system apps that
> > generally should consume ~0% of CPU.
> >
> > If performance is really a concern, the BPF JIT representation with
> > the bitmap to be able to run the filter in O(1) sounds a better
> > solution than not adding ad-hoc filters and it's being worked on for
> > x86-64 and can be ported to aarch64 too. Many of the standalone
> > background processes likely wouldn't even use uffd at all so you could
> > block the user initiated faults too that way.
> >
> > Ultimately because of issues as [3] (be them still relevant or not, to
> > be double checked), no matter if through selinux, seccomp or a
> > different sysctl value, without this patchset applied the default
> > behavior of the userfaultfd syscall for all Linux binaries running on
> > Android kernels, would deviate from the upstream kernel. So even if we
> > would make the pipe mutex logic more complex the deviation would
> > remain. Your patchset adds much less risk of breakage than adding a
> > timeout to kernel initiated userfaults and it resolves all concerns as
> > well as a timeout. We'll also make better use of the "0" value this
> > way. So while I'm not certain this is the best for the long term, this
> > looks the sweet spot for the short term to resolve many issues at
> > once.
> >
> > Thanks!
> > Andrea
> >
>
>
> --
> Nick Kralevich | nnk@google.com

Hi Andrea,

Did you get a chance to go through Nick's reply to your questions?
Also, I sent another revision of this patch series which takes care of
the printk that you suggested. Please take a look.

Andrea Arcangeli Oct. 24, 2020, 5:28 a.m. UTC | #5

Hello,

On Thu, Oct 08, 2020 at 04:22:36PM -0700, Nick Kralevich wrote:
> I haven't tried to verify this myself. I wonder if the usermode
> hardening changes also impacted this exploit? See
> https://lkml.org/lkml/2017/1/16/468

My plan was to:

1) reproduce with the old buggy kernel

2) forward port the bug to the very first version that had both the
   slub and page freelist randomization available and keep them
   disabled

3) enable the freelist randomization features (which are already
   enabled by default in the current enterprise kernels) and see if
   that makes the attack not workable

The hardening of the usermode helper you mentioned is spot on, but it
would have been something to worry about and possibly roll back at
point 2), but I couldn't get past point 1)..

Plenty other hardening techniques (just like the usermode helper) are
very specific to a single attack, but the randomization looks generic
enough to cover the entire class.

> But again, focusing on an exploit, which is inherently fragile in
> nature and dependent on the state of the kernel tree at a particular
> time, is unlikely to be useful to analyze this patch.

Agreed. A single exploit using userfaultfd to enlarge the race window
of the use-after-free, not being workable anymore with randomized slub
and page freelist enabled, wouldn't have meant a thing by itself.

As opposed if that single exploit was still fairly reproducible, it
would have been enough to consider the sysctl default to zero as
something providing a more tangible long term benefit. That would have
been good information to have too, if that's actually the case.

I was merely attempting to get a first data point.. overall it would
be nice to initiate some research to verify the exact statistical
effects that slub/page randomization has on those use-after-free race
conditions that can be enlarged by blocking kernel faults, given we're
already paying the price for it. I don't think anybody has a sure
answer at this point, if we can entirely rely on those features or not.

> Seccomp causes more problems than just performance. Seccomp is not
> designed for whole-of-system protections. Please see my other writeup
> at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/

Whole-of-system protection I guess helps primarily because it requires
no change to userland I guess.

An example of a task not running as root (and without ptrace
capability) that could use more seccomp blocking:

# cat /proc/1517/cmdline ; echo ; grep CapEff /proc/1517/status; grep Seccomp /proc/1517/status
/vendor/bin/hw/qcrild
CapEff: 0000001000003000
Seccomp:        0

My view is that if the various binaries started by init.rc are run
without a strict seccomp filter there would be more things to worry
about, than kernel initiated userfaults for those.

Still the solution in the v5 patchset looks the safest for all until
we'll be able to tell if the slub/page randomizaton (or any other
generic enough robustness feature) is already effective against an
enlarged race window of kernel initiated userfaults and at the same
time it provides the main benefit of avoiding divergence in the
behavior of the userfaultfd syscall if invoked within the Android
userland.

Thanks,
Andrea

[v4,0/2] Control over userfaultfd kernel-fault handling

Message

Comments