mbox series

[v2,0/6] Harden userfaultfd

Message ID 20200211225547.235083-1-dancol@google.com (mailing list archive)
Headers show
Series Harden userfaultfd | expand

Message

Daniel Colascione Feb. 11, 2020, 10:55 p.m. UTC
Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors and allows administrators to limit userfaultfd to
servicing user-mode faults, increasing the difficulty of using
userfaultfd in exploit chains invoking delaying kernel faults.

A new anon_inodes interface allows callers to opt into SELinux
management of anonymous file objects. In this mode, anon_inodes
creates new ephemeral inodes for anonymous file objects instead of
reusing a singleton dummy inode. A new LSM hook gives security modules
an opportunity to configure and veto these ephemeral inodes.

Existing anon_inodes users must opt into the new functionality.

Daniel Colascione (6):
  Add a new flags-accepting interface for anonymous inodes
  Add a concept of a "secure" anonymous file
  Teach SELinux about a new userfaultfd class
  Wire UFFD up to SELinux
  Let userfaultfd opt out of handling kernel-mode faults
  Add a new sysctl for limiting userfaultfd to user mode faults

 Documentation/admin-guide/sysctl/vm.rst | 13 ++++
 fs/anon_inodes.c                        | 89 +++++++++++++++++--------
 fs/userfaultfd.c                        | 29 ++++++--
 include/linux/anon_inodes.h             | 27 ++++++--
 include/linux/lsm_hooks.h               |  8 +++
 include/linux/security.h                |  2 +
 include/linux/userfaultfd_k.h           |  3 +
 include/uapi/linux/userfaultfd.h        |  9 +++
 kernel/sysctl.c                         |  9 +++
 security/security.c                     |  8 +++
 security/selinux/hooks.c                | 68 +++++++++++++++++++
 security/selinux/include/classmap.h     |  2 +
 12 files changed, 229 insertions(+), 38 deletions(-)

Comments

Casey Schaufler Feb. 11, 2020, 11:13 p.m. UTC | #1
On 2/11/2020 2:55 PM, Daniel Colascione wrote:
> Userfaultfd in unprivileged contexts could be potentially very
> useful. We'd like to harden userfaultfd to make such unprivileged use
> less risky. This patch series allows SELinux to manage userfaultfd
> file descriptors and allows administrators to limit userfaultfd to
> servicing user-mode faults, increasing the difficulty of using
> userfaultfd in exploit chains invoking delaying kernel faults.
>
> A new anon_inodes interface allows callers to opt into SELinux
> management of anonymous file objects. In this mode, anon_inodes
> creates new ephemeral inodes for anonymous file objects instead of
> reusing a singleton dummy inode. A new LSM hook gives security modules
> an opportunity to configure and veto these ephemeral inodes.
>
> Existing anon_inodes users must opt into the new functionality.
>
> Daniel Colascione (6):
>   Add a new flags-accepting interface for anonymous inodes
>   Add a concept of a "secure" anonymous file
>   Teach SELinux about a new userfaultfd class
>   Wire UFFD up to SELinux
>   Let userfaultfd opt out of handling kernel-mode faults
>   Add a new sysctl for limiting userfaultfd to user mode faults

This must be posted to the linux Security Module list
<linux-security-module@vger.kernel.org>

>
>  Documentation/admin-guide/sysctl/vm.rst | 13 ++++
>  fs/anon_inodes.c                        | 89 +++++++++++++++++--------
>  fs/userfaultfd.c                        | 29 ++++++--
>  include/linux/anon_inodes.h             | 27 ++++++--
>  include/linux/lsm_hooks.h               |  8 +++
>  include/linux/security.h                |  2 +
>  include/linux/userfaultfd_k.h           |  3 +
>  include/uapi/linux/userfaultfd.h        |  9 +++
>  kernel/sysctl.c                         |  9 +++
>  security/security.c                     |  8 +++
>  security/selinux/hooks.c                | 68 +++++++++++++++++++
>  security/selinux/include/classmap.h     |  2 +
>  12 files changed, 229 insertions(+), 38 deletions(-)
>
Daniel Colascione Feb. 11, 2020, 11:27 p.m. UTC | #2
On Tue, Feb 11, 2020 at 3:13 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 2/11/2020 2:55 PM, Daniel Colascione wrote:
> > Userfaultfd in unprivileged contexts could be potentially very
> > useful. We'd like to harden userfaultfd to make such unprivileged use
> > less risky. This patch series allows SELinux to manage userfaultfd
> > file descriptors and allows administrators to limit userfaultfd to
> > servicing user-mode faults, increasing the difficulty of using
> > userfaultfd in exploit chains invoking delaying kernel faults.
> >
> > A new anon_inodes interface allows callers to opt into SELinux
> > management of anonymous file objects. In this mode, anon_inodes
> > creates new ephemeral inodes for anonymous file objects instead of
> > reusing a singleton dummy inode. A new LSM hook gives security modules
> > an opportunity to configure and veto these ephemeral inodes.
> >
> > Existing anon_inodes users must opt into the new functionality.
> >
> > Daniel Colascione (6):
> >   Add a new flags-accepting interface for anonymous inodes
> >   Add a concept of a "secure" anonymous file
> >   Teach SELinux about a new userfaultfd class
> >   Wire UFFD up to SELinux
> >   Let userfaultfd opt out of handling kernel-mode faults
> >   Add a new sysctl for limiting userfaultfd to user mode faults
>
> This must be posted to the linux Security Module list
> <linux-security-module@vger.kernel.org>

Added. I thought selinux@ was sufficient.
Kees Cook Feb. 12, 2020, 7:50 a.m. UTC | #3
Hi!

Firstly, thanks for working on this! It's been on my TODO list for a
while. :)

Casey already recommended including the LSM list to CC (since this is a
new LSM -- there are many LSMs). Additionally, the series should
probably be sent _to_ the userfaultfd maintainers:
	Andrea Arcangeli <aarcange@redhat.com>
	Mike Rapoport <rppt@linux.ibm.com>
and I'd also CC a couple other people that have done recent work:
	Peter Xu <peterx@redhat.com>
	Jann Horn <jannh@google.com>

More notes below...

On Tue, Feb 11, 2020 at 02:55:41PM -0800, Daniel Colascione wrote:
> Userfaultfd in unprivileged contexts could be potentially very
> useful. We'd like to harden userfaultfd to make such unprivileged use
> less risky. This patch series allows SELinux to manage userfaultfd
> file descriptors and allows administrators to limit userfaultfd to
> servicing user-mode faults, increasing the difficulty of using
> userfaultfd in exploit chains invoking delaying kernel faults.

I actually think these are two very different goals and likely the
series could be split into two for them. One is LSM hooking of
userfaultfd and the SELinux attachment, and the other is the user-mode
fault restrictions. And they would likely go via separate trees (LSM
through James's LSM tree, and probably akpm's -mm tree for the sysctl).

> A new anon_inodes interface allows callers to opt into SELinux
> management of anonymous file objects. In this mode, anon_inodes
> creates new ephemeral inodes for anonymous file objects instead of
> reusing a singleton dummy inode. A new LSM hook gives security modules
> an opportunity to configure and veto these ephemeral inodes.
> 
> Existing anon_inodes users must opt into the new functionality.
> 
> Daniel Colascione (6):
>   Add a new flags-accepting interface for anonymous inodes
>   Add a concept of a "secure" anonymous file
>   Teach SELinux about a new userfaultfd class
>   Wire UFFD up to SELinux

The above is the first "series"... I don't have much opinion about it,
though I do like the idea of making userfaultfd visible to the LSM.

>   Let userfaultfd opt out of handling kernel-mode faults
>   Add a new sysctl for limiting userfaultfd to user mode faults

Now this I'm very interested in. Can you go into more detail about two
things:

- What is the threat being solved? (I understand the threat, but detailing
  it in the commit log is important for people who don't know it. Existing
  commit cefdca0a86be517bc390fc4541e3674b8e7803b0 gets into some of the
  details already, but I'd like to see reference to external sources like
  https://duasynt.com/blog/linux-kernel-heap-spray)

- Why is this needed in addition to the existing vm.unprivileged_userfaultfd
  sysctl? (And should this maybe just be another setting for that
  sysctl, like "2"?)

As to the mechanics of the change, I'm not sure I like the idea of adding
a UAPI flag for this. Why not just retain the permission check done at
open() and if kernelmode faults aren't allowed, ignore them? This would
require no changes to existing programs and gains the desired defense.
(And, I think, the sysctl value could be bumped to "2" as that's a
better default state -- does qemu actually need kernelmode traps?)

Thanks again for the patches!

-Kees

> 
>  Documentation/admin-guide/sysctl/vm.rst | 13 ++++
>  fs/anon_inodes.c                        | 89 +++++++++++++++++--------
>  fs/userfaultfd.c                        | 29 ++++++--
>  include/linux/anon_inodes.h             | 27 ++++++--
>  include/linux/lsm_hooks.h               |  8 +++
>  include/linux/security.h                |  2 +
>  include/linux/userfaultfd_k.h           |  3 +
>  include/uapi/linux/userfaultfd.h        |  9 +++
>  kernel/sysctl.c                         |  9 +++
>  security/security.c                     |  8 +++
>  security/selinux/hooks.c                | 68 +++++++++++++++++++
>  security/selinux/include/classmap.h     |  2 +
>  12 files changed, 229 insertions(+), 38 deletions(-)
> 
> -- 
> 2.25.0.225.g125e21ebc7-goog
>
Stephen Smalley Feb. 12, 2020, 4:09 p.m. UTC | #4
On 2/11/20 6:27 PM, Daniel Colascione wrote:
> On Tue, Feb 11, 2020 at 3:13 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>> On 2/11/2020 2:55 PM, Daniel Colascione wrote:
>>> Userfaultfd in unprivileged contexts could be potentially very
>>> useful. We'd like to harden userfaultfd to make such unprivileged use
>>> less risky. This patch series allows SELinux to manage userfaultfd
>>> file descriptors and allows administrators to limit userfaultfd to
>>> servicing user-mode faults, increasing the difficulty of using
>>> userfaultfd in exploit chains invoking delaying kernel faults.
>>>
>>> A new anon_inodes interface allows callers to opt into SELinux
>>> management of anonymous file objects. In this mode, anon_inodes
>>> creates new ephemeral inodes for anonymous file objects instead of
>>> reusing a singleton dummy inode. A new LSM hook gives security modules
>>> an opportunity to configure and veto these ephemeral inodes.
>>>
>>> Existing anon_inodes users must opt into the new functionality.
>>>
>>> Daniel Colascione (6):
>>>    Add a new flags-accepting interface for anonymous inodes
>>>    Add a concept of a "secure" anonymous file
>>>    Teach SELinux about a new userfaultfd class
>>>    Wire UFFD up to SELinux
>>>    Let userfaultfd opt out of handling kernel-mode faults
>>>    Add a new sysctl for limiting userfaultfd to user mode faults
>>
>> This must be posted to the linux Security Module list
>> <linux-security-module@vger.kernel.org>
> 
> Added. I thought selinux@ was sufficient.

scripts/get_maintainer.pl can be helpful in identifying relevant lists 
and maintainers for each patch.  I don't use its output blindly as it 
tends to over-approximate but since your patches span the VFS, LSM 
framework, and selinux, you do need to include relevant 
maintainers/lists for each.
Jann Horn Feb. 12, 2020, 4:54 p.m. UTC | #5
On Wed, Feb 12, 2020 at 8:51 AM Kees Cook <keescook@chromium.org> wrote:
> On Tue, Feb 11, 2020 at 02:55:41PM -0800, Daniel Colascione wrote:
> >   Let userfaultfd opt out of handling kernel-mode faults
> >   Add a new sysctl for limiting userfaultfd to user mode faults
>
> Now this I'm very interested in. Can you go into more detail about two
> things:
[...]
> - Why is this needed in addition to the existing vm.unprivileged_userfaultfd
>   sysctl? (And should this maybe just be another setting for that
>   sysctl, like "2"?)
>
> As to the mechanics of the change, I'm not sure I like the idea of adding
> a UAPI flag for this. Why not just retain the permission check done at
> open() and if kernelmode faults aren't allowed, ignore them? This would
> require no changes to existing programs and gains the desired defense.
> (And, I think, the sysctl value could be bumped to "2" as that's a
> better default state -- does qemu actually need kernelmode traps?)

I think this might be necessary for I/O emulation? As in, if before
getting migrated, the guest writes some data into a buffer, then the
guest gets migrated, and then while the postcopy migration stuff is
still running, the guest tells QEMU to write that data from
guest-physical memory to disk or whatever; I think in that case, QEMU
will do something like a pwrite() syscall where the userspace pointer
points into the memory area containing guest-physical memory, which
would return -EFAULT if userfaultfd was restricted to userspace
accesses.

This was described in this old presentation about why userfaultfd is
better than a SIGSEGV handler:
https://drive.google.com/file/d/0BzyAwvVlQckeSzlCSDFmRHVybzQ/view
(slide 6) (recording at https://youtu.be/pC8cWWRVSPw?t=463)
Daniel Colascione Feb. 12, 2020, 5:12 p.m. UTC | #6
Thanks for taking a look and for the fast reply!

On Tue, Feb 11, 2020 at 11:51 PM Kees Cook <keescook@chromium.org> wrote:
>
> Hi!
>
> Firstly, thanks for working on this! It's been on my TODO list for a
> while. :)
>
> Casey already recommended including the LSM list to CC (since this is a
> new LSM -- there are many LSMs). Additionally, the series should
> probably be sent _to_ the userfaultfd maintainers:
>         Andrea Arcangeli <aarcange@redhat.com>
>         Mike Rapoport <rppt@linux.ibm.com>
> and I'd also CC a couple other people that have done recent work:
>         Peter Xu <peterx@redhat.com>
>         Jann Horn <jannh@google.com>
>
> More notes below...

In general, in the event that a patch series doesn't include all
needed parties on the to-line, what's the right way to fix the
situation without spamming everyone and forking the thread? In this
case, since I'm splitting the patch series anyway, I can just expand
the to-line in the reroll.

> On Tue, Feb 11, 2020 at 02:55:41PM -0800, Daniel Colascione wrote:
> > Userfaultfd in unprivileged contexts could be potentially very
> > useful. We'd like to harden userfaultfd to make such unprivileged use
> > less risky. This patch series allows SELinux to manage userfaultfd
> > file descriptors and allows administrators to limit userfaultfd to
> > servicing user-mode faults, increasing the difficulty of using
> > userfaultfd in exploit chains invoking delaying kernel faults.
>
> I actually think these are two very different goals and likely the
> series could be split into two for them. One is LSM hooking of
> userfaultfd and the SELinux attachment, and the other is the user-mode
> fault restrictions. And they would likely go via separate trees (LSM
> through James's LSM tree, and probably akpm's -mm tree for the sysctl).
>
> > A new anon_inodes interface allows callers to opt into SELinux
> > management of anonymous file objects. In this mode, anon_inodes
> > creates new ephemeral inodes for anonymous file objects instead of
> > reusing a singleton dummy inode. A new LSM hook gives security modules
> > an opportunity to configure and veto these ephemeral inodes.
> >
> > Existing anon_inodes users must opt into the new functionality.
> >
> > Daniel Colascione (6):
> >   Add a new flags-accepting interface for anonymous inodes
> >   Add a concept of a "secure" anonymous file
> >   Teach SELinux about a new userfaultfd class
> >   Wire UFFD up to SELinux
>
> The above is the first "series"... I don't have much opinion about it,
> though I do like the idea of making userfaultfd visible to the LSM.

Yeah. The interesting part there is the anon_inodes API change. I'll
split that half of the series out.

> >   Let userfaultfd opt out of handling kernel-mode faults
> >   Add a new sysctl for limiting userfaultfd to user mode faults
>
> Now this I'm very interested in. Can you go into more detail about two
> things:
>
> - What is the threat being solved? (I understand the threat, but detailing
>   it in the commit log is important for people who don't know it. Existing
>   commit cefdca0a86be517bc390fc4541e3674b8e7803b0 gets into some of the
>   details already, but I'd like to see reference to external sources like
>   https://duasynt.com/blog/linux-kernel-heap-spray)

Sure. I can add a reference to that and a more general discussion of
how delaying kernel fault handling can broaden race windows for other
exploits.

> - Why is this needed in addition to the existing vm.unprivileged_userfaultfd
>   sysctl? (And should this maybe just be another setting for that
>   sysctl, like "2"?)

We want to use UFFD for a new garbage collector in Android, and we
want unprivileged processes to be able to use this new garbage
collector. Giving them CAP_SYS_PTRACE is dangerous.

> As to the mechanics of the change, I'm not sure I like the idea of adding
> a UAPI flag for this. Why not just retain the permission check done at
> open() and if kernelmode faults aren't allowed, ignore them? This would
> require no changes to existing programs and gains the desired defense.

As Jann mentions below, it's a matter of the kernel's contractual
obligation to userspace. Right now, userfaultfd(2) either succeeds or
it fails with one of the enumerated error codes. You're proposing
having the userfaultfd(2) succeed but return a file descriptor that
doesn't do the job the kernel promised it would do. If we were to
adopt your proposal, an application would see UFFD succeed, then see
unexpeced EFAULTs from system calls, which would probably cause the
application to malfunctioning in exciting ways. An explicit "sorry, I
can't do that" error code is better: an application that gets an error
code from userfaultfd(2) can fall back to something else, but an
application that silently gets an underfeatured UFFD doesn't know
anything is wrong until it's too late. The additional flag preserves
the UFFD contract and gives applications a way to probe for feature
support.

> (And, I think, the sysctl value could be bumped to "2" as that's a
> better default state -- does qemu actually need kernelmode traps?)

I prefer a default of one for regular systems because I don't want to
make experimentation with novel ways to use UFFD more difficult, and a
default of two would require users go out of their way to handle user
faults, and few will. For a more constrained system like Android, two
is fine.
Peter Xu Feb. 12, 2020, 5:14 p.m. UTC | #7
On Wed, Feb 12, 2020 at 05:54:35PM +0100, Jann Horn wrote:
> On Wed, Feb 12, 2020 at 8:51 AM Kees Cook <keescook@chromium.org> wrote:
> > On Tue, Feb 11, 2020 at 02:55:41PM -0800, Daniel Colascione wrote:
> > >   Let userfaultfd opt out of handling kernel-mode faults
> > >   Add a new sysctl for limiting userfaultfd to user mode faults
> >
> > Now this I'm very interested in. Can you go into more detail about two
> > things:
> [...]
> > - Why is this needed in addition to the existing vm.unprivileged_userfaultfd
> >   sysctl? (And should this maybe just be another setting for that
> >   sysctl, like "2"?)
> >
> > As to the mechanics of the change, I'm not sure I like the idea of adding
> > a UAPI flag for this. Why not just retain the permission check done at
> > open() and if kernelmode faults aren't allowed, ignore them? This would
> > require no changes to existing programs and gains the desired defense.
> > (And, I think, the sysctl value could be bumped to "2" as that's a
> > better default state -- does qemu actually need kernelmode traps?)
> 
> I think this might be necessary for I/O emulation? As in, if before
> getting migrated, the guest writes some data into a buffer, then the
> guest gets migrated, and then while the postcopy migration stuff is
> still running, the guest tells QEMU to write that data from
> guest-physical memory to disk or whatever; I think in that case, QEMU
> will do something like a pwrite() syscall where the userspace pointer
> points into the memory area containing guest-physical memory, which
> would return -EFAULT if userfaultfd was restricted to userspace
> accesses.
> 
> This was described in this old presentation about why userfaultfd is
> better than a SIGSEGV handler:
> https://drive.google.com/file/d/0BzyAwvVlQckeSzlCSDFmRHVybzQ/view
> (slide 6) (recording at https://youtu.be/pC8cWWRVSPw?t=463)

Right. AFAICT QEMU uses it far more than disk IOs.  A guest page can
be accessed by any kernel component on the destination host during a
postcopy procedure.  It can be as simple as when a vcpu writes to a
missing guest page which still resides on the source host, then KVM
will get a page fault and trap into userfaultfd asking for that page.
The same thing happens to other modules like vhost, etc., as long as a
missing guest page is touched by a kernel module.

Thanks,
Andrea Arcangeli Feb. 12, 2020, 7:41 p.m. UTC | #8
Hello everyone,

On Wed, Feb 12, 2020 at 12:14:16PM -0500, Peter Xu wrote:
> Right. AFAICT QEMU uses it far more than disk IOs.  A guest page can
> be accessed by any kernel component on the destination host during a
> postcopy procedure.  It can be as simple as when a vcpu writes to a
> missing guest page which still resides on the source host, then KVM
> will get a page fault and trap into userfaultfd asking for that page.
> The same thing happens to other modules like vhost, etc., as long as a
> missing guest page is touched by a kernel module.

Correct.

How does the android garbage collection work to make sure there cannot
be kernel faults on the missing memory?

If I understood correctly (I didn't have much time to review sorry)
what's proposed with regard to limiting uffd events from kernel
faults, the only use case I know that could deal with it is the
UFFD_FEATURE_SIGBUS but that's not normal userfaultfd: that's also the
only feature required from uffd to implement a pure malloc library in
userland that never takes the mmap sem for writing to implement
userland mremap/mmap/munmap lib calls (as those will convert to
UFFDIO_ZEROPAGE and MADV_DONTNEED internally to the lib and there will
be always a single vma). We just need to extend UFFDIO_ZEROPAGE to map
the THP zeropage to make this future pure-uffd malloc lib perform
better.

On the other end I'm also planning a mremap_vma_merge userland syscall
that will merge fragmented vmas.

Currently once you have a nice heap all contiguous but with small
objects and you free the fragments you can't build THP anymore even if
you make the memory virtually contiguous again by calling mremap. That
just build up a ton of vmas slowing down the app forever and also
preventing THP collapsing ever again.

mremap_vma_merge will require no new kernel feature, but it
fundamentally must be able to handle kernel faults. If databases
starts to use that, how can you enable this feature without breaking
random apps then?

So it'd be a feature usable only by one user (Android) perhaps? And
only until you start defragging the vmas of small objects?

Thanks,
Andrea
Daniel Colascione Feb. 12, 2020, 8:04 p.m. UTC | #9
On Wed, Feb 12, 2020 at 11:41 AM Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Hello everyone,
>
> On Wed, Feb 12, 2020 at 12:14:16PM -0500, Peter Xu wrote:
> > Right. AFAICT QEMU uses it far more than disk IOs.  A guest page can
> > be accessed by any kernel component on the destination host during a
> > postcopy procedure.  It can be as simple as when a vcpu writes to a
> > missing guest page which still resides on the source host, then KVM
> > will get a page fault and trap into userfaultfd asking for that page.
> > The same thing happens to other modules like vhost, etc., as long as a
> > missing guest page is touched by a kernel module.
>
> Correct.
>
> How does the android garbage collection work to make sure there cannot
> be kernel faults on the missing memory?

We don't pass pointers to the heap into system calls. (Big primitive
arrays, ByteBuffer, etc. are allocated off the regular heap.)

> If I understood correctly (I didn't have much time to review sorry)
> what's proposed with regard to limiting uffd events from kernel
> faults,

I don't understand what you mean. The purpose of preventing UFFD from
handling kernel faults is exploit mitigation.

> the only use case I know that could deal with it is the
> UFFD_FEATURE_SIGBUS but that's not normal userfaultfd: that's also the
> only feature required from uffd to implement a pure malloc library in
> userland that never takes the mmap sem for writing to implement
> userland mremap/mmap/munmap lib calls (as those will convert to
> UFFDIO_ZEROPAGE and MADV_DONTNEED internally to the lib and there will
> be always a single vma). We just need to extend UFFDIO_ZEROPAGE to map
> the THP zeropage to make this future pure-uffd malloc lib perform
> better.

The key requirement here is the ability to prevent unprivileged
processes from using UFFD to widen kernel exploit windows by
preventing UFFD from taking kernel faults. Forcing unprivileged
processes to use UFFD only with UFFD_FEATURE_SIGBUS would satisfy this
requirement, but it's much less flexible and unnecessarily couples two
features.

> On the other end I'm also planning a mremap_vma_merge userland syscall
> that will merge fragmented vmas.

This is probably a separate discussion, but does that operation really
need to be a system call? Historically, userspace has operated mostly
on page ranges and not VMAs per se, and the kernel has been free to
merge and split VMAs as needed for its internal purposes. This
approach has serious negative side effects (like making munmap
fallible: see [1]), but it is what it is.

[1] https://lore.kernel.org/linux-mm/CAKOZuetOD6MkGPVvYFLj5RXh200FaDyu3sQqZviVRhTFFS3fjA@mail.gmail.com/

> Currently once you have a nice heap all contiguous but with small
> objects and you free the fragments you can't build THP anymore even if
> you make the memory virtually contiguous again by calling mremap. That
> just build up a ton of vmas slowing down the app forever and also
> preventing THP collapsing ever again.

Shouldn't the THP background kthread take care of merging VMAs?

> mremap_vma_merge will require no new kernel feature, but it
> fundamentally must be able to handle kernel faults. If databases
> starts to use that, how can you enable this feature without breaking
> random apps then?

Presumably, those apps wouldn't issue the system call on address
ranges managed with a non-kernel-fault UFFD.

> So it'd be a feature usable only by one user (Android) perhaps? And
> only until you start defragging the vmas of small objects?

We shouldn't be fragmenting at all, either at the memory level or the
VMA level. The GC is a moving collector, and we don't punch holes in
the heap.
Andrea Arcangeli Feb. 12, 2020, 11:41 p.m. UTC | #10
Hi Daniel,

On Wed, Feb 12, 2020 at 12:04:39PM -0800, Daniel Colascione wrote:
> We don't pass pointers to the heap into system calls. (Big primitive
> arrays, ByteBuffer, etc. are allocated off the regular heap.)

That sounds pretty restrictive, I wonder what you gain by enforcing
that invariant or if it just happened incidentally for some other
reason?  Do you need to copy that memory every time if you need to do
I/O on it? Are you sure this won't need to change any time soon to
increase performance?

> I don't understand what you mean. The purpose of preventing UFFD from
> handling kernel faults is exploit mitigation.

That part was clear. What wasn't clear is what the new feature
does exactly and what it blocks, because it's all about blocking or
how does it make things more secure?

> The key requirement here is the ability to prevent unprivileged
> processes from using UFFD to widen kernel exploit windows by
> preventing UFFD from taking kernel faults. Forcing unprivileged
> processes to use UFFD only with UFFD_FEATURE_SIGBUS would satisfy this
> requirement, but it's much less flexible and unnecessarily couples two
> features.

I mentioned it in case you could use something like that model.

> > On the other end I'm also planning a mremap_vma_merge userland syscall
> > that will merge fragmented vmas.
> 
> This is probably a separate discussion, but does that operation really
> need to be a system call? Historically, userspace has operated mostly

mremap_vma_merge was not intended as a system call, if implemented as
a system call it wouldn't use uffd.

> on page ranges and not VMAs per se, and the kernel has been free to

Userland doesn't need to know anything.. unless it wants to optimize.

The userland can know full well if it does certain mremap operations
and puts its ranges virtually contiguous in a non linear way, so that
the kernel cannot merge the vmas.

> merge and split VMAs as needed for its internal purposes. This
> approach has serious negative side effects (like making munmap
> fallible: see [1]), but it is what it is.
> 
> [1] https://lore.kernel.org/linux-mm/CAKOZuetOD6MkGPVvYFLj5RXh200FaDyu3sQqZviVRhTFFS3fjA@mail.gmail.com/

The fact it's fallible is a secondary concern here. Even if you make
it unlimited, if it grows it slowdown everything and also prevents THP
to be collapsed. Even the scalability of the mmap_sem worsens.

> > Currently once you have a nice heap all contiguous but with small
> > objects and you free the fragments you can't build THP anymore even if
> > you make the memory virtually contiguous again by calling mremap. That
> > just build up a ton of vmas slowing down the app forever and also
> > preventing THP collapsing ever again.
> 
> Shouldn't the THP background kthread take care of merging VMAs?

The solution can't depend on any THP feature, because the buildup of
vmas is a scalability issue and a performance regression over time
even if THP is not configured in the kernel. However once that's
solved THP also gets naturally optimized.

What should happen (in my view) is just the simplest solution that can
defrag and forcefully merge the vmas without having to stop or restart
the app.

> Presumably, those apps wouldn't issue the system call on address
> ranges managed with a non-kernel-fault UFFD.

So the new security feature won't have to block kernel faults on those
apps and they can run side by side with the blocked app?

> We shouldn't be fragmenting at all, either at the memory level or the
> VMA level. The GC is a moving collector, and we don't punch holes in
> the heap.

That sounds good.

Thanks,
Andrea
Daniel Colascione Feb. 14, 2020, 3:26 a.m. UTC | #11
Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors and in the future, other kinds of
anonymous-inode-based file descriptor.  SELinux policy authors can
apply policy types to anonymous inodes by providing name-based
transition rules keyed off the anonymous inode internal name (
"[userfaultfd]" in the case of userfaultfd(2) file descriptors) and
applying policy to the new SIDs thus produced.

Inside the kernel, a pair of new anon_inodes interface,
anon_inode_getfile_secure and anon_inode_getfd_secure, allow callers
to opt into this SELinux management. In this new "secure" mode,
anon_inodes creates new ephemeral inodes for anonymous file objects
instead of reusing the normal anon_inodes singleton dummy inode. A new
LSM hook gives security modules an opportunity to configure and veto
these ephemeral inodes.

This patch series is one of two fork of [1] and is an
alternative to [2].

The primary difference between the two patch series is that this
partch series creates a unique inode for each "secure" anonymous
inode, while the other patch series ([2]) continues using the
singleton dummy anonymous inode and adds a way to attach SELinux
security information directly to file objects.

I prefer the approach in this patch series because 1) it's a smaller
patch than [2], and 2) it produces a more regular security
architecture: in this patch series, secure anonymous inodes aren't
S_PRIVATE and they maintain the SELinux property that the label for a
file is in its inode. We do need an additional inode per anonymous
file, but per-struct-file inode creation doesn't seem to be a problem
for pipes and sockets.

The previous version of this feature ([1]) created a new SELinux
security class for userfaultfd file descriptors. This version adopts
the generic transition-based approach of [2].

This patch series also differs from [2] in that it doesn't affect all
anonymous inodes right away --- instead requiring anon_inodes callers
to opt in --- but this difference isn't one of basic approach. The
important question to resolve is whether we should be creating new
inodes or enhancing per-file data.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://lore.kernel.org/linux-fsdevel/20200213194157.5877-1-sds@tycho.nsa.gov/

Daniel Colascione (3):
  Add a new LSM-supporting anonymous inode interface
  Teach SELinux about anonymous inodes
  Wire UFFD up to SELinux

 fs/anon_inodes.c            | 196 ++++++++++++++++++++++++++++--------
 fs/userfaultfd.c            |  34 +++++--
 include/linux/anon_inodes.h |  13 +++
 include/linux/lsm_hooks.h   |   9 ++
 include/linux/security.h    |   4 +
 security/security.c         |  10 ++
 security/selinux/hooks.c    |  57 +++++++++++
 7 files changed, 274 insertions(+), 49 deletions(-)
James Morris Feb. 21, 2020, 5:56 p.m. UTC | #12
On Tue, 11 Feb 2020, Daniel Colascione wrote:

> > This must be posted to the linux Security Module list
> > <linux-security-module@vger.kernel.org>
> 
> Added. I thought selinux@ was sufficient.
> 

Please cc: me on these patches.
Daniel Colascione March 26, 2020, 6:14 p.m. UTC | #13
Userfaultfd in unprivileged contexts could be potentially very
useful. We'd like to harden userfaultfd to make such unprivileged use
less risky. This patch series allows SELinux to manage userfaultfd
file descriptors and in the future, other kinds of
anonymous-inode-based file descriptor.  SELinux policy authors can
apply policy types to anonymous inodes by providing name-based
transition rules keyed off the anonymous inode internal name (
"[userfaultfd]" in the case of userfaultfd(2) file descriptors) and
applying policy to the new SIDs thus produced.

Inside the kernel, a pair of new anon_inodes interface,
anon_inode_getfile_secure and anon_inode_getfd_secure, allow callers
to opt into this SELinux management. In this new "secure" mode,
anon_inodes creates new ephemeral inodes for anonymous file objects
instead of reusing the normal anon_inodes singleton dummy inode. A new
LSM hook gives security modules an opportunity to configure and veto
these ephemeral inodes.

This patch series is one of two fork of [1] and is an
alternative to [2].

The primary difference between the two patch series is that this
partch series creates a unique inode for each "secure" anonymous
inode, while the other patch series ([2]) continues using the
singleton dummy anonymous inode and adds a way to attach SELinux
security information directly to file objects.

I prefer the approach in this patch series because 1) it's a smaller
patch than [2], and 2) it produces a more regular security
architecture: in this patch series, secure anonymous inodes aren't
S_PRIVATE and they maintain the SELinux property that the label for a
file is in its inode. We do need an additional inode per anonymous
file, but per-struct-file inode creation doesn't seem to be a problem
for pipes and sockets.

The previous version of this feature ([1]) created a new SELinux
security class for userfaultfd file descriptors. This version adopts
the generic transition-based approach of [2].

This patch series also differs from [2] in that it doesn't affect all
anonymous inodes right away --- instead requiring anon_inodes callers
to opt in --- but this difference isn't one of basic approach. The
important question to resolve is whether we should be creating new
inodes or enhancing per-file data.

Changes from the first version of the patch:

  - Removed some error checks
  - Defined a new anon_inode SELinux class to resolve the
    ambiguity in [3]
  - Inherit sclass as well as descriptor from context inode

Changes from the second version of the patch:

  - Fixed example policy in the commit message to reflect the use of
    the new anon_inode class.

[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://lore.kernel.org/linux-fsdevel/20200213194157.5877-1-sds@tycho.nsa.gov/
[3] https://lore.kernel.org/lkml/23f725ca-5b5a-5938-fcc8-5bbbfc9ba9bc@tycho.nsa.gov/

Daniel Colascione (3):
  Add a new LSM-supporting anonymous inode interface
  Teach SELinux about anonymous inodes
  Wire UFFD up to SELinux

 fs/anon_inodes.c                    | 196 ++++++++++++++++++++++------
 fs/userfaultfd.c                    |  30 ++++-
 include/linux/anon_inodes.h         |  13 ++
 include/linux/lsm_hooks.h           |   9 ++
 include/linux/security.h            |   4 +
 security/security.c                 |  10 ++
 security/selinux/hooks.c            |  54 ++++++++
 security/selinux/include/classmap.h |   2 +
 8 files changed, 272 insertions(+), 46 deletions(-)