mbox series

[0/3] userfaultfd: allow to forbid unprivileged users

Message ID 20190311093701.15734-1-peterx@redhat.com (mailing list archive)
Headers show
Series userfaultfd: allow to forbid unprivileged users | expand

Message

Peter Xu March 11, 2019, 9:36 a.m. UTC
Hi,

(The idea comes from Andrea, and following discussions with Mike and
 other people)

This patchset introduces a new sysctl flag to allow the admin to
forbid users from using userfaultfd:

  $ cat /proc/sys/vm/unprivileged_userfaultfd
  [disabled] enabled kvm

  - When set to "disabled", all unprivileged users are forbidden to
    use userfaultfd syscalls.

  - When set to "enabled", all users are allowed to use userfaultfd
    syscalls.

  - When set to "kvm", all unprivileged users are forbidden to use the
    userfaultfd syscalls, except the user who has permission to open
    /dev/kvm.

This new flag can add one more layer of security to reduce the attack
surface of the kernel by abusing userfaultfd.  Here we grant the
thread userfaultfd permission by checking against CAP_SYS_PTRACE
capability.  By default, the value is "disabled" which is the most
strict policy.  Distributions can have their own perferred value.

The "kvm" entry is a bit special here only to make sure that existing
users like QEMU/KVM won't break by this newly introduced flag.  What
we need to do is simply set the "unprivileged_userfaultfd" flag to
"kvm" here to automatically grant userfaultfd permission for processes
like QEMU/KVM without extra code to tweak these flags in the admin
code.

Patch 1:  The interface patch to introduce the flag

Patch 2:  The KVM related changes to detect opening of /dev/kvm

Patch 3:  Apply the flag to userfaultfd syscalls

All comments would be greatly welcomed.  Thanks,

Peter Xu (3):
  userfaultfd/sysctl: introduce unprivileged_userfaultfd
  kvm/mm: introduce MMF_USERFAULTFD_ALLOW flag
  userfaultfd: apply unprivileged_userfaultfd check

 fs/userfaultfd.c               | 121 +++++++++++++++++++++++++++++++++
 include/linux/sched/coredump.h |   1 +
 include/linux/userfaultfd_k.h  |   5 ++
 init/Kconfig                   |  11 +++
 kernel/sysctl.c                |  11 +++
 virt/kvm/kvm_main.c            |   7 ++
 6 files changed, 156 insertions(+)

Comments

Mike Rapoport March 12, 2019, 7:01 a.m. UTC | #1
Hi Peter,

On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote:
> Hi,
> 
> (The idea comes from Andrea, and following discussions with Mike and
>  other people)
> 
> This patchset introduces a new sysctl flag to allow the admin to
> forbid users from using userfaultfd:
> 
>   $ cat /proc/sys/vm/unprivileged_userfaultfd
>   [disabled] enabled kvm
> 
>   - When set to "disabled", all unprivileged users are forbidden to
>     use userfaultfd syscalls.
> 
>   - When set to "enabled", all users are allowed to use userfaultfd
>     syscalls.
> 
>   - When set to "kvm", all unprivileged users are forbidden to use the
>     userfaultfd syscalls, except the user who has permission to open
>     /dev/kvm.
> 
> This new flag can add one more layer of security to reduce the attack
> surface of the kernel by abusing userfaultfd.  Here we grant the
> thread userfaultfd permission by checking against CAP_SYS_PTRACE
> capability.  By default, the value is "disabled" which is the most
> strict policy.  Distributions can have their own perferred value.
> 
> The "kvm" entry is a bit special here only to make sure that existing
> users like QEMU/KVM won't break by this newly introduced flag.  What
> we need to do is simply set the "unprivileged_userfaultfd" flag to
> "kvm" here to automatically grant userfaultfd permission for processes
> like QEMU/KVM without extra code to tweak these flags in the admin
> code.
> 
> Patch 1:  The interface patch to introduce the flag
> 
> Patch 2:  The KVM related changes to detect opening of /dev/kvm
> 
> Patch 3:  Apply the flag to userfaultfd syscalls
 
I'd appreciate to see "Patch 4: documentation update" ;-)
It'd be also great to update the man pages after this is merged.

Except for the comment to patch 1, feel free to add

Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

> All comments would be greatly welcomed.  Thanks,
> 
> Peter Xu (3):
>   userfaultfd/sysctl: introduce unprivileged_userfaultfd
>   kvm/mm: introduce MMF_USERFAULTFD_ALLOW flag
>   userfaultfd: apply unprivileged_userfaultfd check
> 
>  fs/userfaultfd.c               | 121 +++++++++++++++++++++++++++++++++
>  include/linux/sched/coredump.h |   1 +
>  include/linux/userfaultfd_k.h  |   5 ++
>  init/Kconfig                   |  11 +++
>  kernel/sysctl.c                |  11 +++
>  virt/kvm/kvm_main.c            |   7 ++
>  6 files changed, 156 insertions(+)
> 
> -- 
> 2.17.1
>
Kirill A. Shutemov March 12, 2019, 7:49 a.m. UTC | #2
On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote:
> Hi,
> 
> (The idea comes from Andrea, and following discussions with Mike and
>  other people)
> 
> This patchset introduces a new sysctl flag to allow the admin to
> forbid users from using userfaultfd:
> 
>   $ cat /proc/sys/vm/unprivileged_userfaultfd
>   [disabled] enabled kvm

CC linux-api@

This is unusual way to return current value for sysctl. Does it work fine
with sysctl tool?

Have you considered to place the switch into /sys/kernel/mm instead?
I doubt it's the last tunable for userfaultfd. Maybe we should have an
directory for it under /sys/kernel/mm?

>   - When set to "disabled", all unprivileged users are forbidden to
>     use userfaultfd syscalls.
> 
>   - When set to "enabled", all users are allowed to use userfaultfd
>     syscalls.
> 
>   - When set to "kvm", all unprivileged users are forbidden to use the
>     userfaultfd syscalls, except the user who has permission to open
>     /dev/kvm.
> 
> This new flag can add one more layer of security to reduce the attack
> surface of the kernel by abusing userfaultfd.  Here we grant the
> thread userfaultfd permission by checking against CAP_SYS_PTRACE
> capability.  By default, the value is "disabled" which is the most
> strict policy.  Distributions can have their own perferred value.
> 
> The "kvm" entry is a bit special here only to make sure that existing
> users like QEMU/KVM won't break by this newly introduced flag.  What
> we need to do is simply set the "unprivileged_userfaultfd" flag to
> "kvm" here to automatically grant userfaultfd permission for processes
> like QEMU/KVM without extra code to tweak these flags in the admin
> code.
> 
> Patch 1:  The interface patch to introduce the flag
> 
> Patch 2:  The KVM related changes to detect opening of /dev/kvm
> 
> Patch 3:  Apply the flag to userfaultfd syscalls
> 
> All comments would be greatly welcomed.  Thanks,
> 
> Peter Xu (3):
>   userfaultfd/sysctl: introduce unprivileged_userfaultfd
>   kvm/mm: introduce MMF_USERFAULTFD_ALLOW flag
>   userfaultfd: apply unprivileged_userfaultfd check
> 
>  fs/userfaultfd.c               | 121 +++++++++++++++++++++++++++++++++
>  include/linux/sched/coredump.h |   1 +
>  include/linux/userfaultfd_k.h  |   5 ++
>  init/Kconfig                   |  11 +++
>  kernel/sysctl.c                |  11 +++
>  virt/kvm/kvm_main.c            |   7 ++
>  6 files changed, 156 insertions(+)
> 
> -- 
> 2.17.1
>
Peter Xu March 12, 2019, 12:29 p.m. UTC | #3
On Tue, Mar 12, 2019 at 09:01:47AM +0200, Mike Rapoport wrote:
> Hi Peter,
> 
> On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote:
> > Hi,
> > 
> > (The idea comes from Andrea, and following discussions with Mike and
> >  other people)
> > 
> > This patchset introduces a new sysctl flag to allow the admin to
> > forbid users from using userfaultfd:
> > 
> >   $ cat /proc/sys/vm/unprivileged_userfaultfd
> >   [disabled] enabled kvm
> > 
> >   - When set to "disabled", all unprivileged users are forbidden to
> >     use userfaultfd syscalls.
> > 
> >   - When set to "enabled", all users are allowed to use userfaultfd
> >     syscalls.
> > 
> >   - When set to "kvm", all unprivileged users are forbidden to use the
> >     userfaultfd syscalls, except the user who has permission to open
> >     /dev/kvm.
> > 
> > This new flag can add one more layer of security to reduce the attack
> > surface of the kernel by abusing userfaultfd.  Here we grant the
> > thread userfaultfd permission by checking against CAP_SYS_PTRACE
> > capability.  By default, the value is "disabled" which is the most
> > strict policy.  Distributions can have their own perferred value.
> > 
> > The "kvm" entry is a bit special here only to make sure that existing
> > users like QEMU/KVM won't break by this newly introduced flag.  What
> > we need to do is simply set the "unprivileged_userfaultfd" flag to
> > "kvm" here to automatically grant userfaultfd permission for processes
> > like QEMU/KVM without extra code to tweak these flags in the admin
> > code.
> > 
> > Patch 1:  The interface patch to introduce the flag
> > 
> > Patch 2:  The KVM related changes to detect opening of /dev/kvm
> > 
> > Patch 3:  Apply the flag to userfaultfd syscalls
>  
> I'd appreciate to see "Patch 4: documentation update" ;-)
> It'd be also great to update the man pages after this is merged.

Oops, sorry!  I should have remembered that.

> 
> Except for the comment to patch 1, feel free to add
> 
> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>

Thanks Mike!  I'll take it for 2/3 until I got confirmation from you
on patch 1.

Regards,
Peter Xu March 12, 2019, 12:43 p.m. UTC | #4
Hi, Kirill,

On Tue, Mar 12, 2019 at 10:49:51AM +0300, Kirill A. Shutemov wrote:
> On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote:
> > Hi,
> > 
> > (The idea comes from Andrea, and following discussions with Mike and
> >  other people)
> > 
> > This patchset introduces a new sysctl flag to allow the admin to
> > forbid users from using userfaultfd:
> > 
> >   $ cat /proc/sys/vm/unprivileged_userfaultfd
> >   [disabled] enabled kvm
> 
> CC linux-api@
> 
> This is unusual way to return current value for sysctl. Does it work fine
> with sysctl tool?

It can work, though it displays the same as "cat":

$ sysctl vm.unprivileged_userfaultfd
vm.unprivileged_userfaultfd = disabled enabled [kvm] 

> 
> Have you considered to place the switch into /sys/kernel/mm instead?
> I doubt it's the last tunable for userfaultfd. Maybe we should have an
> directory for it under /sys/kernel/mm?

I haven't thought about sysfs, if that's preferred I can consider to
switch to that.  And yes I think creating a directory should be a good
idea.

Thanks,
Mike Kravetz March 12, 2019, 7:59 p.m. UTC | #5
On 3/11/19 2:36 AM, Peter Xu wrote:
> 
> The "kvm" entry is a bit special here only to make sure that existing
> users like QEMU/KVM won't break by this newly introduced flag.  What
> we need to do is simply set the "unprivileged_userfaultfd" flag to
> "kvm" here to automatically grant userfaultfd permission for processes
> like QEMU/KVM without extra code to tweak these flags in the admin
> code.

Another user is Oracle DB, specifically with hugetlbfs.  For them, we would
like to add a special case like kvm described above.  The admin controls
who can have access to hugetlbfs, so I think adding code to the open
routine as in patch 2 of this series would seem to work.

However, I can imagine more special cases being added for other users.  And,
once you have more than one special case then you may want to combine them.
For example, kvm and hugetlbfs together.
Peter Xu March 13, 2019, 6 a.m. UTC | #6
On Tue, Mar 12, 2019 at 12:59:34PM -0700, Mike Kravetz wrote:
> On 3/11/19 2:36 AM, Peter Xu wrote:
> > 
> > The "kvm" entry is a bit special here only to make sure that existing
> > users like QEMU/KVM won't break by this newly introduced flag.  What
> > we need to do is simply set the "unprivileged_userfaultfd" flag to
> > "kvm" here to automatically grant userfaultfd permission for processes
> > like QEMU/KVM without extra code to tweak these flags in the admin
> > code.
> 
> Another user is Oracle DB, specifically with hugetlbfs.  For them, we would
> like to add a special case like kvm described above.  The admin controls
> who can have access to hugetlbfs, so I think adding code to the open
> routine as in patch 2 of this series would seem to work.

Yes I think if there's an explicit and safe place we can hook for
hugetlbfs then we can do the similar trick as KVM case.  Though I
noticed that we can not only create hugetlbfs files under the
mountpoint (which the admin can control), but also using some other
ways.  The question (of me... sorry if it's a silly one!) is whether
all other ways to use hugetlbfs is still under control of the admin.
One I know of is memfd_create() which seems to be doable even as
unprivileged users.  If so, should we only limit the uffd privilege to
those hugetlbfs users who use the mountpoint directly?

Another question is about fork() of privileged processes - for KVM we
only grant privilege for the exact process that opened the /dev/kvm
node, and the privilege will be lost for any forked childrens.  Is
that the same thing for OracleDB/Hugetlbfs?

> 
> However, I can imagine more special cases being added for other users.  And,
> once you have more than one special case then you may want to combine them.
> For example, kvm and hugetlbfs together.

It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon
mm_struct, since that seems to be a very general flag that can be used
by anything we want to grant privilege for, not only KVM?

Thanks,
Paolo Bonzini March 13, 2019, 8:22 a.m. UTC | #7
On 13/03/19 07:00, Peter Xu wrote:
>> However, I can imagine more special cases being added for other users.  And,
>> once you have more than one special case then you may want to combine them.
>> For example, kvm and hugetlbfs together.
> It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon
> mm_struct, since that seems to be a very general flag that can be used
> by anything we want to grant privilege for, not only KVM?

Perhaps you can remove the fork() limitation, and add a new suboption to
prctl(PR_SET_MM) that sets/resets MMF_USERFAULTFD_ALLOW.  If somebody
wants to forbid unprivileged userfaultfd and use KVM, they'll have to
use libvirt or some other privileged management tool.

We could also add support for this prctl to systemd, and then one could
do "systemd-run -pAllowUserfaultfd=yes COMMAND".

Paolo
Mike Kravetz March 13, 2019, 5:50 p.m. UTC | #8
On 3/12/19 11:00 PM, Peter Xu wrote:
> On Tue, Mar 12, 2019 at 12:59:34PM -0700, Mike Kravetz wrote:
>> On 3/11/19 2:36 AM, Peter Xu wrote:
>>>
>>> The "kvm" entry is a bit special here only to make sure that existing
>>> users like QEMU/KVM won't break by this newly introduced flag.  What
>>> we need to do is simply set the "unprivileged_userfaultfd" flag to
>>> "kvm" here to automatically grant userfaultfd permission for processes
>>> like QEMU/KVM without extra code to tweak these flags in the admin
>>> code.
>>
>> Another user is Oracle DB, specifically with hugetlbfs.  For them, we would
>> like to add a special case like kvm described above.  The admin controls
>> who can have access to hugetlbfs, so I think adding code to the open
>> routine as in patch 2 of this series would seem to work.
> 
> Yes I think if there's an explicit and safe place we can hook for
> hugetlbfs then we can do the similar trick as KVM case.  Though I
> noticed that we can not only create hugetlbfs files under the
> mountpoint (which the admin can control), but also using some other
> ways.  The question (of me... sorry if it's a silly one!) is whether
> all other ways to use hugetlbfs is still under control of the admin.
> One I know of is memfd_create() which seems to be doable even as
> unprivileged users.  If so, should we only limit the uffd privilege to
> those hugetlbfs users who use the mountpoint directly?

Wow!  I did not realize that apps which specify mmap(MAP_HUGETLB) do not
need any special privilege to use huge pages.  Honestly, I am not sure if
that was by design or a bug.  The memfd_create code is based on the MAP_HUGETLB
code and also does not need any special privilege.  Not to sidetrack this
discussion, but people on Cc may know if this is a bug or by design.  My
opinion is that huge pages are a limited resource and should be under control.
One needs to be a member of a special group (or root) to access via System V
interfaces.

The DB use case only does mmap of files in an explicitly mounted filesystem.
So, limiting it in that manner would work for them.

> Another question is about fork() of privileged processes - for KVM we
> only grant privilege for the exact process that opened the /dev/kvm
> node, and the privilege will be lost for any forked childrens.  Is
> that the same thing for OracleDB/Hugetlbfs?

I need to confirm with the DB people, but it is my understanding that the
exact process which does the open/mmap will be the one using userfaultfd.
Andrea Arcangeli March 13, 2019, 6:52 p.m. UTC | #9
Hello,

On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote:
> On 13/03/19 07:00, Peter Xu wrote:
> >> However, I can imagine more special cases being added for other users.  And,
> >> once you have more than one special case then you may want to combine them.
> >> For example, kvm and hugetlbfs together.
> > It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon
> > mm_struct, since that seems to be a very general flag that can be used
> > by anything we want to grant privilege for, not only KVM?
> 
> Perhaps you can remove the fork() limitation, and add a new suboption to
> prctl(PR_SET_MM) that sets/resets MMF_USERFAULTFD_ALLOW.  If somebody
> wants to forbid unprivileged userfaultfd and use KVM, they'll have to
> use libvirt or some other privileged management tool.
> 
> We could also add support for this prctl to systemd, and then one could
> do "systemd-run -pAllowUserfaultfd=yes COMMAND".

systemd can already implement -pAllowUserfaultfd=no with seccomp if it
wants. It can also implement -yes if by default turns off userfaultfd
like firejail -seccomp would do.

If the end goal is to implement the filtering with an userland policy
instead of a kernel policy, seccomp enabled for all services sounds
reasonable. It's very unlikely you'll block only userfaultfd, firejail
-seccomp by default blocks dozen of syscalls that are unnecessary
99.9% of the time.

This is not about implementing an userland flexible policy, it's just
a simple kernel policy, to use until userland disables the kernel
policy to takeover with seccomp across the board.

I wouldn't like this too be too complicated because this is already
theoretically overlapping 100% with seccomp.

hugetlbfs is more complicated to detect, because even if you inherit
it from fork(), the services that mounts the fs may be in a different
container than the one that Oracle that uses userfaultfd later on down
the road from a different context. And I don't think it would be ok to
allow running userfaultfd just because you can open a file in an
hugetlbfs file system. With /dev/kvm it's a bit different, that's
chmod o-r by default.. no luser should be able to open it.

Unless somebody suggests a consistent way to make hugetlbfs "just
work" (like we could achieve clean with CRIU and KVM), I think Oracle
will need a one liner change in the Oracle setup to echo into that
file in addition of running the hugetlbfs mount.

Note that DPDK host bridge process will also need a one liner change
to do a dummy open/close of /dev/kvm to unblock the syscall.

Thanks,
Andrea
Paolo Bonzini March 13, 2019, 7:12 p.m. UTC | #10
> On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote:
> Unless somebody suggests a consistent way to make hugetlbfs "just
> work" (like we could achieve clean with CRIU and KVM), I think Oracle
> will need a one liner change in the Oracle setup to echo into that
> file in addition of running the hugetlbfs mount.

Hi Andrea, can you explain more in detail the risks of enabling
userfaultfd for unprivileged users?

Paolo

> Note that DPDK host bridge process will also need a one liner change
> to do a dummy open/close of /dev/kvm to unblock the syscall.
> 
> Thanks,
> Andrea
>
Mike Kravetz March 13, 2019, 8:01 p.m. UTC | #11
On 3/13/19 11:52 AM, Andrea Arcangeli wrote:
> 
> hugetlbfs is more complicated to detect, because even if you inherit
> it from fork(), the services that mounts the fs may be in a different
> container than the one that Oracle that uses userfaultfd later on down
> the road from a different context. And I don't think it would be ok to
> allow running userfaultfd just because you can open a file in an
> hugetlbfs file system. With /dev/kvm it's a bit different, that's
> chmod o-r by default.. no luser should be able to open it.
> 
> Unless somebody suggests a consistent way to make hugetlbfs "just
> work" (like we could achieve clean with CRIU and KVM), I think Oracle
> will need a one liner change in the Oracle setup to echo into that
> file in addition of running the hugetlbfs mount.

I think you are suggesting the DB setup process enable uffd for all users.
Correct?

This may be too simple, and I don't really like group access, but how about
just defining a uffd group?  If you are in the group you can make uffd
system calls.
Andrea Arcangeli March 13, 2019, 11:44 p.m. UTC | #12
Hi Paolo,

On Wed, Mar 13, 2019 at 03:12:28PM -0400, Paolo Bonzini wrote:
> 
> > On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote:
> > Unless somebody suggests a consistent way to make hugetlbfs "just
> > work" (like we could achieve clean with CRIU and KVM), I think Oracle
> > will need a one liner change in the Oracle setup to echo into that
> > file in addition of running the hugetlbfs mount.
> 
> Hi Andrea, can you explain more in detail the risks of enabling
> userfaultfd for unprivileged users?

There's no more risk than in allowing mremap (direct memory overwrite
after underflow) or O_DIRECT (dirtycow) for unprivileged users.

If there was risk in allowing userfaultfd for unprivileged users,
CAP_SYS_PTRACE would be mandatory and there wouldn't be an option to
allow userfaultfd to all unprivileged users.

This is just an hardening policy in kernel (for those that don't run
everything under seccomp) that may even be removed later.

Unlike mremap and O_DIRECT, because we've only an handful of
(important) applications using userfaultfd so far, we can do like the
bpf syscall:

SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
{
	union bpf_attr attr = {};
	int err;

	if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
		return -EPERM;

Except we picked CAP_SYS_PTRACE because CRIU already has to run with
such capability for other reasons.

So this is intended as the "sysctl_unprivileged_bpf_disabled" trick
applied to the bpf syscall, also applied to the userfaultfd syscall,
nothing more nothing less.

Then I thought we can add a tristate so an open of /dev/kvm would also
allow the syscall to make things more user friendly because
unprivileged containers ideally should have writable mounts done with
nodev and no matter the privilege they shouldn't ever get an hold on
the KVM driver (and those who do, like kubevirt, will then just work).

There has been one complaint because userfaultfd can also be used to
facilitate exploiting use after free bugs in other kernel code:

https://cyseclabs.com/blog/linux-kernel-heap-spray

This isn't a bug in userfaultfd, the only bug there is some kernel
code doing an use after free in a reproducible place, userfaultfd only
allows to stop copy-user where the exploits like copy-user to be
stopped.

This isn't particularly concerning, because you can achieve the same
objective with FUSE. In fact even if you set CONFIG_USERFAULTFD=n in
the kernel config and CONFIG_FUSE_FS=n, a security bug like that can
still be exploited eventually. It's just less reproducible if you
can't stop copy-user.

Restricting userfaultfd to privileged processes, won't make such
kernel bug become harmless, it'll just require more attempts to
exploit as far as I can tell. To put things in prospective, the
exploits for the most serious security bugs like mremap missing
underflow check, dirtycow or the missing stack_guard_gap would not
get any facilitation by userfaultfd.

I also thought we were randomizing all slab heaps by now to avoid
issues like above, I don't know if the randomized slab freelist oreder
CONFIG_SLAB_FREELIST_RANDOM and also the pointer with
CONFIG_SLAB_FREELIST_HARDENED precisely to avoid the exploits like
above. It's not like you can disable those two options even if you set
CONFIG_USERFAULTFD=n. I wonder if in that blog post the exploit was
set on a kernel with those two options enabled.

In any case not allowing non privileged processes to run the
userfaultfd syscall will provide some hardening feature also against
such kind of concern.

Thanks,
Andrea
Andrea Arcangeli March 13, 2019, 11:55 p.m. UTC | #13
On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote:
> On 3/13/19 11:52 AM, Andrea Arcangeli wrote:
> > 
> > hugetlbfs is more complicated to detect, because even if you inherit
> > it from fork(), the services that mounts the fs may be in a different
> > container than the one that Oracle that uses userfaultfd later on down
> > the road from a different context. And I don't think it would be ok to
> > allow running userfaultfd just because you can open a file in an
> > hugetlbfs file system. With /dev/kvm it's a bit different, that's
> > chmod o-r by default.. no luser should be able to open it.
> > 
> > Unless somebody suggests a consistent way to make hugetlbfs "just
> > work" (like we could achieve clean with CRIU and KVM), I think Oracle
> > will need a one liner change in the Oracle setup to echo into that
> > file in addition of running the hugetlbfs mount.
> 
> I think you are suggesting the DB setup process enable uffd for all users.
> Correct?

Yes. In addition of the hugetlbfs setup, various apps requires to also
increase fs.inotify.max_user_watches or file-max and other tweaks,
this would be one of those tweaks.

> This may be too simple, and I don't really like group access, but how about
> just defining a uffd group?  If you are in the group you can make uffd
> system calls.

Everything is possible, I'm just afraid it gets too complex.

So you suggest to echo a gid into the file?
Mike Kravetz March 14, 2019, 3:32 a.m. UTC | #14
On 3/13/19 4:55 PM, Andrea Arcangeli wrote:
> On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote:
>> On 3/13/19 11:52 AM, Andrea Arcangeli wrote:
>>> Unless somebody suggests a consistent way to make hugetlbfs "just
>>> work" (like we could achieve clean with CRIU and KVM), I think Oracle
>>> will need a one liner change in the Oracle setup to echo into that
>>> file in addition of running the hugetlbfs mount.
>>
>> I think you are suggesting the DB setup process enable uffd for all users.
>> Correct?
> 
> Yes. In addition of the hugetlbfs setup, various apps requires to also
> increase fs.inotify.max_user_watches or file-max and other tweaks,
> this would be one of those tweaks.

Yes, I agree.
It is just that unprivileged_userfaultfd disabled would likely to be the
default set by distros.  Or, perhaps 'kvm'?  Then, if you want to run the
DB, the admin (or the DB) will need to set it to enabled.  And, this results
in it being enabled for everyone.  I think I understand the scope of any
security exposure this would cause from a technical perspective.  However,
I can imagine people being concerned about enabling everywhere if this is
not the default setting.

If it is OK to disable everywhere, why not just use disable for the kvm
use case as well? :)

>> This may be too simple, and I don't really like group access, but how about
>> just defining a uffd group?  If you are in the group you can make uffd
>> system calls.
> 
> Everything is possible, I'm just afraid it gets too complex.
> 
> So you suggest to echo a gid into the file?

That is what I was thinking.  But, I was mostly thinking of that because
Peter's earlier comment made me go and check hugetlbfs code.  There is a
sysctl_hugetlb_shm_group variable that does this, even though it is mostly
unused in the hugetlbfs code.

I know the kvm dev open scheme works for kvm.  Was just trying to think
of something more general that would work for hugetlbfs/DB or other use
cases.
Paolo Bonzini March 14, 2019, 10:58 a.m. UTC | #15
On 14/03/19 00:44, Andrea Arcangeli wrote:
> Then I thought we can add a tristate so an open of /dev/kvm would also
> allow the syscall to make things more user friendly because
> unprivileged containers ideally should have writable mounts done with
> nodev and no matter the privilege they shouldn't ever get an hold on
> the KVM driver (and those who do, like kubevirt, will then just work).

I wouldn't even bother with the KVM special case.  Containers can use
seccomp if they want a fine-grained policy.

(Actually I wouldn't bother with the knob at all; the attack surface of
userfaultfd is infinitesimal compared to the BPF JIT...).

Paolo
Alexei Starovoitov March 14, 2019, 3:23 p.m. UTC | #16
On Thu, Mar 14, 2019 at 4:00 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 14/03/19 00:44, Andrea Arcangeli wrote:
> > Then I thought we can add a tristate so an open of /dev/kvm would also
> > allow the syscall to make things more user friendly because
> > unprivileged containers ideally should have writable mounts done with
> > nodev and no matter the privilege they shouldn't ever get an hold on
> > the KVM driver (and those who do, like kubevirt, will then just work).
>
> I wouldn't even bother with the KVM special case.  Containers can use
> seccomp if they want a fine-grained policy.
>
> (Actually I wouldn't bother with the knob at all; the attack surface of
> userfaultfd is infinitesimal compared to the BPF JIT...).

please name _one_ BPF JIT bug that affected unprivileged user space.
Paolo Bonzini March 14, 2019, 4 p.m. UTC | #17
On 14/03/19 16:23, Alexei Starovoitov wrote:
> On Thu, Mar 14, 2019 at 4:00 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> On 14/03/19 00:44, Andrea Arcangeli wrote:
>>> Then I thought we can add a tristate so an open of /dev/kvm would also
>>> allow the syscall to make things more user friendly because
>>> unprivileged containers ideally should have writable mounts done with
>>> nodev and no matter the privilege they shouldn't ever get an hold on
>>> the KVM driver (and those who do, like kubevirt, will then just work).
>>
>> I wouldn't even bother with the KVM special case.  Containers can use
>> seccomp if they want a fine-grained policy.
>>
>> (Actually I wouldn't bother with the knob at all; the attack surface of
>> userfaultfd is infinitesimal compared to the BPF JIT...).
> 
> please name _one_ BPF JIT bug that affected unprivileged user space.

I didn't say there were any bugs, I talked about attack surface.  The
potential impact would obviously be much bigger and, even leaving the
JIT aside, the userspace API is much more complex.

All this is just about paranoia, not about past experience.

Paolo
Andrea Arcangeli March 14, 2019, 4:16 p.m. UTC | #18
On Thu, Mar 14, 2019 at 11:58:15AM +0100, Paolo Bonzini wrote:
> On 14/03/19 00:44, Andrea Arcangeli wrote:
> > Then I thought we can add a tristate so an open of /dev/kvm would also
> > allow the syscall to make things more user friendly because
> > unprivileged containers ideally should have writable mounts done with
> > nodev and no matter the privilege they shouldn't ever get an hold on
> > the KVM driver (and those who do, like kubevirt, will then just work).
> 
> I wouldn't even bother with the KVM special case.  Containers can use
> seccomp if they want a fine-grained policy.

We can have a single boolean 0|1 and stick to a simpler sysctl and no
gid and if you want to use userfaultfd you need to enable it for all
users. I agree seccomp already provides more than enough granularity
to do more finegrined choices.

So this will be for who's paranoid and prefers to disable userfaultfd
as a whole as an hardening feature like the bpf sysctl allows: it will
allow to block uffd syscall without having to rebuild the kernel with
CONFIG_USERFAULTFD=n in environments where seccomp cannot be easily
enabled (i.e. without requiring userland changes).

That's very fine with me, but then it wasn't me complaining in the
first place. Kees?

If the above is ok, we can implement it as a static key, not that the
syscall itself is particularly performance critical but it'll be
simple enough as a boolean (only the ioctl are performance critical
but those are unaffected).

The blog post about UAF is not particularly interesting in my view,
unless both of the following points are true 1) it can be also proven
that the very same two UAF bugs, cannot be exploited by other means
(as far as I can tell it can be exploited by other means regardless of
userfaultfd) and 2) the slab randomization was actually enabled (99%
of the time in all POC all randomization features like kalsr are
incidentally disabled first to facilitate publishing papers and blog
posts, but those are really the features intended to reduce the
reproduciblity of exploits against UAF bugs, not disabling userfaultfd
which only provides a minor advantage, and unlike in PoC environments,
we enable those slab randomization in production 100% of the time
whenever they're available in the kernel).

Thanks,
Andrea
Peter Xu March 15, 2019, 8:26 a.m. UTC | #19
On Wed, Mar 13, 2019 at 10:50:48AM -0700, Mike Kravetz wrote:
> On 3/12/19 11:00 PM, Peter Xu wrote:
> > On Tue, Mar 12, 2019 at 12:59:34PM -0700, Mike Kravetz wrote:
> >> On 3/11/19 2:36 AM, Peter Xu wrote:
> >>>
> >>> The "kvm" entry is a bit special here only to make sure that existing
> >>> users like QEMU/KVM won't break by this newly introduced flag.  What
> >>> we need to do is simply set the "unprivileged_userfaultfd" flag to
> >>> "kvm" here to automatically grant userfaultfd permission for processes
> >>> like QEMU/KVM without extra code to tweak these flags in the admin
> >>> code.
> >>
> >> Another user is Oracle DB, specifically with hugetlbfs.  For them, we would
> >> like to add a special case like kvm described above.  The admin controls
> >> who can have access to hugetlbfs, so I think adding code to the open
> >> routine as in patch 2 of this series would seem to work.
> > 
> > Yes I think if there's an explicit and safe place we can hook for
> > hugetlbfs then we can do the similar trick as KVM case.  Though I
> > noticed that we can not only create hugetlbfs files under the
> > mountpoint (which the admin can control), but also using some other
> > ways.  The question (of me... sorry if it's a silly one!) is whether
> > all other ways to use hugetlbfs is still under control of the admin.
> > One I know of is memfd_create() which seems to be doable even as
> > unprivileged users.  If so, should we only limit the uffd privilege to
> > those hugetlbfs users who use the mountpoint directly?
> 
> Wow!  I did not realize that apps which specify mmap(MAP_HUGETLB) do not
> need any special privilege to use huge pages.  Honestly, I am not sure if
> that was by design or a bug.  The memfd_create code is based on the MAP_HUGETLB
> code and also does not need any special privilege.  Not to sidetrack this
> discussion, but people on Cc may know if this is a bug or by design.  My
> opinion is that huge pages are a limited resource and should be under control.
> One needs to be a member of a special group (or root) to access via System V
> interfaces.

Yeah I completely agree that huge pages should need some special
care...

> 
> The DB use case only does mmap of files in an explicitly mounted filesystem.
> So, limiting it in that manner would work for them.
> 
> > Another question is about fork() of privileged processes - for KVM we
> > only grant privilege for the exact process that opened the /dev/kvm
> > node, and the privilege will be lost for any forked childrens.  Is
> > that the same thing for OracleDB/Hugetlbfs?
> 
> I need to confirm with the DB people, but it is my understanding that the
> exact process which does the open/mmap will be the one using userfaultfd.

It'll be nice if these can be confirmed and if above proposal could
still be an alternative for us (grant privilege for processes who do
mknod() upon the hugetlbfs mountpoint; drop privilege when fork as
usual), since IMHO it is still the simplest approach comparing to what
we've discussed in the other threads...

Thanks,
Kees Cook March 15, 2019, 4:09 p.m. UTC | #20
On Thu, Mar 14, 2019 at 9:16 AM Andrea Arcangeli <aarcange@redhat.com> wrote:
> So this will be for who's paranoid and prefers to disable userfaultfd
> as a whole as an hardening feature like the bpf sysctl allows: it will
> allow to block uffd syscall without having to rebuild the kernel with
> CONFIG_USERFAULTFD=n in environments where seccomp cannot be easily
> enabled (i.e. without requiring userland changes).
>
> That's very fine with me, but then it wasn't me complaining in the
> first place. Kees?

I'm fine with a boolean. I just wanted to find a way to disable at
runtime (so distro users had it available to them).