Message ID | 20190311093701.15734-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | userfaultfd: allow to forbid unprivileged users | expand |
Hi Peter, On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote: > Hi, > > (The idea comes from Andrea, and following discussions with Mike and > other people) > > This patchset introduces a new sysctl flag to allow the admin to > forbid users from using userfaultfd: > > $ cat /proc/sys/vm/unprivileged_userfaultfd > [disabled] enabled kvm > > - When set to "disabled", all unprivileged users are forbidden to > use userfaultfd syscalls. > > - When set to "enabled", all users are allowed to use userfaultfd > syscalls. > > - When set to "kvm", all unprivileged users are forbidden to use the > userfaultfd syscalls, except the user who has permission to open > /dev/kvm. > > This new flag can add one more layer of security to reduce the attack > surface of the kernel by abusing userfaultfd. Here we grant the > thread userfaultfd permission by checking against CAP_SYS_PTRACE > capability. By default, the value is "disabled" which is the most > strict policy. Distributions can have their own perferred value. > > The "kvm" entry is a bit special here only to make sure that existing > users like QEMU/KVM won't break by this newly introduced flag. What > we need to do is simply set the "unprivileged_userfaultfd" flag to > "kvm" here to automatically grant userfaultfd permission for processes > like QEMU/KVM without extra code to tweak these flags in the admin > code. > > Patch 1: The interface patch to introduce the flag > > Patch 2: The KVM related changes to detect opening of /dev/kvm > > Patch 3: Apply the flag to userfaultfd syscalls I'd appreciate to see "Patch 4: documentation update" ;-) It'd be also great to update the man pages after this is merged. Except for the comment to patch 1, feel free to add Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> > All comments would be greatly welcomed. Thanks, > > Peter Xu (3): > userfaultfd/sysctl: introduce unprivileged_userfaultfd > kvm/mm: introduce MMF_USERFAULTFD_ALLOW flag > userfaultfd: apply unprivileged_userfaultfd check > > fs/userfaultfd.c | 121 +++++++++++++++++++++++++++++++++ > include/linux/sched/coredump.h | 1 + > include/linux/userfaultfd_k.h | 5 ++ > init/Kconfig | 11 +++ > kernel/sysctl.c | 11 +++ > virt/kvm/kvm_main.c | 7 ++ > 6 files changed, 156 insertions(+) > > -- > 2.17.1 >
On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote: > Hi, > > (The idea comes from Andrea, and following discussions with Mike and > other people) > > This patchset introduces a new sysctl flag to allow the admin to > forbid users from using userfaultfd: > > $ cat /proc/sys/vm/unprivileged_userfaultfd > [disabled] enabled kvm CC linux-api@ This is unusual way to return current value for sysctl. Does it work fine with sysctl tool? Have you considered to place the switch into /sys/kernel/mm instead? I doubt it's the last tunable for userfaultfd. Maybe we should have an directory for it under /sys/kernel/mm? > - When set to "disabled", all unprivileged users are forbidden to > use userfaultfd syscalls. > > - When set to "enabled", all users are allowed to use userfaultfd > syscalls. > > - When set to "kvm", all unprivileged users are forbidden to use the > userfaultfd syscalls, except the user who has permission to open > /dev/kvm. > > This new flag can add one more layer of security to reduce the attack > surface of the kernel by abusing userfaultfd. Here we grant the > thread userfaultfd permission by checking against CAP_SYS_PTRACE > capability. By default, the value is "disabled" which is the most > strict policy. Distributions can have their own perferred value. > > The "kvm" entry is a bit special here only to make sure that existing > users like QEMU/KVM won't break by this newly introduced flag. What > we need to do is simply set the "unprivileged_userfaultfd" flag to > "kvm" here to automatically grant userfaultfd permission for processes > like QEMU/KVM without extra code to tweak these flags in the admin > code. > > Patch 1: The interface patch to introduce the flag > > Patch 2: The KVM related changes to detect opening of /dev/kvm > > Patch 3: Apply the flag to userfaultfd syscalls > > All comments would be greatly welcomed. Thanks, > > Peter Xu (3): > userfaultfd/sysctl: introduce unprivileged_userfaultfd > kvm/mm: introduce MMF_USERFAULTFD_ALLOW flag > userfaultfd: apply unprivileged_userfaultfd check > > fs/userfaultfd.c | 121 +++++++++++++++++++++++++++++++++ > include/linux/sched/coredump.h | 1 + > include/linux/userfaultfd_k.h | 5 ++ > init/Kconfig | 11 +++ > kernel/sysctl.c | 11 +++ > virt/kvm/kvm_main.c | 7 ++ > 6 files changed, 156 insertions(+) > > -- > 2.17.1 >
On Tue, Mar 12, 2019 at 09:01:47AM +0200, Mike Rapoport wrote: > Hi Peter, > > On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote: > > Hi, > > > > (The idea comes from Andrea, and following discussions with Mike and > > other people) > > > > This patchset introduces a new sysctl flag to allow the admin to > > forbid users from using userfaultfd: > > > > $ cat /proc/sys/vm/unprivileged_userfaultfd > > [disabled] enabled kvm > > > > - When set to "disabled", all unprivileged users are forbidden to > > use userfaultfd syscalls. > > > > - When set to "enabled", all users are allowed to use userfaultfd > > syscalls. > > > > - When set to "kvm", all unprivileged users are forbidden to use the > > userfaultfd syscalls, except the user who has permission to open > > /dev/kvm. > > > > This new flag can add one more layer of security to reduce the attack > > surface of the kernel by abusing userfaultfd. Here we grant the > > thread userfaultfd permission by checking against CAP_SYS_PTRACE > > capability. By default, the value is "disabled" which is the most > > strict policy. Distributions can have their own perferred value. > > > > The "kvm" entry is a bit special here only to make sure that existing > > users like QEMU/KVM won't break by this newly introduced flag. What > > we need to do is simply set the "unprivileged_userfaultfd" flag to > > "kvm" here to automatically grant userfaultfd permission for processes > > like QEMU/KVM without extra code to tweak these flags in the admin > > code. > > > > Patch 1: The interface patch to introduce the flag > > > > Patch 2: The KVM related changes to detect opening of /dev/kvm > > > > Patch 3: Apply the flag to userfaultfd syscalls > > I'd appreciate to see "Patch 4: documentation update" ;-) > It'd be also great to update the man pages after this is merged. Oops, sorry! I should have remembered that. > > Except for the comment to patch 1, feel free to add > > Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Thanks Mike! I'll take it for 2/3 until I got confirmation from you on patch 1. Regards,
Hi, Kirill, On Tue, Mar 12, 2019 at 10:49:51AM +0300, Kirill A. Shutemov wrote: > On Mon, Mar 11, 2019 at 05:36:58PM +0800, Peter Xu wrote: > > Hi, > > > > (The idea comes from Andrea, and following discussions with Mike and > > other people) > > > > This patchset introduces a new sysctl flag to allow the admin to > > forbid users from using userfaultfd: > > > > $ cat /proc/sys/vm/unprivileged_userfaultfd > > [disabled] enabled kvm > > CC linux-api@ > > This is unusual way to return current value for sysctl. Does it work fine > with sysctl tool? It can work, though it displays the same as "cat": $ sysctl vm.unprivileged_userfaultfd vm.unprivileged_userfaultfd = disabled enabled [kvm] > > Have you considered to place the switch into /sys/kernel/mm instead? > I doubt it's the last tunable for userfaultfd. Maybe we should have an > directory for it under /sys/kernel/mm? I haven't thought about sysfs, if that's preferred I can consider to switch to that. And yes I think creating a directory should be a good idea. Thanks,
On 3/11/19 2:36 AM, Peter Xu wrote: > > The "kvm" entry is a bit special here only to make sure that existing > users like QEMU/KVM won't break by this newly introduced flag. What > we need to do is simply set the "unprivileged_userfaultfd" flag to > "kvm" here to automatically grant userfaultfd permission for processes > like QEMU/KVM without extra code to tweak these flags in the admin > code. Another user is Oracle DB, specifically with hugetlbfs. For them, we would like to add a special case like kvm described above. The admin controls who can have access to hugetlbfs, so I think adding code to the open routine as in patch 2 of this series would seem to work. However, I can imagine more special cases being added for other users. And, once you have more than one special case then you may want to combine them. For example, kvm and hugetlbfs together.
On Tue, Mar 12, 2019 at 12:59:34PM -0700, Mike Kravetz wrote: > On 3/11/19 2:36 AM, Peter Xu wrote: > > > > The "kvm" entry is a bit special here only to make sure that existing > > users like QEMU/KVM won't break by this newly introduced flag. What > > we need to do is simply set the "unprivileged_userfaultfd" flag to > > "kvm" here to automatically grant userfaultfd permission for processes > > like QEMU/KVM without extra code to tweak these flags in the admin > > code. > > Another user is Oracle DB, specifically with hugetlbfs. For them, we would > like to add a special case like kvm described above. The admin controls > who can have access to hugetlbfs, so I think adding code to the open > routine as in patch 2 of this series would seem to work. Yes I think if there's an explicit and safe place we can hook for hugetlbfs then we can do the similar trick as KVM case. Though I noticed that we can not only create hugetlbfs files under the mountpoint (which the admin can control), but also using some other ways. The question (of me... sorry if it's a silly one!) is whether all other ways to use hugetlbfs is still under control of the admin. One I know of is memfd_create() which seems to be doable even as unprivileged users. If so, should we only limit the uffd privilege to those hugetlbfs users who use the mountpoint directly? Another question is about fork() of privileged processes - for KVM we only grant privilege for the exact process that opened the /dev/kvm node, and the privilege will be lost for any forked childrens. Is that the same thing for OracleDB/Hugetlbfs? > > However, I can imagine more special cases being added for other users. And, > once you have more than one special case then you may want to combine them. > For example, kvm and hugetlbfs together. It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon mm_struct, since that seems to be a very general flag that can be used by anything we want to grant privilege for, not only KVM? Thanks,
On 13/03/19 07:00, Peter Xu wrote: >> However, I can imagine more special cases being added for other users. And, >> once you have more than one special case then you may want to combine them. >> For example, kvm and hugetlbfs together. > It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon > mm_struct, since that seems to be a very general flag that can be used > by anything we want to grant privilege for, not only KVM? Perhaps you can remove the fork() limitation, and add a new suboption to prctl(PR_SET_MM) that sets/resets MMF_USERFAULTFD_ALLOW. If somebody wants to forbid unprivileged userfaultfd and use KVM, they'll have to use libvirt or some other privileged management tool. We could also add support for this prctl to systemd, and then one could do "systemd-run -pAllowUserfaultfd=yes COMMAND". Paolo
On 3/12/19 11:00 PM, Peter Xu wrote: > On Tue, Mar 12, 2019 at 12:59:34PM -0700, Mike Kravetz wrote: >> On 3/11/19 2:36 AM, Peter Xu wrote: >>> >>> The "kvm" entry is a bit special here only to make sure that existing >>> users like QEMU/KVM won't break by this newly introduced flag. What >>> we need to do is simply set the "unprivileged_userfaultfd" flag to >>> "kvm" here to automatically grant userfaultfd permission for processes >>> like QEMU/KVM without extra code to tweak these flags in the admin >>> code. >> >> Another user is Oracle DB, specifically with hugetlbfs. For them, we would >> like to add a special case like kvm described above. The admin controls >> who can have access to hugetlbfs, so I think adding code to the open >> routine as in patch 2 of this series would seem to work. > > Yes I think if there's an explicit and safe place we can hook for > hugetlbfs then we can do the similar trick as KVM case. Though I > noticed that we can not only create hugetlbfs files under the > mountpoint (which the admin can control), but also using some other > ways. The question (of me... sorry if it's a silly one!) is whether > all other ways to use hugetlbfs is still under control of the admin. > One I know of is memfd_create() which seems to be doable even as > unprivileged users. If so, should we only limit the uffd privilege to > those hugetlbfs users who use the mountpoint directly? Wow! I did not realize that apps which specify mmap(MAP_HUGETLB) do not need any special privilege to use huge pages. Honestly, I am not sure if that was by design or a bug. The memfd_create code is based on the MAP_HUGETLB code and also does not need any special privilege. Not to sidetrack this discussion, but people on Cc may know if this is a bug or by design. My opinion is that huge pages are a limited resource and should be under control. One needs to be a member of a special group (or root) to access via System V interfaces. The DB use case only does mmap of files in an explicitly mounted filesystem. So, limiting it in that manner would work for them. > Another question is about fork() of privileged processes - for KVM we > only grant privilege for the exact process that opened the /dev/kvm > node, and the privilege will be lost for any forked childrens. Is > that the same thing for OracleDB/Hugetlbfs? I need to confirm with the DB people, but it is my understanding that the exact process which does the open/mmap will be the one using userfaultfd.
Hello, On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote: > On 13/03/19 07:00, Peter Xu wrote: > >> However, I can imagine more special cases being added for other users. And, > >> once you have more than one special case then you may want to combine them. > >> For example, kvm and hugetlbfs together. > > It looks fine to me if we're using MMF_USERFAULTFD_ALLOW flag upon > > mm_struct, since that seems to be a very general flag that can be used > > by anything we want to grant privilege for, not only KVM? > > Perhaps you can remove the fork() limitation, and add a new suboption to > prctl(PR_SET_MM) that sets/resets MMF_USERFAULTFD_ALLOW. If somebody > wants to forbid unprivileged userfaultfd and use KVM, they'll have to > use libvirt or some other privileged management tool. > > We could also add support for this prctl to systemd, and then one could > do "systemd-run -pAllowUserfaultfd=yes COMMAND". systemd can already implement -pAllowUserfaultfd=no with seccomp if it wants. It can also implement -yes if by default turns off userfaultfd like firejail -seccomp would do. If the end goal is to implement the filtering with an userland policy instead of a kernel policy, seccomp enabled for all services sounds reasonable. It's very unlikely you'll block only userfaultfd, firejail -seccomp by default blocks dozen of syscalls that are unnecessary 99.9% of the time. This is not about implementing an userland flexible policy, it's just a simple kernel policy, to use until userland disables the kernel policy to takeover with seccomp across the board. I wouldn't like this too be too complicated because this is already theoretically overlapping 100% with seccomp. hugetlbfs is more complicated to detect, because even if you inherit it from fork(), the services that mounts the fs may be in a different container than the one that Oracle that uses userfaultfd later on down the road from a different context. And I don't think it would be ok to allow running userfaultfd just because you can open a file in an hugetlbfs file system. With /dev/kvm it's a bit different, that's chmod o-r by default.. no luser should be able to open it. Unless somebody suggests a consistent way to make hugetlbfs "just work" (like we could achieve clean with CRIU and KVM), I think Oracle will need a one liner change in the Oracle setup to echo into that file in addition of running the hugetlbfs mount. Note that DPDK host bridge process will also need a one liner change to do a dummy open/close of /dev/kvm to unblock the syscall. Thanks, Andrea
> On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote: > Unless somebody suggests a consistent way to make hugetlbfs "just > work" (like we could achieve clean with CRIU and KVM), I think Oracle > will need a one liner change in the Oracle setup to echo into that > file in addition of running the hugetlbfs mount. Hi Andrea, can you explain more in detail the risks of enabling userfaultfd for unprivileged users? Paolo > Note that DPDK host bridge process will also need a one liner change > to do a dummy open/close of /dev/kvm to unblock the syscall. > > Thanks, > Andrea >
On 3/13/19 11:52 AM, Andrea Arcangeli wrote: > > hugetlbfs is more complicated to detect, because even if you inherit > it from fork(), the services that mounts the fs may be in a different > container than the one that Oracle that uses userfaultfd later on down > the road from a different context. And I don't think it would be ok to > allow running userfaultfd just because you can open a file in an > hugetlbfs file system. With /dev/kvm it's a bit different, that's > chmod o-r by default.. no luser should be able to open it. > > Unless somebody suggests a consistent way to make hugetlbfs "just > work" (like we could achieve clean with CRIU and KVM), I think Oracle > will need a one liner change in the Oracle setup to echo into that > file in addition of running the hugetlbfs mount. I think you are suggesting the DB setup process enable uffd for all users. Correct? This may be too simple, and I don't really like group access, but how about just defining a uffd group? If you are in the group you can make uffd system calls.
Hi Paolo, On Wed, Mar 13, 2019 at 03:12:28PM -0400, Paolo Bonzini wrote: > > > On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote: > > Unless somebody suggests a consistent way to make hugetlbfs "just > > work" (like we could achieve clean with CRIU and KVM), I think Oracle > > will need a one liner change in the Oracle setup to echo into that > > file in addition of running the hugetlbfs mount. > > Hi Andrea, can you explain more in detail the risks of enabling > userfaultfd for unprivileged users? There's no more risk than in allowing mremap (direct memory overwrite after underflow) or O_DIRECT (dirtycow) for unprivileged users. If there was risk in allowing userfaultfd for unprivileged users, CAP_SYS_PTRACE would be mandatory and there wouldn't be an option to allow userfaultfd to all unprivileged users. This is just an hardening policy in kernel (for those that don't run everything under seccomp) that may even be removed later. Unlike mremap and O_DIRECT, because we've only an handful of (important) applications using userfaultfd so far, we can do like the bpf syscall: SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; int err; if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN)) return -EPERM; Except we picked CAP_SYS_PTRACE because CRIU already has to run with such capability for other reasons. So this is intended as the "sysctl_unprivileged_bpf_disabled" trick applied to the bpf syscall, also applied to the userfaultfd syscall, nothing more nothing less. Then I thought we can add a tristate so an open of /dev/kvm would also allow the syscall to make things more user friendly because unprivileged containers ideally should have writable mounts done with nodev and no matter the privilege they shouldn't ever get an hold on the KVM driver (and those who do, like kubevirt, will then just work). There has been one complaint because userfaultfd can also be used to facilitate exploiting use after free bugs in other kernel code: https://cyseclabs.com/blog/linux-kernel-heap-spray This isn't a bug in userfaultfd, the only bug there is some kernel code doing an use after free in a reproducible place, userfaultfd only allows to stop copy-user where the exploits like copy-user to be stopped. This isn't particularly concerning, because you can achieve the same objective with FUSE. In fact even if you set CONFIG_USERFAULTFD=n in the kernel config and CONFIG_FUSE_FS=n, a security bug like that can still be exploited eventually. It's just less reproducible if you can't stop copy-user. Restricting userfaultfd to privileged processes, won't make such kernel bug become harmless, it'll just require more attempts to exploit as far as I can tell. To put things in prospective, the exploits for the most serious security bugs like mremap missing underflow check, dirtycow or the missing stack_guard_gap would not get any facilitation by userfaultfd. I also thought we were randomizing all slab heaps by now to avoid issues like above, I don't know if the randomized slab freelist oreder CONFIG_SLAB_FREELIST_RANDOM and also the pointer with CONFIG_SLAB_FREELIST_HARDENED precisely to avoid the exploits like above. It's not like you can disable those two options even if you set CONFIG_USERFAULTFD=n. I wonder if in that blog post the exploit was set on a kernel with those two options enabled. In any case not allowing non privileged processes to run the userfaultfd syscall will provide some hardening feature also against such kind of concern. Thanks, Andrea
On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote: > On 3/13/19 11:52 AM, Andrea Arcangeli wrote: > > > > hugetlbfs is more complicated to detect, because even if you inherit > > it from fork(), the services that mounts the fs may be in a different > > container than the one that Oracle that uses userfaultfd later on down > > the road from a different context. And I don't think it would be ok to > > allow running userfaultfd just because you can open a file in an > > hugetlbfs file system. With /dev/kvm it's a bit different, that's > > chmod o-r by default.. no luser should be able to open it. > > > > Unless somebody suggests a consistent way to make hugetlbfs "just > > work" (like we could achieve clean with CRIU and KVM), I think Oracle > > will need a one liner change in the Oracle setup to echo into that > > file in addition of running the hugetlbfs mount. > > I think you are suggesting the DB setup process enable uffd for all users. > Correct? Yes. In addition of the hugetlbfs setup, various apps requires to also increase fs.inotify.max_user_watches or file-max and other tweaks, this would be one of those tweaks. > This may be too simple, and I don't really like group access, but how about > just defining a uffd group? If you are in the group you can make uffd > system calls. Everything is possible, I'm just afraid it gets too complex. So you suggest to echo a gid into the file?
On 3/13/19 4:55 PM, Andrea Arcangeli wrote: > On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote: >> On 3/13/19 11:52 AM, Andrea Arcangeli wrote: >>> Unless somebody suggests a consistent way to make hugetlbfs "just >>> work" (like we could achieve clean with CRIU and KVM), I think Oracle >>> will need a one liner change in the Oracle setup to echo into that >>> file in addition of running the hugetlbfs mount. >> >> I think you are suggesting the DB setup process enable uffd for all users. >> Correct? > > Yes. In addition of the hugetlbfs setup, various apps requires to also > increase fs.inotify.max_user_watches or file-max and other tweaks, > this would be one of those tweaks. Yes, I agree. It is just that unprivileged_userfaultfd disabled would likely to be the default set by distros. Or, perhaps 'kvm'? Then, if you want to run the DB, the admin (or the DB) will need to set it to enabled. And, this results in it being enabled for everyone. I think I understand the scope of any security exposure this would cause from a technical perspective. However, I can imagine people being concerned about enabling everywhere if this is not the default setting. If it is OK to disable everywhere, why not just use disable for the kvm use case as well? :) >> This may be too simple, and I don't really like group access, but how about >> just defining a uffd group? If you are in the group you can make uffd >> system calls. > > Everything is possible, I'm just afraid it gets too complex. > > So you suggest to echo a gid into the file? That is what I was thinking. But, I was mostly thinking of that because Peter's earlier comment made me go and check hugetlbfs code. There is a sysctl_hugetlb_shm_group variable that does this, even though it is mostly unused in the hugetlbfs code. I know the kvm dev open scheme works for kvm. Was just trying to think of something more general that would work for hugetlbfs/DB or other use cases.
On 14/03/19 00:44, Andrea Arcangeli wrote: > Then I thought we can add a tristate so an open of /dev/kvm would also > allow the syscall to make things more user friendly because > unprivileged containers ideally should have writable mounts done with > nodev and no matter the privilege they shouldn't ever get an hold on > the KVM driver (and those who do, like kubevirt, will then just work). I wouldn't even bother with the KVM special case. Containers can use seccomp if they want a fine-grained policy. (Actually I wouldn't bother with the knob at all; the attack surface of userfaultfd is infinitesimal compared to the BPF JIT...). Paolo
On Thu, Mar 14, 2019 at 4:00 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 14/03/19 00:44, Andrea Arcangeli wrote: > > Then I thought we can add a tristate so an open of /dev/kvm would also > > allow the syscall to make things more user friendly because > > unprivileged containers ideally should have writable mounts done with > > nodev and no matter the privilege they shouldn't ever get an hold on > > the KVM driver (and those who do, like kubevirt, will then just work). > > I wouldn't even bother with the KVM special case. Containers can use > seccomp if they want a fine-grained policy. > > (Actually I wouldn't bother with the knob at all; the attack surface of > userfaultfd is infinitesimal compared to the BPF JIT...). please name _one_ BPF JIT bug that affected unprivileged user space.
On 14/03/19 16:23, Alexei Starovoitov wrote: > On Thu, Mar 14, 2019 at 4:00 AM Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> On 14/03/19 00:44, Andrea Arcangeli wrote: >>> Then I thought we can add a tristate so an open of /dev/kvm would also >>> allow the syscall to make things more user friendly because >>> unprivileged containers ideally should have writable mounts done with >>> nodev and no matter the privilege they shouldn't ever get an hold on >>> the KVM driver (and those who do, like kubevirt, will then just work). >> >> I wouldn't even bother with the KVM special case. Containers can use >> seccomp if they want a fine-grained policy. >> >> (Actually I wouldn't bother with the knob at all; the attack surface of >> userfaultfd is infinitesimal compared to the BPF JIT...). > > please name _one_ BPF JIT bug that affected unprivileged user space. I didn't say there were any bugs, I talked about attack surface. The potential impact would obviously be much bigger and, even leaving the JIT aside, the userspace API is much more complex. All this is just about paranoia, not about past experience. Paolo
On Thu, Mar 14, 2019 at 11:58:15AM +0100, Paolo Bonzini wrote: > On 14/03/19 00:44, Andrea Arcangeli wrote: > > Then I thought we can add a tristate so an open of /dev/kvm would also > > allow the syscall to make things more user friendly because > > unprivileged containers ideally should have writable mounts done with > > nodev and no matter the privilege they shouldn't ever get an hold on > > the KVM driver (and those who do, like kubevirt, will then just work). > > I wouldn't even bother with the KVM special case. Containers can use > seccomp if they want a fine-grained policy. We can have a single boolean 0|1 and stick to a simpler sysctl and no gid and if you want to use userfaultfd you need to enable it for all users. I agree seccomp already provides more than enough granularity to do more finegrined choices. So this will be for who's paranoid and prefers to disable userfaultfd as a whole as an hardening feature like the bpf sysctl allows: it will allow to block uffd syscall without having to rebuild the kernel with CONFIG_USERFAULTFD=n in environments where seccomp cannot be easily enabled (i.e. without requiring userland changes). That's very fine with me, but then it wasn't me complaining in the first place. Kees? If the above is ok, we can implement it as a static key, not that the syscall itself is particularly performance critical but it'll be simple enough as a boolean (only the ioctl are performance critical but those are unaffected). The blog post about UAF is not particularly interesting in my view, unless both of the following points are true 1) it can be also proven that the very same two UAF bugs, cannot be exploited by other means (as far as I can tell it can be exploited by other means regardless of userfaultfd) and 2) the slab randomization was actually enabled (99% of the time in all POC all randomization features like kalsr are incidentally disabled first to facilitate publishing papers and blog posts, but those are really the features intended to reduce the reproduciblity of exploits against UAF bugs, not disabling userfaultfd which only provides a minor advantage, and unlike in PoC environments, we enable those slab randomization in production 100% of the time whenever they're available in the kernel). Thanks, Andrea
On Wed, Mar 13, 2019 at 10:50:48AM -0700, Mike Kravetz wrote: > On 3/12/19 11:00 PM, Peter Xu wrote: > > On Tue, Mar 12, 2019 at 12:59:34PM -0700, Mike Kravetz wrote: > >> On 3/11/19 2:36 AM, Peter Xu wrote: > >>> > >>> The "kvm" entry is a bit special here only to make sure that existing > >>> users like QEMU/KVM won't break by this newly introduced flag. What > >>> we need to do is simply set the "unprivileged_userfaultfd" flag to > >>> "kvm" here to automatically grant userfaultfd permission for processes > >>> like QEMU/KVM without extra code to tweak these flags in the admin > >>> code. > >> > >> Another user is Oracle DB, specifically with hugetlbfs. For them, we would > >> like to add a special case like kvm described above. The admin controls > >> who can have access to hugetlbfs, so I think adding code to the open > >> routine as in patch 2 of this series would seem to work. > > > > Yes I think if there's an explicit and safe place we can hook for > > hugetlbfs then we can do the similar trick as KVM case. Though I > > noticed that we can not only create hugetlbfs files under the > > mountpoint (which the admin can control), but also using some other > > ways. The question (of me... sorry if it's a silly one!) is whether > > all other ways to use hugetlbfs is still under control of the admin. > > One I know of is memfd_create() which seems to be doable even as > > unprivileged users. If so, should we only limit the uffd privilege to > > those hugetlbfs users who use the mountpoint directly? > > Wow! I did not realize that apps which specify mmap(MAP_HUGETLB) do not > need any special privilege to use huge pages. Honestly, I am not sure if > that was by design or a bug. The memfd_create code is based on the MAP_HUGETLB > code and also does not need any special privilege. Not to sidetrack this > discussion, but people on Cc may know if this is a bug or by design. My > opinion is that huge pages are a limited resource and should be under control. > One needs to be a member of a special group (or root) to access via System V > interfaces. Yeah I completely agree that huge pages should need some special care... > > The DB use case only does mmap of files in an explicitly mounted filesystem. > So, limiting it in that manner would work for them. > > > Another question is about fork() of privileged processes - for KVM we > > only grant privilege for the exact process that opened the /dev/kvm > > node, and the privilege will be lost for any forked childrens. Is > > that the same thing for OracleDB/Hugetlbfs? > > I need to confirm with the DB people, but it is my understanding that the > exact process which does the open/mmap will be the one using userfaultfd. It'll be nice if these can be confirmed and if above proposal could still be an alternative for us (grant privilege for processes who do mknod() upon the hugetlbfs mountpoint; drop privilege when fork as usual), since IMHO it is still the simplest approach comparing to what we've discussed in the other threads... Thanks,
On Thu, Mar 14, 2019 at 9:16 AM Andrea Arcangeli <aarcange@redhat.com> wrote: > So this will be for who's paranoid and prefers to disable userfaultfd > as a whole as an hardening feature like the bpf sysctl allows: it will > allow to block uffd syscall without having to rebuild the kernel with > CONFIG_USERFAULTFD=n in environments where seccomp cannot be easily > enabled (i.e. without requiring userland changes). > > That's very fine with me, but then it wasn't me complaining in the > first place. Kees? I'm fine with a boolean. I just wanted to find a way to disable at runtime (so distro users had it available to them).