Message ID | 20200924065606.3351177-1-lokeshgidra@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Control over userfaultfd kernel-fault handling | expand |
On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote: > > This patch series is split from [1]. The other series enables SELinux > support for userfaultfd file descriptors so that its creation and > movement can be controlled. > > It has been demonstrated on various occasions that suspending kernel > code execution for an arbitrary amount of time at any access to > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited > to change the intended behavior of the kernel. For instance, handling > page faults in kernel-mode using userfaultfd has been exploited in [2, 3]. > Likewise, FUSE, which is similar to userfaultfd in this respect, has been > exploited in [4, 5] for similar outcome. > > This small patch series adds a new flag to userfaultfd(2) that allows > callers to give up the ability to handle kernel-mode faults with the > resulting UFFD file object. It then adds a 'user-mode only' option to > the unprivileged_userfaultfd sysctl knob to require unprivileged > callers to use this new flag. > > The purpose of this new interface is to decrease the chance of an > unprivileged userfaultfd user taking advantage of userfaultfd to > enhance security vulnerabilities by lengthening the race window in > kernel code. > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/ > [2] https://duasynt.com/blog/linux-kernel-heap-spray > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit > [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html > [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808 > > Changes since v3: > > - Modified the meaning of value '0' of unprivileged_userfaultfd > sysctl knob. Setting this knob to '0' now allows unprivileged users > to use userfaultfd, but can handle page faults in user-mode only. > - The default value of unprivileged_userfaultfd sysctl knob is changed > to '0'. > Request reviewers and maintainers to please take a look. > Changes since v2: > > - Removed 'uffd_flags' and directly used 'UFFD_USER_MODE_ONLY' in > userfaultfd(). > > Changes since v1: > > - Added external references to the threats from allowing unprivileged > users to handle page faults from kernel-mode. > - Removed the new sysctl knob restricting handling of page > faults from kernel-mode, and added an option for the same > in the existing 'unprivileged_userfaultfd' knob. > > Lokesh Gidra (2): > Add UFFD_USER_MODE_ONLY > Add user-mode only option to unprivileged_userfaultfd sysctl knob > > Documentation/admin-guide/sysctl/vm.rst | 15 ++++++++++----- > fs/userfaultfd.c | 12 +++++++++--- > include/uapi/linux/userfaultfd.h | 9 +++++++++ > 3 files changed, 28 insertions(+), 8 deletions(-) > > -- > 2.28.0.681.g6f77f65b4e-goog >
Hello Lokesh, On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote: > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > This patch series is split from [1]. The other series enables SELinux > > support for userfaultfd file descriptors so that its creation and > > movement can be controlled. > > > > It has been demonstrated on various occasions that suspending kernel > > code execution for an arbitrary amount of time at any access to > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited > > to change the intended behavior of the kernel. For instance, handling > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3]. > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been > > exploited in [4, 5] for similar outcome. > > > > This small patch series adds a new flag to userfaultfd(2) that allows > > callers to give up the ability to handle kernel-mode faults with the > > resulting UFFD file object. It then adds a 'user-mode only' option to > > the unprivileged_userfaultfd sysctl knob to require unprivileged > > callers to use this new flag. > > > > The purpose of this new interface is to decrease the chance of an > > unprivileged userfaultfd user taking advantage of userfaultfd to > > enhance security vulnerabilities by lengthening the race window in > > kernel code. > > > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/ > > [2] https://duasynt.com/blog/linux-kernel-heap-spray > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit I've looking at those links and I've been trying to verify the link [3] is relevant. Specifically I've been trying to verify if 1) current state of the art modern SLUB randomization techniques already enabled in production and rightfully wasting some CPU in all enterprise kernels to prevent things like above to become an issue in practice 2) combined with the fact different memcg need to share the same kmemcaches (which was incidentally fixed a few months ago upstream) and 3) further robustness enhancements against exploits in the slub metadata, may already render the exploit [3] from 2016 irrelevant in practice. So I started by trying to reproduce [3] by building 4.5.1 with a .config with no robustness features and I booted it on fedora-32 or gentoo userland and I cannot even invoke call_usermodehelper. Calling socket(22, AF_INET, 0) won't invoke such function. Can you reproduce on 4.5.1? Which kernel .config should I use to build 4.5.1 in order for call_usermodehelper to be invoked by the exploit? Could you help to verify it? It even has uninitialized variable spawning random perrors so it doesn't give a warm fuzzy feeling: ==== int main(int argc, char **argv) { void *region, *map; ^^^^^ pthread_t uffd_thread; int uffd, msqid, i; region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE, MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0); if (!region) { perror("mmap"); exit(2); } setup_pagefault(region + 0x1000, 0x1000, 1); printf("my pid = %d\n", getpid()); if (!map) { ^^^^^^^^ perror("mmap"); ==== The whole point of being able to reproduce on 4.5.1 is then to simulate if the same exploit would also reproduce on current kernels with all enterprise default robustness features enabled. Or did anybody already verify it? Anyway the links I was playing with are all in the cover letter, the cover letter is not as important as the actual patches. The actual patches looks fine to me. The only improvement I can think of is, what about to add a printk_once to suggest to toggle the sysctl if userfaultfd bails out because the process lacks the CAP_SYS_PTRACE capability? That would facilitate the /etc/sysctl.conf or tuned tweaking in case the apps aren't verbose enough. It's not relevant anymore with this latest patchset, but about the previous argument that seccomp couldn't be used in all Android processes because of performance concern, I'm slightly confused. https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html "Android O includes a single seccomp filter installed into zygote, the process from which all the Android applications are derived. Because the filter is installed into zygote—and therefore all apps—the Android security team took extra caution to not break existing apps" Example: $ uname -mo aarch64 Android $ cat swapoff.c #include <sys/swap.h> int main() { swapoff(""); } $ gcc swapoff.c -o swapoff -O2 $ ./swapoff Bad system call $ It's hard to imagine what is more performance critical than the zygote process and the actual apps as above? It's also hard to imagine what kind of performance concern can arise by adding seccomp filters also to background system apps that generally should consume ~0% of CPU. If performance is really a concern, the BPF JIT representation with the bitmap to be able to run the filter in O(1) sounds a better solution than not adding ad-hoc filters and it's being worked on for x86-64 and can be ported to aarch64 too. Many of the standalone background processes likely wouldn't even use uffd at all so you could block the user initiated faults too that way. Ultimately because of issues as [3] (be them still relevant or not, to be double checked), no matter if through selinux, seccomp or a different sysctl value, without this patchset applied the default behavior of the userfaultfd syscall for all Linux binaries running on Android kernels, would deviate from the upstream kernel. So even if we would make the pipe mutex logic more complex the deviation would remain. Your patchset adds much less risk of breakage than adding a timeout to kernel initiated userfaults and it resolves all concerns as well as a timeout. We'll also make better use of the "0" value this way. So while I'm not certain this is the best for the long term, this looks the sweet spot for the short term to resolve many issues at once. Thanks! Andrea
On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > Hello Lokesh, > > On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote: > > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > > > This patch series is split from [1]. The other series enables SELinux > > > support for userfaultfd file descriptors so that its creation and > > > movement can be controlled. > > > > > > It has been demonstrated on various occasions that suspending kernel > > > code execution for an arbitrary amount of time at any access to > > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited > > > to change the intended behavior of the kernel. For instance, handling > > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3]. > > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been > > > exploited in [4, 5] for similar outcome. > > > > > > This small patch series adds a new flag to userfaultfd(2) that allows > > > callers to give up the ability to handle kernel-mode faults with the > > > resulting UFFD file object. It then adds a 'user-mode only' option to > > > the unprivileged_userfaultfd sysctl knob to require unprivileged > > > callers to use this new flag. > > > > > > The purpose of this new interface is to decrease the chance of an > > > unprivileged userfaultfd user taking advantage of userfaultfd to > > > enhance security vulnerabilities by lengthening the race window in > > > kernel code. > > > > > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/ > > > [2] https://duasynt.com/blog/linux-kernel-heap-spray > > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit > > I've looking at those links and I've been trying to verify the link > [3] is relevant. > > Specifically I've been trying to verify if 1) current state of the art > modern SLUB randomization techniques already enabled in production and > rightfully wasting some CPU in all enterprise kernels to prevent > things like above to become an issue in practice 2) combined with the > fact different memcg need to share the same kmemcaches (which was > incidentally fixed a few months ago upstream) and 3) further > robustness enhancements against exploits in the slub metadata, may > already render the exploit [3] from 2016 irrelevant in practice. It's quite possible that some other mitigation was helpful against the technique used by this particular exploit. It's the nature of exploits that they are fragile and will change as new soft mitigations are introduced. The effectiveness of a particular exploit mitigation change is orthogonal to the change presented here. The purpose of this change is to prevent an attacker from suspending kernel code execution and having kernel data structures in a predictable state. This makes it harder for an attacker to "win" race conditions against various kernel data structures. This change compliments other kernel hardening changes such as the changes you've referenced above. Focusing on one particular exploit somewhat misses the point of this change. > > So I started by trying to reproduce [3] by building 4.5.1 with a > .config with no robustness features and I booted it on fedora-32 or > gentoo userland and I cannot even invoke call_usermodehelper. Calling > socket(22, AF_INET, 0) won't invoke such function. Can you reproduce > on 4.5.1? Which kernel .config should I use to build 4.5.1 in order > for call_usermodehelper to be invoked by the exploit? Could you help > to verify it? I haven't tried to verify this myself. I wonder if the usermode hardening changes also impacted this exploit? See https://lkml.org/lkml/2017/1/16/468 But again, focusing on an exploit, which is inherently fragile in nature and dependent on the state of the kernel tree at a particular time, is unlikely to be useful to analyze this patch. > > It even has uninitialized variable spawning random perrors so it > doesn't give a warm fuzzy feeling: > > ==== > int main(int argc, char **argv) { > void *region, *map; > ^^^^^ > pthread_t uffd_thread; > int uffd, msqid, i; > > region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE, > MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0); > > if (!region) { > perror("mmap"); > exit(2); > } > > setup_pagefault(region + 0x1000, 0x1000, 1); > > printf("my pid = %d\n", getpid()); > > if (!map) { > ^^^^^^^^ > perror("mmap"); > ==== > > The whole point of being able to reproduce on 4.5.1 is then to > simulate if the same exploit would also reproduce on current kernels > with all enterprise default robustness features enabled. Or did > anybody already verify it? > > Anyway the links I was playing with are all in the cover letter, the > cover letter is not as important as the actual patches. The actual > patches looks fine to me. That's great to hear. > > The only improvement I can think of is, what about to add a > printk_once to suggest to toggle the sysctl if userfaultfd bails out > because the process lacks the CAP_SYS_PTRACE capability? That would > facilitate the /etc/sysctl.conf or tuned tweaking in case the apps > aren't verbose enough. > > It's not relevant anymore with this latest patchset, but about the > previous argument that seccomp couldn't be used in all Android > processes because of performance concern, I'm slightly confused. Seccomp causes more problems than just performance. Seccomp is not designed for whole-of-system protections. Please see my other writeup at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/ > > https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html > > "Android O includes a single seccomp filter installed into zygote, the > process from which all the Android applications are derived. Because > the filter is installed into zygote—and therefore all apps—the Android > security team took extra caution to not break existing apps" > > Example: > > $ uname -mo > aarch64 Android > $ cat swapoff.c > #include <sys/swap.h> > > int main() > { > swapoff(""); > } > $ gcc swapoff.c -o swapoff -O2 > $ ./swapoff > Bad system call > $ > > It's hard to imagine what is more performance critical than the zygote > process and the actual apps as above? > > It's also hard to imagine what kind of performance concern can arise > by adding seccomp filters also to background system apps that > generally should consume ~0% of CPU. > > If performance is really a concern, the BPF JIT representation with > the bitmap to be able to run the filter in O(1) sounds a better > solution than not adding ad-hoc filters and it's being worked on for > x86-64 and can be ported to aarch64 too. Many of the standalone > background processes likely wouldn't even use uffd at all so you could > block the user initiated faults too that way. > > Ultimately because of issues as [3] (be them still relevant or not, to > be double checked), no matter if through selinux, seccomp or a > different sysctl value, without this patchset applied the default > behavior of the userfaultfd syscall for all Linux binaries running on > Android kernels, would deviate from the upstream kernel. So even if we > would make the pipe mutex logic more complex the deviation would > remain. Your patchset adds much less risk of breakage than adding a > timeout to kernel initiated userfaults and it resolves all concerns as > well as a timeout. We'll also make better use of the "0" value this > way. So while I'm not certain this is the best for the long term, this > looks the sweet spot for the short term to resolve many issues at > once. > > Thanks! > Andrea >
On Thu, Oct 8, 2020 at 4:22 PM Nick Kralevich <nnk@google.com> wrote: > > On Wed, Oct 7, 2020 at 9:01 PM Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Hello Lokesh, > > > > On Wed, Oct 07, 2020 at 01:26:55PM -0700, Lokesh Gidra wrote: > > > On Wed, Sep 23, 2020 at 11:56 PM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > > > > > This patch series is split from [1]. The other series enables SELinux > > > > support for userfaultfd file descriptors so that its creation and > > > > movement can be controlled. > > > > > > > > It has been demonstrated on various occasions that suspending kernel > > > > code execution for an arbitrary amount of time at any access to > > > > userspace memory (copy_from_user()/copy_to_user()/...) can be exploited > > > > to change the intended behavior of the kernel. For instance, handling > > > > page faults in kernel-mode using userfaultfd has been exploited in [2, 3]. > > > > Likewise, FUSE, which is similar to userfaultfd in this respect, has been > > > > exploited in [4, 5] for similar outcome. > > > > > > > > This small patch series adds a new flag to userfaultfd(2) that allows > > > > callers to give up the ability to handle kernel-mode faults with the > > > > resulting UFFD file object. It then adds a 'user-mode only' option to > > > > the unprivileged_userfaultfd sysctl knob to require unprivileged > > > > callers to use this new flag. > > > > > > > > The purpose of this new interface is to decrease the chance of an > > > > unprivileged userfaultfd user taking advantage of userfaultfd to > > > > enhance security vulnerabilities by lengthening the race window in > > > > kernel code. > > > > > > > > [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/ > > > > [2] https://duasynt.com/blog/linux-kernel-heap-spray > > > > [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit > > > > I've looking at those links and I've been trying to verify the link > > [3] is relevant. > > > > Specifically I've been trying to verify if 1) current state of the art > > modern SLUB randomization techniques already enabled in production and > > rightfully wasting some CPU in all enterprise kernels to prevent > > things like above to become an issue in practice 2) combined with the > > fact different memcg need to share the same kmemcaches (which was > > incidentally fixed a few months ago upstream) and 3) further > > robustness enhancements against exploits in the slub metadata, may > > already render the exploit [3] from 2016 irrelevant in practice. > > It's quite possible that some other mitigation was helpful against the > technique used by this particular exploit. It's the nature of exploits > that they are fragile and will change as new soft mitigations are > introduced. The effectiveness of a particular exploit mitigation > change is orthogonal to the change presented here. > > The purpose of this change is to prevent an attacker from suspending > kernel code execution and having kernel data structures in a > predictable state. This makes it harder for an attacker to "win" race > conditions against various kernel data structures. This change > compliments other kernel hardening changes such as the changes you've > referenced above. Focusing on one particular exploit somewhat misses > the point of this change. > > > > > So I started by trying to reproduce [3] by building 4.5.1 with a > > .config with no robustness features and I booted it on fedora-32 or > > gentoo userland and I cannot even invoke call_usermodehelper. Calling > > socket(22, AF_INET, 0) won't invoke such function. Can you reproduce > > on 4.5.1? Which kernel .config should I use to build 4.5.1 in order > > for call_usermodehelper to be invoked by the exploit? Could you help > > to verify it? > > I haven't tried to verify this myself. I wonder if the usermode > hardening changes also impacted this exploit? See > https://lkml.org/lkml/2017/1/16/468 > > But again, focusing on an exploit, which is inherently fragile in > nature and dependent on the state of the kernel tree at a particular > time, is unlikely to be useful to analyze this patch. > > > > > It even has uninitialized variable spawning random perrors so it > > doesn't give a warm fuzzy feeling: > > > > ==== > > int main(int argc, char **argv) { > > void *region, *map; > > ^^^^^ > > pthread_t uffd_thread; > > int uffd, msqid, i; > > > > region = (void *)mmap((void *)0x40000000, 0x2000, PROT_READ|PROT_WRITE, > > MAP_FIXED|MAP_PRIVATE|MAP_ANON, -1, 0); > > > > if (!region) { > > perror("mmap"); > > exit(2); > > } > > > > setup_pagefault(region + 0x1000, 0x1000, 1); > > > > printf("my pid = %d\n", getpid()); > > > > if (!map) { > > ^^^^^^^^ > > perror("mmap"); > > ==== > > > > The whole point of being able to reproduce on 4.5.1 is then to > > simulate if the same exploit would also reproduce on current kernels > > with all enterprise default robustness features enabled. Or did > > anybody already verify it? > > > > Anyway the links I was playing with are all in the cover letter, the > > cover letter is not as important as the actual patches. The actual > > patches looks fine to me. > > That's great to hear. > > > > > The only improvement I can think of is, what about to add a > > printk_once to suggest to toggle the sysctl if userfaultfd bails out > > because the process lacks the CAP_SYS_PTRACE capability? That would > > facilitate the /etc/sysctl.conf or tuned tweaking in case the apps > > aren't verbose enough. > > > > It's not relevant anymore with this latest patchset, but about the > > previous argument that seccomp couldn't be used in all Android > > processes because of performance concern, I'm slightly confused. > > Seccomp causes more problems than just performance. Seccomp is not > designed for whole-of-system protections. Please see my other writeup > at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/ > > > > > https://android-developers.googleblog.com/2017/07/seccomp-filter-in-android-o.html > > > > "Android O includes a single seccomp filter installed into zygote, the > > process from which all the Android applications are derived. Because > > the filter is installed into zygote—and therefore all apps—the Android > > security team took extra caution to not break existing apps" > > > > Example: > > > > $ uname -mo > > aarch64 Android > > $ cat swapoff.c > > #include <sys/swap.h> > > > > int main() > > { > > swapoff(""); > > } > > $ gcc swapoff.c -o swapoff -O2 > > $ ./swapoff > > Bad system call > > $ > > > > It's hard to imagine what is more performance critical than the zygote > > process and the actual apps as above? > > > > It's also hard to imagine what kind of performance concern can arise > > by adding seccomp filters also to background system apps that > > generally should consume ~0% of CPU. > > > > If performance is really a concern, the BPF JIT representation with > > the bitmap to be able to run the filter in O(1) sounds a better > > solution than not adding ad-hoc filters and it's being worked on for > > x86-64 and can be ported to aarch64 too. Many of the standalone > > background processes likely wouldn't even use uffd at all so you could > > block the user initiated faults too that way. > > > > Ultimately because of issues as [3] (be them still relevant or not, to > > be double checked), no matter if through selinux, seccomp or a > > different sysctl value, without this patchset applied the default > > behavior of the userfaultfd syscall for all Linux binaries running on > > Android kernels, would deviate from the upstream kernel. So even if we > > would make the pipe mutex logic more complex the deviation would > > remain. Your patchset adds much less risk of breakage than adding a > > timeout to kernel initiated userfaults and it resolves all concerns as > > well as a timeout. We'll also make better use of the "0" value this > > way. So while I'm not certain this is the best for the long term, this > > looks the sweet spot for the short term to resolve many issues at > > once. > > > > Thanks! > > Andrea > > > > > -- > Nick Kralevich | nnk@google.com Hi Andrea, Did you get a chance to go through Nick's reply to your questions? Also, I sent another revision of this patch series which takes care of the printk that you suggested. Please take a look.
Hello, On Thu, Oct 08, 2020 at 04:22:36PM -0700, Nick Kralevich wrote: > I haven't tried to verify this myself. I wonder if the usermode > hardening changes also impacted this exploit? See > https://lkml.org/lkml/2017/1/16/468 My plan was to: 1) reproduce with the old buggy kernel 2) forward port the bug to the very first version that had both the slub and page freelist randomization available and keep them disabled 3) enable the freelist randomization features (which are already enabled by default in the current enterprise kernels) and see if that makes the attack not workable The hardening of the usermode helper you mentioned is spot on, but it would have been something to worry about and possibly roll back at point 2), but I couldn't get past point 1).. Plenty other hardening techniques (just like the usermode helper) are very specific to a single attack, but the randomization looks generic enough to cover the entire class. > But again, focusing on an exploit, which is inherently fragile in > nature and dependent on the state of the kernel tree at a particular > time, is unlikely to be useful to analyze this patch. Agreed. A single exploit using userfaultfd to enlarge the race window of the use-after-free, not being workable anymore with randomized slub and page freelist enabled, wouldn't have meant a thing by itself. As opposed if that single exploit was still fairly reproducible, it would have been enough to consider the sysctl default to zero as something providing a more tangible long term benefit. That would have been good information to have too, if that's actually the case. I was merely attempting to get a first data point.. overall it would be nice to initiate some research to verify the exact statistical effects that slub/page randomization has on those use-after-free race conditions that can be enlarged by blocking kernel faults, given we're already paying the price for it. I don't think anybody has a sure answer at this point, if we can entirely rely on those features or not. > Seccomp causes more problems than just performance. Seccomp is not > designed for whole-of-system protections. Please see my other writeup > at https://lore.kernel.org/lkml/CAFJ0LnEo-7YUvgOhb4pHteuiUW+wPfzqbwXUCGAA35ZMx11A-w@mail.gmail.com/ Whole-of-system protection I guess helps primarily because it requires no change to userland I guess. An example of a task not running as root (and without ptrace capability) that could use more seccomp blocking: # cat /proc/1517/cmdline ; echo ; grep CapEff /proc/1517/status; grep Seccomp /proc/1517/status /vendor/bin/hw/qcrild CapEff: 0000001000003000 Seccomp: 0 My view is that if the various binaries started by init.rc are run without a strict seccomp filter there would be more things to worry about, than kernel initiated userfaults for those. Still the solution in the v5 patchset looks the safest for all until we'll be able to tell if the slub/page randomizaton (or any other generic enough robustness feature) is already effective against an enlarged race window of kernel initiated userfaults and at the same time it provides the main benefit of avoiding divergence in the behavior of the userfaultfd syscall if invoked within the Android userland. Thanks, Andrea