mbox series

[v2,0/5] memfd: cleanups for vm.memfd_noexec

Message ID 20230814-memfd-vm-noexec-uapi-fixes-v2-0-7ff9e3e10ba6@cyphar.com (mailing list archive)
Headers show
Series memfd: cleanups for vm.memfd_noexec | expand

Message

Aleksa Sarai Aug. 14, 2023, 8:40 a.m. UTC
The most critical issue with vm.memfd_noexec=2 (the fact that passing
MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
tree[2], but there are still some outstanding issues that need to be
addressed:

 * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
   because it will make it far to difficult to ever migrate. Instead it
   should imply MFD_EXEC.

 * The dmesg warnings are pr_warn_once(), which on most systems means
   that they will be used up by systemd or some other boot process and
   userspace developers will never see it.

   - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
     rate-limited message to the kernel log is necessary to tell
     userspace that they should add the new flags.

     Arguably the most ideal way to deal with the spam concern[3,4]
     while still prompting userspace to switch to the new flags would be
     to only log the warning once per task or something similar.
     However, adding something to task_struct for tracking this would be
     needless bloat for a single pr_warn_ratelimited().

     So just switch to pr_info_ratelimited() to avoid spamming the log
     with something that isn't a real warning. There's lots of
     info-level stuff in dmesg, it seems really unlikely that this
     should be an actual problem. Most programs are already switching to
     the new flags anyway.

   - For the vm.memfd_noexec=2 case, we need to log a warning for every
     failure because otherwise userspace will have no idea why their
     previously working program started returning -EACCES (previously
     -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.

 * The racheting mechanism for vm.memfd_noexec makes it incredibly
   unappealing for most users to enable the sysctl because enabling it
   on &init_pid_ns means you need a system reboot to unset it. Given the
   actual security threat being protected against, CAP_SYS_ADMIN users
   being restricted in this way makes little sense.

   The argument for this ratcheting by the original author was that it
   allows you to have a hierarchical setting that cannot be unset by
   child pidnses, but this is not accurate -- changing the parent
   pidns's vm.memfd_noexec setting to be more restrictive didn't affect
   children.

   Instead, switch the vm.memfd_noexec sysctl to be properly
   hierarchical and allow CAP_SYS_ADMIN users (in the pidns's owning
   userns) to lower the setting as long as it is not lower than the
   parent's effective setting. This change also makes it so that
   changing a parent pidns's vm.memfd_noexec will affect all
   descendants, providing a properly hierarchical setting. The
   performance impact of this is incredibly minimal since the maximum
   depth of pidns is 32 and it is only checked during memfd_create(2)
   and unshare(CLONE_NEWPID).

 * The memfd selftests would not exit with a non-zero error code when
   certain tests that ran in a forked process (specifically the ones
   related to MFD_EXEC and MFD_NOEXEC_SEAL) failed.

[1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/
[2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/
[3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
[4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v2:
- Make vm.memfd_noexec restrictions properly hierarchical.
- Allow vm.memfd_noexec setting to be lowered by CAP_SYS_ADMIN as long
  as it is not lower than the parent's effective setting.
- Fix the logging behaviour related to the new flags and
  vm.memfd_noexec=2.
- Add more thorough tests for vm.memfd_noexec in selftests.
- v1: <https://lore.kernel.org/r/20230713143406.14342-1-cyphar@cyphar.com>

---
Aleksa Sarai (5):
      selftests: memfd: error out test process when child test fails
      memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
      memfd: improve userspace warnings for missing exec-related flags
      memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
      selftests: improve vm.memfd_noexec sysctl tests

 include/linux/pid_namespace.h              |  39 ++--
 kernel/pid.c                               |   3 +
 kernel/pid_namespace.c                     |   6 +-
 kernel/pid_sysctl.h                        |  28 ++-
 mm/memfd.c                                 |  33 ++-
 tools/testing/selftests/memfd/memfd_test.c | 332 +++++++++++++++++++++++------
 6 files changed, 322 insertions(+), 119 deletions(-)
---
base-commit: 3ff995246e801ea4de0a30860a1d8da4aeb538e7
change-id: 20230803-memfd-vm-noexec-uapi-fixes-ace725c67b0f

Best regards,

Comments

Jeff Xu Aug. 16, 2023, 5:08 a.m. UTC | #1
On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> The most critical issue with vm.memfd_noexec=2 (the fact that passing
> MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
> tree[2], but there are still some outstanding issues that need to be
> addressed:
>
>  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
>    because it will make it far to difficult to ever migrate. Instead it
>    should imply MFD_EXEC.
>
>  * The dmesg warnings are pr_warn_once(), which on most systems means
>    that they will be used up by systemd or some other boot process and
>    userspace developers will never see it.
>
>    - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
>      rate-limited message to the kernel log is necessary to tell
>      userspace that they should add the new flags.
>
>      Arguably the most ideal way to deal with the spam concern[3,4]
>      while still prompting userspace to switch to the new flags would be
>      to only log the warning once per task or something similar.
>      However, adding something to task_struct for tracking this would be
>      needless bloat for a single pr_warn_ratelimited().
>
>      So just switch to pr_info_ratelimited() to avoid spamming the log
>      with something that isn't a real warning. There's lots of
>      info-level stuff in dmesg, it seems really unlikely that this
>      should be an actual problem. Most programs are already switching to
>      the new flags anyway.
>
>    - For the vm.memfd_noexec=2 case, we need to log a warning for every
>      failure because otherwise userspace will have no idea why their
>      previously working program started returning -EACCES (previously
>      -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.
>
>  * The racheting mechanism for vm.memfd_noexec makes it incredibly
>    unappealing for most users to enable the sysctl because enabling it
>    on &init_pid_ns means you need a system reboot to unset it. Given the
>    actual security threat being protected against, CAP_SYS_ADMIN users
>    being restricted in this way makes little sense.
>
>    The argument for this ratcheting by the original author was that it
>    allows you to have a hierarchical setting that cannot be unset by
>    child pidnses, but this is not accurate -- changing the parent
>    pidns's vm.memfd_noexec setting to be more restrictive didn't affect
>    children.
>
That is not exactly what I said though.
From ChromeOS's position,  allowing downgrade is less secure, and this
setting was designed to be set at startup/reboot time from the very
beginning, such that the kernel command line or as part of the
container runtime environment (get passed to sandboxed container)
I understand your viewpoint,  from another distribution point of view,
 the original design might be too restricted, so if the kernel wants
to weigh more on ease of admin, I'm OK with your approach.
Though it is less secure for ChromeOS - i.e. we do try to prevent
arbitrary code execution  as much as possible, even for CAP_SYSADMIN.
And with this change, it is less secure and one more possibility for
us to consider.




>    Instead, switch the vm.memfd_noexec sysctl to be properly
>    hierarchical and allow CAP_SYS_ADMIN users (in the pidns's owning
>    userns) to lower the setting as long as it is not lower than the
>    parent's effective setting. This change also makes it so that
>    changing a parent pidns's vm.memfd_noexec will affect all
>    descendants, providing a properly hierarchical setting. The
>    performance impact of this is incredibly minimal since the maximum
>    depth of pidns is 32 and it is only checked during memfd_create(2)
>    and unshare(CLONE_NEWPID).
>
>  * The memfd selftests would not exit with a non-zero error code when
>    certain tests that ran in a forked process (specifically the ones
>    related to MFD_EXEC and MFD_NOEXEC_SEAL) failed.
>
> [1]: https://lore.kernel.org/all/ZJwcsU0vI-nzgOB_@codewreck.org/
> [2]: https://lore.kernel.org/all/20230705063315.3680666-1-jeffxu@google.com/
> [3]: https://lore.kernel.org/Y5yS8wCnuYGLHMj4@x1n/
> [4]: https://lore.kernel.org/f185bb42-b29c-977e-312e-3349eea15383@linuxfoundation.org/
>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
> Changes in v2:
> - Make vm.memfd_noexec restrictions properly hierarchical.
> - Allow vm.memfd_noexec setting to be lowered by CAP_SYS_ADMIN as long
>   as it is not lower than the parent's effective setting.
> - Fix the logging behaviour related to the new flags and
>   vm.memfd_noexec=2.
> - Add more thorough tests for vm.memfd_noexec in selftests.
> - v1: <https://lore.kernel.org/r/20230713143406.14342-1-cyphar@cyphar.com>
>
> ---
> Aleksa Sarai (5):
>       selftests: memfd: error out test process when child test fails
>       memfd: do not -EACCES old memfd_create() users with vm.memfd_noexec=2
>       memfd: improve userspace warnings for missing exec-related flags
>       memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy
>       selftests: improve vm.memfd_noexec sysctl tests
>
>  include/linux/pid_namespace.h              |  39 ++--
>  kernel/pid.c                               |   3 +
>  kernel/pid_namespace.c                     |   6 +-
>  kernel/pid_sysctl.h                        |  28 ++-
>  mm/memfd.c                                 |  33 ++-
>  tools/testing/selftests/memfd/memfd_test.c | 332 +++++++++++++++++++++++------
>  6 files changed, 322 insertions(+), 119 deletions(-)
> ---
> base-commit: 3ff995246e801ea4de0a30860a1d8da4aeb538e7
> change-id: 20230803-memfd-vm-noexec-uapi-fixes-ace725c67b0f
>
> Best regards,
> --
> Aleksa Sarai <cyphar@cyphar.com>
>
Aleksa Sarai Aug. 19, 2023, 2:50 a.m. UTC | #2
On 2023-08-15, Jeff Xu <jeffxu@google.com> wrote:
> On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > The most critical issue with vm.memfd_noexec=2 (the fact that passing
> > MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
> > tree[2], but there are still some outstanding issues that need to be
> > addressed:
> >
> >  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
> >    because it will make it far to difficult to ever migrate. Instead it
> >    should imply MFD_EXEC.
> >
> >  * The dmesg warnings are pr_warn_once(), which on most systems means
> >    that they will be used up by systemd or some other boot process and
> >    userspace developers will never see it.
> >
> >    - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
> >      rate-limited message to the kernel log is necessary to tell
> >      userspace that they should add the new flags.
> >
> >      Arguably the most ideal way to deal with the spam concern[3,4]
> >      while still prompting userspace to switch to the new flags would be
> >      to only log the warning once per task or something similar.
> >      However, adding something to task_struct for tracking this would be
> >      needless bloat for a single pr_warn_ratelimited().
> >
> >      So just switch to pr_info_ratelimited() to avoid spamming the log
> >      with something that isn't a real warning. There's lots of
> >      info-level stuff in dmesg, it seems really unlikely that this
> >      should be an actual problem. Most programs are already switching to
> >      the new flags anyway.
> >
> >    - For the vm.memfd_noexec=2 case, we need to log a warning for every
> >      failure because otherwise userspace will have no idea why their
> >      previously working program started returning -EACCES (previously
> >      -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.
> >
> >  * The racheting mechanism for vm.memfd_noexec makes it incredibly
> >    unappealing for most users to enable the sysctl because enabling it
> >    on &init_pid_ns means you need a system reboot to unset it. Given the
> >    actual security threat being protected against, CAP_SYS_ADMIN users
> >    being restricted in this way makes little sense.
> >
> >    The argument for this ratcheting by the original author was that it
> >    allows you to have a hierarchical setting that cannot be unset by
> >    child pidnses, but this is not accurate -- changing the parent
> >    pidns's vm.memfd_noexec setting to be more restrictive didn't affect
> >    children.
> >
> That is not exactly what I said though.

Sorry, I probably should've phrased this as "one of the main arguments".
In the last discussion thread we had in the v1 of this patch, it was my
impression that this was the primary sticking point.

> From ChromeOS's position,  allowing downgrade is less secure, and this
> setting was designed to be set at startup/reboot time from the very
> beginning, such that the kernel command line or as part of the
> container runtime environment (get passed to sandboxed container)

If this had been implemented as a cmdline flag, it would be completely
reasonable that you need to reboot to change it. However, it was
implemented as a sysctl and the behaviour of sysctls is that admins can
(generally) change them after they've been set -- even for
security-related sysctls such as the fs.protected_* sysctls. The only
counter-example I know if the YAMA one, and if I'm being honest I think
that behaviour is also weird.

> I understand your viewpoint,  from another distribution point of view,
>  the original design might be too restricted, so if the kernel wants
> to weigh more on ease of admin, I'm OK with your approach.
> Though it is less secure for ChromeOS - i.e. we do try to prevent
> arbitrary code execution  as much as possible, even for CAP_SYSADMIN.
> And with this change, it is less secure and one more possibility for
> us to consider.

FWIW I still think the threat model where a &init_user_ns-privileged
CAP_SYS_ADMIN process can be tricked into writing a sysctl should be
protected against by memfd_create(MFD_EXEC) doesn't really make sense
for the vast majority of systems (if any).

If ChromeOS really wants the old vm.memfd_noexec=2 behaviour to be
enforced, this can be done with a very simple seccomp filter. If applied
to pid1, this would also not be possible to unset without a reboot.
Jeff Xu Aug. 21, 2023, 7:04 p.m. UTC | #3
On Fri, Aug 18, 2023 at 7:50 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2023-08-15, Jeff Xu <jeffxu@google.com> wrote:
> > On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > The most critical issue with vm.memfd_noexec=2 (the fact that passing
> > > MFD_EXEC would bypass it entirely[1]) has been fixed in Andrew's
> > > tree[2], but there are still some outstanding issues that need to be
> > > addressed:
> > >
> > >  * vm.memfd_noexec=2 shouldn't reject old-style memfd_create(2) syscalls
> > >    because it will make it far to difficult to ever migrate. Instead it
> > >    should imply MFD_EXEC.
> > >
> > >  * The dmesg warnings are pr_warn_once(), which on most systems means
> > >    that they will be used up by systemd or some other boot process and
> > >    userspace developers will never see it.
> > >
> > >    - For the !(flags & (MFD_EXEC | MFD_NOEXEC_SEAL)) case, outputting a
> > >      rate-limited message to the kernel log is necessary to tell
> > >      userspace that they should add the new flags.
> > >
> > >      Arguably the most ideal way to deal with the spam concern[3,4]
> > >      while still prompting userspace to switch to the new flags would be
> > >      to only log the warning once per task or something similar.
> > >      However, adding something to task_struct for tracking this would be
> > >      needless bloat for a single pr_warn_ratelimited().
> > >
> > >      So just switch to pr_info_ratelimited() to avoid spamming the log
> > >      with something that isn't a real warning. There's lots of
> > >      info-level stuff in dmesg, it seems really unlikely that this
> > >      should be an actual problem. Most programs are already switching to
> > >      the new flags anyway.
> > >
> > >    - For the vm.memfd_noexec=2 case, we need to log a warning for every
> > >      failure because otherwise userspace will have no idea why their
> > >      previously working program started returning -EACCES (previously
> > >      -EINVAL) from memfd_create(2). pr_warn_once() is simply wrong here.
> > >
> > >  * The racheting mechanism for vm.memfd_noexec makes it incredibly
> > >    unappealing for most users to enable the sysctl because enabling it
> > >    on &init_pid_ns means you need a system reboot to unset it. Given the
> > >    actual security threat being protected against, CAP_SYS_ADMIN users
> > >    being restricted in this way makes little sense.
> > >
> > >    The argument for this ratcheting by the original author was that it
> > >    allows you to have a hierarchical setting that cannot be unset by
> > >    child pidnses, but this is not accurate -- changing the parent
> > >    pidns's vm.memfd_noexec setting to be more restrictive didn't affect
> > >    children.
> > >
> > That is not exactly what I said though.
>
> Sorry, I probably should've phrased this as "one of the main arguments".
> In the last discussion thread we had in the v1 of this patch, it was my
> impression that this was the primary sticking point.
>
> > From ChromeOS's position,  allowing downgrade is less secure, and this
> > setting was designed to be set at startup/reboot time from the very
> > beginning, such that the kernel command line or as part of the
> > container runtime environment (get passed to sandboxed container)
>
> If this had been implemented as a cmdline flag, it would be completely
> reasonable that you need to reboot to change it. However, it was

You might already know that sysctl can be set in kernel command line,
thanks to Vlastimil Babka from SUSE. [1]
[1] https://lore.kernel.org/lkml/20200325120345.12946-1-vbabka@suse.cz/

> implemented as a sysctl and the behaviour of sysctls is that admins can
> (generally) change them after they've been set -- even for
> security-related sysctls such as the fs.protected_* sysctls. The only
> counter-example I know if the YAMA one, and if I'm being honest I think
> that behaviour is also weird.
>

> > I understand your viewpoint,  from another distribution point of view,
> >  the original design might be too restricted, so if the kernel wants
> > to weigh more on ease of admin, I'm OK with your approach.
> > Though it is less secure for ChromeOS - i.e. we do try to prevent
> > arbitrary code execution  as much as possible, even for CAP_SYSADMIN.
> > And with this change, it is less secure and one more possibility for
> > us to consider.
>
> FWIW I still think the threat model where a &init_user_ns-privileged
> CAP_SYS_ADMIN process can be tricked into writing a sysctl should be
> protected against by memfd_create(MFD_EXEC) doesn't really make sense
> for the vast majority of systems (if any).
>
I agree other distributions might not care much about running
arbitrary code on the host for CAP_SYS_ADMIN, similar to traditional
unix in this aspect. ChromeOS has some unique security features.

> If ChromeOS really wants the old vm.memfd_noexec=2 behaviour to be
> enforced, this can be done with a very simple seccomp filter. If applied
> to pid1, this would also not be possible to unset without a reboot.
>
In practice, host and process can have different values for
vm.memfd_noexec, it can't easily be implemented through seccomp.
Seccomp also requires no-new-priv set, there are implications if we
set it to pid 1 and apply to all its children.


> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>

Thanks
Best regards,
-Jeff