diff mbox series

[v2,4/5] memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy

Message ID 20230814-memfd-vm-noexec-uapi-fixes-v2-4-7ff9e3e10ba6@cyphar.com (mailing list archive)
State Accepted
Commit 9876cfe8ec1cb3c88de31f4d58d57b0e7e22bcc4
Headers show
Series memfd: cleanups for vm.memfd_noexec | expand

Commit Message

Aleksa Sarai Aug. 14, 2023, 8:41 a.m. UTC
This sysctl has the very unusual behaviour of not allowing any user (even
CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you
were to set this sysctl to a more restrictive option in the host pidns
you would need to reboot your machine in order to reset it.

The justification given in [1] is that this is a security feature and
thus it should not be possible to disable. Aside from the fact that we
have plenty of security-related sysctls that can be disabled after being
enabled (fs.protected_symlinks for instance), the protection provided by
the sysctl is to stop users from being able to create a binary and then
execute it. A user with CAP_SYS_ADMIN can trivially do this without
memfd_create(2):

  % cat mount-memfd.c
  #include <fcntl.h>
  #include <string.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <unistd.h>
  #include <linux/mount.h>

  #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"

  int main(void)
  {
  	int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
  	assert(fsfd >= 0);
  	assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));

  	int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
  	assert(dfd >= 0);

  	int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
  	assert(execfd >= 0);
  	assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
  	assert(!close(execfd));

  	char *execpath = NULL;
  	char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
  	execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
  	assert(execfd >= 0);
  	assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
  	assert(!execve(execpath, argv, envp));
  }
  % ./mount-memfd
  this file was executed from this totally private tmpfs: /proc/self/fd/5
  %

Given that it is possible for CAP_SYS_ADMIN users to create executable
binaries without memfd_create(2) and without touching the host
filesystem (not to mention the many other things a CAP_SYS_ADMIN process
would be able to do that would be equivalent or worse), it seems strange
to cause a fair amount of headache to admins when there doesn't appear
to be an actual security benefit to blocking this. There appear to be
concerns about confused-deputy-esque attacks[2] but a confused deputy that
can write to arbitrary sysctls is a bigger security issue than
executable memfds.

/* New API */

The primary requirement from the original author appears to be more
based on the need to be able to restrict an entire system in a
hierarchical manner[3], such that child namespaces cannot re-enable
executable memfds.

So, implement that behaviour explicitly -- the vm.memfd_noexec scope is
evaluated up the pidns tree to &init_pid_ns and you have the most
restrictive value applied to you. The new lower limit you can set
vm.memfd_noexec is whatever limit applies to your parent.

Note that a pidns will inherit a copy of the parent pidns's effective
vm.memfd_noexec setting at unshare() time. This matches the existing
behaviour, and it also ensures that a pidns will never have its
vm.memfd_noexec setting *lowered* behind its back (but it will be raised
if the parent raises theirs).

/* Backwards Compatibility */

As the previous version of the sysctl didn't allow you to lower the
setting at all, there are no backwards compatibility issues with this
aspect of the change.

However it should be noted that now that the setting is completely
hierarchical. Previously, a cloned pidns would just copy the current
pidns setting, meaning that if the parent's vm.memfd_noexec was changed
it wouldn't propoagate to existing pid namespaces. Now, the restriction
applies recursively. This is a uAPI change, however:

 * The sysctl is very new, having been merged in 6.3.
 * Several aspects of the sysctl were broken up until this patchset and
   the other patchset by Jeff Xu last month.

And thus it seems incredibly unlikely that any real users would run into
this issue. In the worst case, if this causes userspace isues we could
make it so that modifying the setting follows the hierarchical rules but
the restriction checking uses the cached copy.

[1]: https://lore.kernel.org/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@mail.gmail.com/
[2]: https://lore.kernel.org/CALmYWFs_dNCzw_pW1yRAo4bGCPEtykroEQaowNULp7svwMLjOg@mail.gmail.com/
[3]: https://lore.kernel.org/CALmYWFuahdUF7cT4cm7_TGLqPanuHXJ-hVSfZt7vpTnc18DPrw@mail.gmail.com/

Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: stable@vger.kernel.org # v6.3+
Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/pid_namespace.h | 23 ++++++++++++++++++++++-
 kernel/pid.c                  |  3 +++
 kernel/pid_namespace.c        |  6 +++---
 kernel/pid_sysctl.h           | 28 ++++++++++++----------------
 mm/memfd.c                    |  3 ++-
 5 files changed, 42 insertions(+), 21 deletions(-)

Comments

Jeff Xu Aug. 16, 2023, 5:13 a.m. UTC | #1
On Mon, Aug 14, 2023 at 1:41 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> This sysctl has the very unusual behaviour of not allowing any user (even
> CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you
> were to set this sysctl to a more restrictive option in the host pidns
> you would need to reboot your machine in order to reset it.
>
> The justification given in [1] is that this is a security feature and
> thus it should not be possible to disable. Aside from the fact that we
> have plenty of security-related sysctls that can be disabled after being
> enabled (fs.protected_symlinks for instance), the protection provided by
> the sysctl is to stop users from being able to create a binary and then
> execute it. A user with CAP_SYS_ADMIN can trivially do this without
> memfd_create(2):
>
>   % cat mount-memfd.c
>   #include <fcntl.h>
>   #include <string.h>
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <unistd.h>
>   #include <linux/mount.h>
>
>   #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:"
>
>   int main(void)
>   {
>         int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
>         assert(fsfd >= 0);
>         assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2));
>
>         int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
>         assert(dfd >= 0);
>
>         int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782);
>         assert(execfd >= 0);
>         assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE));
>         assert(!close(execfd));
>
>         char *execpath = NULL;
>         char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL };
>         execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC);
>         assert(execfd >= 0);
>         assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0);
>         assert(!execve(execpath, argv, envp));
>   }
>   % ./mount-memfd
>   this file was executed from this totally private tmpfs: /proc/self/fd/5
>   %
>
> Given that it is possible for CAP_SYS_ADMIN users to create executable
> binaries without memfd_create(2) and without touching the host
> filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> would be able to do that would be equivalent or worse), it seems strange
> to cause a fair amount of headache to admins when there doesn't appear
> to be an actual security benefit to blocking this. There appear to be
> concerns about confused-deputy-esque attacks[2] but a confused deputy that
> can write to arbitrary sysctls is a bigger security issue than
> executable memfds.
>
Something to point out: The demo code might be enough to prove your
case in other distributions, however, in ChromeOS, you can't run this
code. The executable in ChromeOS are all from known sources and
verified at boot.
If an attacker could run this code in ChromeOS, that means the
attacker already acquired arbitrary code execution through other ways,
at that point, the attacker no longer needs to create/find an
executable memfd, they already have the vehicle. You can't use an
example of an attacker already running arbitrary code to prove that
disable downgrading is useless.
I agree it is a big problem that an attacker already can modify a
sysctl.  Assuming this can happen by controlling arguments passed into
sysctl, at the time, the attacker might not have full arbitrary code
execution yet, that is the reason the original design is so
restrictive.

Best regards,
-Jeff
Dominique Martinet Aug. 16, 2023, 5:44 a.m. UTC | #2
Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700:
> > Given that it is possible for CAP_SYS_ADMIN users to create executable
> > binaries without memfd_create(2) and without touching the host
> > filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> > would be able to do that would be equivalent or worse), it seems strange
> > to cause a fair amount of headache to admins when there doesn't appear
> > to be an actual security benefit to blocking this. There appear to be
> > concerns about confused-deputy-esque attacks[2] but a confused deputy that
> > can write to arbitrary sysctls is a bigger security issue than
> > executable memfds.
> >
> Something to point out: The demo code might be enough to prove your
> case in other distributions, however, in ChromeOS, you can't run this
> code. The executable in ChromeOS are all from known sources and
> verified at boot.
> If an attacker could run this code in ChromeOS, that means the
> attacker already acquired arbitrary code execution through other ways,
> at that point, the attacker no longer needs to create/find an
> executable memfd, they already have the vehicle. You can't use an
> example of an attacker already running arbitrary code to prove that
> disable downgrading is useless.
> I agree it is a big problem that an attacker already can modify a
> sysctl.  Assuming this can happen by controlling arguments passed into
> sysctl, at the time, the attacker might not have full arbitrary code
> execution yet, that is the reason the original design is so
> restrictive.

I don't understand how you can say an attacker cannot run arbitrary code
within a process here, yet assert that they'd somehow run memfd_create +
execveat on it if this sysctl is lowered -- the two look equivalent to
me?

CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as
soon as you can run arbitrary code (just have a look at the various
container escape example when the capability is given); I see little
point in trying to harden just this here.
It'd make more sense to limit all sysctl modifications in the context
you're thinking of through e.g. selinux or another LSM.

(in the context of users making their own containers, my suggestion is
always to never use CAP_SYS_ADMIN, or if they must give it to a separate
minimal container where they can limit user interaction)


FWIW, I also think the proposed =2 behaviour makes more sense, but this
is something we already discussed last month so I won't come back to it
as not really involved here.
Jeff Xu Aug. 16, 2023, 10:46 p.m. UTC | #3
On Tue, Aug 15, 2023 at 10:44 PM Dominique Martinet
<asmadeus@codewreck.org> wrote:
>
> Jeff Xu wrote on Tue, Aug 15, 2023 at 10:13:18PM -0700:
> > > Given that it is possible for CAP_SYS_ADMIN users to create executable
> > > binaries without memfd_create(2) and without touching the host
> > > filesystem (not to mention the many other things a CAP_SYS_ADMIN process
> > > would be able to do that would be equivalent or worse), it seems strange
> > > to cause a fair amount of headache to admins when there doesn't appear
> > > to be an actual security benefit to blocking this. There appear to be
> > > concerns about confused-deputy-esque attacks[2] but a confused deputy that
> > > can write to arbitrary sysctls is a bigger security issue than
> > > executable memfds.
> > >
> > Something to point out: The demo code might be enough to prove your
> > case in other distributions, however, in ChromeOS, you can't run this
> > code. The executable in ChromeOS are all from known sources and
> > verified at boot.
> > If an attacker could run this code in ChromeOS, that means the
> > attacker already acquired arbitrary code execution through other ways,
> > at that point, the attacker no longer needs to create/find an
> > executable memfd, they already have the vehicle. You can't use an
> > example of an attacker already running arbitrary code to prove that
> > disable downgrading is useless.
> > I agree it is a big problem that an attacker already can modify a
> > sysctl.  Assuming this can happen by controlling arguments passed into
> > sysctl, at the time, the attacker might not have full arbitrary code
> > execution yet, that is the reason the original design is so
> > restrictive.
>
> I don't understand how you can say an attacker cannot run arbitrary code
> within a process here, yet assert that they'd somehow run memfd_create +
> execveat on it if this sysctl is lowered -- the two look equivalent to
> me?
>
It might require multiple steps for this attack, one possible scenario:
1> control a write primitive in CAP_SYSADMIN process's memory,  change
arguments of sysctl call, and downgrade the setting for memfd, e.g. change
it=0 to revert to old behavior (by default creating executable memfd)
2> control a non-privileged process that creates and writes to
memfd, and write the contents with the binary that the
attacker wants. This process just needs non-executable memfd, but
isn't updated yet.
3> Confuse a non-privilege process to execute the memfd the attacker
wrote in step 2.

In chromeOS, because all the executables are from verified sources,
attackers typically can't easily use the step 3 alone (without step
2),  and memfd was such a hole that enables an unverified executable.

In the original design, downgrading is not allowed, the attack chain
of 2/3 is completely blocked.  With this new approach, attackers will
try to find an additional step (step 1) to make the old attack (step 2
and 3) working again. It is difficult but I can't say it is
impossible.

> CAP_SYS_ADMIN is a kludge of a capability that pretty much gives root as
> soon as you can run arbitrary code (just have a look at the various
> container escape example when the capability is given); I see little
> point in trying to harden just this here.

I'm not an expert in containers, if the industry is giving up on
privileged containers, then the reasoning makes sense.
From ChromeOS point of view, we don't use runc currently, so I think
it makes more sense for runc users to drive these features.  The
original design is with runc's in mind, and even privileged containers
can't downgrade its own setting.

> It'd make more sense to limit all sysctl modifications in the context
> you're thinking of through e.g. selinux or another LSM.
>
I agree,  when I think more about this.
Security features fit LSM better, LSM can do additional "allow/deny"
on otherwise allowed behavior from user space code. Based on that,
"disallow downgrading" fits LSM better.  Also from the same reasoning,
I have second thoughts on the "=2", originally the "MEMFD_EXE was left
out due to the thinking, if user code explicitly setting MEMFD_EXE,
sysctl should not block it, it is the work of LSM. However, the "=2"
has evolved to block MEMFD_EXE completely ... alas .. it might be too
late to revert this, if this is what devs want, it can be that way.

Thanks
Best regards,
-Jeff




-Jeff

> (in the context of users making their own containers, my suggestion is
> always to never use CAP_SYS_ADMIN, or if they must give it to a separate
> minimal container where they can limit user interaction)
>
>
> FWIW, I also think the proposed =2 behaviour makes more sense, but this
> is something we already discussed last month so I won't come back to it
> as not really involved here.
>
> --
> Dominique Martinet | Asmadeus
diff mbox series

Patch

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 53974d79d98e..f9f9931e02d6 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -39,7 +39,6 @@  struct pid_namespace {
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-	/* sysctl for vm.memfd_noexec */
 	int memfd_noexec_scope;
 #endif
 } __randomize_layout;
@@ -56,6 +55,23 @@  static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+	int scope = MEMFD_NOEXEC_SCOPE_EXEC;
+
+	for (; ns; ns = ns->parent)
+		scope = max(scope, READ_ONCE(ns->memfd_noexec_scope));
+
+	return scope;
+}
+#else
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+	return 0;
+}
+#endif
+
 extern struct pid_namespace *copy_pid_ns(unsigned long flags,
 	struct user_namespace *user_ns, struct pid_namespace *ns);
 extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
@@ -70,6 +86,11 @@  static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 	return ns;
 }
 
+static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)
+{
+	return 0;
+}
+
 static inline struct pid_namespace *copy_pid_ns(unsigned long flags,
 	struct user_namespace *user_ns, struct pid_namespace *ns)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 6a1d23a11026..fee14a4486a3 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -83,6 +83,9 @@  struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	.memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
+#endif
 };
 EXPORT_SYMBOL_GPL(init_pid_ns);
 
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0bf44afe04dd..619972c78774 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -110,9 +110,9 @@  static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->user_ns = get_user_ns(user_ns);
 	ns->ucounts = ucounts;
 	ns->pid_allocated = PIDNS_ADDING;
-
-	initialize_memfd_noexec_scope(ns);
-
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+	ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
+#endif
 	return ns;
 
 out_free_idr:
diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h
index b26e027fc9cd..2ee41a3a1dfd 100644
--- a/kernel/pid_sysctl.h
+++ b/kernel/pid_sysctl.h
@@ -5,33 +5,30 @@ 
 #include <linux/pid_namespace.h>
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
-static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns)
-{
-	ns->memfd_noexec_scope =
-		task_active_pid_ns(current)->memfd_noexec_scope;
-}
-
 static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
 	int write, void *buf, size_t *lenp, loff_t *ppos)
 {
 	struct pid_namespace *ns = task_active_pid_ns(current);
 	struct ctl_table table_copy;
+	int err, scope, parent_scope;
 
 	if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN))
 		return -EPERM;
 
 	table_copy = *table;
-	if (ns != &init_pid_ns)
-		table_copy.data = &ns->memfd_noexec_scope;
 
-	/*
-	 * set minimum to current value, the effect is only bigger
-	 * value is accepted.
-	 */
-	if (*(int *)table_copy.data > *(int *)table_copy.extra1)
-		table_copy.extra1 = table_copy.data;
+	/* You cannot set a lower enforcement value than your parent. */
+	parent_scope = pidns_memfd_noexec_scope(ns->parent);
+	/* Equivalent to pidns_memfd_noexec_scope(ns). */
+	scope = max(READ_ONCE(ns->memfd_noexec_scope), parent_scope);
+
+	table_copy.data = &scope;
+	table_copy.extra1 = &parent_scope;
 
-	return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+	err = proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+	if (!err && write)
+		WRITE_ONCE(ns->memfd_noexec_scope, scope);
+	return err;
 }
 
 static struct ctl_table pid_ns_ctl_table_vm[] = {
@@ -51,7 +48,6 @@  static inline void register_pid_ns_sysctl_table_vm(void)
 	register_sysctl("vm", pid_ns_ctl_table_vm);
 }
 #else
-static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns) {}
 static inline void register_pid_ns_sysctl_table_vm(void) {}
 #endif
 
diff --git a/mm/memfd.c b/mm/memfd.c
index aa46521057ab..1cad1904fc26 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -271,7 +271,8 @@  long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 static int check_sysctl_memfd_noexec(unsigned int *flags)
 {
 #ifdef CONFIG_SYSCTL
-	int sysctl = task_active_pid_ns(current)->memfd_noexec_scope;
+	struct pid_namespace *ns = task_active_pid_ns(current);
+	int sysctl = pidns_memfd_noexec_scope(ns);
 
 	if (!(*flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
 		if (sysctl >= MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL)