diff mbox series

ucounts: Fix systemd LimigtNPROC with private users regression

Message ID 87sfs8jmpz.fsf_-_@email.froward.int.ebiederm.org (mailing list archive)
State Accepted
Commit 0ac983f512033cb7b5e210c9589768ad25b1e36b
Headers show
Series ucounts: Fix systemd LimigtNPROC with private users regression | expand

Commit Message

Eric W. Biederman Feb. 24, 2022, 3:41 p.m. UTC
Long story short recursively enforcing RLIMIT_NPROC when it is not
enforced on the process that creates a new user namespace, causes
currently working code to fail.  There is no reason to enforce
RLIMIT_NPROC recursively when we don't enforce it normally so update
the code to detect this case.

I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
RLIMIT_NPROC is charged and checked for enforcement based upon the
real uid, using capable() wich is euid based is inconsistent with reality.
Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
testing for when the real uid would match the conditions when
CAP_SYS_RESOURCE would be present if the real uid was the effective
uid.

Reported-by: Etienne Dechamps <etienne@edechamps.fr>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---

The previous conversation has given me enough clarity that I can see
which tests I am comfortable with use for this pending regression fix.

I have tested this and it works for me.  Does anyone have any concerns
with this change?

 kernel/user_namespace.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Kees Cook Feb. 24, 2022, 4:28 p.m. UTC | #1
typo: Subject's LimigtNPROC -> LimitNPROC

On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote:
> 
> Long story short recursively enforcing RLIMIT_NPROC when it is not
> enforced on the process that creates a new user namespace, causes
> currently working code to fail.  There is no reason to enforce
> RLIMIT_NPROC recursively when we don't enforce it normally so update
> the code to detect this case.
> 
> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
> RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
> RLIMIT_NPROC is charged and checked for enforcement based upon the
> real uid, using capable() wich is euid based is inconsistent with reality.

typo: wich -> which

> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
> testing for when the real uid would match the conditions when
> CAP_SYS_RESOURCE would be present if the real uid was the effective
> uid.
> 
> Reported-by: Etienne Dechamps <etienne@edechamps.fr>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
> 
> The previous conversation has given me enough clarity that I can see
> which tests I am comfortable with use for this pending regression fix.
> 
> I have tested this and it works for me.  Does anyone have any concerns
> with this change?

I'd really love some kind of selftest that exercises the edge cases; do
you have your tests in some form that could be converted?

But otherwise, yes, this looks like the best option here.

Reviewed-by: Kees Cook <keescook@chromium.org>

> 
>  kernel/user_namespace.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index 6b2e3ca7ee99..5481ba44a8d6 100644
> --- a/kernel/user_namespace.c
> +++ b/kernel/user_namespace.c
> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>  	cred->user_ns = user_ns;
>  }
>  
> +static unsigned long enforced_nproc_rlimit(void)
> +{
> +	unsigned long limit = RLIM_INFINITY;
> +
> +	/* Is RLIMIT_NPROC currently enforced? */
> +	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
> +	    (current_user_ns() != &init_user_ns))
> +		limit = rlimit(RLIMIT_NPROC);
> +
> +	return limit;
> +}
> +
>  /*
>   * Create a new user namespace, deriving the creator from the user in the
>   * passed credentials, and replacing that user with the new root user for the
> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
>  	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
>  		ns->ucount_max[i] = INT_MAX;
>  	}
> -	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
> +	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
> -- 
> 2.29.2
>
Michal Koutný Feb. 24, 2022, 6:53 p.m. UTC | #2
On Thu, Feb 24, 2022 at 08:28:41AM -0800, Kees Cook <keescook@chromium.org> wrote:
> I'd really love some kind of selftest that exercises the edge cases; do
> you have your tests in some form that could be converted?

There's the original
tools/testing/selftests/rlimits/rlimits-per-userns.c selftest.

I've been rewriting it to cover more situations, I'm sending it as one
monster patch (I'd need spend more time reordering my commits into some
logical patch order) if anyone wishes to try it.

I've tried it on 5c1ee569660d4a205dced9cb4d0306b907fb7599 + this Eric's
patch.

The test rlimit-per-userns-root passes
- together with that I claim this patch

Reviewed-by: Michal Koutný <mkoutny@suse.com>

The test rlimit-per-userns-nonroot fails. It's similar off-by-one
mistake as was in the fork path, but it's in the do_execveat_common():

        if ((current->flags & PF_NPROC_EXCEEDED) &&
            is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
                retval = -EAGAIN;
                goto out_ret;
        }

(If RLIMIT_NPROC should be strictly honored, setuid+execve should fail
when given uid's ucount is at the limit already.)

Funnily, the original
tools/testing/selftests/rlimits/rlimits-per-userns.c passes thanks to
the off-by-one check even though it should not pass because unshare(2)
is called after setuid(2).

Michal

-- >8 --
From be67d903f1f179f585bf302f6c2d2446f24263d7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Michal=20Koutn=C3=BD?= <mkoutny@suse.com>
Date: Thu, 20 Jan 2022 19:32:54 +0100
Subject: [RFC PATCH] selftests: Rewrite RLIMIT_NPROC checks (in user
 namespaces)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds two test programs:
- rlimit-per-userns-root -- creates user namespaces owned by root,
- rlimit-per-userns-nonroot -- creates user namespaces owned by non-root.

The forking tree:

    main [init_user_ns]
    ` service [user_ns_1]
      ` worker 1
      ` worker 2
      ...
      ` worker k
    ...
    ` service [user_ns_n]
      ` worker 1
      ` worker 2
      ...
      ` worker k

Expectations rlimit-per-userns-root:

    n > RLIMIT_NPROC		privileged user can spawn mutliple services in different user namespaces
    (k+1) <= RLIMIT_NPROC	limit is honored within user namespace
    k >= RLIMIT_NPROC-1		separate per-user namespace counters

Expectations rlimit-per-userns-nonroot:

    n <= RLIMIT_NPROC		global RLIMIT_NPROC is honored
    (k+1) <= RLIMIT_NPROC	limit is honored within user namespace

Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 tools/testing/selftests/rlimits/Makefile      |   6 +-
 .../rlimits/rlimits-per-userns-nonroot.c      |  37 ++
 .../rlimits/rlimits-per-userns-root.c         |  34 ++
 .../selftests/rlimits/rlimits-per-userns.c    | 161 -------
 .../selftests/rlimits/service_common.c        | 400 ++++++++++++++++++
 .../selftests/rlimits/service_common.h        |  24 ++
 6 files changed, 500 insertions(+), 162 deletions(-)
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c
 create mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns-root.c
 delete mode 100644 tools/testing/selftests/rlimits/rlimits-per-userns.c
 create mode 100644 tools/testing/selftests/rlimits/service_common.c
 create mode 100644 tools/testing/selftests/rlimits/service_common.h

diff --git a/tools/testing/selftests/rlimits/Makefile b/tools/testing/selftests/rlimits/Makefile
index 03aadb406212..8ccb92020206 100644
--- a/tools/testing/selftests/rlimits/Makefile
+++ b/tools/testing/selftests/rlimits/Makefile
@@ -1,6 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0-or-later
 
 CFLAGS += -Wall -O2 -g
-TEST_GEN_PROGS := rlimits-per-userns
+TEST_GEN_PROGS := rlimits-per-userns-root
+TEST_GEN_PROGS += rlimits-per-userns-nonroot
 
 include ../lib.mk
+
+$(OUTPUT)/rlimits-per-userns-root: service_common.c
+$(OUTPUT)/rlimits-per-userns-nonroot: service_common.c
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c b/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c
new file mode 100644
index 000000000000..ccf021769f88
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns-nonroot.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
+ * Author: Michal Koutný <mkoutny@suse.com>
+ */
+#include <err.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "service_common.h"
+
+int main(int argc, char **argv)
+{
+	struct services_ctx *ctx;
+	pid = getpid();
+
+	if (getenv(ENV_PARAM))
+		return run_service(atoi(getenv(ENV_PARAM)));
+
+	if (getuid() > 0)
+		errx(KSFT_SKIP, "This selftest must start as (global) root user.");
+
+	warnx("(pid=%d) Starting testcase", pid);
+
+	ctx = start_services(argv[0], UM_UNSHARE);
+	stop_services(ctx);
+
+	if (count_services(ctx) > THE_LIMIT)
+		errx(KSFT_FAIL, "(pid=%d): Test failed, exec'd services > RLIMIT_NPROC", pid);
+
+	if (check_services(ctx) < count_services(ctx))
+		errx(KSFT_FAIL, "(pid=%d): Test failed, failed services", pid);
+
+	warnx("(pid=%d): Test passed", pid);
+	exit(KSFT_PASS);
+}
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns-root.c b/tools/testing/selftests/rlimits/rlimits-per-userns-root.c
new file mode 100644
index 000000000000..3bf0149ac93d
--- /dev/null
+++ b/tools/testing/selftests/rlimits/rlimits-per-userns-root.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
+ * Author: Michal Koutný <mkoutny@suse.com>
+ */
+#include <err.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "service_common.h"
+
+int main(int argc, char **argv)
+{
+	struct services_ctx *ctx;
+	pid = getpid();
+
+	if (getenv(ENV_PARAM))
+		return run_service(atoi(getenv(ENV_PARAM)));
+
+	if (getuid() > 0)
+		errx(KSFT_SKIP, "This selftest must start as (global) root user.");
+
+	warnx("(pid=%d) Starting testcase", pid);
+
+	ctx = start_services(argv[0], UM_CLONE_NEWUSER);
+	stop_services(ctx);
+
+	if (check_services(ctx) != NR_SERVICES)
+		errx(KSFT_FAIL, "(pid=%d): Test failed, unexpected terminations", pid);
+
+	warnx("(pid=%d): Test passed", pid);
+	exit(KSFT_PASS);
+}
diff --git a/tools/testing/selftests/rlimits/rlimits-per-userns.c b/tools/testing/selftests/rlimits/rlimits-per-userns.c
deleted file mode 100644
index 26dc949e93ea..000000000000
--- a/tools/testing/selftests/rlimits/rlimits-per-userns.c
+++ /dev/null
@@ -1,161 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
- */
-#define _GNU_SOURCE
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <sys/time.h>
-#include <sys/resource.h>
-#include <sys/prctl.h>
-#include <sys/stat.h>
-
-#include <unistd.h>
-#include <stdlib.h>
-#include <stdio.h>
-#include <string.h>
-#include <sched.h>
-#include <signal.h>
-#include <limits.h>
-#include <fcntl.h>
-#include <errno.h>
-#include <err.h>
-
-#define NR_CHILDS 2
-
-static char *service_prog;
-static uid_t user   = 60000;
-static uid_t group  = 60000;
-
-static void setrlimit_nproc(rlim_t n)
-{
-	pid_t pid = getpid();
-	struct rlimit limit = {
-		.rlim_cur = n,
-		.rlim_max = n
-	};
-
-	warnx("(pid=%d): Setting RLIMIT_NPROC=%ld", pid, n);
-
-	if (setrlimit(RLIMIT_NPROC, &limit) < 0)
-		err(EXIT_FAILURE, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
-}
-
-static pid_t fork_child(void)
-{
-	pid_t pid = fork();
-
-	if (pid < 0)
-		err(EXIT_FAILURE, "fork");
-
-	if (pid > 0)
-		return pid;
-
-	pid = getpid();
-
-	warnx("(pid=%d): New process starting ...", pid);
-
-	if (prctl(PR_SET_PDEATHSIG, SIGKILL) < 0)
-		err(EXIT_FAILURE, "(pid=%d): prctl(PR_SET_PDEATHSIG)", pid);
-
-	signal(SIGUSR1, SIG_DFL);
-
-	warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
-
-	if (setgid(group) < 0)
-		err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
-	if (setuid(user) < 0)
-		err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
-
-	warnx("(pid=%d): Service running ...", pid);
-
-	warnx("(pid=%d): Unshare user namespace", pid);
-	if (unshare(CLONE_NEWUSER) < 0)
-		err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
-
-	char *const argv[] = { "service", NULL };
-	char *const envp[] = { "I_AM_SERVICE=1", NULL };
-
-	warnx("(pid=%d): Executing real service ...", pid);
-
-	execve(service_prog, argv, envp);
-	err(EXIT_FAILURE, "(pid=%d): execve", pid);
-}
-
-int main(int argc, char **argv)
-{
-	size_t i;
-	pid_t child[NR_CHILDS];
-	int wstatus[NR_CHILDS];
-	int childs = NR_CHILDS;
-	pid_t pid;
-
-	if (getenv("I_AM_SERVICE")) {
-		pause();
-		exit(EXIT_SUCCESS);
-	}
-
-	service_prog = argv[0];
-	pid = getpid();
-
-	warnx("(pid=%d) Starting testcase", pid);
-
-	/*
-	 * This rlimit is not a problem for root because it can be exceeded.
-	 */
-	setrlimit_nproc(1);
-
-	for (i = 0; i < NR_CHILDS; i++) {
-		child[i] = fork_child();
-		wstatus[i] = 0;
-		usleep(250000);
-	}
-
-	while (1) {
-		for (i = 0; i < NR_CHILDS; i++) {
-			if (child[i] <= 0)
-				continue;
-
-			errno = 0;
-			pid_t ret = waitpid(child[i], &wstatus[i], WNOHANG);
-
-			if (!ret || (!WIFEXITED(wstatus[i]) && !WIFSIGNALED(wstatus[i])))
-				continue;
-
-			if (ret < 0 && errno != ECHILD)
-				warn("(pid=%d): waitpid(%d)", pid, child[i]);
-
-			child[i] *= -1;
-			childs -= 1;
-		}
-
-		if (!childs)
-			break;
-
-		usleep(250000);
-
-		for (i = 0; i < NR_CHILDS; i++) {
-			if (child[i] <= 0)
-				continue;
-			kill(child[i], SIGUSR1);
-		}
-	}
-
-	for (i = 0; i < NR_CHILDS; i++) {
-		if (WIFEXITED(wstatus[i]))
-			warnx("(pid=%d): pid %d exited, status=%d",
-				pid, -child[i], WEXITSTATUS(wstatus[i]));
-		else if (WIFSIGNALED(wstatus[i]))
-			warnx("(pid=%d): pid %d killed by signal %d",
-				pid, -child[i], WTERMSIG(wstatus[i]));
-
-		if (WIFSIGNALED(wstatus[i]) && WTERMSIG(wstatus[i]) == SIGUSR1)
-			continue;
-
-		warnx("(pid=%d): Test failed", pid);
-		exit(EXIT_FAILURE);
-	}
-
-	warnx("(pid=%d): Test passed", pid);
-	exit(EXIT_SUCCESS);
-}
diff --git a/tools/testing/selftests/rlimits/service_common.c b/tools/testing/selftests/rlimits/service_common.c
new file mode 100644
index 000000000000..043c59828a03
--- /dev/null
+++ b/tools/testing/selftests/rlimits/service_common.c
@@ -0,0 +1,400 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Alexey Gladkov <gladkov.alexey@gmail.com>
+ * Author: Michal Koutný <mkoutny@suse.com>
+ */
+/*
+ * The forking tree:
+ *
+ *     main [init_user_ns]
+ *     ` service [user_ns_1]
+ *       ` worker 1
+ *       ` worker 2
+ *       ...
+ *       ` worker k
+ *     ...
+ *     ` service [user_ns_n]
+ *       ` worker 1
+ *       ` worker 2
+ *       ...
+ *       ` worker k
+ *
+ * Sequence (synchronization) diagram:
+ *   main			service
+ *   ----			-------
+ *   setrlimit()
+ *   service=clone([CLONE_NEWUSER])
+ *   define_maps()
+ *   	  	MAP_DEFINE ->
+ *   	  			setuid()
+ *   	  			[unshare(CLONE_NEWUSER)]
+ *   		<- UNSHARE
+ *   rlimit_restore()
+ *   		RLIMIT_RESTORE ->
+ *   				execve()
+ *   		POST_EXEC ->
+ *
+ * Expectations UM_UNSHARE:
+ *
+ *     n <= RLIMIT_NPROC	global RLIMIT_NPROC is honored
+ *     (k+1) <= RLIMIT_NPROC	limit is honored within user namespace
+ *
+ * Expectations UM_CLONE_NEWUSER:
+ *
+ *     n > RLIMIT_NPROC		privileged user can spawn mutliple services in different user namespaces
+ *     k >= RLIMIT_NPROC-1	separate per-user namespace counters
+ *     (k+1) <= RLIMIT_NPROC	limit is honored within user namespace
+ */
+
+#define _GNU_SOURCE
+#include <assert.h>
+#include <err.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+#include "service_common.h"
+
+#define STACK_SIZE (2 * (1<<20))
+#define SERVICE_RUNTIME	250000 /* μs */
+
+static_assert(NR_SERVICES > THE_LIMIT, "Services must exceed THE_LIMIT for effective test.");
+static_assert(NR_WORKERS > THE_LIMIT-1, "Need enough workers to challenge THE_LIIMT.");
+
+static struct services_ctx {
+	int control_fd[NR_SERVICES];
+	pid_t child[NR_SERVICES];
+	int wstatus[NR_SERVICES];
+	int fork_ed;
+	int exec_ed;
+} services_ctx;
+
+static uid_t user   = 60000;
+static uid_t group  = 60000;
+static struct rlimit saved_limit;
+
+static struct {
+	int control_fd;
+	const char *pathname;
+	enum userns_mode um;
+} child_args;
+
+pid_t pid;
+
+static void setrlimit_nproc(rlim_t n)
+{
+	struct rlimit limit = {
+		.rlim_cur = n,
+		.rlim_max = n
+	};
+	if (getrlimit(RLIMIT_NPROC, &saved_limit) < 0)
+		err(KSFT_FAIL, "(pid=%d): getrlimit(RLIMIT_NPROC)", pid);
+
+	if (setrlimit(RLIMIT_NPROC, &limit) < 0)
+		err(KSFT_FAIL, "(pid=%d): setrlimit(RLIMIT_NPROC)", pid);
+
+	warnx("(pid=%d): Set RLIMIT_NPROC=%ld", pid, n);
+}
+
+static void restore_rlimit_nproc(void)
+{
+	if (setrlimit(RLIMIT_NPROC, &saved_limit) < 0)
+		err(KSFT_FAIL, "(pid=%d): setrlimit(RLIMIT_NPROC, saved)", pid);
+	warnx("(pid=%d) Restored RLIMIT_NPROC", pid);
+}
+
+enum msg_sync {
+	MAP_DEFINE,
+	UNSHARE,
+	RLIMIT_RESTORE,
+	POST_EXEC,
+};
+
+static int _sync_notify(int fd, enum msg_sync m)
+{
+	char tmp = m;
+
+	return write(fd, &tmp, 1);
+}
+static void sync_notify(int fd, enum msg_sync m)
+{
+	if (_sync_notify(fd, m) < 0)
+		warnx("(pid=%d): failed sync-write", pid);
+}
+
+static void sync_wait(int fd, enum msg_sync m)
+{
+	char tmp;
+
+	if (read(fd, &tmp, 1) < 0)
+		warn("(pid=%d): failed sync-read", pid);
+	else if (tmp != m)
+		warnx("(pid=%d): unexpected sync", pid);
+}
+
+static int define_maps(pid_t child_pid, enum userns_mode um)
+{
+	FILE *f;
+	char filename[PATH_MAX];
+
+	if (um != UM_CLONE_NEWUSER)
+		return 0;
+
+	snprintf(filename, PATH_MAX, "/proc/%i/uid_map", child_pid);
+	f = fopen(filename, "w");
+	if (fprintf(f, "%i %i 1\n", user, user) < 0)
+		return -1;
+	fclose(f);
+
+	snprintf(filename, PATH_MAX, "/proc/%i/gid_map", child_pid);
+	f = fopen(filename, "w");
+	if (fprintf(f, "%i %i 1\n", group, group) < 0)
+		return -1;
+	fclose(f);
+
+	return 0;
+}
+
+static int setup_and_exec(void *arg)
+{
+	int control_fd = child_args.control_fd;
+
+	pid = getpid();
+	warnx("(pid=%d): New process starting ...", pid);
+
+	signal(SIGUSR1, SIG_DFL);
+
+	sync_wait(control_fd, MAP_DEFINE);
+	warnx("(pid=%d): Changing to uid=%d, gid=%d", pid, user, group);
+
+	if (setgid(group) < 0)
+		err(EXIT_FAILURE, "(pid=%d): setgid(%d)", pid, group);
+	if (setuid(user) < 0)
+		err(EXIT_FAILURE, "(pid=%d): setuid(%d)", pid, user);
+
+	warnx("(pid=%d): Service running ...", pid);
+
+	if (child_args.um == UM_UNSHARE) {
+		warnx("(pid=%d): Unshare user namespace", pid);
+		if (unshare(CLONE_NEWUSER) < 0)
+			err(EXIT_FAILURE, "unshare(CLONE_NEWUSER)");
+	}
+
+	sync_notify(control_fd, UNSHARE);
+	sync_wait(control_fd, RLIMIT_RESTORE);
+
+	char *param = NULL;
+	asprintf(&param, ENV_PARAM "=%i", child_args.um);
+	char *const argv[] = { "service", NULL };
+	char *const envp[] = { param, NULL };
+
+	warnx("(pid=%d): Executing real service ...", pid);
+
+	execve(child_args.pathname, argv, envp);
+
+	/* stay around until parent notifies/signals */
+	warn("(pid=%d): execve failed", pid);
+	sync_wait(control_fd, POST_EXEC);
+	pause();
+	return 0;
+}
+
+static pid_t start_child(const char *pathname, int control_fd, enum userns_mode um)
+{
+	char *stack = malloc(STACK_SIZE);
+	int flags = um == UM_CLONE_NEWUSER ? CLONE_NEWUSER : 0;
+	pid_t new_pid;
+
+	/* Pass via global variable to child */
+	child_args.control_fd = control_fd;
+	child_args.pathname = pathname;
+	child_args.um = um;
+
+	new_pid = clone(setup_and_exec, stack+STACK_SIZE-1, flags, NULL);
+
+	free(stack);
+	close(control_fd);
+	return new_pid;
+}
+
+static void dump_context(size_t n_workers)
+{
+	struct rlimit limit;
+	char user_ns[PATH_MAX];
+	ssize_t len;
+
+	if (getrlimit(RLIMIT_NPROC, &limit) < 0)
+		err(EXIT_FAILURE, "(pid=%d) failed getrlimit", pid);
+	if ((len = readlink("/proc/self/ns/user", user_ns, PATH_MAX)) < 0)
+		err(EXIT_FAILURE, "(pid=%d) failed readlink", pid);
+	user_ns[len] = 0;
+
+	warnx("(pid=%d) Service instance attempts %lu workers, limit %lu:%lu, ns=%s",
+	      pid, n_workers, limit.rlim_cur, limit.rlim_max, user_ns);
+}
+
+int run_service(enum userns_mode um)
+{
+	size_t i;
+	pid_t worker[NR_WORKERS];
+	int ret = EXIT_SUCCESS;
+
+	dump_context(NR_WORKERS);
+
+	/* test RLIMIT_NPROC inside the service, last worker should fail because of service itself */
+	for (i = 0; i < NR_WORKERS; i++) {
+		worker[i] = fork();
+		if (worker[i] == 0) {
+			/* service worker */
+			pause();
+			exit(EXIT_SUCCESS);
+		}
+		if (worker[i] < 0) {
+			warn("(pid=%d) service fork %lu failed", pid, i+1);
+			if (um == UM_CLONE_NEWUSER && !(i >= (THE_LIMIT-1) && errno == EAGAIN))
+				ret = EXIT_FAILURE;
+		} else if (i >= (THE_LIMIT-1)) {
+			warnx("(pid=%d) RLIMIT_NPROC not honored", pid);
+			ret = EXIT_FAILURE;
+		}
+	}
+
+	/* service cleanup */
+	for (i = 0; i < NR_WORKERS; i++)
+		if (worker[i] > 0)
+			kill(worker[i], SIGUSR1);
+
+	for (i = 0; i < NR_WORKERS; i++)
+		if (worker[i] > 0)
+			waitpid(worker[i], NULL, WNOHANG);
+
+	if (ret) {
+		warnx("(pid=%d) service failed, ret=%i", pid, ret);
+		return ret;
+	}
+	/* we must get here before SERVICE_RUNTIME elapses */
+	pause();
+	return EXIT_FAILURE;
+}
+
+struct services_ctx *start_services(const char *pathname, enum userns_mode um)
+{
+	size_t i;
+	int sockets[2];
+	struct services_ctx *ctx = &services_ctx;
+
+	signal(SIGPIPE, SIG_IGN);
+	setrlimit_nproc(THE_LIMIT);
+	ctx->fork_ed = 0;
+	ctx->exec_ed = 0;
+	for (i = 0; i < NR_SERVICES; i++) {
+		if (socketpair(AF_UNIX, SOCK_DGRAM | SOCK_CLOEXEC, 0, sockets) < 0)
+			err(KSFT_FAIL, "(pid=%d) socketpair failed", pid);
+		ctx->control_fd[i] = sockets[0];
+		ctx->child[i] = start_child(pathname, sockets[1], um);
+		ctx->wstatus[i] = 0;
+		if (ctx->child[i] < 0)
+			continue;
+		ctx->fork_ed++;
+
+		if (define_maps(ctx->child[i], um) < 0)
+			err(KSFT_FAIL, "(pid=%d) user_ns maps definition failed", pid);
+		sync_notify(ctx->control_fd[i], MAP_DEFINE);
+	}
+
+	for (i = 0; i < NR_SERVICES; i++)
+		sync_wait(ctx->control_fd[i], UNSHARE);
+	restore_rlimit_nproc();
+
+	for (i = 0; i < NR_SERVICES; i++) {
+		sync_notify(ctx->control_fd[i], RLIMIT_RESTORE);
+	}
+
+	return ctx;
+}
+
+void stop_services(struct services_ctx *ctx)
+{
+	size_t i;
+	int children = ctx->fork_ed;
+
+	/* Well behaved service would pause() and wait for our SIGUSR1, if it
+	 * failed check it early.
+	 */
+	while (1) {
+		for (i = 0; i < NR_SERVICES; i++) {
+			if (ctx->child[i] <= 0)
+				continue;
+
+			errno = 0;
+			pid_t ret = waitpid(ctx->child[i], &ctx->wstatus[i], WNOHANG | __WALL);
+
+			if (!ret)
+				continue;
+
+			if (ret < 0 && errno != ECHILD)
+				warn("(pid=%d): waitpid(%d)", pid, ctx->child[i]);
+
+			ctx->child[i] *= -1;
+			children -= 1;
+		}
+
+		if (!children)
+			break;
+
+		usleep(SERVICE_RUNTIME);
+
+		for (i = 0; i < NR_SERVICES; i++) {
+			if (ctx->child[i] <= 0)
+				continue;
+			if (_sync_notify(ctx->control_fd[i], POST_EXEC) < 0 &&
+			    (errno == EPIPE || errno == ECONNREFUSED))
+				ctx->exec_ed++;
+			close(ctx->control_fd[i]);
+			kill(ctx->child[i], SIGUSR1);
+		}
+	}
+
+	warnx("(pid=%d): stats: fork_ed=%i exec_ed=%i", pid, ctx->fork_ed, ctx->exec_ed);
+}
+
+int count_services(struct services_ctx *ctx)
+{
+	return ctx->exec_ed;
+}
+
+int check_services(struct services_ctx *ctx)
+{
+	size_t i;
+	int correct = 0;
+	
+	for (i = 0; i < NR_SERVICES; i++) {
+		if (WIFEXITED(ctx->wstatus[i]))
+			warnx("(pid=%d): pid %d exited, status=%d",
+				pid, -ctx->child[i], WEXITSTATUS(ctx->wstatus[i]));
+		else if (WIFSIGNALED(ctx->wstatus[i]))
+			warnx("(pid=%d): pid %d killed by signal %d",
+				pid, -ctx->child[i], WTERMSIG(ctx->wstatus[i]));
+
+		/* The only acceptable service termination */
+		if (WIFSIGNALED(ctx->wstatus[i]) && WTERMSIG(ctx->wstatus[i]) == SIGUSR1)
+			correct++;
+	}
+
+	return correct;
+}
+
+
diff --git a/tools/testing/selftests/rlimits/service_common.h b/tools/testing/selftests/rlimits/service_common.h
new file mode 100644
index 000000000000..4a3cd929d865
--- /dev/null
+++ b/tools/testing/selftests/rlimits/service_common.h
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+#include <sys/types.h>
+
+#define THE_LIMIT 4
+#define NR_SERVICES 5
+#define NR_WORKERS 5
+
+#define ENV_PARAM "I_AM_SERVICE"
+
+enum userns_mode {
+	UM_UNSHARE,		/* setrlimit,clone(0),setuid,unshare,execve */
+	UM_CLONE_NEWUSER,	/* setrlimit,clone(NEWUSER),setuid,execve */
+};
+
+struct services_ctx;
+
+/* Cache current pid */
+extern pid_t pid;
+
+int run_service(enum userns_mode um);
+struct services_ctx *start_services(const char *pathname, enum userns_mode um);
+void stop_services(struct services_ctx *ctx);
+int count_services(struct services_ctx *ctx);
+int check_services(struct services_ctx *ctx);
Eric W. Biederman Feb. 25, 2022, 12:29 a.m. UTC | #3
Kees Cook <keescook@chromium.org> writes:

> typo: Subject's LimigtNPROC -> LimitNPROC
>
> On Thu, Feb 24, 2022 at 09:41:44AM -0600, Eric W. Biederman wrote:
>> 
>> Long story short recursively enforcing RLIMIT_NPROC when it is not
>> enforced on the process that creates a new user namespace, causes
>> currently working code to fail.  There is no reason to enforce
>> RLIMIT_NPROC recursively when we don't enforce it normally so update
>> the code to detect this case.
>> 
>> I would like to simply use capable(CAP_SYS_RESOURCE) to detect when
>> RLIMIT_NPROC is not enforced upon the caller.  Unfortunately because
>> RLIMIT_NPROC is charged and checked for enforcement based upon the
>> real uid, using capable() wich is euid based is inconsistent with reality.
>
> typo: wich -> which

Ahh... Typos.

>> Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by
>> testing for when the real uid would match the conditions when
>> CAP_SYS_RESOURCE would be present if the real uid was the effective
>> uid.
>> 
>> Reported-by: Etienne Dechamps <etienne@edechamps.fr>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596
>> Link: https://lkml.kernel.org/r/e9589141-cfeb-90cd-2d0e-83a62787239a@edechamps.fr
>> Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts")
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>> 
>> The previous conversation has given me enough clarity that I can see
>> which tests I am comfortable with use for this pending regression fix.
>> 
>> I have tested this and it works for me.  Does anyone have any concerns
>> with this change?
>
> I'd really love some kind of selftest that exercises the edge cases; do
> you have your tests in some form that could be converted?
>
> But otherwise, yes, this looks like the best option here.

Let's start with Michal Koutný tests.  I keep forgetting to look at
them.  This cold has really been kicking my butt.

For this issue the test case was a systemd unit file.  Which is simple
and demonstrates the real-world regression but not really minimal in the
way a kernel selftest should be.

> Reviewed-by: Kees Cook <keescook@chromium.org>
>
>> 
>>  kernel/user_namespace.c | 14 +++++++++++++-
>>  1 file changed, 13 insertions(+), 1 deletion(-)
>> 
>> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> index 6b2e3ca7ee99..5481ba44a8d6 100644
>> --- a/kernel/user_namespace.c
>> +++ b/kernel/user_namespace.c
>> @@ -58,6 +58,18 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
>>  	cred->user_ns = user_ns;
>>  }
>>  
>> +static unsigned long enforced_nproc_rlimit(void)
>> +{
>> +	unsigned long limit = RLIM_INFINITY;
>> +
>> +	/* Is RLIMIT_NPROC currently enforced? */
>> +	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
>> +	    (current_user_ns() != &init_user_ns))
>> +		limit = rlimit(RLIMIT_NPROC);
>> +
>> +	return limit;
>> +}
>> +
>>  /*
>>   * Create a new user namespace, deriving the creator from the user in the
>>   * passed credentials, and replacing that user with the new root user for the
>> @@ -122,7 +134,7 @@ int create_user_ns(struct cred *new)
>>  	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
>>  		ns->ucount_max[i] = INT_MAX;
>>  	}
>> -	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
>> +	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
>>  	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
>> -- 
>> 2.29.2
>> 

Eric
diff mbox series

Patch

diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 6b2e3ca7ee99..5481ba44a8d6 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -58,6 +58,18 @@  static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns)
 	cred->user_ns = user_ns;
 }
 
+static unsigned long enforced_nproc_rlimit(void)
+{
+	unsigned long limit = RLIM_INFINITY;
+
+	/* Is RLIMIT_NPROC currently enforced? */
+	if (!uid_eq(current_uid(), GLOBAL_ROOT_UID) ||
+	    (current_user_ns() != &init_user_ns))
+		limit = rlimit(RLIMIT_NPROC);
+
+	return limit;
+}
+
 /*
  * Create a new user namespace, deriving the creator from the user in the
  * passed credentials, and replacing that user with the new root user for the
@@ -122,7 +134,7 @@  int create_user_ns(struct cred *new)
 	for (i = 0; i < MAX_PER_NAMESPACE_UCOUNTS; i++) {
 		ns->ucount_max[i] = INT_MAX;
 	}
-	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC));
+	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_NPROC, enforced_nproc_rlimit());
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_rlimit_ucount_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));